Log Output Constantly Stalling to Zero

I have a problem with my Graylog instance where I regularly see the “out” records drop to zero fairly frequently (several times a minute) and often times for extended periods of time (a minute or more). To be clear, I am not sure if this issue is with Graylog or Elasticsearch or something else altogether. I have tried to look at every angle possible on this before posting here for help.

The overview of the environment is Graylog 3.2.4 and Elasticsearch 6.8.8 running on Red Hat Enterprise Linux 7.8. We have four Graylog nodes sitting behind an F5 LTM load balancer and 12 Elasticsearch nodes. The F5 LTM does a solid job of round-robin distributing the load pretty equally to all four of the Graylog nodes. I am not certain that it’s relevant but there is a Cisco ASA firewall that sits between the Graylog nodes and the Elasticsearch nodes; however, ports are open on TCP 9200 into Elasticsearch from the Graylog nodes. Additionally, our F5 LTM is “in-line” meaning it is the default gateway for our Graylog nodes and all traffic passes through it.

For better or worse, the environment is all virtualized; however, I do not believe that there are resource contention issues at the CPU, memory or storage level. The Graylog and Elasticsearch nodes do share a common VM datastore on an all-flash fibre channel attached storage array; however, Graylog and Elasticsearch are the only things on the array. Additionally, performance metrics collected from the storage array don’t indicate that we are anywhere remotely close to running out of performance. Additionally, CPU and memory usage across the environment has plenty of headroom.

The Graylog servers (four total) are each 16 CPU cores with 16 GB RAM. The Graylog Java process is limited to 4 GB RAM. There is roughly 4 GB RAM of the 16 GB total that is free. In my graylog.conf file have processbuffer_processors = 10, outputbuffer_processors = 5 and inputbuffer_processors = 1.

The Elasticsearch servers (12 total) are each 8 CPU cores with 64 GB RAM. The Elasticsearch Java process is limited to 30 GB RAM. There is roughly 4 GB RAM of the 64 GB total that is free.

I am currently processing a constant stream of roughly 55,000-60,000 messages per second through Graylog. As I type this, the system just started to recover from one of these stalled outputs and is currently pushing 168,000 messages per second out and the Elasticsearch nodes are using 100-300% CPU utilization with a load average ranging between 1.45 to 2.79 across the 12 Elasticsearch nodes.

My default index set, which pretty much everything is written to, is configured for 12 shards and 1 replica set.

I have also noticed that when this thing drops to zero logs out, each Graylog node’s process buffer and output buffer are 100% full.

I can’t correlate any events happening (rotating or deleting an index) from the Graylog log file with the drops to zero either. Same with the Elasticsearch log. Neither seem to have helpful insight as to what could be going on here.

Any ideas on what I am doing wrong here? I am at a total loss as to what to try next.

he @nnelson

watch the elasticsearch threads and check what they are doing.

Configure the index refresh rate of elasticsearch to 30seconds - this will give you some help. When the output buffer is filled the reason is elasticsearch or better graylog is not able to handover the messages to elasticsearch.

So take a deeper look into the threads and information from that part of the stack.

@jan through the Graylog GUI I changed within the “Default index set” the “Field type refresh interval” value to 30 seconds (it was previously set to 5). I found reference to what this is (I think) at What is the parameter "field_type_refresh_interval"? and unless I am misunderstanding, it supposedly correlates to the “refresh_interval” in Elasticsearch. I changed it and let the index rotate but after checking the newest index, I didn’t see “refresh_interval” set to 30 seconds as expected. Not sure if I should change that back to 5 seconds or if it matters at all.

Since the previous step didn’t seem to do what I expected, I went ahead and set it within Elasticsearch for the default template:

# curl -XPUT -H 'Content-Type: application/json' http://localhost:9200/_template/all_indices -d '{"template":"*","settings":{"refresh_interval":"30s"}}'

I waited for the index to rotate and checked that the newest index had the “refresh_interval” set to 30 seconds:

# curl --silent http://localhost:9200/graylog_*?pretty | cut -c4-16 | grep ^graylog_ | while read INDEX; do curl --silent http://localhost:9200/${INDEX}/_settings?pretty; done | grep "graylog_\|refresh_interval" | grep -v provided_name | tail -2
  "graylog_99233" : {
        "refresh_interval" : "30s",

Not sure if it was necessary but I also looped through all existing Graylog indices and set the existing ones to 30 seconds also:

# curl --silent http://localhost:9200/graylog_*?pretty | cut -c4-16 | grep ^graylog_ | while read INDEX; do curl --silent -XPUT -H 'Content-Type: application/json' http://localhost:9200/${INDEX}/_settings -d '{"index":{"refresh_interval":"30s"}}'; done

And verified that all existing Graylog indices were now set to 30 seconds for the “refresh_interval” setting:

# curl --silent http://localhost:9200/graylog_*?pretty | cut -c4-16 | grep ^graylog_ | while read INDEX; do curl --silent http://localhost:9200/${INDEX}/_settings?pretty; done | grep "graylog_\|refresh_interval" | grep -v provided_name
  "graylog_99052" : {
        "refresh_interval" : "30s",
  "graylog_99053" : {
        "refresh_interval" : "30s",
  "graylog_99054" : {
        "refresh_interval" : "30s",
[[removed additional output]]

I also remember several months ago trying to fix a similar (same?) performance issue and tinkering with the “output_batch_size” setting in the “graylog.conf” configuration file. I had it set to 10000 (seems way high) and found a Github post where a user had set it to 1000 with positive results and later in that thread I saw a developer comment that the new default had been set to 500. I tried 500 at first and it did not help at all. I am now running it at 1000 instead of the 10000 I had it initially set at but it still isn’t helping.

As for taking “a deeper look into the threads and information” within Elasticsearch, with me being an Elasticsearch novice, might you have some advice on specifically what I should be looking for? When Google searching around I found reference to “/_cat/thread_pool” which I queried but I am not sure exactly what I should be looking for that would be indicative of a bottleneck or problem or if this is even the right thing you’re talking about.

I also found reference in my Google search to “/_nodes/hot_threads” which I also queried, but again, same as above, not sure what I should be looking for that is indicative of a bottleneck or problem or if this is even what you’re asking me to check into.

Any advice is greatly appreciated.

To follow-up after running this a few days, changing the “refresh_interval” to 30 seconds mostly stopped it from hitting zero multiple times per minute, though, it still does happen. That being said, the net result is still not very positive. Ultimately, it has had no meaningful impact on the environment as far as I can tell as logs are still very throttled and Graylog disk buffers are overflowing.

Any further advice on what to look at on Elasticsearch? I just can’t imagine it being lack of horsepower (CPU/RAM/disk).

Edit: I can no longer say that it has had any measurable impacts whatsoever. I must have been watching it during a time when it was running “better.” The multiple zeros per minute are the same as always.

Additional steps I have done to try and lighten the load as of right now:

  1. Changed the replica sets to 0 for all new indices. The thought here was to half the write loads since it doesn’t have to write a replica set. No meaningful impact on the output dropping to zero constantly.

  2. Changed the Elasticsearch setting index.translog.durability to be async. The thought here was that fsync and commit was causing bad things to happen in terms of throughput. Again, no meaningful impact on the output dropping to zero constantly. Command ran for reference:

# curl --silent -H 'Content-Type: application/json' -XPUT 'http://localhost:9200/_settings' -d '{"index.translog.durability": "async"}'

  1. Changed the Elasticsearch -Xmx and -Xms values to be 28 GB as a colleague found reference to potential issues with Java memory usage and compressed pointers (or not compressed pointers?). I don’t pretend to understand it. The page referenced said something about verifying rather it was a problem and when I grepped our Elasticsearch log file it contained the verbiage that it was seemingly a non-issue, even at 30 GB. Either way, worth noting it was changed.

Additionally, the CPU load on the Elasticsearch nodes continues to be very low relative to the number of CPU cores (currently 1.70, 1.32, 1.48 on 8 vCPU cores which seems roughly consistent across all the Elasticsearch node members)

I also found on the Elasticsearch forums someone posting the following command to try to troubleshoot performance issues and thought perhaps it might be relevant. I have ran it several times in rapid succession and have never seen anything other than zero in the “active” or “rejected” column:

# curl --silent -XGET 'http://localhost:9200/_cat/thread_pool/search?v&h=node_name,name,active,rejected,completed'
node_name                 name   active rejected completed
elsrch01a.mydomain.com search      0        0    421084
elsrch01k.mydomain.com search      0        0    416241
elsrch01h.mydomain.com search      0        0    417632
elsrch01e.mydomain.com search      0        0    417228
elsrch01d.mydomain.com search      0        0    421196
elsrch01b.mydomain.com search      0        0    418618
elsrch01j.mydomain.com search      0        0    416366
elsrch01c.mydomain.com search      0        0    419898
elsrch01f.mydomain.com search      0        0    420869
elsrch01i.mydomain.com search      0        0    423459
elsrch01g.mydomain.com search      0        0    421222
elsrch01l.mydomain.com search      0        0    412267

Finally, I ran iostat every second on /dev/sdb device (my Elasticsearch data volume) and monitored disk I/O which was very underwhelming. It spent most of the type with zeros across the board (idle). I captured a few of the “spikes” of activity it did have to show you an example of what kind of actual disk I/O it is doing:

# iostat -xm 1 /dev/sdb
Linux 3.10.0-1127.el7.x86_64 (elsrch01a.mydomain.com) 	05/19/2020 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          21.35    0.00    1.85    0.43    0.00   76.37

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     2.28    1.04  135.53     0.09    17.32   261.10     0.52    3.80   23.72    3.65   0.34   4.68

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          25.89    0.00    2.54    1.02    0.00   70.56

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  396.00     0.00    11.16    57.73     0.13    0.33    0.00    0.33   0.33  13.10

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          28.66    0.00    3.03    1.14    0.00   67.17

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  447.00     0.00    12.52    57.34     0.12    0.27    0.00    0.27   0.27  12.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          19.37    0.00    2.15    0.89    0.00   77.59

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  367.00     0.00    10.03    55.98     0.11    0.30    0.00    0.30   0.29  10.60

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.13    0.00    0.00    0.00    0.00   99.87

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    3.00     0.00     0.07    46.33     0.00    0.67    0.00    0.67   0.67   0.20

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.26    0.00    0.88    0.00    0.00   96.87

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.38    0.00    0.50    0.00    0.00   98.12

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.25    0.00    0.13    0.00    0.00   99.62

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.75    0.00    1.00    0.00    0.00   97.25

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

I’m attaching some additional performance graphs showing metrics of the storage array hosting all the VMs as well as CPU graphs from the last few days from each of the Elasticsearch nodes.

IOPS On All-Flash SAN Over 24-hour Period:

Latency On All-Flash SAN Over 24-hour Period:

CPU Utilization On Each Elasticsearch Node Over 48-Hour Period:

The CPU utilization across all 12 Elasticsearch nodes is almost completely identical, so much so that I had to go back and check the source images on my workstation to make sure it didn’t accidentally upload the same image multiple times.

Also note that on the SAN IOPS graph the drop in total IOPS around 14:00-14:30 time mark correlates to when I changed the Elasticsearch cluster to be asynchronous writes AND when I changed the index set replica count to 0, though, I suspect the drop in IOPS is likely related to the lack of replica set more than the asynchronous writes.

You mentioned there is an ASA between the Graylog and Elasticsearch servers. Are you running any kind of IPS? This sounds an awful like IPS behavior and the ASA is shunning the connections from Graylog to Elasticsearch because an IPS thinks there is an attack. Probably based on the traffic volume.

Check to see if your ASA is shunning the connections between Graylog and ES.

@cawfehman I like your thought process on this thinking outside the box. Yes, there is an IPS in place but there has been a rule to completely bypass the IPS for traffic inspection between the Graylog source IPs and Elasticsearch destination IPs.

Since we are on the same though process, just earlier this afternoon I temporarily added a second NIC to each Graylog server to put it in the same network segment as the Elasticsearch servers to completely bypass the ASA and IPS. I can confirm this by capturing traffic on the second NIC (eth1) and see TCP 9200 going across that interface to my Elasticsearch nodes.

When I run tcpdump on eth1 it stops sending any traffic whatsoever to any of the Elasticsearch IPs when the log output reaches 0 (I think this seems obvious/expected). I also noticed that 1-2 of the 4 of my Graylog hosts seem to stall and send only a few connections out Elasticsearch:

graylog01c ~]# tcpdump -i eth1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
16:35:05.340222 IP graylog01c.mydomain.com.54794 > elsrch01h.mydomain.com.wap-wsp: Flags [P.], seq 207808037:207808232, ack 119503898, win 23, options [nop,nop,TS val 2334523825 ecr 2331981541], length 195
16:35:05.343567 IP elsrch01h.mydomain.com.wap-wsp > graylog01c.mydomain.com.54794: Flags [P.], seq 1:846, ack 195, win 7127, options [nop,nop,TS val 2332071539 ecr 2334523825], length 845
16:35:05.343584 IP graylog01c.mydomain.com.54794 > elsrch01h.mydomain.com.wap-wsp: Flags [.], ack 846, win 23, options [nop,nop,TS val 2334523828 ecr 2332071539], length 0
16:35:05.345946 IP graylog01c.mydomain.com.47300 > elsrch01i.mydomain.com.wap-wsp: Flags [P.], seq 1745041446:1745041653, ack 516136779, win 23, options [nop,nop,TS val 2334523831 ecr 2332020546], length 207
16:35:05.348345 IP elsrch01i.mydomain.com.wap-wsp > graylog01c.mydomain.com.47300: Flags [P.], seq 1:954, ack 207, win 8462, options [nop,nop,TS val 2332110547 ecr 2334523831], length 953
16:35:05.348367 IP graylog01c.mydomain.com.47300 > elsrch01i.mydomain.com.wap-wsp: Flags [.], ack 954, win 23, options [nop,nop,TS val 2334523833 ecr 2332110547], length 0
^C
6 packets captured
6 packets received by filter
0 packets dropped by kernel 

For reference, the same time span that Graylog node C (above) captured 6 packets, Graylog node A, B and D captured 13756, 13424 and 13082 respectively.

I so very much wanted it to be a networking issue but the issue persists.

hmmm well, doesn’t seem like a networking issue, but perhaps we’re putting the cart before the horse…

When you see these “zero” out events, what is the status of your nodes? What are your journal and input/output buffers doing?

I’ve read and reread through your description a couple times (kudos on the details, btw) and I can’t see much mention of the state of those during this? Perhaps it’s implied that they are fine, but I’d be curious for confirmation.

Thanks for the kudos – I try to provide as much technical detail as possible in hopes one tiny detail might just be the key to unlocking the mystery.

During a zero message output occurrence, my Graylog nodes all behave consistently across all four of them.

Assuming the disk buffer is not totally full, the node is marked “ALIVE” and the F5 load balancer keeps sending a load balanced portion of the total log volume to it until the disk buffer fills completely and it goes “THROTTLED” at which time the F5 load balancer stops sending syslog data to it.

Assuming there is still room in the disk buffer and the node is “ALIVE” then the following happens:

  • The disk buffers begin to fill.
  • The input buffer is always at 0% utilized and the number of messages in the input buffer bounces from zero to maybe 50 or so (but usually zero).
  • The process buffer sits at 100% utilized with 65,536 messages in the queue.
  • The output buffer sits at 100% utilized with 65,536 messages in the queue.

At the risk of breaking it further, tonight I changed the -Xmx and -Xms settings on the Elasticsearch cluster back to 30 GB since the change my coworker suggested didn’t seem to make a difference. I also changed the “output_batch_size” back to the default 500 since changing it to 1000 didn’t seem to help either. I restarted the Elasticsearch cluster and the Graylog cluster.

For a period of an hour or so it seemed actually better and was bursting out at sending out to Elasticsearch at around 150,000+ messages per second at times.

I couldn’t leave well enough alone and thought maybe I could re-enable the replica set to be 1 instead of the 0 I had previously set it to in hopes of lightening the write load to see if conditions improved. This ended very poorly. Lots and lots of zero messages out and all four nodes grew to nearly 10,000,000 backlogged unprocessed messages in a matter of minutes. I reverted this change, forced an active index rotation and let Graylog catch back up.

I then thought what if I had 6 shards across 12 Elasticsearch nodes with 1 replica set? In mind that somehow seemed to make logical sense. I made the changes and forced the active index rotation to see. It kinda worked for a brief period of time but quickly was back to the old behavior as my disk buffer filled and my Graylog nodes fell further behind. Again, I reverted this change to once again be 12 shards across 12 Elasticsearch nodes with 0 replica set and forced an active index rotation and let Graylog catch back up.

As it sits right now with my current configuration, it has mostly kept up with the log volume but still drops to zero (and it isn’t just when the active index rotates). At least for now, the difference is it can ramp up over 100,000 messages per second out to try and play catch up to compensate for the dips.

It just seems like something just isn’t dialed in right and I can’t put my finger on it. As of right now it seems to be keeping afloat (knock on wood) but I might log into this thing tomorrow morning and it be back in a total tailspin. Not to mention I still don’t have my additional replica set as it currently sits which makes patching the Elasticsearch node problematic for doing rolling reboots.

when the output drops to 0 - check the activity of Elasticsearch. It is very likely that you do not have any threads left. Means that the processing waiting queue is full.

Usually elasticsearch is logging such.

Assuming that you’ve reverted back to your original configuration, you’ve confirmed that the ASA is not blocking/shunning the traffic, I would focus on why your output buffer is filling up. As @jan pointed out, check elasticsearch. But also check the pieces in between. I don’t think you need the LB between Graylog and ES, but I would save it’s not necessary. Not an ES expert, but it handles the LB process pretty seamlessly. perhaps something to test would be to have only 1 ES server as an available node in the LB strategy.

@cawfehman the F5 load balancer is in-line. It is the default gateway for the subnet our Graylog nodes are in. It is only doing load balancing INTO the Graylog nodes for syslog and does nothing functionally different than a router for traffic outbound to Elasticsearch. To that point, we’re still running dual network interface and bypassing the ASA and F5 LTM for traffic between the Graylog and Elasticsearch nodes.

@jan I am still not entirely certain how to check for available threads in Elasticsearch. After Google searching around it appears that, again, perhaps “thread_pool” might be what you mean. I assume that write is the most important here? If this isn’t what you are asking me for then please advise on what exactly I should be querying.

According to the Elasticsearch documentation, default “thread_pool” output columns are “active,” “queue” and “rejected.” This was captured during a zero log output event:

$ curl --silent -X GET "http://elsrch01a.mydomain.com:9200/_cat/thread_pool/"
elsrch01f.mydomain.com analyze             0 0 0
elsrch01f.mydomain.com ccr                 0 0 0
elsrch01f.mydomain.com fetch_shard_started 0 0 0
elsrch01f.mydomain.com fetch_shard_store   0 0 0
elsrch01f.mydomain.com flush               0 0 0
elsrch01f.mydomain.com force_merge         1 0 0
elsrch01f.mydomain.com generic             0 0 0
elsrch01f.mydomain.com get                 0 0 0
elsrch01f.mydomain.com index               0 0 0
elsrch01f.mydomain.com listener            0 0 0
elsrch01f.mydomain.com management          1 0 0
elsrch01f.mydomain.com ml_autodetect       0 0 0
elsrch01f.mydomain.com ml_datafeed         0 0 0
elsrch01f.mydomain.com ml_utility          0 0 0
elsrch01f.mydomain.com refresh             0 0 0
elsrch01f.mydomain.com rollup_indexing     0 0 0
elsrch01f.mydomain.com search              0 0 0
elsrch01f.mydomain.com search_throttled    0 0 0
elsrch01f.mydomain.com security-token-key  0 0 0
elsrch01f.mydomain.com snapshot            0 0 0
elsrch01f.mydomain.com warmer              0 0 0
elsrch01f.mydomain.com watcher             0 0 0
elsrch01f.mydomain.com write               0 0 0
elsrch01a.mydomain.com analyze             0 0 0
elsrch01a.mydomain.com ccr                 0 0 0
elsrch01a.mydomain.com fetch_shard_started 0 0 0
elsrch01a.mydomain.com fetch_shard_store   0 0 0
elsrch01a.mydomain.com flush               0 0 0
elsrch01a.mydomain.com force_merge         0 0 0
elsrch01a.mydomain.com generic             0 0 0
elsrch01a.mydomain.com get                 0 0 0
elsrch01a.mydomain.com index               0 0 0
elsrch01a.mydomain.com listener            0 0 0
elsrch01a.mydomain.com management          1 0 0
elsrch01a.mydomain.com ml_autodetect       0 0 0
elsrch01a.mydomain.com ml_datafeed         0 0 0
elsrch01a.mydomain.com ml_utility          0 0 0
elsrch01a.mydomain.com refresh             0 0 0
elsrch01a.mydomain.com rollup_indexing     0 0 0
elsrch01a.mydomain.com search              0 0 0
elsrch01a.mydomain.com search_throttled    0 0 0
elsrch01a.mydomain.com security-token-key  0 0 0
elsrch01a.mydomain.com snapshot            0 0 0
elsrch01a.mydomain.com warmer              0 0 0
elsrch01a.mydomain.com watcher             0 0 0
elsrch01a.mydomain.com write               0 0 0
elsrch01d.mydomain.com analyze             0 0 0
elsrch01d.mydomain.com ccr                 0 0 0
elsrch01d.mydomain.com fetch_shard_started 0 0 0
elsrch01d.mydomain.com fetch_shard_store   0 0 0
elsrch01d.mydomain.com flush               0 0 0
elsrch01d.mydomain.com force_merge         1 0 0
elsrch01d.mydomain.com generic             0 0 0
elsrch01d.mydomain.com get                 0 0 0
elsrch01d.mydomain.com index               0 0 0
elsrch01d.mydomain.com listener            0 0 0
elsrch01d.mydomain.com management          1 0 0
elsrch01d.mydomain.com ml_autodetect       0 0 0
elsrch01d.mydomain.com ml_datafeed         0 0 0
elsrch01d.mydomain.com ml_utility          0 0 0
elsrch01d.mydomain.com refresh             0 0 0
elsrch01d.mydomain.com rollup_indexing     0 0 0
elsrch01d.mydomain.com search              0 0 0
elsrch01d.mydomain.com search_throttled    0 0 0
elsrch01d.mydomain.com security-token-key  0 0 0
elsrch01d.mydomain.com snapshot            0 0 0
elsrch01d.mydomain.com warmer              0 0 0
elsrch01d.mydomain.com watcher             0 0 0
elsrch01d.mydomain.com write               0 0 0
elsrch01i.mydomain.com analyze             0 0 0
elsrch01i.mydomain.com ccr                 0 0 0
elsrch01i.mydomain.com fetch_shard_started 0 0 0
elsrch01i.mydomain.com fetch_shard_store   0 0 0
elsrch01i.mydomain.com flush               0 0 0
elsrch01i.mydomain.com force_merge         1 0 0
elsrch01i.mydomain.com generic             0 0 0
elsrch01i.mydomain.com get                 0 0 0
elsrch01i.mydomain.com index               0 0 0
elsrch01i.mydomain.com listener            0 0 0
elsrch01i.mydomain.com management          1 0 0
elsrch01i.mydomain.com ml_autodetect       0 0 0
elsrch01i.mydomain.com ml_datafeed         0 0 0
elsrch01i.mydomain.com ml_utility          0 0 0
elsrch01i.mydomain.com refresh             0 0 0
elsrch01i.mydomain.com rollup_indexing     0 0 0
elsrch01i.mydomain.com search              0 0 0
elsrch01i.mydomain.com search_throttled    0 0 0
elsrch01i.mydomain.com security-token-key  0 0 0
elsrch01i.mydomain.com snapshot            0 0 0
elsrch01i.mydomain.com warmer              0 0 0
elsrch01i.mydomain.com watcher             0 0 0
elsrch01i.mydomain.com write               0 0 0
elsrch01c.mydomain.com analyze             0 0 0
elsrch01c.mydomain.com ccr                 0 0 0
elsrch01c.mydomain.com fetch_shard_started 0 0 0
elsrch01c.mydomain.com fetch_shard_store   0 0 0
elsrch01c.mydomain.com flush               0 0 0
elsrch01c.mydomain.com force_merge         0 0 0
elsrch01c.mydomain.com generic             0 0 0
elsrch01c.mydomain.com get                 0 0 0
elsrch01c.mydomain.com index               0 0 0
elsrch01c.mydomain.com listener            0 0 0
elsrch01c.mydomain.com management          1 0 0
elsrch01c.mydomain.com ml_autodetect       0 0 0
elsrch01c.mydomain.com ml_datafeed         0 0 0
elsrch01c.mydomain.com ml_utility          0 0 0
elsrch01c.mydomain.com refresh             0 0 0
elsrch01c.mydomain.com rollup_indexing     0 0 0
elsrch01c.mydomain.com search              0 0 0
elsrch01c.mydomain.com search_throttled    0 0 0
elsrch01c.mydomain.com security-token-key  0 0 0
elsrch01c.mydomain.com snapshot            0 0 0
elsrch01c.mydomain.com warmer              0 0 0
elsrch01c.mydomain.com watcher             0 0 0
elsrch01c.mydomain.com write               0 0 0
elsrch01l.mydomain.com analyze             0 0 0
elsrch01l.mydomain.com ccr                 0 0 0
elsrch01l.mydomain.com fetch_shard_started 0 0 0
elsrch01l.mydomain.com fetch_shard_store   0 0 0
elsrch01l.mydomain.com flush               0 0 0
elsrch01l.mydomain.com force_merge         1 0 0
elsrch01l.mydomain.com generic             0 0 0
elsrch01l.mydomain.com get                 0 0 0
elsrch01l.mydomain.com index               0 0 0
elsrch01l.mydomain.com listener            0 0 0
elsrch01l.mydomain.com management          1 0 0
elsrch01l.mydomain.com ml_autodetect       0 0 0
elsrch01l.mydomain.com ml_datafeed         0 0 0
elsrch01l.mydomain.com ml_utility          0 0 0
elsrch01l.mydomain.com refresh             0 0 0
elsrch01l.mydomain.com rollup_indexing     0 0 0
elsrch01l.mydomain.com search              0 0 0
elsrch01l.mydomain.com search_throttled    0 0 0
elsrch01l.mydomain.com security-token-key  0 0 0
elsrch01l.mydomain.com snapshot            0 0 0
elsrch01l.mydomain.com warmer              0 0 0
elsrch01l.mydomain.com watcher             0 0 0
elsrch01l.mydomain.com write               0 0 0
elsrch01e.mydomain.com analyze             0 0 0
elsrch01e.mydomain.com ccr                 0 0 0
elsrch01e.mydomain.com fetch_shard_started 0 0 0
elsrch01e.mydomain.com fetch_shard_store   0 0 0
elsrch01e.mydomain.com flush               0 0 0
elsrch01e.mydomain.com force_merge         1 0 0
elsrch01e.mydomain.com generic             0 0 0
elsrch01e.mydomain.com get                 0 0 0
elsrch01e.mydomain.com index               0 0 0
elsrch01e.mydomain.com listener            0 0 0
elsrch01e.mydomain.com management          1 0 0
elsrch01e.mydomain.com ml_autodetect       0 0 0
elsrch01e.mydomain.com ml_datafeed         0 0 0
elsrch01e.mydomain.com ml_utility          0 0 0
elsrch01e.mydomain.com refresh             0 0 0
elsrch01e.mydomain.com rollup_indexing     0 0 0
elsrch01e.mydomain.com search              0 0 0
elsrch01e.mydomain.com search_throttled    0 0 0
elsrch01e.mydomain.com security-token-key  0 0 0
elsrch01e.mydomain.com snapshot            0 0 0
elsrch01e.mydomain.com warmer              0 0 0
elsrch01e.mydomain.com watcher             0 0 0
elsrch01e.mydomain.com write               0 0 0
elsrch01h.mydomain.com analyze             0 0 0
elsrch01h.mydomain.com ccr                 0 0 0
elsrch01h.mydomain.com fetch_shard_started 0 0 0
elsrch01h.mydomain.com fetch_shard_store   0 0 0
elsrch01h.mydomain.com flush               0 0 0
elsrch01h.mydomain.com force_merge         1 0 0
elsrch01h.mydomain.com generic             0 0 0
elsrch01h.mydomain.com get                 0 0 0
elsrch01h.mydomain.com index               0 0 0
elsrch01h.mydomain.com listener            0 0 0
elsrch01h.mydomain.com management          1 0 0
elsrch01h.mydomain.com ml_autodetect       0 0 0
elsrch01h.mydomain.com ml_datafeed         0 0 0
elsrch01h.mydomain.com ml_utility          0 0 0
elsrch01h.mydomain.com refresh             0 0 0
elsrch01h.mydomain.com rollup_indexing     0 0 0
elsrch01h.mydomain.com search              0 0 0
elsrch01h.mydomain.com search_throttled    0 0 0
elsrch01h.mydomain.com security-token-key  0 0 0
elsrch01h.mydomain.com snapshot            0 0 0
elsrch01h.mydomain.com warmer              0 0 0
elsrch01h.mydomain.com watcher             0 0 0
elsrch01h.mydomain.com write               0 0 0
elsrch01k.mydomain.com analyze             0 0 0
elsrch01k.mydomain.com ccr                 0 0 0
elsrch01k.mydomain.com fetch_shard_started 0 0 0
elsrch01k.mydomain.com fetch_shard_store   0 0 0
elsrch01k.mydomain.com flush               0 0 0
elsrch01k.mydomain.com force_merge         1 0 0
elsrch01k.mydomain.com generic             0 0 0
elsrch01k.mydomain.com get                 0 0 0
elsrch01k.mydomain.com index               0 0 0
elsrch01k.mydomain.com listener            0 0 0
elsrch01k.mydomain.com management          1 0 0
elsrch01k.mydomain.com ml_autodetect       0 0 0
elsrch01k.mydomain.com ml_datafeed         0 0 0
elsrch01k.mydomain.com ml_utility          0 0 0
elsrch01k.mydomain.com refresh             0 0 0
elsrch01k.mydomain.com rollup_indexing     0 0 0
elsrch01k.mydomain.com search              0 0 0
elsrch01k.mydomain.com search_throttled    0 0 0
elsrch01k.mydomain.com security-token-key  0 0 0
elsrch01k.mydomain.com snapshot            0 0 0
elsrch01k.mydomain.com warmer              0 0 0
elsrch01k.mydomain.com watcher             0 0 0
elsrch01k.mydomain.com write               0 0 0
elsrch01j.mydomain.com analyze             0 0 0
elsrch01j.mydomain.com ccr                 0 0 0
elsrch01j.mydomain.com fetch_shard_started 0 0 0
elsrch01j.mydomain.com fetch_shard_store   0 0 0
elsrch01j.mydomain.com flush               0 0 0
elsrch01j.mydomain.com force_merge         1 0 0
elsrch01j.mydomain.com generic             0 0 0
elsrch01j.mydomain.com get                 0 0 0
elsrch01j.mydomain.com index               0 0 0
elsrch01j.mydomain.com listener            0 0 0
elsrch01j.mydomain.com management          1 0 0
elsrch01j.mydomain.com ml_autodetect       0 0 0
elsrch01j.mydomain.com ml_datafeed         0 0 0
elsrch01j.mydomain.com ml_utility          0 0 0
elsrch01j.mydomain.com refresh             0 0 0
elsrch01j.mydomain.com rollup_indexing     0 0 0
elsrch01j.mydomain.com search              0 0 0
elsrch01j.mydomain.com search_throttled    0 0 0
elsrch01j.mydomain.com security-token-key  0 0 0
elsrch01j.mydomain.com snapshot            0 0 0
elsrch01j.mydomain.com warmer              0 0 0
elsrch01j.mydomain.com watcher             0 0 0
elsrch01j.mydomain.com write               1 0 0
elsrch01g.mydomain.com analyze             0 0 0
elsrch01g.mydomain.com ccr                 0 0 0
elsrch01g.mydomain.com fetch_shard_started 0 0 0
elsrch01g.mydomain.com fetch_shard_store   0 0 0
elsrch01g.mydomain.com flush               0 0 0
elsrch01g.mydomain.com force_merge         1 0 0
elsrch01g.mydomain.com generic             0 0 0
elsrch01g.mydomain.com get                 0 0 0
elsrch01g.mydomain.com index               0 0 0
elsrch01g.mydomain.com listener            0 0 0
elsrch01g.mydomain.com management          1 0 0
elsrch01g.mydomain.com ml_autodetect       0 0 0
elsrch01g.mydomain.com ml_datafeed         0 0 0
elsrch01g.mydomain.com ml_utility          0 0 0
elsrch01g.mydomain.com refresh             0 0 0
elsrch01g.mydomain.com rollup_indexing     0 0 0
elsrch01g.mydomain.com search              0 0 0
elsrch01g.mydomain.com search_throttled    0 0 0
elsrch01g.mydomain.com security-token-key  0 0 0
elsrch01g.mydomain.com snapshot            0 0 0
elsrch01g.mydomain.com warmer              0 0 0
elsrch01g.mydomain.com watcher             0 0 0
elsrch01g.mydomain.com write               0 0 0
elsrch01b.mydomain.com analyze             0 0 0
elsrch01b.mydomain.com ccr                 0 0 0
elsrch01b.mydomain.com fetch_shard_started 0 0 0
elsrch01b.mydomain.com fetch_shard_store   0 0 0
elsrch01b.mydomain.com flush               0 0 0
elsrch01b.mydomain.com force_merge         0 0 0
elsrch01b.mydomain.com generic             0 0 0
elsrch01b.mydomain.com get                 0 0 0
elsrch01b.mydomain.com index               0 0 0
elsrch01b.mydomain.com listener            0 0 0
elsrch01b.mydomain.com management          1 0 0
elsrch01b.mydomain.com ml_autodetect       0 0 0
elsrch01b.mydomain.com ml_datafeed         0 0 0
elsrch01b.mydomain.com ml_utility          0 0 0
elsrch01b.mydomain.com refresh             0 0 0
elsrch01b.mydomain.com rollup_indexing     0 0 0
elsrch01b.mydomain.com search              0 0 0
elsrch01b.mydomain.com search_throttled    0 0 0
elsrch01b.mydomain.com security-token-key  0 0 0
elsrch01b.mydomain.com snapshot            0 0 0
elsrch01b.mydomain.com warmer              0 0 0
elsrch01b.mydomain.com watcher             0 0 0
elsrch01b.mydomain.com write               0 0 0

If I understand the output, as far as I can tell, the threads are basically doing absolutely nothing. Again, assuming this is what you’re asking me to check, would you agree?

Additionally, I was told that by default an Elasticsearch node will be a both data and master node. I was told that as I effectively running 12 Elasticsearch master nodes and that was a bad idea as there was some sort of (synchronous?) communication needed between each of them. I was advised to limit the number of masters to 3. I changed the index replica set to be 9 and changed Elasticsearch nodes A, B and C to be master only nodes and Elasticsearch nodes D through L to be data only nodes. I realize this limited my potential throughput by reducing my data nodes by three but it was just a test to see if things improved. It’s honestly hard to say rather it made a difference or not. Especially now, as I write this at nearly 10:00 PM, the log volume is lower than peak hours so it’s hard to stress test it.

yes I refer to the thread_pool that is limited and you need to keep an eye on the queue for your tasks @nnelson

about the ES master/data nodes it is partly true - but not all. I would not run dedicated master nodes in this kind of setup. But I do not know enough about your setup and to be honest - everything more I can tell would cut my potential salary because your problem in that size is where people are paying me to solve their problems …