I have a problem with my Graylog instance where I regularly see the “out” records drop to zero fairly frequently (several times a minute) and often times for extended periods of time (a minute or more). To be clear, I am not sure if this issue is with Graylog or Elasticsearch or something else altogether. I have tried to look at every angle possible on this before posting here for help.
The overview of the environment is Graylog 3.2.4 and Elasticsearch 6.8.8 running on Red Hat Enterprise Linux 7.8. We have four Graylog nodes sitting behind an F5 LTM load balancer and 12 Elasticsearch nodes. The F5 LTM does a solid job of round-robin distributing the load pretty equally to all four of the Graylog nodes. I am not certain that it’s relevant but there is a Cisco ASA firewall that sits between the Graylog nodes and the Elasticsearch nodes; however, ports are open on TCP 9200 into Elasticsearch from the Graylog nodes. Additionally, our F5 LTM is “in-line” meaning it is the default gateway for our Graylog nodes and all traffic passes through it.
For better or worse, the environment is all virtualized; however, I do not believe that there are resource contention issues at the CPU, memory or storage level. The Graylog and Elasticsearch nodes do share a common VM datastore on an all-flash fibre channel attached storage array; however, Graylog and Elasticsearch are the only things on the array. Additionally, performance metrics collected from the storage array don’t indicate that we are anywhere remotely close to running out of performance. Additionally, CPU and memory usage across the environment has plenty of headroom.
The Graylog servers (four total) are each 16 CPU cores with 16 GB RAM. The Graylog Java process is limited to 4 GB RAM. There is roughly 4 GB RAM of the 16 GB total that is free. In my graylog.conf file have processbuffer_processors = 10, outputbuffer_processors = 5 and inputbuffer_processors = 1.
The Elasticsearch servers (12 total) are each 8 CPU cores with 64 GB RAM. The Elasticsearch Java process is limited to 30 GB RAM. There is roughly 4 GB RAM of the 64 GB total that is free.
I am currently processing a constant stream of roughly 55,000-60,000 messages per second through Graylog. As I type this, the system just started to recover from one of these stalled outputs and is currently pushing 168,000 messages per second out and the Elasticsearch nodes are using 100-300% CPU utilization with a load average ranging between 1.45 to 2.79 across the 12 Elasticsearch nodes.
My default index set, which pretty much everything is written to, is configured for 12 shards and 1 replica set.
I have also noticed that when this thing drops to zero logs out, each Graylog node’s process buffer and output buffer are 100% full.
I can’t correlate any events happening (rotating or deleting an index) from the Graylog log file with the drops to zero either. Same with the Elasticsearch log. Neither seem to have helpful insight as to what could be going on here.
Any ideas on what I am doing wrong here? I am at a total loss as to what to try next.