Messages getting queued, high load on es cluster

I am running on graylog 3.1.2 and elasticsearch cluster 6.8.3 in kubernetes in aws. The data are mounted on SC1 EBS.
The specs are as follows:
Master only (3x) : 2GB 1 CPU - 1GB Heap
Data and ingest (2x) : 24GB 6 CPU - 12GB Heap
Graylog (1x) 12GB 6CPU - 6GB Heap

On a good day, Graylog seems to be pushing 7-12k/s messages.
However after a few days of this, it drops to 1k for every few seconds to a minutes, and my messages start to queue up.

On elasticsearch side (utilizing cerebro for stats), cluster is green, but load on one of the nodes is really high. Heap is not maxed, cpu and ram are barely past 5%. Restarting pod doesnt do anything, however, if i shutdown the node, and a brand new ec2 instance comes up in its place, each of the data nodes are start fresh, i hit the 12k/s again.

Any ideas ?

I can’t tell exactly, just some things to check:

  • have 2-3% of your raw data in memory for ES.
  • if you have lot messages in queue, check the GL’s node processor and output buffer
    (if your processor buffer is full, but output is not, you have problem with processing (GL needs more resource, or less tasks; if your output buffer is full, and it cause your processor also, you have problem with your ES cluster.)
  • I saw, if one ES node have problem, it can cause problem in the cluster. does a restart solve it?
  • I have problems with ES hierarchy, I’m not a specialist, but I don’t think you need more master nodes than the data nodes. Maybe try to add one more data node.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.