2 VMs
Elasticsearch Data node
16 core
64GB memory
Applications - Elasticsearch Data
Elasticsearch JVM settings - -Xms30g -Xmx30g
With the above setup, I get an input of logs in the range of 2k to up to 20k during peak times and I get an output of only in the range of 50-200 msgs per second.
I have about 200 alerts configured on Graylog.
I have seen many similar posts before and with the increased infrastructure in the company, the logs have kept increasing over time. So now I end up with a full disk journal and very little output to Elasticsearch. I get too many false alerts because of this.
I have seen many posts asking the same question and the suggestions I have seen are saying to look at Elasticsearch servers. The Elasticsearch servers are using less than 10% CPU when I have a full disk journal in Graylog.
There is no clear solution from Graylog to tune the output performance from Graylog.
Can someone help me with improving the output from Graylog? I am not expecting Graylog to output 14k logs per second. But what worries me that a 12 core server is not able to output even a thousand logs per second.
I have already tried changing the values of processbuffer_processors, outputbuffer_processors, output_batch_size. But it has no impact on the output. I know the next suggestion will be to move Kafka to a different server. But I have already tried this as well.
I checked the thread dump. Could not see any blocking threads. All the inputs i use are Gelf/Kafka inputs. I have around 17 inputs reading from kafka topics for each particular inputs.
Having had a similar issue in my environment and looking at other posts over time I am beginning to suspect it is a runaway/recursive GROK issue. I have mostly eliminated GROK in favor of key/value, split and regex and is seems to have helped. Not sure yet as I never see any errors in logs.
Which VM is limiting the throughput? Graylog ? Elastic ?
its Graylog. ES servers use less than 10% CPU power.
Which VM is limiting the throughput ? Graylog ? Elastic ?
Not sure how to get this. CPU is on almost 90% on the servers.
Messages are on DISK Journal. So it will reduce processing.
There was further research in the post I put out recently that helped show which messages were locking up the process buffers. It was definitely a GROK-lock scenario for me… It shows in the post where to look for locked process buffers.