Graylog Output issue to Elasticsearch

My Graylog setup

3 VMs
12 core
32 GB memory
Applications - Zookeeper, Kafka, MongoDB, Graylog
Zookeeper JVM settings - -Xmx512M -Xms512M
Kafka JVM settings - -Xmx1G -Xms1G
Mongo JVM settings - Default
Graylog - -Xms8g -Xmx8g
processbuffer_processors = 12
outputbuffer_processors = 1
output_batch_size = 1000

3 VMs
4 Core
8GB memory
Applications - Elasticsearch Master
Elasticsearch JVM settings - -Xms4g -Xmx4g

2 VMs
Elasticsearch Data node
16 core
64GB memory
Applications - Elasticsearch Data
Elasticsearch JVM settings - -Xms30g -Xmx30g

With the above setup, I get an input of logs in the range of 2k to up to 20k during peak times and I get an output of only in the range of 50-200 msgs per second.

I have about 200 alerts configured on Graylog.

I have seen many similar posts before and with the increased infrastructure in the company, the logs have kept increasing over time. So now I end up with a full disk journal and very little output to Elasticsearch. I get too many false alerts because of this.

I have seen many posts asking the same question and the suggestions I have seen are saying to look at Elasticsearch servers. The Elasticsearch servers are using less than 10% CPU when I have a full disk journal in Graylog.

There is no clear solution from Graylog to tune the output performance from Graylog.

Can someone help me with improving the output from Graylog? I am not expecting Graylog to output 14k logs per second. But what worries me that a 12 core server is not able to output even a thousand logs per second.

I have already tried changing the values of processbuffer_processors, outputbuffer_processors, output_batch_size. But it has no impact on the output. I know the next suggestion will be to move Kafka to a different server. But I have already tried this as well.

Perhaps related to this post where it was a GROK issue? the post at the end shows the beginnings of how it was tracked down…

I checked the thread dump. Could not see any blocking threads. All the inputs i use are Gelf/Kafka inputs. I have around 17 inputs reading from kafka topics for each particular inputs.

Having had a similar issue in my environment and looking at other posts over time I am beginning to suspect it is a runaway/recursive GROK issue. I have mostly eliminated GROK in favor of key/value, split and regex and is seems to have helped. Not sure yet as I never see any errors in logs.

Sorry. I should have mentioned this. I am not using any GROK patterns. All my inputs are GELF which is JSON format.

Which buffer is full first before logs fill the journal ? process buffer ? output buffer ? (you can see it in System -> Nodes, select a node)

Which VM is limiting the throughput ? Graylog ? Elastic ?

Which resource is limiting the throughput ? CPU ? Disk I/O ?

It is the process buffer that is always full.

Which VM is limiting the throughput? Graylog ? Elastic ?
its Graylog. ES servers use less than 10% CPU power.

Which VM is limiting the throughput ? Graylog ? Elastic ?
Not sure how to get this. CPU is on almost 90% on the servers.
Messages are on DISK Journal. So it will reduce processing.

There was further research in the post I put out recently that helped show which messages were locking up the process buffers. It was definitely a GROK-lock scenario for me… It shows in the post where to look for locked process buffers.

For CPU use the “top” command and check the load averages (the three values).

For Disk I/O check with the command “iostat” and also on the ESX host.
What is the hardware ? SSD ? RAID ?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.