Graylog Output issue to Elasticsearch

mev9669 · April 17, 2020, 5:07pm

My Graylog setup

3 VMs
12 core
32 GB memory
Applications - Zookeeper, Kafka, MongoDB, Graylog
Zookeeper JVM settings - -Xmx512M -Xms512M
Kafka JVM settings - -Xmx1G -Xms1G
Mongo JVM settings - Default
Graylog - -Xms8g -Xmx8g
processbuffer_processors = 12
outputbuffer_processors = 1
output_batch_size = 1000

3 VMs
4 Core
8GB memory
Applications - Elasticsearch Master
Elasticsearch JVM settings - -Xms4g -Xmx4g

2 VMs
Elasticsearch Data node
16 core
64GB memory
Applications - Elasticsearch Data
Elasticsearch JVM settings - -Xms30g -Xmx30g

With the above setup, I get an input of logs in the range of 2k to up to 20k during peak times and I get an output of only in the range of 50-200 msgs per second.

I have about 200 alerts configured on Graylog.

I have seen many similar posts before and with the increased infrastructure in the company, the logs have kept increasing over time. So now I end up with a full disk journal and very little output to Elasticsearch. I get too many false alerts because of this.

I have seen many posts asking the same question and the suggestions I have seen are saying to look at Elasticsearch servers. The Elasticsearch servers are using less than 10% CPU when I have a full disk journal in Graylog.

There is no clear solution from Graylog to tune the output performance from Graylog.

Can someone help me with improving the output from Graylog? I am not expecting Graylog to output 14k logs per second. But what worries me that a 12 core server is not able to output even a thousand logs per second.

I have already tried changing the values of processbuffer_processors, outputbuffer_processors, output_batch_size. But it has no impact on the output. I know the next suggestion will be to move Kafka to a different server. But I have already tried this as well.

tmacgbay · April 17, 2020, 5:56pm

Perhaps related to this post where it was a GROK issue? the post at the end shows the beginnings of how it was tracked down…

mev9669 · April 20, 2020, 10:27am

I checked the thread dump. Could not see any blocking threads. All the inputs i use are Gelf/Kafka inputs. I have around 17 inputs reading from kafka topics for each particular inputs.

tmacgbay · April 20, 2020, 2:28pm

Having had a similar issue in my environment and looking at other posts over time I am beginning to suspect it is a runaway/recursive GROK issue. I have mostly eliminated GROK in favor of key/value, split and regex and is seems to have helped. Not sure yet as I never see any errors in logs.

mev9669 · April 20, 2020, 2:33pm

Sorry. I should have mentioned this. I am not using any GROK patterns. All my inputs are GELF which is JSON format.

frantz · April 24, 2020, 2:42pm

Which buffer is full first before logs fill the journal ? process buffer ? output buffer ? (you can see it in System -> Nodes, select a node)

Which VM is limiting the throughput ? Graylog ? Elastic ?

Which resource is limiting the throughput ? CPU ? Disk I/O ?

mev9669 · April 24, 2020, 2:52pm

It is the process buffer that is always full.

Which VM is limiting the throughput? Graylog ? Elastic ?
its Graylog. ES servers use less than 10% CPU power.

Which VM is limiting the throughput ? Graylog ? Elastic ?
Not sure how to get this. CPU is on almost 90% on the servers.
Messages are on DISK Journal. So it will reduce processing.

tmacgbay · April 24, 2020, 3:14pm

There was further research in the post I put out recently that helped show which messages were locking up the process buffers. It was definitely a GROK-lock scenario for me… It shows in the post where to look for locked process buffers.

frantz · April 24, 2020, 3:27pm

For CPU use the “top” command and check the load averages (the three values).

For Disk I/O check with the command “iostat” and also on the ESX host.
What is the hardware ? SSD ? RAID ?

system · May 8, 2020, 3:27pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Process and output buffer is 100% utilized Graylog Central (peer support)	5	9392	July 26, 2018
Process Buffer Flooding 100% process Graylog Central (peer support)	8	4664	May 7, 2020
Elasticsearch optimization Graylog Central (peer support)	3	1107	January 23, 2023
Very high outputbuffer_processors count and small output_batch_size for better message throughput? Graylog Central (peer support)	5	7950	January 1, 2018
Sudden performance downgrade and search error Graylog Central (peer support)	6	685	August 12, 2020

Graylog Output issue to Elasticsearch

Related topics