Graylog cluster in kubernetes

cake007 · April 14, 2024, 4:16pm

I have graylog and elasticsearch deployed in kuberenetes. Three graylog nodes as cluster and three elasticsearch with 1 master and two master eligible nodes. The spec of the worker nodes is 64 core and 128gb ram and 20TB disk on each worker node. We are getting 2tb logs per day. Logs are coming from kuberentes cluster and other applications each with differnet size. Right now Iam only getting 7000 message output per sec so getting around 20 to 25k messages processed per second but the input rate is 20k to 100k per second so most of the messages are going to journal with millions of unprocessed messages and Iam unable to view the logs from the dashboard most of the time.
The CPU usage of the graylog and elasticsearch is low so I am unable to figure out the bottleneck here…

Graylog -5-0 32gb heap
elasticsearch - 7.10 32gb heap
mongo -5.0

Graylog leader and elasticsearch master is on dedicated workernodes and the remaining are on shared workernodes.
Please help I want to use all my cpu resources and increase the processing to 100k message per second. What elase I need to do add more data nodes but the current servers are underutilized. What iam missing here need assistance

Joel_Duffield · April 14, 2024, 8:36pm

Can you post a screenshot of the nodes page of each of thr graylog nodes from system>nodes its helpful to have a look at the journals and buffers on each node.

cake007 · April 15, 2024, 6:09am

Thank you for the reply

cake007 · April 15, 2024, 6:10am

cake007 · April 15, 2024, 6:10am

cake007 · April 15, 2024, 6:10am

cake007 · April 15, 2024, 6:11am

Unable to view messages

from the dashboard

cake007 · April 15, 2024, 6:14am

Each graylog node is processing only 3000 to 7000 message per second but when I checked my cpu usage it only take 20% and also elasticsearch not taking system resources eventhough it have 64 core and 128gb ram. How much nodes and spec is needed to process without unprocessed messages in this case

Joel_Duffield · April 15, 2024, 9:38am

It looks from that to most likely be an elastic/opensesrch problem. What heap is assigned to each of those opensesrch nodes?

cake007 · April 15, 2024, 10:24am

32gb for each node
3 master eligible nodes in elasticsearch
each with 64 core and 128gb ram

Joel_Duffield · April 15, 2024, 11:04am

Do you have any relavant messages in server.log about retries, messages being rejected etc?

ihe · April 15, 2024, 3:14pm

did you change the number of your processors?
Have a look at the server config:
processbuffer_processors = ?
outputbuffer_processors = ?
inputbuffer_processors = ?
the sum of those three should be approx the number of cores of the node.
The value of output_batch_size is also tunable.

cake007 · April 17, 2024, 7:31am

No
Only error I got is message with missing mandatory host field other than that no error on graylog

cake007 · April 17, 2024, 7:32am

output batch size = 500
outputbuffer_processor = 20
processbuffer_processors = 40
ringsize = 256128

I tried with different values but the output rate never goes above 10k per node

I have 64 core processor

cake007 · April 17, 2024, 7:35am

Is adding more elasticsearch nodes increase the output rate?
Also is dedicated master and data nodes needed in elasticsearch
Right now I have three elasticsearch nodes with no dedicated roles each with128gb ram and 64core
32gb JVM for graylog and elasticcsearch. Planning to deploy multiple elasticsearch instance on the same worker node beacuse the CPU and memory usage right now is very low

How much gl and elasticnode is reccomended for processing 100k message per second. Daily getting around 2TB logs with 7 day retention

Joel_Duffield · April 18, 2024, 12:38am

You should be fine cpu and ram wise, but what does disk performace look like, are you maxing out iops or anything like that on the opensearch nodes?

bogd · April 18, 2024, 7:49am

What kind of disk do you have in your ES nodes? From your initial message, I would guess magnetic disks, and this is definitely not recommended with ES.

All the official docs recommend SSDs for (at least) the hot tier. And based on personal experience, trying to use magnetic disks will lead to them getting overwhelmed very quickly. Especially considering that your ingestion rate is not that low.

To answer your other questions:

adding ES nodes may increase the processing rate. But in your case, if you are adding nodes on the same physical hardware, on the same disks which are already hitting their limit, you will probably not see a benefit
dedicated master/data nodes are not a requirement in this case. You only have three nodes, so you should be fine for now

ihe · April 18, 2024, 6:28pm

I agree with you: your Graylog is not the bottleneck, I am very sure it’s OpenSearch:
your outputbuffer is full → either not enough output-processors (not the case) or not enough power to process the logs on the OpenSeach-End. The flow out of Graylog should therefore be good → it must be OpenSearch.

@bogd asked a very good question: what kind of disks/ssd do you have for OpenSearch? IO is very important here, and magnetic drives will not deliver for that scale.
Can you run a iotop on your OpenSearch? How much IO is it doing, and how high are the IO waits? You might also run “top” to see your IO waits, as you can read here.

cake007 · April 19, 2024, 1:24pm

We are using SSD disks
We tried to deploy multiple instance of ES and the processing improved.
How much JVM to assign to es nodes also any sharding reccomendations. We are using 20+ shards now
is 12GB enough for ES nodes? we are deploying 10 es nodes with 12gb each

gsmith · April 20, 2024, 3:25am

Hey,

Just chiming in, Have you tried raising your output batch size = 500?

Topic		Replies	Views
Issues with Graylog after moving to an elasticsearch cluster Graylog Central (peer support)	21	2660	June 24, 2018
Graylog output will stop Graylog Central (peer support)	10	433	September 14, 2023
Graylog woes: journal utilization, process buffer and other issues Graylog Central (peer support)	2	278	September 15, 2022
Slow Message processing from GELF/Kafka inputs Graylog Central (peer support)	12	2148	April 23, 2018
Regarding tuning Graylog Cluster Graylog Central (peer support) access-specific-log- , architecture , components	4	1075	March 1, 2023

Graylog cluster in kubernetes

Related topics