Ah ok i understand, so that is the graylog prom counter for gl_output_throughput and you are seeing drops. You are not using any additional or special output, just what is being written to elasticsearch.
Other than this graph/chart, are you experiencing any issues? What happens in those dips? Do you have to manually take an action to recover? Does it recover on its own? Any additional context or information you can provide would be very helpful.
Typically a sudden drop in output means that graylog isnāt able to successfully āhandoffā the messages to the indexer (Elasticsearch/OpenSearch) because the indexer cannot keep up with message volume. This can be for any number of reasons, but most commonly disk throughput and sometimes CPU usage.
Are you able to check resource monitoring for elasticsearch/OpenSearch during this time to see what CPU/RAM utilization looked like? Also storage health?
One other trick that can greatly reduce resource utilization and allow for more efficient batching of writing messages to disk is the graylog index set setting āField type refresh intervalā:
You can view this by editing any of your graylog index sets. This is defaulted to 5 sec but increasing this, while can cause a small delay in how long new messages become available in graylog, can reduce pressure on your indexer. You can try experimenting with different values but iāve used a value as high as 30 seconds with great success.
I checked the monitoring of elasticsearch and confirmed that the CPU and RAM are not fully loaded.
How does network utilization and disk IO look?
What happens with the buffers is that if graylog cannot write out the logs to the indexer/backend (Elasticsearch/OpenSearch) the output buffer will hold those logs until they can be written. IF the output buffer fills up 100%, the messages start backing up into the process buffer. This is consistent with what is shown in your most recent screenshot.
In effect your message throughput (EPS/events per second) is too high for your cluster.
Some important things youāll want to review/verify:
Make sure your nodes are configured with this amount of heap.
the recommended shard count is 20 * the GB of heap, per elastic node.
For example if you have 12 nodes with 31GB of heap, you should have no more than 7440 shards
the ideal and optimized shard size is 20GB - 50GB
This is easier said than done as graylog until this most recent release did not provide any way to manage this. In graylog 5.1 we added a time-size optimize retention option
disk throughput is one of the limiting factors in message throughput for elasticsearch. IF you hit a practical limit of performance with the number of elasticnodes and all of the above tuning as been done, you can scale horizontally, which is to say add additional elasticsearch nodes
I hope this is helpful in some way. this is all general advice and without working directly with you and your environment it will be difficult to provide specific actionable resolutions.
Thank you very much, I will try my best to provide you with detailed data.
Here I can confirm that the network and disk io of the 12 data nodes are normal, but the 3 master nodes have alerts about insufficient memory. Now I am already upgrading the master node.
Here you mean that the GRAYLOG_RING_SIZE=524288 internal cache queue is too large, right?
My graylog node is 32CPU 64G RAM, so I set it to 32G. Official recommendations say it should not exceed 32G, but I am not sure what impact it will have if I set it to 32G.
This is an allocation of my current graylog node for indexing, sharding and storage, which I can adjust at any timeļ¼
I have 12 nodes, so I set up to 12 shards(no replicas); 30 means the size of the shard (30G); the last one means the number of reserved indexes.So I got the following data:
A total of 3860 shards
A total of 345 indices
A total of 123T space is required
The total number of worker threads should be below the total number of CPU cores.
Which it looks like youāve done. Iām afraid I canāt offer much else. I still believe the bottle neck is with opensearch, but its not clear why. It does still hold true that if opensearch cannot keep up with the volume of messages the output buffer fills up, once that fills, the process buffer fills (just reiterating from before) and this what we are seeing.
Can you post a screenshot of the gl_buffer_usage metric? Should look something like:
Iām curious if the output buffer is only full SOMETIMES or if it is ALWAYS full.
My best guess is to add more opensearch nodes to see if that relieves pressure.