Graylog output will stop

There are about 20T logs per day

I hava 12 graylog node( aws ec2,c5a.8xlarge), 32CPU 64G Mem;
I have 12 Elasticsearch node( aws ec2,c5a.8xlarge), 32CPU 64G Mem;

Use docker-compose deploy graylog, the config is follow:

      - GRAYLOG_SERVER_JAVA_OPTS=-Xms32g -Xmx32g -XX:NewRatio=1 -XX:MaxMetaspaceSize=16G -server -XX:+ResizeTLAB  -XX:-OmitStackTraceInFastThrow
      - GRAYLOG_TIMEZONE=Asia/Shanghai
      - GRAYLOG_HTTP_EXTERNAL_URI=http://${node_ip}:9000/
      - GRAYLOG_HTTP_PUBLISH_URI=http://${node_ip}:9000/
      - GRAYLOG_WEB_ENDPOINT_URI=http://${node_ip}:9000/api
      - GRAYLOG_WEB_ENABLE=true      
      - GRAYLOG_REST_TRANSPORT_URI=https://${GRAYLOG_domain}:9000/api/
      - GRAYLOG_MONGODB_URI=mongodb://${mg_graylog_user}:${mg_graylog_pass}@mongodb_01:27017,mongodb_02:27017,mongodb_03:27017/graylog?replicaSet=messpush0
      - GRAYLOG_ELASTICSEARCH_HOSTS=http://${es_graylog_user}:${es_graylog_pass}@es01:9200,http://${es_graylog_user}:${es_graylog_pass}@es02:9200,http://${es_graylog_user}:${es_graylog_pass}@es03:9200,http://${es_graylog_user}:${es_graylog_pass}@es04:9200,http://${es_graylog_user}:${es_graylog_pass}@es05:9200,http://${es_graylog_user}:${es_graylog_pass}@es06:9200,http://${es_graylog_user}:${es_graylog_pass}@es07:9200,http://${es_graylog_user}:${es_graylog_pass}@es08:9200,http://${es_graylog_user}:${es_graylog_pass}@es09:9200,http://${es_graylog_user}:${es_graylog_pass}@es10:9200,http://${es_graylog_user}:${es_graylog_pass}@es11:9200,http://${es_graylog_user}:${es_graylog_pass}@es12:9200
      - GRAYLOG_RING_SIZE=524288

Why is the output so unsmooth (unstable)?

What is the y axis (unit of measurement) on that chart?

Also what output are you using?

The unit is ten thousand, output to elasticsearch

Ah ok i understand, so that is the graylog prom counter for gl_output_throughput and you are seeing drops. You are not using any additional or special output, just what is being written to elasticsearch.

Other than this graph/chart, are you experiencing any issues? What happens in those dips? Do you have to manually take an action to recover? Does it recover on its own? Any additional context or information you can provide would be very helpful.

It will recover by itself, but this will cause a lot of log clogging problems, I don’t know how to deal with this problem.

Is there anything else I need to provide?


When this happens what do your output buffers look like?

Typically a sudden drop in output means that graylog isn’t able to successfully “handoff” the messages to the indexer (Elasticsearch/OpenSearch) because the indexer cannot keep up with message volume. This can be for any number of reasons, but most commonly disk throughput and sometimes CPU usage.

Are you able to check resource monitoring for elasticsearch/OpenSearch during this time to see what CPU/RAM utilization looked like? Also storage health?

One other trick that can greatly reduce resource utilization and allow for more efficient batching of writing messages to disk is the graylog index set setting ‘Field type refresh interval’:

You can view this by editing any of your graylog index sets. This is defaulted to 5 sec but increasing this, while can cause a small delay in how long new messages become available in graylog, can reduce pressure on your indexer. You can try experimenting with different values but i’ve used a value as high as 30 seconds with great success.

1 Like


Both processbuffer and outputbuffer are blocked.

And, I checked the monitoring of elasticsearch and confirmed that the CPU and RAM are not fully loaded.

Now I have tried to change the “Field type refresh interval” of the index from the default 5s to 30s.

I checked the monitoring of elasticsearch and confirmed that the CPU and RAM are not fully loaded.

How does network utilization and disk IO look?

What happens with the buffers is that if graylog cannot write out the logs to the indexer/backend (Elasticsearch/OpenSearch) the output buffer will hold those logs until they can be written. IF the output buffer fills up 100%, the messages start backing up into the process buffer. This is consistent with what is shown in your most recent screenshot.

In effect your message throughput (EPS/events per second) is too high for your cluster.

Some important things you’ll want to review/verify:

  1. The maximum heap you can practically configure (beyond which there is no measurable benefit and in some cases there is a performance hit!) is 31GB (good explanation: Why 35GB Heap is Less Than 32GB - Java JVM Memory Oddities)
    • Make sure your nodes are configured with this amount of heap.
  2. the recommended shard count is 20 * the GB of heap, per elastic node.
    • For example if you have 12 nodes with 31GB of heap, you should have no more than 7440 shards
  3. the ideal and optimized shard size is 20GB - 50GB
    • This is easier said than done as graylog until this most recent release did not provide any way to manage this. In graylog 5.1 we added a time-size optimize retention option
  4. disk throughput is one of the limiting factors in message throughput for elasticsearch. IF you hit a practical limit of performance with the number of elasticnodes and all of the above tuning as been done, you can scale horizontally, which is to say add additional elasticsearch nodes

I hope this is helpful in some way. this is all general advice and without working directly with you and your environment it will be difficult to provide specific actionable resolutions.

Thank you very much, I will try my best to provide you with detailed data.

Here I can confirm that the network and disk io of the 12 data nodes are normal, but the 3 master nodes have alerts about insufficient memory. Now I am already upgrading the master node.

Here you mean that the GRAYLOG_RING_SIZE=524288 internal cache queue is too large, right?

My graylog node is 32CPU 64G RAM, so I set it to 32G. Official recommendations say it should not exceed 32G, but I am not sure what impact it will have if I set it to 32G.

This is an allocation of my current graylog node for indexing, sharding and storage, which I can adjust at any time:

12 * 30 * 10 = 3600
12 * 30 * 10 = 3600
12 * 30 * 10 = 3600
12 * 30 * 10 = 3600
12 * 30 * 10 = 3600
12 * 30 * 10 = 3600
12 * 30 * 10 = 3600
12 * 30 * 10 = 3600
12 * 30 * 10 = 3600
12 * 30 * 10 = 3600
12 * 30 * 30 = 10800
12 * 30 * 30 = 10800
12 * 30 * 60 = 21600
12 * 30 * 20 = 7200
12 * 30 * 20 = 7200
12 * 30 * 20 = 7200
12 * 30 * 20 = 7200
12 * 30 * 20 = 7200
12 * 30 * 20 = 7200
4 * 30 * 5 = 600

I have 12 nodes, so I set up to 12 shards(no replicas); 30 means the size of the shard (30G); the last one means the number of reserved indexes.So I got the following data:

A total of 3860 shards
A total of 345 indices
A total of 123T space is required

Good to know that your shard count isn’t too high.

It looks like for ring_size :

For optimum performance your LogMessage objects in the ring buffer should fit in your CPU L3 cache

Regarding the buffer values:

The total number of worker threads should be below the total number of CPU cores.

Which it looks like you’ve done. I’m afraid I can’t offer much else. I still believe the bottle neck is with opensearch, but its not clear why. It does still hold true that if opensearch cannot keep up with the volume of messages the output buffer fills up, once that fills, the process buffer fills (just reiterating from before) and this what we are seeing.

Can you post a screenshot of the gl_buffer_usage metric? Should look something like:

I’m curious if the output buffer is only full SOMETIMES or if it is ALWAYS full.

My best guess is to add more opensearch nodes to see if that relieves pressure.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.