Problems with Output buffer/ElasticSearch dying

Hey everyone,

Running into a pretty crippling issue here where the processing/output buffers will fill up and then the logs will all write to the journal. According to the overview page, the ES cluster is showing healthy, but the only way to fix this on my end is to restart all the datanodes and wait for them to fully initialize. Once this is done the journal starts to flush and everything continues as normal. I can’t see to track down why this is happening.

Current setup is relatively simple (using 2.4 AMI from AWS):

  1. webserver configured to run as-server (m5.2xlarge)
  2. 4 datanodes set to run as-datanode (m5.large)

Looking at server performance, I dont see memory or CPU having any sort of an issue on them. When looking throw the indexer logs, I see a whole mess of these:

{"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (put-mapping) within 30s"}

Outside of just restarting the datanodes, I am kind of at a loss with what I need to do, or what errors I should be looking for in the logs. Anyone know where I should begin to start?

Note: Yes, I know this isn’t the optimal setup and the appliance should not be used for any sort of smaller production settings (this is an extended testing). I am currently working on building out a better/more robust 3.0 deploy, but this is what we have for now and I’d really like to get it working.

I would also like to know, this seems to happens when the indexes rotate. I don’t see any cpu/memory issues on any of the nodes. We are only processing about 50-60GB of logs a day, is this deploy too weak for this amount of logs?

I’m completely lost on what to do/change.

Hello.

By datanode, you mean your elasticsearch cluster? Do you have anything in the ES logs?

Looking at https://discuss.elastic.co/t/failed-to-process-cluster-event-put-mapping/150354 you may have to much old indices, or maybe you need to review the indices configuration creation in Graylog.

I’m fairly certain this issue is with how I have the indices configured, but I’ve really been having a hard time figuring out how exactly this should be done. The docs help, but I may have missed a best practices notice.

I’m hoping someone can see this and point out how horribly I configured this and can offer me some help.

I have 4 ES nodes in the cluster - m5.xlarge (4vCPU/16GB RAM )- 1 TB drives.

Index set: `3,400 indices, 10,570,300,254 documents, 3.2TB`
Index prefix: graylog
Shards: 4
Replicas: 0
Max. number of segments: 1
Index rotation strategy: Index Size
Max index size: 1073741824 bytes (1.0GB)
Index retention strategy: Delete
Max number of indices: 3400

The idea is to have as much logs as we can, but clear out the old ones when space fills up. Is this in any way correct or did I royally screw this up? I ask because I am working on a more robust deploy, but still am unsure of how to set this up.

That’s a quite huge number on indices to handle for your ES cluster. How many indices do you have right now?

3,400 indices, 10,570,300,254 documents, 3.2TB is what the index set is showing. It looks like every time one gets rotated out it causes an issue. This has been working decently for a few months at max number of indices, but I know this can’t be the correct way this was supposed to be configured.

Like I said, I’m so certain I screwed this up, but I’d really like to get it back to being usable :frowning:

At the end of the day, I just want to be able to save as many logs as my disk will allow.

3400 indices, 4 shards on 4 servers means 3400 shards per server. That’s a lot. Make biggers indices, try to stay up to 500 shards per server max. (see https://discuss.elastic.co/t/max-number-of-indices/97932 with bigger servers than you have).

Maybe review your ES configuration, with 16GB of memory you don’t have a lot of room for ES on the servers, so with a lot of shard you fill up the memory. Problem here, I think, is not the disk size, is on the ES memory you have.

I would try to make bigger indices, like 80/100M docs, not rotate by size. AFAIK, you rotate the indices way to much, 1GB is quickly filled with graylog.

Thank you so much. The only question I have about that, is if I change from disk size to message count, is it possible I can run into an issue where it fills up the disk on the ES cluster servers?

Yes. ES will then complains and stop ingesting messages when occupation is 90%.

You can stay with volume, just take care of your shard count per server.

Awesome. Thank you guys, I will look into it and make sure it all works. Like I said, I was sure this was a configuration issue on my end.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.