Problems with Output buffer/ElasticSearch dying

gaveup · July 18, 2019, 1:37pm

Hey everyone,

Running into a pretty crippling issue here where the processing/output buffers will fill up and then the logs will all write to the journal. According to the overview page, the ES cluster is showing healthy, but the only way to fix this on my end is to restart all the datanodes and wait for them to fully initialize. Once this is done the journal starts to flush and everything continues as normal. I can’t see to track down why this is happening.

Current setup is relatively simple (using 2.4 AMI from AWS):

webserver configured to run as-server (m5.2xlarge)
4 datanodes set to run as-datanode (m5.large)

Looking at server performance, I dont see memory or CPU having any sort of an issue on them. When looking throw the indexer logs, I see a whole mess of these:

{"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (put-mapping) within 30s"}

Outside of just restarting the datanodes, I am kind of at a loss with what I need to do, or what errors I should be looking for in the logs. Anyone know where I should begin to start?

Note: Yes, I know this isn’t the optimal setup and the appliance should not be used for any sort of smaller production settings (this is an extended testing). I am currently working on building out a better/more robust 3.0 deploy, but this is what we have for now and I’d really like to get it working.

gaveup · July 22, 2019, 1:33pm

I would also like to know, this seems to happens when the indexes rotate. I don’t see any cpu/memory issues on any of the nodes. We are only processing about 50-60GB of logs a day, is this deploy too weak for this amount of logs?

I’m completely lost on what to do/change.

zorel · July 23, 2019, 12:46pm

Hello.

By datanode, you mean your elasticsearch cluster? Do you have anything in the ES logs?

Looking at https://discuss.elastic.co/t/failed-to-process-cluster-event-put-mapping/150354 you may have to much old indices, or maybe you need to review the indices configuration creation in Graylog.

gaveup · July 23, 2019, 3:09pm

I’m fairly certain this issue is with how I have the indices configured, but I’ve really been having a hard time figuring out how exactly this should be done. The docs help, but I may have missed a best practices notice.

I’m hoping someone can see this and point out how horribly I configured this and can offer me some help.

I have 4 ES nodes in the cluster - m5.xlarge (4vCPU/16GB RAM )- 1 TB drives.

Index set: `3,400 indices, 10,570,300,254 documents, 3.2TB`
Index prefix: graylog
Shards: 4
Replicas: 0
Max. number of segments: 1
Index rotation strategy: Index Size
Max index size: 1073741824 bytes (1.0GB)
Index retention strategy: Delete
Max number of indices: 3400

The idea is to have as much logs as we can, but clear out the old ones when space fills up. Is this in any way correct or did I royally screw this up? I ask because I am working on a more robust deploy, but still am unsure of how to set this up.

zorel · July 23, 2019, 5:32pm

That’s a quite huge number on indices to handle for your ES cluster. How many indices do you have right now?

gaveup · July 23, 2019, 5:43pm

3,400 indices, 10,570,300,254 documents, 3.2TB is what the index set is showing. It looks like every time one gets rotated out it causes an issue. This has been working decently for a few months at max number of indices, but I know this can’t be the correct way this was supposed to be configured.

Like I said, I’m so certain I screwed this up, but I’d really like to get it back to being usable

At the end of the day, I just want to be able to save as many logs as my disk will allow.

zorel · July 23, 2019, 6:17pm

3400 indices, 4 shards on 4 servers means 3400 shards per server. That’s a lot. Make biggers indices, try to stay up to 500 shards per server max. (see https://discuss.elastic.co/t/max-number-of-indices/97932 with bigger servers than you have).

Maybe review your ES configuration, with 16GB of memory you don’t have a lot of room for ES on the servers, so with a lot of shard you fill up the memory. Problem here, I think, is not the disk size, is on the ES memory you have.

I would try to make bigger indices, like 80/100M docs, not rotate by size. AFAIK, you rotate the indices way to much, 1GB is quickly filled with graylog.

gaveup · July 23, 2019, 7:51pm

Thank you so much. The only question I have about that, is if I change from disk size to message count, is it possible I can run into an issue where it fills up the disk on the ES cluster servers?

zorel · July 24, 2019, 7:23am

Yes. ES will then complains and stop ingesting messages when occupation is 90%.

You can stay with volume, just take care of your shard count per server.

jan · July 24, 2019, 8:59am

gaveup · July 24, 2019, 1:37pm

Awesome. Thank you guys, I will look into it and make sure it all works. Like I said, I was sure this was a configuration issue on my end.

system · August 7, 2019, 1:37pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Graylog CPU spiked to 500% and ES seems slow Graylog Central (peer support)	6	1327	March 18, 2019
Graylog nodes stop outputting/fill up buffers Graylog Central (peer support)	15	6171	May 6, 2020
No more Output. Journal issue Graylog Central (peer support)	2	620	May 7, 2020
Problem with Graylog cluster Graylog Central (peer support)	4	627	December 15, 2020
Again Graylog is backed up and slow to write out Graylog Central (peer support)	2	880	April 20, 2019

Problems with Output buffer/ElasticSearch dying

Related topics