Graylog doesn't process messages from journal

Hi!

After upgrading from 3.0 to 3.2 I face issues with random nodes in my graylog cluster. Some nodes stop processing messages flushing them to the journal.

Some words about the setup:
56 graylog nodes sitting behind 4 lvs balancer are writing logs to a huge elasticsearch cluster (about 200tb of data, 360 data nodes).

In most cases the nodes can be “repaired” by just restarting them but this is an ugly solutions which isn’t reliable at all.

I had a similar issue with previous versions solved by increasing the “http.max_content_length” setting in elasticsearch to 500mb.

The only lines related to the problem I was able to discover:
debug log fragment: https://pastebin.com/6RDwsLwC
graylog configs: https://pastebin.com/zftVSvMy

I host logs/configs on paste bin due to upload limitations here.

jmap (part 1): https://pastebin.com/c7QjXD3D
jmap (part 2): https://pastebin.com/TswQ8L1m

And this is the java profile from the problematic node:
profile: https://sendeyo.com/up/d/c63d75f7ad

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.