Graylog doesn't process messages from journal


After upgrading from 3.0 to 3.2 I face issues with random nodes in my graylog cluster. Some nodes stop processing messages flushing them to the journal.

Some words about the setup:
56 graylog nodes sitting behind 4 lvs balancer are writing logs to a huge elasticsearch cluster (about 200tb of data, 360 data nodes).

In most cases the nodes can be “repaired” by just restarting them but this is an ugly solutions which isn’t reliable at all.

I had a similar issue with previous versions solved by increasing the “http.max_content_length” setting in elasticsearch to 500mb.

The only lines related to the problem I was able to discover:
debug log fragment:
graylog configs:

I host logs/configs on paste bin due to upload limitations here.

jmap (part 1):
jmap (part 2):

And this is the java profile from the problematic node:

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.