After upgrading from 3.0 to 3.2 I face issues with random nodes in my graylog cluster. Some nodes stop processing messages flushing them to the journal.
Some words about the setup:
56 graylog nodes sitting behind 4 lvs balancer are writing logs to a huge elasticsearch cluster (about 200tb of data, 360 data nodes).
In most cases the nodes can be “repaired” by just restarting them but this is an ugly solutions which isn’t reliable at all.
I had a similar issue with previous versions solved by increasing the “http.max_content_length” setting in elasticsearch to 500mb.
I host logs/configs on paste bin due to upload limitations here.