I have noticed that for the past couple days, when Graylog goes to rollover indexes, the connection to Elastic stops on my master node, but my secondary node keeps processing messages. During this time, messages are all sent to the journal, and Graylog is not able to run any index processing. Restarting the graylog-server service brings everything back to normal.
Environment:
Graylog v2.4.3 (2 nodes)
Elastic cluster v5.6.6 (3 nodes, separate from Graylog)
Okay, so I was able to trace this back to Elastic storage. The Graylog rolls to new indices, but Elastic throws a warning about breaching 85% storage utilization. I reduced the number max of indices in the set, and it looks like the daily rollover is happier now.
Yes, increasing disk space would be optimal, but we are still in proof of concept mode, and are limited in available space until we formally approve the project.
My Elastic cluster has three nodes with 1TB of storage each, and we are looking at about 10 days of storage to fill this up (with 1 replica per index).
I am seeing high journal utilization during the day, but my Elastic nodes are maxing at 60% CPU and about 70% JVM (total cluster combined heap 12GB).