I have a production cluster of GrayLog, moving an average of 2 million messages per day. It has been working like a charm, but sometimes, when a message peak arrives, the servers behave weird, they start to hold messages in the Journal, and they take a lot of time to flush that journal.
Is there any white paper, guide or documentation for tuning the servers? server.conf file has a lot of tuning parameters, but most of them are like blackboxes for me.
I’ve Googled a lot trying to find docs about tuning , but GrayLog is kind of new.
I faced the same issue that you are when initially scaling Graylog. There seems to be a lack of published documentation on large scale installations and tuning. We are pushing about 25,000 to 40,000 msg/s though our Graylog currently. The settings I found that helped the most are as follows.
ring_size = 262144 (must fit in your L3 cache and must be a power of 2)
If your cpu cores are not being utilized you can try to increase the following.
processbuffer_processors = {see what works}
outputbuffer_processors = {see what works}
I also set my heap size to 6G out of a total of 10G on the box.
If your output buffers are filling up you are going to want to look at expanding or tuning your elastic cluster.
Yes, Graylog and Elastic run on different servers. Graylog is the bottleneck for us currently as we are running quite a few of inputs / extractors that are rather CPU intensive.
Graylog is 8CPU/10GB RAM @ 9 VM’s
Elastic is 8CPU/32GB RAM @ 12 VM’s