In our environment we generally have 200-300 messages per second. We have a single Graylog instance that seems to handle the amount of data we through at it very effectively. We have given the system 14 cores and 16GB of memory and all of our Datastores exist on flash storage.
We are able to effectively run for month and then during our monthly maintenance we install patches and reboot machines as needed. During this time my logs go from 200-300 messages per second to about 5000-7000 messages per second for about 4 hours or less.
I have increased my Journal from 1GB to 25GB still had a crash (Although much later in the day) but our Graylog instance still crashed. I don’t want to build a second or third node just for 4 hours or less of ingestion a month. I want to be able to build a large journal and just let Graylog filter through the data until the journal is empty again to survive the momentary load.
Besides this once a month event our journals pretty much stay empty and I just want to be able to let messages queue until our system can process everything needed.
What can I do to best optimize around this small even once a month?
I think I actually figured it out. I have never properly configured my JVM heap sizes for both Graylog and ES. Because of that I assume I was overwhelming the memory I gave them. Based on my research I increased Graylogs heap to 4GB and ES heap to 8GB.
The logs before it ran out of memory appear to show the system rotating my indexes.
If both are running on the same machine with only 16 GB of memory, that’s a bad idea.
As a rule of thumb, the machine running Elasticsearch should use one half of its memory for disk buffers and the other half for Elasticsearch (up to 30 GB).
Alright, I assumed (You know what they say about assumptions) but heap sizes but did it incorrectly. I will increases the total system memory to 24GB to leave 50% free.