So, after a lot of hiccups, everything is go! Graylog is running smoothly, receiving output from the testing Graylog VM while we (I mean, I) setup things for production, like getting the powershell installer script ready (hint: https://github.com/ion-storm/Graylog_Sysmon/pull/6)
But, yeah, not everything is great. I notice that process buffer is almost all the way full up to 65536 messages, journal sometimes reaches 10-20%…
But the most CPU usage all my 3 Graylog VMs got was 16,15%, and memory never gets higher than 1,8GB of the 4GB of RAM they each have.
Admittedly, I seem to be, once again, at a loss on the documentation. I remember having read something about optimizing, but I can’t seem to find it, and my Google Fu is failing me. For what I remember, all specs are default as of now - so should I start tweaking?
Indeed. I wonder what are the settings for number of processbuffer processors. Probably it would be possible to increase their number. For example if you have 10 cores in your VM:s, you could have 8 processbuffer processors and 2 outputbuffer processors…
other threads about 100% process buffer here say it may be fault of some faulty extractor, maybe check your extractor metrics, also maybe processing stuck on some slow lookup tables or pipelines
Indeed. The number of processbuffer processors limits the total CPU load of the Graylog node. Increasing that number it is possible to get a higher load on the node. Then, after finding the proper processor numbers so that the VM processors are fully utilized on a peak load, the next step would be looking at the extractors and optimizing the regexes.
so… I upped the processbuffer_processors from 5 to 75… CPU peaked at 100% and fell down to 15%. We decided to up the game - configured a sidecar on our main system, with history.
131 (master) peaked CPU but seemed to be digesting nicely.
Decided to try other configurations on 132 and 133 slaves - 65 on 132 and 55 on 133. Both are also peaking, but it seems to have stabilized. At first, incoming was about 3k/s, (which seems to have been the history), but now it’s down to 10-20/s, while output is at a peak of 7.5k/s This on 4 CPU VMs. We’re still deciding if we’ll up the CPU count, seems to have been just the history digesting peak and nothing more…
-edit
half an hour later and we have 0 journal usage, process and output buffers don’t even hit 1%. Must really have been history =)
BUT considering the opening of this thread was process buffer 100% and now it’s not hitting even 1%, I’d say upping the processors count did the trick! Thanks @jtkarvo and @maniel! =D
Oh yeah, BTW, I didn’t even touch elasticsearch - neither itself nor it’s VMs. 137 (graylog+mongo uses 131-133, elastic uses 134-138) hit 100% CPU load for less than 10 minutes after the increase of processor count, and returned to normal after that.
That is weird. If you have 4 CPU:s (cores?), then the original 5 processors dedicated to processing messages would sound enough to get to 100% processor utilization… Or do your VM CPU:s have several cores - in that case, the maximum number of processors for processbuffer and outputbuffer probably should not exceed the total number of cores.
And 3k messages per node sounds a bit low throughput; if you hit capacity problems, you should look at your GROK and regex usage and optimize it.
I asked around here, we have 4 CPUs on each VM, each CPU with 2 cores. But I still failed to see a significant difference between processbuffer_processor = 5;outputbuffer_processor = 3 and processbuffer_processor = 7;outputbuffer_processor = 1.
server.conf states # The number of parallel running processors.# Raise this number if your buffers are filling up., so I thought this was the number of parallel process running, not actually cores - and it’s what made me test with higher number for CPU load.
Heap sizes have already been tweaked for the RAM set for each VM - 4GB for each Graylog VM, 8 for each Elasticsearch VM.
On the previous scenario, before tweaking processbuffer_processor, ES VMs never got past 10% CPU load - after changing, and after dealing with that log history, they haven’t got past 40% CPU load. Active memory maximum hit 2.3GB of the total 8GB RAM, and 4GB configured on java heap… so I see no reason to add a sixth and seventh nodes.
8h graph for today, so far, gave me peaks of 30k messages/minute
In general, switching between tasks consumes also resources, so for processing intensive tasks, having a lot of threads on a single core would be counterproductive. For tasks that wait for I/O, having several processes per core would be good. Hard to tell the right amount. 75 processes just sounded high.