Status Green, all systems go. How to optimize?

So, after a lot of hiccups, everything is go! Graylog is running smoothly, receiving output from the testing Graylog VM while we (I mean, I) setup things for production, like getting the powershell installer script ready (hint: https://github.com/ion-storm/Graylog_Sysmon/pull/6)

But, yeah, not everything is great. I notice that process buffer is almost all the way full up to 65536 messages, journal sometimes reaches 10-20%…

But the most CPU usage all my 3 Graylog VMs got was 16,15%, and memory never gets higher than 1,8GB of the 4GB of RAM they each have.

Admittedly, I seem to be, once again, at a loss on the documentation. I remember having read something about optimizing, but I can’t seem to find it, and my Google Fu is failing me. For what I remember, all specs are default as of now - so should I start tweaking?

My guess would be that the Elasticsearch cluster is the culprit, and does not index messages as fast as Graylog would give them to it.

If Graylog had a high processor load (but you tell that that is not the case here), then I would look at extractors/pipeline functions.

wouldn’t the output buffer be full then?

Indeed. I wonder what are the settings for number of processbuffer processors. Probably it would be possible to increase their number. For example if you have 10 cores in your VM:s, you could have 8 processbuffer processors and 2 outputbuffer processors…

other threads about 100% process buffer here say it may be fault of some faulty extractor, maybe check your extractor metrics, also maybe processing stuck on some slow lookup tables or pipelines

Indeed. The number of processbuffer processors limits the total CPU load of the Graylog node. Increasing that number it is possible to get a higher load on the node. Then, after finding the proper processor numbers so that the VM processors are fully utilized on a peak load, the next step would be looking at the extractors and optimizing the regexes.

so… I upped the processbuffer_processors from 5 to 75… CPU peaked at 100% and fell down to 15%. We decided to up the game - configured a sidecar on our main system, with history.

131 (master) peaked CPU but seemed to be digesting nicely.

Decided to try other configurations on 132 and 133 slaves - 65 on 132 and 55 on 133. Both are also peaking, but it seems to have stabilized. At first, incoming was about 3k/s, (which seems to have been the history), but now it’s down to 10-20/s, while output is at a peak of 7.5k/s This on 4 CPU VMs. We’re still deciding if we’ll up the CPU count, seems to have been just the history digesting peak and nothing more…

-edit
half an hour later and we have 0 journal usage, process and output buffers don’t even hit 1%. Must really have been history =)

BUT considering the opening of this thread was process buffer 100% and now it’s not hitting even 1%, I’d say upping the processors count did the trick! Thanks @jtkarvo and @maniel! =D

Oh yeah, BTW, I didn’t even touch elasticsearch - neither itself nor it’s VMs. 137 (graylog+mongo uses 131-133, elastic uses 134-138) hit 100% CPU load for less than 10 minutes after the increase of processor count, and returned to normal after that.

That is weird. If you have 4 CPU:s (cores?), then the original 5 processors dedicated to processing messages would sound enough to get to 100% processor utilization… Or do your VM CPU:s have several cores - in that case, the maximum number of processors for processbuffer and outputbuffer probably should not exceed the total number of cores.

And 3k messages per node sounds a bit low throughput; if you hit capacity problems, you should look at your GROK and regex usage and optimize it.

This posting might help you: https://www.graylog.org/blog/74-back-to-basics-from-single-server-to-graylog-cluster

I asked around here, we have 4 CPUs on each VM, each CPU with 2 cores. But I still failed to see a significant difference between processbuffer_processor = 5;outputbuffer_processor = 3 and processbuffer_processor = 7;outputbuffer_processor = 1.
server.conf states # The number of parallel running processors.# Raise this number if your buffers are filling up., so I thought this was the number of parallel process running, not actually cores - and it’s what made me test with higher number for CPU load.

Heap sizes have already been tweaked for the RAM set for each VM - 4GB for each Graylog VM, 8 for each Elasticsearch VM.

On the previous scenario, before tweaking processbuffer_processor, ES VMs never got past 10% CPU load - after changing, and after dealing with that log history, they haven’t got past 40% CPU load. Active memory maximum hit 2.3GB of the total 8GB RAM, and 4GB configured on java heap… so I see no reason to add a sixth and seventh nodes.

8h graph for today, so far, gave me peaks of 30k messages/minute

Nodes are doing just fine

2017-08-23_15h20_02

Main node details:

This is from yesterday, when we started ingesting the log history from that nginx, almost hitting 1M/minute::

In general, switching between tasks consumes also resources, so for processing intensive tasks, having a lot of threads on a single core would be counterproductive. For tasks that wait for I/O, having several processes per core would be good. Hard to tell the right amount. 75 processes just sounded high.

Nevertheless, I’m glad this worked out.

It actually may as well be too high. While 131 was set to 75, 132 was set to 65 and 133 to 55 - all had the exact same result…

Just wondering, do you have more than 75 CPU cores in your machine?

As I`ve said before…

Then why do you assign so much more process/output buffer processors than you have CPU cores? That’s counterproductive to say the least.

As I’ve said before:

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.