I’m at a loss of how to further investigate my issue, there are other posts that are similar but not the issue I have.
I have a single Graylog 3.0.3 instances backed by a single Elasticsearch 6.8.5 node both hosted on a single VM w/ 4vcpus and 12GB of ram.
Elasticsearch has 27 indices, 770,879,751 documents, 229.4GB has 4GB of heap.
Graylog also has 4GB of heap.
For the last two weeks at there is at least once a day where the process buffer will fill up and very (but still some) logs make it through to Elasticsearch. I only have ~100 messages/minute inbound on average and can flush up to 4500msg/sec to Elasticsearch.
There are no log entries that show that any errors have taken place on either Graylog or Elasticsearch even when I turn up the debugging level.
I have a single input with seven grok extractors, at no time I’ve seen that maximum process time grow to show that there is a timeout. I have a pipeline that corrects the timezone but again no error. If there is a specific metric I can monitor to identify if a grok or pipeline is my issue please tell me.
If you need any more info please do ask.
Edit fixed memory assigned, added Elasticsearch usage.
I’d check the performance metrics for your processing, make sure all your pipeline rules, etc are working smoothly…
It’s a bit of a pain to troubleshoot but, it could be that some of your rules are taking a while to run… If you have a lot of messages being processed by those rules, it’ll slow your processing right down
check if your systems are swapping or showing any other not normal behaviour.
the amount of indices and data in GB might be to much for only 4GB Heap in Elasticsearch but that is just an educated guess.
Is your journal filling up and being flushed? Are you losing messages? The process buffer filling up is usually followed up by the journal filling up and then in turn the journal reaching capacity and flushing messages to free up space at which point you’ll lose some messages.
in your server.conf, what are your proccessbuffer_processors and outputbuffer_processors settings set to?
Can you add more CPUs?
Here are the metrics for my extractors, though I’ve been restarting Graylog hourly to avoid it failing overnight and losing logs, I’m letting it run longer now. I’ve checked this before and I haven’t see a maximum longer than ~10-20ms (~10,000μs). Given my very low injest I don’t see that as a bottle neck.
Here’s my heap usage % wise. I think this is “fine”.
It is happening right now. Here’s an updated view of my exactor metrics:
You can see my timings are still good, but I’m not processing anything new.
Once the process buffer is full and it is hitting the journal no new messages are being sent out to ES.
When it started I had one Graylog java thread using 100% of one cpu core. Now I have two spinning at 100% of a single cpu (Still have two spare vcpus). I think they are spinning on something and blocking messages but anyway to see what they are doing or what message they are stuck on?
I actually don’t have swap setup on this instance, but I still have 4GB free:
total used free shared buff/cache available
Mem: 11 7 0 0 4 4
Swap: 0 0 0
Graylog heap usage is only 1.5GB/4GB.
processbuffer_processors = 3
outputbuffer_processors = 3
you have 4 CPUs, but you’ve allocated 6. since it’s your process buffer that is filling up, you should probably drop this to
processbuffer_processors = 2
outputbuffer_processors = 1
That’ll leave you 1 for ES/server overhead.
or go 3/1 or 2/2 and see what effect that has.
I have been watching this one since I have had a similar problem but could not find anything telltale in all the usual places (logs, metrics, etc). My jump-to-conclusion-without-proof was to convert my extractors to pipeline rules - which has worked in only that Graylog has not been constipated since. I only had two inputs and about 40 or so extractor “rules” to work with.
This all came about because I read somewhere that extractors can get hung up but pipelines timeout on errors. I am thoroughly OK with being told that is wrong.
Either way I enjoy writing rules more than creating/modifying extractors - so it worked out for me
…I will keep watching.
EDIT: …Eliminating extractors does not help - it is still happening (Not surprised). I am not clear on where to look next.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.