Graylog has stopped storing messages in ES - no errors in log

benvanstaveren · March 19, 2021, 3:18pm

Hi folks, been a while since I stumbled around here but been busy with work. Speaking of, we ran into an ongoing problem, there’s many threads on the forum about them but none of them have a solution.

Our 4 graylog nodes stopped storing messages in ES - the cluster is green, allocation is enabled, searches work, external tooling works. We now have a journal with 2 to 3 million messages per node, except one which has 16 million.

The graylog nodes are on 48 core machines, and the outputs go to 3 ES servers, batch size of 1024, with 4 outputbuffer_processors.

Restarting a node will briefly have a few message submitted (10k or so), but then it goes right back to 0 messages output, across the entire cluster.

Any ideas where to start with this because the logs don’t show a damn thing and it’s kind of critical this stuff works

benvanstaveren · March 19, 2021, 3:21pm

O-kay then, of course right after typing this post out, it all started working again - we’re now outputting 40k msg/sec

This basically mimics what happened the last few times this occurred; graylog stops outputting for a few hours, journal goes apeshit big, and suddenly it all starts working again. It’s been a while since I’ve been in the code, but I get a sneaking suspicion there’s a backoff retry timer that gets up to a high value when it can’t connect to ES (we did change a few things in our firewall configuration today which may have caused dropped connections).

tmacgbay · March 19, 2021, 4:09pm

It could be like the GROK-lock issues I was seeing - next time it locks up check the process-buffer dump on the node?

benvanstaveren · March 19, 2021, 4:35pm

Dumb question, how do you get that dump again? I’ve not actually had to touch our setup in well over 8 months now so I’ve forgotten most of the good stuff

The odd thing though is that while we do use Grok, it’s very limited, perhaps maybe on the order of a message per minute - most of our apps now send JSON formatted log data (via filebeat, because reasons) and our “main” pipeline just decodes it and then extracts some fields for further mangling.

tmacgbay · March 19, 2021, 4:39pm

Systems->Nodes ->More Actions->Get Process-Buffer Dump

For me it was my inefficient written GROK but it was notable that it only took one runaway GROK on one process-buffer to lock the node up… and I only have one node…

aaronsachs · March 19, 2021, 4:53pm

I’ll echo what @tmacgbay posted–I recently saw this issue in an environment where a regex pattern was taking an incredibly long time to do its thing, but the behavior was exactly the same. Getting a process buffer dump is the first step (should show you what’s going on), but I’d also turn on debug metrics for your pipeline rules (System–>Pipelines–>Manage Rules–>Debug metrics). Once you’ve turned that on, you’ll see the metrics on the nodes with the prefix of org.graylog.plugins.pipelineprocessor.ast.Rule and that should give you some additional information for the rule(s) that might be causing the issue.

system · April 2, 2021, 4:54pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Graylog CPU spiked to 500% and ES seems slow Graylog Central (peer support)	6	1327	March 18, 2019
Graylog has Millions of Unprocessed Messages Graylog Central (peer support)	2	975	February 10, 2021
No more Output. Journal issue Graylog Central (peer support)	2	620	May 7, 2020
Again Graylog is backed up and slow to write out Graylog Central (peer support)	2	880	April 20, 2019
Graylog not processing messages / processing buffer full Graylog Central (peer support) pipeline-rules	1	6516	July 2, 2020

Graylog has stopped storing messages in ES - no errors in log

Related topics