Graylog has stopped storing messages in ES - no errors in log

Hi folks, been a while since I stumbled around here but been busy with work. Speaking of, we ran into an ongoing problem, there’s many threads on the forum about them but none of them have a solution.

Our 4 graylog nodes stopped storing messages in ES - the cluster is green, allocation is enabled, searches work, external tooling works. We now have a journal with 2 to 3 million messages per node, except one which has 16 million.

The graylog nodes are on 48 core machines, and the outputs go to 3 ES servers, batch size of 1024, with 4 outputbuffer_processors.

Restarting a node will briefly have a few message submitted (10k or so), but then it goes right back to 0 messages output, across the entire cluster.

Any ideas where to start with this because the logs don’t show a damn thing and it’s kind of critical this stuff works :wink:

O-kay then, of course right after typing this post out, it all started working again - we’re now outputting 40k msg/sec

This basically mimics what happened the last few times this occurred; graylog stops outputting for a few hours, journal goes apeshit big, and suddenly it all starts working again. It’s been a while since I’ve been in the code, but I get a sneaking suspicion there’s a backoff retry timer that gets up to a high value when it can’t connect to ES (we did change a few things in our firewall configuration today which may have caused dropped connections).

It could be like the GROK-lock issues I was seeing - next time it locks up check the process-buffer dump on the node?

1 Like

Dumb question, how do you get that dump again? I’ve not actually had to touch our setup in well over 8 months now so I’ve forgotten most of the good stuff :smiley:

The odd thing though is that while we do use Grok, it’s very limited, perhaps maybe on the order of a message per minute - most of our apps now send JSON formatted log data (via filebeat, because reasons) and our “main” pipeline just decodes it and then extracts some fields for further mangling.

:stuck_out_tongue:

Systems->Nodes ->More Actions->Get Process-Buffer Dump

For me it was my inefficient written GROK but it was notable that it only took one runaway GROK on one process-buffer to lock the node up… and I only have one node…

2 Likes

I’ll echo what @tmacgbay posted–I recently saw this issue in an environment where a regex pattern was taking an incredibly long time to do its thing, but the behavior was exactly the same. Getting a process buffer dump is the first step (should show you what’s going on), but I’d also turn on debug metrics for your pipeline rules (System–>Pipelines–>Manage Rules–>Debug metrics). Once you’ve turned that on, you’ll see the metrics on the nodes with the prefix of org.graylog.plugins.pipelineprocessor.ast.Rule and that should give you some additional information for the rule(s) that might be causing the issue.

3 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.