Graylog nodes stop outputting/fill up buffers

Hi, we have a curios little problem in which one graylog node out of a 3 + 2 Cluster stops outputting messages. filling up the buffers and subsequently the journal. This happens about every other day and can be resolved by a “graceful shutdown” and restart of the node in question.

Unfortunately there is nothing in the logs from graylog or elasticsearch that would indicate a problem.

Is there additional logging we could enable to get behind this issue or look at other places besides the standard logs?

All nodes are runnig Graylog 3.2.3+a9c061c.

Thanks!

1 Like

he @jrunu

what elasticsearch version did you have? I would suggest to check the threads of elasticsearch. It is very likely that elasticsearch did not have any threads left and actions in the queue and that is the reason Graylog is “pausing” because. The restart closes all connections from the running Graylog and the queue for threads get a little more space.

The items that can be tuned:

  • index_refresh rate (elasticsearch)
  • number of output_buffer_processors (graylog)
  • number of output_batch_size

Those 3 are the most common settings that will help you.

Hi @jan, thanks for the pointers. With what you’re describing I would expect the node to unfuck itself after a while. (Unless there are IP based queues somewhere involved?) The other graylog nodes deliver to the elasticnodes without any issues.

Anyway, I set up script to collect more information to nail this better down before we start to fiddle knobs. So far it happened once. I did not had the time to look at the info that was collected yet.

Just wanted to give you a brief update.

Please let me (us) know what you find - I am having the same issue and I haven’t been able to nail down what is causing it… :stuck_out_tongue:

We are well under the capacity of the system… we average about 100 messages a second (out) but when I restart and it is catching up it can pump out 1,000+ a second

@tmacgbay and @jrunu, off topic, but I’m curious how you have your 3+2 clusters setup because both MongoDB and Elasticsearch recommend a minimum of 3 for redundancy.

don’t want to hijack the thread, but I am curious. Hope you figure out your issue.

My setup is much simpler I have a single ubuntu Hyper-V Graylog server that sends to a separate ubuntu hyper-v elasticsearch server. Both are located on the same physical server. We have backups of the virtual environment via the physical server backup and no direct business/compliance reliance on the data - it is used for internal IT security/reporting only.

The only place we match is that Graylog will stop shipping to elasticsearch randomly and a restart of Graylog (app not server) clears out the queue. we are at such a low volume that I highly doubt we are resource constrained. There is nothing in the Graylog logs… the only think I can think of is that there is a GROK that gets out of control. I have eliminated all extractors for that reason and do all my work in the pipeline but that doesn’t seem to have made a difference. I haven’t put my mind to reworking the GROK in my pipeline… I would rather wait until I see SOMETHING that would lead me to think it’s worth the time.

I have mentioned it in the forums but it is a hard issue to describe as there is no error to show for it.

So I watch the forums… and I occasionally have to restart Graylog.

1 Like

We have 3 Graylog Server that also run a MongoDB instance each, behind that currently is a two node elastic cluster. This is a POC at the moment with roughly 150 to 700 messages per second. Judging by the loads most of the time bored as well. I guess the final setup will look a bit different, but we started with just two nodes. We’re increasing the intake of events over time to get an idea on the sizing and behavior. There’s isn’t much more to why we have just 2 elastic nodes.

I did had a look now. Basically the data I have is a complete call stack trace of all the current java threads, a list of currently pending syscalls and open file descriptors (net,block, etc). And these were taken after it stopped draining towards the elasticsearch. Once and then again roughly 3-4 seconds later.

Of course there are loads of pending futex() calls that I can’t attribute, but I couldn’t find any potentially indefinetly blocking I/O syscalls. However, I realized I can’t properly attribute the fds to the threads, which I should figure out how to do.

The only thing I found to be out of the ordinary was a stack trace from one of the processbufferprocessor threads. The stack is so deep that its beyond what is supported by the tool (jattach). Which apparently has a call depth of 1023. Alls these are calls I can see are within java.utils.regex.Patterns. And it seems to do something because between the two snapshots there are differences in the call stack. Furthermore right now I can’t reproduce something similar, now all the processbufferprocessor threads are in the state WAITING. However we’re in the off hours here, so I might just not get lucky.

So I guess another todo is to figure out how to get a complete stack trace. But I’m not sure if this is a red herring. The other threads were idle. And a regex match in an infinite loop shouldn’t take any global locks.

More debugging

I think ultimately only more application level debugging can help here so I wanted to dial up the logging via log4j2. I added these two things to the xml config in the resp. tags:

    <RollingFile name="trace-file" fileName="/var/log/graylog-server/trace.log" filePattern="/var/log/graylog-server/trace.log.%i.gz">
        <PatternLayout pattern="%d{yyyy-MM-dd'T'HH:mm:ss.SSSXXX} %-5p [%c{1}] %m%n"/>
        <Policies>
            <SizeBasedTriggeringPolicy size="50MB"/>
        </Policies>
        <DefaultRolloverStrategy max="10" fileIndex="min"/>
    </RollingFile>
<Logger name="org.graylog2" level="trace">
    <AppenderRef ref="trace-file"/>
</Logger>

Then I restarted graylog. It did create a trace file but doesn’t write anything to it. I also played with variations of this with debug and trace. Also setting the root to those levels. All with no avail. What am I doing wrong here?

Good news everyone! It happened again today. And it confirms my suspicion with the processbufferprocessor. Again the enermous call stack, and what I didn’t realize before the output buffer is empty. So its gotta have to do something with that.

I’m starting to warm up to the idea to have a trace level running with a very agressive rotation. Still haven’t figured out how to get that.

Colleague of mine has subscribed to the Graylog newsletter. These newsletter have a little “Did you know” section at the bottom for technical tips.

This months last tip was this:

And here’s where the ‘Get process-buffer dump’ feature lets you know which message is sending your processing into an endless loop.

I’m not saying that I’m upset, but it does feel little bit like the person who writes these, also lurks in this forum and wanted to throw some shade at me.

We had this happening again yesterday. With the “process-buffer dump” I could get enough information to pull the log event in question from the upstream syslog server. And et voliá testing it against one of the two grok patterns that are working with this line yielded a HTTP 502.

Still though, I’m suprised that this takes down the complete chain. I would expect that the other 3 remaining threads would continue to process messages normally. Of course it would ultimately clog up after much longer time. But we can have a watchdog for this that kills of the thread and gives me a notifcation/log that parsing that messaged failed for that very reason.

2 Likes

I am curious about the message and GROK pattern … anything you can share?

The original data was full of PII, I compiled a minimal example.

All this was part of a bigger grok pattern, but the surrounding things are not relevant. The relevant part of the pattern looked like this:

‘(<%{EMAILLOCALPART}@%{HOSTNAME}>[,\s]?)+’

generating this regular expression:

‘(<[a-zA-Z][a-zA-Z0-9_.±=:]+@\b(?:[0-9A-Za-z][0-9A-Za-z-]{0,100})(?:.(?:[0-9A-Za-z][0-9A-Za-z-]{0,100}))*(.?|\b)>[,\s]?)+’

This is the relevant extract from the message

'<asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <0asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid>'

If you guys want to reproduce this, here is the stuff in a small test harness: GitHub - jrunu/java_regex_complexity_poc

There are some interesting points to take away from this. To make this explode in your face there are two things needed :

  1. A good amount of repetion of email addresses
  2. An email adress starting with a number

EMAILLOCALPART assumes that all email addresses start with an alphabetic symbol, which is not quite correct. This makes the whole thing not match. And then the a whole lot of backtracking ensues. Which I can’t quite wrap my head around, because I assumed that the < and > would be anchors enough. But apparently that’s not how things work.

Besides fixing EMAILLOCALPART, a way to reduce the amount of backtracing is to make the outermost group an atomic group i.e. a group that is treated as “one” when backtracking:

‘(?><%{EMAILLOCALPART}@%{HOSTNAME}>[,\s]?)+’

In this example it then only tries to back track, I think, 81 times instead of a metric fuckton of times.

And of course using

‘(<[^>]+>[,\s]?)+’

instead would have worked just as well and faster.

2 Likes

I’m still curious about this. For me that feels like something that should be looked into. Esp. the implication that the matching might lock all other threads. Any thoughts, are we reading the symptoms wrong?

1 Like

Interesting - this goes way more into depth about the issue than I have - there is a lot for Graylog developers to chew on…I am normally loth to tag @jan but just to help make sure it gets looked at.

Thanks for posting it and taking the time to weed out PII!

With the info you posted I was able to track down my issue - 100% a runaway GROK. In the IIS logs there was a spot where I was expecting an IP but in the rare instance someone logged into the webserver locally on the machine, the logs would put “,localhost,” instead. Which would permanently tie up a process buffer. Interestingly, when 3 of the 5 process buffers were GROK-locked (and the other two idle) All Graylog would stop sending to Elastic.

Thanks again for posting the details!! :+1:

EDIT: Turns out that only one buffer GROK-locked will eventually stop processing. Also: changing %{IP:client_ip} to %{IPORHOST:client_ip} allows for the chance of localhost.

3 Likes

love seeing these solutions popping up for people… congrats to both… great thread

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.