Certain Mesages seem to stop Graylognodes

Mr.Error · February 18, 2019, 8:18am

Hello,

I know this seems like a Classic Topic, but I am not able to hunt it down alone… so heres the Situation:

2 Nodes Graylog Cluster (Version: 2.5.1)
11 Node ES Cluster (3 Master, 8 Datanodes with 2 Instances of ES, so all over 3 Master ans 16Data Nodes) Version: 5.6.14
about 2500msg/s

Our Problem seems to be that certain messages or maybe even just one message, stops the message Processing on our GraylogNodes. From Time to Time, I don’t see any regularity, one Nodes stops processing its messages. The processbuffer is filled up. The Outputbuffer remains empty. After a restart of the Service via systemctl everything works perfectly again.

I am already monitoring [org.graylog2.shared.buffers.processors.ProcessBufferProcessor.processTime] and I see max ansd mean going up, when the processing comes to stop. In Htop I don’t see any shortage an HW-Resources.

So to me it looks like a massage can’t be processed with a RegEx or Grok. But how do in figure out who is sending therse messages. I tried to figure via the trace log, but also this log simply stops after the last successful processed message.

Any help is highly appreciated!

Totally_Not_A_Robot · February 18, 2019, 8:25am

Paying close attention to this one; you have a marvelously interesting situation on your hands. Sorry that I can’t offer anything useful at the mo’.

I’m assuming from the infra you described that this is your production environment, right? So we can’t just stop a bunch of input sources to determine where the message(s) could be coming from.

Do you have a testbed as well? Does it run into the same issues? If not, could it be that the equipment or applications generating the relevant message(s) are not present in your testing environment?

Mr.Error · February 18, 2019, 8:51am

Thanks for your quick reply!
Yes, you are right the Setup I described is productive. We are having a TestLab and you are right again. The Problem doesn’t show up there.

We are feeding our Testsystems with productive Messages via an Output in production. We are sending every 10th message generated from applications and Servers in our DEV-Stage.

What confuses me the most is that a Service restart resolves the Problem. Because the bad message should still be in the journal, so it seems like its processed correctly after the restart. Is there an Config Switch to discard messages which take longer then 5s to be processed? I haven’t found one, but maybe is was looking in the wrong place…

Totally_Not_A_Robot · February 18, 2019, 9:13am

That’s remarkable. I mean, with a setup like that you would expect that the same problem occurs in test, just much less frequently. Unless of course Kismet is against you.

I’m thinking whether one of the oodles of statistics that Graylog provides us with per node can provide more insight. I mean System > Nodes > Blocked Node > Details (or Metrics). You could click the Metrics button for just the one stuck process buffer.

How often does this problem show up?
Does it happen to the same Graylog node every time? Or does the problem hop between nodes?
Instead of restarting the whole node, could you see what happens when you pause and then unpause the processing on that particular node?

EDIT:
I just noticed, if you go into System > Nodes > Blocked Node, at the top right there’s the Actions button. One of the actions is Get thread dump. Could be useful once the processing gets stuck again.

Mr.Error · February 18, 2019, 9:32am

I am already Monitoring these, as i wrote in my first entry. Please correct me if am wrong, but it looks to me as these are only processing times, but i dont get any hints, why processing takes so long.

Somtimes twice per Week, Sometimes 8 times a day

Both nodes are affcted

Just pausing the processing doesn’t help…

I looked at the ThreadDump, but my Java skills are not Good enough to read it. I’ll try to catch a Dump the next time it happens.

P.S: Damn Kismet, allways giving me a hard time…

Mr.Error · February 18, 2019, 10:50am

I setup a little Quick and Dirty script to Catch the Threaddump nextime Graylog Fails. Its written in Powershell and uses Graylog API. I thought id share it, it may come in handy for someone

$username = "yourtoken"
$password = "token"
$base64AuthInfo = [Convert]::ToBase64String([Text.Encoding]::ASCII.GetBytes(("{0}:{1}" -f $username,$password)))



while($true){
    for ($i=1; $i -le 2; $i++){
     [single]$outstr = Invoke-RestMethod -Method GET -Uri "https://grayloghost$.fqdn/api/system/metrics/org.graylog2.throughput.output.1-sec-rate" -Headers @{Authorization=("Basic {0}" -f $base64AuthInfo)} | Select -ExpandProperty value
     if($outstr -le 1){
        if($i -eq 2){
            $nodeid = "nodeid-your-gl-host2"
        }ELSE{
            $nodeid = "nodeid-your-gl-host1"
        }
        Invoke-RestMethod -Method GET -Uri "https://grayloghost$i.fqdnapi/cluster/$nodeid/threaddump" -Headers @{Authorization=("Basic {0}" -f $base64AuthInfo)} | select -ExpandProperty threaddump | Out-File -FilePath C:\Users\youruser\Desktop\Graylog_node$i.txt
        [System.Windows.Forms.MessageBox]::Show("Graylog Failed","GL FAIL",0)
     }
      $stats = Invoke-RestMethod  -Method GET -Uri "https://grayloghost$i.fqdn/api/system/metrics/org.graylog2.shared.buffers.processors.ProcessBufferProcessor.processTime" -Headers @{Authorization=("Basic {0}" -f $base64AuthInfo)} | Select max,p99,mean
      $stats | Export-Csv -Append C:\Users\youruser\Desktop\GraylogStats_node$i.txt
    }
sleep 1
}

macko003 · February 19, 2019, 8:31am

check ES http max size, and GL’s output batch size.

Mr.Error · February 19, 2019, 8:53am

I got him! I Captured the Threaddump, the second one Node failed! Then I made a Threaddump after 2 minutes in failstate. And i made a third one, the second after the restart and the node startet do process messages again.

The second the Node failed:
https://www.filehosting.at/file/details/782650/Failsecond.txt

Two minutes in failstate:
https://www.filehosting.at/file/details/782651/2minutesinfail.txt

The Second processing startet:
https://www.filehosting.at/file/details/782652/processing.txt

Thanks for your help so far! I’ll also check these Dumps myself of course!

Mr.Error · February 19, 2019, 8:53am

I’ll have a look! Thanks!

Totally_Not_A_Robot · February 19, 2019, 9:29am

Ooohhh? Thinking the message was simply too large?

Dude(tte) that’s awesome work! Nice going!

Mr.Error · February 19, 2019, 9:54am

/done

Our output batch is set to 5000. I checked all GL Nodes for Connection Resets, there are none. But thanks for your Help. Moreover i think, please correct if am wrong, if ES would be the Problem, it would the Outputpuffer filling up and not the Processbuffer, wouldn’t it?

Mr.Error · February 19, 2019, 10:47am

I found some diffs in the Threaddump:

Output processor:

ProcessBuffer:

When in failed state there are lot more state=waiting (384 / 196)

All Processbuffeers are have these lines in comen

    - waiting on <0x573d260d> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
- locked <0x573d260d> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)

Except 0 and 17
I wish my JAVA skills were better…

Totally_Not_A_Robot · February 19, 2019, 11:28am

That makes two of us. But seriously, but grabbing this data you may have potentially made the lives of the developers a lot easier. Perhaps @jochen has some idea what could be going wrong, or they know whom to ping

macko003 · February 20, 2019, 8:51am

If it can’t write it to ES the process and output buffer fill up.

Mr.Error · February 20, 2019, 9:52am

As i said, the Outputbuffer stays empty…

macko003 · February 20, 2019, 12:01pm

sorry, I missunderstood you.

Mr.Error · February 20, 2019, 3:50pm

Okay a little Update:
in my log there are a lot of theses Messages:

2019-02-20T16:44:31.770+01:00 WARN  [GelfCodec] GELF message <6a2a0980-3526-11e9-96b4-38eaa732cb5d> (received from <1.1.1.1:6484>) has invalid "timestamp": 1550677471.767  (type: STRING)
2019-02-20T16:44:31.773+01:00 WARN  [GelfCodec] GELF message <6a2aa5c0-3526-11e9-96b4-38eaa732cb5d> (received from <1.1.1.1:6484>) has invalid "timestamp": 1550677471.771  (type: STRING)
2019-02-20T16:44:32.206+01:00 WARN  [GelfCodec] GELF message <6a6c90c0-3526-11e9-96b4-38eaa732cb5d> (received from <1.1.1.1:6484>) has invalid "timestamp": 1550677472.204  (type: STRING)
2019-02-20T16:44:32.209+01:00 WARN  [GelfCodec] GELF message <6a6d2d00-3526-11e9-96b4-38eaa732cb5d> (received from <1.1.1.1:6484>) has invalid "timestamp": 1550677472.207  (type: STRING)
2019-02-20T16:44:32.461+01:00 WARN  [GelfCodec] GELF message <6a9379b0-3526-11e9-96b4-38eaa732cb5d> (received from <1.1.1.1:6484>) has invalid "timestamp": 1550677472.458  (type: STRING)
2019-02-20T16:44:32.467+01:00 WARN  [GelfCodec] GELF message <6a946410-3526-11e9-96b4-38eaa732cb5d> (received from <1.1.1.1:6484>) has invalid "timestamp": 1550677472.465  (type: STRING)
2019-02-20T16:44:32.665+01:00 WARN  [GelfCodec] GELF message <6ab2c180-3526-11e9-96b4-38eaa732cb5d> (received from <1.1.1.1:6484>) has invalid "timestamp": 1550677472.663  (type: STRING)
2019-02-20T16:44:32.669+01:00 WARN  [GelfCodec] GELF message <6ab336b0-3526-11e9-96b4-38eaa732cb5d> (received from <1.1.1.1:6484>) has invalid "timestamp": 1550677472.667  (type: STRING)
2019-02-20T16:44:32.836+01:00 WARN  [GelfCodec] GELF message <6accd930-3526-11e9-96b4-38eaa732cb5d> (received from <1.1.1.1:6484>) has invalid "timestamp": 1550677472.834  (type: STRING)
2019-02-20T16:44:32.840+01:00 WARN  [GelfCodec] GELF message <6acd7570-3526-11e9-96b4-38eaa732cb5d> (received from <1.1.1.1:6484>) has invalid "timestamp": 1550677472.839  (type: STRING)

I figured since these were “just” WARN i should find these Messages. I tried it with

_id: 6acd7570\-3526\-11e9\-96b4\-38eaa732cb5d

since i dont know the index. But I am getting no result. I dont know if these Messages keep stopping our Graylog, but non the less i should teach our developers to use correct logging settings…

I cant use the source IP because alle Messages are Coming from our LB. Has Anyone an idea how to figure out where these messages come from?

TIA!

system · March 6, 2019, 3:59pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Graylog stops processing message(No errors/ES green) Graylog Central (peer support)	2	607	September 5, 2017
Graylog has stopped storing messages in ES - no errors in log Graylog Central (peer support) pipeline-rules , dump-messagespl , debuggingpl	6	1863	April 2, 2021
Graylog Cluster, Buffer process 100% stop process messages Graylog Central (peer support)	22	17028	November 28, 2018
Graylog stops processing messages seemingly at random times Graylog Central (peer support)	7	1898	June 19, 2020
Graylog nodes stop outputting/fill up buffers Graylog Central (peer support)	15	6172	May 6, 2020

Certain Mesages seem to stop Graylognodes

Related topics