Certain Mesages seem to stop Graylognodes

Hello,

I know this seems like a Classic Topic, but I am not able to hunt it down alone… so heres the Situation:

  • 2 Nodes Graylog Cluster (Version: 2.5.1)
  • 11 Node ES Cluster (3 Master, 8 Datanodes with 2 Instances of ES, so all over 3 Master ans 16Data Nodes) Version: 5.6.14
  • about 2500msg/s

Our Problem seems to be that certain messages or maybe even just one message, stops the message Processing on our GraylogNodes. From Time to Time, I don’t see any regularity, one Nodes stops processing its messages. The processbuffer is filled up. The Outputbuffer remains empty. After a restart of the Service via systemctl everything works perfectly again.

I am already monitoring [org.graylog2.shared.buffers.processors.ProcessBufferProcessor.processTime] and I see max ansd mean going up, when the processing comes to stop. In Htop I don’t see any shortage an HW-Resources.

So to me it looks like a massage can’t be processed with a RegEx or Grok. But how do in figure out who is sending therse messages. I tried to figure via the trace log, but also this log simply stops after the last successful processed message.

Any help is highly appreciated!

2 Likes

Paying close attention to this one; you have a marvelously interesting situation on your hands. Sorry that I can’t offer anything useful at the mo’.

I’m assuming from the infra you described that this is your production environment, right? So we can’t just stop a bunch of input sources to determine where the message(s) could be coming from.

Do you have a testbed as well? Does it run into the same issues? If not, could it be that the equipment or applications generating the relevant message(s) are not present in your testing environment?

1 Like

Thanks for your quick reply!
Yes, you are right the Setup I described is productive. We are having a TestLab and you are right again. The Problem doesn’t show up there.

We are feeding our Testsystems with productive Messages via an Output in production. We are sending every 10th message generated from applications and Servers in our DEV-Stage.

What confuses me the most is that a Service restart resolves the Problem. Because the bad message should still be in the journal, so it seems like its processed correctly after the restart. Is there an Config Switch to discard messages which take longer then 5s to be processed? I haven’t found one, but maybe is was looking in the wrong place…

1 Like

That’s remarkable. I mean, with a setup like that you would expect that the same problem occurs in test, just much less frequently. Unless of course Kismet is against you. :slight_smile:

I’m thinking whether one of the oodles of statistics that Graylog provides us with per node can provide more insight. I mean System > Nodes > Blocked Node > Details (or Metrics). You could click the Metrics button for just the one stuck process buffer.

  • How often does this problem show up?
  • Does it happen to the same Graylog node every time? Or does the problem hop between nodes?
  • Instead of restarting the whole node, could you see what happens when you pause and then unpause the processing on that particular node?

EDIT:
I just noticed, if you go into System > Nodes > Blocked Node, at the top right there’s the Actions button. One of the actions is Get thread dump. Could be useful once the processing gets stuck again.

1 Like

I am already Monitoring these, as i wrote in my first entry. Please correct me if am wrong, but it looks to me as these are only processing times, but i dont get any hints, why processing takes so long.

Somtimes twice per Week, Sometimes 8 times a day

Both nodes are affcted

Just pausing the processing doesn’t help…

I looked at the ThreadDump, but my Java skills are not Good enough to read it. I’ll try to catch a Dump the next time it happens.

P.S: Damn Kismet, allways giving me a hard time… :wink:

1 Like

I setup a little Quick and Dirty script to Catch the Threaddump nextime Graylog Fails. Its written in Powershell and uses Graylog API. I thought id share it, it may come in handy for someone :slight_smile:

$username = "yourtoken"
$password = "token"
$base64AuthInfo = [Convert]::ToBase64String([Text.Encoding]::ASCII.GetBytes(("{0}:{1}" -f $username,$password)))



while($true){
    for ($i=1; $i -le 2; $i++){
     [single]$outstr = Invoke-RestMethod -Method GET -Uri "https://grayloghost$.fqdn/api/system/metrics/org.graylog2.throughput.output.1-sec-rate" -Headers @{Authorization=("Basic {0}" -f $base64AuthInfo)} | Select -ExpandProperty value
     if($outstr -le 1){
        if($i -eq 2){
            $nodeid = "nodeid-your-gl-host2"
        }ELSE{
            $nodeid = "nodeid-your-gl-host1"
        }
        Invoke-RestMethod -Method GET -Uri "https://grayloghost$i.fqdnapi/cluster/$nodeid/threaddump" -Headers @{Authorization=("Basic {0}" -f $base64AuthInfo)} | select -ExpandProperty threaddump | Out-File -FilePath C:\Users\youruser\Desktop\Graylog_node$i.txt
        [System.Windows.Forms.MessageBox]::Show("Graylog Failed","GL FAIL",0)
     }
      $stats = Invoke-RestMethod  -Method GET -Uri "https://grayloghost$i.fqdn/api/system/metrics/org.graylog2.shared.buffers.processors.ProcessBufferProcessor.processTime" -Headers @{Authorization=("Basic {0}" -f $base64AuthInfo)} | Select max,p99,mean
      $stats | Export-Csv -Append C:\Users\youruser\Desktop\GraylogStats_node$i.txt
    }
sleep 1
}
2 Likes

check ES http max size, and GL’s output batch size.

2 Likes

I got him! I Captured the Threaddump, the second one Node failed! Then I made a Threaddump after 2 minutes in failstate. And i made a third one, the second after the restart and the node startet do process messages again. :slight_smile:

The second the Node failed:
https://www.filehosting.at/file/details/782650/Failsecond.txt

Two minutes in failstate:
https://www.filehosting.at/file/details/782651/2minutesinfail.txt

The Second processing startet:
https://www.filehosting.at/file/details/782652/processing.txt

Thanks for your help so far! I’ll also check these Dumps myself of course!

1 Like

I’ll have a look! Thanks!

1 Like

Ooohhh? Thinking the message was simply too large?

Dude(tte) that’s awesome work! Nice going!

/done

Our output batch is set to 5000. I checked all GL Nodes for Connection Resets, there are none. But thanks for your Help. Moreover i think, please correct if am wrong, if ES would be the Problem, it would the Outputpuffer filling up and not the Processbuffer, wouldn’t it?

I found some diffs in the Threaddump:

Output processor:

ProcessBuffer:


When in failed state there are lot more state=waiting (384 / 196)

All Processbuffeers are have these lines in comen

    - waiting on <0x573d260d> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
- locked <0x573d260d> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)

Except 0 and 17
I wish my JAVA skills were better…

That makes two of us. But seriously, but grabbing this data you may have potentially made the lives of the developers a lot easier. Perhaps @jochen has some idea what could be going wrong, or they know whom to ping :smiley:

2 Likes

If it can’t write it to ES the process and output buffer fill up.

1 Like

As i said, the Outputbuffer stays empty…

sorry, I missunderstood you.

1 Like

Okay a little Update:
in my log there are a lot of theses Messages:

2019-02-20T16:44:31.770+01:00 WARN  [GelfCodec] GELF message <6a2a0980-3526-11e9-96b4-38eaa732cb5d> (received from <1.1.1.1:6484>) has invalid "timestamp": 1550677471.767  (type: STRING)
2019-02-20T16:44:31.773+01:00 WARN  [GelfCodec] GELF message <6a2aa5c0-3526-11e9-96b4-38eaa732cb5d> (received from <1.1.1.1:6484>) has invalid "timestamp": 1550677471.771  (type: STRING)
2019-02-20T16:44:32.206+01:00 WARN  [GelfCodec] GELF message <6a6c90c0-3526-11e9-96b4-38eaa732cb5d> (received from <1.1.1.1:6484>) has invalid "timestamp": 1550677472.204  (type: STRING)
2019-02-20T16:44:32.209+01:00 WARN  [GelfCodec] GELF message <6a6d2d00-3526-11e9-96b4-38eaa732cb5d> (received from <1.1.1.1:6484>) has invalid "timestamp": 1550677472.207  (type: STRING)
2019-02-20T16:44:32.461+01:00 WARN  [GelfCodec] GELF message <6a9379b0-3526-11e9-96b4-38eaa732cb5d> (received from <1.1.1.1:6484>) has invalid "timestamp": 1550677472.458  (type: STRING)
2019-02-20T16:44:32.467+01:00 WARN  [GelfCodec] GELF message <6a946410-3526-11e9-96b4-38eaa732cb5d> (received from <1.1.1.1:6484>) has invalid "timestamp": 1550677472.465  (type: STRING)
2019-02-20T16:44:32.665+01:00 WARN  [GelfCodec] GELF message <6ab2c180-3526-11e9-96b4-38eaa732cb5d> (received from <1.1.1.1:6484>) has invalid "timestamp": 1550677472.663  (type: STRING)
2019-02-20T16:44:32.669+01:00 WARN  [GelfCodec] GELF message <6ab336b0-3526-11e9-96b4-38eaa732cb5d> (received from <1.1.1.1:6484>) has invalid "timestamp": 1550677472.667  (type: STRING)
2019-02-20T16:44:32.836+01:00 WARN  [GelfCodec] GELF message <6accd930-3526-11e9-96b4-38eaa732cb5d> (received from <1.1.1.1:6484>) has invalid "timestamp": 1550677472.834  (type: STRING)
2019-02-20T16:44:32.840+01:00 WARN  [GelfCodec] GELF message <6acd7570-3526-11e9-96b4-38eaa732cb5d> (received from <1.1.1.1:6484>) has invalid "timestamp": 1550677472.839  (type: STRING)

I figured since these were “just” WARN i should find these Messages. I tried it with

_id: 6acd7570\-3526\-11e9\-96b4\-38eaa732cb5d

since i dont know the index. But I am getting no result. I dont know if these Messages keep stopping our Graylog, but non the less i should teach our developers to use correct logging settings…

I cant use the source IP because alle Messages are Coming from our LB. Has Anyone an idea how to figure out where these messages come from?

TIA!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.