Input worker process hanging intermittently

I’m having intermittent but frequent issues where one or more worker processes serving a Netflow input appear to be hanging.

Graylog server has a quad-core Xeon and 16Gb. Elasticsearch has a 6Gb heap, Graylog has 3Gb. Running graylog 4.2.12 and Elasticsearch 7.10.2 on Ubuntu 20.04.

I have a Netflow input running, which three devices are sending netflow data to. From time to time I notice that I’m not receiving any data from one or more of the devices. On checking with tcpdump, I can see that the device is sending, and graylog server receiving the netflow packets. I can also see with netstat that one or more (matching the number of devices with data missing) workers listening on the input UDP port has a high receive queue so assume that the worker has stopped taking data from the queue to process. If I try to stop the input to restart it again, that doesn’t work. The input state changes to ‘stopping’ but it doesn’t stop. The only fix is to stop and start the whole graylog service.

Other inputs are running fine, can’t see anything obvious in the logs.

I’ve tried increasing the number of workers from 4 to 6, hoping maybe another would take over. And I’ve tried increasing the receive buffer size on the input, but neither helped. When all 3 devices are sending, it probably averages about 150 messages/second.

There are some pipelines attached to that stream to do some field normalization and to do lookups, but if the problem was in a pipeline would expect the input worker to still pick up the data, and would expect it not to be limited to a worker handling a specific device.

Are there any ways I can try to see what it’s doing, and what is triggering the worker to hang? (and stop it doing so)

That sounds like a difficult question. Do you see some buffers filling up in the overview of the node?

I once had a run-away grok pattern. We finally managed to hunt it down in the thread dump/process-buffer dump. It might be an idea to compare those in working and non-working states.
grafik

Buffers show as empty. Journal has variably between 0 and 20 or so unprocessed messages, so there’s nothing visibly queuing up.
The input has hung again, does mostly seem to one particular host that is causing the problem. However, it runs the exact same version of netflow probe, with exactly the same configuration, as another device which isn’t suffering the same problem.
I’ll see if I can make any sense of the thread/process buffer dumps, or might try disabling the pipelines just to see whether that cures the problem so I know whether it’s something in the pipeline.

So I think I’ve ruled out pipelines as I disconnected the pipeline from the stream and after running for roughly 9 hours, one worker process has stalled again, and stopped processing data from that device.

Upgrade to 4.3.6 did not fix the problem. When I checked it first thing this morning, inputs from one device had stopped entirely as before; and inputs from a second had slowed to a trickle. Looking at the count by source on the standard 5 minute view of that stream showed a count averaging around 30 in the 5 minute window from the second device. Restarting the netflow probe didn’t change it, restarting the whole of graylog did, and it’s now showing in excess of 26,000 messages for the last 5 minutes from that same device, as well as restoring input from the first device. The third device sending netflow packets was unaffected.

So it’s like the worker process for that input is just slowing down to the point where it stops processing anything at all, at which point the UDP receive buffers start filling up. It definitely appears to be the graylog end that is the problem, although I’m not ruling out the netflow probes as triggering it.

Is anyone else sending netflow data to graylog and seeing similar issues?

For reference, devices 1 and 3 are running nprobe, device 2 is using softflowd to capture and send netflow packets. i.e., inputs from two different applications appear to be triggering it, while another device running the same software as one of the affected inputs is not affected.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.