Missing messages after processing

fangycz · March 12, 2019, 4:31pm

Hello everyone. I want to debug some missing messages. I’m just testing the new installation of Graylog 3.0. I have some low-speed live inputs and some customers data from Cisco ASA firewall (syslog messages).

What I am trying now is to send high-speed UDP input (20k+/sec) with custom PowerShell script or nxlog. But I see that when I send more data at once, I miss approx. 50% of messages in the end. I am using samples of 1-8 millions of messages. When I cycle the write index and run the same again I will get different results (number of messages are not the same). Common logs are clean, nothing special.

Setup is one GL 3.0 and ElasticSearch cluster of 5 nodes (4 data / 3 masters). All running on Debian 9. Everything is running on two physical boxes with Hyper-V. I am not using any parsers on that input/stream, just saving them to elastic.

A number of packets on graylog ethernet interface are the same as sent from my device but the number of indexed documents differs.

jan · March 13, 2019, 6:53am

Sherlock, start investigation.

How should someone without access to your box now find the error? You are using UDP, debug that first. Than check if Graylog and Elasticsearch can hold the preasure of 20k messages per second. I guess not.

Just to think, you try to ingest ~17.280.000.000 messages per day what end in ~8640 TB a day when each message is ~500kb. We have some threads in this community about big setups but non has this dimensions.

fangycz · March 13, 2019, 9:10am

Hello again. I did some more testing today.

LAB setup:
PowerShell script with UDP stream
Wireshark which confirms a number of sent packets
tcpdump on Graylog which confirms a number of received packets

One running UDP input with 25m input buffer.
(2019-03-13T09:01:46.510+01:00 WARN [UdpTransport] receiveBufferSize (SO_RCVBUF) for input SyslogUDPInput{title=GrayLog Bruteforce test, type=org.graylog2.inputs.syslog.udp.SyslogUDPInput, nodeId=null} (channel [id: 0x0f033690, L:/0.0.0.0:10514]) should be 26214400 but is 52428800.)

or 50?

Testing at full speed:
1mil of messages sent in 33sec (30k/sec) (interesting thing is, that Graylog is reporting on the top right in In/Out approx. half of them on input (15k))

Processing enabled

536,424 documents in elastic
500,676 documents in elastic
490,999 documents in elastic

input buffer max approx 200msgs
process buffer full (65,536)
output buffer max approx 30% (65,536)

Processing disabled

955,488 added journal
839,327 added journal
815,119 added journal

Testing at limited speed:
1mil of messages sent in 70sec (15k/sec)

Processing enabled

999,942 documents in elastic
999,735 documents in elastic
999,700 documents in elastic

input buffer max approx 200msgs
process buffer full for few seconds at the end of tests (65,536)
output buffer is slowly getting full and stays full (65,536)

My conclusion:
When testing at full speed, I have lost a huge number of messages (50%) between network stack and journal and processing. When processing disabled, things are getting better, but there is still approx 10-20% loss of messages between stack and journal. What drives me crazy is that I don’t see any errors/warnings in the Graylog server log. (at default settings, I don’t know if I can debug something into the log)

When testing with the limited speed I have lost approx 0.3% messages in ethernet (network). TCP dump and Graylog confirms, that all received messages are correctly saved into Elastic.

Jan, please know that this is a lab scenario, I am actually finding the best way for our customers to centralize log location. I have to compare GL with another log analyzer right now.

fangycz · March 13, 2019, 9:40am

In our production use case I don’t expect to receive more than 5k messages per second. I am testing it with higher numbers just to know what to expect. 15k messages per sec was still without any loss (at least on the Greylog side).

I want to understand how to debug the missing messages. We can’t allow any messages to be lost in production environment. Most probably in 5k msg/sec setup nothing will be lost but is there a way how to know for sure? With 25kmsg/sec lots of messages were lost and I found nothing in logs. Is there any way how to know about it?

jan · March 13, 2019, 10:18am

if you use UDP, you can’t be sure. It is a fire and forget protocol, if you need to know that the messages has reached the target, use TCP.

But just to make sure, no logging solution will give you 100% security to no loose message, no system. You can reduce the risk, but no way to be 100%.

fangycz · March 13, 2019, 1:35pm

Hello Jan.
I have to repeat that I have seen all the packets with tcpdump on incoming interface on Graylog server OS, so there should be no ethernet problem.

But before i sent this update i tried to switch to TCP. Everything seems to be fine. Even when I used full speed. Process and output buffers were full as I expected, but the number of sent syslog messages and saved documents in elastic equals.

So I am still thinking about some problem with UDP processing and large numbers of messages per second.

jan · March 13, 2019, 2:35pm

you should read about buffers and how udp messages are handled in the network stack of linux.

macko003 · March 13, 2019, 7:00pm

known issue/working method, when GL runs java GC it drops UDP packages…
so if you have a good “ethernet” connection, there could be a LOT other things.
UDP is UDP
so if you use UDP don’t count packages. Do some research about UDP syslog test. You will see some simmilar problems.

fangycz · March 14, 2019, 10:34am

Thank you, guys. I have to admit that I don’t have a good understanding of Linux systems. I think, that Jan’s answer will guide me well, I have to make some research on receive buffers in Linux. Because I can clearly see, that the packets will not fit into UDP receive buffer in my test.

macko003 · March 14, 2019, 2:55pm

you CAN’T avoid to loose udp messages.
You can play with a lot of thing, but you can’t avoid it.

system · March 28, 2019, 2:55pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Windows nxlog to Graylog 2.4.3 Graylog Central (peer support) sidecar , nxlog , nodatanx	9	2313	March 21, 2018
How to debug missing messages Graylog Central (peer support)	15	3479	February 4, 2022
Message Loss at Ingestion Graylog Central (peer support)	9	2540	July 12, 2018
Not seeing any messages from IIS (nxlog) Graylog Central (peer support) nxlog	8	998	September 14, 2022
Graylog is processing messages really slowly Graylog Central (peer support)	5	1499	March 6, 2018

Missing messages after processing

Related topics