Missing messages after processing


#1

Hello everyone. I want to debug some missing messages. I’m just testing the new installation of Graylog 3.0. I have some low-speed live inputs and some customers data from Cisco ASA firewall (syslog messages).

What I am trying now is to send high-speed UDP input (20k+/sec) with custom PowerShell script or nxlog. But I see that when I send more data at once, I miss approx. 50% of messages in the end. I am using samples of 1-8 millions of messages. When I cycle the write index and run the same again I will get different results (number of messages are not the same). Common logs are clean, nothing special.

Setup is one GL 3.0 and ElasticSearch cluster of 5 nodes (4 data / 3 masters). All running on Debian 9. Everything is running on two physical boxes with Hyper-V. I am not using any parsers on that input/stream, just saving them to elastic.

A number of packets on graylog ethernet interface are the same as sent from my device but the number of indexed documents differs.


(Jan Doberstein) #2

Sherlock, start investigation.

How should someone without access to your box now find the error? You are using UDP, debug that first. Than check if Graylog and Elasticsearch can hold the preasure of 20k messages per second. I guess not.

Just to think, you try to ingest ~17.280.000.000 messages per day what end in ~8640 TB a day when each message is ~500kb. We have some threads in this community about big setups but non has this dimensions.


#3

Hello again. I did some more testing today.

LAB setup:
PowerShell script with UDP stream
Wireshark which confirms a number of sent packets
tcpdump on Graylog which confirms a number of received packets

One running UDP input with 25m input buffer.
(2019-03-13T09:01:46.510+01:00 WARN [UdpTransport] receiveBufferSize (SO_RCVBUF) for input SyslogUDPInput{title=GrayLog Bruteforce test, type=org.graylog2.inputs.syslog.udp.SyslogUDPInput, nodeId=null} (channel [id: 0x0f033690, L:/0.0.0.0:10514]) should be 26214400 but is 52428800.)

or 50? :smiley:

Testing at full speed:
1mil of messages sent in 33sec (30k/sec) (interesting thing is, that Graylog is reporting on the top right in In/Out approx. half of them on input (15k))

Processing enabled

  1. 536,424 documents in elastic
  2. 500,676 documents in elastic
  3. 490,999 documents in elastic

input buffer max approx 200msgs
process buffer full (65,536)
output buffer max approx 30% (65,536)

Processing disabled

  1. 955,488 added journal
  2. 839,327 added journal
  3. 815,119 added journal

Testing at limited speed:
1mil of messages sent in 70sec (15k/sec)

Processing enabled

  1. 999,942 documents in elastic
  2. 999,735 documents in elastic
  3. 999,700 documents in elastic

input buffer max approx 200msgs
process buffer full for few seconds at the end of tests (65,536)
output buffer is slowly getting full and stays full (65,536)

My conclusion:
When testing at full speed, I have lost a huge number of messages (50%) between network stack and journal and processing. When processing disabled, things are getting better, but there is still approx 10-20% loss of messages between stack and journal. What drives me crazy is that I don’t see any errors/warnings in the Graylog server log. (at default settings, I don’t know if I can debug something into the log)

When testing with the limited speed I have lost approx 0.3% messages in ethernet (network). TCP dump and Graylog confirms, that all received messages are correctly saved into Elastic.

Jan, please know that this is a lab scenario, I am actually finding the best way for our customers to centralize log location. I have to compare GL with another log analyzer right now.


#4

In our production use case I don’t expect to receive more than 5k messages per second. I am testing it with higher numbers just to know what to expect. 15k messages per sec was still without any loss (at least on the Greylog side).

I want to understand how to debug the missing messages. We can’t allow any messages to be lost in production environment. Most probably in 5k msg/sec setup nothing will be lost but is there a way how to know for sure? With 25kmsg/sec lots of messages were lost and I found nothing in logs. Is there any way how to know about it?


(Jan Doberstein) #5

if you use UDP, you can’t be sure. It is a fire and forget protocol, if you need to know that the messages has reached the target, use TCP.

But just to make sure, no logging solution will give you 100% security to no loose message, no system. You can reduce the risk, but no way to be 100%.


#6

Hello Jan.
I have to repeat that I have seen all the packets with tcpdump on incoming interface on Graylog server OS, so there should be no ethernet problem.

But before i sent this update i tried to switch to TCP. Everything seems to be fine. Even when I used full speed. Process and output buffers were full as I expected, but the number of sent syslog messages and saved documents in elastic equals.

So I am still thinking about some problem with UDP processing and large numbers of messages per second.


(Jan Doberstein) #7

you should read about buffers and how udp messages are handled in the network stack of linux.


#8

known issue/working method, when GL runs java GC it drops UDP packages…
so if you have a good “ethernet” connection, there could be a LOT other things.
UDP is UDP
so if you use UDP don’t count packages. Do some research about UDP syslog test. You will see some simmilar problems.


#9

Thank you, guys. I have to admit that I don’t have a good understanding of Linux systems. I think, that Jan’s answer will guide me well, I have to make some research on receive buffers in Linux. Because I can clearly see, that the packets will not fit into UDP receive buffer in my test.


#10

you CAN’T avoid to loose udp messages.
You can play with a lot of thing, but you can’t avoid it.