Message Loss at Ingestion

F30 · June 25, 2018, 10:46pm

I am experiencing similar message loss issues to the ones described on the mailing list and on GitHub in 2013, however with recent versions (Graylog 2.4.5 and Elasticsearch 5.6.10). After ingesting about 8000 messages per second for testing purposes, only about 95 % if them are returned by a query.

I am using a GELF UDP input, but have made sure that all packets actually reach the Graylog server at the transport layer. Output buffer utilization and journal size are steadily increasing during the test ingestion, but drop to zero afterwards (before the query). I have also made sure that the number of messages returned by Graylog is equal to the one returned by the Elasticsearch API.

Since I could mitigate (but not eliminate) the issue by increasing elasticsearch_max_retries, I assume that Elasticsearch is returning errors. Unfortunately, neither Graylog nor Elasticsearch log anything interesting. Elasticsearch tuning options are also rather limited.

Does anybody have an idea for where to look and what to tune?

F30 · June 26, 2018, 2:18pm

I did a tcpdump between Graylog and Elasticsearch and counted the number of messages in the resulting PCAP. That number matches the one from the database. Also, all responses from Elasticsearch have status code 200.

So this doesn’t look like Elasticsearch errors after all, but rather like Graylog not receiving all messages from the kernel or dropping them internally. My original questions remain.

jan · June 26, 2018, 4:06pm

Did you find any related in the Graylog or Elasticsearch log?

That would be the point I’m looking at.

F30 · June 26, 2018, 4:55pm

Unless I’m missing a log file or log level setting, the logs don’t show an obvious problem.

At log level “Debug”, Graylog still doesn’t log anything during the ingestion. Elasticsearch does have some debug logging, but primarily mumbles something about version mismatches for Logstash, Elasticsearch and Kibana (I don’t use Logstash or Kibana), “Get stats” or “executing watch”.

juemue · June 26, 2018, 5:35pm

perhaps you have udp receive errors, you can check this with netstat -us

F30 · June 26, 2018, 9:40pm

Thanks, this looks promising.

I am indeed seeing a high number of receive errors, which in a short test almost account for the number of missing messages. Will have to investigate further.

juemue · June 27, 2018, 6:10am

There are probably 2 points to start with your investigate:

Receive Buffer Size of your input
UDP buffer Size of your os

https://access.redhat.com/documentation/en-US/JBoss_Enterprise_Web_Platform/5/html/Administration_And_Configuration_Guide/jgroups-perf-udpbuffer.html

F30 · June 28, 2018, 8:34am

Exactly as @juemue said, the issue could be fixed by increasing receive buffer size in the input’s config (through the web interface) and accordingly adjusting sysctl net.core.rmem_max on the Graylog host.

Thanks for the help!

juemue · June 28, 2018, 9:52am

No Problem, I am glad to help

system · July 12, 2018, 9:52am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Graylog is processing messages really slowly Graylog Central (peer support)	5	1500	March 6, 2018
Are dropped messages counted as ingested? Graylog Central (peer support)	2	634	May 16, 2019
Graylog missing messages, fluctuates between 20k and up to 100k per minute Graylog Central (peer support)	6	1633	November 29, 2017
Missing messages after processing Graylog Central (peer support) sidecar , nxlog , nodatanx	10	2897	March 28, 2019
Graylog not receiving messages, unprocessed messages Graylog Central (peer support)	22	4273	June 23, 2022

Message Loss at Ingestion

Related topics