Message Loss at Ingestion

I am experiencing similar message loss issues to the ones described on the mailing list and on GitHub in 2013, however with recent versions (Graylog 2.4.5 and Elasticsearch 5.6.10). After ingesting about 8000 messages per second for testing purposes, only about 95 % if them are returned by a query.

I am using a GELF UDP input, but have made sure that all packets actually reach the Graylog server at the transport layer. Output buffer utilization and journal size are steadily increasing during the test ingestion, but drop to zero afterwards (before the query). I have also made sure that the number of messages returned by Graylog is equal to the one returned by the Elasticsearch API.

Since I could mitigate (but not eliminate) the issue by increasing elasticsearch_max_retries, I assume that Elasticsearch is returning errors. Unfortunately, neither Graylog nor Elasticsearch log anything interesting. Elasticsearch tuning options are also rather limited.

Does anybody have an idea for where to look and what to tune?

I did a tcpdump between Graylog and Elasticsearch and counted the number of messages in the resulting PCAP. That number matches the one from the database. Also, all responses from Elasticsearch have status code 200.

So this doesn’t look like Elasticsearch errors after all, but rather like Graylog not receiving all messages from the kernel or dropping them internally. My original questions remain.

Did you find any related in the Graylog or Elasticsearch log?

That would be the point I’m looking at.

Unless I’m missing a log file or log level setting, the logs don’t show an obvious problem.

At log level “Debug”, Graylog still doesn’t log anything during the ingestion. Elasticsearch does have some debug logging, but primarily mumbles something about version mismatches for Logstash, Elasticsearch and Kibana (I don’t use Logstash or Kibana), “Get stats” or “executing watch”.

perhaps you have udp receive errors, you can check this with netstat -us

Thanks, this looks promising.

I am indeed seeing a high number of receive errors, which in a short test almost account for the number of missing messages. Will have to investigate further.

There are probably 2 points to start with your investigate:

  • Receive Buffer Size of your input
  • UDP buffer Size of your os

Exactly as @juemue said, the issue could be fixed by increasing receive buffer size in the input’s config (through the web interface) and accordingly adjusting sysctl net.core.rmem_max on the Graylog host.

Thanks for the help!

No Problem, I am glad to help

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.