I am experiencing similar message loss issues to the ones described on the mailing list and on GitHub in 2013, however with recent versions (Graylog 2.4.5 and Elasticsearch 5.6.10). After ingesting about 8000 messages per second for testing purposes, only about 95 % if them are returned by a query.
I am using a GELF UDP input, but have made sure that all packets actually reach the Graylog server at the transport layer. Output buffer utilization and journal size are steadily increasing during the test ingestion, but drop to zero afterwards (before the query). I have also made sure that the number of messages returned by Graylog is equal to the one returned by the Elasticsearch API.
Since I could mitigate (but not eliminate) the issue by increasing elasticsearch_max_retries, I assume that Elasticsearch is returning errors. Unfortunately, neither Graylog nor Elasticsearch log anything interesting. Elasticsearch tuning options are also rather limited.
Does anybody have an idea for where to look and what to tune?
I did a tcpdump between Graylog and Elasticsearch and counted the number of messages in the resulting PCAP. That number matches the one from the database. Also, all responses from Elasticsearch have status code 200.
So this doesn’t look like Elasticsearch errors after all, but rather like Graylog not receiving all messages from the kernel or dropping them internally. My original questions remain.
Unless I’m missing a log file or log level setting, the logs don’t show an obvious problem.
At log level “Debug”, Graylog still doesn’t log anything during the ingestion. Elasticsearch does have some debug logging, but primarily mumbles something about version mismatches for Logstash, Elasticsearch and Kibana (I don’t use Logstash or Kibana), “Get stats” or “executing watch”.
I am indeed seeing a high number of receive errors, which in a short test almost account for the number of missing messages. Will have to investigate further.
Exactly as @juemue said, the issue could be fixed by increasing receive buffer size in the input’s config (through the web interface) and accordingly adjusting sysctl net.core.rmem_max on the Graylog host.