I am experiencing similar message loss issues to the ones described on the mailing list and on GitHub in 2013, however with recent versions (Graylog 2.4.5 and Elasticsearch 5.6.10). After ingesting about 8000 messages per second for testing purposes, only about 95 % if them are returned by a query.
I am using a GELF UDP input, but have made sure that all packets actually reach the Graylog server at the transport layer. Output buffer utilization and journal size are steadily increasing during the test ingestion, but drop to zero afterwards (before the query). I have also made sure that the number of messages returned by Graylog is equal to the one returned by the Elasticsearch API.
Since I could mitigate (but not eliminate) the issue by increasing
elasticsearch_max_retries, I assume that Elasticsearch is returning errors. Unfortunately, neither Graylog nor Elasticsearch log anything interesting. Elasticsearch tuning options are also rather limited.
Does anybody have an idea for where to look and what to tune?