Messages lost when buffers are full

Hello everyone,

Graylog version: v3.3.9

I discovered my Graylog server’s buffers are full and there was millions of unprocessed messages yesterday. After I deleted some pipelines and extractors, they are relieved.

However, I found that the messages are gone during the buffer=full period:

And no elasticsearch log is created during the period.

I found repeated connection refused error in graylog log

2020-12-04T15:45:25.986+08:00 ERROR [Messages] Caught exception during bulk indexing: io.searchbox.client.config.exception.CouldNotConnectException: Could not connect to, retrying (attempt #52783).
2020-12-04T15:45:27.375+08:00 ERROR [IndexFieldTypePollerPeriodical] Couldn’t update field types for index set <Default index set/5e47260863f3bb0f5a9bb9cc>
org.graylog2.indexer.ElasticsearchException: Couldn’t collect indices for alias graylog_deflector
at org.graylog2.indexer.cluster.jest.JestUtils.execute( ~[graylog.jar:?]
at org.graylog2.indexer.cluster.jest.JestUtils.execute( ~[graylog.jar:?]
at org.graylog2.indexer.indices.Indices.aliasTarget( ~[graylog.jar:?]
at org.graylog2.indexer.MongoIndexSet.getActiveWriteIndex( ~[graylog.jar:?]
at org.graylog2.indexer.fieldtypes.IndexFieldTypePollerPeriodical.lambda$schedule$4( ~[graylog.jar:?]
at java.util.concurrent.Executors$ [?:1.8.0_265]
at java.util.concurrent.FutureTask.runAndReset( [?:1.8.0_265]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301( [?:1.8.0_265]
at java.util.concurrent.ScheduledThreadPoolExecutor$ [?:1.8.0_265]
at java.util.concurrent.ThreadPoolExecutor.runWorker( [?:1.8.0_265]
at java.util.concurrent.ThreadPoolExecutor$ [?:1.8.0_265]
at [?:1.8.0_265]
Caused by: io.searchbox.client.config.exception.CouldNotConnectException: Could not connect to
at io.searchbox.client.http.JestHttpClient.execute( ~[graylog.jar:?]
at org.graylog2.indexer.cluster.jest.JestUtils.execute( ~[graylog.jar:?]
… 11 more
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to [/] failed: Connection refused (Connection refused)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect( ~[graylog.jar:?]
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect( ~[graylog.jar:?]
at org.apache.http.impl.execchain.MainClientExec.establishRoute( ~[graylog.jar:?]
at org.apache.http.impl.execchain.MainClientExec.execute( ~[graylog.jar:?]
at org.apache.http.impl.execchain.ProtocolExec.execute( ~[graylog.jar:?]
at org.apache.http.impl.execchain.RedirectExec.execute( ~[graylog.jar:?]
at org.apache.http.impl.client.InternalHttpClient.doExecute( ~[graylog.jar:?]
at org.apache.http.impl.client.CloseableHttpClient.execute( ~[graylog.jar:?]
at org.apache.http.impl.client.CloseableHttpClient.execute( ~[graylog.jar:?]
at io.searchbox.client.http.JestHttpClient.executeRequest( ~[graylog.jar:?]
at io.searchbox.client.http.JestHttpClient.execute( ~[graylog.jar:?]
at org.graylog2.indexer.cluster.jest.JestUtils.execute( ~[graylog.jar:?]
… 11 more

I am re-indexing the elasticsearch index during the period. Does it help?
Or is that the messages are permanently lost?

Thank you all in advance.

If Graylog can’t write data to Elastic then writes it to the journal.
If your journal also was full at that time, all messages which didn’t fit the journal were deleted

1 Like

Thanks zoulja,

I can’t find anything mentioning journal or deletion in graylog’s log (/var/log/graylog-server/).

Can you share how to check the event (e.g. deletion / full) in the logs. Thank you very much

Actually, the journal is only written to right after the message is received on an input. From there it is picked up by the processing system and “processed” and then sent to the output processor to be sent to ES. If ES can not write the message, the output buffer begins to fill up, then the process buffer fills up, then the journal fills up. Once the journal reaches it’s limit (either size or age) it purges the oldest messages to make room for the new ones. Once the road block (in this case ES) is cleared or fixed. The messages start getting written to ES, the output buffer empties, the process buffer sends messages to the output buffer, the process buffer starts picking up messages from the journal… and slowly the system starts to recover.

Output buffer problems are usually related to either and issue with ES or insufficient/oversubscription of CPUs. output processing doesn’t require alot of processors, but should have at least 1 dedicated CPU.

1 Like

Thanks for your details. It helps clear some concepts on the Graylog buffers and ES.

If I get it right, what you imply is I should look into the ES nodes. If no index is created at the period, then Graylog can only do so much (storing temporarily in output + input buffer + journals) and the old messages will keep being purged if you don’t fix the ES issue asap, right?

Well…I just saw the journal by default 5G and 12hrs long. So the data is gone anyway… :sweat_smile:

I don’t see there is alarm setting for this in Graylog, any suggestion on how to configure an alarm for this? Sorry, I am rather new to Linux.

yes. It seems the ES nodes would be a good starting point for your issue. You can check if Graylog is reporting you ES as healthy as well. under System | Overview. Otherwise check the ES logs.

I’m not sure what piece you are asking for in terms of alarm setting. There is a high journal utilization and a journal messages lost notification that pops up. Is that what you mean? also what kind of alarm are you looking for?

Thanks again for the help. What I mean is an alarm and email notification for the journal filled up event and the like. Because it is not monitoring all the time. If you can point me to a solution, that would be very nice.

Graylog exposes alot of metrics that you can use to monitor the system. System | Nodes | Metrics for a listing. But you would need to access these through the API from your monitoring system. Hopefully you have a separate monitoring system already as it is never a good idea to monitor a system with itself.

Thank you for the advice. I will look into a way to make use of the graylog API.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.