Graylog Process Buffer Full

I have been running a decent size Graylog instance 25,000 msg/s for the last few months without incident. As of last night we started to experience the process buffers filling up. Eventually it appears that the deflector dies and stops processing messages all together. The elasticsearch cluster is green and doesn’t appear to be having performance issues. It doesn’t appear that the messages are even getting to the output buffer. Messages are as a result stacking up in the journal.

I tried default settings and the following to help with the process buffers filling up.

processbuffer_processors = 8
output_batch_size = 100
ring_size = 262144

Any guidance how how I can dig in to see what is causing the process buffers to fill up would be helpful. The logs are not pointing me anywhere currently.

Thanks,

Please upload and share the logs of your Graylog and Elasticsearch nodes.

:arrow_right: http://docs.graylog.org/en/2.2/pages/configuration/file_location.html

I appreciate the reply. I was able to figure out what the issue. There was a bad extractor causing the processor buffer pool to fill up and crash. The logs did not indicate an issue from what I was able to see.

I do have a follow up question though.

  1. Are there metrics exposed through the API for extractor performance?
  2. Why can a single bad extractor brick an entire Graylog cluster? It almost seems like this is a bug that should be fixed.

Thanks,

Yes, you can also send these metrics to various other systems.

If it was a cluster, the other Graylog nodes would still have worked. :wink:

Anyway, we didn’t add timeouts to the extractors until now on purpose, because sometimes complex (and thus long running) extractions are necessary. It’s up to you to monitor the health of your Graylog nodes.

Awesome, thank you for the link to the plugin.

If it was a cluster, the other Graylog nodes would still have worked.

I would have thought so as well. However, since it was a global input that the extractor was configured on all of the Graylog nodes were affected. The filebeat agents that are sending the data are auto load balanced across all of the Graylog nodes.

Anyway, we didn’t add timeouts to the extractors until now on purpose, because sometimes complex (and thus long running) extractions are necessary.

Is there specific version of Graylog I need to be running in order to leverage the timeout setting? Is there any documentation on this timeout setting?

Is there specific version of Graylog I need to be running in order to leverage the timeout setting? Is there any documentation on this timeout setting?

that is not build and is not in planing to build.

that is not build and is not in planing to build.

Are you saying this is not planned or not in a current build?

I would like to see some way prevent this in the future besides alerts setup around extractor metrics. :frowning:

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.