Graylog not processing messages / processing buffer full

Going to write this here to help with GoogleFU of people later because it took me ages and ages to find it… longer than it should have. This post needs no reply as the issue is fixed and wanted to share my pain for google breadcrumbs.

TL;DR - My elastic search indicies were RO. It just took me ages to find it and was thrown by several things happening.

Earlier today Graylog stopped processing messages but I didn’t realise for a good few hours. Eventually i found it wasn’t “Outputting” messages per the indicator in the top right.
It was ingesting them, but not outputting them. The disk had space. Elastic search showed all shards as green

Checked the logs, but they were flooded with a misconfigured pipeline rule calling a lookup table that i couldn’t find anything of use.
Kicked the server, no change.

I did however notice that the Process Buffer was always full and at 100%. Even straight after a server reboot, it was pegged at 100%. Outputs still climbed, tailing the logs showed my pipeline error.
Having made no changes, i eventually decided to restart again, killed off inputs, restarted, paused streams nothing.

So first error - fixed my pipelines. I won’t bore you with my typos, but once i did that and restarted, same error.

So i tackled the process buffer first. That was always pegged at 100% and no matter how many CPUs i threw at it and rebooted the VM - no change.
I eventually found this command in a bug report rm -rf /var/lib/graylog-server/journal/.lock which made no difference

I found a bug report which lead to this thread After disk space issue - no out messages - help - #7 by jochen
Which resulted in deleting the whole journal folder.
This did fix it. Ok so now i have process buffers at 0% again and the logs fixed. Awesome. Still not outputting messages.

That’s when i found this in the logs:

WARN [Messages] Retrying 7 messages, because their indices are blocked with status [read-only / allow delete]

WTH? The disk is only 81% full (on a 2TB disk). Ok fine, re-indexed them (something i found in my googling to try) - no errors. Elasticsearch is green, happy days. Why are they read only.
Tried creating a new indicie, that went ok, but it didn’t change the active write on over which was weird. Deleted it and tried again - no dice.
Ok fine, to google.

high disk watermark [90%] exceeded on… shows up in a search. Ok, cool story… but I’m only at 81%

Oh. right. That makes sense. Tried that curl command

curl -XPUT -H "Content-Type: application/json" http://[YOUR_ELASTICSEARCH_ENDPOINT]:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'

BOOM! Logs. What a nightmare.

Lessons learnt:

  1. Elasticsearch isn’t set and forget
  2. The graylog forums are full of really useful info
  3. Fix your log spew so you can actually see things
  4. Give it ample room and tweak your watermarks. Graylog not showing messages in seach view - #6 by Markus
  5. Because of my log spew, the high watermark logs had rotated out AGES before I even found the issue to begin looking at it
  6. Other monitoring systems are needed on this
[root@node179 graylog-server]# grep -i watermark server.log
[root@node179 graylog-server]# zgrep -i watermark server.log.*.gz
server.log.7.gz:2020-06-18T10:54:11.624+10:00 WARN  [IndexerClusterCheckerThread] Elasticsearch node [127.0.0.1] triggered [ES_NODE_DISK_WATERMARK_LOW] due to low free disk space
server.log.7.gz:2020-06-18T10:57:11.380+10:00 WARN  [IndexerClusterCheckerThread] Elasticsearch node [127.0.0.1] triggered [ES_NODE_DISK_WATERMARK_LOW] due to low free disk space
server.log.7.gz:2020-06-18T11:02:11.377+10:00 WARN  [IndexerClusterCheckerThread] Elasticsearch node [127.0.0.1] triggered [ES_NODE_DISK_WATERMARK_LOW] due to low free disk space
server.log.7.gz:2020-06-18T11:03:11.365+10:00 WARN  [IndexerClusterCheckerThread] Elasticsearch node [127.0.0.1] triggered [ES_NODE_DISK_WATERMARK_LOW] due to low free disk space
[root@node179 graylog-server]#

Version was 3.2 and is now 3.3 because it was already broken, so why not!

3 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.