Every morning I come in at 8am and Graylog appears to be processing a backlog of messages (pic below). Usually around 1.5 million. Seems to take a couple hours before it catches up and the dashboards start updating again. Assuming I’ve botched a config or have taxing regex filters or something. Can someone tell me where to look to find out what the problem is?
I’ve looked in /var/log/graylog-server/graylog.log and the last entry is Kafka barfing due to permssions, however , IIRC, that was two days ago upon reboot. I accidentally had two graylog services trying to start.
2017-07-31T08:07:39.265-05:00 ERROR [KafkaJournal] Unable to start logmanager.
kafka.common.KafkaException: Failed to acquire lock on file .lock in /var/lib/graylog-server/journal. A Kafka instance in another process or thread is using this directory.
You’ll have to find out what’s causing Graylog to restart and also check the logs of your Elasticsearch node(s) as well as cron jobs which might cause additional load on your systems.
I have this in the elasticsearch log from last night. Nothing in there from 8/02 yet. Something caused a “RED” status change. I have attached a CPU graph which seems to suggest some major processing going on from 4:00 to 9:30 (obviously the backlog processing) and then again 18:00 to 22:30. I don’t see any cron jobs starting at 4:00 or 18:00. Is there some housecleaning that graylog does by default at those times? Could be a remote backup process, but not sure. Have to check with the storage ops guys…
# cat /var/log/elasticsearch/graylog.log
[2017-08-01 10:35:41,330][INFO ][cluster.metadata ] [Dragon of the Moon] [graylog_41] update_mapping [message]
[2017-08-01 18:49:28,353][INFO ][cluster.metadata ] [Dragon of the Moon] [graylog_42] creating index, cause [api], templates [graylog-internal], shards [4]/[0], mappings [message]
[2017-08-01 18:49:28,406][INFO ][cluster.routing.allocation] [Dragon of the Moon] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[graylog_42][0], [graylog_42][0]] ...]).
[2017-08-01 18:49:28,692][INFO ][cluster.metadata ] [Dragon of the Moon] [graylog_42] update_mapping [message]
[2017-08-01 18:50:05,293][INFO ][cluster.metadata ] [Dragon of the Moon] [graylog_42] update_mapping [message]
[2017-08-01 18:54:15,331][INFO ][cluster.metadata ] [Dragon of the Moon] [graylog_42] update_mapping [message]
[2017-08-01 18:54:38,343][INFO ][cluster.metadata ] [Dragon of the Moon] [graylog_42] update_mapping [message]
[2017-08-01 18:55:00,517][INFO ][cluster.metadata ] [Dragon of the Moon] [graylog_42] update_mapping [message]
[2017-08-01 18:56:03,338][INFO ][cluster.metadata ] [Dragon of the Moon] [graylog_42] update_mapping [message]
[2017-08-01 18:56:23,927][INFO ][cluster.metadata ] [Dragon of the Moon] [graylog_42] update_mapping [message]
[2017-08-01 18:56:23,946][INFO ][cluster.metadata ] [Dragon of the Moon] [graylog_42] update_mapping [message]
[2017-08-01 19:30:31,270][INFO ][cluster.metadata ] [Dragon of the Moon] [graylog_42] update_mapping [message]
[2017-08-01 21:36:41,343][INFO ][cluster.metadata ] [Dragon of the Moon] [graylog_42] update_mapping [message]
[2017-08-01 23:17:53,325][INFO ][cluster.metadata ] [Dragon of the Moon] [graylog_42] update_mapping [message]
[2017-08-01 23:22:24,487][INFO ][cluster.metadata ] [Dragon of the Moon] [graylog_42] update_mapping [message]
Seems to be related to a process running on another server at those times dumping massive amounts of logs to graylog during an import process. I’ve disabled logging from that host and the problem isn’t there this morning.