Has anyone experienced something like this, and what did you do to resolve?
1. Describe your incident:
Bringing up additional Graylog nodes at a second physical location causes all nodes to “drop” from the cluster entirely and processing halts. Stopping the graylog service on the new node allows the others to resume.
2. Describe your environment:
Three node cluster at Site1 and attempting to bring up nodes at Site2.
Two physical network locations separated by SD-WAN.
Two subnets under the same broadcast domain.
Package Version:
4.3.5+32fa802 (Debian 11.0.16 on Linux 5.10.0-16-amd64)
Not a lot to go on other than the Graylog version… Here are some tips on how to make your question clearer here and here.
Have you looked in the Graylog logs? What are the server.conf files? The tips I posted show how to get the server.conf data with out all the comments… make sure to obfuscate where needed… they also show how to use the </> forum tool when posting code/logs to make it easily readable…
I’m happy to report that I’ve found the issue and everything appears to be working now!
It turns out that our WAN link was just laggy enough to cause a timeout for the Master Node. After a bit of digging I found the stale_master_timeout setting in server.conf bumped it up a few seconds instead of two, and bam - problem solved!
Hopefully this post can help someone else in the future!