Hello all,
Let’s assume an installation of four graylog nodes on two datacenters, two on each. One of the datacenters is considered primary. The four graylogs are supported by a single five-node mongodb cluster and a five-node opensearch cluster; in both implementations, the 5th node is an arbiter that sits on the primary datacenter.
Now, let’s assume the following two scenarios:
S1 - primary datacenter goes completely offline
S2 - the connection between the two datacenters gets severed for X minutes
I understand that in both scenarios, mongodb will become readonly on the secondary site, and graylog stops working. What happens to log collection in that case?
Let’s assume that I started an arbiter in the secondary site and therefore graylog resumes. When we recover from S1, primary datacenter would catch up and all is good. (correct me if wrong).
What happens however in S2, where the primary datacenter had a working mongodb cluster (thanks to the arbiter) and never stopped collecting? If I have started an arbiter in the secondary site, then I would have a diverged replicaset, and the cluster would reject some of the nodes, if not all. (A mongodb recovery is required then)
I understand that this may look like a mongodb problem and not related to Graylog, but my question remains: what is the best way to implement a fault-tolerant graylog cluster?
Can I run a single graylog cluster with two mongodb clusters? (one for each site)?
(I’m obviously not worried about the opensearch cluster going offline temporarily as graylog can work with it offline for a few hours depending on the load.)
Thanks in advance and apologies for the long post.
- AKG