This issue has only been observed since we upgraded from Garylog version 2.2.3 to 2.4.6.
openjdk version “1.8.0_111”
OpenJDK Runtime Environment (build 1.8.0_111-b15)
OpenJDK 64-Bit Server VM (build 25.111-b15, mixed mode)
We have two physically identical Graylog servers behind a BIG-IP load balancer, one of those nodes acting as the Graylog master. Both servers have identical Graylog configs (apart from the one being the master).
I’ve noticed that under peak load the master node will start to back log increasing amounts of messages to the journal whilst the other server keeps up with the increased message rate. At points during the peak load the master node will set it’s life cycle as dead. When this happens the slave comfortably deals with the extra load imposed on it due to the master being offline. The example metrics pasted below shows the scenario far more effectively than I can describe it.
plp-glserver04 is the master whilst plp-glserver03 is a regular node. In particular the ‘Network RX’ clearly shows the master being taken out of the LB pool and the regular node absorbing the extra load.