This issue has only been observed since we upgraded from Garylog version 2.2.3 to 2.4.6.
/usr/bin/java -version
openjdk version “1.8.0_111”
OpenJDK Runtime Environment (build 1.8.0_111-b15)
OpenJDK 64-Bit Server VM (build 25.111-b15, mixed mode)
We have two physically identical Graylog servers behind a BIG-IP load balancer, one of those nodes acting as the Graylog master. Both servers have identical Graylog configs (apart from the one being the master).
I’ve noticed that under peak load the master node will start to back log increasing amounts of messages to the journal whilst the other server keeps up with the increased message rate. At points during the peak load the master node will set it’s life cycle as dead. When this happens the slave comfortably deals with the extra load imposed on it due to the master being offline. The example metrics pasted below shows the scenario far more effectively than I can describe it.
plp-glserver04 is the master whilst plp-glserver03 is a regular node. In particular the ‘Network RX’ clearly shows the master being taken out of the LB pool and the regular node absorbing the extra load.
Did you check the Graylog server.log on the master node?
some maintenance tasks run on the master only that might give it higher load. But it could be that other non-Graylog related issues happen on the system. That might tell the Graylog server.log
I’ve managed to get some logs when this issue takes place. I get many repeats of the following error.
2018-11-19 06:04:06,479 WARN : org.graylog2.shared.rest.resources.ProxiedResource - Unable to call http://plp-glserver03.betgenius.net/api/system/metrics/multiple on node <003> java.net.SocketTimeoutException: timeout at okio.Okio$4.newTimeoutException(Okio.java:230) ~[graylog.jar:?] at okio.AsyncTimeout.exit(AsyncTimeout.java:285) ~[graylog.jar:?] at okio.AsyncTimeout$2.read(AsyncTimeout.java:241) ~[graylog.jar:?] at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345) ~[graylog.jar:?] at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217) ~[graylog.jar:?] at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211) ~[graylog.jar:?] at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:187) ~[graylog.jar:?] at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?] at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?] at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?] at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?] at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:125) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?] at org.graylog2.rest.RemoteInterfaceProvider.lambda$get$0(RemoteInterfaceProvider.java:59) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?] at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200) ~[graylog.jar:?] at okhttp3.RealCall.execute(RealCall.java:77) ~[graylog.jar:?] at retrofit2.OkHttpCall.execute(OkHttpCall.java:180) ~[graylog.jar:?] at org.graylog2.shared.rest.resources.ProxiedResource.lambda$getForAllNodes$0(ProxiedResource.java:76) ~[graylog.jar:?] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_111] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111] at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111] Caused by: java.net.SocketException: Socket closed at java.net.SocketInputStream.read(SocketInputStream.java:203) ~[?:1.8.0_111] at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_111] at okio.Okio$2.read(Okio.java:139) ~[graylog.jar:?] at okio.AsyncTimeout$2.read(AsyncTimeout.java:237) ~[graylog.jar:?] ... 28 more
The node is actually timing out when attempting to connect to itself. Could it be that the server is running out of file handles?