Garylog Master Node LB status will mark its life cycle state as dead under peak load periods

This issue has only been observed since we upgraded from Garylog version 2.2.3 to 2.4.6.

/usr/bin/java -version
openjdk version “1.8.0_111”
OpenJDK Runtime Environment (build 1.8.0_111-b15)
OpenJDK 64-Bit Server VM (build 25.111-b15, mixed mode)

We have two physically identical Graylog servers behind a BIG-IP load balancer, one of those nodes acting as the Graylog master. Both servers have identical Graylog configs (apart from the one being the master).

I’ve noticed that under peak load the master node will start to back log increasing amounts of messages to the journal whilst the other server keeps up with the increased message rate. At points during the peak load the master node will set it’s life cycle as dead. When this happens the slave comfortably deals with the extra load imposed on it due to the master being offline. The example metrics pasted below shows the scenario far more effectively than I can describe it.

image

plp-glserver04 is the master whilst plp-glserver03 is a regular node. In particular the ‘Network RX’ clearly shows the master being taken out of the LB pool and the regular node absorbing the extra load.

image

Did you check the Graylog server.log on the master node?

some maintenance tasks run on the master only that might give it higher load. But it could be that other non-Graylog related issues happen on the system. That might tell the Graylog server.log

Hi Jan

Unfortunately I can’t show you any logs because since I upgraded to the latest version the logs are saturated with repeats of the following examples…

2018-11-14 09:32:29,570 WARN : org.graylog2.inputs.codecs.GelfCodec - GELF message <349d4710-e7f0-11e8-b35e-1866daec73dc> (received from <XXX.XXX.XXX.XXX:53243>) has invalid "timestamp": 1542187949.54509 (type: STRING)

and…

java.lang.IllegalArgumentException: GELF message <349caad0-e7f0-11e8-b35e-1866daec73dc> (received from <10.128.40.232:59815>) has empty mandatory "short_message" field.
2018-11-14 09:32:29,566 ERROR: org.graylog2.shared.buffers.processors.DecodingProcessor - Error processing message RawMessage{id=349caad0-e7f0-11e8-b35e-1866daec73dc, journalOffset=1572882118, codec=gelf, payloadSize=427, timestamp=2018-11-14T09:32:29.565Z, remoteAddress=/XXX.XXX.XXX.XXX:59815}

I could lower the logging level to FATAL but that won’t capture the logs we need I’m guessing?

I would first fix the GELF transmission errors and then look into the above mentioned when this appears again.

I’ve managed to get some logs when this issue takes place. I get many repeats of the following error.

2018-11-19 06:04:06,479 WARN : org.graylog2.shared.rest.resources.ProxiedResource - Unable to call http://plp-glserver03.betgenius.net/api/system/metrics/multiple on node <003> java.net.SocketTimeoutException: timeout at okio.Okio$4.newTimeoutException(Okio.java:230) ~[graylog.jar:?] at okio.AsyncTimeout.exit(AsyncTimeout.java:285) ~[graylog.jar:?] at okio.AsyncTimeout$2.read(AsyncTimeout.java:241) ~[graylog.jar:?] at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345) ~[graylog.jar:?] at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217) ~[graylog.jar:?] at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211) ~[graylog.jar:?] at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:187) ~[graylog.jar:?] at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?] at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?] at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?] at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?] at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:125) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?] at org.graylog2.rest.RemoteInterfaceProvider.lambda$get$0(RemoteInterfaceProvider.java:59) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?] at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200) ~[graylog.jar:?] at okhttp3.RealCall.execute(RealCall.java:77) ~[graylog.jar:?] at retrofit2.OkHttpCall.execute(OkHttpCall.java:180) ~[graylog.jar:?] at org.graylog2.shared.rest.resources.ProxiedResource.lambda$getForAllNodes$0(ProxiedResource.java:76) ~[graylog.jar:?] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_111] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111] at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111] Caused by: java.net.SocketException: Socket closed at java.net.SocketInputStream.read(SocketInputStream.java:203) ~[?:1.8.0_111] at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_111] at okio.Okio$2.read(Okio.java:139) ~[graylog.jar:?] at okio.AsyncTimeout$2.read(AsyncTimeout.java:237) ~[graylog.jar:?] ... 28 more

The node is actually timing out when attempting to connect to itself. Could it be that the server is running out of file handles?

It might also be that the server is not able to resolve the given DNS Entry:

http://plp-glserver03.betgenius.net/api/system/metrics/multiple

at least this is what I guess when the server gets the timeout …

Hi Jan,

I doubt it’s DNS lookup failures. The failed lookup is the host name of the server its self which is in the /etc/hosts file.

I’ve doubled the max file handles on the node in question. I’ll leave it to soak over the weekend when we have peak load.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.