Garylog Master Node LB status will mark its life cycle state as dead under peak load periods

steven.cherry · November 13, 2018, 11:44am

This issue has only been observed since we upgraded from Garylog version 2.2.3 to 2.4.6.

/usr/bin/java -version
openjdk version “1.8.0_111”
OpenJDK Runtime Environment (build 1.8.0_111-b15)
OpenJDK 64-Bit Server VM (build 25.111-b15, mixed mode)

We have two physically identical Graylog servers behind a BIG-IP load balancer, one of those nodes acting as the Graylog master. Both servers have identical Graylog configs (apart from the one being the master).

I’ve noticed that under peak load the master node will start to back log increasing amounts of messages to the journal whilst the other server keeps up with the increased message rate. At points during the peak load the master node will set it’s life cycle as dead. When this happens the slave comfortably deals with the extra load imposed on it due to the master being offline. The example metrics pasted below shows the scenario far more effectively than I can describe it.

plp-glserver04 is the master whilst plp-glserver03 is a regular node. In particular the ‘Network RX’ clearly shows the master being taken out of the LB pool and the regular node absorbing the extra load.

jan · November 14, 2018, 6:47am

Did you check the Graylog server.log on the master node?

some maintenance tasks run on the master only that might give it higher load. But it could be that other non-Graylog related issues happen on the system. That might tell the Graylog server.log

steven.cherry · November 14, 2018, 10:39am

Hi Jan

Unfortunately I can’t show you any logs because since I upgraded to the latest version the logs are saturated with repeats of the following examples…

2018-11-14 09:32:29,570 WARN : org.graylog2.inputs.codecs.GelfCodec - GELF message <349d4710-e7f0-11e8-b35e-1866daec73dc> (received from <XXX.XXX.XXX.XXX:53243>) has invalid "timestamp": 1542187949.54509 (type: STRING)

and…

java.lang.IllegalArgumentException: GELF message <349caad0-e7f0-11e8-b35e-1866daec73dc> (received from <10.128.40.232:59815>) has empty mandatory "short_message" field.
2018-11-14 09:32:29,566 ERROR: org.graylog2.shared.buffers.processors.DecodingProcessor - Error processing message RawMessage{id=349caad0-e7f0-11e8-b35e-1866daec73dc, journalOffset=1572882118, codec=gelf, payloadSize=427, timestamp=2018-11-14T09:32:29.565Z, remoteAddress=/XXX.XXX.XXX.XXX:59815}

I could lower the logging level to FATAL but that won’t capture the logs we need I’m guessing?

jan · November 14, 2018, 11:14am

I would first fix the GELF transmission errors and then look into the above mentioned when this appears again.

steven.cherry · November 19, 2018, 3:23pm

I’ve managed to get some logs when this issue takes place. I get many repeats of the following error.

2018-11-19 06:04:06,479 WARN : org.graylog2.shared.rest.resources.ProxiedResource - Unable to call http://plp-glserver03.betgenius.net/api/system/metrics/multiple on node <003> java.net.SocketTimeoutException: timeout at okio.Okio$4.newTimeoutException(Okio.java:230) ~[graylog.jar:?] at okio.AsyncTimeout.exit(AsyncTimeout.java:285) ~[graylog.jar:?] at okio.AsyncTimeout$2.read(AsyncTimeout.java:241) ~[graylog.jar:?] at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345) ~[graylog.jar:?] at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217) ~[graylog.jar:?] at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211) ~[graylog.jar:?] at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:187) ~[graylog.jar:?] at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?] at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?] at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?] at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?] at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:125) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?] at org.graylog2.rest.RemoteInterfaceProvider.lambda$get$0(RemoteInterfaceProvider.java:59) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?] at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?] at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200) ~[graylog.jar:?] at okhttp3.RealCall.execute(RealCall.java:77) ~[graylog.jar:?] at retrofit2.OkHttpCall.execute(OkHttpCall.java:180) ~[graylog.jar:?] at org.graylog2.shared.rest.resources.ProxiedResource.lambda$getForAllNodes$0(ProxiedResource.java:76) ~[graylog.jar:?] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_111] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111] at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111] Caused by: java.net.SocketException: Socket closed at java.net.SocketInputStream.read(SocketInputStream.java:203) ~[?:1.8.0_111] at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_111] at okio.Okio$2.read(Okio.java:139) ~[graylog.jar:?] at okio.AsyncTimeout$2.read(AsyncTimeout.java:237) ~[graylog.jar:?] ... 28 more

The node is actually timing out when attempting to connect to itself. Could it be that the server is running out of file handles?

jan · November 19, 2018, 4:02pm

It might also be that the server is not able to resolve the given DNS Entry:

http://plp-glserver03.betgenius.net/api/system/metrics/multiple

at least this is what I guess when the server gets the timeout …

steven.cherry · November 20, 2018, 12:09pm

Hi Jan,

I doubt it’s DNS lookup failures. The failed lookup is the host name of the server its self which is in the /etc/hosts file.

I’ve doubled the max file handles on the node in question. I’ll leave it to soak over the weekend when we have peak load.

system · December 4, 2018, 12:09pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Graylog Nodes "Out of Balance" Graylog Central (peer support)	4	3010	September 27, 2018
Loadbalancer problen Graylog Central (peer support)	4	421	November 8, 2018
Graylog 3.1.3 - Master node flap / NodePingThread Graylog Central (peer support)	2	1745	December 20, 2019
Input on graylog second node "not running" Graylog Central (peer support)	11	1573	September 13, 2019
Graylog Big Problem Graylog Central (peer support) access-specific-log-	27	4114	December 19, 2022

Garylog Master Node LB status will mark its life cycle state as dead under peak load periods

Related topics