Graylog System API Timeouts

Hello all,

I’ve been running Graylog in production for a while now, but recently have been experiencing some at times very severe service degradation and it seems to be stemming from the Graylog API making calls to the /system endpoint.


Specs:
Graylog Server 3.3.2 on CentOS 7.8.2003 @ 3.10.0-1127.13.1.el7.x86_64

Graylog server single node setup with the “master” being an Azure Standard D8 v3 (8 vcpus, 32 GiB memory) which also runs the MongoDB (4.0.19) instance for configs. The server also has Nginx running on 443 which is proxying for the web/api on 9000. Both Nginx and the API/Web are TLS’ed

ES (6.8) cluster of 5 nodes which are Azure Standard D4s v3 (4 vcpus, 16 GiB memory) with 1 TB each of disk space.


Observations:

This instance did start life as a 3.1 instance and was upgraded to 3.3.2 the day it came out. The instance ran without issue for two days until suddenly the API calls to the system endpoint just started to fail. The UI was somewhat serviceable until it just began to time out here and there as well.

We’re still getting logs into ES and even seeing them in Search. But when navigating to domain/system/nodes/, the server.log spews this:

`2020-07-15T19:55:13.456Z ERROR [AnyExceptionClassMapper] Unhandled exception in REST resource

java.net.SocketTimeoutException: timeout
at okio.Okio$4.newTimeoutException(Okio.java:232) ~[graylog.jar:?]
at okio.AsyncTimeout.exit(AsyncTimeout.java:286) ~[graylog.jar:?]
at okio.AsyncTimeout$2.read(AsyncTimeout.java:241) ~[graylog.jar:?]
at okio.RealBufferedSource.indexOf(RealBufferedSource.java:358) ~[graylog.jar:?]
at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:230) ~[graylog.jar:?]
at okhttp3.internal.http1.Http1ExchangeCodec.readHeaderLine(Http1ExchangeCodec.java:242) ~[graylog.jar:?]
at okhttp3.internal.http1.Http1ExchangeCodec.readResponseHeaders(Http1ExchangeCodec.java:213) ~[graylog.jar:?]
at okhttp3.internal.connection.Exchange.readResponseHeaders(Exchange.java:115) ~[graylog.jar:?]
at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:94) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[graylog.jar:?]
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:43) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117) ~[graylog.jar:?]
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:94) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117) ~[graylog.jar:?]
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[graylog.jar:?]
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:88) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117) ~[graylog.jar:?]
at org.graylog2.rest.RemoteInterfaceProvider.lambda$get$0(RemoteInterfaceProvider.java:61) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117) ~[graylog.jar:?]
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:229) ~[graylog.jar:?]
at okhttp3.RealCall.execute(RealCall.java:81) ~[graylog.jar:?]
at retrofit2.OkHttpCall.execute(OkHttpCall.java:190) ~[graylog.jar:?]
at org.graylog2.rest.resources.cluster.ClusterSystemResource.jvm(ClusterSystemResource.java:91) ~[graylog.jar:?]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_252]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_252]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_252]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:205) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102) ~[graylog.jar:?]
at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326) [graylog.jar:?]
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) [graylog.jar:?]
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267) [graylog.jar:?]
at org.glassfish.jersey.internal.Errors.process(Errors.java:315) [graylog.jar:?]
at org.glassfish.jersey.internal.Errors.process(Errors.java:297) [graylog.jar:?]
at org.glassfish.jersey.internal.Errors.process(Errors.java:267) [graylog.jar:?]
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317) [graylog.jar:?]
at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305) [graylog.jar:?]
at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154) [graylog.jar:?]
at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:384) [graylog.jar:?]
at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:224) [graylog.jar:?]
at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181) [graylog.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]`

The rest of the log is filled with:

`2020-07-15T20:14:32.024Z WARN  [ProxiedResource] Unable to call https:domain:9000/api/system/jobs on node <<redacted>>: timeout

2020-07-15T20:32:43.229Z WARN [ProxiedResource] Unable to call https:domain:9000/api/system/metrics/multiple on node <>: timeout
2020-07-15T20:32:54.025Z WARN [ProxiedResource] Unable to call https:domain:9000/api/system on node <>: timeout`

The problem is, the instance is practically asleep. It’s idling along at 3 - 5% CPU usage. Any help here is greatly appreciated and please let me know what more I can provide or if I need to up the verbosity of any logging. I apologize for the format of the URI, it seems the forum won’t let me post more than two “links.”

how much RAM did you have configured for JAVA Heap?

If you do not know it is very likely that you have only the default settings and you should raise the heap.

Hi Jan, the heap size is set to 28 gig and has been from the outset.

Seems like this is an issue for others as well? Would love to help try to debug this before going full production on GL3. If not, I’ll probably revert to GL2. Please let me know what else I can do as far as debugging this. I feel like this is a bug somewhere.

he @finite

setting 28GB Heap for Graylog on a 32GB system is way to much - I recomment not more than 12GB of HEAP. Unless you make heavy use of CSV or DSV Lookup tables.

Hi @jan I’ve decreased the heap to 12GB, but the issue persists.

Just stumbled on this following https://github.com/Graylog2/graylog2-server/issues/8610 .
Is this the same issue I am describing in API timeout when querying list of metrics ?

Hi @fadenb indeed it looks like you and I might be in a very similar situation. Which to me means there has to be a bug somewhere. I only have one node running GL, but it has plenty of CPU just sitting idle that the process can use.

1 Like

Still seeing the same behavior in 3.3.3.

Is there anything else I can try? Upping the verbosity of the logging?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.