Graylog System API Timeouts

finite · July 15, 2020, 8:48pm

Hello all,

I’ve been running Graylog in production for a while now, but recently have been experiencing some at times very severe service degradation and it seems to be stemming from the Graylog API making calls to the /system endpoint.

Specs:
Graylog Server 3.3.2 on CentOS 7.8.2003 @ 3.10.0-1127.13.1.el7.x86_64

Graylog server single node setup with the “master” being an Azure Standard D8 v3 (8 vcpus, 32 GiB memory) which also runs the MongoDB (4.0.19) instance for configs. The server also has Nginx running on 443 which is proxying for the web/api on 9000. Both Nginx and the API/Web are TLS’ed

ES (6.8) cluster of 5 nodes which are Azure Standard D4s v3 (4 vcpus, 16 GiB memory) with 1 TB each of disk space.

Observations:

This instance did start life as a 3.1 instance and was upgraded to 3.3.2 the day it came out. The instance ran without issue for two days until suddenly the API calls to the system endpoint just started to fail. The UI was somewhat serviceable until it just began to time out here and there as well.

We’re still getting logs into ES and even seeing them in Search. But when navigating to domain/system/nodes/, the server.log spews this:

`2020-07-15T19:55:13.456Z ERROR [AnyExceptionClassMapper] Unhandled exception in REST resource

java.net.SocketTimeoutException: timeout
at okio.Okio$4.newTimeoutException(Okio.java:232) ~[graylog.jar:?]
at okio.AsyncTimeout.exit(AsyncTimeout.java:286) ~[graylog.jar:?]
at okio.AsyncTimeout$2.read(AsyncTimeout.java:241) ~[graylog.jar:?]
at okio.RealBufferedSource.indexOf(RealBufferedSource.java:358) ~[graylog.jar:?]
at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:230) ~[graylog.jar:?]
at okhttp3.internal.http1.Http1ExchangeCodec.readHeaderLine(Http1ExchangeCodec.java:242) ~[graylog.jar:?]
at okhttp3.internal.http1.Http1ExchangeCodec.readResponseHeaders(Http1ExchangeCodec.java:213) ~[graylog.jar:?]
at okhttp3.internal.connection.Exchange.readResponseHeaders(Exchange.java:115) ~[graylog.jar:?]
at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:94) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[graylog.jar:?]
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:43) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117) ~[graylog.jar:?]
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:94) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117) ~[graylog.jar:?]
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[graylog.jar:?]
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:88) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117) ~[graylog.jar:?]
at org.graylog2.rest.RemoteInterfaceProvider.lambda$get$0(RemoteInterfaceProvider.java:61) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117) ~[graylog.jar:?]
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:229) ~[graylog.jar:?]
at okhttp3.RealCall.execute(RealCall.java:81) ~[graylog.jar:?]
at retrofit2.OkHttpCall.execute(OkHttpCall.java:190) ~[graylog.jar:?]
at org.graylog2.rest.resources.cluster.ClusterSystemResource.jvm(ClusterSystemResource.java:91) ~[graylog.jar:?]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_252]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_252]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_252]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:205) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102) ~[graylog.jar:?]
at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326) [graylog.jar:?]
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) [graylog.jar:?]
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267) [graylog.jar:?]
at org.glassfish.jersey.internal.Errors.process(Errors.java:315) [graylog.jar:?]
at org.glassfish.jersey.internal.Errors.process(Errors.java:297) [graylog.jar:?]
at org.glassfish.jersey.internal.Errors.process(Errors.java:267) [graylog.jar:?]
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317) [graylog.jar:?]
at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305) [graylog.jar:?]
at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154) [graylog.jar:?]
at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:384) [graylog.jar:?]
at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:224) [graylog.jar:?]
at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181) [graylog.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]`

The rest of the log is filled with:

`2020-07-15T20:14:32.024Z WARN  [ProxiedResource] Unable to call https:domain:9000/api/system/jobs on node <<redacted>>: timeout

2020-07-15T20:32:43.229Z WARN [ProxiedResource] Unable to call https:domain:9000/api/system/metrics/multiple on node <>: timeout
2020-07-15T20:32:54.025Z WARN [ProxiedResource] Unable to call https:domain:9000/api/system on node <>: timeout`

The problem is, the instance is practically asleep. It’s idling along at 3 - 5% CPU usage. Any help here is greatly appreciated and please let me know what more I can provide or if I need to up the verbosity of any logging. I apologize for the format of the URI, it seems the forum won’t let me post more than two “links.”

jan · July 20, 2020, 7:11am

how much RAM did you have configured for JAVA Heap?

If you do not know it is very likely that you have only the default settings and you should raise the heap.

finite · July 20, 2020, 2:51pm

Hi Jan, the heap size is set to 28 gig and has been from the outset.

finite · July 21, 2020, 8:28pm

Seems like this is an issue for others as well? Would love to help try to debug this before going full production on GL3. If not, I’ll probably revert to GL2. Please let me know what else I can do as far as debugging this. I feel like this is a bug somewhere.

jan · July 22, 2020, 8:24am

he @finite

setting 28GB Heap for Graylog on a 32GB system is way to much - I recomment not more than 12GB of HEAP. Unless you make heavy use of CSV or DSV Lookup tables.

finite · July 22, 2020, 2:02pm

Hi @jan I’ve decreased the heap to 12GB, but the issue persists.

fadenb · July 29, 2020, 8:38am

Just stumbled on this following https://github.com/Graylog2/graylog2-server/issues/8610 .
Is this the same issue I am describing in API timeout when querying list of metrics ?

finite · July 29, 2020, 2:58pm

Hi @fadenb indeed it looks like you and I might be in a very similar situation. Which to me means there has to be a bug somewhere. I only have one node running GL, but it has plenty of CPU just sitting idle that the process can use.

finite · August 5, 2020, 2:45pm

Still seeing the same behavior in 3.3.3.

Is there anything else I can try? Upping the verbosity of the logging?

system · August 19, 2020, 2:45pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
API timeout /system Graylog Central (peer support)	8	2403	February 19, 2022
Graylog api/system timeout failures Graylog Central (peer support)	4	2623	July 26, 2019
Graylog - Failed to call API on node <node>, cause: timeout (duration: 5006ms) Graylog Central (peer support)	8	668	May 31, 2024
WARN ProxiedRessource Graylog Central (peer support)	17	3251	January 2, 2018
Problem with REST API timing out Graylog Central (peer support)	9	2668	June 22, 2017

Graylog System API Timeouts

Related topics