API timeout when querying list of metrics

Hey,

I am having trouble with the metrics page of a cluster node.

When I open http://<graylog-master-IP>:9000/api/cluster/bcc504fe-f4f3-4fa6-b6d0-57d12691e5ab/metrics/namespace/org it takes a while and I am presented an error message:

Cannot read property 'length' of undefined

Chrome dev-tools shows that it is trying to load http://<graylog-master-IP>:9000/api/cluster/bcc504fe-f4f3-4fa6-b6d0-57d12691e5ab/metrics/namespace/org which most of the time fails after ~10 seconds with this error:

{"type":"ApiError","message":"timeout"}

Every time this happens I get the following error logged to the server log:

2020-07-17T18:30:53.527+02:00 ERROR [AnyExceptionClassMapper] Unhandled exception in REST resource
java.net.SocketTimeoutException: timeout
        at okio.Okio$4.newTimeoutException(Okio.java:232) ~[graylog.jar:?]
        at okio.AsyncTimeout.exit(AsyncTimeout.java:286) ~[graylog.jar:?]
        at okio.AsyncTimeout$2.read(AsyncTimeout.java:241) ~[graylog.jar:?]
        at okio.RealBufferedSource.indexOf(RealBufferedSource.java:358) ~[graylog.jar:?]
        at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:230) ~[graylog.jar:?]
        at okhttp3.internal.http1.Http1ExchangeCodec.readHeaderLine(Http1ExchangeCodec.java:242) ~[graylog.jar:?]
        at okhttp3.internal.http1.Http1ExchangeCodec.readResponseHeaders(Http1ExchangeCodec.java:213) ~[graylog.jar:?]
        at okhttp3.internal.connection.Exchange.readResponseHeaders(Exchange.java:115) ~[graylog.jar:?]
        at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:94) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[graylog.jar:?]
        at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:43) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117) ~[graylog.jar:?]
        at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:94) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117) ~[graylog.jar:?]
        at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[graylog.jar:?]
        at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:88) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117) ~[graylog.jar:?]
        at org.graylog2.rest.RemoteInterfaceProvider.lambda$get$0(RemoteInterfaceProvider.java:61) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117) ~[graylog.jar:?]
        at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:229) ~[graylog.jar:?]
        at okhttp3.RealCall.execute(RealCall.java:81) ~[graylog.jar:?]
        at retrofit2.OkHttpCall.execute(OkHttpCall.java:190) ~[graylog.jar:?]
        at org.graylog2.rest.resources.cluster.ClusterNodeMetricsResource.byNamespace(ClusterNodeMetricsResource.java:122) ~[graylog.jar:?]
        at sun.reflect.GeneratedMethodAccessor203.invoke(Unknown Source) ~[?:?]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_252]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
        at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:205) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102) ~[graylog.jar:?]
        at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:315) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:297) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:267) [graylog.jar:?]
        at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317) [graylog.jar:?]
        at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305) [graylog.jar:?]
        at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154) [graylog.jar:?]
        at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:384) [graylog.jar:?]
        at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:224) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181) [graylog.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
Caused by: java.net.SocketException: Socket closed
        at java.net.SocketInputStream.read(SocketInputStream.java:204) ~[?:1.8.0_252]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_252]
        at okio.Okio$2.read(Okio.java:140) ~[graylog.jar:?]
        at okio.AsyncTimeout$2.read(AsyncTimeout.java:237) ~[graylog.jar:?]
        ... 51 more

So it seems like one Graylog node is trying to reach the other and this results in a timeout.
When it does work (from time to time) the response usually comes after ~9 seconds.

Now to my questions:

  • Is there a way to increase the timeout? (quick fix to allow us to browse the available metrics more easily)
  • Any common pitfalls that might explain the slow response? (when it works the resulting document is around ~34 kb in size so there are quite a lot of metrics)

Thanks!

Obligatory info:

2 GL nodes (same specs) + 1 Elastic node:
GL Memory: -Xms2g -Xmx2g
MongoDB running as replicaSet on both GL nodes
Graylog: Graylog 3.3.2+ade4779
Java: Oracle Corporation 1.8.0_252
Linux: Linux 3.10.0-1127.el7.x86_64

he @fadenb

what are the http_* parameters of your both Graylog nodes?

Hey @jan,

thanks for taking the time to look into this!

See below for an excerpt from the config.
The 2nd Graylog node is configured almost identically except that it publishes its own IP/port.

http_bind_address = 0.0.0.0:9000

### This is the IP of the Graylog node. Each node publishes its own IP.
http_publish_uri = http://<redacted>.48:9000/

### Leftover of some experimentation. Was not set before (apparently does not make a difference for this issue)
http_thread_pool_size = 32

### Outbound (internet) is only possible via the proxy
http_proxy_uri = http://<redacted>:<redacted>@proxy.<redacted>.local:8080

### <redacted>.48  <- This node
### <redacted>.6   <- 2nd GL node
### <redacted>.204 <- Elasticsearch server
http_non_proxy_hosts = 127.0.0.1,<redacted>.48,<redacted>.6,<redacted>.204

I believe the endpoints to be configured correctly as I can see the communication happening.
It is just that in ~95% of the time the request for the metrics results in a timeout.

when this happens only from time to time, it is very likely that you have not enough cores available for the work from time to time … add more cores and it should work.

Unfortunately it happens all the time (and only in rare cases it works as expected).

The systems currently each have 8 cores and a load average between 1-1.5. Besides some rare log volume peaks they each only process 1-2k messages/s so they are basically idle (we tested the setup and know that they have enough capacity to process 10 times that while using about 80% of the available CPU).

What metrics can I check to identify the bottleneck?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.