1. Describe your incident:
2025-03-07 10:57:17,856 WARN : org.graylog2.shared.rest.resources.ProxiedResource - Failed to call API on node <4cafb399-387b-4b90-9116-2f7deac188b2>, cause: timeout (duration: 1003 ms)
2025-03-07 10:57:19,865 WARN : org.graylog2.shared.rest.resources.ProxiedResource - Failed to call API on node <4cafb399-387b-4b90-9116-2f7deac188b2>, cause: timeout (duration: 1002 ms)
2025-03-07 10:57:20,479 WARN : org.graylog2.shared.rest.resources.ProxiedResource - Failed to call API on node <4cafb399-387b-4b90-9116-2f7deac188b2>, cause: timeout (duration: 5001 ms)
I have set up Graylog in HA mode according to guides at:
Graylog-Cluster-Docker-Swarm/README.md at main · s0p4L1n3/Graylog-Cluster-Docker-Swarm
I made only tiny tweaks to the whole set up that would fit my current server architecture.
The config files for all 3 Graylog instances are the default config files and the only config changes come with env variables set in the docker stack yaml file, as per: Graylog-Cluster-Docker-Swarm/docker-stack-with-Traefik.yml at main · s0p4L1n3/Graylog-Cluster-Docker-Swarm
When deploying the cluster, graylog nodes connect RANDOMLY. Meaning, sometimes when deploying the stack only 1 node is connected, sometimes 2, sometimes all 3. On the /system/nodes, the unavailable node(s) appear, but with message âSystem information is currently unavailable.â, and when trying to open node info page, I get errors:
1)
Could not get plugins
Getting plugins on node â4cafb399-387b-4b90-9116-2f7deac188b2â failed: FetchError: There was an error fetching a resource: . Additional information: timeout
2)
Could not get JVM information
Getting JVM information for node â4cafb399-387b-4b90-9116-2f7deac188b2â failed: FetchError: There was an error fetching a resource: . Additional information: timeout
2. Describe your environment:
-
OS Information:
All 3 VMs are set up exactly the same way.
NAME=âAlmaLinuxâ
VERSION=â9.5 (Teal Serval)â -
Package Version:
Docker v. 27.4.1
GlusterFS v. 11.1
Keepalived v. 2.2.8
Traefik v. 3.3.2 (image traefik:3.3.2)
MongoDB v. 7.0.14 (image mongo:7.0.14)
OpenSearch v. 2.15.0 (image opensearchproject/opensearch:2.15.0)
Graylog v. 6.1.5 (image graylog/graylog:6.1.5) -
Service logs, configurations, and environment variables:
Environment variables for the services as per abovementioned github guide.
Also,GRAYLOG_IS_MASTER
is changed toGRAYLOG_IS_LEADER
for proper config read and setup.
http_bind_address = 0.0.0.0:9000
(on all three nodes)
When opening the node page at /system/nodes/, in graylog logs additionally I get:
2025-03-07 10:57:21,836 ERROR: org.graylog2.shared.rest.exceptionmappers.AnyExceptionClassMapper - Unhandled exception in REST resource
java.io.InterruptedIOException: timeout
at okhttp3.internal.connection.RealCall.timeoutExit(RealCall.kt:398) ~[graylog.jar:?]
at okhttp3.internal.connection.RealCall.callDone(RealCall.kt:360) ~[graylog.jar:?]
at okhttp3.internal.connection.RealCall.noMoreExchanges$okhttp(RealCall.kt:325) ~[graylog.jar:?]
at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:209) ~[graylog.jar:?]
at okhttp3.internal.connection.RealCall.execute(RealCall.kt:154) ~[graylog.jar:?]
at retrofit2.OkHttpCall.execute(OkHttpCall.java:207) ~[graylog.jar:?]
at org.graylog2.rest.resources.cluster.ClusterSystemResource.jvm(ClusterSystemResource.java:92) ~[graylog.jar:?]
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
at java.base/java.lang.reflect.Method.invoke(Unknown Source) ~[?:?]
at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:146) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:189) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:93) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81) ~[graylog.jar:?]
at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:274) [graylog.jar:?]
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) [graylog.jar:?]
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) [graylog.jar:?]
at org.glassfish.jersey.internal.Errors.process(Errors.java:292) [graylog.jar:?]
at org.glassfish.jersey.internal.Errors.process(Errors.java:274) [graylog.jar:?]
at org.glassfish.jersey.internal.Errors.process(Errors.java:244) [graylog.jar:?]
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:266) [graylog.jar:?]
at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:253) [graylog.jar:?]
at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:696) [graylog.jar:?]
at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:367) [graylog.jar:?]
at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:190) [graylog.jar:?]
at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:259) [graylog.jar:?]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
at java.base/java.lang.Thread.run(Unknown Source) [?:?]
Caused by: java.net.SocketException: Socket closed
at java.base/sun.nio.ch.NioSocketImpl.endConnect(Unknown Source) ~[?:?]
at java.base/sun.nio.ch.NioSocketImpl.connect(Unknown Source) ~[?:?]
at java.base/java.net.SocksSocketImpl.connect(Unknown Source) ~[?:?]
at java.base/java.net.Socket.connect(Unknown Source) ~[?:?]
at okhttp3.internal.platform.Platform.connectSocket(Platform.kt:128) ~[graylog.jar:?]
at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:295) ~[graylog.jar:?]
at okhttp3.internal.connection.RealConnection.connect(RealConnection.kt:207) ~[graylog.jar:?]
at okhttp3.internal.connection.ExchangeFinder.findConnection(ExchangeFinder.kt:226) ~[graylog.jar:?]
at okhttp3.internal.connection.ExchangeFinder.findHealthyConnection(ExchangeFinder.kt:106) ~[graylog.jar:?]
at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.kt:74) ~[graylog.jar:?]
at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:255) ~[graylog.jar:?]
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) ~[graylog.jar:?]
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) ~[graylog.jar:?]
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) ~[graylog.jar:?]
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:76) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) ~[graylog.jar:?]
at org.graylog2.rest.RemoteInterfaceProvider.lambda$get$0(RemoteInterfaceProvider.java:75) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) ~[graylog.jar:?]
at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201) ~[graylog.jar:?]
... 30 more
2025-03-07 10:57:21,847 WARN : org.graylog2.shared.rest.resources.ProxiedResource - Failed to call API on node <4cafb399-387b-4b90-9116-2f7deac188b2>, cause: timeout (duration: 5008 ms)
2025-03-07 10:57:21,861 ERROR: org.graylog2.shared.rest.exceptionmappers.AnyExceptionClassMapper - Unhandled exception in REST resource
java.io.InterruptedIOException: timeout
at okhttp3.internal.connection.RealCall.timeoutExit(RealCall.kt:398) ~[graylog.jar:?]
at okhttp3.internal.connection.RealCall.callDone(RealCall.kt:360) ~[graylog.jar:?]
at okhttp3.internal.connection.RealCall.noMoreExchanges$okhttp(RealCall.kt:325) ~[graylog.jar:?]
at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:209) ~[graylog.jar:?]
at okhttp3.internal.connection.RealCall.execute(RealCall.kt:154) ~[graylog.jar:?]
at retrofit2.OkHttpCall.execute(OkHttpCall.java:207) ~[graylog.jar:?]
at org.graylog2.rest.resources.cluster.ClusterSystemPluginResource.list(ClusterSystemPluginResource.java:77) ~[graylog.jar:?]
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
at java.base/java.lang.reflect.Method.invoke(Unknown Source) ~[?:?]
at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:146) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:189) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:93) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400) ~[graylog.jar:?]
at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81) ~[graylog.jar:?]
at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:274) [graylog.jar:?]
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) [graylog.jar:?]
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) [graylog.jar:?]
at org.glassfish.jersey.internal.Errors.process(Errors.java:292) [graylog.jar:?]
at org.glassfish.jersey.internal.Errors.process(Errors.java:274) [graylog.jar:?]
at org.glassfish.jersey.internal.Errors.process(Errors.java:244) [graylog.jar:?]
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:266) [graylog.jar:?]
at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:253) [graylog.jar:?]
at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:696) [graylog.jar:?]
at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:367) [graylog.jar:?]
at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:190) [graylog.jar:?]
at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:259) [graylog.jar:?]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
at java.base/java.lang.Thread.run(Unknown Source) [?:?]
Caused by: java.net.SocketException: Socket closed
at java.base/sun.nio.ch.NioSocketImpl.endConnect(Unknown Source) ~[?:?]
at java.base/sun.nio.ch.NioSocketImpl.connect(Unknown Source) ~[?:?]
at java.base/java.net.SocksSocketImpl.connect(Unknown Source) ~[?:?]
at java.base/java.net.Socket.connect(Unknown Source) ~[?:?]
at okhttp3.internal.platform.Platform.connectSocket(Platform.kt:128) ~[graylog.jar:?]
at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.kt:295) ~[graylog.jar:?]
at okhttp3.internal.connection.RealConnection.connect(RealConnection.kt:207) ~[graylog.jar:?]
at okhttp3.internal.connection.ExchangeFinder.findConnection(ExchangeFinder.kt:226) ~[graylog.jar:?]
at okhttp3.internal.connection.ExchangeFinder.findHealthyConnection(ExchangeFinder.kt:106) ~[graylog.jar:?]
at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.kt:74) ~[graylog.jar:?]
at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:255) ~[graylog.jar:?]
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) ~[graylog.jar:?]
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) ~[graylog.jar:?]
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) ~[graylog.jar:?]
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:76) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) ~[graylog.jar:?]
at org.graylog2.rest.RemoteInterfaceProvider.lambda$get$0(RemoteInterfaceProvider.java:75) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) ~[graylog.jar:?]
at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201) ~[graylog.jar:?]
... 30 more
3. What steps have you already taken to try and solve the problem?
I redeployed the whole stack multiple times - each time different working/not working node combination comes up.
I tried opening all ports on all nodes - no help.
I restarted individual docker containers.
I restarted and recreated individual docker stack services.
I verified all services are up, healthy and can communicate with each other (graylog, mongo, elasticsearch).
I manually started a mongodb replicaset from inside mongodb service.
I CAN reach problematic node /api from inside the Leader container and receive a proper response with
curl graylog03:9000/api
{"cluster_id":"3cf39bee-d21a-4854-b66c-396f4baf4525","node_id":"4cafb399-387b-4b90-9116-2f7deac188b2","version":"6.1.5+e3ae3ce","tagline":"Manage your logs in the dark and have lasers going and make it look like you're from space!"}
Everything seems to work, but still, the gaylog cluster canât see the last node (graylog03 in my case) and Iâm still getting the error:
2025-03-07 11:56:32,096 WARN : org.graylog2.shared.rest.resources.ProxiedResource - Failed to call API on node <4cafb399-387b-4b90-9116-2f7deac188b2>, cause: timeout (duration: 5003 ms)
2025-03-07 11:57:27,948 WARN : org.graylog2.shared.rest.resources.ProxiedResource - Failed to call API on node <4cafb399-387b-4b90-9116-2f7deac188b2>, cause: timeout (duration: 1002 ms)
2025-03-07 11:57:32,176 WARN : org.graylog2.shared.rest.resources.ProxiedResource - Failed to call API on node <4cafb399-387b-4b90-9116-2f7deac188b2>, cause: timeout (duration: 5001 ms)
4. How can the community help?
Please help me find a root cause for why this is happening. Everything seems to be working, all services are healthy and communicating, but each time I redeploy the whole stack, I get a different combination of working/not working nodes in graylog cluster.