Problem with Graylog cluster

Hi. Hopefully there is someone here who know deeper workings of Graylog and might be able to point me to the right direction.
I inherited a quite large Graylog cluster. It consists of 10 hardware data nodes which all have 1 instance of Graylog and 4 instances of Elasticsearch each. In addition there are 3 master VMs which also have Elastic and Graylog.
The problem occurs when we started to replace some of the old hardware nodes. We created 11th node, fully Ansible, identical to other 10 nodes. When we try to add one additional Elasticsearch instance to cluster, it works just fine for around 30-60 minutes and then Graylog just stops processing new logs, output buffer goes full and journal starts growing, on all 10 nodes simultaneously. When I stop/restart the additional ES instance, processing starts again immediately at all 10 nodes. This happens with or without any roles assigned to ES instance. Otherwise cluster seems to recognize the new node just fine, ES side looks all green and butterflies.
I have fully digged into logs, even enabled full debug, can’t find anything of interest in both ES and GL, at the “stoppage” time nor the “restart” time.
Only difference right now between the 10 nodes and the new one is that the new one is in different network segment, but both old and new segment have any<>any in both directions.
GL: 3.3.0
ES = 6.8.7

Thanks in advance, if some cares to think along.
Best!

:wave: Hmmm, this seems like a tough one. Given the symptoms, this has me leaning toward it being an ES issue. Are you getting anything in the logs from the ES nodes at that time?

Hi, thanks for the answer!

No, I don’t see anything of relevance in ES logs at the time when the stoppage occurs.

I actually do see some errors in Graylog master instance which I somehow previously missed. These seem to match to the times that new ES node is running. This error sadly does not provide much value for me in further debugging.
This time, processing stopped around the same time:

2020-11-30T19:19:55.720+02:00 ERROR [IndexerClusterCheckerThread] Uncaught exception in periodical
org.graylog2.indexer.ElasticsearchException: Unable to read Elasticsearch node information
at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:54) ~[graylog.jar:?]
at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:65) ~[graylog.jar:?]
at org.graylog2.indexer.cluster.Cluster.catNodes(Cluster.java:128) ~[graylog.jar:?]
at org.graylog2.indexer.cluster.Cluster.getFileDescriptorStats(Cluster.java:133) ~[graylog.jar:?]
at org.graylog2.periodical.IndexerClusterCheckerThread.checkOpenFiles(IndexerClusterCheckerThread.java:73) ~[graylog.jar:?]
at org.graylog2.periodical.IndexerClusterCheckerThread.doRun(IndexerClusterCheckerThread.java:62) ~[graylog.jar:?]
at org.graylog2.plugin.periodical.Periodical.run(Periodical.java:77) [graylog.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_121]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_121]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_121]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_121]
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_121]
at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_121]
at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_121]
at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[graylog.jar:?]
at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[graylog.jar:?]
at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282) ~[graylog.jar:?]
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[graylog.jar:?]
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[graylog.jar:?]
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[graylog.jar:?]
at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[graylog.jar:?]
at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165) ~[graylog.jar:?]
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[graylog.jar:?]
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[graylog.jar:?]
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[graylog.jar:?]
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[graylog.jar:?]
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111) ~[graylog.jar:?]
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[graylog.jar:?]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[graylog.jar:?]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[graylog.jar:?]
at io.searchbox.client.http.JestHttpClient.executeRequest(JestHttpClient.java:151) ~[graylog.jar:?]
at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:77) ~[graylog.jar:?]
at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:49) ~[graylog.jar:?]
… 13 more

I also have this error spamming at the master every 30 seconds or so, not sure if this is somehow related as this has been ongoing for a long time already:

ERROR [IndexerClusterCheckerThread] Error while trying to check Elasticsearch disk usage.Details: null

Little update.
I ended up trying to put the 11th node to same network as rest of the cluster is in. Problem seem to have disappeared, 11th ES node is running and GL is happily still processing, for 2+ hours already.
Might be some timeouts, although the networks are very closely connected. Not sure though if it’s GL or ES side in this case? Is it expected that there is no error messages when timeouts occur?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.