Problem with Graylog cluster

tuman · November 30, 2020, 6:33pm

Hi. Hopefully there is someone here who know deeper workings of Graylog and might be able to point me to the right direction.
I inherited a quite large Graylog cluster. It consists of 10 hardware data nodes which all have 1 instance of Graylog and 4 instances of Elasticsearch each. In addition there are 3 master VMs which also have Elastic and Graylog.
The problem occurs when we started to replace some of the old hardware nodes. We created 11th node, fully Ansible, identical to other 10 nodes. When we try to add one additional Elasticsearch instance to cluster, it works just fine for around 30-60 minutes and then Graylog just stops processing new logs, output buffer goes full and journal starts growing, on all 10 nodes simultaneously. When I stop/restart the additional ES instance, processing starts again immediately at all 10 nodes. This happens with or without any roles assigned to ES instance. Otherwise cluster seems to recognize the new node just fine, ES side looks all green and butterflies.
I have fully digged into logs, even enabled full debug, can’t find anything of interest in both ES and GL, at the “stoppage” time nor the “restart” time.
Only difference right now between the 10 nodes and the new one is that the new one is in different network segment, but both old and new segment have any<>any in both directions.
GL: 3.3.0
ES = 6.8.7

Thanks in advance, if some cares to think along.
Best!

aaronsachs · December 1, 2020, 2:36am

Hmmm, this seems like a tough one. Given the symptoms, this has me leaning toward it being an ES issue. Are you getting anything in the logs from the ES nodes at that time?

tuman · December 1, 2020, 8:24am

Hi, thanks for the answer!

No, I don’t see anything of relevance in ES logs at the time when the stoppage occurs.

I actually do see some errors in Graylog master instance which I somehow previously missed. These seem to match to the times that new ES node is running. This error sadly does not provide much value for me in further debugging.
This time, processing stopped around the same time:

2020-11-30T19:19:55.720+02:00 ERROR [IndexerClusterCheckerThread] Uncaught exception in periodical
org.graylog2.indexer.ElasticsearchException: Unable to read Elasticsearch node information
at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:54) ~[graylog.jar:?]
at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:65) ~[graylog.jar:?]
at org.graylog2.indexer.cluster.Cluster.catNodes(Cluster.java:128) ~[graylog.jar:?]
at org.graylog2.indexer.cluster.Cluster.getFileDescriptorStats(Cluster.java:133) ~[graylog.jar:?]
at org.graylog2.periodical.IndexerClusterCheckerThread.checkOpenFiles(IndexerClusterCheckerThread.java:73) ~[graylog.jar:?]
at org.graylog2.periodical.IndexerClusterCheckerThread.doRun(IndexerClusterCheckerThread.java:62) ~[graylog.jar:?]
at org.graylog2.plugin.periodical.Periodical.run(Periodical.java:77) [graylog.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_121]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_121]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_121]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_121]
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_121]
at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_121]
at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_121]
at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[graylog.jar:?]
at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[graylog.jar:?]
at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282) ~[graylog.jar:?]
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[graylog.jar:?]
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[graylog.jar:?]
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[graylog.jar:?]
at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[graylog.jar:?]
at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165) ~[graylog.jar:?]
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[graylog.jar:?]
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[graylog.jar:?]
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[graylog.jar:?]
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[graylog.jar:?]
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111) ~[graylog.jar:?]
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[graylog.jar:?]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[graylog.jar:?]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[graylog.jar:?]
at io.searchbox.client.http.JestHttpClient.executeRequest(JestHttpClient.java:151) ~[graylog.jar:?]
at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:77) ~[graylog.jar:?]
at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:49) ~[graylog.jar:?]
… 13 more

I also have this error spamming at the master every 30 seconds or so, not sure if this is somehow related as this has been ongoing for a long time already:

ERROR [IndexerClusterCheckerThread] Error while trying to check Elasticsearch disk usage.Details: null

tuman · December 1, 2020, 2:44pm

Little update.
I ended up trying to put the 11th node to same network as rest of the cluster is in. Problem seem to have disappeared, 11th ES node is running and GL is happily still processing, for 2+ hours already.
Might be some timeouts, although the networks are very closely connected. Not sure though if it’s GL or ES side in this case? Is it expected that there is no error messages when timeouts occur?

system · December 15, 2020, 2:45pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Graylog stops processing message(No errors/ES green) Graylog Central (peer support)	2	607	September 5, 2017
Connection to Elastic Stops Graylog Central (peer support)	7	725	February 22, 2018
Indexing of new messages stoppes occasionally Graylog Central (peer support)	17	1542	May 17, 2017
Graylog, log problem Graylog Central (peer support)	23	2127	March 18, 2019
Graylog-ES Communications Graylog Central (peer support)	12	1670	June 29, 2017

Problem with Graylog cluster

Related topics