I have 2 graqylog nodes and in some scenarios, one of them is processing messages normally, but the other one not: It says like that, there is all the time 0 in the read column:
957 345 unprocessed messages in 1 segment. 1,237 messages appended, 0 messages read in the last second
My question is how is the best way to debug or compare them. For example there in GUI- Nodes this option of Metrics with hundreds of options. There is option of Get thread dump with hundred lines of code.
I compared the server.conf and they are almost the same on both. The main difference is that the one with problem is marked as master. also in thee graylog GUI it has this golden star as sign of master node.
There are 2 graylog nodes on the latest 3.x version and 7 elasticsearch nodes
You might want to post Service logs, configuration, and environment variables here. This would help troubleshoot you issue quicker.
Judging from your statement, looks like your journal is filling up. I’m also assuming that your buffer/s on this node in question is at 100% or close?
Did you check if your firewall?
Did you check permission?
Have you checked all services are functioning on this node?
Was there any updates on this node prior to this issue?
You could of posted that here. Make sure you remove any personal info you dont want others to see.
I’m just assuming at this point but if your process buffer are full, but your output empty and the journal is filling up, it might not be Elasticsearch problem. If your processing can’t keep up it’s because you either don’t have enough CPUs, buffer configuration or you have poorly performing message processing due to extractors or pipeline rules, or all. Another thing you can do is check your Java heap size. By default it’s 1GB. bump it up slowly maybe to 2 GB. This is just a guess right now untill you display more information.
99% of time cluster works just fine. Just in a rare case when one elasticsearch node gets shutdown or is no longer pingable, this strange error happens. One graylog node works fine, the other fills al the process buffer to 100%, teh journal queue start flling. This lasts until this failed elasticsearch comes back. Just the system, elasticsearch is still down. From that momemen also the processin on the stucke graylog starts working . One hint I got over on the elasticsearch forum was to decrease
value of TCP retransmission timeout sysctl net.ipv4.tcp_retries2. Default is 15, I changed it on whole cluster to 5. But it didnt help in this scenario.
[TCP retransmission timeout | Elasticsearch Guide [7.14] | Elastic](https://TCP retransmission) . I will try and prepare some more details.
_And some errors in server.log file form the same node (it is maste in cluster)
2021-08-20T15:51:30.843+02:00 ERROR [IndexFieldTypePollerPeriodical] Couldn't update field types for index set <OTH ezdrav/5af405179712e82fb5133b27>
org.graylog2.indexer.ElasticsearchException: Couldn't collect indices for alias oth_deflector
at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:54) ~[graylog.jar:?]
at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:65) ~[graylog.jar:?]
at org.graylog2.indexer.indices.Indices.aliasTarget(Indices.java:336) ~[graylog.jar:?]
at org.graylog2.indexer.MongoIndexSet.getActiveWriteIndex(MongoIndexSet.java:204) ~[graylog.jar:?]
at org.graylog2.indexer.fieldtypes.IndexFieldTypePollerPeriodical.lambda$schedule$4(IndexFieldTypePollerPeriodical.java:201) ~[graylog.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:1.8.0_144]
at java.util.concurrent.FutureTask.runAndReset(Unknown Source) [?:1.8.0_144]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source) [?:1.8.0_144]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_144]
at java.lang.Thread.run(Unknown Source) [?:1.8.0_144]
Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to ELASTICIP6:9200 [/ELASTICIP6] failed: connect timed out
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151) ~[graylog.jar:?]
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373) ~[graylog.jar:?]
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381) ~[graylog.jar:?]
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237) ~[graylog.jar:?]
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[graylog.jar:?]
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111) ~[graylog.jar:?]
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[graylog.jar:?]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[graylog.jar:?]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[graylog.jar:?]
at io.searchbox.client.http.JestHttpClient.executeRequest(JestHttpClient.java:151) ~[graylog.jar:?]
at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:77) ~[graylog.jar:?]
at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:49) ~[graylog.jar:?]
... 11 more
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_144]
Yes there are many extractors ,over 70. But the input where they are defined is running on both graqylog nodes, the one that is ok ant the one that is not. Maybe the master status of the failed node is the reason. I will try to collect more data, why the difference between nodes. The server. conf is almost the same on both.
Whoa…70?!? That’s quite a few. Is there a reason for using that many extractors over using pipelines? Keep in mind that extractors are more computationally intensive than pipelines. That said, I don’t think that those are the issue. If you’re stuck with processing messages, something is preventing them from being processed–do you have any outputs configured?