Best way to debug and compare graylog nodes - one not processing messages

lecko · August 19, 2021, 9:27am

Description of your problem

I have 2 graqylog nodes and in some scenarios, one of them is processing messages normally, but the other one not: It says like that, there is all the time 0 in the read column:

957 345 unprocessed messages in 1 segment. 1,237 messages appended, 0 messages read in the last second

My question is how is the best way to debug or compare them. For example there in GUI- Nodes this option of Metrics with hundreds of options. There is option of Get thread dump with hundred lines of code.

Environmental information

I compared the server.conf and they are almost the same on both. The main difference is that the one with problem is marked as master. also in thee graylog GUI it has this golden star as sign of master node.

There are 2 graylog nodes on the latest 3.x version and 7 elasticsearch nodes

Operating system information

Oracle Linux

gsmith · August 19, 2021, 10:50pm

Hello,

Check Elasticsearch/Graylog/MongoDb log files to see if there is anything that would pertain to your issue.

https://docs.graylog.org/en/3.0/pages/configuration/file_location.html

You might want to post Service logs, configuration, and environment variables here. This would help troubleshoot you issue quicker.

Judging from your statement, looks like your journal is filling up. I’m also assuming that your buffer/s on this node in question is at 100% or close?
Couple questions:

Did you check if your firewall?
Did you check permission?
Have you checked all services are functioning on this node?
Was there any updates on this node prior to this issue?

You could of posted that here. Make sure you remove any personal info you dont want others to see.

I’m just assuming at this point but if your process buffer are full, but your output empty and the journal is filling up, it might not be Elasticsearch problem. If your processing can’t keep up it’s because you either don’t have enough CPUs, buffer configuration or you have poorly performing message processing due to extractors or pipeline rules, or all. Another thing you can do is check your Java heap size. By default it’s 1GB. bump it up slowly maybe to 2 GB. This is just a guess right now untill you display more information.

lecko · August 20, 2021, 12:12pm

hi,

99% of time cluster works just fine. Just in a rare case when one elasticsearch node gets shutdown or is no longer pingable, this strange error happens. One graylog node works fine, the other fills al the process buffer to 100%, teh journal queue start flling. This lasts until this failed elasticsearch comes back. Just the system, elasticsearch is still down. From that momemen also the processin on the stucke graylog starts working . One hint I got over on the elasticsearch forum was to decrease
value of TCP retransmission timeout sysctl net.ipv4.tcp_retries2. Default is 15, I changed it on whole cluster to 5. But it didnt help in this scenario.
[TCP retransmission timeout | Elasticsearch Guide [7.14] | Elastic](https://TCP retransmission) . I will try and prepare some more details.

aaronsachs · August 20, 2021, 1:16pm

This sounds like a wonky regex rule to me. What do you see when you get a processbuffer dump (NOT a thread dump) from that node?

lecko · August 20, 2021, 2:01pm

Here is process buffer from the node that is stucked:

"ProcessBufferProcessor #0": "source: IP1 | message: IP1 AgentDevice=WindowsDNSAgentLogFile=DNSDebug.txtPluginVersion=7.3.0.41Date=20/08/2021Time=15:46:55Thread ID=1FF4Context=PACKETMessage=Internal packet identifier=00000205B31F1CD0UDP/TCP indicator=UDPSend/Receive indicator=RcvRemote IP=IP2Xid (hex)=fd27Query/Response=QOpcode=QFlags (hex)=0001Flags (char codes)=DResponseCode=NOERRORQuestion Type=PTRQuestion Name=IP3.in-addr.arpa { level: 5 | gl2_remote_ip: IP2 | gl2_remote_port: 47736 | IP: IP2 | gl2_message_id: 01FDHX3QJTBBWWFH4Z8RX07631 | gl2_source_input: 59d38342d5888c6d23588d66 | domain: IP3.in-addr.arpa | gl2_source_node: dc160ee2-e515-4ba8-90ff-8654f1cdcdb2 | _id: 497eb493-01bd-11ec-abae-022128574717 | facility: user-level | timestamp: 2021-08-20T15:46:56.000+02:00 }",

  "ProcessBufferProcessor #4": "source: 192.IP1 | message: 192.IP1 AgentDevice=WindowsDNSAgentLogFile=DNSDebug.txtPluginVersion=7.3.0.41Date=20/08/2021Time=15:46:55Thread ID=1E14Context=PACKETMessage=Internal packet identifier=00000205A00CDCC0UDP/TCP indicator=UDPSend/Receive indicator=RcvRemote IP=10.IP2Xid (hex)=23fbQuery/Response=QOpcode=QFlags (hex)=0001Flags (char codes)=DResponseCode=NOERRORQuestion Type=PTRQuestion Name=124.IP3.in-addr.arpa { level: 5 | gl2_remote_ip: 10.IP3 | gl2_remote_port: 47736 | IP: 10.IP4 | gl2_message_id: 01FDHX3QJSJ0G1K6V43NATDYPK | gl2_source_input: 59d38342d5888c6d23588d66 | domain: 124.IP3.in-addr.arpa | gl2_source_node: dc160ee2-e515-4ba8-90ff-8654f1cdcdb2 | _id: 497e8d88-01bd-11ec-abae-022128574717 | facility: user-level | timestamp: 2021-08-20T15:46:56.000+02:00 }"

 

  "ProcessBufferProcessor #8": "source: 192.IP1 | message: 192.IP1 AgentDevice=WindowsDNSAgentLogFile=DNSDebug.txtPluginVersion=7.3.0.41Date=20/08/2021Time=15:46:55Thread ID=1FF4Context=PACKETMessage=Internal packet identifier=000002059B04A960UDP/TCP indicator=UDPSend/Receive indicator=RcvRemote IP=10.IP2Xid (hex)=2c30Query/Response=QOpcode=QFlags (hex)=0001Flags (char codes)=DResponseCode=NOERRORQuestion Type=AQuestion Name=hanahy2.DOMAIN { level: 5 | gl2_remote_ip: 10.IP3 | gl2_remote_port: 47736 | IP: 10.IP2 | gl2_message_id: 01FDHX3QJSC4QPXHCPD0WA4A08 | gl2_source_input: 59d38342d5888c6d23588d66 | domain: hanahy2.DOMAIN | gl2_source_node: dc160ee2-e515-4ba8-90ff-8654f1cdcdb2 | _id: 497eb495-01bd-11ec-abae-022128574717 | facility: user-level | timestamp: 2021-08-20T15:46:56.000+02:00 }"
}

_And some errors in server.log file form the same node (it is maste in cluster)

2021-08-20T15:51:30.843+02:00 ERROR [IndexFieldTypePollerPeriodical] Couldn't update field types for index set <OTH ezdrav/5af405179712e82fb5133b27>
org.graylog2.indexer.ElasticsearchException: Couldn't collect indices for alias oth_deflector
        at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:54) ~[graylog.jar:?]
        at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:65) ~[graylog.jar:?]
        at org.graylog2.indexer.indices.Indices.aliasTarget(Indices.java:336) ~[graylog.jar:?]
        at org.graylog2.indexer.MongoIndexSet.getActiveWriteIndex(MongoIndexSet.java:204) ~[graylog.jar:?]
        at org.graylog2.indexer.fieldtypes.IndexFieldTypePollerPeriodical.lambda$schedule$4(IndexFieldTypePollerPeriodical.java:201) ~[graylog.jar:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:1.8.0_144]
        at java.util.concurrent.FutureTask.runAndReset(Unknown Source) [?:1.8.0_144]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source) [?:1.8.0_144]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:1.8.0_144]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_144]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_144]
        at java.lang.Thread.run(Unknown Source) [?:1.8.0_144]
Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to ELASTICIP6:9200 [/ELASTICIP6] failed: connect timed out
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151) ~[graylog.jar:?]
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373) ~[graylog.jar:?]
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381) ~[graylog.jar:?]
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237) ~[graylog.jar:?]
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[graylog.jar:?]
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111) ~[graylog.jar:?]
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[graylog.jar:?]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[graylog.jar:?]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[graylog.jar:?]
        at io.searchbox.client.http.JestHttpClient.executeRequest(JestHttpClient.java:151) ~[graylog.jar:?]
        at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:77) ~[graylog.jar:?]
        at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:49) ~[graylog.jar:?]
        ... 11 more
Caused by: java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_144]

gsmith · August 20, 2021, 9:44pm

Hello,

I’m assuming you have elsticsearch problems? If so, Check all you elasticsearch nodes log files for errors/warnings that pertain to this issue. Maybe post your configuration files and/or log files.

EDIT: As @aaronsachs suggested

Did you create any extractors? if so can you post them?

lecko · August 23, 2021, 3:31pm

Yes there are many extractors ,over 70. But the input where they are defined is running on both graqylog nodes, the one that is ok ant the one that is not. Maybe the master status of the failed node is the reason. I will try to collect more data, why the difference between nodes. The server. conf is almost the same on both.

aaronsachs · August 23, 2021, 5:15pm

Whoa…70?!? That’s quite a few. Is there a reason for using that many extractors over using pipelines? Keep in mind that extractors are more computationally intensive than pipelines. That said, I don’t think that those are the issue. If you’re stuck with processing messages, something is preventing them from being processed–do you have any outputs configured?

gsmith · August 23, 2021, 10:22pm

Hello,

Agree, I think I have the max of 7 extractors.

I seems there was something missed or a configuration has change.

org.apache.http.conn.ConnectTimeoutException: Connect to ELASTICIP6:9200 [/ELASTICIP6] failed: connect timed out

You may have a either permision issue or configuration issue.

system · September 6, 2021, 10:23pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Graylog doesn't process messages anymore, ELS cluster is green Graylog Central (peer support)	6	3587	April 12, 2021
Answer to Graylog Journal is fully utilized and there are millions of unprocessed messages Graylog Central (peer support)	5	2112	January 11, 2019
Problem with unprocessed messages in inputs and index Graylog Central (peer support)	5	668	March 19, 2020
Only Master node is processing logs , the other 2 nodes are stating "The journal contains 0 unprocessed messages in 1 segment. 0 messages appended, 0 messages read in the last second" Graylog Central (peer support)	7	1221	July 26, 2019
New messages left unprocessed Graylog Central (peer support)	1	764	August 29, 2017

Best way to debug and compare graylog nodes - one not processing messages

Description of your problem

Environmental information

Operating system information

Related topics