Best way to debug and compare graylog nodes - one not processing messages

Description of your problem

I have 2 graqylog nodes and in some scenarios, one of them is processing messages normally, but the other one not: It says like that, there is all the time 0 in the read column:

957 345 unprocessed messages in 1 segment. 1,237 messages appended, 0 messages read in the last second

My question is how is the best way to debug or compare them. For example there in GUI- Nodes this option of Metrics with hundreds of options. There is option of Get thread dump with hundred lines of code.

Environmental information

I compared the server.conf and they are almost the same on both. The main difference is that the one with problem is marked as master. also in thee graylog GUI it has this golden star as sign of master node.

There are 2 graylog nodes on the latest 3.x version and 7 elasticsearch nodes

Operating system information

Oracle Linux


Check Elasticsearch/Graylog/MongoDb log files to see if there is anything that would pertain to your issue.

You might want to post Service logs, configuration, and environment variables here. This would help troubleshoot you issue quicker.

Judging from your statement, looks like your journal is filling up. I’m also assuming that your buffer/s on this node in question is at 100% or close?
Couple questions:

  • Did you check if your firewall?
  • Did you check permission?
  • Have you checked all services are functioning on this node?
  • Was there any updates on this node prior to this issue?

You could of posted that here. Make sure you remove any personal info you dont want others to see.

I’m just assuming at this point but if your process buffer are full, but your output empty and the journal is filling up, it might not be Elasticsearch problem. If your processing can’t keep up it’s because you either don’t have enough CPUs, buffer configuration or you have poorly performing message processing due to extractors or pipeline rules, or all. Another thing you can do is check your Java heap size. By default it’s 1GB. bump it up slowly maybe to 2 GB. This is just a guess right now untill you display more information.

1 Like


99% of time cluster works just fine. Just in a rare case when one elasticsearch node gets shutdown or is no longer pingable, this strange error happens. One graylog node works fine, the other fills al the process buffer to 100%, teh journal queue start flling. This lasts until this failed elasticsearch comes back. Just the system, elasticsearch is still down. From that momemen also the processin on the stucke graylog starts working . One hint I got over on the elasticsearch forum was to decrease
value of TCP retransmission timeout sysctl net.ipv4.tcp_retries2. Default is 15, I changed it on whole cluster to 5. But it didnt help in this scenario.
[TCP retransmission timeout | Elasticsearch Guide [7.14] | Elastic](https://TCP retransmission) . I will try and prepare some more details.

This sounds like a wonky regex rule to me. What do you see when you get a processbuffer dump (NOT a thread dump) from that node?

1 Like

Here is process buffer from the node that is stucked:

"ProcessBufferProcessor #0": "source: IP1 | message: IP1 AgentDevice=WindowsDNSAgentLogFile=DNSDebug.txtPluginVersion= ID=1FF4Context=PACKETMessage=Internal packet identifier=00000205B31F1CD0UDP/TCP indicator=UDPSend/Receive indicator=RcvRemote IP=IP2Xid (hex)=fd27Query/Response=QOpcode=QFlags (hex)=0001Flags (char codes)=DResponseCode=NOERRORQuestion Type=PTRQuestion { level: 5 | gl2_remote_ip: IP2 | gl2_remote_port: 47736 | IP: IP2 | gl2_message_id: 01FDHX3QJTBBWWFH4Z8RX07631 | gl2_source_input: 59d38342d5888c6d23588d66 | domain: | gl2_source_node: dc160ee2-e515-4ba8-90ff-8654f1cdcdb2 | _id: 497eb493-01bd-11ec-abae-022128574717 | facility: user-level | timestamp: 2021-08-20T15:46:56.000+02:00 }",

  "ProcessBufferProcessor #4": "source: 192.IP1 | message: 192.IP1 AgentDevice=WindowsDNSAgentLogFile=DNSDebug.txtPluginVersion= ID=1E14Context=PACKETMessage=Internal packet identifier=00000205A00CDCC0UDP/TCP indicator=UDPSend/Receive indicator=RcvRemote IP=10.IP2Xid (hex)=23fbQuery/Response=QOpcode=QFlags (hex)=0001Flags (char codes)=DResponseCode=NOERRORQuestion Type=PTRQuestion { level: 5 | gl2_remote_ip: 10.IP3 | gl2_remote_port: 47736 | IP: 10.IP4 | gl2_message_id: 01FDHX3QJSJ0G1K6V43NATDYPK | gl2_source_input: 59d38342d5888c6d23588d66 | domain: | gl2_source_node: dc160ee2-e515-4ba8-90ff-8654f1cdcdb2 | _id: 497e8d88-01bd-11ec-abae-022128574717 | facility: user-level | timestamp: 2021-08-20T15:46:56.000+02:00 }"


  "ProcessBufferProcessor #8": "source: 192.IP1 | message: 192.IP1 AgentDevice=WindowsDNSAgentLogFile=DNSDebug.txtPluginVersion= ID=1FF4Context=PACKETMessage=Internal packet identifier=000002059B04A960UDP/TCP indicator=UDPSend/Receive indicator=RcvRemote IP=10.IP2Xid (hex)=2c30Query/Response=QOpcode=QFlags (hex)=0001Flags (char codes)=DResponseCode=NOERRORQuestion Type=AQuestion Name=hanahy2.DOMAIN { level: 5 | gl2_remote_ip: 10.IP3 | gl2_remote_port: 47736 | IP: 10.IP2 | gl2_message_id: 01FDHX3QJSC4QPXHCPD0WA4A08 | gl2_source_input: 59d38342d5888c6d23588d66 | domain: hanahy2.DOMAIN | gl2_source_node: dc160ee2-e515-4ba8-90ff-8654f1cdcdb2 | _id: 497eb495-01bd-11ec-abae-022128574717 | facility: user-level | timestamp: 2021-08-20T15:46:56.000+02:00 }"

_And some errors in server.log file form the same node (it is maste in cluster)

2021-08-20T15:51:30.843+02:00 ERROR [IndexFieldTypePollerPeriodical] Couldn't update field types for index set <OTH ezdrav/5af405179712e82fb5133b27>
org.graylog2.indexer.ElasticsearchException: Couldn't collect indices for alias oth_deflector
        at org.graylog2.indexer.cluster.jest.JestUtils.execute( ~[graylog.jar:?]
        at org.graylog2.indexer.cluster.jest.JestUtils.execute( ~[graylog.jar:?]
        at org.graylog2.indexer.indices.Indices.aliasTarget( ~[graylog.jar:?]
        at org.graylog2.indexer.MongoIndexSet.getActiveWriteIndex( ~[graylog.jar:?]
        at org.graylog2.indexer.fieldtypes.IndexFieldTypePollerPeriodical.lambda$schedule$4( ~[graylog.jar:?]
        at java.util.concurrent.Executors$ Source) [?:1.8.0_144]
        at java.util.concurrent.FutureTask.runAndReset(Unknown Source) [?:1.8.0_144]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source) [?:1.8.0_144]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ Source) [?:1.8.0_144]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_144]
        at java.util.concurrent.ThreadPoolExecutor$ Source) [?:1.8.0_144]
        at Source) [?:1.8.0_144]
Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to ELASTICIP6:9200 [/ELASTICIP6] failed: connect timed out
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect( ~[graylog.jar:?]
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect( ~[graylog.jar:?]
        at org.apache.http.impl.execchain.MainClientExec.establishRoute( ~[graylog.jar:?]
        at org.apache.http.impl.execchain.MainClientExec.execute( ~[graylog.jar:?]
        at org.apache.http.impl.execchain.ProtocolExec.execute( ~[graylog.jar:?]
        at org.apache.http.impl.execchain.RedirectExec.execute( ~[graylog.jar:?]
        at org.apache.http.impl.client.InternalHttpClient.doExecute( ~[graylog.jar:?]
        at org.apache.http.impl.client.CloseableHttpClient.execute( ~[graylog.jar:?]
        at org.apache.http.impl.client.CloseableHttpClient.execute( ~[graylog.jar:?]
        at io.searchbox.client.http.JestHttpClient.executeRequest( ~[graylog.jar:?]
        at io.searchbox.client.http.JestHttpClient.execute( ~[graylog.jar:?]
        at org.graylog2.indexer.cluster.jest.JestUtils.execute( ~[graylog.jar:?]
        ... 11 more
Caused by: connect timed out
        at Method) ~[?:1.8.0_144]


I’m assuming you have elsticsearch problems? If so, Check all you elasticsearch nodes log files for errors/warnings that pertain to this issue. Maybe post your configuration files and/or log files.

EDIT: As @aaronsachs suggested

Did you create any extractors? if so can you post them?

Yes there are many extractors ,over 70. But the input where they are defined is running on both graqylog nodes, the one that is ok ant the one that is not. Maybe the master status of the failed node is the reason. I will try to collect more data, why the difference between nodes. The server. conf is almost the same on both.

Whoa…70?!? That’s quite a few. Is there a reason for using that many extractors over using pipelines? Keep in mind that extractors are more computationally intensive than pipelines. That said, I don’t think that those are the issue. If you’re stuck with processing messages, something is preventing them from being processed–do you have any outputs configured?


Agree, I think I have the max of 7 extractors.

I seems there was something missed or a configuration has change.

org.apache.http.conn.ConnectTimeoutException: Connect to ELASTICIP6:9200 [/ELASTICIP6] failed: connect timed out

You may have a either permision issue or configuration issue.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.