Master/Leader node disappearing with high CPU on Java

Describe your incident:
We have a Graylog 4 cluster with four nodes running Graylog and MongoDB locally, connecting to AWS OpenSearch for ElasticSearch. Generally it’s been running well, but as of recently the master node can’t seem to stay online. It disappears from the node list. I can restart the Graylog service and it come back, but only for a few minutes. If I change the master (leader) to a different node, it then crashes and the original node performs properly. When this occurs, I’m seeing Graylog Java using 400-600% CPU, which I assume is what causes the node to drop out of the cluster. But, I can’t figure out why it’s pegging CPU.

Describe your environment:

  • Node CPU/Memory:
    6 CPUs
    16 GB Memory

  • OS Information:
    CentOS 7.9

  • Package Version:
    Java 18.0.1 (8GB Heap Size)
    MongoDB 4.4.15-1.el7
    Graylog 4.3.5-1

  • Service logs, configurations, and environment variables:
    server.conf:

# Cluster settings
is_leader = true
node_id_file = /etc/graylog/server/node-id

# General
bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin
dashboard_widget_default_cache_time = 10s

# Access secrets
password_secret = <MASKED>
root_username = <MASKED>
root_email = <MASKED>
root_timezone = America/Los_Angeles
root_password_sha2 = <MASKED>

# Web interface
http_bind_address = <MASKED>:9000
http_publish_uri = https://<MASKED>:9000/
http_external_uri = https://<MASKED>/
http_enable_tls = true
http_tls_cert_file = /etc/graylog/server/graylog-certificate.pem
http_tls_key_file = /etc/graylog/server/graylog-key.pem

# Proxies

# Search options
allow_leading_wildcard_searches = false
allow_highlighting = false

# Elasticsearch
elasticsearch_hosts = https://<MASKED>.es.amazonaws.com
elasticsearch_max_total_connections = 200
elasticsearch_max_total_connections_per_route = 20
elasticsearch_max_retries = 2
elasticsearch_discovery_enabled = false
graylog_elasticsearch_version=7

elasticsearch_index_prefix = graylog
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30

# Processors
processbuffer_processors = 5
outputbuffer_processors = 3
async_eventbus_processors = 2
outputbuffer_processor_keep_alive_time = 5000
outputbuffer_processor_threads_core_pool_size = 3
outputbuffer_processor_threads_max_pool_size = 30
processor_wait_strategy = blocking
udp_recvbuffer_sizes = 1048576
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking

# Message journal
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
message_journal_max_age = 12h
message_journal_max_size = 5gb
message_journal_flush_age = 1m
message_journal_flush_interval = 1000000
message_journal_segment_age = 1h
message_journal_segment_size = 100mb

# Timeouts
output_module_timeout = 10000
stale_leader_timeout = 30000
shutdown_timeout = 30000
ldap_connection_timeout = 2000
http_connect_timeout = 5s
http_read_timeout = 10s
http_write_timeout = 10s
stream_processing_timeout = 2000

# Ring buffers
ring_size = 65536

# Load balancing
lb_recognition_period_seconds = 3
lb_throttle_threshold_percentage = 95

# Stream processing
stream_processing_max_faults = 3

# Intervals

# MongoDB Configuration
mongodb_uri = mongodb://<MASKED>:27017,<MASKED>:27017,<MASKED>:27017,<MASKED>:27017/graylog?replicaSet=rs01
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5

# Email transport
transport_email_enabled = true
transport_email_hostname = <MASKED>
transport_email_port = 25
transport_email_use_auth = false
transport_email_use_tls = true
transport_email_use_ssl = false
transport_email_auth_username = you@example.com
transport_email_auth_password = secret
transport_email_subject_prefix = [graylog]
transport_email_from_email = <MASKED>
transport_email_web_interface_url = https://<MASKED>

# HTTP proxy for outgoing HTTP calls
proxied_requests_thread_pool_size = 32

# GC

# Additional options

What steps have you already taken to try and solve the problem?
I’ve scoured Mongo and Graylog logs, but haven’t found an obvious root cause. Our Mongo logs are very clean (all informational). Our Graylog logs are a bit difficult to dig through, as we have a lot of noise in the logs. Some of the major noise is:

  1. ERROR [DecodingProcessor] Error processing message RawMessage…Unrecognized token ‘default’: was expecting (‘true’, ‘false’ or ‘null’)
  2. ERROR [GelfCodec] Could not parse JSON, first 400 characters: default send string
    com.fasterxml.jackson.core.JsonParseException: Unrecognized token ‘default’: was expecting (‘true’, ‘false’ or ‘null’)
  3. WARN [GelfCodec] GELF message <358274f6-1caf-11ed-90be-0050569302c8> (received from ) has invalid “timestamp”: 1660577269.37275 (type: STRING)

When the master node disappears, we also start to see these errors:

  1. WARN [StreamRouterEngine] Unexpected error during stream matching
  2. WARN [GarbageCollectionWarningThread] Last GC run with G1 Old Generation took longer than 1 second (last duration=3320 milliseconds)
  3. ERROR [JobWorkerPool] Unhandled exception java.lang.OutOfMemoryError: Java heap space
  4. ERROR [DynamoDBLeaseRefresher] Failed to get table status for graylog-aws-plugin-AWSLogs software.amazon.kinesis.leases.exceptions.DependencyException: java.util.concurrent.TimeoutException

How can the community help?
Can anyone give me any ideas on configuration issues, performance/optimization recommendations, or other ways to troubleshoot this issue?

Some additional Graylog server.log errors while the mater node is in current failed state:

  • ERROR [ClusterAdapterES7] An error occurred: (Request cannot be executed; I/O reactor status: STOPPED)
  • INFO [IndexerClusterCheckerThread] Indexer not fully initialized yet. Skipping periodic cluster check.
  • ERROR [IndexRotationThread] Uncaught exception in Periodical…Caused by: java.lang.RuntimeException: Request cannot be executed; I/O reactor status: STOPPED

I’ve also observed the the master node will intermittently show up in the node list with “System information is currently unavailable” and then disappear again.

Hard to say which of these are just follow-on errors due to lack of resources.
Do you use regex in processing messages (pipeline rule or extractor)? That can be a reason for excessive resource usage.

Thanks for the ideas @patrickmann. I’ve gone through all of our pipeline rules and extractors and found that we don’t use any RegEx in pipeline rules, but do have a half dozen RegEx based extractors. I’ve gone through them and confirmed the RegEx is valid. Is there anything in particular I should be looking for on those RegEx’s? I’m trying to understand why the other three nodes that run the same inputs and such have been running fine. What is it that a master/leader node does that is different/extra than the other nodes?

One other thing I did was bump up the resources on the master node to 8 CPUs and 64GB memory, giving Graylog Java 32GB of heap space. I’ve done this in the past as I’ve found the node “lives” longer and I’m able to enable debug mode and collect more logs before it goes offline. I did that last night and then let it run/churn overnight. This morning the master is actually showing up under nodes, but it’s showing ~3.8 million unprocessed messages, ~41,000 messages in the Process buffer, and load balancer indication of “DEAD”. While in that state, it’s currently showing 50-150% CPU utilization on Graylog Java.

Today, I’m planning to scour the debug log I captured last night… :face_with_spiral_eyes:

valid regex can still cause race conditions - here is a good article about it. This is not to say this is happening, more that you can examine your regex (and GROK if you have it) and consider ways to make them more efficient.

Hello,

Adding on to @tmacgbay & @patrickmann

What I noticed was some message are received in GELF input that maybe cause issues.

  • has invalid “timestamp”:
  • Unrecognized token ‘default’: was expecting (‘true’, ‘false’ or ‘null’)

Like @patrickmann stated running out of resources.

Wow that’s a lot :eyes:

When I see that its from the following:

  • Inputs (i.e., wrong type of logs going into a input)
  • Extractor’s
  • Pipelines
  • Lookup Tables

I ingest 30 Million messages a day using 4 GB heap.

1 Like

@gsmith I had only upped cpu/memory/heap space temporarily, as I had found that it extended the time before I saw the node fail and was trying to capture more logs. The rest of our nodes run with 4GB heap.

I’ve been working on moving some RegEx input extractors to a Pipeline, so they aren’t executed on every single message. I’ve just about gotten rid of the “Unrecognized token ‘default’” error. Our F5 load balancer UDP health monitors were causing that. II don’t get the has invalid timestamp errors, as the timestamps look valid (ie. 1660849509.445, 1660849509.484, 1660849308.95, 1660849308.03504). That being said, we’re getting a TON of those messages in the server logs. So, those could be affecting performance and are definitely affecting the readability of the logs. Any idea on how I might get rid of those or is this a bug??

Thanks,

Vishnu

It would help if you posted the errors from server.log for us to look at… :smiley:

I’ve still got a few “default send string” errors, but I think I’m down to one source that is sending a few. They are occasional in the logs now:

2022-08-18T13:20:39.663-07:00 ERROR [DecodingProcessor] Error processing message RawMessage{id=3a1548e1-1f33-11ed-8dfc-00505693e572, messageQueueId=844486734, codec=gelf, payloadSize=19, timestamp=2022-08-18T20:20:39.662Z, remoteAddress=/10.104.72.253:45328}
com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'default': was expecting ('true', 'false' or 'null')
 at [Source: (String)"default send string"; line: 1, column: 8]
        at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1804) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:703) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._reportInvalidToken(ReaderBasedJsonParser.java:2853) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1899) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:757) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectMapper._readTreeAndClose(ObjectMapper.java:4043) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2552) ~[graylog.jar:?]
        at org.graylog2.inputs.codecs.GelfCodec.decode(GelfCodec.java:130) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage(DecodingProcessor.java:156) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent(DecodingProcessor.java:94) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:95) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:49) [graylog.jar:?]
        at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]

Otherwise as far as repeating errors I might actually be down to GELF timestamp errors, which are very frequent in the server.log:

2022-08-18T13:35:46.698-07:00 WARN  [GelfCodec] GELF message <56b7b081-1f35-11ed-8a9d-005056931163> (received from <10.104.72.254:44395>) has invalid "timestamp": 1660854946.696  (type: STRING)
2022-08-18T13:35:46.698-07:00 WARN  [GelfCodec] GELF message <56b7b082-1f35-11ed-8a9d-005056931163> (received from <10.104.72.254:44395>) has invalid "timestamp": 1660854946.696  (type: STRING)
2022-08-18T13:35:46.699-07:00 WARN  [GelfCodec] GELF message <56b7b083-1f35-11ed-8a9d-005056931163> (received from <10.104.72.254:44395>) has invalid "timestamp": 1660854946.696  (type: STRING)
2022-08-18T13:35:46.699-07:00 WARN  [GelfCodec] GELF message <56b7b084-1f35-11ed-8a9d-005056931163> (received from <10.104.72.254:52940>) has invalid "timestamp": 1660854946.693  (type: STRING)
2022-08-18T13:35:46.699-07:00 WARN  [GelfCodec] GELF message <56b7b085-1f35-11ed-8a9d-005056931163> (received from <10.104.72.254:41504>) has invalid "timestamp": 1660854946.693  (type: STRING)
2022-08-18T13:35:46.695-07:00 WARN  [GelfCodec] GELF message <56b76267-1f35-11ed-8a9d-005056931163> (received from <10.104.72.254:44395>) has invalid "timestamp": 1660854946.694  (type: STRING)
2022-08-18T13:35:46.695-07:00 WARN  [GelfCodec] GELF message <56b76266-1f35-11ed-8a9d-005056931163> (received from <10.104.72.254:44395>) has invalid "timestamp": 1660854946.694  (type: STRING)
2022-08-18T13:35:46.700-07:00 WARN  [GelfCodec] GELF message <56b7d792-1f35-11ed-8a9d-005056931163> (received from <10.104.72.254:44395>) has invalid "timestamp": 1660854946.697  (type: STRING)
2022-08-18T13:35:46.700-07:00 WARN  [GelfCodec] GELF message <56b825b2-1f35-11ed-8a9d-005056931163> (received from <10.104.72.254:52940>) has invalid "timestamp": 1660854946.696  (type: STRING)
2022-08-18T13:35:46.700-07:00 WARN  [GelfCodec] GELF message <56b7d795-1f35-11ed-8a9d-005056931163> (received from <10.104.72.254:44395>) has invalid "timestamp": 1660854946.697  (type: STRING)

I just did a clean restart of the graylog-server service on my master node and some of the errors that occur before it disappears out of the node list are:

2022-08-18T14:04:16.323-07:00 WARN  [AbstractTcpTransport] receiveBufferSize (SO_RCVBUF) for input GELFTCPInput{title=gelf-tcp, type=org.graylog2.inputs.gelf.tcp.GELFTCPInput, nodeId=9f572587-1a47-4ecc-b4a4-0191e6b4a149} (channel [id: 0x7da022bd, L:/0.0.0.0:12202]) should be >= 2097152 but is 425984.
2022-08-18T14:04:16.324-07:00 WARN  [UdpTransport] receiveBufferSize (SO_RCVBUF) for input GELFUDPInput{title=winlogs-gelf, type=org.graylog2.inputs.gelf.udp.GELFUDPInput, nodeId=9f572587-1a47-4ecc-b4a4-0191e6b4a149} (channel [id: 0xaff45739, L:/0.0.0.0:5414]) should be >= 2097152 but is 425984.
2022-08-18T14:04:16.374-07:00 WARN  [UdpTransport] receiveBufferSize (SO_RCVBUF) for input GELFUDPInput{title=winlogs-gelf, type=org.graylog2.inputs.gelf.udp.GELFUDPInput, nodeId=9f572587-1a47-4ecc-b4a4-0191e6b4a149} (channel [id: 0x9518413e, L:/0.0.0.0:5414]) should be >= 2097152 but is 425984.
2022-08-18T14:04:16.386-07:00 INFO  [InputStateListener] Input [Syslog UDP/62d5e5022559d94d70bed868] is now RUNNING
2022-08-18T14:04:16.393-07:00 INFO  [InputStateListener] Input [GELF UDP/6254cf616b834f046bcb223c] is now RUNNING
2022-08-18T14:04:16.394-07:00 WARN  [UdpTransport] receiveBufferSize (SO_RCVBUF) for input GELFUDPInput{title=winlogs-gelf, type=org.graylog2.inputs.gelf.udp.GELFUDPInput, nodeId=9f572587-1a47-4ecc-b4a4-0191e6b4a149} (channel [id: 0x32be7490, L:/0.0.0.0:5414]) should be >= 2097152 but is 425984.
2022-08-18T14:04:16.408-07:00 WARN  [UdpTransport] receiveBufferSize (SO_RCVBUF) for input GELFUDPInput{title=winlogs-gelf, type=org.graylog2.inputs.gelf.udp.GELFUDPInput, nodeId=9f572587-1a47-4ecc-b4a4-0191e6b4a149} (channel [id: 0x507e10e2, L:/0.0.0.0:5414]) should be >= 2097152 but is 425984.
2022-08-18T14:04:20.264-07:00 WARN  [GarbageCollectionWarningThread] Last GC run with G1 Young Generation took longer than 1 second (last duration=2163 milliseconds)
2022-08-18T14:04:33.864-07:00 WARN  [GelfCodec] GELF message <5c2ec7c0-1f39-11ed-808a-0050569302c8> (received from <10.104.72.252:40206>) is missing mandatory "host" field.
2022-08-18T14:04:34.701-07:00 WARN  [GelfCodec] GELF message <5cabe700-1f39-11ed-808a-0050569302c8> (received from <10.104.72.252:40304>) is missing mandatory "host" field.
2022-08-18T14:04:50.939-07:00 WARN  [StreamRouterEngine] Unexpected error during stream matching
java.util.concurrent.TimeoutException: null
        at java.util.concurrent.FutureTask.get(FutureTask.java:204) ~[?:?]
        at com.google.common.util.concurrent.SimpleTimeLimiter.callWithTimeout(SimpleTimeLimiter.java:162) ~[graylog.jar:?]
        at org.graylog2.streams.StreamRouterEngine$Rule.matchWithTimeOut(StreamRouterEngine.java:311) [graylog.jar:?]
        at org.graylog2.streams.StreamRouterEngine$Rule.access$000(StreamRouterEngine.java:268) [graylog.jar:?]
        at org.graylog2.streams.StreamRouterEngine.match(StreamRouterEngine.java:195) [graylog.jar:?]
        at org.graylog2.streams.StreamRouter.route(StreamRouter.java:99) [graylog.jar:?]
        at org.graylog2.filters.StreamMatcherFilter.filter(StreamMatcherFilter.java:42) [graylog.jar:?]
        at org.graylog2.messageprocessors.MessageFilterChainProcessor.process(MessageFilterChainProcessor.java:104) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.handleMessage(ProcessBufferProcessor.java:158) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.dispatchMessage(ProcessBufferProcessor.java:128) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:98) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:49) [graylog.jar:?]
        at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
2022-08-18T14:05:18.069-07:00 ERROR [DecodingProcessor] Error processing message RawMessage{id=76386e00-1f39-11ed-808a-0050569302c8, messageQueueId=745, codec=gelf, payloadSize=307, timestamp=2022-08-18T21:05:17.536Z, remoteAddress=/10.104.72.254:52095}
com.fasterxml.jackson.core.JsonParseException: Unexpected character ('<' (code 60)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
 at [Source: (String)"<11>Navis.WinServices.MessageEventProcessService.3.exe:    at Navis.WinServices.MessageEventProcessService.Service.ParseWebEvent.ParseEvents(SendGridEvent eventMsg, String batchId) in D:\BuildAgent\work\e674a8e3a1e2d2e0\MessageEventProcessService\MessageEventProcessService\Service\ParseWebEvent.cs:line 100"; line: 1, column: 2]
        at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1804) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:693) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:591) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1902) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:757) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectMapper._readTreeAndClose(ObjectMapper.java:4043) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2552) ~[graylog.jar:?]
        at org.graylog2.inputs.codecs.GelfCodec.decode(GelfCodec.java:130) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage(DecodingProcessor.java:156) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent(DecodingProcessor.java:94) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:95) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:49) [graylog.jar:?]
        at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
2022-08-18T14:06:48.490-07:00 ERROR [ServerRuntime$Responder] An I/O error has occurred while writing a response message entity to the container output stream.
org.glassfish.jersey.server.internal.process.MappableException: java.io.IOException: Connection is closed
        at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:67) ~[graylog.jar:?]
        at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:139) ~[graylog.jar:?]
        at org.glassfish.jersey.message.internal.MessageBodyFactory.writeTo(MessageBodyFactory.java:1116) ~[graylog.jar:?]
        at org.glassfish.jersey.server.ServerRuntime$Responder.writeResponse(ServerRuntime.java:635) [graylog.jar:?]
        at org.glassfish.jersey.server.ServerRuntime$Responder.processResponse(ServerRuntime.java:373) [graylog.jar:?]
        at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:363) [graylog.jar:?]
        at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:258) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:292) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:274) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:244) [graylog.jar:?]
        at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) [graylog.jar:?]
        at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:234) [graylog.jar:?]
        at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:680) [graylog.jar:?]
        at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:356) [graylog.jar:?]
        at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:200) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180) [graylog.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.io.IOException: Connection is closed
        at org.glassfish.grizzly.nio.NIOConnection.assertOpen(NIOConnection.java:441) ~[graylog.jar:?]
        at org.glassfish.grizzly.http.io.OutputBuffer.write(OutputBuffer.java:663) ~[graylog.jar:?]
        at org.glassfish.grizzly.http.server.NIOOutputStreamImpl.write(NIOOutputStreamImpl.java:59) ~[graylog.jar:?]
        at java.io.ByteArrayOutputStream.writeTo(ByteArrayOutputStream.java:161) ~[?:?]
        at org.glassfish.jersey.message.internal.CommittingOutputStream.flushBuffer(CommittingOutputStream.java:278) ~[graylog.jar:?]
        at org.glassfish.jersey.message.internal.CommittingOutputStream.write(CommittingOutputStream.java:203) ~[graylog.jar:?]
        at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$UnCloseableOutputStream.write(WriterInterceptorExecutor.java:276) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.UTF8JsonGenerator._flushBuffer(UTF8JsonGenerator.java:2085) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.UTF8JsonGenerator.flush(UTF8JsonGenerator.java:1097) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:2658) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.base.GeneratorBase.writeObject(GeneratorBase.java:381) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.JsonGenerator.writeObjectField(JsonGenerator.java:1726) ~[graylog.jar:?]
        at com.codahale.metrics.json.MetricsModule$GaugeSerializer.serialize(MetricsModule.java:43) ~[graylog.jar:?]
        at com.codahale.metrics.json.MetricsModule$GaugeSerializer.serialize(MetricsModule.java:26) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ser.std.MapSerializer.serializeFields(MapSerializer.java:722) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ser.std.MapSerializer.serialize(MapSerializer.java:643) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ser.std.MapSerializer.serialize(MapSerializer.java:33) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:727) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:719) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:155) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1396) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectWriter.writeValue(ObjectWriter.java:913) ~[graylog.jar:?]
        at com.fasterxml.jackson.jaxrs.base.ProviderBase.writeTo(ProviderBase.java:625) ~[graylog.jar:?]
        at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor.invokeWriteTo(WriterInterceptorExecutor.java:242) ~[graylog.jar:?]
        at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor.aroundWriteTo(WriterInterceptorExecutor.java:227) ~[graylog.jar:?]
        at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:139) ~[graylog.jar:?]
        at org.glassfish.jersey.server.internal.JsonWithPaddingInterceptor.aroundWriteTo(JsonWithPaddingInterceptor.java:85) ~[graylog.jar:?]
        at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:139) ~[graylog.jar:?]
        at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:61) ~[graylog.jar:?]
        ... 20 more
Caused by: java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[?:?]
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:62) ~[?:?]
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:137) ~[?:?]
        at sun.nio.ch.IOUtil.write(IOUtil.java:81) ~[?:?]
        at sun.nio.ch.IOUtil.write(IOUtil.java:58) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:532) ~[?:?]
        at org.glassfish.grizzly.nio.transport.TCPNIOUtils.flushByteBuffer(TCPNIOUtils.java:125) ~[graylog.jar:?]
        at org.glassfish.grizzly.nio.transport.TCPNIOUtils.writeSimpleBuffer(TCPNIOUtils.java:109) ~[graylog.jar:?]
        at org.glassfish.grizzly.nio.transport.TCPNIOAsyncQueueWriter.write0(TCPNIOAsyncQueueWriter.java:102) ~[graylog.jar:?]
        at org.glassfish.grizzly.nio.transport.TCPNIOAsyncQueueWriter.write0(TCPNIOAsyncQueueWriter.java:82) ~[graylog.jar:?]
        at org.glassfish.grizzly.nio.AbstractNIOAsyncQueueWriter.write(AbstractNIOAsyncQueueWriter.java:236) ~[graylog.jar:?]
        at org.glassfish.grizzly.nio.AbstractNIOAsyncQueueWriter.write(AbstractNIOAsyncQueueWriter.java:145) ~[graylog.jar:?]
        at org.glassfish.grizzly.nio.AbstractNIOAsyncQueueWriter.write(AbstractNIOAsyncQueueWriter.java:47) ~[graylog.jar:?]
        at org.glassfish.grizzly.nio.transport.TCPNIOTransportFilter.handleWrite(TCPNIOTransportFilter.java:102) ~[graylog.jar:?]
        at org.glassfish.grizzly.filterchain.TransportFilter.handleWrite(TransportFilter.java:167) ~[graylog.jar:?]
        at org.glassfish.grizzly.ssl.SSLBaseFilter$SSLTransportFilterWrapper.handleWrite(SSLBaseFilter.java:1125) ~[graylog.jar:?]
        at org.glassfish.grizzly.filterchain.ExecutorResolver$8.execute(ExecutorResolver.java:87) ~[graylog.jar:?]
        at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeFilter(DefaultFilterChain.java:260) ~[graylog.jar:?]
        at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeChainPart(DefaultFilterChain.java:177) ~[graylog.jar:?]
        at org.glassfish.grizzly.filterchain.DefaultFilterChain.execute(DefaultFilterChain.java:109) ~[graylog.jar:?]
        at org.glassfish.grizzly.filterchain.DefaultFilterChain.process(DefaultFilterChain.java:88) ~[graylog.jar:?]
        at org.glassfish.grizzly.ProcessorExecutor.execute(ProcessorExecutor.java:53) ~[graylog.jar:?]
        at org.glassfish.grizzly.filterchain.FilterChainContext.write(FilterChainContext.java:866) ~[graylog.jar:?]
        at org.glassfish.grizzly.filterchain.FilterChainContext.write(FilterChainContext.java:719) ~[graylog.jar:?]
        at org.glassfish.grizzly.ssl.SSLBaseFilter.doHandshakeStep(SSLBaseFilter.java:719) ~[graylog.jar:?]
        at org.glassfish.grizzly.ssl.SSLBaseFilter.doHandshakeSync(SSLBaseFilter.java:577) ~[graylog.jar:?]
        at org.glassfish.grizzly.ssl.SSLBaseFilter.silentRehandshake(SSLBaseFilter.java:796) ~[graylog.jar:?]
        at org.glassfish.grizzly.ssl.SSLBaseFilter.unwrapAll(SSLBaseFilter.java:413) ~[graylog.jar:?]
        at org.glassfish.grizzly.ssl.SSLBaseFilter.handleRead(SSLBaseFilter.java:345) ~[graylog.jar:?]
        at org.glassfish.grizzly.filterchain.ExecutorResolver$9.execute(ExecutorResolver.java:95) ~[graylog.jar:?]
        at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeFilter(DefaultFilterChain.java:260) ~[graylog.jar:?]
        at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeChainPart(DefaultFilterChain.java:177) ~[graylog.jar:?]
        at org.glassfish.grizzly.filterchain.DefaultFilterChain.execute(DefaultFilterChain.java:109) ~[graylog.jar:?]
        at org.glassfish.grizzly.filterchain.DefaultFilterChain.process(DefaultFilterChain.java:88) ~[graylog.jar:?]
        at org.glassfish.grizzly.ProcessorExecutor.execute(ProcessorExecutor.java:53) ~[graylog.jar:?]
        at org.glassfish.grizzly.nio.transport.TCPNIOTransport.fireIOEvent(TCPNIOTransport.java:515) ~[graylog.jar:?]
        at org.glassfish.grizzly.strategies.AbstractIOStrategy.fireIOEvent(AbstractIOStrategy.java:89) ~[graylog.jar:?]
        at org.glassfish.grizzly.strategies.SameThreadIOStrategy.executeIoEvent(SameThreadIOStrategy.java:79) ~[graylog.jar:?]
        at org.glassfish.grizzly.strategies.AbstractIOStrategy.executeIoEvent(AbstractIOStrategy.java:66) ~[graylog.jar:?]
        at org.glassfish.grizzly.nio.SelectorRunner.iterateKeyEvents(SelectorRunner.java:391) ~[graylog.jar:?]
        at org.glassfish.grizzly.nio.SelectorRunner.iterateKeys(SelectorRunner.java:360) ~[graylog.jar:?]
        at org.glassfish.grizzly.nio.SelectorRunner.doSelect(SelectorRunner.java:324) ~[graylog.jar:?]
        at org.glassfish.grizzly.nio.SelectorRunner.run(SelectorRunner.java:255) ~[graylog.jar:?]
        at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:569) ~[graylog.jar:?]
        at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:549) ~[graylog.jar:?]
        ... 1 more
2022-08-18T14:06:48.645-07:00 WARN  [CloudWatchMetricsPublisher] Could not publish 4 datums to CloudWatch
java.util.concurrent.TimeoutException: null
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960) ~[?:?]
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095) ~[?:?]
        at software.amazon.kinesis.metrics.CloudWatchMetricsPublisher.blockingExecute(CloudWatchMetricsPublisher.java:87) ~[graylog-plugin-integrations-4.3.3.jar:?]
        at software.amazon.kinesis.metrics.CloudWatchMetricsPublisher.publishMetrics(CloudWatchMetricsPublisher.java:74) [graylog-plugin-integrations-4.3.3.jar:?]
        at software.amazon.kinesis.metrics.CloudWatchPublisherRunnable.runOnce(CloudWatchPublisherRunnable.java:138) [graylog-plugin-integrations-4.3.3.jar:?]
        at software.amazon.kinesis.metrics.CloudWatchPublisherRunnable.run(CloudWatchPublisherRunnable.java:84) [graylog-plugin-integrations-4.3.3.jar:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
2022-08-18T14:06:56.595-07:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.

I don’t know if the last few errors are an effect of running out of resources, but I hadn’t seen the “ServerRuntime$Responder” one before. That seems a bit interesting to me…

Hello,

Some ideas after looking over you logs.

This is an OS issue.

WARN [UdpTransport] receiveBufferSize (SO_RCVBUF) for input GELFUDPInput{title=winlogs-gelf,
type=org.graylog2.inputs.gelf.udp.GELFUDPInput, nodeId=9f572587-1a47-4ecc-b4a4-0191e6b4a149}
(channel [id: 0x9518413e, L:/0.0.0.0:5414]) should be >= 2097152 but is 425984

My Solution:

Issue:

WARN [UdpTransport] receiveBufferSize (SO_RCVBUF) for input RawUDPInput{title=Security Devices, type=org.graylog2.inputs.raw.udp.RawUDPInput, nodeId=null} (channel [id: 0 L:/0:0:0:0:0:0:0:0%0:51430]) should be 1703936 but is 1048576

Fix:
Inspect maximum socket buffer size for SO_RCVBUF (receive, rmem_max) and SO_SNDBUF (send, wmem_max).

sysctl --all --pattern "net.core.[rw]mem_max"
sudo sysctl --write net.core.rmem_max=1703936 
systemctl restart graylog-server
tail -f /var/log/graylog-server/server.log

I would check out these two nodes. Ensure there shipping logs to match the input used to collect those logs, just a thought. You could try creating a new INPUT like Raw/Plaintext see if that would resolve this issue.

  • (received from <10.104.72.252:40206>) is missing mandatory “host” field.

  • Error processing message RawMessage {id=76386e00-1f39-11ed-808a-0050569302c8, messageQueueId=745, codec=gelf, payloadSize=307, timestamp=2022-08-18T21:05:17.536Z, remoteAddress=/10.104.72.254:52095}

Not 100% sure but check the ‘Pipeline Processor’ is running after the ‘Message Filter Chain’ under System/Configurations.

org.graylog2.messageprocessors.MessageFilterChainProcessor.process(MessageFilterChainProcessor.java:104) [graylog.jar:?]

The most common reason I’ve had for a “broken pipe” is that one machine (of a pair communicating via socket) has shut down its end of the socket before communication was complete.
About half of those were because the program communicating on that socket had terminated.

  • Caused by: java.io.IOException: Broken pipe

EDIT: I forgot about this log.

ERROR [ServerRuntime$Responder] An I/O error has occurred while writing a response message entity to the container output stream.

If I understand the log correctly, is the process of an incoming request and fails to respond because the caller prematurely closed the connection. You should check that device/s you have in your network that do those requests. Most likely these are UPnP devices as they are “asked” to start communicating.

I think this WARN below is for AWS?

WARN [CloudWatchMetricsPublisher] Could not publish 4 datums to CloudWatch
java.util.concurrent.TimeoutException: null

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.