Graylog high journaling and low output rate

After running for a couple of days in a row Graylog will put up a notification saying that journaling is too high. Output tanks-- drops significantly, and does not match the input. Graylog will stay this way till it’s rebooted or the services are restarted.

Ubuntu 20.04.3 LTS
5.4.0-89-generic
graylog-4.2

1 NODE - 24Core, 48GB ram, 7 Drives Hardware Raid 5
Average ingestion rate: 2,500 in,
High ingestion rate 5,000 in
Graylog
Graylog v4.2.1+5442e44
Mongo DB
db version v4.0.27
Elasticsearch
7.10.2

At the time of the error I got the following message.

2021-11-24T16:56:19.289-05:00 INFO [connection] Opened connection [connectionId{localValue:13, serverValue:13}] to localhost:27017
2021-11-24T16:56:52.734-05:00 WARN [StreamRouterEngine] Unexpected error during stream matching
java.util.concurrent.TimeoutException: null
at java.util.concurrent.FutureTask.get(FutureTask.java:205) ~[?:1.8.0_292]
at com.google.common.util.concurrent.SimpleTimeLimiter.callWithTimeout(SimpleTimeLimiter.java:155) ~[graylog.jar:?]
at org.graylog2.streams.StreamRouterEngine$Rule.matchWithTimeOut(StreamRouterEngine.java:311) [graylog.jar:?]
at org.graylog2.streams.StreamRouterEngine$Rule.access$000(StreamRouterEngine.java:268) [graylog.jar:?]
at org.graylog2.streams.StreamRouterEngine.match(StreamRouterEngine.java:195) [graylog.jar:?]
at org.graylog2.streams.StreamRouter.route(StreamRouter.java:99) [graylog.jar:?]
at org.graylog2.filters.StreamMatcherFilter.filter(StreamMatcherFilter.java:42) [graylog.jar:?]
at org.graylog2.messageprocessors.MessageFilterChainProcessor.process(MessageFilterChainProcessor.java:104) [graylog.jar:?]
at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.handleMessage(ProcessBufferProcessor.java:158) [graylog.jar:?]
at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.dispatchMessage(ProcessBufferProcessor.java:128) [graylog.jar:?]
at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:98) [graylog.jar:?]
at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:49) [graylog.jar:?]
at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]

After a while I started getting this error.

2021-11-24T18:31:12.562-05:00 WARN [LocalKafkaJournal] Journal utilization (99.0%) has gone over 95%.

when running top

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND

1568 graylog 20 0 24.8g 8.5g 26084 S 393.0 17.9 167:01.59 java

Main PID: 1129 (graylog-server)
Tasks: 414 (limit: 57780)
Memory: 8.7G
CGroup: /system.slice/graylog-server.service
├─1129 /bin/sh /usr/share/graylog-server/bin/graylog-server
└─1568 /usr/bin/java -Xms10g -Xmx10g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:-OmitStackTraceInFastThrow -Djdk.tls.acknowledgeCloseNotify=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -jar -Dlog4j.configurationFile=file:///etc/graylog/server/log4j2.xml -Djava.library.path=/usr/share/graylog-server/lib/sigar -Dgraylog2.installation_source=deb /usr/shar>

I need some help in trying to figure out why this keeps doing the high journaling and the low output rate.

Hello @Chase, welcome to the community!

Do you have any extractors running? Very generic/broad regex extractors in particular are known for causing this behavior. You can try disabling extractors to see if the issue goes away.

I do have some extractors going most of which I got from a content pack for my firewall. I’m guessing, but I don’t think those extractors are the ones causing this specific issues. Is there a way to verify before disabling? If I were to guess, I would guess that the problems was stemming from our servers or desktops logging. None of which has extractors on the input.

Chase

Is the output buffer full or empty when the issue occurs?

If it’s full, then some possibilities:

  • The elasticsearch service could be entering a read only state. You would see a response in the graylog-server logs for read-only. This is often a disk space issue.
  • There could be a connectivity problem between graylog-server and elasticsearch. This would be if the deployment isn’t monolithic.
  • The elasticsearch service could be repeatedly crashing due to some error. You would see this in the elasticsearch service logs.

If it’s empty:
Is the input buffer full or empty?
If the input buffer is full but the output buffer is empty, there’s an issue with message processing within graylog-server. Some possibilities:

  • The server could be out of resources. You would see this in performance metrics.
  • The service could be encountering some error causing the service to keep resetting. You would see this in your usage (web dashboard unavailable) and in the graylog-server logs
  • The server could be hung on a message because user logic is not working as expected, e.g. an extractor taking a very long time to process messages. You could see this in the message process time metric (you will see periods of very high mpt)

If the input buffer is empty:
Is the disk journal full or empty? If the disk journal is full but the input buffer is empty, some possibilities:

  • The disk journal has become corrupted due to an unclean shutdown state or (commonly) an underlying disk condition like a disk size change without cleanly stopping the service
  • The graylog-server service is unable to read the disk journal to some other condition which will appear in the graylog-server logs

If the disk journal is empty, either you aren’t receiving messages or you’re processing them so quickly that it appears empty.

Is there a way to verify before disabling?

Manually audit messages to which the extractor has applied and look for messages which have a large delta between the time they were received and the time they were indexed. You will probably have to use the API to do this, I don’t think it’s all face up in the web interface.

If I were to guess, I would guess that the problems was stemming from our servers or desktops logging. None of which has extractors on the input.

This suggests to me that the server may be suffering from an overall performance issue. Look at the performance metrics, if you’re constantly suffering from high disk wait queues or cpu usage > total cpu available that should be pretty straightforward to diagnose. OOM issues in my experience will manifest in elasticsearch before either of the other core services. If there’s available physical memory then allocate more to the elasticsearch service if it’s complaining it’s OOM. If there’s not then you’ll need to offload the service or increase the physical resources of the machine. Alternatively scale out to a cluster.

List item

1 Like

Alternative to extractors, also check stream rules. If you are using regex to do message routing it can cause this behavior as well. I always recommend that people spend the time defining their unique sources to do exact string match on message source, if possible.

Don’t have any regex for our streams presently. Currently we have different inputs for each device, and have streams based purely on the source input. Likewise our extractors are setup the same.

I think Andrew has covered most the important questions above, but can we also ask from this how much Java Heap memory is assigned to Elasticsearch at present (14gb perhaps, since I see 10gb is assigned to Graylog, and it’s advised not to assign more than 50% of RAM to Java HEAP memory)?

I will check these the next time the starts having issues. This should be in the next couple of days? Anything that I can look at now?

48GB total. Elastic is assigned 20GB. Graylog is assigned 10GB

Jvm.options

Xms represents the initial size of total heap space

Xmx represents the maximum size of total heap space

-Xms20g

-Xmx20g

/etc/default/graylog-server

Default Java options for heap and garbage collection.

GRAYLOG_SERVER_JAVA_OPTS="-Xms10g -Xmx10g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:-OmitStackTraceInFastThrow"

Thanks - I would suggest that you would benefit from altering your Java Heap sizing so the combined values of Elastic and Graylog is slightly under 24gb - as a rule of thumb, you want maximum Java Heap usage to be under 50% of total avaliable memory (Normal memory usage demands will exceed Heap memory usage during java garbage collection).

To this end, you might try out sizing of 8gb HEAP for Graylog, and 16gb HEAP for Elasticsearch - you will probably find that performance is improved.

I’d would be interested to know the answers to Andrew’s queries re: the input and output buffer states?

I’ll give that a try, and update everyone as soon as the problem happens again.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.