Messages stop being read from disk journal

We have been running Graylog for some time now but we decided to point our vCenter and ESXi syslogs to our Graylog server. When we do, the Disk Journal fills, but no messages seem to be read from it.

We stop the vCenter/ESXi syslog input, stop Graylog, delete the Journal, start Graylog and everything goes back to normal. Messages enter and leave the journal.

We can’t figure out why exactly vCenter/ESXi syslog data jams things up. We know that vCenter and ESXi output a ton of messages, but our node reports everything is healthy. Elasticsearch remains healthy while the “firehose” is turned on and we don’t get any indexer failures. CPU and RAM on the VM are good.

Is it just that the disk isn’t fast enough for the torrent of data coming from vCenter/ESXi? How can we troubleshoot this when everything else appears “green”?

Graylog 4.3, Ubuntu 22.04, VMware VM, 12vCPUs, 16GB vRAM, 500GB, installed from package

Do you have an regex/GROK extractors or pipeline rules that might be getting jammed up? If they aren’t efficient, it could definitely lock things up. You might be able to see process buffers listings (normally they show idle for me) of things it is caught up on. This is under the Actions menu on the individual Node

image

1 Like

Ok so of the 5 Processor Buffers they do fill up with 5 messages (1 for each buffer) that don’t seem to clear out. So now what?

Hello,

Adding on to @tmacgbay

How much logs are ingested during this time?
Can you show the Buffer configurations in GL config file?
Stated that there were index failures? what does those logs show, can you post them?

You likely have a runaway (poorly developed) GROK or regex statement. When a particular message hits the rule with the offending GROK/regex, it locks up the processor buffer … at least this is what I have observed int he past… Thats what you need to track down in your extractors and/or pipeline rules… what changed for this to happen…

The amount of logs coming through on the input is in the thousands.

Here is the buffer config:

# The number of parallel running processors.
# Raise this number if your buffers are filling up.
processbuffer_processors = 5
outputbuffer_processors = 8

# The following settings (outputbuffer_processor_*) configure the thread pools backing each output buffer processor.
# See https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ThreadPoolExecutor.html for technical details

# When the number of threads is greater than the core (see outputbuffer_processor_threads_core_pool_size),
# this is the maximum time in milliseconds that excess idle threads will wait for new tasks before terminating.
# Default: 5000
#outputbuffer_processor_keep_alive_time = 5000

# The number of threads to keep in the pool, even if they are idle, unless allowCoreThreadTimeOut is set
# Default: 3
#outputbuffer_processor_threads_core_pool_size = 3

# The maximum number of threads to allow in the pool
# Default: 30
#outputbuffer_processor_threads_max_pool_size = 30

# UDP receive buffer size for all message inputs (e. g. SyslogUDPInput).
#udp_recvbuffer_sizes = 1048576

# Wait strategy describing how buffer processors wait on a cursor sequence. (default: sleeping)
# Possible types:
#  - yielding
#     Compromise between performance and CPU usage.
#  - sleeping
#     Compromise between performance and CPU usage. Latency spikes can occur after quiet periods.
#  - blocking
#     High throughput, low latency, higher CPU usage.
#  - busy_spinning
#     Avoids syscalls which could introduce latency jitter. Best when threads can be bound to specific CPU cores.
processor_wait_strategy = blocking

# Size of internal ring buffers. Raise this if raising outputbuffer_processors does not help anymore.
# For optimum performance your LogMessage objects in the ring buffer should fit in your CPU L3 cache.
# Must be a power of 2. (512, 1024, 2048, ...)
ring_size = 65536

inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking

We don’t get any index failures, that is the weird part, it just stops reading from disk.

We thought it was a GROK or regex… but we disconnected all the pipelines from that message stream… at least we think we did. We are still learning Graylog so we could have been mistaken. The vCenter input goes directly to a vCenter stream and we remove those messages from the All Messages stream. I then double checked the pipelines to make sure that stream wasn’t connected. I guess for further testing we could disconnect all pipelines for a bit and see what happens.

and nothing in the Graylog logs?
tail -f /var/log/graylog-server/server.log

do you have any extractors on the inputs that receive the vcenter/esxi data? Is it possible to reduce what is coming in temporarily to see if you can find a breaking point? @gsmith knows a bit more about large environments and how to adjust for it - you may need to adjust the amount of memory for Java

Are you running a single machine or a cluster?

Hello,

The first configuration I noticed was…

I see you have 12vCPUs cores but your using 15 parallel running processors as shown above. I suggest something like this

processbuffer_processors = 7
outputbuffer_processors = 3
inputbuffer_processors = 2

processbuffer_processors is your have hitter. I have seen others increase these settings to high and end up with the same issue.

If the journal files up it may take a little long to process, or if you can increase the amount of core/s on the device that would be the way to go.

another suggestion, depending on the amount of logs ingested you could raise
output_batch_size, you can find more here

1 Like

I totally forgot about Extractors! I imported them from here. I guess I could go through those and see what is getting hit… although there are a lot of them…

Thanks for the buffer tips. I am going to adjust those values. Maybe not the output_batch_size, but the others for sure.

1 Like

Ok so it is certainly the extractors causing the problem. We removed them all and things flow. We added back just the regex ones back and left the grok patterns off (just to split the field) and things jam back up. So it is the extractors. Now to go through them and maybe figure out which one is broken.

The question is, why doesn’t the process buffer just kick the messages out after a time period? It is weird that they just stay there.

I wish it would! One of the key things to make regex and GROK efficient is to use the ^ and $ to specifically delineate the beginning and end of a sentence. Without them there is a lot more iteration in the search and it slows down the efficiency.

1 Like