Proccess Buffer Full, Slow Output, Journaling

We are using Graylog 6.x on CentOS. This specific node has 24 CPUs and 24GB RAM on an SSD storage and is part of a 3 node Graylog cluster where each node is also a Mongo node. There is no load balancing in play here, so its all about this specific node.
Behind the Graylog cluster is a seperate 4 Node Opensearch cluster that is bored. Normally we are using pipeline rules, but for all messages on this node I deactivated the rules which brought nearly no performance boost. There are no extractors.

The input is coming from a Graylog Forwarder. All Messages are Windows server event logs, forwarded to a forwarder via NXLog.
The config of the node is:

output_batch_size = 1000
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 22
outputbuffer_processors = 1
processor_wait_strategy = blocking
ring_size = 131072
inputbuffer_ring_size = 131072
inputbuffer_processors = 1
inputbuffer_wait_strategy = blocking

The heap of the node 12GB.
Refresh rate of the indices is 30 seconds.

Now my question: Why is this node only outputting ~700 messages per second and how can I increase the output?. Isn’t that far to low for that hardware or is it simply because windows eventlogs are huge compared to other syslog events? I think I tried all the solutions from the community forums but maybe if I ask directly there is another solution.

Hello @fsv

If you were to alter the NXlog configuration to send to one of the other Graylog nodes in the 3 node cluster does this same behaviour occur? Is there anything within the Graylog logs of the node experiencing this that points to an issue?

To increase output to Opensearch I would increase outputbuffer_processors to around 3 or 4 (reducing processbuffer_processors by the same amount) and then incrementally output_batch_size by 500 at a time until you stop seeing improvements.

1 Like

Hello, thanks for the input!

Yes. I tried using inputs instead of the forwarder and I tried switching the forwarder from one node to another. In all scenarios the outcome is the same.

Unfortunatley my logs are as clean as ever. I’m not even encountering any index errors.

Been there, done that. In the end I changed the outputbuffers to 1 because the input and output buffers are always empty. The only thing that had a positive effect on the output was setting the inputprocessor buffer higher. The default value fo the output_batch_size is 500. Mine is at 1500 atm but the switch from 500 → 1k → 1.5k had absolutley no effect, so I stopped increasing it.

Oh, there is one thing I didn’t mention. The max index size on all my indices is increased from 1k to 2.5k because the service and application logs of multiple windows server where forcing me to. But I tested it to set all new and old indices to default 1k and it had no impact on the performance, which makes sense because opensearch has no load.

Morning @fsv

When you speak on the max index size, what exactly do you mean by 2.5k? Is this total messages per index or total size and if total size is that in KiB?

Maximum fields per index. Default is 1k, which should be enough normally, but Microsoft eventlogs or third party application event logs are so bad sometimes that a single event can use up many fields.

Thanks for clarifying, realistically there shouldn’t be the need for more than 1000 individual fields even if the logs are coming from Windows.

Are you using the default NXLog configuration that ships with Graylog?

If these logs are currently arriving on a GELF input, try switching to a RAW input and see if the process buffer empties. While not a solution it may help narrow down the issue.

If possible, try sending via Filebeat - does the same issue occur, also are the total fields reduced.

I looked into the indices what fields are getting created and they are all legit. Windows logs are working fine because they use the same format most of the time, but some service and application logs are really bad and create 50 fields because they look like that:

    "SynchronousReadIoCountsBucket1" : {
      "type" : "keyword"
    },
    "SynchronousReadIoCountsBucket10" : {
      "type" : "keyword"
    },
    "SynchronousReadIoCountsBucket11" : {
      "type" : "keyword"
    },
    "SynchronousReadIoCountsBucket12" : {
      "type" : "keyword"
    },
    "SynchronousReadIoCountsBucket2" : {
      "type" : "keyword"
    },
    "SynchronousReadIoCountsBucket3" : {
      "type" : "keyword"
    },
    "SynchronousReadIoCountsBucket4" : {
      "type" : "keyword"
    },
    "SynchronousReadIoCountsBucket5" : {
      "type" : "keyword"
    },
    "SynchronousReadIoCountsBucket6" : {
      "type" : "keyword"
    },
    "SynchronousReadIoCountsBucket7" : {
      "type" : "keyword"
    },
    "SynchronousReadIoCountsBucket8" : {
      "type" : "keyword"
    },
    "SynchronousReadIoCountsBucket9" : {
      "type" : "keyword"
    },
    "SynchronousReadIoMaxLatency" : {
      "type" : "keyword"
    },
    "SynchronousReadIoNonBlockingMaxLatency" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingIoCountsBucket1" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingIoCountsBucket10" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingIoCountsBucket11" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingIoCountsBucket12" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingIoCountsBucket2" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingIoCountsBucket3" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingIoCountsBucket4" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingIoCountsBucket5" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingIoCountsBucket6" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingIoCountsBucket7" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingIoCountsBucket8" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingIoCountsBucket9" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingTotalLatencyBucket1" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingTotalLatencyBucket10" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingTotalLatencyBucket11" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingTotalLatencyBucket12" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingTotalLatencyBucket2" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingTotalLatencyBucket3" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingTotalLatencyBucket4" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingTotalLatencyBucket5" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingTotalLatencyBucket6" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingTotalLatencyBucket7" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingTotalLatencyBucket8" : {
      "type" : "keyword"
    },
    "SynchronousReadNonBlockingTotalLatencyBucket9" : {
      "type" : "keyword"
    },
    "SynchronousReadTotalLatencyBucket1" : {
      "type" : "keyword"
    },
    "SynchronousReadTotalLatencyBucket10" : {
      "type" : "keyword"
    },
    "SynchronousReadTotalLatencyBucket11" : {
      "type" : "keyword"
    },
    "SynchronousReadTotalLatencyBucket12" : {
      "type" : "keyword"
    },
    "SynchronousReadTotalLatencyBucket2" : {
      "type" : "keyword"
    },
    "SynchronousReadTotalLatencyBucket3" : {
      "type" : "keyword"
    },
    "SynchronousReadTotalLatencyBucket4" : {
      "type" : "keyword"
    },
    "SynchronousReadTotalLatencyBucket5" : {
      "type" : "keyword"
    },
    "SynchronousReadTotalLatencyBucket6" : {
      "type" : "keyword"
    },
    "SynchronousReadTotalLatencyBucket7" : {
      "type" : "keyword"
    },
    "SynchronousReadTotalLatencyBucket8" : {
      "type" : "keyword"
    },
    "SynchronousReadTotalLatencyBucket9" : {
      "type" : "keyword"
    },
    "SynchronousWriteIoCountsBucket1" : {
      "type" : "keyword"
    },
    "SynchronousWriteIoCountsBucket10" : {
      "type" : "keyword"
    },
    "SynchronousWriteIoCountsBucket11" : {
      "type" : "keyword"
    },
    "SynchronousWriteIoCountsBucket12" : {
      "type" : "keyword"
    },
    "SynchronousWriteIoCountsBucket2" : {
      "type" : "keyword"
    },
    "SynchronousWriteIoCountsBucket3" : {
      "type" : "keyword"
    },
    "SynchronousWriteIoCountsBucket4" : {
      "type" : "keyword"
    },
    "SynchronousWriteIoCountsBucket5" : {
      "type" : "keyword"
    },
    "SynchronousWriteIoCountsBucket6" : {
      "type" : "keyword"
    },
    "SynchronousWriteIoCountsBucket7" : {
      "type" : "keyword"
    },
    "SynchronousWriteIoCountsBucket8" : {
      "type" : "keyword"
    },
    "SynchronousWriteIoCountsBucket9" : {
      "type" : "keyword"
    },
    "SynchronousWriteIoMaxLatency" : {
      "type" : "keyword"
    },
    "SynchronousWriteIoNonBlockingMaxLatency" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingIoCountsBucket1" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingIoCountsBucket10" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingIoCountsBucket11" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingIoCountsBucket12" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingIoCountsBucket2" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingIoCountsBucket3" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingIoCountsBucket4" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingIoCountsBucket5" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingIoCountsBucket6" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingIoCountsBucket7" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingIoCountsBucket8" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingIoCountsBucket9" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingTotalLatencyBucket1" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingTotalLatencyBucket10" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingTotalLatencyBucket11" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingTotalLatencyBucket12" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingTotalLatencyBucket2" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingTotalLatencyBucket3" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingTotalLatencyBucket4" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingTotalLatencyBucket5" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingTotalLatencyBucket6" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingTotalLatencyBucket7" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingTotalLatencyBucket8" : {
      "type" : "keyword"
    },
    "SynchronousWriteNonBlockingTotalLatencyBucket9" : {
      "type" : "keyword"
    },
    "SynchronousWriteTotalLatencyBucket1" : {
      "type" : "keyword"
    },
    "SynchronousWriteTotalLatencyBucket10" : {
      "type" : "keyword"
    },
    "SynchronousWriteTotalLatencyBucket11" : {
      "type" : "keyword"
    },
    "SynchronousWriteTotalLatencyBucket12" : {
      "type" : "keyword"
    },
    "SynchronousWriteTotalLatencyBucket2" : {
      "type" : "keyword"
    },
    "SynchronousWriteTotalLatencyBucket3" : {
      "type" : "keyword"
    },
    "SynchronousWriteTotalLatencyBucket4" : {
      "type" : "keyword"
    },
    "SynchronousWriteTotalLatencyBucket5" : {
      "type" : "keyword"
    },
    "SynchronousWriteTotalLatencyBucket6" : {
      "type" : "keyword"
    },
    "SynchronousWriteTotalLatencyBucket7" : {
      "type" : "keyword"
    },
    "SynchronousWriteTotalLatencyBucket8" : {
      "type" : "keyword"
    },
    "SynchronousWriteTotalLatencyBucket9" : {
      "type" : "keyword"
    },

If you’re a madman like me and want to collect all the logs, you end up with indices like that. Sure I could break up those logs in multiple indices or even better make a list of alle the logs I really need instead of collecting everything, but until now I had no problem, so I did not touch it.

I’m using my own NXLog configuration. Most of the time I have three inputs. One for the standard windows logfiles (system, sec, app, setup), One for the Service and Application logs and one or more for special outputs like logfile exports for nap, sccm, exchange and stuff like that.

I will try to switch to RAW to look if GELF processing is having problems, that s a good hint.

A switch to Filebeat is something on my list, because NXLog locks all their good features behind a paywall. Do you know if Filebeat supports local caching in case of unreachable Graylog Nodes?

With regards to Filebeat - the delivery state of each message is logged, if Graylog or the Forwarder becomes unavailable then it tracks last line sent and when communication is restored it will resume sending from there.

Queuing options can be found here.

Keep us updated on your progress!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.