Process and output buffers are full

Hi,
I’m new in the Graylog family. I run a test server. The message rate have been increased a couple days ago. The consequences are that the process and output buffer constantly full.

I have searched about my issue and find some changes that can be made. I did but the buffers are still full.

Here are my specs :

  • VM with 4 vCPUs
  • 8GB RAM
  • 150GB disk

I changed some values :

Elasticsearch conf :

  • max heap size : 2GB

Graylog conf :

  • max heap size : 2GB (it never use more than 1GB)
  • output_batch_size = 2000
  • outputbuffer_processors = 6
  • processbuffer_processors = 6

But this is not helping.

If I get the heap values for Elasticsearch :

root@graylog:~# curl -sS -XGET “localhost:9200/_cat/nodes?h=heap*&v”
heap.current heap.percent heap.max
563.7mb 27 1.9gb

Input message rate is 600/s.

  • All my extractors are GROK patterns.
  • IO stats seems not be the problem.
  • I get “Allocation Failure” from the Elasticsearch logs

Here is the logs :
Graylog (since last reboot) : 2020-11-06T11:11:12.834+01:00 INFO [CmdLineTool] Loaded plugin: AWS plugins 3.3 - Pastebin.com
Elasticsearch : [2020-11-06T10:22:37.085+0000][496848][gc,age ] GC(511) - age 4: 584 - Pastebin.com

Hope that I give all the informations you need.
Thanks in advance for your help !

Hi,

Since this morning and all the changes above, I have around 350-450k unprocessed messages.

Hope this helps,
Thanks !

What’s your extractors metrics? Check whether your extractors take 100 seconds per message.
Performance tuning with Graylog and Elastic is always full of magic.
Check CPU utilization, check all indices are green, try to reduce batch size

Thanks for your quick reply.

Where can I find extractors metrics ?
CPU utilization is around 70-80%.
All indices are green.

While searching for extractors metrics, I just saw this :


For the record, I did several services reboot (conf changes)

All the remaning messages are with the same error.

Thanks

Go to Inputs and check specific Extractors, make sure none of them is too heavy.
From the screenshot elastic had some issue(most probably because of performance, it’s a golden standard for this Java thing) but now restored.
About this part

outputbuffer_processors = 6
processbuffer_processors = 6

Graylog folks usually say these parameters must not be more than number of cores(even from this forum we can find it’s not always true, lol).
You have 12 buffer processors but just 4 vCPU and part of them occupied by Elastic.
Damn, that looks bad!

Here some metrics from extractors.
These are for firewall logs.




Default configuration for these processors are 3 for output buffer and 5 for process buffer.
Should I test like this ? :

outputbuffer_processors = 2
processbuffer_processors = 2

Cheers

In general your metrics aren’t so bad, but maximum values (like 22.000) seem a little bit suspicious.
I would try to optimize pattern.
Example, try to modify
srcip=%{IP:srcip}
to
^srcip=%{IP:srcip}$

And if you use only IPv4 then you can replace IP grok pattern, which includes IPv6 also.
Always check metrics after modifications.

Yes, 2x2 seems more reasonable(if we decide to believe Graylog team members), maybe you even can test 1x1,2x1,1x2

I think, that problem is very common using Graylog. Today I have this same problem and I had readed a lot of topics here and I can’t resolve this issue.
I’m following this topic.

Hi,

Thanks for your input.

I tried
srcip=%{IP:srcip}
to
srcip=%{IPV4:srcip}

Did it on all of my GROK patterns, worked.

I will test the processors modification later and comeback with the resultats.

Thanks :slight_smile:

Hi @zouljan

I did test the processors with 2x2, 1x2, 2x1 but this is not working either. It’s worse actually.
I got back with 6x6. But I can see right now that I have over 1.5 million messages.

Is someone have experience with one Graylog VM about how many one instance can ingest per second with my specs?
It seems that this is too much.

EDIT : no index failures since two days :slight_smile:

Thanks !

To ingest more data you need more CPU. With 4 CPU I would not expect to ingest more than 400 logs per seconds (of course it depends on logs, extractors, pipeline, alerts…).
What is the CPU load average ? Give us the 1 minute average, 5 minutes and 15 minutes.

LOAD 4-core

1 min: 10.66
5 min: 10.95
15 min: 10.66

Well, I’ll try to increase CPUs.
Thanks for your input !

You may need this information then :
I have also like 15 GROK extractors, all with one field for my received syslog messages.
I’m guessing the message is passing trought all extractors.
I cannot merge them because I receive differents types of messages (firewall).

For example, a few of my extractors :

rule=%{DATA:rule} %{GREEDYDATA:UNWANTED}

srcip=%{IPV4:srcip}

destport=%{NUMBER:destport}

All extractors are processing one field in my message. Fields that I’m interrested in.

May be there is too much extractors and I need to find a way to merge them all ?

Thanks

Yes you get the issue: processes require 10 CPU but you have only 4.

I was in this situation before.
One uncareful regex and your Graylog server is fucked up and you see how your journal is growing,
Client-side parsing is my weapon of choice now.
Filebeat can replace 99.9% of Extractors logic in my case.

Unfortunately, can’t play with Filebeat.

I just dropped all this huge load and unprocessing messages by setting the Message Processors Configuration to its default… I did changed a the order Pipeline Processor and Message Filter Chain in order to start filtering with streams but it seems that its too heavy…
Now I need to find a solution with pipelines for filtering… but that will be on another thread eventually :slight_smile:

Thanks for helping people !

Well,

Looks like this change solved this issue temporarily… (2M messages unprocessed)
If someone have suggestions, I’ll take it !

I can see that my message rate per second is 327.
image

And also some iowait… between 30 and 40%.

For the record, VM is 8core with 8GB RAM.

Thanks

In this case you should always start from graylog and elastic logs.
Is there some error there?
Usually growing journal means graylog can’t write messages into ES

Indeed I checked.
Same output from Elasticsearch since this thread begun : [2020-11-06T10:22:37.085+0000][496848][gc,age ] GC(511) - age 4: 584 - Pastebin.com
This is the same pattern, the same kind.
No errors so far on the graylog log.

Thanks for your help.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.