Graylog high journal utilization is too high

Ubuntu 20.04.3 LTS
5.4.0-91-generic
graylog-4.2

1 NODE - 24Core, 48GB ram, 7 Drives Hardware Raid 5
Average ingestion rate: 2,500 in,
High ingestion rate 5,000 in
Graylog -4.2.4+b643d2b, codename Noir
JVM - PID 1616, Private Build 1.8.0_312 on Linux 5.4.0-91-generic
Mongo DB
db version v4.0.27
Elasticsearch
7.10.2

Graylog is saying that journal utilization is too high. I had an issue similar to this that I believe was resolved by changing the mount of ram in my heap file from 20 to 16gb. I checked my heap file, and it appears to be still set to 16gb.

I did do an upgrade on my OS yesterday. That brought up the version from 5.4.0-89-generic to 5.4.0-91-generic. This was done along side of a Graylog upgrade to patch the log4j vulnerability.

Everything appeared to be working for more than 24 hours, then got the journaling error. When I checked elasticsearches CPU it was close to 90-100% usage. I have turned off all inputs and CPU is still extremely high.

Best I can tell elasticsearch didn’t have an updated, but honestly not sure where to go from here.

from running top

1120 elastic+ 20 0 2003.8g 33.9g 16.5g S 1031 72.1 643:32.30 java
1616 graylog 20 0 22.7g 8.2g 9496 S 304.8 17.3 282:20.55 java
1125 mongodb 20 0 1155252 90156 10832 S 32.3 0.2 22:53.21 mongod
248 root 20 0 0 0 0 S 15.5 0.0 8:49.84 kswapd0
98 root 20 0 0 0 0 S 6.8 0.0 1:11.82 ksoftirqd/14
249 root 20 0 0 0 0 S 6.1 0.0 4:28.41 kswapd1
3636 graylog+ 20 0 9528 4116 3252 R 6.1 0.0 0:00.83 top

Attached is a picture showing that all inputs are down, but my output is still going. CPU is still extremely high.

Hello,

If your journal is filling up check your Elasticsearch status.

curl -XGET http://localhost:9200/_cluster/health?pretty=true

Or you need to adjust your processbuffers.

Example:

processbuffer_processors = 7
outputbuffer_processors = 3

These setting would be in you Graylog configuration file. Rule of thumb only increase them till the issues is resolved and/or don’t exceed the amount of physical CPU core you have.

Have you tail’ed your Graylog or Elasticsearch log file?
Have you check permission for Graylog and elasticsearch?

EDIT:

If your elasticsearch status is GREEN then maybe elasticsearch is trying to index the Journal messages depending on how big/many messages you have it might take a hour to index those. This might have been if you rebooted you device and Elasticsearch failed in someway. This would depend on how you updated Graylog server and the procedure that was executed. Under System/Nodes do you see something like this?

{

“cluster_name” : “graylog”,

“status” : “green”,

“timed_out” : false,

“number_of_nodes” : 1,

“number_of_data_nodes” : 1,

“active_primary_shards” : 1004,

“active_shards” : 1004,

“relocating_shards” : 0,

“initializing_shards” : 0,

“unassigned_shards” : 0,

“delayed_unassigned_shards” : 0,

“number_of_pending_tasks” : 0,

“number_of_in_flight_fetch” : 0,

“task_max_waiting_in_queue_millis” : 0,

“active_shards_percent_as_number” : 100.0

}

Process Buffer is set to 5

Outpubtbutffer is set to 3

After a reboot, and some patience, I was able to see the output log go to zero. However, what I’m noticing now is that when I go do my dashboards and run query it down to say 5 minutes, instead of my dashboard covering all 5 minutes, I’m only seeing the results from when I started the query. See image:

I have done some further testing. When trying to view live traffic (1 second intervals) , I’m seeing about 1 ½ minute delay between when the traffic is generated and when it shown. I have disabled our noisiest input (servers), and that went down to 3 seconds. The servers also generate anywhere from 1000-3000 logs per second. This shouldn’t be a problem as I’ve seen the server push up to 6K in logs per second. From what I can tell this appears to be an issue sense the upgrade, but I can’t say for sure.

I have also tried increasing the processors from 5 and 3 to 10 and 7. Still same results.

As you know elasticsearch grabs the logs and indexes them so if there was a delay I would first look into Elasticsearch (resources/logs). Next would be the Date/Time on all devices (NTP). When you stated the the heap was increase was this on Graylog or Elasticsearch?

I’m assuming that all your Buffers (Input/Process/Output) are at 0% ?

By adjusting these it could increase productivity for process logs/messages, but if your Buffers are at 0% not to sure if that will help the delayed messages in your case.

Process Buffer is set to 5

Outpubtbutffer is set to 3

inputbuffer_processors = 2

What is the refresh interval of your es node (as it looks you have one)

curl -X PUT "<ip-addres>:9200/<index_name>/_settings?pretty" -H 'Content-Type: application/json' -d'

Try setting this to 30 sec if you can wait that log for data being indexed by es?

curl -X PUT "<ip-adres>:9200/g<index_name>/_settings?pretty" -H 'Content-Type: application/json' -d'
{
  "index" : {
    "refresh_interval" : "5s"
  }
}
'

Making your cluster exist of mor nodes could speed things up.

SC6

Input buffer is at 0, but process buffer and output buffer are both at 100%

Is ulimit properly configured? It defaults to 1024 and should be 65536 when using graylog. Could be an unseen configuration problem.

When I run ulimit –u, I get 192601. I’m not sure if that’s what I need to be running to check the ulimit.

I have a few other observations that I have noticed.

  1. The Output buffer seems to be filling up first. This appears to cause a chain reaction with the process buffer, once the output buffer hits 100 percent.

  2. We have multiple inputs: firewall, wireless, etc. However the biggest culprit appears to be the servers input. I don’t know if this is purely because it has the most logs, or because nxlog is sending these logs, and it’s having a harder time with them, then say our firewall logs. But when server logs are disabled the buffer seems to retreat the quickest.

  3. Buffers seems to fill up faster when I do a search or open a dashboards, the buffers start to fill up quicker

I see the same thing when shutting down elasticsearch, looks like the problem could reside over there.

Do you have something like elastichq to monitor your cluster, it gan give some diagnostics on elastic.

Unfortunately we don’t. And we are not running docker, and I haven’t seen a specific version for Ubuntu.

There is no specific version of elastichq for Ubuntu, it’s a phyton thing, for now with a small quirk but one should get it running.

Do you have any link to instructions that I could follow for the installation? I tend to agree that I think it’s elastic problem, as I have tried uping my CPU count, for both outpu6 and process buffers. I’ve change the batch size. Nothing seems to fix this.

Again, the weird thing is that this seemed to be working since I made my update, which was supposed to fix the log4j fix. Should I try reverting back to an older version? Did the latest update make any changes to elasticsearch, as I don’t recall that it did.

Lets try some things @Chase

Graylog + OS and this kind of messages possibly needs 8GB of RAM.
That leaves 40GB for elasticsearch.

so is /etc/elasticsearch/jvm.options should contain

  -Xms20g
  -Xmx20g

/etc/security/limits.conf should contain something like this for ulimits:

root       -  nofile  65535
*          -  nofile  65535

/etc/sysctl.conf should contain:
vm.swappiness = 0

disable selinux in /etc/security/config
(https://linuxconfig.org/how-to-disable-enable-selinux-on-ubuntu-20-04-focal-fossa-linux)

Then the cores you use 24 as I have read.

processbuffer_processors = 7
outputbuffer_processors = 3

Makes a total of 10 cores for graylog and leaves 14 cores for elastic, if the old parameters
did their job, throttle processorbuffer down if elastic is your problem.

Put your elastic config over here:
grep -v "#" /etc/elasticsearch/elasticsearch.yml

Good luck.

I made the requested changes, but still is filling up the buffers.

Put your elastic config over here:
grep -v "#" /etc/elasticsearch/elasticsearch.yml

path.data: /var/lib/elasticsearch

path.logs: /var/log/elasticsearch

bootstrap.memory_lock: true

cluster.name: graylog

action.auto_create_index: false

One other thing that I’ve noted is that I keep getting black space when I do a search, even if the search is within the last 5 minutes.

I don’t what would have changed in the upgrade, but again, this to me continues to show that this is more on the elasticsearch side rather than the graylog side.