Journal error, no throughput

Before you post: Your responses to these questions will help the community help you. Please complete this template if you’re asking a support question.
Don’t forget to select tags to help index your topic!

Hi I’m really a newbie to Linux and Graylog, this has mostly been set up and maintained by interns, but now I’ve got a problem I haven’t been able to fix on my own. This was previously working just fine.

1. Describe your incident:
Recently I’ve been getting error "Journal utilization is too high

Journal utilization is too high and may go over the limit soon. Please verify that your Elasticsearch cluster is healthy and fast enough. You may also want to review your Graylog journal settings and set a higher limit"

I also can’t see/find anything in graylog anymore (all streams at throughput = 0 msg)

2. Describe your environment:

  • OS Information:
    Debian 11 on Hyper-V VM
    351 GO disk
    10 GO memory

  • Package Version:
    Graylog 5.1.8+507d172

  • Service logs, configurations, and environment variables:
    I am rotating to keep data for 1 year, I’m monitoring about 6 servers and our firewall is all.

Everything (mongo, elastic, graylog) is all on the same VM.

elastic stack log says this: [2024-03-11T23:32:48,523][WARN ][o.e.c.r.a.DiskThresholdMonitor] [SRV-011-VM] high disk watermark [90%] exceeded on [jgKl9_TxT5SGFGvLe84uhw][SRV-011-VM][/var/lib/elasticsearch/nodes/0] free: 28.2gb[8.3%], shards will be relocated away from this node; currently relocating away shards totalling [0] bytes; the node is expected to continue to exceed the high disk watermark when these relocations are complete

last changes:
-As this was set up by an intern, recently had to do some work on the host machine as I had disk size issues on C: ( so things got temporarily moved then put back and turned back on (VM are all on D:)
-I added “Windows sucessful logon local” with rules/pipelines to check for logons outside office hours and send alert, this was my first time doing this. I do think it was working when I did this (about 2 months ago)

I really don’t monitor graylog this closely, I just make sure the web console is reachable. We’ve added this to meet client requirements, even though I find it useful sometimes, I’m a one person IT departement so I’m not logging on everyday, so not sure when it stopped logging, saw the journal error two weeks ago and had given it more diskspace to see if that would help but doesn’t look like it did.

3. What steps have you already taken to try and solve the problem?

I tried restarting mongodb, elasticsearch and graylog

Checked “Journal size” is commented out on my server.conf file, so I assume that it is currently at the default size?

Found a command “df -h” that shows me this:
Sys. de fichiers Taille Utilisé Dispo Uti% Monté sur
udev 4,9G 0 4,9G 0% /dev
tmpfs 995M 1,2M 994M 1% /run
/dev/sda2 339G 296G 29G 92% /
tmpfs 4,9G 0 4,9G 0% /dev/shm
tmpfs 5,0M 0 5,0M 0% /run/lock
/dev/loop3 64M 64M 0 100% /snap/core20/2105
/dev/loop1 106M 106M 0 100% /snap/core/16574
/dev/loop0 32M 32M 0 100% /snap/glpi-agent/x1
/dev/loop2 64M 64M 0 100% /snap/core20/2182
/dev/loop4 106M 106M 0 100% /snap/core/16202
/dev/sda1 511M 5,8M 506M 2% /boot/efi
tmpfs 995M 72K 995M 1% /run/user/115
tmpfs 995M 60K 995M 1% /run/user/1000

4. How can the community help?
Any help on how I get Graylog fonctional again would be appreciated! I’ll admit I don’t really know what I’m doing and haven’t been able to find training. Thank you!

Helpful Posting Tips: Tips for Posting Questions that Get Answers [Hold down CTRL and link on link to open tips documents in a separate tab]

What do the buffers look like on the system>nodes page, are any of them more than just a little full?
What is thr log volume per day from the system>overview page?

Thanks for responding. Litterally was just looking at this, Process and Output buffer are at 100%

Disk journal utilization is at 96%

I understand that something is full, but not quite sure what and how to fix it. 350GB of hard drive for the VM seems like it should be enough to me, so I’m guessing it’s something else.

In System, Inputs I do see throughput and I can see messages there.

But in search/ streams I see nothing. After some reading, could this be related to the pipeline I tried to add a couple months ago? It was the first time I tried adding a pipeline (i was following something like this Time based alerts),

If output buffer is full then it probably that elastic or opensearch has stopped storing messages. Most likely reason for that is their disk is too full, it freaks out at like 80% full.

ok so you think I just need to increase my disk size for my VM? This ran all last year and is supposed to delete, but I can increase size, just thought that was a bit crazy that 350 GB not enough. I did do the df command on linux, and I see the following, but I don’t know what this /dev/loop stuff means:

/dev/loop3 65536 65536 0 100% /snap/core20/2105
/dev/loop1 108032 108032 0 100% /snap/core/16574
/dev/loop0 32512 32512 0 100% /snap/glpi-agent/x1
/dev/loop2 65536 65536 0 100% /snap/core20/2182
/dev/loop4 108416 108416 0 100% /snap/core/16202

Do I need to do anything particular after increasing the VM disk size? I just increased it by 25 GB and still at 100% in buffers.

I tried restarting the VM and I get the watermark and blocked “indices blocked” errors.

Debian doesn’t seem to take into account the increased disk size, not sure what I need to do for that.

In case anyone else is looking
In addition to adding diskspace to my VM I had to

  1. Resize my partition on debian, this included having to delete my swap partion to resize, I with fdisk following the post by “automatix” here : Auto expand last partition to use all unallocated space, using parted in batch mode - Unix & Linux Stack Exchange

2.then I had to resize the file system here using : resize2fs /dev/sda2 where sda2 was my partition

Buffers quickly went back down to 0 and I’m seeing my messages again! Took me a whole days work to research and figure all this out but that’s how to learn I guess. Thought I’d save anyone else the time cause the actual changes were pretty quick!

Thank you again to Joel for getting me pointed in the right direction and knowing how to check if things were working.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.