Graylog 2.5 REST API Failing - Continued

Hi all,

I let the other thread I started about this die due to lack of time to get back to the issue. Back ground here: Graylog 2.5 REST API Failing

To answer Andrew’s question, I am running 1.8.0-openjdk-headless.x86_64 1:1.8.0.282.b08-1.amzn2.0.1 which I just upgraded to today. I know I’m running an older version and am now in the process of planning the migration to 4.0, but as of now it’s falling over multiple times a day and I’ve got a python script running as a systemd service keeping it alive as a band-aid. Any help would be greatly appreciated.

Thanks!

@finite
Hello,
Before this issue your having was Graylog running good?
Did you apply any updates to the host prior to this issue as @ttsandrew suggested? specifically for Java? Its kinds strange this suddenly happened.
How have you tried to resolve this issue besides a python script?

@finite I also meant to ask on the other thread if could qualify what’s happening when it falls over. Do you have any metrics about the event rate, process/output buffers, anything about system resources, etc. What about heap usage? Do you see anything about the process getting killed off my OOMkiller or any OutOfMemory errors in the logs?

Hi @gsmith indeed, before this started happening, my instance was happily churning away ~50 GB a day and still is with the exception of a spike here and there when someone turns on debug mode. I run yum security updates bi-weekly which seems to have upped my Java from 1.8.0.265.b01-1.amzn2.0.1.x86_64 to 1.8.0.272.b10-1.amzn2.0.1.x86_64 a few months ago. But running yum yesterday bumped me from 272 to 1.8.0.282.b08-1.amzn2.0.1.x86_64

Other than the python band-aid, I’ve turned on debug in the application logs, but I’m not really seeing anything jump out at me that indicates a failure in the API.

@aaronsachs Hi Aaron, that’s the real kicker is that resource utilization appears very normal, the CPU doesn’t break above half and the heap/buffers/mem are within range as well. In my python keepalive, I threw in some psutil calls for that very reason:

Mar 24 04:40:09 python[6356]: The Graylog service was restarted at 2021-03-24 04:40:09.539326
Mar 24 04:40:09 python[6356]: Memory usage: svmem(total=16214753280, available=9239326720, 
percent=43.0, used=6646013952, free=4742971392, active=6716796928, inactive=4351053824, 
buffers=2138112, cached=4823629824, shared=389120, slab=256032768)
Mar 24 04:40:09 python[6356]: CPU usage: scputimes(user=58052.41, nice=0.19, 
system=16311.24, idle=550130.79, iowait=946.91, irq=0.0, softirq=187.98, steal=1.62, guest=0.0, 
guest_nice=0.0)

Checking overnight, I’ve actually have had no in over 24 hours from the time of this post, so I’m wondering if that Java update did in fact cause this issue. But I’m no where near certain.

@finite

Looking over your last post here
When you stated “REST/web APIs seem to just die.” was there any errors shown on the web interface?

I do agree with you, something with Java and being updating.

@gsmith sadly, the entire web interface is unresponsive. Any request to is times out so I couldn’t get any error there or via any API call to 9000.

Oh, so you able to log into the Web UI? or does you web interface look like this?

Sorry yea, it’s unreachable due to timeouts. So when you make any request to the dashboard, it just times out.

Umm… curious. I cant replicate a time out right now, but I had those before. Cant remember what I did to fix it. What I can remember was checking the following.

Selinux - I had to excute “root # sealert -a /var/log/audit/audit.log” I did find some warning/s.
Firewall - I checked for errors or warnings in the logs.
Reverse Proxy - ( i.e. nginx) At one point I removed nginx and just ran Graylog HTTPS. This person post also showed without reverse proxy it worked, maybe same?

File Permissions - I check to make sure Graylog had permission to the files it needed ( keystore, etc…)
I monitored my graylog-server while loading the dashboard to see if I needed more resources.

I used the command “iotop” the same way to identfy long wait times that stuck out.

Other then that, Im not to sure.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.