Graylog 2.5 REST API Failing - Continued

finite · March 24, 2021, 7:05pm

Hi all,

I let the other thread I started about this die due to lack of time to get back to the issue. Back ground here: Graylog 2.5 REST API Failing

To answer Andrew’s question, I am running 1.8.0-openjdk-headless.x86_64 1:1.8.0.282.b08-1.amzn2.0.1 which I just upgraded to today. I know I’m running an older version and am now in the process of planning the migration to 4.0, but as of now it’s falling over multiple times a day and I’ve got a python script running as a systemd service keeping it alive as a band-aid. Any help would be greatly appreciated.

Thanks!

gsmith · March 25, 2021, 12:17am

@finite
Hello,
Before this issue your having was Graylog running good?
Did you apply any updates to the host prior to this issue as @ttsandrew suggested? specifically for Java? Its kinds strange this suddenly happened.
How have you tried to resolve this issue besides a python script?

aaronsachs · March 25, 2021, 12:39am

@finite I also meant to ask on the other thread if could qualify what’s happening when it falls over. Do you have any metrics about the event rate, process/output buffers, anything about system resources, etc. What about heap usage? Do you see anything about the process getting killed off my OOMkiller or any OutOfMemory errors in the logs?

finite · March 25, 2021, 4:21pm

Hi @gsmith indeed, before this started happening, my instance was happily churning away ~50 GB a day and still is with the exception of a spike here and there when someone turns on debug mode. I run yum security updates bi-weekly which seems to have upped my Java from 1.8.0.265.b01-1.amzn2.0.1.x86_64 to 1.8.0.272.b10-1.amzn2.0.1.x86_64 a few months ago. But running yum yesterday bumped me from 272 to 1.8.0.282.b08-1.amzn2.0.1.x86_64

Other than the python band-aid, I’ve turned on debug in the application logs, but I’m not really seeing anything jump out at me that indicates a failure in the API.

@aaronsachs Hi Aaron, that’s the real kicker is that resource utilization appears very normal, the CPU doesn’t break above half and the heap/buffers/mem are within range as well. In my python keepalive, I threw in some psutil calls for that very reason:

Mar 24 04:40:09 python[6356]: The Graylog service was restarted at 2021-03-24 04:40:09.539326
Mar 24 04:40:09 python[6356]: Memory usage: svmem(total=16214753280, available=9239326720, 
percent=43.0, used=6646013952, free=4742971392, active=6716796928, inactive=4351053824, 
buffers=2138112, cached=4823629824, shared=389120, slab=256032768)
Mar 24 04:40:09 python[6356]: CPU usage: scputimes(user=58052.41, nice=0.19, 
system=16311.24, idle=550130.79, iowait=946.91, irq=0.0, softirq=187.98, steal=1.62, guest=0.0, 
guest_nice=0.0)

Checking overnight, I’ve actually have had no in over 24 hours from the time of this post, so I’m wondering if that Java update did in fact cause this issue. But I’m no where near certain.

gsmith · March 25, 2021, 10:38pm

@finite

Looking over your last post here
When you stated “REST/web APIs seem to just die.” was there any errors shown on the web interface?

I do agree with you, something with Java and being updating.

finite · March 25, 2021, 11:23pm

@gsmith sadly, the entire web interface is unresponsive. Any request to is times out so I couldn’t get any error there or via any API call to 9000.

gsmith · March 25, 2021, 11:31pm

Oh, so you able to log into the Web UI? or does you web interface look like this?

finite · March 25, 2021, 11:48pm

Sorry yea, it’s unreachable due to timeouts. So when you make any request to the dashboard, it just times out.

gsmith · March 26, 2021, 12:27am

Umm… curious. I cant replicate a time out right now, but I had those before. Cant remember what I did to fix it. What I can remember was checking the following.

Selinux - I had to excute “root # sealert -a /var/log/audit/audit.log” I did find some warning/s.
Firewall - I checked for errors or warnings in the logs.
Reverse Proxy - ( i.e. nginx) At one point I removed nginx and just ran Graylog HTTPS. This person post also showed without reverse proxy it worked, maybe same?

File Permissions - I check to make sure Graylog had permission to the files it needed ( keystore, etc…)
I monitored my graylog-server while loading the dashboard to see if I needed more resources.

I used the command “iotop” the same way to identfy long wait times that stuck out.

Other then that, Im not to sure.

system · April 9, 2021, 12:27am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Graylog 2.5 REST API Failing Graylog Central (peer support)	2	462	March 18, 2021
Graylog died on me (again). There is something I am missing Graylog Central (peer support)	8	313	February 7, 2024
Syslog Output - NoClassDefFoundError Graylog Add-ons	1	669	November 30, 2018
Graylog failed to start after yum updates Graylog Central (peer support)	3	378	December 18, 2018
Unable to log in now! Error - the server returned: 404 - cannot POST - /system/sessions (404) Graylog Central (peer support)	3	1272	November 20, 2018

Graylog 2.5 REST API Failing - Continued

Related topics