Web interface just stops responding

The Graylog instance I have setup randomly stops responding on the UI.

Right now, I have tried purging the journal, increasing the heap size by 2GB and I can’t see a single error in the log at all. mongod and elasticsearch are both running and healthy.

I’ve restarted it several times to no avail. I need the UI to come up and work before anything else so i can troubleshoot and manage the instance. Is there any option to start graylog with processing and inputs disabled?

EDIT: This is a production instance.

Steps I tried so far:

  • disabled journaling
  • restarted mongod with verbose logging (vvvv)
  • stopped ES while restarting
  • Firewall rules to prevent traffic to all inputs
  • recreated the journal directory
  • turned on debug logging in log4j2
  • tried increasing http thread pool to 8,32 and back to 6

Does it give you an error message? Can you post the error message that you get? Have you recently upgraded? What version of Graylog are you running? What OS are you running it on? Need some more information about the problem and the system to offer suggestions on causes or fixes.

No error message in the UI, it just does not load. I looked at the developer console and requests to https://:9000/api/system/sessions takes 12 minutes which returns with a respone {"session_id":null,"username":null,"is_valid":false}

Even with debugging on, after waiting for Graylog to startup and monitoring the server.log file, I don’t see any logs at all. I’ll reply with a redacted version of the log in a bit. I monitored mongod.log and elasticsearch’s log to with no relevant messages.

I am running Graylog 3.0.2 on RHEL 7.6.

Here are the debug logs. I’ve excluded lines containing specific lookup adapters and tables
https://pastebin.com/XUQ5xXR2

Just to clarify, has this ever worked? Are you running Graylog, MongoDB and Elasticsearch on the same node? Also what versions of those are you running? Installed from packages?

from the logs I noticed that the graylog server did start successfully, but there was one line

2019-08-14T18:40:50.808Z INFO [AbstractJestClient] Setting server pool to a list of 1 servers: [http://127.0.0.1:9200]

leads me to believe that perhaps everything running on a single node? Not a big deal, just trying to better understand.

This has been working fine for months. we onboarded log sources within the past few weeks, i’ve had some issues last week but when this happened everything was working fine. i did change some pipeline rules but they did not cause any issues within 2-3 hours after the change.

It is all one node. I’m running mongod v2.6.12 and ES 6.7.0 all packages installed from official RPM repos. The main problem here is how I can’t get the UI to respond. the logs indicate it might actually be processing logs as they come in.

Assuming that’s a typo and you’re actually using MongoDB 3.6.12?

Also to clarify, when you say the UI stops responding, you mean you are able to use it and then it stops while navigating or you can’t even get to the login screen?

I’ve had issues where the UI stops responding due to message volume and message processing limitations. Can you provide your CPU/Memory and all or some of your server.conf file?

I have the issue temporarily resolved, but I am still trying to collect information for RCA.

I am running mongodb-server-2.6.12-6.el7.x86_64 mongod --version shows v2.6.12 as well.
The UI does not load at all , when i curl it, after about a miute I get the html response. in a browser, the loading of /api/system/sessions and other thing loads after 12 minutes or so. even then the page doesn’t load. it might sometimes load the login prompt but it wouldn’t get past that.

I was able to block all connections except ssh and the UI from outside the server to prevent Input traffic ( I thought I did this earlier but it turns out I was blocking source port when it should be destination), disabled journaling and restarted it. it slowly came up and I was able to disable inputs.

The last rule i changed is this (with obvious redactions):

 rule "my rule"
when 
    has_field("myfield") && $message.myfield == "unknown"
then
    let newmessage = clone_message();
    route_to_stream("My Stream","<stream id>",newmessage);
end

I removed that, an a rule that would have perofrmed PTR record lookups at a later stage was moved to an earlier stage. I moved back that latter rule.

Before restarting again, i looked at the heap usage in the UI and it was almost 100% used. Changing the logging in the UI is also showing a lot more events than when i tried the same with the log4j2 config.

I’ll slowly bring back up the inputs, if it resumes working fine as before I will assume that rule above is cause.

Glad to hear it’s working… slowly… But according to the documentation, graylog 3.0 requires MongoDB version 3.6 or later. Perhaps that’s not helping your situation.

http://docs.graylog.org/en/3.0/pages/installation.html#system-requirements

Either way… g’luck

+1 same idea.
mongo 4.0 available

I experienced a very slow graylog startup issue in the past that turned out to be related to a half duplex network switch issue and DNS config on the graylog server.

I think what you and @cawfehman suggested might have been the true problem. I incrementally upgraded from mongodb 2.6 ->3.6->4.0->4.2. Before the upgrade the problem returned and I couldn’t access the UI again. While still very slow, after the mongodb upgrade it is loading pages and it might have solved the issue.

Just looking at the log messages, it appears as if graylog is responding and starting up faster than before.

I’ll monitor for a while and update.

Well, it’s been stable for over 24 hours now. I’ve re-instated that rule I mentioned earlier (with route_to_stream() instead of clone_message()) and I haven’t seen any problems at all. although I’ll add this, i have turned off GeoIP processing as a precaution as well.

Thanks very much for your help and assistance!