Web ui goes down after adding new input

(Heather) #1

I am running into an issue I can’t solve on a single node install of Graylog 2.5 on Ubuntu 16.04 server with the following specs: 32 gb ram and 8 CPUs

Graylog is installed and running successfully receiving logs from syslog and http inputs on the node described above.

Issue:

  1. Add the Palo Alto input:
    http://docs.graylog.org/en/2.5/pages/integrations/inputs/palo_alto_networks_input.html
  2. The input is added in the UI successfully and is reported in the UI as: Running and netstat shows that the port in question is listening.
  3. Update the PA NGFW to start sending logs to the ip/port of the graylog server. As SOON as the config on the FW is updated to send logs to the graylog server, the graylog dies in browser and can no longer be loaded in a browser, the error in browser is:
    This site can’t be reached - graylog-sec.local took too long to respond.

The graylog-server server.log does not update at all when this condition occurs (the last log message is that the input is running). If I restart the graylog-server service it shows that the http server and rest api start up and the inputs are running (see the end of this post for the log snippet).

Initially I noticed that there were messages like this in the logs:
WARN [NettyTransport] receiveBufferSize (SO_RCVBUF) for input Palo Alto Networks Input{title=pa-tcp, type=org.graylog2.inputs.syslog.tcp.SyslogTCPInput, nodeId=105bdb6c-01d4-4621-bc7b-81643b0414bd} should be 1048576 but is 212992

I updated the receive buffer defaults from 212992 to a number higher than 1048576, and confirmed the changed took effect (the logs no longer report this error about the PA input).

Restarting the graylog-server service, everything starts up successfully according to the logs, but the UI is still unreachable w/ the error I noticed above (site can’t be reached).

Server log snippet:

2019-01-29T16:58:08.106Z INFO  [NetworkListener] Started listener bound to [10.xxx.xxx.xxx:9000]
2019-01-29T16:58:08.108Z INFO  [HttpServer] [HttpServer] Started.
2019-01-29T16:58:08.108Z INFO  [JerseyService] Started REST API at <https://10.xxx.xxx.xxx:9000/api/>
2019-01-29T16:58:08.109Z INFO  [JerseyService] Started Web Interface at <https://10.xxx.xxx.xxx:9000/>
2019-01-29T16:58:08.109Z INFO  [ServiceManagerListener] Services are healthy
2019-01-29T16:58:08.110Z INFO  [InputSetupService] Triggering launching persisted inputs, node transitioned from Uninitialized [LB:DEAD] to Running [LB:ALIVE]
2019-01-29T16:58:08.110Z INFO  [ServerBootstrap] Services started, startup times in ms: {BufferSynchronizerService [RUNNING]=2, JournalReader [RUNNING]=18, InputSetupService [RUNNING]=20, KafkaJournal [RUNNING]=60, ConfigurationEtagService [RUNNING]=119, OutputSetupService [RUNNING]=143, StreamCacheService [RUNNING]=152, PeriodicalsService [RUNNING]=365, LookupTableService [RUNNING]=439, JerseyService [RUNNING]=18915}
2019-01-29T16:58:08.120Z INFO  [ServerBootstrap] Graylog server up and running.
2019-01-29T16:58:08.190Z INFO  [InputStateListener] Input [Syslog TCP/5c4b5a0c878cc5594a1e22a7] is now STARTING
2019-01-29T16:58:08.192Z INFO  [InputStateListener] Input [Palo Alto Networks Input (TCP)/5c5073cc878cc510adf1eff5] is now STARTING
2019-01-29T16:58:08.283Z INFO  [InputStateListener] Input [Palo Alto Networks Input (TCP)/5c5073cc878cc510adf1eff5] is now RUNNING
2019-01-29T16:58:08.285Z INFO  [InputStateListener] Input [Syslog TCP/5c4b5a0c878cc5594a1e22a7] is now RUNNING
1 Like

(Jesse Hills) #2

Have you tried accessing it via the IP that graylog is listening on when you are gettting the error that graylog-sec.local is taking too long to respond?

0 Likes

(Heather) #3

Yup - both IP and DNS are reporting the same error unfortunately! (tested in multiple browsers too, in case it was a weird Chrome issue)

0 Likes

(Tess) #4

Interesting problem you have there!

Am I right in understanding that you are running Graylog + MongoDB + ElasticSearch together on that one host? Technically speaking it shouldn’t be much of a problem, though it’s suboptimal. Best do some maths on your Java heap sizes.

What are your Java heaps for Graylog and Elastic configured for right now?

Based on your current 32 GB of RAM I would suggest:

  • 1 or 2 GB for Graylog heap, no more as @jan has suggested before).
  • 10 GB for Elastic, assuming that caching etc will eat up another 10GB.

See also this excellent post by Jan, explaining some more about the math:

You say that you’re restarting the Graylog server service(s). But have you also restarted Elastic and Mongo? These are separate process stacks.

0 Likes

(Dan Torrey) #5

Hi @heather,
Thanks for the details on the issue. I would like to do a bit of troubleshooting to investigate the issue further. You mentioned that after restarting Graylog, the UI was still not accessible. Are logs still being sent from the Palo Alto device after the reboot? I am wondering if you can please test if a reboot + stop sending logs from Palo Alto allows the UI to become responsive again. If so, then can you please reboot once more while sending Palo Alto logs and confirm that the same issue happens?

Also activating debug logging for the Integrations plugin might also help to provide some additional log messages to investigate the issue further. This command can be used to turn on debug logging for the Integrations plugin specifically. Once the command is executed, the debug entries will be enabled automatically until the server is rebooted.

curl -I -X PUT http://<graylog-username>:<graylog-password>@<graylog-node-ip>:9000/api/system/loggers/org.graylog.integrations/level/debug \
-H 'X-Requested-By: graylog-api-user' \
-X PUT \
-I

Thanks for your help investigating this!

0 Likes

(Heather) #6

Hey @danotorrey, thanks for your help!

I stopped log forwarding on the Palo Alto device, then stopped mongod, elasticsearch, & graylog-server (in that order) then started all three services again (in that order) and the UI did not become responsive again. So it seems that whatever error condition that sending logs from PA caused hasn’t resolved once the logs stopped forwarding. By the way, the PA device is 8.1.5, in case this matters.

Similarly, making the request to activate debug logging for the integrations plugin failed with the error curl: (7) Failed to connect to 10.xxx.xxx.xxx port 9000: Operation timed out

I am going to try rebuilding graylog on a new (same specs) server and recreating this issue, having enabled debug logging on the integrations plugin before adding the new input. Palo Alto log ingestion is the #1 log use case that I am trying to solve with Graylog, so I am motivated to figure out the fix!

0 Likes

(Heather) #7

@Totally_Not_A_Robot thank you for this useful info! I had not considered java heaps at all, (java newb here) but will look into this further. Right now this is a POC system with two tiny inputs (less than 10mb of logs a day), whatever defaults come loaded w/ the apt repo version of graylog, and no one else but me using it, so I “feel” like I should be able to get things running on a single server, but we do plan to move everything to a distributed architecture once we determine that graylog is what we want to move forward with. Potentially I will need to do that sooner rather than later though.

Also, I had only been restarting graylog server, but not mongod or elastic, so good point there.

Edited to add:
Both Graylog and elasticsearch are currently set to 1gb for both Xms and Xmx, so gonna try upping elastic to 10g to see if it changes anything

1 Like

(Dan Torrey) #8

Hi @heather
Thanks for the details. This helps a lot. Did you happen to try if rebooting the entire machine made any difference?

The fact that the logging level API call timed out definitely means that the Graylog server was in a bad state. The challenge will be to find out why.

Please let me know what happens with the new test environment.

If the issue continues to occur with the new test environment, please let us know, and we will continue to help investigate.

0 Likes

(Tess) #9

Yup! I mean, with the original 10 MB per day that @heather mentions, there should be zero issues whatsoever.

It’s interesting that simply starting a data feed to an input could crash anything. That would suggest some very weird data making its way into the input.

@heather, for shits and giggles we could try an experiment! :slight_smile: What happens if you open a Netcat listener on the Graylog box, just on a free port of your liking and then you tell your Palo Alto boxen to send their logging to that port. That’ll give us some idea of what the incoming data looks like.

0 Likes

(Dan Torrey) #10

This is a fantastic idea. Thanks @Totally_Not_A_Robot! Once log data is captured, we can inspect it to see what it looks like, and we can even feed it back into the Graylog input for testing.

@heather Please let us know how the setup of the new test environment goes, and we can continue to troubleshoot from there.

0 Likes

(Heather) #11

Update: using the netcat listener was a great idea, because it helped me diagnose the fact that NO logs are making it from the PA device to the Graylog server (a firewall rule was blocking the traffic). The fact that the Graylog UI seemed to die at the same time that the commit was made to send logs from the PA device was a red herring!

I will have my new test machine up and running later today and will update with the results of the debug log after enabling the PA input.

1 Like

(Tess) #12

Huh, would you look at that? :smiley: That’s interesting!

So are you saying something along the network is blocking the traffic? Or did you perhaps forget to open the firewall port on the receiving host? That’s always an option.

0 Likes

(Heather) #13

Back to update this thread with the cause and resolution to the issue described:

What ended up being the cause was adding a static route from the PA device directly to the IP address of the Graylog server. The PA device has a limited number of routes from the management plane, for security reasons. So I was having to update the route table to allow logs to flow to the subnet that contained the Graylog node.

I am not sure why that caused the web UI to go “down” and start reporting the “this site can’t be reached” error in browser, but after a bunch of testing, that change was the culprit. The moment the static route was removed, the UI went back up. I solved this by setting a larger CIDR block that contained the Graylog, and this resolved the issue. ¯_(ツ)_/¯

1 Like

(Tess) #14

Huh, imagine that :smiley:

Could it be that the traffic from your browser to the GUI actually passes through the PA device? Thus the traffic to/from your workstation could’ve gotten buggered… Who knows? :smiley:

Also, here, you dropped this \

1 Like

(system) closed #15

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

0 Likes