we have been running one Graylog server about a year, but recently UI stopped responding. Server is Ubuntu 18 running Graylog version 3.1.2. Server has two network interfaces. Environment has also proxy.
What happens is that Graylog server works normally max few minutes after service is started and then UI stops responding. Nothing special appears in the logs before or after UI hangs. Logs are collected without problem all the time. If you restart graylog-server service then it will respond again some minutes, which should exclude that problem is caused by proxy or some other external factor.
After UI stops responding, existing connections are ending up to CLOSE_WAIT state when taking a look with netstat. After hanging the UI, wget from localhost will show:
We know the situation on an older version, seems to be running out of memory. Check it with top or htop.
Memory for the server can be set here: /etc/sysconfig/graylog-server
Other option is the arp cache that is flushed bij default after 300 sec.
Flushing caches did not help and stats show that they really were flushed. Logs have no indication of relevant errors or warnings. This really is a strange problem as we have also other systems running on same server (different ports) they are working fine. Therefore, problem is somehow related to Graylog, but haven’t been able to figure out how.
By chance were there any updates applied to your GL Server before this happened?
We have just installed standard Ubuntu server updates. Even Graylog server has not been updated. We needed to remove IPv6 support from host (in /etc/sysctl.conf), but they have been restored. I thing we noticed was that some update had messed up /etc/hosts file so that 127.0.0.1 was set to server name, not to localhost. There was no entry for localhost. We modified it back (127.0.0.1 localhost and other IP addresses actual server name). And caches have been flushed since, so naming should be fine.
Are you running apache or Nginx?
Actually proxy is not on the server, it is Squid on other host. On the server we have set no_proxy for local names and addresses. Same symptoms exist when we are using curl or wget from localhost and we have confirmed that requests are not going through proxy server. Therefore proxy should not be the problem. And again, other services are running fine and reachable via proxy and from localhost.