we are using the open source Graylog edition for a couple of years and are totally happy with it.
Our setup:
2 Loadbalancers → 2 Graylog Servers → 4 Elastic Search nodes
Graylog version: 3.0.1+de74b68
Unfortunately, we were hit by a strange problem today:
Our gui access and everything around api-access to the default port 9000 is extremly slow. We also get socket-timeouts in the server.log
The incoming log messages seem to be processed ok as far as we can tell. Also, the sheer amount of log messages doesn’t seem to slow it down because we experience the same problem when no messages are pouring into the system.
The elastic search cluster seems to be ok, the mongo db seems to be ok, the DNS resolution is working, there is enough disk space, memory and the load on the system is minimal. These things were mentioned by other people in the forum.
We are running out of ideas…
Has anyone else a hint for us how to work towards a solution. We have no clue what’s going on.
thanks for that hint, I will look into this.
It’s kinda strange but when we stopped our efforts to find the problem, the cluster sort of healed itself and after 2 hours without our intervention, everything worked smoothly as before, the UI was responsive again.
that sounds like you have a cluster that can be pushed over the limit just by creating new indices or similar what is usually a sign that the meta data is over what should be done.
if you have a specific hardware required you can push data up to some limitations to that. What you describe sounds like you are a little over what your resources can handle because the symptoms of overwhelmed systems are starting with the given information.