1. Describe your incident:
Periodically a Graylog pod will start to hammer our elasticsearch cluster with /_all/_alias GETS. This causes elasticsearch to consume a lot of threads and put it in an unstable position.
2. Describe your environment:
OS Information:
Kubernetes cluster with 35 Graylog pods and 1 master
Package Version:
3.3.15
ES 6.8.22 on elastic cloud.
Service logs, configurations, and environment variables:
3. What steps have you already taken to try and solve the problem?
Tried to find why these specific calls are made and if there’s a way to disable them
4. How can the community help?
I’m looking to understand why these specific /_all/alias calls are made. I also want to understand what would cause an instance to make hundreds of these GETS per minute. I understand Graylog making a GET call to /graylog*/_alias or other specific indexes but not /_all.
I must say that’s a lot of Graylog Pods, I assume this is a very large environment?
What Kind of security configurations were made on Graylog? Is it only the one Graylog Pod and is it the same one all the time? since there are 35 pods did you try to shut that pod down that making all these calls? If so did you notice any other ones executing the same calls?
I’m curious if this is an Isolated issue or some configuration issue.
It’s processing a few TB a day so it’s not small. During my investigation I found that a major increase in calls to /_all/_alias occurred when I had a tab open on the detailed view for an index set.
You can see a major increase from about 7/rps to just under 40/rps while this tab is open.
Regarding your edit. Do you believe that would be happening even if it’s not creating those indexes? We haven’t had an issue with an alias being created as an index. I’ve also confirmed that this only happens when the UI is open and not during regular operations.
In some rare situations, there might be an Elasticsearch index with a name which has been reserved for the deflector of an index set managed by Graylog, so that Graylog is unable to create the proper Elasticsearch index alias.
We’re not seeing that actually. Writes continue to work and the deflector is rotated according to our policy. The only oddity is this high rate of GETs to that alias endpoint. I also see GETS to /graylog_*/_alias at a reasonable rate without a corresponding spike.