We’ve been debugging a strange issue with Graylog for the past couple of days but we’re stuck so I’m posting here to see if anyone has any insights.
We’re running a 9 node Graylog 2.4.6 cluster on AWS (now c4.4xlarge, was c4.2xlarge) which uses a 9 node Elasticsearch 5.6.9 cluster (now m4.4xlarge, was r4.2xlarge). We’re running 2 hour indices in the main indexset and store a maxiumum amout of 84 indices (7 days). We were running 9 shards per index with no replicas. Last week, CPU load for Graylog was between 50 and 80% during the day, and similar CPU usage was for ES as well.
I’m going to start with first events (which may or may not be important for the current issue).
Late last week we started plan to add a second replica to ES index shards. For that we added 3 more ES nodes (cluster started relocating shards) and added the replica change in System/Indices. Unfortunately, we underestimated the additional load this would put on the cluster and we ended up with an overloaded ES cluster (CPU). Graylog uncommitted messages went really high (up to 60 million) and we started adding new ES nodes to fix that. We ended up the day with 18 nodes which weren’t perfectly balanced. And we reverted the replica change and went back to original (only primary, no replicas). Since ES wasn’t balanced perfectly, it would put multiple shards on a single (new, mostly empty node) so we had to manually relocate shards every 2 hours.
This is when we first noticed the issue I’m writing about but didn’t think much of it. After we added the extra replica (and then removed it), network traffic from Graylog to ES doubled up. We have a detail Datadog dashboard for Graylog and we saw increased number of ES writes in Graylog. There’s was also increased (2x) Network Out traffic (in EC2 monitoring) from graylog nodes and increased incoming traffic (2x) to ES proxy instances as well as ELB in front of those proxy instances. Incoming message rate was unchanged, and since we reverted our replica increase, so was the number of shards per index. Number of documents per index didn’t change, neither did average index (2h) size in GB (around 350-400GB per index). Only thing that changed was network traffic, as if Graylog is sending more data to ES. Since we were focused on fixing ES issues, we didn’t notice this.
After the weekend we focused on scaling ES and Graylog. We moved ES to bigger instance types (excluding old nodes and adding new ones). At the end we ended up with 9 m4.4xlarge instances (and 9 old r4.2xlarge which are in excluded IP list so they’re completely empty - 0% CPU and disk usage - planning to terminate them soon). On Tuesday morning we upscaled Graylog instances as well, moving them to c4.4xlarge. This was done by removing a particular instance from DNS/ELB entries and waiting for traffic to go down before stopping, changing instance type and starting again. Everything went fine here and all 9 nodes were upscaled.
We didn’t notice that at first, but ES number of writes and Graylog outgoing network traffic went back to pre-replica issue levels (from last week). We weren’t aware of that. It seems that rotating Graylog instances (stop/start) somehow brought ES traffic back to normal (no extra traffic which is unaccounted for).
On Wednesday (since we had a bigger ES cluster now) we went on with the replica change. We increased num shards from 9 to 18 (for new indices, in Graylog UI) and added an extra replica. All went well! ES load went a bit up (expected) and there wasn’t a change in traffic or anything from graylog to ES (verified later when debugging).
After this everything was great for around 10 hours. Late in the evening, at precisely 19:46UTC, network traffic from Graylog nodes to ES went up sharply (increased 2x in 5min). Elasticsearch merges went up, ES CPU went up as well. Incoming traffic to Graylog stayed the same. Num documents per index stayed the same. Index size stayed the same. And actual graylog messages (verified in graylog search) stayed the same. It’s like Graylog is sending more traffic to ES which ends up nowhere.
Here is a screenshot of the metrics: https://zappy.zapier.com/A78D6390-0EED-4749-8AC0-9C65D5796302.png
Because of the spike in ES writes and network outgoing traffic, ES is again under heavier load and we’re seeing uncommitted messages and Graylog output buffers filling up for each of the 9 Graylog nodes. It’s more obvious during the heavier parts of the day. We’ve been looking at various stuff trying to find the cause:
- nothing in logs for both Graylog or ES
- no changes in Graylog config at that time
- 19:48UTC isn’t related to any deflector rotations so it’s in the middle of an index
- replicas were added 10 hours ago so (probably) not related
- ES is balanced, no shards are moving
I guess both Graylog and ES configurations would be useful (conf files). If needed, I’ll paste them here. We’re currently out of ideas for where to look. We’re close to trying another Graylog nodes rotation to see if it will have the same effect as previously.
&tldr; 10 hours after adding an additional shard, network traffic from Graylog to ES doubled up. Graylog ES writes jumped by 2x as well and ES is under more load. Actual num documents and index sizes remained the same and incoming messages haven’t changed at all.
Thanks everyone in advance. We’re a bit out of ideas right now.