1. Describe your incident:
I’m using graylog 5.0.3 single node on Kubernetes (Specifically OpenShift) with OpenSearch 2.7.0 backend and mongodb 6.0.4 (installed through helm chart by KongZ as a part of my POC of central logging solution for Kubernetes clusters.
Right now I’m streaming logs from all namespaces and pods of the cluster (small cluster with 3 masters and 4 workers of 16 RAM x 8 CPU), and I’m getting weird spikes on RAM usage of Graylog every couple of hours without spike in the logs consumption by Graylog:
The reason for the down in spikes is the memory pressure a node suffers from because of Graylog, which sometimes makes it evict the pod and reschedule it.
I’m currently testing it with 16GB JVM memory (which because of the low capacity of the node leads to eviction in case of a spike). My question is if this is a normal behavior, and if i should anticipate such spikes in production, without a high increase of sent logs to Graylog.
What might be the cause for it?
2. Describe your environment:
- OpenShift 4.11.35 on CoreOS servers
- Graylog 5.0.3 single node
- OpenSearch 2.7.0
- MongoDB 6.0.4
- Default configuration (from helm chart), OpenSearch without security
3. What steps have you already taken to try and solve the problem?
I have tested Graylog on a massive spike (from 50 logs/s to 4000 log/s) and managed to replicate this massive spike (and eviction by the node), but this happens on normal behavior as well.
4. How can the community help?
I would like to understand if these spikes are normal and what configuration should I try to resolve it (other than increasing the resources of the nodes or lowering the JVM max allowed memory).
To me it looks like the Garbage Collector just comes around and collects trash? Is it the same amount as shown in the overview of the nodes?
Can you clarify if you are seeing this caused by Graylog’s heap or OpenSearch’s heap?
Both are configurable to give you control over how much ram/heap they consume.
That chart appears more consistent with OpenSearch memory usage:
Here are examples showing the last 1 hour from my environment:
To answer your question, this is completely normal.
Well, actually I double checked and this is the Graylog behavior and not OpenSearch. The memory consumption patterns seems different in my case:
I have created a high capacity worker specifically for Graylog with 32GB RAM (while JVM max memory allowed is 16GB) and scheduled Graylog pod in this worker. This happened around 22:00 (my time) yesterday and since then Graylog’s memory consumption remained very high (though the pod wasn’t evicted by the node because of memory pressure in the node).
OpenSearch, on the other hand remains very steady with low memory consumption. The provided picture resembles only one of the 3 masters (in the previous day), but the graph for the other 2 masters is the same (total of about 4.5GB RAM memory consumption for all the OpenSearch pods). Relating to the provided OpenSearch pattern by @drewmiranda-gl, my OpenSearch memory consumption pattern seemed suspicious, but after checking the logs of OpenSearch, I found logs which indicate data from Graylog is written to OpenSearch, and logs of Graylog indicating that indexes are written and deleted. This, and the fact that I can search for logs from previous days in Graylog, makes me think that the problem may not be with my OpenSearch configuration. I will be happy though, for some procedure for double-checking if OpenSearch is working as intended.
Regarding to Graylog, after scheduling it on a larger worker (dedicated to it), I saw that it consumes a steady high amount of memory (about 10.5GB), even though the JVM only consumes 4-8 GB memory (unfortunately I don’t have a graph showing the memory consumption of the JVM over time). Is this considered normal behavior as well? If so, is there some document about all the requirements from the machine and the JVM memory to allowed memory ratio?
Thank you for the help! Much appreciated!
Is this considered normal behavior as well
Without knowing more specifics its difficult to say. However, i can say with having a lot of hands on experience with graylog (running it since v 0.20) that i’ve not seen or personally experienced any memory issues. Its also important to keep in mind that JVM heap is not the same as JVM memory usage, which is not the same as system ram usage.
is there some document about all the requirements from the machine and the JVM memory to allowed memory ratio
My understanding is that the heap configuration should never be more than 1/2 (50%) of available system ram. 2GB of Heap for graylog-server should be sufficient for most usecases and should be ok to process upwards of 1500-2000 events per second.
What do you have your graylog heap configured to? Also how many graylog nodes do you have, what is the system ram available to each, and what is your total cluster events per second?
Graylog is configured to have 16GB RAM heap size, while OpenSearch is configured to have only 512MB (didn’t notice this difference until now, used default configuration provided by the helm chart). I’m running it on Kubernetes cluster (OpenShift) with single node Graylog cluster. Graylog pod runs on a dedicated OpenShift worker with 32GB RAM available and no limit on RAM usage.
Currently, my Graylog processes about 100 events per second. I preformed a small test and turned off the sender completely, but that didn’t change much as Graylog consumed the exact amount as with processing messages (after about an hour without processing messages).
For that message volume i would reccomend:
- Graylog-server: 2GB heap, 4GB ram
- OpenSearch: 8GB heap, 16GB ram
OK, thank you, I will give it a try.
I assume, based on the discussion, that the consistent high RAM consumption is due of the high heap configuration. Is this correct to assume?
That is my understanding. The heap will grow to a larger size before garbage collection kicks in.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.