That process buffer is what I was talking about… in my experience, it can still lock up a machine even if it shows just one that is not idle. Take that redacted message and wind back the process it goes through on your Graylog system (input, extractor, stream, pipeline, alert…) and you may find an area that is not being efficient… causing the buffer to have something in it. I have a very small system but from what I understand, these should all show idle. In the few that I have participated in, it seems to have pointed back to a regex or GROK command that was spending too much time on a message trying to find something… particularly in an overload situation or where incoming message format has changed.
I didn’t understand a little what i needed to do. If it’s not difficult could you describe where to click and what to look at because i have some problems with understanding what you are writing
Take a look at the message that shows in the process buffer.
- what input did it come in on?
- Are there any extractors associated with that input? Do they use GROK/Regex?
- Are there any streams that the message would be assigned to based on stream properties that match the message?
- If the message is assigned to a stream or multiple streams, are there pipelines attached to that stream?
- If the message is traversing pipelines, what are the rules in those pipelines is the message executes based on passing the when…then… section?
- Do any of the rules that are executed contain GROK or regex?
You need to understand the path that message caught in the process buffer passes through in Graylog before it’s stored in Elasticsearch. How is it processed?
This may not even be the right road to your solution but it is still a good thing to understand.
I have a lot of messages from different inputs in the buffer, I need to analyze all of them or I can do with one (provided that extractors are used in this input and etc?)
I am not 100% convinced this is where we will solve the problem so start small and look at one… more if you have time…
Here is a brief conclusion I have.
It may not be a fix but more or less a suggestion. I don’t want to tell you to start reconfiguring your environment. Since it was and seams to be working fine. The problem from what was stated by you was an increase of memory which doesn’t seam to bad right now.
Since you are ingesting a lot of logs and I think you have quit a few fields generated from the messages/logs I’m believe you may need more Memory added. Just an easy, temporary solution.
I was thinking about decreasing the memory in Elasticsearch, but that would do no good because you are ingesting 300-500 GB Day about 3000-5000 messages per second. That would just cause more problems.
Some minor suggestions that could be applied.
If possible, try to decrease the amount of log being ingested. This would depend on how you’re shipping the logs to Graylog INPUT.
Is there a lot of saved searches? Try to decrease those if not needed, also Widgets, Dashboards, etc…
Try to fine tune this environment meaning, if you really don’t need it, remove it.
Try to increase Field type refresh interval to 30 seconds. You would need to edit you default Index and then manually recalculate them and/or rotate them.
The Errors/Warnings in the log are just stating that it could not decode raw message RawMessage sent from the remote device. Tuning your Log shippers AKA Nxlog. Winlogbeat, Rsyslog, etc… to send the proper type of files for the input you’re using might help. Example: Windows is using Winlogbeat for a log shipper reconfigure those to send only the data you need & try not to send every event log from those machines. I noticed in the logs posted above it seems that you are using GELF, this does create a lot of fields.
Not knowing exactly when this problem started or what took place before this issue was noticed it’s hard to say.
From what you stated above, it’s only an increase of memory BUT everything is working correctly? It might just be you need more resources.
Sum it up
You can adjust a few different configurations to lower your memory usage but what I do see, from all what you have shared, everything seams to be running fine. Am I correct?
I don’t believe there is a one solution to lower your memory, I think it’s a few different configurations, and to be honest probably doesn’t even need to happen.
If possible, add more memory to each system (i.e., 2 GB) and watch and trend data to see if it increases over time. If it does then we might need to look further in to fine tuning, your environment. If you do add more memory (2GB) wait a couple days or week, so don’t add any new configuration or updates if possible. The more data we have the better we can find a solution.
If you experiencing data loss, Graylog is freezing/crashing or gaps in the graphs, etc… then well look further into this ASAP.
EDIT: This is a good read if you have a chance.
- Managing and troubleshooting Elasticsearch memory | Elastic Blog
- Elasticsearch in garbage collection hell
EDIT2: I just noticed this, @Uporaba Did you configure this on purpose?
GRAYLOG_SERVER_JAVA_OPTS="-Xms3g -Xmx3g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:-OmitStackTraceInFastThrow"
Here is mine, Maybe mine is just old.
GRAYLOG_SERVER_JAVA_OPTS="-Xms3g -Xmx3g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:-OmitStackTraceInFastThrow "
Might take a look here, Not sure what happen.
Hello. This was done automatically after installation
Everything you described above didn’t really work during the day. After restarting the virtual machine, I noticed that 19 GB of RAM immediately filled up (as indicated, 16 GB for elastic and 3 for graylog), then RAM began to grow slowly. As a result, after an hour I have this picture:
Hmmm… i did it and after about ~ 6 hours, the RAM is around 64% (20 GB out of 32) sometimes it gets smaller, sometimes more but it does not increase to 30-31 GB as it was before i will observe on saturday and sunday maybe we have found a solution.
I think we found a clue
Hello over the weekend the indicators increased the memory is not filled immediately but gradually in the beginning it was 64% now 74% and as you can see from the graph this happens on two nodes where graylog is located (graylog is not on the third node) respectively the problem is not with elastic but in graylog it remains to understand what the problem is. the place where i made the settings and rebooted the nodes is marked in red i also see that on the first node (she is the master in graylog) the processor is loaded
What I know is…
Field type refresh interval to 30 seconds will reduce the load on resources needed.
Java tends to use a lot of memory depending on how much logs are being ingested/index etc…
That many shards and what types of searches are being executed will have a impact on memory.
Elasticsearch, Graylog and Mongo on the same node could be fighting over resources. If the amount of logs didn’t exceed over 1000-1500 per second I would rule that out but your receiving over 3000 per second so it make me wonder.
So quick question.
- all three Nodes have Graylog, ES and MongoDb.
- es-node-03 is master node?
- es-node-02/01 are master/data nodes?
From the Graph it looks like only two nodes have memory increasing. The “elastic_data-3” is steady. If this is correct , something weird is happing, perhaps a missed configuration?
Can I ask was es-node-03 Always a master node before this issue?
EDIT: I just noticed, Graylog is not on es-node-03 so its just Elasticsearch and MongoDb?
I’m kind of confused when you stated.
When I stated this…
I was assuming that was right.
I’ve been going over this issue, doing some research on High memory Utilization on Graylog.
By chance what do you see when you execute this on the nodes with high memory usage.
root# free -m
And is it possible to see this output? I’m curious , something does seam to add up?
Out of curiosity which one have you installed? Oracle Java or OpenJDK.
EDIT : I don’t think you mentioned this but besides the use of memory.
How is everything else working? Any other problems arise?
EDIT2: Over sharding as we talked about this before. Indices 515 with 7170 active shards.
This is just calculated from One Index and not all the other indices you have.
That’s resulting in 2 replica shards per primary shard, giving you a total of 9 total shards per index.
That would be 3 primary shards, + 3 First Replica + 3 Second Replica =9 (4635). A node with 30GB of heap memory should have at most 600 shards.
The size of each shard as shown in this document below.
To insure you not going over shard size you can execute this.
curl -X GET 'http://localhost:9200/_cat/indices?v'
Not sure if its the issue, but it does have a impact on resources, along with all the other things I’ve stated above.
I also want to say that graylog is on node 1 and 2
All 3 nodes have the same settings (except node 3 there is no graylog_journal because there is no graylog)
`openjdk version "1.8.0_302" OpenJDK Runtime Environment (build 1.8.0_302-b08) OpenJDK 64-Bit Server VM (build 25.302-b08, mixed mode`
No. There are no problems only RAM is heavily loaded
And now it is very slowly filling up in 4 days + 20%
My shelf life except for 2 or 3 indexes is 14 days
At this point I really don’t know. You’re setup looks really good.
I assume, node one is the master? Or both of these masters?
Example are they configure like this?
Node-01 is_master = true
Node-02 is_master = false
@ttsandrew @tmacgbay @cawfehman @tfpk By chance you guys have any idea’s on this? Only suggestion I can think of since these server are running well is that either elasticsearch need to be on there own node or more memory because of the size of the data being ingested.
yes it’s true
i noticed that the RAM started to fill up slowly but it’s still growing
over the past 2 days, it has grown from 76% to 77%
I apologize, I’m running out of suggestion to offer you. The only thing I can think of now would be is to separate your Elasticsearch instance from Graylog/MongoDb.
This is my conclusion for the size of the environment and amount of logs being ingested.
I realize this was running with less memory but something changed in this environment. It could be the amount of fields being generated for the amount of logs being shipped. I have seen JAVA which Graylog is based off of use a lot. Taking in consideration of any and all configuration made. I really don’t know. I do know of other members here who do equal or greater then amount of logs then you have by separating the Elasticsearch from Graylog/MongoDb.
Maybe try to rethink a better way to configure and/or ingest message. Saved search’s and Widgets will have memory consumption, using any wildcards in search’s will also use memory. Pease keep us updated.
EDIT: out of curiosity what is the output of this command? Just double checking,
sysctl -a | grep -i vm.swappiness
Just an FYI, If you think this might be a bug you could post it here.
vm.swappiness = 30
It looks like we’ll have to change the architecture. What if I change it and it doesn’t solve the problem? When changing the architecture, will there be any delays when sending to elastic? And should I put monga together with graylog or leave it remotely?
If I am understanding correctly the environment is performing as expected? Do you see any metrics that are out of line? High memory usage especially in a database environment is by design, why would you want your high cost/high performance memory to sit available if it can be allocated and used to increase system performance?
Looking at our environment I see that memory utilization has been near 77% for several days.
I built our Graylog environment monitoring (graylog-server, elasticsearch, and mongo) using this as a starting point:
If you are concerned because of the lack of face-up insight into performance I recommend you start there. The Graylog API makes most of the information you need readily available and it is easily parsed. I monitor the following via the API:
I also monitor jvm heap usage, swap usage, physical memory usage, cpu usage, disk space usage, and number of dropped packets.
I believe that everyone prior in this thread has reviewed your environment and determined it to be healthy. If that is true, and if you are not experiencing symptoms of performance issues, then I think that you do not need to worry.
Sounds like there has been some good ideas and troubleshooting done already, so I won’t rehash it. What I am curious about though is (and perhaps I missed it) are these systems VMs? are the resources dedicated? How do the ESXi host or hosts look? Also, memory usage isn’t bad as @ttsandrew pointed out. Have you tried allocating more? As @gsmith pointed out as well, a re-architecture may be needed here. With all those shards, you either need more nodes or more memory. The 20shards/GB RAM isn’t a hard and fast limitation, but with 16GB or heap per node, you are looking at a “recommended” maximum of 960 shards. you have almost 4 times that. Officially, I think you are in “unsupported” territory from an elasticsearch perspective. Adding RAM and bumping up the heap might be a simple test/fix. But I think you’ll really want to think about separating the graylog/mongo from the ES.
If you have the resources, I would recommend standing up 3 new ES nodes with the same configuration as the current 3 nodes (facilitates integration with the current environment) Add one new node into the ES cluster at a time. Wait for ES to rebalance the indices. Add the second (5th) additional ES node in. Wait for ES to rebalance the indices. Add the last new node (6th) into the cluster and let ES rebalance.
At this point, if performance is ok, you can leave the build as such, but I wouldn’t recommend it. I would 1 by 1 decom the ES portion of the original 3 nodes so that only Graylog and Mongo are on those. I think you’ll see a performance increase as well. If you don’t, at this point, I would bump the RAM on these new nodes and modify the heap accordingly.