Hello,
Here is a brief conclusion I have.
It may not be a fix but more or less a suggestion. I don’t want to tell you to start reconfiguring your environment. Since it was and seams to be working fine. The problem from what was stated by you was an increase of memory which doesn’t seam to bad right now.
Since you are ingesting a lot of logs and I think you have quit a few fields generated from the messages/logs I’m believe you may need more Memory added. Just an easy, temporary solution.
I was thinking about decreasing the memory in Elasticsearch, but that would do no good because you are ingesting 300-500 GB Day about 3000-5000 messages per second. That would just cause more problems.
Some minor suggestions that could be applied.
-
If possible, try to decrease the amount of log being ingested. This would depend on how you’re shipping the logs to Graylog INPUT.
-
Is there a lot of saved searches? Try to decrease those if not needed, also Widgets, Dashboards, etc…
-
Try to fine tune this environment meaning, if you really don’t need it, remove it.
-
Try to increase Field type refresh interval to 30 seconds. You would need to edit you default Index and then manually recalculate them and/or rotate them.
-
The Errors/Warnings in the log are just stating that it could not decode raw message RawMessage sent from the remote device. Tuning your Log shippers AKA Nxlog. Winlogbeat, Rsyslog, etc… to send the proper type of files for the input you’re using might help. Example: Windows is using Winlogbeat for a log shipper reconfigure those to send only the data you need & try not to send every event log from those machines. I noticed in the logs posted above it seems that you are using GELF, this does create a lot of fields.
Not knowing exactly when this problem started or what took place before this issue was noticed it’s hard to say.
From what you stated above, it’s only an increase of memory BUT everything is working correctly? It might just be you need more resources.
Sum it up
You can adjust a few different configurations to lower your memory usage but what I do see, from all what you have shared, everything seams to be running fine. Am I correct?
I don’t believe there is a one solution to lower your memory, I think it’s a few different configurations, and to be honest probably doesn’t even need to happen.
If possible, add more memory to each system (i.e., 2 GB) and watch and trend data to see if it increases over time. If it does then we might need to look further in to fine tuning, your environment. If you do add more memory (2GB) wait a couple days or week, so don’t add any new configuration or updates if possible. The more data we have the better we can find a solution.
If you experiencing data loss, Graylog is freezing/crashing or gaps in the graphs, etc… then well look further into this ASAP.
EDIT: This is a good read if you have a chance.
- Managing and troubleshooting Elasticsearch memory | Elastic Blog
- Elasticsearch in garbage collection hell
EDIT2: I just noticed this, @Uporaba Did you configure this on purpose?
GRAYLOG_SERVER_JAVA_OPTS="-Xms3g -Xmx3g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:-OmitStackTraceInFastThrow"
Here is mine, Maybe mine is just old.
GRAYLOG_SERVER_JAVA_OPTS="-Xms3g -Xmx3g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:-OmitStackTraceInFastThrow "
Might take a look here, Not sure what happen.