Hello. I have a cluster of 3 machines with 16 cores and 32 gigs of RAM. The last 2 weeks there has been a problem that 28-29 GB of 32 GB become full. I sinned on the cache, cleaned it, but the memory did not go away. Rebooting the elastic itself also does not bring results. On 2 machines, another graylog is hanging in parallel. Rebooting it also did not give results. I executed the curl "localhost:9200/_nodes/jvm?pretty&human"
then I give the command log: Output logs. What other data is needed for diagnostics and understanding what is happening? My graylog version is 4.1.6 and Elastic is 7.10.2
Hello @Uporaba
Showing how you setup your Graylog Cluster ( i.e. Config files, Logs, etc…) would help.
There could be multiple issue why this is happening.
@gsmith (Profile - gsmith - Graylog Community), Hello. Here are the graylog.conf settings, Elasticsearch config, jvm.options and vm characteristics:
On the other nodes, identical settings
What other data needs to be provided to understand what is going on?
Hello
Thanks for the added info.
I need to ask some questions about this setup.
- How much data/messages are you ingesting (i.e. seconds or per day)?
- From the screen shot above what are your top services using the RAM?
- What do you have set for Graylog Heap? Depending on what type of OS you have, its location may very.
RPM PACKAGE
/etc/sysconfig/graylog-server
or
DEB PACKAGE
/etc/default/graylog-server
- What do you see in Graylog log file?
- What do you see in elasticsearch log file?
- What do you see in MongoDb log file?
Out of curiosity, can I ask why you have three data paths on the same node?
path.data:
- /var/lib/elasticsearch/data1
- /var/lib/elasticsearch/data2
- /var/lib/elasticsearch/data3
The purpose is data integrity, so if a drive crashes you can recover from it.
Example:
path.data:
- /mnt/elasticsearch/data1
- /mnt/elasticsearch/data2
- /mnt/elasticsearch/data3
If all the data directory’s are on the same drive that’s a lot of I/O plus it serves no real purpose that I see.
The reason I say this is because you have a path.repo: ["/var/backup/graylog"]
set. which is also on the same drive.
I’m assuming you set these up?
- password_secret =
- root_password_sha2 =
To sum it up, the configuration look good but you have a lot of your backups for fault redundancy
on the same drive, this might or it could be part of the issue.
Just a suggestion, maybe look into something like this in the future.
/dev/mapper/centos-root 83G 6.9G 76G 9% /
/dev/sda1 194M 126M 69M 65% /boot
/dev/sdb1 296G 109G 172G 39% /mnt/elasticsearch/data1
/dev/sdc1 296G 109G 172G 39% /mnt/elasticsearch/data2
/dev/sdd1 296G 109G 172G 39% /mnt/elasticsearch/data3
### path.repo: ["/mnt/my_repo"] <--- Elasticsearch config
/dev/sde1 296G 109G 172G 39% /mnt/my_repo
I see you have set Elasticsearch heap to 16 GB which I think is half of the amount of memory on one node?
Depending on how you set Graylog heap from the question above, you may have over committed memory allocation to each node. This would also depend on how much logs are being ingested.
Couple more questions.
-
By chance to you have swap configured?
-
Check on
max_file_descriptors
from your os. -
What is you configurations on the indices used?
Example:
Sorry about all the question, I’m trying to narrow it down on what could be the issue or issues.
EDIT:
I forgot to mention, you can execute this curl command find out what is going on with your heap, etc… perhaps it might show some info that can be used.
curl -XGET "http://localhost:9200/_cat/nodes?v=true"
Well there are a lot of questions but I will try to answer them all. I hope my knowledge of English is enough
It happens in different ways from 300 to 500 GB per day
It also happens in different ways per second. Especially when ddos is going on it’s mostly 3000-8000 messages per second
Elastic, and graylog.
Setting for graylog memory:
GRAYLOG_SERVER_JAVA_OPTS="-Xms3g -Xmx3g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:-OmitStackTraceInFastThrow"
Nothing everything is clean
This setting is done on all 3 nodes. Data is stored on different disks for system performance
Disks are marked up using lvm
Yes I removed this so as not to show my passwords to everyone on the Internet
This problem began to manifest itself recently before that the service worked properly for ~ 8 months with such settings
No it’s off
`[root@elastic uporaba]# curl -X GET "localhost:9200/_nodes/stats/process?filter_path=**.max_file_descriptors&pretty"
{
"nodes" : {
"77JvB6XvS6K-x_Cswz21eA" : {
"process" : {
"max_file_descriptors" : 65535
}
},
"8yNrg7oLS6yhGCGC37b4tg" : {
"process" : {
"max_file_descriptors" : 65535
}
},
"7m1bYlXVTxO5ZDlCJ7nkZQ" : {
"process" : {
"max_file_descriptors" : 65535
}
}
}
}
`
The settings for all indexes are the same
It’s all right if i had initially known what data was needed i would have attached them myself
Hello,
Thanks for the added info.
Ah good, I was like holy cow, All the replicas and Data/directory’s are on the same Disk/HDD
Something that caught my eye. The node es-node-01 in the screen shot is using 3 times more CPU then the other two nodes. At first I though it maybe the master node but its not. Unsure if that was just a random metric or is it always like that?
Sum it up:
- 3 node cluster and all three service on each node
- Each node resources are 16 cores and 32 gigs of RAM.
- Each elasticsearch has 16GB RAM allocated.
- Graylog/Java has 3 GB heap allocated.
- Log ingestion is 300-500 GB day about 3000-5000messages per second.
- Indices 515 with 7170 active shards (that’s a lot for three node cluster).
- All Logs are clean
- I’m assuming the Process/Output/Input buffers & Journal are good.
Couple question on this statement
- Was this gradually or an over night issue?
- What changed before this issue started?
- Was there always 300-500 GB a day?
- Any updates applied? Server rebooted?
- Plugin’s Installed?
- Do you have Regex extractors, GROK patterns or Pipelines configured?
Its been known Java certain version could increase memory. I was curious about the amount of logs per day, If it was lower like 200 -300 then increase this also could have a impacted on resources. A bad regex expression or bad GROK pattern could be a culprit on high memory usage.
You do have large amounts of data, from the messages per day and how many shards are generating.
From what I’m seeing this is pretty normal for the amount of memory being used and for having all three services on each node. This is why the documents suggested that Elasticsearch should be on it own node and give it as much memory as possible. This way Graylog/MongoDb are not fighting over resources.
To be honest I feel something was changed, either log shippers are sending a lot more logs, perhaps new configuration were made or updates. Were you trending data on these cluster servers for the past month or two? If so, did you see anything that may pertain to this issue?
To give you an idea, here is my lab GL server with all three service on one node. I have 12 CPU , 12 GB memory and 500 GB drive. 4 Gb ram to Elasticsearch and 3 GB ram to GL heap. This server is only ingesting 30 Gb day. No replicas, only 4 shards per Index, retention is 1 day /deleted for 30 days.
As you can see I’m using about the same percentage of memory as you.
In the forum there has been issue with " Over Sharding" tying up memory not sure if this would pertain to your issue but If not the posts below are a good read.
Hello again
i have configured that each node is a date and a master (this can be seen in the configuration file that i sent) if one of the nodes goes to reload/another node falls it takes on the role of the master (the so-called RAFT methodology (if memory does not change))
and what settings do I need to make so that everything is beautiful?
I do not know, it’s just that one of the days when i checked the server the RAM filled up so most likely at night
No
yes always
i rebooted only after i saw that the RAM was full and rebooting elastic and graylog did not save me
No
Yes but everything worked fine before that so I don’t think that’s the problem.The service worked for 4 years before ~8 months ago i switched it to a cluster architecture
And a question from me: in what time zone do you live and can we somehow write off here on the form in telegram or another place to solve the problem online together if it doesn’t bother you? when you write messages for me i have 4-5 AM
Hello,
I’m not very good at troubleshooting Regex/GROK patterns but @tmacgbay might be able to jump in here
My time Zone is Central Time Zone (UTC-6). I don t mind, as for communicating how about Discord? or Zoom? You can DM me in here if you like.
This issue is starting to sound like a bad Regex extractors, GROK patterns but I’m not 100% sure if it is. Just a thought ,might be from a burst of messages then something broke and since it is effecting all the GL servers at once I’m leaning towards the extractors right now…
Happy to take a look at what you have going on if you want to post the regex/GROk and an example message. If you search the forums for "GROK lock"
there are a couple of posts I have put out there like this one.
Hi, I didn’t quite understand what was required of me to do. do you need any additional data or not?
@gsmith is suggesting that during high volume time periods that a regex or GROK statement gets overloaded and could possibly lock up or slow down your process buffers - in the link and if you search the forum for more you can find out more about that issue and where to view your process buffers when the issue is happening. If you think there may be a regex or GROK statement that is inefficient, you can post it here (as well as an example message) I am happy t to take a look - I am by no means an expert, but I have dealt with process buffers locking up before.
One of the best ways to make GROK or regex more efficient is to lock it to the beginning ^ or the end $ of the message, otherwise it will shift through the message to attempt a match… and when you have thousands of messages processing, that can get very inefficient…
@tmacgbay
As I understand it, we are talking about this now?
Here’s what shows me where you asked me to click (I glossed over some confidential information, I hope it doesn’t hurt):
Or do you need a full message?
even today in the graylog logs I noticed the following entries:
2022-03-24T13:53:15.001Z ERROR [DecodingProcessor] Unable to decode raw message RawMessage{id=c0763e91-ab79-11ec-b760-001a0000144c, messageQueueId=28898782352, codec=gelf, payloadSize=1025, timestamp=2022-03-24T13:53:15.001Z, remoteAddress=/10.101.15.160:40611} on input <5f367557046ddce7db14e9a3>.
2022-03-24T13:53:15.001Z ERROR [DecodingProcessor] Error processing message RawMessage{id=c0763e91-ab79-11ec-b760-001a0000144c, messageQueueId=28898782352, codec=gelf, payloadSize=1025, timestamp=2022-03-24T13:53:15.001Z, remoteAddress=/10.101.15.160:40611}
java.io.EOFException: Unexpected end of ZLIB input stream
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240) ~[?:1.8.0_302]
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158) ~[?:1.8.0_302]
at com.google.common.io.ByteStreams$LimitedInputStream.read(ByteStreams.java:731) ~[graylog.jar:?]
at com.google.common.io.ByteStreams.toByteArrayInternal(ByteStreams.java:181) ~[graylog.jar:?]
at com.google.common.io.ByteStreams.toByteArray(ByteStreams.java:221) ~[graylog.jar:?]
at org.graylog2.plugin.Tools.decompressZlib(Tools.java:217) ~[graylog.jar:?]
at org.graylog2.inputs.codecs.gelf.GELFMessage.getJSON(GELFMessage.java:74) ~[graylog.jar:?]
at org.graylog2.inputs.codecs.GelfCodec.decode(GelfCodec.java:125) ~[graylog.jar:?]
at org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage(DecodingProcessor.java:153) ~[graylog.jar:?]
at org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent(DecodingProcessor.java:94) [graylog.jar:?]
at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:90) [graylog.jar:?]
at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:47) [graylog.jar:?]
at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
That process buffer is what I was talking about… in my experience, it can still lock up a machine even if it shows just one that is not idle. Take that redacted message and wind back the process it goes through on your Graylog system (input, extractor, stream, pipeline, alert…) and you may find an area that is not being efficient… causing the buffer to have something in it. I have a very small system but from what I understand, these should all show idle. In the few that I have participated in, it seems to have pointed back to a regex or GROK command that was spending too much time on a message trying to find something… particularly in an overload situation or where incoming message format has changed.
I didn’t understand a little what i needed to do. If it’s not difficult could you describe where to click and what to look at because i have some problems with understanding what you are writing
Take a look at the message that shows in the process buffer.
- what input did it come in on?
- Are there any extractors associated with that input? Do they use GROK/Regex?
- Are there any streams that the message would be assigned to based on stream properties that match the message?
- If the message is assigned to a stream or multiple streams, are there pipelines attached to that stream?
- If the message is traversing pipelines, what are the rules in those pipelines is the message executes based on passing the when…then… section?
- Do any of the rules that are executed contain GROK or regex?
You need to understand the path that message caught in the process buffer passes through in Graylog before it’s stored in Elasticsearch. How is it processed?
This may not even be the right road to your solution but it is still a good thing to understand.
I have a lot of messages from different inputs in the buffer, I need to analyze all of them or I can do with one (provided that extractors are used in this input and etc?)
I am not 100% convinced this is where we will solve the problem so start small and look at one… more if you have time…
Hello,
Here is a brief conclusion I have.
It may not be a fix but more or less a suggestion. I don’t want to tell you to start reconfiguring your environment. Since it was and seams to be working fine. The problem from what was stated by you was an increase of memory which doesn’t seam to bad right now.
Since you are ingesting a lot of logs and I think you have quit a few fields generated from the messages/logs I’m believe you may need more Memory added. Just an easy, temporary solution.
I was thinking about decreasing the memory in Elasticsearch, but that would do no good because you are ingesting 300-500 GB Day about 3000-5000 messages per second. That would just cause more problems.
Some minor suggestions that could be applied.
-
If possible, try to decrease the amount of log being ingested. This would depend on how you’re shipping the logs to Graylog INPUT.
-
Is there a lot of saved searches? Try to decrease those if not needed, also Widgets, Dashboards, etc…
-
Try to fine tune this environment meaning, if you really don’t need it, remove it.
-
Try to increase Field type refresh interval to 30 seconds. You would need to edit you default Index and then manually recalculate them and/or rotate them.
-
The Errors/Warnings in the log are just stating that it could not decode raw message RawMessage sent from the remote device. Tuning your Log shippers AKA Nxlog. Winlogbeat, Rsyslog, etc… to send the proper type of files for the input you’re using might help. Example: Windows is using Winlogbeat for a log shipper reconfigure those to send only the data you need & try not to send every event log from those machines. I noticed in the logs posted above it seems that you are using GELF, this does create a lot of fields.
Not knowing exactly when this problem started or what took place before this issue was noticed it’s hard to say.
From what you stated above, it’s only an increase of memory BUT everything is working correctly? It might just be you need more resources.
Sum it up
You can adjust a few different configurations to lower your memory usage but what I do see, from all what you have shared, everything seams to be running fine. Am I correct?
I don’t believe there is a one solution to lower your memory, I think it’s a few different configurations, and to be honest probably doesn’t even need to happen.
If possible, add more memory to each system (i.e., 2 GB) and watch and trend data to see if it increases over time. If it does then we might need to look further in to fine tuning, your environment. If you do add more memory (2GB) wait a couple days or week, so don’t add any new configuration or updates if possible. The more data we have the better we can find a solution.
If you experiencing data loss, Graylog is freezing/crashing or gaps in the graphs, etc… then well look further into this ASAP.
EDIT: This is a good read if you have a chance.
- Managing and troubleshooting Elasticsearch memory | Elastic Blog
- Elasticsearch in garbage collection hell
EDIT2: I just noticed this, @Uporaba Did you configure this on purpose?
GRAYLOG_SERVER_JAVA_OPTS="-Xms3g -Xmx3g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:-OmitStackTraceInFastThrow"
Here is mine, Maybe mine is just old.
GRAYLOG_SERVER_JAVA_OPTS="-Xms3g -Xmx3g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:-OmitStackTraceInFastThrow "
Might take a look here, Not sure what happen.
Hello. This was done automatically after installation
Everything you described above didn’t really work during the day. After restarting the virtual machine, I noticed that 19 GB of RAM immediately filled up (as indicated, 16 GB for elastic and 3 for graylog), then RAM began to grow slowly. As a result, after an hour I have this picture: