Out of RAM memory


Thanks for the added info.

Ah good, I was like holy cow, All the replicas and Data/directory’s are on the same Disk/HDD :laughing:

Something that caught my eye. The node es-node-01 in the screen shot is using 3 times more CPU then the other two nodes. At first I though it maybe the master node but its not. Unsure if that was just a random metric or is it always like that?

Sum it up:

  • 3 node cluster and all three service on each node
  • Each node resources are 16 cores and 32 gigs of RAM.
  • Each elasticsearch has 16GB RAM allocated.
  • Graylog/Java has 3 GB heap allocated.
  • Log ingestion is 300-500 GB day about 3000-5000messages per second.
  • Indices 515 with 7170 active shards (that’s a lot for three node cluster).
  • All Logs are clean
  • I’m assuming the Process/Output/Input buffers & Journal are good.

Couple question on this statement

  • Was this gradually or an over night issue?
  • What changed before this issue started?
  • Was there always 300-500 GB a day?
  • Any updates applied? Server rebooted?
  • Plugin’s Installed?
  • Do you have Regex extractors, GROK patterns or Pipelines configured?

Its been known Java certain version could increase memory. I was curious about the amount of logs per day, If it was lower like 200 -300 then increase this also could have a impacted on resources. A bad regex expression or bad GROK pattern could be a culprit on high memory usage.

You do have large amounts of data, from the messages per day and how many shards are generating.

From what I’m seeing this is pretty normal for the amount of memory being used and for having all three services on each node. This is why the documents suggested that Elasticsearch should be on it own node and give it as much memory as possible. This way Graylog/MongoDb are not fighting over resources.
To be honest I feel something was changed, either log shippers are sending a lot more logs, perhaps new configuration were made or updates. Were you trending data on these cluster servers for the past month or two? If so, did you see anything that may pertain to this issue?

To give you an idea, here is my lab GL server with all three service on one node. I have 12 CPU , 12 GB memory and 500 GB drive. 4 Gb ram to Elasticsearch and 3 GB ram to GL heap. This server is only ingesting 30 Gb day. No replicas, only 4 shards per Index, retention is 1 day /deleted for 30 days.

As you can see I’m using about the same percentage of memory as you.

In the forum there has been issue with " Over Sharding" tying up memory not sure if this would pertain to your issue but If not the posts below are a good read.

Hello again :upside_down_face:

i have configured that each node is a date and a master (this can be seen in the configuration file that i sent) if one of the nodes goes to reload/another node falls it takes on the role of the master (the so-called RAFT methodology (if memory does not change))

and what settings do I need to make so that everything is beautiful?

I do not know, it’s just that one of the days when i checked the server the RAM filled up so most likely at night


yes always

i rebooted only after i saw that the RAM was full and rebooting elastic and graylog did not save me


Yes but everything worked fine before that so I don’t think that’s the problem.The service worked for 4 years before ~8 months ago i switched it to a cluster architecture

And a question from me: in what time zone do you live and can we somehow write off here on the form in telegram or another place to solve the problem online together if it doesn’t bother you? when you write messages for me i have 4-5 AM


I’m not very good at troubleshooting Regex/GROK patterns but @tmacgbay might be able to jump in here :smiley:

My time Zone is Central Time Zone (UTC-6). I don t mind, as for communicating how about Discord? or Zoom? You can DM me in here if you like.

This issue is starting to sound like a bad Regex extractors, GROK patterns but I’m not 100% sure if it is. Just a thought ,might be from a burst of messages then something broke and since it is effecting all the GL servers at once I’m leaning towards the extractors right now…

Happy to take a look at what you have going on if you want to post the regex/GROk and an example message. If you search the forums for "GROK lock" there are a couple of posts I have put out there like this one.

1 Like

Hi, I didn’t quite understand what was required of me to do. do you need any additional data or not?

@gsmith is suggesting that during high volume time periods that a regex or GROK statement gets overloaded and could possibly lock up or slow down your process buffers - in the link and if you search the forum for more you can find out more about that issue and where to view your process buffers when the issue is happening. If you think there may be a regex or GROK statement that is inefficient, you can post it here (as well as an example message) I am happy t to take a look - I am by no means an expert, but I have dealt with process buffers locking up before.

One of the best ways to make GROK or regex more efficient is to lock it to the beginning ^ or the end $ of the message, otherwise it will shift through the message to attempt a match… and when you have thousands of messages processing, that can get very inefficient…

As I understand it, we are talking about this now?

Here’s what shows me where you asked me to click (I glossed over some confidential information, I hope it doesn’t hurt):

Or do you need a full message?

even today in the graylog logs I noticed the following entries:

2022-03-24T13:53:15.001Z ERROR [DecodingProcessor] Unable to decode raw message RawMessage{id=c0763e91-ab79-11ec-b760-001a0000144c, messageQueueId=28898782352, codec=gelf, payloadSize=1025, timestamp=2022-03-24T13:53:15.001Z, remoteAddress=/} on input <5f367557046ddce7db14e9a3>.
2022-03-24T13:53:15.001Z ERROR [DecodingProcessor] Error processing message RawMessage{id=c0763e91-ab79-11ec-b760-001a0000144c, messageQueueId=28898782352, codec=gelf, payloadSize=1025, timestamp=2022-03-24T13:53:15.001Z, remoteAddress=/}
java.io.EOFException: Unexpected end of ZLIB input stream
        at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240) ~[?:1.8.0_302]
        at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158) ~[?:1.8.0_302]
        at com.google.common.io.ByteStreams$LimitedInputStream.read(ByteStreams.java:731) ~[graylog.jar:?]
        at com.google.common.io.ByteStreams.toByteArrayInternal(ByteStreams.java:181) ~[graylog.jar:?]
        at com.google.common.io.ByteStreams.toByteArray(ByteStreams.java:221) ~[graylog.jar:?]
        at org.graylog2.plugin.Tools.decompressZlib(Tools.java:217) ~[graylog.jar:?]
        at org.graylog2.inputs.codecs.gelf.GELFMessage.getJSON(GELFMessage.java:74) ~[graylog.jar:?]
        at org.graylog2.inputs.codecs.GelfCodec.decode(GelfCodec.java:125) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage(DecodingProcessor.java:153) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent(DecodingProcessor.java:94) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:90) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:47) [graylog.jar:?]
        at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]

That process buffer is what I was talking about… in my experience, it can still lock up a machine even if it shows just one that is not idle. Take that redacted message and wind back the process it goes through on your Graylog system (input, extractor, stream, pipeline, alert…) and you may find an area that is not being efficient… causing the buffer to have something in it. I have a very small system but from what I understand, these should all show idle. In the few that I have participated in, it seems to have pointed back to a regex or GROK command that was spending too much time on a message trying to find something… particularly in an overload situation or where incoming message format has changed.

I didn’t understand a little what i needed to do. If it’s not difficult could you describe where to click and what to look at because i have some problems with understanding what you are writing

Take a look at the message that shows in the process buffer.

  1. what input did it come in on?
  2. Are there any extractors associated with that input? Do they use GROK/Regex?
  3. Are there any streams that the message would be assigned to based on stream properties that match the message?
  4. If the message is assigned to a stream or multiple streams, are there pipelines attached to that stream?
  5. If the message is traversing pipelines, what are the rules in those pipelines is the message executes based on passing the when…then… section?
  6. Do any of the rules that are executed contain GROK or regex?

You need to understand the path that message caught in the process buffer passes through in Graylog before it’s stored in Elasticsearch. How is it processed?

This may not even be the right road to your solution but it is still a good thing to understand.

I have a lot of messages from different inputs in the buffer, I need to analyze all of them or I can do with one (provided that extractors are used in this input and etc?)

I am not 100% convinced this is where we will solve the problem so start small and look at one… more if you have time…

Here is a brief conclusion I have.
It may not be a fix but more or less a suggestion. I don’t want to tell you to start reconfiguring your environment. Since it was and seams to be working fine. The problem from what was stated by you was an increase of memory which doesn’t seam to bad right now.

Since you are ingesting a lot of logs and I think you have quit a few fields generated from the messages/logs I’m believe you may need more Memory added. Just an easy, temporary solution.
I was thinking about decreasing the memory in Elasticsearch, but that would do no good because you are ingesting 300-500 GB Day about 3000-5000 messages per second. That would just cause more problems.

Some minor suggestions that could be applied.

  • If possible, try to decrease the amount of log being ingested. This would depend on how you’re shipping the logs to Graylog INPUT.

  • Is there a lot of saved searches? Try to decrease those if not needed, also Widgets, Dashboards, etc…

  • Try to fine tune this environment meaning, if you really don’t need it, remove it.

  • Try to increase Field type refresh interval to 30 seconds. You would need to edit you default Index and then manually recalculate them and/or rotate them.

  • The Errors/Warnings in the log are just stating that it could not decode raw message RawMessage sent from the remote device. Tuning your Log shippers AKA Nxlog. Winlogbeat, Rsyslog, etc… to send the proper type of files for the input you’re using might help. Example: Windows is using Winlogbeat for a log shipper reconfigure those to send only the data you need & try not to send every event log from those machines. I noticed in the logs posted above it seems that you are using GELF, this does create a lot of fields.

Not knowing exactly when this problem started or what took place before this issue was noticed it’s hard to say.

From what you stated above, it’s only an increase of memory BUT everything is working correctly? It might just be you need more resources.

Sum it up

You can adjust a few different configurations to lower your memory usage but what I do see, from all what you have shared, everything seams to be running fine. Am I correct?

I don’t believe there is a one solution to lower your memory, I think it’s a few different configurations, and to be honest probably doesn’t even need to happen.

If possible, add more memory to each system (i.e., 2 GB) and watch and trend data to see if it increases over time. If it does then we might need to look further in to fine tuning, your environment. If you do add more memory (2GB) wait a couple days or week, so don’t add any new configuration or updates if possible. The more data we have the better we can find a solution.

If you experiencing data loss, Graylog is freezing/crashing or gaps in the graphs, etc… then well look further into this ASAP.

EDIT: This is a good read if you have a chance.

EDIT2: I just noticed this, @Uporaba Did you configure this on purpose?

GRAYLOG_SERVER_JAVA_OPTS="-Xms3g -Xmx3g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:-OmitStackTraceInFastThrow"

Here is mine, Maybe mine is just old.

GRAYLOG_SERVER_JAVA_OPTS="-Xms3g -Xmx3g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:-OmitStackTraceInFastThrow "

Might take a look here, Not sure what happen.

Hello. This was done automatically after installation

Everything you described above didn’t really work during the day. After restarting the virtual machine, I noticed that 19 GB of RAM immediately filled up (as indicated, 16 GB for elastic and 3 for graylog), then RAM began to grow slowly. As a result, after an hour I have this picture:

Hmmm… i did it and after about ~ 6 hours, the RAM is around 64% (20 GB out of 32) sometimes it gets smaller, sometimes more but it does not increase to 30-31 GB as it was before i will observe on saturday and sunday maybe we have found a solution.

1 Like

I think we found a clue :male_detective:

Hello over the weekend the indicators increased the memory is not filled immediately but gradually in the beginning it was 64% now 74% and as you can see from the graph this happens on two nodes where graylog is located (graylog is not on the third node) respectively the problem is not with elastic but in graylog it remains to understand what the problem is. the place where i made the settings and rebooted the nodes is marked in red i also see that on the first node (she is the master in graylog) the processor is loaded


What I know is…

Field type refresh interval to 30 seconds will reduce the load on resources needed.
Java tends to use a lot of memory depending on how much logs are being ingested/index etc…
That many shards and what types of searches are being executed will have a impact on memory.

Elasticsearch, Graylog and Mongo on the same node could be fighting over resources. If the amount of logs didn’t exceed over 1000-1500 per second I would rule that out but your receiving over 3000 per second so it make me wonder.

So quick question.

  • all three Nodes have Graylog, ES and MongoDb.
  • es-node-03 is master node?
  • es-node-02/01 are master/data nodes?

From the Graph it looks like only two nodes have memory increasing. The “elastic_data-3” is steady. If this is correct , something weird is happing, perhaps a missed configuration?
Can I ask was es-node-03 Always a master node before this issue?

EDIT: I just noticed, Graylog is not on es-node-03 so its just Elasticsearch and MongoDb?
I’m kind of confused when you stated.

When I stated this…

I was assuming that was right.

I’ve been going over this issue, doing some research on High memory Utilization on Graylog.
By chance what do you see when you execute this on the nodes with high memory usage.

root# free -m

And is it possible to see this output? I’m curious , something does seam to add up?

root# lsblk

Out of curiosity which one have you installed? Oracle Java or OpenJDK.

EDIT : I don’t think you mentioned this but besides the use of memory.
How is everything else working? Any other problems arise?

EDIT2: Over sharding as we talked about this before. Indices 515 with 7170 active shards.
Shards: 3
Replicas: 2

This is just calculated from One Index and not all the other indices you have.
That’s resulting in 2 replica shards per primary shard, giving you a total of 9 total shards per index.
That would be 3 primary shards, + 3 First Replica + 3 Second Replica =9 (4635). A node with 30GB of heap memory should have at most 600 shards.
The size of each shard as shown in this document below.

Shard Size

Shard Count

To insure you not going over shard size you can execute this.

curl -X GET 'http://localhost:9200/_cat/indices?v'

Not sure if its the issue, but it does have a impact on resources, along with all the other things I’ve stated above.