Another thing that I have noticed is that when searching or going to a dashboard, my out traffic goes to 0 a lot. It makes me think that elastic is processing, or is dedicating it’s processing to the query rather that splitting it to traffic coming in as well.
Hello,
I’m back from vacation
I noticed these statements which can be a direct result for resources and configurations made.
This can be from the configuration. made in Graylog config file.
INPUT Buffer configurations doesn’t need a lot of cpu but if you see it climb just add another one
inputbuffer_processors = 2
Normally the Process buffer would have the great number of CPU.
processbuffer_processors= 12
The trick with this is once you increase the number of the process buffer AND restart the Graylog service it takes a couple minutes to kick in. Start with a small number and increase it gradually. Sometimes you wont see results right away.
This depends on how much logs are being received during this time while old messages are trying to be indexed.
If the Output buffer fills up, then you see a increase on the Process buffer. I would increase the number of cpu cores for output buffer , then restart Graylog services and wait a few minutes. If
the issue continues then I would repeat the process over again but you shouldn’t have to increase that number for outputbuffer to much. As a reminder try not to increase those numbers beyond the amount of physical CPU Cores.
outputbuffer_processors = 5
As for this statement
This could be from Date/Time mismatch, Elasticsearch not processing quick enough OR if you are reconfiguring Graylog and making adjustments while restart service/s.
Since your stating about the Buffers filling up I’m leaning more on your configurations and what your allowing Elasticsearch to use.
At first try removeing or hash this one, it could prohibit elastic from starting up if you did not do some other
adjustments in your system.
this is mine on a server ruuning graylog and elasric on the same node:
cluster.name: graylogsrktst
node.name: montst.log.srk
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: <ipadres>
http.port: 9200
discovery.seed_hosts: ["<ipadres>"]
cluster.initial_master_nodes: ["<ipadres>"]
action.auto_create_index: false
Could be that you have to change some parameters in server.conf
elasticsearch_hosts = http://<ipadres>:9200
Thank you for your reply.
I have tested increasing the amount of processors allocated for both output and the process buffer, but either have seemed to help. I have also tried changing the output batch size from 200 all the way up to 2000 and there was no change.
Currently I’m sitting at 6 processors for processing and another 6 for output, and the same thing seems to happen. I have noticed that if I decrease the amount of traffic that goes in that seems to help, but then I’m not logging all my traffic.
I think it’s important to state, that this did not appear to be a problem for before the log4j update. Since then the output buffer appears to become a problem, and from what I can tell is also causing the issues that we had with the gap in time when searching. NOTE: that if I disable most of my inputs, such that the output buffer can catch up, then that gap appears to recede, if not go away completely.
At this point, I’m really thinking that it has something to do with elasticsearch, and not being able to process the data from the output buffer fast enough.
You are saying to comment out the bootstrap.memory_lock:true?
What type of things in the server.conf file are you suggesting that we change?
Comment the bootstrap.memory_lock:true and see what happens en check your elasticsearch logs for any indication that could gice some clues.
Made the change, and doesn’t appear to have any affect.
Hello,
I completely agree,
Since your buffers showing 100% or there not processing messages fast enough, you will see gaps in your Graphs. This problem can be resolved in the Graylog Config file as I stated above. That is the only way I know how to fix that is by increasing those setting I showed above.
Have you searched the forum for Processors, Buffers full, etc?
Most everything that I have found said that they needed more resources to elasticsearch. And while that can make sense in certain circumstances, in this case, it doesn’t as it was working for us, up until the upgrade.
At this point, I’m kind of at a loss, and wondering if I should just reinstall, as we only have a couple of months’ worth of data, and have no obligation to keep it. However, I’m not looking forward to rebuilding, and wished there was some way to troubleshoot elasticsearch better.
Any other thoughts as to other options that I can do? Any specific utilities that I could get or look at that would help to determine if and where the problem would be in elasticsearch?
Best possible thin for ES is to lower the refresh_index.interval.
If one has got 6.000 messages coming in at one single node that has to be indexed immediately than it could be a lot to handle for elastic. Try decreasing is to 5 or even 30 seconds.
I do this with Cerebro and made an template with it so any new index takes it to.
Are you saying that you make this change on elastic or in the graylog conf file?
Hello,
What does the logs show in Elasticsearch and Graylog? Have you tried Tail’ing them? Specially Elasticsearch logs. You mention about an upgrade what actually did you do?
Check out this.
There was an updated version of graylog that came out about 3 weeks ago, that patched the log4j. We installed that, but haven’t seen any problems listed in the logs.
I ended up reinstalling. This fixed the problem, and now the output is way up. However, I’m getting a new error. Not sure if I should create a new ticket or continue on with this. The new error is:
“Nodes with too long GC pauses”
I have noticed that where this problem happens it also appears to make the process buffer go up to 100 percent. This in spite the fact that I gave the process buffer 7 cores, and the output buffer (which isn’t have a problem anymore) 6 cores.
Any thoughts on how to resolve this, or where to look?
Thanks,
Hello
EDIT: After looking more into this error. Looks like a direct result what @Arie was stating above.
index.refresh_interval: 5s
to index.refresh_interval: 30s
curl -X PUT "localhost:9200/graylog_*/_settings" -H 'Content-Type: application/json' -d'
{
"index" : {
"refresh_interval" : "30s"
}
}
'
You may need to read this post, it has some good idea’s/information
https://graylog2.narkive.com/CRI3N2ju/how-to-fix-nodes-with-too-long-gc-pauses-issues-in-my-cluster
You may need to adjust the Process buffer a little more.
As @Arie stated above your getting around 6.000 messages coming in at one single node. You have tripled what I do in my lab. That’s a lot of messages for one single Graylog server so you need to adjust your resources to accommodate Elasticsearch and Graylog.
Hello,
@Arie After testing this out in my lab I was able to set refresh_interval
to 30s, but after rotating the indices the settings didn’t take on the newly created indices. I used graylog_*
which I was hoping that any indices created after I reconfigure the settings would be applied. Any Idea?
Could this be set to the template? What does changing the refresh interval do?
When running the status command I get the following
graylog-admin@graylog01:~$ sudo systemctl status elasticsearch.service
● elasticsearch.service - Elasticsearch
Loaded: loaded (/lib/systemd/system/elasticsearch.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2022-01-07 16:08:35 UTC; 36s ago
Docs: https://www.elastic.co
Main PID: 5604 (java)
Tasks: 189 (limit: 57775)
Memory: 17.1G
CGroup: /system.slice/elasticsearch.service
└─5604 /usr/share/elasticsearch/jdk/bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true>
I have currently set my jvm options set to 16g. I’ve noticed that the memory when running the command has gone up as high as 24g, would be the max my system has. Should the memory value highlighted be the same as what is set in the jvm.options?
Rather than every 30s could changing it to 5 seconds still help with performance? If so, is this how I would change it?