Issues with Graylog after moving to an elasticsearch cluster

(Mariusgeonea) #1


after i moved my setup from a single elasticsearch server to a cluster (in the cluster i have 3 Master nodes and 3 Data nodes) the output in the right corner decreased from 7500 msg/s to arround 4500 msg/s, and my input was the same as vefore around 7500msg/s.

what i did so far, i have checked the time and data, everywhere is the same, because it’s synced via NTP.
i have also configured the following on elasticsearch:

# Recover only after the given number of nodes have joined the cluster. Can be seen as "minimum number of nodes to attempt recovery at all".
gateway.recover_after_nodes: 4
# Time to wait for additional nodes after recover_after_nodes is met.
gateway.recover_after_time: 5m
# Inform ElasticSearch how many nodes form a full cluster. If this number is met, start up immediately.
gateway.expected_nodes: 6

then i did 300mb instead of 150: 300mb

in graylog i changed the process buffer to 20 from 10
also in graylog i rotated all the indexes manually and deleted the old ones.

with all of these nothing changed in better.

have you ever seen this problem before, or any ideas that can help me?


(Jochen) #2

Which version of Graylog are you using?
Which version of Elasticsearch are you using?
Was the single Elasticsearch node previously running on the same machine as the Graylog node?
What’s the complete configuration of Graylog and Elasticsearch?
What’s in the logs of your Graylog and Elasticsearch nodes?

(Mariusgeonea) #3

Q: Which version of Graylog are you using?
A: Graylog v2.4.5+8e18e6a

Q: Which version of Elasticsearch are you using?
A: elasticsearch-5.6

Q: Was the single Elasticsearch node previously running on the same machine as the Graylog node?
A: no

Q: What’s the complete configuration of Graylog and Elasticsearch please note that from the graylog conf file some things were deleted due to privacy issues?
A: graylog -
elastic master node -
data node -

Q: What’s in the logs of your Graylog and Elasticsearch nodes?

(Jochen) #4

There are no files in that folder.

(Mariusgeonea) #5

(Jochen) #6
2018-06-08T03:57:16.436-04:00 WARN  [Messages] Failed to index message: index=<firewall_deflector> id=<725de9a3-6af1-11e8-82eb-0050568640e7> error=<{"type":"invalid_index_name_exception","reason":"Invalid index name [firewall_deflector], already exists as alias","index_uuid":"_na_","index":"firewall_deflector"}>

There are a lot of warnings and error messages in these logs, which you should try to resolve individually. If the performance is still worse after that, come back and post the current logs (and configuration files if anything changed).

(Mariusgeonea) #7

how to i resolve those? by deleting the current logs?

(Jochen) #8

Please read the linked FAQ article.

(Mariusgeonea) #9

no more invalid indexes

but the output messages didn’t increase…

(Mariusgeonea) #10

i re-did the elastic cluster from 0…same issue nothing good is happening…
i;m really running outta ideas here…

(Mariusgeonea) #11

another thing is that the CPU is in 100% all the time

(Jochen) #12

Try playing around with the batch sizes and batch commit intervals:

(Mariusgeonea) #13

right now i have :

# Batch size for the Elasticsearch output. This is the maximum (!) number of messages the Elasticsearch output
# module will get at once and write to Elasticsearch in a batch call. If the configured batch size has not been
# reached within output_flush_interval seconds, everything that is available will be flushed at once. Remember
# that every outputbuffer processor manages its own batch and performs its own batch write calls.
# ("outputbuffer_processors" variable)
output_batch_size = 1000

# Flush interval (in seconds) for the Elasticsearch output. This is the maximum amount of time between two
# batches of messages written to Elasticsearch. It is only effective at all if your minimum number of messages
# for this time period is less than output_batch_size * outputbuffer_processors.
output_flush_interval = 2

# As stream outputs are loaded only on demand, an output which is failing to initialize will be tried over and
# over again. To prevent this, the following configuration options define after how many faults an output will
# not be tried again for an also configurable amount of seconds.
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30

# The number of parallel running processors.
# Raise this number if your buffers are filling up.

processbuffer_processors = 15
outputbuffer_processors = 10

with the same result

(Jan Doberstein) #14

What are your sharding and replication settings? Did you set the refresh interval for elasticsearch?

In addition I would make the check list work with the following elastic articel

I would raise the output_batch_size to 4000 and lower the outputbuffer_processors to 5, I have the feeling that your updates are eating all available elasticsearch threats and you need to push a higher amount of message with lower amount of connections.

(Mariusgeonea) #15

my shards were 4 /index and i have 6 indexes, unfortunately i didn’t test that one out…
what i did i raised the CPU cores from 16 to 24 and now everything is running fine again with a CPU utilization of 50%. i will try to test what you have suggested to see if the CPU usage decreases.

now i’m trying to test with 2 shards /index to see if there is any change in the storage space, maybe that will also help reduce the CPU usage…

(Jan Doberstein) #16

the total number of outbuffer, inputbuffer and processbuffer processors should be 3/4 of the available cores max.

(Mariusgeonea) #17

3 is the number of processors and 4 are the CPU cores?

i’m afraid i don’t really understand the formula

(Mariusgeonea) #18

i think i get, probably is three quarters, i hope that’s it that means 75%, so if i have 24 cores i should give it 16 in total, now i have set it to outbuffer 6, input 2 and 10 for the processbuffer.

i hope this is what you meant

(Jan Doberstein) #19

you got it - sorry that I did not wrote it more clear

(Mariusgeonea) #20

no worries. in fact you were pretty clear, but because i’m not used to math that much i didn’t understood those three quarters :slight_smile:

anyway, at the very moment i’m running those plus an output batch size of 12000, do you think i can go to 20000, and what could happen in the worse case if an output batch of 20k or 30k won’t work?