Server Type 1 - 16 core, 32 GB Memory
Server Type 2 - 16 core, 64GB Memory
Server Type 1 - 3 Nodes
Graylog, Kafka, Elasticsearch - master nodes , 4gb heap
Server Type 2 - 2 Nodes
Elasticsearch - data nodes, 30gb heap
Graylog reads from kafka in server type 1.
ES is using less than 50% of the allocated heap.
I have about 20 input configured on Graylog and most of them are type Gelf kafka.
I have about 300 alerts configured on the 20 streams.
Graylog processing the messages is very slow. I get output of only of max 2500 messages per second and becomes a bottleneck. I can see the number of unprocessed messages going very high and graylog output is very slow.
How can I improve the graylog message processing output?
2 Elasticsearch data nodes
64 GB - 30gb allocated to Elasticsearch
16 cores
I have about 20 gelf kafka inputs configured.
I get a input of upto 10,000 msgs per second.
I get output from graylog of max 2700 per second.
I am struggling to improve Graylog performance. There is no load on my Elasticsearch data nodes at all. I would be happy if i could get 10,000 output per second.
Perhaps the admins could move this to a separate thread.
You should have more output than that. If you look at Graylog nodes, does it show that output buffers are at 100%, or are the processor buffers at 100% while output buffers are at low usage?
I ran into issues Spring of last year when my usage began climbing, at least once per day my processor buffers would max out, nodes would begin backing up onto journal, sometimes well above 1 - 10 million messages backup. Turns out it was my extractors, regex ones which were poorly written, and also weren’t doing conditions efficiently enough. I spent several days back in June tuning each of them, and more efficient conditions on each, completely solved my issue.
On your Elasticsearch nodes, what is your storage type and configuration? I have 4 data nodes, using 7200 rpm midline-SAS but have them in Raid0 groups, 12 drives per node. I haven’t benchmarked these nodes with Rally, but in the benchmarking I’ve done on other systems (http_logs track), a desktop system with quad core i7 and a single desktop grade SSD outperformed a 2-node cluster of multi-processor servers which were using Raid0 groups of 10K SAS drives. Are you seeing high iowait on your nodes during heavier event input times? Not sure which OS they’re running on, but if you install the sysstat package (I’m on CentOS), you can use iostat to get a good idea of the usage of your i/o.
I’m going to be expanding our Elasticsearch cluster probably this summer, hoping to do 4 nodes with SSD, enough to keep maybe 7-14 days of data, use them as hot nodes with my existing nodes as warm.
All my servers have ssd storage and I keep max 1 weeks data.
I have a few CSV extractors for inputs received as Raw/Plaintext Kafka , I will try to change the log format to json so can use gelf/kafka. But processing is slow even if I do not receive any messages on the inputs that have extractors configured.
I will also check the custom processor plugins that we have added.
The graylog server node always has 3 times more messages to be processed then the other 2 nodes when the there is a slowdown in processing.
Removed the alert condition plugins in the staging environment and tried to compare the output before and after remove the plugins, but I did not see any change.