Users feedbacks / Guides for heavy load graylog Cluster

Based on my performance monitoring, it’s not “heavy load”, but maybe useful
//I mentioned these, but here is at one place.

Traffic:
4-6k log/sec ~ 350-400.000.000 log/day
peaks when a missconfigured device sends all its log - 1-2.000.000/min - no problem
~40-45 MB/s load balancer output if traffic to GL servers
EDIT2:
We have 40+ streams, 600+ sources, 50+ different source type

First of all, We planned it geo redundant, so calculate with it.
At the moment our system:
2 load balancer - nginx (we will change it to ipvs) - 2 vcpu/2GB mem
4 graylog servers - 8 vcpu, 16GB mem - 10GB heap
EDIT3:
@jan suggest somewhere not more than 2 GB heap, BUT
We checked the HEAP with 4GB, but the java’s GC run more often and longer, and under the GC the OS UDP error counters increase more than the 10GB HEAP.
10 Elasticsearch servers - 8 vcpu, 32 GB mem, 16 GB Heap - 3TB storage ~10 MB/s daily max write speed. 40MB/s not problem
3 mongodb - 2 vcpu, 2 GB mem
In GL:
~50 input
~40 streams, ~ 10 with syslog output
50+ extractors
10 pipelines

My experiences
loadbalancer, mongo servers do nothing :slight_smile: , no load on it
Graylog can process immediately all messages (except peaks) (0% in/out/process buffer usage on every node), 15% night, 20% daytime cpu usage.
Although you get better performance if you increase the output_batch_size in GL, don’t do it. https://github.com/Graylog2/graylog2-server/issues/5091
I think at the moment the elastic is the battleneck in our system. 20% CPU usage on every node. We can increase the cpu and the mem next.
You need about 2-3% ES heap of the stored data without replica to have a usable search.
In out system one single word search takes about 5 sec with the histogram drawing in the 35 days data
monitor all performance data what you can (I think it could be another good topic)
I did some stress tests on a cloned test system about 1 year ago, with loggen, the 50k log/sec won’t be problem with 4 ES servers
I hope slowly the administrators will start to use the system, and increase the amount of logs, and I can start to do some performance optimization

My “special” settings:
GL:
//live without elastic connection for a weekend
message_journal_max_age = 48h
message_journal_max_size = 95gb
ES:
//for geo redundant data storage
node.attr.site: A
cluster.routing.allocation.awareness.force.site.values: A,B
cluster.routing.allocation.awareness.attributes: site
//the GL bug over
http.max_content_length: 250mb

Backup:
Every day elastic snapshot for every IS and index to NFS

Some evidence by photoshop :slight_smile:
(At the first, you see the data with replica, under IS without replica, other IS names not public)
graylog-indices

(daily)
graylog-daily

EDIT :
Colleagues start to log a bit more. At the moment 13-15k log/sec ~x3 of the previous daytime traffic
GL: 6%-> 20% CPU; ES output average process time 200->230k us; /node
ES: 6%->20%; 4->11 MBps disk IO; 300->900 ms index time; 0->4 ms fetch time /node
LB: 10%->20% CPU; 47-> 130 mbps interface bandwidth (total)
As I see, we will need to increase the ES disk size.

5 Likes