Users feedbacks / Guides for heavy load graylog Cluster

Traffic:
4-6k log/sec ~ 350-400.000.000 log/day
peaks when a missconfigured device sends all its log - 1-2.000.000/min - no problem
~40-45 MB/s load balancer output if traffic to GL servers
EDIT2:
We have 40+ streams, 600+ sources, 50+ different source type

First of all, We planned it geo redundant, so calculate with it.
At the moment our system:
2 load balancer - nginx (we will change it to ipvs) - 2 vcpu/2GB mem
4 graylog servers - 8 vcpu, 16GB mem - 10GB heap
EDIT3:
@jan suggest somewhere not more than 2 GB heap, BUT
We checked the HEAP with 4GB, but the java’s GC run more often and longer, and under the GC the OS UDP error counters increase more than the 10GB HEAP.
10 Elasticsearch servers - 8 vcpu, 32 GB mem, 16 GB Heap - 3TB storage ~10 MB/s daily max write speed. 40MB/s not problem
3 mongodb - 2 vcpu, 2 GB mem
In GL:
~50 input
~40 streams, ~ 10 with syslog output
50+ extractors
10 pipelines

My experiences
loadbalancer, mongo servers do nothing , no load on it
Graylog can process immediately all messages (except peaks) (0% in/out/process buffer usage on every node), 15% night, 20% daytime cpu usage.
Although you get better performance if you increase the output_batch_size in GL, don’t do it. https://github.com/Graylog2/graylog2-server/issues/5091
I think at the moment the elastic is the battleneck in our system. 20% CPU usage on every node. We can increase the cpu and the mem next.
You need about 2-3% ES heap of the stored data without replica to have a usable search.
In out system one single word search takes about 5 sec with the histogram drawing in the 35 days data
monitor all performance data what you can (I think it could be another good topic)
I did some stress tests on a cloned test system about 1 year ago, with loggen, the 50k log/sec won’t be problem with 4 ES servers
I hope slowly the administrators will start to use the system, and increase the amount of logs, and I can start to do some performance optimization

My “special” settings:
GL:
//live without elastic connection for a weekend
message_journal_max_age = 48h
message_journal_max_size = 95gb
ES:
//for geo redundant data storage
node.attr.site: A
cluster.routing.allocation.awareness.force.site.values: A,B
cluster.routing.allocation.awareness.attributes: site
//the GL bug over
http.max_content_length: 250mb

Backup:
Every day elastic snapshot for every IS and index to NFS

Some evidence by photoshop
(At the first, you see the data with replica, under IS without replica, other IS names not public)
graylog-indices

(daily)
graylog-daily

EDIT :
Colleagues start to log a bit more. At the moment 13-15k log/sec ~x3 of the previous daytime traffic
GL: 6%-> 20% CPU; ES output average process time 200->230k us; /node
ES: 6%->20%; 4->11 MBps disk IO; 300->900 ms index time; 0->4 ms fetch time /node
LB: 10%->20% CPU; 47-> 130 mbps interface bandwidth (total)
As I see, we will need to increase the ES disk size.

Totally_Not_A_Robot · November 26, 2018, 12:41pm

Odd… clicking that little icon more than once just makes it go ba-bump,ba-bump… I can only upvote you once @macko003

piellick · November 26, 2018, 12:47pm

So many useful informations thanks @macko003

benvanstaveren · November 29, 2018, 9:47am

Not entirely sure if our setup classifies as “heavy load” but here goes.

Keep in mind we run everything on bare metal (yes, cloud is great, but sometimes it isn’t )

Currently running as follows:

3 Graylog servers (24 core cpu, 128Gb memory, 32Gb heap allocated to Graylog which doesn’t seem to be an issue; also 12 processbuffer_processors and 4 outputbuffer_processors)
25 Elasticsearch servers (19 data, 3 master, 3 routing). Data nodes are 64Gb memory, quad-core CPU, 32Gb heap for ES, 2x4Tb raid 0 storage for data. Master and routing nodes are 32Gb memory, quad core CPU, 16Gb heap for ES.
1 Mongodb instance on each Graylog server in a replica set.

Graylog itself runs with 2 inputs, currently 50-odd streams, with about 30 single-stage pipelines, and 5 pipelines with multiple stages and a heck of a lot of grokking/lookup tabling going on.

The reason for the number of data servers is two-fold: first, we have hiliarious retention policies on some of this data (60 days is the norm), and we run most index sets with 3 replicas because the data is important - as well as the boost to search speed it brings.

We’re currently pushing a consistent 2500-3000 msg/sec through.

Graylog’s journal is set to a max age of 48h, and 512Gb since that is about the average size of logs we pull in across that time so at least I can be gone for a weekend or something and heroically save the world on a Monday.

Backup wise we don’t actually back up the ES data on account of the replicas we run (accepted risk), index archival/closure/retention is handled by our in-house snapshot manager that uses the Graylog API to find indices that contain “stale” data, and snapshots them to S3 before removing them.

jan · November 29, 2018, 10:16am

Thank you for this topic. I want to add this two links:

Sharding and size calculation in Elasticsearch

Fix unassigned shards in Elasticsearch

benvanstaveren · November 29, 2018, 11:12am

Since for legacy reasons we have daily indices, we run at 4 shards per index given our average volume of 100Gb daily (100/30 = 3.33 = 4 shards of 30Gb a piece should cover it) - we may switch to using size-based indices given that Graylog manages it all (in our previous setup we used an external tool to manage index closure/archiving).

Totally_Not_A_Robot · November 29, 2018, 11:16am

*cough* Try five frackin’ years… That’s how long our security-related logs need to be retained Technically it’s easily doable, but it sure is gonna get messy. It’s tempting to make an output that just barfs those into a logfile.

piellick · November 29, 2018, 12:12pm

Thanks Jan. i update the main post.

piellick · November 29, 2018, 12:20pm

Interesting, and your long-term retention logs fils for security purposes are saves on tapes ?

benvanstaveren · November 29, 2018, 12:23pm

We also keep 5 years but after 6 months we shuffle everything off to a nice cozy S3 bucket… bit more manageable that way

Totally_Not_A_Robot · November 29, 2018, 12:33pm

Pffft, heck no Tapes? We don’t have those… With luck I’ll shift them into a dead file on a filesystem. Ditto for S3 and other cloud storage: our env is 100% off-the-grid, so we’re stuck with what we have available.

benvanstaveren · November 29, 2018, 12:35pm

That sounds so… James Bond-ish, somehow.

Totally_Not_A_Robot · November 29, 2018, 1:07pm

piellick · November 29, 2018, 1:36pm

mouhahaha tape is cheap and not so disgusting for very long retention

benvanstaveren · November 29, 2018, 1:38pm

Dat ga je een mede-Nederlander toch niet aandoen… >.>

But yeah off-the-grid entirely seems like you need yourself a heck of a lot of storage boxes to keep up with demand

Totally_Not_A_Robot · November 29, 2018, 6:28pm

Adoeh! Je weet dat Belanda onderling soms best pittig kunnen reageren, toch?

But let’s get back on-topic Our environment is far from heavy-load: I’m expecting less than 10GB a day. Our difficulty lies in both the long retention times as well as our network topology, which is both highly dispersed and has an expectancy of network loss. I still need to look into queueing servers that can be used to locally cache the logging in case the Graylog receivers become unreachable.

jan · November 29, 2018, 6:38pm

use beats - that can handle if the receiver is not available. Or if bound to syslog something like written here:

Topic		Replies	Views
Graylog Sizing/Optimzation Graylog Central (peer support) sidecar , nxlog , nodatanx	4	6457	December 15, 2017
Regarding tuning Graylog Cluster Graylog Central (peer support) access-specific-log- , architecture , components	4	1038	March 1, 2023
Server sizing guide? Graylog Central (peer support)	3	7625	August 7, 2019
Assistance Required: Enhancing Graylog Efficiency for Huge Log Volumes Graylog Central (peer support)	2	73	July 3, 2024
Performance Tuning Whitepaper, Guide, Doc Graylog Central (peer support)	5	4837	August 8, 2017

Users feedbacks / Guides for heavy load graylog Cluster

Related topics