Users feedbacks / Guides for heavy load graylog Cluster

Hi everyones,
i create this topic to have informations about sizing for a “heavy load” graylog cluster.

Specially for the sizing of ressources (CPU / IO / RAM) of each nodes (graylog / elasticsearch), heap memory configuration …tricky options …

I think it would be interesting for the community to have generals advises.


Inside this Topic :

@macko003 (great performance / sizing review) : Users feedbacks / Guides for heavy load graylog Cluster

@benvanstaveren, review of his graylog cluster : Users feedbacks / Guides for heavy load graylog Cluster

Others Topics registry :

@jfowler : Understanding how to use a Graylog Cluster

@scampuza Performance Tuning Whitepaper, Guide, Doc

@Aenima4six2 Championing Graylog and need performance advice

@zahnd Enhance Graylog search performance by adding new Elastic nodes?

@arnaud Best practice for index and shards configuration

External link:

Another (and great) external blog : https://thehftguy.com/2016/09/12/250-gbday-of-logs-with-graylog-lessons-learned/

great Elasticsearch cluster design approach : Designing the Perfect Elasticsearch Cluster: the (almost) Definitive Guide

loggly tips for Elastic : https://www.loggly.com/blog/nine-tips-configuring-elasticsearch-for-high-performance/

from @jan, Elasticsearch shards sizing and considerations :


1 Like

Paging @macko003, who apparently does 13TB each month :smiley:

Gorgeous :smiley:@macko003 free to provide any feedbacks to the community.

Based on my performance monitoring, it’s not “heavy load”, but maybe useful
//I mentioned these, but here is at one place.

Traffic:
4-6k log/sec ~ 350-400.000.000 log/day
peaks when a missconfigured device sends all its log - 1-2.000.000/min - no problem
~40-45 MB/s load balancer output if traffic to GL servers
EDIT2:
We have 40+ streams, 600+ sources, 50+ different source type

First of all, We planned it geo redundant, so calculate with it.
At the moment our system:
2 load balancer - nginx (we will change it to ipvs) - 2 vcpu/2GB mem
4 graylog servers - 8 vcpu, 16GB mem - 10GB heap
EDIT3:
@jan suggest somewhere not more than 2 GB heap, BUT
We checked the HEAP with 4GB, but the java’s GC run more often and longer, and under the GC the OS UDP error counters increase more than the 10GB HEAP.
10 Elasticsearch servers - 8 vcpu, 32 GB mem, 16 GB Heap - 3TB storage ~10 MB/s daily max write speed. 40MB/s not problem
3 mongodb - 2 vcpu, 2 GB mem
In GL:
~50 input
~40 streams, ~ 10 with syslog output
50+ extractors
10 pipelines

My experiences
loadbalancer, mongo servers do nothing :slight_smile: , no load on it
Graylog can process immediately all messages (except peaks) (0% in/out/process buffer usage on every node), 15% night, 20% daytime cpu usage.
Although you get better performance if you increase the output_batch_size in GL, don’t do it. https://github.com/Graylog2/graylog2-server/issues/5091
I think at the moment the elastic is the battleneck in our system. 20% CPU usage on every node. We can increase the cpu and the mem next.
You need about 2-3% ES heap of the stored data without replica to have a usable search.
In out system one single word search takes about 5 sec with the histogram drawing in the 35 days data
monitor all performance data what you can (I think it could be another good topic)
I did some stress tests on a cloned test system about 1 year ago, with loggen, the 50k log/sec won’t be problem with 4 ES servers
I hope slowly the administrators will start to use the system, and increase the amount of logs, and I can start to do some performance optimization

My “special” settings:
GL:
//live without elastic connection for a weekend
message_journal_max_age = 48h
message_journal_max_size = 95gb
ES:
//for geo redundant data storage
node.attr.site: A
cluster.routing.allocation.awareness.force.site.values: A,B
cluster.routing.allocation.awareness.attributes: site
//the GL bug over
http.max_content_length: 250mb

Backup:
Every day elastic snapshot for every IS and index to NFS

Some evidence by photoshop :slight_smile:
(At the first, you see the data with replica, under IS without replica, other IS names not public)
graylog-indices

(daily)
graylog-daily

EDIT :
Colleagues start to log a bit more. At the moment 13-15k log/sec ~x3 of the previous daytime traffic
GL: 6%-> 20% CPU; ES output average process time 200->230k us; /node
ES: 6%->20%; 4->11 MBps disk IO; 300->900 ms index time; 0->4 ms fetch time /node
LB: 10%->20% CPU; 47-> 130 mbps interface bandwidth (total)
As I see, we will need to increase the ES disk size.

5 Likes

Odd… clicking that little :heart: icon more than once just makes it go ba-bump,ba-bump:smiley: I can only upvote you once @macko003

2 Likes

So many useful informations :heart_eyes: thanks @macko003

Not entirely sure if our setup classifies as “heavy load” but here goes.

Keep in mind we run everything on bare metal (yes, cloud is great, but sometimes it isn’t :smiley: )

Currently running as follows:

3 Graylog servers (24 core cpu, 128Gb memory, 32Gb heap allocated to Graylog which doesn’t seem to be an issue; also 12 processbuffer_processors and 4 outputbuffer_processors)
25 Elasticsearch servers (19 data, 3 master, 3 routing). Data nodes are 64Gb memory, quad-core CPU, 32Gb heap for ES, 2x4Tb raid 0 storage for data. Master and routing nodes are 32Gb memory, quad core CPU, 16Gb heap for ES.
1 Mongodb instance on each Graylog server in a replica set.

Graylog itself runs with 2 inputs, currently 50-odd streams, with about 30 single-stage pipelines, and 5 pipelines with multiple stages and a heck of a lot of grokking/lookup tabling going on.

The reason for the number of data servers is two-fold: first, we have hiliarious retention policies on some of this data (60 days is the norm), and we run most index sets with 3 replicas because the data is important - as well as the boost to search speed it brings.

We’re currently pushing a consistent 2500-3000 msg/sec through.

Graylog’s journal is set to a max age of 48h, and 512Gb since that is about the average size of logs we pull in across that time so at least I can be gone for a weekend or something and heroically save the world on a Monday.

Backup wise we don’t actually back up the ES data on account of the replicas we run (accepted risk), index archival/closure/retention is handled by our in-house snapshot manager that uses the Graylog API to find indices that contain “stale” data, and snapshots them to S3 before removing them.

2 Likes

Thank you for this topic. I want to add this two links:

Sharding and size calculation in Elasticsearch

Fix unassigned shards in Elasticsearch

2 Likes

Since for legacy reasons we have daily indices, we run at 4 shards per index given our average volume of 100Gb daily (100/30 = 3.33 = 4 shards of 30Gb a piece should cover it) - we may switch to using size-based indices given that Graylog manages it all (in our previous setup we used an external tool to manage index closure/archiving).

1 Like

*cough* Try five frackin’ years… That’s how long our security-related logs need to be retained :frowning: Technically it’s easily doable, but it sure is gonna get messy. It’s tempting to make an output that just barfs those into a logfile.

1 Like

Thanks Jan. i update the main post.

Interesting, and your long-term retention logs fils for security purposes are saves on tapes ?

We also keep 5 years but after 6 months we shuffle everything off to a nice cozy S3 bucket… bit more manageable that way :smiley:

Pffft, heck no :smiley: Tapes? We don’t have those… With luck I’ll shift them into a dead file on a filesystem. Ditto for S3 and other cloud storage: our env is 100% off-the-grid, so we’re stuck with what we have available.

That sounds so… James Bond-ish, somehow.

:stuck_out_tongue:

mouhahaha tape is cheap and not so disgusting for very long retention :stuck_out_tongue:

Dat ga je een mede-Nederlander toch niet aandoen… >.>

But yeah off-the-grid entirely seems like you need yourself a heck of a lot of storage boxes to keep up with demand :smiley:

2 Likes

Adoeh! Je weet dat Belanda onderling soms best pittig kunnen reageren, toch? :wink:

But let’s get back on-topic :slight_smile: Our environment is far from heavy-load: I’m expecting less than 10GB a day. Our difficulty lies in both the long retention times as well as our network topology, which is both highly dispersed and has an expectancy of network loss. I still need to look into queueing servers that can be used to locally cache the logging in case the Graylog receivers become unreachable.

use beats - that can handle if the receiver is not available. Or if bound to syslog something like written here:

1 Like