Users feedbacks / Guides for heavy load graylog Cluster

Not entirely sure if our setup classifies as “heavy load” but here goes.

Keep in mind we run everything on bare metal (yes, cloud is great, but sometimes it isn’t :smiley: )

Currently running as follows:

3 Graylog servers (24 core cpu, 128Gb memory, 32Gb heap allocated to Graylog which doesn’t seem to be an issue; also 12 processbuffer_processors and 4 outputbuffer_processors)
25 Elasticsearch servers (19 data, 3 master, 3 routing). Data nodes are 64Gb memory, quad-core CPU, 32Gb heap for ES, 2x4Tb raid 0 storage for data. Master and routing nodes are 32Gb memory, quad core CPU, 16Gb heap for ES.
1 Mongodb instance on each Graylog server in a replica set.

Graylog itself runs with 2 inputs, currently 50-odd streams, with about 30 single-stage pipelines, and 5 pipelines with multiple stages and a heck of a lot of grokking/lookup tabling going on.

The reason for the number of data servers is two-fold: first, we have hiliarious retention policies on some of this data (60 days is the norm), and we run most index sets with 3 replicas because the data is important - as well as the boost to search speed it brings.

We’re currently pushing a consistent 2500-3000 msg/sec through.

Graylog’s journal is set to a max age of 48h, and 512Gb since that is about the average size of logs we pull in across that time so at least I can be gone for a weekend or something and heroically save the world on a Monday.

Backup wise we don’t actually back up the ES data on account of the replicas we run (accepted risk), index archival/closure/retention is handled by our in-house snapshot manager that uses the Graylog API to find indices that contain “stale” data, and snapshots them to S3 before removing them.

2 Likes