Not entirely sure if our setup classifies as “heavy load” but here goes.
Keep in mind we run everything on bare metal (yes, cloud is great, but sometimes it isn’t )
Currently running as follows:
3 Graylog servers (24 core cpu, 128Gb memory, 32Gb heap allocated to Graylog which doesn’t seem to be an issue; also 12 processbuffer_processors and 4 outputbuffer_processors)
25 Elasticsearch servers (19 data, 3 master, 3 routing). Data nodes are 64Gb memory, quad-core CPU, 32Gb heap for ES, 2x4Tb raid 0 storage for data. Master and routing nodes are 32Gb memory, quad core CPU, 16Gb heap for ES.
1 Mongodb instance on each Graylog server in a replica set.
Graylog itself runs with 2 inputs, currently 50-odd streams, with about 30 single-stage pipelines, and 5 pipelines with multiple stages and a heck of a lot of grokking/lookup tabling going on.
The reason for the number of data servers is two-fold: first, we have hiliarious retention policies on some of this data (60 days is the norm), and we run most index sets with 3 replicas because the data is important - as well as the boost to search speed it brings.
We’re currently pushing a consistent 2500-3000 msg/sec through.
Graylog’s journal is set to a max age of 48h, and 512Gb since that is about the average size of logs we pull in across that time so at least I can be gone for a weekend or something and heroically save the world on a Monday.
Backup wise we don’t actually back up the ES data on account of the replicas we run (accepted risk), index archival/closure/retention is handled by our in-house snapshot manager that uses the Graylog API to find indices that contain “stale” data, and snapshots them to S3 before removing them.