I’ll chime in with what we have.
I can’t answer #1 - we haven’t moved any of this to docker yet.
As for #2 - we are working through issues and learning as we go.
Our current ingest rate is between 8-15k per second depending on time of day.
We are working to keep 12 months of data online.
What we have learned so far:
Elasticsearch shards/indexes do consume memory. We started with 10 shards per index, and a new index each 20,000,000 records. This resulted in roughly 800 indexes before we switched gears and moved to larger and fewer indexes. Memory/CPU on the ES nodes matter - don’t skimp if you can help it. If you need financial flexibility, look at the https://www.elastic.co/blog/hot-warm-architecture to allow fast log insert and slower/cheaper searching later.
We use SSD nodes to hold the first 24 hours of logs, then the logs are moved to less expensive servers with SATA disks and less ram. If budget were no issue, all the nodes would be SSD.
Extractors and pipelines can change how many messages you can accept per second and still keep up. Our Graylog server nodes have two 6c Intel® Xeon® CPU E5-2609 v3 @ 1.90GHz. This is enough performance, but is challenged as load gets heavy. Our initial strategy was to send in “perfect” logs wherever possible and use extractors where the source couldn’t do that kind of thing (firewalls and switches for example).
Big Impact items for us:
- Hot/Warm Architecture - fairly instant increase in speed.
- Enable disk journal on Graylog node. This seems to relieve some memory pressure for us, and makes GL more resilient if you need to restart the node.
- CPU on ES nodes - documentation for ES seemed to indicate CPU wasn’t a huge concern, but I don’t think they knew the age of our CPU’s . Be prepared to change things around as you find the right balance.
Hope this helped.