Looking for Big Production Environment

Hi

we are looking for “Bigger production setup” http://docs.graylog.org/en/2.2/pages/architecture.html in our environment.

right now we are having 120 + servers in AWS. our expectectation to processing 10000 messages per second with at least 12 months retension perod.

Please suggest me few things…

  1. is to convenient to setup Bigger production setup with docker or ansible ?
  2. what should be servers capaciy for handling & managing these log rates ?

Thanks & Regards,
Amit

Amit,

I’ll chime in with what we have.

I can’t answer #1 - we haven’t moved any of this to docker yet.

As for #2 - we are working through issues and learning as we go.

Our current ingest rate is between 8-15k per second depending on time of day.
We are working to keep 12 months of data online.

What we have learned so far:

Elasticsearch
Elasticsearch shards/indexes do consume memory. We started with 10 shards per index, and a new index each 20,000,000 records. This resulted in roughly 800 indexes before we switched gears and moved to larger and fewer indexes. Memory/CPU on the ES nodes matter - don’t skimp if you can help it. If you need financial flexibility, look at the https://www.elastic.co/blog/hot-warm-architecture to allow fast log insert and slower/cheaper searching later.

We use SSD nodes to hold the first 24 hours of logs, then the logs are moved to less expensive servers with SATA disks and less ram. If budget were no issue, all the nodes would be SSD.

Graylog

Extractors and pipelines can change how many messages you can accept per second and still keep up. Our Graylog server nodes have two 6c Intel® Xeon® CPU E5-2609 v3 @ 1.90GHz. This is enough performance, but is challenged as load gets heavy. Our initial strategy was to send in “perfect” logs wherever possible and use extractors where the source couldn’t do that kind of thing (firewalls and switches for example).

Big Impact items for us:

  1. Hot/Warm Architecture - fairly instant increase in speed.
  2. Enable disk journal on Graylog node. This seems to relieve some memory pressure for us, and makes GL more resilient if you need to restart the node.
  3. CPU on ES nodes - documentation for ES seemed to indicate CPU wasn’t a huge concern, but I don’t think they knew the age of our CPU’s :slight_smile:. Be prepared to change things around as you find the right balance.

Hope this helped.

Dustin

2 Likes

I see you didn’t get any other responses to this post - do you have any updates on your situation?

Our environment has settled down, and we run at a steady state of about 10,000 messages per second.

Current issues:
CPU power on Graylog nodes - occasionally we receive enough data that one node can’t get the logs processed quickly enough, and we are behind for periods of time. We plan to add two more graylog nodes to beef up processing power during ingestion.

Current stats:
1 Load Balancer (not dedicated, but a VIP was created to handle Graylog data)
2 Graylog nodes (about 20G of HEAP used on each box)
14 ES nodes (2 master nodes, 2 SSD nodes for Hot Data, 10 spinning disk nodes for Warm data)
We have differing amounts of data kept online and created different index sets to handle that. We are meeting our search and storage needs.

It’s helpful to deploy a LB at the beginning of the process and have those issues figured out - I wish we had done that. With it in place, scaling is easily accomplished.

I hope things are going OK for you.

Dustin

2 Likes

Thanks for sharing your setup and experiences!