Graylog Sizing/Optimzation

I’m running a Graylog cluster in AWS and I was hoping for some sizing/optimization advice.

We’re ingesting about 140 million messages a day at about 130 GB and that’s only going to grow.

I have 3 ES 2.3.2 hosts running on M3.xlarge. ES has a 9 GB heap on each host and is maintaining 5 shards, each with 1 replica.

I have 2 Graylog 2.2.0 hosts running on C3.xlarge. Graylog has a 2GB heap on each host, and is also running mongodb and td-agent to receive secure logging connections (we use the secure_forward plugin to log over ssl).

The main issue I’m having is that CPU and Memory are super stressed on the Elasticsearch hosts. I think this is because we’re undersized and are exhausting our resources.

I have two questions:

  1. Are there any optimizations/configurations specific to graylog I could make to wring out more performance before sizing up?
    Aside from not analyzing very fields in log messages and not throttling my indexes, I’m basically running ES out of the box.

  2. What would be a more ideal spec for this farm other than “bigger”?
    I found the Graylog Sizing Guidelines document and while I know Jochen described them as “bad”, they also seem to indicate I’m running at about half the recommended size, especlially Elasticsearch.

I can provide any other info about my setup that would help discussion.

very interested in seeing some responses to this.

have you tried adjusting your batch and output processors in the graylog config? on the ES side, make sure you change the indexing interval

I can only tell you what I have done and what has worked. Different environment on the backend as we use vmware on our own hardware backed by SAN storage vs. EC2/Xen.

Here we go.
We are currently very close to your ingestion rate. We pull around 200 million+ messages per day and growing.
3 Graylog nodes. 4 vcpu’s each with 16 GB memory with Heap size at 12 GB. (2 nodes works but if things get
behind it can take awhile to catch up.)
8 elasticsearch nodes 4 vcpus’s at 12 GB memory with 6 GB heap size. (overkill a bit but we find scaling horizontally with smaller instances is more mobile and easier to maintain.)
Storage can be set to your requirements but at our ingestion rate each elasticsearch node holds 500 GB each and can each be easily upgraded if need be or adding another node works as well. Migration tends to get unwieldy when you start pushing multi TB VM’s around. This nets us about 1 month worth of retention before needing to rotate out logs given the current storage setup.
Now onto performance. While the current setup might be overkill at an ingestion rate that averages around 2000 msgs/s we find that when need be and we get a backlog or high incoming rate that this setup is EASILY able to sustain 30k plus messages per second and keeps up in almost real time.
Another thing to note. I only assign 1 shard per node. Seems to work just fine. Currently not using replicas as in our case it isn’t all that worthwhile. We can stand to lose nodes/data in our case so it didn’t bring much value that we could see. If using in a more critical context we’d likely change that to reflect the importance of the data being ingested.

With all this we have about 20-30 users utilizing this per day for searching and dashboarding without any signs of slowness. Searches come back almost instantly. This includes 6 full time display dashboards that are always on and running making pretty constant use of elasticsearch. One thing to keep in mind. Where possible we try to get the log shippers (nxlog) do the parsing and formatting in GELF or JSON to lighten the load on graylog. If you just ship out raw messages and do all the heavy lifting with graylog via regex or GROK extractors the requirements will arguably be MUCH higher for your graylog nodes to keep up. So keep that in mind. You could likely make huge gains there in not needing to upsize your amazon instances if you offload the parsing and formatting to the log shipping systems.

This is just what we have found works for us and our workloads so don’t take this as the holy grail of configurations. What works for you could be wildly different.

There are some other knobs to tweak, but I have found this blog post informative for scaling: https://thehftguy.com/2016/09/12/250-gbday-of-logs-with-graylog-lessons-learned/

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.