We are finishing up a Graylog pilot and want to kick this into high-gear with a full-blown production install. We are currently running Graylog and Elasticsearch on one VM:
4 CPUs and 16 GiB of total DRAM.
4GiB of RAM for Elasticsearch
4GiB of RAM for Graylog
We’re going to move it onto more powerful hardware where I can provision (almost) as many VMs, RAM and CPUs as I like. We’re currently doing < 1500 messages per second and the above hardware is handling that volume fine. It’s about 50 GB of logs per day. But I’d like to quadruple that volume with the new setup.
I’m assuming that the recommendation will be to put Graylog and Elasticsearch on separate VMs. But should I use two or more ElasticSearch VMs? How much RAM/CPU should I give the ElasticSearch VMs?
I know the actual answer is “it depend”. But does anyone have a starting point I could use?
Here’s an example. I’ve read that you shouldn’t give Elasticsearch more than 32GB of RAM because it disables java oop compression and hurts performance. I’ve also read that you shouldn’t allocate more than 50% of the VM’s RAM to Elasticsearch because you need to leave room for the OS-level disk cache. Does that mean that a single “max” Elasticsearch VM would be something around 64GB of DRAM with 26GB of DRAM (32 minus some safety margin) for the Elasticsearch JVM?
Thanks.
There is lots of information in the DOCS about planning your environment, other sections talk about multi-node setup and lots of other posts in the community you can search such as this one that will give you more information about building out. A search on “scaling” will point you to a bunch more. For Elastic specific information on sizing… well that would be in Elasticsearch documentation/community…
I would defiantly look at what @tmacgbay suggested for starters. There are some real good ideas.
Just to give you idea what we did for collecting log from over 300 remote devices (trial and error) we came up with some steps creating a large cluster.
How much log/s will be ingested per hour, day, etc? That would pertain to what kind of resources you would need. See if you can get a ruff idea how much log per day and work from that. We have 30 GB per day over 60 Remote devices with one Graylog Server VM. I have 14 cores, 12 GB Ram and 500 GB HDD. Below is a message count per day.
What type of logs will be Ingested? This would pertain to what type of INPUT and log shippers to be acquired. Depending on the type you decide to use some INPUT/s left uncheck like GELF for windows will produce a lot of fields so you will see your volume fill up. This can be control from the client side also.
How long do you depend on retaining log? ( week, months, years) this, as you know depend on storage resources .
What type of devices will be send Logs (Linux, Windows , Switch, etc…) This would pertain to how many INPUTs you might want to use. If you decide to add extractors to INPUTS this will increase some of your resources needed.
Over all when you finish setting up your cluster and before sending ALL logs to Graylog Server. I would advise to start slow, maybe 1/3 of your remote devices and work you way up. This will give you the opportunity to break in Graylog and see how it functioning as a cluster. I have seen others just over whelm Graylog server (message storm) then state “My server Crashed”,
If you start to notice problems, stop because you may need to increase you resources and/or reconfigure you server.confg file. Such as buffers filling up, Volume filling up, Heap filling up, etc… You just don’t know about the other little details until logs start to roll in. By going slow you can catch these issues and start to adjust before it becomes a big problem. Once you cleared up any issues start again by sending more logs till your finished.
If your increasing your volume by x4 then having 3 Elasticsearch nodes separated form you 3 Graylog/MongDb node would be good. You can always expand your cluster/volumes if need be.
Since your using Virtual Machine that’s Ideal and as you know Its very easy to add resource to a VM.
We’re doing 58M messages per day. Our department is Network Engineering, so all of our messages are syslog.
I found a couple of people in our department that have some ELK stack experience. I’ll combine some of the advise above with their expertise and see where we land. Thank you for your suggestions!