I’ve been running Graylog2 in production for 1-2 years now, on a simple setup with 2x VMs. One runs GL + Mongo, the other is the singular ES node.
Another department is moving to ‘the cloud’ and I managed to convince them to give my department 3x almost brand new servers, specs are
2xE5-2643 v4 (6-core @ 3.4Ghz)
512GB DDR4
24x 400GB SSD + 2x 800GB HDD.
I’m aware that it’s preferable to have a three node setup for most of the parts. Usually we do everything with VMWare ESXi, but I’m keen to go bare metal to get the most out of the hardware.
How would you set this up? I was thinking of a simple bare-metal OS per server, then each server runs: ES (3-node cluster), MongoDB (3-node replica set), and graylog2.
I would likely be using keepalived between the three servers to hold the logging VIP address, and then whichever server receives the logs would also act as the load balancer to the others. Redundancy is important, so I don’t want to do the load balancing upstream as it means more hardware.
You can use other external loadbalancer or nginx or haproxy to make loadbalanceing.
But all depends on your needs and on your possibilities and on your environment.
I’m not sure i use an loadbalncer first. In this case maybe it is a big network overhead.
LB -> GB -> ES (every time it could be another host, so maybe you one log will handle by 3 different hosts.)
If one Graylog can handle the full traffic I suggest forget a loadbalancer.
Another problem:
the ES can handle max 64GB of RAM, and unfortunately the ES needs the max resources, so at basic setting you can use only 1/5 or 1/6 of your memory.
So maybe it is better if you run some ES nodes on the same host in your favorite virtualization solution.
The maximum memory setting of ES is something I definitely wasn’t aware of, so thanks for that - that will definitely help shape how I run this, likely doing what you’ve suggested and running multiple ES nodes per host, or just on one/two hosts. It’s a shame these servers are CPU constrained…
I need to do some looking in to CPU hungriness of ES vs GL, so I can further figure out how to divide resources.
Networking overhead isn’t a problem, we’re an ISP, so our internal core network(s) are far, far bigger than required, load balancing and shifting data around isn’t a problem at all.
Also keep in mind for ES that if you have 64Gb of RAM, you don’t want to allocate it all to ES. You want to keep some memory free for the OS disk cache. We generally run ES on 64Gb machines with 32Gb allocated to ES, and the rest of the memory is free for the OS to use as it sees fit.
RAM is abundant on these servers and we have little other use for it, so what will likely happen is each box will get a single VM (likely go with ESXi now instead of bare metal) for ES with something like 100GB, 64 for ES and the rest for activities.
It’s been a long time since I built anything GL/logging-related, but I’m guessing the best approach is design + build the ES cluster, then get MongoDB running, then finally install GL?
if you run any virtualization on the machines, maybe I suggest use a different LB for the hosts.
It don’t need anything, you just able to unconnect this function from your other services’ OS.
and if you need HA, you need 2 LB with the same config…
Enough 10Gb OS, 2 vCPU, 2-4 GB of RAM for a little one.
I was just going to think about something like keepalived to keep configuration/software minimal, but I’ll likely use a set of redundant relay servers - this is actually what I run in production now, using rsyslog to accept, filter, and then forward logs (or spool to disk if the downstream server isn’t ready).
I’m assuming that if I ran 3 graylog servers, in a cluster, I would need to send all of my logs to the master? And the master would then distribute? Or is it a case of that any node in the graylog cluster can receive? If the latter is the case, I would assume/hope that the config/plugins such as Streams and Extractors are still global to the cluster?
I thought only the master could receive, I think I was confused with a different stack we’re using for something else. If each node can receive any data and process it as part of the cluster, then that should make load balancing a lot easier, and I can simply do round-robin between my 3 potential GL VMs.
Yes, I’ve read a lot now that 31GB is the best number to allocate, to ensure you don’t hit the issues listed in that article.
My current design involves 2x load balancers (ipvs/keepalived), 3x graylog, and 3x ES. I’ll use ESXi instead of 3 hosts on bare metal, to make things easier.
I’ve set aside:
Load balancers - 2 vCPU + 16GB
Graylog - 6 vCPU + 64GB (Half for JVM heap)
ElasticSearch - 4 vCPU + 64GB (Half for JVM heap)
I think that will be more than enough for what we are doing. All of our logs obviously go to flat-file, as is standard practice, and we use GL for some nice analysis and alerting. We’re barely using GL at the moment, so the above cluster should be enough to move to, and then start throwing logs we don’t currently analyse at.
I’ll likely get the above built sometime soon and then build a couple of load testing VMs to throw junk data at it for a bit, to see how it runs.
Thanks for all the help, I’d have definitely made some mistakes without asking.
you should not assign that amount of JVM to Graylog without a reason. The GC in Graylog will make you cry and run into issues. Assign ~12GB to Graylogs JVM and you are save for a looong time.
We have some older threads talking about best ressources and how others had build the setups. Just search in this community.
Thanks @jan, I do tend to throw resources at things a bit as this hardware is overkill
I’ll take your advice and use 32GB machines for GL, 12GB for GL JVM and the rest for OS + Mongo.
I had a search through the forums, but that was more for clustering, now I have a rough idea of a simple setup that would work for me, I’ll start looking more in to best practices and configuration tweaks for performance.