Our company is running Graylog to process Windows security event logs of all of our ~500 Windows hosts. I will try to define our environment first.
We have ~70 domains with ~5 Windows hosts each. Every domain has it’s own VLAN and could be seen as separate network.
We first tried to process all of our data to 2 Graylog nodes and everything was fine but the traffic from our logs started to overload our firewalls.
Now we have 1 Elasticsearch cluster, 1 central Graylog and a Graylog docker container in every domain which works as a proxy to prevent big amounts of connections (which results in 70 Graylog nodes). Everything worked great in the beginning but after 2 months different problems with Elasticsearch index started to occur every 2 months. I am busy with upgrading our Graylog, Elasticsearch and MongoDB to the newest version (We are running Graylog 2.4 currently), but I recently started to question this idea with all those 70 Graylog nodes. I have never seen anyone use more than 2-3 Graylog nodes to process data.
What do you think? Is it the right way? Should it be working just fine or was Graylog never designed to work like this?
you might want to contact us Graylog - as we release something that can you help with this. A dedicated input/output plugin to forward data from satellite environments to a central.
You might want to have some other kind of architecture that others in this community can provide.
I think if your bottleneck is the firewalls that looking at changing that would be the better option from a purely management perspective. You can, theoretically, put up something where the local windows nodes report to a Logstash instance running in the domain, and have Logstash forward (via GELF or Beats) to Graylog. It’ll still, maybe, cause bandwidth issues, but at least each domain only has one incoming connection (well, depending on Logstash config) to the Graylog nodes.
I try to overcome the problem with logstash as an aggregator and forwarder, but I fail at the last mile. The logstash to Graylog part is pretty hard when you want to use TLS.
When you dont need TLS on the last Mile you can use the GELF output of Logstash. If you want to ship your logs in a secure way, then you need lumberjack output to Beats input, but i have struggles to get it working. Maybe we can team up?
I’ve never had much luck with Logstash and Lumberjack output due to the documentation being super skimpy on how to secure it - but you can still use GELF, if you connect things over a VPN or alternatively VLAN if it’s all inside the same network.
@jan Thank you for your advice, but the company I work for prefers open-source and free solutions. @benvanstaveren We already have high end Clavister firewalls. The problem is in the infrastructure itself and how it’s build. The number of connections was the problem not the amount of data. @gruselglatz Thanks, your explanation and examples gave me some ideas as it is exactly what we tried to accomplish with Graylog “Proxies” which create only one connection to Elasticsearch cluster.
I don’t think that I can help you with your issue as we don’t need TLS-encryption. All connections are local and go through VPN between the firewalls.
So guys if I understand you correctly I should change my architecture to winlogbeat -> logstash per domain -> Graylog with GELF -> Elasticsearch?
@matgel I changed my plan to do filebeat [BEATS protocol] -> logstash per site [LUMBERJACK protocol] -> logstash dmz hq [LUMBERJACK protocol] -> logstash secure domain [GELF] -> Graylog
This way i can add metricbeat and APM agent and only have to change the input on the external site side and the output on the secure domain side.
The communication is authenticated on the filebeat to logstash beginning and secured from the beginning until its in my secure domain where i go to Graylog and with Metricbeat and APM-Server to Elasticsearch.
With this architecture i can scale as many external to dmz logstashes i want, and also could add internal logstashes to my secure domain logstash.