We are currently testing a graylog cluster with 4 nodes (1 Frontend/WebUI & 3 backend DB nodes):
1 - HAProxy (frontend, not cutting it so far…) + Graylog WebUI
2 - Graylog/Elasticsearch/MongoDB
3 - Graylog/Elasticsearch/MongoDB
4 - Graylog/Elasticsearch/MongoDB
We are running into major messaging backlogs as HAproxy only binds to one of the backend nodes instead of evenly distributing the log load across all three backend nodes. The logs are VERY bursty with large bursts of logs in the thousands/tens of thousands…
What is the recommended architecture and/or best option for use to queue up these large bursts of log messages to prevent them from creating huge backlogs of message processing on a single backend node? HAproxy doesn’t seem to be doing the trick…
This is using TCP inputs, sending logs from rsyslog via TCP…we want TCP for reliability. We’ve tried HAproxy’s roundrobin lb method and also the leastconn lb method and neither are working out.
We do the same but HAPROXY working very well to balance out and distribute the logs evenly. It’s a very simple listener configuration. Here is an example config from our HAPROXY instance that deals with our nginx logs. Of course replace your IP address and port numbers accordingly.
listen NgnixLogs
bind IPaddress:514
balance roundrobin
mode tcp
server Graylog01 IPaddress:514 check fall 3 rise 5 inter 2000 weight 10
server Graylog02 IPaddress:514 check fall 3 rise 5 inter 2000 weight 10
server Graylog03 IPaddress:514 check fall 3 rise 5 inter 2000 weight 10
This works pretty much flawlessly for us for all tcp logging instances. Have 3 nodes. Normal ingestion rate at 5-10k per sec. on average but easily handles sustained loads of up to 30k per sec.
Thank you much! I think the haproxy config was not quite right…however I wasn’t the one we set that up but just I tried out your example and things seem better, but still only splitting the log load between two of the backend nodes instead of all three…might have been a fluke since I dumped a huge burst of ~300,000 logs at it with the new config by accident…so will stop the incoming logs and let it catch up and try turning it back on without dumping such a huge burst to see if it evenly distributes this time.
I’m curious also as to what your server specs are for your backend nodes and your haproxy server?
Also, are you parsing (extracting) all the logs fields for all of your logs at the rate of ingestion you mentioned?
I’m glad things are looking better for you. When things go whacky on the balance especially after reloads I find that instead of a reload that a restart of the haproxy service seems to work better to get things back in order. Forcing old sessions to the fresh instance. Might wanna try that see how it goes. As far as specs. Haproxy is only a 2 vcpu VM with 4 Gigs of memory. (Could probably get by with less as this doesn’t really stress either resource) 3 graylog only servers right now each with 4 Vcpu’s and 16 Gigs of memory. Each under average load tend to hover around 40-50% utilization. We keep elasticsearch on it’s own seperate cluster away from graylog to keep one from affecting the other especially during high demand.
We parse the logs where needed on logs that can’t be preformatted and sent via GELF with nxlog or filebeat. But try where possible to have them shipped in a way that parsing isn’t needed. Again GELF and JSON are nice for this but not always an option with some systems. (Network hardware,appliances, 3rd party software, etc.) the rest we parse out as needed. I’d say maybe 25% of what we bring in needs to be parsed using an extractor of some kind. The rest is pre-formatted as JSON and shipped via GELF using nxlog. (works great for IIS logs) Also we now build in where possible into our applications using Log4Net the ability to directly log to graylog in the GELF format as well.
Sounds good, I think everything is looking good with the even load-balancing between our 3 backend nodes now, very much appreciated! That’s the next thing I’m considering- separating Elasticsearch out from Graylog/MongoDB…I think it’s going to be a necessity. On that note- what are your VM specs for your Elasticsearch cluster? The same as your graylog VM’s or?
Interesting, I’ve been setting up Grok Patterns to parse/extract log fields but I’m not so sure how I’m liking it so far…I may just switch to pre-formatting my logs to JSON before I ship them to Graylog. With Grok patterns I can’t seem to find a way to tell the extractors to only apply to a specific application_name or something identifiable and unique in the applicable logs for that specific extractor so that all of my extractors don’t have to try to parse every log that comes into that input…I’ll have to read up on GELF/nxlog…and maybe make the switch to logging in JSON format.
For the most part yes the configs are roughly the same. Except with this setup I can individually scale elasticsearch or graylog out horizontally. I use GROK on occasion where it makes sense. But if no pre-formatted I generally use REGEX. “Copy input” if key value pair data is being pushed. Substring and split & index can be pretty helpful and powerful too if applied in the correct situation.