We are sending logs from Filebeats to a Logstash instance for enrichment and then load balancing between two Graylog nodes via a F5 round robin. The issue we are seeing is that Logstash is creating a persistent connection to the F5, which results in all of the messages being sent to a single Graylog node. Has anyone successfully implemented a similar setup and/or is there any recommended alternatives to balance the messages between the Graylog nodes (redis?) outputted from Logstash? Any advice would be appreciated.
Which protocol are you using to send messages from Logstash to Graylog?
I am using the GELF protocol as the Logstash output.
The Logstash GELF output only supports GELF UDP (i. e. a connectionless transport protocol), so I’m wondering what exactly you mean by:
Related GitHub issue for GELF TCP in Logstash:
When messages are sent to the F5 directly from a Filebeats agent to the F5 Virtual IP they round robin correctly to the Graylog servers. However, if the messages are sent from Logstash to the F5 Virtual IP they only go to one of the servers. I didn’t know if others have experienced this issue with a F5/similar Load Balancer and were able to resolve it, or had an alternative way to balance the messages between the Graylog servers using Logstash.
Filebeat (or rather the Beats protocol) is using TCP, Logstash is sending GELF via UDP.
Maybe your load balancer handles TCP and UDP differently.
This being said, it’s not a good idea to send GELF UDP packages via round robin to different servers because chunked messages will be corrupted that way.
Thanks for the info. If I understand correctly there is currently no working GELF TCP output for Logstash. Is there another option that others are using to send from Logstash to Graylog over TCP?
I had a similar problem with Citrix’s ADC’s.
The problem we faced was that although the nxlog to graylog transfer (using GELF UDP) was configured to use a virtual IP (VIP) it appeared that from the VIP to the round robin configured graylog servers a ‘session’ was created with a 120 seconds timeout. This timeout was never reached because an agent machine was always sending data within the 120 seconds interval. To get around this we configured a ‘session’ timeout value to 1 second which resulted in a much closer round robin setup.
Thanks for the info Harry. I’ll check with our F5 administrator about reducing the timeout. I believe the default for UPD on the F5 is 60 seconds.