How to balance udp inputs (GELF)

Hi all. We are using GELF (Udp) to collect messages. For HA we have created global inputs . So How we can balance udp gelf traffic between two nodes with health check ? is it possible? thanks

hej @davidoff you would need to use a loadbalancer for your udp input. This can be done for example using nginx

for your references: https://www.nginx.com/resources/admin-guide/tcp-load-balancing/

Load-balancing GELF UDP is not exactly trivial because of its chunking characteristics. You would need to use client-based balancing, i. e. send all UDP packets by one client to the same Graylog input. Otherwise you’ll end up with corrupted messages.

We are using the FOSS version of NGINX as a UDP loadbalancer for GELF messages. The one major downside is that in order to get active health checks (to poll the graylog-servers’ lbstatus page), we’d need NGINX Plus which I consider a bit too expensive. So this basically means that whenever we perform some maintenance on the graylog-server nodes (like yum updates with reboots) some of the log messages are lost during that time period.

Luckily, we only need to use GELF UDP for a handful of services due to technical restrictions. The rest of the log messages are sent via Filebeat which has proven a great way to handle log collecting (and loadbalancing) on the client-side.

Hi all,
We had many problems with GELF UDP because of its chunking characteristics.
We’re Using Keepalived and we have Configured some Virtual IP’s for that the HA purpose on every Graylog node.
We have also configured a “dummy” balancing method by using dns round robin on the Virtual IP’s.
What we’ve done is not “scientific” but still works for us.

Regards

Thx for answers. We decide to use nginx with udp balancing on one node, but if it goes down script will switch to another by copying new config with new udp backends. Cause Nginx plus is not free , we have wrote simple script for health check (see below.)

#!/bin/bash
#Create file master with: node1
#Create file  health_check.log for logs

status1=`curl 'http://172.16.20.58:9000/api/system/lbstatus'`
status2=`curl 'http://172.16.20.59:9000/api/system/lbstatus'`
master=`cat master`

if [[ "$status1" == "ALIVE" && "$status2" == "ALIVE" ]]; then
        exit 0

elif [[ "$status1" == "ALIVE" && "$status2" == "DEAD" ]]; then
        echo "[`date`]Node2 - $status2" >> health_check.log

elif [[ "$status1" == "DEAD" && "$status2" == "ALIVE" ]]; then
        echo "[`date`]Node1 - $status1" >> health_check.log
        #Check which node is master
        if [[ "$master" == "node1" ]]; then
                cp /path1/gelf_backend /etc/nginx/gelf/gelf_backend
                /etc/init.d/nginx reload
                echo "node2" > master
        else
                exit 0
        fi

elif [[ "$status1" == "DEAD" && "$status2" == "DEAD" ]]; then
        echo "[`date`]Node1 - $status1" >> health_check.log
        echo "[`date`]Node2 - $status2" >> health_check.log
fi
1 Like

Hi . Does Anyone have production configs for nginx udp balancing? We have a lot of errors under load such as :

2017/03/02 06:33:29 [alert] 2435#2435: 16000 worker_connections are not enough

when we increase workers and connections we have this errors :

2017/03/02 11:24:28 [error] 2416#2416: *702506 connect() to 172.16.20.58:12206 failed (11: Resource temporarily unavailable) while connecting to upstream, udp client: 172.16.20.47, server: 0.0.0.0:12206, upstream: “172.16.20.58:12206”, bytes from/to client:0/0, bytes from/to upstream:0/0

we have solved this problem by adding in nginx config

proxy_timeout 10s;

default value is 10m , nginx create a lot of connections so there is not enough local ports to new one, or you can edit systemctl config net.ipv4.ip_local_port_range
so the backend config looks like:

server {
listen 12204 udp;
proxy_timeout 10s;
proxy_pass example.com;
}