Hi all. We are using GELF (Udp) to collect messages. For HA we have created global inputs . So How we can balance udp gelf traffic between two nodes with health check ? is it possible? thanks
hej @davidoff you would need to use a loadbalancer for your udp input. This can be done for example using nginx
for your references: https://www.nginx.com/resources/admin-guide/tcp-load-balancing/
Load-balancing GELF UDP is not exactly trivial because of its chunking characteristics. You would need to use client-based balancing, i. e. send all UDP packets by one client to the same Graylog input. Otherwise you’ll end up with corrupted messages.
We are using the FOSS version of NGINX as a UDP loadbalancer for GELF messages. The one major downside is that in order to get active health checks (to poll the graylog-servers’ lbstatus page), we’d need NGINX Plus which I consider a bit too expensive. So this basically means that whenever we perform some maintenance on the graylog-server nodes (like yum updates with reboots) some of the log messages are lost during that time period.
Luckily, we only need to use GELF UDP for a handful of services due to technical restrictions. The rest of the log messages are sent via Filebeat which has proven a great way to handle log collecting (and loadbalancing) on the client-side.
We had many problems with GELF UDP because of its chunking characteristics.
We’re Using Keepalived and we have Configured some Virtual IP’s for that the HA purpose on every Graylog node.
We have also configured a “dummy” balancing method by using dns round robin on the Virtual IP’s.
What we’ve done is not “scientific” but still works for us.
NginX load balancer
Thx for answers. We decide to use nginx with udp balancing on one node, but if it goes down script will switch to another by copying new config with new udp backends. Cause Nginx plus is not free , we have wrote simple script for health check (see below.)
#!/bin/bash #Create file master with: node1 #Create file health_check.log for logs status1=`curl 'http://172.16.20.58:9000/api/system/lbstatus'` status2=`curl 'http://172.16.20.59:9000/api/system/lbstatus'` master=`cat master` if [[ "$status1" == "ALIVE" && "$status2" == "ALIVE" ]]; then exit 0 elif [[ "$status1" == "ALIVE" && "$status2" == "DEAD" ]]; then echo "[`date`]Node2 - $status2" >> health_check.log elif [[ "$status1" == "DEAD" && "$status2" == "ALIVE" ]]; then echo "[`date`]Node1 - $status1" >> health_check.log #Check which node is master if [[ "$master" == "node1" ]]; then cp /path1/gelf_backend /etc/nginx/gelf/gelf_backend /etc/init.d/nginx reload echo "node2" > master else exit 0 fi elif [[ "$status1" == "DEAD" && "$status2" == "DEAD" ]]; then echo "[`date`]Node1 - $status1" >> health_check.log echo "[`date`]Node2 - $status2" >> health_check.log fi
Hi . Does Anyone have production configs for nginx udp balancing? We have a lot of errors under load such as :
2017/03/02 06:33:29 [alert] 2435#2435: 16000 worker_connections are not enough
when we increase workers and connections we have this errors :
2017/03/02 11:24:28 [error] 2416#2416: *702506 connect() to 172.16.20.58:12206 failed (11: Resource temporarily unavailable) while connecting to upstream, udp client: 172.16.20.47, server: 0.0.0.0:12206, upstream: “172.16.20.58:12206”, bytes from/to client:0/0, bytes from/to upstream:0/0
we have solved this problem by adding in nginx config
default value is 10m , nginx create a lot of connections so there is not enough local ports to new one, or you can edit systemctl config net.ipv4.ip_local_port_range
so the backend config looks like:
listen 12204 udp;