My application log4j2 with a Gelf appender to log messages to our Graylog server.
The configuration is the same in dev, qa, uat, and production. I can view the messages in Graylog from dev, qa, and uat, but production messages are not being received.
I have tested that there is connectivity between production and graylog (UDP on port 12201) by using echo and nc. When a manual test is completed using echo and nc, then I can see the test messages in Graylog.
Does anyone have any other suggestions to troubleshoot this issue?
Unfortunately I do not have a lot of details other than the Graylog server and Application servers are in the same data centre. Manual testing using echo and nc have proved that there is no firewall rules blocking traffic between the servers.
The log4j2-gelf appender is working (as demonstrated in your dev, qa, and uat environments)
The network connection from the machine running the application server to the Graylog GELF UDP input on port 12201/udp is working (as demonstrated by your test with netcat)
So maybe the application server in the production environment is using a different configuration file for Log4j 2.
But ultimately, that’s pretty much how far free support can go.
If you want to buy professional support (with NDA and everything, so you can share sensitive information), please check out https://www.graylog.org/enterprise.
The plot thickens … we ran tcpdump for udp traffic on port 12201 and saw that it was trying to send the log messages to the incorrect IP address.
Doing nslookup on the graylog inputs domain resolves to the correct IP address. Similarly, doing a manual test using echo and nc on the production server, shows the UDP traffic being sent to the correct IP address.
It seems that for some reason when the log messages are sent via the Gelf appender running in the application, then it is resolving to the wrong IP address.
By default, when a security manager is installed, in order to protect against DNS spoofing attacks, the result of positive host name resolutions are cached forever. When a security manager is not installed, the default behavior is to cache entries for a finite (implementation dependent) period of time. The result of unsuccessful host name resolution is cached for a very short period of time (10 seconds) to improve performance.