We recently setup Graylog for a cluster of Apache2 servers using piped log messages via /bin/nc
At some point in time, graylog’s listener stopped working and we started filling the apache2 error logs with
piped log program '/bin/nc -u OURSERVER.com 12201' failed unexpectedly errors
Around the same time, our web clusters started dying unexpected deaths and falling off their load ballancers - I suspect because the hung connections to graylog were causing them to fault somehow.
That said, I suspect the failure of a graylog listener in our case was the cause of a cascade failure across our webserver clusters.
With the above in mind, what is best practice to get data from an apache2 webserver cluster to a graylog install in a way that prevents this from happening in the future should the graylog instance “go away”?
Many thanks in advance for the conversation and advice to come!
the problems are located locally at your apache servers - cause you use UDP to send the logs it can’t be that anything in Graylog is the reason for that.
You should check your log files of the systems itself and your metrics
Hello Jan, thanks very much for your reply, time and efforts!
Initially this is what I thought as well, but restarting the graylog box corrected the problem. Looking at the sources graph corroborates this further as all source activity abruptly stops at the same time all three apache servers begin to get the error described.
I’ve since added
-w 1 to the
nc command in order to enforce a timeout / prevent indefinite waiting / hanging up the apache instance, but I was hoping there was a better solution?
Perhaps some way to queue messages and transmit them in bulk every minute or so in order to prevent multiple open connections all of which are at risk of hanging up?
With great appreciation,
personal I would never make my user facing server that vulnerable and send the messages directly out. As your selected solution is hacky you run into strange issues.
I would write the log messages local, having filebeat collect them and on rotate delete fast if you have space issues. But this way your frontend is not bound to any backoffice logging to work proper…
This is great advice - I will look into FileBeat.
Is there a best-practices DOC somewhere I should be following?
BTW - Re: Hacky - this is what turned up as the solution when I searched the above It’s actually piping through VPC and not user-facing (port is not exposed) so the transmission is actually secure end-to-end.
Any available docs / faq regarding setting this up for an Apache cluster you could reference would be greatly appreciated, as is your help with this problem thus far - many thanks!
you get in this community some insides, the documentation will give you some additional.
No “final solution” is out as you can have more than one way to make your goal happen. you need to decide what fits to your setup.
Some more information about this issue, which continues for us:
Every 12-48 hours the input itself is stopping ingesting messages.
I was able to get it running again by stopping / starting the input from the web interface alone this time (as opposed to restarting the entire server).
The input is currently frozen/locked up - I will screenshot what I can and put it below for reference:
If anyone has some things to check or try next, I would be greatly appreciative. Thank you
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.