How to diagnose freezing GELF UDP input?

I have a GELF UDP input that has been collecting log messages from multiple Apache servers for about a month now. Every 1-2 days, the logging stops being collected from all apache machines at the same time.

Generally I would restart the instance and it would come back up, but this morning I tried something new: Just stopping and starting the Input itself via the web interface.

I’m not sure how to proceed with determining the cause of the freezing - any advice would be greatly appreciated.


check your system logs - check the Graylog server.log - check your system metrics - check kernel messages

Thanks Jan - I’ve been monitoring the same over previous failures and haven’t seen anything that sticks out, except one itching thing in the graylog server log file (will attach at the end).

I see nothing consistent with overload in system metrics, but do see the network traffic still trying to come in (but graylog doesn’t appear to listen):

Re: Server log - I do see a lot of messages like the below which do not make sense to me. It looks like Apache might be sending in many bad messages, though I do not know why (EDIT: Note, the time of this message coincides with the last received message within seconds):

2019-07-05T14:42:27.271Z ERROR [DecodingProcessor] Unable to decode raw message RawMessage{id=1bf52571-9f33-11e9-bf84-063bc20c2362, journalOffset=10377850, codec=gelf, payloadSize=2048, timestamp=2019-07-05T14:42:27.271Z, remoteAddress=/[REDACTED]:55800} on input <5d07f63af4ca790892f00bff>.
2019-07-05T14:42:27.271Z ERROR [DecodingProcessor] Error processing message RawMessage{id=1bf52571-9f33-11e9-bf84-063bc20c2362, journalOffset=10377850, codec=gelf, payloadSize=2048, timestamp=2019-07-05T14:42:27.271Z, remoteAddress=/[REDACTED]:55800}
com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'AY43MEUQtBhQyaoFm1QXxcYbTVjnwvjvG231HEcZroRhcCTi9Nk7rQ13rzzPY6I5VEhv1ihlIHFgMwOozY_ndzshQwyqPOLv7k_z6UA7NOE30': was expecting ('true', 'false' or 'null')
 at [Source: AY43MEUQtBhQyaoFm1QXxcYbTVjnwvjvG231HEcZroRhcCTi9Nk7rQ13rzzPY6I5VEhv1ihlIHFgMwOozY_ndzshQwyqPOLv7k_z6UA7NOE30-7BReAokuZUJsexeYdCuKmQcHX4BP5RFNFMCv639dwnQBQ61-USBP7fIerGPT3KEsKQLtyJcJJ-H-E5Ak6nVzp4i6OHu8gY2N5yild9qkJkxc10YzWoRFPWkUf7l3SNi-o3qMS-MCCvRMwsGPkjM-xDqmGSJ2plaLaU0XoLv_F4MXRhDBHeTFINPqKaB1CiF_oudqO3q02_RB0x0gyTR08E55B-V09qa6dmuJ4vucHnXHP961LTB0QuMXLIlz79VkmVgzkhBdcKseD7-1RSionOSraxPkBDgo2mvQBfobBSthxAQQaLwhUqP6DGJIOu7wWI5VeKwFQTrAOykaZOJ7TCPbDTR7xGwXMnV30YE_54XMxVQidkYWA3ZDk3kHpuBoF1DnlQdOGZtkSI71HCOwj65zsFoLnGc41weKr_ABIdKLOdlCYdJ0qADXJTL_qDtC5YqRu2yrH-_UbvxITxTtiHa_nsrzxjzREklcRomcpMr3dPwIh3fjAc6pKPG1XZZ5fZeidbwYfI_i0NGQlA34dG8QgXyxXgSZJVEsorzN8ZUQ1BtgVD04d9dXfHHNuABnX1439MukPmg81cE7ZG0pFQql3mg9lvzUV6Buaa9OFb8AD9jjCjImPN23oJj6I7SnU5_JJgWzAnRsfD8WaYsGETNNA&pt=text&li=rbox-t2v&sig=8c9e58debd91891c42dae4af026c3bb429b4c7a37699&", "_from_apache": "true" }; line: 1, column: 110]
        at com.fasterxml.jackson.core.JsonParser._constructError( ~[graylog.jar:?]
        at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError( ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._reportInvalidToken( ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue( ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken( ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectMapper._initForReading( ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose( ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectMapper.readTree( ~[graylog.jar:?]
        at org.graylog2.inputs.codecs.GelfCodec.decode( ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage( ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent( [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent( [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent( [graylog.jar:?]
        at [graylog.jar:?]
        at com.codahale.metrics.InstrumentedThreadFactory$ [graylog.jar:?]
        at [?:1.8.0_212]

There are many hundreds like it before this, but this is the last message in the log.

My apache log confirmation for GELF is as follows:

LogFormat "{ \"version\": \"1.1\", \"host\": \"%V\", \"short_message\": \"%r\", \"timestamp\": %{%s}t, \"level\": 6, \"_user_agent\": \"%{User-Agent}i\", \"_source_ip\": \"%{X-Forwarded-For}i\", \"_duration_usec\": %D, \"_duration_sec\": %T, \"_request_size_byte\": %O, \"_http_status_orig\": %s, \"_http_status\": %>s, \"_http_request_path\": \"%U\", \"_http_request\": \"%U%q\", \"_http_method\": \"%m\", \"_http_referer\": \"%{Referer}i\", \"_from_apache\": \"true\" }" graylog_access

The above is mentioned here and stack-overflow:

I’ll keep monitoring of course - if you or anyone else thinks of anything further to check, your input is greatly appreciated.

I suspect it is possible the following issue is related to mine, I will investigate:

By this ticket, it appears that the UDP receive can be trashed by malformed packets (bad validation?)

I will use ss next time this happens to see if the listener is in fact dead or alive. If dead I will see if I can reproduce the other issue’s OP’s solution - more to come.

Update: Indeed, you can see below that initially the listener is present, then is gone after the logging stops functioning some time later:


I’ve also noticed that since I increased the number of incoming Apache logs (more servers) the fault happens more frequently (every hour or two so far).

glad that you have a trace now.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.