Graylog 3.x cluster intermittently stops receiving GELF UDP logs with malformed packet decode errors and input bind conflicts

We are running a multi-node Graylog 3.3.17 cluster and recently started seeing intermittent issues where certain Graylog nodes stop appearing to receive GELF UDP logs for several minutes.

CRITICAL : Graylog node graylog1 is not receiving GELF UDP logs since last 10m.

At the same time, some streams temporarily show 0 processed messages.

The issue appears intermittent and mostly affects GELF UDP ingestion specifically. GELF TCP and Graylog web UI continue functioning normally.

We initially suspected Elasticsearch or memory pressure, but Elasticsearch cluster health remains green and healthy.

We restarted Graylog containers on all nodes to clear memory, but the issue still occurs intermittently.

We also noticed repeated GELF UDP parsing/decode errors in Graylog logs such as:

GELF message is too short. Not even the type header would fit.

and:

GELF message is missing mandatory "host" field.

We are trying to understand whether this behavior is:

  • expected under malformed GELF traffic

  • caused by input conflicts

  • related to UDP load balancing

  • caused by a forwarding loop

  • or a known issue in older Graylog 3.x deployments

OS Information:

Linux (Docker-based deployment)

Package Version:

  • Graylog 3.3.17

  • Elasticsearch 6.x

  • MongoDB Replica Set

Graylog Docker image:

FROM graylog/graylog:3.3.17

Deployment architecture:

We run a 3-node Graylog cluster in Docker using network_mode: host.

Elasticsearch cluster health remains healthy:

{
  "status": "green",
  "number_of_nodes": 3,
  "active_shards_percent_as_number": 100.0
}

Inputs configured:

  • GELF UDP

  • GELF TCP

  • GELF HTTP

  • Syslog UDP

  • Beats Input

NGINX stream load balancing

We also use nginx stream-based load balancing in front of Graylog inputs.

Example UDP config:

upstream gelf_udp_servers {
    server graylog1:12200 weight=8;
    server graylog2:12200 weight=4;
    server graylog3:12200 weight=4;
}

server {
    listen 12201 udp;
    listen 12202 udp;
    proxy_pass gelf_udp_servers;

    proxy_responses 0;
}

TCP config:

upstream gelf_tcp_servers {
    server graylog1:12200 weight=8;
    server graylog2:12200 weight=4;
    server graylog3:12200 weight=4;
}

server {
    listen 12201;
    proxy_pass gelf_tcp_servers;

    proxy_responses 0;
}

Syslog UDP config:

upstream syslog_udp_servers {
    server graylog1:513 weight=8;
    server graylog2:513 weight=4;
    server graylog3:513 weight=4;
}

server {
    listen 514 udp;
    proxy_pass syslog_udp_servers;

    proxy_responses 0;
}

Relevant Graylog logs:

ERROR: org.graylog2.plugin.inputs.transports.NettyTransport - Error in Input [GELF UDP]
cause java.lang.IllegalStateException:
GELF message is too short. Not even the type header would fit.
WARN : org.graylog2.inputs.codecs.GelfCodec -
GELF message is missing mandatory "host" field.
ERROR: org.graylog2.shared.buffers.processors.DecodingProcessor -
Unable to decode raw message

Example malformed message log:

RawMessage{
 codec=gelf,
 payloadSize=252,
 remoteAddress=/<ip:port>
}

We also occasionally see:

An input has failed to start:
bind(..) failed: Address already in use

Additionally, during incidents we observed MongoDB monitor reconnect messages:

com.mongodb.MongoSocketOpenException:
Exception opening socket

followed shortly by successful reconnects.

What steps have you already taken to try and solve the problem?

  • Restarted Graylog containers on all 3 nodes

  • Verified Elasticsearch cluster health is green

  • Checked Graylog logs on all nodes

  • Verified malformed GELF UDP decoding errors

  • Investigated streams showing 0 processed messages

  • Verified issue appears mostly related to GELF UDP inputs

  • Checked Graylog UI for failed/binding inputs

  • Investigated whether nodes might be forwarding traffic to each other unintentionally

How can the community help?

We are looking for guidance on the following:

  1. Could this indicate a logging loop, UDP forwarding loop, or input misconfiguration?

  2. Could nginx stream UDP proxying contribute to duplicated/malformed GELF packets in Graylog 3.x?

  3. Could malformed GELF packets alone cause Graylog to temporarily stop processing UDP traffic on a node?

  4. Are there known GELF UDP transport/input stability issues in older Graylog 3.x releases?

  5. Would upgrading from Graylog 3.0 likely resolve transport/input related behaviors like this?

  6. Any recommended debugging steps to identify which service/process is generating malformed GELF UDP packets?

Any guidance or similar experiences would be appreciated.

Sorry added the wrong version above so above issue was happening when the version f graylog was 3.0 but now when updgraded the version to 3.3.17 the issue is resolved i mean this newer version ignores bad traffic, but i really did not get any doc or issues on github regarding this fix ?

My first step for issues like this is always to switch to a raw input, because then if you are getting malformed messages etc, you will see exactly what you are getting, those will except just about anything especially over UDP.

Hello @rahulpatil, could I ask what is blocking you from upgrading your Graylog instance? As the newest release is 7.1 it’s safe to assume there are a host of fixes for issues that would appear within Graylog 3.x - not to mention new features to make use of.