Graylog node memory leak / OOM

While this may prove to be a bit pre-emptive, I’ve rolled back a version of OpenJDK and already node3 has been up and carrying traffic for double the amount of time that I could keep it stable prior with no traffic.

Suspect Java version in RHEL7…

# java -version
openjdk version "1.8.0_144"
OpenJDK Runtime Environment (build 1.8.0_144-b01)
OpenJDK 64-Bit Server VM (build 25.144-b01, mixed mode)

Rolled back initially to…

# java -version
openjdk version "1.8.0_141"
OpenJDK Runtime Environment (build 1.8.0_141-b16)
OpenJDK 64-Bit Server VM (build 25.141-b16, mixed mode)

After seeing Node3 stay up for more than 90 minutes, I have now backrolled Java on all three nodes to the following version (which incidentally matches the Java version in my lab environment with Ubuntu Xenial).

# java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-b12)
OpenJDK 64-Bit Server VM (build 25.131-b12, mixed mode)

This looks like a code bug with Java 1.8.0_144. I suggest people stay away from it until that’s looked into and fixed.

Interesting. Thanks for sharing this information!

Did OpenJDK 8u141 work correctly or did you have to revert to 8u131 for everything to work?

8u141 appeared to be working (did not test past 92 minutes), however when doing yum downgrade *openjdk* node1 and node2 reverted back to 8u131 directly. Not wanting to risk strange issues by leaving them long term with mismatched Java versions, I downgraded node3 a couple more times until it achieved version parity with the others.

All nodes have been up and running since that time.

And these messages continue to rage on in the logs with GELF TCP…

2017-09-21T07:49:28.886-05:00 ERROR [DecodingProcessor] Unable to decode raw message RawMessage{id=4e63c31a-9ecb-11e7-8b97-0050568a6438, journalOffset=4088676143, codec=gelf, payloadSize=191, timestamp=2017-09-21T12:49:28.880Z, remoteAddress=/10.2.81.252:57186} on input <59c2630365ea93611450c4b7>.
2017-09-21T07:49:28.886-05:00 ERROR [DecodingProcessor] Error processing message RawMessage{id=4e63c31a-9ecb-11e7-8b97-0050568a6438, journalOffset=4088676143, codec=gelf, payloadSize=191, timestamp=2017-09-21T12:49:28.880Z, remoteAddress=/10.2.81.252:57186}
java.lang.IllegalArgumentException: GELF message <4e63c31a-9ecb-11e7-8b97-0050568a6438> (received from <10.2.81.252:57186>) has empty mandatory "short_message" field.
        at org.graylog2.inputs.codecs.GelfCodec.validateGELFMessage(GelfCodec.java:252) ~[graylog.jar:?]
        at org.graylog2.inputs.codecs.GelfCodec.decode(GelfCodec.java:134) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage(DecodingProcessor.java:146) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent(DecodingProcessor.java:87) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:74) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:42) [graylog.jar:?]
        at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]

I noticed these in my logs with 1.8.0.144 that I never noticed before. I rolled back to 141 and will check if it helped tomorrow.

Edit: It did help the oom-messages and Graylog instability.

@jasonkeller that is simple:

2017-09-21T07:49:28.886-05:00 ERROR [DecodingProcessor] Error processing message RawMessage{id=4e63c31a-9ecb-11e7-8b97-0050568a6438, journalOffset=4088676143, codec=gelf, payloadSize=191, timestamp=2017-09-21T12:49:28.880Z, remoteAddress=/10.2.81.252:57186}
java.lang.IllegalArgumentException: GELF message <4e63c31a-9ecb-11e7-8b97-0050568a6438> (received from <10.2.81.252:57186>) has empty mandatory “short_message” field.

I just highlight the important part of the message.

The GELF Messages are not following the definition.

I’m using latest nxlog ce, and I have it set to not truncate the short_message field. Are there known bugs with their implementation? Is there a different way I should be tweaking the configuration file?

hej @jasonkeller

I do not know how to configure nxlog - but your short_message field is empty, that is the reason for the message.

regards
Jan

So after much confusion (as the messages clearly do have the short-message field; I’m getting appropriate data in Graylog), I performed a packet-capture and saw the following among the din of valid messages…

{"version":"1.1","_EventReceivedTime":"2017-10-04 08:48:37","_SourceModuleName":"in_log","_SourceModuleType":"im_file","host":"SERVER2","short_message":"","timestamp":1507124917,"level":6} {"version":"1.1","_EventReceivedTime":"2017-10-04

Seemingly random blank short_message fields interspersed with valid ones! Where were they coming from you might ask? Arbitrary blank lines in the log file (another shout out to Palo Alto for poor formatting). I fixed it with the following configuration to NXlog to drop the blank lines in the input section (the Exec line)…

<Input in_log>
 Module im_file
 File 'TheLogFile'
 SavePos TRUE
 ReadFromLast TRUE
 Exec if $raw_event =~ /^$/ drop();
 PollInterval 1
</Input>

Hopefully this helps others that run into this.

2 Likes

Thanks for sharing, Jason!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.