Graylog node memory leak / OOM

Hi guys - I’ve got a vexing issue that’s plagued me for the last week now that despite my best efforts, continues to persist.

So I have three Graylog nodes (Graylog (2.3.1), MongoDB, Unbound), and two Elasticsearch nodes (ES5, LVS/Keepalived, NGINX). All nodes are running on RHEL 7 latest…Linux 3.10.0-693.2.2.el7.x86_64 under vSphere 6. Graylog nodes have 4 vCPU and 3GB memory, ES5 nodes have 6 vCPU and 12GB memory.

When I first clustered in the 2.2 train with ES2.x, all nodes worked fine. Even through the upgrade to Graylog 2.3 and ES5, again all nodes worked fine. That all changed last week when my disk on ES nodes filled up and journals started overflowing in the Graylog nodes.

I quickly worked to destroy old indices to free up space (and flipped to space-based retention, as my bursts are becoming larger and more frequent) and got the node journals flushed out. However, since that time I’ve noticed that while the first two nodes are ok, node 3, even after such a short period as 2 minutes, suddenly bursts upward in memory usage (within a couple seconds usually) until the oom-killer destroys it.

The most vexing part of all this is that even with debugging turned on, I still have not found a single useful snippet in the logs to indicate why this is happening. It is not happening in my test environment on Ubuntu Xenial. Worse, I’ve even gone to the length of destroying the old node 3, cloning node 2 and adjusting all of its settings (including node-id) to make it a new node 3, and the issue persists!

So whatever is causing this issue it is utterly persistent. I’m looking for some direction on how to further troubleshoot this issue since it seems to be very deeply embedded somewhere in either MongoDB or ES.

with the given RAM, what is your JVM configured heap?

What is your configured shards and retention settings? at what rate did the indices rotate?

What can be found in the Graylog log files short before it got killed?

JVM heap is default (1GB). ES5 is set to one primary shard, one replica shard, rotating every 5GB for some indices and every day for others.

journalctl -xe

Sep 13 09:55:04 graylog3.local kernel: java invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
Sep 13 09:55:04 graylog3.local kernel: java cpuset=/ mems_allowed=0
Sep 13 09:55:04 graylog3.local kernel: CPU: 1 PID: 2936 Comm: java Not tainted 3.10.0-693.2.2.el7.x86_64 #1
Sep 13 09:55:04 graylog3.local kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
Sep 13 09:55:04 graylog3.local kernel:  ffff8800b8240fd0 00000000ff832169 ffff8800a2387988 ffffffff816a3db1
Sep 13 09:55:04 graylog3.local kernel:  ffff8800a2387a18 ffffffff8169f1a6 ffff8800a2387a20 ffffffff812b7e6b
Sep 13 09:55:04 graylog3.local kernel:  ffff8800aa6fc768 0000000000000202 ffffffff00000202 fffeefff00000000
Sep 13 09:55:04 graylog3.local kernel: Call Trace:
Sep 13 09:55:04 graylog3.local kernel:  [<ffffffff816a3db1>] dump_stack+0x19/0x1b
Sep 13 09:55:04 graylog3.local kernel:  [<ffffffff8169f1a6>] dump_header+0x90/0x229
Sep 13 09:55:04 graylog3.local kernel:  [<ffffffff812b7e6b>] ? cred_has_capability+0x6b/0x120
Sep 13 09:55:04 graylog3.local kernel:  [<ffffffff81186394>] oom_kill_process+0x254/0x3d0
Sep 13 09:55:04 graylog3.local kernel:  [<ffffffff812b804e>] ? selinux_capable+0x2e/0x40
Sep 13 09:55:04 graylog3.local kernel:  [<ffffffff81186bd6>] out_of_memory+0x4b6/0x4f0
Sep 13 09:55:04 graylog3.local kernel:  [<ffffffff8169fcaa>] __alloc_pages_slowpath+0x5d6/0x724
Sep 13 09:55:04 graylog3.local kernel:  [<ffffffff8118cd85>] __alloc_pages_nodemask+0x405/0x420
Sep 13 09:55:04 graylog3.local kernel:  [<ffffffff811d4135>] alloc_pages_vma+0xb5/0x200
Sep 13 09:55:04 graylog3.local kernel:  [<ffffffff811c453d>] read_swap_cache_async+0xed/0x160
Sep 13 09:55:04 graylog3.local kernel:  [<ffffffff811c4658>] swapin_readahead+0xa8/0x110
Sep 13 09:55:04 graylog3.local kernel:  [<ffffffff811b235b>] handle_mm_fault+0xadb/0xfa0
Sep 13 09:55:04 graylog3.local kernel:  [<ffffffff816afff4>] __do_page_fault+0x154/0x450
Sep 13 09:55:04 graylog3.local kernel:  [<ffffffff816b0325>] do_page_fault+0x35/0x90
Sep 13 09:55:04 graylog3.local kernel:  [<ffffffff816ac548>] ? page_fault+0x28/0x30
Sep 13 09:55:04 graylog3.local kernel:  [<ffffffff816ac548>] page_fault+0x28/0x30
Sep 13 09:55:04 graylog3.local kernel: Mem-Info:
Sep 13 09:55:04 graylog3.local kernel: active_anon:531277 inactive_anon:134288 isolated_anon:0
                                            active_file:7 inactive_file:345 isolated_file:0
                                            unevictable:1 dirty:0 writeback:0 unstable:0
                                            slab_reclaimable:4783 slab_unreclaimable:14641
                                            mapped:96 shmem:702 pagetables:4657 bounce:0
                                            free:14065 free_pcp:41 free_cma:0
Sep 13 09:55:04 graylog3.local kernel: Node 0 DMA free:11484kB min:244kB low:304kB high:364kB active_anon:1664kB inactive_anon:1908kB active_file:0kB inactive_file:16kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15
Sep 13 09:55:04 graylog3.local kernel: lowmem_reserve[]: 0 2813 2813 2813
Sep 13 09:55:04 graylog3.local kernel: Node 0 DMA32 free:44776kB min:44808kB low:56008kB high:67212kB active_anon:2123444kB inactive_anon:535244kB active_file:28kB inactive_file:1364kB unevictable:4kB isolated(anon):0kB isolated(file
Sep 13 09:55:04 graylog3.local kernel: lowmem_reserve[]: 0 0 0 0
Sep 13 09:55:04 graylog3.local kernel: Node 0 DMA: 13*4kB (UM) 8*8kB (UM) 6*16kB (UEM) 5*32kB (UEM) 6*64kB (UEM) 2*128kB (UM) 3*256kB (UEM) 1*512kB (E) 1*1024kB (E) 2*2048kB (EM) 1*4096kB (M) = 11508kB
Sep 13 09:55:04 graylog3.local kernel: Node 0 DMA32: 895*4kB (UE) 867*8kB (UE) 408*16kB (UEM) 231*32kB (UEM) 145*64kB (UEM) 55*128kB (UEM) 16*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44852kB
Sep 13 09:55:04 graylog3.local kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Sep 13 09:55:04 graylog3.local kernel: 1705 total pagecache pages
Sep 13 09:55:04 graylog3.local kernel: 603 pages in swap cache
Sep 13 09:55:04 graylog3.local kernel: Swap cache stats: add 294604, delete 294001, find 6410/9812
Sep 13 09:55:04 graylog3.local kernel: Free swap  = 0kB
Sep 13 09:55:04 graylog3.local kernel: Total swap = 1047548kB
Sep 13 09:55:04 graylog3.local kernel: 786301 pages RAM
Sep 13 09:55:04 graylog3.local kernel: 0 pages HighMem/MovableOnly
Sep 13 09:55:04 graylog3.local kernel: 61717 pages reserved
Sep 13 09:55:04 graylog3.local kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Sep 13 09:55:04 graylog3.local kernel: [  504]     0   504     9207       35      21       58             0 systemd-journal
Sep 13 09:55:04 graylog3.local kernel: [  529]     0   529    48772        0      32      699             0 lvmetad
Sep 13 09:55:04 graylog3.local kernel: [  534]     0   534    12025        1      26      820         -1000 systemd-udevd
Sep 13 09:55:04 graylog3.local kernel: [  649]     0   649    13863        0      28      111         -1000 auditd
Sep 13 09:55:04 graylog3.local kernel: [  668]     0   668     3035        0      11      921             0 haveged
Sep 13 09:55:04 graylog3.local kernel: [  669]     0   669     1618        0       9       43             0 rngd
Sep 13 09:55:04 graylog3.local kernel: [  670]     0   670    66523        0      77      332             0 sssd
Sep 13 09:55:04 graylog3.local kernel: [  671]   998   671   136701        0      60     2210             0 polkitd
Sep 13 09:55:04 graylog3.local kernel: [  672]     0   672     5406       41      16       42             0 irqbalance
Sep 13 09:55:04 graylog3.local kernel: [  674]     0   674    53030      567      38      152             0 rsyslogd
Sep 13 09:55:04 graylog3.local kernel: [  677]     0   677    24902        0      42      403             0 VGAuthService
Sep 13 09:55:04 graylog3.local kernel: [  679]     0   679    76269       33      57      307             0 vmtoolsd
Sep 13 09:55:04 graylog3.local kernel: [  681]    81   681    11238       10      20      116          -900 dbus-daemon
Sep 13 09:55:04 graylog3.local kernel: [  683]   996   683    29553        0      31      115             0 chronyd
Sep 13 09:55:04 graylog3.local kernel: [  716]     0   716    84506       59      85     6515             0 firewalld
Sep 13 09:55:04 graylog3.local kernel: [  717]     0   717   121121        0     175     5472             0 sssd_be
Sep 13 09:55:04 graylog3.local kernel: [  947]     0   947    67889        0      83      236             0 sssd_nss
Sep 13 09:55:04 graylog3.local kernel: [  948]     0   948    62802        1      74      230             0 sssd_pam
Sep 13 09:55:04 graylog3.local kernel: [  950]     0   950     6051        2      17       73             0 systemd-logind
Sep 13 09:55:04 graylog3.local kernel: [  952]     0   952    31559       19      18      138             0 crond
Sep 13 09:55:04 graylog3.local kernel: [  953]     0   953    27511        1      10       32             0 agetty
Sep 13 09:55:04 graylog3.local kernel: [ 1018]     0  1018    55883       44      64     1063             0 snmpd
Sep 13 09:55:04 graylog3.local kernel: [ 1019]     0  1019   140598       99      90     3081             0 tuned
Sep 13 09:55:04 graylog3.local kernel: [ 1020]     0  1020    28911        0      13       39             0 rhsmcertd
Sep 13 09:55:04 graylog3.local kernel: [ 1024]     0  1024    26499        0      56      244         -1000 sshd
Sep 13 09:55:04 graylog3.local kernel: [ 1068]     0  1068    87123        3      49      902             0 dsmcad
Sep 13 09:55:04 graylog3.local kernel: [ 1074]   995  1074   362486     4880     171    17136             0 mongod
Sep 13 09:55:04 graylog3.local kernel: [ 1106]   994  1106    36906     2150      37      623             0 unbound
Sep 13 09:55:04 graylog3.local kernel: [ 1116]     0  1116    26973        0       8       26             0 rhnsd
Sep 13 09:55:04 graylog3.local kernel: [ 2848]     0  2848    40529       43      78      296             0 sshd
Sep 13 09:55:04 graylog3.local kernel: [ 2853]     0  2853    28848        0      13      102             0 bash
Sep 13 09:55:04 graylog3.local kernel: [ 2875]     0  2875    39589      150      33      170             0 top
Sep 13 09:55:04 graylog3.local kernel: [ 2878]     0  2878    40529       29      78      304             0 sshd
Sep 13 09:55:04 graylog3.local kernel: [ 2883]     0  2883    28848        1      14      102             0 bash
Sep 13 09:55:04 graylog3.local kernel: [ 2928]   993  2928    28282        1      12       47             0 graylog-server
Sep 13 09:55:04 graylog3.local kernel: [ 2929]   993  2929  2426981   647804    2918   216846             0 java
Sep 13 09:55:04 graylog3.local kernel: [ 2948]     0  2948    38405      217      32       71             0 watch
Sep 13 09:55:04 graylog3.local kernel: [ 4029]     0  4029    38404      217      28       71             0 watch
Sep 13 09:55:04 graylog3.local kernel: [ 4030]     0  4030    32106       53      18        0             0 systemctl
Sep 13 09:55:04 graylog3.local kernel: Out of memory: Kill process 2929 (java) score 879 or sacrifice child
Sep 13 09:55:04 graylog3.local kernel: Killed process 2929 (java) total-vm:9707924kB, anon-rss:2591216kB, file-rss:0kB, shmem-rss:0kB
Sep 13 09:55:04 graylog3.local graylog-server[2928]: /usr/share/graylog-server/bin/graylog-server: line 24:  2929 Killed                  $GRAYLOG_COMMAND_WRAPPER ${JAVA:=/usr/bin/java} $GRAYLOG_SERVER_JAVA_OPTS -jar -Dlog4j.configur
Sep 13 09:55:04 graylog3.local systemd[1]: graylog-server.service: main process exited, code=exited, status=137/n/a
Sep 13 09:55:04 graylog3.local systemd[1]: Unit graylog-server.service entered failed state.
Sep 13 09:55:04 graylog3.local systemd[1]: graylog-server.service failed.

This is the log output (I’ve even gone all the way back to the start looking through hundreds of thousands of log lines before, logs are just filled with this garbage on all of them).

tail -100 /var/log/graylog-server/server.log

2017-09-13T09:54:59.979-05:00 ERROR [DecodingProcessor] Unable to decode raw message RawMessage{id=83f07d98-9893-11e7-a202-0050568a570f, journalOffset=3204499478, codec=gelf, payloadSize=155, timestamp=2017-09-13T14:54:59.928Z, remoteAddress=/10.2.12.82:50478} on input <59676e7a55928d1858fccaf3>.
2017-09-13T09:54:59.979-05:00 ERROR [DecodingProcessor] Error processing message RawMessage{id=83f07d98-9893-11e7-a202-0050568a570f, journalOffset=3204499478, codec=gelf, payloadSize=155, timestamp=2017-09-13T14:54:59.928Z, remoteAddress=/10.2.12.82:50478}
java.lang.IllegalArgumentException: GELF message <83f07d98-9893-11e7-a202-0050568a570f> (received from <10.2.12.82:50478>) has empty mandatory "short_message" field.
        at org.graylog2.inputs.codecs.GelfCodec.validateGELFMessage(GelfCodec.java:252) ~[graylog.jar:?]
        at org.graylog2.inputs.codecs.GelfCodec.decode(GelfCodec.java:134) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage(DecodingProcessor.java:146) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent(DecodingProcessor.java:87) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:74) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:42) [graylog.jar:?]
        at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
2017-09-13T09:54:59.979-05:00 ERROR [DecodingProcessor] Unable to decode raw message RawMessage{id=83f07d9a-9893-11e7-a202-0050568a570f, journalOffset=3204499480, codec=gelf, payloadSize=155, timestamp=2017-09-13T14:54:59.928Z, remoteAddress=/10.2.12.82:50478} on input <59676e7a55928d1858fccaf3>.
2017-09-13T09:54:59.979-05:00 ERROR [DecodingProcessor] Error processing message RawMessage{id=83f07d9a-9893-11e7-a202-0050568a570f, journalOffset=3204499480, codec=gelf, payloadSize=155, timestamp=2017-09-13T14:54:59.928Z, remoteAddress=/10.2.12.82:50478}
java.lang.IllegalArgumentException: GELF message <83f07d9a-9893-11e7-a202-0050568a570f> (received from <10.2.12.82:50478>) has empty mandatory "short_message" field.
        at org.graylog2.inputs.codecs.GelfCodec.validateGELFMessage(GelfCodec.java:252) ~[graylog.jar:?]
        at org.graylog2.inputs.codecs.GelfCodec.decode(GelfCodec.java:134) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage(DecodingProcessor.java:146) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent(DecodingProcessor.java:87) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:74) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:42) [graylog.jar:?]
        at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
2017-09-13T09:54:59.979-05:00 ERROR [DecodingProcessor] Unable to decode raw message RawMessage{id=83f07d9b-9893-11e7-a202-0050568a570f, journalOffset=3204499481, codec=gelf, payloadSize=155, timestamp=2017-09-13T14:54:59.928Z, remoteAddress=/10.2.12.82:50478} on input <59676e7a55928d1858fccaf3>.
2017-09-13T09:54:59.979-05:00 ERROR [DecodingProcessor] Error processing message RawMessage{id=83f07d9b-9893-11e7-a202-0050568a570f, journalOffset=3204499481, codec=gelf, payloadSize=155, timestamp=2017-09-13T14:54:59.928Z, remoteAddress=/10.2.12.82:50478}
java.lang.IllegalArgumentException: GELF message <83f07d9b-9893-11e7-a202-0050568a570f> (received from <10.2.12.82:50478>) has empty mandatory "short_message" field.
        at org.graylog2.inputs.codecs.GelfCodec.validateGELFMessage(GelfCodec.java:252) ~[graylog.jar:?]
        at org.graylog2.inputs.codecs.GelfCodec.decode(GelfCodec.java:134) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage(DecodingProcessor.java:146) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent(DecodingProcessor.java:87) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:74) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:42) [graylog.jar:?]
        at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
2017-09-13T09:54:59.985-05:00 ERROR [DecodingProcessor] Unable to decode raw message RawMessage{id=83f0a499-9893-11e7-a202-0050568a570f, journalOffset=3204499507, codec=gelf, payloadSize=155, timestamp=2017-09-13T14:54:59.929Z, remoteAddress=/10.2.12.82:50478} on input <59676e7a55928d1858fccaf3>.
2017-09-13T09:54:59.985-05:00 ERROR [DecodingProcessor] Error processing message RawMessage{id=83f0a499-9893-11e7-a202-0050568a570f, journalOffset=3204499507, codec=gelf, payloadSize=155, timestamp=2017-09-13T14:54:59.929Z, remoteAddress=/10.2.12.82:50478}
java.lang.IllegalArgumentException: GELF message <83f0a499-9893-11e7-a202-0050568a570f> (received from <10.2.12.82:50478>) has empty mandatory "short_message" field.
        at org.graylog2.inputs.codecs.GelfCodec.validateGELFMessage(GelfCodec.java:252) ~[graylog.jar:?]
        at org.graylog2.inputs.codecs.GelfCodec.decode(GelfCodec.java:134) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage(DecodingProcessor.java:146) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent(DecodingProcessor.java:87) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:74) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:42) [graylog.jar:?]
        at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
2017-09-13T09:54:59.985-05:00 ERROR [DecodingProcessor] Unable to decode raw message RawMessage{id=83f0cba0-9893-11e7-a202-0050568a570f, journalOffset=3204499511, codec=gelf, payloadSize=155, timestamp=2017-09-13T14:54:59.930Z, remoteAddress=/10.2.12.82:50478} on input <59676e7a55928d1858fccaf3>.
2017-09-13T09:54:59.985-05:00 ERROR [DecodingProcessor] Error processing message RawMessage{id=83f0cba0-9893-11e7-a202-0050568a570f, journalOffset=3204499511, codec=gelf, payloadSize=155, timestamp=2017-09-13T14:54:59.930Z, remoteAddress=/10.2.12.82:50478}
java.lang.IllegalArgumentException: GELF message <83f0cba0-9893-11e7-a202-0050568a570f> (received from <10.2.12.82:50478>) has empty mandatory "short_message" field.
        at org.graylog2.inputs.codecs.GelfCodec.validateGELFMessage(GelfCodec.java:252) ~[graylog.jar:?]
        at org.graylog2.inputs.codecs.GelfCodec.decode(GelfCodec.java:134) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage(DecodingProcessor.java:146) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent(DecodingProcessor.java:87) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:74) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:42) [graylog.jar:?]
        at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
2017-09-13T09:54:59.986-05:00 ERROR [DecodingProcessor] Unable to decode raw message RawMessage{id=83f0f2b3-9893-11e7-a202-0050568a570f, journalOffset=3204499518, codec=gelf, payloadSize=155, timestamp=2017-09-13T14:54:59.931Z, remoteAddress=/10.2.12.82:50478} on input <59676e7a55928d1858fccaf3>.
2017-09-13T09:54:59.986-05:00 ERROR [DecodingProcessor] Error processing message RawMessage{id=83f0f2b3-9893-11e7-a202-0050568a570f, journalOffset=3204499518, codec=gelf, payloadSize=155, timestamp=2017-09-13T14:54:59.931Z, remoteAddress=/10.2.12.82:50478}
java.lang.IllegalArgumentException: GELF message <83f0f2b3-9893-11e7-a202-0050568a570f> (received from <10.2.12.82:50478>) has empty mandatory "short_message" field.
        at org.graylog2.inputs.codecs.GelfCodec.validateGELFMessage(GelfCodec.java:252) ~[graylog.jar:?]
        at org.graylog2.inputs.codecs.GelfCodec.decode(GelfCodec.java:134) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage(DecodingProcessor.java:146) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent(DecodingProcessor.java:87) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:74) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:42) [graylog.jar:?]
        at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
2017-09-13T09:54:59.987-05:00 ERROR [DecodingProcessor] Unable to decode raw message RawMessage{id=83f0f2b7-9893-11e7-a202-0050568a570f, journalOffset=3204499522, codec=gelf, payloadSize=155, timestamp=2017-09-13T14:54:59.931Z, remoteAddress=/10.2.12.82:50478} on input <59676e7a55928d1858fccaf3>.
2017-09-13T09:54:59.987-05:00 ERROR [DecodingProcessor] Error processing message RawMessage{id=83f0f2b7-9893-11e7-a202-0050568a570f, journalOffset=3204499522, codec=gelf, payloadSize=155, timestamp=2017-09-13T14:54:59.931Z, remoteAddress=/10.2.12.82:50478}
java.lang.IllegalArgumentException: GELF message <83f0f2b7-9893-11e7-a202-0050568a570f> (received from <10.2.12.82:50478>) has empty mandatory "short_message" field.
        at org.graylog2.inputs.codecs.GelfCodec.validateGELFMessage(GelfCodec.java:252) ~[graylog.jar:?]
        at org.graylog2.inputs.codecs.GelfCodec.decode(GelfCodec.java:134) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage(DecodingProcessor.java:146) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent(DecodingProcessor.java:87) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:74) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:42) [graylog.jar:?]
        at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
2017-09-13T09:54:59.988-05:00 ERROR [DecodingProcessor] Unable to decode raw message RawMessage{id=83f119c1-9893-11e7-a202-0050568a570f, journalOffset=3204499526, codec=gelf, payloadSize=155, timestamp=2017-09-13T14:54:59.932Z, remoteAddress=/10.2.12.82:50478} on input <59676e7a55928d1858fccaf3>.
2017-09-13T09:54:59.988-05:00 ERROR [DecodingProcessor] Error processing message RawMessage{id=83f119c1-9893-11e7-a202-0050568a570f, journalOffset=3204499526, codec=gelf, payloadSize=155, timestamp=2017-09-13T14:54:59.932Z, remoteAddress=/10.2.12.82:50478}
java.lang.IllegalArgumentException: GELF message <83f119c1-9893-11e7-a202-0050568a570f> (received from <10.2.12.82:50478>) has empty mandatory "short_message" field.
        at org.graylog2.inputs.codecs.GelfCodec.validateGELFMessage(GelfCodec.java:252) ~[graylog.jar:?]
        at org.graylog2.inputs.codecs.GelfCodec.decode(GelfCodec.java:134) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage(DecodingProcessor.java:146) ~[graylog.jar:?]
        at org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent(DecodingProcessor.java:87) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:74) [graylog.jar:?]
        at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:42) [graylog.jar:?]
        at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]

Please provide the specs of the VM running Graylog.

The Linux OOM killer forcefully killed the Java process(es) because the system ran out of memory (phys. and swap).

From one of the other Graylog nodes that’s working fine…

#top -SHi

top - 10:34:02 up  9:03,  1 user,  load average: 7.12, 10.26, 11.11
Threads: 632 total,  11 running, 621 sleeping,   0 stopped,   0 zombie
%Cpu(s): 51.6 us,  6.0 sy,  0.0 ni, 41.0 id,  0.1 wa,  0.0 hi,  1.3 si,  0.0 st
KiB Mem :  2899240 total,   263076 free,  1797520 used,   838644 buff/cache
KiB Swap:  1047548 total,  1047548 free,        0 used.   742680 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 24880 graylog   20   0 6112904 1.489g 155568 S 15.0 53.8  12:23.38 java
 24928 graylog   20   0 6112904 1.489g 155568 S 11.3 53.8   6:48.49 java
 24930 graylog   20   0 6112904 1.489g 155568 S 10.3 53.8   6:47.89 java
 24937 graylog   20   0 6112904 1.489g 155568 S  9.6 53.8   6:50.00 java
 24922 graylog   20   0 6112904 1.489g 155568 S  9.3 53.8   6:49.69 java
 24921 graylog   20   0 6112904 1.489g 155568 S  9.0 53.8   6:48.61 java
 24936 graylog   20   0 6112904 1.489g 155568 S  9.0 53.8   6:48.18 java
 24918 graylog   20   0 6112904 1.489g 155568 S  8.6 53.8   6:50.42 java
 24920 graylog   20   0 6112904 1.489g 155568 S  8.6 53.8   6:49.65 java
 24925 graylog   20   0 6112904 1.489g 155568 S  8.6 53.8   6:49.43 java
 24927 graylog   20   0 6112904 1.489g 155568 S  8.6 53.8   6:49.93 java
 24929 graylog   20   0 6112904 1.489g 155568 S  8.6 53.8   6:49.48 java
 24931 graylog   20   0 6112904 1.489g 155568 S  8.6 53.8   6:48.39 java
 24933 graylog   20   0 6112904 1.489g 155568 S  8.6 53.8   6:50.73 java
 24919 graylog   20   0 6112904 1.489g 155568 S  8.3 53.8   6:49.48 java
 24926 graylog   20   0 6112904 1.489g 155568 S  8.3 53.8   6:48.06 java
 24932 graylog   20   0 6112904 1.489g 155568 S  8.3 53.8   6:49.09 java
 24923 graylog   20   0 6112904 1.489g 155568 S  8.0 53.8   6:49.13 java
 24924 graylog   20   0 6112904 1.489g 155568 S  8.0 53.8   6:48.13 java
 24934 graylog   20   0 6112904 1.489g 155568 S  8.0 53.8   6:48.87 java

CPU arch…

# lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             4
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 45
Model name:            Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz
Stepping:              2
CPU MHz:               2297.339
BogoMIPS:              4594.67
Hypervisor vendor:     VMware
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              40960K
NUMA node0 CPU(s):     0-3
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm epb tsc_adjust dtherm ida arat pln pts

How do I attach dump files here? They’re too big for the forum apparently.

You’ll have to upload these files somewhere else (Google Drive, Dropbox, etc.) and share a link here.

https://drive.google.com/file/d/0B5h6wOk0h5FbOXdZY3ZncFo0OWM/view?usp=sharing
https://drive.google.com/file/d/0B5h6wOk0h5FbaTFNWEFVd2NRSjA/view?usp=sharing

Any news/direction? I’m drowning without a third operable node and at this point pretty lost for words as to what could be wrong, as again there have been no useful log entries even in debug mode, nor did it help even recreating the third node nearly from scratch as the issue seems to follow it.

I will attempt to provide any outputs needed in as timely a manner as possible to hopefully get to the bottom of the issue.

What’s the configuration of all Graylog and Elasticsearch nodes in the cluster?

Graylog Node 1

1 ~]# cat /etc/graylog/server/server.conf | egrep -v "^\s*(#|$)"
is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = [snip]
root_password_sha2 = [snip]
plugin_dir = /usr/share/graylog-server/plugin
rest_listen_uri = http://10.2.81.244:9000/api/
web_listen_uri = http://10.2.81.244:9000/
web_endpoint_uri = http://10.2.81.244:9000/api/
elasticsearch_hosts = http://elastic1.local:9200,http://elastic2.local:9200
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 20
outputbuffer_processors = 15
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://graylog:[snip]@graylog1.local:27017,graylog2.local:27017,graylog3.local:27017/graylog?replicaSet=rs01
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
transport_email_enabled = true
transport_email_hostname = apprelay.local
transport_email_port = 25
transport_email_use_tls = false
transport_email_use_ssl = false
transport_email_from_email = graylog@graylog.local
transport_email_web_interface_url = https://graylog.local
content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32

Graylog Node 2

2~]# cat /etc/graylog/server/server.conf | egrep -v "^\s*(#|$)"
is_master = false
node_id_file = /etc/graylog/server/node-id
password_secret = [snip]
root_password_sha2 = [snip]
plugin_dir = /usr/share/graylog-server/plugin
rest_listen_uri = http://10.2.81.245:9000/api/
web_listen_uri = http://10.2.81.245:9000/
web_endpoint_uri = http://10.2.81.245:9000/api/
elasticsearch_hosts = http://elastic1.local:9200,http://elastic2.local:9200
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 20
outputbuffer_processors = 15
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://graylog:[snip]@graylog1.local:27017,graylog2.local:27017,graylog3.local:27017/graylog?replicaSet=rs01
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
transport_email_enabled = true
transport_email_hostname = apprelay.local
transport_email_port = 25
transport_email_use_tls = false
transport_email_use_ssl = false
transport_email_from_email = graylog@graylog.local
transport_email_web_interface_url = https://graylog.local
content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32

Graylog Node 3

3~]# cat /etc/graylog/server/server.conf | egrep -v "^\s*(#|$)"
is_master = false
node_id_file = /etc/graylog/server/node-id
password_secret = [snip]
root_password_sha2 = [snip]
plugin_dir = /usr/share/graylog-server/plugin
rest_listen_uri = http://10.2.81.246:9000/api/
web_listen_uri = http://10.2.81.246:9000/
web_endpoint_uri = http://10.2.81.246:9000/api/
elasticsearch_hosts = http://elastic1.local:9200,http://elastic2.local:9200
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 20
outputbuffer_processors = 15
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://graylog:[snip]@graylog1.local:27017,graylog2.local:27017,graylog3.local:27017/graylog?replicaSet=rs01
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
transport_email_enabled = true
transport_email_hostname = apprelay.local
transport_email_port = 25
transport_email_use_tls = false
transport_email_use_ssl = false
transport_email_from_email = graylog@graylog.local
transport_email_web_interface_url = https://graylog.local
content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32

Elasticsearch Node 1

cluster.name: graylog
node.name: Sinister
network.host: [_eth0_, _local_]
discovery.zen.ping.unicast.hosts: ["10.2.81.252"]

Elasticsearch Node 2

cluster.name: graylog
node.name: Blackout
network.host: [_eth0_, _local_]
discovery.zen.ping.unicast.hosts: ["10.2.81.251"]

I should probably also mention that even without sending messages to node 3 for processing, it still dies after a few minutes.

You could probably reduce the settings for processbuffer_processors and outputbuffer_processors on each Graylog node, but apart from that the configuration files look normal.

Are the contents of the /etc/graylog/server/node-id file unique on each Graylog node?
Are you using a load-balancer for the Graylog inputs? If yes, what’s its configuration and the configurations of the inputs?

Also try removing the files from the journal directory of Graylog node 3, if you haven’t done so.

node-id files are indeed unique on each node. Load balancer used is Keepalived/LVS for UDP traffic using a one-packet scheduler and NGINX for the https sessions for the GUI.

Journal has been blown out several times (including the .lock file). I did manage to dig up an error in a log now, but I don’t know how much help it will be…I see them in chunks typically on startup of the Graylog engine in debug mode. I will post the Keepalived/LVS configuration shortly.

2017-09-19T11:02:18.428-05:00 DEBUG [Version] Git commit details are not available, skipping.
java.lang.NullPointerException: null
	at sun.misc.MetaIndex.mayContain(MetaIndex.java:242) ~[?:1.8.0_144]
	at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1032) ~[?:1.8.0_144]
	at sun.misc.URLClassPath.getResource(URLClassPath.java:239) ~[?:1.8.0_144]
	at sun.misc.URLClassPath.getResource(URLClassPath.java:292) ~[?:1.8.0_144]
	at java.lang.ClassLoader.getBootstrapResource(ClassLoader.java:1264) ~[?:1.8.0_144]
	at java.lang.ClassLoader.getResource(ClassLoader.java:1093) ~[?:1.8.0_144]
	at java.lang.ClassLoader.getResource(ClassLoader.java:1091) ~[?:1.8.0_144]
	at java.lang.ClassLoader.getResource(ClassLoader.java:1091) ~[?:1.8.0_144]
	at org.graylog2.plugin.Version.getResource(Version.java:247) ~[graylog.jar:?]
	at org.graylog2.plugin.Version.fromClasspathProperties(Version.java:228) [graylog.jar:?]
	at org.graylog2.plugin.Version.fromPluginProperties(Version.java:163) [graylog.jar:?]
	at org.graylog.plugins.usagestatistics.UsageStatsMetaData.<clinit>(UsageStatsMetaData.java:29) [graylog-plugin-anonymous-usage-statistics-2.3.1.jar:?]
	at org.graylog.plugins.usagestatistics.UsageStatsPlugin.metadata(UsageStatsPlugin.java:31) [graylog-plugin-anonymous-usage-statistics-2.3.1.jar:?]
	at org.graylog2.shared.plugins.PluginLoader$PluginAdapter.metadata(PluginLoader.java:159) [graylog.jar:?]
	at org.graylog2.shared.plugins.PluginLoader$PluginComparator.compare(PluginLoader.java:139) [graylog.jar:?]
	at org.graylog2.shared.plugins.PluginLoader$PluginComparator.compare(PluginLoader.java:132) [graylog.jar:?]
	at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) [?:1.8.0_144]
	at java.util.TimSort.sort(TimSort.java:220) [?:1.8.0_144]
	at java.util.Arrays.sort(Arrays.java:1512) [?:1.8.0_144]
	at com.google.common.collect.ImmutableSortedSet.construct(ImmutableSortedSet.java:392) [graylog.jar:?]
	at com.google.common.collect.ImmutableSortedSet$Builder.build(ImmutableSortedSet.java:542) [graylog.jar:?]
	at org.graylog2.shared.plugins.PluginLoader.loadPlugins(PluginLoader.java:64) [graylog.jar:?]
	at org.graylog2.bootstrap.CmdLineTool.loadPlugins(CmdLineTool.java:294) [graylog.jar:?]
	at org.graylog2.bootstrap.CmdLineTool.installPluginConfigAndBindings(CmdLineTool.java:269) [graylog.jar:?]
	at org.graylog2.bootstrap.CmdLineTool.run(CmdLineTool.java:171) [graylog.jar:?]
	at org.graylog2.bootstrap.Main.main(Main.java:44) [graylog.jar:?]

LB1 (on elastic1 right now)

1 ~]# cat /etc/keepalived/keepalived.conf | egrep -v "^\s*(#|$)"
vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 1
    priority 1
    advert_int 1
    nopreempt
    virtual_ipaddress {
        10.2.81.242
    }
   unicast_src_ip 10.2.81.251
   unicast_peer {
     10.2.81.252
   }
}
virtual_server fwmark 1 {
        lb_algo rr
        lb_kind NAT
        protocol UDP
        ops
        delay_loop 1
        real_server 10.2.81.244 * {
                HTTP_GET {
                        url {
                                path  /api/system/lbstatus
                                status_code 200
                        }
                        connect_port 9000
                        connect_timeout 1
                        nb_get_retry 1
                        delay_before_retry 1
                }
        }
        real_server 10.2.81.245 * {
                HTTP_GET {
                        url {
                                path  /api/system/lbstatus
                                status_code 200
                        }
                        connect_port 9000
                        connect_timeout 1
                        nb_get_retry 1
                        delay_before_retry 1
                }
        }
        real_server 10.2.81.246 * {
                HTTP_GET {
                        url {
                                path  /api/system/lbstatus
                                status_code 200
                        }
                        connect_port 9000
                        connect_timeout 1
                        nb_get_retry 1
                        delay_before_retry 1
                }
        }
}

LB2 (on elastic2 right now)

2 ~]# cat /etc/keepalived/keepalived.conf | egrep -v "^\s*(#|$)"
vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 1
    priority 1
    advert_int 1
    nopreempt
    virtual_ipaddress {
        10.2.81.242
    }
   unicast_src_ip 10.2.81.252
   unicast_peer {
     10.2.81.251
   }
}
virtual_server fwmark 1 {
        lb_algo rr
        lb_kind NAT
        protocol UDP
        ops
        delay_loop 1
        real_server 10.2.81.244 * {
                HTTP_GET {
                        url {
                                path  /api/system/lbstatus
                                status_code 200
                        }
                        connect_port 9000
                        connect_timeout 1
                        nb_get_retry 1
                        delay_before_retry 1
                }
        }
        real_server 10.2.81.245 * {
                HTTP_GET {
                        url {
                                path  /api/system/lbstatus
                                status_code 200
                        }
                        connect_port 9000
                        connect_timeout 1
                        nb_get_retry 1
                        delay_before_retry 1
                }
        }
        real_server 10.2.81.246 * {
                HTTP_GET {
                        url {
                                path  /api/system/lbstatus
                                status_code 200
                        }
                        connect_port 9000
                        connect_timeout 1
                        nb_get_retry 1
                        delay_before_retry 1
                }
        }
}

Here is the NGINX configuration on the LB nodes…

1 ~]# cat /etc/nginx/conf.d/graylog.conf | egrep -v "^\s*(#|$)"
upstream graylog {
 ip_hash;
 server graylog1.local:9000 max_fails=1 fail_timeout=5s;
 server graylog2.local:9000 max_fails=1 fail_timeout=5s;
 server graylog3.local:9000 max_fails=1 fail_timeout=5s;
 }
server {
listen 443 ssl http2;
server_name graylog.local;
ssl_certificate /etc/nginx/ca/graylog.local.chain;
ssl_certificate_key /etc/nginx/ca/graylog.local.key;
ssl_session_timeout 10m;
ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_ciphers 'EECDH+AESGCM:EDH+AESGCM:AES+EECDH:AES+EDH';
ssl_prefer_server_ciphers on;
ssl_session_cache shared:SSL:10m;
keepalive_timeout 60;
ssl_stapling on;
ssl_stapling_verify on;
ssl_trusted_certificate /etc/nginx/ca/graylog.local.chain;
resolver 8.8.8.8 8.8.4.4 valid=86400s;
resolver_timeout 5s;
ssl_ecdh_curve secp384r1;
ssl_dhparam /etc/ssl/certs/dhparam.pem;
location / {
 proxy_pass http://graylog;
 proxy_http_version 1.1;
 proxy_set_header Host $host;
 proxy_set_header X-Real-IP $remote_addr;
 proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
 proxy_set_header X-Graylog-Server-URL https://graylog.local/api;
 proxy_pass_request_headers on;
 proxy_connect_timeout 5;
 proxy_send_timeout 60;
 proxy_read_timeout 60;
 proxy_buffers 4 32k;
 client_max_body_size 8m;
 client_body_buffer_size 128k;
 }
}

GELF UDP is inherently unfriendly for load-balancers due to its message chunking.

Try sending the GELF UDP messages directly to Graylog or configure some sort of source pinning in the load-balancer, so that all UDP packets from the same source are sent to the same Graylog instance.

Alternatively try switching to GELF TCP.

I have switched over to using GELF TCP. Strangely, I’m still getting the same GELF error messages on the two nodes even absent any GELF UDP inputs. Ironically, only one of them is even getting the GELF TCP traffic, but they both continue to spout off errors.

Node 3 continues to crash even absent traffic being sent to it.

Thoughts?

Bad news - whatever has been happening to node 3 is also now happening to node 1 (master node). Desperate for ideas at this point aside from trying to roll java back to a previous version as a last-ditch effort.