Hi,
regarding the ingestion, the system generally handles ~50GB per day.
As suggested, I took a peek at processes running on one of the nodes this morning, I went for
top -b -d 60 | grep -A 10 -F PID
over the hottest hour, and the outputs were all like
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26247 graylog 20 0 8268596 4,4g 84420 S 275,4 57,1 3214:56 java
512 mongodb 20 0 2203624 323536 41216 S 4,3 4,0 8546:17 mongod
4246 root 20 0 0 0 0 I 0,7 0,0 0:02.78 kworker/2:2-events
4173 node-exp 20 0 719380 22296 12088 S 0,1 0,3 296:52.89 node_exporter
4462 root 20 0 0 0 0 I 0,1 0,0 0:00.09 kworker/3:2-events_freezable_power_
28744 haproxy 20 0 15700 2740 1152 S 0,1 0,0 43:33.34 haproxy
10 root 20 0 0 0 0 I 0,0 0,0 67:27.42 rcu_sched
248 root 20 0 0 0 0 S 0,0 0,0 69:31.40 jbd2/dm-0-8
515 Debian-+ 20 0 42900 14500 9600 S 0,0 0,2 72:53.43 snmpd
1 root 20 0 170776 10636 7904 S 0,0 0,1 70:21.02 systemd
kind of confirming that Graylog is the greatest CPU user. I also reviewed buffer and journal utilization using the built-in prometheus exporter, and can confirm that buffers are always empty, and the journal never fills over 2%.
In the afternoon, I tried a full rolling shutdown and cold restart of all VMs, but this did not change any of the performance parameters.
Finally, I tried restarting one on the graylog-server instances without -Dlog4j2.formatMsgNoLookups=true
and left it running for some time, but the performance of the node matched exactly the one of the nodes running with the option enabled.
I took my time to be sure to resolve the JSON parsing warning from one of my pipelines and since then I see nothing interesting in the logs except for index rotation. Upon restarting, I mostly get informational messages, except for the warnings
2022-01-14T16:17:16.460+01:00 WARN [UdpTransport] receiveBufferSize (SO_RCVBUF) for input SyslogUDPInput{title=Syslog UDP, type=org.graylog2.inputs.syslog.udp.SyslogUDPInput, nodeId=null} (channel [id: 0xdd42cb3b, L:/0:0:0:0:0:0:0:0%0:1514]) should be 4194304 but is 8388608.
2022-01-14T16:17:16.460+01:00 WARN [UdpTransport] receiveBufferSize (SO_RCVBUF) for input SyslogUDPInput{title=Syslog UDP, type=org.graylog2.inputs.syslog.udp.SyslogUDPInput, nodeId=null} (channel [id: 0xb8d7f8de, L:/0:0:0:0:0:0:0:0%0:1514]) should be 4194304 but is 8388608.
2022-01-14T16:17:16.462+01:00 WARN [AbstractTcpTransport] receiveBufferSize (SO_RCVBUF) for input SyslogTCPInput{title=Syslog TCP, type=org.graylog2.inputs.syslog.tcp.SyslogTCPInput, nodeId=null} (channel [id: 0x92f75b38, L:/0:0:0:0:0:0:0:0%0:1514]) should be 4194304 but is 8388608.
2022-01-14T16:17:16.470+01:00 WARN [UdpTransport] receiveBufferSize (SO_RCVBUF) for input SyslogUDPInput{title=Syslog UDP, type=org.graylog2.inputs.syslog.udp.SyslogUDPInput, nodeId=null} (channel [id: 0x5a74b7d2, L:/0:0:0:0:0:0:0:0%0:1514]) should be 4194304 but is 8388608.
2022-01-14T16:17:16.476+01:00 WARN [UdpTransport] receiveBufferSize (SO_RCVBUF) for input SyslogUDPInput{title=Syslog UDP, type=org.graylog2.inputs.syslog.udp.SyslogUDPInput, nodeId=null} (channel [id: 0xf49687f1, L:/0:0:0:0:0:0:0:0%0:1514]) should be 4194304 but is 8388608.
which do not worry me (I set 4194304 as buffer size in the inputs since the default once was low and caused me headaches by dropping messages, but I do not think that having them larger is a problem!) and the following logs from the JVM in systemd’s journal
gen 14 16:17:02 graylog1 systemd[1]: Started Graylog server.
gen 14 16:17:03 graylog1 graylog-server[501]: OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
gen 14 16:17:04 graylog1 graylog-server[501]: WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
gen 14 16:17:11 graylog1 graylog-server[501]: WARNING: An illegal reflective access operation has occurred
gen 14 16:17:11 graylog1 graylog-server[501]: WARNING: Illegal reflective access by retrofit2.Platform (file:/usr/share/graylog-server/graylog.jar) to constructor java.lang.invoke.MethodHandles$Lookup(java.lang.Class,int)
gen 14 16:17:11 graylog1 graylog-server[501]: WARNING: Please consider reporting this to the maintainers of retrofit2.Platform
gen 14 16:17:11 graylog1 graylog-server[501]: WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
gen 14 16:17:11 graylog1 graylog-server[501]: WARNING: All illegal access operations will be denied in a future release
Regarding plugins, I did not install any. I guess then that I should not be worried about the version mismatch you suggested to check?
I compared my configuration with yours, and I do not see many differences. You have some performance parameters (from the little I understand) that have been raise to better suit the fact that you have more cores, and do bigger output batches to Elasticsearch, but I do not see any relevant difference.
I admit I have very few further steps in mind except maybe:
- Upgrading all pending OS packages (I am behind some bugfix releases for both OpenJDK and MongoDB). I have low hopes on this, but at least it should not make things worse, and I will someday need to apply bugfix updates anyway.
- Trying to upgrade to Debian bullseye, Graylog 4.2 and OpenJDK 17. This will take me quite some time however, since it is a big leap forward.
- Trying to go back to OpenJDK 8. I never had a problem with OpenJDK 11 on this installation, and docs seems to suggest that it is compatible since Graylog 3.x, but still the docs say that OpenJDK 8 is the official requirement, so maybe I ran out of my luck using OpenJDK 11!
One other thing I am worried about is that maybe I should look better into JVM performance parameters? I would like e.g. to understand if the heap allocated to Graylog is doing fine, or if the load may be due to some overstress in things such as GC. This however is also quite out of my expertise, I am only aping other most commonly heard horror stories without knowing if they also can affect Graylog!