Graylog Server Fails on Startup after upgrade to 4.3

Before you post: Your responses to these questions will help the community help you. Please complete this template if you’re asking a support question.
Don’t forget to select tags to help index your topic!

1. Describe your incident:
Upgraded to 4.3 this morning, now graylog-server.service fails on startup

2. Describe your environment:
The Graylog Server is run inside of an LXC container

  • OS Information:
    Distributor ID: Ubuntu
    Description: Ubuntu 20.04.4 LTS
    Release: 20.04
    Codename: focal
  • Package Version:
    graylog-enterprise-integrations-plugins(4.3.1-1).
    graylog-enterprise-plugins (4.3.1-1).
    graylog-integrations-plugins (4.3.1-1).
    graylog-server (4.3.1-1).
  • Service logs, configurations, and environment variables:
    Systemctl status:
ubuntu@graylog:~$ sudo systemctl status graylog-server.service
● graylog-server.service - Graylog server
     Loaded: loaded (/lib/systemd/system/graylog-server.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Sat 2022-06-04 10:25:35 EDT; 321ms ago
       Docs: http://docs.graylog.org/
    Process: 1758 ExecStart=/usr/share/graylog-server/bin/graylog-server (code=exited, status=1/FAILURE)
   Main PID: 1758 (code=exited, status=1/FAILURE)
        CPU: 7.737s

Jun 04 10:25:35 graylog systemd[1]: graylog-server.service: Failed with result 'exit-code'.
Jun 04 10:25:35 graylog systemd[1]: graylog-server.service: Consumed 7.737s CPU time.

Journalctl -xe

Jun 04 10:26:22 graylog systemd[1]: graylog-server.service: Scheduled restart job, restart counter is at 12.
-- Subject: Automatic restarting of a unit has been scheduled
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Automatic restarting of the unit graylog-server.service has been scheduled, as the result for
-- the configured Restart= setting for the unit.
Jun 04 10:26:22 graylog systemd[1]: Stopped Graylog server.
-- Subject: A stop job for unit graylog-server.service has finished
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A stop job for unit graylog-server.service has finished.
--
-- The job identifier is 1219 and the job result is done.
Jun 04 10:26:22 graylog systemd[1]: graylog-server.service: Consumed 8.086s CPU time.
-- Subject: Resources consumed by unit runtime
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- The unit graylog-server.service completed and consumed the indicated resources.
Jun 04 10:26:22 graylog systemd[1]: Started Graylog server.
-- Subject: A start job for unit graylog-server.service has finished successfully
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit graylog-server.service has finished successfully.
--
-- The job identifier is 1219.
Jun 04 10:26:25 graylog graylog-server[2314]: Exception in thread "main" java.lang.ArithmeticException: / by zero
Jun 04 10:26:25 graylog graylog-server[2314]:         at org.graylog2.shared.system.stats.fs.OshiFsProbe.lambda$fsStats$11(OshiFsProbe.java:116)
Jun 04 10:26:25 graylog graylog-server[2314]:         at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1321)
Jun 04 10:26:25 graylog graylog-server[2314]:         at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
Jun 04 10:26:25 graylog graylog-server[2314]:         at java.util.stream.ReferencePipeline$11$1.accept(ReferencePipeline.java:440)
Jun 04 10:26:25 graylog graylog-server[2314]:         at java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1723)
Jun 04 10:26:25 graylog graylog-server[2314]:         at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
Jun 04 10:26:25 graylog graylog-server[2314]:         at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
Jun 04 10:26:25 graylog graylog-server[2314]:         at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
Jun 04 10:26:25 graylog graylog-server[2314]:         at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
Jun 04 10:26:25 graylog graylog-server[2314]:         at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
Jun 04 10:26:25 graylog graylog-server[2314]:         at org.graylog2.shared.system.stats.fs.OshiFsProbe.fsStats(OshiFsProbe.java:106)
Jun 04 10:26:25 graylog graylog-server[2314]:         at org.graylog2.bootstrap.preflight.DiskJournalPreflightCheck.checkJournalDirSizeAndType(DiskJournalPreflightCheck.java:67)
Jun 04 10:26:25 graylog graylog-server[2314]:         at org.graylog2.bootstrap.preflight.DiskJournalPreflightCheck.runCheck(DiskJournalPreflightCheck.java:62)
Jun 04 10:26:25 graylog graylog-server[2314]:         at com.google.common.collect.ImmutableList.forEach(ImmutableList.java:422)
Jun 04 10:26:25 graylog graylog-server[2314]:         at org.graylog2.bootstrap.preflight.PreflightCheckService.runChecks(PreflightCheckService.java:51)
Jun 04 10:26:25 graylog graylog-server[2314]:         at org.graylog2.bootstrap.ServerBootstrap.runPreFlightChecks(ServerBootstrap.java:135)
Jun 04 10:26:25 graylog graylog-server[2314]:         at org.graylog2.bootstrap.ServerBootstrap.beforeInjectorCreation(ServerBootstrap.java:126)
Jun 04 10:26:25 graylog graylog-server[2314]:         at org.graylog2.bootstrap.CmdLineTool.run(CmdLineTool.java:292)
Jun 04 10:26:25 graylog graylog-server[2314]:         at org.graylog2.bootstrap.Main.main(Main.java:45)
Jun 04 10:26:25 graylog systemd[1]: graylog-server.service: Main process exited, code=exited, status=1/FAILURE
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- An ExecStart= process belonging to unit graylog-server.service has exited.
--
-- The process' exit code is 'exited' and its exit status is 1.
Jun 04 10:26:25 graylog systemd[1]: graylog-server.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- The unit graylog-server.service has entered the 'failed' state with result 'exit-code'.
Jun 04 10:26:25 graylog systemd[1]: graylog-server.service: Consumed 7.573s CPU time.
-- Subject: Resources consumed by unit runtime
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- The unit graylog-server.service completed and consumed the indicated resources.
Jun 04 10:26:30 graylog sudo[2366]:   ubuntu : TTY=pts/1 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/usr/bin/journalctl -xe
Jun 04 10:26:30 graylog sudo[2366]: pam_unix(sudo:session): session opened for user root by (uid=0)

/var/log/graylog-server/server.log:

2022-06-04T10:27:13.775-04:00 INFO  [ImmutableFeatureFlagsCollector] Following feature flags are used: {}
2022-06-04T10:27:14.401-04:00 INFO  [CmdLineTool] Loaded plugin: AWS plugins 4.3.1 [org.graylog.aws.AWSPlugin]
2022-06-04T10:27:14.402-04:00 INFO  [CmdLineTool] Loaded plugin: Enterprise Integrations 4.3.1 [org.graylog.enterprise.integrations.EnterpriseIntegrationsPlugin]
2022-06-04T10:27:14.403-04:00 INFO  [CmdLineTool] Loaded plugin: Integrations 4.3.1 [org.graylog.integrations.IntegrationsPlugin]
2022-06-04T10:27:14.404-04:00 INFO  [CmdLineTool] Loaded plugin: Collector 4.3.1 [org.graylog.plugins.collector.CollectorPlugin]
2022-06-04T10:27:14.404-04:00 INFO  [CmdLineTool] Loaded plugin: Graylog Enterprise 4.3.1 [org.graylog.plugins.enterprise.EnterprisePlugin]
2022-06-04T10:27:14.405-04:00 INFO  [CmdLineTool] Loaded plugin: Graylog Enterprise (ES6 Support) 4.3.1 [org.graylog.plugins.enterprise.org.graylog.plugins.enterprise.es6.Enterpri>2022-06-04T10:27:14.406-04:00 INFO  [CmdLineTool] Loaded plugin: Graylog Enterprise (ES7 Support) 4.3.1 [org.graylog.plugins.enterprise.org.graylog.plugins.enterprise.es7.Enterpri>2022-06-04T10:27:14.407-04:00 INFO  [CmdLineTool] Loaded plugin: Threat Intelligence Plugin 4.3.1 [org.graylog.plugins.threatintel.ThreatIntelPlugin]
2022-06-04T10:27:14.407-04:00 INFO  [CmdLineTool] Loaded plugin: Elasticsearch 6 Support 4.3.1+78ee2f1 [org.graylog.storage.elasticsearch6.Elasticsearch6Plugin]
2022-06-04T10:27:14.407-04:00 INFO  [CmdLineTool] Loaded plugin: Elasticsearch 7 Support 4.3.1+78ee2f1 [org.graylog.storage.elasticsearch7.Elasticsearch7Plugin]
2022-06-04T10:27:14.440-04:00 INFO  [CmdLineTool] Running with JVM arguments: -Xms1g -Xmx1g -XX:NewRatio=1 -XX:+ResizeTLAB -XX:-OmitStackTraceInFastThrow -Djdk.tls.acknowledgeClos>2022-06-04T10:27:14.812-04:00 INFO  [cluster] Cluster created with settings {hosts=[localhost:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', >2022-06-04T10:27:14.843-04:00 INFO  [cluster] Cluster description not yet available. Waiting for 30000 ms before timing out
2022-06-04T10:27:14.876-04:00 INFO  [connection] Opened connection [connectionId{localValue:1, serverValue:33}] to localhost:27017
2022-06-04T10:27:14.884-04:00 INFO  [cluster] Monitor thread successfully connected to server with description ServerDescription{address=localhost:27017, type=STANDALONE, state=CO>2022-06-04T10:27:14.895-04:00 INFO  [connection] Opened connection [connectionId{localValue:2, serverValue:34}] to localhost:27017
2022-06-04T10:27:14.918-04:00 INFO  [connection] Closed connection [connectionId{localValue:2, serverValue:34}] to localhost:27017 because the pool has been closed.
2022-06-04T10:27:14.919-04:00 INFO  [MongoDBPreflightCheck] Connected to MongoDB version 4.4.14
2022-06-04T10:27:15.020-04:00 INFO  [SearchDbPreflightCheck] Connected to (Elastic/Open)Search version <Elasticsearch:6.8.23>
2022-06-04T10:27:26.303-04:00 INFO  [ImmutableFeatureFlagsCollector] Following feature flags are used: {}
2022-06-04T10:27:26.949-04:00 INFO  [CmdLineTool] Loaded plugin: AWS plugins 4.3.1 [org.graylog.aws.AWSPlugin]
2022-06-04T10:27:26.950-04:00 INFO  [CmdLineTool] Loaded plugin: Enterprise Integrations 4.3.1 [org.graylog.enterprise.integrations.EnterpriseIntegrationsPlugin]
2022-06-04T10:27:26.951-04:00 INFO  [CmdLineTool] Loaded plugin: Integrations 4.3.1 [org.graylog.integrations.IntegrationsPlugin]
2022-06-04T10:27:26.952-04:00 INFO  [CmdLineTool] Loaded plugin: Collector 4.3.1 [org.graylog.plugins.collector.CollectorPlugin]
2022-06-04T10:27:26.953-04:00 INFO  [CmdLineTool] Loaded plugin: Graylog Enterprise 4.3.1 [org.graylog.plugins.enterprise.EnterprisePlugin]
2022-06-04T10:27:26.953-04:00 INFO  [CmdLineTool] Loaded plugin: Graylog Enterprise (ES6 Support) 4.3.1 [org.graylog.plugins.enterprise.org.graylog.plugins.enterprise.es6.Enterpri>2022-06-04T10:27:26.954-04:00 INFO  [CmdLineTool] Loaded plugin: Graylog Enterprise (ES7 Support) 4.3.1 [org.graylog.plugins.enterprise.org.graylog.plugins.enterprise.es7.Enterpri>2022-06-04T10:27:26.955-04:00 INFO  [CmdLineTool] Loaded plugin: Threat Intelligence Plugin 4.3.1 [org.graylog.plugins.threatintel.ThreatIntelPlugin]
2022-06-04T10:27:26.956-04:00 INFO  [CmdLineTool] Loaded plugin: Elasticsearch 6 Support 4.3.1+78ee2f1 [org.graylog.storage.elasticsearch6.Elasticsearch6Plugin]
2022-06-04T10:27:26.956-04:00 INFO  [CmdLineTool] Loaded plugin: Elasticsearch 7 Support 4.3.1+78ee2f1 [org.graylog.storage.elasticsearch7.Elasticsearch7Plugin]
2022-06-04T10:27:26.989-04:00 INFO  [CmdLineTool] Running with JVM arguments: -Xms1g -Xmx1g -XX:NewRatio=1 -XX:+ResizeTLAB -XX:-OmitStackTraceInFastThrow -Djdk.tls.acknowledgeClos>2022-06-04T10:27:27.343-04:00 INFO  [cluster] Cluster created with settings {hosts=[localhost:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', >2022-06-04T10:27:27.375-04:00 INFO  [cluster] Cluster description not yet available. Waiting for 30000 ms before timing out
2022-06-04T10:27:27.408-04:00 INFO  [connection] Opened connection [connectionId{localValue:1, serverValue:35}] to localhost:27017
2022-06-04T10:27:27.412-04:00 INFO  [cluster] Monitor thread successfully connected to server with description ServerDescription{address=localhost:27017, type=STANDALONE, state=CO>2022-06-04T10:27:27.422-04:00 INFO  [connection] Opened connection [connectionId{localValue:2, serverValue:36}] to localhost:27017
2022-06-04T10:27:27.444-04:00 INFO  [connection] Closed connection [connectionId{localValue:2, serverValue:36}] to localhost:27017 because the pool has been closed.
2022-06-04T10:27:27.445-04:00 INFO  [MongoDBPreflightCheck] Connected to MongoDB version 4.4.14
2022-06-04T10:27:27.546-04:00 INFO  [SearchDbPreflightCheck] Connected to (Elastic/Open)Search version <Elasticsearch:6.8.23>

3. What steps have you already taken to try and solve the problem?

Google-foo, mostly. Honestly I am not sure what is causing the issue.

4. How can the community help?

Good morning. The issue I am encountering is that when I upgraded from 4.2.x to 4.3 the graylog-server.service no longer starts up but fails. I’ll note that I see there is a sidecar error but I presume that is only because the server is not up and therefore the sidecar can’t check in.

Any suggestions are sincerely welcome; thank you!

EDIT: I will also note the following:

  • I did not install a new server.conf file with the upgrade (perhaps a mistake as a I now see some things have changed
  • It appears graylog-server.service initially succeeds at startup but then fails seconds after

Helpful Posting Tips: Tips for Posting Questions that Get Answers [Hold down CTRL and link on link to open tips documents in a separate tab]

For now I have downgraded to 4.2.x as I could not find the cause of the issue with 4.3 and could not afford additional downtime. Still, I welcome suggestions to get up to 4.3. Thank you.

Looking at the code, the /0 exception is caused by fileStore.getTotalInodes returning 0. I assume this is due to running in an LXC container.
What does sudo df -ih /init report?

GL should handle this more gracefully. I have filed a bug.

I don’t know LXC very well - perhaps the configuration can be adjusted to work around this. Here is a somewhat similar issue that may be helpful.
Alternatively, do not use LXC until we can get this fixed.

Both the container and the host report as follows:
df: /init: No such file or directory

Thank you.

Sorry, drop the /init. Please just run sudo df -ih and paste the result.

I thought so but wasn’t sure haha
Host:

Filesystem     Inodes IUsed IFree IUse% Mounted on
udev             3.9M   509  3.9M    1% /dev
tmpfs            3.9M   842  3.9M    1% /run
/dev/nvme0n1p2      0     0     0     - /
tmpfs            3.9M     1  3.9M    1% /dev/shm
tmpfs            3.9M     2  3.9M    1% /run/lock
/dev/nvme0n1p1      0     0     0     - /boot/efi
/dev/sda1           0     0     0     - /home
/dev/loop0        816   816     0  100% /snap/lxd/23155
/dev/loop1        12K   12K     0  100% /snap/core20/1494
/dev/loop2        484   484     0  100% /snap/snapd/15904
tmpfs            789K    30  789K    1% /run/user/106
tmpfs            3.9M     3  3.9M    1% /var/snap/lxd/common/ns
tmpfs            789K    24  789K    1% /run/user/1000

Container:

Filesystem     Inodes IUsed IFree IUse% Mounted on
/dev/nvme0n1p2      0     0     0     - /
none             3.9M    25  3.9M    1% /dev
udev             3.9M   509  3.9M    1% /dev/tty
tmpfs            3.9M     2  3.9M    1% /dev/lxd
tmpfs            3.9M     6  3.9M    1% /dev/.lxd-mounts
tmpfs            3.9M     1  3.9M    1% /dev/shm
tmpfs            3.9M   252  3.9M    1% /run
tmpfs            3.9M     3  3.9M    1% /run/lock
snapfuse          12K   12K     0  100% /snap/core20/1434
snapfuse          12K   12K     0  100% /snap/core20/1494
snapfuse          802   802     0  100% /snap/lxd/22753
snapfuse          484   484     0  100% /snap/snapd/15534
snapfuse          484   484     0  100% /snap/snapd/15904
tmpfs            3.9M    21  3.9M    1% /run/user/1000

Here is a quick workaround:
disable_native_system_stats_collector=true in your graylog.conf.

3 Likes

I’ll give this a shot off-hours; thank you!

Hi Matthew, I had the same issue… I tested your workaround and is working just fine. I was wandering what effects has adding this setting. ( disable_native_system_stats_collector=true in your graylog.conf.)
I googled about it and documentation says:

  • Disable the use of OSHI for collecting system stats.

which seems to be related with Operating System and Hardware Information metrics gathered by Graylog.
So I was wandering what effect may have to disable this library and if it is safe to do it in production.

Also I would like to add more data about my issue. I think this issue was triggered when I upgraded from 4.2.8 to 4.3.0 skipping 4.2.9. At least I think this is the only difference from the other Graylog instances I’m running. I had no issues in the other instances but I did the upgrade from 4.2.9 to 4.3.0. So I’m not sure if skipping the 4.2.9 when I did the upgrade could have caused the issue. Also I would like to note here that after I detected the issue, I rolled back to 4.2.8 and did the upgrade to 4.2.9 without any problem, but when I try to go to 4.3.0 it keeps failing with the same error described abobe

Thanks in advance,
Sebastian.

Early 2021 we switched to OSHI for collecting stats, as it offered a few benefits over our existing code. The old code is still available as a fallback - which is handy in cases like this. Eventually I assume we will deprecate and remove the old code. But meanwhile you can safely use this configuration setting as a workaround.

I have the same issue too. Running Ubuntu 20.04 on btrfs file system which has no inodes.
disable_native_system_stats_collector=true works for me, but devs should allow non-inode filesystems.

2 Likes

This worked for me but would echo the desire to come up with a fix for those of us using non-inode filesystems (Btrfs)

Fix is slated for the next patch release

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.