Error after upgrade from 4.2.7 to 4.3.3

1. Incident:
Hello,
after upgrade graylog from version 4.2.7 to version 4.3.3 my graylog node started, but not connected to cluster because I found in log the following java error

2022-08-04 13:32:22,514 INFO    [ServerBootstrap] - Graylog server 4.3.3+86369d3 starting up - {}
2022-08-04 13:32:22,514 INFO    [ServerBootstrap] - JRE: Oracle Corporation 1.8.0_332 on Linux 5.4.170+ - {}
2022-08-04 13:32:22,514 INFO    [ServerBootstrap] - Deployment: docker - {}
2022-08-04 13:32:22,515 INFO    [ServerBootstrap] - OS: Debian GNU/Linux 11 (bullseye) (debian) - {}
2022-08-04 13:32:22,515 INFO    [ServerBootstrap] - Arch: amd64 - {}
2022-08-04 13:32:22,564 INFO    [PeriodicalsService] - Starting 7 periodicals ... - {}
2022-08-04 13:32:22,564 INFO    [PeriodicalsService] - Delaying start of 21 periodicals until this node becomes leader ... - {}
2022-08-04 13:32:22,565 INFO    [Periodicals] - Starting [org.graylog2.periodical.GarbageCollectionWarningThread] periodical, running forever. - {}
2022-08-04 13:32:22,577 INFO    [Periodicals] - Starting [org.graylog2.periodical.TrafficCounterCalculator] periodical in [0s], polling every [1s]. - {}
2022-08-04 13:32:22,597 INFO    [Periodicals] - Starting [org.graylog2.periodical.NodePingThread] periodical in [0s], polling every [1s]. - {}
2022-08-04 13:32:22,608 INFO    [Periodicals] - Starting [org.graylog2.events.ClusterEventPeriodical] periodical in [0s], polling every [1s]. - {}
2022-08-04 13:32:22,611 INFO    [Periodicals] - Starting [org.graylog2.periodical.ThroughputCalculator] periodical in [0s], polling every [1s]. - {}
2022-08-04 13:32:22,638 INFO    [Periodicals] - Starting [org.graylog2.periodical.BatchedElasticSearchOutputFlushThread] periodical in [0s], polling every [1s]. - {}
2022-08-04 13:32:22,642 INFO    [Periodicals] - Starting [org.graylog2.periodical.ThrottleStateUpdaterThread] periodical in [1s], polling every [1s]. - {}
2022-08-04 13:32:22,670 INFO    [connection] - Opened connection [connectionId{localValue:12, serverValue:2123918}] to mongodb-secondary-0.mongodb.graylog.svc.cluster.local:27017 - {}
2022-08-04 13:32:22,854 INFO    [PrometheusExporterHTTPServer] - Exporting Prometheus metrics on <0.0.0.0:9833> via HTTP - {}
2022-08-04 13:32:22,882 INFO    [JerseyService] - Enabling CORS for HTTP endpoint - {}
2022-08-04 13:32:24,455 INFO    [NetworkListener] - Started listener bound to [0.0.0.0:9000] - {}
2022-08-04 13:32:24,457 INFO    [HttpServer] - [HttpServer] Started. - {}
2022-08-04 13:32:24,457 INFO    [JerseyService] - Started REST API at <0.0.0.0:9000> - {}
2022-08-04 13:32:24,457 INFO    [ServiceManagerListener] - Services are healthy - {}
2022-08-04 13:32:24,457 INFO    [JobSchedulerService] - Job scheduler execution is disabled. Waiting and trying again until enabled. - {}
2022-08-04 13:32:24,458 INFO    [ServerBootstrap] - Services started, startup times in ms: {FailureHandlingService [RUNNING]=5, UserSessionTerminationService [RUNNING]=11, JobSchedulerService [RUNNING]=27, InputSetupService [RUNNING]=28, BufferSynchronizerService [RUNNING]=32, OutputSetupService [RUNNING]=32, LocalKafkaMessageQueueWriter [RUNNING]=33, UrlWhitelistService [RUNNING]=33, GracefulShutdownService [RUNNING]=34, LocalKafkaMessageQueueReader [RUNNING]=36, EtagService [RUNNING]=84, ConfigurationEtagService [RUNNING]=86, LocalKafkaJournal [RUNNING]=89, MongoDBProcessingStatusRecorderService [RUNNING]=93, LookupTableService [RUNNING]=94, PeriodicalsService [RUNNING]=105, StreamCacheService [RUNNING]=115, PrometheusExporter [RUNNING]=283, JerseyService [RUNNING]=1895} - {}
2022-08-04 13:32:24,458 INFO    [InputSetupService] - Triggering launching persisted inputs, node transitioned from Uninitialized [LB:DEAD] to Running [LB:ALIVE] - {}
2022-08-04 13:32:24,467 INFO    [ServerBootstrap] - Graylog server up and running. - {}
2022-08-04 13:32:24,472 INFO    [InputLauncher] - Launching input [Beats/PubSub-input/620a4cab38926a16a72d774c] - desired state is RUNNING - {}
2022-08-04 13:32:24,477 INFO    [InputStateListener] - Input [Beats/620a4cab38926a16a72d774c] is now STARTING - {}
2022-08-04 13:32:24,542 INFO    [InputStateListener] - Input [Beats/620a4cab38926a16a72d774c] is now RUNNING - {}
2022-08-04 13:33:00,226 ERROR   [AnyExceptionClassMapper] - Unhandled exception in REST resource - {}
java.lang.NullPointerException: null
	at org.graylog2.cluster.NodeImpl.isLeader(NodeImpl.java:51) ~[graylog.jar:?]
	at org.graylog2.rest.resources.system.ClusterResource.nodeSummary(ClusterResource.java:110) ~[graylog.jar:?]
	at org.graylog2.rest.resources.system.ClusterResource.nodes(ClusterResource.java:76) ~[graylog.jar:?]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_332]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_332]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_332]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_332]
	at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) ~[graylog.jar:?]
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:124) ~[graylog.jar:?]
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:167) ~[graylog.jar:?]
	at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219) ~[graylog.jar:?]
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:79) ~[graylog.jar:?]
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:469) ~[graylog.jar:?]
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:391) ~[graylog.jar:?]
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:80) ~[graylog.jar:?]
	at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:255) [graylog.jar:?]
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) [graylog.jar:?]
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) [graylog.jar:?]
	at org.glassfish.jersey.internal.Errors.process(Errors.java:292) [graylog.jar:?]
	at org.glassfish.jersey.internal.Errors.process(Errors.java:274) [graylog.jar:?]
	at org.glassfish.jersey.internal.Errors.process(Errors.java:244) [graylog.jar:?]
	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) [graylog.jar:?]
	at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:234) [graylog.jar:?]
	at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:680) [graylog.jar:?]
	at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:356) [graylog.jar:?]
	at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:200) [graylog.jar:?]
	at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180) [graylog.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_332]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_332]
	at java.lang.Thread.run(Thread.java:750) [?:1.8.0_332]

I don’t know what this error means, because I tried upgrade to new version on very small cluster with only one Beats input. Cluster have 3 nodes, and I updated it one by one, but this error showed on the first node. After downgrade graylog node to previous version, node was successfully connected to graylog cluster.

2. Environment:
Graylog cluster are running in kubernetes.
Kubernetes version: 1.21.9-gke.300
ElasticSearch version: 7.10.1
MongoDB version: 4.2.1

I don’t know if this error is related with graylog input or connection to some databases ES or Mongodb, but in logs I didn’t find any error about connection to databases.
Thank for help or some information what can be wrong.

Hello,

I’m not familiar with kubernetes, but I am with Docker/Docker-compose. I need to ask a couple question.

  • How did you perform you upgrade, beside one node at a time?
  • Is there any more logs you can show Elasticsearch/MongoDb?
  • You mentioned this was one node , how about the other ones? Do they show the same error?
  • Are you able to logon the Web UI, if not what do you see?

From the logs it shows your inputs started, and let’s say you cant logon the Web UI I would look into Graylog logs a little more. Also check permissions, configuration, Network, Firewalls, etc… it seams everything is running on JAVA , ensure you have the right versions.

Hi Tomas,

When you upgrade, the server.conf for graylog needs some adjustments,
did you make those changes?

https://docs.graylog.org/docs/upgrading-to-graylog-43x

Kind Greetings,
Arie

Hello,

update:

  • My first try was update graylog node by node (rollingUpdate). One node will be update to new version and connected to exists cluster, than second node, etc. But this not work correctly because graylog node with new version started show java error what I posted yesterday.
  • Others graylog parts, ElasticSearch and MongoDB was without errors or warnings, because graylog cluster worked correctly with 2 nodes.
  • Other graylog nodes was without errors and worked correctly, they collected data and search worked.
  • Web UI worked

Solution:

  • I stopped all inputs manually via graylog UI (it’s more secure, because I can’t lose data)
  • After that I release new graylog version, but I didn’t use rollingUpdate, but I removed all graylog pods in kubernetes and I wait for creating new pods with new graylog version
  • After successful pod create I checked all logs and everything is OK without errors or warnings.
  • Then I logged into graylog UI I started inputs
  • Graylog started received new data from inputs and now everything works correctly with new graylog version

Before update to version 4.3.3 I was think that rolling update will be work, because previous updates worked with rolling update. But maybe graylog change more than only few variables in config file. But hovewer graylogs works fine now.
Thank for help.

Thank you for posting the solution, :+1:

It might be, I always double check the changelog incase.