Failed Cluster Member shuts down entire Graylog Cluster

Heyo,

I just ran into the problem, that one of my Graylog Nodes shut down, because the storage system behind our virtualization platform (more exact the storage for the vm-host my Graylog was running on) became full and therefore the system was halted for some reason (I can’t get the server to start, I have to wait for the storage guys to solve their problem first, so I can’t look for problems atm.)

But my bigger concern is, that the second Graylog node also failed, after the other one stopped being reachable. So, atm, we are blind. Is this a known problem, that Graylog stops responing when another cluster member goes down unexpectedly? This is the first time this happened, every planned shutdown or service restart never caused any problems.

Has anyone experienced this problem? And if yes, any ideas why? And any idea on how to get Graylog01 running again?

Symptoms:

  • Graylog02 stopped responding (Server seems to have been shut down by the hypervisor).
  • Graylog01 stopped responding on the Web-Interface (the good old server not reachable screen of Graylog)
  • Graylog01 stopped indexing into Elasticsearch, even into the still available elasticsearch nodes. The Node however still received the messages and wrote them into the Journal (Journal filled to ca. 250k messages, this is why I noticed the problem, I get alerted when a Journal fills up)
  • I Tried restarting Graylog01 but it will not come up, it is stuck here:
2017-09-28T13:04:29.274+02:00 INFO  [cluster] Monitor thread successfully connected to server with description ServerDescription{address=graylog-arbiter.example.local:27018, type=REPLICA_SET_ARBITER, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 4, 2]}, minWireVersion=0, maxWireVersion=5, maxDocumentSize=16777216, roundTripTimeNanos=9646035, setName='rewe-graylog-prod', canonicalAddress=graylog-arbiter.example.local:27018, hosts=[graylog01:27017, graylog02.example.local:27017], passives=[], arbiters=[graylog-arbiter.example.local:27018], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=5, lastWriteDate=Thu Jan 01 01:00:00 CET 1970, lastUpdateTimeNanos=18043098561357175}
2017-09-28T13:04:29.626+02:00 INFO  [cluster] Exception in monitor thread while connecting to server graylog02.example.local:27017
com.mongodb.MongoSocketOpenException: Exception opening socket
        at com.mongodb.connection.SocketStream.open(SocketStream.java:63) ~[graylog.jar:?]
        at com.mongodb.connection.InternalStreamConnection.open(InternalStreamConnection.java:115) ~[graylog.jar:?]
        at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:113) [graylog.jar:?]
        at java.lang.Thread.run(Unknown Source) [?:1.8.0_121]
Caused by: java.net.NoRouteToHostException: No route to host (Host unreachable)
        at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_121]
        at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source) ~[?:1.8.0_121]
        at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source) ~[?:1.8.0_121]
        at java.net.AbstractPlainSocketImpl.connect(Unknown Source) ~[?:1.8.0_121]
        at java.net.SocksSocketImpl.connect(Unknown Source) ~[?:1.8.0_121]
        at java.net.Socket.connect(Unknown Source) ~[?:1.8.0_121]
        at com.mongodb.connection.SocketStreamHelper.initialize(SocketStreamHelper.java:57) ~[graylog.jar:?]
        at com.mongodb.connection.SocketStream.open(SocketStream.java:58) ~[graylog.jar:?]
        ... 3 more
2017-09-28T13:04:59.047+02:00 ERROR [MongoConnectionProvider] Error connecting to MongoDB: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=REPLICA_SET, servers=[{address=graylog-arbiter.example.local:27018, type=REPLICA_SET_ARBITER, roundTripTime=1.2 ms, state=CONNECTED}, {address=graylog01:27017, type=REPLICA_SET_OTHER, roundTripTime=0.9 ms, state=CONNECTED}, {address=graylog02.example.local:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.NoRouteToHostException: No route to host (Host unreachable)}}]
2017-09-28T13:04:59.307+02:00 INFO  [cluster] No server chosen by WritableServerSelector from cluster description ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=graylog-arbiter.example.local:27018, type=REPLICA_SET_ARBITER, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 4, 2]}, minWireVersion=0, maxWireVersion=5, maxDocumentSize=16777216, roundTripTimeNanos=1159190, setName='rewe-graylog-prod', canonicalAddress=graylog-arbiter.example.local:27018, hosts=[graylog01:27017, graylog02.example.local:27017], passives=[], arbiters=[graylog-arbiter.example.local:27018], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=5, lastWriteDate=Thu Jan 01 01:00:00 CET 1970, lastUpdateTimeNanos=18043128104881863}, ServerDescription{address=graylog01:27017, type=REPLICA_SET_OTHER, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 4, 2]}, minWireVersion=0, maxWireVersion=5, maxDocumentSize=16777216, roundTripTimeNanos=877379, setName='rewe-graylog-prod', canonicalAddress=graylog01:27017, hosts=[graylog01:27017, graylog02.example.local:27017], passives=[], arbiters=[graylog-arbiter.example.local:27018], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=5, lastWriteDate=Tue Jul 04 14:28:27 CEST 2017, lastUpdateTimeNanos=18043128098131404}, ServerDescription{address=graylog02.example.local:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.NoRouteToHostException: No route to host (Host unreachable)}}]}. Waiting for 30000 ms before timing out
2017-09-28T13:05:29.317+02:00 INFO  [cluster] No server chosen by WritableServerSelector from cluster description ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=graylog-arbiter.example.local:27018, type=REPLICA_SET_ARBITER, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 4, 2]}, minWireVersion=0, maxWireVersion=5, maxDocumentSize=16777216, roundTripTimeNanos=1170321, setName='rewe-graylog-prod', canonicalAddress=graylog-arbiter.example.local:27018, hosts=[graylog01:27017, graylog02.example.local:27017], passives=[], arbiters=[graylog-arbiter.example.local:27018], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=5, lastWriteDate=Thu Jan 01 01:00:00 CET 1970, lastUpdateTimeNanos=18043158154655368}, ServerDescription{address=graylog01:27017, type=REPLICA_SET_OTHER, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 4, 2]}, minWireVersion=0, maxWireVersion=5, maxDocumentSize=16777216, roundTripTimeNanos=722206, setName='rewe-graylog-prod', canonicalAddress=graylog01:27017, hosts=[graylog01:27017, graylog02.example.local:27017], passives=[], arbiters=[graylog-arbiter.example.local:27018], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=5, lastWriteDate=Tue Jul 04 14:28:27 CEST 2017, lastUpdateTimeNanos=18043158114364905}, ServerDescription{address=graylog02.example.local:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.NoRouteToHostException: No route to host (Host unreachable)}}]}. Waiting for 30000 ms before timing out

Here are screenshots, when it happened (yes, I am monitoring my Graylog-Prod with my Graylog-Test environment. :smiley:)

Logs: (This is repeated over and over before I restarted Graylog01)

2017-09-28T12:26:17.327+02:00 INFO  [cluster] No server chosen by WritableServerSelector from cluster description ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=graylog-arbiter.example.local:27018, type=REPLICA_SET_ARBITER, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 4, 2]}, minWireVersion=0, maxWireVersion=5, maxDocumentSize=16777216, roundTripTimeNanos=830087, setName='rewe-graylog-prod', canonicalAddress=graylog-arbiter.example.local:27018, hosts=[graylog01:27017, graylog02.example.local:27017], passives=[], arbiters=[graylog-arbiter.example.local:27018], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=5, lastWriteDate=Thu Jan 01 01:00:00 CET 1970, lastUpdateTimeNanos=18040806167280731}, ServerDescription{address=graylog01:27017, type=REPLICA_SET_OTHER, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 4, 2]}, minWireVersion=0, maxWireVersion=5, maxDocumentSize=16777216, roundTripTimeNanos=335873, setName='rewe-graylog-prod', canonicalAddress=graylog01:27017, hosts=[graylog01:27017, graylog02.example.local:27017], passives=[], arbiters=[graylog-arbiter.example.local:27018], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=5, lastWriteDate=Tue Jul 04 14:28:27 CEST 2017, lastUpdateTimeNanos=18040806133671295}, ServerDescription{address=graylog02.example.local:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.NoRouteToHostException: No route to host (Host unreachable)}}]}. Waiting for 30000 ms before timing out
2017-09-28T12:26:20.011+02:00 INFO  [RetryExec] I/O exception (java.net.NoRouteToHostException) caught when processing request to {}->http://graylog02.example.local:9200: No route to host (Host unreachable)
2017-09-28T12:26:20.011+02:00 INFO  [RetryExec] I/O exception (java.net.NoRouteToHostException) caught when processing request to {}->http://graylog02.example.local:9200: No route to host (Host unreachable)
2017-09-28T12:26:20.011+02:00 INFO  [RetryExec] Retrying request to {}->http://graylog02.example.local:9200
2017-09-28T12:26:20.011+02:00 INFO  [RetryExec] Retrying request to {}->http://graylog02.example.local:9200
2017-09-28T12:26:22.784+02:00 ERROR [IndexRotationThread] Uncaught exception in periodical
com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) Error injecting constructor, com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, servers=[{address=graylog-arbiter.example.local:27018, type=REPLICA_SET_ARBITER, roundTripTime=1.8 ms, state=CONNECTED}, {address=graylog01:27017, type=REPLICA_SET_OTHER, roundTripTime=0.4 ms, state=CONNECTED}, {address=graylog02.example.local:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.NoRouteToHostException: No route to host (Host unreachable)}}]
  at org.graylog2.system.activities.SystemMessageServiceImpl.<init>(SystemMessageServiceImpl.java:37)
  while locating org.graylog2.system.activities.SystemMessageServiceImpl
  while locating org.graylog2.system.activities.SystemMessageService
    for the 1st parameter of org.graylog2.system.activities.SystemMessageActivityWriter.<init>(SystemMessageActivityWriter.java:41)
  while locating org.graylog2.system.activities.SystemMessageActivityWriter
  while locating org.graylog2.shared.system.activities.ActivityWriter
    for the 1st parameter of org.graylog2.bindings.providers.SystemJobManagerProvider.<init>(SystemJobManagerProvider.java:30)
  while locating org.graylog2.bindings.providers.SystemJobManagerProvider
  while locating org.graylog2.system.jobs.SystemJobManager
    for the 6th parameter of org.graylog2.indexer.MongoIndexSet.<init>(MongoIndexSet.java:93)
  while locating org.graylog2.indexer.MongoIndexSet annotated with @com.google.inject.internal.UniqueAnnotations$Internal(value=1)
Caused by: com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, servers=[{address=graylog-arbiter.example.local:27018, type=REPLICA_SET_ARBITER, roundTripTime=1.8 ms, state=CONNECTED}, {address=graylog01:27017, type=REPLICA_SET_OTHER, roundTripTime=0.4 ms, state=CONNECTED}, {address=graylog02.example.local:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.NoRouteToHostException: No route to host (Host unreachable)}}]
    at com.mongodb.connection.BaseCluster.createTimeoutException(BaseCluster.java:377)
    at com.mongodb.connection.BaseCluster.selectServer(BaseCluster.java:104)
    at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:75)
    at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:71)
    at com.mongodb.binding.ClusterBinding.getWriteConnectionSource(ClusterBinding.java:68)
    at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:411)
    at com.mongodb.operation.CreateIndexesOperation.execute(CreateIndexesOperation.java:144)
    at com.mongodb.operation.CreateIndexesOperation.execute(CreateIndexesOperation.java:71)
    at com.mongodb.Mongo.execute(Mongo.java:845)
    at com.mongodb.Mongo$2.execute(Mongo.java:828)
    at com.mongodb.DBCollection.createIndex(DBCollection.java:1618)
    at com.mongodb.DBCollection.createIndex(DBCollection.java:1603)
    at org.graylog2.system.activities.SystemMessageServiceImpl.<init>(SystemMessageServiceImpl.java:39)
    at org.graylog2.system.activities.SystemMessageServiceImpl$$FastClassByGuice$$c6891cf2.newInstance(<generated>)
    at com.google.inject.internal.DefaultConstructionProxyFactory$FastClassProxy.newInstance(DefaultConstructionProxyFactory.java:89)
    at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:111)
    at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:90)
    at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:268)
    at com.google.inject.internal.FactoryProxy.get(FactoryProxy.java:56)
    at com.google.inject.internal.SingleParameterInjector.inject(SingleParameterInjector.java:38)
    at com.google.inject.internal.SingleParameterInjector.getAll(SingleParameterInjector.java:62)
    at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:110)
    at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:90)
    at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:268)
    at com.google.inject.internal.FactoryProxy.get(FactoryProxy.java:56)
    at com.google.inject.internal.SingleParameterInjector.inject(SingleParameterInjector.java:38)
    at com.google.inject.internal.SingleParameterInjector.getAll(SingleParameterInjector.java:62)
    at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:110)
    at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:90)
    at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:268)
    at com.google.inject.internal.BoundProviderFactory.get(BoundProviderFactory.java:61)
    at com.google.inject.internal.SingleParameterInjector.inject(SingleParameterInjector.java:38)
    at com.google.inject.internal.SingleParameterInjector.getAll(SingleParameterInjector.java:62)
    at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:110)
    at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:90)
    at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:268)
    at com.google.inject.internal.InjectorImpl$2$1.call(InjectorImpl.java:1019)
    at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1085)
    at com.google.inject.internal.InjectorImpl$2.get(InjectorImpl.java:1015)
    at com.google.inject.assistedinject.FactoryProvider2.invoke(FactoryProvider2.java:776)
    at com.sun.proxy.$Proxy110.create(Unknown Source)
    at org.graylog2.indexer.MongoIndexSetRegistry.findAllMongoIndexSets(MongoIndexSetRegistry.java:99)
    at org.graylog2.indexer.MongoIndexSetRegistry.getAll(MongoIndexSetRegistry.java:107)
    at org.graylog2.indexer.MongoIndexSetRegistry.iterator(MongoIndexSetRegistry.java:222)
    at java.lang.Iterable.forEach(Unknown Source)
    at org.graylog2.periodical.IndexRotationThread.doRun(IndexRotationThread.java:73)
    at org.graylog2.plugin.periodical.Periodical.run(Periodical.java:77)
    at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
    at java.util.concurrent.FutureTask.runAndReset(Unknown Source)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

2) Error injecting constructor, com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, servers=[{address=graylog-arbiter.example.local:27018, type=REPLICA_SET_ARBITER, roundTripTime=0.7 ms, state=CONNECTED}, {address=graylog01:27017, type=REPLICA_SET_OTHER, roundTripTime=0.4 ms, state=CONNECTED}, {address=graylog02.example.local:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.NoRouteToHostException: No route to host (Host unreachable)}}]
  at org.graylog2.system.activities.SystemMessageServiceImpl.<init>(SystemMessageServiceImpl.java:37)
  while locating org.graylog2.system.activities.SystemMessageServiceImpl
  while locating org.graylog2.system.activities.SystemMessageService
    for the 1st parameter of org.graylog2.system.activities.SystemMessageActivityWriter.<init>(SystemMessageActivityWriter.java:41)
  while locating org.graylog2.system.activities.SystemMessageActivityWriter
  while locating org.graylog2.shared.system.activities.ActivityWriter
    for the 8th parameter of org.graylog2.indexer.MongoIndexSet.<init>(MongoIndexSet.java:93)
  while locating org.graylog2.indexer.MongoIndexSet annotated with @com.google.inject.internal.UniqueAnnotations$Internal(value=1)
Caused by: com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, servers=[{address=graylog-arbiter.example.local:27018, type=REPLICA_SET_ARBITER, roundTripTime=0.7 ms, state=CONNECTED}, {address=graylog01:27017, type=REPLICA_SET_OTHER, roundTripTime=0.4 ms, state=CONNECTED}, {address=graylog02.example.local:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.NoRouteToHostException: No route to host (Host unreachable)}}]
2017-09-28T12:26:23.017+02:00 INFO  [RetryExec] I/O exception (java.net.NoRouteToHostException) caught when processing request to {}->http://graylog02.example.local:9200: No route to host (Host unreachable)
2017-09-28T12:26:23.017+02:00 INFO  [RetryExec] I/O exception (java.net.NoRouteToHostException) caught when processing request to {}->http://graylog02.example.local:9200: No route to host (Host unreachable)
2017-09-28T12:26:23.017+02:00 INFO  [RetryExec] Retrying request to {}->http://graylog02.example.local:9200
2017-09-28T12:26:23.017+02:00 INFO  [RetryExec] Retrying request to {}->http://graylog02.example.local:9200
2017-09-28T12:26:25.205+02:00 WARN  [Errors] The following warnings have been detected: WARNING: Unknown HK2 failure detected:
MultiException stack 1 of 1
com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) Error injecting constructor, com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, servers=[{address=graylog-arbiter.example.local:27018, type=REPLICA_SET_ARBITER, roundTripTime=0.8 ms, state=CONNECTED}, {address=graylog01:27017, type=REPLICA_SET_OTHER, roundTripTime=0.4 ms, state=CONNECTED}, {address=graylog02.example.local:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.NoRouteToHostException: No route to host (Host unreachable)}}]
  at org.graylog2.users.UserServiceImpl.<init>(UserServiceImpl.java:67)
  while locating org.graylog2.users.UserServiceImpl
  while locating org.graylog2.shared.users.UserService

[....]

I appreciate any help :slight_smile:

Greetings - Phil

PS: Yes I know, having Graylog and Elasticsearch on the same host is not best practice, but this was not my decision ^^

MongoDB is essential for Graylog to run.

It seems like the failure of your storage backend also took the MongoDB cluster or at least parts of the MongoDB cluster with it.

Hey @jochen ,

Well, this is strange. The MongoDB process on Graylog01 never logged any error and was in a healthy state. But restarting it allowed Graylog to come back up (Only thing that did not work was AD/LDAP-authentication).

Another problem: As long as the failed Elasticsearch-server was still in the configuration of Graylog 01, Graylog would start up, output messages to Elasticsearch for a minute or two, and then suddenly stop sending messages to Elasticsearch alltogether (Output = 0msg/s while the Incoming messages were still received and written to Journal). Removing the failed Elasticsearch-Server from the Graylog configuration enabled Graylog to work as normal.

This does not really seem wanted, that a failed/not-reachable Elasticsearch-server inside the Graylog configuration disables output to healthy and reachable Elasticsearch-Servers…

Edit: The AD/LDAP-auth did not work because Graylog now uses an old search string that was changed months ago… this is strange…

Greetings - Phil

you should check the elasticsearch status. I personal notice that if Elasticsearch shard routing is disabled my labs shows the same.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.