Elasticsearch cluster is red. Default Index set shard allocation issue

1. Describe your incident:

Hello!

I inherited a Graylog version 5.x with a 3 node Elastic 7.17 cluster backing it and yesterday, the health status was red.

There are three unassigned shards out of 4 for the graylog_ Default Index set.

I did note the Unassigned shards claim that there is no_valid_shard_copy but also saw there was a 4th shard that looks available, but it says the cluster has unassigned shards and cluster setting cluster.routing.allocation.allow_rebalanceis set toindices_all_active`. All 4 shards for this index are primary, but only 3 nodes.

There are no snapshots for this cluster at this time.

I was thinking there was a way to copy the “good” shard onto the others (is that rebalancing?), and willing to accept data loss (but hopefully as a last resort) as it is log data.

Additionally, I wondered if Rotating Indexes Manually from the documentation here would also fix my issue without diving into the guts of the elasticsearch cluster too much.

Sending a screenshot related to this. Any suggestions or help is appreciated.

Thanks!

2. Describe your environment:

  • OS Information: Ubuntu 20.04.6 LTS

  • Package Version: Graylog 5.0.8+4c22532

  • Service logs, configurations, and environment variables: See screenshot. Happy to add more.

3. What steps have you already taken to try and solve the problem?

I have so far just gathered data from the elastic cluster and tried to review docs. Looking for options to try as I am new to greylog/elastic. Inherited service.

4. How can the community help?

See Incident Description above.

Forgot to post screenshot of index to possibly rotate from the web front end.

This is something i’ve never run into so don’t have any firsthand knowledge of that specific error. I did find this post which may be helpful: How to fix cluster is red, reason: no_valid_shard_copy after stopping data nodes? - Elasticsearch - Discuss the Elastic Stack

For what its worth Graylog is not compatible with any version of Elasticsearch after 7.10.2.

Are there any errors in your Graylog server.log and/or Elasticsearch logs?

I am still reading over the post you sent, @drewmiranda-gl . In the meantime, I have the following from the logs…

Graylog server log is saying:

2023-06-21T20:42:00.023-09:00 ERROR [MessagesAdapterES7] Failed to index [3] messages. Please check the index error log in your we
b interface for the reason. Error: failure in bulk execution:
[0]: index [graylog_39], type [_doc], id [2f0b5aab-1052-11ee-8e15-005056a99d43], message [ElasticsearchException[Elasticsearch exc
eption [type=unavailable_shards_exception, reason=[graylog_39][3] primary shard is not active Timeout: [1m], request: [BulkShardRe
quest [[graylog_39][3]] containing [3] requests]]]]
[1]: index [graylog_39], type [_doc], id [2f0bcfd7-1052-11ee-8e15-005056a99d43], message [ElasticsearchException[Elasticsearch exc
eption [type=unavailable_shards_exception, reason=[graylog_39][3] primary shard is not active Timeout: [1m], request: [BulkShardRe
quest [[graylog_39][3]] containing [3] requests]]]]
[2]: index [graylog_39], type [_doc], id [2f8b6019-1052-11ee-8e15-005056a99d43], message [ElasticsearchException[Elasticsearch exc
eption [type=unavailable_shards_exception, reason=[graylog_39][3] primary shard is not active Timeout: [1m], request: [BulkShardRe
quest [[graylog_39][3]] containing [3] requests]]]]

Elastic errors:

[2023-06-21T07:39:24,219][WARN ][o.e.c.r.a.AllocationService] [elastic-01-in-prod] failing shard [failed shard, shard [graylog_39]
[3], node[POHn_aN0R-CE7kteDGVRfA], [P], s[STARTED], a[id=KnnvS0goSM2tlczol_G5Rg], message [shard failure, reason [lucene commit fa
iled]], failure [NoSuchFileException[/var/lib/elasticsearch/nodes/0/indices/utxx9Zc_TKaeZqRwpTRkfw/3/index/_l_Lucene84_0.tim]], ma
rkAsStale [true]]
java.nio.file.NoSuchFileException: /var/lib/elasticsearch/nodes/0/indices/utxx9Zc_TKaeZqRwpTRkfw/3/index/_l_Lucene84_0.tim
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
        at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:224) ~[?:?]
        at java.nio.channels.FileChannel.open(FileChannel.java:308) ~[?:?]
        at java.nio.channels.FileChannel.open(FileChannel.java:367) ~[?:?]
        at org.apache.lucene.util.IOUtils.fsync(IOUtils.java:469) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed12
23c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:331) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef
36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:286) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.store.FilterDirectory.sync(FilterDirectory.java:84) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.store.FilterDirectory.sync(FilterDirectory.java:84) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:5099) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3460) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(InternalEngine.java:2793) ~[elasticsearch-7.17.10.jar:7.17.10]
        at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:2075) ~[elasticsearch-7.17.10.jar:7.17.10]
        at org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:1432) ~[elasticsearch-7.17.10.jar:7.17.10]
        at org.elasticsearch.index.shard.IndexShard$8.doRun(IndexShard.java:3818) ~[elasticsearch-7.17.10.jar:7.17.10]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) ~[elasticsearch-7.17.10.jar:7.17.10]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.17.10.jar:7.17.10]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
        at java.lang.Thread.run(Thread.java:1623) [?:?]
[2023-06-21T07:39:24,418][INFO ][o.e.c.r.a.AllocationService] [elastic-01-in-prod] Cluster health status changed from [GREEN] to [RED] (reason: [shards failed [[graylog_39][3]]]).
[2023-06-21T07:39:25,682][WARN ][o.e.c.r.a.AllocationService] [elastic-01-in-prod] failing shard [failed shard, shard [graylog_39][3], node[POHn_aN0R-CE7kteDGVRfA], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[INITIALIZING], a[id=KnnvS0goSM2tlczol_G5Rg], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-06-21T16:39:24.209Z], failed_attempts[1], delayed=false, details[failed shard on node [POHn_aN0R-CE7kteDGVRfA]: shard failure, reason [lucene commit failed], failure NoSuchFileException[/var/lib/elasticsearch/nodes/0/indices/utxx9Zc_TKaeZqRwpTRkfw/3/index/_l_Lucene84_0.tim]], allocation_status[no_valid_shard_copy]], message [shard failure, reason [corrupt file (source: [start])]], failure [CorruptIndexException[Problem reading index. (resource=/var/lib/elasticsearch/nodes/0/indices/utxx9Zc_TKaeZqRwpTRkfw/3/index/_l_Lucene84_0.tim)]; nested: NoSuchFileException[/var/lib/elasticsearch/nodes/0/indices/utxx9Zc_TKaeZqRwpTRkfw/3/index/_l_Lucene84_0.tim]; ], markAsStale [true]]
org.apache.lucene.index.CorruptIndexException: Problem reading index. (resource=/var/lib/elasticsearch/nodes/0/indices/utxx9Zc_TKaeZqRwpTRkfw/3/index/_l_Lucene84_0.tim)
        at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:144) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:83) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:171) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:213) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.index.IndexWriter.lambda$getReader$0(IndexWriter.java:571) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:108) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:629) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:121) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:97) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.elasticsearch.index.engine.InternalEngine.createReaderManager(InternalEngine.java:669) ~[elasticsearch-7.17.10.jar:7.17.10]
                at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:261) ~[elasticsearch-7.17.10.jar:7.17.10]
        at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:199) ~[elasticsearch-7.17.10.jar:7.17.10]
        at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:14) ~[elasticsearch-7.17.10.jar:7.17.10]
        at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:2064) ~[elasticsearch-7.17.10.jar:7.17.10]
        at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2028) ~[elasticsearch-7.17.10.jar:7.17.10]
        at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:472) ~[elasticsearch-7.17.10.jar:7.17.10]
        at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:90) ~[elasticsearch-7.17.10.jar:7.17.10]
        at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:436) ~[elasticsearch-7.17.10.jar:7.17.10]
        at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:88) ~[elasticsearch-7.17.10.jar:7.17.10]
        at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:2361) ~[elasticsearch-7.17.10.jar:7.17.10]
        at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62) ~[elasticsearch-7.17.10.jar:7.17.10]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) ~[elasticsearch-7.17.10.jar:7.17.10]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.17.10.jar:7.17.10]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
        at java.lang.Thread.run(Thread.java:1623) [?:?]
Caused by: java.nio.file.NoSuchFileException: /var/lib/elasticsearch/nodes/0/indices/utxx9Zc_TKaeZqRwpTRkfw/3/index/_l_Lucene84_0.tim
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
        at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:224) ~[?:?]
        at java.nio.channels.FileChannel.open(FileChannel.java:308) ~[?:?]
        at java.nio.channels.FileChannel.open(FileChannel.java:367) ~[?:?]
        at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:238) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.elasticsearch.index.store.FsDirectoryFactory$HybridDirectory.openInput(FsDirectoryFactory.java:126) ~[elasticsearch-7.17.10.jar:7.17.10]
        at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:100) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:100) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
                at org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:141) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.codecs.lucene84.Lucene84PostingsFormat.fieldsProducer(Lucene84PostingsFormat.java:441) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:315) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:395) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:114) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
        ... 25 more
[2023-06-21T07:45:38,828][WARN ][o.e.i.e.Engine           ] [elastic-01-in-prod] [graylog_39][2] failed engine [lucene commit failed]
java.nio.file.NoSuchFileException: /var/lib/elasticsearch/nodes/0/indices/utxx9Zc_TKaeZqRwpTRkfw/2/index/_b_Lucene84_0.tim
...

Hey @PresGas

Adding on to @drewmiranda-gl suggeestion. You can do a couple test, example below may need to adjust the IP address. This should give you more of an idea why the shards are unassinged.

Check Shards

curl -XGET http://192.168.1.100:9200/_cat/shards

ES Shard Info

curl  -XGET http://192.168.1.100:9200/_cluster/allocation/explain?pretty

Or

curl -XGET http://192.168.1.100:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.