Graylog Elasticsearch cluster is yellow since 3 days back

Hi,

I have created my graylog server with 2 elasticsearch nodes and work the system perfectly more than 10 months.

But 3 days back, in indices section of the graylog server it is showing that " Elasticsearch cluster is yellow"

When I check both elasticsearch nodes are working

There are total 2344 shards available and but only 1174 are active and 1174 are unassigned.

Why this kind of an incident happens ?

Check the logs of your Graylog and Elasticsearch nodes.
:arrow_right: http://docs.graylog.org/en/2.4/pages/configuration/file_location.html

When refer to the web GUI log found following lines

But I have provide full permission (777) to the destination folder where graylog store the indices.

That’s not what I’ve asked for.

@jochen

I couldn’t find any error log related to node failure in graylog server log and elasticsearch log

Exact issue is:

Existing setup:

  1. I have the one Graylog server with two elasticsearch nodes.
  2. This setup works more than 10 months perfectly.
  3. This setup had 8 shards. 4 shards from elasticsearch node 2 and other 4 from elasticsarch node 2

Current situation:

  1. Now Graylog server system node showing that elasticsearch cluster is yellow.
  2. Now the setup had only 4 shards showing and when go through each indices found that some of indices have shards from node 1 and some of have node 2.

Observation.

  1. Both elasticsarch nodes are active.

How can I fixed this node issue ?

I can’t help you without the complete logs of your Graylog and Elasticsearch nodes. :man_shrugging:

My Friend, both log files have sensitive data due to some parser exception. So I couldn’t share it with you.

If you can mention any part of a file which will you able to get any direction to get a understading, let me know. I can share that part with you.

Here are some lines which I have taken from Graylog server.log file

2018-06-20T10:29:25.284+05:30 ERROR [Messages] Failed to index [1] messages. Please check the index error log in your web interface for the reason. Error: failure in bulk execution:
[28]: index [graylog_1066], type [message], id [b3d25982-7446-11e8-ab7b-0050568908d1], message [MapperParsingException[failed to parse [version]]; nested: NumberFormatException[For input string: “IKEv2”];]
2018-06-20T10:29:45.357+05:30 ERROR [LocalCopyListProvider] Could not refresh [Abuse.ch Ransomware tracker] table.
java.util.concurrent.ExecutionException: Could not refresh local source table.
at org.graylog.plugins.threatintel.providers.abusech.AbuseChRansomLookupProvider.refreshTable(AbuseChRansomLookupProvider.java:113) ~[?:?]
at org.graylog.plugins.threatintel.providers.LocalCopyListProvider.lambda$initialize$1(LocalCopyListProvider.java:107) ~[?:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_121]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_121]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_121]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_121]
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_121]
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_121]
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_121]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_121]
at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_121]
at okhttp3.internal.platform.Platform.connectSocket(Platform.java:124) ~[graylog.jar:?]
at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.java:187) ~[graylog.jar:?]
at okhttp3.internal.connection.RealConnection.buildConnection(RealConnection.java:173) ~[graylog.jar:?]
at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:114) ~[graylog.jar:?]
at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:196) ~[graylog.jar:?]
at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:132) ~[graylog.jar:?]
at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:101) ~[graylog.jar:?]
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:179) ~[graylog.jar:?]
at okhttp3.RealCall.execute(RealCall.java:63) ~[graylog.jar:?]
at org.graylog.plugins.threatintel.providers.abusech.AbuseChRansomLookupProvider.refreshTable(AbuseChRansomLookupProvider.java:99) ~[?:?]
… 8 more
2018-06-20T10:29:56.315+05:30 ERROR [Messages] Failed to index [1] messages. Please check the index error log in your web interface for the reason. Error: failure in bulk execution:
[178]: index [graylog_1066], type [message], id [c65397e0-7446-11e8-ab7b-0050568908d1], message [MapperParsingException[failed to parse [version]]; nested: NumberFormatException[For input string: “IKEv2”];]
2018-06-20T10:30:08.284+05:30 INFO [TorExitNodeLookupProvider] Refreshing internal table of known Tor exit nodes.
2018-06-20T10:30:18.299+05:30 ERROR [LocalCopyListProvider] Could not refresh [Tor exit nodes] table.
java.util.concurrent.ExecutionException: Could not refresh local source table.
at org.graylog.plugins.threatintel.providers.tor.TorExitNodeLookupProvider.refreshTable(TorExitNodeLookupProvider.java:120) ~[?:?]
at org.graylog.plugins.threatintel.providers.LocalCopyListProvider.lambda$initialize$1(LocalCopyListProvider.java:107) ~[?:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_121]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_121]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_121]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_121]
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_121]
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_121]
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_121]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_121]
at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_121]
at okhttp3.internal.platform.Platform.connectSocket(Platform.java:124) ~[graylog.jar:?]
at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.java:187) ~[graylog.jar:?]
at okhttp3.internal.connection.RealConnection.buildConnection(RealConnection.java:173) ~[graylog.jar:?]
at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:114) ~[graylog.jar:?]
at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:196) ~[graylog.jar:?]
at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:132) ~[graylog.jar:?]
at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:101) ~[graylog.jar:?]
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:179) ~[graylog.jar:?]
at okhttp3.RealCall.execute(RealCall.java:63) ~[graylog.jar:?]
at org.graylog.plugins.threatintel.providers.tor.TorExitNodeLookupProvider.refreshTable(TorExitNodeLookupProvider.java:89) ~[?:?]
… 8 more

You’ll find the reason for unassigned shards in the logs of your Elasticsearch nodes.

Also read How to Resolve Unassigned Shards in Elasticsearch | Datadog for some hints about how to assign these shards to nodes.

@jochen

Thanks for the link that you provided. It was very helpful link.

Since I have using elasticsearch version 2.3.3 I am unable to get “allocate_explanation” from the index to rectify the issue.

How ever, When I stop ES node 2, log data will saved to Node 1. But when I enable the ES service on node 2 the data will store in node 2 again and node 1 will become idle. Idle means showing that node is active, but does not store data. This operation captured through shards on Graylog server. It is operate like active passive mode. Why do I observed this kind of a situation. This total system works more than 10 months perfectly, without observing abnormal errors.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.