Elasticsearch Node is in Red Status

All,
An issue occurred recently, and I want to show how I was able to troubleshoot and resolve this. Reminder, this was my Lab Graylog server, so I assume this was a user error on my part :thinking:

About a week ago I installed Cerebra on my remote Graylog Docker test VM GitHub - lmenezes/cerebro for testing purposes. Today I logged in Cerebra and noticed one of my Elasticsearch nodes was in RED. It showed three different indices that had unassigned shards in them, mainly the new indices that were supposed to be created within the paste 4 days. :astonished:

First, I execute command below which I think most of us use know for troubleshooting shard issues.

[root@graylog conf]# curl -XGET http://192.168.1.100:9200/_cluster/allocation/explain?pretty
{
"index" : "gl-events_33",
"shard" : 2,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "INDEX_CREATED",
"at" : "2022-04-01T00:01:01.903Z",
"last_allocation_status" : "no"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "srgsYBshRb-ue3IQzIK-RQ",
"node_name" : "graylog.domain.com",
"transport_address" : "192.168.1.100:9300",
"node_decision" : "no",
"weight_ranking" : 1,
"deciders" : [
{
"decider" : "enable",
"decision" : "NO",
"explanation" : "no allocations are allowed due to cluster setting [cluster.routing.allocation.enable=none]"
}
]
}
]
}

That command gave me a clear idea what was wrong and after 5 minutes of research I found out what I needed to do. I noticed Cerebra has a rest API command section on the web, which is much like OpenSearch/Elasticsearch. Good Job OpenSource Guys :+1:

Example of the CLI:

I copy & Paste this command below, in that box .

PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "primaries"
}
}

Execute a service restart for Elasticsearch.

systemctl restart elasticsearch

Just to make sure things were good I execute this.

curl -XGET http://192.168.1.100:9200/_cat/shards

I tailed my Elasticsearch log file.

root@graylog conf]# tail -f /var/log/elasticsearch/graylog.log
[2022-04-04T16:44:27,711][INFO ][o.e.n.Node ] [graylog.domain.com] starting ...
[2022-04-04T16:44:27,883][INFO ][o.e.t.TransportService ] [graylog.enseva-labs.net] publish_address {192.168.1.100:9300}, bound_address es {192.168.1.100:9300}
[2022-04-04T16:44:29,217][INFO ][o.e.c.c.Coordinator ] [graylog.domain.com] cluster UUID [OMgi3eu5QGiJ3buKOYn4_w]
[2022-04-04T16:44:29,522][INFO ][o.e.c.s.MasterService ] [graylog.domain.com] elected-as-master ([1] nodes joined)[{graylog.domain.com}{srgsYBshRb-ue3IQzIK-RQ}{4i1Mvoc8Q0epxjIP7STI9g}{192.168.1.100}{192.168.1.100:9300}{dimr} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 59, version: 135416, delta: master node changed {previous [], current [{graylog.domain.com}{srgsYBshRb-u e3IQzIK-RQ}{4i1Mvoc8Q0epxjIP7STI9g}{192.168.1.100}{192.168.1.100:9300}{dimr}]}
[2022-04-04T16:44:29,925][INFO ][o.e.c.s.ClusterApplierService] [graylog.domain.com] master node changed {previous [], current [{g raylog.domian.com}{srgsYBshRb-ue3IQzIK-RQ}{4i1Mvoc8Q0epxjIP7STI9g}{192.168.1.100}{192.168.1.100:9300}{dimr}]}, term: 59, version: 1354 16, reason: Publication{term=59, version=135416}
[2022-04-04T16:44:29,967][INFO ][o.e.h.AbstractHttpServerTransport] [graylog.domain.com] publish_address {192.168.1.100:9200}, bound _addresses {192.168.1.100:9200}
[2022-04-04T16:44:29,968][INFO ][o.e.n.Node ] [graylog.domain.com] started
[2022-04-04T16:44:30,619][INFO ][o.e.c.s.ClusterSettings ] [graylog. domain.com] updating [cluster.routing.allocation.enable] from [all] to [primaries]
[2022-04-04T16:44:30,620][INFO ][o.e.c.s.ClusterSettings ] [graylog. domain.com] updating [action.destructive_requires_name] from [false] to [true]
[2022-04-04T16:44:31,603][INFO ][o.e.g.GatewayService ] [graylog. domain.com] recovered [111] indices into cluster_state
[2022-04-04T16:45:14,265][INFO ][o.e.c.r.a.AllocationService] [graylog. domain.com] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[gl-system-events_20][2], [gl-system-events_20][3], [gl-system-events_20][0]]]).
[2022-04-04T16:45:17,006][INFO ][o.e.c.r.a.AllocationService] [graylog. domain.com] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[graylog_1560][2], [graylog_1560][3], [graylog_1560][0]]]).

As you can see from the logs worked out, and the cluster was in Green status. I had to wait about 5-10 minutes for Graylog to settle down because the buffer/journal was full from this weekend. Other then that all is good.

1 Like