Couldn't read Elasticsearch cluster health after extending LV and upgrading OpenSearch to 1.3.8

1. Describe your incident:
The OpenSearch cluster had some storage issues and it was hitting the high watermark threshold, so I extended the LV for the OS data, upgraded OpenSearch to 1.3.8 and restarted the OpenSearch service.

After the restart, Graylog’s web interface complains with:

2. Describe your environment:

  • OS Information:
    Ubuntu 20.04 LTS for Graylog
    CentOS 7.9 for OpenSearch

  • Package Version:
    Graylog 4.3.9
    OpenSearch 1.3.8

  • Service logs, configurations, and environment variables:

Relevant config:

$ grep elasticsearch_ /etc/graylog/server/server.conf
elasticsearch_version = 7
elasticsearch_hosts = http://admin:admin@node-1:9200,http://admin:admin@node-2:9200,http://admin:admin@node-3:9200

From the Graylog logs:

2023-03-02T17:22:07.788+01:00 INFO [SearchDbPreflightCheck] Connected to (Elastic/Open)Search version OpenSearch:1.3.8

3. What steps have you already taken to try and solve the problem?

I checked that the OpenSearch status can be queried from the Graylog cluster nodes:

$ curl http://opensearch-node1:9200/_cluster/health?pretty -u admin:admin -k
{
“cluster_name” : “opensearch-cluster”,
“status” : “yellow”,
“timed_out” : false,
“number_of_nodes” : 3,
“number_of_data_nodes” : 3,
“discovered_master” : true,
“active_primary_shards” : 1798,
“active_shards” : 1827,
“relocating_shards” : 0,
“initializing_shards” : 4,
“unassigned_shards” : 1169,
“delayed_unassigned_shards” : 0,
“number_of_pending_tasks” : 0,
“number_of_in_flight_fetch” : 0,
“task_max_waiting_in_queue_millis” : 0,
“active_shards_percent_as_number” : 60.9
}

4. How can the community help?

Is there any reason for this to stop suddenly working? Or should I learn to be patient and OS will catch up eventually?

TIA!

Hey @m_mlk

I see Opensearch is in yellow.

You can execute something like this see why its in yellow.

curl  -XGET  http://opensearch-node1:9200/_cluster/allocation/explain?pretty

Fortrouble shooting have you tried NOT to use admin:admin to connect to Opensearch nodes to see if that made a difference?

Hi @gsmith

Thanks for your reply.

$ curl http://opensearch-node1:9200/_cluster/allocation/explain?pretty
{
“error” : {
“root_cause” : [
{
“type” : “illegal_argument_exception”,
“reason” : “unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]”
}
],
“type” : “illegal_argument_exception”,
“reason” : “unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]”
},
“status” : 400
}

The connection to OpenSearch has been working with admin:admin since day #1

Anyway, NOT using those credentials still seem to work… O_o

2023-03-03T09:49:03.581+01:00 INFO [MongoDBPreflightCheck] Connected to MongoDB version 5.0.13
2023-03-03T09:49:03.684+01:00 INFO [SearchDbPreflightCheck] Connected to (Elastic/Open)Search version OpenSearch:1.3.8

Still, the web GUI shows the same error message:

Could not retrieve Elasticsearch cluster health. Fetching Elasticsearch cluster health failed: There was an error fetching a resource: . Additional information: Couldn’t read Elasticsearch cluster health

…but the cluster state is back to “green”:

graylog-node2 $ curl http://opensearch-node3:9200/_cluster/health?pretty
“cluster_name” : “opensearch-cluster”,
“status” : “green”,
“timed_out” : false,
“number_of_nodes” : 3,
“number_of_data_nodes” : 3,
“discovered_master” : true,
“active_primary_shards” : 1798,
“active_shards” : 3000,
“relocating_shards” : 0,
“initializing_shards” : 0,
“unassigned_shards” : 0,
“delayed_unassigned_shards” : 0,
“number_of_pending_tasks” : 0,
“number_of_in_flight_fetch” : 0,
“task_max_waiting_in_queue_millis” : 0,
“active_shards_percent_as_number” : 100.0
}

Ideas?

TIA!

Hey,

So somewhere you have a configuration thats maybe incorrect, IF Graylog is tell you it cannot get the health status ( i.e., API) something is either blocking it or misconfigured. You can try restart Graylog service as a starter but we would need more info on you configurations made.

Hi all,

the problem is almost solved…
It seems like our OpenSearch cluster ran out of shards.
Fixing the value also resolved the issue of not being able to retrieve the OS cluster status from Graylog…

Reference: Size your shards | Elasticsearch Guide [7.17] | Elastic

HTH

Cheers

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.