Couldn't read Elasticsearch cluster health after extending LV and upgrading OpenSearch to 1.3.8

m_mlk · March 2, 2023, 4:50pm

1. Describe your incident:
The OpenSearch cluster had some storage issues and it was hitting the high watermark threshold, so I extended the LV for the OS data, upgraded OpenSearch to 1.3.8 and restarted the OpenSearch service.

After the restart, Graylog’s web interface complains with:

2. Describe your environment:

OS Information:
Ubuntu 20.04 LTS for Graylog
CentOS 7.9 for OpenSearch
Package Version:
Graylog 4.3.9
OpenSearch 1.3.8
Service logs, configurations, and environment variables:

Relevant config:

$ grep elasticsearch_ /etc/graylog/server/server.conf
elasticsearch_version = 7
elasticsearch_hosts = http://admin:admin@node-1:9200,http://admin:admin@node-2:9200,http://admin:admin@node-3:9200

From the Graylog logs:

2023-03-02T17:22:07.788+01:00 INFO [SearchDbPreflightCheck] Connected to (Elastic/Open)Search version OpenSearch:1.3.8

3. What steps have you already taken to try and solve the problem?

I checked that the OpenSearch status can be queried from the Graylog cluster nodes:

$ curl http://opensearch-node1:9200/_cluster/health?pretty -u admin:admin -k
{
“cluster_name” : “opensearch-cluster”,
“status” : “yellow”,
“timed_out” : false,
“number_of_nodes” : 3,
“number_of_data_nodes” : 3,
“discovered_master” : true,
“active_primary_shards” : 1798,
“active_shards” : 1827,
“relocating_shards” : 0,
“initializing_shards” : 4,
“unassigned_shards” : 1169,
“delayed_unassigned_shards” : 0,
“number_of_pending_tasks” : 0,
“number_of_in_flight_fetch” : 0,
“task_max_waiting_in_queue_millis” : 0,
“active_shards_percent_as_number” : 60.9
}

4. How can the community help?

Is there any reason for this to stop suddenly working? Or should I learn to be patient and OS will catch up eventually?

TIA!

gsmith · March 2, 2023, 10:46pm

Hey @m_mlk

I see Opensearch is in yellow.

You can execute something like this see why its in yellow.

curl  -XGET  http://opensearch-node1:9200/_cluster/allocation/explain?pretty

Fortrouble shooting have you tried NOT to use admin:admin to connect to Opensearch nodes to see if that made a difference?

m_mlk · March 3, 2023, 8:52am

Hi @gsmith

Thanks for your reply.

$ curl http://opensearch-node1:9200/_cluster/allocation/explain?pretty
{
“error” : {
“root_cause” : [
{
“type” : “illegal_argument_exception”,
“reason” : “unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]”
}
],
“type” : “illegal_argument_exception”,
“reason” : “unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]”
},
“status” : 400
}

The connection to OpenSearch has been working with admin:admin since day #1 …

Anyway, NOT using those credentials still seem to work… O_o

2023-03-03T09:49:03.581+01:00 INFO [MongoDBPreflightCheck] Connected to MongoDB version 5.0.13
2023-03-03T09:49:03.684+01:00 INFO [SearchDbPreflightCheck] Connected to (Elastic/Open)Search version OpenSearch:1.3.8

Still, the web GUI shows the same error message:

Could not retrieve Elasticsearch cluster health. Fetching Elasticsearch cluster health failed: There was an error fetching a resource: . Additional information: Couldn’t read Elasticsearch cluster health

…but the cluster state is back to “green”:

graylog-node2 $ curl http://opensearch-node3:9200/_cluster/health?pretty
“cluster_name” : “opensearch-cluster”,
“status” : “green”,
“timed_out” : false,
“number_of_nodes” : 3,
“number_of_data_nodes” : 3,
“discovered_master” : true,
“active_primary_shards” : 1798,
“active_shards” : 3000,
“relocating_shards” : 0,
“initializing_shards” : 0,
“unassigned_shards” : 0,
“delayed_unassigned_shards” : 0,
“number_of_pending_tasks” : 0,
“number_of_in_flight_fetch” : 0,
“task_max_waiting_in_queue_millis” : 0,
“active_shards_percent_as_number” : 100.0
}

Ideas?

TIA!

gsmith · March 3, 2023, 10:06pm

Hey,

So somewhere you have a configuration thats maybe incorrect, IF Graylog is tell you it cannot get the health status ( i.e., API) something is either blocking it or misconfigured. You can try restart Graylog service as a starter but we would need more info on you configurations made.

m_mlk · March 6, 2023, 12:36pm

Hi all,

the problem is almost solved…
It seems like our OpenSearch cluster ran out of shards.
Fixing the value also resolved the issue of not being able to retrieve the OS cluster status from Graylog…

Reference: Size your shards | Elasticsearch Guide [7.17] | Elastic

HTH

Cheers

system · March 20, 2023, 12:37pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Migration from Elasticsearch to Opensearch gone wrong? Graylog Central (peer support)	14	2092	February 7, 2023
Could not retrieve Elasticsearch cluster health. Fetching Elasticsearch cluster health failed: There was an error fetching a resource: Internal Server Error. Additional information: Couldn't read Elasticsearch cluster health Graylog Central (peer support) docker	9	5402	February 7, 2023
Could not retrieve Elasticsearch cluster health Graylog Central (peer support)	1	20	October 27, 2024
OpenSearch exception Graylog Central (peer support)	14	509	October 20, 2023
Updated from v2.2 to v2.3 : Can't reach ElasticSearch Graylog Central (peer support)	9	1563	May 1, 2018

Couldn't read Elasticsearch cluster health after extending LV and upgrading OpenSearch to 1.3.8

Related topics