Migration from Elasticsearch to Opensearch gone wrong?

1. Describe your incident:
We had a working 3-node cluster with some performance issues. We decided to split the services and migrate from elastic to opensearch (3 opensearch nodes, 3 graylog + mongodb nodes). We also agreed to start anew, so we didn’t migrate the old elastic data.

After a while we got it working smoothly, but are seeing 2 issues:

  • We can’t see the elastic / opensearch health in the dashboard anymore (System → Overview → Elastic Cluster). It just says:

Could not retrieve Elasticsearch cluster health. Fetching Elasticsearch cluster health failed: There was an error fetching a resource: Internal Server Error. Additional information: Couldn’t read Elasticsearch cluster health

I can’t find anything in the server.log file about this. Also a curl on the opensearch cluster states this:

{
  "cluster_name" : "graylog",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 25,
  "active_shards" : 27,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}
  • Also on the System → Overview → System Messages I find every 10 seconds a message:

There is no index target to point to. Creating one now.

In the logfile I found this:

2023-01-11T17:02:08.079+01:00 WARN [Indices] Couldn’t create index gl-failures_0. Error: No index template provider found for type ‘failures’
java.lang.IllegalStateException: No index template provider found for type ‘failures’
at org.graylog2.indexer.IndexMappingFactory.resolveIndexMappingTemplateProvider(IndexMappingFactory.java:58) ~[graylog.jar:?]
at org.graylog2.indexer.IndexMappingFactory.createIndexMapping(IndexMappingFactory.java:50) ~[graylog.jar:?]
at org.graylog2.indexer.indices.Indices.buildTemplate(Indices.java:223) ~[graylog.jar:?]
at org.graylog2.indexer.indices.Indices.ensureIndexTemplate(Indices.java:173) ~[graylog.jar:?]
at org.graylog2.indexer.indices.Indices.create(Indices.java:210) ~[graylog.jar:?]
at org.graylog2.indexer.MongoIndexSet.cycle(MongoIndexSet.java:292) ~[graylog.jar:?]
at org.graylog2.indexer.MongoIndexSet.setUp(MongoIndexSet.java:260) ~[graylog.jar:?]
at org.graylog2.periodical.IndexRotationThread.checkAndRepair(IndexRotationThread.java:152) ~[graylog.jar:?]
at org.graylog2.periodical.IndexRotationThread.lambda$doRun$0(IndexRotationThread.java:90) ~[graylog.jar:?]
at java.lang.Iterable.forEach(Iterable.java:75) [?:?]
at org.graylog2.periodical.IndexRotationThread.doRun(IndexRotationThread.java:87) [graylog.jar:?]
at org.graylog2.plugin.periodical.Periodical.run(Periodical.java:94) [graylog.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) [?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]

2023-01-11T17:02:08.079+01:00 ERROR [IndexRotationThread] Couldn’t point deflector to a new index
java.lang.RuntimeException: Could not create new target index <gl-failures_0>.
at org.graylog2.indexer.MongoIndexSet.cycle(MongoIndexSet.java:293) ~[graylog.jar:?]
at org.graylog2.indexer.MongoIndexSet.setUp(MongoIndexSet.java:260) ~[graylog.jar:?]
at org.graylog2.periodical.IndexRotationThread.checkAndRepair(IndexRotationThread.java:152) ~[graylog.jar:?]
at org.graylog2.periodical.IndexRotationThread.lambda$doRun$0(IndexRotationThread.java:90) ~[graylog.jar:?]
at java.lang.Iterable.forEach(Iterable.java:75) [?:?]
at org.graylog2.periodical.IndexRotationThread.doRun(IndexRotationThread.java:87) [graylog.jar:?]
at org.graylog2.plugin.periodical.Periodical.run(Periodical.java:94) [graylog.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) [?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]

I also checked _template/gl*?pretty and couldn’t find a gl-failures template

curl -X GET "https://user:password@opensearchnode:9200/_template/gl*?pretty

Does anyone have a hint for me?

2. Describe your environment:

  • OS Information:
    Ubuntu 20.04

  • Package Version:
    Opensearch 2.4.1
    Graylog 5.0.2
    MongoDB 6.0.3

  • Service logs, configurations, and environment variables:
    Can be provided if needed

3. What steps have you already taken to try and solve the problem?
See above

4. How can the community help?
Maybe someone has an idea what went wrong here

Hello && welcome @PSchillmaier

The Warn shown is about this index.

Have you tried restarting Graylog service? if so what does Graylog log file show?

Have you tried manually rotating the index/s, and what I mean is on Graylog’s GUI, under indices sets there should be a drop down in the upper left to rotate indices.

Also make sure your Opensearch/elasticsearch is not in read mode only.

Thanks for your fast reply!

Yes, I already figured that out. Mine looks like this now
image

I just restarted the whole Graylog Cluster (fully cycled down all machines at once for a cluster cold-boot) and the issue persists.

2023-01-12T07:16:49.154+01:00 INFO  [MongoIndexSet] Did not find a deflector alias. Setting one up now.
2023-01-12T07:16:49.160+01:00 INFO  [MongoIndexSet] There is no index target to point to. Creating one now.
2023-01-12T07:16:49.168+01:00 INFO  [MongoIndexSet] Cycling from <none> to <gl-failures_0>.
2023-01-12T07:16:49.168+01:00 INFO  [MongoIndexSet] Creating target index <gl-failures_0>.
2023-01-12T07:16:49.170+01:00 WARN  [Indices] Couldn't create index gl-failures_0. Error: No index template provider found for type 'failures'
java.lang.IllegalStateException: No index template provider found for type 'failures'

I see the very same error when I try to rotate the indexes.

Can you point me to the right direction how to check this? Opensearch does index a lot of stuff apparently (already indexed 100.000.000 documents)

I just stumbled upon a small notice in my logs that doesn’t make sense to me

2023-01-12T07:16:49.086+01:00 INFO  [IndexerClusterCheckerThread] Indexer not fully initialized yet. Skipping periodic cluster check.

Thanks for your time and great work here to help me!

BR,
Phil

Also this still bothers me a lot

curl from all three graylog nodes to all three Opensearch nodes works and show status green.
Any ideas?

Hey @PSchillmaier

if its indexing then its not in read-mode. The warning states it can find Index template then you have “Could not retreive Cluster health” on the GUI. Im kind leaning to a confiuration issue. I had this before and it was the settings made in elasticsearch.yml file and Graylog configuration file. Double check the setting for Graylog to connect to Elasticsearch cluster (i.e., IP address, etc…).

In other words might want to bind your Graylog HTTP interface to your Servers IP. Not knowing how you configured you setup this is just a guess. If you can show those configuration it might help.

Here’s my config

server.conf on node-1

is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = <redacted>
root_password_sha2 = <redacted>
root_timezone = Europe/Berlin
bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin

http_bind_address = <node-1-ip>:9000
http_external_uri = https://<node-1-fqdn>:9000/
http_enable_cors = true
http_enable_tls = true
http_tls_cert_file = /graylog/certs/<certificate>.cer
http_tls_key_file = /graylog/certs/<key>.key

elasticsearch_hosts = https://user:password@<node-4-fqdn>:9200,https://user:password@<node-5-fqdn>:9200,https://user:password@<node-6-fqdn>:9200
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = true
allow_highlighting = false
elasticsearch_analyzer = standard

output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
processor_wait_strategy = sleeping
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking

message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3

mongodb_uri = mongodb://user:password@<node-1-fqdn>:27017,<node-2-fqdn>:27017,<node-3-fqdn>:27017/graylog?replicaSet=graylog0
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5

proxied_requests_thread_pool_size = 32

opensearch.yml on node-4

path.data: /opt/graylog
path.logs: /opt/opensearch/logs

cluster.name: graylog
action.auto_create_index: false
cluster.initial_master_nodes: ["<node-4-fqdn>","<node-5-fqdn>","<node-6-fqdn>"]
node.name: "<node-4-fqdn>"
network.host: <node-4-ip>
discovery.seed_hosts: ["<node-4-fqdn>","<node-5-fqdn>","<node-6-fqdn>"]

plugins.security.ssl.transport.enabled: "true"
plugins.security.ssl.transport.pemcert_filepath: <certificate>.cer
plugins.security.ssl.transport.pemkey_filepath: <key>.key
plugins.security.ssl.transport.pemtrustedcas_filepath: <RootCA>.crt
plugins.security.ssl.transport.enforce_hostname_verification: false

plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: <certificate>.cer
plugins.security.ssl.http.pemkey_filepath: <key>.key
plugins.security.ssl.http.pemtrustedcas_filepath: <RootCA>.crt

plugins.security.authcz.admin_dn:
  - "<admin-dn>"

plugins.security.allow_unsafe_democertificates: false
plugins.security.allow_default_init_securityindex: true
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"]
plugins.security.system_indices.enabled: false
plugins.security.system_indices.indices: [".plugins-ml-model", ".plugins-ml-task", ".opendistro-alerting-config", ".opendistro-alerting-alert*", ".opendistro-anomaly-results*", ".opendistro-anomaly-detector*", ".opendistro-anomaly-checkpoints", ".opendistro-anomaly-detection-state", ".opendistro-reports-*", ".opensearch-notifications-*", ".opensearch-notebooks", ".opensearch-observability", ".opendistro-asynchronous-search-response*", ".replication-metadata-store"]
plugins.security.nodes_dn: 
  - 'CN=<node-4-fqdn>'
  - 'CN=<node-5-fqdn>'
  - 'CN=<node-6-fqdn>'
node.max_local_storage_nodes: 3

I already tried to disable the opensearch security plugin, error was the same. For me it seems that for some reason graylog can’t recreate the gl-failures template (but it recreated the others without any hassle). I also can create new indices via the GUI
image

Also already checked if the local firewall’s interfering (even disabled UFW completely) - same issue. The nodes are also in the same subnet, so there’s no firewall between them but the local one.

I even performed a complete cluster halt and restarted everything in order (Opensearch first, after a few minutes followed by the graylog nodes). The issue persists

@PSchillmaier

So the only index template is the gl_failures :thinking:

I was actually going to say about the security plugin. When you disabled it I assumed you restart service for Opensearch & Graylog?

The cluster health screenshot tells me that Graylog doesnt have permissions fro ES/OS. It seasm like a connection issue between those two. :thinking:

I have couple suggestions for troubleshooting and one is relating to the following HTTPS

elasticsearch_hosts = https://user:password@<node-4-fqdn>:9200,https://user:password@<node-5-fqdn>:9200,https://user:password@<node-6-fqdn>:9200
#### External Graylog URI
#
# The public URI of Graylog which will be used by the Graylog web interface to communicate with the Graylog REST API.
#
# The external Graylog URI usually has to be specified, if Graylog is running behind a reverse proxy or load-balancer
# and it will be used to generate URLs addressing entities in the Graylog REST API (see $http_bind_address).
#
# When using Graylog Collector, this URI will be used to receive heartbeat messages and must be accessible for all collectors.
#
# This setting can be overriden on a per-request basis with the "X-Graylog-Server-URL" HTTP request header.
#
# Default: $http_publish_uri
#http_external_uri =

Noticed it talks about

n the Graylog REST API (see $http_bind_address).

So here is mine, I also use certs.

[root@graylog graylog_user]# cat /etc/graylog/server/server.conf  | egrep -v "^\s*(#|$)"
is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = epOqmLi7r7CdZxl76QOQxr8bRSKlKXjMQG9ojc0bn22EBUJgbD
root_password_sha2 = 5e884898da28047151d0e5ef721d1542d8
root_email = "greg.smith@domain.com"
root_timezone = America/Chicago
bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin
http_bind_address = graylog.enseva-labs.net:9000
http_publish_uri = https://graylog.enseva-labs.net:9000/
http_enable_cors = true
http_enable_tls = true
http_tls_cert_file = /etc/ssl/certs/graylog/graylog-certificate.pem
http_tls_key_file = /etc/ssl/certs/graylog/graylog-key.pem
http_tls_key_password = secret
elasticsearch_hosts = http://192.168.1.100:9200,http://192.168.1.102:9200,http://192.168.1.103:9200
rotation_strategy = count

I think one of the reason why Graylog configuration only shows HTTP

# Default: http://127.0.0.1:9200
#elasticsearch_hosts = http://node1:9200,http://user:password@node2:19200

Because there not a configuration for HTTPS, Hence why the documentation states "Disable the Security " plugin on OPENSEARCH.

So if this was mine I would adjust the following setting s , along with the proper restarts as follow.

## Just set publish_uri &  comment out *http_external_uri*
http_publish_uri = https://<node-1-fqdn>:9000/
### set URI HTTP
elasticsearch_hosts = http://<node-4-fqdn>:9200,http://<node-5-fqdn>:9200,http://<node-6-fqdn>:9200

Then disable Securit Plugin as showin the doc’s
Restart services.

Correct, the only index template missing is the failures.

Yes, just redid everything in case I missed something. Same error

Also thanks for your config. I should add that we got a LB in front of our first 3 nodes which makes http_external_uri mandatory and worked great before we switched to Opensearch.

Turned off Opensearch security:

user@node-1 ~ $ curl -X GET "http://<node-4-fqdn>:9200/_cluster/health?pretty=true"
{
  "cluster_name" : "graylog",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 29,
  "active_shards" : 31,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

(note that the curl works without https and without authentication from node-1)

Even after I set elasticsearch_hosts to http-only and without authentication the gl-failures template isn’t recreated and I still see the cluster health error.

I’m nearly at the point where I just switch back to elastic tbh. One of the main reasons why we switched was the enhanced security with certificates and the authentication.

It just came to me… Could that be caused by old data in the mongoDB? Like, I don’t know, cached connection strings, maybe? Like I said, we shifted the indexer from nodes 1-3 to 4-6.

Hey

Not sure, it might be.

As for the curl commands, I can execute cURL on all my ES/OS clusters. BUT if Graylog can not connect it really doesnt mean anything except that your cluster are fine.

I agree but it doesnt now By defualt it will use http_publish_uri.

I quote:

The public URI of Graylog which will be used by the Graylog web interface to communicate with the Graylog REST API.

Not sure If your able to test this out in a lab or non-production environment.

Oh, I see. I just disabled http_external_uri and enabled http_publish_uri and it still works via the lb. Thanks for that, didn’t know this.

I fiddled a bit more with it now and can’t get rid of the errors. I even tried to use IP addresses in the elasticsearch_hosts parameter.

Current state is like this:

  • Opensearch security disabled
  • http_external_uri disabled
  • http_publish_uri enabled, set to https
  • http_bind_address set to hostname
  • elasticsearch_hosts set to http and ip addresses
  • ufw disabled on all machines
  • no other firewall between those machines
  • All machines / services have been restarted.

I still suspect my mongoDB instance here. Unfortunately I really have no clue how to check what data might be stuck in there. And I really don’t want to drop that database :slight_smile:
Maybe it helps when I tell you, that we started with Graylog 3.3 and kept it up-to-date. Maybe there is something in the DB that produces this.

Maybe I can create a test replication set some time this week…

Still: thanks a lot, @gsmith !

Hey,

Not 100% sure what going on, but if you have 3 es nodes the Graylog configuration should match the elasticsearch/OpenSearch yaml configuration as shown in this section here

Is the error your stating from the one above in this post about elasticsearch cluster health?

Hey,

I just dropped the graylog database in my mongoDB and restarted graylog. Guess what

So it IS caused by my mongoDB. For now I will try to recreate my stuff and move on.

[EDIT] OH btw, it also works with Opensearch Security enabled :slight_smile:

Many thanks to you!

1 Like

Awesome, thanks for the feed back.

1 Like