Search failing to load when adding nodes to a cluster

I have an issue searching when running multi-node setup. If I am running just a single node, things are okay. As soon as I spin up a second or third node, I get the spinning “Updating search results” which changes to “This is taking a bit longer, please hold on” and ultimately nothing. If I delete the additional node(s), it all starts working perfectly again. From what I can tell, nothing is wrong with mongodb or opensearch. I have don’t see any errors or issues in my logs. I am completely baffled as to what would even cause this, and with no errors being reported, I can’t think what to do next. I normally try to bang my head until I figure out my problem, but I’ve tried that and am getting nowhere. I figure I will break down and ask the question here.

Running k3s (v1.31.2+k3s1) version of kubernetes on Ubuntu 22.04
Graylog server 6.1.2+c8b654f
JRE: Eclipse Adoptium 17.0.13 on Linux 6.8.0-48-generic
Deployment: docker

mongodb-agent:12.0.24.7719-1
opensearch:2.18.0

Here is the startup log from my master node:

And then here is the first additional node in the cluster:

Data seems to be collecting fine, no issue with input.
Opensearch health is fine.
All nodes on the status page in graylog seem fine, no errors or warnings.
No issues reported under any of the indexes.

Really at a lost and very confused. I appreciate in advance any assistance anyone can provide on what I might look at next to try to figure this one out. The good news, if I just run one node, it works okay, but not the setup I want.

I have a bad feeling I’m going to feel really stupid when someone points out the obvious problem that I missed :frowning:

If you lookup online how to use curl and directly run a query against the opensearch api, do those results show correctly?

Are you listing all the opensearch nodes in your server.conf file and in the opensearch configs?

Thanks @Joel_Duffield

Yes, Opensearch seems fine. That is actually where I started. But that uses a load balancer that sits in front of opensearch.

I’ve configured graylog with the following to reach opensearch:

elasticsearch_hosts = http://opensearch-cluster-master.opensearch.svc.cluster.local.:9200

Oh, I think you just figured it out! I’m wondering if this is a cross node issue. I’m going to check if it’s because graylog-0 works and is on the same k3s node as the opensearch master, but fails when a graylog-1 node is on a different k3s node…I’ll come back with more test results, but that would make sense to describe the problem. Then I just would need to figure out why :slight_smile:

Nope, it was a good theory, but I don’t think that is the problem. When I connect to both pods directly, I can curl opensearch from both nodes using that URL no problem, and it hit each of the different opensearch nodes.

╰─○ kubectl describe service/opensearch-cluster-master | grep Endp
Endpoints:                192.168.125.102:9200,192.168.125.106:9200,192.168.96.173:9200

I’ll keep looking. This is definitely something strange in how I have Graylog setup though. I’m pretty sure Opensearch is fine. This did used to work, and I’m not exactly sure what I did in the environment that broke it. I’m sure it’s something I did, just trying to figure out what.

I have confirmed further it’s not an issue with communications across nodes. I ran the active graylog node on my -2 and -3 servers, and they can query fine, as long as no other graylog servers are running. As soon as I have 2 or more nodes in my graylog cluster, I start having issues with querying. Strange right!? :slight_smile:

Wait, your putting a load balancer between graylog and opensearch, thats not how its built to work?

No, sorry, let me clarify. When running the manual test it goes through a load balancer. That’s how I normally expose services on my network. However the config in Graylog is the kubernetes service address, which responds with the three different opensearch nodes. So if I manually run curl from the graylog pod, I see each of the different servers. For example:

graylog@graylog-0:~$ curl http://opensearch-cluster-master.opensearch.svc.cluster.local.:9200 -s | grep name
  "name" : "opensearch-cluster-master-0",
  "cluster_name" : "opensearch-cluster",
graylog@graylog-0:~$ curl http://opensearch-cluster-master.opensearch.svc.cluster.local.:9200 -s | grep name
  "name" : "opensearch-cluster-master-1",
  "cluster_name" : "opensearch-cluster",
graylog@graylog-0:~$ curl http://opensearch-cluster-master.opensearch.svc.cluster.local.:9200 -s | grep name
  "name" : "opensearch-cluster-master-1",
  "cluster_name" : "opensearch-cluster",
graylog@graylog-0:~$ curl http://opensearch-cluster-master.opensearch.svc.cluster.local.:9200 -s | grep name
  "name" : "opensearch-cluster-master-2",
  "cluster_name" : "opensearch-cluster",
graylog@graylog-0:~$ curl http://opensearch-cluster-master.opensearch.svc.cluster.local.:9200 -s | grep name
  "name" : "opensearch-cluster-master-1",
  "cluster_name" : "opensearch-cluster",
graylog@graylog-0:~$ curl http://opensearch-cluster-master.opensearch.svc.cluster.local.:9200 -s | grep name
  "name" : "opensearch-cluster-master-2",
  "cluster_name" : "opensearch-cluster",
graylog@graylog-0:~$ curl http://opensearch-cluster-master.opensearch.svc.cluster.local.:9200 -s | grep name
  "name" : "opensearch-cluster-master-2",
  "cluster_name" : "opensearch-cluster",

I could change this value to test, but I haven’t yet. Reason being, I don’t see how that would affect things if more than one node is running, it’s the same value either way, and works regardless of the number of nodes running. That’s the part I can’t figure, is what graylog does differently, when running a query, when it has a partner live. Also, I know graylog is seeing all nodes, as they show up in the interface too.

So to clarify adding and removing opensearch nodes isnt the issue, the issue is when you add additional graylog nodes?

If so, if you have say 3 graylog nodes running and you connect to the web interface of each directly bypassing the load balancer, does a query work from one but not the others, say the leader works but not the two member nodes?

That is correct. Opensearch doesn’t seem to be a problem, it’s just adding graylog nodes. My LB does round robin, and I can see which node I’m on when I run queries. I don’t have direct access to them from my desktop currently as they are only in the kubernetes network, so I only reach them through the LB. I can make that change later though and play around more. But, I have confirmed as I see which one I’m connected to that they all end up having problems as I move through them. But…queries will sometimes work, so, it’s intermittent, and I have not found a pattern for when then work. It’s a good point though, so let me do that and provide the results so I can be specific. Thanks.