I have an issue searching when running multi-node setup. If I am running just a single node, things are okay. As soon as I spin up a second or third node, I get the spinning “Updating search results” which changes to “This is taking a bit longer, please hold on” and ultimately nothing. If I delete the additional node(s), it all starts working perfectly again. From what I can tell, nothing is wrong with mongodb or opensearch. I have don’t see any errors or issues in my logs. I am completely baffled as to what would even cause this, and with no errors being reported, I can’t think what to do next. I normally try to bang my head until I figure out my problem, but I’ve tried that and am getting nowhere. I figure I will break down and ask the question here.
Running k3s (v1.31.2+k3s1) version of kubernetes on Ubuntu 22.04
Graylog server 6.1.2+c8b654f
JRE: Eclipse Adoptium 17.0.13 on Linux 6.8.0-48-generic
Deployment: docker
mongodb-agent:12.0.24.7719-1
opensearch:2.18.0
Here is the startup log from my master node:
And then here is the first additional node in the cluster:
Data seems to be collecting fine, no issue with input.
Opensearch health is fine.
All nodes on the status page in graylog seem fine, no errors or warnings.
No issues reported under any of the indexes.
Really at a lost and very confused. I appreciate in advance any assistance anyone can provide on what I might look at next to try to figure this one out. The good news, if I just run one node, it works okay, but not the setup I want.
Oh, I think you just figured it out! I’m wondering if this is a cross node issue. I’m going to check if it’s because graylog-0 works and is on the same k3s node as the opensearch master, but fails when a graylog-1 node is on a different k3s node…I’ll come back with more test results, but that would make sense to describe the problem. Then I just would need to figure out why
Nope, it was a good theory, but I don’t think that is the problem. When I connect to both pods directly, I can curl opensearch from both nodes using that URL no problem, and it hit each of the different opensearch nodes.
I’ll keep looking. This is definitely something strange in how I have Graylog setup though. I’m pretty sure Opensearch is fine. This did used to work, and I’m not exactly sure what I did in the environment that broke it. I’m sure it’s something I did, just trying to figure out what.
I have confirmed further it’s not an issue with communications across nodes. I ran the active graylog node on my -2 and -3 servers, and they can query fine, as long as no other graylog servers are running. As soon as I have 2 or more nodes in my graylog cluster, I start having issues with querying. Strange right!?
No, sorry, let me clarify. When running the manual test it goes through a load balancer. That’s how I normally expose services on my network. However the config in Graylog is the kubernetes service address, which responds with the three different opensearch nodes. So if I manually run curl from the graylog pod, I see each of the different servers. For example:
I could change this value to test, but I haven’t yet. Reason being, I don’t see how that would affect things if more than one node is running, it’s the same value either way, and works regardless of the number of nodes running. That’s the part I can’t figure, is what graylog does differently, when running a query, when it has a partner live. Also, I know graylog is seeing all nodes, as they show up in the interface too.
So to clarify adding and removing opensearch nodes isnt the issue, the issue is when you add additional graylog nodes?
If so, if you have say 3 graylog nodes running and you connect to the web interface of each directly bypassing the load balancer, does a query work from one but not the others, say the leader works but not the two member nodes?
That is correct. Opensearch doesn’t seem to be a problem, it’s just adding graylog nodes. My LB does round robin, and I can see which node I’m on when I run queries. I don’t have direct access to them from my desktop currently as they are only in the kubernetes network, so I only reach them through the LB. I can make that change later though and play around more. But, I have confirmed as I see which one I’m connected to that they all end up having problems as I move through them. But…queries will sometimes work, so, it’s intermittent, and I have not found a pattern for when then work. It’s a good point though, so let me do that and provide the results so I can be specific. Thanks.
So strange…yes, I can confirm that each node stops working correctly as soon as it has a sibling. Almost just want to start over
What I find frustrating is just not knowing why. Something in my recent upgrades did this, but I did a few so I can’t tell what would cause this. Everything else seems to be working just fine. Oh well.
I have completely wiped out all persistent volumes and nodes, and booted fresh, same thing. Maybe I wipe out all my settings and mongo next.
I think I might know what is happening here. And now what I can’t understand is how this ever worked for me. I’ve been running like this for ages.
Here is my theory. When running with a single node, an LB is no problem, all requests go to one place. When there are multiple nodes, the page will load up no problem from one server, but when the request for a query goes out for a search, it will be a search job on one node, but then follow up requests might get directed to different nodes.
When I look at the console, I see a 404 on the search job. This makes me think the search job is on the server originally loaded, but the follow up request was set to a different node.
I have confirmed the issue. I tried to get sticky sessions setup, but I’m not doing something right there. I just forced all traffic to a single node, and that works. All three nodes can be up, but all requests are just sent to the first node. So I believe this means my theory is right. The search is getting sent to a different node than originally visited, causing the search job to be unknown.
Now to figure out a good solution. Folks must run Graylog behind traffic managers, do they need to be stateful sessions?
Time to head back to the docs, but I do welcome any thoughts folks have on their setups if they use traefik or another LB.
Your publish uri is the load balancer address, but it needs to be the address of that node that the other nodes use to communicate with it. Only external uri should be the load balancer address.
Thank you @Joel_Duffield and @kpearson! I was not reading that config option correctly, that was exactly the problem. What I can’t figure out is how it was working at all before??? I went back through my git history, that hasn’t changed ever. Anyway, it’s working now. Great…moving on. Really appreciate the help and patience. Sure enough, my original comment came true, I’m going to feel stupid when I learn what the issue was
On the bright side, maybe someone else will come along some day with this same problem and this post will help them so they don’t have to feel as stupid.