1. Describe your incident:
I wanted to preserve a bit of disk space and the most data (280GB) were in the “graylog_events” index. So i lowered the retention period and then run into errors and it kinda suggested that I could delete the shards.
Since the data isn’t terribly important yet I was like “okay well I guess im gonna lose everything that wasnt routed to streams that have other index sets but here we go”. But now no messages seem to arrive in the default stream at all any more but still in some of the other streams (that to my understanding are routed to the default stream first though? (I just delete them after rerouting).
I tried a full restart of the cluster but there are no more messages arriving in the default stream.
What was the previous retention period and policy? What is it now?
Are you missing logs from any particular hosts?
What were the error messages you ran into after changing the retention period?
Does anything show up in graylog-server logs?
If so, post these. If something is wrong with your server it will tell you, you need to look at the logs in /var/log/graylog-server/ and the system journal.
The only way we can help you is if you give us the exact error messages you are experiencing.
The server is not outputting any errors on the web UI and in the logfiles to my best knowledge it looks like there are only malformed messages or inputs that aren’t proper TLS by some clients (which has been happening before already).
journalctl -u graylog-server.service only gives ma a warning for an unsupported class: WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
But this has been there forever, too.
I think I might have F*ed up by changing the retention time from P30D Month with maximum number of Indices of 3 to something like P20D with 2 Indices.
After that I had some errors but I don’t fully remember what it said, it also warned me about something like all primaries being deleted when I press delete but I took it as an “oh, at worst I probably just lose all the data?” (the more important data is still in a different stream with a different index behind it.)
I should have probably paid more attention to what exactly will happen but with how little I understand about the underlying opensearch mechanisms it would’ve been just guessing anyway and I didn’t think I’d do more than erase data which I was okay with in this case.
One of my 2 Data Nodes keeps running full sweeps if tthat is any indication of anything.
edit: just noticed this shows in “green” and thus is not an error but might be relevant: 1 indices with a total of 290,659 messages under management, current write-active index is gl-events_3 .
Hm, graylog-events is logs about what’s happening with the server, what operations are being done, eg: Starting inputs, system jobs being ran and so on.
Deleting data from there should not cause any issues. Also data from the default stream is not being routed into this index. As a matter of fact it is impossible to assign any stream to it. graylog-events is strictly a system index and you cannot route any messages there, it’s just not possible even if you tried to.
edit: just noticed this shows in “green” and thus is not an error but might be relevant: 1 indices with a total of 290,659 messages under management, current write-active index is gl-events_3
Nothing wrong with that message, I have an almost identical (except for number of messages) on my server for that index.
It means that the server has routed 290,659 messages into this index, and the prefix gl-events_3 is just the index name, once it has been closed/deleted (depending on your policy), Graylog will start writing data into a new index, likely gl-events_4
Now, you say that there are no longer any messages showing up in the Default Stream? Side note: Only messages that are not routed to any streams wind up here. So no, if you have a message that’s being routed into Stream X it’s not going to be in the Default Stream first.
Just to make sure, there have been messages in the default stream before you changed the retention policy? So let’s say there was an average of 20 msg/s in this stream and it became 0 and stayed 0 directly after you changed the policy?
Any particular hosts that you’re missing data from? Any particular Inputs that are now empty?
There is a sharp cutoff at about 10:25 today where no message arrives in the default stream anymore. However in the meantime I have been able to confirm that my pipelines connected to the default stream still see decently high throughput… oddly enough all streams but one are empty after the cutoff moment - even all the Graylog-Default ones.
But now the even more odd thing is I have one pipeline that bundles input from a specific domain and applies a few extra fields that it parses from the messages… this pipeline routes into a stream that is also empty. Now the weird thing is that it also reroutes the messages into a second stream that is further doing things to those messages so we can run basic analytics on the data without any personal details attached (so we can store it longer). and THIS stream still shows up properly as if nothing happened. All other streams are empty since roughly 10:25 today when I deleted the shards or whatever I was exactly doing. I should maybe try to change the retention policy to something even lower now to see what options I exactly had so I can with some luck provide better information of what I exactly did.
But to my understanding I deleted shards where no primary was available now… so maybe that was important data graylog needed to index and find things?
I think everything is still working as the throughput of messages is still there and even visible within the pipeline overview site… it just seems that Graylog “forgot” how to assign them properly to the correct streams (even default one) or something… and so they’re invsible to me now even though they’re getting processed… but sadly I don’t have enough understanding of how I could fix that now or force graylog to recalibrate… or what I exactly broke.
edit: i also didnt stop or start any inputs or change them in any way
edit2: to clarify - the pipelines all (except one) use the default stream as their connection and optionally might remove the messages from the default steam after rerouting but not all pipelines do that. Only the messages from a “2nd layer pipeline” that is connected to a different stream than the default stream shows any messages at all. but the “layer one pipeline” that produces the messages for the stream that the “2nd layer pipeline” is connected to itself is connected to the default stream… this further hints at the fact that the messages get processed just fine… but can’t get properly matched to show up when queueried for.
and one of the 2 data nodes is running full sweets all day now:
tail -f /var/log/opensearch/graylog.log
[2024-06-20T16:37:42,698][INFO ][o.o.j.s.JobSweeper ] [graylog-data1] Running full sweep
[2024-06-20T16:42:42,700][INFO ][o.o.j.s.JobSweeper ] [graylog-data1] Running full sweep
[2024-06-20T16:47:42,701][INFO ][o.o.j.s.JobSweeper ] [graylog-data1] Running full sweep
[2024-06-20T16:52:42,702][INFO ][o.o.j.s.JobSweeper ] [graylog-data1] Running full sweep
[2024-06-20T16:57:42,703][INFO ][o.o.j.s.JobSweeper ] [graylog-data1] Running full sweep
[2024-06-20T17:02:26,992][INFO ][o.o.a.t.CronTransportAction] [graylog-data1] Start running AD hourly cron.
[2024-06-20T17:02:26,995][INFO ][o.o.a.t.ADTaskManager ] [graylog-data1] Start to maintain running historical tasks
[2024-06-20T17:02:42,704][INFO ][o.o.j.s.JobSweeper ] [graylog-data1] Running full sweep
[2024-06-20T17:07:42,704][INFO ][o.o.j.s.JobSweeper ] [graylog-data1] Running full sweep
[2024-06-20T17:12:42,705][INFO ][o.o.j.s.JobSweeper ] [graylog-data1] Running full sweep
Ok, this sounds like an issue with OpenSearch to me.
To understand what you might have done, I looked up how exactly OpenSearch works with shards and indexes (I’m no expert in OpenSearch) From this blog
An OpenSearch index is composed of shards. Each document in an index is stored in the shards of an index. An index can have two types of shards, primary and replica. When you write documents to an OpenSearch index, indexing requests first go through primary shards before they are replicated to the replica shard(s). Each primary shard is hosted on a data node in an OpenSearch domain. When you read/search data in OpenSearch, a search request may interact with a number of replica or primary shards. Replica shards are automatically updated, mirroring their corresponding primary shards.
So in short, when a document is being written to an index it firstly goes to a primary shard and then to a replica shard.
What this might mean in your case is that you either deleted all the primary shards or deleted enough so that the index can’t function properly.
I keep saying might cause I’m not 100% sure, since you yourself cannot recall the exact error, so I’m making reasonable assumptions.
But this gives us more insight. Check the OpenSearch logs and cluster health.
Oh I forgot to mention that the cluster health itself was green too, I did check that but forgot to include that bit of information.
but I think the issue fixed itself overnight… after a good night of rest my streams show properly again, even the messages that arrived during the time when nothing was showing up.
I suppose it just took a lot of time for opensearch to fix my rather rude deletion of shards?
I’ll definitely read up more on opensearch though to make sure I will have a better understanding what I am doing… thank you for your help
Did you try to cleanup and recalculate all indexe ranges under System & Index sets.
Reconfiguring an index set should normally not lead to these kind of problems. I have done those changes multple times and graylog reworks the indexes in their maintenance jobs.