1. Describe your incident:
I am unable to make a data lake work properly. I have tried with Graylog 6.2.1 and 6.2.2, a cluster updated from previous versions.
I have an input feeding a stream. The requirement for messages to be routed to that stream called “AIS” is the presence of a field called “mssi”. Both are numeric.
I have created a data lake backend (I tried both filesystem and an S3 backed storage based on Minio).
In order to decide which messages to route to data lake and OpenSearch I used the field “type”. Messages with type <> 11 would go to the data lake, while messages with type == 11 would stay on the OpenSearch stream. I just wanted to test the data lake feature.
It seems it didn’t really write to the data lake. On 6.2.1 I recall I the stream filter inside the data lake configuration kinda worked, but on 6.2.2 it seems to be completely ignored. No matter what I put on it I find messages with type 11 and others.
I understand that the stream filter determines which messages will not be routed to the Opensearch index, so I defined type <= 10 and type >= 12. The field is numeric of course.
2. Describe your environment:
OS Information: FreeBSD 14.2
Package Version: Graylog 6.2.2
Service logs, configurations, and environment variables:
Opensearch 2.19.2 (I know, later I realised that I should have not been updating Opensearch)
MongoDB 6.0.19
3. What steps have you already taken to try and solve the problem?
I tried removing the data lake configuration and recreating it. I made it worse. When I deleted everything a value called “AIS” somewhat remained stuck somewhere, so when I go to data lake “preview” the stream name “AIS” appears twice.
4. How can the community help?
Has anyone else tried data lake?
I think it is a bit confusing and I think I recall a difference between 6.2.1 and 6.2.2 when defining a stream filter under the data lake configuration. I think 6.2.1 had a “true/false” toggle that was removed for 6.2.2? Sorry about this sketchy report but when you are trying something you don’t imagine that you might end up reporting a bug!
When you are talking about streams, do you have two streams and your are routing messeges to two streams before data lake and then send one to data lake and one to index? Or do you have one stream and you want to route a subset of the stream into data lake?
Also what kind of license is applied to the cluster?
Also, being an experimental cluster I have configured a new Opensearch cluster running 2.15.0 just in case I had messed something.
With brand new data (keeping the rest of the configurations) I have tried to enable Data Lake.
I have tried three kinds of backends:
Filesystem (ZFS dataset)
Filesystem (conventional FFS filesystem just in case)
S3 backend (based on Minio)
Behavior is the same for all three attempts. I have added routing to a stream, I haven’f configured any filters. The result looks like this screenshot:
Checking the relevant directories or S3 backend I have seen no data is being stored.
Moreover, I have tried to route a second stream (in this case Netflow data) doing the same, no filters. Behavior in this case is worse.
For the “AIS data” stream, the data warehouse stream is 6822e7110a9407408312c459.
For “Netflow” nothing has been created. And if I try to do a preview, the result is an error:
“Could not find archive for stream : 60350cb139a3c87a2ae5705b”. Indeed, looking at the data-warehouse.streams directory I don’t see 60350cb139a3c87a2ae5705b. Just 6822e7110a9407408312c459 which I presume is for the “AIS data” stream.
I first tried this with 6.2.1, it didn´t work. I tried again on 6.2.2. Same result.
Either I hit some bug or my instance has some hidden problem?
How long are you waiting to see if data is stored? Depending on what version you started with the default store time can be up to an hour before the messages are actually written to the datalake.
I am checking now roughly 10 hours after configuring data lake routing for two streams.
The first one has data warehouse created in the data lake filesystem, but the second one doesn’t have anything.
The data lake directory is /var/db/datalakeV
Inside data-warehouse.streams I see a directory called 6822e7110a9407408312c459 which I presume would be for “AIS data” . But there is no directory for Netflow whch I imagine would be “60350cb139a3c87a2ae5705b” as I get an error when trying to preview it:
“Could not find archive for stream : 60350cb139a3c87a2ae5705b”
I am considering starting an empty instance without any configuration (it would be easy) as I am wondering whether there is something rotten in my cluster that has gone through many versions since 4.1+
Launch an empty Opensearch + Mongodb + brand new Graylog
Request new license and import
Created two inputs. AIS (raw UDP with JSON decoding) and UDP Netflow
Created indexes and streams for them. No Illuminate.
Activated data lake for both. The backend is a ZFS dataset. Proper permissions of course.
Now I see two data warehouses created. Waiting for them to be populated with some data.
Looks like something in my Mongodb database might be corrupted. It’s been migrated without interruption since 2021. Now that I recall several months ago it begun to complain about being unable to load recent activity.
Sorry, seems for some reason I have a broken instance. Everything works except for that.
Now I try to apply a similar filter to the data lake, but the opposite. This is tricky, the field does not exist for IPv4 flows and it is a number field with a value of 6 for IPv6.
Seems like it is confused because some records contain the nf_ip_protocol_version and others don’t. This does look like a bug to me? I know I can just check on field existence.
Anyway if I try to use two rules, field existance and field value >= it still fails.