Unable to make a data lake work

borjam · May 12, 2025, 8:22am

1. Describe your incident:
I am unable to make a data lake work properly. I have tried with Graylog 6.2.1 and 6.2.2, a cluster updated from previous versions.

I have an input feeding a stream. The requirement for messages to be routed to that stream called “AIS” is the presence of a field called “mssi”. Both are numeric.

I have created a data lake backend (I tried both filesystem and an S3 backed storage based on Minio).

In order to decide which messages to route to data lake and OpenSearch I used the field “type”. Messages with type <> 11 would go to the data lake, while messages with type == 11 would stay on the OpenSearch stream. I just wanted to test the data lake feature.

It seems it didn’t really write to the data lake. On 6.2.1 I recall I the stream filter inside the data lake configuration kinda worked, but on 6.2.2 it seems to be completely ignored. No matter what I put on it I find messages with type 11 and others.

I understand that the stream filter determines which messages will not be routed to the Opensearch index, so I defined type <= 10 and type >= 12. The field is numeric of course.

2. Describe your environment:

OS Information: FreeBSD 14.2
Package Version: Graylog 6.2.2
Service logs, configurations, and environment variables:
Opensearch 2.19.2 (I know, later I realised that I should have not been updating Opensearch)
MongoDB 6.0.19

3. What steps have you already taken to try and solve the problem?
I tried removing the data lake configuration and recreating it. I made it worse. When I deleted everything a value called “AIS” somewhat remained stuck somewhere, so when I go to data lake “preview” the stream name “AIS” appears twice.

4. How can the community help?

Has anyone else tried data lake?

I think it is a bit confusing and I think I recall a difference between 6.2.1 and 6.2.2 when defining a stream filter under the data lake configuration. I think 6.2.1 had a “true/false” toggle that was removed for 6.2.2? Sorry about this sketchy report but when you are trying something you don’t imagine that you might end up reporting a bug!

Helpful Posting Tips: Tips for Posting Questions that Get Answers [Hold down CTRL and link on link to open tips documents in a separate tab]

Joel_Duffield · May 12, 2025, 9:13am

When you are talking about streams, do you have two streams and your are routing messeges to two streams before data lake and then send one to data lake and one to index? Or do you have one stream and you want to route a subset of the stream into data lake?
Also what kind of license is applied to the cluster?

borjam · May 12, 2025, 9:57am

Thanks.

It is one stream. One “substream” going to the data lake.

The license is Small Business / free Enterprise ( < 2 GB/day).

Just trying it at home for evaluation purposes, we are likely going to Graylog Security at work.

Looks like a bug given that the filter seems to be ignored.

Joel_Duffield · May 12, 2025, 10:27am

Can you post screenshots of the filters you applied to both index and datalake.

Also if you run it without any filtering at all, what happens, you should get all thr messages in both.

borjam · May 13, 2025, 1:33pm

In order to keep it simple I removed the filters.

Also, being an experimental cluster I have configured a new Opensearch cluster running 2.15.0 just in case I had messed something.

With brand new data (keeping the rest of the configurations) I have tried to enable Data Lake.

I have tried three kinds of backends:

Filesystem (ZFS dataset)
Filesystem (conventional FFS filesystem just in case)
S3 backend (based on Minio)

Behavior is the same for all three attempts. I have added routing to a stream, I haven’f configured any filters. The result looks like this screenshot:

Checking the relevant directories or S3 backend I have seen no data is being stored.

Moreover, I have tried to route a second stream (in this case Netflow data) doing the same, no filters. Behavior in this case is worse.

For the “AIS data” stream, the data warehouse stream is 6822e7110a9407408312c459.

For “Netflow” nothing has been created. And if I try to do a preview, the result is an error:
“Could not find archive for stream : 60350cb139a3c87a2ae5705b”. Indeed, looking at the data-warehouse.streams directory I don’t see 60350cb139a3c87a2ae5705b. Just 6822e7110a9407408312c459 which I presume is for the “AIS data” stream.

I first tried this with 6.2.1, it didn´t work. I tried again on 6.2.2. Same result.

Either I hit some bug or my instance has some hidden problem?

Joel_Duffield · May 13, 2025, 3:10pm

How long are you waiting to see if data is stored? Depending on what version you started with the default store time can be up to an hour before the messages are actually written to the datalake.

borjam · May 14, 2025, 6:15am

I am checking now roughly 10 hours after configuring data lake routing for two streams.

The first one has data warehouse created in the data lake filesystem, but the second one doesn’t have anything.

The data lake directory is /var/db/datalakeV

Inside data-warehouse.streams I see a directory called 6822e7110a9407408312c459 which I presume would be for “AIS data” . But there is no directory for Netflow whch I imagine would be “60350cb139a3c87a2ae5705b” as I get an error when trying to preview it:

“Could not find archive for stream : 60350cb139a3c87a2ae5705b”

I am considering starting an empty instance without any configuration (it would be easy) as I am wondering whether there is something rotten in my cluster that has gone through many versions since 4.1+

borjam · May 14, 2025, 7:09am

OK I did this:

Launch an empty Opensearch + Mongodb + brand new Graylog
Request new license and import
Created two inputs. AIS (raw UDP with JSON decoding) and UDP Netflow
Created indexes and streams for them. No Illuminate.
Activated data lake for both. The backend is a ZFS dataset. Proper permissions of course.

Now I see two data warehouses created. Waiting for them to be populated with some data.

Looks like something in my Mongodb database might be corrupted. It’s been migrated without interruption since 2021. Now that I recall several months ago it begun to complain about being unable to load recent activity.

Sorry, seems for some reason I have a broken instance. Everything works except for that.

borjam · May 14, 2025, 7:59am

And yes, now I can see data in the preview section for both streams.

I have also configured a simple filter to exclude IPv4 traffic from the Netflow Opensearch index and it works.

However, is this a problem with filters?

I have configured a filter to exclude IPv4 messages from the OpenSearch index. It works, the condition is:

Now I try to apply a similar filter to the data lake, but the opposite. This is tricky, the field does not exist for IPv4 flows and it is a number field with a value of 6 for IPv6.

I try to create this filter and it fails.

Seems like it is confused because some records contain the nf_ip_protocol_version and others don’t. This does look like a bug to me? I know I can just check on field existence.

Anyway if I try to use two rules, field existance and field value >= it still fails.

And I am unable to create it. Now, this is a clean Graylog instance created from scratch running 6.2.2.

Topic		Replies	Views
Upgrading from elasticsearch to open search Graylog Central (peer support)	4	628	February 5, 2024
Pipeline Not Working Graylog Central (peer support) pipeline-rules , filebeat-windows	8	1175	October 18, 2022
Message Table Can't Sort by Compound Field Type Graylog Central (peer support) dashboards	2	479	May 18, 2023
Graylog - The stream can't select a field in the managed stream Graylog Central (peer support) capacity_planning	2	80	May 7, 2024
4.0.2 Searches don't work. 4.0.1 does Graylog Central (peer support)	5	584	March 12, 2021

Unable to make a data lake work

Related topics