Anonymized and raw views of same logs in different streams possible?

Hello!

I would like to configure my Graylog instance in such a way that for certain logs, there are two “views”: One in which personal data is anonymized (for your usual troubleshooting while keeping GDPR compliance etc.) and one that contains more information (for when, for example, there is evidence of any attack and we need to track down the “patient zero”-workstation).
I have been tinkering around on this for quite a bit. I have tried to route some logs into different streams (and different index sets), then manipulate one of the streams with processing pipelines. Every time I did this, the changes the processing pipelines performed on the logs also applied to the other stream which I meant to stay unchanged. My Message Processors Configuration is this:

  1. Message Filter Chain (active)
  2. Pipeline processor (active)
  3. AWS Instance Name Lookup (disabled)
  4. GeoIP Resolver (active)

I also tried to send the raw logs via output to a different input and put it into the different stream from there, but the result was the same.
Now, I could probably accomplish my goal by setting up a second Graylog instance to host only anonymized logs, but that seems overkill to me.
Can anyone tell me if there is a way to configure what I referred to as different “views” in Graylog? And if so, how should I go about that?

Any help would be greatly appreciated.

Greetings,
Philipp

He Philipp,

your initial Idea looks like the best idea. But it looks that you bound the pipelines to the wrong streams or the rules are not specific enough.

You need to forcefully duplicate messages. Create two streams, both with their own index set. Create the same rules for both streams. Now you should have duplicated messages. Now you can work with a pipeline on one of these streams to manipulate and make the logs GDPR compliant.

Hello Jan,

thank you very much for your quick response.

I tried what you recommended but I still run into my old problem of both streams being manipulated instead of just the one that I want to be manipulated. Somewhere, I must have made a mistake, but I don’t get where exactly. I’ll describe my configurations below - if you could check them for possible failings, I’d appreciate that.
I did the following:

  1. Created two new index sets which I called index_raw and index_anon.
  2. Created two streams, one configured with index_raw, one configured with index_anon, both with the same stream rule (match the source name to a given string and include the logs if matched), the same input and the same content.
  3. Created a new pipeline “test_pipeline” and connect it to the stream configured with index_anon.
  4. To this pipeline, added a pipeline rule “test-pipeline-rule” which simply lowercases the source field and which looks like this:
rule "test-pipeline-rule"
when
    to_string($message.source)=="my_server"
then
    set_field("source", lowercase(to_string($message.source)));
end

I expected this test rule to only lowercase the source field in the stream with index “index_anon”. It did that, but also lowercased the source field in the other stream where I wanted to keep the raw data.

Any ideas why that is and how I can change this to the desired behaviour?

if you have duplicated the messages - did you checked if that had happend?

When using the admin user who has the ability to search without a filter on everything does a search that messages the messages you have in those streams you should get two identical messages returned.

The same message ID, but different indices and streams on the message details:

59

When searching for the source name in the input my two streams are connected to, I find each event three times: once in each new index set and once in the default index set. Under “Stored in index” in my two seperate streams, the name of the specified index sets are written. In one stream, it says “index_anon”, in the other it says “index_raw” - just as ist should be, I think.

So I suppose something might be wrong with the pipeline? I only connected it to the stream which is connected to index_anon, still it processes entries in all three indices…

that indeed sounds that something is not configured as you should and like you said.

If everything is configured like you wrote this might be a bug, then we would need a detailed bug report including how to reproduce this over at https://github.com/Graylog2/graylog2-server/issues

I just submitted an issue at https://github.com/Graylog2/graylog2-server/issues/5016 .
Thank you for your time.

I’ve checked something and … I think I have some serious problems with my elasticsearch installation. Please feel free to not look into the bug report too much right now, I may have found the problem.

The issue with Elasticsearch (I was still using 2.4.6 which is not recommended for Graylog anymore) was apparently unrelated, the pipeline issue is still present after the upgrade to Elasticsearch 5.x. The bug report seems to still be relevant.

Heyo,

just a quick jump into the convo here since I did not read all of it.

Have you tried duplicating the message with the clone_message([message: Message]) or create_message([message: string], [source: string], [timestamp: DateTime]) and then sending it to the different stream with route_to_stream(id: string | name: string, [message: Message], [remove_from_default: boolean])?

What I read and saw in the screenshot was that the “supposed to be different” messages still were the same message (same message ID), simply attached to two streams. But what you would need to do is to create a copy of the message, apply your filtering/redactions/etc. and then route that to the stream with the limited view while routing the original message to the stream with full visibility.

This will result in (almost) double data usage, but will make sure that the limited view is completely decoupled from the original data, since they are two different messages.

Just a quick thought from me :slight_smile:
Greetings,
Philipp

1 Like

Hey Philipp,

I appreciate your feedback, thank you!
So far, though, I couldn’t get the duplication of messages working. This is the pipeline rule I tried:

rule "copy_message_into_stream"
when
    to_string($message.source)=="<name_of_machine>"
then
    let msg = clone_message();
    route_to_stream("5b8fe2b089e98d0a1f23c07e", to_string(msg));
end

But - not only did it not send copies to the other stream, it even seemed to process more and more messages (exponentially?) which made me very quickly stop it from processing. Do you have experience there and can tell me how to correct this rule?

What I changed:

  • I might be wrong, but just to be sure add a check, if the message is already a duplicate to avoid loops
  • The route_to_stream function expects a Message object, not a string. And I explicitly set that the stream identifier is a stream ID
rule "copy_message_into_stream"
when
    to_string($message.source)=="<name_of_machine>" && has_field("isDuplicate") == false
then
    let msg = clone_message();
    set_field(field: "isDuplicate", value: true, message: msg);
    route_to_stream(id: "5b8fe2b089e98d0a1f23c07e", message: msg); //Function expects Message object, not a String
end

I tried cloning a message more than 1 1/2 years ago. I can’t really recall it, sorry. But this should work in theory (I did not check it with my test Graylog, can’t reach it atm, sorry ^^)

Greetings,
Philipp

2 Likes

Hey Philipp,

thanks again for your reply! This pipeline rule worked quite well for me.
I added a line to your rule to also remove the duplicate message from the original stream so that the original stream only contains the original messages while the other stream receiving the copies only contains those.

rule "copy_message_into_stream"
when
    to_string($message.source)=="<machine_name>" && has_field("isDuplicate") == false
then
    let msg = clone_message();
    set_field(field: "isDuplicate", value: true, message: msg);
    route_to_stream(id: "<id_of_target_stream>", message: msg); 
    remove_from_stream(id: "<id_of_first_stream>", message: msg);
end

Using this mechanism, I finally managed to get two different versions of the same data, which is what I wanted to accomplish.

Thank you very much again, I appreciate your help.

While I do consider my problem solved, I think the implementation of different views on log data in Graylog might be worth some documentation over at docs.graylog.org?

At the same time, I do think a feature with dashboards and/or streams which enables administrators to mask parts of log data for certain roles would be neat, as it would not increase the volume of logs. I suppose recommending that as a new feature is possible as a new issue on GitHub?

Greetings,
Philipp

1 Like

Heyo,
cool, I’m happy that it works :slight_smile:

Sure. That’s a good idea. Simply open an Issue in the Documentation repo on Github. Or if you want to write it yourself, open an issue, fork the repo, write it and submit a pull request :slight_smile:

Sure, that’s the correct way to propose new features :slight_smile:

Greetings,
Philipp

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.