I would like to configure my Graylog instance in such a way that for certain logs, there are two “views”: One in which personal data is anonymized (for your usual troubleshooting while keeping GDPR compliance etc.) and one that contains more information (for when, for example, there is evidence of any attack and we need to track down the “patient zero”-workstation).
I have been tinkering around on this for quite a bit. I have tried to route some logs into different streams (and different index sets), then manipulate one of the streams with processing pipelines. Every time I did this, the changes the processing pipelines performed on the logs also applied to the other stream which I meant to stay unchanged. My Message Processors Configuration is this:
Message Filter Chain (active)
Pipeline processor (active)
AWS Instance Name Lookup (disabled)
GeoIP Resolver (active)
I also tried to send the raw logs via output to a different input and put it into the different stream from there, but the result was the same.
Now, I could probably accomplish my goal by setting up a second Graylog instance to host only anonymized logs, but that seems overkill to me.
Can anyone tell me if there is a way to configure what I referred to as different “views” in Graylog? And if so, how should I go about that?
your initial Idea looks like the best idea. But it looks that you bound the pipelines to the wrong streams or the rules are not specific enough.
You need to forcefully duplicate messages. Create two streams, both with their own index set. Create the same rules for both streams. Now you should have duplicated messages. Now you can work with a pipeline on one of these streams to manipulate and make the logs GDPR compliant.
I tried what you recommended but I still run into my old problem of both streams being manipulated instead of just the one that I want to be manipulated. Somewhere, I must have made a mistake, but I don’t get where exactly. I’ll describe my configurations below - if you could check them for possible failings, I’d appreciate that.
I did the following:
Created two new index sets which I called index_raw and index_anon.
Created two streams, one configured with index_raw, one configured with index_anon, both with the same stream rule (match the source name to a given string and include the logs if matched), the same input and the same content.
Created a new pipeline “test_pipeline” and connect it to the stream configured with index_anon.
To this pipeline, added a pipeline rule “test-pipeline-rule” which simply lowercases the source field and which looks like this:
rule "test-pipeline-rule"
when
to_string($message.source)=="my_server"
then
set_field("source", lowercase(to_string($message.source)));
end
I expected this test rule to only lowercase the source field in the stream with index “index_anon”. It did that, but also lowercased the source field in the other stream where I wanted to keep the raw data.
Any ideas why that is and how I can change this to the desired behaviour?
if you have duplicated the messages - did you checked if that had happend?
When using the admin user who has the ability to search without a filter on everything does a search that messages the messages you have in those streams you should get two identical messages returned.
The same message ID, but different indices and streams on the message details:
When searching for the source name in the input my two streams are connected to, I find each event three times: once in each new index set and once in the default index set. Under “Stored in index” in my two seperate streams, the name of the specified index sets are written. In one stream, it says “index_anon”, in the other it says “index_raw” - just as ist should be, I think.
So I suppose something might be wrong with the pipeline? I only connected it to the stream which is connected to index_anon, still it processes entries in all three indices…
I’ve checked something and … I think I have some serious problems with my elasticsearch installation. Please feel free to not look into the bug report too much right now, I may have found the problem.
The issue with Elasticsearch (I was still using 2.4.6 which is not recommended for Graylog anymore) was apparently unrelated, the pipeline issue is still present after the upgrade to Elasticsearch 5.x. The bug report seems to still be relevant.
just a quick jump into the convo here since I did not read all of it.
Have you tried duplicating the message with the clone_message([message: Message]) or create_message([message: string], [source: string], [timestamp: DateTime]) and then sending it to the different stream with route_to_stream(id: string | name: string, [message: Message], [remove_from_default: boolean])?
What I read and saw in the screenshot was that the “supposed to be different” messages still were the same message (same message ID), simply attached to two streams. But what you would need to do is to create a copy of the message, apply your filtering/redactions/etc. and then route that to the stream with the limited view while routing the original message to the stream with full visibility.
This will result in (almost) double data usage, but will make sure that the limited view is completely decoupled from the original data, since they are two different messages.
I appreciate your feedback, thank you!
So far, though, I couldn’t get the duplication of messages working. This is the pipeline rule I tried:
rule "copy_message_into_stream"
when
to_string($message.source)=="<name_of_machine>"
then
let msg = clone_message();
route_to_stream("5b8fe2b089e98d0a1f23c07e", to_string(msg));
end
But - not only did it not send copies to the other stream, it even seemed to process more and more messages (exponentially?) which made me very quickly stop it from processing. Do you have experience there and can tell me how to correct this rule?
I might be wrong, but just to be sure add a check, if the message is already a duplicate to avoid loops
The route_to_stream function expects a Message object, not a string. And I explicitly set that the stream identifier is a stream ID
rule "copy_message_into_stream"
when
to_string($message.source)=="<name_of_machine>" && has_field("isDuplicate") == false
then
let msg = clone_message();
set_field(field: "isDuplicate", value: true, message: msg);
route_to_stream(id: "5b8fe2b089e98d0a1f23c07e", message: msg); //Function expects Message object, not a String
end
I tried cloning a message more than 1 1/2 years ago. I can’t really recall it, sorry. But this should work in theory (I did not check it with my test Graylog, can’t reach it atm, sorry ^^)
thanks again for your reply! This pipeline rule worked quite well for me.
I added a line to your rule to also remove the duplicate message from the original stream so that the original stream only contains the original messages while the other stream receiving the copies only contains those.
rule "copy_message_into_stream"
when
to_string($message.source)=="<machine_name>" && has_field("isDuplicate") == false
then
let msg = clone_message();
set_field(field: "isDuplicate", value: true, message: msg);
route_to_stream(id: "<id_of_target_stream>", message: msg);
remove_from_stream(id: "<id_of_first_stream>", message: msg);
end
Using this mechanism, I finally managed to get two different versions of the same data, which is what I wanted to accomplish.
Thank you very much again, I appreciate your help.
While I do consider my problem solved, I think the implementation of different views on log data in Graylog might be worth some documentation over at docs.graylog.org?
At the same time, I do think a feature with dashboards and/or streams which enables administrators to mask parts of log data for certain roles would be neat, as it would not increase the volume of logs. I suppose recommending that as a new feature is possible as a new issue on GitHub?
Sure. That’s a good idea. Simply open an Issue in the Documentation repo on Github. Or if you want to write it yourself, open an issue, fork the repo, write it and submit a pull request
Sure, that’s the correct way to propose new features