I wonder if anyone might be able to tell me whether it’s currently possible to implement data pseudonymization with graylog. Specifically what we’d like to be able to do is to retain log messages with personally identifiable information for a set period, say 6 months, and then pseudonymize the data after that.
I saw this thread where someone seems to have the same question but I don’t think that answers whether it’s possible to hash the logging data after a set period (only that it is possible to hash the data as it comes in). As far as I can tell I don’t believe this would be possible with pipelines?
Would appreciate any advice from people dealing with the same issues at the moment.
Unfortunately is not possible to change data after it is indexed on Elasticsearch.
However, you have two options: anonymize the logs at ingestion time (as stated in the thread you mentioned), or duplicate the data in two different indexes: one with raw data (short retention time) and another with long time retention where you can store anonymized logs.
What I’m not sure about is how to stream both logs inside the pipeline.
Thanks for the reply Eduardo. Yes, just for anyone else interested I think this is the page of the docs that refers to time-based index retention.
It sounds like all the tools are there (multiple indexes with different retention strategies, pipelines and has functions) but I’m just not entirely sure how to piece them together in the right way yet. Let me know if you work it out as it sounds like you’ve got pretty much the same problem to solve.