Anonymizer extractor idea

Hello,

Currently we (companies which act inside EU) are facing GDPR concerns regarding log data and data anonymization, e.g. not storing personal-identifiable information for more than 14d in logs.

I would like to discuss the idea to add a new extractor type used to anonymize any string field/regex result into some SHA256, SHA512 or so. Logstash has a plugin to implement this, and maybe could be used as an example: https://www.elastic.co/guide/en/logstash/6.x/plugins-filters-fingerprint.html

There are some examples in which this feature would be great to have:

  • As a Sysadmin, I need to be GDPR-compliant, and at the same time would like to be able to view unique web accesses into my application in a range more than 14d, to help identify crawlers and/or other threats that are accessing my websites.

  • My logs have some personal data (Tax ID, for example) that need to be distinguishable for analysis, however this data cannot be exposed to the Graylog operator. Also in this case, the raw data is not important per se, but to be able to distinguish them is crucial.

Hope that this topic brings some new cool ideas for Graylog :smiley:

Graylog already supports quite a few hash functions:

1 Like

Great! I wasn’t aware.

Do you have any example for aplying SHA256 in the pipeline?

I’m not sure what type of example you would expect. Maybe you should play around with the hash functions a bit first. :wink:

Sure, I asked because I’m not that familiar with the Pipeline rules.

I also checked the docs, but it’s a bit challenging to start from scratch without some complete examples.

The following rule would replace the value of the “ip_address” message field with a SHA-256 hash of the value:

rule "anonymize-ip"
when
  has_field("ip_address")
then
  let ip_addr = to_string($message.ip_address);
  let hash = sha256(ip_addr);
  set_field("ip_address", hash);
end

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.