GDPR how to be OK?

kornflex · December 17, 2018, 10:04am

Hello,

I’m trying to find a way to store all logs and be ok with GDPR.
We need to limit access to olders log files to only GDPR team.

Graylog can’t create users profile with an access to 0-6months log and an other profile to 0-1 year.

Is it possible to have 2 graylog servers: one for admins, second for our GDPR team ?

First one used for all admins, for stats, checks… :

collect all logs with a 3months retention for example
send all logs to the second one

Second one, only used bu GDPR team :

get all logs from the first one, but with a one year retention log.

Thank you

macko003 · December 17, 2018, 10:31am

Hi
yes, it is possible, but in this case you should run two different cluster, and store double the data.
I suggest, to use graylog API.
My idea:
create new stream at every month, forward all (related message) in it
set rights to see all group
the old streams remove the rights

Or you can create a lot of streams via script (api) or manual (eg. 1812,1901,1902…), and create a pipeline to store the message in stream by the arrival date.

kornflex · December 17, 2018, 10:40am

Hi,
Thank you, great idea !

I will try this !

macko003 · December 17, 2018, 10:41am

If you do it, please share with us the final solution.

Totally_Not_A_Robot · December 17, 2018, 10:50am

Ditto, that’s certainly worthy of a nice blog post

kornflex · December 17, 2018, 11:02am

The first solution has to be manual, it’s boring

Other solution: a plugin

Graylog keep all logs for a year.
With a specific plugin :
If the user is limited by the plugin ( user management ), all requests done by this user is limited to 6 months old if he doesn’t put a smaller date.

We need a plugin administration to set a specific duration, not 6 months, but could be changed by graylog admin.
We need to active it on user profiles.

Is it possible to create de plugin ? I’m not a dev I don’t know if this could be easy with webUI, API etc…

Totally_Not_A_Robot · December 17, 2018, 11:16am

Of course there’s one more issue: system administrators. They can always, always access your data. Even if it’s supposed to be off-limits.

Also, if someone manages to get the creds to ElasticSearch, they can easily bypass Graylog. So make sure that you’ve hamered down Elastic as well!

jan · December 17, 2018, 11:43am

I would turn that and look from a different angle - that might make it easier to solve the issue.

How long did you really need the log data to be instand searchable? Is that really for that long time? I know not many environments that really need data for longer than 40 days. Most have aggregated data for that period of time and raw data only for a few days …

I do not know about what amount of data we are talking but having elasticsearch handle multiple TB or PB is not a sidejob - so if you can make some assumptions about that it would help to help you.

What is the Job of your GDPR Team? Do they need RAW Log data?

kornflex · December 17, 2018, 2:28pm

You are right, for sysadmins, only few days are usefull to check what happens, get notification of lots of errors etc…

For RGPD, they need to have more/all logs. I don’t know how many, how many server I have to use. So, if I have to make two clusters, it’s not great I want something easy

So I try this:
I have created two indexes : first one, by default with graylog installation, at 6 months retention
Second one called “GDPR”, 1 year of retention.

To feed the second one, I have created a steam with this rule : “Rule always matches”

So, when a log comes in, two indexes are feed.

I created a role which can only see the stream GDPR.

The user GDPR can only see is own index, can create dashboard, alert etc…

What do you think ?

Thank you

jan · December 17, 2018, 3:48pm

it is not easy to give a recommendation - As I can’t imagine what your “GDPR Team” is doing.

You will, with your idea, have more data as you duplicate it when holding it in different indices. From my experience such plans becomes very quick a review when you put a price tag on the requirements.

Ask from what devices and services messages should be hold how long and what volumen will that be. If you have that complete amount of data on a daily base, take this multiply by 1.7 (just to be save) and then you have the need of storage per day you need. Now multiply that up o one year and you have the amount of data you need to manage. If you want to duplicate the messages for one month - take the daily volumen multiply that for one month and add this on top.

In a small environment having 10GB per day is not uncommon. Taking now the above example:

10GB/day * 365 Days = 3.65 terabytes
10GB/day * 31 Days = 310 gigabytes
The above * 1,7 = 4.17700 terabytes

Now taking what you find in https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster will make each index holding ~11GB on a daily base. When you need to be resilient - no data lost even if a node is nuked you need 3 Elasticsearch Nodes and a replication of 1. So the data gets duplicated - 22GB per day. The final needed storage on three nodes together will be ~ 8.35400 terabytes.

We can spin this up with no issues to even more details - but just to give you something to think about.

Totally_Not_A_Robot · December 17, 2018, 5:09pm

Absolutely! Don’t just start building stuff, even if it’s very tempting. You need a proper set of requirements from them, especially with something as important as data security and privacy legislation.

Personally I don’t see why a privacy/legal team would need access to all server logs dating back to a year. I could understand them wanting specific security and access logs, to trace which users accessed which data. But as @jan already said: we can’t imagine what your “GDPR Team” is supposed to be doing all day

I mean, even our security auditors will not require full access to the full server logs dating back that far. It’s mostly security stuff, which is a limited subset.

So… time for talks, meetings, proposals and most importantly: lists of requirements.

benvanstaveren · December 17, 2018, 8:21pm

To throw in a few cents here, my company requires logs be kept searchable for a minimum of 90 days, preferably more than that (180 even) - with a 5 year archive. Partially due to legal reasons, partially because we need to be able to look that far back. (This is backed by give or take 48Tb of storage).

nefarius · December 19, 2018, 8:06pm

You need a legal basis to store the log data, if it contains personal data, for a certain period of time. The right of revocation must not be forgotten. The data subject may request that his or her personal data be deleted.

Before you think about the technical implementation: Check how you can meet the requirements (purpose of processing personal data: storage).

Has a processing activity been created for this purpose?

A purpose is surely the own interest, but how is one to justify this for more than 3 months?

Here you can read about lawfullness of processing: https://gdpr-info.eu/art-6-gdpr/
A legal basis can be: processing is necessary for compliance with a legal obligation to which the controller is subject.

Why does the GDPR team need access to the log data and why so long? What is the purpose? This should be answered. The data protection officer should reduce the storage duration to a minimum and not to a maximum of time.

system · January 2, 2019, 8:06pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Implementing GDPR Data Pseudonymization Graylog Central (peer support)	3	1735	June 14, 2018
Multiple nodes or cluster, different retention periods Graylog Central (peer support)	2	928	February 23, 2018
How many nodes needed for >5TB logs in a day? Graylog Central (peer support) pipeline-rules , route-to-streampl	29	4648	March 8, 2019
Logs from text file, JSON formatted file and database to Graylog Graylog Central (peer support)	18	6218	December 18, 2018
Message retention Graylog Central (peer support)	4	10391	February 22, 2017

GDPR how to be OK?

Related topics