Is there a way (a pipeline?) to filter out old data being ingested? Or more ideally, a way to ensure that older data ends up in the correct index file (and/or dropped if too old) in order to meet retention schedules?
As per documentation, I have set up mutiple indices with time-based rotation schedules so as to meet retention policies set in our GDPR policy documents. but I have found that on setting up a sidecar with winlogbeat, old data that precedes my retention schedule is imported and searchable, and all of the new data ingested is located within the current file for the index regardless of the actual timestamp in the logs. So old data is present that should have been expired, and data within the retention schedule is present, but won’t get automatically deleted when it reaches the appropriate age, only when today’s file for that index reaches the end of its retention schedule. Which could conceivably make us in breach if there’s any personal data in those logs.
I’m assuming that this is not something unique to sidecar/beats inputs, so any after-the-event-ingesting of logs would result in the same issue?
I know for beats at least, there’s the ‘ignore_older’ configuration option to simply skip importing anything older than the moment. But that might not be ideal if you do actually want to ingest those logs, and may not be an option for other ingestion methods.