Preface:
Efficient storage and management of log data is a vexed topic, mainly due to the mass of data and its size. Nevertheless, it has a crucial importance for our business. In this blog post, I will present a preliminary storage architecture for log data that both meets current requirements and is flexible enough for future challenges. In doing so, I'll put a special focus on aspects such as scalability, security, and data accessibility to provide a robust foundation for data-driven decisions.
According to one study, of all indexed data, only about 1% is actively searched and controlled by an administrator. The remaining 99% "wastes" costly storage space in indexes. On average, 95% of the data is caused by the same 5 errors. (Source: "Observability is too expensive" Chris Cooney, presentation at Containerdays 2023).
Technical Information:
Currently an index collects data from one day (35GiB or 80,000,000 entries).
An index consists of 4 shards, each containing a subset of the data.
This is necessary to allow fast searching of individual data.
Architecture:
Since, as mentioned, the retrievable data must be in indexes, the retention time must be greatly reduced. From our own experience, 1-2 weeks of data on demand is sufficient. This currently corresponds to approx. 500GiB of data or 1,100,000,000 entries.
With the help of a "job" all entries which are older should be "archived" or possibly also compressed. The retention period in the "WARM Stage" should be about 1 month. Here I would recommend the use of somewhat cheaper HDDs. For one month, currently 2TB should be enough in uncompressed state. With further growth I would increase quite quickly to 3-4.
(Potentially even via Ceph, here the cheapest server disks would be possible. Alternatively, even possible on NAS drives. These would not noticeably interfere with other write processes.)
The "COLD stage" contains all entries. They can be kept indefinitely and are used for jurisdiction, forensics, and preservation of evidence. Storage tapes (possibly WORM tapes) would be possible here. The tape drive we use supports up to 30TiB of storage in "LP" mode. To get even more storage you could (if it is possible) compress the data beforehand.
Important:
There are some things which need to be clarified:
- temporary coexistence of data during the "move" process corresponds to double memory size
- Handling when expanding to multiple Graylog instances
- Upcoming move of Graylog
- Indices do not have a specific time
- What kind of "jobs" are used (Veeam, Cron, Ansible, ...)
- Handling of the restore (possibly own Graylog for forensics)