Preliminary storage architecture for logging data

Marvin1 · October 24, 2023, 7:38am

Preface:

Efficient storage and management of log data is a vexed topic, mainly due to the mass of data and its size. Nevertheless, it has a crucial importance for our business. In this blog post, I will present a preliminary storage architecture for log data that both meets current requirements and is flexible enough for future challenges. In doing so, I'll put a special focus on aspects such as scalability, security, and data accessibility to provide a robust foundation for data-driven decisions.

According to one study, of all indexed data, only about 1% is actively searched and controlled by an administrator. The remaining 99% "wastes" costly storage space in indexes. On average, 95% of the data is caused by the same 5 errors. (Source: "Observability is too expensive" Chris Cooney, presentation at Containerdays 2023).

Technical Information:

Currently an index collects data from one day (35GiB or 80,000,000 entries).

An index consists of 4 shards, each containing a subset of the data.

This is necessary to allow fast searching of individual data.

Architecture:

Since, as mentioned, the retrievable data must be in indexes, the retention time must be greatly reduced. From our own experience, 1-2 weeks of data on demand is sufficient. This currently corresponds to approx. 500GiB of data or 1,100,000,000 entries.

With the help of a "job" all entries which are older should be "archived" or possibly also compressed. The retention period in the "WARM Stage" should be about 1 month. Here I would recommend the use of somewhat cheaper HDDs. For one month, currently 2TB should be enough in uncompressed state. With further growth I would increase quite quickly to 3-4.

(Potentially even via Ceph, here the cheapest server disks would be possible. Alternatively, even possible on NAS drives. These would not noticeably interfere with other write processes.)

The "COLD stage" contains all entries. They can be kept indefinitely and are used for jurisdiction, forensics, and preservation of evidence. Storage tapes (possibly WORM tapes) would be possible here. The tape drive we use supports up to 30TiB of storage in "LP" mode. To get even more storage you could (if it is possible) compress the data beforehand.

Architektur

Important:

There are some things which need to be clarified:

temporary coexistence of data during the "move" process corresponds to double memory size
Handling when expanding to multiple Graylog instances
Upcoming move of Graylog
Indices do not have a specific time
What kind of "jobs" are used (Veeam, Cron, Ansible, ...)
Handling of the restore (possibly own Graylog for forensics)

ihe · October 24, 2023, 3:51pm

Hi @Marvin1
that is a very nice post, thank you very much!
could you help with the practical implementation by letting us know how to change the settings for the elastic/opensearch nodes to the appropriate level?

Topic		Replies	Views
New to Graylog - need suggestion - NFS share + 1 node Graylog Central (peer support)	1	748	August 17, 2020
What architecture we follow to store 2TB logs per day in graylog for high scalability and cost optimisation? Graylog Central (peer support) architecture	10	1196	March 9, 2023
Suggestions for Long Term trend analysis Graylog Central (peer support)	3	465	July 17, 2019
Optimising for 5TB of Data Graylog Central (peer support)	3	738	February 22, 2018
Graylog Open system and hardware requirements Development	2	13553	September 5, 2023

Preliminary storage architecture for logging data

Preface:

Technical Information:

Architecture:

Important:

Related topics