Graylog Data Retention and Compress

Hi, folks!

I’m a newbie in Graylog/Elasticsearch and I’m starting to implement it to manage some logs in my organization. Reading the documentation (of both softwares) I’d faced a lot of information about what I will ask here. But…I’m looking for some advices of the more experienced users about the best practices at all.

The scenario: Graylog 4.3 and Elasticsearch 7.17.
I’m already have a Squid Proxy server sending the user’s Internet access logs, I have a dashboard with some charts, tables, metrics etc. I’m able to search and manage the data that I’m catching. I’ll start to catch the logs of other servers in the future but, for now, the above is enough for explain my doubts. Now…I’m concerned about the disk consumption increase.

So, there goes some doubts:
How can I configure data retention in Graylog?
For example…the most cost-effective (in terms of disk consumption) logs that I’ll manage are Squid Proxy logs. I want to store the user’s Internet access logs for 1 year period. I’ve read in the Elasticsearch docs about a feature named “rollup”, but I don’t no if its a good idea to configure it “direct” on ES under Graylog. I mean, Graylog is able to deal with all those concepts of “rolllup search”, “rollup index”, etc?

Is there a way to “rotate” and/or “compress” the older indexes while still retaining Graylog’s ability to search them (with some performance degradation, I think, but it’s OK)?

In other words…can someone “guide” me for where I can start to look and read about it? The ES documentation are rich, but I could not define what features are aplicable in the Graylog stack. What are the best practices? How would you deal with my scenario?

Thanks in advance. Please, be patient with this newbie and sorry for my bad English. xD

Hello && Welcome @araubach

First, are you aware that your version of Elasticsearch is greater then Graylog supports? Second, FYI Graylog is moving toward OpenSearch.

Data retention in Graylog.
You have a couple choices.
1.Graylog Operation licenses has an Archive solution.
2. All logs go through the Stream “All Messages” you could route those logs to a different stream then into a different index. Another option is Elasticsearch Repo configuration.
The Kool thing about repository is you can set the path to the repo into a different volume or server,
Example:

path.repo: ["/mnt/volume_disk_2/my_repo"]

Or perhaps something like this.

mount server:/directory/with/data /mnt

and you can verify the mount

mount -t nfs
server:/directory/with/data on /mnt type nfs (rw,addr=192.168.254.196)

Another suggestion is during the Index creation you could set the indices to close and lets say its the Proxy index logs. That may save some room but not much for year.

1 Like

Hi, @gsmith !
Thanks for reply.

I installed and configured the application following a tutorial. After the installation and reading the documentation I noticed that the version was unsupported. However, I still had no problems with the application working, even using this version.

I already got a stream “squid-access-logs” to where I’m routing the messages, as well as an index with the same name.

I will search and read about ES repo configuration. It may be a good solution for me, since I have a remote storage where I can store this data.

Thanks again!

1 Like

To be honest: better migrate to Openseach now than later. If your instance is still growing you might loose less data.
How many GB of Logs do you ingest every day with your squid? After my experience it could be worth to throw a little harddrive on the problem and leave the data in the live database, Openseach or Elastic. It will save you from workarounds.

Little goodie for squid users:
It is possible to define an own logformat, which can be refferenced by your logging-config.

logformat graylog_vhost { "server_fqdn": "%{Host}>h", "short_message": "%rm %"ru HTTP/%rv", "timestamp": %ts, "client_source_ip": "%>a", "squid_ip": "%la", "server_ip": "%<a", "response_time": %tr, "size_of_request": %>st, "size_of_reply": %<st, "request_url": "%"ru", "http_status_code": %>Hs, "request_method": "%rm", "squid_request_status": "%Ss", "squid_hierarchy_status": "%Sh", "mime_type": "%mt", "x_forwarded_for": "%{X-Forwarded-For}>h", "referer": "%{Referer}>h", "user_agent": "%"{User-Agent}>h"}

This produces nicely parsed JSON which easily can be ingested into Graylog :slight_smile:

1 Like

Yes, agree! I’m already reading about OpenSearch. My scenario is still a initial setup to learn about it and tests. I’ll build another VM, now with OpenSearch probably, considering that Graylog will migrate to this in the future.

The ingestion of Squid logs is about ~2GB/day.

Yeah! I’m already using this format. It’s really better than need to create rules on the stream or pipeline to parse the log after received on Graylog.

Thanks for reply, @ihe !

1 Like

2GB/Day are approx 700GB/year, which will compress down to 400-500GB of disk. Are you sure you don’t want to tackle the problem with a bit of harddrive space?

1 Like

Yes I will. Actually, hard disk space is not a big issue in my structure. I planned to dedicate about 2TB to the logs, considering that in the future I will be ingesting logs from other servers into Graylog. Still, available space doesn’t mean infinite space, right? :sweat_smile: And I think I’d better worry about that now, before the application gets “bigger” and loads a huge amount of data logs. You know…it’s better prevent the problem.

When you grow:

  • put streams with the same set of field on the same index-set. Winlogbeat to winlogbeat, squid on their own, firewall on it’s own and so on.
  • Separate your Elastic/Openseach from your Graylog Nodes
  • put a small loadbalancer in front of your Graylog

Quick sketch, how I do tune Openseach/Elastic next

  • set the index set for daily rotation if suitable. For two GB of Data you might increase that to a week.
  • for each 20-30GB of data per rotation on a index set there should be one shard
  • for each 20 shards one GB of heap on the OS/Elastic.
  • 50% of RAM for the java-heap, rest for filesystemcaches via OS on OS/Elastic.
1 Like

Thanks for the tips! It’s necessary deploy mongodb in separate nodes too? What you think is the minimum set of nodes of each component (Graylog/Mongo/Elastic)? Considering that my scenario is not a big data scenario. It’s only internal applications. I want to centralize the main logs of my infrastructure (Squid access logs, DNS queries logs, DHCP leases logs, Iptables “drop/reject” matches, etc). To put it on Dashboards with some metrics and overviews and to set some alerts, maybe. Of course, high availability is desirable, to avoid losing logs being sent from the servers during an eventual Graylog outage.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.