Hi all, we’ve been running Graylog for several years and are looking to move it to a public cloud and make it a bit more resilient.
During the investigation and size/design process, I’ve done some number crunching and realised that our average of ~130GB per day input to Graylog is resulting in an expansion factor of over 10 in the ElasticSearch cluster. To store 6 days worth of logs, we’re using 8TB of backend storage for ElasticSearch.
Obviously the way our log data is being indexed is creating this overhead, but I’m struggling a bit as to how to start identifying the issues. I’ve Googled a fair bit but I suspect my limited knowledge is not helping me to search using appropriate terminology to see some helpful information.
We rotate our indices hourly and keep 432 for our 6 day retention. The short retention is due to the ancient hardware in use for ElasticSearch and the non-optimal slow storage it uses. This timeframe will increase to probably 30 days in the new cloud environment, hence why I need to optimise the backend storage requirements.
The average index size is 8 to 10GB primary size split into 4 shared keeping 1 replica, so the numbers add up to us really using 8TB backend size for our ~130GB per day input size. This should rule out orphaned data that is not valid in ElasticSearch.
The cluster status is green and Graylog is happily rotating indices etc. and has been stable for a while.
Graylog is version 2.3.2 and ElasticSearch is version 5.6.5.
Can anyone point me in the right direction to start identifying why our overheads are so large? I figure an expansion factor >10 is probably close to 10 times more than we should be seeing, so obviously we’re doing something wrong.
I don’t know what you mean by expansion factor. But I suspect you have too many and too small shards. Making shards fewer and bigger should improve your performance.
Thanks for the reply, I did see that in my searching but didn’t read it in detail as I didn’t think it would apply to this situation. I will give it a closer look.
By expansion factor, that’s a term I came across in regards to ElasticSearch sizing which I took to mean the size of the data being indexed vs. the size of the resultant index. On re-reading the term is actually expansion ratio. This is the article:
Going through the examples in that article the maximum seems to be in the order of ~1.1, but in my case my ~130GB per day of indexed data produces a primary index size of ~670GB per day.
I just realised I didn’t take replicas into account in my initial calculation however this expansion ratio is still 5.25 which I’m assuming is a lot more than it should be.
This is where my concerns lie… consuming 8TB of on-prem storage that we already own for 6 days of data isn’t an issue, but moving this to a cloud solution means I need to optimise the index sizes as those 8TB will quickly turn to 40TB for the 30 days retention, and we are likely going to send more data to Graylog meaning it will blow out even more again.
Just a quick update, yes that linked article was very helpful in terms of my shard sizing. I calculate an average shard size of ~2.5GB which is way too small.
I’ve adjusted my Graylog retention strategy so we will rotate every 12 hours which should give us an average shard size of ~30GB which gives us a decent buffer up to the recommended 50GB maximum per shard size.
It’ll only take us a few days to see what impact this has on our backend sizing for Elasticsearch.
Going forwards, I’ll need to keep monitoring our average shard size and adjust the retention strategy to maintain that shard size, which is not particularly complicated.
I suspect it will not affect the disk usage much, but my experience was that the elasticsearch cluster became gradually much quicker when I made those changes.
Where do you get the amount of data to be indexed, to which you compare your disk usage? My feeling is that the disk usage is about on the same level as the numbers shown on the Graylog index sets page.
A bit over 24 hours in and no, the space hasn’t really changed at all, although I have a lot of hourly cycled indices to “age off” yet.
I’ve yet to determine if there’s any differences to index/search performance yet, but the change certainly has not had any ill effects.
To obtain my GB/day sizing, I wrote a Perl script a couple of years ago that processes the throughput of each input on each node of the cluster and tracks the delta through a 24 hour period. At the time I wrote it I was trying to justify purchasing new hardware for Elasticsearch to remove data from Splunk, so I needed a GB/day number to compare with the Splunk licensing costs.
I must add that I have a little egg on my face this morning… we’re not retaining 6 days of data at all, we’re in fact retaining 18 days. I had completely forgotten that the last time I made changes to Graylog’s retention settings I’d been able to increase the retention (documentation anyone?).
So, 8TB backend Elasticsearch sizing is actually ok, and gives me an expansion ratio of ~1.75:1 which will be fine.
Apologies for the confusion and misinformation here, I’ve set myself on a bit of a wild goose chase.
However, I do appreciate the replies and I think the information you’ve provided is invaluable regardless, so thanks again for that.