Hi all, we’ve been running Graylog for several years and are looking to move it to a public cloud and make it a bit more resilient.
During the investigation and size/design process, I’ve done some number crunching and realised that our average of ~130GB per day input to Graylog is resulting in an expansion factor of over 10 in the ElasticSearch cluster. To store 6 days worth of logs, we’re using 8TB of backend storage for ElasticSearch.
Obviously the way our log data is being indexed is creating this overhead, but I’m struggling a bit as to how to start identifying the issues. I’ve Googled a fair bit but I suspect my limited knowledge is not helping me to search using appropriate terminology to see some helpful information.
We rotate our indices hourly and keep 432 for our 6 day retention. The short retention is due to the ancient hardware in use for ElasticSearch and the non-optimal slow storage it uses. This timeframe will increase to probably 30 days in the new cloud environment, hence why I need to optimise the backend storage requirements.
The average index size is 8 to 10GB primary size split into 4 shared keeping 1 replica, so the numbers add up to us really using 8TB backend size for our ~130GB per day input size. This should rule out orphaned data that is not valid in ElasticSearch.
The cluster status is green and Graylog is happily rotating indices etc. and has been stable for a while.
Graylog is version 2.3.2 and ElasticSearch is version 5.6.5.
Can anyone point me in the right direction to start identifying why our overheads are so large? I figure an expansion factor >10 is probably close to 10 times more than we should be seeing, so obviously we’re doing something wrong.