Ive set up a production cluster consisting of a elasticsearch cluster with 3 nodes and 3 graylog servers running mongdb with replicaset.
I set the shard number to 3 (One per node)
And the replica to 1.
Now i’m planning on how I’m going to organize my logs in it. I need to retain 6 months worth of logs for every devices sending logs to it for “hot access” and I was thinking of having separated indices for every device for better archiving purposes (I need to store logs for 7 years). And then it comes down to:
If I set my retention strategy to 1D and retain 180 days, in 6 months I’m gonna have 540 shards and 540 replica shards per device! So, 1080 shards per device seems a LOT! Is it a problem?
When I run a search, is it possible to search all indices with the same prefix? I know I can search a specific index but it seems I can’t use wildcards for this field.
I read a lot of documentation but I’d like some thoughts from experienced people about this.
if you need to hold the data for that time, you will need more nodes - without using any kind of archiving. Did you really need to have 7 years instant available for search or is it suitable that you might need some time (hours) to make requested data searchable?
You have 3 shards with replica 1 - that give you 6 shards, two on each node. What amount of data you have for one day? Can you possible reduce the amount by only storing significant parts of the message?
No, I need 6 months for instant access and 7 years for regulatory purposes. I’m planning to use the archive feature to store anything older that 6 months using a cloud blob storage.
Is it possible to reduce the message size at it’s arrival in graylog? I can’t change it at the source.
I have seen that article about shards but still, that’s a general document for general purposes. Graylog is really write intensive most of the time so I was looking for some advice focused on graylog use. I’m getting about 3500 msgs/sec.
Not sure how Archive feature works, but I believe that’s something you are looking for if being an enterprise user.
1000 shards are fine for 3 nodes with enough resources, we somehow managed to handle even ~1800 on just two nodes.
Instead of dropping messages in Pipelines, you can try a different approach that served us for some time.
Configure one stream for match-all and route it to some temporary index set, which will store all messages just for some time before you discard them.
Process messages with Pipelines from that match-all stream and route them to the specific stream based on your retention conditions.
Use one stream for one retention policy (e.g. 6m, 7y…) Stream needs to be configured to use own index set with appropriate retention strategy. No need to rotate every day, you can do it per 7,10 or more. In our case index size strategy showed as the best option to keep a minimal number of shards.
Manual archiving can be done by ES Snapshot API and some cloud storage plugin like S3, GCS and similar. Don’t forget to trigger range recalculation or closing index through GL to do it for you.
Once you have that in place you can even use a separate node/cluster to temporary restore archived data when needed.
Hi @breshich, I need you suggestion in our case. we have below Specification ES Cluster right now with Version 5.6.4
3 ES Master Nodes - 32 Gb /8 Core
4 ES Data Nodes - 64 Gb / 16 Core
Number of Shards right now - 2364 ( default shard set is 4 )
we have 4 indeces right now in Graylog :-
1st indices having small data with Rotation Period of 7 days ( Max Indices 50 ) - 1 Replica
2nd indices having huge number of documents with Rotation Period of 6 hour ( Max indices 400 ) - 1 Replica
3rd indices with Rotation Period of 1 day ( Max indices 90 ) - No Replica
4rth indices with Rotation Period of 1 day ( Max indices 90 ) - 1 Replica
Some times ES search failed for 45 or more than 60 days but some times its works don’t know the reason yet in Peak time our messages speed is 15k-25k genrally its 3K
Please suggest what changes should i make for better optimiziation for ES Performance ?
Hi @amitshar04, I really don’t have experience with a setup and load like yours, so can’t give you a very reliable suggestion for it, as every case is specific on its own.
Based on rule-of-thumb from article @jan linked it looks like you are not lacking memory, but regardless you should monitor all system resources and analyze findings in the peak time.
To me, 25K looks very high ingest rate and it can have an impact on both CPU and disk IOPS.