Confusion surrounding hardware

Hi there, apologies in advance if any of these questions are previously answered or considered common knowledge but big data is fairly new to me. I have a few questions which have been tormenting me for quite a while that I can’t seem to find concrete answers for.

The situation is I have roughly 4TB+ of data at rest in various formats (json, txt, rtf, sql, csv, tsv etc etc) which I use on a daily basis for work in online investigations. The trouble is it can take 15 to 20 minutes to perform one ripgrep search through larger items, this severely hinders productivity; hence the need to index it all for much faster searches and how I stumbled from ELK (what a headache that was) to Graylog.

My questions are (assuming data is indexed via graylog):

  • For data at rest are SSD’s still significantly quicker for performing searches?
  • Why is it not recommended to use NFS/SMB for ingesting data?
  • Is Grok the best way to “parse” databases without prior cleaning.

I already have 32TB of enterprise spinning disks and would rather not shell out another few thousand for SSD’s unless the performance increase is VASTLY greater.

Any help is much appreciated.

Hi, xLqF7NG3c6wUfpHh,

Welcome to the Graylog community. Your question is important to all members. I’ve moved it to a place we call “Daily Challenges” where it’ll get more attention more quickly. If you have any visuals or examples that will support your question, please post them. it often helps to accelerate a response.

Thanks for joining. We’re glad you’re a part of our community.

Hey there, I’ll attempt to answer your questions:

For data at rest are SSD’s still significantly quicker for performing searches?

Yes. Since the data itself is indexed in Elasticsearch, I’ll point you to their recommendations: Tune for search speed | Elasticsearch Guide [7.13] | Elastic

Why is it not recommended to use NFS/SMB for ingesting data?

See Why NFS is to be avoided for data directories - #3 by DavidTurner - Elasticsearch - Discuss the Elastic Stack. Elasticsearch is particularly sensitive to latency, which I’ve also seen in other data stores like etcd.

Is Grok the best way to “parse” databases without prior cleaning

It really depends on where you’re planning to use Grok patterns. In general, I’d avoid using an extractor–they’re more computationally intensive than using a Grok pattern in a pipeline. I personally don’t have a ton of experience using Grok patterns, but there are folks in the community who do and may be able to provide more input with this question.

So all that to say, I would definitely plan on using SSD’s for your storage, especially if you’re concerned with productivity. The spinning disks are going to cause more headaches than they’re worth IMO and given that Elasticsearch is also sensitive to disk latency, you’ll want to do everything you can to minimize it. Hope this helps.

2 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.