Default index set 180 indices, 2,579,511,730 documents, 1.1TiB

Dear all

Default index set 180 indices, 2,579,511,730 documents, 1.1TiB

On graylog I see 1.1TiB but on the 2 Elasticsearch Master Data Server I see with df -h each with 5xxGB. What is the reason?
I use graylog 4.x opensource and elasticsearch 6.8 oss. Elasticsearch work with master and data note

Best regards
Tigo

Hey there. Your question ^ is unclear. What are you asking about? Are you asking why the index set shows 1TB of data, but you only see 500GB used on each Elasticsearch node?

If that’s the question, my question back to you is this: are you familiar with how Elasticsearch works under the hood? An index is comprised of shards and those shards are distributed between your two nodes. That said, your data should be evenly distributed between your two nodes. It’s not as if the data is replicated between the two nodes–each node has a piece of your overall data set.

Does this help answer your question?

1 Like

thanks for the feedback. So if on of my Elasticsearch server will die. I will lose data? It is not possible to have on each elasticsearch data server all data?

Best regards
Tigo

Hello in case I lose one data note (Hardware defect), graylog will stop to send data to the elasticsearch server… How process to let graylog send data to the last elasticsearch host? Or how to add a new elasticsearch host to the elasticsearch cluster and of course graylog send again data to the elasticsearch cluster.

Best regards
Tigo

Hey there. I just saw your replies. So because Elasticsearch shards data and you only have 2 nodes, if one goes down, then there’s an incomplete data set. Let’s first look what things look like from Graylog’s perspective:

So inside of Graylog, you have an index set and that index set actually is comprised of multiple indices. However that data looks different when it comes to Elasticsearch:

What happens is if you take the default configuration (4 shards, no replicas), each index ends up having 4 shards split across your two Elasticsearch nodes. In order for Elasticsearch to read the data, all 4 shards must be active. That means that if one node goes down, you’re down at least 2 shards, which means only half your data is present and is therefore incomplete.

So when Graylog attempts to send data and one node is down, the Elasticsearch cluster is considered unhealthy. Because it’s considered unhealthy, Graylog cannot send data and will start to store the outgoing messages in a buffer, and eventually its journal. So all of that to say, simply having 2 nodes won’t work in this case. In order for you to still be able to send data to Elasticsearch, you’d need at least 3 nodes (though 5 will ensure some fault tolerance in your architecture) and you’ll also need to look at using replicas as a part of your index set configuration.

There’s a lot of architectural and design decisions that you’ll need to consider in order for your Graylog deployment to be one that is resilient and can withstand failure. If that’s something you need assistance with, there are folks here in the community who may volunteer some time to discuss this with you. The alternative is to reach out to our sales team and discuss options for a more formal engagement in which our professional services team evaluates your deployment and works with you to deploy something that meets your requirements.

3 Likes

Thanks a lot Aaron!
In my case if one elasticsearch node will die I not have more the possibility to add a new elasticsearch node and bring up the cluster?
So in case of disaster recovery, if I have 3 Elasticsearch Node and 2 will die (Hardware defect). What is the right way to bring up the Elasticsearch cluster online again?

Hey there, the short story is no–you’re not just able to add a node and expect that the cluster will return to a healthy state. The reason being is that the shards present on the node that died are gone forever, meaning that you’ll forever have an incomplete dataset. Elasticsearch isn’t able to just magically recreate the data. The ONLY exception to this is if you decide to add replicas. Replicas are copies of shards that will be promoted if the primary shard is no longer active.

That said, I’d highly recommend reading over these two articles:

^ Those will help you better understand shards, replicas and how you can deploy Elasticsearch to be fault-tolerant.

Now to answer this:

If you have a 3-node cluster and 2 of them die, you’re going to be in a bad way. Again, adding more nodes won’t solve the issue if you don’t have replicas as a part of your sharding strategy within Graylog. But even then, you’d have to have multiple replicas as part of your index set. Without replicas, you can’t hope to ever bring the cluster back into a healthy state.

Let’s use this as an example: You create an index set with 2 shards and 3 replicas for your 3 node cluster. That would look like this:

So for each shard, you’d have a copy of the data on a node. Now, in the event of failure, those replica shards would be promoted to primary shards, meaning that the data is still present:

But what this also means is that the cluster would create the shards on the remaining node. So you’d have to have the available disk, cpu and RAM overhead available on that remaining node. Keep in mind that this is only for a single index set. The more index sets you have, the more overhead you’ll have to have in order to support a failure scenario. Keep in mind that this also doesn’t take into account the following:

  • The size of the data in each index and its component shards
  • The resources you’ve allocated to Elasticsearch (heap is going to be of primary concern here)
  • Your retention requirements
  • How often you’re rotating your data.

With that in mind, you’ll want to also read up on these recommendations from Elasticsearch:

So take some time to read up on those links so you have a better understanding of sharding and how data is stored within Elasticsearch. This should help inform your architectural design as you deploy Elasticsearch.

3 Likes

Thansk a lot for your Support Aaron. But why 3 Replicas for 3 Cluster Nodes?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.