Graylog heap size maximum

I’ve got server with 256Gb RAM and researching which heap size will give the best performance.
For Elasticsearch there is clear rule “Don’t give more than 32Gb”
https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html#compressed_oops
But I didn’t find such guidelines for Graylog.
So, should we assume ES approach is applicable to Graylog also or Graylog has its own unique way and for example with 128Gb it will work faster?

The way I’ve understood things so far:

  • ElasticSearch greatly benefits from lots of RAM.
  • Graylog greatly benefits from more CPUs.

So while you can max out ES’s heap and RAM, Graylog won’t have much benefit from the same treatment.

Thanks for linking to that article; I’ll have to look into why ES can only go to a certain point. I guess that, if you need to grow beyond that limit, you’ll just have to add more and more ES nodes.

EDIT:
Ah! Another point in that article suggests that, if your host has lots more RAM, you can also increase the file access caching into RAM which leads to faster access times. Noice.

I think the reason for the “don’t give ES more than 32Gb” is that with heap larger than that Java switches to using a different type of pointer that causes (or caused) issues with ES. What ES benefits from is having free RAM for OS disk cache - our data servers have 64Gb memory, with 32Gb allocated to ES, and generally the other 30-odd gigabytes are in use by Linux’ disk cache.

For Graylog memory isn’t too big of an issue, I allocate 16Gb to our Graylog instances and that’s more than enough it seems.

1 Like

just to mention it - having to much Graylog HEAP will make the GC sometimes a problem.

We do not see many environments that have more than the 16GB most are ~12GB or less. Highly depending of their Lookup Table and Cache usage.

1 Like

Thanks, so I assume allocating more than 16Gb for Graylog is meaningless and more RAM here doesn’t mean more performance.
I believe this should be reflected in some guide or Graylog documentation.

also GL and ES support scale horizontal instead of vertical, so install your favorite container/vm software, and run more host on your metal. You can use all of your memory in this way.
you can check the graylog sizing chart to check what resources do you need.
https://docs.google.com/viewer?a=v&pid=forums&srcid=MTMyNzU4MDI3MTY0NTIwNzM3MDcBMTc1MTY2ODg0OTMxNzA5MTc1OTYBbHpVZEN5SDNBUUFKATAuMQEBdjI&authuser=0

2 Likes

Thanks for your advice, but it seems in my test environment CPU is the bottleneck and adding more instances will not improve anything.
So on my 256Gb RAM/56 cores/18Tb HDD server I’m able to deploy only 1 Graylog + 1 ES

No? Dude, no :smiley: That’s what @macko003 and I are telling you: split up those resources! Make multiple VMs.

With those resources you can easily run:

  • Three ElasticSearch nodes (3*64GB = 196GB, 2 cores per VM)
  • Three Graylog+MongoDB nodes (3*32GB = 96GB, 4 cores per VM)

As Macko, myself and others like @benvanstaveren have already pointed out here and in other threads: many modern applications scale horizontally, not vertically. If you need more processing power, you add on extra cluster nodes instead of pumping up one node to huge proportions.

I’m crying for a similar test system.
My live system not reach triple of that, and I handle over 15k/s with it.

1 Like

Dude, what OP described is basically half of one of my data centers. We use resources like that one box to run a whole frickin’ corporate network :smiley: Trying to use one fully stacked box like that to run one single application stack feels so odd :smiley:

Like buying a Lamborghini to do your grocery shopping :smiley:

1 Like

Sounds good, but how is that possible?
If I already reached CPU limit(got about 20Km/s with short peaks up to 200Km/s), how splitting resources can help?
Whether VM is the only option? What about containers?

It will help you effectively split the whole workload across all those resources you have!

One problem with processing huge amounts of data, is that one set of processes can only handle so much. There is no such thing as limitless parallelization for one set of processes. I’m sure Graylog is very well built, but I sincerely doubt that Graylog can push 56 cores to the max with one install. I’m sure @jan can weigh in on that :wink:

Let me sketch my simple setup:

  • We have four Graylog servers in our environment.
  • One Graylog server is the GUI / query box, the other three are ingestors.
  • All of our inputs are configured to run on the three ingestors.
  • Our log sources are configured to send their data to these ingestors. The BEATS variants can figure out load-balancing themselves, so we just provide the three ingestor IPs. Other, dumber protocols are directed towards a load-balanced address which forwards and divides the traffic between the three ingestors.
  • Off the bat, each of these three Graylog hosts will get about 1/3 of all incoming traffic. Three hosts, each having their own journal and their own individual inputs, with less active connections and incoming data than your single box.
  • We also have three ElasticSearch boxen in a cluster. Because we combine sharding with replication these boxen provide both load-balancing and high availability (one box can fail without us losing data).

We’re putting resources to much better use now!

  • Elastic can use three times the theoretical max of RAM, instead of just one.
  • Graylog is running multiple inputs to parse the whole incoming blarf of data, instead of one pipe.
  • Graylog can now also process many more messages in parallel, because effectively you’re running three Graylogs that each parse their own set of incoming data.

I don’t know enough about containers versus VMs to answer your last question. I can imagine that containers can help you achieve the same goal, but in a different way. Me, I’m just used to working with virtualization.

I’m sure Graylog is very well built, but I sincerely doubt that Graylog can push 56 cores to the max with one install. I’m sure @jan can weigh in on that

Funny warstory from the field. I have seen a box with 72 cores but half the RAM. But this box was for Graylog only (!!). With the inputbuffer_processor, processingbuffer_processor and outputbuffer_processor well tuned (4 Input, 40 processing and 8 output) this beast did heavy processing on a constant stream of 150k m/s. But was backuped by 6 node ES Cluster to take the load, IO and be resilient to hardware failure.

The key in this - have 3/4 of the available cores configured for the input, processing and output threads with the main for processing. As more input is only needed if you have issues with timeout tcp connections or similar, output can easily overwhelm elasticsearch because each thread will open one bulk ingest thread in elasticsearch - but processing means dedicated cores to crunch numbers (regex and all of that).

If the available is splitted between ES and GL it becomes tricky because ES is very distinct about taking the ressources it detects.

3 Likes

So it seems Graylog actually does scale vertically and well tuned inputbuffer_processor , processingbuffer_processor and outputbuffer_processor configuration on powerful single PM will beat swarm of VMs.
Thanks everybody, I will research VMs based configuration, currently I’m not able to proceed more than 20Km/s for a long time (and search doesn’t work during indexing on such speed).

1 Like

In that case Jan, was I offbase? Is my understanding of Graylog performance just too basic? The way it reads right now, it sounds like the “horizontal scaling” doesn’t apply to Graylog itself.

both works.

That is the nice about Graylog - some people prefer to build smaller boxes to be fail-save on hardware issues others want one big box. But that is just for Graylog - Elasticsearch is something different.

1 Like

I could use some clarification on this statement. Do you refer to OP’s case where they were running both ES and GL on one huge piece of iron? That this would lead to Elastic acting oddly? Or do you mean something else?

Either way I think we can conclude that OP would be well-served by at least switching Elastic to a clustered approach.

I could use some clarification on this statement. Do you refer to OP’s case where they were running both ES and GL on one huge piece of iron? That this would lead to Elastic acting oddly? Or do you mean something else?

That is the exact meaning.

See here: Performance advice. I'm missing something - #2 by jan

I was about to link to that excellent post of yours right here :smiley:

Dude, that’s such a great explanation!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.