Graylog grinding to a halt

I’ve had my share of experience in other use-cases with Graylog. But for now I am running into issues.

Setup:
Graylog 4.0.0 (installed from an OVA)
8GB memory
2 vCPUs (yes its on a VM).
ES

All running on 1 node (as this was an appliance install).
We have like 100 servers running sidecar-winlogbeat and 50 odd networking gear. Java is running at 100% all the time and the web gui is not responding.

Whats wrong here? The fact we’re using the OVA instead of a clean install? The sizing ? Parameters ? Winlogbeat produces around 120 msg/second.

The OVA isn’t really recommended for production use; it’s more for PoC/testing. ( Virtual Machine Appliances — Graylog 4.0.0 documentation )

You most likely need more resources on the host for the amount of data you’re ingesting.
You could either try to resolve your issue by giving the VM more resources or, build a fresh install using the available packages.

https://docs.graylog.org/en/4.0/pages/installation.html

1 Like

Hmmmm…off the top of my head, it’s hard to help narrow down what the issue is. Are you seeing a backlog of messages? I guess having a better idea of what “grinding to a halt” would help. What do the system resources look like? What do your heap values look like on the VM?

1 Like

Blockquote
Xms represents the initial size of total heap space
Xmx represents the maximum size of total heap space
-Xms1g
-Xmx1g

So really tiny. This is the default on the OVA.

Now my question is when I go and redesign this (using the open source version of GL/ES). I need 1 year of data retained (index can be closed after a month). We are getting like 10 GB (or 27 Million docs) in 10 days.

Should I be looking at more than 1 GL/ES server? As in a load balanced 3GL/3ES setup ? Or would 1 work with enough horsepower ?

I’d start by increasing the heap, both for Graylog and Elasticsearch. If you’ve got 8GB on that box, increasing each to say, 2GB should give you plenty of headroom to spare. That said, the OVA isn’t something I would consider for production use. Granted, it makes things easy by having each of the component pieces preinstalled, but you’ll still have to tune things (as you’re seeing now). A single node can get you quite a bit of mileage (we’ve not really benchmarked what a single node of your size can do for throughput), but for production use cases, you’ll definitely want more than just the single node.

One example of a close-to-real-world production setup would look something like:

  • 3 Elastic nodes
  • 3 Mongo nodes (1 primary, 2 running replicasets)
  • 3 Graylog nodes (load balancer in front for api/UI/inputs)

That alone will get you quite a bit of mileage depending on what you’re doing with the logs. If it’s just dumping things to Graylog without any sort of additional processing, this setup would be able to do quite a bit of throughput. But of course, that would change if you’re doing heavy post-processing of your log messages.

So all that to say, if there’s a critical business need to keep Graylog up, then I’d architect it like so–a single node won’t get you there.

As I am just the network admin they have chosen to opt for a small scale setup. 1 Graylog server, 1 Elastic. What would (given the numbers above) be a good memory,cpu and disk sizing ? @aaronsachs

Under System/Nodes… Nodes take a look at your JVM:

One other thing that I’d note here is that you’ll probably want to take a look at some of the info under Nodes → Details (should look like this)

If you’re ingesting 10GB of data over 10 days, and Graylog has issues, finding out where the bottleneck is would be helpful, because then we can make some targeted recommendations as to the sizing. I will note that having 2 CPUs with that much data is probably going to be an issue. Here’s why–Graylog has 3 buffers: Input, Process, and Output. The input is responsible for just ingesting messages. The process buffer is responsible for processing those messages (i.e., parsing and manipulating those messages), and the output buffer is just responsible for shoving all of that into Elasticsearch. Each of those buffers corresponds to at least 1 cpu or core. If you only have 2 single core CPUs available, then they’re probably going to have to work overtime if you’re doing anything beyond just ingesting the data (especially if you’re using pipelines, or if you have any computationally intensive rules).

I’ll also note that for my lab, I have a 3 node cluster with my GL nodes with 4 cores and 16GB of RAM and am easily handling 50GB over a 30 day period. I don’t have a whole bunch of stream rules and am really only making use of 1 pipeline for Unifi logs. I could drop 2 out of the 3 nodes and still handle that 50GB/m easily, so feel free to use that as a reference point.

So all of that to say, I’d try and narrow down where the bottleneck is before going about making any sort of infrastructure changes. Yes, you’ll likely need to scale your instance up a bit and tune some things, but blindly doing that isn’t a good solution IMO.

I second @aaronsachs recommendation of trying to narrow down the bottle neck to help us help you. To your sizing question, based on the information you’ve provided and a couple of assumptions, I would build your solution with the following.

1 GL node with 4CPUs, 8GB RAM and 50-100GB storage (SSD preferred)
1 ES node with 4 CPUs 8GB RAM and 500GB storage (SSD preferred)

**10GB/10days = 1 unit of retention and means you need about 36.5 units of retention or 365GB of storage.

Configure the java heap on both to 50% of your RAM (-Xms4g -Xmx4g)
modify the default config on graylog to allocate 2 proc to the process buffer and 1 to output
modify your journal size to be at least 10GB (this will allow it to store 10 days of logs if you run into a problem or your system bogs down)

10GB/10days = 1 unit of retention and means you need about 36.5 units of retention or 365GB of storage.

Things to take into consideration that this design does not:

  • No redundancy on the front end
  • No redundancy on the back end
  • No always on capabilities (doing some maintenance will cause logs to be lost)
  • No data protection on either the config (mongoDB replica set needed) or the data (ES cluster needed and a replica)
  • Depending on your storage setup, you may have 1 drive doing all the work per node, which could lead to performance issues (consider RAID or SAN environment)

Currently the graylog server stopped as it ran out of memory.
The current setup I am installing is:

VM for Graylog:

• CPU: 4 cores
• Memory: 16GB
• 3 HDDs.
o 20GB → OS mountpoint
o 40GB → /var/log mountpoint (to prevent logs from shutting down the OS)
o 50GB → /var/lib mountpoint (graylog and mongodb data)
• Network: 1 NIC

VM for Elastic:
• CPU: 4 cores
• Memory: 16GB
• 3 HDDs.
o 20GB → OS mountpoint
o 40GB → /var/log mountpoint (to prevent logs from shutting down the OS)
o 2048GB → /var/lib mountpoint (Elastic database data)
• Network: 1 NIC

Does this sound like a more usefull setup ? Its running on Ubuntu with LVM. So I can scale all Logical Volumes at any given time.

The sysadmin does not want more servers (too much work). So there goes redundancy :frowning:

yeah, looks like you have a good foundation.

FWIW, redundancy is (should be) a business decision, not a sysadmin’s choice. Not sure how critical this is for you, but a little bit of extra work now can save a lot of work down the road.

But then again, I’m not in his/her shoes, nor yours… so I just wish you luck :slight_smile:

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.