Graylog died on me (again). There is something I am missing


About two weeks ago, I posted about - among other things - Graylog slowing down considerably. Well, shortly thereafter, it ground to a halt. I was unable to do any searches, change any indices settings, stop inputs. Nothing. The GUI was inoperable. The Opensearch nodes said they were operating properly; same with the Graylog nodes. Honestly, I can’t say I did too much research because I am a one-man show and I needed to get our 3rd-Party CISO security information. So I set up a Wazuh stack and got what he needed. (Sidebar: I know Wazuh and Graylog do different things)

All this to say, I am not complaining. I KNOW I AM AT FAULT. I can’t expect Graylog to work as intended if I don’t understand it sufficiently and am abusing it.

So I am seeking advice.

My previous setup was as follows:

3 x Opensearch Nodes, each 32GB RAM 8 cores 750GB Storage
3 x Graylog/MongoDB Nodes, each 16GB RAM 16 cores (1 marked as Leader)

Nginx Load Balance distributing round-robin to the 3 Graylog nodes

I need 30 days hot, and 360 cold storage. I know this will cause eye-rolls, but Archive (i.e. Enterprise) is not an option (my budget is “Please sir, may have some more” with hat in hand).

That being said, I have, at my disposal, a rig running PVE with 32 cores and 128GB RAM (yes, I know I was stealing from Peter to Pay Paul with the 3 x 32 & 3 x 16 RAM config). And, if I am lucky, there is a chance I can beg for the money for new hardware (one-time purchases are preferred over subscriptions).

So, knowing the above, how would you configure the stack? (ask as many questions as you’d like). Additionally, If my hardware lacks to achieve my goals, what should I look for? If I need more than 128GB RAM I am going to have to start looking at enterprise servers and unfortunately my hardware knowledge in that area is not sufficient.

Final question: Opensearch allows for role-specific designations. I was thinking of having one node dedicated as master, 2 hot data nodes, and 1 warm/cold data node. Can I do that? If so, what does Graylog actually do when it comes to configuring the Opensearch Nodes? Would it then overwrite those settings? To quote Bob from Office Space, “What is it you would say [Graylog] does, here?”

Thank you all in advance. I owe you all several rounds.

What does your daily ingestion look like, and what are your current retention settings set to?

Also generally more resources won’t help, Graylog and opensearch are built to scale horizontally not vertically, and sometimes scaling beyond a certain point actually causes problems or worse performance.

I guess I Should have mentioned that. 10GB daily on the low end; 20GB on the high end. I think I will par down the ingestion to keep it - at most - to 15GB.

Retention is not set as there is no Graylog, currently. I will admit, however, I likely f’d things up there trying to play the shard (20-40GB) balancing act (and I still don’t know how to do that).

Unfortunately I don’t think I understand. Could you give me an example of horizontal vs vertical? (sincerely)

So for shard sizes just use the new time shard balanced setting, and let Graylog look after the sizes of the shards.

When I say horizontal what I mean is many small nodes rather than a few massive ones. For example it doesn’t make sense for an opensearch node to have more than 31GB of heap, so nothing above a 64GB system ram makes sense.

At your size you may not even need more than a single Graylog node with 16GB of heap if that. You can check out these recommendations

Now your opensearch if you plan on keeping the data longer is another story, all hot data has a price in OS on resources.this blog will give you some idea of that. How many shards should I have in my Elasticsearch cluster? | Elastic Blog

For 20 Go per day a single all-in-one node (Graylog, OS and Mongo on the same host) is enough with 16 CPU and 64 Go RAM.

Regarding your hardware and architecture you need to configure the heap size of Graylog and OS, and to tune some parameters in Graylog (output_batch_size, inputbuffer_processors, outputbuffer_processors and processbuffer_processors).

Finally be careful with Extractors, Pipelines, Streams and Event Definitions, a bad regular expression can have very significative impact on performance. Check corresponding metrics.

Thank you; that youtube link is incredibly informative.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.