How many nodes needed for >5TB logs in a day?

Hello guys,
So, my company produces approximately 5 TB in a day…
Right now, i am using 3 nodes each server has 16 cores and 32 gb memory.
What do you think ?
Is it enough ?

That’s really hard to say, because we have no idea what you are using those:

3 nodes each server has 16 cores and 32 gb memory.

… for and how you have them configured.

You tell us. Are they enough? Is it working? Because the way you write suggests that it’s already been built and that it’s working :slight_smile:

If it hasn’t been built and you’re still in the design phase, please tell us a bit more about your situation.

5TB a day sounds like a lot. More than @benvanstaveren I believe and competing with @macko003.

I think more than 5TB, i enabled 5TB SSD in node master, just took 8 hours to reach 4.1TB and the segmentation running, i think it’s more than 5TB, now i have setup additional server (16c, 32gb) too. No i have 4 nodes.

is_master = true
root_timezone = Asia/Jakarta
node_id_file = /etc/graylog/server/node-id
password_secret = REDACTED
root_username = REDACTED
root_password_sha2 = REDACTED
plugin_dir = /usr/share/graylog-server/plugin
rest_listen_uri = http://0.0.0.0:9000/applogs/api/
web_listen_uri = http://0.0.0.0:9000/applogs/
elasticsearch_hosts = http://REDACTED:9200,http://REDACTED:9200,http://REDACTED:9200
elasticsearch_max_total_connections = 2048
elasticsearch_max_total_connections_per_route = 2
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = true
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 5000
output_flush_interval = 5
output_fault_count_threshold = 5
output_fault_penalty_seconds = 15
processbuffer_processors = 6
outputbuffer_processors = 5
processor_wait_strategy = blocking
ring_size = 16384
inputbuffer_ring_size = 16384
inputbuffer_processors = 5
inputbuffer_wait_strategy = blocking
message_journal_enabled = false
message_journal_dir = /data/graylog-server/journal
message_journal_max_size = 4700gb
message_journal_segment_size = 10gb
lb_recognition_period_seconds = 10
mongodb_uri = mongodb://REDACTED/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
dashboard_widget_default_cache_time = 1s
content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32

If you have any suggestion, i am very welcome
My plan right now is about to migrate to Google Cloud Platform or AWS
But it’s so expensive instead of bare metal in company

message_journal_max_size = 4700gb
it’s only the log “buffer”, so at 3 nodes it’s enought for buffering messages for about one day (you won’t be able to search in this messages…). Are you sure you want it?

At system->nodes you can check the uncommited messages. You need enought node to keep this number under 2-4Xyour msg/s number.
I can suggest don’t start with your full message number first. You should understand the graylog’s log processing method, after that you can play with configs to optimalize your performance.
If I were you…
I drop the big amount of your message with eg. iptables to test how many messages/s could your system handle.
After that I find my monitoring system’s graps, and check for battlenecks…
After I find the necks, I start to increase my system.

It is hard to tell any exact thing, because (as @Totally_Not_A_Robot told) everything depends on your goals, and resources.
If you run wrong and complex regexs on all of your messages it needs more resource if you do a good simple on a part of your messages. So first “Keep It Simple Stupid”. After few months when you (think, you) know all things about graylog start to play, and you will be able to find problems in your system (becase you know only the little part of GL). (I started with graylog about 2 years ago, and I can find new featureas all week, and I know half of the features I have never used…)

1 Like

You don’t need more Graylog nodes, what you need for storage is more Elasticsearch nodes. For ingesting 5Tb of logs per day, you need to look at how you want to store that - if you need replicas (which is never a bad idea), your storage has to be able to deal with that.

For example, if you use 1 replica, there will be 2 copies of the data around meaning 10Tb you need to store on a daily basis. For Elasticsearch, at that point, you need to see how much you can store on a single data node, and then figure out how many days of logs you want to be able to search in.

So, let’s say you can store about 6Tb on a single ES data node, and you want 30 days of logs, you need 300Tb of total storage. 300Tb of total storage divided by, let’s say 5Tb because reasons, you need 60 data nodes just to store that amount of logs and have them available.

The Graylog nodes themselves only deal with ingestion and processing, not storage. Keep that in mind :slight_smile:

1 Like

Feel honored can be answered completely.

I have tried message_journal_enabled = true in 3 nodes. But only in master node that config running correctly. The second and third node is stand-by (Active-Passive) mode.
When i turned it to be false in all 3 nodes, they are working together.

I did set-up the forth node… the memory usage is normal back…
I think there is a unprocessable message such as long message

I have 3 ES nodes and 10.5 TB SSD for each ES Node.
In overview sub-menu, only 200gb - 350gb in a day. After adding second nodes, there is escalation significantly. In overview sub-menu, there was 600gb - 700gb.
The ammount of log was discovered 5TB yesterday. When i set-up the message_journal disk.
Seems fishy, i added Message journal disk to store the data, only 6-8 hours, 4.1TB reached.
That’s why i added 2 more servers (4 nodes) and message_journal false and graylog succeeded to collect 600gb in 6 hours.

The Graylog nodes themselves only deal with ingestion and processing, not storage.

I got it but it was happened when 100gb logs in a day.
After i released to production, all developers are using it. RIP haha

Seems like Graylog stack itself could be generating lots of messages/errors, which are filling up Graylog :slight_smile: I’ve had that happen: Elastic logging exploding and filling up Graylog.

1 Like

The message journal is used when there is more data coming in than Graylog can push out, so if the message journal keeps filling up you may indeed need more Graylog nodes for processing - but, when you have 5 nodes, each node acts independently of the others - there is no “shared” capacity. So what you want to do is create an input on each node (or one global one, it’ll automatically start one on each Graylog node), and distribute your logs across all inputs.

For Beats input that’s easy because you can list all hosts in the filebeat output config, and set the loadbalance flag to “true” and Filebeat will “Do The Right Thing™”.

Do keep in mind that if you use “bad” regexes (unanchored, too loose) it can affect processing speed. There’s a million topics on that on the forums though so you may want to look at that just for some tips and tricks :slight_smile:

Hah. Yeah, that happened to us as well, the developers were all like “ooooh shiny new toy!” so… yeah.

1 Like

Yes yes yes, i got broken pipe error. I don’t know why.
I created post in this forum 2 days ago.
When i got broken pipe error, there was “No Master has been fixed” error appeared.

Gosh I wish… So far everybody’s staying away from mine, because making their apps log someplace else is “complicated”.

It’s not complicated, it’s just dev speak for “I can’t be bothered with that” :smiley:

1 Like

Hence the italics :wink:

…So what you want to do is create an input on each node (or one global one, it’ll automatically start one on each Graylog node), and distribute your logs across all inputs…

hmm interesting…
I am using GORK pattern for Nginx Log, and will filter if match with regex pattern HTTP/*.* in Extractor.

If there’s one thing @benvanstaveren has taught me, is to always first regex against the exact start of a line. That will cut down unneeded processing greatly.

1 Like

lul…
what a funny developer in my company
Producing logs exactly trash logs… I told them, "dude, pls just collect response > 400 and > 500 nginx and some 3XX collected to Graylog.
Somebody said, “bro, we need care for 200 response”
OMG -_-
I can’t do anything. speechless

1 Like

They have a point, perhaps, but then you should tell them that their 200 responses will be purged after a day or so. Let’s make some assumptions here:

1: You have an “NGINX” stream that will get all the logs from NGINX routed to it
2: You create 2 more streams, called “NGINX 200” and “NGINX 300+”, each with their own index set (so you can tweak the retention policies)
3: Attach a pipeline to the NGINX stream where you can check the status, e.g. set up a rule that checks if status == 200, and then does a route_to_stream call to put the message in NGINX 200 stream, and a remove_from_stream call that removes it from the NGINX stream.
4: Do the #3 again except check it for statuses > 300.

But that’s like, advanced Graylogging :slight_smile:

Or just tell your devs like… mas dev, gak ada kapasitasnya, pelan-pelan saja… or something. That might work :smiley:

2 Likes

Don’t forget, you can also anchor against the end of a line :wink: Even better is regexes that anchor on both sides - those tend to be real fast :smiley:

2 Likes

Yeah man, that’s unduly advance :slight_smile:
Even i am not covering yet all graylog features like you bro

I can’t imagine if my company is using Scalyr or other paid log management services LOL
5TB OMG rip…

Or just tell your devs like… mas dev, gak ada kapasitasnya, pelan-pelan saja… or something. That might work

wait, r u indonesian ?

Aduh! No, he’s Belanda like me :slight_smile: But he’s had waaaaay more exposure to Indonesia than me; but that’s his story to tell :blush:

But seriously… if you’re looking at 5TB of logging every day, you’re going to have to get creative with multiple streams, indices and retention times pretty fast. You can’t afford to just pile it all and save it for later.

1 Like