How many nodes needed for >5TB logs in a day?

merceskoba · February 20, 2019, 6:32am

Hello guys,
So, my company produces approximately 5 TB in a day…
Right now, i am using 3 nodes each server has 16 cores and 32 gb memory.
What do you think ?
Is it enough ?

Totally_Not_A_Robot · February 20, 2019, 7:54am

That’s really hard to say, because we have no idea what you are using those:

3 nodes each server has 16 cores and 32 gb memory.

… for and how you have them configured.

You tell us. Are they enough? Is it working? Because the way you write suggests that it’s already been built and that it’s working

If it hasn’t been built and you’re still in the design phase, please tell us a bit more about your situation.

5TB a day sounds like a lot. More than @benvanstaveren I believe and competing with @macko003.

merceskoba · February 20, 2019, 7:59am

I think more than 5TB, i enabled 5TB SSD in node master, just took 8 hours to reach 4.1TB and the segmentation running, i think it’s more than 5TB, now i have setup additional server (16c, 32gb) too. No i have 4 nodes.

is_master = true
root_timezone = Asia/Jakarta
node_id_file = /etc/graylog/server/node-id
password_secret = REDACTED
root_username = REDACTED
root_password_sha2 = REDACTED
plugin_dir = /usr/share/graylog-server/plugin
rest_listen_uri = http://0.0.0.0:9000/applogs/api/
web_listen_uri = http://0.0.0.0:9000/applogs/
elasticsearch_hosts = http://REDACTED:9200,http://REDACTED:9200,http://REDACTED:9200
elasticsearch_max_total_connections = 2048
elasticsearch_max_total_connections_per_route = 2
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = true
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 5000
output_flush_interval = 5
output_fault_count_threshold = 5
output_fault_penalty_seconds = 15
processbuffer_processors = 6
outputbuffer_processors = 5
processor_wait_strategy = blocking
ring_size = 16384
inputbuffer_ring_size = 16384
inputbuffer_processors = 5
inputbuffer_wait_strategy = blocking
message_journal_enabled = false
message_journal_dir = /data/graylog-server/journal
message_journal_max_size = 4700gb
message_journal_segment_size = 10gb
lb_recognition_period_seconds = 10
mongodb_uri = mongodb://REDACTED/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
dashboard_widget_default_cache_time = 1s
content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32

If you have any suggestion, i am very welcome
My plan right now is about to migrate to Google Cloud Platform or AWS
But it’s so expensive instead of bare metal in company

macko003 · February 20, 2019, 8:37am

message_journal_max_size = 4700gb
it’s only the log “buffer”, so at 3 nodes it’s enought for buffering messages for about one day (you won’t be able to search in this messages…). Are you sure you want it?

At system->nodes you can check the uncommited messages. You need enought node to keep this number under 2-4Xyour msg/s number.
I can suggest don’t start with your full message number first. You should understand the graylog’s log processing method, after that you can play with configs to optimalize your performance.
If I were you…
I drop the big amount of your message with eg. iptables to test how many messages/s could your system handle.
After that I find my monitoring system’s graps, and check for battlenecks…
After I find the necks, I start to increase my system.

It is hard to tell any exact thing, because (as @Totally_Not_A_Robot told) everything depends on your goals, and resources.
If you run wrong and complex regexs on all of your messages it needs more resource if you do a good simple on a part of your messages. So first “Keep It Simple Stupid”. After few months when you (think, you) know all things about graylog start to play, and you will be able to find problems in your system (becase you know only the little part of GL). (I started with graylog about 2 years ago, and I can find new featureas all week, and I know half of the features I have never used…)

benvanstaveren · February 20, 2019, 9:20am

You don’t need more Graylog nodes, what you need for storage is more Elasticsearch nodes. For ingesting 5Tb of logs per day, you need to look at how you want to store that - if you need replicas (which is never a bad idea), your storage has to be able to deal with that.

For example, if you use 1 replica, there will be 2 copies of the data around meaning 10Tb you need to store on a daily basis. For Elasticsearch, at that point, you need to see how much you can store on a single data node, and then figure out how many days of logs you want to be able to search in.

So, let’s say you can store about 6Tb on a single ES data node, and you want 30 days of logs, you need 300Tb of total storage. 300Tb of total storage divided by, let’s say 5Tb because reasons, you need 60 data nodes just to store that amount of logs and have them available.

The Graylog nodes themselves only deal with ingestion and processing, not storage. Keep that in mind

merceskoba · February 20, 2019, 9:21am

Feel honored can be answered completely.

I have tried message_journal_enabled = true in 3 nodes. But only in master node that config running correctly. The second and third node is stand-by (Active-Passive) mode.
When i turned it to be false in all 3 nodes, they are working together.

I did set-up the forth node… the memory usage is normal back…
I think there is a unprocessable message such as long message

merceskoba · February 20, 2019, 9:30am

I have 3 ES nodes and 10.5 TB SSD for each ES Node.
In overview sub-menu, only 200gb - 350gb in a day. After adding second nodes, there is escalation significantly. In overview sub-menu, there was 600gb - 700gb.
The ammount of log was discovered 5TB yesterday. When i set-up the message_journal disk.
Seems fishy, i added Message journal disk to store the data, only 6-8 hours, 4.1TB reached.
That’s why i added 2 more servers (4 nodes) and message_journal false and graylog succeeded to collect 600gb in 6 hours.

The Graylog nodes themselves only deal with ingestion and processing, not storage.

I got it but it was happened when 100gb logs in a day.
After i released to production, all developers are using it. RIP haha

Totally_Not_A_Robot · February 20, 2019, 9:36am

Seems like Graylog stack itself could be generating lots of messages/errors, which are filling up Graylog I’ve had that happen: Elastic logging exploding and filling up Graylog.

benvanstaveren · February 20, 2019, 9:36am

The message journal is used when there is more data coming in than Graylog can push out, so if the message journal keeps filling up you may indeed need more Graylog nodes for processing - but, when you have 5 nodes, each node acts independently of the others - there is no “shared” capacity. So what you want to do is create an input on each node (or one global one, it’ll automatically start one on each Graylog node), and distribute your logs across all inputs.

For Beats input that’s easy because you can list all hosts in the filebeat output config, and set the loadbalance flag to “true” and Filebeat will “Do The Right Thing™”.

Do keep in mind that if you use “bad” regexes (unanchored, too loose) it can affect processing speed. There’s a million topics on that on the forums though so you may want to look at that just for some tips and tricks

Hah. Yeah, that happened to us as well, the developers were all like “ooooh shiny new toy!” so… yeah.

merceskoba · February 20, 2019, 9:38am

Yes yes yes, i got broken pipe error. I don’t know why.
I created post in this forum 2 days ago.
When i got broken pipe error, there was “No Master has been fixed” error appeared.

Totally_Not_A_Robot · February 20, 2019, 9:39am

Gosh I wish… So far everybody’s staying away from mine, because making their apps log someplace else is “complicated”.

benvanstaveren · February 20, 2019, 9:40am

It’s not complicated, it’s just dev speak for “I can’t be bothered with that”

Totally_Not_A_Robot · February 20, 2019, 9:42am

Hence the italics

merceskoba · February 20, 2019, 9:43am

…So what you want to do is create an input on each node (or one global one, it’ll automatically start one on each Graylog node), and distribute your logs across all inputs…

hmm interesting…
I am using GORK pattern for Nginx Log, and will filter if match with regex pattern HTTP/*.* in Extractor.

Totally_Not_A_Robot · February 20, 2019, 9:46am

If there’s one thing @benvanstaveren has taught me, is to always first regex against the exact start of a line. That will cut down unneeded processing greatly.

merceskoba · February 20, 2019, 9:46am

lul…
what a funny developer in my company
Producing logs exactly trash logs… I told them, "dude, pls just collect response > 400 and > 500 nginx and some 3XX collected to Graylog.
Somebody said, “bro, we need care for 200 response”
OMG -_-
I can’t do anything. speechless

benvanstaveren · February 20, 2019, 10:19am

They have a point, perhaps, but then you should tell them that their 200 responses will be purged after a day or so. Let’s make some assumptions here:

1: You have an “NGINX” stream that will get all the logs from NGINX routed to it
2: You create 2 more streams, called “NGINX 200” and “NGINX 300+”, each with their own index set (so you can tweak the retention policies)
3: Attach a pipeline to the NGINX stream where you can check the status, e.g. set up a rule that checks if status == 200, and then does a route_to_stream call to put the message in NGINX 200 stream, and a remove_from_stream call that removes it from the NGINX stream.
4: Do the #3 again except check it for statuses > 300.

But that’s like, advanced Graylogging

Or just tell your devs like… mas dev, gak ada kapasitasnya, pelan-pelan saja… or something. That might work

benvanstaveren · February 20, 2019, 10:20am

Don’t forget, you can also anchor against the end of a line Even better is regexes that anchor on both sides - those tend to be real fast

merceskoba · February 20, 2019, 10:55am

Yeah man, that’s unduly advance
Even i am not covering yet all graylog features like you bro

I can’t imagine if my company is using Scalyr or other paid log management services LOL
5TB OMG rip…

Or just tell your devs like… mas dev, gak ada kapasitasnya, pelan-pelan saja… or something. That might work

wait, r u indonesian ?

Totally_Not_A_Robot · February 20, 2019, 11:12am

Aduh! No, he’s Belanda like me But he’s had waaaaay more exposure to Indonesia than me; but that’s his story to tell

But seriously… if you’re looking at 5TB of logging every day, you’re going to have to get creative with multiple streams, indices and retention times pretty fast. You can’t afford to just pile it all and save it for later.

Topic		Replies	Views
Optimising for 5TB of Data Graylog Central (peer support)	3	737	February 22, 2018
Graylog Sizing / Architecture Graylog Central (peer support)	2	1974	September 11, 2017
Graylog Enterprise production infrastructure setup Graylog Central (peer support)	2	1238	June 21, 2017
Graylog speed performance Graylog Central (peer support)	2	388	August 10, 2020
Graylog 5 GB/Day Limit for Enterprise Free Graylog Central (peer support)	5	5100	September 20, 2018

How many nodes needed for >5TB logs in a day?

Related topics