I would like to get advise from experienced people to build a HA infrastructure to log 1.5To of data in JSON format every week. I need to have a retention time of 7 days and need to be able to requests these data by API.
The global requirements are :
Handle 400 000 000 requests per month (154/s)
Need to handle flat logs in JSON format and clear it by removing not needed informations to save disk space in the end.
Have a real time dashboard of flat logs
Have a dashboard to be able to made charts or requests on the final logs stored for 7 days.
Have an API available to be able to retrieve data of these logs from an external software.
The infrastructure should be able to easily growing in the future.
Send log data by HTTP request
I was thinking to create a Kubernetes Cluster on-premise on 6 dedicated servers (3 masters and 3 workers).
The 3 workers nodes will have 3.85To of SSD disk space each.
The flat log requests will be send to Graylog and Graylog allow me to handle logs in real time and get a dashboard and ElasticSearch allow me to get a analytics dashboard and API.
I have some questions :
Do you think Kubernetes is required or useless in my use case ?
Do you know if a log who is sent to elastic search was automaticaly deleted on graylog to avoid disk space over usage ?
Do you know how many requests per seconds can handle graylog and if it is possible to send logs of 15KB in json format ?
Do you think that 3 dedicated servers with 8C/16T Xeon at 3.7Ghz, 64Gb Memory and 3.85To diskspace on differents datacenters but in same country can works well for cluster ?
Most of what you want in Dashboard and Index retention is very plausible. On the "JSON " format different inputs that you can use.
I see people functioning with those, our environment we use Virtual machine and basically K8’s/Docker is pretty much the same idea. I would look into the throughput on on Kubernetes compared to actual Virtual machine. With Kubernetes only think I hated was the network connection issues, this was mainly because I was unfamiliar with that software. I have grown to like Docker/Docker-compos but we only use those for labbing an dev stuff.
Logs will get sent to Graylog INPUTs then ingest with elasticsearch. There are other means to get rid of logs from being ingested. First, the remote device sending those logs need to be configured properly either in its software or log shipper used. Graylog pipeline are ideal for situations like that.
I have seen 30,000+ mps on some machine also seen one node ingest a lot of messages per second. This depend on resources and configurations.
What I don’t see is your logical diagram for this scenario. It would be advisable to separate Elasticsearch from Graylog /MongoDb. Insure Elasticsearch nodes have a lot of CPU cores.
Here are a couple links you probably should look at, this may help.