Today i got to know that there is something wrong with graylog so checked i found out that graylog jounal is full 100% so try to figure it out what’s wrong there. i find out some issue related to graylog_deflactor so deleted it and created a new index. stopped my input to receive messages because there were more than 8 million logs was in graylog journal. so my question is how can i solve this problem and after stopping input graylog journal messages are getting decreased but buffer is still full
32GB RAM for each host
4 CORE for each host
1.5 TB each host
logs 50GB/per day 800logs/sec average
Another question since i am using graylog cluster is it possible i can use some load balancer in front of 3 graylog, previously i tried Nginx load balancer but i found out many logs are missing in between logstash (GELF UDP) and graylog so i removed it and now all logs are going to the first server and ES data distributed in 3 servers
quick answer, your output process bottle neck points to either CPU issues or ES server issues or both. Since the process buffers are filling up, your journal will also fill up. Yes, you can use a load balancer in front, NGINX works.
I think you need to rethink your architecture. I don’t know what resources you have available, or your requirements, but 50GB/day x 25 days = 1.25TB. So in theory you have enough storage, but you are up against the limits and would want to think about expanding that. Also, if you can, increase your CPUs to 8. 50GB/day is almost too much if you want real time searching capabilities. In a typical environment, the vast majority of the logging is generated during normal business hours. So you’d want to have it sized to handle the higher volume of logging for the 8-12 hours of heavy log generation. Or, if you don’t mind waiting for a slower system to process the messages in the journal and the journal is large enough to hold the messages without flushing, then you’ll just need to make the journal adjustments.
based on my limited understanding of your requirements, all you really need is 1 GL server with 8 CPU 16GB RAM 250-500GB SSD/HDD, 1 ES server 8 CPU and 32GB RAM 2-3TB SSD.
This will allow you to easily handle your load and provide a path forward for growth.
i am confused since my graylog, MongoDB and elasticsearch is running on all three servers, which process is taking too much CPU ?
I have the following question
Should i remove graylog and MongoDB from other 2 servers because of ultimately they are consuming resources of host and since there is no load balancer in front of inputs I can see no messages are going to other 2 nodes and grafana shows nothing on 02/03 graylog.
As i mention before for load balancer i already tried NGINX for gelf UDP but due to chunkiness previously i see so many messages lost since i am not on the cloud so what should be the alternative?
in future i want 90 days retention time, i know only increasing the disk size is not a solution, can you help me what another thing i need to keep in mind for 90 days retention time.
creating multiple inputs will help something or switching to UDP to TCP will help?
what adjustment i can do in the journal to make it more robust?
By mistake, i put the core number 4 but actually, it’s 8 and I decreased the retention time 25 to 15 days i can see graylog start working fine. what action can be taken so such incident will not happen again in future?
what’s your daily ingest volume? 10GB/day? 20? 50?
If you are not using a load balancer, and all your logs are pointed at a single GL node, you don’t need Mongo/Graylog on the other servers. I would seperate the ES instance from graylog as it helps the Java performance, cleans up the architecture, and simplifies troubleshooting.
increasing disk size/quantity is absolutely valid solution to increase retention, but so is efficiently processing and storing the messages. Archiving is an option available if you purchase the enterprise license. And finally, develop a data retention policy and be sure to build graylog to support that. debug logs? 1 day… then delete NAT logs… 7-10 days… then delete, etc. etc… simple retention policies are easy to create and maintain, but are typically inefficient, so be a thorough as possible, but don’t over engineer it unless needed.
The journal is where the log is written to first before it can be processed by graylog. typically this flow happens quickly and the message is only in the journal momentarily. The journal starts filling up when the processing of the messages is being delayed… the causes for this are numerous, but typically are related to CPU allocation, Java heap, or issues writing to Elasticseach. Without more information, I wouldn’t really know where your issues are. you said you had 8 CPUs. how are they allocated in the server.conf file? from the first section it seems the output buffer is filling up, this is either indicative of elasticseach issues, or insufficient output processors. In your case I would guess it also has something to do with your architecture.
I would rebuild with a single GL server, and separate ES node. but without knowing what your ingest is, this may not be adequate. I like to recommend this as a starting point because it simplifies a lot, separates the major components, but still allows you to grow into a multinode deployment. Just make sure you read through the multinode documentation to ensure you are building it with growth in mind.
Currently, I am using a shell script which creates a cluster. All i need to pass graylog hostname and IP, this current setup is working from last 2 years but now logs are increasing day by day and i also need to increase retention to 90 days now this setup start creating problem.
I hope now you have a much better idea about my current configuration
ok, 55GB/day with more in the future and 90 day retention. I’ll use 100GB/day to make the math easy. 100GB * 90 days = 9TB of space. need to factor in some overhead, so I would make it 10TB of space. This is the size needed for the Elasticseach node/nodes. How you want to eventually build it out is up to you, I’m not sure how easily you can add diskspace or what type of storage, but I would highly recommend SSD. IO can be a bottleneck, and SSD will help at that ingest rate.
The graylog node doesn’t need alot of storage because the mongodb instance is really just storing configuration information. However if you are going to grow the number GL nodes, or want HA, then make sure you build the GL node according to the documentations recommendations. (mongodb replica set, etc.)
based on your conf file, I’m thinking the issue your having is an elasticseach issue. What exactly, I’m not sure, perhaps disk I/O, but you have 3 processors dedicated to output, and it’s 100% full, so something seems misconfigured or bottlenecking on the elasticsearch/java side. Again, seperating GL node and ES node helps alleviate some of these concerns as each node has it’s own Java Heap and resources.
to get your 90 day retention, currently, look at your index set statistics and determine the average size of an index before it’s rotated, the time frame that it gets you and increase the number of indices to that correct amount. If you’re averaging 1GB indices and it’s rotating every 12 hours, then you need 2 GB/day and 180GB for 90 days.
did as you suggested increase index-refresh time to 30 Seconds but no update in performance, i have stopped my input for few hours so all the unprocessed logs will proceed and after all buffer were clear, i started my input again but again i can see spike in buffer, any idea ?
**NOTE**: i tried to update the index-refresh but when i checked via elasticsearch settings i did not find anything so created a JSON file and applied it via API call and now i can see 30 seconds time index mapping and forcefully rotated the index via graylog
as the outputbuffer is given - your elasticsearch is having issues accepting the messages.
check the thread pools of elaticserach, check the ES settings.
for the setting - this is set per index so only NEW created indices will have this setting to 30 seconds but all existing indices will still have the previous 5s. You need to force that for all existing ones!
Thanks for quick solution, the given link provides how to fix it by deleting but i wanted to know the route cause why this error happens? and i can not see this issue now its 8-9 hours old error, will it create problem in future if it is already fixed by graylog ?
after updating the configuration elasticsearch is not able to start when i checked the logs i got this problem.
Caused by: java.lang.IllegalArgumentException: the [action.auto_create_index] setting value [false] is too restrictive. disable [action.auto_create_index] or set it to [.watches,.triggered_watches,.watcher-history-*]