Graylog journal getting full

Hi team,

Today i got to know that there is something wrong with graylog so checked i found out that graylog jounal is full 100% so try to figure it out what’s wrong there. i find out some issue related to graylog_deflactor so deleted it and created a new index. stopped my input to receive messages because there were more than 8 million logs was in graylog journal. so my question is how can i solve this problem and after stopping input graylog journal messages are getting decreased but buffer is still full

the only thing I changed in recent days was retention time 15 to 25 days older index get deleted

Graylog heap size

GRAYLOG_SERVER_1_GL_HEAP="-Xms2g -Xmx4g"
GRAYLOG_SERVER_2_GL_HEAP="-Xms2g -Xmx4g"
GRAYLOG_SERVER_3_GL_HEAP="-Xms2g -Xmx4g"

Elasticsearch heap size

GRAYLOG_SERVER_1_ES_HEAP=“16g”
GRAYLOG_SERVER_2_ES_HEAP=“16g”
GRAYLOG_SERVER_3_ES_HEAP=“16g”

INFRA
32GB RAM for each host
4 CORE for each host
1.5 TB each host

logs 50GB/per day 800logs/sec average

Another question since i am using graylog cluster is it possible i can use some load balancer in front of 3 graylog, previously i tried Nginx load balancer but i found out many logs are missing in between logstash (GELF UDP) and graylog so i removed it and now all logs are going to the first server and ES data distributed in 3 servers

please help me out

quick answer, your output process bottle neck points to either CPU issues or ES server issues or both. Since the process buffers are filling up, your journal will also fill up. Yes, you can use a load balancer in front, NGINX works.

I think you need to rethink your architecture. I don’t know what resources you have available, or your requirements, but 50GB/day x 25 days = 1.25TB. So in theory you have enough storage, but you are up against the limits and would want to think about expanding that. Also, if you can, increase your CPUs to 8. 50GB/day is almost too much if you want real time searching capabilities. In a typical environment, the vast majority of the logging is generated during normal business hours. So you’d want to have it sized to handle the higher volume of logging for the 8-12 hours of heavy log generation. Or, if you don’t mind waiting for a slower system to process the messages in the journal and the journal is large enough to hold the messages without flushing, then you’ll just need to make the journal adjustments.

based on my limited understanding of your requirements, all you really need is 1 GL server with 8 CPU 16GB RAM 250-500GB SSD/HDD, 1 ES server 8 CPU and 32GB RAM 2-3TB SSD.

This will allow you to easily handle your load and provide a path forward for growth.

i am confused since my graylog, MongoDB and elasticsearch is running on all three servers, which process is taking too much CPU ?
image

I have the following question

  • Should i remove graylog and MongoDB from other 2 servers because of ultimately they are consuming resources of host and since there is no load balancer in front of inputs I can see no messages are going to other 2 nodes and grafana shows nothing on 02/03 graylog.

image

  • As i mention before for load balancer i already tried NGINX for gelf UDP but due to chunkiness previously i see so many messages lost since i am not on the cloud so what should be the alternative?

  • in future i want 90 days retention time, i know only increasing the disk size is not a solution, can you help me what another thing i need to keep in mind for 90 days retention time.

  • creating multiple inputs will help something or switching to UDP to TCP will help?

  • what adjustment i can do in the journal to make it more robust?

By mistake, i put the core number 4 but actually, it’s 8 and I decreased the retention time 25 to 15 days i can see graylog start working fine. what action can be taken so such incident will not happen again in future?

CURRENT ARCHITECTURE

image

thanks in advance

what’s your daily ingest volume? 10GB/day? 20? 50?

If you are not using a load balancer, and all your logs are pointed at a single GL node, you don’t need Mongo/Graylog on the other servers. I would seperate the ES instance from graylog as it helps the Java performance, cleans up the architecture, and simplifies troubleshooting.

increasing disk size/quantity is absolutely valid solution to increase retention, but so is efficiently processing and storing the messages. Archiving is an option available if you purchase the enterprise license. And finally, develop a data retention policy and be sure to build graylog to support that. debug logs? 1 day… then delete NAT logs… 7-10 days… then delete, etc. etc… simple retention policies are easy to create and maintain, but are typically inefficient, so be a thorough as possible, but don’t over engineer it unless needed.

The journal is where the log is written to first before it can be processed by graylog. typically this flow happens quickly and the message is only in the journal momentarily. The journal starts filling up when the processing of the messages is being delayed… the causes for this are numerous, but typically are related to CPU allocation, Java heap, or issues writing to Elasticseach. Without more information, I wouldn’t really know where your issues are. you said you had 8 CPUs. how are they allocated in the server.conf file? from the first section it seems the output buffer is filling up, this is either indicative of elasticseach issues, or insufficient output processors. In your case I would guess it also has something to do with your architecture.

I would rebuild with a single GL server, and separate ES node. but without knowing what your ingest is, this may not be adequate. I like to recommend this as a starting point because it simplifies a lot, separates the major components, but still allows you to grow into a multinode deployment. Just make sure you read through the multinode documentation to ensure you are building it with growth in mind.

2 Likes

it somewhere around 55 GB/day but it will increase since more application is going to add in future.

So increasing disksize will help me to store 90 days logs?

graylog.conf   01 graylog server
is_master = true
root_username = admin
password_secret =23cf370dc3e66e9da9ddf5b8b982cd6311155018faa1ccf176341843d5b11cad 
root_password_sha2 = 9b8769a4a742959a2d0298c36fb70623f2dfacda8436237df08d8dfd5b37374c
http_bind_address = 0.0.0.0:9000
http_publish_uri = http://${DH}:8081/
http_external_uri = http://${DH}:8081/
node_id_file = /opt/graylog/node-id
plugin_dir = plugin
http_enable_cors = true
rotation_strategy = count
retention_strategy = delete
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
processor_wait_strategy = blocking
ring_size = 65536
message_journal_enabled = true
message_journal_dir = data/journal
lb_recognition_period_seconds = 3
content_packs_auto_load = grok-patterns.json
mongodb_uri = mongodb://${GRAYLOG_SERVER_1_NAME}:9301,${GRAYLOG_SERVER_2_NAME}:9301,${GRAYL   OG_SERVER_3_NAME}:9301/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
elasticsearch_cluster_name = graylog2
elasticsearch_shards = 2
elasticsearch_replicas = 1
elasticsearch_discovery_zen_ping_unicast_hosts = ${GRAYLOG_SERVER_1_NAME}:9300, ${GRAYLOG_SERVER_2_NAME}:9300, ${GRAYLOG_SERVER_3_NAME}:9300
elasticsearch_hosts = http://${GRAYLOG_SERVER_1_NAME}:8995, http://${GRAYLOG_SERVER_2_NAME}:8995, http://${GRAYLOG_SERVER_3_NAME}:8995
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking

02/03 graylog.conf

is_master = false
root_username = admin
password_secret =23cf370dc3e6d5b11cad 
root_password_sha2 = 9b8769a48436237df08d8dfd5b37374c
http_bind_address = 0.0.0.0:9000
http_publish_uri = http://${DH}:8081/
http_external_uri = http://${DH}:8081/
node_id_file = /opt/graylog/node-id
plugin_dir = plugin
http_enable_cors = true
rotation_strategy = count
retention_strategy = delete
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
processor_wait_strategy = blocking
ring_size = 65536
message_journal_enabled = true
message_journal_dir = data/journal
lb_recognition_period_seconds = 3
content_packs_auto_load = grok-patterns.json
mongodb_uri = mongodb://${GRAYLOG_SERVER_1_NAME}:9301,${GRAYLOG_SERVER_2_NAME}:9301,${GRAYL   OG_SERVER_3_NAME}:9301/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
elasticsearch_cluster_name = graylog2
elasticsearch_shards = 2
elasticsearch_replicas = 1
elasticsearch_discovery_zen_ping_unicast_hosts = ${GRAYLOG_SERVER_1_NAME}:9300, ${GRAYLOG_SERVER_2_NAME}:9300, ${GRAYLOG_SERVER_3_NAME}:9300
elasticsearch_hosts = http://${GRAYLOG_SERVER_1_NAME}:8995, http://${GRAYLOG_SERVER_2_NAME}:8995, http://${GRAYLOG_SERVER_3_NAME}:8995
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking

For retention, i used to update configuration from graylog UI once graylog up.

This is my elasticsearch Configuration where DH is host IP and NUM is number of node

cluster.name: graylog2
node.name: node0${NUM}
network.host: 0.0.0.0
network.publish_host: $DH
transport.host: 0.0.0.0
transport.publish_host: $DH
http.host: 0.0.0.0
#http.bind_host: 0.0.0.0
http.publish_host: $DH
discovery.zen.ping.unicast.hosts: [\"$GRAYLOG_SERVER_1_NAME:9300\", \"$GRAYLOG_SERVER_2_NAME:9300\", \"$GRAYLOG_SERVER_3_NAME:9300\"]
discovery.zen.minimum_master_nodes: 1
#bootstrap.memory_lock: true

Currently, I am using a shell script which creates a cluster. All i need to pass graylog hostname and IP, this current setup is working from last 2 years but now logs are increasing day by day and i also need to increase retention to 90 days now this setup start creating problem.

I hope now you have a much better idea about my current configuration

ok, 55GB/day with more in the future and 90 day retention. I’ll use 100GB/day to make the math easy. 100GB * 90 days = 9TB of space. need to factor in some overhead, so I would make it 10TB of space. This is the size needed for the Elasticseach node/nodes. How you want to eventually build it out is up to you, I’m not sure how easily you can add diskspace or what type of storage, but I would highly recommend SSD. IO can be a bottleneck, and SSD will help at that ingest rate.

The graylog node doesn’t need alot of storage because the mongodb instance is really just storing configuration information. However if you are going to grow the number GL nodes, or want HA, then make sure you build the GL node according to the documentations recommendations. (mongodb replica set, etc.)

based on your conf file, I’m thinking the issue your having is an elasticseach issue. What exactly, I’m not sure, perhaps disk I/O, but you have 3 processors dedicated to output, and it’s 100% full, so something seems misconfigured or bottlenecking on the elasticsearch/java side. Again, seperating GL node and ES node helps alleviate some of these concerns as each node has it’s own Java Heap and resources.

to get your 90 day retention, currently, look at your index set statistics and determine the average size of an index before it’s rotated, the time frame that it gets you and increase the number of indices to that correct amount. If you’re averaging 1GB indices and it’s rotating every 12 hours, then you need 2 GB/day and 180GB for 90 days.

1 Like

For 15 days retention time graylog works fine but last day i increase the retention to 20 days and then again i see the graylog journal get full with 4 Million messages.

GRAFANA BOARD FOR GRAYLOG

GRAFANA BOARD FOR ELASTICSEARCH

my configuration setting is already shared above.

retention time policy

i am getting 50-55GB data on each node/per day

raise the index-refresh time to 30 seconds and you should see a performance gain …

did as you suggested increase index-refresh time to 30 Seconds but no update in performance, i have stopped my input for few hours so all the unprocessed logs will proceed and after all buffer were clear, i started my input again but again i can see spike in buffer, any idea ?

**NOTE**: i tried to update the index-refresh but when i checked via elasticsearch settings i did not find anything so created a JSON file and applied it via API call and now i can see 30 seconds time index mapping and forcefully rotated the index via graylog 

as the outputbuffer is given - your elasticsearch is having issues accepting the messages.

check the thread pools of elaticserach, check the ES settings.


for the setting - this is set per index so only NEW created indices will have this setting to 30 seconds but all existing indices will still have the previous 5s. You need to force that for all existing ones!

1 Like

Hey i have updated the settings for all indices but today i found some error can you please tell me the route cause

please see the docs:

http://docs.graylog.org/en/3.1/pages/faq.html#how-do-i-fix-the-deflector-exists-as-an-index-and-is-not-an-alias-error-message

Thanks for quick solution, the given link provides how to fix it by deleting but i wanted to know the route cause why this error happens? and i can not see this issue now its 8-9 hours old error, will it create problem in future if it is already fixed by graylog ?

the root is a changed behavior from elasticsearch and your elasticsearch configuration not following our guidance to avoid that.

Hey i have already posted my elasticsearch configuration, can you please tell me what’s wrong there in my configuration?

please compare the suggested settings in our installation step-by-step guides to your settings:

http://docs.graylog.org/en/3.1/pages/installation/os/ubuntu.html#elasticsearch

Checked just one setting which I am missing action.auto_create_index: false is there any other thing I am missing?

you have my Elasticsearch configuration, can you please suggest the exact point where i need to focus?

that single setting is the reason.

1 Like

after updating the configuration elasticsearch is not able to start when i checked the logs i got this problem.

Caused by: java.lang.IllegalArgumentException: the [action.auto_create_index] setting value [false] is too restrictive. disable [action.auto_create_index] or set it to [.watches,.triggered_watches,.watcher-history-*]

and the error inlude a solution already - brave new world.