Before you post: Your responses to these questions will help the community help you. Please complete this template if you’re asking a support question.
Don’t forget to select tags to help index your topic!
1. Describe your incident:
Installed ubuntu 20 and followed the guide to install graylog - set up three DCs and our Meraki gear to forward syslog traffic to port 1515. Also tinkered with nxlog forwarding with a beats input on port 5044.
Everything works well for a day or two and then messages stop flowing inbound.
2. Describe your environment:
-
OS Information:
Ubuntu 20.04 on hyper-v VM
8 cores Xeon Gold 6148 CPU
24GB memory -
Package Version:
ii elasticsearch-oss 7.10.2 amd64 Distributed RESTful search engine built for the cloud
ii graylog-4.2-repository 1-4 all Package to install Graylog 4.2 GPG key and repository
ii graylog-integrations-plugins 4.2.5-1 all Graylog Integrations plugins
ii graylog-server 4.2.5-1 all Graylog server
ii mongodb-org 4.0.28 amd64 MongoDB open source document-oriented database system (metapackage)
ii mongodb-org-mongos 4.0.28 amd64 MongoDB sharded cluster query router
ii mongodb-org-server 4.0.28 amd64 MongoDB database server
ii mongodb-org-shell 4.0.28 amd64 MongoDB shell client
ii mongodb-org-tools 4.0.28 amd64 MongoDB tools -
Service logs, configurations, and environment variables:
server.conf file:
is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret =
root_password_sha2 =
bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin
http_bind_address = 10.10.10.27:9000
http_enable_cors = false
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://localhost/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
proxied_requests_thread_pool_size = 32
3. What steps have you already taken to try and solve the problem?
tail -f /var/log/graylog-server/server.log
2022-02-01T15:38:10.404-06:00 WARN [LocalKafkaJournal] Journal utilization (101.0%) has gone over 95%.
2022-02-01T15:38:41.843-06:00 INFO [connection] Opened connection [connectionId{localValue:17, serverValue:17}] to localhost:27017
2022-02-01T15:38:41.850-06:00 INFO [connection] Opened connection [connectionId{localValue:15, serverValue:15}] to localhost:27017
2022-02-01T15:38:41.851-06:00 INFO [connection] Opened connection [connectionId{localValue:16, serverValue:16}] to localhost:27017
2022-02-01T15:38:41.851-06:00 INFO [connection] Opened connection [connectionId{localValue:14, serverValue:12}] to localhost:27017
2022-02-01T15:38:41.852-06:00 INFO [connection] Opened connection [connectionId{localValue:13, serverValue:13}] to localhost:27017
2022-02-01T15:38:41.854-06:00 INFO [connection] Opened connection [connectionId{localValue:11, serverValue:11}] to localhost:27017
2022-02-01T15:38:41.857-06:00 INFO [connection] Opened connection [connectionId{localValue:12, serverValue:14}] to localhost:27017
2022-02-01T15:39:11.056-06:00 WARN [LocalKafkaJournal] Journal utilization (98.0%) has gone over 95%.
2022-02-01T15:39:11.058-06:00 INFO [LocalKafkaJournal] Journal usage is 98.00% (threshold 100%), changing load balancer status from THROTTLED to ALIVE
Journal usage overrun maybe?
curl -XGET http://localhost:9200/_cluster/health?pretty=true
{
“cluster_name” : “graylog”,
“status” : “green”,
“timed_out” : false,
“number_of_nodes” : 1,
“number_of_data_nodes” : 1,
“active_primary_shards” : 20,
“active_shards” : 20,
“relocating_shards” : 0,
“initializing_shards” : 0,
“unassigned_shards” : 0,
“delayed_unassigned_shards” : 0,
“number_of_pending_tasks” : 0,
“number_of_in_flight_fetch” : 0,
“task_max_waiting_in_queue_millis” : 0,
“active_shards_percent_as_number” : 100.0
}
I’m not sure what a shard is - but assuming if it’s 100% then it’s full or too busy to process more. What can I do to get messages flowing again? And most importantly please help me understand why this happened to prevent it from reoccurring. (Like deleting a folder of queued messages is nice to know how and what to do, but why didn’t they process and what settings do I need to adjust for proper retention and automatic digesting)?
4. How can the community help?
I’m guessing my elasticsearch settings are wrong or perhaps the server isn’t powerful enough? I cannot get into the /etc/elasticsearch directory as access is denied - is that expected? I don’t want to start modifying file permissions without understanding why or what that might break.
How can I get messages flowing again?
Helpful Posting Tips: Tips for Posting Questions that Get Answers [Hold down CTRL and link on link to open tips documents in a separate tab]