Issues with Master Node after upgrade to 2.4.3

Beginning of this week we upgraded graylog from 2.2 to 2.4.3
After the upgrade the master node began to stop processing messages, the daemon hung. no errors or indicators in the logs
restarted the daemon and it would start processing again with a large queue of messages to process. this happens every 10 hours.
it also can be noted that the master node seems to ingest more of the data and has a tougher time with it than the rest of the nodes

we have 4 graylog nodes on c4.2xlarge instances and 4x elasticsearch nodes on c4.4xlarge
we process on average 250GB
use round robin dns for the UDP inputs
load balancer for the UI

node_id_file = /etc/graylog/server/node-id
password_secret = xxx
root_password_sha2 = xxx
root_timezone = UTC
plugin_dir = /usr/share/graylog-server/plugin
rest_listen_uri = http://x.x.x.x:9000/api/
web_listen_uri = http://x.x.x.x:9000/
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = true
allow_highlighting = false
elasticsearch_hosts = http://x.x.x.x:9200,http://x.x.x.x:9200,http://x.x.x.x:9200,http://x.x.x.x:9200
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 6
outputbuffer_processors = 4
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://user:xxx@x.x.x.x:27017,x.x.x.x:27017,x.x.x.x:27017/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32```

Any help on how to understand why the master keeps crashing would be most welcome

I would say - check your elasticsearch cluster. (with the given information)

Maybe it is everytime when you rotate the index?

Hi @jan
its not when we rotate the index. timing doesnt line up

however, there could be some performance tuning with elasticsearch.
the ES_HEAP_SIZE was at 10GB and we could push it to 15GB (instances have 30GB total)
based on the docs http://docs.graylog.org/en/2.4/pages/configuration/elasticsearch.html there are a couple of other performance tips like setting indices.store.throttle.max_bytes_per_sec in elasticsearch

I am going to make these changes and see if that helps with the master node.

@jan
you were correct. it is when the indices are rotated
found some other issues too and sorted them out. however the master node crashes a few minutes after the default index and one other gets rotated

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.