I would like to ask for advice regarding wait strategy usage and whether it would benefit our use case.
Specifically, for the processor_wait_strategy setting in https://github.com/Graylog2/graylog2-server/blob/2.1.1/misc/graylog.conf#L352
Our current setup:
Kafka Broker --> Graylog --> Elasticsearch
We have kafka inputs in Graylog with throttling enabled, we also have journal enabled in graylog nodes.
Previously, we encountered an issue where our graylog processors are not fast enough to process logs causing the disk journal to fill up.
The behavior we observed during the incident is that graylog disk journal would delete messages in the disk journal when it got full, and keep on filling up even though the graylog nodes would show “THROTTLED” status. This I believe causing the kafka input offset cursor to move further even if we’re not indexing, causing data loss.
We were hoping that it would actually slow down the inputs instead of deleting messages in the journal.
Our kafka brokers are setup to buffer a huge amount of messages and we value “no data loss” than latency of logs.
In a side note, we fixed our issue above by increasing processbuffer_processors, outputbuffer_processors and output_batch_size, which I guess, sort of proves our ES cluster is fast enough.
Here’s an excerpt of our settings:
# Elasticsearch elasticsearch_node_master = false elasticsearch_node_data = false elasticsearch_http_enabled = false elasticsearch_config_file = /etc/graylog/server/graylog-elasticsearch.yml elasticsearch_shards = 4 elasticsearch_replicas = 1 elasticsearch_index_prefix = dcslogs-prod elasticsearch_cluster_name = dcslogs-prod-muc1 elasticsearch_transport_tcp_port = 9350 elasticsearch_discovery_zen_ping_unicast_hosts = ["10.36.20.143:9300"] elasticsearch_network_host = 0.0.0.0 elasticsearch_analyzer = standard output_batch_size = 3000 output_flush_interval = 1 output_fault_count_threshold = 5 output_fault_penalty_seconds = 30 # Processors processbuffer_processors = 8 outputbuffer_processors = 5 async_eventbus_processors = 2 outputbuffer_processor_keep_alive_time = 5000 outputbuffer_processor_threads_core_pool_size = 3 outputbuffer_processor_threads_max_pool_size = 30 processor_wait_strategy = blocking udp_recvbuffer_sizes = 1048576 inputbuffer_ring_size = 65536 inputbuffer_processors = 2 inputbuffer_wait_strategy = blocking # Message journal message_journal_enabled = true message_journal_dir = /var/lib/graylog-server/journal message_journal_max_age = 12h message_journal_max_size = 10gb message_journal_flush_age = 1m message_journal_flush_interval = 1000000 message_journal_segment_age = 1h message_journal_segment_size = 100mb
Thanks a lot!