- Describe your incident:
I’m running 4 instances of Graylog nodes 5.0.5 managed by load balancer and 5 instances of ElasticSearch. Those four nodes are not sending messages to elasticsearch and there are a lot of messages that saturates the journal. In server.log there are a lot of this error messages:
ERROR [Messages] Caught exception during bulk indexing: ElasticsearchException{message=ElasticsearchException[An error occurred: ]; nested: IOException[Connection reset]; nested: SocketException[Connection reset];, errorDetails=}, retrying (attempt #38)
All hosts of elastic and graylog clusters are on the same subnet and there is nothing to prevent communication between them even testing it manually there do not seem to be any problems.
The infrastructure has been running smoothly for months and suddenly this problem has arisen.
Instead in elasticsearch’s side there aren’t evident issues, and the resource consumption is low.
- Describe your environment:
Graylog nodes details:
CPU
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7J13 64-Core Processor
Stepping: 1
CPU MHz: 2445.392
BogoMIPS: 4890.78
Virtualization: AMD-V
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
L3 cache: 16384K
NUMA node0 CPU(s): 0-15
OS version
NAME=“Oracle Linux Server”
VERSION=“8.8”
ID=“ol”
ID_LIKE=“fedora”
VARIANT=“Server”
VARIANT_ID=“server”
VERSION_ID=“8.8”
PLATFORM_ID=“platform:el8”
PRETTY_NAME=“Oracle Linux Server 8.8”
ANSI_COLOR=“0;31”
CPE_NAME=“cpe:/o:oracle:linux:8:8:server”
Elasticsearch Version:7.10.2
Graylog Version:5.0.5
GRAYLOG graylog.conf
is_leader = true
node_id_file = /appl/graylog/graylog-5.0.5-linux-x64/node_id
password_secret = xxxx
root_password_sha2 = xxxx
root_timezone = America/Sao_Paulo
bin_dir = /appl/graylog/graylog-5.0.5-linux-x64/bin
data_dir = /appl/graylog/graylog-5.0.5-linux-x64/data
plugin_dir = /appl/graylog/graylog-5.0.5-linux-x64/plugin
http_bind_address = node_ip_addr:9000
http_enable_tls = true
http_tls_cert_file = /appl/graylog/cert.pem
http_tls_key_file = /appl/graylog/pkcs8-encrypted.pem
http_tls_key_password = pwd
stream_aware_field_types=false
elasticsearch_hosts = host1:9200,host2:9200,host3:9200,host4:9200,host5:9200
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 3
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 1000
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 7
outputbuffer_processors = 4
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 3
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = data/journal
message_journal_max_age = 24h
message_journal_max_size = 10gb
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://host1:27017,host2:27017,host3:27017,host4:27017/graylog?replicaSet=rs0
mongodb_max_connections = 1000
ELASTICSEARCH CONFIG
cluster.name: elasticsearch_cluster
node.name: hostname
node.attr.rack: r1
path.data: /appl/elasticsearch/elasticsearch-7.10.2/data
path.logs: /appl/elasticsearch/elasticsearch-7.10.2/log
network.host: ip_add
discovery.seed_hosts: [“host1”, “host2”, “host3”,“host4”,“host5”]
cluster.initial_master_nodes: [“host1-1”, “host2-2”, “host3-3”]
gateway.recover_after_nodes: 2
action.auto_create_index: false
opendistro_security.ssl.transport.pemcert_filepath: esnode.pem
opendistro_security.ssl.transport.pemkey_filepath: esnode-key.pem
opendistro_security.ssl.transport.pemtrustedcas_filepath: root-ca.pem
opendistro_security.ssl.transport.enforce_hostname_verification: false
opendistro_security.ssl.http.enabled: false
opendistro_security.ssl.http.pemcert_filepath: esnode.pem
opendistro_security.ssl.http.pemkey_filepath: esnode-key.pem
opendistro_security.ssl.http.pemtrustedcas_filepath: root-ca.pem
opendistro_security.allow_unsafe_democertificates: true
opendistro_security.allow_default_init_securityindex: true
opendistro_security.authcz.admin_dn:
-
CN=kirk,OU=client,O=client,L=test, C=de
#opendistro_security.audit.type: internal_elasticsearch
opendistro_security.enable_snapshot_restore_privilege: true
opendistro_security.check_snapshot_restore_write_privileges: true
opendistro_security.restapi.roles_enabled: [“all_access”, “security_rest_api_access”, “readall”]
opendistro_security.system_indices.enabled: true
opendistro_security.system_indices.indices: [“.opendistro-alerting-config”, “.opendistro-alerting-alert*”, “.opendistro-anomaly-results*”, “.opendistro-anomaly-detector*”, “.opendistro-anomaly-checkpoints”, “.opendistro-anomaly-detection-state”, “.opendistro-reports-", ".opendistro-notifications-”, “.opendistro-notebooks”, “.opendistro-asynchronous-search-response*”]
cluster.routing.allocation.disk.threshold_enabled: false
node.max_local_storage_nodes: 5
- What steps have you already taken to try and solve the problem?
I tried to raise the number of processors allocated, processbuffer_processors = 7, outputbuffer_processors = 4, inputbuffer_processor=3 but nothing changed. I restarted several time the graylog services, seems to work initially but now the situation seems blocked.
I tried also to increase Field type refresh interval from 5 s to 30 s
Any idea on what to troubleshoot in order to solve this issue?