GL version: v2.4.3+2c41897
Elastic version: 5.6.9
We have 4 GL nodes and had 6 elastic nodes.
The ES nodes has 3TB disk space, and we needed more, so we added 4 new nodes to the cluster.
So we have 10 now.
After the update we have new error,
Every 2-3 days one of the GL serveres lost the connection with full ES cluster. (not same time, not same servers) First only servers, but after a time it can’t reach all 10 servers. (I caught the start time, the netstat output below)
We got the same messages with all cluster ip.
GL:
2018-07-31T15:21:10.291+02:00 INFO [RetryExec] I/O exception (java.net.SocketException) caught when processing request to {}->http://1.2.3.35:9200: Connection reset
2018-07-31T15:21:10.291+02:00 INFO [RetryExec] Retrying request to {}->http://1.2.3.35:9200
2018-07-31T15:21:10.292+02:00 INFO [RetryExec] I/O exception (java.net.SocketException) caught when processing request to {}->http://1.2.3.35:9200: Connection reset
2018-07-31T15:21:10.292+02:00 INFO [RetryExec] Retrying request to {}->http://1.2.3.35:9200
2018-07-31T15:21:10.294+02:00 INFO [RetryExec] I/O exception (java.net.SocketException) caught when processing request to {}->http://1.2.3.35:9200: Connection reset
2018-07-31T15:21:10.294+02:00 INFO [RetryExec] Retrying request to {}->http://1.2.3.35:9200
2018-07-31T15:21:10.295+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Connection reset, retrying (attempt #166).
No releated logs in elastic master’s log.
ES logs:
[2018-07-31T12:10:13,875][INFO ][o.e.c.m.MetaDataMappingService] [elastic_a5] [graylog_218/ejHAf-upRsuNEL8ls2yx7w] update_mapping [message]
[2018-07-31T14:03:23,780][INFO ][o.e.c.m.MetaDataMappingService] [elastic_a5] [graylog_218/ejHAf-upRsuNEL8ls2yx7w] update_mapping [message]
[2018-07-31T14:51:03,401][INFO ][o.e.m.j.JvmGcMonitorService] [elastic_a5] [gc][1790804] overhead, spent [488ms] collecting in the last [1s]
I did a little research, and I find the following:
http://docs.graylog.org/en/2.4/pages/configuration/elasticsearch.html
Config Setting Type Comments Default
elasticsearch_max_total_connections int Maximum number of total Elasticsearch connections 20
elasticsearch_max_total_connections_per_route int Maximum number of Elasticsearch connections per route/host 2
After GL 2.3 the previous variables availables available.
So 10 nodes * elasticsearch_max_total_connections_per_route = elasticsearch_max_total_connections in our case.
I increase the value to 50, but didn’t help.
The configs and the hosts cloned, so only the hostnames and IP addresses are different (we checked, except is_master parameter).
GL config:
is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = XXX
root_username = XXX
root_password_sha2 = XXX
root_timezone = Europe/Budapest
plugin_dir = /usr/share/graylog-server/plugin
rest_listen_uri = http://IP:9000/api/
rest_transport_uri = http://IP:9000/api/
trusted_proxies = IP/32, IP/32
web_listen_uri = http://IP:9000/
rotation_strategy = time
elasticsearch_max_docs_per_index = 20000000
rotation_strategy = time
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_time_per_index = 1d
elasticsearch_max_number_of_indices = 365
retention_strategy = delete
elasticsearch_max_number_of_indices = 365
retention_strategy = delete
elasticsearch_shards = 2
elasticsearch_replicas = 1
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = true
allow_highlighting = true
elasticsearch_cluster_name = elasticsearch
elasticsearch_discovery_zen_ping_multicast_enabled = false
elasticsearch_hosts = http://IP:9200, …
elasticsearch_network_host = IP
elasticsearch_analyzer = standard
elasticsearch_request_timeout = 2m
elasticsearch_index_optimization_timeout = 1h
elasticsearch_max_total_connections = 50
output_batch_size = 10000
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
message_journal_max_age = 24h
message_journal_max_size = 45gb
message_journal_flush_age = 1m
message_journal_flush_interval = 25000
message_journal_segment_age = 15m
message_journal_segment_size = 100mb
lb_recognition_period_seconds = 3
lb_throttle_threshold_percentage = 50
mongodb_uri = mongodb://XXX:XXX@db_a1:27017,db_b1:27017,db_q1:27017/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32
ES config:
cluster.name: elasticsearch
node.name: elastic_a1
node.attr.site: A
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
path.repo: /backup
discovery.zen.ping.unicast.hosts: ["elastic_a1", "elastic_a2", "elastic_a3", "elastic_a4", "elastic_a5", "elastic_b1", "elastic_b2", "elastic_b3", "elastic_b4", "elastic_b5"]
gateway.recover_after_nodes: 3
cluster.routing.allocation.awareness.force.site.values: A,B
cluster.routing.allocation.awareness.attributes: site
And under the error when it starts the netstat’s filtered output:
[root@log_b1 ~]# while true ;do date ;netstat -tpa | grep wap ; sleep 30; done
Tue Jul 31 15:58:56 CEST 2018
tcp 0 0 LOG_B1:38132 ELASTIC_B2:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:42840 ELASTIC_B3:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:46114 ELASTIC_B1:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:50314 ELASTIC_B4:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:59772 ELASTIC_B5:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:57514 ELASTIC_A4:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:55802 ELASTIC_A1:wap-wsp ESTABLISHED 6506/java
Tue Jul 31 15:59:26 CEST 2018
tcp 0 0 LOG_B1:38132 ELASTIC_B2:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:42840 ELASTIC_B3:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:46114 ELASTIC_B1:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:50314 ELASTIC_B4:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:59772 ELASTIC_B5:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:57514 ELASTIC_A4:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:55802 ELASTIC_A1:wap-wsp ESTABLISHED 6506/java
Tue Jul 31 15:59:56 CEST 2018
tcp 0 0 LOG_B1:38132 ELASTIC_B2:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:42840 ELASTIC_B3:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:46114 ELASTIC_B1:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:50314 ELASTIC_B4:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:59772 ELASTIC_B5:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:57514 ELASTIC_A4:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:55802 ELASTIC_A1:wap-wsp ESTABLISHED 6506/java
Tue Jul 31 16:00:26 CEST 2018
tcp 0 0 LOG_B1:60506 ELASTIC_A2:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:38132 ELASTIC_B2:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:42840 ELASTIC_B3:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:50314 ELASTIC_B4:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:59772 ELASTIC_B5:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:57514 ELASTIC_A4:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:51570 ELASTIC_A3:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:46090 ELASTIC_A5:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:46312 ELASTIC_B1:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:55802 ELASTIC_A1:wap-wsp ESTABLISHED 6506/java
Tue Jul 31 16:00:56 CEST 2018
tcp 0 0 LOG_B1:60506 ELASTIC_A2:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:42840 ELASTIC_B3:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:50314 ELASTIC_B4:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:59772 ELASTIC_B5:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:57514 ELASTIC_A4:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:51570 ELASTIC_A3:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:46090 ELASTIC_A5:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:46312 ELASTIC_B1:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:55802 ELASTIC_A1:wap-wsp ESTABLISHED 6506/java
Tue Jul 31 16:01:26 CEST 2018
tcp 0 0 LOG_B1:60506 ELASTIC_A2:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:42840 ELASTIC_B3:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:50314 ELASTIC_B4:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:57514 ELASTIC_A4:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:51570 ELASTIC_A3:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:46090 ELASTIC_A5:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:46312 ELASTIC_B1:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:55802 ELASTIC_A1:wap-wsp ESTABLISHED 6506/java
Tue Jul 31 16:01:56 CEST 2018
tcp 0 0 LOG_B1:60506 ELASTIC_A2:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:42840 ELASTIC_B3:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:50314 ELASTIC_B4:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:57514 ELASTIC_A4:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:51570 ELASTIC_A3:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:46090 ELASTIC_A5:wap-wsp ESTABLISHED 6506/java
tcp 0 0 LOG_B1:46312 ELASTIC_B1:wap-wsp ESTABLISHED 6506/java