Elastic cluster connection lost

GL version: v2.4.3+2c41897
Elastic version: 5.6.9

We have 4 GL nodes and had 6 elastic nodes.
The ES nodes has 3TB disk space, and we needed more, so we added 4 new nodes to the cluster.
So we have 10 now.
After the update we have new error,
Every 2-3 days one of the GL serveres lost the connection with full ES cluster. (not same time, not same servers) First only servers, but after a time it can’t reach all 10 servers. (I caught the start time, the netstat output below)
We got the same messages with all cluster ip.
GL:

2018-07-31T15:21:10.291+02:00 INFO  [RetryExec] I/O exception (java.net.SocketException) caught when processing request to {}->http://1.2.3.35:9200: Connection reset
2018-07-31T15:21:10.291+02:00 INFO  [RetryExec] Retrying request to {}->http://1.2.3.35:9200
2018-07-31T15:21:10.292+02:00 INFO  [RetryExec] I/O exception (java.net.SocketException) caught when processing request to {}->http://1.2.3.35:9200: Connection reset
2018-07-31T15:21:10.292+02:00 INFO  [RetryExec] Retrying request to {}->http://1.2.3.35:9200
2018-07-31T15:21:10.294+02:00 INFO  [RetryExec] I/O exception (java.net.SocketException) caught when processing request to {}->http://1.2.3.35:9200: Connection reset
2018-07-31T15:21:10.294+02:00 INFO  [RetryExec] Retrying request to {}->http://1.2.3.35:9200
2018-07-31T15:21:10.295+02:00 ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Connection reset, retrying (attempt #166).

No releated logs in elastic master’s log.
ES logs:

[2018-07-31T12:10:13,875][INFO ][o.e.c.m.MetaDataMappingService] [elastic_a5] [graylog_218/ejHAf-upRsuNEL8ls2yx7w] update_mapping [message]
[2018-07-31T14:03:23,780][INFO ][o.e.c.m.MetaDataMappingService] [elastic_a5] [graylog_218/ejHAf-upRsuNEL8ls2yx7w] update_mapping [message]
[2018-07-31T14:51:03,401][INFO ][o.e.m.j.JvmGcMonitorService] [elastic_a5] [gc][1790804] overhead, spent [488ms] collecting in the last [1s]

I did a little research, and I find the following:
http://docs.graylog.org/en/2.4/pages/configuration/elasticsearch.html
Config Setting Type Comments Default
elasticsearch_max_total_connections int Maximum number of total Elasticsearch connections 20
elasticsearch_max_total_connections_per_route int Maximum number of Elasticsearch connections per route/host 2

After GL 2.3 the previous variables availables available.
So 10 nodes * elasticsearch_max_total_connections_per_route = elasticsearch_max_total_connections in our case.
I increase the value to 50, but didn’t help.

The configs and the hosts cloned, so only the hostnames and IP addresses are different (we checked, except is_master parameter).

GL config:

is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = XXX
root_username = XXX
root_password_sha2 = XXX
root_timezone = Europe/Budapest
plugin_dir = /usr/share/graylog-server/plugin
rest_listen_uri = http://IP:9000/api/
rest_transport_uri = http://IP:9000/api/
trusted_proxies = IP/32, IP/32
web_listen_uri = http://IP:9000/
rotation_strategy = time
elasticsearch_max_docs_per_index = 20000000
rotation_strategy = time
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_time_per_index = 1d
elasticsearch_max_number_of_indices = 365
retention_strategy = delete
elasticsearch_max_number_of_indices = 365
retention_strategy = delete
elasticsearch_shards = 2
elasticsearch_replicas = 1
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = true
allow_highlighting = true
elasticsearch_cluster_name = elasticsearch
elasticsearch_discovery_zen_ping_multicast_enabled = false
elasticsearch_hosts = http://IP:9200, …
elasticsearch_network_host = IP
elasticsearch_analyzer = standard
elasticsearch_request_timeout = 2m
elasticsearch_index_optimization_timeout = 1h
elasticsearch_max_total_connections = 50
output_batch_size = 10000
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
message_journal_max_age = 24h
message_journal_max_size = 45gb
message_journal_flush_age = 1m
message_journal_flush_interval = 25000
message_journal_segment_age = 15m
message_journal_segment_size = 100mb
lb_recognition_period_seconds = 3
lb_throttle_threshold_percentage = 50
mongodb_uri = mongodb://XXX:XXX@db_a1:27017,db_b1:27017,db_q1:27017/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32

ES config:

cluster.name: elasticsearch
node.name: elastic_a1
node.attr.site: A
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
path.repo: /backup
discovery.zen.ping.unicast.hosts: ["elastic_a1", "elastic_a2", "elastic_a3", "elastic_a4", "elastic_a5", "elastic_b1", "elastic_b2", "elastic_b3", "elastic_b4", "elastic_b5"]
gateway.recover_after_nodes: 3
cluster.routing.allocation.awareness.force.site.values: A,B
cluster.routing.allocation.awareness.attributes: site

And under the error when it starts the netstat’s filtered output:

[root@log_b1 ~]# while true ;do date ;netstat -tpa | grep wap ; sleep 30; done
Tue Jul 31 15:58:56 CEST 2018
tcp        0      0 LOG_B1:38132        ELASTIC_B2:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:42840        ELASTIC_B3:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:46114        ELASTIC_B1:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:50314        ELASTIC_B4:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:59772        ELASTIC_B5:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:57514        ELASTIC_A4:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:55802        ELASTIC_A1:wap-wsp  ESTABLISHED 6506/java
Tue Jul 31 15:59:26 CEST 2018
tcp        0      0 LOG_B1:38132        ELASTIC_B2:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:42840        ELASTIC_B3:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:46114        ELASTIC_B1:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:50314        ELASTIC_B4:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:59772        ELASTIC_B5:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:57514        ELASTIC_A4:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:55802        ELASTIC_A1:wap-wsp  ESTABLISHED 6506/java
Tue Jul 31 15:59:56 CEST 2018
tcp        0      0 LOG_B1:38132        ELASTIC_B2:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:42840        ELASTIC_B3:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:46114        ELASTIC_B1:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:50314        ELASTIC_B4:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:59772        ELASTIC_B5:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:57514        ELASTIC_A4:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:55802        ELASTIC_A1:wap-wsp  ESTABLISHED 6506/java
Tue Jul 31 16:00:26 CEST 2018
tcp        0      0 LOG_B1:60506        ELASTIC_A2:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:38132        ELASTIC_B2:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:42840        ELASTIC_B3:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:50314        ELASTIC_B4:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:59772        ELASTIC_B5:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:57514        ELASTIC_A4:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:51570        ELASTIC_A3:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:46090        ELASTIC_A5:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:46312        ELASTIC_B1:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:55802        ELASTIC_A1:wap-wsp  ESTABLISHED 6506/java
Tue Jul 31 16:00:56 CEST 2018
tcp        0      0 LOG_B1:60506        ELASTIC_A2:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:42840        ELASTIC_B3:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:50314        ELASTIC_B4:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:59772        ELASTIC_B5:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:57514        ELASTIC_A4:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:51570        ELASTIC_A3:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:46090        ELASTIC_A5:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:46312        ELASTIC_B1:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:55802        ELASTIC_A1:wap-wsp  ESTABLISHED 6506/java
Tue Jul 31 16:01:26 CEST 2018
tcp        0      0 LOG_B1:60506        ELASTIC_A2:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:42840        ELASTIC_B3:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:50314        ELASTIC_B4:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:57514        ELASTIC_A4:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:51570        ELASTIC_A3:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:46090        ELASTIC_A5:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:46312        ELASTIC_B1:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:55802        ELASTIC_A1:wap-wsp  ESTABLISHED 6506/java
Tue Jul 31 16:01:56 CEST 2018
tcp        0      0 LOG_B1:60506        ELASTIC_A2:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:42840        ELASTIC_B3:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:50314        ELASTIC_B4:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:57514        ELASTIC_A4:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:51570        ELASTIC_A3:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:46090        ELASTIC_A5:wap-wsp  ESTABLISHED 6506/java
tcp        0      0 LOG_B1:46312        ELASTIC_B1:wap-wsp  ESTABLISHED 6506/java

We did more research, and we se the following:
At normal working, the GL server keep open TCP sessions to ES servers,
Under this error somehow the ES nodes resets the TCP sessions after a few packages.

Anyone any idea?

Not an answer per se, but I believe you don’t need to specify all of your ES hosts in your Graylog unicast.hosts discovery parameter. I have mine pointed only at a handful of non-data nodes (in my case master/ingest, but perhaps your setup warrants these be ingest only). With Graylog connecting to only a few designated ES nodes, the Elasticsearch cluster will handle the rest.

@jebucha

this line in the Elastic search (ES) config file, not in graylog config file.
In my config all ES server has master rule, and no designated data node, so it contains only the possible master nodes :slight_smile:

anyone any idea? unfortunately we still have the problem, and I have no more things to try out.

  • did you have metric that you can check?
  • did you checked the elasticsearch log?
  • did you checked the threads running in the elasticsearch jvm?

Yes, I have metrics. I have checked, but I don’t see any interesting. If you can suggest one metric, I can recheck it, and copy it.
Yes, I have checked the ES logs, no logs at the same time when we see the error.
About the threads, if you thing for that, yes, i have checked.

curl -XGET 'http://XX:9200/_cat/thread_pool?v&h=id,name,active,rejected,completed'
id                     name                active rejected completed
-FZzllvhTg-BIvyT01lvoQ bulk                     0        0    977403
-FZzllvhTg-BIvyT01lvoQ fetch_shard_started      0        0         0
-FZzllvhTg-BIvyT01lvoQ fetch_shard_store        0        0         0
-FZzllvhTg-BIvyT01lvoQ flush                    0        0       369
-FZzllvhTg-BIvyT01lvoQ force_merge              0        0         9
-FZzllvhTg-BIvyT01lvoQ generic                  0        0   2855665
-FZzllvhTg-BIvyT01lvoQ get                      0        0         0
-FZzllvhTg-BIvyT01lvoQ index                    0        0         0
-FZzllvhTg-BIvyT01lvoQ listener                 0        0         0
-FZzllvhTg-BIvyT01lvoQ management               1        0     36393
-FZzllvhTg-BIvyT01lvoQ refresh                  1        0  26100532
-FZzllvhTg-BIvyT01lvoQ search                   0        0     89527
-FZzllvhTg-BIvyT01lvoQ snapshot                 0        0       297
-FZzllvhTg-BIvyT01lvoQ warmer                   0        0         0
Bhal_2SARBy011GJkG4UQw bulk                     2        0  20951198
Bhal_2SARBy011GJkG4UQw fetch_shard_started      0        0        24
Bhal_2SARBy011GJkG4UQw fetch_shard_store        0        0       590
Bhal_2SARBy011GJkG4UQw flush                    0        0      5202
Bhal_2SARBy011GJkG4UQw force_merge              0        0       108
Bhal_2SARBy011GJkG4UQw generic                  0        0   3839849
Bhal_2SARBy011GJkG4UQw get                      0        0         0
Bhal_2SARBy011GJkG4UQw index                    0        0         0
Bhal_2SARBy011GJkG4UQw listener                 0        0         0
Bhal_2SARBy011GJkG4UQw management               1        0    420499
Bhal_2SARBy011GJkG4UQw refresh                  0        0 321899810
Bhal_2SARBy011GJkG4UQw search                   0        0   1608937
Bhal_2SARBy011GJkG4UQw snapshot                 0        0      4202
Bhal_2SARBy011GJkG4UQw warmer                   0        0         0
PjUuFw_VRrSi2cvtwDiNPA bulk                     0        0  20748934
PjUuFw_VRrSi2cvtwDiNPA fetch_shard_started      0        0        73
PjUuFw_VRrSi2cvtwDiNPA fetch_shard_store        0        0      1305
PjUuFw_VRrSi2cvtwDiNPA flush                    0        0      6291
PjUuFw_VRrSi2cvtwDiNPA force_merge              0        0       105
PjUuFw_VRrSi2cvtwDiNPA generic                  0        0   4200083
PjUuFw_VRrSi2cvtwDiNPA get                      0        0         1
PjUuFw_VRrSi2cvtwDiNPA index                    0        0         0
PjUuFw_VRrSi2cvtwDiNPA listener                 0        0         0
PjUuFw_VRrSi2cvtwDiNPA management               1        0    417623
PjUuFw_VRrSi2cvtwDiNPA refresh                  0        0 322559025
PjUuFw_VRrSi2cvtwDiNPA search                   0        0   1597171
PjUuFw_VRrSi2cvtwDiNPA snapshot                 0        0      4020
PjUuFw_VRrSi2cvtwDiNPA warmer                   0        0         0
VdLQocL2RtSVlrAOVSHhYw bulk                     0        0  22672754
VdLQocL2RtSVlrAOVSHhYw fetch_shard_started      0        0        73
VdLQocL2RtSVlrAOVSHhYw fetch_shard_store        0        0     43878
VdLQocL2RtSVlrAOVSHhYw flush                    0        0      5658
VdLQocL2RtSVlrAOVSHhYw force_merge              0        0       108
VdLQocL2RtSVlrAOVSHhYw generic                  0        0   6916542
VdLQocL2RtSVlrAOVSHhYw get                      0        0         2
VdLQocL2RtSVlrAOVSHhYw index                    0        0         0
VdLQocL2RtSVlrAOVSHhYw listener                 0        0         0
VdLQocL2RtSVlrAOVSHhYw management               1        0    421411
VdLQocL2RtSVlrAOVSHhYw refresh                  0        0 318694189
VdLQocL2RtSVlrAOVSHhYw search                   0        0   1431749
VdLQocL2RtSVlrAOVSHhYw snapshot                 0        0      3842
VdLQocL2RtSVlrAOVSHhYw warmer                   0        0         0
Xx5-US-2SKymqsG8uM2ziw bulk                     2        0  20854918
Xx5-US-2SKymqsG8uM2ziw fetch_shard_started      0        0        73
Xx5-US-2SKymqsG8uM2ziw fetch_shard_store        0        0      1010
Xx5-US-2SKymqsG8uM2ziw flush                    0        0      5951
Xx5-US-2SKymqsG8uM2ziw force_merge              0        0       108
Xx5-US-2SKymqsG8uM2ziw generic                  0        0   4523321
Xx5-US-2SKymqsG8uM2ziw get                      0        0         0
Xx5-US-2SKymqsG8uM2ziw index                    0        0         0
Xx5-US-2SKymqsG8uM2ziw listener                 0        0         0
Xx5-US-2SKymqsG8uM2ziw management               1        0    419092
Xx5-US-2SKymqsG8uM2ziw refresh                  0        0 322464963
Xx5-US-2SKymqsG8uM2ziw search                   0        0   1664934
Xx5-US-2SKymqsG8uM2ziw snapshot                 0        0      1936
Xx5-US-2SKymqsG8uM2ziw warmer                   0        0         0
QGFHX6cPT3SlpC2Q35EkYA bulk                     0        0  22400192
QGFHX6cPT3SlpC2Q35EkYA fetch_shard_started      0        0        73
QGFHX6cPT3SlpC2Q35EkYA fetch_shard_store        0        0      1459
QGFHX6cPT3SlpC2Q35EkYA flush                    0        0      5523
QGFHX6cPT3SlpC2Q35EkYA force_merge              0        0       105
QGFHX6cPT3SlpC2Q35EkYA generic                  0        0   6485289
QGFHX6cPT3SlpC2Q35EkYA get                      0        0         2
QGFHX6cPT3SlpC2Q35EkYA index                    0        0         0
QGFHX6cPT3SlpC2Q35EkYA listener                 0        0         0
QGFHX6cPT3SlpC2Q35EkYA management               1        0    413729
QGFHX6cPT3SlpC2Q35EkYA refresh                  0        0 319202263
QGFHX6cPT3SlpC2Q35EkYA search                   0        0   1491721
QGFHX6cPT3SlpC2Q35EkYA snapshot                 0        0      1888
QGFHX6cPT3SlpC2Q35EkYA warmer                   0        0         0
auOfjotvTXSpNoIgfc40WQ bulk                     2        0  21166250
auOfjotvTXSpNoIgfc40WQ fetch_shard_started      0        0        71
auOfjotvTXSpNoIgfc40WQ fetch_shard_store        0        0       802
auOfjotvTXSpNoIgfc40WQ flush                    0        0      5643
auOfjotvTXSpNoIgfc40WQ force_merge              0        0       108
auOfjotvTXSpNoIgfc40WQ generic                  0        0   4081400
auOfjotvTXSpNoIgfc40WQ get                      0        0         0
auOfjotvTXSpNoIgfc40WQ index                    0        0         0
auOfjotvTXSpNoIgfc40WQ listener                 0        0         0
auOfjotvTXSpNoIgfc40WQ management               1        0    418687
auOfjotvTXSpNoIgfc40WQ refresh                  0        0 322419263
auOfjotvTXSpNoIgfc40WQ search                   0        0   1647344
auOfjotvTXSpNoIgfc40WQ snapshot                 0        0      1756
auOfjotvTXSpNoIgfc40WQ warmer                   0        0         0
dIzxbHNGQIK8dVWhgO2ThQ bulk                     0        0  20824086
dIzxbHNGQIK8dVWhgO2ThQ fetch_shard_started      0        0        73
dIzxbHNGQIK8dVWhgO2ThQ fetch_shard_store        0        0      1434
dIzxbHNGQIK8dVWhgO2ThQ flush                    0        0      6208
dIzxbHNGQIK8dVWhgO2ThQ force_merge              0        0       105
dIzxbHNGQIK8dVWhgO2ThQ generic                  0        0   4479901
dIzxbHNGQIK8dVWhgO2ThQ get                      0        0         0
dIzxbHNGQIK8dVWhgO2ThQ index                    0        0         0
dIzxbHNGQIK8dVWhgO2ThQ listener                 0        0         0
dIzxbHNGQIK8dVWhgO2ThQ management               1        0    417817
dIzxbHNGQIK8dVWhgO2ThQ refresh                  0        0 322432830
dIzxbHNGQIK8dVWhgO2ThQ search                   0        0   1593028
dIzxbHNGQIK8dVWhgO2ThQ snapshot                 0        0      4035
dIzxbHNGQIK8dVWhgO2ThQ warmer                   0        0         0
4edjrBQpT2-8J2pLCiHLsw bulk                     2        0   1014470
4edjrBQpT2-8J2pLCiHLsw fetch_shard_started      0        0         0
4edjrBQpT2-8J2pLCiHLsw fetch_shard_store        0        0         0
4edjrBQpT2-8J2pLCiHLsw flush                    0        0       387
4edjrBQpT2-8J2pLCiHLsw force_merge              0        0         9
4edjrBQpT2-8J2pLCiHLsw generic                  0        0   2785258
4edjrBQpT2-8J2pLCiHLsw get                      0        0         0
4edjrBQpT2-8J2pLCiHLsw index                    0        0         0
4edjrBQpT2-8J2pLCiHLsw listener                 0        0         0
4edjrBQpT2-8J2pLCiHLsw management               1        0     36406
4edjrBQpT2-8J2pLCiHLsw refresh                  1        0  26052019
4edjrBQpT2-8J2pLCiHLsw search                   0        0     90067
4edjrBQpT2-8J2pLCiHLsw snapshot                 0        0       170
4edjrBQpT2-8J2pLCiHLsw warmer                   0        0         0
TTGSytdORTKTKdmEGw4FZg bulk                     2        0  21285645
TTGSytdORTKTKdmEGw4FZg fetch_shard_started      0        0        73
TTGSytdORTKTKdmEGw4FZg fetch_shard_store        0        0      1642
TTGSytdORTKTKdmEGw4FZg flush                    0        0      7337
TTGSytdORTKTKdmEGw4FZg force_merge              0        0       105
TTGSytdORTKTKdmEGw4FZg generic                  0        0   4235929
TTGSytdORTKTKdmEGw4FZg get                      0        0         1
TTGSytdORTKTKdmEGw4FZg index                    0        0         0
TTGSytdORTKTKdmEGw4FZg listener                 0        0         0
TTGSytdORTKTKdmEGw4FZg management               1        0    417863
TTGSytdORTKTKdmEGw4FZg refresh                  0        0 322470044
TTGSytdORTKTKdmEGw4FZg search                   0        0   1597756
TTGSytdORTKTKdmEGw4FZg snapshot                 0        0      3996
TTGSytdORTKTKdmEGw4FZg warmer                   0        0         0

Maybe this blog post will give you some guidance

https://www.graylog.org/post/back-to-basics-monitoring-graylog

Your posted threats might be useful when seen with content - means history or similar. did you have network errors or similar?

As you can see no red hering can be seen and you would need to sherlock your environment

Thanks, its an useful article. But I’m monitor all important metrics in the GL and ES clusters thru the OS and rest api.
We don’t have network errors, When the GL node reports the connection resets then I can ping or telnet all ES nodes. And in tcpdump I see data flow between the GL and ES nodes, but after some packages full with logs the ES node sends a TCP reset. At the same time all other GL nodes can communicate the ES node(s).
We are debugging our environment, but it is easier if someone maybe saw something similar, or have an idea based on a better knowledge.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.