Output to elasticsearch stops


(Dima Zyuryaev) #1

Hi

We are using graylog cluster of 4 nodes, mongodb RS and separate elasticsearch cluster.
Each graylog server has elasticsearch client-only service, and each graylog server configured to send logs to itself and to other 3 nodes. Sometimes i observe nodes that just stop to send output to elasticsearch: output buffer is 100% full and no messages are going out.
Using tcpdump i have noticed that graylog server does not even tries to send output elasticsearch, gathering messages in its local journal. I dont see any rejects/denies in tcpdump from the side of elasticsearch, messages just dont go out, while at the same time other cluster members send messages successuly to elasticsearch.

  1. What logs should i check to try to debug it ? as in log of graylog server nothing appear. as in local journalct and in elasticsearch log
  2. What configuration parameters should i generally set to fix it ?

graylog version 2.5
elastic version: 5.6.15
relevant configuration (may be i miss something here)
elasticsearch_hosts = http://10.25.100.57:9200,http://10.25.100.87:9200,http://10.25.100.40:9200,http://10.25.100.18:9200
elasticsearch_connect_timeout = 2s
elasticsearch_socket_timeout = 5s
elasticsearch_max_total_connections = 256
elasticsearch_max_total_connections_per_route = 64
elasticsearch_max_retries = 1
output_batch_size = 1024
output_flush_interval = 1
processbuffer_processors = 5
outputbuffer_processors = 7
ring_size = 262144
inputbuffer_ring_size = 16384
inputbuffer_processors = 4
inputbuffer_wait_strategy = blocking
message_journal_enabled = true


(Jan Doberstein) #2

check your elasticsearch log files, check your available threads in Elasticsearchs JVM - check the JVM Metrics or Elasticsearch.


(Dima Zyuryaev) #3

Es is totally ok. All other nodes communicate with it at the same time.
According to tcpdump, graylog does not even try to send out messages to port 9200
Also, how to find best ballance between elasticsearch_max_total_connections_per_route and output_batch_size


(Jan Doberstein) #4

how many threads are in waiting stage in your elasticsearch cluster?


(Dima Zyuryaev) #5

None, as far as i see:
es-datanode1 bulk 0 0 0
es-datanode1 force_merge 0 0 0
es-datanode1 generic 0 0 0
es-datanode1 get 0 0 0
es-datanode1 index 0 0 0
es-datanode11 bulk 0 0 0
es-datanode11 force_merge 0 0 0
es-datanode11 generic 0 0 0
es-datanode11 get 0 0 0
es-datanode11 index 0 0 0
es-datanode13 bulk 2 0 0
es-datanode13 force_merge 0 0 0
es-datanode13 generic 0 0 0
es-datanode13 get 0 0 0
es-datanode13 index 0 0 0
awsgraylog-node3-coordinator bulk 0 0 0
awsgraylog-node3-coordinator force_merge 0 0 0
awsgraylog-node3-coordinator generic 0 0 0
awsgraylog-node3-coordinator get 0 0 0
awsgraylog-node3-coordinator index 0 0 0
es-datanode16 bulk 0 0 0
es-datanode16 force_merge 0 0 0
es-datanode16 generic 0 0 0
es-datanode16 get 0 0 0
es-datanode16 index 0 0 0
es-datanode4 bulk 1 0 0
es-datanode4 force_merge 1 0 0
es-datanode4 generic 0 0 0
es-datanode4 get 0 0 0
es-datanode4 index 0 0 0
es-datanode2 bulk 1 0 0
es-datanode2 force_merge 1 0 0
es-datanode2 generic 0 0 0
es-datanode2 get 0 0 0
es-datanode2 index 0 0 0
awses-mnode3 bulk 0 0 0
awses-mnode3 force_merge 0 0 0
awses-mnode3 generic 0 0 0
awses-mnode3 get 0 0 0
awses-mnode3 index 0 0 0
es-datanode3 bulk 1 0 0
es-datanode3 force_merge 1 0 0
es-datanode3 generic 0 0 0
es-datanode3 get 0 0 0
es-datanode3 index 0 0 0
es-datanode8 bulk 0 0 0
es-datanode8 force_merge 0 0 0
es-datanode8 generic 0 0 0
es-datanode8 get 0 0 0
es-datanode8 index 0 0 0
es-datanode10 bulk 0 0 0
es-datanode10 force_merge 0 0 0
es-datanode10 generic 0 0 0
es-datanode10 get 0 0 0
es-datanode10 index 0 0 0
es-datanode15 bulk 0 0 0
es-datanode15 force_merge 0 0 0
es-datanode15 generic 0 0 0
es-datanode15 get 0 0 0
es-datanode15 index 0 0 0
es-datanode6 bulk 0 0 0
es-datanode6 force_merge 0 0 0
es-datanode6 generic 0 0 0
es-datanode6 get 0 0 0
es-datanode6 index 0 0 0
es-datanode14 bulk 1 0 0
es-datanode14 force_merge 0 0 0
es-datanode14 generic 0 0 0
es-datanode14 get 0 0 0
es-datanode14 index 0 0 0
es-datanode5 bulk 0 0 0
es-datanode5 force_merge 0 0 0
es-datanode5 generic 0 0 0
es-datanode5 get 0 0 0
es-datanode5 index 0 0 0
awsgraylog-node1-coordinator bulk 0 0 0
awsgraylog-node1-coordinator force_merge 0 0 0
awsgraylog-node1-coordinator generic 0 0 0
awsgraylog-node1-coordinator get 0 0 0
awsgraylog-node1-coordinator index 0 0 0
es-datanode7 bulk 0 0 0
es-datanode7 force_merge 0 0 0
es-datanode7 generic 0 0 0
es-datanode7 get 0 0 0
es-datanode7 index 0 0 0
awsgraylog-node4-coordinator bulk 0 0 0
awsgraylog-node4-coordinator force_merge 0 0 0
awsgraylog-node4-coordinator generic 0 0 0
awsgraylog-node4-coordinator get 0 0 0
awsgraylog-node4-coordinator index 0 0 0
awsgraylog-node2-coordinator bulk 0 0 0
awsgraylog-node2-coordinator force_merge 0 0 0
awsgraylog-node2-coordinator generic 0 0 0
awsgraylog-node2-coordinator get 0 0 0
awsgraylog-node2-coordinator index 0 0 0
awses-mnode1 bulk 0 0 0
awses-mnode1 force_merge 0 0 0
awses-mnode1 generic 0 0 0
awses-mnode1 get 0 0 0
awses-mnode1 index 0 0 0
awses-mnode2 bulk 0 0 0
awses-mnode2 force_merge 0 0 0
awses-mnode2 generic 0 0 0
awses-mnode2 get 0 0 0
awses-mnode2 index 0 0 0
es-datanode12 bulk 0 0 0
es-datanode12 force_merge 0 0 0
es-datanode12 generic 0 0 0
es-datanode12 get 0 0 0
es-datanode12 index 0 0 0
es-datanode9 bulk 0 0 0
es-datanode9 force_merge 0 0 0
es-datanode9 generic 0 0 0
es-datanode9 get 0 0 0
es-datanode9 index 0 0 0


(Dima Zyuryaev) #6

Also OS stats looks not much overloaded

Blockquote
root@es-datanode3:~# vmstat -S M 1
procs -----------memory---------- —swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 9727 19 81871 0 0 175 1079 0 1 6 0 94 0 0
1 0 0 9690 19 81908 0 0 0 857 1860 1510 9 1 90 0 0
2 0 0 9650 19 81949 0 0 0 9626 4273 2494 15 0 84 0 0
1 0 0 9596 19 82003 0 0 0 7801 3347 1727 12 1 87 0 0
2 0 0 9564 19 82034 0 0 0 3151 3686 2141 14 0 86 0 0
2 0 0 9502 19 82095 0 0 0 421108 6677 8620 10 1 86 3 0
1 0 0 9446 19 82153 0 0 0 3775 2729 1814 10 0 90 0 0
2 0 0 9379 19 82219 0 0 0 2969 3971 2435 14 1 85 0 0
1 0 0 9321 19 82277 0 0 0 167725 4204 4133 10 1 89 1 0
2 0 0 9253 19 82345 0 0 0 9870 5247 2681 18 1 81 0 0
4 0 0 9195 19 82399 0 0 0 9842 4676 2351 15 0 84 0 0

Blockquote


(Dima Zyuryaev) #7

Actually the issue resolved by raising refresh_interval to 15s. Since then, even on peaks of 20.000-30.000 msgs /sec. Output rate raise accordingly and logs are not stacking in local journal. I would be glad to understand more why this option has so much importance on heavy loading setups and how does it affect searching performance


(Jan Doberstein) #8

you should read this blog posting

that is the best you can find about this topic.


(Dima Zyuryaev) #9

This is amazing. Thank you very much