Hi there
I just upgraded a well-running graylog-server-2.2.0-11 instance to 2.2.1-1 (fully patched CentOS-7, via yum repo) and had an outage today within an hour of upgrading (haven’t had any outages with graylog/ES for 6 months+ before today)
What seems to happen is I do a search over a large timeframe (eg 30 days), and I can see the In/Out msg/sec on the top right suddenly has the “Out” number collapse to 0/sec and stays there. Then going over to the Node page shows the journal is expanding and expanding, so that soon it would fill the disk buffer, etc.
Querying Elasticsearch (which was already up-to-date at elasticsearch-2.4.4-1.noarch) via
curl -m 60 -s http://$ES_SRV:9200/_cluster/health
shows the “unassigned_shards” is just growing and growing. There are no errors at all in /var/log/graylog-server/server.log, but /var/log/elasticsearch/logfile reports the following (see end) with the search starts
to hang. Looks like it tried for 30sec, timed out, but the answer came back 46sec later?
I appreciate this looks like an ES bug/issue, but it just seems weird this starts to happen after upgrading to graylog-2.2.1. So I am wondering if searches are done differently, causing ES to work harder? (or tickling a bug in 2.4.4?). I am also seeing less dangerous searches have the “out” column go to 0/sec for a few seconds and then pop back to normal - can’t recall seeing that happen before either (we average 0.5K-3K/sec, so 0/sec really looks unexpected)
In the end I shut down graylog-server, then shutdown elasticsearch. However, after waiting 6min for it to exit (which seems way too long), I killed it and restarted it. It came back fine, processed the unassigned shards, then I restarted graylog and it all works again. But ES didn’t like something…
Any ideas?
Thanks, Jason
[2017-02-21 01:26:29,497][DEBUG][action.admin.cluster.node.stats] [fantail] failed to execute on node [fNe28oG6TWCOWwOffzHV1A]
ReceiveTimeoutTransportException[[fantail][1.3.4.5:9300][cluster:monitor/nodes/stats[n]] request_id [31950609] timed out after [30527ms]]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:698)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2017-02-21 01:26:29,610][WARN ][monitor.jvm ] [fantail] [gc][old][2495698][273347] duration [14s], collections [1]/[14.3s], total [14s]/[7.8h], memory [16.8gb]->[16.7gb]/[16.8gb], all_pools {[young] [1.1gb]->[1.1gb]/[1.1gb]}{[survivor] [121mb]->[83.2mb]/[149.7mb]}{[old] [15.5gb]->[15.5gb]/[15.5gb]}
[2017-02-21 01:26:45,621][WARN ][transport ] [fantail] Received response for a request that has timed out, sent [46684ms] ago, timed out [16157ms] ago, action [cluster:monitor/nodes/stats[n]], node [{fantail}{fNe28oG6TWCOWwOffzHV1A}{1.3.4.5}{1.3.4.5:9300}], id [31950609]
[2017-02-21 01:27:15,810][WARN ][monitor.jvm ] [fantail] [gc][old][2495699][273350] duration [45.6s], collections [3]/[46.2s], total [45.6s]/[7.8h], memory [16.7gb]->[16.8gb]/[16.8gb], all_pools {[young] [1.1gb]->[1.1gb]/[1.1gb]}{[survivor] [83.2mb]->[105.5mb]/[149.7mb]}{[old] [15.5gb]->[15.5gb]/[15.5gb]}