Latest graylog-server 2.2.1 bug?

Hi there

I just upgraded a well-running graylog-server-2.2.0-11 instance to 2.2.1-1 (fully patched CentOS-7, via yum repo) and had an outage today within an hour of upgrading (haven’t had any outages with graylog/ES for 6 months+ before today)

What seems to happen is I do a search over a large timeframe (eg 30 days), and I can see the In/Out msg/sec on the top right suddenly has the “Out” number collapse to 0/sec and stays there. Then going over to the Node page shows the journal is expanding and expanding, so that soon it would fill the disk buffer, etc.

Querying Elasticsearch (which was already up-to-date at elasticsearch-2.4.4-1.noarch) via

curl -m 60 -s http://$ES_SRV:9200/_cluster/health

shows the “unassigned_shards” is just growing and growing. There are no errors at all in /var/log/graylog-server/server.log, but /var/log/elasticsearch/logfile reports the following (see end) with the search starts
to hang. Looks like it tried for 30sec, timed out, but the answer came back 46sec later?

I appreciate this looks like an ES bug/issue, but it just seems weird this starts to happen after upgrading to graylog-2.2.1. So I am wondering if searches are done differently, causing ES to work harder? (or tickling a bug in 2.4.4?). I am also seeing less dangerous searches have the “out” column go to 0/sec for a few seconds and then pop back to normal - can’t recall seeing that happen before either (we average 0.5K-3K/sec, so 0/sec really looks unexpected)

In the end I shut down graylog-server, then shutdown elasticsearch. However, after waiting 6min for it to exit (which seems way too long), I killed it and restarted it. It came back fine, processed the unassigned shards, then I restarted graylog and it all works again. But ES didn’t like something…

Any ideas?

Thanks, Jason

[2017-02-21 01:26:29,497][DEBUG][action.admin.cluster.node.stats] [fantail] failed to execute on node [fNe28oG6TWCOWwOffzHV1A]
ReceiveTimeoutTransportException[[fantail][1.3.4.5:9300][cluster:monitor/nodes/stats[n]] request_id [31950609] timed out after [30527ms]]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:698)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2017-02-21 01:26:29,610][WARN ][monitor.jvm              ] [fantail] [gc][old][2495698][273347] duration [14s], collections [1]/[14.3s], total [14s]/[7.8h], memory [16.8gb]->[16.7gb]/[16.8gb], all_pools {[young] [1.1gb]->[1.1gb]/[1.1gb]}{[survivor] [121mb]->[83.2mb]/[149.7mb]}{[old] [15.5gb]->[15.5gb]/[15.5gb]}
[2017-02-21 01:26:45,621][WARN ][transport                ] [fantail] Received response for a request that has timed out, sent [46684ms] ago, timed out [16157ms] ago, action [cluster:monitor/nodes/stats[n]], node [{fantail}{fNe28oG6TWCOWwOffzHV1A}{1.3.4.5}{1.3.4.5:9300}], id [31950609]
[2017-02-21 01:27:15,810][WARN ][monitor.jvm              ] [fantail] [gc][old][2495699][273350] duration [45.6s], collections [3]/[46.2s], total [45.6s]/[7.8h], memory [16.7gb]->[16.8gb]/[16.8gb], all_pools {[young] [1.1gb]->[1.1gb]/[1.1gb]}{[survivor] [83.2mb]->[105.5mb]/[149.7mb]}{[old] [15.5gb]->[15.5gb]/[15.5gb]}

Hi Jason,

there have been absolutely no changes in Graylog 2.2.1 which change the usage of Elasticsearch, so I guess this was a problem caused by something else.

The GC pauses mentioned in the logs of your Elasticsearch node are extremely long (46s), usually they run in sub-second time.

Are there any other error messages in the (historic) logs of your Elasticsearch nodes?

None that seem serious. I have some GELF record errors due to (I guess) ES incorrectly guessing the first field appearance in a new index was a date when it was actually a string: end result is that 2nd+ occurrence of that field triggers errors. I’m going to have to hard-wire the mapping to string to make that problem go away

BTW, is there any way to declare on a record basis what the field type is? This particular set of data is in JSON format and it’s a batch process to inject it into graylog via GELF - so I’d rather alter the JSON data directly before submission to do something like

“fieldname”:varchar(“2017-02-21 00:33:hello”)

so that graylog and ES both know it’s a string and not anything else. Is that possible, or am I stuck with the graylog-custom-mapping.json trick? Also, is there a magic directory to put that file in so that it gets auto-loaded at boot time? Or do I have to create a process to ensure that gets loaded?

Thanks

Jason

You can use the message processing pipelines for this, using to_bool, to_string(), etc.

Additionally, creating a custom index mapping for your (known) fields will help: http://docs.graylog.org/en/2.2/pages/configuration/elasticsearch.html#custom-index-mappings

What do you mean exactly?

I figured it out. I was assuming loading custom mapping via graylog-custom-mapping.json was ephemeral - but now I see it actually gets loaded into ES - so it’s permanent.

Jason