ElasticSearch Java reports OutOfMemory Causing Graylog to queu messages

I have set of Graylog 2.2.3 servers and an ElasticSearch 2.4.4 cluster with 3 master-eligible nodes and 10 data nodes. The master servers are 4xCPU and 16GB RAM with 8GB allocated for java heap; the data nodes are 8xCPU with 64 GB RAM with 31GB allocated for java heap. I have 4,112 shards in 439 indices totalling approximately 29 TB. Following the graylog and elasticsearch documents for memory tuning I still get the OutOfMemory errors at which point the node seems to fail and the assocated shards become unallocated and then graylog begins to queue messages as the elasticsearch reallocates the data:

2017-04-27 05:49:56,738][INFO ][monitor.jvm ] [es-node-08.example.com] [gc][old][28914][348] duration [6.7s], collections [1]/[6.7s], total [6.7s]/[27.1m], memory [30.8gb]->[30.9gb]/[30.9gb], all_pools {[young] [507mb]->[532.5mb]/[532.5mb]}{[survivor] [0b]->[37.6mb]/[66.5mb]}{[old] [30.3gb]->[30.3gb]/[30.3gb]}
[2017-04-27 05:50:47,914][WARN ][transport.netty ] [es-node-08.example.com] exception caught on transport layer [[id: 0xc5cd721b, / => /]], closing connection java.lang.OutOfMemoryError: Java heap space
[2017-04-27 05:51:14,697][WARN ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the selector loop. java.lang.OutOfMemoryError: Java heap space
[2017-04-27 05:51:22,189][INFO ][monitor.jvm ] [es-node-08.example.com] [gc][old][28916][367] duration [7.1s], collections [1]/[7.4s], total [7.1s]/[28.5m], memory [30.3gb]->[30gb]/[30.9gb], all_pools {[young] [84mb]->[90.2mb]/[532.5mb]}{[survivor] [0b]->[0b]/[66.5mb]}{[old] [30.2gb]->[29.9gb]/[30.3gb]}

hej @JoeG

what is your question now?

Where do I start to troubleshoot or tune to prevent the OutOfMemory? Is there a proper ratio of shards to server or a shard size sweet spot?