I found the Graylog cluster sometimes just stops responing for a while, then continues. Today I noticed kernel transparent hugepage daemon on top, and it seems memory allocation is the problem.
According to this:
the problem could be avoided by disabling transparent hugepages. Trying that now.
I find adding a section on actual Graylog documentation on which parameters to twiddle when deploying a production load would be helpful (and if this is indeed one useful, it could be mentioned there).