Hello,
Yesterday, we started having quite severe issues with our graylog installation, more specifically Elasticsearch. We have been running this installation without a hitch for about a month after making an upgrade to Graylog 2.4.6 / Elasticsearch 5.6.10.
The symptoms we’re seeing is simply that our Elasticsearch servers started throwing OutOfMemory exceptions after about 5-15 minutes of operation. The first we tried was to gradually increase the heap size from 3GB, to eventually 16GB. This kind of mitigates the issue, in that it takes 1-2 hours between crashes, but something is still clearly wrong.
We have 120 indices 5-10 GB each and 1,000,000-1,500,000 documents each, rotated daily. Total about 800GB of data.
This is a example stack trace
[2018-08-29T12:51:32,245][INFO ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][988] overhead, spent [271ms] collecting in the last [1s]
[2018-08-29T12:51:35,255][INFO ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][991] overhead, spent [252ms] collecting in the last [1s]
[2018-08-29T12:51:40,066][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][994] overhead, spent [2.4s] collecting in the last [2.7s]
[2018-08-29T12:51:42,326][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][995] overhead, spent [2.1s] collecting in the last [2.2s]
[2018-08-29T12:51:44,902][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][996] overhead, spent [2.4s] collecting in the last [2.5s]
[2018-08-29T12:51:46,717][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][997] overhead, spent [1.7s] collecting in the last [1.8s]
[2018-08-29T12:51:48,947][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][998] overhead, spent [2.1s] collecting in the last [2.2s]
[2018-08-29T12:51:50,710][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][999] overhead, spent [1.7s] collecting in the last [1.7s]
[2018-08-29T12:51:54,499][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][1000] overhead, spent [3.7s] collecting in the last [3.8s]
[2018-08-29T12:52:57,651][INFO ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][1001] overhead, spent [25.4s] collecting in the last [1m]
[2018-08-29T12:53:10,657][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [000int_prd_esdata2-graylog] fatal error in thread [elasticsearch[000int_prd_esdata2-graylog][bulk][T#2]], exiting
java.lang.OutOfMemoryError: Java heap space
at com.fasterxml.jackson.core.util.TextBuffer.carr(TextBuffer.java:778) ~[jackson-core-2.8.6.jar:2.8.6]
at com.fasterxml.jackson.core.util.TextBuffer.finishCurrentSegment(TextBuffer.java:632) ~[jackson-core-2.8.6.jar:2.8.6]
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2493) ~[jackson-core-2.8.6.jar:2.8.6]
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishAndReturnString(UTF8StreamJsonParser.java:2469) ~[jackson-core-2.8.6.jar:2.8.6]
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:315) ~[jackson-core-2.8.6.jar:2.8.6]
at org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:86) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.common.xcontent.support.AbstractXContentParser.textOrNull(AbstractXContentParser.java:237) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.KeywordFieldMapper.parseCreateField(KeywordFieldMapper.java:336) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.FieldMapper.parse(FieldMapper.java:287) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:468) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:591) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:396) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:373) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentParser.internalParseDocument(DocumentParser.java:93) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:66) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:277) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:530) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.shard.IndexShard.prepareIndexOnPrimary(IndexShard.java:507) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.bulk.TransportShardBulkAction.prepareIndexOperationOnPrimary(TransportShardBulkAction.java:458) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.bulk.TransportShardBulkAction.executeIndexRequestOnPrimary(TransportShardBulkAction.java:466) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:145) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:114) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:69) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:975) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:944) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:113) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:345) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:270) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:924) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:921) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:151) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationLock(IndexShard.java:1659) ~[elasticsearch-5.6.10.jar:5.6.10]
Additionally I have taken a heap dump on one of the processes and looked at it in VisualVM. 90% of the heap usage (at 6GB) is byte[] objects. The vast majority seems to be referenced by org.elasticsearch.common.util.BigArrays$ByteArrayWrapper
We are running on AWS, using t2.medium instances for Graylog and m5.large for Elasticsearch (upgraded to r5.xlarge to test increasing heap size). All of the instances are running Amazon Linux 2. Graylog and Elasticsearch is installed with the Graylog ansible playbook.
Today morning, our Graylog installation was completely unresponsive after all Elasticsearch nodes crashed during the night. Rebooting all of the servers seemed to do the trick.