Elasticsearch Heap requirements skyrocketed

Hello,

Yesterday, we started having quite severe issues with our graylog installation, more specifically Elasticsearch. We have been running this installation without a hitch for about a month after making an upgrade to Graylog 2.4.6 / Elasticsearch 5.6.10.

The symptoms we’re seeing is simply that our Elasticsearch servers started throwing OutOfMemory exceptions after about 5-15 minutes of operation. The first we tried was to gradually increase the heap size from 3GB, to eventually 16GB. This kind of mitigates the issue, in that it takes 1-2 hours between crashes, but something is still clearly wrong.

We have 120 indices 5-10 GB each and 1,000,000-1,500,000 documents each, rotated daily. Total about 800GB of data.

This is a example stack trace

[2018-08-29T12:51:32,245][INFO ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][988] overhead, spent [271ms] collecting in the last [1s]
[2018-08-29T12:51:35,255][INFO ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][991] overhead, spent [252ms] collecting in the last [1s]
[2018-08-29T12:51:40,066][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][994] overhead, spent [2.4s] collecting in the last [2.7s]
[2018-08-29T12:51:42,326][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][995] overhead, spent [2.1s] collecting in the last [2.2s]
[2018-08-29T12:51:44,902][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][996] overhead, spent [2.4s] collecting in the last [2.5s]
[2018-08-29T12:51:46,717][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][997] overhead, spent [1.7s] collecting in the last [1.8s]
[2018-08-29T12:51:48,947][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][998] overhead, spent [2.1s] collecting in the last [2.2s]
[2018-08-29T12:51:50,710][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][999] overhead, spent [1.7s] collecting in the last [1.7s]
[2018-08-29T12:51:54,499][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][1000] overhead, spent [3.7s] collecting in the last [3.8s]
[2018-08-29T12:52:57,651][INFO ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][1001] overhead, spent [25.4s] collecting in the last [1m]
[2018-08-29T12:53:10,657][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [000int_prd_esdata2-graylog] fatal error in thread [elasticsearch[000int_prd_esdata2-graylog][bulk][T#2]], exiting
java.lang.OutOfMemoryError: Java heap space
at com.fasterxml.jackson.core.util.TextBuffer.carr(TextBuffer.java:778) ~[jackson-core-2.8.6.jar:2.8.6]
at com.fasterxml.jackson.core.util.TextBuffer.finishCurrentSegment(TextBuffer.java:632) ~[jackson-core-2.8.6.jar:2.8.6]
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2493) ~[jackson-core-2.8.6.jar:2.8.6]
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishAndReturnString(UTF8StreamJsonParser.java:2469) ~[jackson-core-2.8.6.jar:2.8.6]
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:315) ~[jackson-core-2.8.6.jar:2.8.6]
at org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:86) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.common.xcontent.support.AbstractXContentParser.textOrNull(AbstractXContentParser.java:237) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.KeywordFieldMapper.parseCreateField(KeywordFieldMapper.java:336) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.FieldMapper.parse(FieldMapper.java:287) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:468) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:591) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:396) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:373) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentParser.internalParseDocument(DocumentParser.java:93) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:66) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:277) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:530) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.shard.IndexShard.prepareIndexOnPrimary(IndexShard.java:507) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.bulk.TransportShardBulkAction.prepareIndexOperationOnPrimary(TransportShardBulkAction.java:458) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.bulk.TransportShardBulkAction.executeIndexRequestOnPrimary(TransportShardBulkAction.java:466) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:145) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:114) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:69) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:975) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:944) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:113) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:345) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:270) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:924) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:921) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:151) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationLock(IndexShard.java:1659) ~[elasticsearch-5.6.10.jar:5.6.10]

Additionally I have taken a heap dump on one of the processes and looked at it in VisualVM. 90% of the heap usage (at 6GB) is byte[] objects. The vast majority seems to be referenced by org.elasticsearch.common.util.BigArrays$ByteArrayWrapper

We are running on AWS, using t2.medium instances for Graylog and m5.large for Elasticsearch (upgraded to r5.xlarge to test increasing heap size). All of the instances are running Amazon Linux 2. Graylog and Elasticsearch is installed with the Graylog ansible playbook.

Today morning, our Graylog installation was completely unresponsive after all Elasticsearch nodes crashed during the night. Rebooting all of the servers seemed to do the trick.

You did not mention how many servers you have for Elasticsearch … In addition you should only assing 50% of the Servers RAM to Elasticsearch.

So you would need m5.2xlarge to assing 16GB of RAM to Elasticsearch for optimal performance. Depending on the Data and usage this might be the best to get a stable system back.

Ah, yes. We have 3x nodes for Graylog and 3x Elasticsearch data nodes. As I mentioned we increased the data nodes to r5.xlarge, which have 32GB of RAM each. This is more stable, but is still crashes although less frequently.

Something is clearly not right, but I have not been able to figure out what yet. It worked fine for weeks, and suddenly a raise from 3 -> 16GB heap is not enough?

Depending on the data and usage this might happen. Did you enable a new message source? Did your user now actively using Graylog? Did your messages change?

One of the above might give you the idea what happen. In addition the amount of Data need some ressources to be made available. If you hit the tipping point the only cure is more ressources.

After some more experimentation I have found some configurations with widgets that may trigger the Elasticsearch OutOfMemory error. Specifically high granularity over large amounts of time.

I think that someone may have accidentally added some kind of widget in their session right about when the issues started, and wont admit it now.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.