Elasticsearch Heap requirements skyrocketed

albgus · August 30, 2018, 7:07am

Hello,

Yesterday, we started having quite severe issues with our graylog installation, more specifically Elasticsearch. We have been running this installation without a hitch for about a month after making an upgrade to Graylog 2.4.6 / Elasticsearch 5.6.10.

The symptoms we’re seeing is simply that our Elasticsearch servers started throwing OutOfMemory exceptions after about 5-15 minutes of operation. The first we tried was to gradually increase the heap size from 3GB, to eventually 16GB. This kind of mitigates the issue, in that it takes 1-2 hours between crashes, but something is still clearly wrong.

We have 120 indices 5-10 GB each and 1,000,000-1,500,000 documents each, rotated daily. Total about 800GB of data.

This is a example stack trace

[2018-08-29T12:51:32,245][INFO ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][988] overhead, spent [271ms] collecting in the last [1s]
[2018-08-29T12:51:35,255][INFO ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][991] overhead, spent [252ms] collecting in the last [1s]
[2018-08-29T12:51:40,066][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][994] overhead, spent [2.4s] collecting in the last [2.7s]
[2018-08-29T12:51:42,326][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][995] overhead, spent [2.1s] collecting in the last [2.2s]
[2018-08-29T12:51:44,902][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][996] overhead, spent [2.4s] collecting in the last [2.5s]
[2018-08-29T12:51:46,717][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][997] overhead, spent [1.7s] collecting in the last [1.8s]
[2018-08-29T12:51:48,947][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][998] overhead, spent [2.1s] collecting in the last [2.2s]
[2018-08-29T12:51:50,710][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][999] overhead, spent [1.7s] collecting in the last [1.7s]
[2018-08-29T12:51:54,499][WARN ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][1000] overhead, spent [3.7s] collecting in the last [3.8s]
[2018-08-29T12:52:57,651][INFO ][o.e.m.j.JvmGcMonitorService] [000int_prd_esdata2-graylog] [gc][1001] overhead, spent [25.4s] collecting in the last [1m]
[2018-08-29T12:53:10,657][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [000int_prd_esdata2-graylog] fatal error in thread [elasticsearch[000int_prd_esdata2-graylog][bulk][T#2]], exiting
java.lang.OutOfMemoryError: Java heap space
at com.fasterxml.jackson.core.util.TextBuffer.carr(TextBuffer.java:778) ~[jackson-core-2.8.6.jar:2.8.6]
at com.fasterxml.jackson.core.util.TextBuffer.finishCurrentSegment(TextBuffer.java:632) ~[jackson-core-2.8.6.jar:2.8.6]
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2493) ~[jackson-core-2.8.6.jar:2.8.6]
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishAndReturnString(UTF8StreamJsonParser.java:2469) ~[jackson-core-2.8.6.jar:2.8.6]
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:315) ~[jackson-core-2.8.6.jar:2.8.6]
at org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:86) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.common.xcontent.support.AbstractXContentParser.textOrNull(AbstractXContentParser.java:237) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.KeywordFieldMapper.parseCreateField(KeywordFieldMapper.java:336) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.FieldMapper.parse(FieldMapper.java:287) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:468) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:591) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:396) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:373) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentParser.internalParseDocument(DocumentParser.java:93) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:66) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:277) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:530) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.shard.IndexShard.prepareIndexOnPrimary(IndexShard.java:507) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.bulk.TransportShardBulkAction.prepareIndexOperationOnPrimary(TransportShardBulkAction.java:458) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.bulk.TransportShardBulkAction.executeIndexRequestOnPrimary(TransportShardBulkAction.java:466) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:145) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:114) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:69) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:975) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:944) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:113) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:345) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:270) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:924) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:921) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:151) ~[elasticsearch-5.6.10.jar:5.6.10]
at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationLock(IndexShard.java:1659) ~[elasticsearch-5.6.10.jar:5.6.10]

Additionally I have taken a heap dump on one of the processes and looked at it in VisualVM. 90% of the heap usage (at 6GB) is byte[] objects. The vast majority seems to be referenced by org.elasticsearch.common.util.BigArrays$ByteArrayWrapper

We are running on AWS, using t2.medium instances for Graylog and m5.large for Elasticsearch (upgraded to r5.xlarge to test increasing heap size). All of the instances are running Amazon Linux 2. Graylog and Elasticsearch is installed with the Graylog ansible playbook.

Today morning, our Graylog installation was completely unresponsive after all Elasticsearch nodes crashed during the night. Rebooting all of the servers seemed to do the trick.

jan · August 30, 2018, 8:09am

You did not mention how many servers you have for Elasticsearch … In addition you should only assing 50% of the Servers RAM to Elasticsearch.

So you would need m5.2xlarge to assing 16GB of RAM to Elasticsearch for optimal performance. Depending on the Data and usage this might be the best to get a stable system back.

albgus · August 30, 2018, 9:04am

Ah, yes. We have 3x nodes for Graylog and 3x Elasticsearch data nodes. As I mentioned we increased the data nodes to r5.xlarge, which have 32GB of RAM each. This is more stable, but is still crashes although less frequently.

Something is clearly not right, but I have not been able to figure out what yet. It worked fine for weeks, and suddenly a raise from 3 -> 16GB heap is not enough?

jan · August 30, 2018, 10:04am

Depending on the data and usage this might happen. Did you enable a new message source? Did your user now actively using Graylog? Did your messages change?

One of the above might give you the idea what happen. In addition the amount of Data need some ressources to be made available. If you hit the tipping point the only cure is more ressources.

albgus · August 30, 2018, 1:23pm

After some more experimentation I have found some configurations with widgets that may trigger the Elasticsearch OutOfMemory error. Specifically high granularity over large amounts of time.

I think that someone may have accidentally added some kind of widget in their session right about when the issues started, and wont admit it now.

system · September 13, 2018, 1:23pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ElasticSearch Java reports OutOfMemory Causing Graylog to queu messages Graylog Central (peer support)	2	936	April 28, 2017
Graylog, log problem Graylog Central (peer support)	23	2138	March 18, 2019
Graylog not working because of heap memory continuous increase Graylog Central (peer support)	3	574	August 7, 2019
Indexing of new messages stoppes occasionally Graylog Central (peer support)	17	1542	May 17, 2017
Graylog missing messages, fluctuates between 20k and up to 100k per minute Graylog Central (peer support)	6	1643	November 29, 2017

Elasticsearch Heap requirements skyrocketed

Related topics