Graylog failing to index data

Hello everyone. I have the following setup:
-Kafka queue containing the messages I need to process
-Graylog cluster with one master and 3 node collector that go to kafka and retrieve the data.
-Processing over 300 millions messages per day (avg 500GB per day)
-20 data Elasticsearch data nodes with 4 CPU 32 RAM (HEAP = 16GB)
The data is collected as expected during most of the day but for some reason around 12am-2am EST we start getting the errors below and my question is does graylog retry sending those messages that failed to be indexed ?

Hi @lacarrillo

Normally yes, that’s why Graylog have Journal.

The Journal file system in your Graylog Cluster will increase utilization if it happens.

About those errors,

It looks like you need to fine tune your ES Cluster configuration to handle this amount of messages.

I think this Doc could help you with that:

Hello @reimlima thank you for replying and the link. I’m using Elasticsearch 6.8, so it looks like I will need to upgrade to version 7.x in order to use the queue size=1000, Elasticsearch 6.8 has a queue size=200 and even though I can change it, it is not recommended. The good thing is that at least I’m not loosing those messages. Thank you.

1 Like

@lacarrillo
Hello,

I have to agree with @reimlima

If you see bulk rejections, try to understand why they are taking place and whether it is a single node or the whole cluster that is affected. If the cluster is unable to cope with the load, ensure that all nodes are sharing the load evenly. If this does not help, it may be necessary to scale the cluster out or up. If I see and understand this right, you have 20 ES nodes? if so then scaling probably wont help. Upgrading to ES 7 only likely to postpone the problem/s.

1 Like

Hello @gsmith thank you for replying. The error I’m getting is that the nodes are reaching the queue limit = 200, if I update to 7.x the limit goes up to 1000. I understand that if the cluster is not configured correctly this update will only delay my issues but here is the thing, the cluster is well balanced, all nodes are getting the same amount of data, I have a single shard per node and 20 shards per index, the errors are coming from the entire cluster is not a single node failure. That is why I thought about the upgrade, everything else seems to be ok.

@lacarrillo
I see now, and yeah maybe the upgrade will do it.

I figured I would throw my 2¢ in here–the amount of queued tasks is one thing (namely, what’s waiting to be processed). But your pool size is another and that corresponds directly to the amount of CPUs on a box. I.e., you have 4 CPUs, so that’s your pool size. Increasing the amount of queued tasks is categorically NOT going to help you and in fact, lead to a larger backlog of tasks.

You will have to make some changed to stop the backlog, whether by tuning things in Graylog, or adding more resources to those Elasticsearch servers.

There are a number of things that could be impacting you here. What I suspect is that given your time frame, that’s when a lot of Elasticsearch operations are taking place. Keep in mind that any index, query, merge, etc. operation that happens is going to require resources. So if you’re doing a lot of index rotations at P1D, then those are happening at midnight UTC and if you have a lot of index sets, this could be impacting things. It’s also worth noting that if you’ve configured your indices to be large (e.g., set to 1 week or 1 month’s worth of data), then the index operations will take a LOT longer (usually in the order of hours to days).

But of course, this is all speculation on my part since we have no idea how your index rotation is configured, nor do we have visibility into other parts of your Graylog configuration. I would tend to suspect that you’ve got some wonky index configurations that are leading to what you’re seeing versus it being a resource issue. But if it’s indeed a resource issue, you might just end up having to add more resources. A good place to start investigating would be to see how many shards you have in your cluster (How many shards should I have in my Elasticsearch cluster? | Elastic Blog) and see how that compares to the ratio of 20:1 (shards:GB of heap) that Elasticsearch recommends. If it’s over that ratio, and you already have low index retention, then you’re going to have to throw more resources at this to compensate for the spikes. But if you have a high retention setting in multiple index sets, get those down, and that should allow for index operations to complete in a shorter amount of time, which should reduce those messages.

1 Like

Tacking on to @aaronsachs reply, not long after we first deployed Graylog we encountered a similar issue and tuning was key to resolving it. Make sure that you’re balancing index settings especially within the recommendations he’s providing. In our initial configuration we had some huge indices owing to the rotation interval being set to P30D, but we had configured it that way to keep the number of shards down. Our retention target was only 3 months so it was pretty easy. Once we observed maintenance tasks taking days we did the math based on the same guidelines @aaronsachs is providing. We reduced the interval to P10D and increased the number of retained indices to correct the issue while still meeting our 3 months target and staying within the Elasticsearch guidelines. Our problems went away.

Definitely spend some time reviewing tuning and sizing.

Hello @aaronsachs and @ttsandrew thank you for replying. Currently i have one index set in graylog and 1 index set for data that is not going through graylog but is also indexed into the Elasticsearch cluster. The index set in graylog has 25 indices and the rotation is P1D, each index has about 325GB by the time the rotation starts. The rotation is done 8PM EST and the issues I’m seeing are happening at around 2-3am EST that is why i discarded the option of the issue happening due to the index rotation. The other index set that is not managed by graylog also does the rotation daily with almost same amount of data and at 8PM as well. I’m keeping 25 indices which gives me almost 1 month retention for the data. I have 25 data nodes in the cluster and 25 shards per index, I have index optimization after rotation enabled with the max number of segment set 1. The number of shards i have in the graylog setup = 660 and a total of 7.3TiB of data. I’m not sure what else i could do to tune graylog more, any ideas ?

1 Like

Hey @lacarrillo, thanks for the additional information. Although, can you please clarify your architecture more precisely? In the first post you mention 20 elasticsearch nodes and now you mention 25. Can you give us exactly the architecture?

  • Graylog server
    1 index set
    P1D
    25 day retention
    do they delete after retenton limit reached?
  • 25 node elasticsearch cluster
    4 CPU each
    32 GB RAM
    16 GB heap
  • Reformat this and provide as much information as you can

How many active shards can you see Graylog reporting in the web console?

325 GB is significant even if broken up into (325/25) GB sized chunks distributed across 25 nodes. Each ES node is still going to have to service active requests (which are significant) while performing maintenance tasks after index rotation. What does the load look like on the nodes? Any evidence of resource exhaustion? CPU pegging? OOM errors?

Hello @ttsandrew i add 5 extra nodes to the original setup, so I now have:

  • 1 index set
  • P1D
  • 25 days retention (indices are deleted after the 25 days are reached)
  • 25 nodes ES cluster
    4 CPU each
    32 GB RAM
    16 GB Heap
    660 active shards
    I just checked some of the dashboards that i have for the cluster and i see a small spike at 8PM EST . The avg CPU is usually 14% but in that period it went up to 24 % in most of the nodes, that is the time when the rotation is done. I also noticed that at that time 70-80% of the data that comes in goes to 1-5 nodes and the other 20-30% is distributed to the rest of the nodes. But graylog does not give me any errors at that time the errors happen about 3-4h later. Please see the dashboards

The first dashboard shows CPU utilization at the time of rotation and at the time the errors start.
The second dashboard shows the spike on the number of bytes in at rotation time, each line representing a node, as you can see there’s only about 4 nodes getting an increase in number of bytes in while the rest stay normal. In the screenshot number 2 the node I’m pointing is indexing about 2.2 mill bytes while the rest was indexing about 35k.

Hey @lacarrillo, thanks for the detailed explanation! What do the elasticsearch logs say when this issue is occurring? I don’t see any obvious loading issues with some napkin math but it’s interesting the work doesn’t appear to be distributed evenly. Is it always unevenly distributed in the same way? I’d start looking at the Elasticsearch logs on the heavily loaded nodes at that time.

Hey @ttsandrew this is what i see in the logs

org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.ingest.IngestService$4@7bd747ed on EsThreadPoolExecutor[name = ebiFAwY/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@2110d362[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 182347897]]
	at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-6.8.15.jar:6.8.15]
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) ~[?:1.8.0_282]
	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) ~[?:1.8.0_282]
	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:98) ~[elasticsearch-6.8.15.jar:6.8.15]
	at org.elasticsearch.ingest.IngestService.executeBulkRequest(IngestService.java:381) ~[elasticsearch-6.8.15.jar:6.8.15]
	at org.elasticsearch.action.bulk.TransportBulkAction.processBulkIndexIngestRequest(TransportBulkAction.java:588) ~[elasticsearch-6.8.15.jar:6.8.15]
	at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:217) ~[elasticsearch-6.8.15.jar:6.8.15]
	at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:92) ~[elasticsearch-6.8.15.jar:6.8.15]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:167) ~[elasticsearch-6.8.15.jar:6.8.15]
	at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.apply(SecurityActionFilter.java:121) ~[?:?]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165) ~[elasticsearch-6.8.15.jar:6.8.15]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:139) ~[elasticsearch-6.8.15.jar:6.8.15]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:81) ~[elasticsearch-6.8.15.jar:6.8.15]
	at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:87) ~[elasticsearch-6.8.15.jar:6.8.15]
	at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:76) ~[elasticsearch-6.8.15.jar:6.8.15]
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:403) ~[elasticsearch-6.8.15.jar:6.8.15]
	at org.elasticsearch.client.support.AbstractClient.bulk(AbstractClient.java:482) ~[elasticsearch-6.8.15.jar:6.8.15]
	at org.elasticsearch.xpack.core.ClientHelper.executeAsyncWithOrigin(ClientHelper.java:73) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.doFlush(LocalBulk.java:105) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk.flush(ExportBulk.java:63) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$doFlush$1(ExportBulk.java:107) ~[?:?]
	at org.elasticsearch.xpack.core.common.IteratingActionListener.run(IteratingActionListener.java:102) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.doFlush(ExportBulk.java:123) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk.flush(ExportBulk.java:63) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.Exporters.doExport(Exporters.java:236) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.Exporters.lambda$export$3(Exporters.java:211) ~[?:?]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) [elasticsearch-6.8.15.jar:6.8.15]
	at org.elasticsearch.xpack.monitoring.exporter.Exporters$AccumulatingExportBulkActionListener.delegateIfComplete(Exporters.java:310) [x-pack-monitoring-6.8.15.jar:6.8.15]
	at org.elasticsearch.xpack.monitoring.exporter.Exporters$AccumulatingExportBulkActionListener.onResponse(Exporters.java:289) [x-pack-monitoring-6.8.15.jar:6.8.15]
	at org.elasticsearch.xpack.monitoring.exporter.Exporters$AccumulatingExportBulkActionListener.onResponse(Exporters.java:260) [x-pack-monitoring-6.8.15.jar:6.8.15]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalExporter.openBulk(LocalExporter.java:148) [x-pack-monitoring-6.8.15.jar:6.8.15]
	at org.elasticsearch.xpack.monitoring.exporter.Exporters.wrapExportBulk(Exporters.java:197) [x-pack-monitoring-6.8.15.jar:6.8.15]
	at org.elasticsearch.xpack.monitoring.exporter.Exporters.export(Exporters.java:209) [x-pack-monitoring-6.8.15.jar:6.8.15]
	at org.elasticsearch.xpack.monitoring.MonitoringService$MonitoringExecution$1.doRun(MonitoringService.java:282) [x-pack-monitoring-6.8.15.jar:6.8.15]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.8.15.jar:6.8.15]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_282]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_282]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:708) [elasticsearch-6.8.15.jar:6.8.15]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.