Event definitions causing concurrent modification exception

Hi everyone. I have recently been getting some notifications from our Graylog Open 5.2.7 running on Ubuntu 20.04 with MongoDB 5.0.26 and two Opensearch 2.14 nodes as follows:

Aggregation search failed (triggered a few seconds ago)

Event definition XXX (65bd781456b5eb0bc31a0863) failed: Unable to perform search query: OpenSearch exception [type=concurrent_modification_exception, reason=null].

Aggregation search failed (triggered a few seconds ago)

Event definition XXX (657b43346aa6d452a18e1eb3) failed: Unable to perform search query: OpenSearch exception [type=concurrent_modification_exception, reason=null].

Aggregation search failed (triggered 16 hours ago)

Event definition XXX (63ea7fb166c2691bc2c064ab) failed: Unable to perform search query: OpenSearch exception [type=concurrent_modification_exception, reason=null].

I receive probably 40-50 of these a day. They don’t seem to be preventing our event definitions from triggering (at least not all of the time).

I’m not sure what change brought this about. It could have been an upgrade of Opensearch from 2.13 to 2.14. I also noticed that one of our Opensearch nodes was down from the weekend, but it seemed to start back up and recover fine. This doesn’t seem like the kind of problem that would result from a missing Opensearch node, and we’ve had that happen before and it hasn’t caused any long-lasting problems. This has been going on for about a week now.

Has anyone seen this error before or have any ideas what might be going on?

Thanks,
Mark

1 Like

I am also receiving the same error I think its related to mongod db 2.14 as there are a lot of bugs including a performance analyzer bug that are showing up in the logs graylog opensearch logs.

Can you share the event definition?

Patrickmann,

I’m getting the alert for various event definitions, perhaps all of them. I’m attaching the screenshot of one particular definition, but reviewing the errors, it’s quite possibly every definition I have configured.

I will add - it’s possible that the definitions that are throwing this error are those in which I have checked the “Aggregation of results reaches a threshold” options in the event definition.

Mark

We also started receiving this error since we upgraded to graylog 6.0.0. Now we are in 6.0.2 and we have this error several times per day. I didn’t see this error in 5.2.x, and haven’t changed some of those event definitions that are failing now.

HI @cocorossello - that’s an interesting data point. Did you also upgrade Opensearch? So far I believe all instances of this issue have been related to an upgrade to OS 2.14.

Yes, we are on os 2.14.0. Is it safe to downgrade? Maybe I can just try with 2.13.0 and see what happens

There is no real downgrade support in OS. The recommended way is to take a snapshot of the data & restore onto a new cluster of OS 2.13.0.

This will be harder, straight downgrade is not possible

[2024-05-29T08:51:24,482][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [6d113a1fa4ec] uncaught exception in thread [main]
org.opensearch.bootstrap.StartupException: java.lang.IllegalStateException: cannot downgrade a node from version [2.14.0] to version [2.13.0]
        at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:185) ~[opensearch-2.13.0.jar:2.13.0]
        at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:172) ~[opensearch-2.13.0.jar:2.13.0]
        at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104) ~[opensearch-2.13.0.jar:2.13.0]
        at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138) ~[opensearch-cli-2.13.0.jar:2.13.0]
        at org.opensearch.cli.Command.main(Command.java:101) ~[opensearch-cli-2.13.0.jar:2.13.0]
        at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:138) ~[opensearch-2.13.0.jar:2.13.0]
        at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:104) ~[opensearch-2.13.0.jar:2.13.0]

Patrickmann, do you suspect this is an issue with Graylog that may be resolved in a future update, or a problem with Opensearch? I may just wait for a fix from either side, as opposed to attempting a data restore into 2.13.

I suspect it is an Open search issue. Hard to say if and when it might be fixed, or a workaround implemented in GL, unless we can narrow it down.

I have the same issue after upgrading to graylog 6.0.2

Thx for reporting. We are actively working on it.
You can help us by letting us know more about the context this occurs in: how many nodes, how heavily utilized, rate of event executions 
 So far we can’t reproduce it reliably.

Piggybacking off of @patrickmann 's reply, I’m curious if you use the OpenSearch security plugin and if you could post a redacted opensearch.yml?

Drew,

In my case, I do not use the security plugin. Here is a redacted opensearch.yml for you:

action.auto_create_index: false
cluster.name: graylog
network.host: x.x.x.x
discovery.seed_hosts: ["x.x.x.x"]
cluster.initial_master_nodes: graylog1,graylog2
node.name: graylog1
path.data: /var/lib/opensearch
path.logs: /var/log/opensearch
plugins.security.disabled: true
search.max_buckets: 131070
indices.query.bool.max_clause_count: 4096
cluster.routing.allocation.disk.watermark.low: 93%
cluster.routing.allocation.disk.watermark.high: 95%
cluster.routing.allocation.disk.watermark.flood_stage: 97%

Thanks,
Mark

1 Like

I can also make some observations here by looking at my errors and comparing them to my total list of alerts:

  • The alert does not have to trigger for the error to occur. I have some errors for alerts that have never triggered, or last triggered 5 months ago.
  • The error has not occurred for any of the event definitions that are in the Disabled state.
  • I originally thought that perhaps the “Aggregation of results reaches a threshold” option needed to be enabled an on event definition for the error to occur, but that does not seem to be the case. I have seen it occur for event definitions that do not have that option enabled (set to “Filter has results”).

I haven’t yet found a “smoking gun” on why this is occuring on some of my event definitions, but not others, but if I notice anything I’ll let you know.

1 Like

This helpful. Thank you!

After investigating, IMHO this is a bug in OpenSearch 2.14 (probably already in 2.13 but maybe not emerging due to other reasons) [BUG] sporadic concurrent_modification_exception during query in 2.14 · Issue #14032 · opensearch-project/OpenSearch · GitHub

2 Likes

Thank you Jan, it appears that due to you reporting it to the opensearch team, there is a fix implemented for 2.15:

We are using a HashMap (not thread safe) for the inner map of the cleanupKeyToCountMap and hence it throws a Concurrent Modification Exception when the map is getting updated by multiple threads concurrently.

The fix is to use a thread safe Concurrent Map instead.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.