Hi everyone. I have recently been getting some notifications from our Graylog Open 5.2.7 running on Ubuntu 20.04 with MongoDB 5.0.26 and two Opensearch 2.14 nodes as follows:
Aggregation search failed (triggered a few seconds ago)
I receive probably 40-50 of these a day. They donât seem to be preventing our event definitions from triggering (at least not all of the time).
Iâm not sure what change brought this about. It could have been an upgrade of Opensearch from 2.13 to 2.14. I also noticed that one of our Opensearch nodes was down from the weekend, but it seemed to start back up and recover fine. This doesnât seem like the kind of problem that would result from a missing Opensearch node, and weâve had that happen before and it hasnât caused any long-lasting problems. This has been going on for about a week now.
Has anyone seen this error before or have any ideas what might be going on?
I am also receiving the same error I think its related to mongod db 2.14 as there are a lot of bugs including a performance analyzer bug that are showing up in the logs graylog opensearch logs.
Iâm getting the alert for various event definitions, perhaps all of them. Iâm attaching the screenshot of one particular definition, but reviewing the errors, itâs quite possibly every definition I have configured.
I will add - itâs possible that the definitions that are throwing this error are those in which I have checked the âAggregation of results reaches a thresholdâ options in the event definition.
We also started receiving this error since we upgraded to graylog 6.0.0. Now we are in 6.0.2 and we have this error several times per day. I didnât see this error in 5.2.x, and havenât changed some of those event definitions that are failing now.
HI @cocorossello - thatâs an interesting data point. Did you also upgrade Opensearch? So far I believe all instances of this issue have been related to an upgrade to OS 2.14.
This will be harder, straight downgrade is not possible
[2024-05-29T08:51:24,482][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [6d113a1fa4ec] uncaught exception in thread [main]
org.opensearch.bootstrap.StartupException: java.lang.IllegalStateException: cannot downgrade a node from version [2.14.0] to version [2.13.0]
at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:185) ~[opensearch-2.13.0.jar:2.13.0]
at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:172) ~[opensearch-2.13.0.jar:2.13.0]
at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104) ~[opensearch-2.13.0.jar:2.13.0]
at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138) ~[opensearch-cli-2.13.0.jar:2.13.0]
at org.opensearch.cli.Command.main(Command.java:101) ~[opensearch-cli-2.13.0.jar:2.13.0]
at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:138) ~[opensearch-2.13.0.jar:2.13.0]
at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:104) ~[opensearch-2.13.0.jar:2.13.0]
Patrickmann, do you suspect this is an issue with Graylog that may be resolved in a future update, or a problem with Opensearch? I may just wait for a fix from either side, as opposed to attempting a data restore into 2.13.
Thx for reporting. We are actively working on it.
You can help us by letting us know more about the context this occurs in: how many nodes, how heavily utilized, rate of event executions ⊠So far we canât reproduce it reliably.
I can also make some observations here by looking at my errors and comparing them to my total list of alerts:
The alert does not have to trigger for the error to occur. I have some errors for alerts that have never triggered, or last triggered 5 months ago.
The error has not occurred for any of the event definitions that are in the Disabled state.
I originally thought that perhaps the âAggregation of results reaches a thresholdâ option needed to be enabled an on event definition for the error to occur, but that does not seem to be the case. I have seen it occur for event definitions that do not have that option enabled (set to âFilter has resultsâ).
I havenât yet found a âsmoking gunâ on why this is occuring on some of my event definitions, but not others, but if I notice anything Iâll let you know.
Thank you Jan, it appears that due to you reporting it to the opensearch team, there is a fix implemented for 2.15:
We are using a HashMap (not thread safe) for the inner map of the cleanupKeyToCountMap and hence it throws a Concurrent Modification Exception when the map is getting updated by multiple threads concurrently.
The fix is to use a thread safe Concurrent Map instead.