Graylog 3.3.5 - Some messages in stream do not generate event and notification

We are running:
Graylog 3.3.5+6436f1b in single node environment.
The problem we are facing is the fact that sometimes messages in our stream do not generate alert and therefore notification while it should.
Sometimes this happens after few days of everything working correctly.

Our setup:
We have an alert:

    Type: Aggregation
    Search Query: *
    Streams: [Errors](/streams/5fd39b11cc99b90cb001d9b9/search)
    Search within: 1 minutes
    Execute search every: 1 minutes
    Enable scheduling: yes
    Group by Field(s): No Group by configured
    Create Events if: count() **>** 0

Now I am executing code which successfully generates new messages in our “ERROR” stream but there is no corresponding Event generated for that message. The worst part is that sometimes this works and sometimes it does not. If I leave job putting message every 30 minutes to this stream through the night then in the morning I have around 5 new mails in my inbox instead of at least 10+ (2 messages per each hour).

Output of: echo ‘db.processing_status.find()’ | mongo graylog:
{ "_id" : ObjectId("603cebdb19c921775b573519"), "node_id" : "ec3cbb8d-0b71-4b8e-bc7d-53057bfa7406", "node_lifecycle_status" : "RUNNING", "updated_at" : ISODate("2021-03-01T15:18:39.988Z"), "receive_times" : { "ingest" : ISODate("2021-03-01T15:18:38.879Z"), "post_processing" : ISODate("2021-03-01T15:18:38.879Z"), "post_indexing" : ISODate("2021-03-01T15:18:38.879Z") }, "input_journal" : { "uncommitted_entries" : NumberLong(0), "read_messages_1m_rate" : 6.951131632689634, "written_messages_1m_rate" : 6.951131632689634, "journal_enabled" : true } }

After enabling “DEBUG” log level via: PUT “ /”

I can see such messages pooping all the time:
2021-03-01 15:20:34,825 DEBUG: - Executing event processor <At least one ERROR message/5fd39b11cc99b90cb001d9c3/aggregation-v1>
2021-03-01 15:20:34,832 DEBUG: - Event processor <At least one ERROR message/5fd39b11cc99b90cb001d9c3> couldn’t be executed because of a failed precondition (retry in 5000 ms)

Additionally in metrics for my node I can see such entries:

Is there anyone who could help me understand what is going on?

New debug info:
Our stream has ID: 5fd39b11cc99b90cb001d9b9
Event Definition ID: 5fd39b11cc99b90cb001d9c3
New message in this stream arrived at Greylog’s Elastic Index at: “timestamp” : “2021-03-02 12:47:14.174”

In logs I can see that Greylog is not able to count this in aggregation event processor:

2021-03-02 12:47:30,291 DEBUG: - Executing event processor <At least one ERROR message/5fd39b11cc99b90cb001d9c3/aggregation-v1>
2021-03-02 12:47:30,300 DEBUG: - Creating events for config=AggregationEventProcessorConfig{type=aggregation-v1, query=*, queryParameters=[], streams=[5fd39b11cc99b90cb001d9b9], groupBy=[], series=[AggregationSeries{id=ef239e8e-a046-43f0-9a36-68a8f32fa8df, function=COUNT, field=Optional.empty}], conditions=Optional[AggregationConditions{expression=Optional[Greater{expr=>, left=NumberReference{expr=number-ref, ref=ef239e8e-a046-43f0-9a36-68a8f32fa8df}, right=NumberValue{expr=number, value=0.0}}]}], searchWithinMs=60000, executeEveryMs=60000} parameters=AggregationEventProcessorParameters{type=aggregation-v1, timerange=AbsoluteRange{type=absolute, from=2021-03-02T12:46:24.354Z, to=2021-03-02T12:47:24.353Z}, streams=[], batchSize=500}
2021-03-02 12:47:30,350 DEBUG: - Got 1 (total-aggregated-messages=0) results.
2021-03-02 12:47:30,350 DEBUG: - Skipping result <AggregationKeyResult{key=[], timestamp=Optional[2021-03-02T12:47:24.354Z], seriesValues=[AggregationSeriesValue{series=AggregationSeries{id=ef239e8e-a046-43f0-9a36-68a8f32fa8df, function=COUNT, field=Optional.empty}, key=[], value=0.0}]}> because the conditions <Optional[AggregationConditions{expression=Optional[Greater{expr=>, left=NumberReference{expr=number-ref, ref=ef239e8e-a046-43f0-9a36-68a8f32fa8df}, right=NumberValue{expr=number, value=0.0}}]}]> don't match

The problem is that:

2021-03-02 12:47:30,350 DEBUG: - Got 1 (total-aggregated-messages=0) results.

Any idea why it would return 0 why absolute date time conditions are:

timerange=AbsoluteRange{type=absolute, from=2021-03-02T12:46:24.354Z, to=2021-03-02T12:47:24.353Z}

So timestamp of message from Elastic should fall into this range yet it’s not counted and therefore the processing is skipped.

As soon as I disable conditions for this Event Definition then it all works but we really need aggregation : /

Anyone? Still stuck at this, not sure how to debug this to find a root cause.
I tried to get ElasticSearch query that GrayLog executes by AggregationEventProcessor but even with DEBUG log level enabled in “” I could not find it in string form.

Maybe this was already fixed in newer version of 3.3, latest is 3.3.11, check changelog:

I have just upgraded to 3.3.11+9335407 but it did not change anything.

I think, your problem is in agregation function enabled in event definition. Why you enabled Aggregation, when you didn’t define group by field at all? Try to switch your event definition to Filter has result, it should help.

Check also your Grace period in notification, it you didn’t postpone notifications.

If I switch to Filter has result then it’s working great but the problem is that 50 errors in 1 minute period generates 50 notifications and this is why we wish to have Aggregation setup to only receive single mail.
We do not have Grace period setup. I will try to define group by field to see if this changes anything but the weird thing is that sometimes it does work correctly even with current setup.
Thank you for your advices and time, I will try that and let you know.

One more hint, check you Events System section in System - Configurations, there are default timeouts and so on.

Hello again, unfortunately setting Group by field does not change anything. The only real difference is when we disable aggregation and just resort to Filter has result option but we really like Agreggation for two reasons:

  • grouped messages in single email so 200 new messages won’t send out 200 emails

  • Seeing result of count() aggregation in email allowing us to easily see scale of problem

I think we could do Filter has result together with Grace period to not triggering 200 emails when 200 new messages appear in the stream but we are loosing the information about scale of accident.

Currently the weird situation is that sometimes our current setup with aggregation works great but it randomly fails other times, I do not know if I should look for reason in ElasticSearch or somewhere else. We are thinking about upgrading to 4.x version but we are not really sure if 4.x version had some changes in events subsytem that could potentially fix this for us.

Bumping once more, still trying to figure this stuff out.
We have noticed that changing event setting from:

Search within: 1 minutes
Execute search every: 1 minutes


Search within: 5 minutes
Execute search every: 5 minutes

gives us stable notifications, now we are trying to figure out why.

Bumping, is there anybody who could help with this? Is Graylog just not designed to work with such short and frequent notifications?

Good morning, all of ours are set to:

Maybe you need more resources to the server? RAM, CPU, JVM Heap, etc…

We are only receiving about 6 MB/day. So very light. Thank you, Zach.

As I understand it there is a gap in the way the notifications system works. It is not boxing in the search filter based on the prior execution. So, if you define the filter to run once per minute searching the past minute, the amount of time it takes for the search to actually execute results in a gap between search windows. If your first execution runs at 00:01:00.000 searching the last minute, but it takes 15 ms for the search to be staged for execution, then your actual search window is 00:00:00.015 to 00:01:00.015. The scheduler will then schedule the next execution exactly one minute from the beginning of the prior execution (00:02:00.00) meaning that if performance conditions are now better and the execution takes 10 ms to stage rather than 15, you will have a 5 ms overlap with the prior execution (00:01:00.010 to 00:02:00.010).

All of this means that your event definitions/alerts can theoretically have gaps or duplicates if the execution frequency and lookback timeframe are the same.

If this has been resolved since it was explained to me I invite @aaronsachs or anyone else to please correct me.

1 Like

Hello once again to you all.
Sorry for not replying for longer period of time but I was busy debugging this on my own.
I think that we did find the root of the problem and unfortunately and I must apologize to it, it has nothing to do with Graylog iteself, it seems that I was foolishly assuming it’s Graylog issue because sometimes notification did work and sometimes not.

The thing is that we are using fluent-d to collect logs from containers and send them via GELF input to Graylog. The fluent-d is using “reserve_time true” which means that fluent-d will preserve time of log from container. What that means is time of log in container is actually going to be mapped into “timestamp” field in Elastic which is all good.

The problem is that it takes some time for fluent-d to collect log and send it to Graylog. When I disabled “reserve_time” feature I could see that time difference between actual log time in container and timestamp value in Graylog is around 50 seconds and that will lead to situation when log with timestamp in the past is actually persisted in Graylog AFTER EventProcessorJob cycle.

In other words let’s assume that Graylog is scanning timespan from 26.03.2021 09:10:25 - 26.03.2021 09:11:25 then it’s possible for our fluent-d pods to send log AFTER 26.03.2021 09:11:25 but with timestamp within that timeframe and that is why I can see message within timestamp but there is no notification since at the time of EventProcessorJob execution message is just not yet in ElasticSearch.

For test run I have disabled “reserve_time” flag and right off the bat I was getting every notification since now message’s timestamp is most probably set to “now()” so EventProcessorJob can actually query it correctly.

With all of this said, do you know any recommended technique for dealing with such situations?
Currently we are thinking about two solutions:

  1. Reconfigure fluent-d and try to make it run more frequent
  2. Persist actual time of log into new field “container_timestamp” and make Graylog sort messages by that field

Both of your solutions sound feasible.
I don’t know much about Fluentd, but what are the drawbacks of the following statement?

IMO If that is the simplest way to solve your issue (i.e., with no major drawbacks) and all it takes is to run Fluentd more often than I would do that.
Easier you can make it, less problems you might have.
Just a thought…

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.