Alert events are being randomly skipped

Hello,

I don’t receive an Alert nor it is being displayed in Alerts & Events list.
Filter preview in Event Definition displays the missing event during the edit if switched to search within last 24 hours.
Searching for the filter search query displays missing events.

Here’s the part of log that is being sent from the remote instance to Graylog via filebeat:

Mar 28 19:26:31 mail postfix/submission/smtpd: disconnect from ec2-18-214-89-240.compute-1.amazonaws.com[18.214.89.240] commands=0/0
Mar 28 19:26:34 mail postfix/submission/smtpd: warning: ec2-18-214-89-240.compute-1.amazonaws.com[18.214.89.240]: SASL LOGIN authentication failed: UGFzc3dvcmQ6
Mar 28 19:26:34 mail postfix/submission/smtpd: lost connection after AUTH from ec2-18-214-89-240.compute-1.amazonaws.com[18.214.89.240]
Mar 28 19:26:34 mail postfix/submission/smtpd: disconnect from ec2-18-214-89-240.compute-1.amazonaws.com[18.214.89.240] ehlo=2 starttls=1 auth=0/1 commands=3/4
Mar 28 21:27:13 mail postfix/submission/smtpd: lost connection after UNKNOWN from unknown[103.45.246.23]
Mar 28 21:27:13 mail postfix/submission/smtpd: disconnect from unknown[103.45.246.23] unknown=0/2 commands=0/2
Mar 29 03:27:21 mail postfix/submission/smtpd: SSL_accept error from <redacted>: lost connection
Mar 29 03:27:21 mail postfix/submission/smtpd: disconnect from <redacted> commands=0/0
Mar 29 03:27:21 mail postfix/smtpd: disconnect from <redacted> commands=0/0
Mar 29 03:27:21 mail postfix/submission/smtpd: disconnect from <redacted> commands=0/0
Mar 29 06:10:41 mail postfix/submission/smtpd: lost connection after UNKNOWN from unknown[168.121.97.40]
Mar 29 06:10:41 mail postfix/submission/smtpd: disconnect from unknown[168.121.97.40] ehlo=1 auth=0/1 unknown=0/2 commands=1/4
Mar 29 13:25:25 mail postfix/smtpd: warning: non-SMTP command from 233.150.216.162.bc.googleusercontent.com[162.216.150.233]: GET / HTTP/1.1
Mar 29 13:25:25 mail postfix/smtpd: disconnect from 233.150.216.162.bc.googleusercontent.com[162.216.150.233] unknown=0/1 commands=0/1

Those events were not delivered via Notifications, they are also missing in “Alerts & Events” [1]

Mar 28 21:27:13 mail postfix/submission/smtpd: lost connection after UNKNOWN from unknown[103.45.246.23]
Mar 28 21:27:13 mail postfix/submission/smtpd: disconnect from unknown[103.45.246.23] unknown=0/2 commands=0/2
Mar 29 06:10:41 mail postfix/submission/smtpd: lost connection after UNKNOWN from unknown[168.121.97.40]
Mar 29 06:10:41 mail postfix/submission/smtpd: disconnect from unknown[168.121.97.40] ehlo=1 auth=0/1 unknown=0/2 commands=1/4

When editing the Event Definition, if I set Filter settings (“search within the last” and “execute search every”) to display last 24 hours, I can clearly see the missing events. [2]
Searching manually with the same Search Query as used in Event Definition returns missing messages. [3]
For the reference, I’m attaching also the Event Definition summary as it is configured and used (with no altering). [4]

Environment:
Debian 10 host with Docker 26.0.0
Graylog 5.2.5
Elasticsearch 7.10.2
Mongo 5.0.13
All three running with Docker

docker-compose.yml

version: '3'
services:
  mongo:
    image: mongo:5.0.13
    container_name: graylog_mongo
    restart: always
    volumes:
      - mongo_data:/data/db
    networks:
      - graylog
    ports:
      - 127.0.0.1:27017:27017/tcp

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2
    container_name: graylog_es
    restart: always
    volumes:
      - es_data:/usr/share/elasticsearch/data
    environment:
      - http.host=0.0.0.0
      - transport.host=localhost
      - network.host=0.0.0.0
      - "ES_JAVA_OPTS=-Dlog4j2.formatMsgNoLookups=true -Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    deploy:
      resources:
        limits:
          memory: 1g
    networks:
      - graylog

  graylog:
    image: graylog/graylog:5.2.5
    container_name: graylog_graylog
    restart: always
    volumes:
      - graylog_data:/usr/share/graylog/data
      - ./certs:/opt/certs:ro
      - ./graylog.conf:/usr/share/graylog/graylog.conf
      - ./graylog-plugin-telegram-notification-2.4.0.jar:/usr/share/graylog/plugin/graylog-plugin-telegram-notification-2.4.0.jar
    environment:
      - GRAYLOG_PASSWORD_SECRET=<redacted>
      - GRAYLOG_ROOT_PASSWORD_SHA2=<redacted>
      - TZ=Europe/Prague
      - GRAYLOG_ROOT_TIMEZONE=Europe/Prague
    entrypoint: /usr/bin/tini -- wait-for-it elasticsearch:9200 -- /docker-entrypoint.sh
    networks:
      - graylog
    depends_on:
      - mongo
      - elasticsearch
    ports:
      - 127.0.0.1:9000:9000/tcp
      - 5044:5044/tcp

networks:
  graylog:
    driver: bridge

volumes:
  mongo_data:
    driver: local
  es_data:
    driver: local
  graylog_data:
    driver: local

graylog.conf:

is_leader = true
node_id_file = /etc/graylog/server/node-id
bin_dir = bin
data_dir = data
plugin_dir = plugin
http_publish_uri = http://172.0.0.1:9000/
http_enable_tls = true
http_tls_cert_file = /opt/certs/cert.pem
http_tls_key_file = /opt/certs/privkey.pem
stream_aware_field_types=false
allow_leading_wildcard_searches = true
allow_highlighting = false
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = data/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://localhost/graylog
mongodb_max_connections = 1000

Graylog container has no logs around the time of missing messages. (I’ve fixed those warnings and errors by removing problematic Custom Fields at the time of gathering information for this topic.)

2024-03-28 19:26:53,533 WARN : org.graylog.events.processor.EventProcessorEngine - Key spec <[message]> for event <EventImpl{eventId=01HT3516WNKT1SMJ3P93CEDZRR, eventDefinitionType=aggregation-v1, eventDefinitionId=65e50cfceba8cf7317beded8, originContext=urn:graylog:message:es:graylog_0:bc909871-ed30-11ee-8201-0242ac190004, eventTimestamp=2024-03-28T18:26:37.279Z, processingTimestamp=2024-03-28T18:26:53.461Z, timerangeStart=null, timerangeEnd=null, streams=[], sourceStreams=[000000000000000000000001], message=New mail issue, source=3fa06454e9e7, keyTuple=[], priority=2, alert=true, fields={message=FieldValue{dataType=ERROR, value=}}, groupByFields={}, replayInfo=EventReplayInfo{timerangeStart=2024-03-28T18:26:17.585Z, timerangeEnd=2024-03-28T18:26:47.585Z, query=source:mail.<redacted> AND ("does not resolve to" OR "non-SMTP command" OR "authentication failed" OR "Relay access denied" OR "Domain not found" OR "warning: TLS library problem" OR "Service unavailable" OR "SSL_accept error" OR "failed to parse" OR "too many errors" OR "commands=0/0" OR "auth=0/1" OR "auth=0/2" OR "unknown=0" OR "SERVFAIL" OR "lost connection after AUTH" OR "lost connection after EHLO" OR "lost connection after UNKNOWN" OR "lost connection after STARTTLS"), streams=[000000000000000000000001]}}> cannot be fulfilled
2024-03-29 03:27:48,439 ERROR: org.graylog.events.fields.providers.TemplateFieldValueProvider - No value found for variable "message" in template "${message}"
2024-03-29 03:27:48,482 WARN : org.graylog.events.processor.EventProcessorEngine - Key spec <[message]> for event <EventImpl{eventId=01HT40HSHM9D974AT22V5VB714, eventDefinitionType=aggregation-v1, eventDefinitionId=65e50cfceba8cf7317beded8, originContext=urn:graylog:message:es:graylog_0:e67281c1-ed73-11ee-8201-0242ac190004, eventTimestamp=2024-03-29T02:27:23.833Z, processingTimestamp=2024-03-29T02:27:48.404Z, timerangeStart=null, timerangeEnd=null, streams=[], sourceStreams=[000000000000000000000001], message=New mail issue, source=3fa06454e9e7, keyTuple=[], priority=2, alert=true, fields={message=FieldValue{dataType=ERROR, value=}}, groupByFields={}, replayInfo=EventReplayInfo{timerangeStart=2024-03-29T02:27:17.585Z, timerangeEnd=2024-03-29T02:27:47.585Z, query=source:mail.<redacted> AND ("does not resolve to" OR "non-SMTP command" OR "authentication failed" OR "Relay access denied" OR "Domain not found" OR "warning: TLS library problem" OR "Service unavailable" OR "SSL_accept error" OR "failed to parse" OR "too many errors" OR "commands=0/0" OR "auth=0/1" OR "auth=0/2" OR "unknown=0" OR "SERVFAIL" OR "lost connection after AUTH" OR "lost connection after EHLO" OR "lost connection after UNKNOWN" OR "lost connection after STARTTLS"), streams=[000000000000000000000001]}}> cannot be fulfilled
2024-03-29 13:25:47,865 ERROR: org.graylog.events.fields.providers.TemplateFieldValueProvider - No value found for variable "message" in template "${message}"

Elasticsearch has no logs around the time of missing messages:

{"type": "server", "timestamp": "2024-03-28T18:25:22,016Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "docker-cluster", "node.name": "0e6ecec725cb", "message": "[gc][272490] overhead, spent [273ms] collecting in the last [1s]", "cluster.uuid": "qbsNRgP8SaWhxjWwTjrIog", "node.id": "6sI6NmX1Q5eN0khzul73jg"  }
{"type": "server", "timestamp": "2024-03-28T23:00:10,963Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "docker-cluster", "node.name": "0e6ecec725cb", "message": "[gc][288968] overhead, spent [270ms] collecting in the last [1s]", "cluster.uuid": "qbsNRgP8SaWhxjWwTjrIog", "node.id": "6sI6NmX1Q5eN0khzul73jg"  }
{"type": "server", "timestamp": "2024-03-28T23:00:10,963Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "docker-cluster", "node.name": "0e6ecec725cb", "message": "[gc][288968] overhead, spent [270ms] collecting in the last [1s]", "cluster.uuid": "qbsNRgP8SaWhxjWwTjrIog", "node.id": "6sI6NmX1Q5eN0khzul73jg"  }
{"type": "server", "timestamp": "2024-03-29T15:36:12,362Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "0e6ecec725cb", "message": "version[7.10.2], pid[7], build[oss/docker/747e1cc71def077253878a59143c1f785afa92b9/2021-01-13T00:42:12.435326Z], OS[Linux/4.19.0-26-amd64/amd64], JVM[AdoptOpenJDK/OpenJDK 64-Bit Server VM/15.0.1/15.0.1+9]" }

Mongo has no logs around the time f missing messages - last log entry dated 23.03.2024 and is obviously unrelated.

Maybe the “Key spec for event […] cannot be fulfilled” error was causing problems, that I don’t know at this time. Will keep watching for the issue to happen again

Any contributions highly appreciated! I hope someone can point me into right direction.

__
Four screenshots has been merged into single one due to limitation of “new users”

Maybe the “Key spec for event […] cannot be fulfilled” error was causing problems, that I don’t know at this time. Will keep watching for the issue to happen again

It happened again this night, even if I fixed the problem quoted above.

Missing event is visible in manual search, it’s visible in Event Definition edit mode when adjusted to 24 hours but it’s not among the Alerts list, it was not reported via Notifications :frowning:

I haven’t found anything related in logs of all containers.

So under normal conditions you run this search every 30 seconds, for the previous 30 seconds?

If you are there are lots of things that can go wrong with such a short timeframe. If its taking longer than 30 seconds for messages to be stored and available for search then yes you will have issues.

I would try changing it to run every 5 minutes and search back 6 minutes. Or even keep the run every 30 seconds but extend the timeframe. When you extend the timeframe beyond the frequency yes you can get duplicate notifications but you are less likely to miss things.

So under normal conditions you run this search every 30 seconds, for the previous 30 seconds?

Under normal conditions I execute search every 30 seconds to search within the last 30 seconds. It’s been at 15 seconds before, but I’ve adjusted it some while ago.

Now I’ve set it to search within the last 300s and execute the search every 315s, I will be watching if the problem happen again.

Thanks for your input!

I think you have those numbers backwards, you want the search to be longer than how often it searches so that it searches a little bit back into the old seach window.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.