Index ranges wrong - manual recalculation yields dates in the future

Graylog: 2.4.7+9116ead
ES: 5.6.16

Graylog cluster: 4 machines w\ 8 cores, 6G of RAM, usually around 10 - 30% load. Calculated as unix load / # of CPUs

ES cluster:
masters: 3* VM 2 cores 2G of RAM (1G heap)
data: 2*physical 16 cores, 62G of RAM (31G heap) 16 SSDs in RAID0
Around 2000 msg / sec

Issue: a while ago, we noticed the read IOPS of our ES cluster increasing from 500 to 8000. Writes stayed roughtly the same. The number of segmens also stable around 900 during that time (we do force merges to 1 segment every night). ES load increased from 15% to up to 150%, probably because of the aforementioned read IO.

Checking the index sets on graylog we can see that the time ranges of the default index set are just plain wrong. They are not sequential and just all over the place. We have, in total maybe 6 months worth of data in our ES cluster, but some individual index time ranges are 6 months alone. After doing index range recalc we get messages like this:

2020-07-13T10:17:36.221+02:00 INFO [RebuildIndexRangesJob] Created ranges for index graylog2_6126: MongoIndexRange{id=null, indexName=graylog2_6126, begin=2020-03-20T06:00:01.000Z, end=2020-09-12T19:06:26.000Z, calculatedAt=2020-07-13T08:17:04.975Z, calculationDuration=31244, streamIds=[000000000000000000000001, 5b21d991e5b43f814010354d, 5a4f7aa9b884dda95e3ef25f, 5a4f818ab884dda95e3ef958, 5b20cc65e5b43f81400f2805, 5d319ac2b884dde38b433ac1, 561ce9fbe4b042d56225d130, 5c3c60056274fe852220ebe3, 55ed3f33e4b0dfacf5f08604, 59bfc85fe5b43f192db4cb74]}

Which makes me think congratulations might be in order for the graylog team to have invented a time machine.

timestamp mapping for that index:

 "timestamp": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss.SSS"
          },

servertime:
Mon Jul 13 10:42:41 CEST 2020

We get some timestamp indexing errors maybe that has something to do with it:

2020-07-13T10:11:47.789+02:00 WARN [GelfCodec] GELF message <7f6e60c6-c4e0-11ea-b255-005056a7e5cd> (received from <xxx>) has invalid "timestamp": 1594627906.977 (type: STRING)

Currenty hypothesis is that the wrong timestamps cause graylog to do way more lookups than required for queries, hence resulting in higher read IOPS. But maybe the two issues are not related. In that case, this topic is about the wrong timestamps only.

Any ideas? Maybe we can go into mongodb and fix it ourselves.

just a guessing.

How many shards did you have in your elasticsearch cluster?

please read: https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

Hi,

we have 67 indices and 266 shards in the cluster. So around 130 per node. The number of shards and indices did not change lately however, this issue did not occur with the same number of shards in the past.

he @mh-pinguin

the range of indices is calculated by the ingested data, speaking of that Graylog makes a min/max query on the index when you recalculate the index range.

That means that your sources do not have proper ntp configuration and send messages with dates from the past or the future into Graylog.

As Graylog just write the date in sequence of reception into elasticsearch.

When you do not a search on a specific date all indices that can hold data from this time are queried. I guess in your case all indices are queried all the time.

Check your received messages and correct the clock on the message sender OR change the timestamp during ingest in Graylog to the received time.

1 Like

Hi jan,

makes sense, I will check the logs to see which hosts send data from the future and update.

Indeed there was a lonely computer in some office corner setting the wrong timestamps. I am trying to create a pipeline rule to automatically fix timestamps that are off by more than one hour, but having difficulties with doing boolean operations on a period and a datetime object:

when
    period(parse_date(
        value: to_string($message.timestamp),
        pattern: "yyyy-MM-dd HH:mm:ss Z") - now()) > seconds(3600)
then
    let time = parse_date(
        value: to_string(now()),
        pattern: "yyyy-MM-dd HH:mm:ss Z"
    );
    set_field("timestamp", time);
    set_field("timestamp_corrected", "true");
end

I am not sure how to either change the timestamp - now to be a period, or seconds to be a timestamp.

For posterity, to find machines with future timestamps, one cannot use graylog, since graylog does not allow to search in the future. So instead, it is a GET to elasticsearch//_search with data:

{
  "from": 0,
  "size": 150,
  "query": {
    "bool": {
      "must": {
        "match_all": {}
      },
      "filter": {
        "bool": {
          "must": {
            "range": {
              "timestamp": {
                "from": "2020-07-28 10:40:53.990",
                "to": "2020-12-28 07:45:53.990",
                "include_lower": true,
                "include_upper": true
              }
            }
          }
        }
      }
    }
  },
  "sort": [
    {
      "timestamp": {
        "order": "desc"
      }
    }
  ]
}

This rule
rule “fix-timestamp”
when
to_long(parse_date(
value: to_string($message.timestamp),
pattern: “yyyy-MM-dd’T’HH:mm:ss.SSS’Z’”).seconds - now().seconds
) > 3600 || to_long(parse_date(
value: to_string($message.timestamp),
pattern: “yyyy-MM-dd’T’HH:mm:ss.SSS’Z’”).seconds - now().seconds
) < 3600
then
set_field(“timestamp”, now());
set_field(“timestamp_corrected”, “true”);
end

In a pipeline just causes no messages to be written anymore. There are no errors and simulation just does not return a result. So for now we use this

rule "fix-timestamp"
when
   true
then
    set_field("timestamp", now());
    set_field("timestamp_corrected", "true");
end

Then we wrote some python code to delete all the messages from the client with the wrong timestamp.

you could also use the “receive” timestamp from Graylog and compare that to the timestamp of the message - and if that is of by whatever, overwrite it.

the field is called gl2_receive_timestamp

Graylog definitely can find messages in future, but you must use Absolute time frame selector:
https://docs.graylog.org/en/3.3/pages/searching/time_frame_selector.html#time-frame-selector

I wrote this python script to bulk delete messages from multiple indices that match a certain query https://pastebin.com/aCVJWfqw

Indeed, that has helped our load and the index ranges are back to normal.

However, I still have a problem with

when
  to_long(parse_date(
    value: to_string($message.timestamp),
    pattern: “yyyy-MM-dd’T’HH:mm:ss.SSS’Z’”).seconds - now().seconds
  ) > 3600 || to_long(parse_date(
    value: to_string($message.timestamp),
    pattern: “yyyy-MM-dd’T’HH:mm:ss.SSS’Z’”).seconds - now().seconds
  ) < 3600
then
  set_field(“timestamp”, now());
  set_field(“timestamp_corrected”, “true”);
end

While graylog reports no errors, the pipeline simply outputs no messages anymore, which, when you apply it to all messages, means nothing gets written anymore.
So for now, we just use

when
  true

However I would like to have this conditional and am not sure why it leads to the pipeline stopping to work.

Hi,
you can directly use arithmetic function using now() and seconds(), it’s not necessary to convert to long (which indeed doesn’t work, as your code):

rule "fix timestamp"
when
    parse_date(value: to_string($message.timestamp), pattern: "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'") - seconds(3600) > now() || 
    parse_date(value: to_string($message.timestamp), pattern: "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'") + seconds(3600) < now()
then
  set_field("timestamp", now());
  set_field("timestamp_corrected", "true");
end

That seems to work thanks, however it really cuts down our throughput to only maybe 1500 msg/s. Nothing to be done about that I guess, so we just stick with setting timestamp to now() for everything. Thank you for your help.

If you know the exact client(s), you can use new pileline stage (lower number than you have for fix timestamp pipeline) with condition for source:

rule "example.com client"
when
    to_string($message.source) == "example.com"
then
end

And higher pipeline stage number for fix timestamp. This way, you only check bad timestamp for exact clients, so it should’t slow down your msg/s so much.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.