Index ranges wrong - manual recalculation yields dates in the future

mh-pinguin · July 13, 2020, 8:44am

Graylog: 2.4.7+9116ead
ES: 5.6.16

Graylog cluster: 4 machines w\ 8 cores, 6G of RAM, usually around 10 - 30% load. Calculated as unix load / # of CPUs

ES cluster:
masters: 3* VM 2 cores 2G of RAM (1G heap)
data: 2*physical 16 cores, 62G of RAM (31G heap) 16 SSDs in RAID0
Around 2000 msg / sec

Issue: a while ago, we noticed the read IOPS of our ES cluster increasing from 500 to 8000. Writes stayed roughtly the same. The number of segmens also stable around 900 during that time (we do force merges to 1 segment every night). ES load increased from 15% to up to 150%, probably because of the aforementioned read IO.

Checking the index sets on graylog we can see that the time ranges of the default index set are just plain wrong. They are not sequential and just all over the place. We have, in total maybe 6 months worth of data in our ES cluster, but some individual index time ranges are 6 months alone. After doing index range recalc we get messages like this:

2020-07-13T10:17:36.221+02:00 INFO [RebuildIndexRangesJob] Created ranges for index graylog2_6126: MongoIndexRange{id=null, indexName=graylog2_6126, begin=2020-03-20T06:00:01.000Z, end=2020-09-12T19:06:26.000Z, calculatedAt=2020-07-13T08:17:04.975Z, calculationDuration=31244, streamIds=[000000000000000000000001, 5b21d991e5b43f814010354d, 5a4f7aa9b884dda95e3ef25f, 5a4f818ab884dda95e3ef958, 5b20cc65e5b43f81400f2805, 5d319ac2b884dde38b433ac1, 561ce9fbe4b042d56225d130, 5c3c60056274fe852220ebe3, 55ed3f33e4b0dfacf5f08604, 59bfc85fe5b43f192db4cb74]}

Which makes me think congratulations might be in order for the graylog team to have invented a time machine.

timestamp mapping for that index:

 "timestamp": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss.SSS"
          },

servertime:
Mon Jul 13 10:42:41 CEST 2020

We get some timestamp indexing errors maybe that has something to do with it:

2020-07-13T10:11:47.789+02:00 WARN [GelfCodec] GELF message <7f6e60c6-c4e0-11ea-b255-005056a7e5cd> (received from <xxx>) has invalid "timestamp": 1594627906.977 (type: STRING)

Currenty hypothesis is that the wrong timestamps cause graylog to do way more lookups than required for queries, hence resulting in higher read IOPS. But maybe the two issues are not related. In that case, this topic is about the wrong timestamps only.

Any ideas? Maybe we can go into mongodb and fix it ourselves.

jan · July 17, 2020, 12:30pm

just a guessing.

How many shards did you have in your elasticsearch cluster?

please read: https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

mh-pinguin · July 21, 2020, 7:27am

Hi,

we have 67 indices and 266 shards in the cluster. So around 130 per node. The number of shards and indices did not change lately however, this issue did not occur with the same number of shards in the past.

jan · July 22, 2020, 8:18am

he @mh-pinguin

the range of indices is calculated by the ingested data, speaking of that Graylog makes a min/max query on the index when you recalculate the index range.

That means that your sources do not have proper ntp configuration and send messages with dates from the past or the future into Graylog.

As Graylog just write the date in sequence of reception into elasticsearch.

When you do not a search on a specific date all indices that can hold data from this time are queried. I guess in your case all indices are queried all the time.

Check your received messages and correct the clock on the message sender OR change the timestamp during ingest in Graylog to the received time.

mh-pinguin · July 22, 2020, 8:56am

Hi jan,

makes sense, I will check the logs to see which hosts send data from the future and update.

mh-pinguin · July 28, 2020, 8:58am

Indeed there was a lonely computer in some office corner setting the wrong timestamps. I am trying to create a pipeline rule to automatically fix timestamps that are off by more than one hour, but having difficulties with doing boolean operations on a period and a datetime object:

when
    period(parse_date(
        value: to_string($message.timestamp),
        pattern: "yyyy-MM-dd HH:mm:ss Z") - now()) > seconds(3600)
then
    let time = parse_date(
        value: to_string(now()),
        pattern: "yyyy-MM-dd HH:mm:ss Z"
    );
    set_field("timestamp", time);
    set_field("timestamp_corrected", "true");
end

I am not sure how to either change the timestamp - now to be a period, or seconds to be a timestamp.

For posterity, to find machines with future timestamps, one cannot use graylog, since graylog does not allow to search in the future. So instead, it is a GET to elasticsearch//_search with data:

{
  "from": 0,
  "size": 150,
  "query": {
    "bool": {
      "must": {
        "match_all": {}
      },
      "filter": {
        "bool": {
          "must": {
            "range": {
              "timestamp": {
                "from": "2020-07-28 10:40:53.990",
                "to": "2020-12-28 07:45:53.990",
                "include_lower": true,
                "include_upper": true
              }
            }
          }
        }
      }
    }
  },
  "sort": [
    {
      "timestamp": {
        "order": "desc"
      }
    }
  ]
}

mh-pinguin · July 29, 2020, 8:25am

This rule
rule “fix-timestamp”
when
to_long(parse_date(
value: to_string($message.timestamp),
pattern: “yyyy-MM-dd’T’HH:mm:ss.SSS’Z’”).seconds - now().seconds
) > 3600 || to_long(parse_date(
value: to_string($message.timestamp),
pattern: “yyyy-MM-dd’T’HH:mm:ss.SSS’Z’”).seconds - now().seconds
) < 3600
then
set_field(“timestamp”, now());
set_field(“timestamp_corrected”, “true”);
end

In a pipeline just causes no messages to be written anymore. There are no errors and simulation just does not return a result. So for now we use this

rule "fix-timestamp"
when
   true
then
    set_field("timestamp", now());
    set_field("timestamp_corrected", "true");
end

Then we wrote some python code to delete all the messages from the client with the wrong timestamp.

jan · July 29, 2020, 9:57am

you could also use the “receive” timestamp from Graylog and compare that to the timestamp of the message - and if that is of by whatever, overwrite it.

the field is called gl2_receive_timestamp

shoothub · August 3, 2020, 7:38am

Graylog definitely can find messages in future, but you must use Absolute time frame selector:
https://docs.graylog.org/en/3.3/pages/searching/time_frame_selector.html#time-frame-selector

mh-pinguin · August 5, 2020, 8:43am

I wrote this python script to bulk delete messages from multiple indices that match a certain query https://pastebin.com/aCVJWfqw

Indeed, that has helped our load and the index ranges are back to normal.

However, I still have a problem with

when
  to_long(parse_date(
    value: to_string($message.timestamp),
    pattern: “yyyy-MM-dd’T’HH:mm:ss.SSS’Z’”).seconds - now().seconds
  ) > 3600 || to_long(parse_date(
    value: to_string($message.timestamp),
    pattern: “yyyy-MM-dd’T’HH:mm:ss.SSS’Z’”).seconds - now().seconds
  ) < 3600
then
  set_field(“timestamp”, now());
  set_field(“timestamp_corrected”, “true”);
end

While graylog reports no errors, the pipeline simply outputs no messages anymore, which, when you apply it to all messages, means nothing gets written anymore.
So for now, we just use

when
  true

However I would like to have this conditional and am not sure why it leads to the pipeline stopping to work.

shoothub · August 5, 2020, 1:49pm

Hi,
you can directly use arithmetic function using now() and seconds(), it’s not necessary to convert to long (which indeed doesn’t work, as your code):

rule "fix timestamp"
when
    parse_date(value: to_string($message.timestamp), pattern: "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'") - seconds(3600) > now() || 
    parse_date(value: to_string($message.timestamp), pattern: "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'") + seconds(3600) < now()
then
  set_field("timestamp", now());
  set_field("timestamp_corrected", "true");
end

mh-pinguin · August 6, 2020, 7:09am

That seems to work thanks, however it really cuts down our throughput to only maybe 1500 msg/s. Nothing to be done about that I guess, so we just stick with setting timestamp to now() for everything. Thank you for your help.

shoothub · August 6, 2020, 1:35pm

If you know the exact client(s), you can use new pileline stage (lower number than you have for fix timestamp pipeline) with condition for source:

rule "example.com client"
when
    to_string($message.source) == "example.com"
then
end

And higher pipeline stage number for fix timestamp. This way, you only check bad timestamp for exact clients, so it should’t slow down your msg/s so much.

system · August 20, 2020, 1:35pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Wrong index range \| Contains messages from value Graylog Central (peer support)	14	2961	February 14, 2018
Index range issue Graylog Central (peer support)	2	2474	June 5, 2019
Graylog index range causes problem after every index rotation Graylog Central (peer support)	7	1583	November 27, 2019
Time range of index is unknown, because index range is not available Graylog Central (peer support) elastic	6	677	August 26, 2022
API：Absolute-Search response time is not correct Graylog Central (peer support)	4	1545	June 29, 2017

Index ranges wrong - manual recalculation yields dates in the future

Related topics