Graylog: 2.4.7+9116ead
ES: 5.6.16
Graylog cluster: 4 machines w\ 8 cores, 6G of RAM, usually around 10 - 30% load. Calculated as unix load / # of CPUs
ES cluster:
masters: 3* VM 2 cores 2G of RAM (1G heap)
data: 2*physical 16 cores, 62G of RAM (31G heap) 16 SSDs in RAID0
Around 2000 msg / sec
Issue: a while ago, we noticed the read IOPS of our ES cluster increasing from 500 to 8000. Writes stayed roughtly the same. The number of segmens also stable around 900 during that time (we do force merges to 1 segment every night). ES load increased from 15% to up to 150%, probably because of the aforementioned read IO.
Checking the index sets on graylog we can see that the time ranges of the default index set are just plain wrong. They are not sequential and just all over the place. We have, in total maybe 6 months worth of data in our ES cluster, but some individual index time ranges are 6 months alone. After doing index range recalc we get messages like this:
2020-07-13T10:17:36.221+02:00 INFO [RebuildIndexRangesJob] Created ranges for index graylog2_6126: MongoIndexRange{id=null, indexName=graylog2_6126, begin=2020-03-20T06:00:01.000Z, end=2020-09-12T19:06:26.000Z, calculatedAt=2020-07-13T08:17:04.975Z, calculationDuration=31244, streamIds=[000000000000000000000001, 5b21d991e5b43f814010354d, 5a4f7aa9b884dda95e3ef25f, 5a4f818ab884dda95e3ef958, 5b20cc65e5b43f81400f2805, 5d319ac2b884dde38b433ac1, 561ce9fbe4b042d56225d130, 5c3c60056274fe852220ebe3, 55ed3f33e4b0dfacf5f08604, 59bfc85fe5b43f192db4cb74]}
Which makes me think congratulations might be in order for the graylog team to have invented a time machine.
timestamp mapping for that index:
"timestamp": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSS"
},
servertime:
Mon Jul 13 10:42:41 CEST 2020
We get some timestamp indexing errors maybe that has something to do with it:
2020-07-13T10:11:47.789+02:00 WARN [GelfCodec] GELF message <7f6e60c6-c4e0-11ea-b255-005056a7e5cd> (received from <xxx>) has invalid "timestamp": 1594627906.977 (type: STRING)
Currenty hypothesis is that the wrong timestamps cause graylog to do way more lookups than required for queries, hence resulting in higher read IOPS. But maybe the two issues are not related. In that case, this topic is about the wrong timestamps only.
Any ideas? Maybe we can go into mongodb and fix it ourselves.