Problems with mongodb replication

Hi everyone,

we’re currently running a Graylog cluster (3 master/graylog-server/mongo-db nodes und 3 elasticsearch datanodes) and facing an issue with the mongodb replica set.
One of the mongod services on one particular host is continuously dying, and having checked the log we’re seeing the following message:
Timestamp(1579543862, 2) is before their earliest timestamp: Timestamp(1580905960, 11) for 1min until: 2020-02-05T15:50:29.644+0100",
“2020-02-05T15:49:29.644+0100 I REPL [replication-0] could not find member to sync from”,
“2020-02-05T15:50:29.655+0100 I REPL [rsBackgroundSync] sync source candidate: blabla.bla.com:27017”,
“2020-02-05T15:50:29.655+0100 I REPL [replication-0] We are too stale to use blabla.bla.com:27017 as a sync source. Blacklisting this sync source because our last fetched timestamp: Timestamp(1579543862, 2) is before their earliest timestamp: Timestamp(1580905978, 28) for 1min until: 2020-02-05T15:51:29.655+0100”,
“2020-02-05T15:50:29.655+0100 I REPL [replication-0] sync source candidate: blabla.bla.com:27017”,
“2020-02-05T15:50:29.655+0100 I REPL [replication-0] We are too stale to use blabla.bla.com:27017 as a sync source. Blacklisting this sync source because our last fetched timestamp: Timestamp(1579543862, 2) is before their earliest timestamp: Timestamp(1580905960, 11) for 1min until: 2020-02-05T15:51:29.655+0100”,
“2020-02-05T15:50:29.655+0100 I REPL [replication-0] could not find member to sync from”

this doesn’t seem to look too good.
has anyone ever had a similar problem with mongodb replica sets and how to fix it? we have found some posts on stackoverflow that indicated it would need to be re-installed/setup again…

Thankful for any advice here,
thanks,
theresa

he @micsnare

Did you have NTPD running on all servers that keep the time in sync? I have often seen this when the time differs on members of the replica set.

In addition when the network connection is flaky or saturated, but mostly time was the problem. You might want to check that.

Servus @jan

Thanks for your immediate reply. I think you’re right, for some reason the ntp service wasn’t running, and therefore the time didn’t get synced.
However, I have now started the ntpd on all nodes, but it seems mongodb is still in a recovering state.
In the mongo shell it says “recovering”

rs-graylog:RECOVERING>

                "2020-02-06T13:12:01.534+0100 I NETWORK  [listener] connection accepted from 10.17.7.1:37294 #776 (10 connections now open)",
            "2020-02-06T13:12:01.534+0100 I NETWORK  [conn776] end connection 10.17.7.1:37294 (9 connections now open)",
            "2020-02-06T13:12:29.558+0100 I NETWORK  [listener] connection accepted from 10.18.7.1:37498 #777 (10 connections now open)",
            "2020-02-06T13:12:29.560+0100 I NETWORK  [conn777] end connection 10.18.7.1:37498 (9 connections now open)",
            "2020-02-06T13:12:42.821+0100 I REPL     [rsBackgroundSync] sync source candidate: server3.blabla.com:27017",
            "2020-02-06T13:12:42.822+0100 I REPL     [replication-0] We are too stale to use server3.blabla.com:27017 as a sync source. Blacklisting this sync source because our last fetched timestamp: Timestamp(1579543862, 2) is before their earliest timestamp: Timestamp(1580907197, 1889) for 1min until: 2020-02-06T13:13:42.822+0100",
            "2020-02-06T13:12:42.822+0100 I REPL     [replication-0] sync source candidate: server2.blabla.com:27017",
            "2020-02-06T13:12:42.822+0100 I REPL     [replication-0] We are too stale to use server2.blabla.com:27017 as a sync source. Blacklisting this sync source because our last fetched timestamp: Timestamp(1579543862, 2) is before their earliest timestamp: Timestamp(1580906717, 975) for 1min until: 2020-02-06T13:13:42.822+0100",
            "2020-02-06T13:12:42.822+0100 I REPL     [replication-0] could not find member to sync from",
            "2020-02-06T13:13:01.528+0100 I NETWORK  [listener] connection accepted from 10.17.7.1:57436 #778 (10 connections now open)",
            "2020-02-06T13:13:01.528+0100 I NETWORK  [conn778] end connection 10.17.7.1:57436 (9 connections now open)",
            "2020-02-06T13:13:29.556+0100 I NETWORK  [listener] connection accepted from 10.18.7.1:57688 #779 (10 connections now open)",
            "2020-02-06T13:13:29.556+0100 I NETWORK  [conn779] end connection 10.18.7.1:57688 (9 connections now open)",
            "2020-02-06T13:13:42.831+0100 I REPL     [rsBackgroundSync] sync source candidate: server3.blabla.com:27017",
            "2020-02-06T13:13:42.831+0100 I REPL     [replication-0] We are too stale to use server3.blabla.com:27017 as a sync source. Blacklisting this sync source because our last fetched timestamp: Timestamp(1579543862, 2) is before their earliest timestamp: Timestamp(1580907197, 1889) for 1min until: 2020-02-06T13:14:42.831+0100",
            "2020-02-06T13:13:42.831+0100 I REPL     [replication-0] sync source candidate: server2.blabla.com:27017",
            "2020-02-06T13:13:42.832+0100 I REPL     [replication-0] We are too stale to use server2.blabla.com:27017 as a sync source. Blacklisting this sync source because our last fetched timestamp: Timestamp(1579543862, 2) is before their earliest timestamp: Timestamp(1580906717, 975) for 1min until: 2020-02-06T13:14:42.832+0100",
            "2020-02-06T13:13:42.832+0100 I REPL     [replication-0] could not find member to sync from",
            "2020-02-06T13:14:01.528+0100 I NETWORK  [listener] connection accepted from 10.17.7.1:49416 #780 (10 connections now open)",
            "2020-02-06T13:14:01.528+0100 I NETWORK  [conn780] end connection 10.17.7.1:49416 (9 connections now open)",
            "2020-02-06T13:14:03.631+0100 I NETWORK  [LogicalSessionCacheRefresh] Starting new replica set monitor for rs-graylog/server1.blabla.com:27017,server2.blabla.com:27017,server3.blabla.com:27017",
            "2020-02-06T13:14:03.634+0100 I NETWORK  [LogicalSessionCacheRefresh] Starting new replica set monitor for rs-graylog/server1.blabla.com:27017,server2.blabla.com:27017,server3.blabla.com:27017",
            "2020-02-06T13:14:03.634+0100 I NETWORK  [LogicalSessionCacheRefresh] Starting new replica set monitor for rs-graylog/server1.blabla.com:27017,server2.blabla.com:27017,server3.blabla.com:27017",
            "2020-02-06T13:14:25.404+0100 I NETWORK  [listener] connection accepted from 127.0.0.1:51316 #781 (10 connections now open)",
            "2020-02-06T13:14:25.405+0100 I NETWORK  [conn781] received client metadata from 127.0.0.1:51316 conn781: { application: { name: \"MongoDB Shell\" }, driver: { name: \"MongoDB Internal Client\", version: \"4.0.12\" }, os: { type: \"Linux\", name: \"Red Hat Enterprise Linux Server release 7.7 (Maipo)\", architecture: \"x86_64\", version: \"Kernel 3.10.0-1062.el7.x86_64\" } }",
            "2020-02-06T13:14:29.555+0100 I NETWORK  [listener] connection accepted from 10.18.7.1:49578 #782 (11 connections now open)",
            "2020-02-06T13:14:29.555+0100 I NETWORK  [conn782] end connection 10.18.7.1:49578 (10 connections now open)"
    ],
    "ok" : 1,
    "operationTime" : Timestamp(1579543862, 2),
    "$clusterTime" : {
            "clusterTime" : Timestamp(1580991270, 7),
            "signature" : {
                    "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                    "keyId" : NumberLong(0)
            }
    }

}

Do you think this is a lost caused, or will it eventually finish recovering?
Would the restart of the mongod process help?

Lg,
theresa

@micsnare

I hope that you already have restarted the server/mongodb.

The adjustmend by ntpd that is done slowly can cause the problems. It is better to stop MongoDB, sync the time (maybe restart the server) and start MongoDB again.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.