Hello,
After more than a year of runing 3 node graylog cluster with mongodb 3 node replica, I have what seems like mongodb (or graylog) problem.
graylog is running on phsical machine on Oracle Linux os.
I was running 3.2.4 and did upgrade to latest version on nodes 1 and 2. After that, I lost all configuration on upgraded nodes. It is possible, that mogodb replica was not in ideal state before upgrade, but no such ewrror was seen.
Log files when starting graylog on those nodes 1 and 2 dont show any Errors, even no warnings. But all config is empty, no Streams, Inputs, Users. But it shows me 2 nodes in cluster on GUI-nodes, node 1 and node 2.
On node 3 in GUI-nodes it shows only node 3.
Funniyl it shows elasticsearch is ok and it even shows very few message from All msgs Stream.
Luckily I still have the one node running 3.2.4 and there all config is still OK. That node 3 was not cluster master, so I had to restart it and change it to master . It went well, after restat it still has whole config.
I suspect there is probabyl some problewm in the mongodb config. All 3 nodes are calling mongodb replicaset:
(I changed IPs a bit from real ones)
mongodb_uri = mongodb://192.158.20.100/graylog,192.158.20.101/graylog,192.158.20.102/graylog?replicaSet=reproduk
I tried to so some mongodb investigation. If I log to node1 it shows me these 3 dbs:
reproduk:PRIMARY> show dbs
graylog 0.029GB
graylog,192 0.002GB
local 0.312GB
Especially this entry graylog,192 is vers suspectful to me, I havent noticed it before.
If I run the same command on the only node still runing ok, node3, I get ERROR:
reproduk:SECONDARY> show dbs
2020-09-21T14:27:51.126+0200 E QUERY [thread1] Error: listDatabases failed:{ "ok" : 0, "errmsg" : "not master and slaveOk=false", "code" : 13435 } :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
Mongo.prototype.getDBs@src/mongo/shell/mongo.js:62:1
shellHelper.show@src/mongo/shell/utils.js:769:19
shellHelper@src/mongo/shell/utils.js:659:15
@(shellhelp2):1:1
But if I run commands like rs.conf() or rs.status() I get practicall the same working result on both node1 and node 3:
reproduk:SECONDARY> rs.status()
{
"set" : "reproduk",
"date" : ISODate("2020-09-21T12:37:04.748Z"),
"myState" : 2,
"term" : NumberLong(65),
"syncingTo" : "192.158.20.100:27017",
"heartbeatIntervalMillis" : NumberLong(2000),
"members" : [
{
"_id" : 0,
"name" : "192.158.20.100:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 2625358,
"optime" : {
"ts" : Timestamp(1600691823, 20),
"t" : NumberLong(65)
},
"optimeDate" : ISODate("2020-09-21T12:37:03Z"),
"lastHeartbeat" : ISODate("2020-09-21T12:37:03.658Z"),
"lastHeartbeatRecv" : ISODate("2020-09-21T12:37:03.128Z"),
"pingMs" : NumberLong(0),
...
Any pointers how could I continue my debugging ?
Maybe deleting this collection graylog,192 ?
I am a little cautios with any work in mongodb, because I would not like to make things wors on the only runnig node node3. I diid several mongodb backups and also Contentpack “backup” from node3 to have graylog config saved.
Thanks