New created instance occationally turn into two replset ids

Description

Phenonemon:

Occationally new-created instance will not enter into ready state with log showing "Error in heartbeat (requestId: 7994) to j2-j1-0-2.j2-j1-0.zxl-mongo.svc.cluster.local:27017, response status: InvalidReplicaSetConfig: replica set IDs do not match, ours: 61cacf1f471fc92488836164; remote node's: 61cacf3a6c41a4c95e261555". 

 

Check rs.status() will get below info (briefly):

members:

  • name: j2-j1-0-0.j2-j1-0.zxl-mongo.svc.cluster.local:27017
    role: PRIMARY

  • name: j2-j1-0-1.j2-j1-0.zxl-mongo.svc.cluster.local:27017
    role: (not reachable/healthy)

  • name: j2-j1-0-2.j2-j1-0.zxl-mongo.svc.cluster.local:27017
    role: SECONDARY

 

And the log from j2-j1-0-1 showed that it turned into a new Primary with a new replicaSet. Some suspicious log as below:

We can see that from above log, the config from local.system.replset contains only itself ( I guess) and that this node turn into a new Primary in a new replicaSet.

 

Some ground truth about this phenomenon:

  1. This is occational and do not know what reason can contribute to this

  2. The instance are newly created with newly created PV at first time. And after first time, the splitted node will get restarted over and over again.

Whole start-up log as attached with debug log enabled.

Environment

None

Attachments

3

Smart Checklist

Activity

Aaditya Dubey June 6, 2022 at 10:08 AM

Hi  ,

Thank you for the updates.
closing the bug now.

Xiaolu January 12, 2022 at 3:36 AM

This bug can be closed for now

Xiaolu January 12, 2022 at 3:35 AM

This bug is because network latency. When create a new instance, operator can not ping mongod successfully so that failed to create system user. So operator problematically create second replset cause the two replset co-exist. I fix this bug by enlarge the timeout value from 2 second to 10 second as main branch did.

Xiaolu December 28, 2021 at 11:11 AM

What is wired is, when I exec into the mongod container of j2-j1-0-1 and tried to check rs.status(), I got below error:

And the account I used is 100% correct.

Xiaolu December 28, 2021 at 11:05 AM

This bug do not related to server version. It both happens with 4.2.17 and 4.4.6

Done

Details

Assignee

Reporter

Affects versions

Priority

Smart Checklist

Created December 28, 2021 at 11:00 AM
Updated June 6, 2022 at 10:08 AM
Resolved June 6, 2022 at 10:08 AM