Fix replaying oplog on system collections during the restore

General

Escalation

General

Escalation

Description

I have been investigating a recent failure to restore a MongoDB backup using PBM 1.6.0 to a different cluster. It appears PBM attempted to replay a change to the admin.system.users collection, which failed due to the collection not existing as the expected UUID.

2022-06-02T18:15:10Z E [rs0/XXX:27017] [pitrestore/2022-05-31T11:15:05Z] restore: replay chunk 1653708638.1653719417: apply oplog for chunk: applying an entry: applyOps: (NamespaceNotFound) Failed to apply operation due to missing collection (dbab81d0-060c-4633-958d-8a1618ce6310): { ts: Timestamp(1653717902, 1), t: 76, h: -5679716188828143268, v: 2, op: "i", ns: "admin.system.users", o: { _id: "$external.CN=XXX", userId: UUID("29ce3a40-513e-40c2-b46c-e091c9c615fb"), user: "CN=XXX", db: "$external", credentials: { external: true }, roles: [ { role: "readAnyDatabase", db: "admin" }, { role: "clusterMonitor", db: "admin" } ] }, ui: UUID("dbab81d0-060c-4633-958d-8a1618ce6310") }

I can see in restore/restore.go that the order of restore is:

PBM restores the backup, setting mr.SkipUsersAndRoles = true
The function swapUsers restores users and roles from the pbmRUsers and pbmRRoles collections, skipping anything related to the currentUser
PBM replays the oplog, filtering transactions related to collection excluded from the restore.

I cannot see anywhere that would cause modifications to the admin.system.users collection to be excluded from the oplog replay – it is not included in excludeFromRestore.

I know that the above transaction was created in the Mongo Shell using:

use admin
db.getSiblingDB("$external").createUser({
  user: "CN=XX",
  roles: [ {role: "readAnyDatabase", db: "admin"}, {role: "clusterMonitor", db: "admin"} ]
});

The original system where the error was observed is now restoring user a later snapshot that does not include the application upgrade, which is a multi-day process due to the number of indexes.

I have tried to reproduce the failure in a simpler and quicker lab scenario, but have so-far been unsuccessful.

I suspect there is a bug in an edge case that I haven't been able to fully identify. However, I am also unclear exactly what the expected behaviour is? If Users and Roles are handled specially as part of the snapshot restore, should changes in the oplog work as expected?

Possibly related: https://perconadev.atlassian.net/browse/PBM-659#icft=PBM-659

Environment

None

Linked issues

relates to

PBM-841

apply oplogs failed when grant roles to user

Smart Checklist

Activity

andrew.pogrebnoi June 27, 2022 at 1:56 PM

Hi @Daniel Oliver,

Thanks for the report. It appears we shouldn't preserve UUID during the oplog reply for those collections.

The fix is merged into the main branch.

Daniel Oliver June 8, 2022 at 12:52 PM

I just encountered another oplog replay failure, this time applying operations to admin.system.keys. I'm not sure which operation in our upgrade generated this entry.

I don't have sufficient understanding of the innards of MongoDB to know if these collections (or all of admin.system.*) should just be filtered from a restore?

2022-06-08T11:20:26Z E [rs0/XXX:27017] [pitrestore/2022-06-07T08:15:03Z] restore: replay chunk 1654533816.1654544592: apply oplog for chunk: applying an entry: applyOps: (NamespaceNotFound) Failed to apply operation due to missing collection (f004b29e-c517-4d6a-b07e-42644335e073): { ts: Timestamp(1654543614, 2), t: 6, h: -8753832872775161847, v: 2, op: "i", ns: "admin.system.keys", o: { _id: 7106210711935647745, purpose: "HMAC", key: BinData(0, XXX), expiresAt: Timestamp(1670088827, 0) }, ui: UUID("f004b29e-c517-4d6a-b07e-42644335e073") }

Daniel Oliver June 7, 2022 at 6:39 PM

Sorry, I can now confirm I've managed to reproduce the problem. I failed to completely reset my MongoDB data directory before restoring, so the users collection had the same UUID. Steps to reproduce:

Take a full snapshot backup using PBM
Insert user (I used the createUser macro, as above)
Allow oplog entry to be saved by PBM
Completely reset MongoDB (I used rm -rf * in the data directory)
Re-create the replicaset configuration and re-configure PBM
Run a restore to the latest oplog entry. A replay error occurs.

Done

Details
Assignee
andrew.pogrebnoi
Reporter
Daniel Oliver
Fix versions
1.8.1
Affects versions
1.6.0
Priority
High

Smart Checklist

Created June 7, 2022 at 6:09 PM

Updated March 5, 2024 at 6:52 PM

Resolved June 27, 2022 at 1:57 PM

Fix replaying oplog on system collections during the restore

Description

Environment

Linked issues

relates to

Smart Checklist

Activity

andrew.pogrebnoi June 27, 2022 at 1:56 PM

Daniel Oliver June 8, 2022 at 12:52 PM

Daniel Oliver June 7, 2022 at 6:39 PM

Details
Assignee
andrew.pogrebnoi
Reporter
Daniel Oliver
Fix versions
1.8.1
Affects versions
1.6.0
Priority
High

Details

Assignee

Reporter

Fix versions

Affects versions

Priority

Smart Checklist

Smart Checklist

Flag notifications

Something's gone wrong

Something's gone wrong

Fix replaying oplog on system collections during the restore

Description

Environment

Linked issues

relates to

Smart Checklist

Activity

andrew.pogrebnoi June 27, 2022 at 1:56 PM

Daniel Oliver June 8, 2022 at 12:52 PM

Daniel Oliver June 7, 2022 at 6:39 PM

DetailsAssigneeandrew.pogrebnoiandrew.pogrebnoiReporterDaniel OliverDaniel OliverFix versions1.8.1Affects versions1.6.0PriorityHigh

Details

Assignee

Reporter

Fix versions

Affects versions

Priority

Smart ChecklistOpen Smart Checklist

Smart Checklist

Flag notifications

Something's gone wrong

Something's gone wrong

Details
Assignee
andrew.pogrebnoi
Reporter
Daniel Oliver
Fix versions
1.8.1
Affects versions
1.6.0
Priority
High

Smart Checklist