Fix replaying oplog on system collections during the restore

Description

I have been investigating a recent failure to restore a MongoDB backup using PBM 1.6.0 to a different cluster.  It appears PBM attempted to replay a change to the admin.system.users collection, which failed due to the collection not existing as the expected UUID.

 

2022-06-02T18:15:10Z E [rs0/XXX:27017] [pitrestore/2022-05-31T11:15:05Z] restore: replay chunk 1653708638.1653719417: apply oplog for chunk: applying an entry: applyOps: (NamespaceNotFound) Failed to apply operation due to missing collection (dbab81d0-060c-4633-958d-8a1618ce6310): { ts: Timestamp(1653717902, 1), t: 76, h: -5679716188828143268, v: 2, op: "i", ns: "admin.system.users", o: { _id: "$external.CN=XXX", userId: UUID("29ce3a40-513e-40c2-b46c-e091c9c615fb"), user: "CN=XXX", db: "$external", credentials: { external: true }, roles: [ { role: "readAnyDatabase", db: "admin" }, { role: "clusterMonitor", db: "admin" } ] }, ui: UUID("dbab81d0-060c-4633-958d-8a1618ce6310") }

 

I can see in restore/restore.go that the order of restore is:

  1. PBM restores the backup, setting mr.SkipUsersAndRoles = true

  2. The function swapUsers restores users and roles from the pbmRUsers and pbmRRoles collections, skipping anything related to the currentUser

  3. PBM replays the oplog, filtering transactions related to collection excluded from the restore.

I cannot see anywhere that would cause modifications to the admin.system.users collection to be excluded from the oplog replay – it is not included in excludeFromRestore.

I know that the above transaction was created in the Mongo Shell using:

use admin db.getSiblingDB("$external").createUser({ user: "CN=XX",   roles: [ {role: "readAnyDatabase", db: "admin"}, {role: "clusterMonitor", db: "admin"} ] });

 

The original system where the error was observed is now restoring user a later snapshot that does not include the application upgrade, which is a multi-day process due to the number of indexes.

I have tried to reproduce the failure in a simpler and quicker lab scenario, but have so-far been unsuccessful.

I suspect there is a bug in an edge case that I haven't been able to fully identify.  However, I am also unclear exactly what the expected behaviour is?  If Users and Roles are handled specially as part of the snapshot restore, should changes in the oplog work as expected?

 

Possibly related: https://perconadev.atlassian.net/browse/PBM-659#icft=PBM-659

 

Environment

None

Smart Checklist

Activity

andrew.pogrebnoi June 27, 2022 at 1:56 PM

Hi ,

Thanks for the report. It appears we shouldn't preserve UUID during the oplog reply for those collections.

The fix is merged into the main branch.

Daniel Oliver June 8, 2022 at 12:52 PM

 

I just encountered another oplog replay failure, this time applying operations to admin.system.keys.  I'm not sure which operation in our upgrade generated this entry.

I don't have sufficient understanding of the innards of MongoDB to know if these collections (or all of admin.system.*) should just be filtered from a restore?

2022-06-08T11:20:26Z E [rs0/XXX:27017] [pitrestore/2022-06-07T08:15:03Z] restore: replay chunk 1654533816.1654544592: apply oplog for chunk: applying an entry: applyOps: (NamespaceNotFound) Failed to apply operation due to missing collection (f004b29e-c517-4d6a-b07e-42644335e073): { ts: Timestamp(1654543614, 2), t: 6, h: -8753832872775161847, v: 2, op: "i", ns: "admin.system.keys", o: { _id: 7106210711935647745, purpose: "HMAC", key: BinData(0, XXX), expiresAt: Timestamp(1670088827, 0) }, ui: UUID("f004b29e-c517-4d6a-b07e-42644335e073") }

 

 

Daniel Oliver June 7, 2022 at 6:39 PM

Sorry, I can now confirm I've managed to reproduce the problem.  I failed to completely reset my MongoDB data directory before restoring, so the users collection had the same UUID.  Steps to reproduce:

  1. Take a full snapshot backup using PBM

  2. Insert user (I used the createUser macro, as above)

  3. Allow oplog entry to be saved by PBM

  4. Completely reset MongoDB (I used rm -rf * in the data directory)

  5. Re-create the replicaset configuration and re-configure PBM

  6. Run a restore to the latest oplog entry.  A replay error occurs.

Done

Details

Assignee

Reporter

Fix versions

Affects versions

Priority

Smart Checklist

Created June 7, 2022 at 6:09 PM
Updated March 5, 2024 at 6:52 PM
Resolved June 27, 2022 at 1:57 PM

Flag notifications