Logical restore fails with E11000 duplicate key error collection: config.chunks
Description
Environment
Attachments
causes
relates to
Activity
Oksana Grishchenko February 5, 2025 at 9:59 AM
Thank you for the investigation! We’ll check it with the Cloud team if the restoration process follows the docs. It’s great to know that it’s not a PBM bug which in terms of Everest means it could be fixed sooner
Oleksandr Havryliak February 5, 2025 at 9:07 AM
I wasn’t able to reproduce this issue from scratch using my own data, it can only be reproduced with the provided backup and only under certain conditions
From parsed data I can conclude that the cause of the issue is mongodb’s automerger which has started it’s job exactly in the middle of the restore process which also means that mongodb’s balancer has not been stopped before the restore.
PBM docs aware users to stop the balancer and all mongos instances before the restore to avoid such issues. I suggest you check if Everest is following all of those steps for logical restore
Oksana Grishchenko January 30, 2025 at 2:30 PM
Hi , thank you for providing the details!
So the root cause of the problem is the state of the target cluster before the restoration
Does it mean that the duplicated key appeared after the last backup but before the restoration time? That’s why you were able to restore from the backup, but the point-in-time has failed.
But in fact, there were no changes in the DB between the last backup and the point-in-time, why the duplication suddenly appeared then? Any ideas of what could be done to prevent that?
Oleksandr Havryliak January 30, 2025 at 1:28 PM
Oleksandr Havryliak January 30, 2025 at 12:42 PM
UPD: finally reproduced the issue
STR: Using the provided backup first restore to the latest timestamp and then restore to the point-in-time from the logs
So the root cause of the problem is the state of the target cluster before the restoration
Problem description
Full error:
Steps to reproduce:
Using the provided backup first restore to the latest timestamp and then restore to the point-in-time from the logs
So the root cause of the problem is the state of the target cluster before the restoration
Acceptance criteria