Logical restore fails with E11000 duplicate key error collection: config.chunks

Description

Problem description

Full error:

 

Steps to reproduce:

Using the provided backup first restore to the latest timestamp and then restore to the point-in-time from the logs

So the root cause of the problem is the state of the target cluster before the restoration

Acceptance criteria

Environment

None

Attachments

2

Activity

Oksana Grishchenko 
February 5, 2025 at 9:59 AM

Thank you for the investigation! We’ll check it with the Cloud team if the restoration process follows the docs. It’s great to know that it’s not a PBM bug which in terms of Everest means it could be fixed sooner

Oleksandr Havryliak 
February 5, 2025 at 9:07 AM

I wasn’t able to reproduce this issue from scratch using my own data, it can only be reproduced with the provided backup and only under certain conditions

From parsed data I can conclude that the cause of the issue is mongodb’s automerger which has started it’s job exactly in the middle of the restore process which also means that mongodb’s balancer has not been stopped before the restore.

PBM docs aware users to stop the balancer and all mongos instances before the restore to avoid such issues. I suggest you check if Everest is following all of those steps for logical restore

Oksana Grishchenko 
January 30, 2025 at 2:30 PM

Hi , thank you for providing the details!

So the root cause of the problem is the state of the target cluster before the restoration

Does it mean that the duplicated key appeared after the last backup but before the restoration time? That’s why you were able to restore from the backup, but the point-in-time has failed.

But in fact, there were no changes in the DB between the last backup and the point-in-time, why the duplication suddenly appeared then? Any ideas of what could be done to prevent that?

Oleksandr Havryliak 
January 30, 2025 at 1:28 PM

Oleksandr Havryliak 
January 30, 2025 at 12:42 PM

UPD: finally reproduced the issue

STR: Using the provided backup first restore to the latest timestamp and then restore to the point-in-time from the logs

So the root cause of the problem is the state of the target cluster before the restoration

Not a Bug

Details

Assignee

Reporter

Found by Automation

Yes

Needs QA

Sprint

Affects versions

Priority

Created February 29, 2024 at 8:12 AM
Updated February 5, 2025 at 1:51 PM
Resolved February 5, 2025 at 9:07 AM