Physical restore assumes a functioning cluster
General
Escalation
General
Escalation
Description
Environment
None
20% Done
Activity
Show:

Boris Ilijic February 27, 2025 at 10:41 AM
Converted it to the epic and linked
Added child tickets to this epic: & .

radoslaw.szulgo September 12, 2024 at 10:12 AM
Moved under the

Jan Mynar September 12, 2024 at 10:02 AM
please create a incubator item for this improvement. And take this topic to the MongoDB leaders forum

Aaditya Dubey June 7, 2024 at 5:30 AM
Hi
Thank you for the report and feedback.
Problem description
With the current design, PBM cannot restore a physical backup if the target cluster is broken (e.g., one of the shards is not healthy, one of the shards doesn’t have a primary, etc.). In a real-world scenario, we want to be able to restore a cluster without having to fix anything first. As long as the target hosts are okay, we can restore.
With physical backups, the agents already use the backup storage (e.g. S3 bucket) to coordinate themselves by creating files. We should not rely on the MongoDB replication mechanism or node states in any way for a physical restore.
Solution proposition
Introduce an “emergency mode” (at best automatically selected) if a target cluster is broken.
User might not know what to check to verify if a cluster is broken and might not know which mode to use
User is notified via logs/output that an emergency mode is used as cluster Is unhealthy - display a reason
PBM agents communicate with each other via HTTP protocol instead of MongoDB database
Acceptance Criteria
Physical restore works if a shard doesn’t have a primary instance
Documentation
Within a restore procedure for physical backups add another section for restoring in emergency mode -
The topic should describe a situation when an emergency restoration is needed to be run.
It should describe how to run it.
Describe (if any) side effects / results when using emergency mode vs a standard way.