Fallback dbpath for physical restore
General
Escalation
General
Escalation
Description
Environment
None
Activity
Show:

Boris Ilijic March 5, 2025 at 4:08 PM
Added follow-up improvements ticket based on the above comments:

Boris Ilijic February 27, 2025 at 12:30 PMEdited
Thanks to both of you.
What should be preferable as default for “--allow-partial-restore“ ?

Boris Ilijic February 27, 2025 at 10:48 AMEdited
/ /
Please review this ticket a bit, especially what’s you opinion about 2b) logic, is it better to do that or simple roll back to state before restore (fallbacksync, actually the same as 2c)?
cc: /
Details
Details
Assignee

Reporter

Needs QA
Yes
Needs Doc
Yes
Story Points
8
Components
Sprint
Priority
Smart Checklist
Open Smart Checklist
Smart Checklist

Open Smart Checklist
Created February 26, 2025 at 6:29 PM
Updated March 7, 2025 at 1:33 PM
Problem description
During physical restore pbm-agent provides directory and files manipulation within
mongod's
dbpath
directory.In case when unexpected error occurs during the physical restore phase, there is high chance that it will not be possible to restart
mongod
instance, because files withindbpath
are in some inconsistent state due to the error and unfinalised restore procedure.In case when there is problem within backup data files, network issue towards backup storage or some unexpected PBM issue during restore procedure, there is the high chance that the whole RS, the shard or the whole cluster goes down without the possibility to restart it. When such thing happens, PBM is not functional anymore, because PBM uses MongoDB as communication channel and metadata storage, an as such represents single-point of failure for PBM system.
Solution proposition
Make PBM more resilient during the physical restore by introducing
fallback dbpath
.Before doing any file operation within
dbpath
, PBM will store all content ofdbpath
dir (all files and subdirs) into the.fallbacksync
directory. By doing that, PBM will have alternative dbpath’s content which will be possible to use in case of error during the physical restore procedure.During the physical restore procedure PBM will applying following additional logic related to
fallback dbpath
.Just before the content of
dbpath
directory should be wiped out, PBM will move all content intodbpath/.fallbacksync
dir. By doing thatdbpath
dir should be ready for backup files download.In case of an error during the restore procedure, PBM will try to swap dbpath from the
.fallbacksync
directory using the following rules:if the cluster is in status done or partly-done,
.fallbacksync
is not used, and it is deleted at the end of the restore procedure.if the cluster is partly done, .fallbacksync is not applied, and neither is deleted on the nodes with an error. That allows the user to delete or move it to the dbpath dir manually.
if the cluster is in error state (at least on one RS all members are in error state),
.fallbacksync
is moved intodbpath
directory.At the beginning of the restore procedure
.fallbacksync
dir is always wiped out.Acceptance Criteria
Explained solution should work for RS and Sharded cluster.
Additional improvements for checking sizes of the free space (and possibility to have stored content of the
dbpath
2 times) will eventually be part of the next ticket.QA and Documentation
<What do we need from QA and Documentation team?>