Fallback dbpath for physical restore

Description

Problem description

During physical restore pbm-agent provides directory and files manipulation within mongod's dbpath directory.


In case when unexpected error occurs during the physical restore phase, there is high chance that it will not be possible to restart mongod instance, because files within dbpath are in some inconsistent state due to the error and unfinalised restore procedure.
In case when there is problem within backup data files, network issue towards backup storage or some unexpected PBM issue during restore procedure, there is the high chance that the whole RS, the shard or the whole cluster goes down without the possibility to restart it. When such thing happens, PBM is not functional anymore, because PBM uses MongoDB as communication channel and metadata storage, an as such represents single-point of failure for PBM system.

Solution proposition

Make PBM more resilient during the physical restore by introducing fallback dbpath.

Before doing any file operation within dbpath, PBM will store all content of dbpath dir (all files and subdirs) into the .fallbacksync directory. By doing that, PBM will have alternative dbpath’s content which will be possible to use in case of error during the physical restore procedure.

During the physical restore procedure PBM will applying following additional logic related to fallback dbpath.

  1. Just before the content of dbpath directory should be wiped out, PBM will move all content into dbpath/.fallbacksyncdir. By doing that dbpath dir should be ready for backup files download.

  2. In case of an error during the restore procedure, PBM will try to swap dbpath from the .fallbacksync directory using the following rules:

    1. if the cluster is in status done or partly-done, .fallbacksync is not used, and it is deleted at the end of the restore procedure.

    2. if the cluster is partly done, .fallbacksync is not applied, and neither is deleted on the nodes with an error. That allows the user to delete or move it to the dbpath dir manually.

    3. if the cluster is in error state (at least on one RS all members are in error state), .fallbacksync is moved into dbpath directory.

  3. At the beginning of the restore procedure .fallbacksync dir is always wiped out.

Acceptance Criteria

Explained solution should work for RS and Sharded cluster.

Additional improvements for checking sizes of the free space (and possibility to have stored content of the dbpath 2 times) will eventually be part of the next ticket.

QA and Documentation

<What do we need from QA and Documentation team?>

Environment

None

Activity

Show:

Boris Ilijic March 5, 2025 at 4:08 PM

Added follow-up improvements ticket based on the above comments:

Boris Ilijic February 27, 2025 at 12:30 PM
Edited

Thanks to both of you.
What should be preferable as default for “--allow-partial-restore“ ?

Boris Ilijic February 27, 2025 at 10:48 AM
Edited

/ /

Please review this ticket a bit, especially what’s you opinion about 2b) logic, is it better to do that or simple roll back to state before restore (fallbacksync, actually the same as 2c)?

cc: /

Details

Assignee

Reporter

Needs QA

Yes

Needs Doc

Yes

Story Points

Components

Priority

Smart Checklist

Created February 26, 2025 at 6:29 PM
Updated March 7, 2025 at 1:33 PM