Fallback dbpath for physical restore

General

Escalation

General

Escalation

Description

Problem description

During physical restore pbm-agent provides directory and files manipulation within mongod's dbpath directory.

In case when unexpected error occurs during the physical restore phase, there is high chance that it will not be possible to restart mongod instance, because files within dbpath are in some inconsistent state due to the error and unfinalised restore procedure.
In case when there is problem within backup data files, network issue towards backup storage or some unexpected PBM issue during restore procedure, there is the high chance that the whole RS, the shard or the whole cluster goes down without the possibility to restart it. When such thing happens, PBM is not functional anymore, because PBM uses MongoDB as communication channel and metadata storage, an as such represents single-point of failure for PBM system.

Solution proposition

Make PBM more resilient during the physical restore by introducing fallback dbpath.

Before doing any file operation within dbpath, PBM will store all content of dbpath dir (all files and subdirs) into the .fallbacksync directory. By doing that, PBM will have alternative dbpath’s content which will be possible to use in case of error during the physical restore procedure.

During the physical restore procedure PBM will applying following additional logic related to fallback dbpath.

Just before the content of dbpath directory should be wiped out, PBM will move all content into dbpath/.fallbacksyncdir. By doing that dbpath dir should be ready for backup files download.
In case of an error during the restore procedure, PBM will try to swap dbpath from the .fallbacksync directory using the following rules:
1. if the cluster is in status done or partly-done, .fallbacksync is not used, and it is deleted at the end of the restore procedure.
2. if the cluster is partly done, .fallbacksync is not applied, and neither is deleted on the nodes with an error. That allows the user to delete or move it to the dbpath dir manually.
3. if the cluster is in error state (at least on one RS all members are in error state), .fallbacksync is moved into dbpath directory.
At the beginning of the restore procedure .fallbacksync dir is always wiped out.

Acceptance Criteria

Explained solution should work for RS and Sharded cluster.

Additional improvements for checking sizes of the free space (and possibility to have stored content of the dbpath 2 times) will eventually be part of the next ticket.

QA and Documentation

<What do we need from QA and Documentation team?>

Environment

None

Linked work items

has to be finished together with

PBM-1511

Configuration for Fallback dbpath feature

Activity

Resize issue view side panel

Details

Assignee

Boris Ilijic

Reporter

Boris Ilijic

Reviewer

Sandra Romanchenko

Needs QA

Yes

Needs Doc

Yes

Story Points

Components

Sprint

MongoDB Tools 18

Fix versions

2.10.0

Priority

Medium

Parent

PBM-1335 Physical restore assumes a functioning cluster

Created February 26, 2025 at 6:29 PM

Updated 3 days ago