Improve resync performance

Description

Resync takes a lot of time (13 minutes) with a storage that only contains 20 backups:

Machine has 4 CPUs:

I don't know why resync takes this long.

Observations

As mentioned in the comments, the resync is slow because getting the physical restore metadata takes a long time. This is mainly due to how the metadata is stored — instead of being in one file, it's split into many small files in the storage (like S3 or GCS).

For each restore, we need to read and parse all these files to figure out the status of the cluster, each replicaset, and each node. This means many small download and decode operations, which adds up quickly.

Acceptance criteria:

  • make the sync/resync faster

  • please specify more criteria during the implementation

  • introduce a flag and config option to skip the .pbm.restore parsing. Process just the last one required for physical restore.

Environment

None

Activity

Jakub Vecera 
May 19, 2025 at 11:39 AM

Hi we plan to change the default for pbm config --force-resync so it reads only the latest entry in .pbm.restore(skipping the rest) and fetches the full restore metadata only when --include-restores is additionally specified. Could you confirm this is okay for you? Thanks

ege.gunes 
March 12, 2025 at 3:21 PM

Looks like the problem is resyncing physical restore metadata. I removed .pbm.restore dir from the storage and resync finished in 1m for 20 backups

radoslaw.szulgo 
March 10, 2025 at 7:24 PM

Sure! Thanks for reporting. We’ll review that as soon as possible and most likely plan to fix within the next version.

Slava Sarzhan 
March 10, 2025 at 4:57 PM

, could you please check this? It is critical for us. As you can see, even 15-20 backups with minimal data can take up to 20 minutes to resync.

Details

Assignee

Reporter

Labels

Reviewer

Needs QA

Story Points

Fix versions

Affects versions

Priority

Created March 10, 2025 at 4:15 PM
Updated 2 days ago