$backupCursor may have oplog holes

Description

During a test of $backupCursor, we found the $backupCursor may have oplog holes, below is the log


The $backupCursor returns oplogEnd with {"t":1704373138,"i":4}, and the oplog of {"t":1704373138,"i":3} is not in the backup files

Looking at the implementation of $backupCursor in percona server mongodb, the oplogEnd is fetched from the last record of oplog collections, that may contain holes before that oplog record, since those oplogs are not committed yet.

So here is my question, can the $backupCursor used alone without $backupCursorExtend ? Looks like the $backupCursorExtend will wait for the given timestamp to be committed. Or the $backupCursor should be implemented like $backupCursorExtend to wait the oplogEnd to be safe without holes ?

Environment

None

Activity

Show:

radoslaw.szulgo October 1, 2024 at 8:44 AM

This is probably caused by some race condition that is hard to reproduce and troubleshoot. We'll need weeks to spend on this - thus, closing as on demand.

Jan Mynar September 3, 2024 at 12:12 PM

Flag added

this seems to be very complex issue, we have are not even able to reproduce it now, we have to discuss it agin with Radek.

Jan Mynar June 18, 2024 at 8:51 AM
Edited

refinement comment: The plan is to reproduce the problem, find the root cause and decide if the problem is bug or not.

providing that we would like to start working on that in Q3/2024 = moving this bug to “open” state than

MingTotti Guoming He February 8, 2024 at 8:37 AM

Hi ,

Here are my answers to your questions

  • My problem is can I use $backupCursor alone without $backupCursorExtend ? From my testing, it seems that will have a risk to lose oplogs in hole before the backup timestamp.

  • My deployment type is a single replica set with three nodes

  • The workload is simply insertions of records in a single thread, and another thread performs the $backupCursor

Thanks

Konstantin Trushin February 7, 2024 at 11:49 AM

Hello ,
sorry for not keeping you updated for so long. We are investigating the issue. In the meantime, it would be useful to know more context. Could you please provide us with the following:

  • what problem you are trying to solve;

  • your deployment type: single replica set or multiple shards;

  • workload type and, if possible, approximate description of the operation that hadn't reached to oplog before backup cursor was created.

With best regards,
Konstantin.

Done

Assignee

Reporter

Labels

Reviewer

Needs QA

Components

Sprint

Affects versions

Priority

Created January 6, 2024 at 1:31 AM
Updated April 22, 2025 at 12:06 PM
Resolved April 22, 2025 at 12:06 PM