Overlapping PITR chunks could happen

Description

A node in customer environment froze for ~10 minutes, which caused overlapping PITR chunks, which in turn made restore impossible.

Reason for the freeze is unknown, probably some virtualization issue. In dmesg we see multiple processes "blocked for more than 120 seconds", as well as:

 

Mar 07 02:32:11 xyz0000953.somedomain.com systemd-journal[719]: Journal started Mar 07 02:22:28 xyz0000953.somedomain.com chronyd[939]: Forward time jump detected! Mar 07 02:22:28 xyz0000953.somedomain.com chronyd[939]: Can't synchronise: no selectable sources # also stall for every CPU was detected, for example: Mar 07 02:32:11 xyz0000953.somedomain.com kernel: INFO: rcu_sched self-detected stall on CPU { 2} (t=217023 jiffies g=240400579 c=240400578 q=2855)

Looking at the timestamps, the stall lasted about 10 minutes. After taking a nap the server continued working as if nothing unusual happened.

The node that froze was also responsible for PITR chunks. While the node was paused, another one took over as it should:

Mar 07 02:19:06 xyz0000948.somedomain.com pbm-agent[22996]: 2022-03-07T02:19:06.000-0700 D [pitr] start_ok Mar 07 02:19:06 xyz0000948.somedomain.com pbm-agent[22996]: 2022-03-07T02:19:06.000-0700 I [pitr] streaming started from 2022-03-07 09:06:11 +0000 UTC / 1646643971

However, the frozen node after waking up still finished the PITR chunk it had started:

Mar 07 02:32:38 xyz0000953.somedomain.com pbm-agent[30659]: 2022-03-07T02:32:38.000-0700 I [pitr] created chunk 2022-03-07T09:06:11 - 2022-03-07T09:16:11. Next chunk creation scheduled to begin at ~2022-03-07T02:26:11 Mar 07 02:32:38 xyz0000953.somedomain.com pbm-agent[30659]: 2022-03-07T02:32:38.000-0700 I [pitr] streaming oplog: pitr slicing resumed on node xyz0000948.somedomain.com:27017

The result of all this is, restoring backup for the affected time range fails with:

bash-4.2$ pbm restore --time="2022-03-07T10:34:00" ..Error: verify oplog slices chain: integrity vilolated, expect chunk with start_ts {1646644571 1}, but got {1646643971 1} -bash-4.2$ pbm restore --time="2022-03-07T10:29:06" ..Error: verify oplog slices chain: integrity vilolated, expect chunk with start_ts {1646644571 1}, but got {1646643971 1}

To recover from this, removing the overlapping PITR chunk from storage, and running pbm config --force-resync works, but in an optimal case this should not have happened.

 

Environment

None

AFFECTED CS IDs

CS0025411

Smart Checklist

Activity

andrew.pogrebnoi March 14, 2022 at 2:59 PM

Before making a slice agent checks if it still owns the lock. But if the freeze happened right after that check it obviously won’t help…
Anyway, since PBM 1.6.0 overlapping shouldn’t be an issue anymore. Now it doesn’t require the start time of the next chunk to exactly much the end time of the current one. The next start could be less than the current end. The only requirement (since v1.6.0) the next start should be more than the current end (no gaps allowed).

But we should test if it tolerates different invariants of overlapping chunks.

Won't Do

Details

Assignee

Reporter

Components

Affects versions

Priority

Smart Checklist

Created March 14, 2022 at 12:21 PM
Updated February 4, 2025 at 11:18 AM
Resolved March 20, 2023 at 2:37 PM

Flag notifications