Won't Do
Details
Assignee
UnassignedUnassignedReporter
Sami AhlroosSami AhlroosLabels
Components
Affects versions
Priority
Medium
Details
Details
Assignee
Unassigned
UnassignedReporter
Sami Ahlroos
Sami AhlroosLabels
Components
Affects versions
Priority
Smart Checklist
Smart Checklist
Smart Checklist
Created March 14, 2022 at 12:21 PM
Updated February 4, 2025 at 11:18 AM
Resolved March 20, 2023 at 2:37 PM
A node in customer environment froze for ~10 minutes, which caused overlapping PITR chunks, which in turn made restore impossible.
Reason for the freeze is unknown, probably some virtualization issue. In dmesg we see multiple processes "blocked for more than 120 seconds", as well as:
Mar 07 02:32:11 xyz0000953.somedomain.com systemd-journal[719]: Journal started Mar 07 02:22:28 xyz0000953.somedomain.com chronyd[939]: Forward time jump detected! Mar 07 02:22:28 xyz0000953.somedomain.com chronyd[939]: Can't synchronise: no selectable sources # also stall for every CPU was detected, for example: Mar 07 02:32:11 xyz0000953.somedomain.com kernel: INFO: rcu_sched self-detected stall on CPU { 2} (t=217023 jiffies g=240400579 c=240400578 q=2855)
Looking at the timestamps, the stall lasted about 10 minutes. After taking a nap the server continued working as if nothing unusual happened.
The node that froze was also responsible for PITR chunks. While the node was paused, another one took over as it should:
Mar 07 02:19:06 xyz0000948.somedomain.com pbm-agent[22996]: 2022-03-07T02:19:06.000-0700 D [pitr] start_ok Mar 07 02:19:06 xyz0000948.somedomain.com pbm-agent[22996]: 2022-03-07T02:19:06.000-0700 I [pitr] streaming started from 2022-03-07 09:06:11 +0000 UTC / 1646643971
However, the frozen node after waking up still finished the PITR chunk it had started:
Mar 07 02:32:38 xyz0000953.somedomain.com pbm-agent[30659]: 2022-03-07T02:32:38.000-0700 I [pitr] created chunk 2022-03-07T09:06:11 - 2022-03-07T09:16:11. Next chunk creation scheduled to begin at ~2022-03-07T02:26:11 Mar 07 02:32:38 xyz0000953.somedomain.com pbm-agent[30659]: 2022-03-07T02:32:38.000-0700 I [pitr] streaming oplog: pitr slicing resumed on node xyz0000948.somedomain.com:27017
The result of all this is, restoring backup for the affected time range fails with:
bash-4.2$ pbm restore --time="2022-03-07T10:34:00" ..Error: verify oplog slices chain: integrity vilolated, expect chunk with start_ts {1646644571 1}, but got {1646643971 1} -bash-4.2$ pbm restore --time="2022-03-07T10:29:06" ..Error: verify oplog slices chain: integrity vilolated, expect chunk with start_ts {1646644571 1}, but got {1646643971 1}
To recover from this, removing the overlapping PITR chunk from storage, and running
pbm config --force-resync
works, but in an optimal case this should not have happened.