pbm-agent crashes if primary steps down during the restore with restore command running forever

General

Escalation

General

Escalation

Description

Steps to reproduce are:

run 3 node replica set
add some data and create backup with PBM
start restore and during restore issue "rs.stepDown()" on the primary

Notice:

restore command will wait forever
data will be incompletely restored
pbm-agent will crash

This is actually related to , but this is now happening with additional issues (crash etc.) after the fix for PBM-211 (synchronous restore command)

Here's how it looks.
Restore waits forever:

Table is too small (should have 5000000)

third pbm-agent is not running any more:

This is the stderr output from pbm-agent:

IIRC it was discussed in the meeting that when this happens restore cannot complete, but then it seems that:

restore should stop/exit with failure
pbm-agent should log that restore has failed and not crash in the process
pbm-agent should be aware of the new situation and that the mongod node to which it is attached is now secondary, so that this is taken into account when new "backup" or "restore" command are issued

Environment

None

Linked work items

relates to

PBM-187

restore incomplete when primary steps down

Activity

Tomislav Plavcic
April 4, 2019 at 1:10 PM

After we cancel the restore command (ctrl-c) and start the crashed pbm-agent we cannot do a restore again since it is claiming that another restore is still in progress:

This is the stderr output from pbm-coordinator:

so it seems to be unaware that restore has failed and it just unregistered client because pbm-agent has crashed.

Resize issue view side panel

Done

Details

Assignee

stefan.vinasi

Reporter

Tomislav Plavcic

Labels

retest

Time tracking

45m logged

Fix versions

1.1.0

Priority

High

Created April 4, 2019 at 12:51 PM

Updated March 5, 2024 at 7:29 PM

Resolved January 4, 2020 at 5:43 PM