pbm-agent crashes if primary steps down during the restore with restore command running forever

Description

Steps to reproduce are:

  • run 3 node replica set

  • add some data and create backup with PBM

  • start restore and during restore issue "rs.stepDown()" on the primary

Notice:

  • restore command will wait forever

  • data will be incompletely restored

  • pbm-agent will crash

This is actually related to , but this is now happening with additional issues (crash etc.) after the fix for PBM-211 (synchronous restore command)

Here's how it looks.
Restore waits forever:

Table is too small (should have 5000000)

third pbm-agent is not running any more:

This is the stderr output from pbm-agent:

IIRC it was discussed in the meeting that when this happens restore cannot complete, but then it seems that:

  • restore should stop/exit with failure

  • pbm-agent should log that restore has failed and not crash in the process

  • pbm-agent should be aware of the new situation and that the mongod node to which it is attached is now secondary, so that this is taken into account when new "backup" or "restore" command are issued

Environment

None

Activity

Tomislav Plavcic 
April 4, 2019 at 1:10 PM

After we cancel the restore command (ctrl-c) and start the crashed pbm-agent we cannot do a restore again since it is claiming that another restore is still in progress:

This is the stderr output from pbm-coordinator:

so it seems to be unaware that restore has failed and it just unregistered client because pbm-agent has crashed.

Done

Details

Assignee

Reporter

Labels

Time tracking

45m logged

Fix versions

Priority

Created April 4, 2019 at 12:51 PM
Updated March 5, 2024 at 7:29 PM
Resolved January 4, 2020 at 5:43 PM