pbm-agent crashes if primary steps down during the restore with restore command running forever
General
Escalation
General
Escalation
Description
Steps to reproduce are:
run 3 node replica set
add some data and create backup with PBM
start restore and during restore issue "rs.stepDown()" on the primary
Notice:
restore command will wait forever
data will be incompletely restored
pbm-agent will crash
This is actually related to , but this is now happening with additional issues (crash etc.) after the fix for PBM-211 (synchronous restore command)
Here's how it looks. Restore waits forever:
Table is too small (should have 5000000)
third pbm-agent is not running any more:
This is the stderr output from pbm-agent:
IIRC it was discussed in the meeting that when this happens restore cannot complete, but then it seems that:
restore should stop/exit with failure
pbm-agent should log that restore has failed and not crash in the process
pbm-agent should be aware of the new situation and that the mongod node to which it is attached is now secondary, so that this is taken into account when new "backup" or "restore" command are issued
After we cancel the restore command (ctrl-c) and start the crashed pbm-agent we cannot do a restore again since it is claiming that another restore is still in progress:
This is the stderr output from pbm-coordinator:
so it seems to be unaware that restore has failed and it just unregistered client because pbm-agent has crashed.
Steps to reproduce are:
run 3 node replica set
add some data and create backup with PBM
start restore and during restore issue "rs.stepDown()" on the primary
Notice:
restore command will wait forever
data will be incompletely restored
pbm-agent will crash
This is actually related to , but this is now happening with additional issues (crash etc.) after the fix for PBM-211 (synchronous restore command)
Here's how it looks.
Restore waits forever:
Table is too small (should have 5000000)
third pbm-agent is not running any more:
This is the stderr output from pbm-agent:
IIRC it was discussed in the meeting that when this happens restore cannot complete, but then it seems that:
restore should stop/exit with failure
pbm-agent should log that restore has failed and not crash in the process
pbm-agent should be aware of the new situation and that the mongod node to which it is attached is now secondary, so that this is taken into account when new "backup" or "restore" command are issued