garbd crashes after sending the SST request and PXC keeps DONOR/DESYNCED for a long time

Description

Hi,

We have noticed that PXC operator has unstable backups. I have reproduced it and found that the SST request won't complete, but it keeps the donor script running and PXC pod is in DONOR/DESYNCED state for more than one hour.

As I can see from the

:
The state was changed at 09:15:48.427711

due to  the backup (backup pod is 10.72.2.235) which failed at 09:15:48.430 with the following error:

but pxc-1 kept the connection till 10:21:20.012214Z  and only after that printed the error:

and state was changed back to sync. If a few backups fail, we have the following situation: pxc cluster will have only one pod it sync state (we do not allow using pxc-0 primary for the backups).

Could you please have a look at this issue and fix.

P.S you can find all the logs in the attachment.

 

Environment

None

Attachments

1

Smart Checklist

Activity

Show:

Slava Sarzhan April 18, 2022 at 8:13 AM

 thanks for fix

Slava Sarzhan April 6, 2022 at 4:42 PM

Hi,

Do we have any update here? Our PXC operator has still unstable backups    When do you plan to fix it?

Slava Sarzhan November 3, 2021 at 8:59 AM

Hi  ,

If we fix the first issue we will have more stable backups because we will not have crashes and as a result less hung backups. The backups were stable when we had stable  garbd.

Zsolt Parragi November 3, 2021 at 7:13 AM

this is actually two issues here:

  • garbd sometimes tries to join the cluster twice, and the second attempt results in the crash. That's easy to fix.

  • if garbd/a new node requests sst, but crashes/disconnects/... shortly after, the donor thread will hang for a long time, keeping the donor node in donor state. The issue here that SST is sent by a bash script, and while the server itself realizes that the new node disappeared, the bash script doesn't know about this, and keeps waiting for the joiner to receive the data.

Sergey Pronin November 1, 2021 at 11:43 AM

 your team analysis would be a great input here. I know  was looking into this issue already.

Done

Details

Assignee

Reporter

Labels

Time tracking

4d 6h 13m logged1h remaining

Affects versions

Priority

Smart Checklist

Created October 28, 2021 at 1:23 PM
Updated March 6, 2024 at 9:04 PM
Resolved April 18, 2022 at 6:51 AM