Issues

Select view

Select search mode

 
48 of 48

Liveness probe is not working after recovery

Cannot Reproduce

Description

We noticed a Percona cluster was not reachable and haproxys in front were restarting marking 2 out of 3 backends down. After thorough investigation it turned out the process mysqld-ps was not running and there were no listeners on ports 3306 and 33062 on 2 out of the 3 pods. 

One 1 of the pods it was still listening , but only 1 couldn't form a quorum.

After looking more closely in the liveness script , it seems it was not really checking if the mysql process listens and responds, because a file called recovery was still present.

This file seems to be created during the recovery process by the PXC scripts, but it seems nothing removes it afterwards, which makes the liveness probe useless. Even if mysql process crashes it will not recognize it, as it exits with 0 when file is present.

Environment

None

Details

Assignee

Reporter

Needs QA

Yes

Components

Affects versions

Priority

Smart Checklist

Created September 14, 2023 at 2:32 PM
Updated December 13, 2023 at 9:56 PM
Resolved December 13, 2023 at 9:56 PM

Activity

Sveta SmirnovaDecember 13, 2023 at 9:56 PM

Thank you for the feedback.

What you report looks more like failed recovery, not failed liveness check. Recovery may fail for different reasons. Since you don't have repeatable test case I am closing this report as "Cannot reproduce". If you manage to find out what causes recovery to fail, open new report with details and steps to reproduce.

Mihail VukadinoffSeptember 15, 2023 at 10:33 AM

Thanks, we switched to the debug image for a while since we were experiencing other issues described here: https://jira.percona.com/browse/PXC-4281
And wanted to get more info on what's actually happening.
But recent occurrence made me think that the problems might be related. Could it be that the debug image might act different from the normal image in regard of the recovery process and this lock file ?

Int this particular case the DB cluster was running fine for more than 30 days and only now we got an alarm that the apps cannot reach the database.

Unfortunately there are no clear steps to reproduce. If the file is not there when I force delete and go to recovery afterwards when finishes it's not there when using another test database.

However if it is there from before it seems it doesn't get cleared.
I wasn't able to reproduce when I force deleted all pods. When recovery completes it seems to remove the file fine.

So does this mean we were stuck in recovery on those databases ?

Slava SarzhanSeptember 15, 2023 at 7:59 AM

Hi  
STR - (steps to reproduce). Why do you need to use 'percona/percona-xtradb-cluster:8.0.32-24.2-debug'? This image was very useful in case of manual recovery or if you need to have some additional rpms to collect core dump.

P.S. Full cluster crush recovery is very easy to test. Just delete all PXC pods together with '--force' flag.

Mihail VukadinoffSeptember 14, 2023 at 3:02 PM

btw, Percona containers themselves are pretty new:

percona/percona-xtradb-cluster:8.0.32-24.2-debug

Mihail VukadinoffSeptember 14, 2023 at 2:56 PM

Thanks for the quick response Slava.

The cluster was running for more than 30 days, I don't imagine a recovery was running all that time. It looked like all nodes are healthy. Even after we restarted the whole cluster it went to "full cluster crash" and we revived it with a signal USR1 to the most advanced pod. The file was still remaining there even after we monitored in the logs that the recovery finished on all pods.

I was just reviewing the code and comparing for versions 1.9 if maybe something is different in the handling.

We'll plan an upgrade for sure.

Sorry , I didn't understand the question, what is STR ?