Need a feature for introducing custom delay on the entrypoint of the backup pod

General

Escalation

General

Escalation

Description

There are specific K8s environments which require lead time for a pod before running a backup.

Since the backups are running at the entrypoint of the backup pod, there is a good chance of failure on such environments. So this is a feature request to introduce custom delay at the entrypoint of the backup pod.

Environment

None

AFFECTED CS IDs

CS0047481

Activity

Kory Brown August 5, 2024 at 4:18 PM

Here’s some logs of the kube apiserver timeout. This is from a db node spinning up. What happens with this node when it runs in the apiserver timeout. It kills itself and spins right back up or it retries the apisever and it eventually works. Once its spins back up the apiserver flow seems to work. If we can have the job restart itself instead or if it can retry the backup cmd. It should work as it needs a little bit of time after pod creation for the flow to be enabled. As of now the job deletes itself and a new job is spun up in its place, which continues to run into the apiserver timeout issue.

2024-08-05 15:54:45,558 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f0c06797320>, 'Connection to 192.168.0.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/anchorepercona-tcop/pods?labelSelector=postgres-operator.crunchydata.com%2Fcluster%3Dpercona-anchore%2Cpostgres-operator.crunchydata.com%2Fpatroni%3Dpercona-anchore-ha
2024-08-05 15:54:45,559 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f0c067973c8>, 'Connection to 192.168.0.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/anchorepercona-tcop/endpoints?labelSelector=postgres-operator.crunchydata.com%2Fcluster%3Dpercona-anchore%2Cpostgres-operator.crunchydata.com%2Fpatroni%3Dpercona-anchore-ha
2024-08-05 15:54:48,062 ERROR: Request to server https://192.168.0.1:443 failed: MaxRetryError("HTTPSConnectionPool(host='192.168.0.1', port=443): Max retries exceeded with url: /api/v1/namespaces/anchorepercona-tcop/endpoints?labelSelector=postgres-operator.crunchydata.com%2Fcluster%3Dpercona-anchore%2Cpostgres-operator.crunchydata.com%2Fpatroni%3Dpercona-anchore-ha (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f0c06797470>, 'Connection to 192.168.0.1 timed out. (connect timeout=2.5)'))",)
2024-08-05 15:54:48,063 ERROR: Request to server https://192.168.0.1:443 failed: MaxRetryError("HTTPSConnectionPool(host='192.168.0.1', port=443): Max retries exceeded with url: /api/v1/namespaces/anchorepercona-tcop/pods?labelSelector=postgres-operator.crunchydata.com%2Fcluster%3Dpercona-anchore%2Cpostgres-operator.crunchydata.com%2Fpatroni%3Dpercona-anchore-ha (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f0c06797240>, 'Connection to 192.168.0.1 timed out. (connect timeout=2.5)'))",)
2024-08-05 15:54:49,064 ERROR: ObjectCache.run K8sConnectionFailed('No more API server nodes in the cluster',)
2024-08-05 15:54:49,065 ERROR: ObjectCache.run K8sConnectionFailed('No more API server nodes in the cluster',)
2024-08-05 15:54:49,089 INFO: No PostgreSQL configuration items changed, nothing to reload.
2024-08-05 15:54:49,093 INFO: Lock owner: None; I am percona-anchore-db-l2jd-0

Kory Brown August 1, 2024 at 12:55 AM

We have a delay in our environment before the kube apisever flow is available. It’s causing a timeout to occur during backup. If we can add a delay to when the backup process starts. This would allow time for the flow to be enabled.

One symptom of this issue that we can explore to fix the issue, is the pod is deleted after the backup process runs into the timeout. With the pod being deleted and constantly a new pod is deployed in its place. The delay to open the flow is never resolved. If its possible that the backup pod isnt deleted after it errors, but instead restarts itself until the flow is enabled. Then this would solve the problem as well I believe. Not sure if adding a probe will allow such a behavior of being restarted vs deleted. It was suggested from my k8 admins to add a readiness probe and this may be why.

Jobin Augustine July 24, 2024 at 2:42 PM

@Slava Sarzhan , Its not reproducible in the K8s environment I have.
But the problem is reproducible in the K8s environment the customer has.

Details
Assignee
ege.gunes
Reporter
Jobin Augustine
Needs QA
Yes
Needs Doc
Yes
Story Points
3
Components
Support Request
Sprint
K8SPG Sprint 2
Fix versions
2.7.0
Affects versions
2.3.1
Priority
High

Smart Checklist

Created July 24, 2024 at 2:10 PM

Updated 3 days ago

Need a feature for introducing custom delay on the entrypoint of the backup pod

Description

Environment

AFFECTED CS IDs

Activity

Kory Brown August 5, 2024 at 4:18 PM

Kory Brown August 1, 2024 at 12:55 AM

Jobin Augustine July 24, 2024 at 2:42 PM

Details
Assignee
ege.gunes
Reporter
Jobin Augustine
Needs QA
Yes
Needs Doc
Yes
Story Points
3
Components
Support Request
Sprint
K8SPG Sprint 2
Fix versions
2.7.0
Affects versions
2.3.1
Priority
High

Details

Assignee

Reporter

Needs QA

Needs Doc

Story Points

Components

Sprint

Fix versions

Affects versions

Priority

Smart Checklist

Smart Checklist

Flag notifications

Something's gone wrong

Something's gone wrong

Need a feature for introducing custom delay on the entrypoint of the backup pod

Description

Environment

AFFECTED CS IDs

Activity

Kory Brown August 5, 2024 at 4:18 PM

Kory Brown August 1, 2024 at 12:55 AM

Jobin Augustine July 24, 2024 at 2:42 PM

DetailsAssigneeege.gunesege.gunesReporterJobin AugustineJobin AugustineNeeds QAYesNeeds DocYesStory Points3ComponentsSupport RequestSprintK8SPG Sprint 2Fix versions2.7.0Affects versions2.3.1PriorityHigh

Details

Assignee

Reporter

Needs QA

Needs Doc

Story Points

Components

Sprint

Fix versions

Affects versions

Priority

Smart ChecklistOpen Smart Checklist

Smart Checklist

Flag notifications

Something's gone wrong

Something's gone wrong

Details
Assignee
ege.gunes
Reporter
Jobin Augustine
Needs QA
Yes
Needs Doc
Yes
Story Points
3
Components
Support Request
Sprint
K8SPG Sprint 2
Fix versions
2.7.0
Affects versions
2.3.1
Priority
High

Smart Checklist