Add Option to Limit SST Retry Attempts

General

Escalation

General

Escalation

Description

Hi,

When SST fails several times in a row due to a software bug, for example, it can hardly be fixed by itself without intervention.

Meanwhile, SST runs continuously, consuming resources such as network bandwidth and affecting the donor's performance.

The Kubelet's --backoff-max-restart-delay is set to 5 minutes by default, which can be too short between restart attempts. This can lead to repeated SST retries in a short timeframe, potentially impacting cluster performance.

Ideally, nodes should have a configurable limit on SST attempts. After a few failed retries, they could stop requesting SST to avoid putting additional pressure on the cluster.

Thanks!

Environment

None

Activity

Show:

Resize issue view side panel

Assignee

ege.gunes

Reporter

Juan Arruti

Needs QA

Yes

Fix versions

1.19.0

Affects versions

1.16.1

Priority

Medium

Created April 9, 2025 at 12:50 AM

Updated April 9, 2025 at 2:10 PM