Add Option to Limit SST Retry Attempts

Description

Hi,

When SST fails several times in a row due to a software bug, for example, it can hardly be fixed by itself without intervention.

Meanwhile, SST runs continuously, consuming resources such as network bandwidth and affecting the donor's performance.

The Kubelet's --backoff-max-restart-delay is set to 5 minutes by default, which can be too short between restart attempts. This can lead to repeated SST retries in a short timeframe, potentially impacting cluster performance.

Ideally, nodes should have a configurable limit on SST attempts. After a few failed retries, they could stop requesting SST to avoid putting additional pressure on the cluster.

Thanks!

Environment

None

Activity

Show:

Assignee

Reporter

Needs QA

Fix versions

Affects versions

Priority

Created April 9, 2025 at 12:50 AM
Updated April 9, 2025 at 2:10 PM