When SST fails several times in a row due to a software bug, for example, it can hardly be fixed by itself without intervention.
Meanwhile, SST runs continuously, consuming resources such as network bandwidth and affecting the donor's performance.
The Kubelet's --backoff-max-restart-delay is set to 5 minutes by default, which can be too short between restart attempts. This can lead to repeated SST retries in a short timeframe, potentially impacting cluster performance.
Ideally, nodes should have a configurable limit on SST attempts. After a few failed retries, they could stop requesting SST to avoid putting additional pressure on the cluster.
Hi,
When SST fails several times in a row due to a software bug, for example, it can hardly be fixed by itself without intervention.
Meanwhile, SST runs continuously, consuming resources such as network bandwidth and affecting the donor's performance.
The Kubelet's --backoff-max-restart-delay is set to 5 minutes by default, which can be too short between restart attempts. This can lead to repeated SST retries in a short timeframe, potentially impacting cluster performance.
Ideally, nodes should have a configurable limit on SST attempts. After a few failed retries, they could stop requesting SST to avoid putting additional pressure on the cluster.
Thanks!