ALTER INSTANCE RELOAD TLS results in a very long execution

Description

Running ALTER INSTANCE RELOAD TLS may result in very long execution times, like over 30 minutes or more depending on server load.

 

mysql [(none)]> select * from performance_schema.processlist where info like '%reload%'; +---------+----------------+--------------------+------+---------+------+-----------+-------------------------------------------------------------------------+---------+-----------+---------------+------------------+ | ID | USER | HOST | DB | COMMAND | TIME | STATE | INFO | TIME_MS | ROWS_SENT | ROWS_EXAMINED | EXECUTION_ENGINE | +---------+----------------+--------------------+------+---------+------+-----------+-------------------------------------------------------------------------+---------+-----------+---------------+------------------+ | 8695002 | root | 10.63.222.42:42742 | NULL | Query | 1142 | starting | alter instance reload tls | 1141688 | 1 | 1 | PRIMARY | .... +---------+----------------+--------------------+------+---------+------+-----------+-------------------------------------------------------------------------+---------+-----------+---------------+------------------+ 2 rows in set (0.00 sec) mysql [(none)]> select * from performance_schema.threads where processlist_id = 8695002\G *************************** 1. row *************************** THREAD_ID: 8695166 NAME: thread/sql/one_connection TYPE: FOREGROUND PROCESSLIST_ID: 8695002 PROCESSLIST_USER: root PROCESSLIST_HOST: 10.63.222.42 PROCESSLIST_DB: NULL PROCESSLIST_COMMAND: Query PROCESSLIST_TIME: 1177 PROCESSLIST_STATE: starting PROCESSLIST_INFO: alter instance reload tls PARENT_THREAD_ID: NULL ROLE: NULL INSTRUMENTED: YES HISTORY: YES CONNECTION_TYPE: SSL/TLS THREAD_OS_ID: 17204 RESOURCE_GROUP: USR_default EXECUTION_ENGINE: PRIMARY CONTROLLED_MEMORY: 12336 MAX_CONTROLLED_MEMORY: 20576 TOTAL_MEMORY: 133394 MAX_TOTAL_MEMORY: 139460 TELEMETRY_ACTIVE: NO 1 row in set (0.00 sec)

If we strace that thread we get a bunch of spinlock waits:

sudo strace -p 17204 strace: Process 17204 attached restart_syscall(<... resuming interrupted read ...>) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, 0x7ed74a557f40) = 0 ....

This looks to be related to a bug fixed in 8.0.33:
An object used internally by ALTER INSTANCE RELOAD TLS was not freed until the number of readers reached 0, under the assumption is that the number of readers should reach 0 fairly frequently. The read lock held during an SSL handshake is generally an expensive operation, with network calls, so when roundtrips between the client and the server took excessively long, the lock was held for a relatively long amount of time. This meant that, when changing the value of this object and there were a sufficient number of incoming SSL connections being made, the number of readers might not reach 0 in a reasonable length of time, leaving the thread holding the lock using 100% of the CPU until the lock was released.We fix this by adding a wait after setting the pointer to this object to a new value, but before releasing the old object.Our thanks to Sinisa Milivojevic for the contribution. (Bug #107567, Bug #34284186)

But the bug is private so I can’t tell much about it.

 

We did a quick code review and seems that there is a forced loop wait when there are active readers being > 0 so it just loops through a wait process then retry, see https://github.com/percona/percona-server/blob/release-8.0.35-27/include/my_rcu_lock.h#L204 and https://github.com/percona/percona-server/blob/release-8.0.35-27/include/my_rcu_lock.h#L186

The affected version is 8.0.35, I assume it was never really fixed but just found a workaround that works ok in non-busy systems.

We don’t have a good way to reproduce but it seems that if we try to open a bunch of TLS connections against the server and run the ALTER command the condition will be hit.

The only effect we observed is the ALTER command running for a very long being in starting state.

Environment

None

Activity

Show:

Sveta Smirnova December 10, 2024 at 2:22 PM

This is a duplicate of https://perconadev.atlassian.net/browse/PS-8846

And you are correct, the upstream patch is just a workaround for https://perconadev.atlassian.net/browse/PS-8699, not real fix.

Duplicate

Details

Assignee

Reporter

Needs QA

Yes

Sprint

Affects versions

Priority

Smart Checklist

Created December 9, 2024 at 8:13 PM
Updated December 20, 2024 at 3:09 AM
Resolved December 12, 2024 at 1:34 PM