"Header page contains inconsistent data in datafile" after SST

Description

It is reported that occasionally, during SST, table spaces get silently corrupted, resulting in the later Xtrabackup failure with the following error:

The triggering condition appears to be PXC 5.7 => 8.0 upgrade, where the corruption manifests in a 2nd node that joined later from the upgraded node. The corruption gets discovered once the 3rd node tries to join from the 2nd node as the donor or when a regular backup is taken from the 2nd node.

The issue is difficult to reproduce but possible with a custom script.

 

Environment

None

AFFECTED CS IDs

CS0042298

Activity

Show:

Kamil Holubicki December 18, 2023 at 11:55 AM

Hi , Thank you for the information you provided. It was very helpful during the investigation.

I was not able to reproduce it so far, but the code analysis confirms your findings.

I’ve created PR, which implements the idea you described in your previous comment (shutdown() should be called before close()), but I implemented it in a way that tries to fix broken API contract (client when calls close() method expects that underlying machinery will just close. Client should not care about underlying details)

Would it be possible to test it on your side? The PR is

Andrew Garner December 12, 2023 at 5:40 PM

Avoiding the write-after-close in the Galera IST receiver makes the data corruption go away for us with a small change:

 

 

 

 

This relies on the fact that the socket destructor is going to SSL_shutdown and close the file descriptor anyway, so avoiding an out-of-order close, shutdown call avoids the bug here.

 

With this change, we observe a different i/o pattern during SST.

 

 

 

I.e. With this change, the previously observed behavior of writing to an already closed file descriptor goes away.

aristotle.po December 12, 2023 at 3:04 AM
Edited

I was able to reproduce only after 2045 attempts on my test AWS EC2 instance (t2.medium) that I left running using the script reproduce_data_corruption_via_docker.bash provided in the internal ticket :

 

Logfile : /tmp/xtrabackup.log

OS :

Andrew Garner December 6, 2023 at 5:45 PM

Reproducing the data corruption is difficult and very timing dependent.  However, we still see the problematic i/o pattern during a normal IST, but it doesn't result in data corruption.

 

 

 

 

Essentially we see this IST thread writing to a closed file descriptor - which obviously fails and the failure gets ignored.

Andrew Garner December 6, 2023 at 2:13 PM

I think the issue is the same as the effect reported here https://jira.mariadb.org/browse/MDEV-31502

 

We think what is happening is in galera::ist::Receiver::run() and suggest this sequence of events can occur:

  • Connection is accepted from the donor => fd=16

  • IST runs to the end

  • galera::ist::Receiver::run() closes the socket (fd=16 is now closed)

  • very timing dependent, InnoDB opens a file and happens to reuse fd=16

  • gu::AsioStreamReact::AsioStreamReact goes out of scope and the destructor runs

  • That destructor runs SSL_shutdown() (via engine_->shtudown()), which under TLSv1.3 writes out a 24-byte close_notify message.  This gets written to the original (now closed) socket, fd=16, which is no longer open, but has been reused by an InnodB datafile

  • socket_.close() also is run again, but the asio stream was already marked as closed so this a no op and does not result in a system call.

 

Reproducing the data corruption seems to require really unlucky timing.   On fast systems, we've run thousands of iterations with no problem.   We have had at least two production systems that have stumbled across this and our CI uses fairly small VM types with cheap, external persistent disks on GCP which seem to be slow enough to see this with some frequency.

 

With strace, we can observe the i/o pattern here and it looks something like this:

 

 

 

In https://jira.mariadb.org/browse/MDEV-31502 the cause seems to be identical, but the TLSv1.2 was probably being used which results in 31 vs. 24 bytes being flushed out.  In both cases we see the pattern of an OpenSSL record - here `0x1703030013...` and in the MariaDB bug , `0x150303001a...`.

Most of the time if you just trace galera's system calls, you'll see a failed write, e.g. "write(16, ..., 24) = -EBADF", which doesn't cause any harm - it tried to write to a closed fd.  

We reproduced this on GCP e2-medium VMs running mysql on a rather small (15GB) pd-standard disk.   We observed the corruption anywhere between 1 to 129 iterations in this environment.   If TLS is disabled, the problem would not manifest at all, but that's not an option for our environments.

 

Done

Details

Assignee

Reporter

Needs Review

Yes

Needs QA

Yes

Affects versions

Priority

Smart Checklist

Created December 6, 2023 at 1:10 PM
Updated March 8, 2024 at 1:52 PM
Resolved January 16, 2024 at 10:11 AM