LP #1127450: pt-archiver --bulk-insert may corrupt data
Description
Environment
Smart Checklist
Activity

lpjirasync January 24, 2018 at 2:23 PM
**Comment from Launchpad by: Daniel Nichter on: 23-04-2013 17:13:45
Another fix was made for this: http://bazaar.launchpad.net/~percona-toolkit-dev/percona-toolkit/release-2.2.2/revision/581

lpjirasync January 24, 2018 at 2:23 PM
**Comment from Launchpad by: Alex Geis on: 16-04-2013 23:28:39
Appreciate you getting this one fixed for 2.2.2. This was a huge one for our workflow. Many thanks!

lpjirasync January 24, 2018 at 2:23 PM
**Comment from Launchpad by: Brian Fraser on: 02-04-2013 09:59:17
Having looked more into this, I have to amend my previous, overly optimistic message. Please, do not use that workaround, and, at least until 2.2.2, do not use --bulk-insert with anything besides binary data / latin1 – It may corrupt your data by double-encoding things.
There were two issues here: First, the missing encodings for the bulk-insert filehandle, and second, a missing 'CHARACTER SET ...' for the LOAD DATA LOCAL INFILE statement. Once this is properly fixed in trunk, I'll try posting a workaround for previous versions of pt-archiver here.

lpjirasync January 24, 2018 at 2:23 PM
**Comment from Launchpad by: Brian Fraser on: 02-04-2013 08:48:22
Possible workaround for previous versions: Try running the tool as
$ perl -Mopen=utf8 /path/to/pt-archiver ...
But this, not setting the encoding on the bulk-insert filehandle is a glaring oversight. This will be fixed in 2.2.2

lpjirasync January 24, 2018 at 2:23 PM
**Comment from Launchpad by: Alex Geis on: 28-03-2013 10:16:10
couldn't edit and looks like there was a line in the 2.1.8 build on my vol.. better diff:
diff -u /usr/local/bin/pt-archiver /usr/local/bin/pt-archiver.1
— /usr/local/bin/pt-archiver 2013-03-28 06:11:27.143479965 -0400
+++ /usr/local/bin/pt-archiver.1 2013-03-28 06:03:48.391462519 -0400
@@ -5740,6 +5740,7 @@
require File::Temp;
$bulkins_file = File::Temp->new( SUFFIX => 'pt-archiver' )
or die "Cannot open temp file: $OS_ERROR\n";
+ binmode($bulkins_file,":utf8");
}
This row is the first row fetched from each 'chunk'.
@@ -5966,7 +5967,8 @@
if ( $o->get('bulk-insert') ) {
$bulkins_file = File::Temp->new( SUFFIX => 'pt-archiver' )
or die "Cannot open temp file: $OS_ERROR\n";
+ binmode($bulkins_file,":utf8");
}
} # no next row (do bulk operations)
else {
PTDEBUG && _d('Got another row in this chunk');
Details
Assignee
UnassignedUnassignedReporter
lpjirasynclpjirasync(Deactivated)Priority
High
Details
Details
Assignee
Reporter

Priority
Smart Checklist
Open Smart Checklist
Smart Checklist
Open Smart Checklist
Smart Checklist

**Reported in Launchpad by Alex Geis last update 02-05-2013 23:07:52
This bug seems to pop up whenever the following conditions are set for a table to table archive copy:
1. The tables have a TEXT field that include utf8 characters, like foreign language
2. --charset utf8 is used
3. --bulk-insert is used
When these 3 conditions are true, pt-archiver immediately returns with a Wide Character error on line 3950. This seems to be similar to bug #940253 and seems that it's linked to utf8 encoding as related to the temporary bulk load insert file that's created, as the problem goes away immediately when I turn off --bulk-insert or I set --no-check-charset. For now, I'm sacrificing speed (about 3x) by not using bulk-insert to get around this problem... otherwise I need to use pt-table-sync once finished to repair all rows with encoding data mismatches.