LP #987495: pt-table-checksum: Switch to disable #-of-rows checks on slaves

Description

**Reported in Launchpad by Mrten last update 30-08-2013 09:33:04

I keep running into the test that checks if the number of rows on the slave is not too different from the number of rows on the master:

04-23T02:30:33 Skipping table aal.it_attachment because on the master it would be checksummed in one chunk but on these replicas it has too many rows:
57328 rows on piro.ii.nl
The current chunk size limit is 42938 rows (chunk size=21469 * chunk size limit=2.0).
04-23T03:21:59 Skipping table coe.blob because on the master it would be checksummed in one chunk but on these replicas it has too many rows:
14154 rows on erika.ii.nl
51916 rows on piro.ii.nl
The current chunk size limit is 11924 rows (chunk size=5962 * chunk size limit=2.0).
04-23T04:03:20 Skipping table ncd.tAdres because on the master it would be checksummed in one chunk but on these replicas it has too many rows:
21899 rows on erika.ii.nl
22069 rows on piro.ii.nl
The current chunk size limit is 20902 rows (chunk size=10451 * chunk size limit=2.0).
pt-table-checksum failed

I've already had a discussion with Baron on percona-discussion about this (subject 'pt-table-checksum and chunking', please see that for history), this is the current invocation:

pt-table-checksum --recursion-method dsn=h=127.0.0.1,P=3306,D=maatkit,t=pt_check_slave_delay_dsns --function MURMUR_HASH --replicate maatkit.pt_checksum --ignore-tables "$ignore_tables,security_log" --ignore-databases "$ignore_databases,mysql,maatkit" --no-check-replication-filters --chunk-time 0.25 $quiet --max-lag 600 h=127.0.0.1,P=3306

Environment

None

Smart Checklist

Activity

Show:

lpjirasync January 24, 2018 at 6:33 PM

**Comment from Launchpad by: lyxing on: 30-08-2013 09:33:03

I have a table with two columns,and with a primary key.And the table only has 2019 rows.When i use pt-table-checksum on it.I get same isue:
08-30T05:02:04 Skipping table game_sns.wm_user because on the master it would be checksummed in one chunk but on these replicas it has too many rows:
2094 rows on g2dbbackup
The current chunk size limit is 2000 rows (chunk size=1000 * chunk size limit=2.0

I check the table status,i find the status is different on master(1902 rows) and slave(2033 rows).But when i set chunk-size-limit=3.0,it goes well.

lpjirasync January 24, 2018 at 6:33 PM

**Comment from Launchpad by: Sheeri K. Cabral on: 03-04-2013 15:55:03

Since this is the bug to register the desire to be able to disable chunk-size-limit, and since chunk-size-limit does not work, could you at least take this out of the docs?

"You can disable oversized chunk checking by specifying a value of 0."

lpjirasync January 24, 2018 at 6:33 PM

**Comment from Launchpad by: Mrten on: 22-11-2012 22:25:57

I don't know if I'm edgecasey. I do have quite big tables (with blobs, yes, I know) sitting in between small ones, and that scenario is repeated for some customers.

For suggestions:

Ask for the cardinality of the table a few times and average (no idea if this is a dumb suggestion).

Or, single-chunkability on the master could be decided by the (dynamic) chunk-size instead of chunk-size * chunk-size-limit

Or, but you would be solidly be in the heuristics category: Initialize the dynamic chunksize for each table anew:

  • calculate the average row length (filesize / #rows) of both the previous (A) and the next table (B), in bytes

  • chunk-size for the new table is chunk-size * A/B

this would keep the number of bytes checksummed per second around the same value.

I'll keep thinking...

lpjirasync January 24, 2018 at 6:33 PM

**Comment from Launchpad by: Daniel Nichter on: 22-11-2012 19:53:23

Yes, that's quite plausible. But what's the alternative? Even if the slave reported 36001 rows (i.e. one too many), the line has to be drawn somewhere. Granted, that line is dynamic, but I think it works in the majority of cases. It seems you have many edge cases (hence the request for a way to disable this check)?

The real, long-term solution is what Baron wanted to do awhile ago: stop single-chunking and just always nibble. Future versions of this tool (and others) will do this, but there's no ETA for that yet. However, I'm not sure what the best near-term solution is. I don't think this feature needs a switch to disable it because afaik it works the vast majority of the time, but I would also like that the tool works better for you. Suggestions?

Also, this example from the original report demonstrates the usefulness of this check:

04-23T03:21:59 Skipping table coe.blob because on the master it would be checksummed in one chunk but on these replicas it has too many rows:
14154 rows on erika.ii.nl
51916 rows on piro.ii.nl
The current chunk size limit is 11924 rows (chunk size=5962 * chunk size limit=2.0).

So erika.ii.nl is kind of near the limit, but piro.ii.nl is way off. Maybe it's just a really bad EXPLAIN estimate, or maybe not--the tool can't know. If you disable the check, the single-chunk on piro.ii.nl could be very slow.

lpjirasync January 24, 2018 at 6:33 PM

**Comment from Launchpad by: Mrten on: 22-11-2012 19:16:43

OK. So what I think is happening is that at a given time, after having checksummed a table, the dynamic chunk-size is, for example, around 9000. The master, when queried, reports the size of the next table as having 35000 rows, so the tool decides "single-chunkable" (I'm using chunk-time 0.25 and chunk-size-limit 4, 4 * 9000 = 36000).

However, the slave reports the same table as having 37000 rows there, so the test errors out.

Is this plausible?

Won't Do

Details

Assignee

Reporter

Priority

Smart Checklist

Created January 24, 2018 at 6:32 PM
Updated February 4, 2018 at 12:23 AM
Resolved January 24, 2018 at 6:33 PM

Flag notifications