LP #987495: pt-table-checksum: Switch to disable #-of-rows checks on slaves

General

Escalation

General

Escalation

Description

**Reported in Launchpad by Mrten last update 30-08-2013 09:33:04

I keep running into the test that checks if the number of rows on the slave is not too different from the number of rows on the master:

04-23T02:30:33 Skipping table aal.it_attachment because on the master it would be checksummed in one chunk but on these replicas it has too many rows:
57328 rows on piro.ii.nl
The current chunk size limit is 42938 rows (chunk size=21469 * chunk size limit=2.0).
04-23T03:21:59 Skipping table coe.blob because on the master it would be checksummed in one chunk but on these replicas it has too many rows:
14154 rows on erika.ii.nl
51916 rows on piro.ii.nl
The current chunk size limit is 11924 rows (chunk size=5962 * chunk size limit=2.0).
04-23T04:03:20 Skipping table ncd.tAdres because on the master it would be checksummed in one chunk but on these replicas it has too many rows:
21899 rows on erika.ii.nl
22069 rows on piro.ii.nl
The current chunk size limit is 20902 rows (chunk size=10451 * chunk size limit=2.0).
pt-table-checksum failed

I've already had a discussion with Baron on percona-discussion about this (subject 'pt-table-checksum and chunking', please see that for history), this is the current invocation:

pt-table-checksum --recursion-method dsn=h=127.0.0.1,P=3306,D=maatkit,t=pt_check_slave_delay_dsns --function MURMUR_HASH --replicate maatkit.pt_checksum --ignore-tables "$ignore_tables,security_log" --ignore-databases "$ignore_databases,mysql,maatkit" --no-check-replication-filters --chunk-time 0.25 $quiet --max-lag 600 h=127.0.0.1,P=3306

Environment

None

Smart Checklist

Activity

Show:

lpjirasync January 24, 2018 at 6:33 PM

**Comment from Launchpad by: lyxing on: 30-08-2013 09:33:03

I have a table with two columns,and with a primary key.And the table only has 2019 rows.When i use pt-table-checksum on it.I get same isue:
08-30T05:02:04 Skipping table game_sns.wm_user because on the master it would be checksummed in one chunk but on these replicas it has too many rows:
2094 rows on g2dbbackup
The current chunk size limit is 2000 rows (chunk size=1000 * chunk size limit=2.0

I check the table status,i find the status is different on master(1902 rows) and slave(2033 rows).But when i set chunk-size-limit=3.0,it goes well.

lpjirasync January 24, 2018 at 6:33 PM

**Comment from Launchpad by: Sheeri K. Cabral on: 03-04-2013 15:55:03

Since this is the bug to register the desire to be able to disable chunk-size-limit, and since chunk-size-limit does not work, could you at least take this out of the docs?

"You can disable oversized chunk checking by specifying a value of 0."

lpjirasync January 24, 2018 at 6:33 PM

**Comment from Launchpad by: Mrten on: 22-11-2012 22:25:57

I don't know if I'm edgecasey. I do have quite big tables (with blobs, yes, I know) sitting in between small ones, and that scenario is repeated for some customers.

For suggestions:

Ask for the cardinality of the table a few times and average (no idea if this is a dumb suggestion).

Or, single-chunkability on the master could be decided by the (dynamic) chunk-size instead of chunk-size * chunk-size-limit

Or, but you would be solidly be in the heuristics category: Initialize the dynamic chunksize for each table anew:

calculate the average row length (filesize / #rows) of both the previous (A) and the next table (B), in bytes
chunk-size for the new table is chunk-size * A/B

this would keep the number of bytes checksummed per second around the same value.

I'll keep thinking...

lpjirasync January 24, 2018 at 6:33 PM

**Comment from Launchpad by: Daniel Nichter on: 22-11-2012 19:53:23

Yes, that's quite plausible. But what's the alternative? Even if the slave reported 36001 rows (i.e. one too many), the line has to be drawn somewhere. Granted, that line is dynamic, but I think it works in the majority of cases. It seems you have many edge cases (hence the request for a way to disable this check)?

The real, long-term solution is what Baron wanted to do awhile ago: stop single-chunking and just always nibble. Future versions of this tool (and others) will do this, but there's no ETA for that yet. However, I'm not sure what the best near-term solution is. I don't think this feature needs a switch to disable it because afaik it works the vast majority of the time, but I would also like that the tool works better for you. Suggestions?

Also, this example from the original report demonstrates the usefulness of this check:

04-23T03:21:59 Skipping table coe.blob because on the master it would be checksummed in one chunk but on these replicas it has too many rows:
14154 rows on erika.ii.nl
51916 rows on piro.ii.nl
The current chunk size limit is 11924 rows (chunk size=5962 * chunk size limit=2.0).

So erika.ii.nl is kind of near the limit, but piro.ii.nl is way off. Maybe it's just a really bad EXPLAIN estimate, or maybe not--the tool can't know. If you disable the check, the single-chunk on piro.ii.nl could be very slow.

lpjirasync January 24, 2018 at 6:33 PM

**Comment from Launchpad by: Mrten on: 22-11-2012 19:16:43

OK. So what I think is happening is that at a given time, after having checksummed a table, the dynamic chunk-size is, for example, around 9000. The master, when queried, reports the size of the next table as having 35000 rows, so the tool decides "single-chunkable" (I'm using chunk-time 0.25 and chunk-size-limit 4, 4 * 9000 = 36000).

However, the slave reports the same table as having 37000 rows there, so the test errors out.

Is this plausible?

Won't Do

Details
Assignee
Unassigned
Reporter
lpjirasync(Deactivated)
Priority
Low
Labels
chunkingmigratedfromlppt-table-checksum

Smart Checklist

Created January 24, 2018 at 6:32 PM

Updated February 4, 2018 at 12:23 AM

Resolved January 24, 2018 at 6:33 PM

LP #987495: pt-table-checksum: Switch to disable #-of-rows checks on slaves

Description

Environment

Smart Checklist

Activity

lpjirasync January 24, 2018 at 6:33 PM

lpjirasync January 24, 2018 at 6:33 PM

lpjirasync January 24, 2018 at 6:33 PM

lpjirasync January 24, 2018 at 6:33 PM

lpjirasync January 24, 2018 at 6:33 PM

Details
Assignee
Unassigned
Reporter
lpjirasync(Deactivated)
Priority
Low
Labels
chunkingmigratedfromlppt-table-checksum

Details

Assignee

Reporter

Priority

Labels

Smart Checklist

Smart Checklist

Flag notifications

Something's gone wrong

Something's gone wrong

LP #987495: pt-table-checksum: Switch to disable #-of-rows checks on slaves

Description

Environment

Smart Checklist

Activity

lpjirasync January 24, 2018 at 6:33 PM

lpjirasync January 24, 2018 at 6:33 PM

lpjirasync January 24, 2018 at 6:33 PM

lpjirasync January 24, 2018 at 6:33 PM

lpjirasync January 24, 2018 at 6:33 PM

DetailsAssigneeUnassignedUnassignedReporterlpjirasynclpjirasync(Deactivated)PriorityLowLabelschunkingmigratedfromlppt-table-checksum

Details

Assignee

Reporter

Priority

Labels

Smart ChecklistOpen Smart Checklist

Smart Checklist

Flag notifications

Something's gone wrong

Something's gone wrong

Details
Assignee
Unassigned
Reporter
lpjirasync(Deactivated)
Priority
Low
Labels
chunkingmigratedfromlppt-table-checksum

Smart Checklist