LP #1575959: Node consistency compromized during pt-online-schema-change

Description

**Reported in Launchpad by Alexander Ilyin last update 29-06-2016 04:17:47

This behaviou happend two times in row when I was altering two big tables. Used versions:

CentOS release 6.7 (Final)

pt-online-schema-change 2.2.17

Percona-XtraDB-Cluster-client-56.x86_64 1:5.6.28-25.14.1.el6 @percona-release-x86_64
Percona-XtraDB-Cluster-devel-56.x86_64 1:5.6.28-25.14.1.el6 @percona-release-x86_64
Percona-XtraDB-Cluster-full-56.x86_64 1:5.6.28-25.14.1.el6 @percona-release-x86_64
Percona-XtraDB-Cluster-galera-3.x86_64 3.14-1.rhel6 @percona-release-x86_64
Percona-XtraDB-Cluster-garbd-3.x86_64 3.14-1.rhel6 @percona-release-x86_64
Percona-XtraDB-Cluster-server-56.x86_64 1:5.6.28-25.14.1.el6 @percona-release-x86_64
Percona-XtraDB-Cluster-shared-56.x86_64 1:5.6.28-25.14.1.el6 @percona-release-x86_64
Percona-XtraDB-Cluster-test-56.x86_64 1:5.6.28-25.14.1.el6 @percona-release-x86_64

We have 3 servers in Percona XtraDB

On 1st node I've started alter with pt-online-schema-change:

pt-online-schema-change --print --execute --alter-foreign-keys-method=drop_swap --no-drop-old-table --host "127.0.0.1" --user "root" --password "***" --port "3306" --alter "DROP COLUMN platform, ADD partner_id INT unsigned NOT NULL DEFAULT '0', ADD COLUMN flags INT unsigned NOT NULL DEFAULT '0';" D=elite,t=user

And in 10 minutes 3rd node was shutdown:

2016-04-27 15:39:45 4558 [Warning] Aborted connection 2857066 to db: 'elite' user: 'elite' host: '*****' (Got an error reading communication packets)
BF-BF X lock conflict,mode: 1027 supremum: 0
conflicts states: my 0 locked 0
RECORD LOCKS space id 2623 page no 1340865 n bits 152 index `PRIMARY` of table `elite`.`device` trx id 4637663494 lock_mode X locks rec but not gap
2016-04-27 15:42:08 4558 [ERROR] Slave SQL: Could not execute Update_rows event on table elite.device; Can't find record in 'device', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 457, Error_code: 1032
2016-04-27 15:42:08 4558 [Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 1252701249
2016-04-27 15:42:08 4558 [Warning] WSREP: Failed to apply app buffer: seqno: 1252701249, status: 1
at galera/src/trx_handle.cpp:apply():351
Retrying 2th time
BF-BF X lock conflict,mode: 1027 supremum: 0
conflicts states: my 0 locked 0
RECORD LOCKS space id 2623 page no 1340865 n bits 152 index `PRIMARY` of table `elite`.`device` trx id 4637663494 lock_mode X locks rec but not gap
2016-04-27 15:42:08 4558 [ERROR] Slave SQL: Could not execute Update_rows event on table elite.device; Can't find record in 'device', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 457, Error_code: 1032
2016-04-27 15:42:08 4558 [Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 1252701249
2016-04-27 15:42:08 4558 [Warning] WSREP: Failed to apply app buffer: seqno: 1252701249, status: 1
at galera/src/trx_handle.cpp:apply():351
Retrying 3th time
BF-BF X lock conflict,mode: 1027 supremum: 0
conflicts states: my 0 locked 0
RECORD LOCKS space id 2623 page no 1340865 n bits 152 index `PRIMARY` of table `elite`.`device` trx id 4637663494 lock_mode X locks rec but not gap
2016-04-27 15:42:08 4558 [ERROR] Slave SQL: Could not execute Update_rows event on table elite.device; Can't find record in 'device', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 457, Error_code: 1032
2016-04-27 15:42:08 4558 [Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 1252701249
2016-04-27 15:42:08 4558 [Warning] WSREP: Failed to apply app buffer: seqno: 1252701249, status: 1
at galera/src/trx_handle.cpp:apply():351
Retrying 4th time
BF-BF X lock conflict,mode: 1027 supremum: 0
conflicts states: my 0 locked 0
RECORD LOCKS space id 2623 page no 1340865 n bits 152 index `PRIMARY` of table `elite`.`device` trx id 4637663494 lock_mode X locks rec but not gap
2016-04-27 15:42:08 4558 [ERROR] Slave SQL: Could not execute Update_rows event on table elite.device; Can't find record in 'device', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 457, Error_code: 1032
2016-04-27 15:42:08 4558 [Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 1252701249
2016-04-27 15:42:08 4558 [ERROR] WSREP: Failed to apply trx: source: 135aa33a-0c9d-11e6-8474-4bed42818caa version: 3 local: 0 state: APPLYING flags: 1 conn_id: 638350 trx_id: 4637945874 seqnos (l: 1784028, g: 1252701249, s: 1252701248, d: 1252701243, ts: 23188580704163)
2016-04-27 15:42:08 4558 [ERROR] WSREP: Failed to apply trx 1252701249 4 times
2016-04-27 15:42:08 4558 [ERROR] WSREP: Node consistency compromized, aborting...
2016-04-27 15:42:08 4558 [Note] WSREP: Closing send monitor...
2016-04-27 15:42:08 4558 [Note] WSREP: Closed send monitor.
2016-04-27 15:42:08 4558 [Note] WSREP: gcomm: terminating thread
2016-04-27 15:42:08 4558 [Note] WSREP: gcomm: joining thread
2016-04-27 15:42:08 4558 [Note] WSREP: gcomm: closing backend
2016-04-27 15:42:08 4558 [Note] WSREP: view(view_id(NON_PRIM,135aa33a,11) memb {
e89d9ff6,0
} joined {
} left {
} partitioned {
135aa33a,0
138ce52e,0
})
2016-04-27 15:42:08 4558 [Note] WSREP: view((empty))
2016-04-27 15:42:08 4558 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2016-04-27 15:42:08 4558 [Note] WSREP: gcomm: closed
2016-04-27 15:42:08 4558 [Note] WSREP: Flow-control interval: [512, 512]
2016-04-27 15:42:08 4558 [Note] WSREP: Received NON-PRIMARY.
2016-04-27 15:42:08 4558 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 1252701251)
2016-04-27 15:42:08 4558 [Note] WSREP: Received self-leave message.
2016-04-27 15:42:08 4558 [Note] WSREP: Flow-control interval: [0, 0]
2016-04-27 15:42:08 4558 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2016-04-27 15:42:08 4558 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 1252701251)
2016-04-27 15:42:08 4558 [Note] WSREP: RECV thread exiting 0: Success
2016-04-27 15:42:08 4558 [Note] WSREP: recv_thread() joined.
2016-04-27 15:42:08 4558 [Note] WSREP: Closing replication queue.
2016-04-27 15:42:08 4558 [Note] WSREP: Closing slave action queue.
2016-04-27 15:42:08 4558 [Note] WSREP: /usr/sbin/mysqld: Terminated.
160427 15:42:10 mysqld_safe Number of processes running now: 0
160427 15:42:10 mysqld_safe WSREP: not restarting wsrep node automatically
160427 15:42:10 mysqld_safe mysqld from pid file /var/run/mysql/mysql.pid ended

Environment

None

Smart Checklist

Activity

Show:

lpjirasync January 14, 2018 at 6:06 PM

**Comment from Launchpad by: Launchpad Janitor on: 29-06-2016 04:17:46

[Expired for Percona XtraDB Cluster because there has been no activity for 60 days.]

lpjirasync January 14, 2018 at 6:05 PM

**Comment from Launchpad by: Alexander Ilyin on: 29-04-2016 21:58:05

How I can get more details about conflicting record?

lpjirasync January 14, 2018 at 6:05 PM

**Comment from Launchpad by: Alexander Ilyin on: 29-04-2016 21:56:02

Unfortunately, I don't have small reproduce case for this problem which I could provide to you.

Also, there is no other workload on failed host except wsrep itself.

The table which faced with conflict has foreign keys, but it was so for last year and only now we faced with such problem.

lpjirasync January 14, 2018 at 6:05 PM

**Comment from Launchpad by: Krunal Bauskar on: 28-04-2016 09:47:22

Shutdown happened because of BF-BF conflict

BF-BF X lock conflict,mode: 1027 supremum: 0 conflicts states: my 0 locked 0

------
Do you have a reproducible test-case. there is some other conflicting workload that causes this things to fail.

Cannot Reproduce

Details

Assignee

Reporter

Priority

Smart Checklist

Created January 14, 2018 at 6:05 PM
Updated December 25, 2018 at 5:25 PM
Resolved December 15, 2023 at 2:26 PM