An Xtradb node starts writing to pagestore instead of ring bugger
Description
Environment
Linux
Kernel 3.13.0-112-generic
PXC - 5.7.22-22-57
Attachments
- 29 Oct 2018, 05:04 PM
Smart Checklist
Activity
Krunal Bauskar November 1, 2018 at 6:56 AM
commit cbb86224e7b4de0bfdb84fcd3a52967a0ae9fb19
Merge: 62ab7140 c03db52a
Author: Krunal Bauskar <krunal.bauskar@percona.com>
Date: Thu Nov 1 12:21:02 2018 +0530
Merge pull request #148 from percona/3.x-pxc-2220
PXC#2220: An Xtradb node starts writing to pagestore instead of rin…
commit c03db52adc4aee2c832755c744e4d02a2698fc5e
Author: Krunal Bauskar <krunal.bauskar@percona.com>
Date: Wed Oct 31 08:57:10 2018 +0530
PXC#2220: An Xtradb node starts writing to pagestore instead of ring bugger
Issue is triggered if the user try to start parallel instance of
PXC/mysqld server on a node that already has the instance running.
Parallel instance eventually fails when it try to bind-address but before
that galera initialization sequence has altered gcache in
irrevocable fashion causing the gcache page error.
Infact, even before gcache is touched, saved state is read.
Galera existing flow had lock on the grastate.dat (state file).
Unfortunately, this lock code was buggy or missed a condition
that re-opening of the file in different mode (through freopen)
doesn't retain the original lock taken on old file descriptior.
This eventually caused no-lock post freopen and so parallel
instance was allowed to start.
Fixed the bug by correcting the said condition.
Krunal Bauskar October 30, 2018 at 9:18 AMEdited
@Shashank Sahni
We have reproduce the issue. Thanks for detailed test-case.
Issue would kick in only when you try to start parallel mysqld (with existing one running)
So immediately workaround is to avoid doing so.
There is check to avoid this from happening but the check is not working as expected. We will fix the issue in upcoming release.
Shashank Sahni October 29, 2018 at 5:05 PM
@krunal
Thats the point Krunal. The gcache isn't acting as a ring buffer and TXs are spilling to pagestore.
I haven't followed this sequence but PFA screenshot charting this value for an affected node. You could see that that value stopped increased at a point and this is exactly when cache spilled to pagestore.
No I didn't.
Have you tried steps I listed in last comment to reproduce it?
Krunal Bauskar August 30, 2018 at 3:34 AM
There is something that we are missing here.
Most of the sysbench workload is lightweight that generate small packets of trx. This means N trx will get appended to galera.cache files and on hitting end of galera cache, cache get used in ring fashion freeing up entries from start of the cache. This sequence continue.
You can monitor this sequence as follows
mysql> show status like '%downto%';
----------------------------------+
Variable_name | Value |
----------------------------------+
wsrep_local_cached_downto | 85858 |
----------------------------------+
1 row in set (0.01 sec)
mysql> show status like '%downto%';
----------------------------------+
Variable_name | Value |
----------------------------------+
wsrep_local_cached_downto | 86910 |
----------------------------------+
1 row in set (0.00 sec)
mysql> show status like '%downto%';
-----------------------------------+
Variable_name | Value |
-----------------------------------+
wsrep_local_cached_downto | 105507 |
-----------------------------------+
1 row in set (0.01 sec)
3. Have you freezed purge of gcache that can cause the creation of gcache file even for smaller trx.
Shashank Sahni August 29, 2018 at 5:43 PM
Hey Krunal,
I've been able to successfully reproduce the issue in a test environment. This behavior looks like a bug to me. Here are the steps to reproduce it. Please check and share your thoughts.
1. Run a standalone PXC node with following wsrep_provider options -
wsrep_provider_options = "gcache.size=100M;gcache.page_size=1M;gcs.fc_limit=500;gcs.fc_factor=0.99;gcs.fc_master_slave=yes;gcache.keep_pages_size=30M;gcache.keep_pages_count=10
2. Generate load. I used sysbench-tpcc benchmark to do so. Let it run for a substantial time to ensure that data being written is more than gcache size.
3. Invoke mysqld. I ran `mysqld --versin`. This logs failure messages
2018-08-29T17:34:42.289123Z 0 [Warning] The syntax '--log_warnings/-W' is deprecated and will be removed in a future release. Please use '--log_error_verbosity' instead.
2018-08-29T17:34:42.289287Z 0 [ERROR] --verbose is for use with --help; did you mean --log-error-verbosity?
2018-08-29T17:34:42.289293Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2018-08-29T17:34:42.305406Z 0 [Note] mysqld (mysqld 5.7.22-22-57-log) starting as process 26493 ...
2018-08-29T17:34:42.311869Z 0 [Note] WSREP: Setting wsrep_ready to false
2018-08-29T17:34:42.311893Z 0 [Note] WSREP: No pre-stored wsrep-start position found. Skipping position initialization.
2018-08-29T17:34:42.311900Z 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so'
2018-08-29T17:34:42.348615Z 0 [Note] WSREP: wsrep_load(): Galera 3.26(rac090bc) by Codership Oy <info@codership.com> loaded successfully.
2018-08-29T17:34:42.359185Z 0 [Note] WSREP: CRC-32C: using hardware acceleration.
2018-08-29T17:34:42.359725Z 0 [Note] WSREP: Found saved state: 30e67dee-8156-11e8-85da-c64fdabff85c:-1, safe_to_bootstrap: 1
2018-08-29T17:34:44.023646Z 0 [Note] WSREP: Passing config to GCS: base_dir = /db/mysql/; base_host = 10.210.1.15; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 10; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 4; evs.view_forget_timeout = PT24H; gcache.dir = /db/mysql/; gcache.freeze_purge_at_seqno = -1; gcache.keep_pages_count = 10; gcache.keep_pages_size = 30M; gcache.mem_size = 0; gcache.name = /db/mysql//galera.cache; gcache.page_size = 1M; gcache.recover = no; gcache.size = 100M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 0.99; gcs.fc_limit = 500; gcs.fc_master_slave = yes; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.recovery = 1; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = PT30S; pc.weight = 1; protonet.backend = asio; protonet.version = 0; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.max_ws_size = 2147483647; repl.proto_max = 8; socket.checksum = 2; socket.recv_buf_size = 212992;
2018-08-29T17:34:44.047028Z 0 [Note] WSREP: GCache history reset: 30e67dee-8156-11e8-85da-c64fdabff85c:0 -> 30e67dee-8156-11e8-85da-c64fdabff85c:-1
2018-08-29T17:34:44.158481Z 0 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
2018-08-29T17:34:44.174233Z 0 [Note] WSREP: Preparing to initiate SST/IST
2018-08-29T17:34:44.174247Z 0 [Note] WSREP: Starting replication
2018-08-29T17:34:44.174273Z 0 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2018-08-29T17:34:44.174459Z 0 [Note] WSREP: Using CRC-32C for message checksums.
2018-08-29T17:34:44.174552Z 0 [Note] WSREP: gcomm thread scheduling priority set to other:0
2018-08-29T17:34:44.174731Z 0 [Note] WSREP: Restoring primary-component from disk successful
2018-08-29T17:34:44.174892Z 0 [Note] WSREP: GMCast version 0
2018-08-29T17:34:44.175015Z 0 [Note] WSREP: (70f9b595, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2018-08-29T17:34:44.175024Z 0 [Note] WSREP: (70f9b595, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2018-08-29T17:34:44.175372Z 0 [Note] WSREP: EVS version 0
2018-08-29T17:34:44.175476Z 0 [Note] WSREP: gcomm: connecting to group 'te_chaos_mm', peer ''
2018-08-29T17:34:44.175509Z 0 [Note] WSREP: start_prim is enabled, turn off pc_recovery
2018-08-29T17:34:44.175647Z 0 [ERROR] WSREP: bind: Address already in use
2018-08-29T17:34:44.180266Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 98: error while trying to listen 'tcp://0.0.0.0:4567?socket.non_blocking=1', asio error 'bind: Address already in use': 98 (Address already in use)
at gcomm/src/asio_tcp.cpp:listen():836
2018-08-29T17:34:44.180287Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -98 (Address already in use)
2018-08-29T17:34:44.180350Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1514: Failed to open channel 'te_chaos_mm' at 'gcomm:///': -98 (Address already in use)
2018-08-29T17:34:44.180360Z 0 [ERROR] WSREP: gcs connect failed: Address already in use
2018-08-29T17:34:44.180366Z 0 [ERROR] WSREP: Provider/Node (gcomm:///) failed to establish connection with cluster (reason: 7)
2018-08-29T17:34:44.180375Z 0 [ERROR] Aborting
4. With in few seconds, you should start seeing gcache.page.* files being created and following messages in the log file.
2018-08-29T17:34:56.791554Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000000 of size 1048576 bytes
2018-08-29T17:34:56.841794Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000001 of size 1048576 bytes
2018-08-29T17:34:57.027707Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000002 of size 1048576 bytes
Details
Assignee
Krunal BauskarKrunal Bauskar(Deactivated)Reporter
Shashank SahniShashank SahniTime tracking
1d 4h loggedFix versions
Affects versions
Priority
Medium
Details
Details
Assignee
Reporter
Time tracking
Fix versions
Affects versions
Priority
Smart Checklist
Open Smart Checklist
Smart Checklist
Open Smart Checklist
Smart Checklist

One of the nodes in the galera cluster had high memory usage. Investigation
showed that this node, instead of writing transactions to galera.cache ring
buffer, was writing them to page stores. It had created 276x128MB pages
already.
This started happening a few hours after a wsrep process failed
to start on the local mysql node. I am not sure what caused the wsrep event.
Note that, this doesn’t mean that galera wasn’t running
or the changes weren’t being applied. At least based on metrics,
the node was participating in certification and commits/write-sets were being
applied. Its just that the write-sets were being retained. Not sure why.
{code}
2018-08-13T21:26:00.305187Z 0 [ERROR] WSREP: bind: Address already in use
2018-08-13T21:26:00.305338Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 98: error while trying to listen 'tcp://0.0.0.0:4567?socket.non_blocking=1', asio error 'bind: Address already in use': 98 (Address already in use)
at gcomm/src/asio_tcp.cpp:listen():836
2018-08-13T21:26:00.305370Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -98 (Address already in use)
2018-08-13T21:26:00.305567Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1514: Failed to open channel 'cluster1' at 'gcomm://member1,member2': -98 (Address already in use)
2018-08-13T21:26:00.305593Z 0 [ERROR] WSREP: gcs connect failed: Address already in use
2018-08-13T21:26:00.305608Z 0 [ERROR] WSREP: Provider/Node (gcomm://member1,member2) failed to establish connection with cluster (reason: 7)
2018-08-13T21:26:00.305627Z 0 [ERROR] Aborting
2018-08-14T01:13:29.240162Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000000 of size 134217728 bytes
2018-08-14T01:35:41.118082Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000001 of size 134217728 bytes
2018-08-14T01:57:41.416134Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000002 of size 134217728 bytes
2018-08-14T02:19:25.737846Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000003 of size 134217728 bytes
...
{code}
Restarting the node didn’t cleanup the data. It complained about the data in files still being mmaped.
{code}
2018-08-18T02:43:13.427709Z 0 [ERROR] WSREP: Could not delete 276 page files: some buffers are still "mmapped".
{code}
Although, the files weren’t accessed/opened by the node after restart and a
manual delete was required.
This behavior seems related to the one intentionally introduced by
gcache.freeze_purge_at_seqno but it somehow got triggered here.
This seems like a regression.