An Xtradb node starts writing to pagestore instead of ring bugger

Description

One of the nodes in the galera cluster had high memory usage. Investigation
showed that this node, instead of writing transactions to galera.cache ring
buffer, was writing them to page stores. It had created 276x128MB pages
already.

This started happening a few hours after a wsrep process failed
to start on the local mysql node. I am not sure what caused the wsrep event.

Note that, this doesn’t mean that galera wasn’t running 
or the changes weren’t being applied. At least based on metrics,
the node was participating in certification and commits/write-sets were being
applied. Its just that the write-sets were being retained. Not sure why.

{code}
2018-08-13T21:26:00.305187Z 0 [ERROR] WSREP: bind: Address already in use
2018-08-13T21:26:00.305338Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 98: error while trying to listen 'tcp://0.0.0.0:4567?socket.non_blocking=1', asio error 'bind: Address already in use': 98 (Address already in use)
at gcomm/src/asio_tcp.cpp:listen():836
2018-08-13T21:26:00.305370Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -98 (Address already in use)
2018-08-13T21:26:00.305567Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1514: Failed to open channel 'cluster1' at 'gcomm://member1,member2': -98 (Address already in use)
2018-08-13T21:26:00.305593Z 0 [ERROR] WSREP: gcs connect failed: Address already in use
2018-08-13T21:26:00.305608Z 0 [ERROR] WSREP: Provider/Node (gcomm://member1,member2) failed to establish connection with cluster (reason: 7)
2018-08-13T21:26:00.305627Z 0 [ERROR] Aborting

2018-08-14T01:13:29.240162Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000000 of size 134217728 bytes
2018-08-14T01:35:41.118082Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000001 of size 134217728 bytes
2018-08-14T01:57:41.416134Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000002 of size 134217728 bytes
2018-08-14T02:19:25.737846Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000003 of size 134217728 bytes

...
{code}

Restarting the node didn’t cleanup the data. It complained about the data in files still being mmaped.

{code}
2018-08-18T02:43:13.427709Z 0 [ERROR] WSREP: Could not delete 276 page files: some buffers are still "mmapped".
{code}

Although, the files weren’t accessed/opened by the node after restart and a
manual delete was required.

This behavior seems related to the one intentionally introduced by
gcache.freeze_purge_at_seqno but it somehow got triggered here.
This seems like a regression.

Environment

Linux

Kernel 3.13.0-112-generic

PXC - 5.7.22-22-57

Attachments

1
  • 29 Oct 2018, 05:04 PM

Smart Checklist

Activity

Krunal Bauskar November 1, 2018 at 6:56 AM

commit cbb86224e7b4de0bfdb84fcd3a52967a0ae9fb19
Merge: 62ab7140 c03db52a
Author: Krunal Bauskar <krunal.bauskar@percona.com>
Date: Thu Nov 1 12:21:02 2018 +0530

Merge pull request #148 from percona/3.x-pxc-2220

  • PXC#2220: An Xtradb node starts writing to pagestore instead of rin…

commit c03db52adc4aee2c832755c744e4d02a2698fc5e

Author: Krunal Bauskar <krunal.bauskar@percona.com>
Date: Wed Oct 31 08:57:10 2018 +0530

  • PXC#2220: An Xtradb node starts writing to pagestore instead of ring bugger

  • Issue is triggered if the user try to start parallel instance of
    PXC/mysqld server on a node that already has the instance running.

  • Parallel instance eventually fails when it try to bind-address but before
    that galera initialization sequence has altered gcache in
    irrevocable fashion causing the gcache page error.

  • Infact, even before gcache is touched, saved state is read.
    Galera existing flow had lock on the grastate.dat (state file).
    Unfortunately, this lock code was buggy or missed a condition
    that re-opening of the file in different mode (through freopen)
    doesn't retain the original lock taken on old file descriptior.


This eventually caused no-lock post freopen and so parallel
instance was allowed to start.

  • Fixed the bug by correcting the said condition.

Krunal Bauskar October 30, 2018 at 9:18 AM
Edited

We have reproduce the issue. Thanks for detailed test-case.

  1. Issue would kick in only when you try to start parallel mysqld (with existing one running)

  2. So immediately workaround is to avoid doing so.

  3. There is check to avoid this from happening but the check is not working as expected. We will fix the issue in upcoming release.

Shashank Sahni October 29, 2018 at 5:05 PM

@krunal

  1. Thats the point Krunal. The gcache isn't acting as a ring buffer and TXs are spilling to pagestore.

  2. I haven't followed this sequence but PFA screenshot charting this value for an affected node. You could see that that value stopped increased at a point and this is exactly when cache spilled to pagestore.

  3. No I didn't.

Have you tried steps I listed in last comment to reproduce it?

Krunal Bauskar August 30, 2018 at 3:34 AM

There is something that we are missing here.

  1. Most of the sysbench workload is lightweight that generate small packets of trx. This means N trx will get appended to galera.cache files and on hitting end of galera cache, cache get used in ring fashion freeing up entries from start of the cache. This sequence continue.

  2. You can monitor this sequence as follows

mysql> show status like '%downto%';
----------------------------------+

Variable_name

Value

----------------------------------+

wsrep_local_cached_downto

85858

----------------------------------+
1 row in set (0.01 sec)

mysql> show status like '%downto%';
----------------------------------+

Variable_name

Value

----------------------------------+

wsrep_local_cached_downto

86910

----------------------------------+
1 row in set (0.00 sec)

mysql> show status like '%downto%';
-----------------------------------+

Variable_name

Value

-----------------------------------+

wsrep_local_cached_downto

105507

-----------------------------------+
1 row in set (0.01 sec)

 

3. Have you freezed purge of gcache that can cause the creation of gcache file even for smaller trx.

Shashank Sahni August 29, 2018 at 5:43 PM

Hey Krunal,

I've been able to successfully reproduce the issue in a test environment. This behavior looks like a bug to me. Here are the steps to reproduce it. Please check and share your thoughts.

1. Run a standalone PXC node with following wsrep_provider  options -

wsrep_provider_options = "gcache.size=100M;gcache.page_size=1M;gcs.fc_limit=500;gcs.fc_factor=0.99;gcs.fc_master_slave=yes;gcache.keep_pages_size=30M;gcache.keep_pages_count=10

2. Generate load. I used sysbench-tpcc benchmark to do so. Let it run for a substantial time to ensure that data being written is more than gcache size.

3. Invoke mysqld. I ran `mysqld --versin`. This logs failure messages 

2018-08-29T17:34:42.289123Z 0 [Warning] The syntax '--log_warnings/-W' is deprecated and will be removed in a future release. Please use '--log_error_verbosity' instead. 2018-08-29T17:34:42.289287Z 0 [ERROR] --verbose is for use with --help; did you mean --log-error-verbosity? 2018-08-29T17:34:42.289293Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details). 2018-08-29T17:34:42.305406Z 0 [Note] mysqld (mysqld 5.7.22-22-57-log) starting as process 26493 ... 2018-08-29T17:34:42.311869Z 0 [Note] WSREP: Setting wsrep_ready to false 2018-08-29T17:34:42.311893Z 0 [Note] WSREP: No pre-stored wsrep-start position found. Skipping position initialization. 2018-08-29T17:34:42.311900Z 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so' 2018-08-29T17:34:42.348615Z 0 [Note] WSREP: wsrep_load(): Galera 3.26(rac090bc) by Codership Oy <info@codership.com> loaded successfully. 2018-08-29T17:34:42.359185Z 0 [Note] WSREP: CRC-32C: using hardware acceleration. 2018-08-29T17:34:42.359725Z 0 [Note] WSREP: Found saved state: 30e67dee-8156-11e8-85da-c64fdabff85c:-1, safe_to_bootstrap: 1 2018-08-29T17:34:44.023646Z 0 [Note] WSREP: Passing config to GCS: base_dir = /db/mysql/; base_host = 10.210.1.15; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 10; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 4; evs.view_forget_timeout = PT24H; gcache.dir = /db/mysql/; gcache.freeze_purge_at_seqno = -1; gcache.keep_pages_count = 10; gcache.keep_pages_size = 30M; gcache.mem_size = 0; gcache.name = /db/mysql//galera.cache; gcache.page_size = 1M; gcache.recover = no; gcache.size = 100M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 0.99; gcs.fc_limit = 500; gcs.fc_master_slave = yes; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.recovery = 1; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = PT30S; pc.weight = 1; protonet.backend = asio; protonet.version = 0; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.max_ws_size = 2147483647; repl.proto_max = 8; socket.checksum = 2; socket.recv_buf_size = 212992; 2018-08-29T17:34:44.047028Z 0 [Note] WSREP: GCache history reset: 30e67dee-8156-11e8-85da-c64fdabff85c:0 -> 30e67dee-8156-11e8-85da-c64fdabff85c:-1 2018-08-29T17:34:44.158481Z 0 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1 2018-08-29T17:34:44.174233Z 0 [Note] WSREP: Preparing to initiate SST/IST 2018-08-29T17:34:44.174247Z 0 [Note] WSREP: Starting replication 2018-08-29T17:34:44.174273Z 0 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1 2018-08-29T17:34:44.174459Z 0 [Note] WSREP: Using CRC-32C for message checksums. 2018-08-29T17:34:44.174552Z 0 [Note] WSREP: gcomm thread scheduling priority set to other:0 2018-08-29T17:34:44.174731Z 0 [Note] WSREP: Restoring primary-component from disk successful 2018-08-29T17:34:44.174892Z 0 [Note] WSREP: GMCast version 0 2018-08-29T17:34:44.175015Z 0 [Note] WSREP: (70f9b595, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567 2018-08-29T17:34:44.175024Z 0 [Note] WSREP: (70f9b595, 'tcp://0.0.0.0:4567') multicast: , ttl: 1 2018-08-29T17:34:44.175372Z 0 [Note] WSREP: EVS version 0 2018-08-29T17:34:44.175476Z 0 [Note] WSREP: gcomm: connecting to group 'te_chaos_mm', peer '' 2018-08-29T17:34:44.175509Z 0 [Note] WSREP: start_prim is enabled, turn off pc_recovery 2018-08-29T17:34:44.175647Z 0 [ERROR] WSREP: bind: Address already in use 2018-08-29T17:34:44.180266Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 98: error while trying to listen 'tcp://0.0.0.0:4567?socket.non_blocking=1', asio error 'bind: Address already in use': 98 (Address already in use) at gcomm/src/asio_tcp.cpp:listen():836 2018-08-29T17:34:44.180287Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -98 (Address already in use) 2018-08-29T17:34:44.180350Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1514: Failed to open channel 'te_chaos_mm' at 'gcomm:///': -98 (Address already in use) 2018-08-29T17:34:44.180360Z 0 [ERROR] WSREP: gcs connect failed: Address already in use 2018-08-29T17:34:44.180366Z 0 [ERROR] WSREP: Provider/Node (gcomm:///) failed to establish connection with cluster (reason: 7) 2018-08-29T17:34:44.180375Z 0 [ERROR] Aborting

4. With in few seconds, you should start seeing gcache.page.* files being created and following messages in the log file.

2018-08-29T17:34:56.791554Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000000 of size 1048576 bytes 2018-08-29T17:34:56.841794Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000001 of size 1048576 bytes 2018-08-29T17:34:57.027707Z 0 [Note] WSREP: Created page /db/mysql/gcache.page.000002 of size 1048576 bytes

 

Done

Details

Assignee

Reporter

Labels

Time tracking

1d 4h logged

Affects versions

Priority

Smart Checklist

Created August 22, 2018 at 10:07 PM
Updated March 6, 2024 at 10:26 PM
Resolved November 1, 2018 at 6:56 AM

Flag notifications