PXC node creates new gcache.page.X but never deletes them for no reason

Description

Under unknown circumstances, a healthy PXC member suddenly starts to accumulate on-demand cache files, and the cache pool grows forever, eventually leading to a disk full, for example:

The freeze_purge_at_seqno function was not used here.

The only way to unblock this is to restart the affected node.

I am unable to reproduce the problem on demand.

Environment

None

AFFECTED CS IDs

CS0049562, CS0049588

Attachments

1

Activity

Show:

Kamil Holubicki September 13, 2024 at 9:46 AM
Edited

I did a quick check of the provided logs, and it is well visible that:

  1. node 113 is IST donor (let’s call it DONOR)

  2. node 10.206.73.91 is IST joiner (let' call it JOINER)

  3. SST/IST request happens at 2024-09-05T08:31:53.737336-04:00

  4. DONOR decides that it can server through IST, so SST is bypassed

  5. IST sender is started to server seqnos 3601164752 -> 3609386786

  6. on the JOINER side the IST is being received

Interesting is what happens next:

  1. DONOR is waiting in send_eof() (pt-pmp.log). This means it has finished serving IST, sent EOF, and is waiting for JOINER to close the connection.

  2. JOINER processes all IST events, syncs with the cluster, but IST Async receiver thread:

    1. doesn’t close the connection

    2. or it closes the connection but DONOR does not see it somehow

The result is that the DONOR is still waiting for JONER to close the connection.

 

Why does the above cause GCache to grow?

IST works in a way that, at the beginning, the DONOR locks starting seqno in GCache. It is to prevent commit-cut releasing of writesets which will be served via IST. It means that even if the node receives commit-cut messages, it can’t release any writesets above the locked seqno. While receiving normal workload writesets, they are cached in GCache, causing off-pages to be created.

When serving of IST is finished, IST-start seqno is unlocked from GCache, and old writesets (together with off-pages) can be removed. IST is finished when the DONOR detects JOINER closed the connection after EOF. But as said above, it didn’t happen. IST finish should be indicated by “async IST sender served“ log on DONOR side (here we see that it didn’t happen)

 

What would be interesting to see is pt-pmp.log from JOINER. This way we would see if the joiner closed the connection.

Done

Details

Assignee

Reporter

Needs QA

No

In progress time

16.8

Time tracking

No time logged2d 6h 45m remaining

Sprint

Priority

Smart Checklist

Created September 12, 2024 at 4:32 PM
Updated December 20, 2024 at 2:20 PM
Resolved October 4, 2024 at 1:05 PM