DBaas Pod running out of memory with PXC

General

Escalation

General

Escalation

Description

Hi,

Summary

If you create a 3 node PXC cluster and you start myloader or loading a bigger dataset the pod will run out of memory and be killed.

More details can be find in this doc: https://docs.google.com/document/d/1EYdnqyxmRrgtOAQUQDdF_FYKvGAwUlxHQx-YILy02wQ/edit#

Reproducing

We were able to reproduce by loading back a backup with myloader and also by using the public imdb database just simply loading the sql files.

Notes

I think we are facing this issue: https://github.com/kubernetes/kubernetes/issues/43916

But it only happens with PXC if I disable the Galera plugin on the same pod it only runs a stand alone MySQL I was not able to reproduce the issue. It only happens when Galera is enabled I think it is because Gcache and the fact how k8s calculating used memory what you can see in the github ticket above.

Environment

None

Attachments

Linked work items

Smart Checklist

Activity

Show:

Kamil Holubicki March 1, 2023 at 5:02 PM

My understanding is that there are no more mysteries in this matter, everything is clear, and we all know what and why happens and how to solve it, so closing this ticket.

Kamil Holubicki December 7, 2022 at 11:20 AM

The following things are related to DBaaS.

Problem 1: When using global wsrep_trx_fragment_size/wsrep_trx_fragment_unit everything works fine, but when using session variables, the pod is OOM killed

Problem 2: During the load, it is visible that the client reconnects to the server

Problem 3: After c.a. 60sec of client inactivity, the next query causes reconnection.

Problem 4: Calculation of InnoDB Buffer Pool size (and maybe max_connections parameter). Right now for 2G pod, BP is set to 1G. gcache.size is 600M. Observed Pod memory consumption is 2G, so we are at the limit boundaries. Any memory pressure on the node could cause OOM kill of the pod. I think we need a smaller BP to be compliant with the calculations explained in the previous comment.

Conclusion 1: Problem 1 is caused by Problem 2. When the client is reconnected, session variables set previously are lost, so we continue without streaming replication (we go back to the original state as we were at the beginning of this ticket). There are 2 possible solutions:

Use global variables
Set local and global variables before load, restore to defaults after load

Conclusion 2: Problem 3 is caused by the HAProxy setup. Adding the following to db config solves the problem

However, it does not solve the problem of load vs session variables problem. There still may occur network failures which will cause reconnections.

Denys Kondratenko December 7, 2022 at 10:58 AM

could you please provide summary of the recent findings from slack.

volunteered to provide different recommended configurations for different types of workloads that should prevent OOM. Could you also check https://jira.percona.com/browse/K8SPXC-441 and also provide recommendation for that corner case where there is a low mem available.

Kamil Holubicki November 30, 2022 at 11:02 AM

I talked to on Slack and I think it is worth documenting it for the future:

Let me summarize what we've learned so far. That will be good guidance.

1. We've got the following significant memory consumers
- (A) Buffer Pool
- (B) WriteSet Cache off pages
- (C) GCache Ring Buffer
- (D) GCache off pages
- (E) MySql allocations
2. (A) and (C) are static/one-time allocations with defaults:
- (A) 128MB
- (C) 128MB
3. Lagre transactions cause OOM because of (B) and (D).
4. We should avoid (B) by setting wsrep_trx_fragment_unit='byte' wsrep_trx_fragment_size=3670016. This way large transactions will be chunked into 3.5M chunks and streamed across the cluster while the transaction is still ongoing.
5. We should avoid (D) by setting large enough (C).
- if there are not many simultaneous write transactions, the default may be enough
- if there are many simultaneous transactions, we should increase (C). Let's say 151 connections (default), 4M chunk => 600M. Now we need the previous chunk in (C) to be present, so this gives the rough estimate 1.2G.
6. For (E) we need to do tests and see how this behaves. My tests with one connection showed that it is c.a. 600MB
7. So our memory demand is M = (A) + (C) + (E)

If we go with (A) being 70% of the memory available to the pod, we've got:
**

Small:
(A) = 1.4G => (C) + (E) = 600M
As it is Small instance we can probably assume we will not have many parallel writers, so default (C) should be enough however, we still have no space for (E), so (A) should be decreased.

Medium:
(A) = 5.4G => (C) + (E) = 2.4G
I think we should expect parallel writers here, so we should increase (C), let's say to 1G, so we have 1.4G left. Should be enough, but, again, we should do testing with simultaneous write workload and different transactions/row sizes (do not confuse with wsrep_trx_fragment_size - this i 3.5M always)

Large:
(A) = 22.4G => (C) + (E) = 9.6G
Having even 2G (C) we are safe (depending on how (E) behaves - again, to be tested)

Another perspective
All we do here so far is considered the case of loading data which happens in huge transactions. Is it always the case? If you don't do this, and you don't do huge (parallel) writes, here are knobs you can manipulate:

1. wsrep_provider_options="gcache.size=N" - the bigger, the better, as it affects the ability of the node for being a good donor for IST, but this is a one-time allocation, never freed. So maybe a huge amount of memory is not needed for (C)? On the other side, if the writeset does not fit into (C) (precisely: is bigger than (C)/2), (D) is created
2. wsrep_trx_fragment_size=N - maybe it is not bad if WriteSet Cache pages are created sometimes? If we've got just a few write transactions and big enough (C) to not create (D) it should not be bad.
3. gcache.page_size - the size of a single page of (D)

And let me stress the following out again:

Right now we know how the system behaves with a single writer, but we need to test it with a parallel write workload!

Sergey Pronin November 29, 2022 at 6:55 AM

just FYI - I tried to reproduce it on our new PS operator with Group Replication and it is not reproducible. Memory consumption stays flat and limited by innodb_buffer, no OOMs.

Resize issue view side panel

Done

Details
Assignee
Kamil Holubicki
Reporter
Tibor Korocz (Percona)
Labels
dbaas-gagas_needsplatform
Needs QA
Yes
Affects versions
8.0.29-21 (Q2 2022)
Priority
Medium

Smart Checklist

Created September 27, 2022 at 2:01 PM

Updated March 6, 2024 at 8:50 PM

Resolved March 1, 2023 at 5:02 PM

DBaas Pod running out of memory with PXC

Description

Environment

Attachments

Linked work items

blocks

is duplicated by

relates to

Smart Checklist

Activity

Kamil Holubicki March 1, 2023 at 5:02 PM

Kamil Holubicki December 7, 2022 at 11:20 AM

Denys Kondratenko December 7, 2022 at 10:58 AM

Kamil Holubicki November 30, 2022 at 11:02 AM

Sergey Pronin November 29, 2022 at 6:55 AM

DetailsAssigneeKamil HolubickiKamil HolubickiReporterTibor Korocz (Percona)Tibor Korocz (Percona)Labelsdbaas-gagas_needsplatformNeeds QAYesAffects versions8.0.29-21 (Q2 2022)PriorityMedium

Details

Assignee

Reporter

Labels

Needs QA

Affects versions

Priority

Smart ChecklistOpen Smart Checklist

Smart Checklist

Details
Assignee
Kamil Holubicki
Reporter
Tibor Korocz (Percona)
Labels
dbaas-gagas_needsplatform
Needs QA
Yes
Affects versions
8.0.29-21 (Q2 2022)
Priority
Medium

Smart Checklist