Issues
- DBaas Pod running out of memory with PXCPXC-4047Resolved issue: PXC-4047Kamil Holubicki
- PMM DBaaS edit db cluster brokenPMM-11428Resolved issue: PMM-11428Iaroslavna Soloveva
- [DBaaS] Internal Server Error on creating multiple DB clustersPMM-11361Resolved issue: PMM-11361Yusaf Awan
- DBaaS: API key missing after move to OLMPMM-11333Resolved issue: PMM-11333
- [DBaaS] DB type can not be changed to MongoDB while creating DB clusterPMM-11311Resolved issue: PMM-11311Iaroslavna Soloveva
- Default values in PMM DBaaS in OperatorsPMM-11183Resolved issue: PMM-11183Andrei Minkin
- DBaaS: list of DB clusters doesn't load if one of my k8s clusters is not respondingPMM-11121Resolved issue: PMM-11121Iaroslavna Soloveva
- [DBaaS] [FE] Ability to create single psmdb node clusterPMM-10627Resolved issue: PMM-10627Carlos Salguero
- Improve the response speed for the List of DB Clusters created with DBaaSPMM-9682Resolved issue: PMM-9682Yusaf Awan
- Support of large scale environmentsPMM-9681Resolved issue: PMM-9681
- [DBaaS] resource calculator fixesPMM-7918Resolved issue: PMM-7918peter.szczepaniak
- DBaaS APIPMM-7175Resolved issue: PMM-7175
- Ability to ignore specific annotations for k8s Service objectsK8SPXC-1082Resolved issue: K8SPXC-1082
- Ability to ignore specific annotations for k8s Service objectsK8SPSMDB-824Resolved issue: K8SPSMDB-824Andrii Dema
- When Changing to allowUnsafeConfigurations: true cluster goes to failures and mongos does not get to Ready stateK8SPSMDB-786Resolved issue: K8SPSMDB-786Pavel Tankov
- A new design for REST API for DBaaSEVEREST-52Resolved issue: EVEREST-52Oksana Grishchenko
- [DBaaS] Create DB Cluster : Network and securityEVEREST-22Taras Kozub
- [DBaaS] Can't load imdb dataset to pxc cluster running under dbaas on EKSEVEREST-12
DBaas Pod running out of memory with PXC
Description
Environment
Attachments
blocks
is duplicated by
relates to
Smart Checklist
Details
Assignee
Kamil HolubickiKamil HolubickiReporter
Tibor Korocz (Percona)Tibor Korocz (Percona)Needs QA
YesAffects versions
Priority
Medium
Details
Details
Assignee
Reporter
Needs QA
Affects versions
Priority
Smart Checklist
Smart Checklist
Smart Checklist
Activity
Kamil HolubickiMarch 1, 2023 at 5:02 PM
My understanding is that there are no more mysteries in this matter, everything is clear, and we all know what and why happens and how to solve it, so closing this ticket.
Kamil HolubickiDecember 7, 2022 at 11:20 AM
The following things are related to DBaaS.
Problem 1: When using global wsrep_trx_fragment_size/wsrep_trx_fragment_unit everything works fine, but when using session variables, the pod is OOM killed
Problem 2: During the load, it is visible that the client reconnects to the server
Problem 3: After c.a. 60sec of client inactivity, the next query causes reconnection.
Problem 4: Calculation of InnoDB Buffer Pool size (and maybe max_connections parameter). Right now for 2G pod, BP is set to 1G. gcache.size is 600M. Observed Pod memory consumption is 2G, so we are at the limit boundaries. Any memory pressure on the node could cause OOM kill of the pod. I think we need a smaller BP to be compliant with the calculations explained in the previous comment.
Conclusion 1: Problem 1 is caused by Problem 2. When the client is reconnected, session variables set previously are lost, so we continue without streaming replication (we go back to the original state as we were at the beginning of this ticket). There are 2 possible solutions:
Use global variables
Set local and global variables before load, restore to defaults after load
Conclusion 2: Problem 3 is caused by the HAProxy setup. Adding the following to db config solves the problem
However, it does not solve the problem of load vs session variables problem. There still may occur network failures which will cause reconnections.
Denys KondratenkoDecember 7, 2022 at 10:58 AM
could you please provide summary of the recent findings from slack.
volunteered to provide different recommended configurations for different types of workloads that should prevent OOM. Could you also check https://jira.percona.com/browse/K8SPXC-441 and also provide recommendation for that corner case where there is a low mem available.
Kamil HolubickiNovember 30, 2022 at 11:02 AM
I talked to on Slack and I think it is worth documenting it for the future:
Let me summarize what we've learned so far. That will be good guidance.
1. We've got the following significant memory consumers
- (A) Buffer Pool
- (B) WriteSet Cache off pages
- (C) GCache Ring Buffer
- (D) GCache off pages
- (E) MySql allocations
2. (A) and (C) are static/one-time allocations with defaults:
- (A) 128MB
- (C) 128MB
3. Lagre transactions cause OOM because of (B) and (D).
4. We should avoid (B) by setting wsrep_trx_fragment_unit='byte' wsrep_trx_fragment_size=3670016. This way large transactions will be chunked into 3.5M chunks and streamed across the cluster while the transaction is still ongoing.
5. We should avoid (D) by setting large enough (C).
- if there are not many simultaneous write transactions, the default may be enough
- if there are many simultaneous transactions, we should increase (C). Let's say 151 connections (default), 4M chunk => 600M. Now we need the previous chunk in (C) to be present, so this gives the rough estimate 1.2G.
6. For (E) we need to do tests and see how this behaves. My tests with one connection showed that it is c.a. 600MB
7. So our memory demand is M = (A) + (C) + (E)
If we go with (A) being 70% of the memory available to the pod, we've got:
**
Small:
(A) = 1.4G => (C) + (E) = 600M
As it is Small instance we can probably assume we will not have many parallel writers, so default (C) should be enough however, we still have no space for (E), so (A) should be decreased.
Medium:
(A) = 5.4G => (C) + (E) = 2.4G
I think we should expect parallel writers here, so we should increase (C), let's say to 1G, so we have 1.4G left. Should be enough, but, again, we should do testing with simultaneous write workload and different transactions/row sizes (do not confuse with wsrep_trx_fragment_size - this i 3.5M always)
Large:
(A) = 22.4G => (C) + (E) = 9.6G
Having even 2G (C) we are safe (depending on how (E) behaves - again, to be tested)
Another perspective
All we do here so far is considered the case of loading data which happens in huge transactions. Is it always the case? If you don't do this, and you don't do huge (parallel) writes, here are knobs you can manipulate:
1. wsrep_provider_options="gcache.size=N" - the bigger, the better, as it affects the ability of the node for being a good donor for IST, but this is a one-time allocation, never freed. So maybe a huge amount of memory is not needed for (C)? On the other side, if the writeset does not fit into (C) (precisely: is bigger than (C)/2), (D) is created
2. wsrep_trx_fragment_size=N - maybe it is not bad if WriteSet Cache pages are created sometimes? If we've got just a few write transactions and big enough (C) to not create (D) it should not be bad.
3. gcache.page_size - the size of a single page of (D)
And let me stress the following out again:
Right now we know how the system behaves with a single writer, but we need to test it with a parallel write workload!
Sergey ProninNovember 29, 2022 at 6:55 AM
just FYI - I tried to reproduce it on our new PS operator with Group Replication and it is not reproducible. Memory consumption stays flat and limited by innodb_buffer, no OOMs.
Hi,
Summary
If you create a 3 node PXC cluster and you start myloader or loading a bigger dataset the pod will run out of memory and be killed.
More details can be find in this doc: https://docs.google.com/document/d/1EYdnqyxmRrgtOAQUQDdF_FYKvGAwUlxHQx-YILy02wQ/edit#
Reproducing
We were able to reproduce by loading back a backup with myloader and also by using the public imdb database just simply loading the sql files.
Notes
I think we are facing this issue: https://github.com/kubernetes/kubernetes/issues/43916
But it only happens with PXC if I disable the Galera plugin on the same pod it only runs a stand alone MySQL I was not able to reproduce the issue. It only happens when Galera is enabled I think it is because Gcache and the fact how k8s calculating used memory what you can see in the github ticket above.