pg_tde has performance issues at a specific threshold and large databases

Description

The investigation and testing of Andreas Scherbaum unveiled that we do have a performance threshold/bottleneck at a specific client level, which extrapolates even further with larger databases.

We need to investigate what’s causing this and try to fix it.

How to document

None

How to test

None

Attachments

06 Dec 2024, 08:51 AM
06 Dec 2024, 08:51 AM
06 Dec 2024, 08:51 AM

Linked work items

blocks

PG-1097

External PostgreSQL Consultant performing a resilience and reliability test

relates to

PG-825

pg_relation_filepath returning NULL after converting a table back to unencrypted

Activity

Show:

Kai Wagner February 10, 2025 at 8:52 AM

It was confirmed, that the issue was because of the cleared hint bits. Newest benchmarks look totally fine. I will close here.

Kai Wagner February 6, 2025 at 6:54 AM

Currently, “ads” is running some long-running benchmarks with different DB sizes and clients. Waiting for the result.

Kai Wagner January 16, 2025 at 10:53 AM

So, after further investigation of this pgbench performance drop issue, we believe, that we've found the issue. The issue is caused by ALTER TABLE <tablename> set access method, as this drops the hint bits. As you can read in the upstream documentation, without these queries can cause heavy writes to a database table even though you're just reading from it. To determine the visibility of a tuple without these bits set, you need to consult pg_clog and possibly pg_subtrans, so it is an expensive check.

Further information can be found here with the explanation: https://wiki.postgresql.org/wiki/Hint_Bits

To overcome this after an alter, you can do the following: "A plain SELECT, count(*), or VACUUM on the entire table will check every tuple for visibility and set its hint bits."

So what I did now was testing a manual vacuum on all pgbench tables after the ALTER and before the benchmark run and also a count(*). Obviously the count(*) is way quicker, for testing purposes. In both tests, I was not able to reproduce the issue anymore as the hint bits were set afterward again.

Some results below:

Vacuum on all tables before the pgbench run

tps = 13700.238776 (without initial connection time)
tps = 15705.631591 (without initial connection time)
tps = 15401.540154 (without initial connection time)
tps = 12119.952905 (without initial connection time)
tps = 13572.204126 (without initial connection time)
tps = 15592.978704 (without initial connection time)
tps = 11616.329240 (without initial connection time)
tps = 13548.561917 (without initial connection time)
tps = 13811.609644 (without initial connection time)
tps = 14721.345952 (without initial connection time)

count(*) on all tables before the pgbench run

tps = 13553.283767 (without initial connection time)
tps = 13542.271232 (without initial connection time)
tps = 13759.484216 (without initial connection time)
tps = 15279.172305 (without initial connection time)
tps = 13750.294649 (without initial connection time)
tps = 16313.974084 (without initial connection time)
tps = 11417.014614 (without initial connection time)
tps = 14031.992944 (without initial connection time)
tps = 11565.277732 (without initial connection time)
tps = 12746.276266 (without initial connection time)

andreas.karlsson January 15, 2025 at 1:18 PM

My first thought was isn’t this just part of the amazing world of hint bits, but will take a look.

Kai Wagner January 14, 2025 at 2:23 PM
Edited

What's the issue?

Whenever you do an "ALTER TABLE" from one access method to another, and you do run pgbench against this table, the performance might be super slow (like 5-10% of the normal speed)
Adjusting memory, doing an ANALYZE of the tables or even restarting PostgreSQL service makes no difference, the issue is still persistent.
If you do run a super short pgbench run against the just newly altered table, the issue is no longer there. A simple “-c 1-t 1” is enough even.
This issue is reproducible with the latest upstream codebase.
The easiest way to reproduce is:

Initialize pgbench without any changes to PostgreSQL

pgbench -U postgres -i --fillfactor=100 -s 5000 -d performance_db

After that, create a new access method called heap2 based on heap

CREATE ACCESS METHOD heap2 TYPE TABLE HANDLER heap_tableam_handler;

ALTER all pgbench tables to use the new access method heap2

ALTER table pgbench_accounts set access method heap2
ALTER table pgbench_branches set access method heap2
ALTER table pgbench_history set access method heap2
ALTER table pgbench_tellers set access method heap2

Run pgbench

 pgbench -U postgres --vacuum-all -c 35 --jobs=1 --protocol=simple --transactions 10 -d performance_db

You may need to run pgbench multiple times, as the issue doesn't occur with every run, but this is the most consistent way to trigger it.

Resize issue view side panel

Done

Details

Assignee

Kai Wagner

Reporter

Kai Wagner

Components

pg_tde

Sprint

None

Fix versions

pg_tde_RC1

Priority

Critical

Parent

PG-975 TDE resilience and reliability testing

Smart Checklist

Created December 6, 2024 at 8:52 AM

Updated February 10, 2025 at 8:52 AM

Resolved February 10, 2025 at 8:52 AM

pg_tde has performance issues at a specific threshold and large databases

Description

How to document

How to test

Attachments

Linked work items

blocks

relates to

Activity

Kai Wagner February 10, 2025 at 8:52 AM

Kai Wagner February 6, 2025 at 6:54 AM

Kai Wagner January 16, 2025 at 10:53 AM

Vacuum on all tables before the pgbench run

count(*) on all tables before the pgbench run

andreas.karlsson January 15, 2025 at 1:18 PM

Kai Wagner January 14, 2025 at 2:23 PM
Edited

What's the issue?

Initialize pgbench without any changes to PostgreSQL

After that, create a new access method called heap2 based on heap

ALTER all pgbench tables to use the new access method heap2

Run pgbench

Details

Assignee

Reporter

Components

Sprint

Fix versions

Priority

Parent

Smart Checklist

Flag notifications

Something's gone wrong

Something's gone wrong

pg_tde has performance issues at a specific threshold and large databases

Description

How to document

How to test

Attachments

Linked work items

blocks

relates to

Activity

Kai Wagner February 10, 2025 at 8:52 AM

Kai Wagner February 6, 2025 at 6:54 AM

Kai Wagner January 16, 2025 at 10:53 AM

Vacuum on all tables before the pgbench run

count(*) on all tables before the pgbench run

andreas.karlsson January 15, 2025 at 1:18 PM

Kai Wagner January 14, 2025 at 2:23 PMEdited

What's the issue?

Initialize pgbench without any changes to PostgreSQL

After that, create a new access method called heap2 based on heap

ALTER all pgbench tables to use the new access method heap2

Run pgbench

Details

Assignee

Reporter

Components

Sprint

Fix versions

Priority

Parent

Smart Checklist

Flag notifications

Something's gone wrong

Something's gone wrong

Kai Wagner January 14, 2025 at 2:23 PM
Edited