erogov Jan 24 2020 at 16:59

MVCC in PostgreSQL-6. Vacuum

13 min

3.4K

Postgres Professional corporate blog PostgreSQL *SQL *

Translation

We started with problems related to isolation, made a digression about low-level data structure, then discussed row versions and observed how data snapshots are obtained from row versions.

Last time we talked about HOT updates and in-page vacuuming, and today we'll proceed to a well-known vacuum vulgaris. Really, so much has already been written about it that I can hardly add anything new, but the beauty of a full picture requires sacrifice. So keep patience.

Vacuum

What does vacuum do?

In-page vacuum works fast, but frees only part of the space. It works within one table page and does not touch indexes.

The basic, «normal» vacuum is done using the VACUUM command, and we will call it just «vacuum» (leaving «autovacuum» for a separate discussion).

So, vacuum processes the entire table. It vacuums away not only dead tuples, but also references to them from all indexes.

Vacuuming is concurrent with other activities in the system. The table and indexes can be used in a regular way both for reads and updates (however, concurrent execution of commands such as CREATE INDEX, ALTER TABLE and some others is impossible).

Only those table pages are looked through where some activities took place. To detect them, the visibility map is used (to remind you, the map tracks those pages that contain pretty old tuples, which are visible in all data snapshots for sure). Only those pages are processed that are not tracked by the visibility map, and the map itself gets updated.

The free space map also gets updated in the process to reflect the extra free space in the pages.

As usual, let's create a table:

=> CREATE TABLE vac(
  id serial,
  s char(100)
) WITH (autovacuum_enabled = off);
=> CREATE INDEX vac_s ON vac(s);
=> INSERT INTO vac(s) VALUES ('A');
=> UPDATE vac SET s = 'B';
=> UPDATE vac SET s = 'C';

We use the autovacuum_enabled parameter to turn the autovacuum process off. We will discuss it next time, and now it is critical for our experiments that we manually control vacuuming.

The table now has three tuples, each of which are referenced from the index:

=> SELECT * FROM heap_page('vac',0);

 ctid  | state  |   xmin   |   xmax   | hhu | hot | t_ctid 
-------+--------+----------+----------+-----+-----+--------
 (0,1) | normal | 4000 (c) | 4001 (c) |     |     | (0,2)
 (0,2) | normal | 4001 (c) | 4002     |     |     | (0,3)
 (0,3) | normal | 4002     | 0 (a)    |     |     | (0,3)
(3 rows)

=> SELECT * FROM index_page('vac_s',1);

 itemoffset | ctid  
------------+-------
          1 | (0,1)
          2 | (0,2)
          3 | (0,3)
(3 rows)

After vacuuming, dead tuples get vacuumed away, and only one, live, tuple remains. And only one reference remains in the index:

=> VACUUM vac;
=> SELECT * FROM heap_page('vac',0);

 ctid  | state  |   xmin   | xmax  | hhu | hot | t_ctid 
-------+--------+----------+-------+-----+-----+--------
 (0,1) | unused |          |       |     |     | 
 (0,2) | unused |          |       |     |     | 
 (0,3) | normal | 4002 (c) | 0 (a) |     |     | (0,3)
(3 rows)

=> SELECT * FROM index_page('vac_s',1);

 itemoffset | ctid  
------------+-------
          1 | (0,3)
(1 row)

Note that the first two pointers acquired the status «unused» instead of «dead», which they would acquire with in-page vacuum.

About the transaction horizon once again

How does PostgreSQL make out which tuples can be considered dead? We already touched upon the concept of transaction horizon when discussing data snapshots, but it won't hurt to reiterate such an important matter.

Let's start the previous experiment again.

=> TRUNCATE vac;
=> INSERT INTO vac(s) VALUES ('A');
=> UPDATE vac SET s = 'B';

But before updating the row once again, let one more transaction start (but not end). In this example, it will use the Read Committed level, but it must get a true (not virtual) transaction number. For example, the transaction can change and even lock certain rows in any table, not obligatory vac:

|  => BEGIN;
|  => SELECT s FROM t FOR UPDATE;

|    s  
|  -----
|   FOO
|   BAR
|  (2 rows)

=> UPDATE vac SET s = 'C';

There are three rows in the table and three references in the index now. What will happen after vacuuming?

=> VACUUM vac;
=> SELECT * FROM heap_page('vac',0);

 ctid  | state  |   xmin   |   xmax   | hhu | hot | t_ctid 
-------+--------+----------+----------+-----+-----+--------
 (0,1) | unused |          |          |     |     | 
 (0,2) | normal | 4005 (c) | 4007 (c) |     |     | (0,3)
 (0,3) | normal | 4007 (c) | 0 (a)    |     |     | (0,3)
(3 rows)

=> SELECT * FROM index_page('vac_s',1);

 itemoffset | ctid  
------------+-------
          1 | (0,2)
          2 | (0,3)
(2 rows)

Two tuples remain in the table: VACUUM decided that the (0,2) tuple cannot be vacuumed yet. The reason is certainly in the transaction horizon of the database, which in this example is determined by the non-completed transaction:

|  => SELECT backend_xmin FROM pg_stat_activity WHERE pid = pg_backend_pid();

|   backend_xmin 
|  --------------
|           4006
|  (1 row)

We can ask VACUUM to report what is happening:

=> VACUUM VERBOSE vac;

INFO:  vacuuming "public.vac"
INFO:  index "vac_s" now contains 2 row versions in 2 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "vac": found 0 removable, 2 nonremovable row versions in 1 out of 1 pages
DETAIL:  1 dead row versions cannot be removed yet, oldest xmin: 4006
There were 1 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

Note that:

2 nonremovable row versions — two tuples that cannot be deleted are found in the table.
1 dead row versions cannot be removed yet — one of them is dead.
oldest xmin shows the current horizon.

Let's reiterate the conclusion: if a database has long-lived transactions (not completed or being performed really long), this can entail table bloat regardless of how often vacuuming happens. Therefore, OLTP- and OLAP-type workloads poorly coexist in one PostgreSQL database: reports running for hours will not let updated tables be duly vacuumed. Creation of a separate replica for reporting purposes may be a possible solution to this.

After completion of an open transaction, the horizon moves, and the situation gets fixed:

|  => COMMIT;

=> VACUUM VERBOSE vac;

INFO:  vacuuming "public.vac"
INFO:  scanned index "vac_s" to remove 1 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  "vac": removed 1 row versions in 1 pages
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  index "vac_s" now contains 1 row versions in 2 pages
DETAIL:  1 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "vac": found 1 removable, 1 nonremovable row versions in 1 out of 1 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 4008
There were 1 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

Now only latest, live, version of the row is left in the page:

=> SELECT * FROM heap_page('vac',0);

 ctid  | state  |   xmin   | xmax  | hhu | hot | t_ctid 
-------+--------+----------+-------+-----+-----+--------
 (0,1) | unused |          |       |     |     | 
 (0,2) | unused |          |       |     |     | 
 (0,3) | normal | 4007 (c) | 0 (a) |     |     | (0,3)
(3 rows)

The index also has only one row:

=> SELECT * FROM index_page('vac_s',1);

 itemoffset | ctid  
------------+-------
          1 | (0,3)
(1 row)

What happens inside?

Vacuuming must process the table and indexes at the same time and do this so as not to lock the other processes. How can it do so?

All starts with the scanning heap phase (the visibility map taken into account, as already mentioned). In the pages read, dead tuples are detected, and their tids are written down to a specialized array. The array is stored in the local memory of the vacuum process, where maintenance_work_mem bytes of memory are allocated for it. The default value of this parameter is 64 MB. Note that the full amount of memory is allocated at once, rather than as the need arises. However, if the table is not large, a smaller amount of memory is allocated.

Then we either reach the end of the table or the memory allocated for the array is over. In either case, the vacuuming indexes phase starts. To this end, each index created on the table is fully scanned in search of the rows that reference the remembered tuples. The rows found are vacuumed away from index pages.

Here we confront the following: the indexes do not already have references to dead tuples, while the table still has them. And this is contrary to nothing: when executing a query, we either don't hit dead tuples (with index access) or reject them at the visibility check (when scanning the table).

After that, the vacuuming heap phase starts. The table is scanned again to read the appropriate pages, vacuum them of the remembered tuples and release the pointers. We can do this since there are no references from the indexes anymore.

If the table was not entirely read during the first cycle, the array is cleared and everything is repeated from where we reached.

In summary:

The table is always scanned twice.
If vacuuming deletes so many tuples that they all do not fit in memory of size maintenance_work_mem, all the indexes will be scanned as many times as needed.

For large tables, this can require a lot of time and add considerable system workload. Of course, queries will not be locked, but extra input/output is definitely undesirable.

To speed up the process, it makes sense to either call VACUUM more often (so that not too many tuples are vacuumed away each time) or allocate more memory.

To note in parentheses, starting with version 11, PostgreSQL can skip index scans unless a compelling need arises. This must make the life easier for owners of large tables where rows are only added (but not changed).

Monitoring

How can we figure out that VACUUM cannot do its job in one cycle?

We've already seen the first way: to call the VACUUM command with the VERBOSE option. In this case, information about the phases of the process will be output to the console.

Second, starting with version 9.6, the pg_stat_progress_vacuum view is available, which also provides all the necessary information.

(The third way is also available: to output the information to the message log, but this works only for autovacuum, which will be discussed next time.)

Let's insert quite a few rows in the table, for the vacuum process to last pretty long, and let's update all of them, for VACUUM to get stuff to do.

=> TRUNCATE vac;
=> INSERT INTO vac(s) SELECT 'A' FROM generate_series(1,500000);
=> UPDATE vac SET s  = 'B';

Let's reduce the memory size allocated for the array of identifiers:

=> ALTER SYSTEM SET maintenance_work_mem = '1MB';
=> SELECT pg_reload_conf();

Let's start VACUUM and while it is working, let's access the pg_stat_progress_vacuum view several times:

=> VACUUM VERBOSE vac;

|  => SELECT * FROM pg_stat_progress_vacuum \gx

|  -[ RECORD 1 ]------+------------------
|  pid                | 6715
|  datid              | 41493
|  datname            | test
|  relid              | 57383
|  phase              | vacuuming indexes
|  heap_blks_total    | 16667
|  heap_blks_scanned  | 2908
|  heap_blks_vacuumed | 0
|  index_vacuum_count | 0
|  max_dead_tuples    | 174762
|  num_dead_tuples    | 174480

|  => SELECT * FROM pg_stat_progress_vacuum \gx

|  -[ RECORD 1 ]------+------------------
|  pid                | 6715
|  datid              | 41493
|  datname            | test
|  relid              | 57383
|  phase              | vacuuming indexes
|  heap_blks_total    | 16667
|  heap_blks_scanned  | 5816
|  heap_blks_vacuumed | 2907
|  index_vacuum_count | 1
|  max_dead_tuples    | 174762
|  num_dead_tuples    | 174480

Here we can see, in particular:

The name of the current phase — we discussed three main phases, but there are more of them in general.
The total number of table pages (heap_blks_total).
The number of scanned pages (heap_blks_scanned).
The number of already vacuumed pages (heap_blks_vacuumed).
The number of index vacuum cycles (index_vacuum_count).

The general progress is determined by the ratio of heap_blks_vacuumed to heap_blks_total, but we should take into account that this value changes in large increments rather than smoothly because of scanning the indexes. The main attention, however, should be given to the number of vacuum cycles: the number greater than 1 means that the memory allocated was not enough to complete vacuuming in one cycle.

The output of the VACUUM VERBOSE command, already completed by that time, will show the general picture:

INFO:  vacuuming "public.vac"

INFO:  scanned index "vac_s" to remove 174480 row versions
DETAIL:  CPU: user: 0.50 s, system: 0.07 s, elapsed: 1.36 s
INFO:  "vac": removed 174480 row versions in 2908 pages
DETAIL:  CPU: user: 0.02 s, system: 0.02 s, elapsed: 0.13 s

INFO:  scanned index "vac_s" to remove 174480 row versions
DETAIL:  CPU: user: 0.26 s, system: 0.07 s, elapsed: 0.81 s
INFO:  "vac": removed 174480 row versions in 2908 pages
DETAIL:  CPU: user: 0.01 s, system: 0.02 s, elapsed: 0.10 s

INFO:  scanned index "vac_s" to remove 151040 row versions
DETAIL:  CPU: user: 0.13 s, system: 0.04 s, elapsed: 0.47 s
INFO:  "vac": removed 151040 row versions in 2518 pages
DETAIL:  CPU: user: 0.01 s, system: 0.02 s, elapsed: 0.08 s

INFO:  index "vac_s" now contains 500000 row versions in 17821 pages
DETAIL:  500000 index row versions were removed.
8778 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "vac": found 500000 removable, 500000 nonremovable row versions in 16667 out of 16667 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 4011
There were 0 unused item pointers.
0 pages are entirely empty.
CPU: user: 1.10 s, system: 0.37 s, elapsed: 3.71 s.
VACUUM

We can see here that three cycles over the indexes were done, and in each cycle, 174480 pointers to dead tuples were vacuumed away. Why exactly this number? One tid occupies 6 bytes, and 1024*1024/6 = 174762, which is the number that we see in pg_stat_progress_vacuum.max_dead_tuples. In reality, slightly less may be used: this ensures that when a next page is read, all pointers to dead tuples will fit in memory for sure.

Analysis

Analysis, or, in other words, collecting statistics for the query planner, is formally unrelated to vacuuming at all. Nevertheless, we can perform the analysis not only using the ANALYZE command, but combine vacuuming and analysis in VACUUM ANALYZE. Here the vacuum is done first and then the analysis, so this gives no gains.

But as we will see later, autovacuum and automatic analysis are done in one process and are controlled in a similar way.

VACUUM FULL

As noted above, vacuum frees more space than in-page vacuum, but still it does not entirely solve the problem.

If for some reasons the size of a table or an index has increased a lot, VACUUM will free space inside the existing pages: «holes» will occur there, which will then be used for insertion of new tuples. But the number of pages won't change, and therefore, from the viewpoint of the operating system, the files will occupy exactly the same space as before the vacuum. And this is no good because:

Full scan of the table (or index) slows down.
A larger buffer cache may be required (since it is the pages that are stored there and the density of useful information decreases).
In the index tree an extra level can occur, which will slow down index access.
The files occupy extra space on disk and in backup copies.

(The only exception is fully vacuumed pages, located at the end of the file. These pages are trimmed from the file and returned to the operating system.)

If the share of useful information in the files falls below some reasonable limit, the administrator can do VACUUM FULL of the table. In this case, the table and all its indexes are rebuilt from scratch and the data are packed in a mostly compact way (of course, the fillfactor parameter taken into account). During the rebuild, PostgreSQL first rebuilds the table and then each of its indexes one-by-one. For each object, new files are created, and old files are removed at the end of rebuilding. We should take into account that extra disk space will be needed in the process.

To illustrate this, let's again insert a certain number of rows into the table:

=> TRUNCATE vac;
=> INSERT INTO vac(s) SELECT 'A' FROM generate_series(1,500000);

How can we estimate the information density? To do this, it's convenient to use a specialized extension:

=> CREATE EXTENSION pgstattuple;
=> SELECT * FROM pgstattuple('vac') \gx

-[ RECORD 1 ]------+---------
table_len          | 68272128
tuple_count        | 500000
tuple_len          | 64500000
tuple_percent      | 94.47
dead_tuple_count   | 0
dead_tuple_len     | 0
dead_tuple_percent | 0
free_space         | 38776
free_percent       | 0.06

The function reads the entire table and shows statistics: which data occupies how much space in the files. The main information of our interest now is the tuple_percent field: the percentage of useful data. It is less than 100 because of the inevitable information overhead inside a page, but is still pretty high.

For the index, different information is output, but the avg_leaf_density field has the same meaning: the percentage of useful information (in leaf pages).

=> SELECT * FROM pgstatindex('vac_s') \gx

-[ RECORD 1 ]------+---------
version            | 3
tree_level         | 3
index_size         | 72802304
root_block_no      | 2722
internal_pages     | 241
leaf_pages         | 8645
empty_pages        | 0
deleted_pages      | 0
avg_leaf_density   | 83.77
leaf_fragmentation | 64.25

And these are the sizes of the table and indexes:

=> SELECT pg_size_pretty(pg_table_size('vac')) table_size,
  pg_size_pretty(pg_indexes_size('vac')) index_size;

 table_size | index_size 
------------+------------
 65 MB      | 69 MB
(1 row)

Now let's delete 90% of all rows. We do a random choice of rows to delete, so that at least one row is highly likely to remain in each page:

=> DELETE FROM vac WHERE random() < 0.9;

DELETE 450189

What size will the objects have after VACUUM?

=> VACUUM vac;
=> SELECT pg_size_pretty(pg_table_size('vac')) table_size,
  pg_size_pretty(pg_indexes_size('vac')) index_size;

 table_size | index_size 
------------+------------
 65 MB      | 69 MB
(1 row)

We can see that the size did not change: VACUUM no way can reduce the size of files. And this is although the information density decreased by approximately 10 times:

=> SELECT vac.tuple_percent, vac_s.avg_leaf_density
FROM pgstattuple('vac') vac, pgstatindex('vac_s') vac_s;

 tuple_percent | avg_leaf_density 
---------------+------------------
          9.41 |             9.73
(1 row)

Now let's check what we get after VACUUM FULL. Now the table and indexes use the following files:

=> SELECT pg_relation_filepath('vac'), pg_relation_filepath('vac_s');

 pg_relation_filepath | pg_relation_filepath 
----------------------+----------------------
 base/41493/57392     | base/41493/57393
(1 row)

=> VACUUM FULL vac;
=> SELECT pg_relation_filepath('vac'), pg_relation_filepath('vac_s');

 pg_relation_filepath | pg_relation_filepath 
----------------------+----------------------
 base/41493/57404     | base/41493/57407
(1 row)

The files are replaced with new ones now. The sizes of the table and indexes considerably decreased, while the information density increased accordingly:

=> SELECT pg_size_pretty(pg_table_size('vac')) table_size,
  pg_size_pretty(pg_indexes_size('vac')) index_size;

 table_size | index_size 
------------+------------
 6648 kB    | 6480 kB
(1 row)

=> SELECT vac.tuple_percent, vac_s.avg_leaf_density
FROM pgstattuple('vac') vac, pgstatindex('vac_s') vac_s;

 tuple_percent | avg_leaf_density 
---------------+------------------
         94.39 |            91.08
(1 row)

Note that the information density in the index is even greater than the original one. It is more advantageous to rebuild an index (B-tree) from the data available than insert the data in an existing index row by row.

The functions of the pgstattuple extension that we used read the entire table. But this is inconvenient if the table is large, so the extension has the pgstattuple_approx function, which skips the pages marked in the visibility map and shows approximate figures.

One more way, but even less accurate, is to use the system catalog to roughly estimate the ratio of the data size to the file size. You can find examples of such queries in wiki.

VACUUM FULL is not intended for regular use since it blocks any work with the table (querying included) for all the duration of the process. It's clear that for a heavily used system, this may appear unacceptable. Locks will be discussed separately, and now we'll only mention the pg_repack extension, which locks the table for only a short period of time at the end of the work.

Similar commands

There are a few commands that also fully rebuild tables and indexes and therefore resemble VACUUM FULL. All of them fully block any work with the table, they all remove old data files and create new ones.

The CLUSTER command is in all similar to VACUUM FULL, but it also physically orders tuples according to one of the available indexes. This enables the planner to use index access more efficiently in some cases. But we should bear in mind that clustering is not maintained: the physical order of tuples will be broken with subsequent changes of the table.

The REINDEX command rebuilds a separate index on the table. VACUUM FULL and CLUSTER actually use this command to rebuild indexes.

The logic of the TRUNCATE command is similar to that of DELETE — it deletes all table rows. But DELETE, as was already mentioned, only marks tuples as deleted, and this requires further vacuuming. And TRUNCATE just creates a new, clean file instead. As a rule, this works faster, but we should mind that TRUNCATE will block any work with the table up to the end of the transaction.

Read on.

Tags:

Hubs: