# MVCC in PostgreSQL-5. In-page vacuum and HOT updates

• Translation
Just to remind you, we already discussed issues related to isolation, made a digression regarding low-level data structure, and then explored row versions and observed how data snapshots are obtained from row versions.

Now we will proceed to two closely connected problems: in-page vacuum и HOT updates. Both techniques can be referred to optimizations; they are important, but virtually not covered in the documentation.

When accessing a page for either an update or read, if PostgreSQL understands that the page is running out of space, it can do a fast in-page vacuum. This happens in either of the cases:

1. A previous update in this page did not find enough space to allocate a new row version in the same page. Such a situation is remembered in the page header, and next time the page is vacuumed.
2. The page is more than fillfactor percent full. In this case, vacuum is performed right away without putting off till next time.

fillfactor is a storage parameter that can be defined for a table (and for an index). PostgresSQL inserts of a new row in a page only if the page is less than fillfactor percent full. The remaining space is reserved for new tuples that are created as a result of updates. The default value for tables is 100, that is, no space is reserved (and the default value for indexes is 90).

In-page vacuum deletes tuples that are not visible in any snapshot (those that are beyond the transaction horizon of the database, which was discussed last time), but does this strictly within one table page. Pointers to vacuumed tuples are not released since they can be referenced from indexes, and an index is in another page. In-page vacuum never reaches beyond one table page, but works very quickly.

For the same reasons, the free space map does not get updated; this also reserves the extra space for updates rather than for inserts. The visibility map does not get updated either.

The fact that a page can be vacuumed during reads means that a SELECT query can entail changing of pages. This is one more case like this, in addition to a deferred change of hint bits, discussed earlier.

Let's consider an example of how it works. Let's create a table and indexes on both columns.

=> CREATE TABLE hot(id integer, s char(2000)) WITH (fillfactor = 75);
=> CREATE INDEX hot_id ON hot(id);
=> CREATE INDEX hot_s ON hot(s);


If the s column stores only Latin characters, each row version will occupy 2004 bytes plus 24 bytes of a header. We set the fillfactor storage parameter to 75%, which reserves just enough space for three rows.

To conveniently look into the contents of the table page, let's recreate an already familiar function by adding two more fields to the output:

=> CREATE FUNCTION heap_page(relname text, pageno integer)
RETURNS TABLE(ctid tid, state text, xmin text, xmax text, hhu text, hot text, t_ctid tid)
AS $$SELECT (pageno,lp)::text::tid AS ctid, CASE lp_flags WHEN 0 THEN 'unused' WHEN 1 THEN 'normal' WHEN 2 THEN 'redirect to '||lp_off WHEN 3 THEN 'dead' END AS state, t_xmin || CASE WHEN (t_infomask & 256) > 0 THEN ' (c)' WHEN (t_infomask & 512) > 0 THEN ' (a)' ELSE '' END AS xmin, t_xmax || CASE WHEN (t_infomask & 1024) > 0 THEN ' (c)' WHEN (t_infomask & 2048) > 0 THEN ' (a)' ELSE '' END AS xmax, CASE WHEN (t_infomask2 & 16384) > 0 THEN 't' END AS hhu, CASE WHEN (t_infomask2 & 32768) > 0 THEN 't' END AS hot, t_ctid FROM heap_page_items(get_raw_page(relname,pageno)) ORDER BY lp;$$ LANGUAGE SQL;


Let's also create a function to look into the index page:

=> CREATE FUNCTION index_page(relname text, pageno integer)
RETURNS TABLE(itemoffset smallint, ctid tid)
AS $$SELECT itemoffset, ctid FROM bt_page_items(relname,pageno);$$ LANGUAGE SQL;


Let's check how in-page vacuum works. To do this, we insert one row and change it several times:

=> INSERT INTO hot VALUES (1, 'A');
=> UPDATE hot SET s = 'B';
=> UPDATE hot SET s = 'C';
=> UPDATE hot SET s = 'D';


There are four tuples in the page now:

=> SELECT * FROM heap_page('hot',0);

 ctid  | state  |   xmin   |   xmax   | hhu | hot | t_ctid
-------+--------+----------+----------+-----+-----+--------
(0,1) | normal | 3979 (c) | 3980 (c) |     |     | (0,2)
(0,2) | normal | 3980 (c) | 3981 (c) |     |     | (0,3)
(0,3) | normal | 3981 (c) | 3982     |     |     | (0,4)
(0,4) | normal | 3982     | 0 (a)    |     |     | (0,4)
(4 rows)


As expected, we've just exceeded the fillfactor threshold. This is clear from the difference between the pagesize and upper values: it exceeds the threshold equal to 75% of the page size, which makes 6144 bytes.

=> SELECT lower, upper, pagesize FROM page_header(get_raw_page('hot',0));

 lower | upper | pagesize
-------+-------+----------
40 |    64 |     8192
(1 row)


So, when the page is accessed next time, in-page vacuum must occur. Let's check this.

=> UPDATE hot SET s = 'E';
=> SELECT * FROM heap_page('hot',0);

 ctid  | state  |   xmin   | xmax  | hhu | hot | t_ctid
-------+--------+----------+-------+-----+-----+--------
(0,1) | dead   |          |       |     |     |
(0,2) | dead   |          |       |     |     |
(0,3) | dead   |          |       |     |     |
(0,4) | normal | 3982 (c) | 3983  |     |     | (0,5)
(0,5) | normal | 3983     | 0 (a) |     |     | (0,5)
(5 rows)


All dead tuples (0,1), (0,2) and (0,3) are vacuumed away; after that a new tuple (0,5) is added in the freed space.

The tuples that survived vacuuming are physically moved towards high addresses of the page so that all the free space is represented by one continuous area. The values of the pointers are changed accordingly. Thanks to this, no issues arise with the fragmentation of free space in a page.

Pointers to vacuumed tuples cannot be released since they are referenced from the index page. Let's look into the first page of the hot_s index (because page zero is occupied by metainformation):

=> SELECT * FROM index_page('hot_s',1);

 itemoffset | ctid
------------+-------
1 | (0,1)
2 | (0,2)
3 | (0,3)
4 | (0,4)
5 | (0,5)
(5 rows)


We also see the same picture in the other index:

=> SELECT * FROM index_page('hot_id',1);

 itemoffset | ctid
------------+-------
1 | (0,5)
2 | (0,4)
3 | (0,3)
4 | (0,2)
5 | (0,1)
(5 rows)


You might notice that pointers to table rows follow here in a reverse order, but this makes no difference since all the tuples have the same value: id=1. But in the previous index, pointers are ordered by the values of s, and this is essential.

With index access, PostgreSQL can get (0,1), (0,2) or (0,3) as tuple identifiers. It will then try to get the appropriate row version from the table page, but because of the «dead» status of the pointer, PostgreSQL will discover that such a version no longer exists and will ignore it. (Actually, having once discovered that the version of a table row is unavailable, PostgreSQL will change the pointer status in the index page in order not to access the table page anymore.)

It is essential that in-page vacuum works only within one table page and does not vacuum index pages.

Why is it no good to store references to all row versions in the index?

First, for any change of the row, all indexes created for the table have to be updated: once a new version has been created, it needs to be referenced. And we need to do this in any case, even if the fields are changed that are not indexed. This is obviously not very efficient.

Second, indexes accumulate references to historical tuples, which then need to be vacuumed away along with the tuples themselves (we will discuss a bit later how this is done).

Moreover, B-tree in PostgreSQL has the implementation specifics. If an index page does not have sufficient space to insert a new row, the page is divided into two, and all data are distributed between them. This is called a split of a page. However, when rows are deleted, the two index pages are not merged into one. Because of this, the index size may fail to reduce even if a considerable part of the data is deleted.

Naturally, the more indexes are created on a table, the more complexities are encountered.

However, if a value is changed in a column that is not indexed at all, it does not make sense to create an extra B-tree row that contains the same value of the key. This is exactly how the optimization called HOT update (Heap-Only Tuple update) works.

During this update, the index page contains only one row, which references the very first version of the row in the table page. And it's already inside the table page, that a chain of tuples is organized:

• Updated rows that are in the chain are labeled with the Heap Hot Updated bit.
• Rows that are not referenced from the index are labeled with Heap Only Tuple bit.
• As usual, row versions are linked through the ctid field.

If during index scan, PostgreSQL accesses a table page and finds a tuple labeled as Heap Hot Updated, it understands that it should not stop, but must follow the HOT chain, considering each tuple in it. Certainly, for all the tuples got this way, the visibility is checked prior to returning them to the client.

To observe how a HOT update works, let's delete one index and clear the table.

=> DROP INDEX hot_s;
=> TRUNCATE TABLE hot;


Now we redo the insert and update of a row.

=> INSERT INTO hot VALUES (1, 'A');
=> UPDATE hot SET s = 'B';


And this is what we see in the table page:

=> SELECT * FROM heap_page('hot',0);

 ctid  | state  |   xmin   | xmax  | hhu | hot | t_ctid
-------+--------+----------+-------+-----+-----+--------
(0,1) | normal | 3986 (c) | 3987  | t   |     | (0,2)
(0,2) | normal | 3987     | 0 (a) |     | t   | (0,2)
(2 rows)


There is a chain of changes in the page:

• The Heap Hot Updated flag indicates that the ctid chain must be followed.
• The Heap Only Tuple flag indicates that this tuple is not referenced from indexes.

The chain will grow (within the page) with further changes:

=> UPDATE hot SET s = 'C';
=> UPDATE hot SET s = 'D';
=> SELECT * FROM heap_page('hot',0);

 ctid  | state  |   xmin   |   xmax   | hhu | hot | t_ctid
-------+--------+----------+----------+-----+-----+--------
(0,1) | normal | 3986 (c) | 3987 (c) | t   |     | (0,2)
(0,2) | normal | 3987 (c) | 3988 (c) | t   | t   | (0,3)
(0,3) | normal | 3988 (c) | 3989     | t   | t   | (0,4)
(0,4) | normal | 3989     | 0 (a)    |     | t   | (0,4)
(4 rows)


But there is only one reference to the head of the chain in the index:

=> SELECT * FROM index_page('hot_id',1);

 itemoffset | ctid
------------+-------
1 | (0,1)
(1 row)


To emphasize, HOT updates work in the case where the fields to update are not indexed at all. Otherwise, some index would contain a reference directly to a new row version, and this is incompatible with the concept of this optimization.

The optimization works only within one page, and therefore, an additional walk through the chain does not need access to other pages and does not affect the performance.

# In-page vacuum during HOT updates

Vacuuming during HOT updates is a special, but important case of in-page vacuum.

As before, we've already exceeded the fillfactor threshold, so the next update must cause an in-page vacuum. But this time there is a chain of updates in the page. The head of this HOT chain must always remain where it is since it is referenced by the index, while the rest of the pointers can be released: they are known to have no references from the outside.

In order not to touch the head pointer, indirect addressing is used: the pointer referenced by the index — (0,1) in this case — acquires the «redirect» status, which redirects to the appropriate tuple.

=> UPDATE hot SET s = 'E';
=> SELECT * FROM heap_page('hot',0);

 ctid  |     state     |   xmin   | xmax  | hhu | hot | t_ctid
-------+---------------+----------+-------+-----+-----+--------
(0,1) | redirect to 4 |          |       |     |     |
(0,2) | normal        | 3990     | 0 (a) |     | t   | (0,2)
(0,3) | unused        |          |       |     |     |
(0,4) | normal        | 3989 (c) | 3990  | t   | t   | (0,2)
(4 rows)


Note that:

• Tuples (0,1), (0,2) and (0,3) were vacuumed away.
• The head pointer (0,1) remains, but it acquired the «redirect» status.
• The new row version overwrote (0,2) since there were no references to that tuple for sure, and the pointer was released («unused» status).

Let's do an update several times more:

=> UPDATE hot SET s = 'F';
=> UPDATE hot SET s = 'G';
=> SELECT * FROM heap_page('hot',0);

 ctid  |     state     |   xmin   |   xmax   | hhu | hot | t_ctid
-------+---------------+----------+----------+-----+-----+--------
(0,1) | redirect to 4 |          |          |     |     |
(0,2) | normal        | 3990 (c) | 3991 (c) | t   | t   | (0,3)
(0,3) | normal        | 3991 (c) | 3992     | t   | t   | (0,5)
(0,4) | normal        | 3989 (c) | 3990 (c) | t   | t   | (0,2)
(0,5) | normal        | 3992     | 0 (a)    |     | t   | (0,5)
(5 rows)


Next update causes in-page vacuuming again:

=> UPDATE hot SET s = 'H';
=> SELECT * FROM heap_page('hot',0);

 ctid  |     state     |   xmin   | xmax  | hhu | hot | t_ctid
-------+---------------+----------+-------+-----+-----+--------
(0,1) | redirect to 5 |          |       |     |     |
(0,2) | normal        | 3993     | 0 (a) |     | t   | (0,2)
(0,3) | unused        |          |       |     |     |
(0,4) | unused        |          |       |     |     |
(0,5) | normal        | 3992 (c) | 3993  | t   | t   | (0,2)
(5 rows)


Again, some of the tuples are vacuumed away, and the pointer to the head of the chain is moved accordingly.

Conclusion: if columns that are not indexed are frequently updated, it might make sense to reduce the fillfactor parameter in order to reserve some page space for updates. We should, however, take into account that the less the fillfactor, the more free space is left in a page, so the physical size of the table increases.

# Break of a HOT chain

If the page lacks free space to allocate a new tuple, the chain will break. And we will have to make a separate reference from the index to the row version located in a different page.

To reproduce this situation, let's start a concurrent transaction and build the data snapshot in it.

|  => BEGIN ISOLATION LEVEL REPEATABLE READ;
|  => SELECT count(*) FROM hot;

|   count
|  -------
|       1
|  (1 row)


The snapshot will not allow vacuuming away the tuples in the page. Now let's do an update in the first session:

=> UPDATE hot SET s = 'I';
=> UPDATE hot SET s = 'J';
=> UPDATE hot SET s = 'K';
=> SELECT * FROM heap_page('hot',0);

 ctid  |     state     |   xmin   |   xmax   | hhu | hot | t_ctid
-------+---------------+----------+----------+-----+-----+--------
(0,1) | redirect to 2 |          |          |     |     |
(0,2) | normal        | 3993 (c) | 3994 (c) | t   | t   | (0,3)
(0,3) | normal        | 3994 (c) | 3995 (c) | t   | t   | (0,4)
(0,4) | normal        | 3995 (c) | 3996     | t   | t   | (0,5)
(0,5) | normal        | 3996     | 0 (a)    |     | t   | (0,5)
(5 rows)


At the next update, the page won't have enough space, but the in-page vacuum will be unable to vacuum anything away:

=> UPDATE hot SET s = 'L';


|  => COMMIT; -- snapshot no longer needed


=> SELECT * FROM heap_page('hot',0);

 ctid  |     state     |   xmin   |   xmax   | hhu | hot | t_ctid
-------+---------------+----------+----------+-----+-----+--------
(0,1) | redirect to 2 |          |          |     |     |
(0,2) | normal        | 3993 (c) | 3994 (c) | t   | t   | (0,3)
(0,3) | normal        | 3994 (c) | 3995 (c) | t   | t   | (0,4)
(0,4) | normal        | 3995 (c) | 3996 (c) | t   | t   | (0,5)
(0,5) | normal        | 3996 (c) | 3997     |     | t   | (1,1)
(5 rows)


In the tuple (0,5), there is a reference to (1,1), which is in page 1.

=> SELECT * FROM heap_page('hot',1);

 ctid  | state  | xmin | xmax  | hhu | hot | t_ctid
-------+--------+------+-------+-----+-----+--------
(1,1) | normal | 3997 | 0 (a) |     |     | (1,1)
(1 row)


Now there are two rows in the index, each of which points to the beginning of its HOT chain:

=> SELECT * FROM index_page('hot_id',1);

 itemoffset | ctid
------------+-------
1 | (1,1)
2 | (0,1)
(2 rows)