Accelerating PHP connectors for Tarantool using Async, Swoole, and Parallel

In the PHP ecosystem, there are currently two connectors for the Tarantool server: the official PECL extension tarantool/tarantool-php written in C, and tarantool-php/client written in PHP. I am the author of the latter one.

In this article I would like to share the results of performance testing of both these libraries and show how you can achieve 3x-5x performance improvement (on synthetic tests!) with minimal changes in code.

What are we going to test?

We will be testing the above mentioned synchronous connectors launched asynchronously, in parallel, and asynchronously in parallel. Also, we want no changes for the connectors' source code. At the moment there are several extensions available that can do the job:

• Swoole, a high-performance asynchronous framework for PHP. Used by Internet giants such as Alibaba and Baidu. Since version 4.1.0, the amazing runtime hook Swoole\Runtime::enableCoroutine() has appeared, allowing “to transform synchronous PHP network libraries into co-routine libraries using a single line of code”.
• Async, a very promising extension for asynchronous work in PHP until recently. Why until recently? Unfortunately, for reasons I do not know, the author deleted the repository and the future of the project is questionable. I will use one of the forks. Like Swoole, this extension makes it easy to activate the asynchronous mode by replacing PHP's default stream implementations with their async counterpart. This is done through the option “async.tcp = 1”.
• Parallel, a quite new extension from the well-known Joe Watkins, the author of such libraries as phpdbg, apcu, pthreads, pcov, uopz. The extension provides a multi-threading API for PHP and is positioned as a replacement for pthreads. A significant limitation of the library is that it only works with the ZTS (Zend Thread Safe) version of PHP.

How are we going to test?

We will run a Tarantool instance with write-ahead logging disabled (wal_mode = none) and an extended network buffer (readahead = 1 * 1024 * 1024). The first option will prevent IO operations to the disk drive, the second one will allow reading more requests from the operating system buffer and thus minimize the number of system calls.

For benchmarks that work with data (insertion, deletion, reading, etc.), a memtx space will be (re-)created before the benchmark starts, and the initial index values for this space will be created by the sequence generator.

DDL of the space is as follows:

space = box.schema.space.create(config.space_name, {
id = config.space_id,
temporary = true
})

space:create_index('primary', {
type = 'tree',
parts = {1, 'unsigned'},
sequence = true
})

space:format({
{name = 'id', type = 'unsigned'},
{name = 'name', type = 'string', is_nullable = false}
})

If necessary, before starting the benchmark, the space is filled with 10,000 tuples of the following form:

{id, 'tuple_' .. id}

Tuples are accessed using the random key value.

The benchmark is a single request to the server that is executed 10,000 times (revolutions), which in turn are executed in iterations. The iterations are repeated until all time deviations among 5 iterations are within the 3% error margin*. After that, the average result is taken. Between iterations, there is a pause of 1 second to prevent CPU from throttling. The Lua garbage collector is disabled before each iteration and is forced to start after the iteration is finished. The PHP process is launched only with extensions required for the benchmark, with output buffering enabled and the garbage collector disabled.

* The number of revolutions, iterations and error threshold can be altered in the benchmark settings.

Test environment

The results posted below were made on MacBookPro (mid 2015) with Fedora 30 (kernel version 5.3.8-200.fc30.x86_64). Tarantool was launched in docker with the “--network host” setting.

Package versions:

Tarantool: 2.3.0-115-g5ba5ed37e
Docker: 19.03.3, build a872fc2f86
PHP: 7.3.11 (cli) (built: Oct 22 2019 08:11:04)
tarantool/client: 0.6.0
rybakit/msgpack: 0.6.1
ext-tarantool: 0.3.2 (patched)*
ext-msgpack: 2.0.3
ext-async: 0.3.0-8c1da46
ext-swoole: 4.4.12
ext-parallel: 1.1.3

* Unfortunately, the official connector does not work with PHP > 7.2. To compile and run the extension on PHP 7.3, I had to use a patch.

Results

Sync (default)

The Tarantool protocol uses the MessagePack binary format to serialize messages. In the PECL connector, serialization is hidden deep inside the library, so it appears impossible to affect the encoding process from userland code. In contrast, the pure PHP connector provides the ability to customize the encoding process, either by extending one of the standard encoders, or by using your own implementation. Two encoders are available out of the box: one is based on msgpack/msgpack-php (the official MessagePack PECL extension) and the other one is based on rybakit/msgpack (pure PHP).

Before we proceed to comparing the connectors, let’s measure the performance of MessagePack encoders for the PHP connector, so that we use the best performer further on in our tests:

Although the PHP version (Pure) is not as fast as the PECL extension, I would still recommend using rybakit/msgpack in real projects, because the official PECL extension implements the MessagePack specification only partially (e.g. there is no support for custom data types, and without it you cannot use Decimal — a new data type introduced in Tarantool 2.3) and has a number of other issues (including compatibility issues with PHP 7.4). And the project looks abandoned in general.

So, let's measure the performance of the connectors in the synchronous mode:

As you can see from the graph, the PECL connector (Tarantool) performs better than the PHP connector (Client). That is not surprising, considering that the latter, in addition to being implemented in a slower language, actually does more work: a new Request and Response object is created with every request (in the case of Select there is also Criteria, and in the case of Update/Upsert there is Operations), Connection, Packer and Handler also add some overhead. It’s needless to say that higher flexibility comes with a cost. However, the PHP interpreter shows good performance in general. Although there is a difference, it is insignificant and may get even less with using preloading in PHP 7.4, not to mention JIT in PHP 8.

Moving on now. Tarantool 2.0 introduced SQL support. Let's try to perform Select, Insert, Update and Delete operations using the SQL protocol and compare results with noSQL (binary) equivalents:

SQL results are not that impressive (let me remind you that we are still testing synchronous mode). However, I wouldn't get upset about it before the time: SQL support is still under active development (for instance, support for prepared statements has been added not too long ago) and, according to the list of issues, the SQL engine will get a number of optimizations in the future.

Async

Well, let's see now how the Async extension can help us improve the results above. For asynchronous programming, the extension provides a coroutines-based API, which we are going to use here. First, as we figure out through testing, the optimal number of coroutines for our environment is 25:

Then we spread 10,000 operations over 25 coroutines and check the result:

The number of operations per second has grown more than 3 times for the PHP connector! Sadly, the PECL connector failed to launch with ext-async.

As you can see, in the asynchronous mode the difference between the binary protocol and SQL falls within the margin of error.

Swoole

Again, let's determine the optimal number of coroutines, this time for Swoole:

Let's take 25. Now, repeating the same trick as with the Async extension: distribute 10,000 operations between 25 coroutines. Apart from that, let's add one more test, where we split the whole thing into two processes (i.e. each process will perform 5,000 operations in 25 coroutines). The processes will be created with the help of Swoole\Process.

Results:

Swoole shows slightly lower performance compared with Async when running in one process, but with 2 processes the picture changes drastically (2 is not chosen by accident, on my machine this exact number of processes showed the best result).

By the way, there is also an API for working with processes in the Async extension, but I noticed no difference between launching benchmarks in a single process or in several processes (it's possible though that I've made some mistakes).

SQL versus binary protocol:

As with Async, the difference between binary and SQL operations gets leveled out in the asynchronous mode.

Parallel

Since the Parallel extension is about threads, not coroutines, we are to measure the optimal number of parallel threads:

It's 16 on my machine. Now let's benchmark the connectors on 16 parallel threads:

As you can see, the result is even better than with asynchronous extensions (except Swoole launched with 2 processes). Note that for the PECL connector, Update and Upsert operations have no bar. This is because those operations crashed with an error, and I'm not sure what is to blame: ext-parallel, or ext-tarantool, or both.

Now let’s add SQL performance to the comparison:

Have you noticed similarities to the graph for the connectors launched synchronously?

All in one

Finally, let's combine all the results in one graph to see the whole picture for the extensions under test. We are going to add only one new test to the graph, which we haven’t done yet: launch Async coroutines in parallel using Parallel*. The idea of integrating the aforementioned extensions has already been discussed by the authors but no consensus has been reached, so we will have to do it ourselves.

* I failed to launch Swoole coroutines with Parallel; it seems that these extensions are incompatible.

Now, the final results:

Conclusion

To my mind, the results are quite decent, but there is something that makes me believe we’re not there yet! If you have any ideas on how to improve the benchmarks, I will be happy to review your pull request. All code with launch instructions and results is published in a dedicated repository.

Leaving it up to you to decide whether you would need this in a real project, I would just say that it was an exciting experiment that allowed me to estimate how much one could make out of a synchronous TCP-connector with minimal effort.
Mail.ru Group
Building the Internet