t1ha = Fast Positive Hash / Habr

Just about the fastest portable 64-bit hash function with decent quality.

This is a translation of the original article by Leonid Yuriev.

Instead of a Disclaimer

I will omit the definition of hash functions along with the detailed listing of the properties and requirements for their cryptographic application, and assume that the reader either has the necessary minimum knowledge or will read up on it. It should also be noted, that hereinafter I will talk about non-cryptographic (not suitable for cryptography) hash functions, unless explicitly stated otherwise.

Banalities

Hashing is used in a lot of algorithms, and almost always the most efficient (fast) data processing is required, along with a certain minimum level of hashing quality. Here the term “quality” means, first of all, a sort of “randomness” (stochasticity) in relation to the initial data. Somewhat less often additional requirements are imposed such as resistance to deliberate generation of collisions or irreversibility.

To be a bit more tedious

For clarity, it is necessary to define the concept of “quality” of the hash function and the rest of the requirements in a little bit more detail:
Baseline quality and the avalanche effect: changing one or more arbitrary bits in an arbitrary set of source data causes each bit of the result to change with a probability of ½.

Irreversibility or first preimage resistance: the impossibility of obtaining the original data or individual bits from the hash result.
Resistance to first-order collisions and/or second pre-image resistance: the difficulty of finding/fitting the original data set to obtain a specified result or part of it, including when the initial data set is known.
Resistance to second-order collisions: the difficulty of finding/fitting two different data sets that would give the same result or a match of a significant part.

Omitting long citations of the underlying mathematics, it can be summarized:

Satisfying all the above requirements while ensuring high performance is quite a difficult problem, solving which would give us a good cryptographic hash function. But we are not going to do this yet.
Providing basic quality requires a sufficiently large number of ALU operations. To put it simply, quality always compromises with speed.
Obtaining a high-quality result with a bit width greater than the bit width of ALU operations requires more than a several-fold increase in the number of mixings, and therefore basic ALU operations.
In general, creating a fast hash function involves achieving a weighted compromise between speed, quality and result bitness.

Therefore, I can say that t1ha appeared as a result of searching for a compromise between quality and speed, at the same time taking into account the capabilities of modern processors and the already found methods (arithmetic-logical combinations) of mixing and spreading dependencies (avalanche effect).

The basic version of t1ha is one of the fastest portable hash functions for constructing hash tables and other related applications. The basic version of t1ha is focused on 64-bit little-endian architectures, takes a 64-bit salt value (seed) and produces a 64-bit result, which includes strengthening by key-length and seed. It is worth noting that t1ha is intentionally designed to return 0 for zero input data (a key of zero size and zero seed).

Answering most popular questions

64-bit operations: Perhaps is should be noted that it is the 64-bit operations that provide speed and quality without hurting portability. In fact, the wider the digit capacity of arithmetic operations, the more avalanche effect they produce and the better they mix the data. Additionally, data processing, all other things being equal, is certainly faster by 8 bytes than by 4. On the other hand, exactly 64-bit operations are natively available on many modern processors, and can be more or less adequately translated into 32-bit ones. All other options, including SIMD operations, force us to greatly sacrifice portability and/or speed on non-native platforms.

64-bit result: To construct hash tables, in many cases, a smaller bit width result is enough. Even 32 bits may be more than sufficient. However, when using 64-bit operations, the 64-bit result comes naturally. At the same time, a sufficiently high-quality 64-bit hash result allows you to quickly perform a comparison for non-equality, and with good accuracy to compare for equality.

The above "magic" of replacing comparisons can seem unclear and unnecessary, or it can increase the speed of hashing by an order of magnitude just by means of data locality, i.e. less CPU cache pollution. Simply put, one can build a hash table structure in such a way that the calculated hash values lie side by side (packed in cache lines). The CPU then would only grab the real data if the hash values matched. And in this case, the 64-bits from t1ha allow to get the best possible result. That being said, 128 bits will not provide an advantage anymore, while taking less from 64 bits is always possible.

Comparison with HighwayHash : I have mixed feelings about this unofficial project by Google employees.

On the one hand, it has good code and excellent technical implementation. On the other hand, HighwayHash is positioned as possibly cryptographically strong (at least equal to SipHash). Inside HighwayHash there are quite a few manipulations that allow us to expect that the result will not be bad. However, there are no proofs that would allow us to reliably say that. The provided proof of "strength" comes down to the results of statistical tests, but with no ability to reproduce them (at one point I even allowed myself a somewhat superfluous comment).
HighwayHash is really fast only on x86_64 with AVX2 or SSE41. Is it not easier then to just use AES-NI or SHA acceleration?

If all goes well, additional options will be added to the t1ha suite (primarily for the result bitness) and optimized for E2K. With this I would like to close the topic of comparisons with HighwayHash.

Quality

Evaluating the quality of a hash function in all aspects can be quite difficult. It can be done either analytically or by implementing various statistical tests. Unfortunately, the analytical approach is not very effective for evaluating hash functions with a compromise between quality and speed. Moreover, a comparative analytical evaluation of such functions tends to be subjective.

In contrast, statistical tests can provide clear quantitative estimates. For such purposes there are well-proven test packages, such as SMHasher. For t1ha, the results are simple — all t1ha variants pass all tests without any comments. On the other hand, one should not assume that t1ha has any properties in excess of those that are necessary for the target application (building hash tables).

The number of collisions at all levels (variants) of t1ha corresponds to the birthday paradox. To formulate it strictly — the collision probability in t1ha corresponds to the probability of coincidence of random discrete values with corresponding bitness.
A similar probability of collisions is observed in all high-quality hash functions. However, this is only probability, so the actual number of collisions can vary for each specific data set.

After this article was first published, Yves Orton discovered, that the first t1ha1() does not always meet the strict avalanche criterion. This drawback is insignificant for targeted applications of t1ha1() and unnoticeable from a practical point of view. However, this disadvantage is eliminated in the next t1ha2()level/variant, which was originally planned to provide a slightly higher quality. On new processors, which use current versions of compilers, t1ha2() is on average one cycle faster than t1ha1(), and in the rest of cases it can be one cycle slower. It is worth noting that t1ha2() additionally offers stream hashing mode and a 128-bit result.

Readers would certainly appreciate a thorough and in-depth analysis of the quality and/or strength of t1ha. However, based on the target t1ha application areas, this seems redundant. Simply put, speed was more important to us, even for short keys. Therefore, multi-round mixing was not considered. The present t1ha version saves on cycles and gives a 64-bit result — it is practically pointless to measure the found compromise in any way other than statistically, and its results are simply good.

In fact

I just followed my colleagues from Google in how they provide their statistical proof

Benchmarks

Regarding the claim of being «the fastest». it is important to note that it is obviously not likely that there is a hash function that is at the same time useful and the fastest on all platforms/architectures. Different processors have different sets of instructions available and execute similar instructions with different efficiencies. Obviously, the «universally fastest» function most likely cannot be created. However, it seems acceptable to use the term «the
fastest» for a function that is portable and at the same time the fastest, at least on the most common platform (x86_64), while having little chance of losing on any modern processor with a decent optimizing compiler.

The source code of the project includes a test that checks both the correctness of the result and measures the speed of each implemented variant. At the same time, on x86, depending on the capabilities of the processor (and the compiler), additional variants of functions can be checked, and measurements are made in processor cycles.

In addition, the project website contains tables with the results of performance measurements through a modified version of SMHasher from Reini Urban. One can double-check all figures and/or get results on a specific processor using a specific compiler.

Here you can compare t1ha with some of its closest competitors.

Hashing short keys (average for 1..31 bytes).
Look at the right column «Cycles/Hash» (smaller is better):

Function	MiB/Second	Cycles/Hash
t1ha	12228.80	35.55
FastHash64	5578.06	43.42
CityHash64	11041.72	51.77
xxHash64	11123.15	56.17
MetroHash	11808.92	46.33

Hashing long keys (256 Kb).
Look at the middle column «MiB/Second» (larger is better):

Function	MiB/Second	Cycles/Hash
t1ha	12228.80	35.55
FarmHash64	12145.36	60.12
CityHash64	11041.72	51.77
xxHash64	11123.15	56.17
Spooky64	11820.20	60.39

Variants of t1ha

Development of t1ha The first of such goals was to obtain a fast portable function of sufficiently high quality for constructing hash tables.

Then we wanted to have the fastest version of the hash function that would give a result of comparable quality but was adapted to the target platform as much as possible. For example, the basic t1ha version works with little-endian byte order, due to which a conversion is necessary for big-endian architectures with inevitable loss of performance. So why not get rid of unnecessary operations on a specific target platform? This way, several more options were added:

Simplified version for 32-bit platforms, both little and big-endian.
Variant using AES-NI instructions, but without AVX.
Two variants using the AES-NI and AVX instructions.

Later it became clear that more options designed for various applications would be needed, including different bit width results, quality and durability requirements. Such diversity required proper systematization. This was achieved by changing the naming scheme, in which the numeric suffix indicates the “level” of the function:

t1ha0() — is the fastest option for the current processor.
t1ha1() — is the basic portable 64-bit version of t1ha.
t1ha2() — is a portable 64-bit version with a little more concern for quality.
t1ha3() — is a fast portable 128-bit version for fingerprinting.
etc.

In this scheme, it is assumed that t1ha0() is a dispatcher that implements redirection depending on the platform and capabilities of the current processor. In addition, the use of the suffixes "_le" and "_be" for an explicit choice between the little-endian and big-endian variants can be introduced. Thus, under the “t1ha” signboard now there are several hash functions, and this family will grow I the future, including a version optimized for Russian E2K “Elbrus”.

A general idea of the current set of functions and their properties can be grasped by looking at the built-in test output (make check). It is worth noting that all functions pass all SMHasher tests, and the performance of the AES-NI variants varies greatly depending on the processor model:

Intel(R) Core(TM) i7-6700K CPU @ 3.00GHz
Build by GNU C/C++ compiler 8.2
[...]
 - use RDPMC_40000001 as clock source
 - measure granularity and overhead: 53 cycles, 0.0188679 iteration/cycle

Bench for tiny keys (7 bytes):
t1ha0                 :    13.14 cycle/hash,  1.877 cycle/byte,  1.598 Gb/s @3GHz
t1ha1_64le            :    15.14 cycle/hash,  2.163 cycle/byte,  1.387 Gb/s @3GHz
t1ha2_atonce          :    15.50 cycle/hash,  2.163 cycle/byte,  1.387 Gb/s @3GHz
t1ha1_64be            :    16.78 cycle/hash,  2.397 cycle/byte,  1.251 Gb/s @3GHz
xxhash32              :    17.17 cycle/hash,  2.453 cycle/byte,  1.223 Gb/s @3GHz
StadtX                :    17.59 cycle/hash,  2.513 cycle/byte,  1.194 Gb/s @3GHz
t1ha0_32le            :    18.28 cycle/hash,  2.612 cycle/byte,  1.149 Gb/s @3GHz
t1ha0_32be            :    20.24 cycle/hash,  2.892 cycle/byte,  1.037 Gb/s @3GHz
xxhash64              :    22.17 cycle/hash,  3.167 cycle/byte,  0.947 Gb/s @3GHz
t1ha2_atonce128*      :    29.93 cycle/hash,  4.277 cycle/byte,  0.701 Gb/s @3GHz
t1ha2_stream*         :    79.81 cycle/hash, 11.402 cycle/byte,  0.263 Gb/s @3GHz
HighwayHash64_avx2    :    83.75 cycle/hash, 11.964 cycle/byte,  0.251 Gb/s @3GHz
HighwayHash64_sse41   :    85.25 cycle/hash, 12.179 cycle/byte,  0.246 Gb/s @3GHz
t1ha2_stream128*      :    99.06 cycle/hash, 14.152 cycle/byte,  0.212 Gb/s @3GHz
HighwayHash64_portable:   480.75 cycle/hash, 68.679 cycle/byte,  0.044 Gb/s @3GHz
HighwayHash64_pure_c  :   652.58 cycle/hash, 93.226 cycle/byte,  0.032 Gb/s @3GHz

Bench for large keys (16384 bytes):
t1ha0                 :  1185.00 cycle/hash,  0.072 cycle/byte, 41.478 Gb/s @3GHz
t1ha2_atonce          :  3436.00 cycle/hash,  0.210 cycle/byte, 14.305 Gb/s @3GHz
t1ha2_atonce128*      :  3440.00 cycle/hash,  0.210 cycle/byte, 14.288 Gb/s @3GHz
t1ha1_64le            :  3449.00 cycle/hash,  0.211 cycle/byte, 14.251 Gb/s @3GHz
t1ha2_stream*         :  3479.00 cycle/hash,  0.212 cycle/byte, 14.128 Gb/s @3GHz
t1ha2_stream128*      :  3508.00 cycle/hash,  0.214 cycle/byte, 14.011 Gb/s @3GHz
StadtX                :  3550.00 cycle/hash,  0.217 cycle/byte, 13.846 Gb/s @3GHz
xxhash64              :  4121.00 cycle/hash,  0.252 cycle/byte, 11.927 Gb/s @3GHz
t1ha1_64be            :  4567.00 cycle/hash,  0.279 cycle/byte, 10.762 Gb/s @3GHz
HighwayHash64_avx2    :  4580.00 cycle/hash,  0.280 cycle/byte, 10.732 Gb/s @3GHz
HighwayHash64_sse41   :  6412.00 cycle/hash,  0.391 cycle/byte,  7.666 Gb/s @3GHz
t1ha0_32le            :  7191.00 cycle/hash,  0.439 cycle/byte,  6.835 Gb/s @3GHz
t1ha0_32be            :  7928.00 cycle/hash,  0.484 cycle/byte,  6.200 Gb/s @3GHz
xxhash32              :  8197.00 cycle/hash,  0.500 cycle/byte,  5.996 Gb/s @3GHz
HighwayHash64_portable: 41895.27 cycle/hash,  2.557 cycle/byte,  1.173 Gb/s @3GHz
HighwayHash64_pure_c  : 53296.11 cycle/hash,  3.253 cycle/byte,  0.922 Gb/s @3GHz

A little bit about the internal structure

To delve into a little more detail, t1ha is built according to the Merkle-Damgård scheme (“wipe-pipe” version) with strengthening from the data size and the seed value. Inside the main compression loop, a 256-bit state is used, with the same size of the input block. Moreover, for each data operand there are two injection points with cross pollination. Upon completion of the compression cycle, the 256-bit state is compressed to 128 bits.

When performing the above actions, 64-bit operations, combined into mixers ARX (Add-Rotate-Xor) and MUX/MRX (Mul-Rotate-Xor), are used. It is important that all these calculations are built in such a way as to ensure the possibility of parallel execution of most operations and tight packing of u-ops both into the pipeline and into x86_64 execution units. Due to this, a sufficiently good quality is achieved with almost maximum hash rate for long keys.

It is worth noting that the compression loop runs only for blocks of sufficient size. If there is less data, then the intermediate 128-bit state will consist only of the key size and the salt value.

Then, the remaining tail of the data is mixed in portions of 64 bits alternately to the halves of the 128-bit state. Finally, the state is mixed and simultaneously compressed to a 64-bit result. An important feature of t1ha here is the use of a mixer based on wide multiplication (128-bit product of two 64-bit multipliers). This allows for good quality mixing with a good avalanche effect and fewer operations. Even though wide multiplication is a relatively expensive operation, fewer such operations allow t1ha to process short keys in a record-low number of processor cycles.

It should be noted that the mixer based on wide multiplication and exclusive OR is not perfect. Although t1ha passes all SMHasher, tests, the author understands the consequences of non-injectivity). Nevertheless, the resulting quality seems to be rationally sufficient, and the development plans for the t1ha line already reflect the intention to provide a slightly higher quality variants.

You can find more information and source code here.

Thank you for reading!