November 10, 2020 was in many ways a landmark event in the microprocessor industry: Apple unveiled its new Mac Mini, the main feature of which was the new M1 chip, developed in-house. It is not an exaggeration to say that this processor is a landmark achievement for the ARM ecosystem: finally an ARM architecture chip whose performance surpassed x86 architecture chips from competitors such as Intel, a niche that had been dominated for decades.
But the main interest for us is not the M1 processor itself, but the Rosetta 2 binary translation technology. This allows the user to run legacy x86 software that has not been migrated to the ARM architecture. Apple has a lot of experience in developing binary translation solutions and is a recognized leader in this area. The first version of the Rosetta binary translator appeared in 2006 were it aided Apple in the transition from PowerPC to x86 architecture. Although this time platforms were different from those of 2006, it was obvious that all the experience that Apple engineers had accumulated over the years, was not lost, but used to develop the next version - Rosetta 2.
We were keen to compare this new solution from Apple, a similar product Huawei ExaGear (with its lineage from Eltechs ExaGear) developed by our team. At the same time, we evaluated the performance of binary translation from x86 to Arm provided by Microsoft (part of MS Windows 10 for Arm devices) on the Huawei MateBook E laptop. At present, these are the only other x86 to Arm binary translation solution that we are aware of on the open market.
Since all solutions were originally created for different operating systems (Huawei ExaGear for Linux, Apple Rosetta 2 for MacOS, and Microsoft binary translator for MS Windows), we had to find an appropriate comparison method, as it was impossible to execute them under the same conditions. We chose a “translation efficiency” metric, which is the ratio between the native version of the binary and a version of the binary for the guest architecture executed using the binary translator. For our tests the native target platform is Arm and the guest platform is x86. In other words, we compared how many percent of the native indicator the benchmark execution achieves in binary translation mode. For Geekbench in which a higher score is better, this ratio is: translated score divided by native score. For Spec in which a lower execution time is better, this ratio is: the native execution time divided by the translated execution time. The same metric was used by experts from the Anandtech website, who published a review article about the Apple M1, where performance figures were also given for Rosetta 2.
All measurements are for a single thread, we are primarily interested in the performance of the translated code. That is, what percentage of performance is lost compared to native code.
Huawei Exagear Vs Apple Rosetta 2
So, let us start by comparing Huawei ExaGear with Apple Rosetta 2 on Apple MacBook Pro (M1). The Rosetta 2 tests run in the native MacOS BigSur 11.1, the ExaGear tests run in a virtual machine with Linux kernel 5.4.1.
Geekbench 5.4.1 (points, higher is better ):
Bench name | ARM64 MacOS | Rosetta 2 (x86) | Rosetta 2 efficiency | ARM64 Linux(VM) | Exagear (x86) | Exagear efficiency |
AES-XTS | 2769 | 1720 | 62.1% | 2703 | 1823 | 67.4% |
Text Compression | 1502 | 1319 | 87.8% | 1438 | 1349 | 93.8% |
Image Compression | 1356 | 1056 | 77.9% | 1335 | 1230 | 92.1% |
Navigation | 1716 | 1678 | 97.8% | 1717 | 1605 | 93.5% |
HTML5 | 1642 | 1066 | 64.9% | 1717 | 1302 | 75.8% |
SQLite | 1400 | 1000 | 71.4% | 1328 | 1229 | 92.6% |
PDF Rendering | 1600 | 1324 | 82.8% | 1738 | 1464 | 84.2% |
Text Rendering | 1778 | 1290 | 72.6% | 1726 | 1480 | 85.8% |
Clang | 1661 | 1244 | 74.9% | 1738 | 1335 | 76.8% |
Camera | 1612 | 1278 | 79.3% | 1464 | 1148 | 78.4% |
N-Body Physics | 1789 | 1503 | 84.0% | 1830 | 1616 | 88.3% |
Rigid Body Physics | 1772 | 1352 | 76.3% | 1690 | 1176 | 69.6% |
Gaussian Blur | 1407 | 1251 | 88.9% | 1435 | 1297 | 90.4% |
Face Detection | 2215 | 1500 | 67.7% | 2216 | 1632 | 73.7% |
Horizon Detection | 1964 | 1268 | 64.6% | 1879 | 1386 | 73.8% |
Image Inpainting | 3214 | 2893 | 90.0% | 3345 | 2903 | 86.8% |
HDR | 2486 | 2250 | 90.5% | 2761 | 2690 | 97.4% |
Ray Tracing | 2553 | 1970 | 77.2% | 2055 | 1992 | 96.9% |
Structure from Motion | 1406 | 1068 | 76.0% | 1507 | 1150 | 76.3% |
Speech Recognition | 1601 | 1371 | 85.6% | 1485 | 1355 | 91.3% |
Machine Learning | 1243 | 713 | 57.4% | 1229 | 727 | 59.2% |
GeoMean | 76.9% | 82.4% |
As we can see, Huawei ExaGear has a better average performance of 82.4% in comparison to Rosetta 2’s 76.9%, and ExaGear loses only in 4 tests out of 21: Navigation, Camera, Rigid Body Physics, and Image Inpainting.
Here we need to make a small, but quite interesting digression. A closer examination of the M1 processor reveals that despite the fact that it was officially released only in November, meaning that the internal design and instruction set was finalized long before, likely in 2019. Despite this it contains support for the Arm V8.7 architecture extension, not published until Autumn 2020. A significant portion of this extension aims to simplify and improve the performance of some operations common to binary translation for an x86 guest architecture. That is, Apple was developing a processor with an extension that was not official at the time. Moreover, a close look at earlier extensions reveals that ArmV8.5 and ArmV8.4 also included operations to support binary translation. This suggests that Apple has been working in close cooperation with Arm for quite some time, pursuing hardware support for their binary translation solution. Apple Rosetta 2 leverages all these features and therefore has a definite advantage over Huawei ExaGear, which does not exploit these extensions.
We believe this is why Apple Rosetta 2 performs better in these four benchmarks. But nevertheless, Huawei ExaGear shows better results in other tests, despite being at somewhat of a disadvantage for not using these advanced architecture features.
It is also worth noting that the performance figures in native Arm-mode for MacOS and Linux are generally quite close, which confirms the general correctness of our approach of comparing the performance of binary translators.
Next, let us compare performance on SpecCPU2006 and SpecCPU2017. The execution time of each subtest is measured in seconds. Benchmarks written in Fortran were excluded.
SpecCPU2006 (in seconds, lower is better)
Compiler: clang 11.0 -O3 –flto
INT tests | ARM64 MacOS | Rosetta 2 (x86) | Rosetta 2 efficiency | ARM64 Linux(VM) | ExaGear (x86) | ExaGear efficiency |
400.perlbench | 157 | 218 | 72.0% | 145 | 166 | 87.3% |
401.bzip2 | 248 | 326 | 76.1% | 247 | 282 | 87.6% |
429.mcf | 106 | 118 | 89.8% | 129 | 123 | 104.9% |
445.gobmk | 178 | 198 | 89.9% | 183 | 193 | 94.8% |
456.hmmer | 159 | 170 | 93.5% | 164 | 142 | 115.5% |
458.sjeng | 246 | 300 | 82.0% | 253 | 281 | 90.0% |
462.libquantum | 94 | 107 | 87.9% | 101 | 128 | 78.9% |
464.h264ref | 203 | 328 | 61.9% | 201 | 271 | 74.2% |
471.omnetpp | 146 | 179 | 81.6% | 173 | 193 | 89.6% |
473.astar | 184 | 203 | 90.6% | 196 | 201 | 97.5% |
483.xalancbmk | 82 | 104 | 78.8% | 96 | 112 | 85.7% |
GeoMean INT | 81.7% | 90.8% | ||||
FP tests | ||||||
433.milc | 95 | 130 | 73.1% | 98 | 127 | 77.2% |
444.namd | 139 | 172 | 80.8% | 139 | 164 | 84.8% |
447.dealII | 108 | 117 | 92.3% | 115 | 151 | 76.2% |
450.soplex | 91 | 104 | 87.5% | 95 | 107 | 88.8% |
453.povray | 60 | 78 | 76.9% | 52 | 65 | 80.0% |
470.lbm | 113 | 119 | 95.0% | 114 | 99 | 115.2% |
482.sphinx3 | 194 | 207 | 93.7% | 195 | 215 | 90.7% |
GeoMean FP | 85.2% | 86.7% |
SpecCPU2017 (in seconds, lower is better)
Compiler: clang 11.0 -O3 –flto
INT tests | ARM64 MacOS | Rosetta 2 (x86) | Rosetta 2 efficiency | ARM64 Linux(VM) | ExaGear (x86) | ExaGear efficiency |
500.perlbench_r | 216 | 282 | 76.6% | 210 | 241 | 87.1% |
502.gcc_r | 125 | 164 | 76.2% | 125 | 153 | 81.7% |
505.mcf_r | 202 | 235 | 86.0% | 217 | 231 | 93.9% |
520.omnetpp_r | 277 | 360 | 76.9% | 295 | 324 | 91.0% |
523.xalancbmk_r | 166 | 204 | 81.4% | 189 | 197 | 95.9% |
525.x264_r | 154 | 176 | 87.5% | 161 | 189 | 85.2% |
531.deepsjeng_r | 175 | 221 | 79.2% | 190 | 197 | 96.4% |
541.leela_r | 282 | 294 | 95.9% | 285 | 277 | 102.9% |
557.xz_r | 275 | 319 | 86.2% | 307 | 335 | 91.6% |
GeoMean INT | 82.7% | 91.6% | ||||
FP tests | ||||||
508.namd_r | 111 | 134 | 82.8% | 111 | 133 | 83.5% |
510.parest_r | 308 | 331 | 93.1% | 304 | 336 | 90.5% |
511.povray_r | 211 | 320 | 65.9% | 196 | 253 | 77.5% |
519.lbm_r | 121 | 153 | 79.1% | 127 | 142 | 89.4% |
526.blender_r | 149 | 179 | 83.2% | 163 | 175 | 93.1% |
538.imagick_r | 210 | 345 | 60.9% | 227 | 271 | 83.8% |
544.nab_r | 156 | 186 | 83.9% | 150 | 174 | 86.2% |
GeoMean FP | 77.7% | 86.1% |
As we can see, the results from SpecCPU2006 and SpecCPU2017 follow that of Geekbench: on average, Huawei ExaGear beats Apple Rosetta 2, although in a few subtests Rosetta 2 is able to outperform ExaGear.
Huawei ExaGear Vs Microsoft binary translator
Next, we will compare the performance of Huawei ExaGear against the binary translator from Microsoft using a Huawei MateBook E.
ExaGear runs in WSL environment, and Microsoft’s translator runs natively under Windows 10.
When running ExaGear within a WSL environment, there is an issue in that the timers we require for profiling are not available. This prevented use of full code optimization by ExaGear. We estimate that this diminishes performance by around 10-20% depending on the benchmark.
Geekbench 5.4.1 (points, higher is better):
Bench name | ARM64 Windows | MS BT (x86) | MS BT efficiency | ARM64 WSL | ExaGear (x86) | ExaGear efficiency |
AES-XTS | 872 | 437 | 50.1% | 892 | 437 | 49.0% |
Text Compression | 514 | 381 | 74.1% | 518 | 451 | 87.1% |
Image Compression | 577 | 328 | 56.8% | 606 | 433 | 71.5% |
Navigation | 402 | 393 | 97.8% | 522 | 502 | 96.2% |
HTML5 | 517 | 224 | 43.3% | 522 | 371 | 71.1% |
SQLite | 534 | 240 | 44.9% | 565 | 412 | 72.9% |
PDF Rendering | 515 | 253 | 49.1% | 574 | 462 | 80.5% |
Text Rendering | 530 | 264 | 49.8% | 544 | 430 | 79.0% |
Clang | 485 | 187 | 38.6% | 601 | 405 | 67.4% |
Camera | 437 | 221 | 50.6% | 479 | 308 | 64.3% |
N-Body Physics | 390 | 253 | 64.9% | 390 | 317 | 81.3% |
Rigid Body Physics | 634 | 299 | 47.2% | 717 | 485 | 67.6% |
Gaussian Blur | 279 | 217 | 77.8% | 282 | 238 | 84.4% |
Face Detection | 598 | 301 | 50.3% | 657 | 387 | 58.9% |
Horizon Detection | 484 | 235 | 48.6% | 542 | 331 | 61.1% |
Image Inpainting | 580 | 419 | 72.2% | 775 | 504 | 65.0% |
HDR | 637 | 481 | 75.5% | 1035 | 843 | 81.4% |
Ray Tracing | 758 | 296 | 39.1% | 762 | 513 | 67.3% |
Structure from Motion | 379 | 203 | 53.6% | 457 | 289 | 63.2% |
Speech Recognition | 346 | 283 | 81.8% | 355 | 312 | 87.9% |
Machine Learning | 252 | 108 | 42.9% | 240 | 128 | 53.3% |
GeoMean | 55.6% | 70.9% |
SpecCPU2017 (in seconds, lower is better)
Compiler: clang 11.0 -O3 -flto
INT tests | ARM64 Windows | MS BT (x86) | MS BT efficiency | ARM64 WSL | ExaGear (x86) | ExaGear efficiency |
500.perlbench_r | 824 | 1526 | 54.0% | 800 | 1042 | 76.8% |
502.gcc_r | 716 | 1119 | 64.0% | 690 | 820 | 84.1% |
505.mcf_r | 781 | 1069 | 73.1% | 792 | 992 | 79.8% |
520.omnetpp_r | 1280 | 2025 | 63.2% | 1186 | 1414 | 83.9% |
525.x264_r | 460 | 771 | 59.7% | 481 | 638 | 75.4% |
531.deepsjeng_r | 507 | 711 | 71.3% | 524 | 629 | 83.3% |
541.leela_r | 564 | 897 | 62.9% | 547 | 652 | 83.9% |
557.xz_r | 733 | 938 | 78.1% | 733 | 827 | 88.6% |
GeoMean INT | 65.4% | 81.9% | ||||
FP tests | ||||||
508.namd_r | 423 | 698 | 60.6% | 429 | 619 | 69.3% |
510.parest_r | 1006 | 1274 | 79.0% | 970 | 1179 | 82.3% |
511.povray_r | 808 | 1540 | 52.5% | 761 | 1027 | 74.1% |
519.lbm_r | 428 | 767 | 55.8% | 429 | 549 | 78.1% |
526.blender_r | 474 | 688 | 68.9% | 484 | 716 | 67.6% |
538.imagick_r | 560 | 1075 | 52.1% | 566 | 847 | 66.8% |
544.nab_r | 522 | 1050 | 49.7% | 517 | 617 | 83.8% |
GeoMean FP | 59.0% | 74.3% |
Unfortunately, we were unable to build SpecCPU2006 for MS Windows. On the other hand, SpecCPU2017 built successfully and results were able to be collected. As seen when comparing against Rosetta 2, the results of SpecCPU2006 and SpecCPU2017 are broadly similar.
Conclusion
Apple’s engineers have not only produced an outstanding and game changing processor. Furthermore, they have equipped this with a performance x86 binary translator. Nevertheless, our tests show that Huawei ExaGear confidently outperforms Apple’s Rosetta 2.
Microsoft’s solution is shown to be inferior to both ExaGear and although we can’t directly compare, also to Rosetta 2. This is expected given Microsoft’s lack of expertise and experience in developing binary translators.