Armmaster 10 сен 2021 в 12:30

Comparing Huawei ExaGear to Apple's Rosetta 2 and Microsoft's solution

7 мин

4.2K

Блог компании HuaweiКомпиляторы*Процессоры

Перевод

November 10, 2020 was in many ways a landmark event in the microprocessor industry: Apple unveiled its new Mac Mini, the main feature of which was the new M1 chip, developed in-house. It is not an exaggeration to say that this processor is a landmark achievement for the ARM ecosystem: finally an ARM architecture chip whose performance surpassed x86 architecture chips from competitors such as Intel, a niche that had been dominated for decades.

But the main interest for us is not the M1 processor itself, but the Rosetta 2 binary translation technology. This allows the user to run legacy x86 software that has not been migrated to the ARM architecture. Apple has a lot of experience in developing binary translation solutions and is a recognized leader in this area. The first version of the Rosetta binary translator appeared in 2006 were it aided Apple in the transition from PowerPC to x86 architecture. Although this time platforms were different from those of 2006, it was obvious that all the experience that Apple engineers had accumulated over the years, was not lost, but used to develop the next version - Rosetta 2.

We were keen to compare this new solution from Apple, a similar product Huawei ExaGear (with its lineage from Eltechs ExaGear) developed by our team. At the same time, we evaluated the performance of binary translation from x86 to Arm provided by Microsoft (part of MS Windows 10 for Arm devices) on the Huawei MateBook E laptop. At present, these are the only other x86 to Arm binary translation solution that we are aware of on the open market.

Since all solutions were originally created for different operating systems (Huawei ExaGear for Linux, Apple Rosetta 2 for MacOS, and Microsoft binary translator for MS Windows), we had to find an appropriate comparison method, as it was impossible to execute them under the same conditions. We chose a “translation efficiency” metric, which is the ratio between the native version of the binary and a version of the binary for the guest architecture executed using the binary translator. For our tests the native target platform is Arm and the guest platform is x86. In other words, we compared how many percent of the native indicator the benchmark execution achieves in binary translation mode. For Geekbench in which a higher score is better, this ratio is: translated score divided by native score. For Spec in which a lower execution time is better, this ratio is: the native execution time divided by the translated execution time. The same metric was used by experts from the Anandtech website, who published a review article about the Apple M1, where performance figures were also given for Rosetta 2.

All measurements are for a single thread, we are primarily interested in the performance of the translated code. That is, what percentage of performance is lost compared to native code.

Huawei Exagear Vs Apple Rosetta 2

So, let us start by comparing Huawei ExaGear with Apple Rosetta 2 on Apple MacBook Pro (M1). The Rosetta 2 tests run in the native MacOS BigSur 11.1, the ExaGear tests run in a virtual machine with Linux kernel 5.4.1.

Geekbench 5.4.1 (points, higher is better ):

Bench name	ARM64 MacOS	Rosetta 2 (x86)	Rosetta 2 efficiency	ARM64 Linux(VM)	Exagear (x86)	Exagear efficiency
AES-XTS	2769	1720	62.1%	2703	1823	67.4%
Text Compression	1502	1319	87.8%	1438	1349	93.8%
Image Compression	1356	1056	77.9%	1335	1230	92.1%
Navigation	1716	1678	97.8%	1717	1605	93.5%
HTML5	1642	1066	64.9%	1717	1302	75.8%
SQLite	1400	1000	71.4%	1328	1229	92.6%
PDF Rendering	1600	1324	82.8%	1738	1464	84.2%
Text Rendering	1778	1290	72.6%	1726	1480	85.8%
Clang	1661	1244	74.9%	1738	1335	76.8%
Camera	1612	1278	79.3%	1464	1148	78.4%
N-Body Physics	1789	1503	84.0%	1830	1616	88.3%
Rigid Body Physics	1772	1352	76.3%	1690	1176	69.6%
Gaussian Blur	1407	1251	88.9%	1435	1297	90.4%
Face Detection	2215	1500	67.7%	2216	1632	73.7%
Horizon Detection	1964	1268	64.6%	1879	1386	73.8%
Image Inpainting	3214	2893	90.0%	3345	2903	86.8%
HDR	2486	2250	90.5%	2761	2690	97.4%
Ray Tracing	2553	1970	77.2%	2055	1992	96.9%
Structure from Motion	1406	1068	76.0%	1507	1150	76.3%
Speech Recognition	1601	1371	85.6%	1485	1355	91.3%
Machine Learning	1243	713	57.4%	1229	727	59.2%
GeoMean			76.9%			82.4%

As we can see, Huawei ExaGear has a better average performance of 82.4% in comparison to Rosetta 2’s 76.9%, and ExaGear loses only in 4 tests out of 21: Navigation, Camera, Rigid Body Physics, and Image Inpainting.

Here we need to make a small, but quite interesting digression. A closer examination of the M1 processor reveals that despite the fact that it was officially released only in November, meaning that the internal design and instruction set was finalized long before, likely in 2019. Despite this it contains support for the Arm V8.7 architecture extension, not published until Autumn 2020. A significant portion of this extension aims to simplify and improve the performance of some operations common to binary translation for an x86 guest architecture. That is, Apple was developing a processor with an extension that was not official at the time. Moreover, a close look at earlier extensions reveals that ArmV8.5 and ArmV8.4 also included operations to support binary translation. This suggests that Apple has been working in close cooperation with Arm for quite some time, pursuing hardware support for their binary translation solution. Apple Rosetta 2 leverages all these features and therefore has a definite advantage over Huawei ExaGear, which does not exploit these extensions.

We believe this is why Apple Rosetta 2 performs better in these four benchmarks. But nevertheless, Huawei ExaGear shows better results in other tests, despite being at somewhat of a disadvantage for not using these advanced architecture features.

It is also worth noting that the performance figures in native Arm-mode for MacOS and Linux are generally quite close, which confirms the general correctness of our approach of comparing the performance of binary translators.

Next, let us compare performance on SpecCPU2006 and SpecCPU2017. The execution time of each subtest is measured in seconds. Benchmarks written in Fortran were excluded.

SpecCPU2006 (in seconds, lower is better)

Compiler: clang 11.0 -O3 –flto

INT tests	ARM64 MacOS	Rosetta 2 (x86)	Rosetta 2 efficiency	ARM64 Linux(VM)	ExaGear (x86)	ExaGear efficiency
400.perlbench	157	218	72.0%	145	166	87.3%
401.bzip2	248	326	76.1%	247	282	87.6%
429.mcf	106	118	89.8%	129	123	104.9%
445.gobmk	178	198	89.9%	183	193	94.8%
456.hmmer	159	170	93.5%	164	142	115.5%
458.sjeng	246	300	82.0%	253	281	90.0%
462.libquantum	94	107	87.9%	101	128	78.9%
464.h264ref	203	328	61.9%	201	271	74.2%
471.omnetpp	146	179	81.6%	173	193	89.6%
473.astar	184	203	90.6%	196	201	97.5%
483.xalancbmk	82	104	78.8%	96	112	85.7%
GeoMean INT			81.7%			90.8%

FP tests
433.milc	95	130	73.1%	98	127	77.2%
444.namd	139	172	80.8%	139	164	84.8%
447.dealII	108	117	92.3%	115	151	76.2%
450.soplex	91	104	87.5%	95	107	88.8%
453.povray	60	78	76.9%	52	65	80.0%
470.lbm	113	119	95.0%	114	99	115.2%
482.sphinx3	194	207	93.7%	195	215	90.7%
GeoMean FP			85.2%			86.7%

SpecCPU2017 (in seconds, lower is better)

Compiler: clang 11.0 -O3 –flto

INT tests	ARM64 MacOS	Rosetta 2 (x86)	Rosetta 2 efficiency	ARM64 Linux(VM)	ExaGear (x86)	ExaGear efficiency
500.perlbench_r	216	282	76.6%	210	241	87.1%
502.gcc_r	125	164	76.2%	125	153	81.7%
505.mcf_r	202	235	86.0%	217	231	93.9%
520.omnetpp_r	277	360	76.9%	295	324	91.0%
523.xalancbmk_r	166	204	81.4%	189	197	95.9%
525.x264_r	154	176	87.5%	161	189	85.2%
531.deepsjeng_r	175	221	79.2%	190	197	96.4%
541.leela_r	282	294	95.9%	285	277	102.9%
557.xz_r	275	319	86.2%	307	335	91.6%
GeoMean INT			82.7%			91.6%

FP tests
508.namd_r	111	134	82.8%	111	133	83.5%
510.parest_r	308	331	93.1%	304	336	90.5%
511.povray_r	211	320	65.9%	196	253	77.5%
519.lbm_r	121	153	79.1%	127	142	89.4%
526.blender_r	149	179	83.2%	163	175	93.1%
538.imagick_r	210	345	60.9%	227	271	83.8%
544.nab_r	156	186	83.9%	150	174	86.2%
GeoMean FP			77.7%			86.1%

As we can see, the results from SpecCPU2006 and SpecCPU2017 follow that of Geekbench: on average, Huawei ExaGear beats Apple Rosetta 2, although in a few subtests Rosetta 2 is able to outperform ExaGear.

Huawei ExaGear Vs Microsoft binary translator

Next, we will compare the performance of Huawei ExaGear against the binary translator from Microsoft using a Huawei MateBook E.

ExaGear runs in WSL environment, and Microsoft’s translator runs natively under Windows 10.

When running ExaGear within a WSL environment, there is an issue in that the timers we require for profiling are not available. This prevented use of full code optimization by ExaGear. We estimate that this diminishes performance by around 10-20% depending on the benchmark.

Geekbench 5.4.1 (points, higher is better):

Bench name	ARM64 Windows	MS BT (x86)	MS BT efficiency	ARM64 WSL	ExaGear (x86)	ExaGear efficiency
AES-XTS	872	437	50.1%	892	437	49.0%
Text Compression	514	381	74.1%	518	451	87.1%
Image Compression	577	328	56.8%	606	433	71.5%
Navigation	402	393	97.8%	522	502	96.2%
HTML5	517	224	43.3%	522	371	71.1%
SQLite	534	240	44.9%	565	412	72.9%
PDF Rendering	515	253	49.1%	574	462	80.5%
Text Rendering	530	264	49.8%	544	430	79.0%
Clang	485	187	38.6%	601	405	67.4%
Camera	437	221	50.6%	479	308	64.3%
N-Body Physics	390	253	64.9%	390	317	81.3%
Rigid Body Physics	634	299	47.2%	717	485	67.6%
Gaussian Blur	279	217	77.8%	282	238	84.4%
Face Detection	598	301	50.3%	657	387	58.9%
Horizon Detection	484	235	48.6%	542	331	61.1%
Image Inpainting	580	419	72.2%	775	504	65.0%
HDR	637	481	75.5%	1035	843	81.4%
Ray Tracing	758	296	39.1%	762	513	67.3%
Structure from Motion	379	203	53.6%	457	289	63.2%
Speech Recognition	346	283	81.8%	355	312	87.9%
Machine Learning	252	108	42.9%	240	128	53.3%
GeoMean			55.6%			70.9%

SpecCPU2017 (in seconds, lower is better)

Compiler: clang 11.0 -O3 -flto

INT tests	ARM64 Windows	MS BT (x86)	MS BT efficiency	ARM64 WSL	ExaGear (x86)	ExaGear efficiency
500.perlbench_r	824	1526	54.0%	800	1042	76.8%
502.gcc_r	716	1119	64.0%	690	820	84.1%
505.mcf_r	781	1069	73.1%	792	992	79.8%
520.omnetpp_r	1280	2025	63.2%	1186	1414	83.9%
525.x264_r	460	771	59.7%	481	638	75.4%
531.deepsjeng_r	507	711	71.3%	524	629	83.3%
541.leela_r	564	897	62.9%	547	652	83.9%
557.xz_r	733	938	78.1%	733	827	88.6%
GeoMean INT			65.4%			81.9%

FP tests
508.namd_r	423	698	60.6%	429	619	69.3%
510.parest_r	1006	1274	79.0%	970	1179	82.3%
511.povray_r	808	1540	52.5%	761	1027	74.1%
519.lbm_r	428	767	55.8%	429	549	78.1%
526.blender_r	474	688	68.9%	484	716	67.6%
538.imagick_r	560	1075	52.1%	566	847	66.8%
544.nab_r	522	1050	49.7%	517	617	83.8%
GeoMean FP			59.0%			74.3%

Unfortunately, we were unable to build SpecCPU2006 for MS Windows. On the other hand, SpecCPU2017 built successfully and results were able to be collected. As seen when comparing against Rosetta 2, the results of SpecCPU2006 and SpecCPU2017 are broadly similar.

Conclusion

Apple’s engineers have not only produced an outstanding and game changing processor. Furthermore, they have equipped this with a performance x86 binary translator. Nevertheless, our tests show that Huawei ExaGear confidently outperforms Apple’s Rosetta 2.

Microsoft’s solution is shown to be inferior to both ExaGear and although we can’t directly compare, also to Rosetta 2. This is expected given Microsoft’s lack of expertise and experience in developing binary translators.

Теги:

Хабы:

Comparing Huawei ExaGear to Apple's Rosetta 2 and Microsoft's solution

Huawei Exagear Vs Apple Rosetta 2

Huawei ExaGear Vs Microsoft binary translator

Публикации

Информация