Примерно полгода назад я познакомился с VLIW‑процессором Эльбрус-8СВ. На тот момент у меня уже был опыт написания кода на ассемблере для VLIW‑процессора TMS320C66. Поэтому я захотел написать нечто похожее для Эльбруса. А именно, реализовать алгоритм FFT на ассемблере. Но из‑за нехватки документации на инструкции процессора мне пришлось начать с реализации какого‑нибудь простого алгоритма на Си, чтобы изучать его ассемблерный вывод. По результатам той деятельности была написана предыдущая статья.

После написания той статьи я решил попробовать реализовать алгоритм FFT на Си для Эльбруса. Работа ещё не завершена, но определённые успехи уже есть (сравнение с EML присутствует). В этой статье я хочу поделиться полученными на данный момент результатами.

Содержание:

Пишем функцию Reverse
Пишем функцию Stage
stage_radix2
stage_radix2_2x
stage_radix2_readConjSwap
stage_radix2_readConjSwap_2x
stage_radix4
stage_radix4_2x
stage_radix4_readConjSwap
stage_radix4_readConjSwap_2x
Собираем FFT

Постановка задачи

Дано: указатель на входной массив комплексных чисел, количество элементов в нём и указатель на выходной массив.
Требуется: вычислить FFT от входных данных, ответ записать в выходной массив.

Будем считать, что комплексные числа имеют тип float complex (то есть действительная и мнимая части имеют тип float).
Будем считать, что количество элементов массива является степенью числа 2 (или 4, если надо будет).
И, конечно же, пусть массивы будут выровнены в памяти на удобные нам границы.

Ключи компиляции

При компиляции использовались следующие ключи lcc:

-Wall -O3 -faligned -ffast-math -march=elbrus-v5

Почему я компилирую для elbrus-v5

После написания предыдущей статьи у меня пропал доступ к elbrus‑v5 (он уехал в ремонт). Сначала я придумывал алгоритмы «на бумажке», потом мне был предоставлен доступ к elbrus‑v6. Я планировал делать код для elbrus‑v5, поэтому добавил в скрипт компиляции ключ ‑march=elbrus‑v5. В дальнейшем доступ к v5 восстановился, но я уже привык работать на v6, так как он был круглосуточно доступен.

Перед написанием этой статьи я решил убрать ключ -march=elbrus-v5. После перекомпиляции некоторый код стал работать медленнее, чем при наличии ключа. Выяснилось, что этот код компилируется для v5 эффективнее, чем для v6 (плотнее упакованы инструкции). Получилось так, что код, скомпилированный для v5, работает на v6 быстрее, чем тот же код, скомпилированный для v6.

Такое поведение можно наблюдать в следующих функциях:

Поэтому я пока остановился на компиляции для v5.

Посмотреть на различия компиляции можно на сайте ce.mentality.rip:

  1. Вставляем в левое поле одну из указанных выше функций

  2. Добавляем перед вставленной функцией такой код:

    #include <e2kintrin.h>
    #include <stdint.h>
    
    typedef struct
    {
    	float real;
    	float imag;
    } myComplex;
  3. В строке «Compiler options» указываем ключи компиляции:

    -Wall -O3 -faligned -ffast-math -march=elbrus-v5

После этого считаем количество тактов в цикле (делаем поиск по «loop_mode»).
При указании -march=elbrus-v5 тактов меньше, чем при -march=elbrus-v6.

Как измерялось время

Замеры времени делались с помощью функции clock_gettime():

	struct timespec t0, t1;
	clock_gettime(CLOCK_REALTIME, &t0);
	/***  здесь измеряемый код  ***/
	clock_gettime(CLOCK_REALTIME, &t1);
	int usec = (t1.tv_sec - t0.tv_sec)*1000000 + (t1.tv_nsec - t0.tv_nsec)/1000;

Также использовалось чтение счётчика тактов процессора:

	uint64_t get_clock_count()
	{
		uint64_t dst;
		#pragma asm_inline
		asm ("rrd %%clkr, %0" : "=r" (dst));
		return dst;
	}

	...

	uint64_t ticks0 = get_clock_count();
	/***  здесь измеряемый код  ***/
	uint64_t ticks1 = get_clock_count();
	uint64_t ticks = ticks1 - ticks0;
Про опции pragma

В этот раз я использовал #pragma prefetch вместо #pragma loop_count(100).
Как я понял, оба варианта включают APB. Разница в том, что prefetchне отключает предварительный запрос данных перед циклом, а loop_count отключает. Этот запрос немного уменьшает время исполнения цикла. При большом числе итераций это ускорение несущественно, но увеличивает размер кода.
Разницы почти нет, но prefetch выглядит проще, чем loop_count(100).

Также использовалась опция #pragma ivdep, которая указывает компилятору, что разные итерации цикла независимы между собой по обращениям к памяти, и можно начать выполнение следующей итерации, не дожидаясь завершения текущей.
Опция полезна, когда в цикле одновременно присутствуют чтение и запись в память.

Про раскрутку циклов

Иногда для ускорения полезно сделать раскрутку циклов. В одном такте могут одновременно выполняться 6 инструкций, потому что в этом процессоре есть 6 исполнительных юнитов. Если, например, итерация цикла состоит из 9 инструкций, то она будет выполняться за 2 такта, а две такие итерации, соответственно, за 4 такта. После раскрутки в 2 раза одна итерация нового цикла будет состоять из 2*9=18 инструкций, и можно ожидать, что она будет выполняться за 3 такта. Таким образом, две итерации исходного цикла будут выполняться не за 4 такта, а за 3. Но следует помнить, что по разным причинам не всегда удаётся уместить в каждый такт 6 инструкций. Например, потому что некоторые инструкции способны выполняться не на всех исполнительных юнитах.

Способы раскрутить цикл:

  • Раскрутка цикла компилятором (у таких функций я добавляю к названию суффикс "_unroll2"/"_unroll3"/"_unroll4")

    Для её использования надо написать #pragma unroll(k) перед циклом, где k — множитель раскрутки. И компилятор раскрутит цикл ровно в k раз. Также будет добавлен код, проверяющий кратность числа итераций исходного цикла параметру k. Если число итераций не кратно k, остаток будет обработан отдельным кодом. Раскрученный код будет выполнять те же действия и в том же порядке, что и изначальный код. Если поставить #pragma unroll(1), то раскрутка не будет производиться. Это бывает полезно, потому что по‑умолчанию компилятор пытается сделать unroll(2).

  • Ручная раскрутка цикла программистом (у таких функций я добавляю к названию суффикс "_x2"/"_x3"/"_x4")

    При ручной раскрутке программист сам пишет код так, чтобы в одной итерации цикла выполнялось k итераций алгоритма. Проверка кратности числа итераций алгоритма величине k лежит на программисте. Если число итераций может быть не кратно k, нужно обрабатывать эти случаи отдельно. В приведённом в этой статье коде такие случаи не обрабатываются для упрощения. В отличие от раскрутки компилятором, полностью повторяющей все действия в исходном порядке, раскрученный вручную код можно оптимизировать, заменяя действия на другие и меняя их порядок.

    Интересным вариантом является ручная раскрутка цикла, тело которого состоит из другого цикла, обрабатывающего один и тот же массив. В этом случае можно объединить k итераций внутреннего цикла. Например, если исходная итерация внутреннего цикла состоит из чтения данных из памяти, обработки данных и сохранения результата в память, то после раскрутки внешнего цикла в 2 раза можно убрать сохранение результата в конце первой итерации и чтение его обратно перед второй. В итоге останется просто чтение, обработка, ещё раз обработка и сохранение. Пример такой оптимизации находится в функциях с пометкой "2x", например, stage_radix2_2x (не "x2" для лучшей сортировки файлов по имени).

Что такое FFT

Пусть дан набор комплексных чисел x0, …, xN-1.

Дискретным преобразованием Фурье (DFT) называется перевод этого набора в другой набор комплексных чисел X0, …, XN-1 по следующей формуле:

\large \begin{gather} \begin{aligned} X_t &= \sum_{n=0}^{N-1} x_n {c_t}^n \end{aligned} \\ \\ c_t = e^{- \frac{2 \pi i}{N} t} \\ \\ \small (t = 0,..., N-1) \end{gather}

Вычисление X0, …, XN-1 по этой формуле требует O(N2) операций.

Быстрым преобразованием Фурье (FFT) называется алгоритм, позволяющий вычислить DFT значительно быстрее (обычно за O(N logN) операций).

Наиболее известным FFT является алгоритм Кули‑Тьюки (Cooley‑Tukey). Этот алгоритм рекурсивно вычисляет DFT через DFT меньшего размера (метод «разделяй и властвуй» / «divide‑and‑conquer»).

Например, если поделить исходный массив на два подмассива (вариант «radix-2»), исходную формулу можно преобразовать к такому виду:

\large \begin{gather} \begin{aligned} X_t &= E_t + c_t O_t \\ X_{t+N/2} &= E_t - c_t O_t \end{aligned}  \\ \\ c_t = e^{- \frac{2 \pi i}{N} t} \\ \\ \small (t = 0,..., \frac{N}{2} - 1) \end{gather}

Et — DFT от N/2 чётных (Even) элементов исходного массива (с индексами вида 2s+0).
Ot — DFT от N/2 нечётных (Odd) элементов исходного массива (с индексами вида 2s+1).

Вариант «radix-4»

По аналогии вариант «radix-4» преобразует формулу к такому виду:

\large  \begin{gather} \begin{alignedat}{1} X_t          & = EE_t\ +&\  &c_t EO_t + {c_t}^2 OE_t\ +&\  &{c_t}^3 OO_t \\ X_{t+N/4}    & = EE_t\ -&\ i&c_t EO_t - {c_t}^2 OE_t\ +&\ i&{c_t}^3 OO_t \\ X_{t + 2N/4} & = EE_t\ -&\  &c_t EO_t + {c_t}^2 OE_t\ -&\  &{c_t}^3 OO_t \\ X_{t+3N/4}   & = EE_t\ +&\ i&c_t EO_t - {c_t}^2 OE_t\ -&\ i&{c_t}^3 OO_t \end{alignedat} \\ \\ c_t = e^{- \frac{2 \pi i}{N} t} \\ \\ \small (t = 0,..., \frac{N}{4} - 1) \end{gather}

EEt — DFT от N/4 элементов исходного массива с индексами вида 4s+0.
EOt — DFT от N/4 элементов исходного массива с индексами вида 4s+1.
OEt — DFT от N/4 элементов исходного массива с индексами вида 4s+2.
OOt — DFT от N/4 элементов исходного массива с индексами вида 4s+3.

Напишем псевдокод рекурсивной функции, реализующей вариант «radix-2»:

	FFT(IN)	// IN - набор входных данных
	{
		N = IN.length

		if N == 1
			return IN

		E = FFT(IN[0:N:2])	// FFT от   чётных элементов
		O = FFT(IN[1:N:2])	// FFT от нечётных элементов

		for t = 0:N/2
		{
			c = e^(-2*pi*i*t/N)
			OUT[t      ] = E[t] + c*O[t]
			OUT[t + N/2] = E[t] - c*O[t]
		}

		return OUT
	}

Напишем чуть ближе к реальному коду:

	FFT(*IN, N, s)
	// IN - указатель на первый элемент
	// N  - количество элементов
	// s  - расстояние между элементами
	{
		if N == 1
			return IN[0]

		E[0:N/2] = FFT(IN,     N/2, 2s)	// FFT от   чётных элементов
		O[0:N/2] = FFT(IN + s, N/2, 2s)	// FFT от нечётных элементов

		for t = 0:N/2
		{
			c = e^(-2*pi*i*t/N)
			OUT[t      ] = E[t] + c*O[t]
			OUT[t + N/2] = E[t] - c*O[t]
		}

		return OUT[0:N]
	}

Разместим элементы Et в первой половине OUT, а элементы Ot во второй половине OUT:

	FFT(*IN, N, s)
	// IN - указатель на первый элемент
	// N  - количество элементов
	// s  - расстояние между элементами
	{
		if N == 1
			return IN[0]

		OUT[0:N/2] = FFT(IN,     N/2, 2s)	// FFT от   чётных элементов
		OUT[N/2:N] = FFT(IN + s, N/2, 2s)	// FFT от нечётных элементов

		for t = 0:N/2
		{
			x = OUT[t      ]
			y = OUT[t + N/2]
			c = e^(-2*pi*i*t/N)
			OUT[t      ] = x + c*y
			OUT[t + N/2] = x - c*y
		}

		return OUT[0:N]
	}

Для удобства реализации перед рекурсивными вызовами можно передвинуть все чётные элементы в начало массива, нечётные — в конец массива. Тогда подмассивы будут занимать подряд идущие ячейки памяти, а не перемежаться между собой.

Если количество элементов N = 2n, то данная перестановка равнозначна перестановке вида IN[k] → OUT[rotateRight(k, n)], где rotateRight(k, n) — операция, «прокручивающая вправо» младшие n битов числа k, т.е. переставляющая младший бит числа k с позиции 0 на позицию n-1, сдвигая биты с позиций n-1, …, 1 на одну позицию в сторону позиции 0.

Псевдокод станет таким:

	FFT(*IN, N)
	// IN - указатель на первый элемент
	// N  - количество элементов
	{
		if N == 1
			return IN[0]

		// перестановка "чётные - в начало, нечётные - в конец"
		IN = Even2Beginning_Odd2Ending(IN, N)

		OUT[0:N/2] = FFT(IN,       N/2)	// FFT от начальной половины массива
		OUT[N/2:N] = FFT(IN + N/2, N/2)	// FFT от конечной  половины массива

		for t = 0:N/2
		{
			x = OUT[t      ]
			y = OUT[t + N/2]
			c = e^(-2*pi*i*t/N)
			OUT[t      ] = x + c*y
			OUT[t + N/2] = x - c*y
		}

		return OUT[0:N]
	}

Если теперь мысленно проследить за ходом рекурсии, можно увидеть, что к моменту достижения самой глубины рекурсивных вызовов все элементы исходного массива будут переставлены в порядке, который обычно называется «bit reversal». В этой статье я буду называть такую перестановку просто «Reverse».

Что такое Reverse

Перестановка элементов вида IN[k] → OUT[reverseNumber(k)], где reverseNumber(k) — операция, переставляющая биты числа k в обратном порядке.

Результат reverseNumber(k) зависит от количества битов в числе k.
Например:
- если битов 3, то reverseNumber(6) = 3  ( 110 → 011 )
- если битов 4, то reverseNumber(6) = 6  ( 0110 → 0110 )
- если битов 5, то reverseNumber(6) = 12 (00110 → 01100)

В случае «radix-4» мы придём к аналогичной перестановке, где потребуется вместо двоичных цифр (битов) переставлять четверичные цифры (пары битов).

Таким образом, движение по рекурсии вглубь можно заменить на Reverse.

Обратный путь рекурсии состоит из обработки подмассивов, размер которых увеличивается по мере возврата из рекурсии. Один шаг возврата из рекурсии, обрабатывающий все подмассивы сразу, будем называть «Stage».

	Stage(*OUT, N, stage_num)
	{
		m = 2^stage_num		// m - размер подмассива
		for k = 0:N/m		// k - номер подмассива
		{
			for t = 0:m/2
			{
				x = OUT[m*k + t      ]
				y = OUT[m*k + t + m/2]
				c = e^(-2*pi*i*t/N)
				OUT[m*k + t      ] = x + c*y
				OUT[m*k + t + m/2] = x - c*y
			}
		}
	}

Псевдокод алгоритма теперь выглядит так (без рекурсии):

	FFT(*IN, N, *OUT)
	{
		OUT = Reverse(IN, N)
		for stage_num = 1, ..., log2(N)
			Stage(OUT, N, stage_num)
	}

Таким образом, FFT = Reverse + log2(N)*Stage.

Для решения задачи нужно написать две функции: Reverse и Stage.

Особенности реализации FFT на Эльбрусе

В процессоре Эльбрус есть механизм APB, который позволяет быстро читать данные, расположенные в памяти с равным шагом. Число потоков чтения в APB ограничено 32 штуками.

В последнем варианте алгоритма (без рекурсии) разные Stage обрабатывают подмассивы разного размера:

  • первый Stage обрабатывает подмассивы длиной 2 (это можно представить в виде 2-х потоков чтения с равным шагом: чётные и нечётные элементы)

  • второй Stage обрабатывает подмассивы длиной 4 (это можно представить в виде 4-х потоков чтения с равным шагом)

  • третий Stage обрабатывает подмассивы длиной 8 (это можно представить в виде 8-ми потоков чтения с равным шагом)

  • и так далее

При достаточно большом количестве Stage перестанет хватать потоков чтения APB. Для эффективного использования APB модифицируем алгоритм Stage таким образом, чтобы элементы всегда читались парами (как сейчас в первом Stage).

Получится такой вариант для Эльбруса:

	Stage(*IN, N, *OUT, stage_num)
	{
		m = 2^stage_num
		for k = 0:m/2
		{
			c = e^(-2*pi*i*k/m)
			for t = 0:N/m
			{
				x = IN[2*t    ]
				y = IN[2*t + 1]
				OUT[t      ] = x + c*y
				OUT[t + N/2] = x - c*y
			}
		}
	}

Если мысленно проследить за ходом операций, можно увидеть, что в этом варианте на каждом Stage выполняются те же арифметические операции с теми же парами чисел и теми же коэффициентами, что и в классическом варианте. Просто эти пары чисел обрабатываются в другом порядке (на всех Stage, кроме первого). И в конце каждого Stage добавляется перестановка «чётные — в начало, нечётные — в конец». Как было написано выше, эта перестановка равнозначна перестановке вида IN[k] → OUT[rotateRight(k, n)], поэтому после прохода по всем Stage числа возвращаются на исходные позиции (прокрутка делается log2(N) раз).

Как выглядит Stage для версии «radix-4»

Именно этот вариант будет реализован в данной статье.

Замечание про коэффициенты в функциях Stage

Для вычисления Stage нам нужны входные данные и коэффициенты. Входные данные располагаются в памяти изначально. Коэффициенты же можно либо читать из памяти одновременно с входными данными (вычислять заранее), либо вычислять на ходу.

Вариант чтения коэффициентов из памяти хорошо работает в случае малого числа коэффициентов, когда они все помещаются в кэше.

Вариант вычисления на ходу оправдан при большом числе коэффициентов, так как в этом случае тонким место является канал доступа к памяти и отказ от чтения коэффициентов из памяти позволяет использовать весь канал только для чтения входных данных. Общий размер коэффициентов (в зависимости от алгоритма) может в несколько раз превосходить размер входных данных. Поэтому отказ от чтения коэффициентов из памяти может увеличить скорость чтения входных данных в несколько раз. При малом числе коэффициентов вычисление на ходу будет замедлять работу, так как оно требует дополнительных инструкций для собственно вычисления.

В приведенном в этой статье коде реализован только вариант чтения коэффициентов из памяти.

Пишем функцию Reverse

reverse_radix2

1. reverse_radix2_etalon

Эталонный вариант для сравнения на корректность.
Здесь reverseNumber вычисляется с помощью цикла.

int reverseNumber_radix2(int number, int bit_count)
{
	int answer = 0;

	for(int i = 0; i < bit_count; ++i)
	{
		answer <<= 1;
		answer |= number & 1;
		number >>= 1;
	}

	return answer;
}


void reverse_radix2_etalon(int bit_count, myComplex *data_in, myComplex *data_out)
{
	int count = 1 << bit_count;

	for(int64_t i = 0; i < count; ++i)
	{
		int index = reverseNumber_radix2(i, bit_count);
		data_out[index] = data_in[i];
	}
}

2. reverse_radix2

В процессоре есть инструкция bitrevd, которая производит операцию reverseNumber_radix2 над 64-битным числом. Заменим reverseNumber_radix2() на __builtin_e2k_bitrevd().

Схема перемещения данных в памяти
Код на Си
void reverse_radix2(int bit_count, myComplex *data_in, myComplex *data_out)
{
	int count = 1 << bit_count;
	int shift = 64 - bit_count;

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < count; ++i)
	{
		int64_t index = __builtin_e2k_bitrevd(i) >> shift;
		data_out[index] = data_in[i];
	}
}
Основной цикл на ассемблере
.L1554:
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=5, abs=0, disp=0
        }
.L1385:
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          bitrevd,0,sm  %b[9], %b[1]
          addd,2,sm     %b[9], 0x1, %b[7]
          shrd,3,sm     %b[5], %r0, %b[10]
          shld,4,sm     %b[12], 0x3, %b[11]
          std,5         %r2, %b[13], %b[8]
          movad,1       area=0, ind=0, am=1, be=0, %b[0]
        }

Теоретическая скорость: 1 комплексное число за 1 такт (1/1) = 8 Байт/такт

Замеры скорости

3. reverse_radix2_x2_bad

Попробуем ускорить с помощью ручной раскрутки цикла в 2 раза.

Схема перемещения данных в памяти
Код на Си
void reverse_radix2_x2_bad(int bit_count, myComplex *data_in, myComplex *data_out)
{
	int count = 1 << bit_count;
	int shift = 64 - bit_count;

	myComplex *data_out_0 = &data_out[0];
	myComplex *data_out_1 = &data_out[count/2];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < count; i += 2)
	{
		int64_t index = __builtin_e2k_bitrevd(i) >> shift;
		data_out_0[index] = data_in[i + 0];
		data_out_1[index] = data_in[i + 1];
	}
}
Основной цикл на ассемблере
.L1860:
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
        }
.L1655:
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          bitrevd,0,sm  %b[20], %b[12]
          addd,1,sm     %b[20], 0x2, %b[18]
          std,2         %r2, %b[17], %b[10]
          shrd,3,sm     %b[16], %r4, %b[19]
          shld,4,sm     %b[21], 0x3, %b[13]
          std,5         %r0, %b[17], %b[11]
          movad,0       area=0, ind=0, am=0, be=0, %b[0]
          movad,1       area=0, ind=8, am=1, be=0, %b[1]
        }

Теоретическая скорость: 2 комплексных числа за 1 такт (2/1) = 16 Байт/такт

Замеры скорости

Видим ускорение в начале графика.

Здесь происходит два чтения из одного места памяти и запись в два разных места памяти.
Вероятно, отсутствие ускорения по всей длине графика связано с тем, что запись в память всё равно делается по очереди (в один банк памяти?).


4. reverse_radix2_x2_good

Попробуем сделать наоборот: будем читать из двух разных мест, а писать рядом.

Схема перемещения данных в памяти
Код на Си
void reverse_radix2_x2_good(int bit_count, myComplex *data_in, myComplex *data_out)
{
	int count = 1 << bit_count;
	int shift = 64 - bit_count;

	myComplex *data_in_0 = &data_in[0];
	myComplex *data_in_1 = &data_in[count/2];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < count/2; ++i)
	{
		int64_t index = __builtin_e2k_bitrevd(i) >> shift;
		data_out[index + 0] = data_in_0[i];
		data_out[index + 1] = data_in_1[i];
	}
}
Основной цикл на ассемблере
.L2162:
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=5, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=5, abs=0, disp=0
        }
.L1993:
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          bitrevd,0,sm  %b[20], %b[12]
          addd,1,sm     %b[20], 0x1, %b[18]
          std,2         %b[17], %r2, %b[11]
          shrd,3,sm     %b[16], %r4, %b[19]
          shld,4,sm     %b[21], 0x3, %b[13]
          std,5         %r0, %b[17], %b[10]
          movad,1       area=0, ind=0, am=1, be=0, %b[1]
          movad,3       area=0, ind=0, am=1, be=0, %b[0]
        }

Теоретическая скорость: 2 комплексных числа за 1 такт (2/1) = 16 Байт/такт

Замеры скорости

Видим желаемое ускорение по всей длине графика.

Строго говоря, это не раскрутка цикла. Честной раскруткой был предыдущий вариант. Здесь же произошло изменение алгоритма (данные обрабатываются в другом порядке). Но я не придумал, как это назвать («stream2»?), поэтому все дальнейшие «раскрутки» будут называться x4/x8 и т.д.


5. reverse_radix2_x2_best

Прежде, чем переходить к более сильным раскруткам, посмотрим, что будет, если вместо двух 64-битных записей в память сделать одну 128-битную запись.

Схема перемещения данных в памяти
Код на Си
void reverse_radix2_x2_best(int bit_count, myComplex *data_in, myComplex *data_out)
{
	int count = 1 << bit_count;
	int shift = 64 - bit_count;

	uint64_t *data_in_0 = (uint64_t*)&data_in[0];
	uint64_t *data_in_1 = (uint64_t*)&data_in[count/2];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < count/2; ++i)
	{
		int64_t offset = 8 * (__builtin_e2k_bitrevd(i) >> shift);
		*(__v2du*)((void*)data_out + offset) = (__v2du){data_in_0[i], data_in_1[i]};
	}
}
Основной цикл на ассемблере
.L2350:
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=5, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=5, abs=0, disp=0
        }
.L2262:
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          bitrevd,0,sm  %b[22], %b[21]
          qppackdl,1,sm %b[10], %b[11], %b[13]
          addd,2,sm     %b[22], 0x1, %b[20]
          shrd,3,sm     %b[25], %r0, %b[24]
          shld,4,sm     %b[26], 0x3, %b[27]
          stqp,5        %r2, %b[29], %b[19]
          movad,1       area=0, ind=0, am=1, be=0, %b[1]
          movad,3       area=0, ind=0, am=1, be=0, %b[0]
        }

Теоретическая скорость: 2 комплексных числа за 1 такт (2/1) = 16 Байт/такт

Замеры скорости

Видим небольшое ускорение.

В дальнейшем будем всегда писать в память 128-битными кусками.


6. reverse_radix2_x4

Сделаем аналогичную «псевдо раскрутку» теперь в 4 раза.

Схема перемещения данных в памяти
Код на Си
void reverse_radix2_x4(int bit_count, myComplex *data_in, myComplex *data_out)
{
	int count = 1 << bit_count;
	int shift = 64 - bit_count;

	uint64_t *data_in_00 = (uint64_t*)&data_in[0 * count/4];
	uint64_t *data_in_01 = (uint64_t*)&data_in[1 * count/4];
	uint64_t *data_in_10 = (uint64_t*)&data_in[2 * count/4];
	uint64_t *data_in_11 = (uint64_t*)&data_in[3 * count/4];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < count/4; ++i)
	{
		int64_t offset = 8 * (__builtin_e2k_bitrevd(i) >> shift);
		*(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_00[i], data_in_10[i]};
		*(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_01[i], data_in_11[i]};
	}
}
Основной цикл на ассемблере
.L2619:
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=4, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=4, abs=0, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=4, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=4, abs=16, disp=0
        }
.L2503:
        {
          loop_mode
          qppackdl,1,sm %b[15], %b[18], %b[20]
          shrd,3,sm     %b[21], %r4, %b[1]
          qppackdl,4,sm %b[9], %b[12], %b[22]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          addd,1,sm     %b[4], 0x1, %b[2]
          stqp,2        %r2, %b[17], %b[20]
          bitrevd,3,sm  %b[4], %b[19]
          shld,4,sm     %b[1], 0x3, %b[15]
          stqp,5        %r0, %b[17], %b[22]
          movad,0       area=1, ind=0, am=1, be=0, %b[6]
          movad,1       area=0, ind=0, am=1, be=0, %b[12]
          movad,2       area=1, ind=0, am=1, be=0, %b[3]
          movad,3       area=0, ind=0, am=1, be=0, %b[9]
        }

Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт

Замеры скорости

Видим сильное ускорение.

Однако, код перестал вмещаться в один такт. Хочется это исправить.


7. reverse_radix2_x4_oneTickVersion

Перепишем код, чтобы вместиться в один такт (убираем инструкции shrd и shld).

Схема перемещения данных в памяти
Код на Си
void reverse_radix2_x4_oneTickVersion(int bit_count, myComplex *data_in, myComplex *data_out)
{
	int count = 1 << bit_count;
	int shift = 64 - bit_count;
	int64_t delta = (1LL << shift) / 8;

	uint64_t *data_in_00 = (uint64_t*)&data_in[0 * count/4];
	uint64_t *data_in_01 = (uint64_t*)&data_in[1 * count/4];
	uint64_t *data_in_10 = (uint64_t*)&data_in[2 * count/4];
	uint64_t *data_in_11 = (uint64_t*)&data_in[3 * count/4];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0, shifted_i = 0; i < count/4; ++i, shifted_i += delta)
	{
		int64_t offset = __builtin_e2k_bitrevd(shifted_i);
		*(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_00[i], data_in_10[i]};
		*(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_01[i], data_in_11[i]};
	}
}
Основной цикл на ассемблере
.L2891:
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=4, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=4, abs=0, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=4, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=4, abs=16, disp=0
        }
.L2777:
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          bitrevd,0,sm  %b[34], %b[25]
          qppackdl,1,sm %b[22], %b[23], %b[31]
          stqp,2        %r2, %b[29], %b[33]
          qppackdl,3,sm %b[10], %b[11], %b[35]
          addd,4,sm     %b[32], %r4, %b[30]
          stqp,5        %r0, %b[29], %b[37]
          movad,0       area=1, ind=0, am=1, be=0, %b[1]
          movad,1       area=0, ind=0, am=1, be=0, %b[13]
          movad,2       area=1, ind=0, am=1, be=0, %b[0]
          movad,3       area=0, ind=0, am=1, be=0, %b[12]
        }

Теоретическая скорость: 4 комплексных числа за 1 такт (4/1) = 32 Байт/такт

Замеры скорости

Видим ускорение в начале и замедление в конце графика.
Должно было быть либо лучше предыдущего варианта, либо так же. Не знаю, как это объяснить.


Примерно в этот момент у меня была мысль, что дальше оптимизировать не получится. Но, посмотрев на схемы алгоритмов, возник вопрос: «а если раскрутить по аналогии ещё в 2 раза?» Можно ли читать из 8-ми разных мест одновременно без потери скорости?

8. reverse_radix2_x8

Продолжаем «псевдо раскручивать» дальше.

Схема перемещения данных в памяти
Код на Си
void reverse_radix2_x8(int bit_count, myComplex *data_in, myComplex *data_out)
{
	int count = 1 << bit_count;
	int shift = 64 - bit_count;

	uint64_t *data_in_000 = (uint64_t*)&data_in[0 * count/8];
	uint64_t *data_in_001 = (uint64_t*)&data_in[1 * count/8];
	uint64_t *data_in_010 = (uint64_t*)&data_in[2 * count/8];
	uint64_t *data_in_011 = (uint64_t*)&data_in[3 * count/8];
	uint64_t *data_in_100 = (uint64_t*)&data_in[4 * count/8];
	uint64_t *data_in_101 = (uint64_t*)&data_in[5 * count/8];
	uint64_t *data_in_110 = (uint64_t*)&data_in[6 * count/8];
	uint64_t *data_in_111 = (uint64_t*)&data_in[7 * count/8];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < count/8; ++i)
	{
		int64_t offset = 8 * (__builtin_e2k_bitrevd(i) >> shift);
		*(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_000[i], data_in_100[i]};
		*(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_010[i], data_in_110[i]};
		*(__v2du*)((void*)data_out + offset + 2*16) = (__v2du){data_in_001[i], data_in_101[i]};
		*(__v2du*)((void*)data_out + offset + 3*16) = (__v2du){data_in_011[i], data_in_111[i]};
	}
}
Основной цикл на ассемблере
.L3314:
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=3, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=3, abs=0, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=3, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=3, abs=8, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=3, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=3, abs=16, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=3, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=3, abs=24, disp=0
        }
.L3146:
        {
          loop_mode
          bitrevd,0,sm  %b[32], %b[29]
          qppackdl,1,sm %b[20], %b[21], %b[33]
          stqp,2        %r2, %b[42], %b[38]
          shld,3,sm     %b[34], 0x3, %b[40]
          qppackdl,4,sm %b[6], %b[7], %b[37]
          stqp,5        %r0, %b[42], %b[36]
          movad,0       area=3, ind=0, am=1, be=0, %b[1]
          movad,1       area=2, ind=0, am=1, be=0, %b[15]
          movad,2       area=3, ind=0, am=1, be=0, %b[0]
          movad,3       area=2, ind=0, am=1, be=0, %b[14]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qppackdl,0,sm %b[26], %b[27], %b[36]
          addd,1,sm     %b[30], 0x1, %b[28]
          stqp,2        %r4, %b[42], %b[35]
          qppackdl,3,sm %b[12], %b[13], %b[34]
          shrd,4,sm     %b[31], %r6, %b[32]
          stqp,5        %r5, %b[42], %b[39]
          movad,0       area=1, ind=0, am=1, be=0, %b[7]
          movad,1       area=0, ind=0, am=1, be=0, %b[21]
          movad,2       area=1, ind=0, am=1, be=0, %b[6]
          movad,3       area=0, ind=0, am=1, be=0, %b[20]
        }

Теоретическая скорость: 8 комплексных чисел за 2 такта (8/2) = 32 Байт/такт

Замеры скорости

Видим замедление в начале и ускорение в конце графика.

Однако же, система справилась с чтением из 8-ми разных мест.


9. reverse_radix2_x16

А если в 16 раз?

Схема перемещения данных в памяти
Код на Си
void reverse_radix2_x16(int bit_count, myComplex *data_in, myComplex *data_out)
{
	int count = 1 << bit_count;
	int shift = 64 - bit_count;

	uint64_t *data_in_0000 = (uint64_t*)&data_in[ 0 * count/16];
	uint64_t *data_in_0001 = (uint64_t*)&data_in[ 1 * count/16];
	uint64_t *data_in_0010 = (uint64_t*)&data_in[ 2 * count/16];
	uint64_t *data_in_0011 = (uint64_t*)&data_in[ 3 * count/16];
	uint64_t *data_in_0100 = (uint64_t*)&data_in[ 4 * count/16];
	uint64_t *data_in_0101 = (uint64_t*)&data_in[ 5 * count/16];
	uint64_t *data_in_0110 = (uint64_t*)&data_in[ 6 * count/16];
	uint64_t *data_in_0111 = (uint64_t*)&data_in[ 7 * count/16];
	uint64_t *data_in_1000 = (uint64_t*)&data_in[ 8 * count/16];
	uint64_t *data_in_1001 = (uint64_t*)&data_in[ 9 * count/16];
	uint64_t *data_in_1010 = (uint64_t*)&data_in[10 * count/16];
	uint64_t *data_in_1011 = (uint64_t*)&data_in[11 * count/16];
	uint64_t *data_in_1100 = (uint64_t*)&data_in[12 * count/16];
	uint64_t *data_in_1101 = (uint64_t*)&data_in[13 * count/16];
	uint64_t *data_in_1110 = (uint64_t*)&data_in[14 * count/16];
	uint64_t *data_in_1111 = (uint64_t*)&data_in[15 * count/16];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < count/16; ++i)
	{
		int64_t offset = 8 * (__builtin_e2k_bitrevd(i) >> shift);
		*(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_0000[i], data_in_1000[i]};
		*(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_0100[i], data_in_1100[i]};
		*(__v2du*)((void*)data_out + offset + 2*16) = (__v2du){data_in_0010[i], data_in_1010[i]};
		*(__v2du*)((void*)data_out + offset + 3*16) = (__v2du){data_in_0110[i], data_in_1110[i]};
		*(__v2du*)((void*)data_out + offset + 4*16) = (__v2du){data_in_0001[i], data_in_1001[i]};
		*(__v2du*)((void*)data_out + offset + 5*16) = (__v2du){data_in_0101[i], data_in_1101[i]};
		*(__v2du*)((void*)data_out + offset + 6*16) = (__v2du){data_in_0011[i], data_in_1011[i]};
		*(__v2du*)((void*)data_out + offset + 7*16) = (__v2du){data_in_0111[i], data_in_1111[i]};
	}
}
Основной цикл на ассемблере
.L4040:
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=2, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=2, abs=0, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=2, abs=4, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=2, abs=4, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=2, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=2, abs=8, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=2, abs=12, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=2, abs=12, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=2, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=2, abs=16, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=2, abs=20, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=2, abs=20, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=2, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=2, abs=24, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=2, abs=28, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=2, abs=28, disp=0
        }
.L3765:
        {
          loop_mode
          qppackdl,1,sm %b[21], %b[24], %b[40]
          stqp,2        %r2, %b[37], %b[41]
          qppackdl,4,sm %b[7], %b[10], %b[42]
          stqp,5        %r0, %b[37], %b[38]
          movad,0       area=7, ind=0, am=1, be=0, %b[4]
          movad,1       area=6, ind=0, am=1, be=0, %b[18]
          movad,2       area=7, ind=0, am=1, be=0, %b[1]
          movad,3       area=6, ind=0, am=1, be=0, %b[15]
        }
        {
          loop_mode
          qppackdl,1,sm %b[25], %b[28], %b[41]
          stqp,2        %r5, %b[37], %b[40]
          shrd,3,sm     %b[31], %r12, %b[29]
          qppackdl,4,sm %b[11], %b[14], %b[38]
          stqp,5        %r6, %b[37], %b[42]
          movad,0       area=5, ind=0, am=1, be=0, %b[10]
          movad,1       area=4, ind=0, am=1, be=0, %b[24]
          movad,2       area=5, ind=0, am=1, be=0, %b[7]
          movad,3       area=4, ind=0, am=1, be=0, %b[21]
        }
        {
          loop_mode
          qppackdl,1,sm %b[17], %b[20], %b[32]
          stqp,2        %r7, %b[37], %b[41]
          shld,3,sm     %b[29], 0x3, %b[35]
          qppackdl,4,sm %b[3], %b[6], %b[31]
          stqp,5        %r9, %b[37], %b[38]
          movad,0       area=1, ind=0, am=1, be=0, %b[14]
          movad,1       area=0, ind=0, am=1, be=0, %b[28]
          movad,2       area=1, ind=0, am=1, be=0, %b[11]
          movad,3       area=0, ind=0, am=1, be=0, %b[25]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qppackdl,0,sm %b[27], %b[30], %b[39]
          addd,1,sm     %b[2], 0x1, %b[0]
          stqp,2        %r10, %b[37], %b[34]
          qppackdl,3,sm %b[13], %b[16], %b[36]
          bitrevd,4,sm  %b[2], %b[29]
          stqp,5        %r11, %b[37], %b[33]
          movad,0       area=3, ind=0, am=1, be=0, %b[6]
          movad,1       area=2, ind=0, am=1, be=0, %b[20]
          movad,2       area=3, ind=0, am=1, be=0, %b[3]
          movad,3       area=2, ind=0, am=1, be=0, %b[17]
        }

Теоретическая скорость: 16 комплексных чисел за 4 такта (16/4) = 32 Байт/такт

Замеры скорости

Видим сильное ускорение.


10. reverse_radix2_x32

В 32 раза?

Код на Си
void reverse_radix2_x32(int bit_count, myComplex *data_in, myComplex *data_out)
{
	int count = 1 << bit_count;
	int shift = 64 - bit_count;

	uint64_t *data_in_00000 = (uint64_t*)&data_in[ 0 * count/32];
	uint64_t *data_in_00001 = (uint64_t*)&data_in[ 1 * count/32];
	uint64_t *data_in_00010 = (uint64_t*)&data_in[ 2 * count/32];
	uint64_t *data_in_00011 = (uint64_t*)&data_in[ 3 * count/32];
	uint64_t *data_in_00100 = (uint64_t*)&data_in[ 4 * count/32];
	uint64_t *data_in_00101 = (uint64_t*)&data_in[ 5 * count/32];
	uint64_t *data_in_00110 = (uint64_t*)&data_in[ 6 * count/32];
	uint64_t *data_in_00111 = (uint64_t*)&data_in[ 7 * count/32];
	uint64_t *data_in_01000 = (uint64_t*)&data_in[ 8 * count/32];
	uint64_t *data_in_01001 = (uint64_t*)&data_in[ 9 * count/32];
	uint64_t *data_in_01010 = (uint64_t*)&data_in[10 * count/32];
	uint64_t *data_in_01011 = (uint64_t*)&data_in[11 * count/32];
	uint64_t *data_in_01100 = (uint64_t*)&data_in[12 * count/32];
	uint64_t *data_in_01101 = (uint64_t*)&data_in[13 * count/32];
	uint64_t *data_in_01110 = (uint64_t*)&data_in[14 * count/32];
	uint64_t *data_in_01111 = (uint64_t*)&data_in[15 * count/32];
	uint64_t *data_in_10000 = (uint64_t*)&data_in[16 * count/32];
	uint64_t *data_in_10001 = (uint64_t*)&data_in[17 * count/32];
	uint64_t *data_in_10010 = (uint64_t*)&data_in[18 * count/32];
	uint64_t *data_in_10011 = (uint64_t*)&data_in[19 * count/32];
	uint64_t *data_in_10100 = (uint64_t*)&data_in[20 * count/32];
	uint64_t *data_in_10101 = (uint64_t*)&data_in[21 * count/32];
	uint64_t *data_in_10110 = (uint64_t*)&data_in[22 * count/32];
	uint64_t *data_in_10111 = (uint64_t*)&data_in[23 * count/32];
	uint64_t *data_in_11000 = (uint64_t*)&data_in[24 * count/32];
	uint64_t *data_in_11001 = (uint64_t*)&data_in[25 * count/32];
	uint64_t *data_in_11010 = (uint64_t*)&data_in[26 * count/32];
	uint64_t *data_in_11011 = (uint64_t*)&data_in[27 * count/32];
	uint64_t *data_in_11100 = (uint64_t*)&data_in[28 * count/32];
	uint64_t *data_in_11101 = (uint64_t*)&data_in[29 * count/32];
	uint64_t *data_in_11110 = (uint64_t*)&data_in[30 * count/32];
	uint64_t *data_in_11111 = (uint64_t*)&data_in[31 * count/32];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < count/32; ++i)
	{
		int64_t offset = 8 * (__builtin_e2k_bitrevd(i) >> shift);
		*(__v2du*)((void*)data_out + offset +  0*16) = (__v2du){data_in_00000[i], data_in_10000[i]};
		*(__v2du*)((void*)data_out + offset +  1*16) = (__v2du){data_in_01000[i], data_in_11000[i]};
		*(__v2du*)((void*)data_out + offset +  2*16) = (__v2du){data_in_00100[i], data_in_10100[i]};
		*(__v2du*)((void*)data_out + offset +  3*16) = (__v2du){data_in_01100[i], data_in_11100[i]};
		*(__v2du*)((void*)data_out + offset +  4*16) = (__v2du){data_in_00010[i], data_in_10010[i]};
		*(__v2du*)((void*)data_out + offset +  5*16) = (__v2du){data_in_01010[i], data_in_11010[i]};
		*(__v2du*)((void*)data_out + offset +  6*16) = (__v2du){data_in_00110[i], data_in_10110[i]};
		*(__v2du*)((void*)data_out + offset +  7*16) = (__v2du){data_in_01110[i], data_in_11110[i]};
		*(__v2du*)((void*)data_out + offset +  8*16) = (__v2du){data_in_00001[i], data_in_10001[i]};
		*(__v2du*)((void*)data_out + offset +  9*16) = (__v2du){data_in_01001[i], data_in_11001[i]};
		*(__v2du*)((void*)data_out + offset + 10*16) = (__v2du){data_in_00101[i], data_in_10101[i]};
		*(__v2du*)((void*)data_out + offset + 11*16) = (__v2du){data_in_01101[i], data_in_11101[i]};
		*(__v2du*)((void*)data_out + offset + 12*16) = (__v2du){data_in_00011[i], data_in_10011[i]};
		*(__v2du*)((void*)data_out + offset + 13*16) = (__v2du){data_in_01011[i], data_in_11011[i]};
		*(__v2du*)((void*)data_out + offset + 14*16) = (__v2du){data_in_00111[i], data_in_10111[i]};
		*(__v2du*)((void*)data_out + offset + 15*16) = (__v2du){data_in_01111[i], data_in_11111[i]};
	}
}
Основной цикл на ассемблере
.L5406:
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=17, incr=0, ind=0, asz=1, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=16, incr=0, ind=0, asz=1, abs=0, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=15, incr=0, ind=0, asz=1, abs=2, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=14, incr=0, ind=0, asz=1, abs=2, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=13, incr=0, ind=0, asz=1, abs=4, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=12, incr=0, ind=0, asz=1, abs=4, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=11, incr=0, ind=0, asz=1, abs=6, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=10, incr=0, ind=0, asz=1, abs=6, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=9, incr=0, ind=0, asz=1, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=8, incr=0, ind=0, asz=1, abs=8, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=7, incr=0, ind=0, asz=1, abs=10, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=6, incr=0, ind=0, asz=1, abs=10, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=5, incr=0, ind=0, asz=1, abs=12, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=4, incr=0, ind=0, asz=1, abs=12, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=3, incr=0, ind=0, asz=1, abs=14, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=2, incr=0, ind=0, asz=1, abs=14, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=1, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=1, abs=16, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=1, abs=18, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=1, abs=18, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=1, abs=20, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=1, abs=20, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=1, abs=22, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=1, abs=22, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=1, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=1, abs=24, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=1, abs=26, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=1, abs=26, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=1, abs=28, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=1, abs=28, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=1, abs=30, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=1, abs=30, disp=0
        }
.L4875:
        {
          loop_mode
          qppackdl,1,sm %b[46], %b[47], %b[63]
          stqp,2        %r2, %b[61], %b[32]
          qppackdl,4,sm %b[38], %b[39], %b[62]
          stqp,5        %r4, %b[61], %b[51]
          movad,0       area=15, ind=0, am=1, be=0, %b[5]
          movad,1       area=14, ind=0, am=1, be=0, %b[13]
          movad,2       area=15, ind=0, am=1, be=0, %b[1]
          movad,3       area=14, ind=0, am=1, be=0, %b[9]
        }
        {
          loop_mode
          qppackdl,1,sm %b[20], %b[26], %b[65]
          stqp,2        %r5, %b[61], %b[63]
          qppackdl,4,sm %b[8], %b[14], %b[64]
          stqp,5        %r6, %b[61], %b[62]
          movad,0       area=13, ind=0, am=1, be=0, %b[25]
          movad,1       area=12, ind=0, am=1, be=0, %b[32]
          movad,2       area=13, ind=0, am=1, be=0, %b[17]
          movad,3       area=12, ind=0, am=1, be=0, %b[33]
        }
        {
          loop_mode
          qppackdl,1,sm %b[59], %b[60], %b[38]
          stqp,2        %r7, %b[61], %b[65]
          qppackdl,4,sm %b[55], %b[56], %b[39]
          stqp,5        %r9, %b[61], %b[64]
          movad,0       area=11, ind=0, am=1, be=0, %b[14]
          movad,1       area=10, ind=0, am=1, be=0, %b[26]
          movad,2       area=11, ind=0, am=1, be=0, %b[8]
          movad,3       area=10, ind=0, am=1, be=0, %b[20]
        }
        {
          loop_mode
          qppackdl,1,sm %b[57], %b[58], %b[47]
          stqp,2        %r10, %b[61], %b[40]
          qppackdl,4,sm %b[54], %b[53], %b[46]
          stqp,5        %r11, %b[61], %b[41]
          movad,0       area=9, ind=0, am=1, be=0, %b[51]
          movad,1       area=8, ind=0, am=1, be=0, %b[56]
          movad,2       area=9, ind=0, am=1, be=0, %b[52]
          movad,3       area=8, ind=0, am=1, be=0, %b[55]
        }
        {
          loop_mode
          addd,0,sm     %b[4], 0x1, %b[2]
          qppackdl,1,sm %b[22], %b[28], %b[40]
          stqp,2        %r12, %b[61], %b[49]
          bitrevd,3,sm  %b[4], %b[59]
          qppackdl,4,sm %b[10], %b[16], %b[41]
          stqp,5        %r13, %b[61], %b[48]
          movad,0       area=7, ind=0, am=1, be=0, %b[54]
          movad,1       area=6, ind=0, am=1, be=0, %b[58]
          movad,2       area=7, ind=0, am=1, be=0, %b[53]
          movad,3       area=6, ind=0, am=1, be=0, %b[57]
        }
        {
          loop_mode
          qppackdl,1,sm %b[35], %b[34], %b[48]
          stqp,2        %r14, %b[61], %b[42]
          shrd,3,sm     %b[59], %r0, %b[49]
          qppackdl,4,sm %b[19], %b[27], %b[28]
          stqp,5        %r15, %b[61], %b[43]
          movad,0       area=5, ind=0, am=1, be=0, %b[10]
          movad,1       area=4, ind=0, am=1, be=0, %b[22]
          movad,2       area=5, ind=0, am=1, be=0, %b[4]
          movad,3       area=4, ind=0, am=1, be=0, %b[16]
        }
        {
          loop_mode
          qppackdl,1,sm %b[9], %b[13], %b[19]
          stqp,2        %r16, %b[61], %b[50]
          shld,3,sm     %b[49], 0x3, %b[59]
          qppackdl,4,sm %b[1], %b[5], %b[27]
          stqp,5        %r17, %b[61], %b[30]
          movad,0       area=3, ind=0, am=1, be=0, %b[35]
          movad,1       area=2, ind=0, am=1, be=0, %b[43]
          movad,2       area=3, ind=0, am=1, be=0, %b[34]
          movad,3       area=2, ind=0, am=1, be=0, %b[42]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qppackdl,0,sm %b[11], %b[15], %b[30]
          stqp,2        %r18, %b[61], %b[23]
          qppackdl,3,sm %b[3], %b[7], %b[49]
          stqp,5        %r19, %b[61], %b[31]
          movad,0       area=1, ind=0, am=1, be=0, %b[5]
          movad,1       area=0, ind=0, am=1, be=0, %b[13]
          movad,2       area=1, ind=0, am=1, be=0, %b[1]
          movad,3       area=0, ind=0, am=1, be=0, %b[9]
        }

Теоретическая скорость: 32 комплексных числа за 8 тактов (32/8) = 32 Байт/такт

Замеры скорости

Видим замедление в начале и ускорение в конце графика.


При попытке «псевдо раскрутить» в 64 раза получается резко менее эффективный код. APB может читать максимум из 32 потоков, поэтому для чтения из 64 потоков компилятор вставляет операции обычного чтения ldd. В итоге скорость резко проседает.

Можно ли ускорить ещё?
В голову приходит разве что попробовать читать не 64-битными кусками, а 128-битными.

11. reverse_radix2_x32x2

Попробуем увеличить скорость чтения версии reverse_radix2_x32.
По сути, в этом варианте сделана честная раскрутка в 2 раза.

Код на Си
void reverse_radix2_x32x2(int bit_count, myComplex *data_in, myComplex *data_out)
{
	int count = 1 << bit_count;
	int shift = 64 - bit_count;

	__v2di *data_in_00000 = (__v2di*)&data_in[ 0 * count/32];
	__v2di *data_in_00001 = (__v2di*)&data_in[ 1 * count/32];
	__v2di *data_in_00010 = (__v2di*)&data_in[ 2 * count/32];
	__v2di *data_in_00011 = (__v2di*)&data_in[ 3 * count/32];
	__v2di *data_in_00100 = (__v2di*)&data_in[ 4 * count/32];
	__v2di *data_in_00101 = (__v2di*)&data_in[ 5 * count/32];
	__v2di *data_in_00110 = (__v2di*)&data_in[ 6 * count/32];
	__v2di *data_in_00111 = (__v2di*)&data_in[ 7 * count/32];
	__v2di *data_in_01000 = (__v2di*)&data_in[ 8 * count/32];
	__v2di *data_in_01001 = (__v2di*)&data_in[ 9 * count/32];
	__v2di *data_in_01010 = (__v2di*)&data_in[10 * count/32];
	__v2di *data_in_01011 = (__v2di*)&data_in[11 * count/32];
	__v2di *data_in_01100 = (__v2di*)&data_in[12 * count/32];
	__v2di *data_in_01101 = (__v2di*)&data_in[13 * count/32];
	__v2di *data_in_01110 = (__v2di*)&data_in[14 * count/32];
	__v2di *data_in_01111 = (__v2di*)&data_in[15 * count/32];
	__v2di *data_in_10000 = (__v2di*)&data_in[16 * count/32];
	__v2di *data_in_10001 = (__v2di*)&data_in[17 * count/32];
	__v2di *data_in_10010 = (__v2di*)&data_in[18 * count/32];
	__v2di *data_in_10011 = (__v2di*)&data_in[19 * count/32];
	__v2di *data_in_10100 = (__v2di*)&data_in[20 * count/32];
	__v2di *data_in_10101 = (__v2di*)&data_in[21 * count/32];
	__v2di *data_in_10110 = (__v2di*)&data_in[22 * count/32];
	__v2di *data_in_10111 = (__v2di*)&data_in[23 * count/32];
	__v2di *data_in_11000 = (__v2di*)&data_in[24 * count/32];
	__v2di *data_in_11001 = (__v2di*)&data_in[25 * count/32];
	__v2di *data_in_11010 = (__v2di*)&data_in[26 * count/32];
	__v2di *data_in_11011 = (__v2di*)&data_in[27 * count/32];
	__v2di *data_in_11100 = (__v2di*)&data_in[28 * count/32];
	__v2di *data_in_11101 = (__v2di*)&data_in[29 * count/32];
	__v2di *data_in_11110 = (__v2di*)&data_in[30 * count/32];
	__v2di *data_in_11111 = (__v2di*)&data_in[31 * count/32];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < count/32/2; ++i)
	{
		int64_t offset0 = 8 * (__builtin_e2k_bitrevd(2*i + 0) >> shift);
		__v2di mask0 = {0x0706050403020100, 0x0706050403020100};
		*(__v2du*)((void*)data_out + offset0 +  0*16) = __builtin_e2k_qpshufb(data_in_10000[i], data_in_00000[i], mask0);
		*(__v2du*)((void*)data_out + offset0 +  1*16) = __builtin_e2k_qpshufb(data_in_11000[i], data_in_01000[i], mask0);
		*(__v2du*)((void*)data_out + offset0 +  2*16) = __builtin_e2k_qpshufb(data_in_10100[i], data_in_00100[i], mask0);
		*(__v2du*)((void*)data_out + offset0 +  3*16) = __builtin_e2k_qpshufb(data_in_11100[i], data_in_01100[i], mask0);
		*(__v2du*)((void*)data_out + offset0 +  4*16) = __builtin_e2k_qpshufb(data_in_10010[i], data_in_00010[i], mask0);
		*(__v2du*)((void*)data_out + offset0 +  5*16) = __builtin_e2k_qpshufb(data_in_11010[i], data_in_01010[i], mask0);
		*(__v2du*)((void*)data_out + offset0 +  6*16) = __builtin_e2k_qpshufb(data_in_10110[i], data_in_00110[i], mask0);
		*(__v2du*)((void*)data_out + offset0 +  7*16) = __builtin_e2k_qpshufb(data_in_11110[i], data_in_01110[i], mask0);
		*(__v2du*)((void*)data_out + offset0 +  8*16) = __builtin_e2k_qpshufb(data_in_10001[i], data_in_00001[i], mask0);
		*(__v2du*)((void*)data_out + offset0 +  9*16) = __builtin_e2k_qpshufb(data_in_11001[i], data_in_01001[i], mask0);
		*(__v2du*)((void*)data_out + offset0 + 10*16) = __builtin_e2k_qpshufb(data_in_10101[i], data_in_00101[i], mask0);
		*(__v2du*)((void*)data_out + offset0 + 11*16) = __builtin_e2k_qpshufb(data_in_11101[i], data_in_01101[i], mask0);
		*(__v2du*)((void*)data_out + offset0 + 12*16) = __builtin_e2k_qpshufb(data_in_10011[i], data_in_00011[i], mask0);
		*(__v2du*)((void*)data_out + offset0 + 13*16) = __builtin_e2k_qpshufb(data_in_11011[i], data_in_01011[i], mask0);
		*(__v2du*)((void*)data_out + offset0 + 14*16) = __builtin_e2k_qpshufb(data_in_10111[i], data_in_00111[i], mask0);
		*(__v2du*)((void*)data_out + offset0 + 15*16) = __builtin_e2k_qpshufb(data_in_11111[i], data_in_01111[i], mask0);

		int64_t offset1 = 8 * (__builtin_e2k_bitrevd(2*i + 1) >> shift);
		__v2di mask1 = {0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908};
		*(__v2du*)((void*)data_out + offset1 +  0*16) = __builtin_e2k_qpshufb(data_in_10000[i], data_in_00000[i], mask1);
		*(__v2du*)((void*)data_out + offset1 +  1*16) = __builtin_e2k_qpshufb(data_in_11000[i], data_in_01000[i], mask1);
		*(__v2du*)((void*)data_out + offset1 +  2*16) = __builtin_e2k_qpshufb(data_in_10100[i], data_in_00100[i], mask1);
		*(__v2du*)((void*)data_out + offset1 +  3*16) = __builtin_e2k_qpshufb(data_in_11100[i], data_in_01100[i], mask1);
		*(__v2du*)((void*)data_out + offset1 +  4*16) = __builtin_e2k_qpshufb(data_in_10010[i], data_in_00010[i], mask1);
		*(__v2du*)((void*)data_out + offset1 +  5*16) = __builtin_e2k_qpshufb(data_in_11010[i], data_in_01010[i], mask1);
		*(__v2du*)((void*)data_out + offset1 +  6*16) = __builtin_e2k_qpshufb(data_in_10110[i], data_in_00110[i], mask1);
		*(__v2du*)((void*)data_out + offset1 +  7*16) = __builtin_e2k_qpshufb(data_in_11110[i], data_in_01110[i], mask1);
		*(__v2du*)((void*)data_out + offset1 +  8*16) = __builtin_e2k_qpshufb(data_in_10001[i], data_in_00001[i], mask1);
		*(__v2du*)((void*)data_out + offset1 +  9*16) = __builtin_e2k_qpshufb(data_in_11001[i], data_in_01001[i], mask1);
		*(__v2du*)((void*)data_out + offset1 + 10*16) = __builtin_e2k_qpshufb(data_in_10101[i], data_in_00101[i], mask1);
		*(__v2du*)((void*)data_out + offset1 + 11*16) = __builtin_e2k_qpshufb(data_in_11101[i], data_in_01101[i], mask1);
		*(__v2du*)((void*)data_out + offset1 + 12*16) = __builtin_e2k_qpshufb(data_in_10011[i], data_in_00011[i], mask1);
		*(__v2du*)((void*)data_out + offset1 + 13*16) = __builtin_e2k_qpshufb(data_in_11011[i], data_in_01011[i], mask1);
		*(__v2du*)((void*)data_out + offset1 + 14*16) = __builtin_e2k_qpshufb(data_in_10111[i], data_in_00111[i], mask1);
		*(__v2du*)((void*)data_out + offset1 + 15*16) = __builtin_e2k_qpshufb(data_in_11111[i], data_in_01111[i], mask1);
	}
}
Основной цикл на ассемблере
.L7238:
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=17, incr=0, ind=0, asz=1, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=16, incr=0, ind=0, asz=1, abs=0, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=15, incr=0, ind=0, asz=1, abs=2, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=14, incr=0, ind=0, asz=1, abs=2, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=13, incr=0, ind=0, asz=1, abs=4, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=12, incr=0, ind=0, asz=1, abs=4, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=11, incr=0, ind=0, asz=1, abs=6, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=10, incr=0, ind=0, asz=1, abs=6, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=9, incr=0, ind=0, asz=1, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=8, incr=0, ind=0, asz=1, abs=8, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=7, incr=0, ind=0, asz=1, abs=10, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=6, incr=0, ind=0, asz=1, abs=10, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=5, incr=0, ind=0, asz=1, abs=12, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=4, incr=0, ind=0, asz=1, abs=12, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=3, incr=0, ind=0, asz=1, abs=14, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=2, incr=0, ind=0, asz=1, abs=14, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=1, incr=0, ind=0, asz=1, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=15, asz=1, abs=16, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=14, asz=1, abs=18, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=13, asz=1, abs=18, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=12, asz=1, abs=20, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=11, asz=1, abs=20, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=10, asz=1, abs=22, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=9, asz=1, abs=22, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=8, asz=1, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=1, abs=24, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=1, abs=26, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=1, abs=26, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=1, abs=28, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=1, abs=28, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=1, abs=30, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=1, asz=1, abs=30, disp=0
        }
.L5673:
        {
          loop_mode
          qpshufb,1,sm  %b[42], %b[41], %r0, %b[40]
          stqp,2        %r2, %b[4], %b[48]
          qpshufb,3,sm  %b[40], %b[39], %r0, %b[39]
          stqp,5        %r4, %b[4], %b[47]
        }
        {
          loop_mode
          qpshufb,1,sm  %b[46], %b[45], %r1, %b[40]
          stqp,2        %r2, %g16, %b[40]
          qpshufb,4,sm  %b[44], %b[43], %r1, %b[39]
          stqp,5        %r4, %g16, %b[39]
        }
        {
          loop_mode
          qpshufb,1,sm  %b[46], %b[45], %r0, %b[40]
          stqp,2        %r6, %b[4], %b[40]
          qpshufb,4,sm  %b[44], %b[43], %r0, %b[39]
          stqp,5        %r7, %b[4], %b[39]
        }
        {
          loop_mode
          qpshufb,1,sm  %b[38], %b[37], %r0, %b[38]
          stqp,2        %r6, %g16, %b[40]
          qpshufb,4,sm  %b[38], %b[37], %r1, %b[37]
          stqp,5        %r7, %g16, %b[39]
        }
        {
          loop_mode
          qpshufb,1,sm  %b[36], %b[35], %r0, %b[36]
          stqp,2        %r9, %g16, %b[38]
          qpshufb,4,sm  %b[36], %b[35], %r1, %b[35]
          stqp,5        %r9, %b[4], %b[37]
          movaqp,0      area=15, ind=0, am=1, be=0, %b[5]
          movaqp,1      area=14, ind=0, am=1, be=0, %b[9]
          movaqp,2      area=15, ind=0, am=1, be=0, %b[1]
          movaqp,3      area=14, ind=0, am=1, be=0, %b[6]
        }
        {
          loop_mode
          qpshufb,1,sm  %b[34], %b[33], %r0, %b[34]
          stqp,2        %r10, %g16, %b[36]
          qpshufb,4,sm  %b[34], %b[33], %r1, %b[33]
          stqp,5        %r10, %b[4], %b[35]
          movaqp,0      area=13, ind=0, am=1, be=0, %b[13]
          movaqp,1      area=12, ind=0, am=1, be=0, %b[17]
          movaqp,2      area=13, ind=0, am=1, be=0, %b[10]
          movaqp,3      area=12, ind=0, am=1, be=0, %b[14]
        }
        {
          loop_mode
          qpshufb,1,sm  %b[32], %b[30], %r0, %b[32]
          stqp,2        %r11, %g16, %b[34]
          qpshufb,4,sm  %b[32], %b[30], %r1, %b[30]
          stqp,5        %r11, %b[4], %b[33]
          movaqp,0      area=11, ind=0, am=1, be=0, %b[21]
          movaqp,1      area=10, ind=0, am=1, be=0, %b[25]
          movaqp,2      area=11, ind=0, am=1, be=0, %b[18]
          movaqp,3      area=10, ind=0, am=1, be=0, %b[22]
        }
        {
          loop_mode
          qpshufb,1,sm  %g17, %g18, %r0, %b[32]
          stqp,2        %r12, %g16, %b[32]
          qpshufb,4,sm  %g17, %g18, %r1, %b[30]
          stqp,5        %r12, %b[4], %b[30]
          movaqp,0      area=9, ind=0, am=1, be=0, %b[29]
          movaqp,1      area=8, ind=0, am=1, be=0, %g17
          movaqp,2      area=9, ind=0, am=1, be=0, %b[26]
          movaqp,3      area=8, ind=0, am=1, be=0, %g18
        }
        {
          loop_mode
          qpshufb,1,sm  %b[31], %b[28], %r0, %b[34]
          stqp,2        %r13, %g16, %b[32]
          qpshufb,4,sm  %b[31], %b[28], %r1, %b[33]
          stqp,5        %r13, %b[4], %b[30]
          movaqp,0      area=7, ind=0, am=1, be=0, %b[30]
          movaqp,1      area=6, ind=0, am=1, be=0, %b[32]
          movaqp,2      area=7, ind=0, am=1, be=0, %b[28]
          movaqp,3      area=6, ind=0, am=1, be=0, %b[31]
        }
        {
          loop_mode
          qpshufb,1,sm  %b[27], %b[24], %r0, %b[38]
          stqp,2        %r14, %g16, %b[34]
          qpshufb,4,sm  %b[27], %b[24], %r1, %b[37]
          stqp,5        %r14, %b[4], %b[33]
          movaqp,0      area=5, ind=0, am=1, be=0, %b[34]
          movaqp,1      area=4, ind=0, am=1, be=0, %b[36]
          movaqp,2      area=5, ind=0, am=1, be=0, %b[33]
          movaqp,3      area=4, ind=0, am=1, be=0, %b[35]
        }
        {
          loop_mode
          qpshufb,1,sm  %b[23], %b[20], %r0, %b[42]
          stqp,2        %r15, %g16, %b[38]
          qpshufb,4,sm  %b[23], %b[20], %r1, %b[41]
          stqp,5        %r15, %b[4], %b[37]
          movaqp,0      area=1, ind=0, am=1, be=0, %b[38]
          movaqp,1      area=0, ind=0, am=1, be=0, %b[40]
          movaqp,2      area=1, ind=0, am=1, be=0, %b[37]
          movaqp,3      area=0, ind=0, am=1, be=0, %b[39]
        }
        {
          loop_mode
          addd,0,sm     0x2, %b[2], %b[0]
          qpshufb,1,sm  %b[19], %b[16], %r0, %b[47]
          stqp,2        %r16, %g16, %b[42]
          addd,3,sm     %b[2], 0x1, %b[45]
          qpshufb,4,sm  %b[19], %b[16], %r1, %b[46]
          stqp,5        %r16, %b[4], %b[41]
          movaqp,0      area=3, ind=0, am=1, be=0, %b[42]
          movaqp,1      area=2, ind=0, am=1, be=0, %b[44]
          movaqp,2      area=3, ind=0, am=1, be=0, %b[41]
          movaqp,3      area=2, ind=0, am=1, be=0, %b[43]
        }
        {
          loop_mode
          bitrevd,0,sm  %b[2], %b[46]
          qpshufb,1,sm  %b[15], %b[12], %r0, %b[49]
          stqp,2        %r17, %g16, %b[47]
          bitrevd,3,sm  %b[45], %b[45]
          qpshufb,4,sm  %b[15], %b[12], %r1, %b[47]
          stqp,5        %r17, %b[4], %b[46]
        }
        {
          loop_mode
          shrd,0,sm     %b[46], %r5, %b[45]
          qpshufb,1,sm  %b[11], %b[8], %r0, %b[48]
          stqp,2        %r18, %g16, %b[49]
          shrd,3,sm     %b[45], %r5, %b[46]
          qpshufb,4,sm  %b[11], %b[8], %r1, %b[47]
          stqp,5        %r18, %b[4], %b[47]
        }
        {
          loop_mode
          qpshufb,1,sm  %b[7], %b[3], %r0, %b[47]
          stqp,2        %r19, %g16, %b[48]
          shld,3,sm     %b[46], 0x3, %b[2]
          qpshufb,4,sm  %b[7], %b[3], %r1, %b[46]
          stqp,5        %r19, %b[4], %b[47]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qpshufb,0,sm  %b[40], %b[39], %r1, %b[46]
          stqp,2        %r20, %g16, %b[47]
          qpshufb,3,sm  %b[38], %b[37], %r1, %b[45]
          shld,4,sm     %b[45], 0x3, %g16
          stqp,5        %r20, %b[4], %b[46]
        }

Теоретическая скорость: 64 комплексных числа за 16 тактов (64/16) = 32 Байт/такт

Замеры скорости

Видим замедление в середине графика.


Итоги по reverse_radix2

Победителем можно считать вариант reverse_radix2_x32.

При реализации Radix-2 FFT будем использовать его.


reverse_radix4

1. reverse_radix4_etalon

Эталонный вариант для сравнения на корректность.
Здесь reverseNumber вычисляется с помощью цикла.

int reverseNumber_radix4(int number, int bit_count)
{
	int answer = 0;

	for(int i = 0; i < bit_count/2; ++i)
	{
		answer <<= 2;
		answer |= number & 3;
		number >>= 2;
	}

	return answer;
}


void reverse_radix4_etalon(int bit_count, myComplex *data_in, myComplex *data_out)
{
	int count = 1 << bit_count;

	for(int64_t i = 0; i < count; ++i)
	{
		int index = reverseNumber_radix4(i, bit_count);
		data_out[index] = data_in[i];
	}
}

2. reverse_radix4

В процессоре нет готовых инструкций, производящих операцию reverseNumber_radix4.
Поэтому выполним инструкцию bitrevd и переставим соседние биты местами.

Схема перемещения данных в памяти
Код на Си
void reverse_radix4(int bit_count, myComplex *data_in, myComplex *data_out)
{
	int count = 1 << bit_count;
	int shift = 64 - bit_count;

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < count; ++i)
	{
		uint64_t rev = __builtin_e2k_bitrevd(i) >> shift;
		int64_t index = ((rev<<1) & 0xAAAAAAAAAAAAAAAA) | ((rev>>1) & 0x5555555555555555);
		data_out[index] = data_in[i];
	}
}
Основной цикл на ассемблере
.L1601:
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=5, abs=0, disp=0
        }
.L1398:
        {
          loop_mode
          shrd,2,sm     %b[16], %r0, %b[1]
          shld,4,sm     %b[17], 0x3, %b[18]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          bitrevd,0,sm  %b[2], %b[14]
          shr_andd,1,sm %b[1], 0x1, %r5, %b[5]
          addd,2,sm     %b[2], 0x1, %b[0]
          ord,3,sm      %b[13], %b[9], %b[15]
          shl_andd,4,sm %b[3], 0x1, %r4, %b[11]
          std,5         %r2, %b[18], %b[12]
          movad,1       area=0, ind=0, am=1, be=0, %b[4]
        }

Теоретическая скорость: 1 комплексное число за 2 такта (1/2) = 4 Байт/такт

Замеры скорости

Заметим, что код можно вместить в один такт, если немного перетасовать инструкции.


3. reverse_radix4_oneTickVersion

Перепишем код, чтобы вместиться в один такт (убираем инструкции shrd и shld).

Схема перемещения данных в памяти
Код на Си
void reverse_radix4_oneTickVersion(int bit_count, myComplex *data_in, myComplex *data_out)
{
	int count = 1 << bit_count;
	int shift = 64 - bit_count;

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < count; ++i)
	{
		uint64_t rev = __builtin_e2k_bitrevd(i);
		int64_t offset = ((rev>>(shift-3-1)) & 0x5555555555555555) | ((rev>>(shift-3+1)) & 0xAAAAAAAAAAAAAAAA);
		*(myComplex*)((void*)data_out + offset) = data_in[i];
	}
}
Основной цикл на ассемблере
.L1873:
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=5, abs=0, disp=0
        }
.L1686:
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          bitrevd,0,sm  %b[21], %b[14]
          shr_andd,1,sm %b[16], %r0, %r6, %b[1]
          addd,2,sm     %b[21], 0x1, %b[19]
          ord,3,sm      %b[17], %b[9], %b[20]
          shr_andd,4,sm %b[18], %r4, %r5, %b[11]
          std,5         %r2, %b[22], %b[12]
          movad,1       area=0, ind=0, am=1, be=0, %b[0]
        }

Теоретическая скорость: 1 комплексное число за 1 такт (1/1) = 8 Байт/такт

Замеры скорости

Видим ускорение в начале графика.


4. reverse_radix4_x4_bad

Попробуем ускорить с помощью ручной раскрутки цикла в 4 раза.

Схема перемещения данных в памяти
Код на Си
void reverse_radix4_x4_bad(int bit_count, myComplex *data_in, myComplex *data_out)
{
	int count = 1 << bit_count;
	int shift = 64 - bit_count;

	myComplex *data_out_0 = &data_out[0 * count/4];
	myComplex *data_out_1 = &data_out[1 * count/4];
	myComplex *data_out_2 = &data_out[2 * count/4];
	myComplex *data_out_3 = &data_out[3 * count/4];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < count; i += 4)
	{
		uint64_t rev = __builtin_e2k_bitrevd(i);
		int64_t index = ((rev>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev>>(shift+1)) & 0x5555555555555555);
		data_out_0[index] = data_in[i + 0];
		data_out_1[index] = data_in[i + 1];
		data_out_2[index] = data_in[i + 2];
		data_out_3[index] = data_in[i + 3];
	}
}
Основной цикл на ассемблере
.L2338:
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=1, asz=5, abs=0, disp=16
        }
.L2049:
        {
          loop_mode
          bitrevd,0,sm  %b[29], %b[23]
          std,2         %r0, %b[26], %b[18]
          addd,3,sm     %b[29], 0x4, %b[27]
          shld,4,sm     %b[30], 0x3, %b[24]
          std,5         %r5, %b[26], %b[19]
          movad,0       area=0, ind=0, am=0, be=0, %b[1]
          movad,1       area=0, ind=8, am=1, be=0, %b[11]
          movad,2       area=0, ind=0, am=0, be=0, %b[0]
          movad,3       area=0, ind=8, am=1, be=0, %b[10]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          shr_andd,1,sm %b[23], %r7, %r9, %b[18]
          std,2         %r4, %b[26], %b[8]
          ord,3,sm      %b[21], %b[22], %b[28]
          shr_andd,4,sm %b[25], %r6, %r8, %b[19]
          std,5         %r2, %b[26], %b[9]
        }

Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт

Замеры скорости

Видим ускорение в начале графика.

Как мы помним из reverse_radix2, запись в разные места памяти работает хуже, чем запись рядом.


5. reverse_radix4_x4_good

Попробуем сделать наоборот: будем читать из разных мест, а писать рядом.

Схема перемещения данных в памяти
Код на Си
void reverse_radix4_x4_good(int bit_count, myComplex *data_in, myComplex *data_out)
{
	int count = 1 << bit_count;
	int shift = 64 - bit_count;

	myComplex *data_in_0 = &data_in[0 * count/4];
	myComplex *data_in_1 = &data_in[1 * count/4];
	myComplex *data_in_2 = &data_in[2 * count/4];
	myComplex *data_in_3 = &data_in[3 * count/4];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < count/4; ++i)
	{
		uint64_t rev = __builtin_e2k_bitrevd(i);
		int64_t index = ((rev>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev>>(shift+1)) & 0x5555555555555555);
		data_out[index + 0] = data_in_0[i];
		data_out[index + 1] = data_in_1[i];
		data_out[index + 2] = data_in_2[i];
		data_out[index + 3] = data_in_3[i];
	}
}
Основной цикл на ассемблере
.L2807:
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=4, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=4, abs=0, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=4, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=4, abs=16, disp=0
        }
.L2518:
        {
          loop_mode
          bitrevd,0,sm  %b[29], %b[23]
          std,2         %r5, %b[26], %b[19]
          addd,3,sm     %b[29], 0x1, %b[27]
          shld,4,sm     %b[30], 0x3, %b[24]
          std,5         %r0, %b[26], %b[18]
          movad,0       area=1, ind=0, am=1, be=0, %b[1]
          movad,1       area=0, ind=0, am=1, be=0, %b[0]
          movad,2       area=1, ind=0, am=1, be=0, %b[11]
          movad,3       area=0, ind=0, am=1, be=0, %b[10]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          shr_andd,1,sm %b[23], %r7, %r9, %b[18]
          std,2         %r4, %b[26], %b[9]
          ord,3,sm      %b[21], %b[22], %b[28]
          shr_andd,4,sm %b[25], %r6, %r8, %b[19]
          std,5         %b[26], %r2, %b[8]
        }

Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт

Замеры скорости

Видим желаемое ускорение по всей длине графика.

Строго говоря, это не раскрутка цикла. Честной раскруткой был предыдущий вариант. Здесь же произошло изменение алгоритма (данные обрабатываются в другом порядке). Но я не придумал, как это назвать («stream4»?), поэтому все дальнейшие «раскрутки» будут называться x4/x16 и т.д.


6. reverse_radix4_x4_best

Вместо четырёх 64-битных записей в память сделаем две 128-битные записи.

Схема перемещения данных в памяти
Код на Си
void reverse_radix4_x4_best(int bit_count, myComplex *data_in, myComplex *data_out)
{
	int count = 1 << bit_count;
	int shift = 64 - bit_count;

	uint64_t *data_in_0 = (uint64_t*)&data_in[0 * count/4];
	uint64_t *data_in_1 = (uint64_t*)&data_in[1 * count/4];
	uint64_t *data_in_2 = (uint64_t*)&data_in[2 * count/4];
	uint64_t *data_in_3 = (uint64_t*)&data_in[3 * count/4];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < count/4; ++i)
	{
		uint64_t rev = __builtin_e2k_bitrevd(i);
		int64_t offset = 8 * (((rev>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev>>(shift+1)) & 0x5555555555555555));
		*(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_0[i], data_in_1[i]};
		*(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_2[i], data_in_3[i]};
	}
}
Основной цикл на ассемблере
.L3099:
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=4, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=4, abs=0, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=4, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=4, abs=16, disp=0
        }
.L2975:
        {
          loop_mode
          qppackdl,0,sm %b[10], %b[16], %b[9]
          shr_andd,1,sm %b[23], %r5, %r7, %b[0]
          qppackdl,3,sm %b[21], %b[22], %b[5]
          shr_andd,4,sm %b[25], %r4, %r6, %b[13]
          ord,5,sm      %b[15], %b[4], %b[26]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          bitrevd,0,sm  %b[3], %b[21]
          stqp,2        %r0, %b[24], %b[11]
          addd,3,sm     %b[3], 0x1, %b[1]
          shld,4,sm     %b[26], 0x3, %b[22]
          stqp,5        %r2, %b[24], %b[7]
          movad,0       area=1, ind=0, am=1, be=0, %b[10]
          movad,1       area=0, ind=0, am=1, be=0, %b[16]
          movad,2       area=1, ind=0, am=1, be=0, %b[4]
          movad,3       area=0, ind=0, am=1, be=0, %b[15]
        }

Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт

Замеры скорости

Видим сильное ускорение.

В дальнейшем будем всегда писать в память 128-битными кусками.


7. reverse_radix4_x16

Продолжим «псевдо раскручивать» дальше.

Схема перемещения данных в памяти
Код на Си
void reverse_radix4_x16(int bit_count, myComplex *data_in, myComplex *data_out)
{
	int count = 1 << bit_count;
	int shift = 64 - bit_count;

	uint64_t *data_in_00 = (uint64_t*)&data_in[ 0 * count/16];
	uint64_t *data_in_01 = (uint64_t*)&data_in[ 1 * count/16];
	uint64_t *data_in_02 = (uint64_t*)&data_in[ 2 * count/16];
	uint64_t *data_in_03 = (uint64_t*)&data_in[ 3 * count/16];
	uint64_t *data_in_10 = (uint64_t*)&data_in[ 4 * count/16];
	uint64_t *data_in_11 = (uint64_t*)&data_in[ 5 * count/16];
	uint64_t *data_in_12 = (uint64_t*)&data_in[ 6 * count/16];
	uint64_t *data_in_13 = (uint64_t*)&data_in[ 7 * count/16];
	uint64_t *data_in_20 = (uint64_t*)&data_in[ 8 * count/16];
	uint64_t *data_in_21 = (uint64_t*)&data_in[ 9 * count/16];
	uint64_t *data_in_22 = (uint64_t*)&data_in[10 * count/16];
	uint64_t *data_in_23 = (uint64_t*)&data_in[11 * count/16];
	uint64_t *data_in_30 = (uint64_t*)&data_in[12 * count/16];
	uint64_t *data_in_31 = (uint64_t*)&data_in[13 * count/16];
	uint64_t *data_in_32 = (uint64_t*)&data_in[14 * count/16];
	uint64_t *data_in_33 = (uint64_t*)&data_in[15 * count/16];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < count/16; ++i)
	{
		uint64_t rev = __builtin_e2k_bitrevd(i);
		int64_t offset = 8 * (((rev>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev>>(shift+1)) & 0x5555555555555555));
		*(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_00[i], data_in_10[i]};
		*(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_20[i], data_in_30[i]};
		*(__v2du*)((void*)data_out + offset + 2*16) = (__v2du){data_in_01[i], data_in_11[i]};
		*(__v2du*)((void*)data_out + offset + 3*16) = (__v2du){data_in_21[i], data_in_31[i]};
		*(__v2du*)((void*)data_out + offset + 4*16) = (__v2du){data_in_02[i], data_in_12[i]};
		*(__v2du*)((void*)data_out + offset + 5*16) = (__v2du){data_in_22[i], data_in_32[i]};
		*(__v2du*)((void*)data_out + offset + 6*16) = (__v2du){data_in_03[i], data_in_13[i]};
		*(__v2du*)((void*)data_out + offset + 7*16) = (__v2du){data_in_23[i], data_in_33[i]};
	}
}
Основной цикл на ассемблере
.L3848:
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=2, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=2, abs=0, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=2, abs=4, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=2, abs=4, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=2, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=2, abs=8, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=2, abs=12, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=2, abs=12, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=2, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=2, abs=16, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=2, abs=20, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=2, abs=20, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=2, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=2, abs=24, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=2, abs=28, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=2, abs=28, disp=0
        }
.L3565:
        {
          loop_mode
          qppackdl,0,sm %b[51], %b[58], %b[1]
          stqp,2        %r0, %b[57], %b[42]
          qppackdl,3,sm %b[35], %b[50], %b[6]
          shr_andd,4,sm %b[60], %r12, %r15, %b[29]
          stqp,5        %r2, %b[57], %b[43]
          movad,0       area=7, ind=0, am=1, be=0, %b[18]
          movad,1       area=6, ind=0, am=1, be=0, %b[26]
          movad,2       area=7, ind=0, am=1, be=0, %b[13]
          movad,3       area=6, ind=0, am=1, be=0, %b[21]
        }
        {
          loop_mode
          ord,0,sm      %b[61], %b[31], %b[59]
          qppackdl,1,sm %b[54], %b[55], %b[35]
          stqp,2        %r5, %b[57], %b[5]
          qppackdl,4,sm %b[46], %b[47], %b[34]
          stqp,5        %r6, %b[57], %b[10]
          movad,0       area=5, ind=0, am=1, be=0, %b[43]
          movad,1       area=4, ind=0, am=1, be=0, %b[51]
          movad,2       area=5, ind=0, am=1, be=0, %b[42]
          movad,3       area=4, ind=0, am=1, be=0, %b[50]
        }
        {
          loop_mode
          shld,0,sm     %b[59], 0x3, %b[55]
          qppackdl,1,sm %b[23], %b[28], %b[5]
          stqp,2        %r7, %b[57], %b[39]
          addd,3,sm     %b[4], 0x1, %b[2]
          qppackdl,4,sm %b[15], %b[20], %b[10]
          stqp,5        %r9, %b[57], %b[38]
          movad,0       area=3, ind=0, am=1, be=0, %b[46]
          movad,1       area=2, ind=0, am=1, be=0, %b[54]
          movad,2       area=3, ind=0, am=1, be=0, %b[31]
          movad,3       area=2, ind=0, am=1, be=0, %b[47]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qppackdl,0,sm %b[19], %b[24], %b[38]
          shr_andd,1,sm %b[60], %r13, %r14, %b[59]
          stqp,2        %r10, %b[57], %b[11]
          qppackdl,3,sm %b[27], %b[32], %b[39]
          bitrevd,4,sm  %b[4], %b[58]
          stqp,5        %r11, %b[57], %b[16]
          movad,0       area=1, ind=0, am=1, be=0, %b[20]
          movad,1       area=0, ind=0, am=1, be=0, %b[28]
          movad,2       area=1, ind=0, am=1, be=0, %b[15]
          movad,3       area=0, ind=0, am=1, be=0, %b[23]
        }

Теоретическая скорость: 16 комплексных чисел за 4 такта (16/4) = 32 Байт/такт

Замеры скорости

Видим сильное ускорение.


При попытке «псевдо раскрутить» в 64 раза получается резко менее эффективный код. APB может читать максимум из 32 потоков, поэтому для чтения из 64 потоков компилятор вставляет операции обычного чтения ldd. В итоге скорость резко проседает.

Попробуем читать не 64-битными кусками, а 128-битными.

8. reverse_radix4_x16x2

Попробуем увеличить скорость чтения версии reverse_radix4_x16.
По сути, в этом варианте сделана честная раскрутка в 2 раза.

Код на Си
void reverse_radix4_x16x2(int bit_count, myComplex *data_in, myComplex *data_out)
{
	int count = 1 << bit_count;
	int shift = 64 - bit_count;

	__v2di *data_in_00 = (__v2di*)&data_in[ 0 * count/16];
	__v2di *data_in_01 = (__v2di*)&data_in[ 1 * count/16];
	__v2di *data_in_02 = (__v2di*)&data_in[ 2 * count/16];
	__v2di *data_in_03 = (__v2di*)&data_in[ 3 * count/16];
	__v2di *data_in_10 = (__v2di*)&data_in[ 4 * count/16];
	__v2di *data_in_11 = (__v2di*)&data_in[ 5 * count/16];
	__v2di *data_in_12 = (__v2di*)&data_in[ 6 * count/16];
	__v2di *data_in_13 = (__v2di*)&data_in[ 7 * count/16];
	__v2di *data_in_20 = (__v2di*)&data_in[ 8 * count/16];
	__v2di *data_in_21 = (__v2di*)&data_in[ 9 * count/16];
	__v2di *data_in_22 = (__v2di*)&data_in[10 * count/16];
	__v2di *data_in_23 = (__v2di*)&data_in[11 * count/16];
	__v2di *data_in_30 = (__v2di*)&data_in[12 * count/16];
	__v2di *data_in_31 = (__v2di*)&data_in[13 * count/16];
	__v2di *data_in_32 = (__v2di*)&data_in[14 * count/16];
	__v2di *data_in_33 = (__v2di*)&data_in[15 * count/16];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < count/16/2; ++i)
	{
		uint64_t rev0 = __builtin_e2k_bitrevd(2*i+0);
		int64_t offset0 = 8 * (((rev0>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev0>>(shift+1)) & 0x5555555555555555));
		__v2di mask0 = {0x0706050403020100, 0x0706050403020100};
		*(__v2du*)((void*)data_out + offset0 + 0*16) = __builtin_e2k_qpshufb(data_in_10[i], data_in_00[i], mask0);
		*(__v2du*)((void*)data_out + offset0 + 1*16) = __builtin_e2k_qpshufb(data_in_30[i], data_in_20[i], mask0);
		*(__v2du*)((void*)data_out + offset0 + 2*16) = __builtin_e2k_qpshufb(data_in_11[i], data_in_01[i], mask0);
		*(__v2du*)((void*)data_out + offset0 + 3*16) = __builtin_e2k_qpshufb(data_in_31[i], data_in_21[i], mask0);
		*(__v2du*)((void*)data_out + offset0 + 4*16) = __builtin_e2k_qpshufb(data_in_12[i], data_in_02[i], mask0);
		*(__v2du*)((void*)data_out + offset0 + 5*16) = __builtin_e2k_qpshufb(data_in_32[i], data_in_22[i], mask0);
		*(__v2du*)((void*)data_out + offset0 + 6*16) = __builtin_e2k_qpshufb(data_in_13[i], data_in_03[i], mask0);
		*(__v2du*)((void*)data_out + offset0 + 7*16) = __builtin_e2k_qpshufb(data_in_33[i], data_in_23[i], mask0);

		uint64_t rev1 = __builtin_e2k_bitrevd(2*i+1);
		int64_t offset1 = 8 * (((rev1>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev1>>(shift+1)) & 0x5555555555555555));
		__v2di mask1 = {0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908};
		*(__v2du*)((void*)data_out + offset1 + 0*16) = __builtin_e2k_qpshufb(data_in_10[i], data_in_00[i], mask1);
		*(__v2du*)((void*)data_out + offset1 + 1*16) = __builtin_e2k_qpshufb(data_in_30[i], data_in_20[i], mask1);
		*(__v2du*)((void*)data_out + offset1 + 2*16) = __builtin_e2k_qpshufb(data_in_11[i], data_in_01[i], mask1);
		*(__v2du*)((void*)data_out + offset1 + 3*16) = __builtin_e2k_qpshufb(data_in_31[i], data_in_21[i], mask1);
		*(__v2du*)((void*)data_out + offset1 + 4*16) = __builtin_e2k_qpshufb(data_in_12[i], data_in_02[i], mask1);
		*(__v2du*)((void*)data_out + offset1 + 5*16) = __builtin_e2k_qpshufb(data_in_32[i], data_in_22[i], mask1);
		*(__v2du*)((void*)data_out + offset1 + 6*16) = __builtin_e2k_qpshufb(data_in_13[i], data_in_03[i], mask1);
		*(__v2du*)((void*)data_out + offset1 + 7*16) = __builtin_e2k_qpshufb(data_in_33[i], data_in_23[i], mask1);
	}
}
Основной цикл на ассемблере
.L4839:
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=1, incr=0, ind=0, asz=2, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=15, asz=2, abs=0, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=14, asz=2, abs=4, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=13, asz=2, abs=4, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=12, asz=2, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=11, asz=2, abs=8, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=10, asz=2, abs=12, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=9, asz=2, abs=12, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=8, asz=2, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=2, abs=16, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=2, abs=20, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=2, abs=20, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=2, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=2, abs=24, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=2, abs=28, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=1, asz=2, abs=28, disp=0
        }
.L3987:
        {
          loop_mode
          qpshufb,0,sm  %b[26], %b[21], %r12, %b[51]
          shr_andd,1,sm %b[23], %r14, %r17, %b[1]
          stqp,2        %r2, %b[4], %b[13]
          qpshufb,3,sm  %b[45], %b[42], %r12, %b[50]
          shr_andd,4,sm %b[23], %r15, %r16, %b[5]
          stqp,5        %r0, %b[4], %b[12]
        }
        {
          loop_mode
          qpshufb,0,sm  %b[35], %b[32], %r12, %b[54]
          shr_andd,1,sm %b[3], %r14, %r17, %b[23]
          stqp,2        %r5, %b[4], %b[51]
          qpshufb,3,sm  %b[29], %b[11], %r12, %b[53]
          ord,4,sm      %b[7], %b[25], %b[52]
          stqp,5        %r6, %b[4], %b[50]
          movaqp,0      area=7, ind=0, am=1, be=0, %b[12]
          movaqp,1      area=6, ind=0, am=1, be=0, %b[18]
          movaqp,2      area=7, ind=0, am=1, be=0, %b[6]
          movaqp,3      area=6, ind=0, am=1, be=0, %b[13]
        }
        {
          loop_mode
          qpshufb,1,sm  %b[22], %b[17], %r12, %b[55]
          stqp,2        %r7, %b[4], %b[54]
          qpshufb,3,sm  %b[22], %b[17], %r13, %b[51]
          shld,4,sm     %b[52], 0x3, %b[50]
          stqp,5        %r9, %b[4], %b[53]
          movaqp,0      area=5, ind=0, am=1, be=0, %b[25]
          movaqp,1      area=4, ind=0, am=1, be=0, %b[31]
          movaqp,2      area=5, ind=0, am=1, be=0, %b[7]
          movaqp,3      area=4, ind=0, am=1, be=0, %b[28]
        }
        {
          loop_mode
          qpshufb,1,sm  %b[45], %b[42], %r13, %b[53]
          stqp,2        %r10, %b[4], %b[55]
          qpshufb,4,sm  %b[35], %b[32], %r13, %b[52]
          stqp,5        %r10, %b[50], %b[51]
          movaqp,0      area=3, ind=0, am=1, be=0, %b[41]
          movaqp,1      area=2, ind=0, am=1, be=0, %b[22]
          movaqp,2      area=3, ind=0, am=1, be=0, %b[38]
          movaqp,3      area=2, ind=0, am=1, be=0, %b[17]
        }
        {
          loop_mode
          addd,0,sm     %b[2], 0x1, %b[48]
          qpshufb,1,sm  %b[49], %b[46], %r13, %b[54]
          stqp,2        %r6, %b[50], %b[53]
          addd,3,sm     0x2, %b[2], %b[0]
          qpshufb,4,sm  %b[26], %b[21], %r13, %b[51]
          stqp,5        %r7, %b[50], %b[52]
          movaqp,0      area=1, ind=0, am=1, be=0, %b[45]
          movaqp,1      area=0, ind=0, am=1, be=0, %b[35]
          movaqp,2      area=1, ind=0, am=1, be=0, %b[42]
          movaqp,3      area=0, ind=0, am=1, be=0, %b[32]
        }
        {
          loop_mode
          bitrevd,0,sm  %b[2], %b[21]
          qpshufb,1,sm  %b[39], %b[36], %r13, %b[52]
          stqp,2        %r0, %b[50], %b[54]
          ord,3,sm      %b[5], %b[1], %b[26]
          qpshufb,4,sm  %b[16], %b[10], %r13, %b[49]
          stqp,5        %r5, %b[50], %b[51]
        }
        {
          loop_mode
          bitrevd,0,sm  %b[48], %b[1]
          qpshufb,1,sm  %b[16], %b[10], %r12, %b[53]
          stqp,2        %r2, %b[50], %b[52]
          shld,3,sm     %b[26], 0x3, %b[2]
          qpshufb,4,sm  %b[29], %b[11], %r13, %b[51]
          stqp,5        %r11, %b[50], %b[49]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qpshufb,0,sm  %b[37], %b[34], %r12, %b[11]
          stqp,2        %r11, %b[4], %b[53]
          qpshufb,3,sm  %b[47], %b[44], %r12, %b[10]
          shr_andd,4,sm %b[3], %r15, %r16, %b[5]
          stqp,5        %r9, %b[50], %b[51]
        }

Теоретическая скорость: 32 комплексных числа за 8 тактов (32/8) = 32 Байт/такт

Замеры скорости

Видим замедление в середине графика.


Ещё можно сделать раскрутку в 32 раза. Для этого напишем версию раскрутки в 64 раза и обработаем сначала одну половину строк в одном цикле, а потом вторую половину строк во втором цикле. В каждом цикле будут использованы 32 потока чтения APB.

9. reverse_radix4_x32

Сделаем «псевдо раскрутку» в 32 раза с помощью двух циклов.

Код на Си
void reverse_radix4_x32(int bit_count, myComplex *data_in, myComplex *data_out)
{
	int count = 1 << bit_count;
	int shift = 64 - bit_count;

	uint64_t *data_in_000 = (uint64_t*)&data_in[ 0 * count/64];
	uint64_t *data_in_001 = (uint64_t*)&data_in[ 1 * count/64];
	uint64_t *data_in_002 = (uint64_t*)&data_in[ 2 * count/64];
	uint64_t *data_in_003 = (uint64_t*)&data_in[ 3 * count/64];
	uint64_t *data_in_010 = (uint64_t*)&data_in[ 4 * count/64];
	uint64_t *data_in_011 = (uint64_t*)&data_in[ 5 * count/64];
	uint64_t *data_in_012 = (uint64_t*)&data_in[ 6 * count/64];
	uint64_t *data_in_013 = (uint64_t*)&data_in[ 7 * count/64];
	uint64_t *data_in_020 = (uint64_t*)&data_in[ 8 * count/64];
	uint64_t *data_in_021 = (uint64_t*)&data_in[ 9 * count/64];
	uint64_t *data_in_022 = (uint64_t*)&data_in[10 * count/64];
	uint64_t *data_in_023 = (uint64_t*)&data_in[11 * count/64];
	uint64_t *data_in_030 = (uint64_t*)&data_in[12 * count/64];
	uint64_t *data_in_031 = (uint64_t*)&data_in[13 * count/64];
	uint64_t *data_in_032 = (uint64_t*)&data_in[14 * count/64];
	uint64_t *data_in_033 = (uint64_t*)&data_in[15 * count/64];
	uint64_t *data_in_100 = (uint64_t*)&data_in[16 * count/64];
	uint64_t *data_in_101 = (uint64_t*)&data_in[17 * count/64];
	uint64_t *data_in_102 = (uint64_t*)&data_in[18 * count/64];
	uint64_t *data_in_103 = (uint64_t*)&data_in[19 * count/64];
	uint64_t *data_in_110 = (uint64_t*)&data_in[20 * count/64];
	uint64_t *data_in_111 = (uint64_t*)&data_in[21 * count/64];
	uint64_t *data_in_112 = (uint64_t*)&data_in[22 * count/64];
	uint64_t *data_in_113 = (uint64_t*)&data_in[23 * count/64];
	uint64_t *data_in_120 = (uint64_t*)&data_in[24 * count/64];
	uint64_t *data_in_121 = (uint64_t*)&data_in[25 * count/64];
	uint64_t *data_in_122 = (uint64_t*)&data_in[26 * count/64];
	uint64_t *data_in_123 = (uint64_t*)&data_in[27 * count/64];
	uint64_t *data_in_130 = (uint64_t*)&data_in[28 * count/64];
	uint64_t *data_in_131 = (uint64_t*)&data_in[29 * count/64];
	uint64_t *data_in_132 = (uint64_t*)&data_in[30 * count/64];
	uint64_t *data_in_133 = (uint64_t*)&data_in[31 * count/64];
	uint64_t *data_in_200 = (uint64_t*)&data_in[32 * count/64];
	uint64_t *data_in_201 = (uint64_t*)&data_in[33 * count/64];
	uint64_t *data_in_202 = (uint64_t*)&data_in[34 * count/64];
	uint64_t *data_in_203 = (uint64_t*)&data_in[35 * count/64];
	uint64_t *data_in_210 = (uint64_t*)&data_in[36 * count/64];
	uint64_t *data_in_211 = (uint64_t*)&data_in[37 * count/64];
	uint64_t *data_in_212 = (uint64_t*)&data_in[38 * count/64];
	uint64_t *data_in_213 = (uint64_t*)&data_in[39 * count/64];
	uint64_t *data_in_220 = (uint64_t*)&data_in[40 * count/64];
	uint64_t *data_in_221 = (uint64_t*)&data_in[41 * count/64];
	uint64_t *data_in_222 = (uint64_t*)&data_in[42 * count/64];
	uint64_t *data_in_223 = (uint64_t*)&data_in[43 * count/64];
	uint64_t *data_in_230 = (uint64_t*)&data_in[44 * count/64];
	uint64_t *data_in_231 = (uint64_t*)&data_in[45 * count/64];
	uint64_t *data_in_232 = (uint64_t*)&data_in[46 * count/64];
	uint64_t *data_in_233 = (uint64_t*)&data_in[47 * count/64];
	uint64_t *data_in_300 = (uint64_t*)&data_in[48 * count/64];
	uint64_t *data_in_301 = (uint64_t*)&data_in[49 * count/64];
	uint64_t *data_in_302 = (uint64_t*)&data_in[50 * count/64];
	uint64_t *data_in_303 = (uint64_t*)&data_in[51 * count/64];
	uint64_t *data_in_310 = (uint64_t*)&data_in[52 * count/64];
	uint64_t *data_in_311 = (uint64_t*)&data_in[53 * count/64];
	uint64_t *data_in_312 = (uint64_t*)&data_in[54 * count/64];
	uint64_t *data_in_313 = (uint64_t*)&data_in[55 * count/64];
	uint64_t *data_in_320 = (uint64_t*)&data_in[56 * count/64];
	uint64_t *data_in_321 = (uint64_t*)&data_in[57 * count/64];
	uint64_t *data_in_322 = (uint64_t*)&data_in[58 * count/64];
	uint64_t *data_in_323 = (uint64_t*)&data_in[59 * count/64];
	uint64_t *data_in_330 = (uint64_t*)&data_in[60 * count/64];
	uint64_t *data_in_331 = (uint64_t*)&data_in[61 * count/64];
	uint64_t *data_in_332 = (uint64_t*)&data_in[62 * count/64];
	uint64_t *data_in_333 = (uint64_t*)&data_in[63 * count/64];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < count/32/2; ++i)
	{
		uint64_t rev = __builtin_e2k_bitrevd(i);
		int64_t offset = 8 * (((rev>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev>>(shift+1)) & 0x5555555555555555));
		*(__v2du*)((void*)data_out + offset +  0*16) = (__v2du){data_in_000[i], data_in_100[i]};
		*(__v2du*)((void*)data_out + offset +  1*16) = (__v2du){data_in_200[i], data_in_300[i]};
		*(__v2du*)((void*)data_out + offset +  2*16) = (__v2du){data_in_010[i], data_in_110[i]};
		*(__v2du*)((void*)data_out + offset +  3*16) = (__v2du){data_in_210[i], data_in_310[i]};
		*(__v2du*)((void*)data_out + offset +  4*16) = (__v2du){data_in_020[i], data_in_120[i]};
		*(__v2du*)((void*)data_out + offset +  5*16) = (__v2du){data_in_220[i], data_in_320[i]};
		*(__v2du*)((void*)data_out + offset +  6*16) = (__v2du){data_in_030[i], data_in_130[i]};
		*(__v2du*)((void*)data_out + offset +  7*16) = (__v2du){data_in_230[i], data_in_330[i]};
		*(__v2du*)((void*)data_out + offset +  8*16) = (__v2du){data_in_001[i], data_in_101[i]};
		*(__v2du*)((void*)data_out + offset +  9*16) = (__v2du){data_in_201[i], data_in_301[i]};
		*(__v2du*)((void*)data_out + offset + 10*16) = (__v2du){data_in_011[i], data_in_111[i]};
		*(__v2du*)((void*)data_out + offset + 11*16) = (__v2du){data_in_211[i], data_in_311[i]};
		*(__v2du*)((void*)data_out + offset + 12*16) = (__v2du){data_in_021[i], data_in_121[i]};
		*(__v2du*)((void*)data_out + offset + 13*16) = (__v2du){data_in_221[i], data_in_321[i]};
		*(__v2du*)((void*)data_out + offset + 14*16) = (__v2du){data_in_031[i], data_in_131[i]};
		*(__v2du*)((void*)data_out + offset + 15*16) = (__v2du){data_in_231[i], data_in_331[i]};
	}

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < count/32/2; ++i)
	{
		uint64_t rev = __builtin_e2k_bitrevd(i);
		int64_t offset = 8 * (((rev>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev>>(shift+1)) & 0x5555555555555555));
		*(__v2du*)((void*)data_out + offset + 16*16) = (__v2du){data_in_002[i], data_in_102[i]};
		*(__v2du*)((void*)data_out + offset + 17*16) = (__v2du){data_in_202[i], data_in_302[i]};
		*(__v2du*)((void*)data_out + offset + 18*16) = (__v2du){data_in_012[i], data_in_112[i]};
		*(__v2du*)((void*)data_out + offset + 19*16) = (__v2du){data_in_212[i], data_in_312[i]};
		*(__v2du*)((void*)data_out + offset + 20*16) = (__v2du){data_in_022[i], data_in_122[i]};
		*(__v2du*)((void*)data_out + offset + 21*16) = (__v2du){data_in_222[i], data_in_322[i]};
		*(__v2du*)((void*)data_out + offset + 22*16) = (__v2du){data_in_032[i], data_in_132[i]};
		*(__v2du*)((void*)data_out + offset + 23*16) = (__v2du){data_in_232[i], data_in_332[i]};
		*(__v2du*)((void*)data_out + offset + 24*16) = (__v2du){data_in_003[i], data_in_103[i]};
		*(__v2du*)((void*)data_out + offset + 25*16) = (__v2du){data_in_203[i], data_in_303[i]};
		*(__v2du*)((void*)data_out + offset + 26*16) = (__v2du){data_in_013[i], data_in_113[i]};
		*(__v2du*)((void*)data_out + offset + 27*16) = (__v2du){data_in_213[i], data_in_313[i]};
		*(__v2du*)((void*)data_out + offset + 28*16) = (__v2du){data_in_023[i], data_in_123[i]};
		*(__v2du*)((void*)data_out + offset + 29*16) = (__v2du){data_in_223[i], data_in_323[i]};
		*(__v2du*)((void*)data_out + offset + 30*16) = (__v2du){data_in_033[i], data_in_133[i]};
		*(__v2du*)((void*)data_out + offset + 31*16) = (__v2du){data_in_233[i], data_in_333[i]};
	}
}
Основной цикл на ассемблере
.L7926:
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=17, incr=0, ind=0, asz=1, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=16, incr=0, ind=0, asz=1, abs=0, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=15, incr=0, ind=0, asz=1, abs=2, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=14, incr=0, ind=0, asz=1, abs=2, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=13, incr=0, ind=0, asz=1, abs=4, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=12, incr=0, ind=0, asz=1, abs=4, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=11, incr=0, ind=0, asz=1, abs=6, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=10, incr=0, ind=0, asz=1, abs=6, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=9, incr=0, ind=0, asz=1, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=8, incr=0, ind=0, asz=1, abs=8, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=7, incr=0, ind=0, asz=1, abs=10, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=6, incr=0, ind=0, asz=1, abs=10, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=5, incr=0, ind=0, asz=1, abs=12, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=4, incr=0, ind=0, asz=1, abs=12, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=3, incr=0, ind=0, asz=1, abs=14, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=2, incr=0, ind=0, asz=1, abs=14, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=1, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=1, abs=16, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=1, abs=18, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=1, abs=18, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=1, abs=20, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=1, abs=20, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=1, abs=22, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=1, abs=22, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=1, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=1, abs=24, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=1, abs=26, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=1, abs=26, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=1, abs=28, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=1, abs=28, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=1, abs=30, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=1, abs=30, disp=0
        }
.L6604:
        {
          loop_mode
          qppackdl,0,sm %b[55], %b[56], %b[1]
          shr_andd,1,sm %b[62], %r0, %r21, %b[67]
          stqp,2        %r4, %b[58], %b[14]
          qppackdl,3,sm %b[26], %b[50], %b[14]
          shr_andd,4,sm %b[62], %r1, %r20, %b[68]
          stqp,5        %r2, %b[58], %b[19]
          movad,0       area=15, ind=0, am=1, be=0, %b[62]
          movad,1       area=14, ind=0, am=1, be=0, %b[66]
          movad,2       area=15, ind=0, am=1, be=0, %b[60]
          movad,3       area=14, ind=0, am=1, be=0, %b[64]
        }
        {
          loop_mode
          qppackdl,1,sm %b[44], %b[49], %b[31]
          stqp,2        %r5, %b[58], %b[3]
          qppackdl,4,sm %b[25], %b[31], %b[26]
          stqp,5        %r6, %b[58], %b[16]
          movad,0       area=13, ind=0, am=1, be=0, %b[16]
          movad,1       area=12, ind=0, am=1, be=0, %b[25]
          movad,2       area=13, ind=0, am=1, be=0, %b[3]
          movad,3       area=12, ind=0, am=1, be=0, %b[19]
        }
        {
          loop_mode
          qppackdl,1,sm %b[63], %b[65], %b[33]
          stqp,2        %r7, %b[58], %b[33]
          qppackdl,4,sm %b[59], %b[61], %b[28]
          stqp,5        %r9, %b[58], %b[28]
          movad,0       area=11, ind=0, am=1, be=0, %b[49]
          movad,1       area=10, ind=0, am=1, be=0, %b[55]
          movad,2       area=11, ind=0, am=1, be=0, %b[44]
          movad,3       area=10, ind=0, am=1, be=0, %b[50]
        }
        {
          loop_mode
          qppackdl,1,sm %g18, %g19, %b[37]
          stqp,2        %r10, %b[58], %b[37]
          qppackdl,4,sm %g16, %g17, %b[32]
          stqp,5        %r11, %b[58], %b[32]
          movad,0       area=9, ind=0, am=1, be=0, %g17
          movad,1       area=8, ind=0, am=1, be=0, %g19
          movad,2       area=9, ind=0, am=1, be=0, %g16
          movad,3       area=8, ind=0, am=1, be=0, %g18
        }
        {
          loop_mode
          qppackdl,1,sm %b[52], %b[57], %b[41]
          stqp,2        %r12, %b[58], %b[41]
          qppackdl,4,sm %b[46], %b[51], %b[36]
          stqp,5        %r13, %b[58], %b[36]
          movad,0       area=7, ind=0, am=1, be=0, %b[59]
          movad,1       area=6, ind=0, am=1, be=0, %b[63]
          movad,2       area=7, ind=0, am=1, be=0, %b[57]
          movad,3       area=6, ind=0, am=1, be=0, %b[61]
        }
        {
          loop_mode
          addd,0,sm     %b[4], 0x1, %b[2] ? %pcnt2
          qppackdl,1,sm %b[21], %b[27], %b[5]
          stqp,2        %r14, %b[58], %b[45]
          ord,3,sm      %b[68], %b[67], %b[65]
          qppackdl,4,sm %b[5], %b[18], %b[18]
          stqp,5        %r15, %b[58], %b[40]
          movad,0       area=5, ind=0, am=1, be=0, %b[27]
          movad,1       area=4, ind=0, am=1, be=0, %b[45]
          movad,2       area=5, ind=0, am=1, be=0, %b[21]
          movad,3       area=4, ind=0, am=1, be=0, %b[40]
        }
        {
          loop_mode
          bitrevd,0,sm  %b[4], %b[60]
          qppackdl,1,sm %b[64], %b[66], %b[9]
          stqp,2        %r16, %b[58], %b[9]
          shld,3,sm     %b[65], 0x3, %b[56]
          qppackdl,4,sm %b[60], %b[62], %b[4]
          stqp,5        %r17, %b[58], %b[22]
          movad,0       area=3, ind=0, am=1, be=0, %b[46]
          movad,1       area=2, ind=0, am=1, be=0, %b[52]
          movad,2       area=3, ind=0, am=1, be=0, %b[22]
          movad,3       area=2, ind=0, am=1, be=0, %b[51]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qppackdl,0,sm %g22, %g23, %b[10]
          stqp,2        %r18, %b[58], %b[15]
          qppackdl,3,sm %g20, %g21, %b[15]
          stqp,5        %r19, %b[58], %b[10]
          movad,0       area=1, ind=0, am=1, be=0, %g23
          movad,1       area=0, ind=0, am=1, be=0, %g21
          movad,2       area=1, ind=0, am=1, be=0, %g22
          movad,3       area=0, ind=0, am=1, be=0, %g20
        }

        ...

.L7272:
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=17, incr=0, ind=0, asz=1, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=16, incr=0, ind=0, asz=1, abs=0, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=15, incr=0, ind=0, asz=1, abs=2, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=14, incr=0, ind=0, asz=1, abs=2, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=13, incr=0, ind=0, asz=1, abs=4, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=12, incr=0, ind=0, asz=1, abs=4, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=11, incr=0, ind=0, asz=1, abs=6, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=10, incr=0, ind=0, asz=1, abs=6, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=9, incr=0, ind=0, asz=1, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=8, incr=0, ind=0, asz=1, abs=8, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=7, incr=0, ind=0, asz=1, abs=10, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=6, incr=0, ind=0, asz=1, abs=10, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=5, incr=0, ind=0, asz=1, abs=12, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=4, incr=0, ind=0, asz=1, abs=12, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=3, incr=0, ind=0, asz=1, abs=14, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=2, incr=0, ind=0, asz=1, abs=14, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=1, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=1, abs=16, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=1, abs=18, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=1, abs=18, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=1, abs=20, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=1, abs=20, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=1, abs=22, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=1, abs=22, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=1, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=1, abs=24, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=1, abs=26, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=1, abs=26, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=1, abs=28, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=1, abs=28, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=1, abs=30, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=1, abs=30, disp=0
        }
.L6518:
        {
          loop_mode
          qppackdl,0,sm %b[55], %b[56], %b[1]
          shr_andd,1,sm %b[62], %r0, %r21, %b[67]
          stqp,2        %r4, %b[58], %b[19]
          qppackdl,3,sm %b[26], %b[50], %b[14]
          shr_andd,4,sm %b[62], %r1, %r20, %b[68]
          stqp,5        %r2, %b[58], %b[14]
          movad,0       area=15, ind=0, am=1, be=0, %b[62]
          movad,1       area=14, ind=0, am=1, be=0, %b[66]
          movad,2       area=15, ind=0, am=1, be=0, %b[60]
          movad,3       area=14, ind=0, am=1, be=0, %b[64]
        }
        {
          loop_mode
          qppackdl,1,sm %b[44], %b[49], %b[26]
          stqp,2        %r5, %b[58], %b[3]
          qppackdl,4,sm %b[25], %b[31], %b[31]
          stqp,5        %r6, %b[58], %b[16]
          movad,0       area=13, ind=0, am=1, be=0, %b[16]
          movad,1       area=12, ind=0, am=1, be=0, %b[25]
          movad,2       area=13, ind=0, am=1, be=0, %b[3]
          movad,3       area=12, ind=0, am=1, be=0, %b[19]
        }
        {
          loop_mode
          qppackdl,1,sm %b[63], %b[65], %b[33]
          stqp,2        %r7, %b[58], %b[28]
          qppackdl,4,sm %b[59], %b[61], %b[28]
          stqp,5        %r9, %b[58], %b[33]
          movad,0       area=11, ind=0, am=1, be=0, %b[49]
          movad,1       area=10, ind=0, am=1, be=0, %b[55]
          movad,2       area=11, ind=0, am=1, be=0, %b[44]
          movad,3       area=10, ind=0, am=1, be=0, %b[50]
        }
        {
          loop_mode
          qppackdl,1,sm %g18, %g19, %b[37]
          stqp,2        %r10, %b[58], %b[37]
          qppackdl,4,sm %g16, %g17, %b[32]
          stqp,5        %r11, %b[58], %b[32]
          movad,0       area=9, ind=0, am=1, be=0, %g17
          movad,1       area=8, ind=0, am=1, be=0, %g19
          movad,2       area=9, ind=0, am=1, be=0, %g16
          movad,3       area=8, ind=0, am=1, be=0, %g18
        }
        {
          loop_mode
          qppackdl,1,sm %b[52], %b[57], %b[36]
          stqp,2        %r12, %b[58], %b[41]
          qppackdl,4,sm %b[46], %b[51], %b[41]
          stqp,5        %r13, %b[58], %b[36]
          movad,0       area=7, ind=0, am=1, be=0, %b[59]
          movad,1       area=6, ind=0, am=1, be=0, %b[63]
          movad,2       area=7, ind=0, am=1, be=0, %b[57]
          movad,3       area=6, ind=0, am=1, be=0, %b[61]
        }
        {
          loop_mode
          addd,0,sm     %b[4], 0x1, %b[2] ? %pcnt2
          qppackdl,1,sm %b[21], %b[27], %b[5]
          stqp,2        %r14, %b[58], %b[40]
          ord,3,sm      %b[68], %b[67], %b[65]
          qppackdl,4,sm %b[5], %b[18], %b[18]
          stqp,5        %r15, %b[58], %b[45]
          movad,0       area=5, ind=0, am=1, be=0, %b[27]
          movad,1       area=4, ind=0, am=1, be=0, %b[45]
          movad,2       area=5, ind=0, am=1, be=0, %b[21]
          movad,3       area=4, ind=0, am=1, be=0, %b[40]
        }
        {
          loop_mode
          bitrevd,0,sm  %b[4], %b[60]
          qppackdl,1,sm %b[64], %b[66], %b[4]
          stqp,2        %r16, %b[58], %b[9]
          shld,3,sm     %b[65], 0x3, %b[56]
          qppackdl,4,sm %b[60], %b[62], %b[9]
          stqp,5        %r17, %b[58], %b[22]
          movad,0       area=3, ind=0, am=1, be=0, %b[46]
          movad,1       area=2, ind=0, am=1, be=0, %b[52]
          movad,2       area=3, ind=0, am=1, be=0, %b[22]
          movad,3       area=2, ind=0, am=1, be=0, %b[51]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qppackdl,0,sm %g22, %g23, %b[15]
          stqp,2        %r18, %b[58], %b[10]
          qppackdl,3,sm %g20, %g21, %b[10]
          stqp,5        %r19, %b[58], %b[15]
          movad,0       area=1, ind=0, am=1, be=0, %g23
          movad,1       area=0, ind=0, am=1, be=0, %g21
          movad,2       area=1, ind=0, am=1, be=0, %g22
          movad,3       area=0, ind=0, am=1, be=0, %g20
        }

Теоретическая скорость: 32 комплексных числа за 8 тактов (32/8) = 32 Байт/такт

Замеры скорости

Видим замедление в начале и ускорение в конце графика.
Накладные расходы на организацию второго цикла не дают проявиться ускорению по всей длине графика.


Итоги по reverse_radix4

Победителем можно считать либо reverse_radix4_x16, либо reverse_radix4_x32.

Алгоритм FFT состоит из одного запуска Reverse и нескольких запусков Stage. Чем больше запусков Stage, тем меньший вклад вносит скорость Reverse в итоговую скорость FFT. Поэтому скорость Reverse важнее на меньших длинах входных данных, где меньше запусков Stage.

При реализации Radix-4 FFT будем использовать reverse_radix4_x16.
Потом можно заменить на reverse_radix4_x32 и посмотреть, как изменится скорость FFT.


Пишем функцию Stage

stage_radix2

Схема алгоритма Stage для версии «radix-2».

1. stage_radix2_etalon

Эталонный вариант для сравнения на корректность.

Код на Си
void stage_radix2_etalon(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef)
{
	myComplex *x_in = &data_in[0];
	myComplex *y_in = &data_in[1];
	myComplex *c_in = coef;

	myComplex *out_add = &data_out[0];
	myComplex *out_sub = &data_out[data_count/2];

	#pragma ivdep
	#pragma unroll(1)
//	#pragma prefetch
	for(int64_t i = 0; i < data_count/2; ++i)
	{
		myComplex x = x_in[2*i];
		myComplex y = y_in[2*i];
		myComplex c = c_in[i];

		myComplex cy = complex_mul(c, y);

		out_add[i] = complex_add(x, cy);
		out_sub[i] = complex_sub(x, cy);
	}
}
Основной цикл на ассемблере
.L444:
        {
          fapb  ct=1, dcd=0, fmt=3, mrng=16, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=2, asz=5, abs=0, disp=0
        }
.L120:
        {
          loop_mode
          fmuls,0,sm    %b[67], %b[6], %b[37]
          fsubs,1,sm    %b[46], %b[24], %b[58]
          staaw,2       %b[62], %aad1[ %aasti3 + _f32s,_lts0 0x4 ]
          fmul_adds,3,sm        %b[55], %b[13], %b[43], %b[14]
          fadds,4,sm    %b[46], %b[24], %b[57]
          staaw,5       %b[61], %aad2[ %aasti4 + _f32s,_lts0 0x4 ]
          movaw,0       area=0, ind=8, am=0, be=0, %b[0]
          movaw,1       area=0, ind=12, am=0, be=0, %b[1]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          fmuls,0,sm    %b[67], %b[7], %b[62]
          fsubs,1,sm    %b[35], %b[56], %b[70]
          staaw,2       %b[74], %aad1[ %aasti3 ]
          incr,2        %aaincr3
          fmul_rsubs,3,sm       %b[55], %b[12], %b[68], %b[46]
          fadds,4,sm    %b[35], %b[56], %b[69]
          staaw,5       %b[73], %aad2[ %aasti4 ]
          incr,5        %aaincr3
          movaw,0       area=0, ind=0, am=0, be=0, %b[13]
          movaw,1       area=0, ind=4, am=1, be=0, %b[24]
          movaw,2       area=0, ind=4, am=1, be=0, %b[61]
          movaw,3       area=0, ind=0, am=0, be=0, %b[43]
        }

Теоретическая скорость: 2 комплексных числа за 2 такта (2/2) = 8 Байт/такт

Замеры скорости

2. stage_radix2_etalon_unroll2

Этот вариант появился случайно.
Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.
Можно видеть, что компилятор умеет использовать векторные инструкции.

Код на Си
void stage_radix2_etalon_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef)
{
	myComplex *x_in = &data_in[0];
	myComplex *y_in = &data_in[1];
	myComplex *c_in = coef;

	myComplex *out_add = &data_out[0];
	myComplex *out_sub = &data_out[data_count/2];

	#pragma ivdep
	#pragma unroll(2)
//	#pragma prefetch
	for(int64_t i = 0; i < data_count/2; ++i)
	{
		myComplex x = x_in[2*i];
		myComplex y = y_in[2*i];
		myComplex c = c_in[i];

		myComplex cy = complex_mul(c, y);

		out_add[i] = complex_add(x, cy);
		out_sub[i] = complex_sub(x, cy);
	}
}
Основной цикл на ассемблере
.L1266:
        {
          fapb  ct=0, dcd=0, fmt=3, mrng=12, d=0, incr=1, ind=1, asz=4, abs=0, disp=20
          fapb  dpl=0, dcd=0, fmt=3, mrng=20, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=2, asz=4, abs=16, disp=0
        }
.L463:
        {
          loop_mode
          pfmuls,0,sm   %b[51], %b[32], %b[53]
          insfd,1,sm    %b[28], %r8, %b[54], %b[1]
          pfmuls,2,sm   %b[51], %b[13], %b[45]
          insfd,3,sm    %b[23], %r8, %b[50], %b[0]
          pshufb,4,sm   %b[9], %b[19], %r0, %b[56]
          pfadds,5,sm   %b[33], %b[39], %b[10]
          movaw,1       area=0, ind=8, am=0, be=0, %b[38]
          movaw,3       area=0, ind=12, am=0, be=0, %b[44]
        }
        {
          loop_mode
          pfmul_rsubs,0,sm      %b[5], %b[15], %b[55], %b[39]
          insfd,1,sm    %b[20], %r8, %b[24], %b[23]
          pfmul_adds,2,sm       %b[5], %b[34], %b[47], %b[33]
          insfd,3,sm    %b[40], %r8, %b[46], %b[28]
          pshufb,4,sm   %b[12], %b[16], %r0, %b[54]
          staad,5       %b[56], %aad1[ %aasti3 + _f32s,_lts0 0x8 ]
          movad,1       area=1, ind=0, am=0, be=0, %b[50]
        }
        {
          loop_mode
          pfsubs,0,sm   %b[8], %b[43], %b[15]
          insfd,1,sm    %b[9], %r8, %b[19], %b[57]
          pfsubs,2,sm   %b[31], %b[37], %b[5]
          insfd,3,sm    %b[12], %r8, %b[16], %b[55]
          staad,5       %b[54], %aad2[ %aasti4 + _f32s,_lts0 0x8 ]
          movad,0       area=1, ind=8, am=1, be=0, %b[24]
          movaw,1       area=0, ind=4, am=0, be=0, %b[34]
          movaw,2       area=0, ind=4, am=0, be=0, %b[20]
          movaw,3       area=0, ind=8, am=0, be=0, %b[40]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          pfadds,0,sm   %b[8], %b[43], %b[12]
          insfd,1,sm    %b[36], %r8, %b[42], %b[9]
          staad,2       %b[57], %aad1[ %aasti3 ]
          incr,2        %aaincr3
          pshufb,4,sm   %b[26], %b[52], %r0, %b[47]
          staad,5       %b[55], %aad2[ %aasti4 ]
          incr,5        %aaincr3
          movaw,1       area=0, ind=0, am=1, be=0, %b[16]
          movaw,2       area=0, ind=0, am=1, be=0, %b[46]
          movaw,3       area=0, ind=16, am=0, be=0, %b[19]
        }

Теоретическая скорость: 4 комплексных числа за 4 такта (4/4) = 8 Байт/такт

Замеры скорости

Видим ускорение.
Теоретическая скорость не изменилась по сравнению с эталонным вариантом, но скорость выросла. В ассемблерном коде можно видеть, что компилятор вставил векторные инструкции.


3. stage_radix2_simd64

Прямую векторизацию сейчас пробовать не будем. Её посмотрим потом отдельно.

Сейчас попробуем использовать векторные инструкции SIMD64 для выполнения нескольких умножений одной инструкцией.

Умножение двух комплексных чисел c и y будем делать так:

  • читаем комплексные числа c и y из памяти в 64-битные регистры (в одну половину регистра попадает действительная часть, в другую половину — мнимая часть)

  • меняем знак у мнимой части c с помощью xor (получаем conj_c) и перемножаем векторно conj_c и y — получаем полуфабрикат для действительной части cy (для завершения получения действительной части cy надо сложить половины регистра)

  • меняем местами действительную и мнимую части c с помощью shuf (получаем swap_c) и перемножаем векторно swap_c и y — получаем полуфабрикат для мнимой части cy (для завершения получения мнимой части cy надо сложить половины регистра)

  • складываем половины регистров‑полуфабрикатов с помощью векторного горизонтального сложения fhadd — получаем cy

Код на Си
void stage_radix2_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef)
{
	uint64_t *x_in = (uint64_t*)&data_in[0];
	uint64_t *y_in = (uint64_t*)&data_in[1];
	uint64_t *c_in = (uint64_t*)coef;

	uint64_t *out_add = (uint64_t*)&data_out[0];
	uint64_t *out_sub = (uint64_t*)&data_out[data_count/2];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/2; ++i)
	{
		uint64_t x = x_in[2*i];
		uint64_t y = y_in[2*i];
		uint64_t c = c_in[i];

		uint64_t conj_c = __builtin_e2k_pxord(c, 1LL<<63);
		uint64_t swap_c = __builtin_e2k_pshufb(0, c, 0x0302010007060504);

		uint64_t cy_real = __builtin_e2k_pfmuls(conj_c, y);
		uint64_t cy_imag = __builtin_e2k_pfmuls(swap_c, y);

		uint64_t cy = __builtin_e2k_pfhadds(cy_real, cy_imag);

		out_add[i] = __builtin_e2k_pfadds(x, cy);
		out_sub[i] = __builtin_e2k_pfsubs(x, cy);
	}
}
Основной цикл на ассемблере
.L1588:
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=5, abs=0, disp=0
        }
.L1388:
        {
          loop_mode
          pfmuls,0,sm   %b[35], %b[28], %b[18]
          pfmul_hadds,1,sm      %b[33], %b[32], %b[22], %b[0]
          pshufb,4,sm   0x0, %b[7], %r5, %b[29]
          pfadds,5,sm   %b[27], %b[10], %b[12]
          movad,3       area=0, ind=0, am=1, be=0, %b[1]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          xord,0,sm     %b[5], %r0, %b[33]
          pfsubs,1,sm   %b[27], %b[10], %b[32]
          staad,2       %b[36], %aad1[ %aasti3 ]
          incr,2        %aaincr0
          staad,5       %b[16], %aad2[ %aasti4 ]
          incr,5        %aaincr0
          movad,0       area=0, ind=8, am=1, be=0, %b[22]
          movad,1       area=0, ind=0, am=0, be=0, %b[7]
        }

После компиляции видим, что цикл состоит из 8 инструкций: xor, shuf, fmul, fmul_fhadd, fadd, fsub, std, std. Инструкция fhadd оказалась «сцеплена» с одной из инструкций fmul (оказывается, Эльбрус так умеет).

Теоретическая скорость: 2 комплексных числа за 2 такта (2/2) = 8 Байт/такт

Замеры скорости

Видим небольшое ускорение.

В одном такте помещается 6 инструкций, а у нас здесь 8 инструкций. Т.е. у нас занято 8/6 такта. В идеале, если раскрутить цикл в 3 раза, получится самая плотная упаковка (3 * 8/6 = 4 такта). Раскручивать будем с помощью опции unroll.

Но сначала посмотрим на раскрутку в 2 раза (2 * 8/6 = 3 такта).


4. stage_radix2_simd64_unroll2

Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.

Код на Си
void stage_radix2_simd64_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef)
{
	uint64_t *x_in = (uint64_t*)&data_in[0];
	uint64_t *y_in = (uint64_t*)&data_in[1];
	uint64_t *c_in = (uint64_t*)coef;

	uint64_t *out_add = (uint64_t*)&data_out[0];
	uint64_t *out_sub = (uint64_t*)&data_out[data_count/2];

	#pragma ivdep
	#pragma unroll(2)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/2; ++i)
	{
		uint64_t x = x_in[2*i];
		uint64_t y = y_in[2*i];
		uint64_t c = c_in[i];

		uint64_t conj_c = __builtin_e2k_pxord(c, 1LL<<63);
		uint64_t swap_c = __builtin_e2k_pshufb(0, c, 0x0302010007060504);

		uint64_t cy_real = __builtin_e2k_pfmuls(conj_c, y);
		uint64_t cy_imag = __builtin_e2k_pfmuls(swap_c, y);

		uint64_t cy = __builtin_e2k_pfhadds(cy_real, cy_imag);

		out_add[i] = __builtin_e2k_pfadds(x, cy);
		out_sub[i] = __builtin_e2k_pfsubs(x, cy);
	}
}
Основной цикл на ассемблере
.L2152:
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=2, asz=5, abs=0, disp=0
        }
.L1710:
        {
          loop_mode
          pfmul_hadds,0,sm      %b[51], %b[10], %b[36], %b[11]
          pfmuls,1,sm   %b[57], %b[7], %b[44]
          pfsubs,2,sm   %b[41], %b[25], %b[45]
          xord,4,sm     %b[42], %r0, %b[52]
          pfadds,5,sm   %b[41], %b[25], %b[48]
          movad,0       area=0, ind=24, am=0, be=0, %b[1]
          movad,1       area=0, ind=8, am=0, be=0, %b[0]
        }
        {
          loop_mode
          pshufb,0,sm   0x0, %b[32], %r9, %b[41]
          pfsubs,1,sm   %b[26], %b[17], %b[51]
          staad,2       %b[47], %aad1[ %aasti3 + _f32s,_lts0 0x8 ]
          pfadds,3,sm   %b[26], %b[17], %b[54]
          xord,4,sm     %b[30], %r0, %b[55]
          staad,5       %b[50], %aad2[ %aasti4 + _f32s,_lts0 0x8 ]
          movad,0       area=0, ind=0, am=1, be=0, %b[10]
          movad,1       area=0, ind=16, am=0, be=0, %b[25]
          movad,3       area=0, ind=0, am=0, be=0, %b[36]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          pfmul_hadds,0,sm      %b[43], %b[9], %b[46], %b[17]
          pfmuls,1,sm   %b[52], %b[6], %b[32]
          staad,2       %b[53], %aad1[ %aasti3 ]
          incr,2        %aaincr3
          pshufb,4,sm   0x0, %b[42], %r9, %b[47]
          staad,5       %b[56], %aad2[ %aasti4 ]
          incr,5        %aaincr3
          movad,3       area=0, ind=8, am=1, be=0, %b[26]
        }

Теоретическая скорость: 4 комплексных числа за 3 такта (4/3) = 10.67 Байт/такт

Замеры скорости

Видим ускорение.


Теперь посмотрим на раскрутку в 3 раза.

5. stage_radix2_simd64_unroll3

Здесь происходит раскрутка цикла в 3 раза с помощью опции unroll.

Код на Си
void stage_radix2_simd64_unroll3(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef)
{
	uint64_t *x_in = (uint64_t*)&data_in[0];
	uint64_t *y_in = (uint64_t*)&data_in[1];
	uint64_t *c_in = (uint64_t*)coef;

	uint64_t *out_add = (uint64_t*)&data_out[0];
	uint64_t *out_sub = (uint64_t*)&data_out[data_count/2];

	#pragma ivdep
	#pragma unroll(3)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/2; ++i)
	{
		uint64_t x = x_in[2*i];
		uint64_t y = y_in[2*i];
		uint64_t c = c_in[i];

		uint64_t conj_c = __builtin_e2k_pxord(c, 1LL<<63);
		uint64_t swap_c = __builtin_e2k_pshufb(0, c, 0x0302010007060504);

		uint64_t cy_real = __builtin_e2k_pfmuls(conj_c, y);
		uint64_t cy_imag = __builtin_e2k_pfmuls(swap_c, y);

		uint64_t cy = __builtin_e2k_pfhadds(cy_real, cy_imag);

		out_add[i] = __builtin_e2k_pfadds(x, cy);
		out_sub[i] = __builtin_e2k_pfsubs(x, cy);
	}
}
Основной цикл на ассемблере
.L2815:
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=24, d=0, incr=2, ind=2, asz=4, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=1, asz=4, abs=16, disp=32
        }
.L2177:
        {
          loop_mode
          pfmuls,0,sm   %b[61], %b[23], %b[41]
          pfmuls,1,sm   %b[73], %b[12], %b[1]
          xord,2,sm     %b[57], %r0, %b[59]
          pfmul_hadds,3,sm      %b[78], %b[53], %b[28], %b[0]
          xord,4,sm     %b[52], %r0, %b[66]
          pfadds,5,sm   %b[48], %b[34], %b[58]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[67], %b[25], %b[43], %b[28]
          pfsubs,1,sm   %b[48], %b[34], %b[68]
          staad,2       %b[70], %aad1[ %aasti3 + _f32s,_lts0 0x10 ]
          pfsubs,3,sm   %b[17], %b[4], %b[61]
          xord,4,sm     %b[20], %r0, %b[71]
          staad,5       %b[60], %aad2[ %aasti4 + _f32s,_lts0 0x10 ]
          movad,1       area=0, ind=16, am=0, be=0, %b[53]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[72], %b[14], %b[3], %b[60]
          pfsubs,1,sm   %b[39], %b[64], %b[73]
          staad,2       %b[75], %aad1[ %aasti3 + _f32s,_lts0 0x8 ]
          pfadds,3,sm   %b[17], %b[4], %b[67]
          pshufb,4,sm   0x0, %b[22], %r10, %b[70]
          staad,5       %b[63], %aad1[ %aasti3 ]
          incr,5        %aaincr3
          movad,0       area=0, ind=0, am=0, be=0, %b[48]
          movad,1       area=1, ind=0, am=0, be=0, %b[34]
          movad,2       area=0, ind=16, am=0, be=0, %b[25]
          movad,3       area=0, ind=8, am=0, be=0, %b[43]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          pfmuls,0,sm   %b[66], %b[47], %b[22]
          pfadds,1,sm   %b[39], %b[64], %b[72]
          staad,2       %b[74], %aad2[ %aasti4 + _f32s,_lts0 0x8 ]
          pshufb,3,sm   0x0, %b[57], %r10, %b[63]
          pshufb,4,sm   0x0, %b[56], %r10, %b[76]
          staad,5       %b[69], %aad2[ %aasti4 ]
          incr,5        %aaincr3
          movad,0       area=1, ind=8, am=1, be=0, %b[17]
          movad,1       area=0, ind=8, am=1, be=0, %b[14]
          movad,2       area=0, ind=0, am=1, be=0, %b[3]
          movad,3       area=0, ind=24, am=0, be=0, %b[4]
        }

Теоретическая скорость: 6 комплексных чисел за 4 такта (6/4) = 12 Байт/такт

Замеры скорости

Видим ускорение в середине графика, но замедление в начале и в конце графика.


6. stage_radix2_simd128

Теперь попробуем использовать векторные инструкции SIMD128 по аналогии с SIMD64.

В отличие от SIMD64, здесь придётся перетасовать данные в начале и в конце цикла с помощью инструкции shuf. Это нужно для того, чтобы в одном 128-битном регистре оказались данные, относящиеся к двум числам x, а в другом — данные, относящиеся к двум числам y.

Код на Си
void stage_radix2_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef)
{
	__v2di *xy0_in = (__v2di*)&data_in[0];
	__v2di *xy1_in = (__v2di*)&data_in[2];
	__v2di *c_in   = (__v2di*)coef;

	__v2di *out_add = (__v2di*)&data_out[0];
	__v2di *out_sub = (__v2di*)&data_out[data_count/2];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/4; ++i)
	{
		__v2di xy0 = xy0_in[2*i];
		__v2di xy1 = xy1_in[2*i];
		__v2di c   = c_in[i];

		__v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		__v2di conj_c = __builtin_e2k_qpxor(c, (__v2di){1LL<<63, 1LL<<63});
		__v2di swap_c = __builtin_e2k_qpshufb(c, c, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});

		__v2di cy_real = __builtin_e2k_qpfmuls(conj_c, y);
		__v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y);

		__v2di cy_rrii = __builtin_e2k_qpfhadds(cy_real, cy_imag);

		__v2di cy = __builtin_e2k_qpshufb(cy_rrii, cy_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});

		out_add[i] = __builtin_e2k_qpfadds(x, cy);
		out_sub[i] = __builtin_e2k_qpfsubs(x, cy);
	}
}
Основной цикл на ассемблере
.L3099:
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=5, abs=0, disp=0
        }
.L2840:
        {
          loop_mode
          qpshufb,0,sm  %b[36], %b[45], %r6, %b[0]
          qpfmuls,1,sm  %b[28], %b[5], %b[18]
          qpfsubs,2,sm  %b[16], %b[47], %b[11]
          qpshufb,3,sm  %b[34], %b[43], %r7, %b[1]
          qpxor,4,sm    %b[23], %r0, %b[24]
          staaqp,5      %b[17], %aad1[ %aasti3 ]
          incr,5        %aaincr0
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qpshufb,0,sm  %b[35], %b[35], %r5, %b[45]
          qpfmul_hadds,1,sm     %b[42], %b[9], %b[22], %b[27]
          qpfadds,2,sm  %b[16], %b[47], %b[44]
          qpshufb,4,sm  %b[25], %b[25], %r9, %b[36]
          staaqp,5      %b[50], %aad2[ %aasti4 ]
          incr,5        %aaincr0
          movaqp,0      area=0, ind=0, am=0, be=0, %b[37]
          movaqp,1      area=0, ind=16, am=1, be=0, %b[28]
          movaqp,3      area=0, ind=0, am=1, be=0, %b[17]
        }

После компиляции видим, что цикл состоит из 11 инструкций (такие же 8 инструкций, что были в варианте SIMD64, и ещё 3 дополнительные инструкции shuf).

Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт

Замеры скорости

Видим ускорение.

Прежде, чем переходить к раскрутке, заметим, что можно сделать цикл на одну инструкцию меньше. В дальнейшем это позволит выполнить более эффективную раскрутку.


7. stage_radix2_simd128_noConj

Воспользуется тем, что Эльбрус умеет сцеплять некоторые инструкции.
Откажемся от создания conj_c (убрали инструкцию xor) и будем использовать fhsub для получения действительной части cy из полуфабриката. Мнимую часть будем, как и раньше, получать с помощью fhadd. Обе эти инструкции будут сцеплены с двумя fmul, то есть будут «бесплатными». Итоговое соединение в единое комплексное число будет сделано в финальном shuf одновременно с уже имеющейся перетасовкой данных.

В версии SIMD64 такой приём сделать было нельзя, потому что там не было финального shuf.

(финальный shuf пришлось заменить на perm, инструкции аналогичны друг другу)

Код на Си
void stage_radix2_simd128_noConj(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef)
{
	__v2di *xy0_in = (__v2di*)&data_in[0];
	__v2di *xy1_in = (__v2di*)&data_in[2];
	__v2di *c_in   = (__v2di*)coef;

	__v2di *out_add = (__v2di*)&data_out[0];
	__v2di *out_sub = (__v2di*)&data_out[data_count/2];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/4; ++i)
	{
		__v2di xy0 = xy0_in[2*i];
		__v2di xy1 = xy1_in[2*i];
		__v2di c   = c_in[i];

		__v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		__v2di swap_c = __builtin_e2k_qpshufb(c, c, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});

		__v2di cy_real = __builtin_e2k_qpfmuls(     c, y);
		__v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y);

		__v2di cy_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy_real);
		__v2di cy_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy_imag);

		__v2di cy = __builtin_e2k_qppermb(cy_ii, cy_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});

		out_add[i] = __builtin_e2k_qpfadds(x, cy);
		out_sub[i] = __builtin_e2k_qpfsubs(x, cy);
	}
}
Основной цикл на ассемблере
.L3385:
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=5, abs=0, disp=0
        }
.L3124:
        {
          loop_mode
          qpfmul_hsubs,0,sm     %b[25], %b[28], %r9, %b[16]
          qpfmul_hadds,1,sm     %b[27], %b[28], %r9, %b[1]
          qpfsubs,2,sm  %b[14], %b[42], %b[37]
          qpshufb,3,sm  %b[35], %b[36], %r6, %b[0]
          qppermb,4,sm  %b[11], %b[26], %r7, %b[38]
          staaqp,5      %b[43], %aad1[ %aasti3 ]
          incr,5        %aaincr0
          movaqp,0      area=0, ind=0, am=0, be=0, %b[30]
          movaqp,1      area=0, ind=16, am=1, be=0, %b[29]
          movaqp,3      area=0, ind=0, am=1, be=0, %b[19]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qpshufb,0,sm  %b[33], %b[34], %r0, %b[26]
          qpshufb,1,sm  %b[23], %b[23], %r5, %b[25]
          qpfadds,2,sm  %b[14], %b[42], %b[11]
          staaqp,5      %b[17], %aad2[ %aasti4 ]
          incr,5        %aaincr0
        }

После компиляции видим, что цикл состоит из 10 инструкций (убрали инструкцию xor).

Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт

Замеры скорости

Скорость не изменилась по сравнению с предыдущим вариантом.

И теперь переходим к раскрутке.
Сейчас занято 10/6 такта. Раскрутка в 2 раза даст 2 * 10/6 = 4 такта, то есть ничего интересного (одна итерация цикла обработает в 2 раза больше данных за в 2 раза большее число тактов).
Поэтому сразу переходим к раскрутке в 3 раза (3 * 10/6 = 5 тактов).


8. stage_radix2_simd128_noConj_unroll3

Здесь происходит раскрутка цикла в 3 раза с помощью опции unroll.

Код на Си
void stage_radix2_simd128_noConj_unroll3(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef)
{
	__v2di *xy0_in = (__v2di*)&data_in[0];
	__v2di *xy1_in = (__v2di*)&data_in[2];
	__v2di *c_in   = (__v2di*)coef;

	__v2di *out_add = (__v2di*)&data_out[0];
	__v2di *out_sub = (__v2di*)&data_out[data_count/2];

	#pragma ivdep
	#pragma unroll(3)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/4; ++i)
	{
		__v2di xy0 = xy0_in[2*i];
		__v2di xy1 = xy1_in[2*i];
		__v2di c   = c_in[i];

		__v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		__v2di swap_c = __builtin_e2k_qpshufb(c, c, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});

		__v2di cy_real = __builtin_e2k_qpfmuls(     c, y);
		__v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y);

		__v2di cy_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy_real);
		__v2di cy_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy_imag);

		__v2di cy = __builtin_e2k_qppermb(cy_ii, cy_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});

		out_add[i] = __builtin_e2k_qpfadds(x, cy);
		out_sub[i] = __builtin_e2k_qpfsubs(x, cy);
	}
}
Основной цикл на ассемблере
.L3932:
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=4, abs=0, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=8, disp=64
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=4, abs=16, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=2, asz=4, abs=16, disp=32
        }
.L3410:
        {
          loop_mode
          qpfmul_hsubs,0,sm     %b[19], %b[55], %r14, %b[0]
          qpshufb,1,sm  %b[26], %b[23], %r12, %b[15]
          qpfmul_hsubs,2,sm     %b[31], %b[63], %r14, %b[1]
          qpshufb,3,sm  %b[52], %b[53], %r0, %b[18]
          qpfadds,4,sm  %b[66], %b[69], %b[6]
          qpfadds,5,sm  %b[64], %b[42], %b[68]
        }
        {
          loop_mode
          qpfmul_hsubs,0,sm     %b[34], %b[15], %r14, %b[52]
          qpshufb,1,sm  %b[34], %b[34], %r11, %b[61]
          staaqp,2      %b[35], %aad1[ %aasti3 + _f32s,_lts0 0x10 ]
          qpshufb,3,sm  %b[44], %b[45], %r12, %b[53]
          qpfsubs,4,sm  %b[62], %b[40], %b[57]
          qpfsubs,5,sm  %b[18], %b[65], %b[58]
          movaqp,0      area=0, ind=0, am=0, be=0, %b[43]
          movaqp,1      area=0, ind=16, am=1, be=0, %b[42]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[61], %b[15], %r14, %b[34]
          qpshufb,1,sm  %b[31], %b[31], %r11, %b[66]
          staaqp,2      %b[60], %aad1[ %aasti3 ]
          qpshufb,3,sm  %b[30], %b[27], %r0, %b[64]
          qppermb,4,sm  %b[38], %b[56], %r13, %b[67]
          qpfadds,5,sm  %b[18], %b[65], %b[35]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[66], %b[63], %r14, %b[18]
          qpshufb,1,sm  %b[19], %b[19], %r11, %b[56]
          staaqp,2      %b[8], %aad2[ %aasti4 + _f32s,_lts0 0x10 ]
          qppermb,3,sm  %b[22], %b[5], %r13, %b[38]
          qpfsubs,4,sm  %b[64], %b[67], %b[31]
          staaqp,5      %b[37], %aad2[ %aasti4 ]
          movaqp,1      area=2, ind=0, am=1, be=0, %b[27]
          movaqp,2      area=1, ind=16, am=1, be=0, %b[30]
          movaqp,3      area=1, ind=0, am=0, be=0, %b[15]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qpfmul_hadds,0,sm     %b[56], %b[55], %r14, %b[37]
          qpshufb,1,sm  %b[10], %b[7], %r12, %b[61]
          staaqp,2      %b[59], %aad1[ %aasti3 + _f32s,_lts0 0x20 ]
          incr,2        %aaincr3
          qpshufb,3,sm  %b[16], %b[13], %r0, %b[60]
          qppermb,4,sm  %b[41], %b[4], %r13, %b[63]
          staaqp,5      %b[68], %aad2[ %aasti4 + _f32s,_lts0 0x20 ]
          incr,5        %aaincr3
          movaqp,0      area=1, ind=0, am=0, be=0, %b[5]
          movaqp,1      area=1, ind=16, am=1, be=0, %b[8]
          movaqp,2      area=0, ind=0, am=0, be=0, %b[19]
          movaqp,3      area=0, ind=16, am=1, be=0, %b[22]
        }

Теоретическая скорость: 12 комплексных чисел за 5 тактов (12/5) = 19.2 Байт/такт

Замеры скорости

Видим ускорение в середине графика, но замедление в начале и в конце графика.


Итоги по stage_radix2

График FFT находится здесь.


stage_radix2_2x

Схема алгоритма Stage для версии «radix-2» 2x.

Один проход по stage_radix2_2x совершает ту же работу, что 2 прохода по stage_radix2. Поэтому скорость stage_radix2_2x будем умножать на 2 для удобства сравнения с stage_radix2 (этот факт подписан на оси графика и в выводе консоли).

1. stage_radix2_2x_etalon

Здесь происходит ручная раскрутка алгоритма stage_radix2_etalon в 2 раза.

Код на Си
void stage_radix2_2x_etalon(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef_a, myComplex *coef_b)
{
	myComplex *x0_in = &data_in[0];
	myComplex *y0_in = &data_in[1];
	myComplex *x1_in = &data_in[2];
	myComplex *y1_in = &data_in[3];
	myComplex *c0a_in = &coef_a[0];
	myComplex *c1a_in = &coef_a[1];
	myComplex *c0b_in = &coef_b[0];
	myComplex *c1b_in = &coef_b[data_count/4];

	myComplex *out_add0 = &data_out[0*data_count/4];
	myComplex *out_add1 = &data_out[1*data_count/4];
	myComplex *out_sub0 = &data_out[2*data_count/4];
	myComplex *out_sub1 = &data_out[3*data_count/4];

	#pragma ivdep
	#pragma unroll(1)
//	#pragma prefetch
	for(int64_t i = 0; i < data_count/4; ++i)
	{
		myComplex x0 = x0_in[4*i];
		myComplex y0 = y0_in[4*i];
		myComplex c0 = c0a_in[2*i];

		myComplex x1 = x1_in[4*i];
		myComplex y1 = y1_in[4*i];
		myComplex c1 = c1a_in[2*i];

		myComplex cy0 = complex_mul(c0, y0);
		myComplex cy1 = complex_mul(c1, y1);

		myComplex add0 = complex_add(x0, cy0);
		myComplex sub0 = complex_sub(x0, cy0);
		myComplex add1 = complex_add(x1, cy1);
		myComplex sub1 = complex_sub(x1, cy1);


		x0 = add0;
		y0 = add1;
		c0 = c0b_in[i];

		x1 = sub0;
		y1 = sub1;
		c1 = c1b_in[i];

		cy0 = complex_mul(c0, y0);
		cy1 = complex_mul(c1, y1);

		out_add0[i] = complex_add(x0, cy0);
		out_sub0[i] = complex_sub(x0, cy0);
		out_add1[i] = complex_add(x1, cy1);
		out_sub1[i] = complex_sub(x1, cy1);
	}
}
Основной цикл на ассемблере
.L965:
        {
          fapb  ct=0, dcd=0, fmt=3, mrng=16, d=0, incr=3, ind=2, asz=3, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=4, asz=3, abs=8, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=3, asz=4, abs=16, disp=0
        }
.L259:
        {
          loop_mode
          fmul_adds,0,sm        %b[27], %b[84], %b[93], %b[45]
          fsub_adds,1,sm        %b[15], %b[79], %b[88], %b[0]
          fsub_rsubs,2,sm       %b[15], %b[79], %b[88], %b[1]
          fmuls,3,sm    %b[56], %b[74], %b[90]
          fmuls,4,sm    %b[45], %b[34], %b[89]
          fmuls,5,sm    %b[68], %b[65], %b[88]
        }
        {
          loop_mode
          fmul_rsubs,0,sm       %b[24], %b[67], %b[95], %b[56]
          fadd_adds,1,sm        %b[15], %b[79], %b[47], %b[24]
          fadd_rsubs,2,sm       %b[15], %b[79], %b[47], %b[47]
          fmul_rsubs,3,sm       %b[73], %b[74], %b[94], %b[15]
          fmuls,5,sm    %b[63], %b[51], %b[91]
        }
        {
          loop_mode
          fmul_rsubs,0,sm       %b[27], %b[53], %b[96], %b[58]
          fsub_adds,1,sm        %b[14], %b[83], %b[58], %b[27]
          fsub_rsubs,2,sm       %b[14], %b[83], %b[58], %b[53]
          fmuls,4,sm    %b[54], %b[50], %b[92]
          fmuls,5,sm    %b[68], %b[85], %b[93]
        }
        {
          loop_mode
          fadd_adds,0,sm        %b[14], %b[83], %b[60], %b[67]
          fadd_rsubs,1,sm       %b[14], %b[83], %b[60], %b[68]
          staaw,2       %b[2], %aad2[ %aasti6 + _f32s,_lts0 0x4 ]
          fsubs,3,sm    %b[48], %b[17], %b[63]
          fmuls,4,sm    %b[63], %b[82], %b[94]
          staaw,5       %b[3], %aad1[ %aasti5 + _f32s,_lts0 0x4 ]
          movaw,0       area=2, ind=0, am=0, be=0, %b[14]
          movaw,1       area=2, ind=4, am=1, be=0, %b[60]
          movaw,2       area=0, ind=0, am=0, be=0, %b[2]
          movaw,3       area=0, ind=4, am=0, be=0, %b[3]
        }
        {
          loop_mode
          staaw,2       %b[26], %aad4[ %aasti8 + _f32s,_lts0 0x4 ]
          fmul_adds,3,sm        %b[73], %b[52], %b[90], %b[73]
          fadds,4,sm    %b[48], %b[17], %b[49]
          staaw,5       %b[49], %aad3[ %aasti7 + _f32s,_lts0 0x4 ]
          movaw,0       area=1, ind=0, am=0, be=0, %b[17]
          movaw,1       area=0, ind=12, am=0, be=0, %b[52]
          movaw,2       area=0, ind=8, am=0, be=0, %b[26]
          movaw,3       area=0, ind=28, am=0, be=0, %b[48]
        }
        {
          loop_mode
          fmul_rsubs,1,sm       %b[42], %b[34], %b[87], %b[79]
          staaw,2       %b[29], %aad2[ %aasti6 ]
          incr,2        %aaincr4
          fsubs,4,sm    %b[80], %b[75], %b[83]
          staaw,5       %b[55], %aad1[ %aasti5 ]
          incr,5        %aaincr4
          movaw,0       area=1, ind=4, am=1, be=0, %b[55]
          movaw,1       area=0, ind=0, am=0, be=0, %b[34]
          movaw,2       area=0, ind=12, am=0, be=0, %b[29]
          movaw,3       area=0, ind=20, am=0, be=0, %b[74]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          fmul_adds,0,sm        %b[42], %b[37], %b[89], %b[75]
          fmul_adds,1,sm        %b[22], %b[85], %b[88], %b[84]
          staaw,2       %b[69], %aad4[ %aasti8 ]
          incr,2        %aaincr4
          fmuls,3,sm    %b[43], %b[35], %b[85]
          fadds,4,sm    %b[80], %b[75], %b[80]
          staaw,5       %b[70], %aad3[ %aasti7 ]
          incr,5        %aaincr4
          movaw,0       area=0, ind=4, am=1, be=0, %b[37]
          movaw,1       area=0, ind=8, am=0, be=0, %b[69]
          movaw,2       area=0, ind=16, am=1, be=0, %b[42]
          movaw,3       area=0, ind=24, am=0, be=0, %b[70]
        }

Теоретическая скорость: 4 комплексных числа за 7 тактов (4/7) = 4.57 Байт/такт
Двойная теоретическая скорость: 9.14 Байт/такт

Замеры скорости

2. stage_radix2_2x_etalon_unroll2

Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.

Код на Си
void stage_radix2_2x_etalon_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef_a, myComplex *coef_b)
{
	myComplex *x0_in = &data_in[0];
	myComplex *y0_in = &data_in[1];
	myComplex *x1_in = &data_in[2];
	myComplex *y1_in = &data_in[3];
	myComplex *c0a_in = &coef_a[0];
	myComplex *c1a_in = &coef_a[1];
	myComplex *c0b_in = &coef_b[0];
	myComplex *c1b_in = &coef_b[data_count/4];

	myComplex *out_add0 = &data_out[0*data_count/4];
	myComplex *out_add1 = &data_out[1*data_count/4];
	myComplex *out_sub0 = &data_out[2*data_count/4];
	myComplex *out_sub1 = &data_out[3*data_count/4];

	#pragma ivdep
	#pragma unroll(2)
//	#pragma prefetch
	for(int64_t i = 0; i < data_count/4; ++i)
	{
		myComplex x0 = x0_in[4*i];
		myComplex y0 = y0_in[4*i];
		myComplex c0 = c0a_in[2*i];

		myComplex x1 = x1_in[4*i];
		myComplex y1 = y1_in[4*i];
		myComplex c1 = c1a_in[2*i];

		myComplex cy0 = complex_mul(c0, y0);
		myComplex cy1 = complex_mul(c1, y1);

		myComplex add0 = complex_add(x0, cy0);
		myComplex sub0 = complex_sub(x0, cy0);
		myComplex add1 = complex_add(x1, cy1);
		myComplex sub1 = complex_sub(x1, cy1);


		x0 = add0;
		y0 = add1;
		c0 = c0b_in[i];

		x1 = sub0;
		y1 = sub1;
		c1 = c1b_in[i];

		cy0 = complex_mul(c0, y0);
		cy1 = complex_mul(c1, y1);

		out_add0[i] = complex_add(x0, cy0);
		out_sub0[i] = complex_sub(x0, cy0);
		out_add1[i] = complex_add(x1, cy1);
		out_sub1[i] = complex_sub(x1, cy1);
	}
}
Основной цикл на ассемблере
.L2305:
        {
          fapb  ct=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=3, mrng=16, d=0, incr=2, ind=2, asz=3, abs=0, disp=16
        }
        {
          fapb  ct=0, dcd=0, fmt=3, mrng=16, d=0, incr=2, ind=2, asz=3, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=3, abs=8, disp=32
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=3, ind=3, asz=4, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=3, ind=4, asz=4, abs=16, disp=0
        }
.L1020:
        {
          loop_mode
          pfmul_rsubs,0,sm      %b[71], %b[17], %b[75], %b[1]
          insfd,1,sm    %b[10], %r6, %b[11], %b[87]
          pfmul_rsubs,2,sm      %b[101], %b[47], %b[96], %b[0]
          insfd,3,sm    %b[92], %r6, %b[98], %b[86]
          insfd,4,sm    %b[86], %r6, %b[87], %b[10]
          pfsubs,5,sm   %b[29], %b[40], %b[11]
        }
        {
          loop_mode
          pfmul_adds,0,sm       %b[101], %b[13], %b[80], %b[17]
          insfd,1,sm    %b[63], %r6, %b[38], %b[12]
          pfmul_rsubs,2,sm      %b[87], %b[93], %b[85], %b[47]
          pfmul_adds,3,sm       %b[90], %b[12], %b[97], %b[38]
          insfd,4,sm    %b[76], %r6, %b[52], %b[13]
        }
        {
          loop_mode
          pfadd_adds,0,sm       %b[18], %b[3], %b[49], %b[49]
          insfd,1,sm    %b[82], %r6, %b[95], %b[29]
          pfadd_rsubs,2,sm      %b[18], %b[3], %b[49], %b[40]
          pfadds,3,sm   %b[29], %b[40], %b[52]
          pshufb,4,sm   %b[43], %b[57], %r0, %b[80]
          pfmuls,5,sm   %b[91], %b[93], %b[76]
        }
        {
          loop_mode
          pfsub_rsubs,0,sm      %b[18], %b[3], %b[2], %b[55]
          insfd,1,sm    %b[94], %r6, %b[81], %b[82]
          pfsub_rsubs,2,sm      %b[37], %b[28], %b[19], %b[41]
          pshufb,3,sm   %b[24], %b[25], %r0, %b[85]
          pshufb,4,sm   %b[34], %b[48], %r0, %b[90]
          pfmuls,5,sm   %b[41], %b[15], %b[81]
        }
        {
          loop_mode
          pfsub_adds,0,sm       %b[18], %b[3], %b[2], %b[46]
          pfmuls,1,sm   %b[82], %b[10], %b[92]
          pfsub_adds,2,sm       %b[37], %b[28], %b[19], %b[32]
          pfadds,3,sm   %b[32], %b[46], %b[91]
          pshufb,4,sm   %b[64], %b[42], %r0, %b[93]
          movad,0       area=2, ind=0, am=0, be=0, %b[19]
          movad,1       area=2, ind=8, am=1, be=0, %b[18]
          movad,2       area=2, ind=0, am=0, be=0, %b[3]
          movad,3       area=2, ind=8, am=1, be=0, %b[2]
        }
        {
          loop_mode
          pfadd_rsubs,0,sm      %b[37], %b[28], %b[84], %b[62]
          insfd,1,sm    %b[89], %r6, %b[100], %b[56]
          staad,2       %b[80], %aad1[ %aasti5 + _f32s,_lts0 0x8 ]
          pshufb,3,sm   %b[8], %b[9], %r0, %b[89]
          pshufb,4,sm   %b[70], %b[51], %r0, %b[95]
          pfmuls,5,sm   %b[85], %b[11], %b[94]
          movaw,0       area=1, ind=0, am=0, be=0, %b[63]
          movaw,1       area=1, ind=4, am=0, be=0, %b[66]
          movaw,2       area=1, ind=4, am=0, be=0, %b[80]
          movaw,3       area=1, ind=0, am=0, be=0, %b[59]
        }
        {
          loop_mode
          pfadd_adds,0,sm       %b[37], %b[28], %b[84], %b[68]
          insfd,1,sm    %b[78], %r6, %b[99], %b[28]
          staad,2       %b[90], %aad2[ %aasti6 + _f32s,_lts0 0x8 ]
          pfmuls,3,sm   %b[85], %b[45], %b[78]
          insfd,4,sm    %b[83], %r6, %b[68], %b[37]
          pfmuls,5,sm   %b[89], %b[52], %b[83]
          movaw,0       area=0, ind=24, am=0, be=0, %b[96]
          movaw,1       area=0, ind=28, am=0, be=0, %b[85]
          movaw,2       area=1, ind=24, am=0, be=0, %b[90]
          movaw,3       area=1, ind=28, am=0, be=0, %b[84]
        }
        {
          loop_mode
          insfd,1,sm    %b[79], %r6, %g16, %b[88]
          staad,2       %b[93], %aad3[ %aasti7 + _f32s,_lts0 0x8 ]
          insfd,4,sm    %b[88], %r6, %b[65], %b[65]
          pfmuls,5,sm   %b[39], %b[58], %b[71]
          movaw,0       area=1, ind=8, am=0, be=0, %g16
          movaw,1       area=1, ind=12, am=1, be=0, %b[79]
          movaw,2       area=1, ind=8, am=0, be=0, %b[72]
          movaw,3       area=1, ind=20, am=0, be=0, %b[75]
        }
        {
          loop_mode
          pfmul_adds,0,sm       %b[87], %b[54], %b[76], %b[82]
          insfd,1,sm    %b[43], %r6, %b[57], %b[98]
          staad,2       %b[95], %aad4[ %aasti8 + _f32s,_lts0 0x8 ]
          pfmuls,3,sm   %b[82], %b[86], %b[95]
          insfd,4,sm    %b[34], %r6, %b[48], %b[97]
          pfsubs,5,sm   %b[30], %b[44], %b[43]
          movaw,0       area=0, ind=4, am=0, be=0, %b[93]
          movaw,1       area=0, ind=0, am=0, be=0, %b[34]
          movaw,2       area=1, ind=16, am=0, be=0, %b[76]
          movaw,3       area=1, ind=12, am=1, be=0, %b[87]
        }
        {
          loop_mode
          pfmul_rsubs,0,sm      %b[88], %b[86], %b[92], %b[42]
          insfd,1,sm    %b[70], %r6, %b[51], %b[97]
          staad,2       %b[98], %aad1[ %aasti5 ]
          incr,2        %aaincr4
          insfd,4,sm    %b[64], %r6, %b[42], %b[98]
          staad,5       %b[97], %aad2[ %aasti6 ]
          incr,5        %aaincr4
          movaw,0       area=0, ind=8, am=0, be=0, %b[48]
          movaw,1       area=0, ind=20, am=0, be=0, %b[51]
          movaw,2       area=0, ind=0, am=0, be=0, %b[86]
          movaw,3       area=0, ind=12, am=0, be=0, %b[92]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          pfmul_adds,0,sm       %b[69], %b[60], %b[81], %b[24]
          insfd,1,sm    %b[24], %r6, %b[25], %b[99]
          staad,2       %b[97], %aad4[ %aasti8 ]
          incr,2        %aaincr4
          insfd,4,sm    %b[77], %r6, %b[53], %b[25]
          staad,5       %b[98], %aad3[ %aasti7 ]
          incr,5        %aaincr4
          movaw,0       area=0, ind=16, am=0, be=0, %b[97]
          movaw,1       area=0, ind=12, am=1, be=0, %b[98]
          movaw,2       area=0, ind=4, am=1, be=0, %b[81]
          movaw,3       area=0, ind=8, am=0, be=0, %b[77]
        }

Так же, как это было в stage_radix2_etalon_unroll2, можно видеть, что компилятор вставил векторные инструкции.

Теоретическая скорость: 8 комплексных чисел за 11 тактов (8/11) = 5.82 Байт/такт
Двойная теоретическая скорость: 11.64 Байт/такт

Замеры скорости

3. stage_radix2_2x_simd64

Здесь происходит ручная раскрутка алгоритма stage_radix2_simd64 в 2 раза.

Код на Си
void stage_radix2_2x_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef_a, myComplex *coef_b)
{
	uint64_t *x0_in = (uint64_t*)&data_in[0];
	uint64_t *y0_in = (uint64_t*)&data_in[1];
	uint64_t *x1_in = (uint64_t*)&data_in[2];
	uint64_t *y1_in = (uint64_t*)&data_in[3];
	uint64_t *c0a_in = (uint64_t*)&coef_a[0];
	uint64_t *c1a_in = (uint64_t*)&coef_a[1];
	uint64_t *c0b_in = (uint64_t*)&coef_b[0];
	uint64_t *c1b_in = (uint64_t*)&coef_b[data_count/4];

	uint64_t *out_add0 = (uint64_t*)&data_out[0*data_count/4];
	uint64_t *out_add1 = (uint64_t*)&data_out[1*data_count/4];
	uint64_t *out_sub0 = (uint64_t*)&data_out[2*data_count/4];
	uint64_t *out_sub1 = (uint64_t*)&data_out[3*data_count/4];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/4; ++i)
	{
		uint64_t x0 = x0_in[4*i];
		uint64_t y0 = y0_in[4*i];
		uint64_t c0 = c0a_in[2*i];

		uint64_t x1 = x1_in[4*i];
		uint64_t y1 = y1_in[4*i];
		uint64_t c1 = c1a_in[2*i];

		uint64_t conj_c0 = __builtin_e2k_pxord(c0, 1LL<<63);
		uint64_t conj_c1 = __builtin_e2k_pxord(c1, 1LL<<63);
		uint64_t swap_c0 = __builtin_e2k_pshufb(0, c0, 0x0302010007060504);
		uint64_t swap_c1 = __builtin_e2k_pshufb(0, c1, 0x0302010007060504);

		uint64_t cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
		uint64_t cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
		uint64_t cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
		uint64_t cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);

		uint64_t cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
		uint64_t cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);

		uint64_t add0 = __builtin_e2k_pfadds(x0, cy0);
		uint64_t sub0 = __builtin_e2k_pfsubs(x0, cy0);
		uint64_t add1 = __builtin_e2k_pfadds(x1, cy1);
		uint64_t sub1 = __builtin_e2k_pfsubs(x1, cy1);


		x0 = add0;
		y0 = add1;
		c0 = c0b_in[i];

		x1 = sub0;
		y1 = sub1;
		c1 = c1b_in[i];

		conj_c0 = __builtin_e2k_pxord(c0, 1LL<<63);
		conj_c1 = __builtin_e2k_pxord(c1, 1LL<<63);
		swap_c0 = __builtin_e2k_pshufb(0, c0, 0x0302010007060504);
		swap_c1 = __builtin_e2k_pshufb(0, c1, 0x0302010007060504);

		cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
		cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
		cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
		cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);

		cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
		cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);

		out_add0[i] = __builtin_e2k_pfadds(x0, cy0);
		out_sub0[i] = __builtin_e2k_pfsubs(x0, cy0);
		out_add1[i] = __builtin_e2k_pfadds(x1, cy1);
		out_sub1[i] = __builtin_e2k_pfsubs(x1, cy1);
	}
}
Основной цикл на ассемблере
.L2998:
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=2, asz=3, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=3, abs=8, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=4, abs=16, disp=0
        }
.L2607:
        {
          loop_mode
          pfmul_hadds,0,sm      %b[43], %b[20], %b[37], %b[24]
          pfmul_hadds,1,sm      %b[41], %b[5], %b[39], %b[28]
          pfmuls,2,sm   %b[75], %b[18], %b[35]
          xord,3,sm     %b[44], %r0, %b[84]
          xord,4,sm     %b[2], %r0, %b[81]
          pfsubs,5,sm   %b[78], %b[49], %b[1]
          movad,1       area=0, ind=8, am=0, be=0, %b[0]
        }
        {
          loop_mode
          pfmul_hadds,1,sm      %b[62], %b[9], %b[60], %b[20]
          pfmuls,2,sm   %b[83], %b[3], %b[37]
          pshufb,3,sm   0x0, %b[71], %r6, %b[41]
          pshufb,4,sm   0x0, %b[58], %r6, %b[39]
          pfadds,5,sm   %b[78], %b[49], %b[5]
        }
        {
          loop_mode
          pfmul_hadds,1,sm      %b[73], %b[15], %b[53], %b[43]
          pfmuls,2,sm   %b[84], %b[7], %b[58]
          pshufb,4,sm   0x0, %b[44], %r6, %b[60]
          pfmuls,5,sm   %b[81], %b[11], %b[49]
          movad,3       area=0, ind=24, am=0, be=0, %b[9]
        }
        {
          loop_mode
          pfsub_adds,0,sm       %b[33], %b[26], %b[30], %b[62]
          pfsub_rsubs,1,sm      %b[33], %b[26], %b[30], %b[53]
          staad,2       %b[66], %aad2[ %aasti6 ]
          incr,2        %aaincr0
          xord,3,sm     %b[69], %r0, %b[73]
          pshufb,4,sm   0x0, %b[4], %r6, %b[71]
          staad,5       %b[57], %aad1[ %aasti5 ]
          incr,5        %aaincr0
          movad,1       area=2, ind=0, am=1, be=0, %b[44]
          movad,3       area=0, ind=0, am=0, be=0, %b[15]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          pfadd_adds,0,sm       %b[33], %b[26], %b[22], %b[78]
          pfadd_rsubs,1,sm      %b[33], %b[26], %b[22], %b[75]
          staad,2       %b[82], %aad4[ %aasti8 ]
          incr,2        %aaincr0
          xord,3,sm     %b[56], %r0, %b[81]
          staad,5       %b[79], %aad3[ %aasti7 ]
          incr,5        %aaincr0
          movad,0       area=1, ind=0, am=1, be=0, %b[30]
          movad,1       area=0, ind=0, am=1, be=0, %b[57]
          movad,2       area=0, ind=8, am=1, be=0, %b[4]
          movad,3       area=0, ind=16, am=0, be=0, %b[66]
        }

Теоретическая скорость: 4 комплексных числа за 5 тактов (4/5) = 6.4 Байт/такт
Двойная теоретическая скорость: 12.8 Байт/такт

Замеры скорости

4. stage_radix2_2x_simd128

Здесь происходит ручная раскрутка алгоритма stage_radix2_simd128 в 2 раза.

Код на Си
void stage_radix2_2x_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef_a, myComplex *coef_b)
{
	__v2di *xy0_in = (__v2di*)&data_in[0];
	__v2di *xy1_in = (__v2di*)&data_in[2];
	__v2di *xy2_in = (__v2di*)&data_in[4];
	__v2di *xy3_in = (__v2di*)&data_in[6];
	__v2di *c0a_in = (__v2di*)&coef_a[0];
	__v2di *c1a_in = (__v2di*)&coef_a[2];
	__v2di *c0b_in = (__v2di*)&coef_b[0];
	__v2di *c1b_in = (__v2di*)&coef_b[data_count/4];

	__v2di *out_add0 = (__v2di*)&data_out[0*data_count/4];
	__v2di *out_add1 = (__v2di*)&data_out[1*data_count/4];
	__v2di *out_sub0 = (__v2di*)&data_out[2*data_count/4];
	__v2di *out_sub1 = (__v2di*)&data_out[3*data_count/4];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/8; ++i)
	{
		__v2di xy0 = xy0_in[4*i];
		__v2di xy1 = xy1_in[4*i];
		__v2di c0  = c0a_in[2*i];

		__v2di xy2 = xy2_in[4*i];
		__v2di xy3 = xy3_in[4*i];
		__v2di c1  = c1a_in[2*i];

		__v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		__v2di conj_c0 = __builtin_e2k_qpxor(c0, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_c1 = __builtin_e2k_qpxor(c1, (__v2di){1LL<<63, 1LL<<63});
		__v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});

		__v2di cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
		__v2di cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
		__v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
		__v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);

		__v2di cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
		__v2di cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);

		__v2di cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});

		__v2di add0 = __builtin_e2k_qpfadds(x0, cy0);
		__v2di sub0 = __builtin_e2k_qpfsubs(x0, cy0);
		__v2di add1 = __builtin_e2k_qpfadds(x1, cy1);
		__v2di sub1 = __builtin_e2k_qpfsubs(x1, cy1);


		xy0 = add0;
		xy1 = add1;
		c0 = c0b_in[i];

		xy2 = sub0;
		xy3 = sub1;
		c1 = c1b_in[i];

		x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
		y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		conj_c0 = __builtin_e2k_qpxor(c0, (__v2di){1LL<<63, 1LL<<63});
		conj_c1 = __builtin_e2k_qpxor(c1, (__v2di){1LL<<63, 1LL<<63});
		swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});

		cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
		cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
		cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
		cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);

		cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
		cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);

		cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});

		out_add0[i] = __builtin_e2k_qpfadds(x0, cy0);
		out_sub0[i] = __builtin_e2k_qpfsubs(x0, cy0);
		out_add1[i] = __builtin_e2k_qpfadds(x1, cy1);
		out_sub1[i] = __builtin_e2k_qpfsubs(x1, cy1);
	}
}
Основной цикл на ассемблере
.L3790:
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=32
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=4, abs=0, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=4, abs=16, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=4, abs=16, disp=0
        }
.L3059:
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[56], %b[54], %b[70], %b[1]
          qpshufb,1,sm  %b[9], %b[32], %r15, %b[52]
          qpfadds,2,sm  %b[88], %b[87], %b[0]
          qpshufb,3,sm  %b[33], %b[33], %r0, %b[94]
          qpshufb,4,sm  %b[6], %b[30], %r16, %b[95]
          qpfadds,5,sm  %b[61], %b[71], %b[92]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[74], %b[69], %b[7], %b[33]
          qpshufb,1,sm  %b[29], %b[29], %r14, %b[56]
          qpfsubs,2,sm  %b[55], %b[90], %b[30]
          qpxor,3,sm    %b[60], %r13, %b[54]
          qpxor,4,sm    %b[27], %r13, %b[57]
          qpfsubs,5,sm  %b[95], %b[94], %b[96]
          movaqp,3      area=1, ind=0, am=0, be=0, %b[6]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[56], %b[73], %b[93], %b[29]
          qpshufb,1,sm  %b[13], %b[36], %r16, %b[59]
          qpfsubs,2,sm  %b[86], %b[85], %b[7]
          qpshufb,4,sm  %b[62], %b[62], %r14, %b[61]
          qpfadds,5,sm  %b[95], %b[94], %b[97]
        }
        {
          loop_mode
          qpshufb,1,sm  %b[3], %b[3], %r0, %b[69]
          qpfmuls,2,sm  %b[53], %b[52], %b[68]
          qpshufb,3,sm  %b[14], %b[43], %r15, %b[62]
          qpshufb,4,sm  %b[75], %b[76], %r15, %b[63]
          staaqp,5      %b[72], %aad1[ %aasti5 ]
          incr,5        %aaincr0
          movaqp,0      area=2, ind=0, am=1, be=0, %b[36]
          movaqp,1      area=1, ind=0, am=1, be=0, %b[13]
          movaqp,3      area=1, ind=16, am=1, be=0, %b[56]
        }
        {
          loop_mode
          qpfmuls,0,sm  %b[41], %b[65], %b[3]
          qpshufb,1,sm  %b[0], %b[24], %r15, %b[71]
          qpfsubs,2,sm  %b[59], %b[69], %b[70]
          qpshufb,3,sm  %b[12], %b[12], %r14, %b[72]
          qpshufb,4,sm  %b[83], %b[84], %r16, %b[53]
          staaqp,5      %b[92], %aad2[ %aasti6 ]
          incr,5        %aaincr0
        }
        {
          loop_mode
          qpfmuls,0,sm  %b[54], %b[64], %b[87]
          qpshufb,1,sm  %b[35], %b[35], %r0, %b[88]
          qpfmuls,2,sm  %b[57], %b[71], %b[91]
          qpshufb,3,sm  %b[22], %b[51], %r16, %b[84]
          qpshufb,4,sm  %b[39], %b[39], %r0, %b[83]
          staaqp,5      %b[96], %aad3[ %aasti7 ]
          incr,5        %aaincr0
          movaqp,0      area=0, ind=0, am=0, be=0, %b[41]
          movaqp,1      area=0, ind=16, am=1, be=0, %b[12]
          movaqp,2      area=0, ind=0, am=0, be=0, %b[74]
          movaqp,3      area=0, ind=16, am=1, be=0, %b[73]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qpfmul_hadds,0,sm     %b[61], %b[66], %b[89], %b[35]
          qpshufb,1,sm  %b[50], %b[50], %r14, %b[54]
          qpfadds,2,sm  %b[55], %b[90], %b[22]
          qpxor,3,sm    %b[8], %r13, %b[39]
          qpxor,4,sm    %b[48], %r13, %b[51]
          staaqp,5      %b[97], %aad4[ %aasti8 ]
          incr,5        %aaincr0
        }

Теоретическая скорость: 8 комплексных чисел за 7 тактов (8/7) = 9.14 Байт/такт
Двойная теоретическая скорость: 18.29 Байт/такт

Замеры скорости

5. stage_radix2_2x_simd128_noConj

Здесь происходит ручная раскрутка алгоритма stage_radix2_simd128_noConj в 2 раза.

Код на Си
void stage_radix2_2x_simd128_noConj(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef_a, myComplex *coef_b)
{
	__v2di *xy0_in = (__v2di*)&data_in[0];
	__v2di *xy1_in = (__v2di*)&data_in[2];
	__v2di *xy2_in = (__v2di*)&data_in[4];
	__v2di *xy3_in = (__v2di*)&data_in[6];
	__v2di *c0a_in = (__v2di*)&coef_a[0];
	__v2di *c1a_in = (__v2di*)&coef_a[2];
	__v2di *c0b_in = (__v2di*)&coef_b[0];
	__v2di *c1b_in = (__v2di*)&coef_b[data_count/4];

	__v2di *out_add0 = (__v2di*)&data_out[0*data_count/4];
	__v2di *out_add1 = (__v2di*)&data_out[1*data_count/4];
	__v2di *out_sub0 = (__v2di*)&data_out[2*data_count/4];
	__v2di *out_sub1 = (__v2di*)&data_out[3*data_count/4];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/8; ++i)
	{
		__v2di xy0 = xy0_in[4*i];
		__v2di xy1 = xy1_in[4*i];
		__v2di c0  = c0a_in[2*i];

		__v2di xy2 = xy2_in[4*i];
		__v2di xy3 = xy3_in[4*i];
		__v2di c1  = c1a_in[2*i];

		__v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		__v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});

		__v2di cy0_real = __builtin_e2k_qpfmuls(     c0, y0);
		__v2di cy1_real = __builtin_e2k_qpfmuls(     c1, y1);
		__v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
		__v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);

		__v2di cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real);
		__v2di cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real);
		__v2di cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag);
		__v2di cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag);

		__v2di cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});

		__v2di add0 = __builtin_e2k_qpfadds(x0, cy0);
		__v2di sub0 = __builtin_e2k_qpfsubs(x0, cy0);
		__v2di add1 = __builtin_e2k_qpfadds(x1, cy1);
		__v2di sub1 = __builtin_e2k_qpfsubs(x1, cy1);


		xy0 = add0;
		xy1 = add1;
		c0 = c0b_in[i];

		xy2 = sub0;
		xy3 = sub1;
		c1 = c1b_in[i];

		x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
		y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});

		cy0_real = __builtin_e2k_qpfmuls(     c0, y0);
		cy1_real = __builtin_e2k_qpfmuls(     c1, y1);
		cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
		cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);

		cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real);
		cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real);
		cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag);
		cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag);

		cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});

		out_add0[i] = __builtin_e2k_qpfadds(x0, cy0);
		out_sub0[i] = __builtin_e2k_qpfsubs(x0, cy0);
		out_add1[i] = __builtin_e2k_qpfadds(x1, cy1);
		out_sub1[i] = __builtin_e2k_qpfsubs(x1, cy1);
	}
}
Основной цикл на ассемблере
.L4577:
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=32
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=4, abs=0, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=4, abs=16, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=4, abs=16, disp=0
        }
.L3851:
        {
          loop_mode
          qpfmul_hsubs,0,sm     %b[53], %b[54], %r16, %b[5]
          qpshufb,1,sm  %b[39], %b[39], %r15, %b[25]
          qpfmul_hadds,2,sm     %b[92], %b[54], %r16, %b[1]
          qpshufb,3,sm  %b[16], %b[8], %r14, %b[24]
          qpfsubs,4,sm  %b[66], %b[22], %b[9]
          qpfsubs,5,sm  %b[89], %b[69], %b[0]
        }
        {
          loop_mode
          qpfmul_hsubs,0,sm     %b[39], %b[71], %r16, %b[58]
          qpshufb,1,sm  %b[72], %b[75], %r0, %b[64]
          qpfmul_hadds,2,sm     %b[25], %b[71], %r16, %b[54]
          qpshufb,3,sm  %b[20], %b[20], %r15, %b[62]
          qpfsubs,4,sm  %b[86], %b[59], %b[8]
          qpfadds,5,sm  %b[24], %b[48], %b[61]
          movaqp,2      area=1, ind=0, am=0, be=0, %b[53]
          movaqp,3      area=1, ind=16, am=1, be=0, %b[16]
        }
        {
          loop_mode
          qpfmul_hsubs,0,sm     %b[57], %b[64], %r16, %b[78]
          qpshufb,1,sm  %b[57], %b[57], %r15, %b[88]
          staaqp,2      %b[87], %aad1[ %aasti5 ]
          incr,2        %aaincr0
          qpshufb,3,sm  %b[26], %b[34], %r0, %b[81]
          qpshufb,4,sm  %b[32], %b[40], %r14, %b[84]
          qpfsubs,5,sm  %b[24], %b[48], %b[85]
          movaqp,0      area=2, ind=0, am=1, be=0, %b[39]
          movaqp,1      area=1, ind=0, am=1, be=0, %b[25]
          movaqp,2      area=0, ind=0, am=0, be=0, %b[71]
          movaqp,3      area=0, ind=16, am=1, be=0, %b[68]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[88], %b[64], %r16, %b[48]
          qpshufb,1,sm  %b[51], %b[51], %r15, %b[90]
          staaqp,2      %b[63], %aad2[ %aasti6 ]
          incr,2        %aaincr0
          qpshufb,3,sm  %b[76], %b[79], %r14, %b[87]
          qppermb,4,sm  %b[15], %b[67], %r13, %b[57]
          qpfadds,5,sm  %b[89], %b[69], %b[40]
          movaqp,0      area=0, ind=0, am=0, be=0, %b[32]
          movaqp,1      area=0, ind=16, am=1, be=0, %b[24]
        }
        {
          loop_mode
          qpfmul_hsubs,0,sm     %b[20], %b[83], %r16, %b[63]
          qpshufb,1,sm  %b[17], %b[42], %r0, %b[69]
          staaqp,2      %b[11], %aad3[ %aasti7 ]
          incr,2        %aaincr0
          qppermb,3,sm  %b[52], %b[82], %r13, %b[67]
          qpshufb,4,sm  %b[21], %b[46], %r14, %b[64]
          qpfadds,5,sm  %b[86], %b[59], %b[15]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qpfmul_hadds,0,sm     %b[62], %b[83], %r16, %b[11]
          qpshufb,1,sm  %b[10], %b[2], %r0, %b[52]
          staaqp,2      %b[23], %aad4[ %aasti8 ]
          incr,2        %aaincr0
          qppermb,3,sm  %b[3], %b[7], %r13, %b[46]
          qppermb,4,sm  %b[56], %b[60], %r13, %b[20]
          qpfadds,5,sm  %b[66], %b[22], %b[21]
        }

Теоретическая скорость: 8 комплексных чисел за 6 тактов (8/6) = 10.67 Байт/такт
Двойная теоретическая скорость: 21.33 Байт/такт

Замеры скорости

Итоги по stage_radix2_2x

Скорости выросли по сравнению с исходными версиями stage_radix2.
График FFT находится здесь.


stage_radix2_readConjSwap

Вернёмся к алгоритмам stage_radix2. Обратим внимание, что conj_c и swap_c получаются напрямую из c, который читается из памяти и больше нигде не используется.

Оптимизация: вместо вычисления conj_c и swap_c сразу читать их из памяти, чтение c больше не нужно. В результате уйдут две инструкции: xor и shuf.

Смотрим, что получится.

1. stage_radix2_readConjSwap_simd64

Развитие stage_radix2_simd64: замена вычисления conj и swap на чтение из памяти.

Код на Си
void stage_radix2_readConjSwap_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef, myComplex *swap_coef)
{
	uint64_t *x_in = (uint64_t*)&data_in[0];
	uint64_t *y_in = (uint64_t*)&data_in[1];
	uint64_t *conj_c_in = (uint64_t*)conj_coef;
	uint64_t *swap_c_in = (uint64_t*)swap_coef;

	uint64_t *out_add = (uint64_t*)&data_out[0];
	uint64_t *out_sub = (uint64_t*)&data_out[data_count/2];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/2; ++i)
	{
		uint64_t x = x_in[2*i];
		uint64_t y = y_in[2*i];
		uint64_t conj_c = conj_c_in[i];
		uint64_t swap_c = swap_c_in[i];

		uint64_t cy_real = __builtin_e2k_pfmuls(conj_c, y);
		uint64_t cy_imag = __builtin_e2k_pfmuls(swap_c, y);

		uint64_t cy = __builtin_e2k_pfhadds(cy_real, cy_imag);

		out_add[i] = __builtin_e2k_pfadds(x, cy);
		out_sub[i] = __builtin_e2k_pfsubs(x, cy);
	}
}
Основной цикл на ассемблере
.L326:
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=4, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=4, abs=16, disp=0
        }
.L125:
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          pfmuls,0,sm   %b[67], %b[33], %b[45]
          pfsubs,1,sm   %b[42], %b[62], %b[62]
          staad,2       %b[70], %aad1[ %aasti4 ]
          incr,2        %aaincr0
          pfmul_hadds,3,sm      %b[23], %b[45], %b[57], %b[42]
          pfadds,4,sm   %b[42], %b[62], %b[67]
          staad,5       %b[75], %aad2[ %aasti5 ]
          incr,5        %aaincr0
          movad,0       area=0, ind=0, am=1, be=0, %b[57]
          movad,1       area=1, ind=0, am=1, be=0, %b[1]
          movad,2       area=0, ind=8, am=1, be=0, %b[23]
          movad,3       area=0, ind=0, am=0, be=0, %b[0]
        }

Раньше было 8 инструкций в цикле, теперь стало 6.
6 инструкций идеально помещаются в 1 такт.

Теоретическая скорость: 2 комплексных числа за 1 такт (2/1) = 16 Байт/такт

Замеры скорости

2. stage_radix2_readConjSwap_simd128

Развитие stage_radix2_simd128: замена вычисления conj и swap на чтение из памяти.
Развитие stage_radix2_simd128_noConj приходит сюда же.

Код на Си
void stage_radix2_readConjSwap_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef, myComplex *swap_coef)
{
	__v2di *xy0_in = (__v2di*)&data_in[0];
	__v2di *xy1_in = (__v2di*)&data_in[2];
	__v2di *conj_c_in = (__v2di*)conj_coef;
	__v2di *swap_c_in = (__v2di*)swap_coef;

	__v2di *out_add = (__v2di*)&data_out[0];
	__v2di *out_sub = (__v2di*)&data_out[data_count/2];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/4; ++i)
	{
		__v2di xy0 = xy0_in[2*i];
		__v2di xy1 = xy1_in[2*i];
		__v2di conj_c = conj_c_in[i];
		__v2di swap_c = swap_c_in[i];

		__v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		__v2di cy_real = __builtin_e2k_qpfmuls(conj_c, y);
		__v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y);

		__v2di cy_rrii = __builtin_e2k_qpfhadds(cy_real, cy_imag);

		__v2di cy = __builtin_e2k_qpshufb(cy_rrii, cy_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});

		out_add[i] = __builtin_e2k_qpfadds(x, cy);
		out_sub[i] = __builtin_e2k_qpfsubs(x, cy);
	}
}
Основной цикл на ассемблере
.L599:
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=4, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=4, abs=16, disp=0
        }
.L353:
        {
          loop_mode
          qpshufb,1,sm  %b[31], %b[40], %r0, %b[16]
          qpshufb,3,sm  %b[33], %b[42], %r7, %b[0]
          qpfsubs,4,sm  %b[14], %b[44], %b[21]
          staaqp,5      %b[25], %aad1[ %aasti4 ]
          incr,5        %aaincr0
          movaqp,0      area=0, ind=0, am=1, be=0, %b[13]
          movaqp,1      area=1, ind=0, am=1, be=0, %b[1]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qpfmuls,0,sm  %b[19], %b[16], %b[33]
          qpfmul_hadds,2,sm     %b[11], %b[20], %b[37], %b[22]
          qpshufb,3,sm  %b[32], %b[32], %r6, %b[42]
          qpfadds,4,sm  %b[14], %b[44], %b[39]
          staaqp,5      %b[43], %aad2[ %aasti5 ]
          incr,5        %aaincr0
          movaqp,2      area=0, ind=0, am=0, be=0, %b[34]
          movaqp,3      area=0, ind=16, am=1, be=0, %b[25]
        }

Раньше было 11 инструкций в цикле, теперь стало 9.

Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт

Замеры скорости

Сейчас занято 9/6 такта. Раскрутка в 2 раза даст 2 * 9/6 = 3 такта.


3. stage_radix2_readConjSwap_simd128_unroll2

Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.

Код на Си
void stage_radix2_readConjSwap_simd128_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef, myComplex *swap_coef)
{
	__v2di *xy0_in = (__v2di*)&data_in[0];
	__v2di *xy1_in = (__v2di*)&data_in[2];
	__v2di *conj_c_in = (__v2di*)conj_coef;
	__v2di *swap_c_in = (__v2di*)swap_coef;

	__v2di *out_add = (__v2di*)&data_out[0];
	__v2di *out_sub = (__v2di*)&data_out[data_count/2];

	#pragma ivdep
	#pragma unroll(2)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/4; ++i)
	{
		__v2di xy0 = xy0_in[2*i];
		__v2di xy1 = xy1_in[2*i];
		__v2di conj_c = conj_c_in[i];
		__v2di swap_c = swap_c_in[i];

		__v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		__v2di cy_real = __builtin_e2k_qpfmuls(conj_c, y);
		__v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y);

		__v2di cy_rrii = __builtin_e2k_qpfhadds(cy_real, cy_imag);

		__v2di cy = __builtin_e2k_qpshufb(cy_rrii, cy_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});

		out_add[i] = __builtin_e2k_qpfadds(x, cy);
		out_sub[i] = __builtin_e2k_qpfsubs(x, cy);
	}
}
Основной цикл на ассемблере
.L992:
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=4, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=4, abs=0, disp=32
        }
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=3, asz=4, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=2, asz=4, abs=16, disp=0
        }
.L626:
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[27], %b[14], %b[92], %b[1]
          qpfmuls,1,sm  %b[81], %b[4], %b[9]
          qpfsubs,2,sm  %b[89], %b[88], %b[34]
          qpshufb,3,sm  %b[52], %b[51], %r0, %b[0]
          qpshufb,4,sm  %b[20], %b[33], %r0, %b[8]
          qpfadds,5,sm  %b[89], %b[88], %b[13]
        }
        {
          loop_mode
          qpfmuls,0,sm  %b[73], %b[10], %b[88]
          qpfsubs,1,sm  %b[84], %b[85], %b[89]
          staaqp,2      %b[36], %aad1[ %aasti4 ]
          qpshufb,3,sm  %b[66], %b[65], %r13, %b[80]
          qpshufb,4,sm  %b[74], %b[74], %r14, %b[81]
          staaqp,5      %b[15], %aad2[ %aasti5 ]
          movaqp,0      area=0, ind=0, am=0, be=0, %b[27]
          movaqp,1      area=0, ind=16, am=1, be=0, %b[14]
          movaqp,2      area=0, ind=0, am=0, be=0, %b[47]
          movaqp,3      area=0, ind=16, am=1, be=0, %b[48]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qpfmul_hadds,0,sm     %b[46], %b[6], %b[11], %b[66]
          qpfadds,1,sm  %b[82], %b[83], %b[74]
          staaqp,2      %b[91], %aad1[ %aasti4 + _f32s,_lts0 0x10 ]
          incr,2        %aaincr3
          qpshufb,3,sm  %b[32], %b[45], %r13, %b[85]
          qpshufb,4,sm  %b[7], %b[7], %r14, %b[84]
          staaqp,5      %b[78], %aad2[ %aasti5 + _f32s,_lts0 0x10 ]
          incr,5        %aaincr3
          movaqp,0      area=1, ind=0, am=0, be=0, %b[65]
          movaqp,1      area=1, ind=16, am=1, be=0, %b[73]
          movaqp,2      area=1, ind=0, am=0, be=0, %b[15]
          movaqp,3      area=1, ind=16, am=1, be=0, %b[36]
        }

Теоретическая скорость: 8 комплексных чисел за 3 такта (8/3) = 21.33 Байт/такт

Замеры скорости

Итоги по stage_radix2_readConjSwap

График FFT находится здесь.


stage_radix2_readConjSwap_2x

Один проход по stage_radix2_readConjSwap_2x совершает ту же работу, что 2 прохода по stage_radix2. Поэтому скорость stage_radix2_readConjSwap_2x будем умножать на 2 для удобства сравнения с stage_radix2 (этот факт подписан на оси графика и в выводе консоли).

1. stage_radix2_readConjSwap_2x_simd64

Здесь происходит ручная раскрутка алгоритма stage_radix2_readConjSwap_simd64 в 2 раза.

Код на Си
void stage_radix2_readConjSwap_2x_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef_a, myComplex *conj_coef_b, myComplex *swap_coef_a, myComplex *swap_coef_b)
{
	uint64_t *x0_in = (uint64_t*)&data_in[0];
	uint64_t *y0_in = (uint64_t*)&data_in[1];
	uint64_t *x1_in = (uint64_t*)&data_in[2];
	uint64_t *y1_in = (uint64_t*)&data_in[3];
	uint64_t *conj_c0a_in = (uint64_t*)&conj_coef_a[0];
	uint64_t *conj_c1a_in = (uint64_t*)&conj_coef_a[1];
	uint64_t *conj_c0b_in = (uint64_t*)&conj_coef_b[0];
	uint64_t *conj_c1b_in = (uint64_t*)&conj_coef_b[data_count/4];
	uint64_t *swap_c0a_in = (uint64_t*)&swap_coef_a[0];
	uint64_t *swap_c1a_in = (uint64_t*)&swap_coef_a[1];
	uint64_t *swap_c0b_in = (uint64_t*)&swap_coef_b[0];
	uint64_t *swap_c1b_in = (uint64_t*)&swap_coef_b[data_count/4];

	uint64_t *out_add0 = (uint64_t*)&data_out[0*data_count/4];
	uint64_t *out_add1 = (uint64_t*)&data_out[1*data_count/4];
	uint64_t *out_sub0 = (uint64_t*)&data_out[2*data_count/4];
	uint64_t *out_sub1 = (uint64_t*)&data_out[3*data_count/4];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/4; ++i)
	{
		uint64_t x0 = x0_in[4*i];
		uint64_t y0 = y0_in[4*i];
		uint64_t conj_c0 = conj_c0a_in[2*i];
		uint64_t swap_c0 = swap_c0a_in[2*i];

		uint64_t x1 = x1_in[4*i];
		uint64_t y1 = y1_in[4*i];
		uint64_t conj_c1 = conj_c1a_in[2*i];
		uint64_t swap_c1 = swap_c1a_in[2*i];

		uint64_t cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
		uint64_t cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
		uint64_t cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
		uint64_t cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);

		uint64_t cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
		uint64_t cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);

		uint64_t add0 = __builtin_e2k_pfadds(x0, cy0);
		uint64_t sub0 = __builtin_e2k_pfsubs(x0, cy0);
		uint64_t add1 = __builtin_e2k_pfadds(x1, cy1);
		uint64_t sub1 = __builtin_e2k_pfsubs(x1, cy1);


		x0 = add0;
		y0 = add1;
		conj_c0 = conj_c0b_in[i];
		swap_c0 = swap_c0b_in[i];

		x1 = sub0;
		y1 = sub1;
		conj_c1 = conj_c1b_in[i];
		swap_c1 = swap_c1b_in[i];

		cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
		cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
		cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
		cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);

		cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
		cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);

		out_add0[i] = __builtin_e2k_pfadds(x0, cy0);
		out_sub0[i] = __builtin_e2k_pfsubs(x0, cy0);
		out_add1[i] = __builtin_e2k_pfadds(x1, cy1);
		out_sub1[i] = __builtin_e2k_pfsubs(x1, cy1);
	}
}
Основной цикл на ассемблере
.L723:
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=3, asz=3, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=1, asz=3, abs=0, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=2, asz=3, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=3, abs=8, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=3, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=4, abs=16, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=3, abs=24, disp=0
        }
.L322:
        {
          loop_mode
          pfmuls,1,sm   %b[60], %b[45], %b[34]
          pfmul_hadds,2,sm      %b[108], %b[47], %b[36], %b[0]
          pfmul_hadds,3,sm      %b[41], %b[97], %b[111], %b[21]
          pfsub_adds,4,sm       %b[77], %b[103], %b[25], %b[6]
          pfsub_rsubs,5,sm      %b[77], %b[103], %b[25], %b[1]
        }
        {
          loop_mode
          pfmul_hadds,3,sm      %b[76], %b[102], %b[107], %b[53]
          pfadd_adds,4,sm       %b[77], %b[103], %b[57], %b[50]
          pfadd_rsubs,5,sm      %b[77], %b[103], %b[57], %b[47]
          movad,0       area=0, ind=8, am=0, be=0, %b[56]
          movad,1       area=3, ind=0, am=1, be=0, %b[36]
          movad,2       area=0, ind=24, am=0, be=0, %b[41]
          movad,3       area=2, ind=0, am=1, be=0, %b[25]
        }
        {
          loop_mode
          staad,2       %b[10], %aad2[ %aasti9 ]
          incr,2        %aaincr0
          pfsubs,3,sm   %b[32], %b[4], %b[91]
          pfmuls,4,sm   %b[22], %b[17], %b[92]
          staad,5       %b[5], %aad1[ %aasti8 ]
          incr,5        %aaincr0
          movad,0       area=2, ind=0, am=1, be=0, %b[77]
          movad,1       area=1, ind=0, am=0, be=0, %b[76]
          movad,2       area=1, ind=0, am=1, be=0, %b[60]
          movad,3       area=0, ind=0, am=0, be=0, %b[57]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          pfadds,0,sm   %b[32], %b[4], %b[96]
          pfmuls,1,sm   %b[89], %b[98], %b[103]
          staad,2       %b[54], %aad4[ %aasti11 ]
          incr,2        %aaincr0
          pfmuls,3,sm   %b[48], %b[93], %b[107]
          pfmul_hadds,4,sm      %b[90], %b[19], %b[94], %b[97]
          staad,5       %b[51], %aad3[ %aasti10 ]
          incr,5        %aaincr0
          movad,0       area=1, ind=8, am=1, be=0, %b[102]
          movad,1       area=0, ind=0, am=1, be=0, %b[10]
          movad,2       area=0, ind=8, am=1, be=0, %b[5]
          movad,3       area=0, ind=16, am=0, be=0, %b[22]
        }

Теоретическая скорость: 4 комплексных числа за 4 такта (4/4) = 8 Байт/такт
Двойная теоретическая скорость: 16 Байт/такт

Замеры скорости

2. stage_radix2_readConjSwap_2x_simd64_unroll2

Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.

Код на Си
void stage_radix2_readConjSwap_2x_simd64_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef_a, myComplex *conj_coef_b, myComplex *swap_coef_a, myComplex *swap_coef_b)
{
	uint64_t *x0_in = (uint64_t*)&data_in[0];
	uint64_t *y0_in = (uint64_t*)&data_in[1];
	uint64_t *x1_in = (uint64_t*)&data_in[2];
	uint64_t *y1_in = (uint64_t*)&data_in[3];
	uint64_t *conj_c0a_in = (uint64_t*)&conj_coef_a[0];
	uint64_t *conj_c1a_in = (uint64_t*)&conj_coef_a[1];
	uint64_t *conj_c0b_in = (uint64_t*)&conj_coef_b[0];
	uint64_t *conj_c1b_in = (uint64_t*)&conj_coef_b[data_count/4];
	uint64_t *swap_c0a_in = (uint64_t*)&swap_coef_a[0];
	uint64_t *swap_c1a_in = (uint64_t*)&swap_coef_a[1];
	uint64_t *swap_c0b_in = (uint64_t*)&swap_coef_b[0];
	uint64_t *swap_c1b_in = (uint64_t*)&swap_coef_b[data_count/4];

	uint64_t *out_add0 = (uint64_t*)&data_out[0*data_count/4];
	uint64_t *out_add1 = (uint64_t*)&data_out[1*data_count/4];
	uint64_t *out_sub0 = (uint64_t*)&data_out[2*data_count/4];
	uint64_t *out_sub1 = (uint64_t*)&data_out[3*data_count/4];

	#pragma ivdep
	#pragma unroll(2)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/4; ++i)
	{
		uint64_t x0 = x0_in[4*i];
		uint64_t y0 = y0_in[4*i];
		uint64_t conj_c0 = conj_c0a_in[2*i];
		uint64_t swap_c0 = swap_c0a_in[2*i];

		uint64_t x1 = x1_in[4*i];
		uint64_t y1 = y1_in[4*i];
		uint64_t conj_c1 = conj_c1a_in[2*i];
		uint64_t swap_c1 = swap_c1a_in[2*i];

		uint64_t cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
		uint64_t cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
		uint64_t cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
		uint64_t cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);

		uint64_t cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
		uint64_t cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);

		uint64_t add0 = __builtin_e2k_pfadds(x0, cy0);
		uint64_t sub0 = __builtin_e2k_pfsubs(x0, cy0);
		uint64_t add1 = __builtin_e2k_pfadds(x1, cy1);
		uint64_t sub1 = __builtin_e2k_pfsubs(x1, cy1);


		x0 = add0;
		y0 = add1;
		conj_c0 = conj_c0b_in[i];
		swap_c0 = swap_c0b_in[i];

		x1 = sub0;
		y1 = sub1;
		conj_c1 = conj_c1b_in[i];
		swap_c1 = swap_c1b_in[i];

		cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
		cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
		cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
		cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);

		cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
		cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);

		out_add0[i] = __builtin_e2k_pfadds(x0, cy0);
		out_sub0[i] = __builtin_e2k_pfsubs(x0, cy0);
		out_add1[i] = __builtin_e2k_pfadds(x1, cy1);
		out_sub1[i] = __builtin_e2k_pfsubs(x1, cy1);
	}
}
Основной цикл на ассемблере
.L1964:
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=3, ind=1, asz=3, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=3, ind=1, asz=3, abs=0, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=3, asz=3, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=2, asz=3, abs=8, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=7, asz=3, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=6, asz=3, abs=16, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=5, asz=3, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=4, asz=3, abs=24, disp=0
        }
.L1045:
        {
          loop_mode
          pfmul_hadds,0,sm      %b[71], %b[68], %b[118], %b[31]
          pfsub_adds,1,sm       %b[85], %b[94], %b[107], %b[1]
          pfsub_rsubs,2,sm      %b[85], %b[94], %b[107], %b[0]
          pfmuls,3,sm   %b[54], %b[50], %b[119]
          pfmuls,4,sm   %b[31], %b[18], %b[117]
          pfmuls,5,sm   %b[62], %b[103], %b[115]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[43], %b[116], %g16, %b[85]
          pfadd_adds,1,sm       %b[85], %b[94], %b[33], %b[71]
          pfadd_rsubs,2,sm      %b[85], %b[94], %b[33], %b[68]
          pfmul_hadds,3,sm      %b[108], %b[38], %g17, %b[62]
          pfmuls,5,sm   %b[95], %b[66], %b[116]
          movad,0       area=3, ind=0, am=0, be=0, %b[43]
          movad,1       area=3, ind=8, am=1, be=0, %b[54]
          movad,2       area=3, ind=0, am=0, be=0, %b[33]
          movad,3       area=3, ind=8, am=1, be=0, %b[38]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[61], %b[76], %g18, %b[107]
          pfsub_adds,1,sm       %b[23], %b[101], %b[87], %b[94]
          pfsub_rsubs,2,sm      %b[23], %b[101], %b[87], %b[95]
          pfmuls,4,sm   %g19, %b[36], %g17
          pfmuls,5,sm   %b[51], %b[114], %g16
          movad,0       area=2, ind=0, am=0, be=0, %b[76]
          movad,1       area=2, ind=8, am=1, be=0, %b[87]
          movad,2       area=2, ind=0, am=0, be=0, %b[51]
          movad,3       area=2, ind=8, am=1, be=0, %b[61]
        }
        {
          loop_mode
          pfadd_adds,0,sm       %b[23], %b[101], %b[109], %b[108]
          pfadd_rsubs,1,sm      %b[23], %b[101], %b[109], %b[109]
          staad,2       %b[3], %aad2[ %aasti9 + _f32s,_lts0 0x8 ]
          pfsubs,3,sm   %b[30], %b[64], %b[101]
          pfmuls,4,sm   %b[84], %b[74], %g18
          staad,5       %b[2], %aad1[ %aasti8 + _f32s,_lts0 0x8 ]
          movad,0       area=1, ind=16, am=0, be=0, %b[23]
          movad,1       area=1, ind=0, am=0, be=0, %b[84]
          movad,2       area=1, ind=16, am=0, be=0, %b[2]
          movad,3       area=1, ind=0, am=0, be=0, %b[3]
        }
        {
          loop_mode
          staad,2       %b[73], %aad4[ %aasti11 + _f32s,_lts0 0x8 ]
          pfmul_hadds,3,sm      %b[34], %b[50], %b[119], %b[70]
          pfadds,4,sm   %b[30], %b[64], %b[64]
          staad,5       %b[70], %aad3[ %aasti10 + _f32s,_lts0 0x8 ]
          movad,0       area=1, ind=8, am=1, be=0, %b[50]
          movad,1       area=1, ind=24, am=0, be=0, %g19
          movad,2       area=1, ind=8, am=0, be=0, %b[30]
          movad,3       area=0, ind=24, am=0, be=0, %b[34]
        }
        {
          loop_mode
          pfmul_hadds,1,sm      %b[11], %b[104], %b[113], %b[97]
          staad,2       %b[96], %aad2[ %aasti9 ]
          incr,2        %aaincr4
          pfsubs,4,sm   %b[24], %b[72], %b[112]
          staad,5       %b[97], %aad1[ %aasti8 ]
          incr,5        %aaincr4
          movad,0       area=0, ind=0, am=0, be=0, %b[11]
          movad,1       area=0, ind=8, am=0, be=0, %b[96]
          movad,2       area=1, ind=24, am=1, be=0, %b[104]
          movad,3       area=0, ind=0, am=0, be=0, %b[73]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          pfmul_hadds,0,sm      %b[10], %b[18], %b[117], %b[90]
          pfmul_hadds,1,sm      %b[46], %b[103], %b[115], %b[103]
          staad,2       %b[110], %aad4[ %aasti11 ]
          incr,2        %aaincr4
          pfmuls,3,sm   %b[90], %b[102], %b[111]
          pfadds,4,sm   %b[24], %b[72], %b[72]
          staad,5       %b[111], %aad3[ %aasti10 ]
          incr,5        %aaincr4
          movad,0       area=0, ind=16, am=0, be=0, %b[18]
          movad,1       area=0, ind=24, am=1, be=0, %b[46]
          movad,2       area=0, ind=8, am=1, be=0, %b[10]
          movad,3       area=0, ind=16, am=0, be=0, %b[24]
        }

Теоретическая скорость: 8 комплексных чисел за 7 тактов (8/7) = 9.14 Байт/такт
Двойная теоретическая скорость: 18.29 Байт/такт

Замеры скорости

3. stage_radix2_readConjSwap_2x_simd64_unroll4

Здесь происходит раскрутка цикла в 4 раза с помощью опции unroll.

Код на Си
void stage_radix2_readConjSwap_2x_simd64_unroll4(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef_a, myComplex *conj_coef_b, myComplex *swap_coef_a, myComplex *swap_coef_b)
{
	uint64_t *x0_in = (uint64_t*)&data_in[0];
	uint64_t *y0_in = (uint64_t*)&data_in[1];
	uint64_t *x1_in = (uint64_t*)&data_in[2];
	uint64_t *y1_in = (uint64_t*)&data_in[3];
	uint64_t *conj_c0a_in = (uint64_t*)&conj_coef_a[0];
	uint64_t *conj_c1a_in = (uint64_t*)&conj_coef_a[1];
	uint64_t *conj_c0b_in = (uint64_t*)&conj_coef_b[0];
	uint64_t *conj_c1b_in = (uint64_t*)&conj_coef_b[data_count/4];
	uint64_t *swap_c0a_in = (uint64_t*)&swap_coef_a[0];
	uint64_t *swap_c1a_in = (uint64_t*)&swap_coef_a[1];
	uint64_t *swap_c0b_in = (uint64_t*)&swap_coef_b[0];
	uint64_t *swap_c1b_in = (uint64_t*)&swap_coef_b[data_count/4];

	uint64_t *out_add0 = (uint64_t*)&data_out[0*data_count/4];
	uint64_t *out_add1 = (uint64_t*)&data_out[1*data_count/4];
	uint64_t *out_sub0 = (uint64_t*)&data_out[2*data_count/4];
	uint64_t *out_sub1 = (uint64_t*)&data_out[3*data_count/4];

	#pragma ivdep
	#pragma unroll(4)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/4; ++i)
	{
		uint64_t x0 = x0_in[4*i];
		uint64_t y0 = y0_in[4*i];
		uint64_t conj_c0 = conj_c0a_in[2*i];
		uint64_t swap_c0 = swap_c0a_in[2*i];

		uint64_t x1 = x1_in[4*i];
		uint64_t y1 = y1_in[4*i];
		uint64_t conj_c1 = conj_c1a_in[2*i];
		uint64_t swap_c1 = swap_c1a_in[2*i];

		uint64_t cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
		uint64_t cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
		uint64_t cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
		uint64_t cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);

		uint64_t cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
		uint64_t cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);

		uint64_t add0 = __builtin_e2k_pfadds(x0, cy0);
		uint64_t sub0 = __builtin_e2k_pfsubs(x0, cy0);
		uint64_t add1 = __builtin_e2k_pfadds(x1, cy1);
		uint64_t sub1 = __builtin_e2k_pfsubs(x1, cy1);


		x0 = add0;
		y0 = add1;
		conj_c0 = conj_c0b_in[i];
		swap_c0 = swap_c0b_in[i];

		x1 = sub0;
		y1 = sub1;
		conj_c1 = conj_c1b_in[i];
		swap_c1 = swap_c1b_in[i];

		cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
		cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
		cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
		cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);

		cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
		cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);

		out_add0[i] = __builtin_e2k_pfadds(x0, cy0);
		out_sub0[i] = __builtin_e2k_pfsubs(x0, cy0);
		out_add1[i] = __builtin_e2k_pfadds(x1, cy1);
		out_sub1[i] = __builtin_e2k_pfsubs(x1, cy1);
	}
}
Основной цикл на ассемблере
.L3317:
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=3, ind=1, asz=2, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=3, ind=1, asz=2, abs=0, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=3, ind=1, asz=2, abs=4, disp=64
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=3, ind=1, asz=2, abs=4, disp=96
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=3, asz=2, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=3, asz=2, abs=8, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=2, asz=2, abs=12, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=2, asz=2, abs=12, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=7, asz=3, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=6, asz=3, abs=16, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=5, asz=3, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=4, asz=3, abs=24, disp=0
        }
.L2286:
        {
          loop_mode
          pfmul_hadds,0,sm      %b[13], %b[110], %g16, %b[110]
          pfadd_adds,1,sm       %b[37], %b[72], %b[111], %b[115]
          pfmuls,2,sm   %b[18], %b[116], %g17
          pfadd_rsubs,3,sm      %b[37], %b[72], %b[111], %b[111]
          pfmuls,4,sm   %b[115], %g20, %g21
          pfmuls,5,sm   %g18, %b[107], %g19
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[12], %b[103], %g24, %b[118]
          pfsub_adds,1,sm       %b[30], %b[102], %b[118], %g23
          pfmuls,2,sm   %b[52], %b[114], %g25
          pfsub_rsubs,3,sm      %b[30], %b[102], %b[118], %g22
          pfmuls,4,sm   %g26, %b[99], %g27
          pfmuls,5,sm   %b[40], %b[80], %b[103]
          movad,0       area=5, ind=16, am=0, be=0, %b[12]
          movad,1       area=5, ind=0, am=0, be=0, %b[13]
          movad,2       area=5, ind=16, am=0, be=0, %b[0]
          movad,3       area=5, ind=0, am=0, be=0, %b[1]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[7], %r0, %r1, %b[102]
          pfadd_adds,1,sm       %b[30], %b[102], %g29, %g31
          pfsubs,2,sm   %b[92], %b[109], %r2
          pfadd_rsubs,3,sm      %b[30], %b[102], %g29, %g30
          pfmuls,4,sm   %b[21], %b[57], %r3
          staad,5       %g28, %aad2[ %aasti9 + _f32s,_lts0 0x8 ]
          movad,0       area=5, ind=24, am=0, be=0, %b[21]
          movad,1       area=5, ind=8, am=1, be=0, %b[30]
          movad,2       area=5, ind=24, am=0, be=0, %b[7]
          movad,3       area=5, ind=8, am=1, be=0, %b[18]
        }
        {
          loop_mode
          pfmul_hadds,1,sm      %b[46], %b[108], %r4, %b[109]
          pfadds,2,sm   %b[92], %b[109], %r5
          pfsubs,4,sm   %b[81], %b[100], %b[108]
          staad,5       %b[119], %aad1[ %aasti8 + _f32s,_lts0 0x8 ]
          movad,0       area=4, ind=16, am=0, be=0, %b[46]
          movad,1       area=4, ind=0, am=0, be=0, %b[52]
          movad,2       area=4, ind=16, am=0, be=0, %b[37]
          movad,3       area=4, ind=0, am=0, be=0, %b[40]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[49], %b[63], %r6, %b[100]
          pfmul_hadds,1,sm      %b[6], %b[116], %g17, %b[116]
          pfmuls,2,sm   %b[71], %b[88], %g17
          pfadds,4,sm   %b[81], %b[100], %b[101]
          staad,5       %b[101], %aad4[ %aasti11 + _f32s,_lts0 0x8 ]
          movad,0       area=4, ind=24, am=0, be=0, %b[63]
          movad,1       area=4, ind=8, am=1, be=0, %b[71]
          movad,2       area=4, ind=24, am=0, be=0, %b[6]
          movad,3       area=4, ind=8, am=1, be=0, %b[49]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[43], %b[114], %g25, %g29
          pfsub_adds,1,sm       %b[68], %b[112], %b[117], %g28
          pfsubs,2,sm   %b[84], %r9, %b[114]
          pfsubs,4,sm   %b[89], %b[106], %r0
          staad,5       %r8, %aad3[ %aasti10 + _f32s,_lts0 0x8 ]
          movad,0       area=3, ind=16, am=0, be=0, %b[81]
          movad,1       area=3, ind=0, am=0, be=0, %b[93]
          movad,2       area=3, ind=0, am=0, be=0, %b[43]
          movad,3       area=3, ind=16, am=0, be=0, %b[72]
        }
        {
          loop_mode
          pfsub_rsubs,1,sm      %b[68], %b[112], %b[117], %b[117]
          pfmuls,2,sm   %b[34], %r2, %g25
          pfadds,4,sm   %b[89], %b[106], %b[106]
          staad,5       %r10, %aad2[ %aasti9 + _f32s,_lts0 0x18 ]
          movad,0       area=3, ind=8, am=1, be=0, %b[34]
          movad,1       area=3, ind=24, am=0, be=0, %b[96]
          movad,2       area=3, ind=8, am=1, be=0, %b[89]
          movad,3       area=3, ind=24, am=0, be=0, %b[92]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[98], %b[99], %g27, %b[107]
          pfadd_adds,1,sm       %b[68], %b[112], %r11, %b[99]
          pfmuls,2,sm   %b[75], %r5, %r12
          pfmul_hadds,3,sm      %b[94], %b[107], %g19, %b[98]
          pfmuls,4,sm   %b[25], %b[108], %g16
          staad,5       %b[113], %aad1[ %aasti8 + _f32s,_lts0 0x18 ]
          movad,0       area=2, ind=8, am=0, be=0, %g19
          movad,1       area=0, ind=24, am=0, be=0, %b[75]
          movad,2       area=2, ind=0, am=0, be=0, %b[25]
          movad,3       area=2, ind=8, am=0, be=0, %b[113]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[91], %g20, %g21, %r9
          pfadd_rsubs,1,sm      %b[68], %b[112], %r11, %r8
          staad,2       %r13, %aad4[ %aasti11 + _f32s,_lts0 0x18 ]
          pfadds,3,sm   %b[84], %r9, %b[112]
          pfmuls,4,sm   %b[67], %b[101], %g24
          staad,5       %b[105], %aad3[ %aasti10 + _f32s,_lts0 0x18 ]
          movad,0       area=2, ind=0, am=0, be=0, %b[67]
          movad,1       area=1, ind=24, am=0, be=0, %g20
          movad,2       area=2, ind=24, am=0, be=0, %g18
          movad,3       area=1, ind=24, am=0, be=0, %b[105]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[97], %b[88], %g17, %b[68]
          pfsub_adds,1,sm       %b[62], %r16, %b[110], %r10
          staad,2       %r15, %aad2[ %aasti9 ]
          pfmul_hadds,3,sm      %b[36], %b[77], %b[104], %b[104]
          pfmuls,4,sm   %b[17], %r0, %r1
          staad,5       %r14, %aad1[ %aasti8 ]
          movad,0       area=2, ind=16, am=0, be=0, %b[17]
          movad,1       area=2, ind=24, am=1, be=0, %g26
          movad,2       area=2, ind=16, am=1, be=0, %b[36]
          movad,3       area=0, ind=24, am=0, be=0, %b[97]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[85], %b[57], %r3, %b[110]
          pfmul_hadds,1,sm      %b[22], %r2, %g25, %b[115]
          staad,2       %b[115], %aad4[ %aasti11 ]
          pfsub_rsubs,3,sm      %b[62], %r16, %b[110], %b[111]
          pfmuls,4,sm   %b[56], %b[106], %r4
          staad,5       %b[111], %aad3[ %aasti10 ]
          movad,0       area=1, ind=0, am=0, be=0, %b[22]
          movad,1       area=1, ind=8, am=0, be=0, %b[57]
          movad,2       area=1, ind=0, am=0, be=0, %b[56]
          movad,3       area=1, ind=16, am=0, be=0, %b[77]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[76], %b[80], %b[103], %r16
          pfadd_adds,1,sm       %b[62], %r16, %b[118], %r13
          staad,2       %g23, %aad2[ %aasti9 + _f32s,_lts0 0x10 ]
          incr,2        %aaincr4
          pfadd_rsubs,3,sm      %b[62], %r16, %b[118], %b[103]
          pfmuls,4,sm   %b[29], %b[61], %r6
          staad,5       %g22, %aad1[ %aasti8 + _f32s,_lts0 0x10 ]
          incr,5        %aaincr4
          movad,0       area=1, ind=16, am=1, be=0, %b[80]
          movad,1       area=0, ind=0, am=0, be=0, %b[29]
          movad,2       area=1, ind=8, am=1, be=0, %b[76]
          movad,3       area=0, ind=0, am=0, be=0, %b[62]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          pfmul_hadds,0,sm      %b[53], %r5, %r12, %r11
          pfsub_adds,1,sm       %b[35], %b[70], %b[102], %r15
          staad,2       %g31, %aad4[ %aasti11 + _f32s,_lts0 0x10 ]
          incr,2        %aaincr4
          pfsub_rsubs,3,sm      %b[35], %b[70], %b[102], %r14
          pfmuls,4,sm   %g19, %b[75], %b[102]
          staad,5       %g30, %aad3[ %aasti10 + _f32s,_lts0 0x10 ]
          incr,5        %aaincr4
          movad,0       area=0, ind=16, am=0, be=0, %b[85]
          movad,1       area=0, ind=8, am=1, be=0, %b[84]
          movad,2       area=0, ind=8, am=1, be=0, %b[53]
          movad,3       area=0, ind=16, am=0, be=0, %b[88]
        }

Теоретическая скорость: 16 комплексных чисел за 13 тактов (16/13) = 9.85 Байт/такт
Двойная теоретическая скорость: 19.69 Байт/такт

Замеры скорости

4. stage_radix2_readConjSwap_2x_simd128

Здесь происходит ручная раскрутка алгоритма stage_radix2_readConjSwap_simd128 в 2 раза.

Код на Си
void stage_radix2_readConjSwap_2x_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef_a, myComplex *conj_coef_b, myComplex *swap_coef_a, myComplex *swap_coef_b)
{
	__v2di *xy0_in = (__v2di*)&data_in[0];
	__v2di *xy1_in = (__v2di*)&data_in[2];
	__v2di *xy2_in = (__v2di*)&data_in[4];
	__v2di *xy3_in = (__v2di*)&data_in[6];
	__v2di *conj_c0a_in = (__v2di*)&conj_coef_a[0];
	__v2di *conj_c1a_in = (__v2di*)&conj_coef_a[2];
	__v2di *conj_c0b_in = (__v2di*)&conj_coef_b[0];
	__v2di *conj_c1b_in = (__v2di*)&conj_coef_b[data_count/4];
	__v2di *swap_c0a_in = (__v2di*)&swap_coef_a[0];
	__v2di *swap_c1a_in = (__v2di*)&swap_coef_a[2];
	__v2di *swap_c0b_in = (__v2di*)&swap_coef_b[0];
	__v2di *swap_c1b_in = (__v2di*)&swap_coef_b[data_count/4];

	__v2di *out_add0 = (__v2di*)&data_out[0*data_count/4];
	__v2di *out_add1 = (__v2di*)&data_out[1*data_count/4];
	__v2di *out_sub0 = (__v2di*)&data_out[2*data_count/4];
	__v2di *out_sub1 = (__v2di*)&data_out[3*data_count/4];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/8; ++i)
	{
		__v2di xy0 = xy0_in[4*i];
		__v2di xy1 = xy1_in[4*i];
		__v2di conj_c0 = conj_c0a_in[2*i];
		__v2di swap_c0 = swap_c0a_in[2*i];

		__v2di xy2 = xy2_in[4*i];
		__v2di xy3 = xy3_in[4*i];
		__v2di conj_c1 = conj_c1a_in[2*i];
		__v2di swap_c1 = swap_c1a_in[2*i];

		__v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		__v2di cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
		__v2di cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
		__v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
		__v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);

		__v2di cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
		__v2di cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);

		__v2di cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});

		__v2di add0 = __builtin_e2k_qpfadds(x0, cy0);
		__v2di sub0 = __builtin_e2k_qpfsubs(x0, cy0);
		__v2di add1 = __builtin_e2k_qpfadds(x1, cy1);
		__v2di sub1 = __builtin_e2k_qpfsubs(x1, cy1);


		xy0 = add0;
		xy1 = add1;
		conj_c0 = conj_c0b_in[i];
		swap_c0 = swap_c0b_in[i];

		xy2 = sub0;
		xy3 = sub1;
		conj_c1 = conj_c1b_in[i];
		swap_c1 = swap_c1b_in[i];

		x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
		y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
		cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
		cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
		cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);

		cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
		cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);

		cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});

		out_add0[i] = __builtin_e2k_qpfadds(x0, cy0);
		out_sub0[i] = __builtin_e2k_qpfsubs(x0, cy0);
		out_add1[i] = __builtin_e2k_qpfadds(x1, cy1);
		out_sub1[i] = __builtin_e2k_qpfsubs(x1, cy1);
	}
}
Основной цикл на ассемблере
.L4621:
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=3, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=3, abs=0, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=3, asz=3, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=2, asz=3, abs=8, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=3, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=3, abs=16, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=3, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=24, disp=0
        }
.L3922:
        {
          loop_mode
          qpfsubs,2,sm  %b[65], %b[78], %b[95]
          qpfadds,3,sm  %b[28], %b[3], %b[0]
          qpshufb,4,sm  %b[69], %b[69], %r20, %b[1]
          qpfadds,5,sm  %b[65], %b[78], %b[96]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[74], %b[30], %b[76], %b[17]
          qpshufb,1,sm  %b[50], %b[55], %r18, %b[28]
          qpfsubs,2,sm  %b[77], %b[92], %b[97]
          qpshufb,4,sm  %b[88], %b[91], %r18, %b[21]
          qpfadds,5,sm  %b[77], %b[92], %b[98]
          movaqp,0      area=3, ind=0, am=1, be=0, %b[8]
          movaqp,1      area=2, ind=0, am=1, be=0, %b[3]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[68], %b[89], %b[71], %b[65]
          qpshufb,1,sm  %b[90], %b[93], %r19, %b[59]
          qpfmuls,2,sm  %b[86], %b[28], %b[74]
          qpshufb,4,sm  %b[62], %b[62], %r20, %b[76]
          qpfmuls,5,sm  %b[82], %b[87], %b[69]
          movaqp,0      area=0, ind=0, am=0, be=0, %b[51]
          movaqp,1      area=0, ind=16, am=1, be=0, %b[46]
          movaqp,2      area=3, ind=0, am=1, be=0, %b[33]
          movaqp,3      area=2, ind=0, am=1, be=0, %b[30]
        }
        {
          loop_mode
          qpshufb,0,sm  %b[4], %b[29], %r18, %b[85]
          qpshufb,1,sm  %b[2], %b[81], %r19, %b[71]
          qpfadds,2,sm  %b[58], %b[54], %b[77]
          qpfsubs,3,sm  %b[58], %b[54], %b[89]
          qpshufb,4,sm  %b[27], %b[27], %r20, %b[90]
          qpfsubs,5,sm  %b[26], %b[1], %b[86]
          movaqp,0      area=1, ind=16, am=1, be=0, %b[78]
          movaqp,1      area=1, ind=0, am=0, be=0, %b[82]
          movaqp,2      area=1, ind=16, am=1, be=0, %b[62]
          movaqp,3      area=1, ind=0, am=0, be=0, %b[68]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[47], %b[23], %b[94], %b[58]
          qpshufb,1,sm  %b[52], %b[57], %r19, %b[54]
          staaqp,2      %b[95], %aad1[ %aasti8 ]
          incr,2        %aaincr0
          qpfmuls,3,sm  %b[20], %b[21], %b[92]
          qpshufb,4,sm  %b[0], %b[79], %r18, %b[81]
          staaqp,5      %b[96], %aad2[ %aasti9 ]
          incr,5        %aaincr0
          movaqp,2      area=0, ind=0, am=0, be=0, %b[27]
          movaqp,3      area=0, ind=16, am=1, be=0, %b[2]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qpfmul_hadds,0,sm     %b[44], %b[83], %b[49], %b[23]
          qpshufb,1,sm  %b[6], %b[31], %r19, %b[20]
          staaqp,2      %b[97], %aad3[ %aasti10 ]
          incr,2        %aaincr0
          qpfmuls,3,sm  %b[15], %b[81], %b[47]
          qpshufb,4,sm  %b[19], %b[19], %r20, %b[52]
          staaqp,5      %b[98], %aad4[ %aasti11 ]
          incr,5        %aaincr0
        }

Теоретическая скорость: 8 комплексных чисел за 6 тактов (8/6) = 10.67 Байт/такт
Двойная теоретическая скорость: 21.33 Байт/такт

Замеры скорости

5. stage_radix2_readConjSwap_2x_simd128_v2

Перетасовали код, чтобы уменьшить число инструкций.

Код на Си
void stage_radix2_readConjSwap_2x_simd128_v2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef_a, myComplex *conj_coef_b, myComplex *swap_coef_a, myComplex *swap_coef_b)
{
	__v2di *xy0_in = (__v2di*)&data_in[0];
	__v2di *xy1_in = (__v2di*)&data_in[2];
	__v2di *xy2_in = (__v2di*)&data_in[4];
	__v2di *xy3_in = (__v2di*)&data_in[6];
	__v2di *conj_c0a_in = (__v2di*)&conj_coef_a[0];
	__v2di *conj_c1a_in = (__v2di*)&conj_coef_a[2];
	__v2di *conj_c0b_in = (__v2di*)&conj_coef_b[0];
	__v2di *conj_c1b_in = (__v2di*)&conj_coef_b[data_count/4];
	__v2di *swap_c0a_in = (__v2di*)&swap_coef_a[0];
	__v2di *swap_c1a_in = (__v2di*)&swap_coef_a[2];
	__v2di *swap_c0b_in = (__v2di*)&swap_coef_b[0];
	__v2di *swap_c1b_in = (__v2di*)&swap_coef_b[data_count/4];

	__v2di *out_add0 = (__v2di*)&data_out[0*data_count/4];
	__v2di *out_add1 = (__v2di*)&data_out[1*data_count/4];
	__v2di *out_sub0 = (__v2di*)&data_out[2*data_count/4];
	__v2di *out_sub1 = (__v2di*)&data_out[3*data_count/4];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/8; ++i)
	{
		__v2di xy0 = xy0_in[4*i];
		__v2di xy1 = xy1_in[4*i];
		__v2di conj_c0 = conj_c0a_in[2*i];
		__v2di swap_c0 = swap_c0a_in[2*i];

		__v2di xy2 = xy2_in[4*i];
		__v2di xy3 = xy3_in[4*i];
		__v2di conj_c1 = conj_c1a_in[2*i];
		__v2di swap_c1 = swap_c1a_in[2*i];

		__v2di x0_rrii = __builtin_e2k_qppermb(xy1, xy0, (__v2di){0x1312111003020100, 0x1716151407060504});
		__v2di x1_rrii = __builtin_e2k_qppermb(xy3, xy2, (__v2di){0x1312111003020100, 0x1716151407060504});
		__v2di y0      = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di y1      = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		__v2di cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
		__v2di cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
		__v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
		__v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);

		__v2di cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
		__v2di cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);

		__v2di add0_rrii = __builtin_e2k_qpfadds(x0_rrii, cy0_rrii);
		__v2di sub0_rrii = __builtin_e2k_qpfsubs(x0_rrii, cy0_rrii);
		__v2di add1_rrii = __builtin_e2k_qpfadds(x1_rrii, cy1_rrii);
		__v2di sub1_rrii = __builtin_e2k_qpfsubs(x1_rrii, cy1_rrii);


		__v2di xy0_rrii = add0_rrii;
		__v2di xy1_rrii = add1_rrii;
		conj_c0 = conj_c0b_in[i];
		swap_c0 = swap_c0b_in[i];

		__v2di xy2_rrii = sub0_rrii;
		__v2di xy3_rrii = sub1_rrii;
		conj_c1 = conj_c1b_in[i];
		swap_c1 = swap_c1b_in[i];

		__v2di x0 = __builtin_e2k_qppermb(xy1_rrii, xy0_rrii, (__v2di){0x0B0A090803020100, 0x1B1A191813121110});
		__v2di x1 = __builtin_e2k_qppermb(xy3_rrii, xy2_rrii, (__v2di){0x0B0A090803020100, 0x1B1A191813121110});
		       y0 = __builtin_e2k_qppermb(xy1_rrii, xy0_rrii, (__v2di){0x0F0E0D0C07060504, 0x1F1E1D1C17161514});
		       y1 = __builtin_e2k_qppermb(xy3_rrii, xy2_rrii, (__v2di){0x0F0E0D0C07060504, 0x1F1E1D1C17161514});

		cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
		cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
		cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
		cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);

		cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
		cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);

		__v2di cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});

		out_add0[i] = __builtin_e2k_qpfadds(x0, cy0);
		out_sub0[i] = __builtin_e2k_qpfsubs(x0, cy0);
		out_add1[i] = __builtin_e2k_qpfadds(x1, cy1);
		out_sub1[i] = __builtin_e2k_qpfsubs(x1, cy1);
	}
}
Основной цикл на ассемблере
.L5345:
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=3, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=3, abs=0, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=3, asz=3, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=2, asz=3, abs=8, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=3, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=3, abs=16, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=3, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=24, disp=0
        }
.L4696:
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[62], %b[79], %b[110], %b[62]
          qpshufb,1,sm  %b[83], %b[83], %r11, %b[110]
          qpfsubs,2,sm  %b[119], %b[118], %b[113]
          qpfmuls,3,sm  %b[73], %b[103], %b[73]
          qppermb,4,sm  %b[69], %b[12], %r9, %b[115]
          qpfadds,5,sm  %b[119], %b[118], %b[114]
          movaqp,0      area=0, ind=0, am=0, be=0, %b[0]
          movaqp,1      area=1, ind=0, am=0, be=0, %b[69]
          movaqp,2      area=0, ind=0, am=0, be=0, %b[12]
          movaqp,3      area=0, ind=16, am=1, be=0, %b[1]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[44], %b[26], %b[108], %b[79]
          qppermb,1,sm  %b[92], %b[76], %r0, %b[117]
          qpfmuls,2,sm  %b[57], %b[77], %b[108]
          qpfmuls,3,sm  %b[80], %b[109], %b[80]
          qpshufb,4,sm  %b[66], %b[66], %r11, %b[116]
          qpfadds,5,sm  %b[115], %b[25], %b[66]
          movaqp,0      area=0, ind=16, am=1, be=0, %b[57]
          movaqp,1      area=1, ind=16, am=1, be=0, %b[76]
          movaqp,2      area=3, ind=0, am=1, be=0, %b[26]
          movaqp,3      area=2, ind=0, am=1, be=0, %b[44]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[106], %b[111], %b[82], %b[92]
          qpshufb,1,sm  %b[3], %b[14], %r12, %b[107]
          qpfmuls,2,sm  %b[41], %b[24], %b[106]
          qpfadds,3,sm  %g16, %b[98], %b[82]
          qpshufb,4,sm  %b[59], %b[2], %r12, %b[101]
          qpfsubs,5,sm  %b[115], %b[25], %b[83]
          movaqp,0      area=3, ind=0, am=1, be=0, %b[25]
          movaqp,1      area=2, ind=0, am=1, be=0, %b[41]
          movaqp,2      area=1, ind=0, am=0, be=0, %b[93]
          movaqp,3      area=1, ind=16, am=1, be=0, %b[100]
        }
        {
          loop_mode
          qpfsubs,0,sm  %g18, %b[110], %g17
          qppermb,1,sm  %b[11], %b[22], %r9, %g16
          staaqp,2      %g17, %aad1[ %aasti8 ]
          incr,2        %aaincr0
          qpfsubs,3,sm  %g16, %b[98], %b[11]
          qppermb,4,sm  %b[13], %b[85], %r10, %b[22]
          staaqp,5      %b[112], %aad2[ %aasti9 ]
          incr,5        %aaincr0
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qpfmul_hadds,0,sm     %b[99], %b[105], %b[75], %b[19]
          qppermb,1,sm  %b[84], %b[68], %r10, %b[75]
          staaqp,2      %b[113], %aad3[ %aasti10 ]
          incr,2        %aaincr0
          qpfadds,3,sm  %g18, %b[110], %b[110]
          qppermb,4,sm  %b[19], %b[91], %r0, %g18
          staaqp,5      %b[114], %aad4[ %aasti11 ]
          incr,5        %aaincr0
        }

Теоретическая скорость: 8 комплексных чисел за 5 тактов (8/5) = 12.8 Байт/такт
Двойная теоретическая скорость: 25.6 Байт/такт

Замеры скорости

Итоги по stage_radix2_readConjSwap_2x

Скорости выросли по сравнению с исходными версиями stage_radix2_readConjSwap.
График FFT находится здесь.


stage_radix4

Схема алгоритма Stage для версии «radix-4».

Один проход по stage_radix4 совершает ту же работу, что 2 прохода по stage_radix2. Поэтому скорость stage_radix4 будем умножать на 2 для удобства сравнения с stage_radix2 (этот факт подписан на оси графика и в выводе консоли).

1. stage_radix4_etalon

Эталонный вариант для сравнения на корректность.

Код на Си
void stage_radix4_etalon(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC, myComplex *coefD, myComplex *coefE)
{
	myComplex *x_in = &data_in[0];
	myComplex *y_in = &data_in[1];
	myComplex *z_in = &data_in[2];
	myComplex *w_in = &data_in[3];
	myComplex *c_in = coefC;
	myComplex *d_in = coefD;
	myComplex *e_in = coefE;

	myComplex *out_0 = &data_out[0*data_count/4];
	myComplex *out_1 = &data_out[1*data_count/4];
	myComplex *out_2 = &data_out[2*data_count/4];
	myComplex *out_3 = &data_out[3*data_count/4];

	#pragma ivdep
	#pragma unroll(1)
//	#pragma prefetch
	for(int64_t i = 0; i < data_count/4; ++i)
	{
		myComplex x = x_in[4*i];
		myComplex y = y_in[4*i];
		myComplex z = z_in[4*i];
		myComplex w = w_in[4*i];
		myComplex c = c_in[i];
		myComplex d = d_in[i];
		myComplex e = e_in[i];

		myComplex cy = complex_mul(c, y);
		myComplex dz = complex_mul(d, z);
		myComplex ew = complex_mul(e, w);

		myComplex add02 = complex_add( x, dz);
		myComplex sub02 = complex_sub( x, dz);
		myComplex add13 = complex_add(cy, ew);
		myComplex sub13 = complex_sub(cy, ew);
		myComplex sub13i = (myComplex){.real = -sub13.imag, .imag = sub13.real};

		out_0[i] = complex_add(add02, add13);
		out_1[i] = complex_sub(sub02, sub13i);
		out_2[i] = complex_sub(add02, add13);
		out_3[i] = complex_add(sub02, sub13i);
	}
}
Основной цикл на ассемблере
.L868:
        {
          fapb  ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=4, asz=3, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=3, asz=3, abs=8, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=2, asz=4, abs=16, disp=0
        }
.L236:
        {
          loop_mode
          fmul_adds,0,sm        %b[66], %b[72], %b[54], %b[1]
          fsub_rsubs,1,sm       %b[12], %b[71], %b[55], %b[5]
          fsub_adds,2,sm        %b[12], %b[71], %b[55], %b[45]
          fmuls,4,sm    %b[59], %b[58], %b[52]
          fsubs,5,sm    %b[79], %b[75], %b[53]
          movaw,3       area=0, ind=4, am=0, be=0, %b[0]
        }
        {
          loop_mode
          fmul_rsubs,0,sm       %b[66], %b[60], %b[82], %b[71]
          fadd_adds,1,sm        %b[24], %b[83], %b[90], %b[72]
          fadd_rsubs,2,sm       %b[24], %b[83], %b[90], %b[76]
          fmuls,4,sm    %b[59], %b[70], %b[80]
          fadds,5,sm    %b[79], %b[75], %b[88]
          movaw,1       area=2, ind=4, am=0, be=0, %b[55]
          movaw,2       area=0, ind=0, am=0, be=0, %b[12]
          movaw,3       area=0, ind=24, am=0, be=0, %b[54]
        }
        {
          loop_mode
          fmul_rsubs,0,sm       %b[42], %b[17], %b[84], %b[75]
          fmul_rsubs,1,sm       %b[32], %b[67], %b[85], %b[79]
          staaw,2       %b[36], %aad3[ %aasti7 ]
          fmuls,3,sm    %b[29], %b[46], %b[82]
          fmuls,4,sm    %b[35], %b[23], %b[83]
          staaw,5       %b[39], %aad1[ %aasti5 ]
          movaw,0       area=2, ind=0, am=1, be=0, %b[60]
          movaw,1       area=1, ind=0, am=0, be=0, %b[24]
          movaw,2       area=0, ind=16, am=0, be=0, %b[59]
          movaw,3       area=0, ind=28, am=0, be=0, %b[66]
        }
        {
          loop_mode
          fmul_adds,0,sm        %b[40], %b[46], %b[86], %b[39]
          fmul_adds,1,sm        %b[32], %b[25], %b[87], %b[67]
          staaw,2       %b[11], %aad4[ %aasti8 + _f32s,_lts0 0x4 ]
          fmuls,3,sm    %b[27], %b[13], %b[84]
          fmuls,4,sm    %b[35], %b[65], %b[85]
          staaw,5       %b[51], %aad2[ %aasti6 + _f32s,_lts0 0x4 ]
          movaw,0       area=1, ind=4, am=1, be=0, %b[29]
          movaw,1       area=0, ind=0, am=0, be=0, %b[36]
          movaw,2       area=0, ind=20, am=0, be=0, %b[17]
          movaw,3       area=0, ind=12, am=0, be=0, %b[42]
        }
        {
          loop_mode
          fsub_rsubs,0,sm       %b[22], %b[81], %b[48], %b[32]
          fsub_adds,1,sm        %b[22], %b[81], %b[48], %b[35]
          staaw,2       %b[7], %aad3[ %aasti7 + _f32s,_lts0 0x4 ]
          incr,2        %aaincr3
          fsubs,4,sm    %b[3], %b[43], %b[46]
          staaw,5       %b[47], %aad1[ %aasti5 + _f32s,_lts0 0x4 ]
          incr,5        %aaincr3
          movaw,1       area=0, ind=4, am=1, be=0, %b[25]
          movaw,3       area=0, ind=8, am=1, be=0, %b[11]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          fadd_adds,0,sm        %b[10], %b[69], %b[50], %b[7]
          fadd_rsubs,1,sm       %b[10], %b[69], %b[50], %b[47]
          staaw,2       %b[74], %aad4[ %aasti8 ]
          incr,2        %aaincr3
          fadds,4,sm    %b[43], %b[3], %b[48]
          staaw,5       %b[78], %aad2[ %aasti6 ]
          incr,5        %aaincr3
        }

Теоретическая скорость: 4 комплексных числа за 6 тактов (4/6) = 5.33 Байт/такт
Двойная теоретическая скорость: 10.67 Байт/такт

Замеры скорости

2. stage_radix4_etalon_unroll2

Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.

Код на Си
void stage_radix4_etalon_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC, myComplex *coefD, myComplex *coefE)
{
	myComplex *x_in = &data_in[0];
	myComplex *y_in = &data_in[1];
	myComplex *z_in = &data_in[2];
	myComplex *w_in = &data_in[3];
	myComplex *c_in = coefC;
	myComplex *d_in = coefD;
	myComplex *e_in = coefE;

	myComplex *out_0 = &data_out[0*data_count/4];
	myComplex *out_1 = &data_out[1*data_count/4];
	myComplex *out_2 = &data_out[2*data_count/4];
	myComplex *out_3 = &data_out[3*data_count/4];

	#pragma ivdep
	#pragma unroll(2)
//	#pragma prefetch
	for(int64_t i = 0; i < data_count/4; ++i)
	{
		myComplex x = x_in[4*i];
		myComplex y = y_in[4*i];
		myComplex z = z_in[4*i];
		myComplex w = w_in[4*i];
		myComplex c = c_in[i];
		myComplex d = d_in[i];
		myComplex e = e_in[i];

		myComplex cy = complex_mul(c, y);
		myComplex dz = complex_mul(d, z);
		myComplex ew = complex_mul(e, w);

		myComplex add02 = complex_add( x, dz);
		myComplex sub02 = complex_sub( x, dz);
		myComplex add13 = complex_add(cy, ew);
		myComplex sub13 = complex_sub(cy, ew);
		myComplex sub13i = (myComplex){.real = -sub13.imag, .imag = sub13.real};

		out_0[i] = complex_add(add02, add13);
		out_1[i] = complex_sub(sub02, sub13i);
		out_2[i] = complex_sub(add02, add13);
		out_3[i] = complex_add(sub02, sub13i);
	}
}
Основной цикл на ассемблере
.L2050:
        {
          fapb  ct=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=4, abs=0, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=4, asz=3, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=3, asz=4, abs=16, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=2, asz=4, abs=16, disp=0
        }
.L913:
        {
          loop_mode
          insfd,0,sm    %b[67], %r7, %b[70], %b[0]
          pfmul_rsubs,1,sm      %b[78], %b[5], %g16, %b[30]
          pfadd_adds,2,sm       %b[50], %b[32], %b[77], %b[5]
          pshufb,3,sm   %b[22], %b[47], %r0, %b[67]
          insfd,4,sm    %b[30], %r7, %b[31], %b[1]
          pfmuls,5,sm   %b[83], %b[58], %g16
        }
        {
          loop_mode
          insfd,0,sm    %b[24], %r7, %b[49], %b[78]
          pfsub_adds,1,sm       %b[50], %b[32], %b[81], %b[31]
          pfsub_adds,2,sm       %b[6], %b[25], %b[71], %b[24]
          pshufb,3,sm   %b[61], %b[8], %r0, %b[83]
          pshufb,4,sm   %b[26], %b[33], %r0, %b[87]
          pfmuls,5,sm   %b[83], %b[3], %b[70]
        }
        {
          loop_mode
          insfd,0,sm    %b[63], %r7, %b[10], %b[71]
          pfsub_rsubs,1,sm      %b[50], %b[32], %b[81], %b[36]
          pfsub_rsubs,2,sm      %b[6], %b[25], %b[71], %b[35]
          insfd,3,sm    %b[76], %r7, %b[79], %b[10]
          pshufb,4,sm   %b[37], %b[38], %r0, %b[76]
        }
        {
          loop_mode
          insfd,0,sm    %b[85], %r7, %b[86], %b[43]
          pfadd_rsubs,1,sm      %b[50], %b[32], %b[77], %b[39]
          pfadd_rsubs,2,sm      %b[6], %b[25], %b[75], %b[32]
          insfd,3,sm    %b[80], %r7, %b[84], %b[40]
          pshufb,4,sm   %b[34], %b[41], %r0, %b[77]
        }
        {
          loop_mode
          pfmuls,0,sm   %b[67], %b[43], %b[25]
          pfadd_adds,1,sm       %b[6], %b[25], %b[75], %b[49]
          staad,2       %b[87], %aad1[ %aasti5 + _f32s,_lts0 0x8 ]
          pfsubs,3,sm   %b[64], %b[21], %b[79]
          pshufb,4,sm   %b[51], %b[7], %r0, %b[84]
          pfadds,5,sm   %b[11], %b[20], %b[75]
          movad,0       area=2, ind=0, am=0, be=0, %b[6]
          movaw,1       area=0, ind=24, am=0, be=0, %b[81]
          movaw,3       area=0, ind=24, am=0, be=0, %b[80]
        }
        {
          loop_mode
          pfmuls,0,sm   %b[83], %b[40], %b[50]
          pfmul_adds,1,sm       %b[78], %b[45], %b[69], %b[17]
          staad,2       %b[76], %aad3[ %aasti7 + _f32s,_lts0 0x8 ]
          pfsubs,3,sm   %b[11], %b[20], %b[69]
          insfd,4,sm    %b[59], %r7, %b[17], %b[76]
          movad,0       area=1, ind=0, am=0, be=0, %b[45]
          movad,1       area=1, ind=8, am=1, be=0, %b[20]
          movad,3       area=1, ind=0, am=0, be=0, %b[11]
        }
        {
          loop_mode
          insfd,0,sm    %b[72], %r7, %b[82], %b[54]
          pfmul_adds,1,sm       %b[71], %b[42], %g17, %b[60]
          staad,2       %b[77], %aad2[ %aasti6 + _f32s,_lts0 0x8 ]
          pfmuls,3,sm   %b[83], %b[14], %g17
          insfd,4,sm    %b[73], %r7, %b[74], %b[42]
          pfmuls,5,sm   %b[67], %b[10], %b[67]
          movad,0       area=2, ind=8, am=1, be=0, %b[59]
          movaw,1       area=0, ind=4, am=0, be=0, %b[66]
          movad,2       area=1, ind=8, am=1, be=0, %b[53]
          movaw,3       area=0, ind=4, am=0, be=0, %b[63]
        }
        {
          loop_mode
          insfd,0,sm    %b[26], %r7, %b[33], %b[83]
          pfmul_rsubs,1,sm      %b[71], %b[16], %b[52], %b[16]
          staad,2       %b[84], %aad4[ %aasti8 + _f32s,_lts0 0x8 ]
          insfd,4,sm    %b[37], %r7, %b[38], %b[82]
          pfadds,5,sm   %b[21], %b[64], %b[73]
          movaw,0       area=0, ind=0, am=0, be=0, %b[72]
          movaw,1       area=0, ind=8, am=0, be=0, %b[77]
          movaw,2       area=0, ind=0, am=0, be=0, %b[71]
          movaw,3       area=0, ind=8, am=0, be=0, %b[74]
        }
        {
          loop_mode
          insfd,0,sm    %b[51], %r7, %b[7], %b[85]
          pfmul_rsubs,1,sm      %b[78], %b[12], %b[27], %b[7]
          staad,2       %b[83], %aad1[ %aasti5 ]
          incr,2        %aaincr3
          insfd,4,sm    %b[34], %r7, %b[41], %b[86]
          staad,5       %b[82], %aad3[ %aasti7 ]
          incr,5        %aaincr3
          movaw,0       area=0, ind=28, am=0, be=0, %b[82]
          movaw,1       area=0, ind=12, am=0, be=0, %b[84]
          movaw,2       area=0, ind=28, am=0, be=0, %b[78]
          movaw,3       area=0, ind=12, am=0, be=0, %b[83]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          pfmul_adds,1,sm       %b[76], %b[58], %b[70], %b[21]
          staad,2       %b[85], %aad4[ %aasti8 ]
          incr,2        %aaincr3
          insfd,3,sm    %b[80], %r7, %b[81], %b[12]
          pshufb,4,sm   %b[57], %b[15], %r0, %b[81]
          staad,5       %b[86], %aad2[ %aasti6 ]
          incr,5        %aaincr3
          movaw,0       area=0, ind=16, am=0, be=0, %b[27]
          movaw,1       area=0, ind=20, am=1, be=0, %b[80]
          movaw,2       area=0, ind=16, am=0, be=0, %b[26]
          movaw,3       area=0, ind=20, am=1, be=0, %b[70]
        }

Теоретическая скорость: 8 комплексных чисел за 10 тактов (8/10) = 6.4 Байт/такт
Двойная теоретическая скорость: 12.8 Байт/такт

Замеры скорости

Видим ускорение.


3. stage_radix4_simd64

Вычисления делаем аналогично stage_radix2_simd64.

Код на Си
void stage_radix4_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC, myComplex *coefD, myComplex *coefE)
{
	uint64_t *x_in = (uint64_t*)&data_in[0];
	uint64_t *y_in = (uint64_t*)&data_in[1];
	uint64_t *z_in = (uint64_t*)&data_in[2];
	uint64_t *w_in = (uint64_t*)&data_in[3];
	uint64_t *c_in = (uint64_t*)coefC;
	uint64_t *d_in = (uint64_t*)coefD;
	uint64_t *e_in = (uint64_t*)coefE;

	uint64_t *out_0 = (uint64_t*)&data_out[0*data_count/4];
	uint64_t *out_1 = (uint64_t*)&data_out[1*data_count/4];
	uint64_t *out_2 = (uint64_t*)&data_out[2*data_count/4];
	uint64_t *out_3 = (uint64_t*)&data_out[3*data_count/4];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/4; ++i)
	{
		uint64_t x = x_in[4*i];
		uint64_t y = y_in[4*i];
		uint64_t z = z_in[4*i];
		uint64_t w = w_in[4*i];
		uint64_t c = c_in[i];
		uint64_t d = d_in[i];
		uint64_t e = e_in[i];

		uint64_t conj_c = __builtin_e2k_pxord(c, 1LL<<63);
		uint64_t conj_d = __builtin_e2k_pxord(d, 1LL<<63);
		uint64_t conj_e = __builtin_e2k_pxord(e, 1LL<<63);
		uint64_t swap_c = __builtin_e2k_pshufb(0, c, 0x0302010007060504);
		uint64_t swap_d = __builtin_e2k_pshufb(0, d, 0x0302010007060504);
		uint64_t swap_e = __builtin_e2k_pshufb(0, e, 0x0302010007060504);

		uint64_t cy_real = __builtin_e2k_pfmuls(conj_c, y);
		uint64_t dz_real = __builtin_e2k_pfmuls(conj_d, z);
		uint64_t ew_real = __builtin_e2k_pfmuls(conj_e, w);
		uint64_t cy_imag = __builtin_e2k_pfmuls(swap_c, y);
		uint64_t dz_imag = __builtin_e2k_pfmuls(swap_d, z);
		uint64_t ew_imag = __builtin_e2k_pfmuls(swap_e, w);

		uint64_t cy = __builtin_e2k_pfhadds(cy_real, cy_imag);
		uint64_t dz = __builtin_e2k_pfhadds(dz_real, dz_imag);
		uint64_t ew = __builtin_e2k_pfhadds(ew_real, ew_imag);

		uint64_t add02 = __builtin_e2k_pfadds( x, dz);
		uint64_t sub02 = __builtin_e2k_pfsubs( x, dz);
		uint64_t add13 = __builtin_e2k_pfadds(cy, ew);
		uint64_t sub13 = __builtin_e2k_pfsubs(cy, ew);

		//uint64_t conj_sub13 = __builtin_e2k_pxord(sub13, 1LL<<63);
		//uint64_t sub13i = __builtin_e2k_pshufb(0, conj_sub13, 0x0302010007060504);
		uint64_t swap_sub13 = __builtin_e2k_pshufb(0, sub13, 0x0302010007060504);
		uint64_t sub13i = __builtin_e2k_pxord(swap_sub13, 1LL<<31);

		out_0[i] = __builtin_e2k_pfadds(add02, add13);
		out_1[i] = __builtin_e2k_pfsubs(sub02, sub13i);
		out_2[i] = __builtin_e2k_pfsubs(add02, add13);
		out_3[i] = __builtin_e2k_pfadds(sub02, sub13i);
	}
}
Основной цикл на ассемблере
.L2675:
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=3, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=3, abs=8, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=4, abs=16, disp=0
        }
.L2317:
        {
          loop_mode
          pfadds,0,sm   %b[70], %b[67], %b[73]
          pfadd_adds,1,sm       %b[40], %b[47], %b[75], %b[1]
          pfadd_rsubs,2,sm      %b[40], %b[47], %b[75], %b[0]
          pfsubs,3,sm   %b[66], %b[63], %b[39]
          xord,4,sm     %b[51], %r0, %b[58]
          xord,5,sm     %b[33], %r0, %b[79]
        }
        {
          loop_mode
          pfmuls,0,sm   %b[60], %b[21], %b[70]
          pfsub_rsubs,1,sm      %b[40], %b[47], %b[81], %b[67]
          pfsub_adds,2,sm       %b[40], %b[47], %b[81], %b[33]
          pshufb,3,sm   0x0, %b[18], %r8, %b[75]
          pshufb,4,sm   0x0, %b[57], %r8, %b[84]
          xord,5,sm     %b[18], %r0, %b[82]
        }
        {
          loop_mode
          pfmuls,0,sm   %b[79], %b[13], %b[85]
          pfmul_hadds,1,sm      %b[78], %b[15], %b[87], %b[57]
          staad,2       %b[5], %aad4[ %aasti8 ]
          incr,2        %aaincr0
          pfmul_hadds,3,sm      %b[84], %b[25], %b[74], %b[60]
          pshufb,4,sm   0x0, %b[41], %r8, %b[81]
          staad,5       %b[4], %aad2[ %aasti6 ]
          incr,5        %aaincr0
          movad,1       area=0, ind=0, am=1, be=0, %b[47]
          movad,2       area=0, ind=0, am=0, be=0, %b[18]
          movad,3       area=0, ind=16, am=0, be=0, %b[40]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          pfmuls,0,sm   %b[82], %b[54], %b[78]
          pfmul_hadds,1,sm      %b[77], %b[56], %b[80], %b[41]
          staad,2       %b[71], %aad3[ %aasti7 ]
          incr,2        %aaincr0
          pshufb,3,sm   0x0, %b[31], %r8, %b[74]
          xord,4,sm     %b[83], %r7, %b[79]
          staad,5       %b[37], %aad1[ %aasti5 ]
          incr,5        %aaincr0
          movad,0       area=2, ind=0, am=1, be=0, %b[25]
          movad,1       area=1, ind=0, am=1, be=0, %b[4]
          movad,2       area=0, ind=24, am=0, be=0, %b[5]
          movad,3       area=0, ind=8, am=1, be=0, %b[15]
        }

Теоретическая скорость: 4 комплексных числа за 4 такта (4/4) = 8 Байт/такт
Двойная теоретическая скорость: 16 Байт/такт

Замеры скорости

Видим ускорение.


4. stage_radix4_simd128

Вычисления делаем аналогично stage_radix2_simd128.

Код на Си
void stage_radix4_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC, myComplex *coefD, myComplex *coefE)
{
	__v2di *xy0_in = (__v2di*)&data_in[0];
	__v2di *zw0_in = (__v2di*)&data_in[2];
	__v2di *xy1_in = (__v2di*)&data_in[4];
	__v2di *zw1_in = (__v2di*)&data_in[6];
	__v2di *c_in   = (__v2di*)coefC;
	__v2di *d_in   = (__v2di*)coefD;
	__v2di *e_in   = (__v2di*)coefE;

	__v2di *out_0 = (__v2di*)&data_out[0*data_count/4];
	__v2di *out_1 = (__v2di*)&data_out[1*data_count/4];
	__v2di *out_2 = (__v2di*)&data_out[2*data_count/4];
	__v2di *out_3 = (__v2di*)&data_out[3*data_count/4];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/8; ++i)
	{
		__v2di xy0 = xy0_in[4*i];
		__v2di zw0 = zw0_in[4*i];
		__v2di xy1 = xy1_in[4*i];
		__v2di zw1 = zw1_in[4*i];
		__v2di c   = c_in[i];
		__v2di d   = d_in[i];
		__v2di e   = e_in[i];

		__v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		__v2di conj_c = __builtin_e2k_qpxor(c, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_d = __builtin_e2k_qpxor(d, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_e = __builtin_e2k_qpxor(e, (__v2di){1LL<<63, 1LL<<63});
		__v2di swap_c = __builtin_e2k_qpshufb(c, c, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d = __builtin_e2k_qpshufb(d, d, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e = __builtin_e2k_qpshufb(e, e, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});

		__v2di cy_real = __builtin_e2k_qpfmuls(conj_c, y);
		__v2di dz_real = __builtin_e2k_qpfmuls(conj_d, z);
		__v2di ew_real = __builtin_e2k_qpfmuls(conj_e, w);
		__v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y);
		__v2di dz_imag = __builtin_e2k_qpfmuls(swap_d, z);
		__v2di ew_imag = __builtin_e2k_qpfmuls(swap_e, w);

		__v2di cy_rrii = __builtin_e2k_qpfhadds(cy_real, cy_imag);
		__v2di dz_rrii = __builtin_e2k_qpfhadds(dz_real, dz_imag);
		__v2di ew_rrii = __builtin_e2k_qpfhadds(ew_real, ew_imag);

		__v2di cy = __builtin_e2k_qpshufb(cy_rrii, cy_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di dz = __builtin_e2k_qpshufb(dz_rrii, dz_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di ew = __builtin_e2k_qpshufb(ew_rrii, ew_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});

		__v2di add02 = __builtin_e2k_qpfadds( x, dz);
		__v2di sub02 = __builtin_e2k_qpfsubs( x, dz);
		__v2di add13 = __builtin_e2k_qpfadds(cy, ew);
		__v2di sub13 = __builtin_e2k_qpfsubs(cy, ew);

		//__v2di conj_sub13 = __builtin_e2k_qpxor(sub13, (__v2di){1LL<<63, 1LL<<63});
		//__v2di sub13i = __builtin_e2k_qpshufb(conj_sub13, conj_sub13, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13 = __builtin_e2k_qpshufb(sub13, sub13, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di sub13i = __builtin_e2k_qpxor(swap_sub13, (__v2di){1LL<<31, 1LL<<31});

		out_0[i] = __builtin_e2k_qpfadds(add02, add13);
		out_1[i] = __builtin_e2k_qpfsubs(sub02, sub13i);
		out_2[i] = __builtin_e2k_qpfsubs(add02, add13);
		out_3[i] = __builtin_e2k_qpfadds(sub02, sub13i);
	}
}
Основной цикл на ассемблере
.L3309:
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=4, abs=0, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=4, abs=16, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=4, abs=16, disp=0
        }
.L2728:
        {
          loop_mode
          qpfadd_rsubs,0,sm     %b[13], %b[38], %b[21], %b[0]
          qpshufb,1,sm  %b[44], %b[44], %r9, %b[8]
          qpshufb,3,sm  %b[16], %b[16], %r10, %b[25]
          qpshufb,4,sm  %b[17], %b[17], %r10, %b[48]
          qpfmuls,5,sm  %b[60], %b[49], %b[1]
        }
        {
          loop_mode
          qpfsub_rsubs,0,sm     %b[13], %b[38], %b[58], %b[44]
          qpshufb,1,sm  %b[36], %b[36], %r9, %b[61]
          qpshufb,3,sm  %b[29], %b[52], %r12, %b[16]
          qpfsub_adds,4,sm      %b[13], %b[38], %b[58], %b[21]
          qpfadds,5,sm  %b[48], %b[25], %b[17]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[10], %b[51], %b[3], %b[13]
          qpshufb,1,sm  %b[56], %b[56], %r10, %b[36]
          qpxor,3,sm    %b[34], %r0, %b[38]
          qpshufb,4,sm  %b[59], %b[59], %r9, %b[52]
          qpfmuls,5,sm  %b[57], %b[45], %b[29]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[55], %b[47], %b[31], %b[10]
          qpshufb,1,sm  %b[37], %b[43], %r12, %b[3]
          staaqp,2      %b[26], %aad4[ %aasti8 ]
          incr,2        %aaincr0
          qpxor,4,sm    %b[52], %r7, %b[56]
          qpfmuls,5,sm  %b[38], %b[20], %b[51]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[61], %b[22], %b[53], %b[52]
          qpxor,1,sm    %b[4], %r0, %b[55]
          staaqp,2      %b[2], %aad2[ %aasti6 ]
          incr,2        %aaincr0
          qpshufb,3,sm  %b[35], %b[41], %r11, %b[47]
          qpshufb,4,sm  %b[27], %b[50], %r11, %b[43]
          qpfsubs,5,sm  %b[48], %b[25], %b[57]
          movaqp,0      area=1, ind=0, am=1, be=0, %b[38]
          movaqp,1      area=0, ind=0, am=0, be=0, %b[37]
          movaqp,2      area=1, ind=0, am=1, be=0, %b[26]
          movaqp,3      area=0, ind=0, am=0, be=0, %b[31]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qpfadd_adds,0,sm      %b[11], %b[36], %b[19], %b[22]
          qpshufb,1,sm  %b[6], %b[6], %r9, %b[53]
          staaqp,2      %b[46], %aad3[ %aasti7 ]
          incr,2        %aaincr0
          qpxor,3,sm    %b[42], %r0, %b[58]
          staaqp,5      %b[23], %aad1[ %aasti5 ]
          incr,5        %aaincr0
          movaqp,0      area=2, ind=0, am=1, be=0, %b[2]
          movaqp,1      area=0, ind=16, am=1, be=0, %b[48]
          movaqp,3      area=0, ind=16, am=1, be=0, %b[25]
        }

Теоретическая скорость: 8 комплексных чисел за 6 тактов (8/6) = 10.67 Байт/такт
Двойная теоретическая скорость: 21.33 Байт/такт

Замеры скорости

Видим ускорение.


5. stage_radix4_simd128_noConj

Уменьшаем число инструкций аналогично stage_radix2_simd128_noConj.

Код на Си
void stage_radix4_simd128_noConj(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC, myComplex *coefD, myComplex *coefE)
{
	__v2di *xy0_in = (__v2di*)&data_in[0];
	__v2di *zw0_in = (__v2di*)&data_in[2];
	__v2di *xy1_in = (__v2di*)&data_in[4];
	__v2di *zw1_in = (__v2di*)&data_in[6];
	__v2di *c_in   = (__v2di*)coefC;
	__v2di *d_in   = (__v2di*)coefD;
	__v2di *e_in   = (__v2di*)coefE;

	__v2di *out_0 = (__v2di*)&data_out[0*data_count/4];
	__v2di *out_1 = (__v2di*)&data_out[1*data_count/4];
	__v2di *out_2 = (__v2di*)&data_out[2*data_count/4];
	__v2di *out_3 = (__v2di*)&data_out[3*data_count/4];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/8; ++i)
	{
		__v2di xy0 = xy0_in[4*i];
		__v2di zw0 = zw0_in[4*i];
		__v2di xy1 = xy1_in[4*i];
		__v2di zw1 = zw1_in[4*i];
		__v2di c   = c_in[i];
		__v2di d   = d_in[i];
		__v2di e   = e_in[i];

		__v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		__v2di swap_c = __builtin_e2k_qpshufb(c, c, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d = __builtin_e2k_qpshufb(d, d, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e = __builtin_e2k_qpshufb(e, e, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});

		__v2di cy_real = __builtin_e2k_qpfmuls(     c, y);
		__v2di dz_real = __builtin_e2k_qpfmuls(     d, z);
		__v2di ew_real = __builtin_e2k_qpfmuls(     e, w);
		__v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y);
		__v2di dz_imag = __builtin_e2k_qpfmuls(swap_d, z);
		__v2di ew_imag = __builtin_e2k_qpfmuls(swap_e, w);

		__v2di cy_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy_real);
		__v2di dz_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz_real);
		__v2di ew_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew_real);
		__v2di cy_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy_imag);
		__v2di dz_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz_imag);
		__v2di ew_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew_imag);

		__v2di cy = __builtin_e2k_qppermb(cy_ii, cy_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di dz = __builtin_e2k_qppermb(dz_ii, dz_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di ew = __builtin_e2k_qppermb(ew_ii, ew_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});

		__v2di add02 = __builtin_e2k_qpfadds( x, dz);
		__v2di sub02 = __builtin_e2k_qpfsubs( x, dz);
		__v2di add13 = __builtin_e2k_qpfadds(cy, ew);
		__v2di sub13 = __builtin_e2k_qpfsubs(cy, ew);

		//__v2di conj_sub13 = __builtin_e2k_qpxor(sub13, (__v2di){1LL<<63, 1LL<<63});
		//__v2di sub13i = __builtin_e2k_qpshufb(conj_sub13, conj_sub13, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13 = __builtin_e2k_qpshufb(sub13, sub13, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di sub13i = __builtin_e2k_qpxor(swap_sub13, (__v2di){1LL<<31, 1LL<<31});

		out_0[i] = __builtin_e2k_qpfadds(add02, add13);
		out_1[i] = __builtin_e2k_qpfsubs(sub02, sub13i);
		out_2[i] = __builtin_e2k_qpfsubs(add02, add13);
		out_3[i] = __builtin_e2k_qpfadds(sub02, sub13i);
	}
}
Основной цикл на ассемблере
.L3939:
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=4, abs=0, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=4, abs=16, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=4, abs=16, disp=0
        }
.L3362:
        {
          loop_mode
          qpfmul_hsubs,0,sm     %b[61], %b[63], %r12, %b[1]
          qpfmul_hadds,2,sm     %b[64], %b[63], %r12, %b[0]
          qpshufb,3,sm  %b[8], %b[9], %r9, %b[40]
          qpshufb,4,sm  %b[19], %b[19], %r11, %b[31]
          qpfadds,5,sm  %b[37], %b[30], %b[26]
          movaqp,0      area=1, ind=0, am=1, be=0, %b[17]
          movaqp,1      area=0, ind=0, am=0, be=0, %b[7]
          movaqp,3      area=0, ind=0, am=0, be=0, %b[6]
        }
        {
          loop_mode
          qpfmul_hsubs,0,sm     %b[21], %b[42], %r12, %b[44]
          qpfmul_hadds,2,sm     %b[33], %b[42], %r12, %b[41]
          qpshufb,3,sm  %b[49], %b[52], %r9, %b[61]
          qpshufb,4,sm  %b[59], %b[59], %r11, %b[62]
          qpfsubs,5,sm  %b[37], %b[30], %b[58]
          movaqp,0      area=2, ind=0, am=1, be=0, %b[57]
          movaqp,1      area=0, ind=16, am=1, be=0, %b[50]
          movaqp,3      area=0, ind=16, am=1, be=0, %b[47]
        }
        {
          loop_mode
          qpfmul_hsubs,0,sm     %b[16], %b[39], %r12, %b[21]
          qpshufb,1,sm  %b[60], %b[60], %r11, %b[42]
          qpfadd_adds,2,sm      %b[24], %b[56], %b[28], %b[30]
          qpshufb,3,sm  %b[14], %b[14], %r11, %b[33]
          qpshufb,4,sm  %b[51], %b[54], %r10, %b[37]
          staaqp,5      %b[34], %aad4[ %aasti8 ]
          incr,5        %aaincr0
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[35], %b[39], %r12, %b[34]
          qpxor,1,sm    %b[42], %r7, %b[60]
          qpfadd_rsubs,2,sm     %b[24], %b[56], %b[28], %b[51]
          qpshufb,3,sm  %b[10], %b[11], %r10, %b[16]
          qppermb,4,sm  %b[38], %b[25], %r0, %b[54]
          staaqp,5      %b[55], %aad2[ %aasti6 ]
          incr,5        %aaincr0
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qpfsub_rsubs,0,sm     %b[24], %b[56], %b[60], %b[25]
          qpfsub_adds,1,sm      %b[24], %b[56], %b[60], %b[11]
          staaqp,2      %b[29], %aad3[ %aasti7 ]
          incr,2        %aaincr0
          qppermb,3,sm  %b[4], %b[5], %r0, %b[28]
          qppermb,4,sm  %b[45], %b[48], %r0, %b[35]
          staaqp,5      %b[15], %aad1[ %aasti5 ]
          incr,5        %aaincr0
          movaqp,3      area=1, ind=0, am=1, be=0, %b[10]
        }

Теоретическая скорость: 8 комплексных чисел за 5 тактов (8/5) = 12.8 Байт/такт
Двойная теоретическая скорость: 25.6 Байт/такт

Замеры скорости

Видим ускорение.


6. stage_radix4_simd128_noConj_unroll3

Здесь происходит раскрутка цикла в 3 раза с помощью опции unroll.

Код на Си
void stage_radix4_simd128_noConj_unroll3(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC, myComplex *coefD, myComplex *coefE)
{
	__v2di *xy0_in = (__v2di*)&data_in[0];
	__v2di *zw0_in = (__v2di*)&data_in[2];
	__v2di *xy1_in = (__v2di*)&data_in[4];
	__v2di *zw1_in = (__v2di*)&data_in[6];
	__v2di *c_in   = (__v2di*)coefC;
	__v2di *d_in   = (__v2di*)coefD;
	__v2di *e_in   = (__v2di*)coefE;

	__v2di *out_0 = (__v2di*)&data_out[0*data_count/4];
	__v2di *out_1 = (__v2di*)&data_out[1*data_count/4];
	__v2di *out_2 = (__v2di*)&data_out[2*data_count/4];
	__v2di *out_3 = (__v2di*)&data_out[3*data_count/4];

	#pragma ivdep
	#pragma unroll(3)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/8; ++i)
	{
		__v2di xy0 = xy0_in[4*i];
		__v2di zw0 = zw0_in[4*i];
		__v2di xy1 = xy1_in[4*i];
		__v2di zw1 = zw1_in[4*i];
		__v2di c   = c_in[i];
		__v2di d   = d_in[i];
		__v2di e   = e_in[i];

		__v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		__v2di swap_c = __builtin_e2k_qpshufb(c, c, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d = __builtin_e2k_qpshufb(d, d, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e = __builtin_e2k_qpshufb(e, e, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});

		__v2di cy_real = __builtin_e2k_qpfmuls(     c, y);
		__v2di dz_real = __builtin_e2k_qpfmuls(     d, z);
		__v2di ew_real = __builtin_e2k_qpfmuls(     e, w);
		__v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y);
		__v2di dz_imag = __builtin_e2k_qpfmuls(swap_d, z);
		__v2di ew_imag = __builtin_e2k_qpfmuls(swap_e, w);

		__v2di cy_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy_real);
		__v2di dz_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz_real);
		__v2di ew_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew_real);
		__v2di cy_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy_imag);
		__v2di dz_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz_imag);
		__v2di ew_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew_imag);

		__v2di cy = __builtin_e2k_qppermb(cy_ii, cy_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di dz = __builtin_e2k_qppermb(dz_ii, dz_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di ew = __builtin_e2k_qppermb(ew_ii, ew_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});

		__v2di add02 = __builtin_e2k_qpfadds( x, dz);
		__v2di sub02 = __builtin_e2k_qpfsubs( x, dz);
		__v2di add13 = __builtin_e2k_qpfadds(cy, ew);
		__v2di sub13 = __builtin_e2k_qpfsubs(cy, ew);

		//__v2di conj_sub13 = __builtin_e2k_qpxor(sub13, (__v2di){1LL<<63, 1LL<<63});
		//__v2di sub13i = __builtin_e2k_qpshufb(conj_sub13, conj_sub13, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13 = __builtin_e2k_qpshufb(sub13, sub13, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di sub13i = __builtin_e2k_qpxor(swap_sub13, (__v2di){1LL<<31, 1LL<<31});

		out_0[i] = __builtin_e2k_qpfadds(add02, add13);
		out_1[i] = __builtin_e2k_qpfsubs(sub02, sub13i);
		out_2[i] = __builtin_e2k_qpfsubs(add02, add13);
		out_3[i] = __builtin_e2k_qpfadds(sub02, sub13i);
	}
}
Основной цикл на ассемблере
.L5038:
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=2, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=2, abs=0, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=2, abs=4, disp=64
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=2, abs=4, disp=96
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=2, abs=8, disp=128
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=2, abs=8, disp=160
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=4, asz=2, abs=12, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=3, asz=2, abs=12, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=3, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=4, asz=3, abs=16, disp=32
        }
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=3, asz=3, abs=24, disp=32
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=2, asz=3, abs=24, disp=32
        }
.L3992:
        {
          loop_mode
          qpfsub_adds,0,sm      %b[106], %b[103], %g16, %g17
          qpfmul_hsubs,1,sm     %b[10], %b[97], %r9, %b[0]
          qpfadds,2,sm  %g18, %g19, %g20
          qpshufb,3,sm  %b[82], %b[81], %r0, %g22
          qpshufb,4,sm  %b[82], %b[81], %r5, %g21
          qpfmul_hadds,5,sm     %g23, %g24, %r9, %b[1]
        }
        {
          loop_mode
          qpfsub_rsubs,0,sm     %g25, %g26, %g27, %b[117]
          qppermb,1,sm  %g28, %b[117], %r1, %g29
          qpfadds,2,sm  %b[105], %b[108], %b[108]
          qpshufb,3,sm  %b[54], %b[54], %r3, %g31
          qpshufb,4,sm  %b[13], %b[13], %r3, %g30
          qpfmul_hsubs,5,sm     %b[13], %g21, %r9, %b[105]
        }
        {
          loop_mode
          qpfsub_adds,0,sm      %b[116], %b[109], %b[113], %r2
          qpfmul_hadds,1,sm     %r11, %b[43], %r9, %b[3]
          qpfsubs,2,sm  %g29, %b[110], %r7
          qppermb,3,sm  %b[3], %b[64], %r1, %g19
          qppermb,4,sm  %b[118], %r10, %r1, %g18
          qpfmul_hadds,5,sm     %g30, %g21, %r9, %g21
        }
        {
          loop_mode
          qpfsub_rsubs,0,sm     %b[106], %b[103], %g16, %b[110]
          qpfmul_hsubs,1,sm     %b[38], %b[114], %r9, %b[13]
          qpfadds,2,sm  %g29, %b[110], %b[113]
          qppermb,3,sm  %b[27], %b[16], %r1, %b[106]
          qppermb,4,sm  %b[21], %b[17], %r1, %b[103]
          qpfsub_rsubs,5,sm     %b[116], %b[109], %b[113], %b[109]
        }
        {
          loop_mode
          qpfmul_hsubs,0,sm     %b[54], %g22, %r9, %r12
          qpfmul_hsubs,1,sm     %b[57], %b[41], %r9, %b[16]
          staaqp,2      %b[101], %aad2[ %aasti6 + _f32s,_lts0 0x10 ]
          qpshufb,3,sm  %b[59], %b[42], %r5, %b[102]
          qppermb,4,sm  %b[102], %r12, %r1, %b[101]
          qpfmul_hadds,5,sm     %b[104], %b[114], %r9, %b[17]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %g31, %g22, %r9, %b[100]
          qpshufb,1,sm  %b[74], %b[73], %r0, %b[104]
          staaqp,2      %b[115], %aad4[ %aasti8 + _f32s,_lts0 0x10 ]
          qpshufb,3,sm  %b[100], %b[100], %r3, %b[115]
          qpshufb,4,sm  %b[10], %b[10], %r3, %b[114]
          qpfmul_hsubs,5,sm     %b[100], %b[102], %r9, %b[10]
        }
        {
          loop_mode
          qpfmul_hsubs,0,sm     %b[51], %b[112], %r9, %b[115]
          qpfsubs,1,sm  %g18, %g19, %b[27]
          staaqp,2      %r13, %aad2[ %aasti6 + _f32s,_lts0 0x20 ]
          qpshufb,3,sm  %b[36], %b[36], %r3, %b[102]
          qpshufb,4,sm  %b[88], %b[87], %r5, %b[116]
          qpfmul_hadds,5,sm     %b[115], %b[102], %r9, %b[21]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %r14, %b[112], %r9, %g28
          qpfsubs,1,sm  %b[103], %b[106], %b[38]
          staaqp,2      %b[119], %aad2[ %aasti6 ]
          incr,2        %aaincr3
          qpshufb,3,sm  %b[31], %b[26], %r5, %b[112]
          qpshufb,4,sm  %b[60], %b[60], %r3, %g22
          qpfmul_hsubs,5,sm     %b[60], %b[116], %r9, %r10
        }
        {
          loop_mode
          qpfadd_rsubs,0,sm     %b[104], %b[101], %b[113], %b[99]
          qppermb,1,sm  %b[5], %b[20], %r1, %b[107]
          staaqp,2      %b[107], %aad4[ %aasti8 + _f32s,_lts0 0x20 ]
          qppermb,3,sm  %b[99], %b[2], %r1, %g26
          qpshufb,4,sm  %b[39], %b[34], %r0, %g25
          qpfmul_hadds,5,sm     %g22, %b[116], %r9, %b[116]
          movaqp,1      area=5, ind=0, am=1, be=0, %b[2]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[114], %b[97], %r9, %b[97]
          qpshufb,1,sm  %b[92], %b[91], %r0, %b[114]
          staaqp,2      %b[98], %aad4[ %aasti8 ]
          incr,2        %aaincr3
          qpshufb,3,sm  %b[96], %b[95], %r0, %b[39]
          qpshufb,4,sm  %b[24], %b[24], %r3, %g23
          qpfadd_adds,5,sm      %b[104], %b[101], %b[113], %b[113]
          movaqp,0      area=4, ind=16, am=1, be=0, %b[5]
          movaqp,1      area=4, ind=0, am=0, be=0, %b[20]
          movaqp,2      area=5, ind=0, am=1, be=0, %b[98]
          movaqp,3      area=4, ind=0, am=1, be=0, %b[34]
        }
        {
          loop_mode
          qpfadd_adds,0,sm      %b[114], %b[107], %g20, %b[96]
          qpshufb,1,sm  %r7, %r7, %r3, %g17
          staaqp,2      %g17, %aad1[ %aasti5 + _f32s,_lts0 0x10 ]
          qpshufb,3,sm  %b[96], %b[95], %r5, %g24
          qpshufb,4,sm  %b[63], %b[46], %r0, %b[95]
          qpfadd_rsubs,5,sm     %g25, %g26, %b[108], %r13
          movaqp,0      area=3, ind=16, am=1, be=0, %b[43]
          movaqp,1      area=3, ind=0, am=0, be=0, %b[54]
          movaqp,2      area=3, ind=16, am=1, be=0, %b[46]
          movaqp,3      area=3, ind=0, am=0, be=0, %b[51]
        }
        {
          loop_mode
          qpfadd_rsubs,0,sm     %b[114], %b[107], %g20, %b[117]
          qpshufb,1,sm  %b[40], %b[40], %r3, %g22
          staaqp,2      %b[117], %aad3[ %aasti7 + _f32s,_lts0 0x20 ]
          qpshufb,3,sm  %b[57], %b[57], %r3, %r11
          qpshufb,4,sm  %b[29], %b[29], %r3, %g29
          qpfmul_hsubs,5,sm     %b[24], %g24, %r9, %b[60]
          movaqp,0      area=2, ind=0, am=0, be=0, %b[24]
          movaqp,1      area=2, ind=16, am=1, be=0, %b[40]
          movaqp,2      area=2, ind=0, am=0, be=0, %b[29]
          movaqp,3      area=2, ind=16, am=1, be=0, %b[57]
        }
        {
          loop_mode
          qpfadd_adds,0,sm      %g25, %g26, %b[108], %b[105]
          qpxor,1,sm    %g22, %r4, %g27
          staaqp,2      %b[111], %aad1[ %aasti5 + _f32s,_lts0 0x20 ]
          qppermb,3,sm  %g21, %b[105], %r1, %b[108]
          qpxor,4,sm    %g29, %r4, %b[111]
          staaqp,5      %r2, %aad1[ %aasti5 ]
          incr,5        %aaincr3
          movaqp,0      area=1, ind=16, am=1, be=0, %b[73]
          movaqp,1      area=1, ind=0, am=0, be=0, %b[63]
          movaqp,2      area=1, ind=16, am=1, be=0, %b[74]
          movaqp,3      area=1, ind=0, am=0, be=0, %b[64]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qpfsub_adds,0,sm      %g25, %g26, %g27, %b[109]
          qpxor,1,sm    %g17, %r4, %g16
          staaqp,2      %b[110], %aad3[ %aasti7 + _f32s,_lts0 0x10 ]
          qpshufb,3,sm  %b[49], %b[49], %r3, %r14
          qpshufb,4,sm  %b[70], %b[69], %r5, %b[110]
          staaqp,5      %b[109], %aad3[ %aasti7 ]
          incr,5        %aaincr3
          movaqp,0      area=0, ind=0, am=0, be=0, %b[81]
          movaqp,1      area=0, ind=16, am=1, be=0, %b[91]
          movaqp,2      area=0, ind=0, am=0, be=0, %b[82]
          movaqp,3      area=0, ind=16, am=1, be=0, %b[92]
        }

Теоретическая скорость: 24 комплексных числа за 14 тактов (24/14) = 13.71 Байт/такт
Двойная теоретическая скорость: 27.43 Байт/такт

Замеры скорости

Видим ускорение в середине графика, но замедление в начале и в конце графика.


Итоги по stage_radix4

График FFT находится здесь.


stage_radix4_2x

Схема алгоритма Stage для версии «radix-4» 2x.

Один проход по stage_radix4_2x совершает ту же работу, что 2 прохода по stage_radix4. А один проход по stage_radix4 совершает ту же работу, что 2 прохода по stage_radix2. Поэтому скорость stage_radix4_2x будем умножать на 4 для удобства сравнения с stage_radix2 (этот факт подписан на оси графика и в выводе консоли).

1. stage_radix4_2x_etalon

Здесь происходит ручная раскрутка алгоритма stage_radix4_etalon в 2 раза.

Код на Си
void stage_radix4_2x_etalon(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b)
{
	myComplex *x0_in = &data_in[ 0];
	myComplex *y0_in = &data_in[ 1];
	myComplex *z0_in = &data_in[ 2];
	myComplex *w0_in = &data_in[ 3];
	myComplex *x1_in = &data_in[ 4];
	myComplex *y1_in = &data_in[ 5];
	myComplex *z1_in = &data_in[ 6];
	myComplex *w1_in = &data_in[ 7];
	myComplex *x2_in = &data_in[ 8];
	myComplex *y2_in = &data_in[ 9];
	myComplex *z2_in = &data_in[10];
	myComplex *w2_in = &data_in[11];
	myComplex *x3_in = &data_in[12];
	myComplex *y3_in = &data_in[13];
	myComplex *z3_in = &data_in[14];
	myComplex *w3_in = &data_in[15];
	myComplex *c0a_in = &coefC_a[0];
	myComplex *c1a_in = &coefC_a[1];
	myComplex *c2a_in = &coefC_a[2];
	myComplex *c3a_in = &coefC_a[3];
	myComplex *d0a_in = &coefD_a[0];
	myComplex *d1a_in = &coefD_a[1];
	myComplex *d2a_in = &coefD_a[2];
	myComplex *d3a_in = &coefD_a[3];
	myComplex *e0a_in = &coefE_a[0];
	myComplex *e1a_in = &coefE_a[1];
	myComplex *e2a_in = &coefE_a[2];
	myComplex *e3a_in = &coefE_a[3];
	myComplex *c0b_in = &coefC_b[0*data_count/16];
	myComplex *c1b_in = &coefC_b[1*data_count/16];
	myComplex *c2b_in = &coefC_b[2*data_count/16];
	myComplex *c3b_in = &coefC_b[3*data_count/16];
	myComplex *d0b_in = &coefD_b[0*data_count/16];
	myComplex *d1b_in = &coefD_b[1*data_count/16];
	myComplex *d2b_in = &coefD_b[2*data_count/16];
	myComplex *d3b_in = &coefD_b[3*data_count/16];
	myComplex *e0b_in = &coefE_b[0*data_count/16];
	myComplex *e1b_in = &coefE_b[1*data_count/16];
	myComplex *e2b_in = &coefE_b[2*data_count/16];
	myComplex *e3b_in = &coefE_b[3*data_count/16];

	myComplex *out_0  = &data_out[ 0*data_count/16];
	myComplex *out_1  = &data_out[ 1*data_count/16];
	myComplex *out_2  = &data_out[ 2*data_count/16];
	myComplex *out_3  = &data_out[ 3*data_count/16];
	myComplex *out_4  = &data_out[ 4*data_count/16];
	myComplex *out_5  = &data_out[ 5*data_count/16];
	myComplex *out_6  = &data_out[ 6*data_count/16];
	myComplex *out_7  = &data_out[ 7*data_count/16];
	myComplex *out_8  = &data_out[ 8*data_count/16];
	myComplex *out_9  = &data_out[ 9*data_count/16];
	myComplex *out_10 = &data_out[10*data_count/16];
	myComplex *out_11 = &data_out[11*data_count/16];
	myComplex *out_12 = &data_out[12*data_count/16];
	myComplex *out_13 = &data_out[13*data_count/16];
	myComplex *out_14 = &data_out[14*data_count/16];
	myComplex *out_15 = &data_out[15*data_count/16];

	#pragma ivdep
	#pragma unroll(1)
//	#pragma prefetch
	for(int64_t i = 0; i < data_count/16; ++i)
	{
		myComplex x0 = x0_in[16*i];
		myComplex y0 = y0_in[16*i];
		myComplex z0 = z0_in[16*i];
		myComplex w0 = w0_in[16*i];
		myComplex c0 = c0a_in[4*i];
		myComplex d0 = d0a_in[4*i];
		myComplex e0 = e0a_in[4*i];

		myComplex x1 = x1_in[16*i];
		myComplex y1 = y1_in[16*i];
		myComplex z1 = z1_in[16*i];
		myComplex w1 = w1_in[16*i];
		myComplex c1 = c1a_in[4*i];
		myComplex d1 = d1a_in[4*i];
		myComplex e1 = e1a_in[4*i];

		myComplex x2 = x2_in[16*i];
		myComplex y2 = y2_in[16*i];
		myComplex z2 = z2_in[16*i];
		myComplex w2 = w2_in[16*i];
		myComplex c2 = c2a_in[4*i];
		myComplex d2 = d2a_in[4*i];
		myComplex e2 = e2a_in[4*i];

		myComplex x3 = x3_in[16*i];
		myComplex y3 = y3_in[16*i];
		myComplex z3 = z3_in[16*i];
		myComplex w3 = w3_in[16*i];
		myComplex c3 = c3a_in[4*i];
		myComplex d3 = d3a_in[4*i];
		myComplex e3 = e3a_in[4*i];

		myComplex cy0 = complex_mul(c0, y0);
		myComplex cy1 = complex_mul(c1, y1);
		myComplex cy2 = complex_mul(c2, y2);
		myComplex cy3 = complex_mul(c3, y3);
		myComplex dz0 = complex_mul(d0, z0);
		myComplex dz1 = complex_mul(d1, z1);
		myComplex dz2 = complex_mul(d2, z2);
		myComplex dz3 = complex_mul(d3, z3);
		myComplex ew0 = complex_mul(e0, w0);
		myComplex ew1 = complex_mul(e1, w1);
		myComplex ew2 = complex_mul(e2, w2);
		myComplex ew3 = complex_mul(e3, w3);

		myComplex add02_0 = complex_add( x0, dz0);
		myComplex add02_1 = complex_add( x1, dz1);
		myComplex add02_2 = complex_add( x2, dz2);
		myComplex add02_3 = complex_add( x3, dz3);
		myComplex sub02_0 = complex_sub( x0, dz0);
		myComplex sub02_1 = complex_sub( x1, dz1);
		myComplex sub02_2 = complex_sub( x2, dz2);
		myComplex sub02_3 = complex_sub( x3, dz3);
		myComplex add13_0 = complex_add(cy0, ew0);
		myComplex add13_1 = complex_add(cy1, ew1);
		myComplex add13_2 = complex_add(cy2, ew2);
		myComplex add13_3 = complex_add(cy3, ew3);
		myComplex sub13_0 = complex_sub(cy0, ew0);
		myComplex sub13_1 = complex_sub(cy1, ew1);
		myComplex sub13_2 = complex_sub(cy2, ew2);
		myComplex sub13_3 = complex_sub(cy3, ew3);
		myComplex sub13i_0 = (myComplex){.real = -sub13_0.imag, .imag = sub13_0.real};
		myComplex sub13i_1 = (myComplex){.real = -sub13_1.imag, .imag = sub13_1.real};
		myComplex sub13i_2 = (myComplex){.real = -sub13_2.imag, .imag = sub13_2.real};
		myComplex sub13i_3 = (myComplex){.real = -sub13_3.imag, .imag = sub13_3.real};

		myComplex out0  = complex_add(add02_0, add13_0);
		myComplex out1  = complex_add(add02_1, add13_1);
		myComplex out2  = complex_add(add02_2, add13_2);
		myComplex out3  = complex_add(add02_3, add13_3);
		myComplex out4  = complex_sub(sub02_0, sub13i_0);
		myComplex out5  = complex_sub(sub02_1, sub13i_1);
		myComplex out6  = complex_sub(sub02_2, sub13i_2);
		myComplex out7  = complex_sub(sub02_3, sub13i_3);
		myComplex out8  = complex_sub(add02_0, add13_0);
		myComplex out9  = complex_sub(add02_1, add13_1);
		myComplex out10 = complex_sub(add02_2, add13_2);
		myComplex out11 = complex_sub(add02_3, add13_3);
		myComplex out12 = complex_add(sub02_0, sub13i_0);
		myComplex out13 = complex_add(sub02_1, sub13i_1);
		myComplex out14 = complex_add(sub02_2, sub13i_2);
		myComplex out15 = complex_add(sub02_3, sub13i_3);


		x0 = out0;
		y0 = out1;
		z0 = out2;
		w0 = out3;
		c0 = c0b_in[i];
		d0 = d0b_in[i];
		e0 = e0b_in[i];

		x1 = out4;
		y1 = out5;
		z1 = out6;
		w1 = out7;
		c1 = c1b_in[i];
		d1 = d1b_in[i];
		e1 = e1b_in[i];

		x2 = out8;
		y2 = out9;
		z2 = out10;
		w2 = out11;
		c2 = c2b_in[i];
		d2 = d2b_in[i];
		e2 = e2b_in[i];

		x3 = out12;
		y3 = out13;
		z3 = out14;
		w3 = out15;
		c3 = c3b_in[i];
		d3 = d3b_in[i];
		e3 = e3b_in[i];

		cy0 = complex_mul(c0, y0);
		cy1 = complex_mul(c1, y1);
		cy2 = complex_mul(c2, y2);
		cy3 = complex_mul(c3, y3);
		dz0 = complex_mul(d0, z0);
		dz1 = complex_mul(d1, z1);
		dz2 = complex_mul(d2, z2);
		dz3 = complex_mul(d3, z3);
		ew0 = complex_mul(e0, w0);
		ew1 = complex_mul(e1, w1);
		ew2 = complex_mul(e2, w2);
		ew3 = complex_mul(e3, w3);

		add02_0 = complex_add( x0, dz0);
		add02_1 = complex_add( x1, dz1);
		add02_2 = complex_add( x2, dz2);
		add02_3 = complex_add( x3, dz3);
		sub02_0 = complex_sub( x0, dz0);
		sub02_1 = complex_sub( x1, dz1);
		sub02_2 = complex_sub( x2, dz2);
		sub02_3 = complex_sub( x3, dz3);
		add13_0 = complex_add(cy0, ew0);
		add13_1 = complex_add(cy1, ew1);
		add13_2 = complex_add(cy2, ew2);
		add13_3 = complex_add(cy3, ew3);
		sub13_0 = complex_sub(cy0, ew0);
		sub13_1 = complex_sub(cy1, ew1);
		sub13_2 = complex_sub(cy2, ew2);
		sub13_3 = complex_sub(cy3, ew3);
		sub13i_0 = (myComplex){.real = -sub13_0.imag, .imag = sub13_0.real};
		sub13i_1 = (myComplex){.real = -sub13_1.imag, .imag = sub13_1.real};
		sub13i_2 = (myComplex){.real = -sub13_2.imag, .imag = sub13_2.real};
		sub13i_3 = (myComplex){.real = -sub13_3.imag, .imag = sub13_3.real};

		out_0[i]  = complex_add(add02_0, add13_0);
		out_1[i]  = complex_add(add02_1, add13_1);
		out_2[i]  = complex_add(add02_2, add13_2);
		out_3[i]  = complex_add(add02_3, add13_3);
		out_4[i]  = complex_sub(sub02_0, sub13i_0);
		out_5[i]  = complex_sub(sub02_1, sub13i_1);
		out_6[i]  = complex_sub(sub02_2, sub13i_2);
		out_7[i]  = complex_sub(sub02_3, sub13i_3);
		out_8[i]  = complex_sub(add02_0, add13_0);
		out_9[i]  = complex_sub(add02_1, add13_1);
		out_10[i] = complex_sub(add02_2, add13_2);
		out_11[i] = complex_sub(add02_3, add13_3);
		out_12[i] = complex_add(sub02_0, sub13i_0);
		out_13[i] = complex_add(sub02_1, sub13i_1);
		out_14[i] = complex_add(sub02_2, sub13i_2);
		out_15[i] = complex_add(sub02_3, sub13i_3);
	}
}
Основной цикл на ассемблере
.L1379:
        {
          fapb  ct=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=1, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=3, mrng=16, d=0, incr=3, ind=2, asz=1, abs=0, disp=16
        }
        {
          fapb  ct=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=64
          fapb  dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=3, mrng=0, d=0, incr=3, ind=4, asz=1, abs=4, disp=0
          fapb  dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=1, abs=4, disp=96
        }
        {
          fapb  ct=0, dcd=0, fmt=3, mrng=16, d=0, incr=3, ind=2, asz=1, abs=6, disp=0
          fapb  dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=3, ind=3, asz=1, abs=6, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=15, asz=2, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=3, mrng=8, d=1, incr=2, ind=0, asz=2, abs=8, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=13, asz=2, abs=12, disp=0
          fapb  dpl=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=14, asz=2, abs=12, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=11, asz=2, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=12, asz=2, abs=16, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=9, asz=2, abs=20, disp=0
          fapb  dpl=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=10, asz=2, abs=20, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=7, asz=2, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=8, asz=2, abs=24, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=5, asz=2, abs=28, disp=0
          fapb  dpl=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=6, asz=2, abs=28, disp=0
        }
.L285:
        {
          loop_mode
          disp  %ctpr1, .L285
          movaw,0       area=0, ind=24, am=0, be=0, %g17
          movaw,1       area=0, ind=28, am=0, be=0, %g16
          movaw,2       area=0, ind=8, am=0, be=0, %g19
          movaw,3       area=0, ind=12, am=0, be=0, %g18
        }
        {
          loop_mode
          movaw,0       area=0, ind=16, am=0, be=0, %g21
          movaw,1       area=0, ind=20, am=0, be=0, %g20
          movaw,2       area=0, ind=0, am=1, be=0, %g23
          movaw,3       area=0, ind=4, am=0, be=0, %g22
        }
        {
          loop_mode
          movaw,0       area=0, ind=8, am=0, be=0, %g25
          movaw,1       area=0, ind=12, am=0, be=0, %g24
          movaw,2       area=1, ind=24, am=0, be=0, %g27
          movaw,3       area=1, ind=28, am=0, be=0, %g26
        }
        {
          loop_mode
          movaw,0       area=0, ind=0, am=1, be=0, %g29
          movaw,1       area=0, ind=4, am=0, be=0, %g28
          movaw,2       area=1, ind=16, am=0, be=0, %g31
          movaw,3       area=1, ind=20, am=0, be=0, %g30
        }
        {
          loop_mode
          movaw,0       area=1, ind=24, am=0, be=0, %r3
          movaw,1       area=1, ind=28, am=0, be=0, %r1
          movaw,2       area=1, ind=8, am=0, be=0, %r5
          movaw,3       area=1, ind=12, am=0, be=0, %r4
        }
        {
          loop_mode
          movaw,0       area=1, ind=16, am=0, be=0, %r9
          movaw,1       area=1, ind=20, am=0, be=0, %r7
          movaw,2       area=1, ind=0, am=1, be=0, %r42
          movaw,3       area=1, ind=4, am=0, be=0, %r41
        }
        {
          loop_mode
          movaw,0       area=1, ind=8, am=0, be=0, %r44
          movaw,1       area=1, ind=12, am=0, be=0, %r43
          movaw,2       area=2, ind=24, am=0, be=0, %r46
          movaw,3       area=2, ind=28, am=0, be=0, %r45
        }
        {
          loop_mode
          movaw,0       area=1, ind=0, am=1, be=0, %r48
          movaw,1       area=1, ind=4, am=0, be=0, %r47
          movaw,2       area=2, ind=16, am=0, be=0, %r50
          movaw,3       area=2, ind=20, am=0, be=0, %r49
        }
        {
          loop_mode
          movaw,0       area=2, ind=24, am=0, be=0, %r52
          movaw,1       area=2, ind=28, am=0, be=0, %r51
          movaw,2       area=2, ind=8, am=0, be=0, %r54
          movaw,3       area=2, ind=12, am=0, be=0, %r53
        }
        {
          loop_mode
          fmuls,0       %g23, %r3, %r57
          fmuls,1       %g22, %r1, %r58
          fmuls,2       %g23, %r1, %g23
          fmuls,3       %g22, %r3, %g22
          movaw,0       area=2, ind=16, am=0, be=0, %r56
          movaw,1       area=2, ind=20, am=0, be=0, %r55
          movaw,2       area=2, ind=0, am=1, be=0, %r3
          movaw,3       area=2, ind=4, am=0, be=0, %r1
        }
        {
          loop_mode
          movaw,0       area=2, ind=8, am=0, be=0, %r60
          movaw,1       area=2, ind=12, am=0, be=0, %r59
          movaw,2       area=3, ind=24, am=0, be=0, %r62
          movaw,3       area=3, ind=28, am=0, be=0, %r61
        }
        {
          loop_mode
          fmuls,0       %g19, %r46, %r63
          fmuls,1       %g18, %r45, %b[0]
          fmuls,2       %g19, %r45, %g19
          fmuls,3       %g18, %r46, %g18
          movaw,0       area=2, ind=0, am=1, be=0, %r46
          movaw,1       area=2, ind=4, am=0, be=0, %r45
          movaw,2       area=3, ind=16, am=0, be=0, %b[2]
          movaw,3       area=3, ind=20, am=0, be=0, %b[1]
        }
        {
          loop_mode
          movaw,0       area=3, ind=8, am=0, be=0, %b[4]
          movaw,1       area=3, ind=12, am=0, be=0, %b[3]
          movaw,2       area=3, ind=8, am=0, be=0, %b[6]
          movaw,3       area=3, ind=12, am=0, be=0, %b[5]
        }
        {
          loop_mode
          fmuls,0       %r52, %r54, %b[7]
          fmuls,1       %r51, %r53, %b[8]
          fmuls,2       %r52, %r53, %r52
          fmuls,3       %r51, %r54, %r51
          movaw,0       area=3, ind=0, am=1, be=0, %r54
          movaw,1       area=3, ind=4, am=0, be=0, %r53
          movaw,2       area=3, ind=0, am=1, be=0, %b[10]
          movaw,3       area=3, ind=4, am=0, be=0, %b[9]
        }
        {
          loop_mode
          fmuls,0       %r56, %r44, %b[11]
          fmuls,1       %r55, %r43, %b[12]
          fmuls,2       %r56, %r43, %r43
          fmuls,3       %r55, %r44, %r44
          movaw,0       area=4, ind=0, am=1, be=0, %r56
          movaw,1       area=4, ind=4, am=0, be=0, %r55
          movaw,2       area=4, ind=0, am=1, be=0, %b[14]
          movaw,3       area=4, ind=4, am=0, be=0, %b[13]
        }
        {
          loop_mode
          fmuls,0       %r60, %r5, %b[15]
          fmuls,1       %r59, %r4, %b[16]
          fmuls,2       %r60, %r4, %r4
          fmuls,3       %r59, %r5, %r5
          fmuls,4       %r61, %r49, %r59
          fmuls,5       %r61, %r50, %r60
          movaw,0       area=5, ind=0, am=1, be=0, %b[17]
          movaw,1       area=5, ind=4, am=0, be=0, %r61
          movaw,2       area=5, ind=0, am=1, be=0, %b[19]
          movaw,3       area=5, ind=4, am=0, be=0, %b[18]
        }
        {
          loop_mode
          fmuls,0       %b[2], %r9, %b[20]
          fmuls,1       %b[1], %r7, %b[21]
          fmuls,2       %b[2], %r7, %r7
          fmuls,3       %b[1], %r9, %r9
          fmuls,4       %r62, %r50, %r50
          fmuls,5       %r62, %r49, %r49
          movaw,0       area=6, ind=0, am=1, be=0, %b[1]
          movaw,1       area=6, ind=4, am=0, be=0, %r62
          movaw,2       area=6, ind=0, am=1, be=0, %b[22]
          movaw,3       area=6, ind=4, am=0, be=0, %b[2]
        }
        {
          loop_mode
          fmuls,0       %b[5], %g30, %b[23]
          fmuls,1       %b[5], %g31, %b[5]
          fmuls,2       %b[4], %g27, %b[24]
          fmuls,3       %b[3], %g26, %b[25]
          fmuls,4       %b[4], %g26, %g26
          fmuls,5       %b[3], %g27, %g27
          movaw,0       area=7, ind=0, am=1, be=0, %b[4]
          movaw,1       area=7, ind=4, am=0, be=0, %b[3]
          movaw,2       area=7, ind=0, am=1, be=0, %b[27]
          movaw,3       area=7, ind=4, am=0, be=0, %b[26]
        }
        {
          loop_mode
          fmuls,0       %b[6], %g31, %g31
          fmuls,1       %b[6], %g30, %g30
          fsubs,2       %r57, %r58, %r57
          fadds,3       %g23, %g22, %g22
          fsubs,4       %r63, %b[0], %g23
          fadds,5       %g19, %g18, %g18
          movaw,0       area=8, ind=0, am=1, be=0, %r58
          movaw,1       area=8, ind=4, am=0, be=0, %g19
          movaw,2       area=8, ind=0, am=1, be=0, %b[0]
          movaw,3       area=8, ind=4, am=0, be=0, %r63
        }
        {
          loop_mode
          fsubs,0       %b[7], %b[8], %b[6]
          fadds,1       %r52, %r51, %r51
          fsubs,2       %b[11], %b[12], %r52
          fmuls,3       %r45, %g24, %b[7]
          fmuls,4       %r46, %g24, %g24
          fmuls,5       %r45, %g25, %r45
          movaw,0       area=9, ind=0, am=1, be=0, %b[11]
          movaw,1       area=9, ind=4, am=0, be=0, %b[8]
          movaw,2       area=9, ind=0, am=1, be=0, %b[28]
          movaw,3       area=9, ind=4, am=0, be=0, %b[12]
        }
        {
          loop_mode
          fadds,0       %r43, %r44, %r43
          fsubs,1       %b[20], %b[21], %r44
          fsubs,2       %b[15], %b[16], %b[15]
          fsubs,3       %r50, %r59, %r50
          fadds,4       %r49, %r60, %r49
          fmuls,5       %r46, %g25, %g25
        }
        {
          loop_mode
          fadds,0       %r4, %r5, %r4
          fmuls,1       %r53, %g16, %g27
          fmuls,2       %r53, %g17, %r5
          fadds,3       %g26, %g27, %g26
          fmuls,4       %b[9], %g21, %r46
          fmuls,5       %r54, %g17, %g17
        }
        {
          loop_mode
          fadds,0       %r7, %r9, %r7
          fsubs,1       %g31, %b[23], %g31
          fadds,2       %g30, %b[5], %g30
          fmuls,3       %r54, %g16, %g16
          fmuls,4       %b[10], %g21, %g21
          fmuls,5       %b[9], %g20, %r9
        }
        {
          loop_mode
          fsubs,0       %b[24], %b[25], %r53
          fmuls,1       %b[10], %g20, %g20
          fadds,2       %r52, %r57, %r54
          fadds,3       %g24, %r45, %g24
        }
        {
          loop_mode
          fadds,0       %r48, %r44, %r45
          fsubs,1       %r48, %r44, %r44
          fsubs,2       %g22, %r43, %r48
          fsubs,3       %r1, %r49, %r59
          fadds,4       %r3, %r50, %r60
          fadds,5       %r1, %r49, %r1
        }
        {
          loop_mode
          fadds,0       %r43, %g22, %g22
          fsubs,1       %g18, %r51, %r43
          fadds,2       %b[6], %g23, %r49
          fadds,3       %r51, %g18, %g18
          fsubs,4       %b[6], %g23, %g23
          fsubs,5       %r52, %r57, %r51
        }
        {
          loop_mode
          fsubs,0       %r3, %r50, %r3
          fadds,1       %r47, %r7, %r50
          fsubs,2       %r47, %r7, %r7
          fsubs,3       %g25, %b[7], %g25
          fsubs,4       %g21, %r9, %g21
        }
        {
          loop_mode
          fadds,0       %r4, %g26, %r9
          fsubs,1       %g26, %r4, %g26
          fadds,2       %r42, %g31, %r4
          fsubs,3       %g17, %g27, %g17
          fadds,4       %g16, %r5, %g16
        }
        {
          loop_mode
          fadds,0       %r41, %g30, %g27
          fsubs,1       %r42, %g31, %g31
          fadds,2       %b[15], %r53, %r5
          fsubs,3       %r41, %g30, %g30
        }
        {
          loop_mode
          fsubs,0       %b[15], %r53, %r41
          fadds,1       %g20, %r46, %g20
          fsubs,2       %r44, %r48, %r42
          fsubs,3       %r59, %g23, %r46
          fadds,4       %r59, %g23, %g23
          fadds,5       %r1, %g18, %r47
        }
        {
          loop_mode
          fsubs,0       %r45, %r54, %r52
          fadds,1       %r44, %r48, %r44
          fadds,2       %r45, %r54, %r45
          fsubs,3       %r1, %g18, %g18
          fsubs,4       %g29, %g21, %r1
          fadds,5       %g29, %g21, %g21
        }
        {
          loop_mode
          fadds,0       %r50, %g22, %g29
          fsubs,1       %r50, %g22, %g22
          fadds,2       %r3, %r43, %r48
          fadds,3       %r60, %r49, %r50
          fsubs,4       %r60, %r49, %r49
          fadds,5       %g24, %g16, %r53
        }
        {
          loop_mode
          fsubs,0       %r3, %r43, %r3
          fadds,1       %g27, %r9, %r43
          fsubs,2       %r7, %r51, %r54
          fadds,3       %r7, %r51, %r7
          fsubs,4       %g25, %g17, %r51
          fsubs,5       %g16, %g24, %g16
        }
        {
          loop_mode
          fsubs,0       %g27, %r9, %g24
          fadds,1       %g31, %g26, %g27
          fadds,2       %r4, %r5, %r9
          fadds,3       %g25, %g17, %g17
          fmuls,4       %r62, %r46, %g25
          fmuls,5       %b[1], %r46, %r46
        }
        {
          loop_mode
          fadds,0       %g30, %r41, %r57
          fsubs,1       %g31, %g26, %g26
          fsubs,2       %g30, %r41, %g30
          fsubs,3       %r4, %r5, %g31
          fmuls,4       %b[8], %g23, %r4
          fmuls,5       %b[11], %g23, %g23
        }
        {
          loop_mode
          fadds,0       %g28, %g20, %r5
          fsubs,1       %g28, %g20, %g20
          fmuls,2       %b[22], %r42, %g28
          fmuls,3       %b[2], %r42, %r41
          fmuls,4       %b[18], %r47, %r42
          fmuls,5       %b[19], %r47, %r47
        }
        {
          loop_mode
          fmuls,0       %b[4], %r52, %r59
          fmuls,1       %b[3], %r52, %r52
          fmuls,2       %b[28], %r44, %r60
          fmuls,3       %b[12], %r44, %r44
          fmuls,4       %r56, %r45, %b[5]
          fmuls,5       %r55, %r45, %r45
        }
        {
          loop_mode
          fmuls,0       %r63, %g18, %b[6]
          fmuls,1       %b[0], %g18, %g18
          fmuls,2       %b[11], %r48, %b[7]
          fmuls,3       %b[8], %r48, %r48
          fmuls,4       %r55, %g29, %r55
          fmuls,5       %r56, %g29, %g29
        }
        {
          loop_mode
          fmuls,0       %b[3], %g22, %r56
          fmuls,1       %b[4], %g22, %g22
          fmuls,2       %b[1], %r3, %b[1]
          fmuls,3       %r62, %r3, %r3
          fmuls,4       %b[12], %r7, %r62
          fmuls,5       %b[28], %r7, %r7
        }
        {
          loop_mode
          fmuls,0       %b[19], %r50, %b[3]
          fmuls,1       %b[18], %r50, %r50
          fmuls,2       %b[0], %r49, %b[0]
          fmuls,3       %r63, %r49, %r49
          fmuls,4       %b[26], %g24, %r63
          fmuls,5       %b[27], %g24, %g24
        }
        {
          loop_mode
          fmuls,0       %g19, %r57, %b[4]
          fmuls,1       %r58, %r57, %r57
          fmuls,2       %b[2], %r54, %b[2]
          fmuls,3       %b[22], %r54, %r54
          fmuls,4       %b[13], %r43, %b[8]
          fmuls,5       %b[14], %r43, %r43
        }
        {
          loop_mode
          fmuls,0       %r61, %g30, %b[9]
          fmuls,1       %b[17], %g30, %g30
          fmuls,2       %r58, %g27, %r58
          fmuls,3       %g19, %g27, %g19
          fmuls,4       %b[14], %r9, %g27
          fmuls,5       %b[13], %r9, %r9
        }
        {
          loop_mode
          fmuls,0       %b[17], %g26, %b[10]
          fmuls,1       %r61, %g26, %g26
          fmuls,2       %b[27], %g31, %r61
          fmuls,3       %b[26], %g31, %g31
          fsubs,4       %g20, %r51, %b[11]
          fadds,5       %g20, %r51, %g20
        }
        {
          loop_mode
          fadds,0       %g21, %g17, %r51
          fadds,1       %r5, %r53, %b[12]
          fsubs,2       %r1, %g16, %b[13]
          fsubs,3       %g21, %g17, %g17
          fsubs,4       %r5, %r53, %g21
          fadds,5       %r1, %g16, %g16
        }
        {
          loop_mode
          fsubs,0       %b[5], %r55, %r1
          fadds,1       %g29, %r45, %g29
          fsubs,2       %r59, %r56, %r5
          fadds,3       %g22, %r52, %g22
          fsubs,4       %b[7], %r4, %r4
          fadds,5       %g23, %r48, %g23
        }
        {
          loop_mode
          fsubs,0       %b[3], %r42, %r42
          fadds,1       %r47, %r50, %r45
          fsubs,2       %b[1], %g25, %g25
          fadds,3       %r46, %r3, %r3
          fsubs,4       %b[0], %b[6], %r46
          fadds,5       %g18, %r49, %g18
        }
        {
          loop_mode
          fsubs,0       %r61, %r63, %r47
          fsubs,1       %r58, %b[4], %r48
          fsubs,2       %g28, %b[2], %g28
          fadds,3       %r54, %r41, %r41
          fsubs,4       %r60, %r62, %r49
          fadds,5       %r7, %r44, %r7
        }
        {
          loop_mode
          fsubs,0       %g27, %b[8], %g27
          fadds,1       %r43, %r9, %r9
          fadds,2       %g30, %g26, %g26
          fadds,3       %g24, %g31, %g24
          fadds,4       %r57, %g19, %g19
        }
        {
          loop_mode
          fsubs,0       %b[10], %b[9], %g30
          fsubs,1       %r51, %r1, %r43
          fadds,2       %r51, %r1, %g22
          fadds,3       %g21, %g22, %g31
          fsubs,4       %g21, %g22, %g21
        }
        {
          loop_mode
          fadds,0       %g17, %r5, %r1
          fsubs,1       %g17, %r5, %g17
          fadds,2       %b[12], %g29, %r5
        }
        {
          loop_mode
          fsubs,0       %b[12], %g29, %g29
          fadds,1       %b[13], %g28, %r44
          fsubs,2       %b[13], %g28, %g28
          fadds,3       %g20, %r7, %r50
          fsubs,4       %g16, %r49, %r51
          fsubs,5       %g20, %r7, %g20
        }
        {
          loop_mode
          fsubs,0       %r48, %r4, %r7
          fadds,1       %r47, %r46, %r49
          fadds,2       %r48, %r4, %r4
          fadds,3       %b[11], %r41, %r52
          fadds,4       %g16, %r49, %g16
          fsubs,5       %b[11], %r41, %r41
        }
        {
          loop_mode
          fsubs,0       %r47, %r46, %r46
          fadds,1       %r9, %r45, %r47
          fsubs,2       %g27, %r42, %r53
          fadds,3       %g19, %g23, %r48
          fsubs,4       %g18, %g24, %r54
          fsubs,5       %g23, %g19, %g19
        }
        {
          loop_mode
          fsubs,0       %r45, %r9, %g23
          fadds,1       %g27, %r42, %g27
          fadds,2       %g26, %r3, %r9
          fadds,3       %g24, %g18, %g18
          fsubs,4       %r3, %g26, %g24
        }
        {
          loop_mode
          fadds,0       %g30, %g25, %g26
          fsubs,1       %g30, %g25, %g25
        }
        {
          loop_mode
          fsubs,0       %r1, %r49, %g30
          fadds,1       %r1, %r49, %r1
        }
        {
          loop_mode
          fsubs,0       %g21, %r46, %r3
          fadds,1       %g21, %r46, %g21
          fsubs,2       %g20, %r7, %r42
          fadds,3       %r51, %g19, %r45
          fadds,4       %r50, %r48, %r46
          fsubs,5       %r51, %g19, %g19
        }
        {
          loop_mode
          fsubs,0       %r43, %g23, %r49
          fadds,1       %r43, %g23, %g23
          fadds,2       %g20, %r7, %g20
          fsubs,3       %r50, %r48, %r7
          fadds,4       %g16, %r4, %r43
          fsubs,5       %g17, %r54, %r48
        }
        {
          loop_mode
          fadds,0       %r5, %r47, %r50
          fsubs,1       %r5, %r47, %r5
          fadds,2       %g29, %r53, %r47
          fsubs,3       %g29, %r53, %g29
          fsubs,4       %g16, %r4, %g16
          fadds,5       %g17, %r54, %g17
        }
        {
          loop_mode
          fsubs,0       %g22, %g27, %r4
          fadds,1       %r52, %r9, %r51
          fadds,2       %g31, %g18, %r53
          fsubs,3       %g28, %g24, %r54
          fsubs,4       %r52, %r9, %r9
          fsubs,5       %g31, %g18, %g18
        }
        {
          loop_mode
          fadds,0       %g28, %g24, %g24
          fadds,1       %g22, %g27, %g22
          fsubs,2       %r44, %g26, %g26
          fadds,3       %r44, %g26, %g27
          fsubs,4       %r41, %g25, %g28
          fadds,5       %r41, %g25, %g25
        }
        {
          loop_mode
          stw,2 %r23, %r0, %g21
          stw,5 %r32, %r0, %g30
        }
        {
          loop_mode
          stw,2 %r6, %r0, %r3
          stw,5 %r39, %r0, %r1
        }
        {
          loop_mode
          stw,2 %r22, %r0, %g20
          stw,5 %r21, %r0, %r42
        }
        {
          loop_mode
          stw,2 %r27, %r0, %r45
          stw,5 %r38, %r0, %r49
        }
        {
          loop_mode
          stw,2 %r31, %r0, %g23
          stw,5 %r34, %r0, %g19
        }
        {
          loop_mode
          stw,2 %r17, %r0, %r7
          stw,5 %r13, %r0, %r46
        }
        {
          loop_mode
          stw,2 %r20, %r0, %r5
          stw,5 %r2, %r0, %r50
        }
        {
          loop_mode
          stw,2 %r16, %r0, %r47
          stw,5 %r29, %r0, %r48
        }
        {
          loop_mode
          stw,2 %r28, %r0, %g17
          stw,5 %r12, %r0, %g29
        }
        {
          loop_mode
          stw,2 %r26, %r0, %g16
          stw,5 %r30, %r0, %r43
        }
        {
          loop_mode
          stw,2 %r37, %r0, %r4
          stw,5 %r14, %r0, %r53
        }
        {
          loop_mode
          stw,2 %r18, %r0, %g18
          stw,5 %r35, %r0, %r54
        }
        {
          loop_mode
          stw,2 %r25, %r0, %g24
          stw,5 %r15, %r0, %r51
        }
        {
          loop_mode
          stw,2 %r19, %r0, %r9
          stw,5 %r40, %r0, %g22
        }
        {
          loop_mode
          stw,2 %r24, %r0, %g25
          stw,5 %r11, %r0, %g28
        }
        {
          loop_mode
          ct    %ctpr1 ? %NOT_LOOP_END
          alc   alcf=1, alct=1
          stw,2 %r33, %r0, %g26
          addd,3,sm     0x8, %r0, %r0
          stw,5 %r36, %r0, %g27
        }

Теоретическая скорость: 16 комплексных чисел за 77 тактов (16/77) = 1.66 Байт/такт
Четверная теоретическая скорость: 6.65 Байт/такт

Замеры скорости

2. stage_radix4_2x_simd64

Здесь происходит ручная раскрутка алгоритма stage_radix4_simd64 в 2 раза.

Код на Си
void stage_radix4_2x_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b)
{
	uint64_t *x0_in = (uint64_t*)&data_in[ 0];
	uint64_t *y0_in = (uint64_t*)&data_in[ 1];
	uint64_t *z0_in = (uint64_t*)&data_in[ 2];
	uint64_t *w0_in = (uint64_t*)&data_in[ 3];
	uint64_t *x1_in = (uint64_t*)&data_in[ 4];
	uint64_t *y1_in = (uint64_t*)&data_in[ 5];
	uint64_t *z1_in = (uint64_t*)&data_in[ 6];
	uint64_t *w1_in = (uint64_t*)&data_in[ 7];
	uint64_t *x2_in = (uint64_t*)&data_in[ 8];
	uint64_t *y2_in = (uint64_t*)&data_in[ 9];
	uint64_t *z2_in = (uint64_t*)&data_in[10];
	uint64_t *w2_in = (uint64_t*)&data_in[11];
	uint64_t *x3_in = (uint64_t*)&data_in[12];
	uint64_t *y3_in = (uint64_t*)&data_in[13];
	uint64_t *z3_in = (uint64_t*)&data_in[14];
	uint64_t *w3_in = (uint64_t*)&data_in[15];
	uint64_t *c0a_in = (uint64_t*)&coefC_a[0];
	uint64_t *c1a_in = (uint64_t*)&coefC_a[1];
	uint64_t *c2a_in = (uint64_t*)&coefC_a[2];
	uint64_t *c3a_in = (uint64_t*)&coefC_a[3];
	uint64_t *d0a_in = (uint64_t*)&coefD_a[0];
	uint64_t *d1a_in = (uint64_t*)&coefD_a[1];
	uint64_t *d2a_in = (uint64_t*)&coefD_a[2];
	uint64_t *d3a_in = (uint64_t*)&coefD_a[3];
	uint64_t *e0a_in = (uint64_t*)&coefE_a[0];
	uint64_t *e1a_in = (uint64_t*)&coefE_a[1];
	uint64_t *e2a_in = (uint64_t*)&coefE_a[2];
	uint64_t *e3a_in = (uint64_t*)&coefE_a[3];
	uint64_t *c0b_in = (uint64_t*)&coefC_b[0*data_count/16];
	uint64_t *c1b_in = (uint64_t*)&coefC_b[1*data_count/16];
	uint64_t *c2b_in = (uint64_t*)&coefC_b[2*data_count/16];
	uint64_t *c3b_in = (uint64_t*)&coefC_b[3*data_count/16];
	uint64_t *d0b_in = (uint64_t*)&coefD_b[0*data_count/16];
	uint64_t *d1b_in = (uint64_t*)&coefD_b[1*data_count/16];
	uint64_t *d2b_in = (uint64_t*)&coefD_b[2*data_count/16];
	uint64_t *d3b_in = (uint64_t*)&coefD_b[3*data_count/16];
	uint64_t *e0b_in = (uint64_t*)&coefE_b[0*data_count/16];
	uint64_t *e1b_in = (uint64_t*)&coefE_b[1*data_count/16];
	uint64_t *e2b_in = (uint64_t*)&coefE_b[2*data_count/16];
	uint64_t *e3b_in = (uint64_t*)&coefE_b[3*data_count/16];

	uint64_t *out_0  = (uint64_t*)&data_out[ 0*data_count/16];
	uint64_t *out_1  = (uint64_t*)&data_out[ 1*data_count/16];
	uint64_t *out_2  = (uint64_t*)&data_out[ 2*data_count/16];
	uint64_t *out_3  = (uint64_t*)&data_out[ 3*data_count/16];
	uint64_t *out_4  = (uint64_t*)&data_out[ 4*data_count/16];
	uint64_t *out_5  = (uint64_t*)&data_out[ 5*data_count/16];
	uint64_t *out_6  = (uint64_t*)&data_out[ 6*data_count/16];
	uint64_t *out_7  = (uint64_t*)&data_out[ 7*data_count/16];
	uint64_t *out_8  = (uint64_t*)&data_out[ 8*data_count/16];
	uint64_t *out_9  = (uint64_t*)&data_out[ 9*data_count/16];
	uint64_t *out_10 = (uint64_t*)&data_out[10*data_count/16];
	uint64_t *out_11 = (uint64_t*)&data_out[11*data_count/16];
	uint64_t *out_12 = (uint64_t*)&data_out[12*data_count/16];
	uint64_t *out_13 = (uint64_t*)&data_out[13*data_count/16];
	uint64_t *out_14 = (uint64_t*)&data_out[14*data_count/16];
	uint64_t *out_15 = (uint64_t*)&data_out[15*data_count/16];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/16; ++i)
	{
		uint64_t x0 = x0_in[16*i];
		uint64_t y0 = y0_in[16*i];
		uint64_t z0 = z0_in[16*i];
		uint64_t w0 = w0_in[16*i];
		uint64_t c0 = c0a_in[4*i];
		uint64_t d0 = d0a_in[4*i];
		uint64_t e0 = e0a_in[4*i];

		uint64_t x1 = x1_in[16*i];
		uint64_t y1 = y1_in[16*i];
		uint64_t z1 = z1_in[16*i];
		uint64_t w1 = w1_in[16*i];
		uint64_t c1 = c1a_in[4*i];
		uint64_t d1 = d1a_in[4*i];
		uint64_t e1 = e1a_in[4*i];

		uint64_t x2 = x2_in[16*i];
		uint64_t y2 = y2_in[16*i];
		uint64_t z2 = z2_in[16*i];
		uint64_t w2 = w2_in[16*i];
		uint64_t c2 = c2a_in[4*i];
		uint64_t d2 = d2a_in[4*i];
		uint64_t e2 = e2a_in[4*i];

		uint64_t x3 = x3_in[16*i];
		uint64_t y3 = y3_in[16*i];
		uint64_t z3 = z3_in[16*i];
		uint64_t w3 = w3_in[16*i];
		uint64_t c3 = c3a_in[4*i];
		uint64_t d3 = d3a_in[4*i];
		uint64_t e3 = e3a_in[4*i];

		uint64_t conj_c0 = __builtin_e2k_pxord(c0, 1LL<<63);
		uint64_t conj_c1 = __builtin_e2k_pxord(c1, 1LL<<63);
		uint64_t conj_c2 = __builtin_e2k_pxord(c2, 1LL<<63);
		uint64_t conj_c3 = __builtin_e2k_pxord(c3, 1LL<<63);
		uint64_t conj_d0 = __builtin_e2k_pxord(d0, 1LL<<63);
		uint64_t conj_d1 = __builtin_e2k_pxord(d1, 1LL<<63);
		uint64_t conj_d2 = __builtin_e2k_pxord(d2, 1LL<<63);
		uint64_t conj_d3 = __builtin_e2k_pxord(d3, 1LL<<63);
		uint64_t conj_e0 = __builtin_e2k_pxord(e0, 1LL<<63);
		uint64_t conj_e1 = __builtin_e2k_pxord(e1, 1LL<<63);
		uint64_t conj_e2 = __builtin_e2k_pxord(e2, 1LL<<63);
		uint64_t conj_e3 = __builtin_e2k_pxord(e3, 1LL<<63);
		uint64_t swap_c0 = __builtin_e2k_pshufb(0, c0, 0x0302010007060504);
		uint64_t swap_c1 = __builtin_e2k_pshufb(0, c1, 0x0302010007060504);
		uint64_t swap_c2 = __builtin_e2k_pshufb(0, c2, 0x0302010007060504);
		uint64_t swap_c3 = __builtin_e2k_pshufb(0, c3, 0x0302010007060504);
		uint64_t swap_d0 = __builtin_e2k_pshufb(0, d0, 0x0302010007060504);
		uint64_t swap_d1 = __builtin_e2k_pshufb(0, d1, 0x0302010007060504);
		uint64_t swap_d2 = __builtin_e2k_pshufb(0, d2, 0x0302010007060504);
		uint64_t swap_d3 = __builtin_e2k_pshufb(0, d3, 0x0302010007060504);
		uint64_t swap_e0 = __builtin_e2k_pshufb(0, e0, 0x0302010007060504);
		uint64_t swap_e1 = __builtin_e2k_pshufb(0, e1, 0x0302010007060504);
		uint64_t swap_e2 = __builtin_e2k_pshufb(0, e2, 0x0302010007060504);
		uint64_t swap_e3 = __builtin_e2k_pshufb(0, e3, 0x0302010007060504);

		uint64_t cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
		uint64_t cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
		uint64_t cy2_real = __builtin_e2k_pfmuls(conj_c2, y2);
		uint64_t cy3_real = __builtin_e2k_pfmuls(conj_c3, y3);
		uint64_t dz0_real = __builtin_e2k_pfmuls(conj_d0, z0);
		uint64_t dz1_real = __builtin_e2k_pfmuls(conj_d1, z1);
		uint64_t dz2_real = __builtin_e2k_pfmuls(conj_d2, z2);
		uint64_t dz3_real = __builtin_e2k_pfmuls(conj_d3, z3);
		uint64_t ew0_real = __builtin_e2k_pfmuls(conj_e0, w0);
		uint64_t ew1_real = __builtin_e2k_pfmuls(conj_e1, w1);
		uint64_t ew2_real = __builtin_e2k_pfmuls(conj_e2, w2);
		uint64_t ew3_real = __builtin_e2k_pfmuls(conj_e3, w3);
		uint64_t cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
		uint64_t cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);
		uint64_t cy2_imag = __builtin_e2k_pfmuls(swap_c2, y2);
		uint64_t cy3_imag = __builtin_e2k_pfmuls(swap_c3, y3);
		uint64_t dz0_imag = __builtin_e2k_pfmuls(swap_d0, z0);
		uint64_t dz1_imag = __builtin_e2k_pfmuls(swap_d1, z1);
		uint64_t dz2_imag = __builtin_e2k_pfmuls(swap_d2, z2);
		uint64_t dz3_imag = __builtin_e2k_pfmuls(swap_d3, z3);
		uint64_t ew0_imag = __builtin_e2k_pfmuls(swap_e0, w0);
		uint64_t ew1_imag = __builtin_e2k_pfmuls(swap_e1, w1);
		uint64_t ew2_imag = __builtin_e2k_pfmuls(swap_e2, w2);
		uint64_t ew3_imag = __builtin_e2k_pfmuls(swap_e3, w3);

		uint64_t cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
		uint64_t cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);
		uint64_t cy2 = __builtin_e2k_pfhadds(cy2_real, cy2_imag);
		uint64_t cy3 = __builtin_e2k_pfhadds(cy3_real, cy3_imag);
		uint64_t dz0 = __builtin_e2k_pfhadds(dz0_real, dz0_imag);
		uint64_t dz1 = __builtin_e2k_pfhadds(dz1_real, dz1_imag);
		uint64_t dz2 = __builtin_e2k_pfhadds(dz2_real, dz2_imag);
		uint64_t dz3 = __builtin_e2k_pfhadds(dz3_real, dz3_imag);
		uint64_t ew0 = __builtin_e2k_pfhadds(ew0_real, ew0_imag);
		uint64_t ew1 = __builtin_e2k_pfhadds(ew1_real, ew1_imag);
		uint64_t ew2 = __builtin_e2k_pfhadds(ew2_real, ew2_imag);
		uint64_t ew3 = __builtin_e2k_pfhadds(ew3_real, ew3_imag);

		uint64_t add02_0 = __builtin_e2k_pfadds( x0, dz0);
		uint64_t add02_1 = __builtin_e2k_pfadds( x1, dz1);
		uint64_t add02_2 = __builtin_e2k_pfadds( x2, dz2);
		uint64_t add02_3 = __builtin_e2k_pfadds( x3, dz3);
		uint64_t sub02_0 = __builtin_e2k_pfsubs( x0, dz0);
		uint64_t sub02_1 = __builtin_e2k_pfsubs( x1, dz1);
		uint64_t sub02_2 = __builtin_e2k_pfsubs( x2, dz2);
		uint64_t sub02_3 = __builtin_e2k_pfsubs( x3, dz3);
		uint64_t add13_0 = __builtin_e2k_pfadds(cy0, ew0);
		uint64_t add13_1 = __builtin_e2k_pfadds(cy1, ew1);
		uint64_t add13_2 = __builtin_e2k_pfadds(cy2, ew2);
		uint64_t add13_3 = __builtin_e2k_pfadds(cy3, ew3);
		uint64_t sub13_0 = __builtin_e2k_pfsubs(cy0, ew0);
		uint64_t sub13_1 = __builtin_e2k_pfsubs(cy1, ew1);
		uint64_t sub13_2 = __builtin_e2k_pfsubs(cy2, ew2);
		uint64_t sub13_3 = __builtin_e2k_pfsubs(cy3, ew3);

		//uint64_t conj_sub13_0 = __builtin_e2k_pxord(sub13_0, 1LL<<63);
		//uint64_t conj_sub13_1 = __builtin_e2k_pxord(sub13_1, 1LL<<63);
		//uint64_t conj_sub13_2 = __builtin_e2k_pxord(sub13_2, 1LL<<63);
		//uint64_t conj_sub13_3 = __builtin_e2k_pxord(sub13_3, 1LL<<63);
		//uint64_t sub13i_0 = __builtin_e2k_pshufb(0, conj_sub13_0, 0x0302010007060504);
		//uint64_t sub13i_1 = __builtin_e2k_pshufb(0, conj_sub13_1, 0x0302010007060504);
		//uint64_t sub13i_2 = __builtin_e2k_pshufb(0, conj_sub13_2, 0x0302010007060504);
		//uint64_t sub13i_3 = __builtin_e2k_pshufb(0, conj_sub13_3, 0x0302010007060504);
		uint64_t swap_sub13_0 = __builtin_e2k_pshufb(0, sub13_0, 0x0302010007060504);
		uint64_t swap_sub13_1 = __builtin_e2k_pshufb(0, sub13_1, 0x0302010007060504);
		uint64_t swap_sub13_2 = __builtin_e2k_pshufb(0, sub13_2, 0x0302010007060504);
		uint64_t swap_sub13_3 = __builtin_e2k_pshufb(0, sub13_3, 0x0302010007060504);
		uint64_t sub13i_0 = __builtin_e2k_pxord(swap_sub13_0, 1LL<<31);
		uint64_t sub13i_1 = __builtin_e2k_pxord(swap_sub13_1, 1LL<<31);
		uint64_t sub13i_2 = __builtin_e2k_pxord(swap_sub13_2, 1LL<<31);
		uint64_t sub13i_3 = __builtin_e2k_pxord(swap_sub13_3, 1LL<<31);

		uint64_t out0  = __builtin_e2k_pfadds(add02_0, add13_0);
		uint64_t out1  = __builtin_e2k_pfadds(add02_1, add13_1);
		uint64_t out2  = __builtin_e2k_pfadds(add02_2, add13_2);
		uint64_t out3  = __builtin_e2k_pfadds(add02_3, add13_3);
		uint64_t out4  = __builtin_e2k_pfsubs(sub02_0, sub13i_0);
		uint64_t out5  = __builtin_e2k_pfsubs(sub02_1, sub13i_1);
		uint64_t out6  = __builtin_e2k_pfsubs(sub02_2, sub13i_2);
		uint64_t out7  = __builtin_e2k_pfsubs(sub02_3, sub13i_3);
		uint64_t out8  = __builtin_e2k_pfsubs(add02_0, add13_0);
		uint64_t out9  = __builtin_e2k_pfsubs(add02_1, add13_1);
		uint64_t out10 = __builtin_e2k_pfsubs(add02_2, add13_2);
		uint64_t out11 = __builtin_e2k_pfsubs(add02_3, add13_3);
		uint64_t out12 = __builtin_e2k_pfadds(sub02_0, sub13i_0);
		uint64_t out13 = __builtin_e2k_pfadds(sub02_1, sub13i_1);
		uint64_t out14 = __builtin_e2k_pfadds(sub02_2, sub13i_2);
		uint64_t out15 = __builtin_e2k_pfadds(sub02_3, sub13i_3);


		x0 = out0;
		y0 = out1;
		z0 = out2;
		w0 = out3;
		c0 = c0b_in[i];
		d0 = d0b_in[i];
		e0 = e0b_in[i];

		x1 = out4;
		y1 = out5;
		z1 = out6;
		w1 = out7;
		c1 = c1b_in[i];
		d1 = d1b_in[i];
		e1 = e1b_in[i];

		x2 = out8;
		y2 = out9;
		z2 = out10;
		w2 = out11;
		c2 = c2b_in[i];
		d2 = d2b_in[i];
		e2 = e2b_in[i];

		x3 = out12;
		y3 = out13;
		z3 = out14;
		w3 = out15;
		c3 = c3b_in[i];
		d3 = d3b_in[i];
		e3 = e3b_in[i];

		conj_c0 = __builtin_e2k_pxord(c0, 1LL<<63);
		conj_c1 = __builtin_e2k_pxord(c1, 1LL<<63);
		conj_c2 = __builtin_e2k_pxord(c2, 1LL<<63);
		conj_c3 = __builtin_e2k_pxord(c3, 1LL<<63);
		conj_d0 = __builtin_e2k_pxord(d0, 1LL<<63);
		conj_d1 = __builtin_e2k_pxord(d1, 1LL<<63);
		conj_d2 = __builtin_e2k_pxord(d2, 1LL<<63);
		conj_d3 = __builtin_e2k_pxord(d3, 1LL<<63);
		conj_e0 = __builtin_e2k_pxord(e0, 1LL<<63);
		conj_e1 = __builtin_e2k_pxord(e1, 1LL<<63);
		conj_e2 = __builtin_e2k_pxord(e2, 1LL<<63);
		conj_e3 = __builtin_e2k_pxord(e3, 1LL<<63);
		swap_c0 = __builtin_e2k_pshufb(0, c0, 0x0302010007060504);
		swap_c1 = __builtin_e2k_pshufb(0, c1, 0x0302010007060504);
		swap_c2 = __builtin_e2k_pshufb(0, c2, 0x0302010007060504);
		swap_c3 = __builtin_e2k_pshufb(0, c3, 0x0302010007060504);
		swap_d0 = __builtin_e2k_pshufb(0, d0, 0x0302010007060504);
		swap_d1 = __builtin_e2k_pshufb(0, d1, 0x0302010007060504);
		swap_d2 = __builtin_e2k_pshufb(0, d2, 0x0302010007060504);
		swap_d3 = __builtin_e2k_pshufb(0, d3, 0x0302010007060504);
		swap_e0 = __builtin_e2k_pshufb(0, e0, 0x0302010007060504);
		swap_e1 = __builtin_e2k_pshufb(0, e1, 0x0302010007060504);
		swap_e2 = __builtin_e2k_pshufb(0, e2, 0x0302010007060504);
		swap_e3 = __builtin_e2k_pshufb(0, e3, 0x0302010007060504);

		cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
		cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
		cy2_real = __builtin_e2k_pfmuls(conj_c2, y2);
		cy3_real = __builtin_e2k_pfmuls(conj_c3, y3);
		dz0_real = __builtin_e2k_pfmuls(conj_d0, z0);
		dz1_real = __builtin_e2k_pfmuls(conj_d1, z1);
		dz2_real = __builtin_e2k_pfmuls(conj_d2, z2);
		dz3_real = __builtin_e2k_pfmuls(conj_d3, z3);
		ew0_real = __builtin_e2k_pfmuls(conj_e0, w0);
		ew1_real = __builtin_e2k_pfmuls(conj_e1, w1);
		ew2_real = __builtin_e2k_pfmuls(conj_e2, w2);
		ew3_real = __builtin_e2k_pfmuls(conj_e3, w3);
		cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
		cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);
		cy2_imag = __builtin_e2k_pfmuls(swap_c2, y2);
		cy3_imag = __builtin_e2k_pfmuls(swap_c3, y3);
		dz0_imag = __builtin_e2k_pfmuls(swap_d0, z0);
		dz1_imag = __builtin_e2k_pfmuls(swap_d1, z1);
		dz2_imag = __builtin_e2k_pfmuls(swap_d2, z2);
		dz3_imag = __builtin_e2k_pfmuls(swap_d3, z3);
		ew0_imag = __builtin_e2k_pfmuls(swap_e0, w0);
		ew1_imag = __builtin_e2k_pfmuls(swap_e1, w1);
		ew2_imag = __builtin_e2k_pfmuls(swap_e2, w2);
		ew3_imag = __builtin_e2k_pfmuls(swap_e3, w3);

		cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
		cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);
		cy2 = __builtin_e2k_pfhadds(cy2_real, cy2_imag);
		cy3 = __builtin_e2k_pfhadds(cy3_real, cy3_imag);
		dz0 = __builtin_e2k_pfhadds(dz0_real, dz0_imag);
		dz1 = __builtin_e2k_pfhadds(dz1_real, dz1_imag);
		dz2 = __builtin_e2k_pfhadds(dz2_real, dz2_imag);
		dz3 = __builtin_e2k_pfhadds(dz3_real, dz3_imag);
		ew0 = __builtin_e2k_pfhadds(ew0_real, ew0_imag);
		ew1 = __builtin_e2k_pfhadds(ew1_real, ew1_imag);
		ew2 = __builtin_e2k_pfhadds(ew2_real, ew2_imag);
		ew3 = __builtin_e2k_pfhadds(ew3_real, ew3_imag);

		add02_0 = __builtin_e2k_pfadds( x0, dz0);
		add02_1 = __builtin_e2k_pfadds( x1, dz1);
		add02_2 = __builtin_e2k_pfadds( x2, dz2);
		add02_3 = __builtin_e2k_pfadds( x3, dz3);
		sub02_0 = __builtin_e2k_pfsubs( x0, dz0);
		sub02_1 = __builtin_e2k_pfsubs( x1, dz1);
		sub02_2 = __builtin_e2k_pfsubs( x2, dz2);
		sub02_3 = __builtin_e2k_pfsubs( x3, dz3);
		add13_0 = __builtin_e2k_pfadds(cy0, ew0);
		add13_1 = __builtin_e2k_pfadds(cy1, ew1);
		add13_2 = __builtin_e2k_pfadds(cy2, ew2);
		add13_3 = __builtin_e2k_pfadds(cy3, ew3);
		sub13_0 = __builtin_e2k_pfsubs(cy0, ew0);
		sub13_1 = __builtin_e2k_pfsubs(cy1, ew1);
		sub13_2 = __builtin_e2k_pfsubs(cy2, ew2);
		sub13_3 = __builtin_e2k_pfsubs(cy3, ew3);

		//conj_sub13_0 = __builtin_e2k_pxord(sub13_0, 1LL<<63);
		//conj_sub13_1 = __builtin_e2k_pxord(sub13_1, 1LL<<63);
		//conj_sub13_2 = __builtin_e2k_pxord(sub13_2, 1LL<<63);
		//conj_sub13_3 = __builtin_e2k_pxord(sub13_3, 1LL<<63);
		//sub13i_0 = __builtin_e2k_pshufb(0, conj_sub13_0, 0x0302010007060504);
		//sub13i_1 = __builtin_e2k_pshufb(0, conj_sub13_1, 0x0302010007060504);
		//sub13i_2 = __builtin_e2k_pshufb(0, conj_sub13_2, 0x0302010007060504);
		//sub13i_3 = __builtin_e2k_pshufb(0, conj_sub13_3, 0x0302010007060504);
		swap_sub13_0 = __builtin_e2k_pshufb(0, sub13_0, 0x0302010007060504);
		swap_sub13_1 = __builtin_e2k_pshufb(0, sub13_1, 0x0302010007060504);
		swap_sub13_2 = __builtin_e2k_pshufb(0, sub13_2, 0x0302010007060504);
		swap_sub13_3 = __builtin_e2k_pshufb(0, sub13_3, 0x0302010007060504);
		sub13i_0 = __builtin_e2k_pxord(swap_sub13_0, 1LL<<31);
		sub13i_1 = __builtin_e2k_pxord(swap_sub13_1, 1LL<<31);
		sub13i_2 = __builtin_e2k_pxord(swap_sub13_2, 1LL<<31);
		sub13i_3 = __builtin_e2k_pxord(swap_sub13_3, 1LL<<31);

		out_0[i]  = __builtin_e2k_pfadds(add02_0, add13_0);
		out_1[i]  = __builtin_e2k_pfadds(add02_1, add13_1);
		out_2[i]  = __builtin_e2k_pfadds(add02_2, add13_2);
		out_3[i]  = __builtin_e2k_pfadds(add02_3, add13_3);
		out_4[i]  = __builtin_e2k_pfsubs(sub02_0, sub13i_0);
		out_5[i]  = __builtin_e2k_pfsubs(sub02_1, sub13i_1);
		out_6[i]  = __builtin_e2k_pfsubs(sub02_2, sub13i_2);
		out_7[i]  = __builtin_e2k_pfsubs(sub02_3, sub13i_3);
		out_8[i]  = __builtin_e2k_pfsubs(add02_0, add13_0);
		out_9[i]  = __builtin_e2k_pfsubs(add02_1, add13_1);
		out_10[i] = __builtin_e2k_pfsubs(add02_2, add13_2);
		out_11[i] = __builtin_e2k_pfsubs(add02_3, add13_3);
		out_12[i] = __builtin_e2k_pfadds(sub02_0, sub13i_0);
		out_13[i] = __builtin_e2k_pfadds(sub02_1, sub13i_1);
		out_14[i] = __builtin_e2k_pfadds(sub02_2, sub13i_2);
		out_15[i] = __builtin_e2k_pfadds(sub02_3, sub13i_3);
	}
}
Основной цикл на ассемблере
.L3676:
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=1, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=2, asz=1, abs=0, disp=16
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=64
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=4, asz=1, abs=4, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=1, abs=4, disp=96
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=2, asz=1, abs=6, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=3, asz=1, abs=6, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=2, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=2, abs=8, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=2, abs=12, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=2, abs=12, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=2, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=2, abs=16, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=2, abs=20, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=2, abs=20, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=2, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=2, abs=24, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=2, abs=28, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=2, abs=28, disp=0
        }
.L2949:
        {
          loop_mode
          pfmul_hadds,0,sm      %b[111], %b[62], %b[104], %b[62]
          pfmuls,1,sm   %b[63], %b[57], %b[63]
          pfmuls,2,sm   %b[97], %b[82], %b[97]
          pshufb,3,sm   0x0, %b[29], %r26, %b[99]
        }
        {
          loop_mode
          pfsubs,1,sm   %b[68], %b[67], %b[103]
          pfmul_hadds,2,sm      %b[100], %b[52], %b[103], %b[100]
          pshufb,3,sm   0x0, %b[32], %r26, %b[104]
          xord,4,sm     %b[86], %r0, %b[105]
          xord,5,sm     %b[105], %r6, %b[106]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[80], %b[53], %b[98], %b[80]
          pfsubs,1,sm   %b[81], %b[79], %b[93]
          pfmul_hadds,2,sm      %b[93], %b[96], %b[89], %b[89]
          pshufb,3,sm   0x0, %b[49], %r26, %b[98]
          pfmuls,5,sm   %b[105], %b[102], %b[96]
        }
        {
          loop_mode
          pfsub_adds,0,sm       %b[13], %b[64], %b[106], %b[87]
          pfsubs,1,sm   %b[91], %b[87], %b[91]
          pfadds,2,sm   %b[91], %b[87], %b[105]
          xord,5,sm     %b[21], %r0, %b[108]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[99], %b[82], %b[97], %b[82]
          pfsubs,1,sm   %b[94], %b[84], %b[84]
          pfadds,2,sm   %b[94], %b[84], %b[86]
          pshufb,3,sm   0x0, %b[48], %r26, %b[94]
          pshufb,4,sm   0x0, %b[86], %r26, %b[97]
          xord,5,sm     %b[74], %r0, %b[99]
        }
        {
          loop_mode
          pfsub_rsubs,0,sm      %b[13], %b[64], %b[106], %b[64]
          pfsubs,1,sm   %b[69], %b[15], %b[106]
          pfmuls,3,sm   %b[99], %b[77], %b[99]
          pfmuls,4,sm   %b[108], %b[10], %b[1]
          xord,5,sm     %b[83], %r0, %b[108]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[104], %b[60], %b[90], %b[63]
          pfadds,1,sm   %b[81], %b[79], %b[79]
          pfmul_hadds,2,sm      %b[98], %b[57], %b[63], %b[60]
          pshufb,3,sm   0x0, %b[23], %r26, %b[90]
          pfmuls,4,sm   %b[108], %b[72], %b[81]
          pfmul_hadds,5,sm      %b[97], %b[102], %b[96], %b[13]
        }
        {
          loop_mode
          pfadd_adds,0,sm       %b[107], %b[71], %b[105], %b[75]
          pfsubs,1,sm   %b[85], %b[75], %b[85]
          pfadds,2,sm   %b[85], %b[75], %b[96]
          xord,5,sm     %b[78], %r0, %b[97]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[94], %b[61], %b[88], %b[93]
          xord,1,sm     %b[47], %r0, %b[61]
          pfadd_rsubs,2,sm      %b[101], %b[70], %b[86], %b[94]
          pshufb,3,sm   0x0, %b[93], %r26, %b[97]
          xord,4,sm     %b[30], %r0, %b[88]
          pfmuls,5,sm   %b[97], %b[92], %b[98]
        }
        {
          loop_mode
          pfadd_rsubs,0,sm      %b[107], %b[71], %b[105], %b[83]
          pfadds,1,sm   %b[69], %b[15], %b[91]
          pfadds,2,sm   %b[62], %b[73], %b[102]
          pshufb,3,sm   0x0, %b[91], %r26, %b[104]
          pshufb,4,sm   0x0, %b[83], %r26, %b[105]
          xord,5,sm     %b[109], %r0, %b[69]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[90], %b[12], %b[3], %b[62]
          pshufb,1,sm   0x0, %b[74], %r26, %b[90]
          pfadd_adds,2,sm       %b[101], %b[70], %b[86], %b[73]
          pshufb,3,sm   0x0, %b[84], %r26, %b[74]
          pfsubs,4,sm   %b[62], %b[73], %b[84]
          xord,5,sm     %b[76], %r0, %b[86]
        }
        {
          loop_mode
          pfadd_adds,0,sm       %b[41], %b[80], %b[79], %b[68]
          pfadds,1,sm   %b[68], %b[67], %b[78]
          pfadd_rsubs,2,sm      %b[87], %b[89], %b[96], %b[72]
          pshufb,3,sm   0x0, %b[106], %r26, %b[81]
          pshufb,4,sm   0x0, %b[78], %r26, %b[105]
          pfmul_hadds,5,sm      %b[105], %b[72], %b[81], %b[67]
        }
        {
          loop_mode
          pfadd_rsubs,0,sm      %b[41], %b[80], %b[79], %b[90]
          pfmul_hadds,1,sm      %b[90], %b[77], %b[99], %b[77]
          pfadd_adds,2,sm       %b[87], %b[89], %b[96], %b[96]
          pshufb,3,sm   0x0, %b[103], %r26, %b[98]
          xord,4,sm     %b[110], %r0, %b[92]
          pfmul_hadds,5,sm      %b[105], %b[92], %b[98], %b[79]
        }
        {
          loop_mode
          pfadd_adds,0,sm       %b[20], %b[100], %b[91], %b[85]
          pfmuls,1,sm   %b[86], %b[66], %b[86]
          pfadd_adds,2,sm       %b[64], %b[82], %b[102], %b[99]
          pshufb,3,sm   0x0, %b[85], %r26, %b[105]
          xord,4,sm     %b[28], %r0, %b[103]
          xord,5,sm     %b[104], %r6, %b[104]
        }
        {
          loop_mode
          pfadd_rsubs,0,sm      %b[20], %b[100], %b[91], %b[74]
          xord,1,sm     %b[95], %r0, %b[108]
          pfadd_rsubs,2,sm      %b[64], %b[82], %b[102], %b[84]
          pshufb,3,sm   0x0, %b[84], %r26, %b[106]
          xord,4,sm     %b[45], %r0, %b[91]
          xord,5,sm     %b[74], %r6, %b[102]
        }
        {
          loop_mode
          pfadd_rsubs,0,sm      %b[56], %b[60], %b[78], %b[97]
          pfmuls,1,sm   %b[108], %b[65], %b[111]
          pfsub_adds,2,sm       %b[107], %b[71], %b[104], %b[108]
          xord,3,sm     %b[46], %r0, %b[81]
          xord,4,sm     %b[97], %r6, %b[112]
          xord,5,sm     %b[81], %r6, %b[113]
        }
        {
          loop_mode
          pfadd_adds,0,sm       %b[56], %b[60], %b[78], %b[109]
          pshufb,1,sm   0x0, %b[110], %r26, %b[78]
          pfsub_rsubs,2,sm      %b[101], %b[70], %b[102], %b[110]
          pshufb,3,sm   0x0, %b[109], %r26, %b[98]
          xord,4,sm     %b[98], %r6, %b[114]
          addd,5,sm     0x8, %b[8], %b[6] ? %pcnt0
        }
        {
          loop_mode
          pfsub_rsubs,0,sm      %b[20], %b[100], %b[113], %b[71]
          pfsubs,1,sm   %b[93], %b[63], %b[75]
          pfsub_rsubs,2,sm      %b[107], %b[71], %b[104], %b[76]
          pshufb,3,sm   0x0, %b[76], %r26, %b[104]
          xord,4,sm     %b[105], %r6, %b[105]
          std,5 %r25, %b[8], %b[75]
        }
        {
          loop_mode
          pfsub_adds,0,sm       %b[20], %b[100], %b[113], %b[63]
          pfadds,1,sm   %b[93], %b[63], %b[93]
          pfsub_adds,2,sm       %b[101], %b[70], %b[102], %b[70]
          pshufb,3,sm   0x0, %b[95], %r26, %b[94]
          xord,4,sm     %b[106], %r6, %b[95]
          std,5 %r23, %b[8], %b[94]
        }
        {
          loop_mode
          pfsub_adds,0,sm       %b[56], %b[60], %b[114], %b[83]
          pfmuls,1,sm   %b[91], %b[68], %b[100]
          pfsub_rsubs,2,sm      %b[87], %b[89], %b[105], %b[101]
          pshufb,3,sm   0x0, %b[36], %r26, %b[91]
          xord,4,sm     %b[44], %r0, %b[102]
          std,5 %r18, %b[8], %b[83]
        }
        {
          loop_mode
          pfsub_rsubs,0,sm      %b[56], %b[60], %b[114], %b[60]
          pfmuls,1,sm   %b[103], %b[90], %b[103]
          pfsub_adds,2,sm       %b[64], %b[82], %b[95], %b[73]
          pshufb,3,sm   0x0, %b[45], %r26, %b[106]
          xord,4,sm     %b[24], %r0, %b[107]
          std,5 %r2, %b[8], %b[73]
        }
        {
          loop_mode
          pfsub_adds,0,sm       %b[87], %b[89], %b[105], %b[64]
          pfmuls,1,sm   %b[102], %b[85], %b[82]
          pfsub_rsubs,2,sm      %b[64], %b[82], %b[95], %b[72]
          pshufb,3,sm   0x0, %b[44], %r26, %b[87]
          xord,4,sm     %b[33], %r0, %b[89]
          std,5 %r12, %b[8], %b[72]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[94], %b[65], %b[111], %b[65]
          pfmuls,1,sm   %b[107], %b[74], %b[95]
          pfsub_adds,2,sm       %b[41], %b[80], %b[112], %b[94]
          pshufb,3,sm   0x0, %b[24], %r26, %b[96]
          xord,4,sm     %b[40], %r0, %b[102]
          std,5 %r16, %b[8], %b[96]
          movad,0       area=9, ind=0, am=1, be=0, %b[15]
          movad,1       area=8, ind=0, am=1, be=0, %b[12]
          movad,2       area=9, ind=0, am=1, be=0, %b[3]
          movad,3       area=8, ind=0, am=1, be=0, %b[20]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[104], %b[66], %b[86], %b[66]
          pfmuls,1,sm   %b[89], %b[97], %b[86]
          pshufb,3,sm   0x0, %b[33], %r26, %b[89]
          xord,4,sm     %b[36], %r0, %b[99]
          std,5 %r22, %b[8], %b[99]
          movad,0       area=7, ind=0, am=1, be=0, %b[24]
          movad,1       area=6, ind=0, am=1, be=0, %b[32]
          movad,2       area=7, ind=0, am=1, be=0, %b[29]
          movad,3       area=6, ind=0, am=1, be=0, %b[23]
        }
        {
          loop_mode
          pfsub_rsubs,0,sm      %b[41], %b[80], %b[112], %b[80]
          pfmuls,1,sm   %b[102], %b[109], %b[84]
          pshufb,3,sm   0x0, %b[40], %r26, %b[102]
          xord,4,sm     %b[19], %r0, %b[104]
          std,5 %r19, %b[8], %b[84]
          movad,0       area=5, ind=0, am=1, be=0, %b[33]
          movad,1       area=4, ind=0, am=1, be=0, %b[41]
          movad,2       area=5, ind=0, am=1, be=0, %b[40]
          movad,3       area=4, ind=0, am=1, be=0, %b[36]
        }
        {
          loop_mode
          pfadd_adds,0,sm       %b[11], %b[62], %b[93], %b[99]
          pfmuls,1,sm   %b[99], %b[71], %b[111]
          pfadd_rsubs,2,sm      %b[11], %b[62], %b[93], %b[105]
          pshufb,3,sm   0x0, %b[28], %r26, %b[112]
          xord,4,sm     %b[16], %r0, %b[113]
          std,5 %r14, %b[8], %b[108]
          movad,0       area=3, ind=8, am=1, be=0, %b[93]
          movad,1       area=3, ind=0, am=0, be=0, %b[28]
          movad,2       area=3, ind=16, am=0, be=0, %b[108]
          movad,3       area=3, ind=24, am=0, be=0, %b[107]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[87], %b[85], %b[82], %b[82]
          pfmuls,1,sm   %b[104], %b[63], %b[87]
          pfmul_hadds,2,sm      %b[96], %b[74], %b[95], %b[85]
          pshufb,3,sm   0x0, %b[19], %r26, %b[95]
          xord,4,sm     %b[37], %r0, %b[96]
          std,5 %r24, %b[8], %b[110]
          movad,0       area=2, ind=0, am=0, be=0, %b[44]
          movad,1       area=2, ind=8, am=0, be=0, %b[74]
          movad,2       area=3, ind=8, am=1, be=0, %b[45]
          movad,3       area=3, ind=0, am=0, be=0, %b[19]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[89], %b[97], %b[86], %b[89]
          pfmuls,1,sm   %b[81], %b[59], %b[86]
          pfmuls,2,sm   %b[113], %b[83], %b[97]
          pshufb,3,sm   0x0, %b[16], %r26, %b[104]
          std,5 %r15, %b[8], %b[76]
          movad,0       area=2, ind=16, am=0, be=0, %b[76]
          movad,1       area=2, ind=24, am=1, be=0, %b[81]
          movad,2       area=2, ind=0, am=0, be=0, %b[16]
          movad,3       area=2, ind=16, am=0, be=0, %b[48]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[102], %b[109], %b[84], %b[92]
          pfmuls,1,sm   %b[92], %b[51], %b[96]
          pfmuls,2,sm   %b[96], %b[60], %b[102]
          pshufb,3,sm   0x0, %b[37], %r26, %b[109]
          xord,4,sm     %b[7], %r0, %b[110]
          std,5 %r17, %b[8], %b[70]
          movad,0       area=1, ind=0, am=0, be=0, %b[37]
          movad,1       area=1, ind=16, am=0, be=0, %b[49]
          movad,2       area=2, ind=8, am=0, be=0, %b[70]
          movad,3       area=0, ind=8, am=0, be=0, %b[84]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[106], %b[68], %b[100], %b[68]
          pfmuls,1,sm   %b[69], %b[50], %b[101]
          pfmul_hadds,2,sm      %b[112], %b[90], %b[103], %b[69]
          pshufb,3,sm   0x0, %b[75], %r26, %b[103]
          std,5 %r20, %b[8], %b[101]
          movad,0       area=1, ind=8, am=1, be=0, %b[90]
          movad,1       area=1, ind=24, am=0, be=0, %b[75]
          movad,2       area=2, ind=24, am=1, be=0, %b[100]
          movad,3       area=1, ind=0, am=0, be=0, %b[52]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[91], %b[71], %b[111], %b[71]
          pfmuls,1,sm   %b[110], %b[94], %b[87]
          pfmul_hadds,2,sm      %b[95], %b[63], %b[87], %b[73]
          pshufb,3,sm   0x0, %b[7], %r26, %b[91]
          xord,4,sm     %b[27], %r0, %b[95]
          std,5 %r11, %b[8], %b[73]
          movad,0       area=0, ind=0, am=0, be=0, %b[7]
          movad,1       area=0, ind=24, am=0, be=0, %b[56]
          movad,2       area=1, ind=16, am=0, be=0, %b[53]
          movad,3       area=1, ind=24, am=0, be=0, %b[63]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          pfmul_hadds,0,sm      %b[104], %b[83], %b[97], %b[83]
          pfmuls,1,sm   %b[88], %b[58], %b[88]
          std,2 %r13, %b[8], %b[64]
          std,5 %r21, %b[8], %b[72]
          movad,0       area=0, ind=8, am=1, be=0, %b[57]
          movad,1       area=0, ind=16, am=0, be=0, %b[8]
          movad,2       area=1, ind=8, am=1, be=0, %b[64]
          movad,3       area=0, ind=0, am=1, be=0, %b[72]
        }

Теоретическая скорость: 16 комплексных чисел за 32 такта (16/32) = 4 Байт/такт
Четверная теоретическая скорость: 16 Байт/такт

Замеры скорости

3. stage_radix4_2x_simd128

Здесь происходит ручная раскрутка алгоритма stage_radix4_simd128 в 2 раза.

Код на Си
void stage_radix4_2x_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b)
{
	__v2di *xy0_in = (__v2di*)&data_in[ 0];
	__v2di *zw0_in = (__v2di*)&data_in[ 2];
	__v2di *xy1_in = (__v2di*)&data_in[ 4];
	__v2di *zw1_in = (__v2di*)&data_in[ 6];
	__v2di *xy2_in = (__v2di*)&data_in[ 8];
	__v2di *zw2_in = (__v2di*)&data_in[10];
	__v2di *xy3_in = (__v2di*)&data_in[12];
	__v2di *zw3_in = (__v2di*)&data_in[14];
	__v2di *xy4_in = (__v2di*)&data_in[16];
	__v2di *zw4_in = (__v2di*)&data_in[18];
	__v2di *xy5_in = (__v2di*)&data_in[20];
	__v2di *zw5_in = (__v2di*)&data_in[22];
	__v2di *xy6_in = (__v2di*)&data_in[24];
	__v2di *zw6_in = (__v2di*)&data_in[26];
	__v2di *xy7_in = (__v2di*)&data_in[28];
	__v2di *zw7_in = (__v2di*)&data_in[30];
	__v2di *c0a_in = (__v2di*)&coefC_a[0];
	__v2di *c1a_in = (__v2di*)&coefC_a[2];
	__v2di *c2a_in = (__v2di*)&coefC_a[4];
	__v2di *c3a_in = (__v2di*)&coefC_a[6];
	__v2di *d0a_in = (__v2di*)&coefD_a[0];
	__v2di *d1a_in = (__v2di*)&coefD_a[2];
	__v2di *d2a_in = (__v2di*)&coefD_a[4];
	__v2di *d3a_in = (__v2di*)&coefD_a[6];
	__v2di *e0a_in = (__v2di*)&coefE_a[0];
	__v2di *e1a_in = (__v2di*)&coefE_a[2];
	__v2di *e2a_in = (__v2di*)&coefE_a[4];
	__v2di *e3a_in = (__v2di*)&coefE_a[6];
	__v2di *c0b_in = (__v2di*)&coefC_b[0*data_count/16];
	__v2di *c1b_in = (__v2di*)&coefC_b[1*data_count/16];
	__v2di *c2b_in = (__v2di*)&coefC_b[2*data_count/16];
	__v2di *c3b_in = (__v2di*)&coefC_b[3*data_count/16];
	__v2di *d0b_in = (__v2di*)&coefD_b[0*data_count/16];
	__v2di *d1b_in = (__v2di*)&coefD_b[1*data_count/16];
	__v2di *d2b_in = (__v2di*)&coefD_b[2*data_count/16];
	__v2di *d3b_in = (__v2di*)&coefD_b[3*data_count/16];
	__v2di *e0b_in = (__v2di*)&coefE_b[0*data_count/16];
	__v2di *e1b_in = (__v2di*)&coefE_b[1*data_count/16];
	__v2di *e2b_in = (__v2di*)&coefE_b[2*data_count/16];
	__v2di *e3b_in = (__v2di*)&coefE_b[3*data_count/16];

	__v2di *out_0  = (__v2di*)&data_out[ 0*data_count/16];
	__v2di *out_1  = (__v2di*)&data_out[ 1*data_count/16];
	__v2di *out_2  = (__v2di*)&data_out[ 2*data_count/16];
	__v2di *out_3  = (__v2di*)&data_out[ 3*data_count/16];
	__v2di *out_4  = (__v2di*)&data_out[ 4*data_count/16];
	__v2di *out_5  = (__v2di*)&data_out[ 5*data_count/16];
	__v2di *out_6  = (__v2di*)&data_out[ 6*data_count/16];
	__v2di *out_7  = (__v2di*)&data_out[ 7*data_count/16];
	__v2di *out_8  = (__v2di*)&data_out[ 8*data_count/16];
	__v2di *out_9  = (__v2di*)&data_out[ 9*data_count/16];
	__v2di *out_10 = (__v2di*)&data_out[10*data_count/16];
	__v2di *out_11 = (__v2di*)&data_out[11*data_count/16];
	__v2di *out_12 = (__v2di*)&data_out[12*data_count/16];
	__v2di *out_13 = (__v2di*)&data_out[13*data_count/16];
	__v2di *out_14 = (__v2di*)&data_out[14*data_count/16];
	__v2di *out_15 = (__v2di*)&data_out[15*data_count/16];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/32; ++i)
	{
		__v2di xy0 = xy0_in[16*i];
		__v2di zw0 = zw0_in[16*i];
		__v2di xy1 = xy1_in[16*i];
		__v2di zw1 = zw1_in[16*i];
		__v2di c0  = c0a_in[4*i];
		__v2di d0  = d0a_in[4*i];
		__v2di e0  = e0a_in[4*i];

		__v2di xy2 = xy2_in[16*i];
		__v2di zw2 = zw2_in[16*i];
		__v2di xy3 = xy3_in[16*i];
		__v2di zw3 = zw3_in[16*i];
		__v2di c1  = c1a_in[4*i];
		__v2di d1  = d1a_in[4*i];
		__v2di e1  = e1a_in[4*i];

		__v2di xy4 = xy4_in[16*i];
		__v2di zw4 = zw4_in[16*i];
		__v2di xy5 = xy5_in[16*i];
		__v2di zw5 = zw5_in[16*i];
		__v2di c2  = c2a_in[4*i];
		__v2di d2  = d2a_in[4*i];
		__v2di e2  = e2a_in[4*i];

		__v2di xy6 = xy6_in[16*i];
		__v2di zw6 = zw6_in[16*i];
		__v2di xy7 = xy7_in[16*i];
		__v2di zw7 = zw7_in[16*i];
		__v2di c3  = c3a_in[4*i];
		__v2di d3  = d3a_in[4*i];
		__v2di e3  = e3a_in[4*i];

		__v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		__v2di conj_c0 = __builtin_e2k_qpxor(c0, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_c1 = __builtin_e2k_qpxor(c1, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_c2 = __builtin_e2k_qpxor(c2, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_c3 = __builtin_e2k_qpxor(c3, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_d0 = __builtin_e2k_qpxor(d0, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_d1 = __builtin_e2k_qpxor(d1, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_d2 = __builtin_e2k_qpxor(d2, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_d3 = __builtin_e2k_qpxor(d3, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_e0 = __builtin_e2k_qpxor(e0, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_e1 = __builtin_e2k_qpxor(e1, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_e2 = __builtin_e2k_qpxor(e2, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_e3 = __builtin_e2k_qpxor(e3, (__v2di){1LL<<63, 1LL<<63});
		__v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});

		__v2di cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
		__v2di cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
		__v2di cy2_real = __builtin_e2k_qpfmuls(conj_c2, y2);
		__v2di cy3_real = __builtin_e2k_qpfmuls(conj_c3, y3);
		__v2di dz0_real = __builtin_e2k_qpfmuls(conj_d0, z0);
		__v2di dz1_real = __builtin_e2k_qpfmuls(conj_d1, z1);
		__v2di dz2_real = __builtin_e2k_qpfmuls(conj_d2, z2);
		__v2di dz3_real = __builtin_e2k_qpfmuls(conj_d3, z3);
		__v2di ew0_real = __builtin_e2k_qpfmuls(conj_e0, w0);
		__v2di ew1_real = __builtin_e2k_qpfmuls(conj_e1, w1);
		__v2di ew2_real = __builtin_e2k_qpfmuls(conj_e2, w2);
		__v2di ew3_real = __builtin_e2k_qpfmuls(conj_e3, w3);
		__v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
		__v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
		__v2di cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
		__v2di cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
		__v2di dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
		__v2di dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
		__v2di dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
		__v2di dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
		__v2di ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
		__v2di ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
		__v2di ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
		__v2di ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);

		__v2di cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
		__v2di cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);
		__v2di cy2_rrii = __builtin_e2k_qpfhadds(cy2_real, cy2_imag);
		__v2di cy3_rrii = __builtin_e2k_qpfhadds(cy3_real, cy3_imag);
		__v2di dz0_rrii = __builtin_e2k_qpfhadds(dz0_real, dz0_imag);
		__v2di dz1_rrii = __builtin_e2k_qpfhadds(dz1_real, dz1_imag);
		__v2di dz2_rrii = __builtin_e2k_qpfhadds(dz2_real, dz2_imag);
		__v2di dz3_rrii = __builtin_e2k_qpfhadds(dz3_real, dz3_imag);
		__v2di ew0_rrii = __builtin_e2k_qpfhadds(ew0_real, ew0_imag);
		__v2di ew1_rrii = __builtin_e2k_qpfhadds(ew1_real, ew1_imag);
		__v2di ew2_rrii = __builtin_e2k_qpfhadds(ew2_real, ew2_imag);
		__v2di ew3_rrii = __builtin_e2k_qpfhadds(ew3_real, ew3_imag);

		__v2di cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di cy2 = __builtin_e2k_qpshufb(cy2_rrii, cy2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di cy3 = __builtin_e2k_qpshufb(cy3_rrii, cy3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di dz0 = __builtin_e2k_qpshufb(dz0_rrii, dz0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di dz1 = __builtin_e2k_qpshufb(dz1_rrii, dz1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di dz2 = __builtin_e2k_qpshufb(dz2_rrii, dz2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di dz3 = __builtin_e2k_qpshufb(dz3_rrii, dz3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di ew0 = __builtin_e2k_qpshufb(ew0_rrii, ew0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di ew1 = __builtin_e2k_qpshufb(ew1_rrii, ew1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di ew2 = __builtin_e2k_qpshufb(ew2_rrii, ew2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di ew3 = __builtin_e2k_qpshufb(ew3_rrii, ew3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});

		__v2di add02_0 = __builtin_e2k_qpfadds( x0, dz0);
		__v2di add02_1 = __builtin_e2k_qpfadds( x1, dz1);
		__v2di add02_2 = __builtin_e2k_qpfadds( x2, dz2);
		__v2di add02_3 = __builtin_e2k_qpfadds( x3, dz3);
		__v2di sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
		__v2di sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
		__v2di sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
		__v2di sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
		__v2di add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
		__v2di add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
		__v2di add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
		__v2di add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
		__v2di sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
		__v2di sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
		__v2di sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
		__v2di sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);

		__v2di swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
		__v2di sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
		__v2di sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
		__v2di sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});

		__v2di out0  = __builtin_e2k_qpfadds(add02_0, add13_0);
		__v2di out1  = __builtin_e2k_qpfadds(add02_1, add13_1);
		__v2di out2  = __builtin_e2k_qpfadds(add02_2, add13_2);
		__v2di out3  = __builtin_e2k_qpfadds(add02_3, add13_3);
		__v2di out4  = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
		__v2di out5  = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
		__v2di out6  = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
		__v2di out7  = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
		__v2di out8  = __builtin_e2k_qpfsubs(add02_0, add13_0);
		__v2di out9  = __builtin_e2k_qpfsubs(add02_1, add13_1);
		__v2di out10 = __builtin_e2k_qpfsubs(add02_2, add13_2);
		__v2di out11 = __builtin_e2k_qpfsubs(add02_3, add13_3);
		__v2di out12 = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
		__v2di out13 = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
		__v2di out14 = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
		__v2di out15 = __builtin_e2k_qpfadds(sub02_3, sub13i_3);


		xy0 = out0;
		zw0 = out1;
		xy1 = out2;
		zw1 = out3;
		c0  = c0b_in[i];
		d0  = d0b_in[i];
		e0  = e0b_in[i];

		xy2 = out4;
		zw2 = out5;
		xy3 = out6;
		zw3 = out7;
		c1  = c1b_in[i];
		d1  = d1b_in[i];
		e1  = e1b_in[i];

		xy4 = out8;
		zw4 = out9;
		xy5 = out10;
		zw5 = out11;
		c2  = c2b_in[i];
		d2  = d2b_in[i];
		e2  = e2b_in[i];

		xy6 = out12;
		zw6 = out13;
		xy7 = out14;
		zw7 = out15;
		c3  = c3b_in[i];
		d3  = d3b_in[i];
		e3  = e3b_in[i];

		x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
		w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
		y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
		w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
		y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
		w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
		y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
		w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		conj_c0 = __builtin_e2k_qpxor(c0, (__v2di){1LL<<63, 1LL<<63});
		conj_c1 = __builtin_e2k_qpxor(c1, (__v2di){1LL<<63, 1LL<<63});
		conj_c2 = __builtin_e2k_qpxor(c2, (__v2di){1LL<<63, 1LL<<63});
		conj_c3 = __builtin_e2k_qpxor(c3, (__v2di){1LL<<63, 1LL<<63});
		conj_d0 = __builtin_e2k_qpxor(d0, (__v2di){1LL<<63, 1LL<<63});
		conj_d1 = __builtin_e2k_qpxor(d1, (__v2di){1LL<<63, 1LL<<63});
		conj_d2 = __builtin_e2k_qpxor(d2, (__v2di){1LL<<63, 1LL<<63});
		conj_d3 = __builtin_e2k_qpxor(d3, (__v2di){1LL<<63, 1LL<<63});
		conj_e0 = __builtin_e2k_qpxor(e0, (__v2di){1LL<<63, 1LL<<63});
		conj_e1 = __builtin_e2k_qpxor(e1, (__v2di){1LL<<63, 1LL<<63});
		conj_e2 = __builtin_e2k_qpxor(e2, (__v2di){1LL<<63, 1LL<<63});
		conj_e3 = __builtin_e2k_qpxor(e3, (__v2di){1LL<<63, 1LL<<63});
		swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});

		cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
		cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
		cy2_real = __builtin_e2k_qpfmuls(conj_c2, y2);
		cy3_real = __builtin_e2k_qpfmuls(conj_c3, y3);
		dz0_real = __builtin_e2k_qpfmuls(conj_d0, z0);
		dz1_real = __builtin_e2k_qpfmuls(conj_d1, z1);
		dz2_real = __builtin_e2k_qpfmuls(conj_d2, z2);
		dz3_real = __builtin_e2k_qpfmuls(conj_d3, z3);
		ew0_real = __builtin_e2k_qpfmuls(conj_e0, w0);
		ew1_real = __builtin_e2k_qpfmuls(conj_e1, w1);
		ew2_real = __builtin_e2k_qpfmuls(conj_e2, w2);
		ew3_real = __builtin_e2k_qpfmuls(conj_e3, w3);
		cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
		cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
		cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
		cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
		dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
		dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
		dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
		dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
		ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
		ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
		ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
		ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);

		cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
		cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);
		cy2_rrii = __builtin_e2k_qpfhadds(cy2_real, cy2_imag);
		cy3_rrii = __builtin_e2k_qpfhadds(cy3_real, cy3_imag);
		dz0_rrii = __builtin_e2k_qpfhadds(dz0_real, dz0_imag);
		dz1_rrii = __builtin_e2k_qpfhadds(dz1_real, dz1_imag);
		dz2_rrii = __builtin_e2k_qpfhadds(dz2_real, dz2_imag);
		dz3_rrii = __builtin_e2k_qpfhadds(dz3_real, dz3_imag);
		ew0_rrii = __builtin_e2k_qpfhadds(ew0_real, ew0_imag);
		ew1_rrii = __builtin_e2k_qpfhadds(ew1_real, ew1_imag);
		ew2_rrii = __builtin_e2k_qpfhadds(ew2_real, ew2_imag);
		ew3_rrii = __builtin_e2k_qpfhadds(ew3_real, ew3_imag);

		cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		cy2 = __builtin_e2k_qpshufb(cy2_rrii, cy2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		cy3 = __builtin_e2k_qpshufb(cy3_rrii, cy3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		dz0 = __builtin_e2k_qpshufb(dz0_rrii, dz0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		dz1 = __builtin_e2k_qpshufb(dz1_rrii, dz1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		dz2 = __builtin_e2k_qpshufb(dz2_rrii, dz2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		dz3 = __builtin_e2k_qpshufb(dz3_rrii, dz3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		ew0 = __builtin_e2k_qpshufb(ew0_rrii, ew0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		ew1 = __builtin_e2k_qpshufb(ew1_rrii, ew1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		ew2 = __builtin_e2k_qpshufb(ew2_rrii, ew2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		ew3 = __builtin_e2k_qpshufb(ew3_rrii, ew3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});

		add02_0 = __builtin_e2k_qpfadds( x0, dz0);
		add02_1 = __builtin_e2k_qpfadds( x1, dz1);
		add02_2 = __builtin_e2k_qpfadds( x2, dz2);
		add02_3 = __builtin_e2k_qpfadds( x3, dz3);
		sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
		sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
		sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
		sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
		add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
		add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
		add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
		add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
		sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
		sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
		sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
		sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);

		swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
		sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
		sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
		sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});

		out_0[i]  = __builtin_e2k_qpfadds(add02_0, add13_0);
		out_1[i]  = __builtin_e2k_qpfadds(add02_1, add13_1);
		out_2[i]  = __builtin_e2k_qpfadds(add02_2, add13_2);
		out_3[i]  = __builtin_e2k_qpfadds(add02_3, add13_3);
		out_4[i]  = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
		out_5[i]  = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
		out_6[i]  = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
		out_7[i]  = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
		out_8[i]  = __builtin_e2k_qpfsubs(add02_0, add13_0);
		out_9[i]  = __builtin_e2k_qpfsubs(add02_1, add13_1);
		out_10[i] = __builtin_e2k_qpfsubs(add02_2, add13_2);
		out_11[i] = __builtin_e2k_qpfsubs(add02_3, add13_3);
		out_12[i] = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
		out_13[i] = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
		out_14[i] = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
		out_15[i] = __builtin_e2k_qpfadds(sub02_3, sub13i_3);
	}
}
Основной цикл на ассемблере
.L7295:
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=0, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=64
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=96
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=4, disp=128
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=4, disp=160
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=6, disp=192
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=6, disp=224
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=4, asz=1, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=4, asz=1, abs=8, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=3, asz=1, abs=10, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=3, asz=1, abs=10, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=1, abs=12, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=1, abs=12, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=1, incr=0, ind=0, asz=1, abs=14, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=15, asz=1, abs=14, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=14, asz=1, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=13, asz=1, abs=16, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=12, asz=1, abs=18, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=11, asz=1, abs=18, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=10, asz=2, abs=20, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=9, asz=2, abs=20, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=8, asz=2, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=2, abs=24, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=2, abs=28, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=2, abs=28, disp=0
        }
.L3969:
        {
          loop_mode
          disp  %ctpr1, .L3969
          movaqp,0      area=0, ind=0, am=1, be=0, %g17
          movaqp,1      area=0, ind=16, am=0, be=0, %g16
          movaqp,2      area=0, ind=0, am=1, be=0, %g19
          movaqp,3      area=0, ind=16, am=0, be=0, %g18
        }
        {
          loop_mode
          movaqp,0      area=1, ind=0, am=1, be=0, %g21
          movaqp,1      area=1, ind=16, am=0, be=0, %g20
          movaqp,2      area=1, ind=0, am=1, be=0, %g23
          movaqp,3      area=1, ind=16, am=0, be=0, %g22
        }
        {
          loop_mode
          movaqp,0      area=2, ind=0, am=1, be=0, %g25
          movaqp,1      area=2, ind=16, am=0, be=0, %g24
          movaqp,2      area=2, ind=0, am=1, be=0, %g27
          movaqp,3      area=2, ind=16, am=0, be=0, %g26
        }
        {
          loop_mode
          movaqp,0      area=3, ind=0, am=1, be=0, %g29
          movaqp,1      area=3, ind=16, am=0, be=0, %g28
          movaqp,2      area=3, ind=0, am=1, be=0, %g31
          movaqp,3      area=3, ind=16, am=0, be=0, %g30
        }
        {
          loop_mode
          movaqp,0      area=4, ind=0, am=1, be=0, %r26
          movaqp,1      area=4, ind=16, am=0, be=0, %r9
          movaqp,2      area=4, ind=0, am=1, be=0, %r28
          movaqp,3      area=4, ind=16, am=0, be=0, %r27
        }
        {
          loop_mode
          qpshufb,0     %g19, %g17, %r1, %r33
          qpshufb,1     %g18, %g16, %r1, %r34
          qpshufb,3     %g18, %g16, %r7, %g16
          qpshufb,4     %g19, %g17, %r7, %g17
          movaqp,0      area=5, ind=0, am=1, be=0, %r30
          movaqp,1      area=5, ind=16, am=0, be=0, %r29
          movaqp,2      area=5, ind=0, am=1, be=0, %r32
          movaqp,3      area=5, ind=16, am=0, be=0, %r31
        }
        {
          loop_mode
          qpshufb,0     %g23, %g21, %r1, %r37
          qpshufb,1     %g22, %g20, %r1, %r38
          qpshufb,3     %g22, %g20, %r7, %g20
          qpshufb,4     %g23, %g21, %r7, %g21
          movaqp,0      area=6, ind=0, am=1, be=0, %g19
          movaqp,1      area=6, ind=16, am=0, be=0, %g18
          movaqp,2      area=6, ind=0, am=1, be=0, %r36
          movaqp,3      area=6, ind=16, am=0, be=0, %r35
        }
        {
          loop_mode
          qpshufb,0     %g27, %g25, %r1, %g22
          qpshufb,1     %g26, %g24, %r1, %g23
          qpshufb,3     %g26, %g24, %r7, %g24
          qpshufb,4     %g27, %g25, %r7, %g25
          movaqp,0      area=8, ind=0, am=1, be=0, %r39
          movaqp,1      area=7, ind=0, am=1, be=0, %g26
          movaqp,2      area=8, ind=0, am=1, be=0, %r40
          movaqp,3      area=7, ind=0, am=1, be=0, %g27
        }
        {
          loop_mode
          qpshufb,0     %g31, %g29, %r1, %r41
          qpshufb,1     %g30, %g28, %r1, %r42
          qpshufb,3     %g30, %g28, %r7, %g28
          qpshufb,4     %g31, %g29, %r7, %g29
          movaqp,0      area=10, ind=0, am=1, be=0, %r43
          movaqp,1      area=9, ind=0, am=1, be=0, %g30
          movaqp,2      area=10, ind=0, am=1, be=0, %r44
          movaqp,3      area=9, ind=0, am=1, be=0, %g31
        }
        {
          loop_mode
          qpxor,0       %r26, %r5, %r45
          qpxor,1       %r9, %r5, %r46
          qpshufb,3     %r26, %r26, %r6, %r26
          qpshufb,4     %r9, %r9, %r6, %r9
          movaqp,0      area=12, ind=0, am=1, be=0, %r49
          movaqp,1      area=11, ind=0, am=1, be=0, %r47
          movaqp,2      area=12, ind=0, am=1, be=0, %r50
          movaqp,3      area=11, ind=0, am=1, be=0, %r48
        }
        {
          loop_mode
          qpxor,0       %r28, %r5, %r51
          qpxor,1       %r27, %r5, %r52
          qpfmuls,2     %r45, %r33, %r33
          qpshufb,3     %r28, %r28, %r6, %r28
          qpshufb,4     %r27, %r27, %r6, %r27
          qpfmuls,5     %r26, %r33, %r26
        }
        {
          loop_mode
          qpxor,0       %g19, %r5, %r45
          qpxor,1       %g18, %r5, %r53
          qpfmuls,2     %r46, %r37, %r37
          qpshufb,3     %g19, %g19, %r6, %g19
          qpshufb,4     %g18, %g18, %r6, %g18
          qpfmuls,5     %r9, %r37, %r9
        }
        {
          loop_mode
          qpxor,0       %r36, %r5, %r46
          qpxor,1       %r35, %r5, %r54
          qpfmuls,2     %r51, %g22, %g22
          qpshufb,3     %r36, %r36, %r6, %r36
          qpshufb,4     %r35, %r35, %r6, %r35
          qpfmuls,5     %r28, %g22, %r28
        }
        {
          loop_mode
          qpfmuls,0     %r52, %r41, %r41
          qpfmuls,1     %r45, %r34, %r45
          qpfmuls,2     %r53, %r38, %r51
          qpfmuls,3     %r27, %r41, %r27
          qpfmuls,4     %g19, %r34, %g19
          qpfmuls,5     %g18, %r38, %g18
        }
        {
          loop_mode
          qpfmuls,0     %r54, %r42, %r38
          qpxor,1       %r29, %r5, %r36
          qpfmuls,2     %r46, %g23, %r34
          qpfmuls,3     %r35, %r42, %r35
          qpshufb,4     %r29, %r29, %r6, %r29
          qpfmuls,5     %r36, %g23, %g23
        }
        {
          loop_mode
          qpfmuls,2     %r36, %g20, %r36
          qpfmuls,5     %r29, %g20, %g20
        }
        {
          loop_mode
          qpxor,1       %r30, %r5, %r29
          qpfhadds,2    %r33, %r26, %r26
          qpxor,4       %r31, %r5, %r42
        }
        {
          loop_mode
          qpshufb,0     %r30, %r30, %r6, %r30
          qpshufb,1     %r31, %r31, %r6, %r31
          qpfmuls,2     %r29, %g16, %r29
          qpxor,3       %r32, %r5, %r33
          qpshufb,4     %r32, %r32, %r6, %r32
          qpfmuls,5     %r42, %g28, %r42
        }
        {
          loop_mode
          qpfmuls,0     %r30, %g16, %g16
          qpfhadds,1    %r37, %r9, %r9
          qpfmuls,2     %r31, %g28, %g28
          qpfmuls,3     %r32, %g24, %g24
          qpfhadds,4    %g22, %r28, %g22
          qpfmuls,5     %r33, %g24, %r30
        }
        {
          loop_mode
          qpfhadds,0    %r45, %g19, %g19
          qpfhadds,1    %r51, %g18, %g18
          qpfhadds,2    %r41, %r27, %r27
          qpshufb,3     %g26, %g26, %r6, %r28
          qpxor,4       %g26, %r5, %g26
        }
        {
          loop_mode
          qpfhadds,0    %r38, %r35, %r31
          qpfhadds,2    %r34, %g23, %g23
        }
        {
          loop_mode
          qpfhadds,2    %r36, %g20, %g20
          qpshufb,3     %r39, %r39, %r6, %r32
          qpxor,4       %r39, %r5, %r33
        }
        {
          loop_mode
          qpshufb,1     %r26, %r26, %r3, %r26
          qpfhadds,2    %r29, %g16, %g16
          qpshufb,3     %g22, %g22, %r3, %g22
          qpshufb,4     %r40, %r40, %r6, %r29
          qpfhadds,5    %r30, %g24, %g24
        }
        {
          loop_mode
          qpshufb,0     %r9, %r9, %r3, %r9
          qpshufb,1     %g19, %g19, %r3, %g19
          qpfhadds,2    %r42, %g28, %g28
          qpxor,3       %r40, %r5, %r30
          qpshufb,4     %g31, %g31, %r6, %r34
        }
        {
          loop_mode
          qpshufb,0     %r27, %r27, %r3, %r27
          qpshufb,1     %g18, %g18, %r3, %g18
          qpfsubs,2     %r26, %g19, %r35
          qpxor,3       %g31, %r5, %g31
          qpxor,4       %r43, %r5, %r36
        }
        {
          loop_mode
          qpshufb,0     %r31, %r31, %r3, %r31
          qpshufb,1     %g23, %g23, %r3, %g23
          qpfsubs,2     %r9, %g18, %r37
          qpshufb,3     %r43, %r43, %r6, %r38
          qpxor,4       %r48, %r5, %r39
        }
        {
          loop_mode
          qpfsubs,0     %g22, %g23, %r41
          qpshufb,1     %g16, %g16, %r3, %g16
          qpfsubs,2     %r27, %r31, %r40
          qpshufb,3     %g24, %g24, %r3, %g24
          qpxor,4       %r47, %r5, %r42
        }
        {
          loop_mode
          qpshufb,0     %g28, %g28, %r3, %g28
          qpshufb,1     %g20, %g20, %r3, %g20
          qpfadds,2     %r26, %g19, %g19
          qpfsubs,3     %g25, %g24, %g24
          qpshufb,4     %r48, %r48, %r6, %g25
          qpfadds,5     %g25, %g24, %r26
        }
        {
          loop_mode
          qpfadds,0     %r9, %g18, %g18
          qpfadds,1     %r27, %r31, %r9
          qpfadds,2     %g22, %g23, %g22
          qpshufb,3     %r47, %r47, %r6, %g23
          qpxor,4       %r50, %r5, %r27
        }
        {
          loop_mode
          qpfadds,0     %g17, %g16, %r31
          qpfadds,1     %g29, %g28, %g17
          qpfsubs,2     %g17, %g16, %g16
          qpshufb,4     %r50, %r50, %r6, %r43
        }
        {
          loop_mode
          qpfadds,0     %g21, %g20, %g29
          qpfsubs,1     %g21, %g20, %g20
          qpfsubs,2     %g29, %g28, %g28
          qpshufb,3     %r35, %r35, %r6, %g21
          qpxor,4       %g27, %r5, %r35
        }
        {
          loop_mode
          qpshufb,3     %r37, %r37, %r6, %r37
          qpxor,4       %g21, %r4, %g21
        }
        {
          loop_mode
          qpshufb,3     %r40, %r40, %r6, %r40
          qpshufb,4     %r41, %r41, %r6, %r41
        }
        {
          loop_mode
          qpfsubs,0     %r31, %g19, %r45
          qpfadds,1     %r31, %g19, %g19
          qpfadds,2     %g17, %r9, %r31
          qpxor,3       %r37, %r4, %r37
          qpxor,4       %r41, %r4, %r41
        }
        {
          loop_mode
          qpfsubs,0     %g17, %r9, %g17
          qpfadds,1     %g29, %g18, %r9
          qpfsubs,2     %g29, %g18, %g18
          qpxor,3       %r40, %r4, %r40
          qpfadds,4     %r26, %g22, %g29
          qpfsubs,5     %r26, %g22, %g22
        }
        {
          loop_mode
          qpfsubs,2     %g16, %g21, %r26
          qpfsubs,3     %g24, %r41, %g21
          qpfadds,4     %g24, %r41, %g24
          qpfadds,5     %g16, %g21, %g16
        }
        {
          loop_mode
          qpfsubs,3     %g20, %r37, %r46
          qpfadds,4     %g28, %r40, %g28
          qpfsubs,5     %g28, %r40, %r41
        }
        {
          loop_mode
          qpshufb,0     %g27, %g27, %r6, %g27
          qpshufb,1     %g30, %g30, %r6, %r37
          qpfadds,2     %g20, %r37, %g20
        }
        {
          loop_mode
          qpshufb,0     %r31, %r9, %r1, %r40
          qpshufb,1     %g17, %g18, %r1, %r47
        }
        {
          loop_mode
          qpfmuls,0     %r33, %r40, %r33
          qpfmuls,1     %r42, %r47, %r40
          qpfmuls,2     %r32, %r40, %r32
          qpshufb,3     %g29, %g19, %r1, %r48
          qpshufb,4     %g22, %r45, %r1, %r50
        }
        {
          loop_mode
          qpxor,0       %g30, %r5, %g30
          qpshufb,1     %r44, %r44, %r6, %r47
          qpfmuls,2     %g23, %r47, %g23
          qpshufb,3     %r41, %r46, %r1, %r42
          qpshufb,4     %g24, %g16, %r1, %r51
          qpfmuls,5     %r38, %r50, %r38
        }
        {
          loop_mode
          qpshufb,3     %g21, %r26, %r1, %r52
          qpfmuls,4     %g31, %r42, %g31
          qpfmuls,5     %r34, %r42, %r34
        }
        {
          loop_mode
          qpshufb,0     %g28, %g20, %r1, %r42
          qpshufb,1     %g17, %g18, %r7, %g17
          qpfmuls,3     %g26, %r48, %g26
          qpfmuls,4     %r36, %r50, %r36
          qpfmuls,5     %r28, %r48, %r28
        }
        {
          loop_mode
          qpfmuls,0     %r43, %r42, %r39
          qpxor,1       %r44, %r5, %r42
          qpfmuls,2     %r27, %r42, %r27
          qpfmuls,3     %r39, %r51, %g18
          qpfmuls,4     %g25, %r51, %g25
          qpfmuls,5     %r29, %r52, %r29
        }
        {
          loop_mode
          qpxor,0       %r49, %r5, %r43
          qpshufb,1     %r49, %r49, %r6, %r44
          qpfmuls,2     %r42, %g17, %r42
          qpfmuls,5     %r30, %r52, %r30
        }
        {
          loop_mode
          qpfhadds,0    %r33, %r32, %r31
          qpshufb,1     %r31, %r9, %r7, %r9
          qpfmuls,2     %r47, %g17, %g17
          qpfhadds,5    %g31, %r34, %g31
        }
        {
          loop_mode
          qpshufb,0     %g28, %g20, %r7, %g20
          qpshufb,1     %r41, %r46, %r7, %g28
          qpfmuls,2     %r35, %r9, %r32
          qpfhadds,3    %r36, %r38, %r28
          qpfhadds,4    %r40, %g23, %g23
          qpfhadds,5    %g26, %r28, %g26
        }
        {
          loop_mode
          qpfmuls,0     %r43, %g20, %r33
          qpfmuls,1     %r44, %g20, %g20
          qpfmuls,2     %g27, %r9, %g27
          qpshufb,3     %g29, %g19, %r7, %g19
          qpshufb,4     %g22, %r45, %r7, %g22
          qpfhadds,5    %g18, %g25, %g18
        }
        {
          loop_mode
          qpfmuls,0     %g30, %g28, %g28
          qpfhadds,1    %r27, %r39, %g29
          qpfmuls,2     %r37, %g28, %g25
          qpfhadds,5    %r30, %r29, %g30
        }
        {
          loop_mode
          qpfhadds,2    %r42, %g17, %g17
          qpshufb,3     %g31, %g31, %r3, %g31
          qpshufb,4     %g24, %g16, %r7, %g16
        }
        {
          loop_mode
          qpshufb,3     %g26, %g26, %r3, %g24
          qpshufb,4     %r28, %r28, %r3, %g26
        }
        {
          loop_mode
          qpfhadds,0    %r32, %g27, %g27
          qpshufb,1     %r31, %r31, %r3, %r9
          qpfhadds,2    %r33, %g20, %g20
          qpshufb,3     %g23, %g23, %r3, %g23
          qpshufb,4     %g18, %g18, %r3, %g18
        }
        {
          loop_mode
          qpshufb,0     %g29, %g29, %r3, %g28
          qpshufb,1     %g21, %r26, %r7, %g21
          qpfhadds,2    %g28, %g25, %g25
          qpshufb,3     %g30, %g30, %r3, %g29
          qpfadds,4     %g26, %g23, %g23
          qpfsubs,5     %g26, %g23, %g30
        }
        {
          loop_mode
          qpshufb,1     %g17, %g17, %r3, %g17
          qpfadds,3     %g29, %g31, %g29
          qpfsubs,5     %g29, %g31, %g26
        }
        {
          loop_mode
          qpfsubs,0     %g24, %r9, %g31
          qpfadds,1     %g24, %r9, %g24
          qpfadds,2     %g22, %g17, %r9
        }
        {
          loop_mode
          qpshufb,0     %g20, %g20, %r3, %g20
          qpshufb,1     %g27, %g27, %r3, %g27
          qpfsubs,2     %g18, %g28, %r26
        }
        {
          loop_mode
          qpfadds,0     %g18, %g28, %g18
          qpfsubs,1     %g16, %g20, %g16
          qpfadds,2     %g16, %g20, %g28
          qpshufb,3     %g30, %g30, %r6, %g20
        }
        {
          loop_mode
          qpshufb,0     %g25, %g25, %r3, %g25
          qpfadds,1     %g19, %g27, %g19
          qpfsubs,2     %g19, %g27, %g30
          qpshufb,3     %g26, %g26, %r6, %g22
          qpxor,4       %g20, %r4, %g20
          qpfsubs,5     %g22, %g17, %g17
        }
        {
          loop_mode
          qpfsubs,0     %g21, %g25, %g21
          qpfadds,1     %r9, %g23, %g25
          qpfadds,2     %g21, %g25, %g26
          qpxor,3       %g22, %r4, %g22
        }
        {
          loop_mode
          qpshufb,0     %g31, %g31, %r6, %g27
          qpfsubs,1     %r9, %g23, %g23
        }
        {
          loop_mode
          qpfadds,0     %g28, %g18, %g31
          qpfsubs,1     %g28, %g18, %g18
        }
        {
          loop_mode
          qpshufb,0     %r26, %r26, %r6, %g28
          qpfadds,1     %g19, %g24, %r9
          qpfsubs,2     %g19, %g24, %g19
          qpfsubs,3     %g17, %g20, %g24
          qpfadds,4     %g17, %g20, %g17
        }
        {
          loop_mode
          qpfadds,0     %g26, %g29, %g20
          qpfsubs,1     %g26, %g29, %g26
          qpfadds,2     %g21, %g22, %g29
        }
        {
          loop_mode
          qpxor,0       %g27, %r4, %g27
          qpfsubs,1     %g21, %g22, %g21
          stqp,2        %r18, %r0, %g23
        }
        {
          loop_mode
          qpfsubs,0     %g30, %g27, %g22
          qpfadds,1     %g30, %g27, %g23
          stqp,2        %r12, %r0, %g18
          stqp,5        %r25, %r0, %g25
        }
        {
          loop_mode
          qpxor,0       %g28, %r4, %g18
          stqp,2        %r16, %r0, %g31
          stqp,5        %r14, %r0, %g17
        }
        {
          loop_mode
          qpfsubs,0     %g16, %g18, %g17
          qpfadds,1     %g16, %g18, %g16
          stqp,2        %r23, %r0, %g19
          stqp,5        %r15, %r0, %g24
        }
        {
          loop_mode
          stqp,2        %r2, %r0, %r9
        }
        {
          loop_mode
          stqp,2        %r24, %r0, %g22
          stqp,5        %r22, %r0, %g20
        }
        {
          loop_mode
          stqp,2        %r19, %r0, %g26
          stqp,5        %r21, %r0, %g21
        }
        {
          loop_mode
          stqp,2        %r17, %r0, %g23
          stqp,5        %r11, %r0, %g29
        }
        {
          loop_mode
          stqp,2        %r20, %r0, %g17
        }
        {
          loop_mode
          ct    %ctpr1 ? %NOT_LOOP_END
          alc   alcf=1, alct=1
          addd,0,sm     0x10, %r0, %r0
          stqp,2        %r13, %r0, %g16
        }

Теоретическая скорость: 32 комплексных числа за 73 такта (32/73) = 3.51 Байт/такт
Четверная теоретическая скорость: 14.03 Байт/такт

Замеры скорости

4. stage_radix4_2x_simd128_unroll2

Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.

Код на Си
void stage_radix4_2x_simd128_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b)
{
	__v2di *xy0_in = (__v2di*)&data_in[ 0];
	__v2di *zw0_in = (__v2di*)&data_in[ 2];
	__v2di *xy1_in = (__v2di*)&data_in[ 4];
	__v2di *zw1_in = (__v2di*)&data_in[ 6];
	__v2di *xy2_in = (__v2di*)&data_in[ 8];
	__v2di *zw2_in = (__v2di*)&data_in[10];
	__v2di *xy3_in = (__v2di*)&data_in[12];
	__v2di *zw3_in = (__v2di*)&data_in[14];
	__v2di *xy4_in = (__v2di*)&data_in[16];
	__v2di *zw4_in = (__v2di*)&data_in[18];
	__v2di *xy5_in = (__v2di*)&data_in[20];
	__v2di *zw5_in = (__v2di*)&data_in[22];
	__v2di *xy6_in = (__v2di*)&data_in[24];
	__v2di *zw6_in = (__v2di*)&data_in[26];
	__v2di *xy7_in = (__v2di*)&data_in[28];
	__v2di *zw7_in = (__v2di*)&data_in[30];
	__v2di *c0a_in = (__v2di*)&coefC_a[0];
	__v2di *c1a_in = (__v2di*)&coefC_a[2];
	__v2di *c2a_in = (__v2di*)&coefC_a[4];
	__v2di *c3a_in = (__v2di*)&coefC_a[6];
	__v2di *d0a_in = (__v2di*)&coefD_a[0];
	__v2di *d1a_in = (__v2di*)&coefD_a[2];
	__v2di *d2a_in = (__v2di*)&coefD_a[4];
	__v2di *d3a_in = (__v2di*)&coefD_a[6];
	__v2di *e0a_in = (__v2di*)&coefE_a[0];
	__v2di *e1a_in = (__v2di*)&coefE_a[2];
	__v2di *e2a_in = (__v2di*)&coefE_a[4];
	__v2di *e3a_in = (__v2di*)&coefE_a[6];
	__v2di *c0b_in = (__v2di*)&coefC_b[0*data_count/16];
	__v2di *c1b_in = (__v2di*)&coefC_b[1*data_count/16];
	__v2di *c2b_in = (__v2di*)&coefC_b[2*data_count/16];
	__v2di *c3b_in = (__v2di*)&coefC_b[3*data_count/16];
	__v2di *d0b_in = (__v2di*)&coefD_b[0*data_count/16];
	__v2di *d1b_in = (__v2di*)&coefD_b[1*data_count/16];
	__v2di *d2b_in = (__v2di*)&coefD_b[2*data_count/16];
	__v2di *d3b_in = (__v2di*)&coefD_b[3*data_count/16];
	__v2di *e0b_in = (__v2di*)&coefE_b[0*data_count/16];
	__v2di *e1b_in = (__v2di*)&coefE_b[1*data_count/16];
	__v2di *e2b_in = (__v2di*)&coefE_b[2*data_count/16];
	__v2di *e3b_in = (__v2di*)&coefE_b[3*data_count/16];

	__v2di *out_0  = (__v2di*)&data_out[ 0*data_count/16];
	__v2di *out_1  = (__v2di*)&data_out[ 1*data_count/16];
	__v2di *out_2  = (__v2di*)&data_out[ 2*data_count/16];
	__v2di *out_3  = (__v2di*)&data_out[ 3*data_count/16];
	__v2di *out_4  = (__v2di*)&data_out[ 4*data_count/16];
	__v2di *out_5  = (__v2di*)&data_out[ 5*data_count/16];
	__v2di *out_6  = (__v2di*)&data_out[ 6*data_count/16];
	__v2di *out_7  = (__v2di*)&data_out[ 7*data_count/16];
	__v2di *out_8  = (__v2di*)&data_out[ 8*data_count/16];
	__v2di *out_9  = (__v2di*)&data_out[ 9*data_count/16];
	__v2di *out_10 = (__v2di*)&data_out[10*data_count/16];
	__v2di *out_11 = (__v2di*)&data_out[11*data_count/16];
	__v2di *out_12 = (__v2di*)&data_out[12*data_count/16];
	__v2di *out_13 = (__v2di*)&data_out[13*data_count/16];
	__v2di *out_14 = (__v2di*)&data_out[14*data_count/16];
	__v2di *out_15 = (__v2di*)&data_out[15*data_count/16];

	#pragma ivdep
	#pragma unroll(2)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/32; ++i)
	{
		__v2di xy0 = xy0_in[16*i];
		__v2di zw0 = zw0_in[16*i];
		__v2di xy1 = xy1_in[16*i];
		__v2di zw1 = zw1_in[16*i];
		__v2di c0  = c0a_in[4*i];
		__v2di d0  = d0a_in[4*i];
		__v2di e0  = e0a_in[4*i];

		__v2di xy2 = xy2_in[16*i];
		__v2di zw2 = zw2_in[16*i];
		__v2di xy3 = xy3_in[16*i];
		__v2di zw3 = zw3_in[16*i];
		__v2di c1  = c1a_in[4*i];
		__v2di d1  = d1a_in[4*i];
		__v2di e1  = e1a_in[4*i];

		__v2di xy4 = xy4_in[16*i];
		__v2di zw4 = zw4_in[16*i];
		__v2di xy5 = xy5_in[16*i];
		__v2di zw5 = zw5_in[16*i];
		__v2di c2  = c2a_in[4*i];
		__v2di d2  = d2a_in[4*i];
		__v2di e2  = e2a_in[4*i];

		__v2di xy6 = xy6_in[16*i];
		__v2di zw6 = zw6_in[16*i];
		__v2di xy7 = xy7_in[16*i];
		__v2di zw7 = zw7_in[16*i];
		__v2di c3  = c3a_in[4*i];
		__v2di d3  = d3a_in[4*i];
		__v2di e3  = e3a_in[4*i];

		__v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		__v2di conj_c0 = __builtin_e2k_qpxor(c0, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_c1 = __builtin_e2k_qpxor(c1, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_c2 = __builtin_e2k_qpxor(c2, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_c3 = __builtin_e2k_qpxor(c3, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_d0 = __builtin_e2k_qpxor(d0, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_d1 = __builtin_e2k_qpxor(d1, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_d2 = __builtin_e2k_qpxor(d2, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_d3 = __builtin_e2k_qpxor(d3, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_e0 = __builtin_e2k_qpxor(e0, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_e1 = __builtin_e2k_qpxor(e1, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_e2 = __builtin_e2k_qpxor(e2, (__v2di){1LL<<63, 1LL<<63});
		__v2di conj_e3 = __builtin_e2k_qpxor(e3, (__v2di){1LL<<63, 1LL<<63});
		__v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});

		__v2di cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
		__v2di cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
		__v2di cy2_real = __builtin_e2k_qpfmuls(conj_c2, y2);
		__v2di cy3_real = __builtin_e2k_qpfmuls(conj_c3, y3);
		__v2di dz0_real = __builtin_e2k_qpfmuls(conj_d0, z0);
		__v2di dz1_real = __builtin_e2k_qpfmuls(conj_d1, z1);
		__v2di dz2_real = __builtin_e2k_qpfmuls(conj_d2, z2);
		__v2di dz3_real = __builtin_e2k_qpfmuls(conj_d3, z3);
		__v2di ew0_real = __builtin_e2k_qpfmuls(conj_e0, w0);
		__v2di ew1_real = __builtin_e2k_qpfmuls(conj_e1, w1);
		__v2di ew2_real = __builtin_e2k_qpfmuls(conj_e2, w2);
		__v2di ew3_real = __builtin_e2k_qpfmuls(conj_e3, w3);
		__v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
		__v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
		__v2di cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
		__v2di cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
		__v2di dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
		__v2di dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
		__v2di dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
		__v2di dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
		__v2di ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
		__v2di ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
		__v2di ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
		__v2di ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);

		__v2di cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
		__v2di cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);
		__v2di cy2_rrii = __builtin_e2k_qpfhadds(cy2_real, cy2_imag);
		__v2di cy3_rrii = __builtin_e2k_qpfhadds(cy3_real, cy3_imag);
		__v2di dz0_rrii = __builtin_e2k_qpfhadds(dz0_real, dz0_imag);
		__v2di dz1_rrii = __builtin_e2k_qpfhadds(dz1_real, dz1_imag);
		__v2di dz2_rrii = __builtin_e2k_qpfhadds(dz2_real, dz2_imag);
		__v2di dz3_rrii = __builtin_e2k_qpfhadds(dz3_real, dz3_imag);
		__v2di ew0_rrii = __builtin_e2k_qpfhadds(ew0_real, ew0_imag);
		__v2di ew1_rrii = __builtin_e2k_qpfhadds(ew1_real, ew1_imag);
		__v2di ew2_rrii = __builtin_e2k_qpfhadds(ew2_real, ew2_imag);
		__v2di ew3_rrii = __builtin_e2k_qpfhadds(ew3_real, ew3_imag);

		__v2di cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di cy2 = __builtin_e2k_qpshufb(cy2_rrii, cy2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di cy3 = __builtin_e2k_qpshufb(cy3_rrii, cy3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di dz0 = __builtin_e2k_qpshufb(dz0_rrii, dz0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di dz1 = __builtin_e2k_qpshufb(dz1_rrii, dz1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di dz2 = __builtin_e2k_qpshufb(dz2_rrii, dz2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di dz3 = __builtin_e2k_qpshufb(dz3_rrii, dz3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di ew0 = __builtin_e2k_qpshufb(ew0_rrii, ew0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di ew1 = __builtin_e2k_qpshufb(ew1_rrii, ew1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di ew2 = __builtin_e2k_qpshufb(ew2_rrii, ew2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di ew3 = __builtin_e2k_qpshufb(ew3_rrii, ew3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});

		__v2di add02_0 = __builtin_e2k_qpfadds( x0, dz0);
		__v2di add02_1 = __builtin_e2k_qpfadds( x1, dz1);
		__v2di add02_2 = __builtin_e2k_qpfadds( x2, dz2);
		__v2di add02_3 = __builtin_e2k_qpfadds( x3, dz3);
		__v2di sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
		__v2di sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
		__v2di sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
		__v2di sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
		__v2di add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
		__v2di add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
		__v2di add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
		__v2di add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
		__v2di sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
		__v2di sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
		__v2di sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
		__v2di sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);

		__v2di swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
		__v2di sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
		__v2di sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
		__v2di sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});

		__v2di out0  = __builtin_e2k_qpfadds(add02_0, add13_0);
		__v2di out1  = __builtin_e2k_qpfadds(add02_1, add13_1);
		__v2di out2  = __builtin_e2k_qpfadds(add02_2, add13_2);
		__v2di out3  = __builtin_e2k_qpfadds(add02_3, add13_3);
		__v2di out4  = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
		__v2di out5  = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
		__v2di out6  = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
		__v2di out7  = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
		__v2di out8  = __builtin_e2k_qpfsubs(add02_0, add13_0);
		__v2di out9  = __builtin_e2k_qpfsubs(add02_1, add13_1);
		__v2di out10 = __builtin_e2k_qpfsubs(add02_2, add13_2);
		__v2di out11 = __builtin_e2k_qpfsubs(add02_3, add13_3);
		__v2di out12 = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
		__v2di out13 = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
		__v2di out14 = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
		__v2di out15 = __builtin_e2k_qpfadds(sub02_3, sub13i_3);


		xy0 = out0;
		zw0 = out1;
		xy1 = out2;
		zw1 = out3;
		c0  = c0b_in[i];
		d0  = d0b_in[i];
		e0  = e0b_in[i];

		xy2 = out4;
		zw2 = out5;
		xy3 = out6;
		zw3 = out7;
		c1  = c1b_in[i];
		d1  = d1b_in[i];
		e1  = e1b_in[i];

		xy4 = out8;
		zw4 = out9;
		xy5 = out10;
		zw5 = out11;
		c2  = c2b_in[i];
		d2  = d2b_in[i];
		e2  = e2b_in[i];

		xy6 = out12;
		zw6 = out13;
		xy7 = out14;
		zw7 = out15;
		c3  = c3b_in[i];
		d3  = d3b_in[i];
		e3  = e3b_in[i];

		x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
		w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
		y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
		w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
		y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
		w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
		y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
		w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		conj_c0 = __builtin_e2k_qpxor(c0, (__v2di){1LL<<63, 1LL<<63});
		conj_c1 = __builtin_e2k_qpxor(c1, (__v2di){1LL<<63, 1LL<<63});
		conj_c2 = __builtin_e2k_qpxor(c2, (__v2di){1LL<<63, 1LL<<63});
		conj_c3 = __builtin_e2k_qpxor(c3, (__v2di){1LL<<63, 1LL<<63});
		conj_d0 = __builtin_e2k_qpxor(d0, (__v2di){1LL<<63, 1LL<<63});
		conj_d1 = __builtin_e2k_qpxor(d1, (__v2di){1LL<<63, 1LL<<63});
		conj_d2 = __builtin_e2k_qpxor(d2, (__v2di){1LL<<63, 1LL<<63});
		conj_d3 = __builtin_e2k_qpxor(d3, (__v2di){1LL<<63, 1LL<<63});
		conj_e0 = __builtin_e2k_qpxor(e0, (__v2di){1LL<<63, 1LL<<63});
		conj_e1 = __builtin_e2k_qpxor(e1, (__v2di){1LL<<63, 1LL<<63});
		conj_e2 = __builtin_e2k_qpxor(e2, (__v2di){1LL<<63, 1LL<<63});
		conj_e3 = __builtin_e2k_qpxor(e3, (__v2di){1LL<<63, 1LL<<63});
		swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});

		cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
		cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
		cy2_real = __builtin_e2k_qpfmuls(conj_c2, y2);
		cy3_real = __builtin_e2k_qpfmuls(conj_c3, y3);
		dz0_real = __builtin_e2k_qpfmuls(conj_d0, z0);
		dz1_real = __builtin_e2k_qpfmuls(conj_d1, z1);
		dz2_real = __builtin_e2k_qpfmuls(conj_d2, z2);
		dz3_real = __builtin_e2k_qpfmuls(conj_d3, z3);
		ew0_real = __builtin_e2k_qpfmuls(conj_e0, w0);
		ew1_real = __builtin_e2k_qpfmuls(conj_e1, w1);
		ew2_real = __builtin_e2k_qpfmuls(conj_e2, w2);
		ew3_real = __builtin_e2k_qpfmuls(conj_e3, w3);
		cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
		cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
		cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
		cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
		dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
		dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
		dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
		dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
		ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
		ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
		ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
		ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);

		cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
		cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);
		cy2_rrii = __builtin_e2k_qpfhadds(cy2_real, cy2_imag);
		cy3_rrii = __builtin_e2k_qpfhadds(cy3_real, cy3_imag);
		dz0_rrii = __builtin_e2k_qpfhadds(dz0_real, dz0_imag);
		dz1_rrii = __builtin_e2k_qpfhadds(dz1_real, dz1_imag);
		dz2_rrii = __builtin_e2k_qpfhadds(dz2_real, dz2_imag);
		dz3_rrii = __builtin_e2k_qpfhadds(dz3_real, dz3_imag);
		ew0_rrii = __builtin_e2k_qpfhadds(ew0_real, ew0_imag);
		ew1_rrii = __builtin_e2k_qpfhadds(ew1_real, ew1_imag);
		ew2_rrii = __builtin_e2k_qpfhadds(ew2_real, ew2_imag);
		ew3_rrii = __builtin_e2k_qpfhadds(ew3_real, ew3_imag);

		cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		cy2 = __builtin_e2k_qpshufb(cy2_rrii, cy2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		cy3 = __builtin_e2k_qpshufb(cy3_rrii, cy3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		dz0 = __builtin_e2k_qpshufb(dz0_rrii, dz0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		dz1 = __builtin_e2k_qpshufb(dz1_rrii, dz1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		dz2 = __builtin_e2k_qpshufb(dz2_rrii, dz2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		dz3 = __builtin_e2k_qpshufb(dz3_rrii, dz3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		ew0 = __builtin_e2k_qpshufb(ew0_rrii, ew0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		ew1 = __builtin_e2k_qpshufb(ew1_rrii, ew1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		ew2 = __builtin_e2k_qpshufb(ew2_rrii, ew2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		ew3 = __builtin_e2k_qpshufb(ew3_rrii, ew3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});

		add02_0 = __builtin_e2k_qpfadds( x0, dz0);
		add02_1 = __builtin_e2k_qpfadds( x1, dz1);
		add02_2 = __builtin_e2k_qpfadds( x2, dz2);
		add02_3 = __builtin_e2k_qpfadds( x3, dz3);
		sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
		sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
		sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
		sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
		add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
		add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
		add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
		add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
		sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
		sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
		sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
		sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);

		swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
		sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
		sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
		sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});

		out_0[i]  = __builtin_e2k_qpfadds(add02_0, add13_0);
		out_1[i]  = __builtin_e2k_qpfadds(add02_1, add13_1);
		out_2[i]  = __builtin_e2k_qpfadds(add02_2, add13_2);
		out_3[i]  = __builtin_e2k_qpfadds(add02_3, add13_3);
		out_4[i]  = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
		out_5[i]  = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
		out_6[i]  = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
		out_7[i]  = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
		out_8[i]  = __builtin_e2k_qpfsubs(add02_0, add13_0);
		out_9[i]  = __builtin_e2k_qpfsubs(add02_1, add13_1);
		out_10[i] = __builtin_e2k_qpfsubs(add02_2, add13_2);
		out_11[i] = __builtin_e2k_qpfsubs(add02_3, add13_3);
		out_12[i] = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
		out_13[i] = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
		out_14[i] = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
		out_15[i] = __builtin_e2k_qpfadds(sub02_3, sub13i_3);
	}
}
Основной цикл на ассемблере
.L11610:
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=0, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=1, disp=64
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=1, disp=96
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=2, disp=128
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=2, disp=160
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=3, disp=192
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=3, disp=224
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=4, disp=256
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=4, disp=288
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=5, disp=320
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=5, disp=352
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=6, disp=384
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=6, disp=416
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=7, disp=448
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=7, disp=480
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=8, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=10, disp=64
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=10, disp=96
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=12, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=12, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=14, disp=64
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=14, disp=96
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=16, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=18, disp=64
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=18, disp=96
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=1, incr=2, ind=0, asz=1, abs=20, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=15, asz=1, abs=20, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=14, asz=1, abs=22, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=13, asz=1, abs=22, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=12, asz=1, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=11, asz=1, abs=24, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=10, asz=1, abs=26, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=9, asz=1, abs=26, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=8, asz=1, abs=28, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=7, asz=1, abs=28, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=6, asz=1, abs=30, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=5, asz=1, abs=30, disp=0
        }
.L7588:
        {
          loop_mode
          disp  %ctpr1, .L7588
          movaqp,0      area=0, ind=0, am=1, be=0, %g17
          movaqp,1      area=0, ind=16, am=0, be=0, %g16
          movaqp,2      area=0, ind=0, am=1, be=0, %g19
          movaqp,3      area=0, ind=16, am=0, be=0, %g18
        }
        {
          loop_mode
          movaqp,0      area=1, ind=0, am=1, be=0, %g21
          movaqp,1      area=1, ind=16, am=0, be=0, %g20
          movaqp,2      area=1, ind=0, am=1, be=0, %g23
          movaqp,3      area=1, ind=16, am=0, be=0, %g22
        }
        {
          loop_mode
          movaqp,0      area=2, ind=0, am=1, be=0, %g25
          movaqp,1      area=2, ind=16, am=0, be=0, %g24
          movaqp,2      area=2, ind=0, am=1, be=0, %g27
          movaqp,3      area=2, ind=16, am=0, be=0, %g26
        }
        {
          loop_mode
          movaqp,0      area=3, ind=0, am=1, be=0, %g29
          movaqp,1      area=3, ind=16, am=0, be=0, %g28
          movaqp,2      area=3, ind=0, am=1, be=0, %g31
          movaqp,3      area=3, ind=16, am=0, be=0, %g30
        }
        {
          loop_mode
          movaqp,0      area=4, ind=0, am=1, be=0, %b[20]
          movaqp,1      area=4, ind=16, am=0, be=0, %b[19]
          movaqp,2      area=4, ind=0, am=1, be=0, %b[22]
          movaqp,3      area=4, ind=16, am=0, be=0, %b[21]
        }
        {
          loop_mode
          qpshufb,0     %g19, %g17, %r24, %b[27]
          qpshufb,1     %g18, %g16, %r24, %b[28]
          qpshufb,3     %g18, %g16, %r7, %g16
          qpshufb,4     %g19, %g17, %r7, %g17
          movaqp,0      area=5, ind=0, am=1, be=0, %b[24]
          movaqp,1      area=5, ind=16, am=0, be=0, %b[23]
          movaqp,2      area=5, ind=0, am=1, be=0, %b[26]
          movaqp,3      area=5, ind=16, am=0, be=0, %b[25]
        }
        {
          loop_mode
          qpshufb,0     %g23, %g21, %r24, %b[31]
          qpshufb,1     %g22, %g20, %r24, %b[32]
          qpshufb,3     %g22, %g20, %r7, %g20
          qpshufb,4     %g23, %g21, %r7, %g21
          movaqp,0      area=6, ind=0, am=1, be=0, %g19
          movaqp,1      area=6, ind=16, am=0, be=0, %g18
          movaqp,2      area=6, ind=0, am=1, be=0, %b[30]
          movaqp,3      area=6, ind=16, am=0, be=0, %b[29]
        }
        {
          loop_mode
          qpshufb,0     %g27, %g25, %r24, %b[35]
          qpshufb,1     %g26, %g24, %r24, %b[36]
          qpshufb,3     %g26, %g24, %r7, %g24
          qpshufb,4     %g27, %g25, %r7, %g25
          movaqp,0      area=7, ind=0, am=1, be=0, %g23
          movaqp,1      area=7, ind=16, am=0, be=0, %g22
          movaqp,2      area=7, ind=0, am=1, be=0, %b[34]
          movaqp,3      area=7, ind=16, am=0, be=0, %b[33]
        }
        {
          loop_mode
          qpshufb,0     %g31, %g29, %r24, %b[39]
          qpshufb,1     %g30, %g28, %r24, %b[40]
          qpshufb,3     %g30, %g28, %r7, %g28
          qpshufb,4     %g31, %g29, %r7, %g29
          movaqp,0      area=8, ind=0, am=1, be=0, %g27
          movaqp,1      area=8, ind=16, am=0, be=0, %g26
          movaqp,2      area=8, ind=0, am=1, be=0, %b[38]
          movaqp,3      area=8, ind=16, am=0, be=0, %b[37]
        }
        {
          loop_mode
          qpshufb,0     %b[22], %b[20], %r24, %b[43]
          qpshufb,1     %b[21], %b[19], %r24, %b[44]
          qpshufb,3     %b[21], %b[19], %r7, %b[19]
          qpshufb,4     %b[22], %b[20], %r7, %b[20]
          movaqp,0      area=9, ind=0, am=1, be=0, %g31
          movaqp,1      area=9, ind=16, am=0, be=0, %g30
          movaqp,2      area=9, ind=0, am=1, be=0, %b[42]
          movaqp,3      area=9, ind=16, am=0, be=0, %b[41]
        }
        {
          loop_mode
          qpshufb,0     %b[26], %b[24], %r24, %b[47]
          qpshufb,1     %b[25], %b[23], %r24, %b[48]
          qpshufb,3     %b[25], %b[23], %r7, %b[23]
          qpshufb,4     %b[26], %b[24], %r7, %b[24]
          movaqp,0      area=10, ind=0, am=1, be=0, %b[22]
          movaqp,1      area=10, ind=16, am=0, be=0, %b[21]
          movaqp,2      area=10, ind=0, am=1, be=0, %b[46]
          movaqp,3      area=10, ind=16, am=0, be=0, %b[45]
        }
        {
          loop_mode
          qpshufb,0     %b[30], %g19, %r24, %b[51]
          qpshufb,1     %b[29], %g18, %r24, %b[52]
          qpshufb,3     %b[29], %g18, %r7, %g18
          qpshufb,4     %b[30], %g19, %r7, %g19
          movaqp,0      area=11, ind=0, am=1, be=0, %b[26]
          movaqp,1      area=11, ind=16, am=0, be=0, %b[25]
          movaqp,2      area=11, ind=0, am=1, be=0, %b[50]
          movaqp,3      area=11, ind=16, am=0, be=0, %b[49]
        }
        {
          loop_mode
          qpshufb,0     %b[34], %g23, %r24, %b[55]
          qpshufb,1     %b[33], %g22, %r24, %b[56]
          qpshufb,3     %b[33], %g22, %r7, %g22
          qpshufb,4     %b[34], %g23, %r7, %g23
          movaqp,0      area=12, ind=0, am=1, be=0, %b[30]
          movaqp,1      area=12, ind=16, am=0, be=0, %b[29]
          movaqp,2      area=12, ind=0, am=1, be=0, %b[54]
          movaqp,3      area=12, ind=16, am=0, be=0, %b[53]
        }
        {
          loop_mode
          qpxor,0       %g27, %r6, %b[59]
          qpxor,1       %g26, %r6, %b[60]
          qpxor,3       %b[38], %r6, %b[61]
          qpxor,4       %b[37], %r6, %b[62]
          movaqp,0      area=13, ind=0, am=1, be=0, %b[34]
          movaqp,1      area=13, ind=16, am=0, be=0, %b[33]
          movaqp,2      area=13, ind=0, am=1, be=0, %b[58]
          movaqp,3      area=13, ind=16, am=0, be=0, %b[57]
        }
        {
          loop_mode
          qpxor,0       %g31, %r6, %b[63]
          qpxor,1       %g30, %r6, %b[64]
          qpfmuls,2     %b[60], %b[31], %b[60]
          qpxor,3       %b[42], %r6, %b[65]
          qpxor,4       %b[41], %r6, %b[66]
          qpfmuls,5     %b[62], %b[39], %b[62]
          movaqp,0      area=14, ind=0, am=1, be=0, %b[68]
          movaqp,1      area=14, ind=16, am=0, be=0, %b[67]
          movaqp,2      area=14, ind=0, am=1, be=0, %b[70]
          movaqp,3      area=14, ind=16, am=0, be=0, %b[69]
        }
        {
          loop_mode
          qpfmuls,0     %b[63], %b[43], %b[63]
          qpfmuls,1     %b[64], %b[47], %b[64]
          qpfmuls,2     %b[59], %b[27], %b[59]
          qpfmuls,3     %b[65], %b[51], %b[65]
          qpxor,4       %b[22], %r6, %b[71]
          qpfmuls,5     %b[61], %b[35], %b[61]
          movaqp,0      area=15, ind=0, am=1, be=0, %b[73]
          movaqp,1      area=15, ind=16, am=0, be=0, %b[72]
          movaqp,2      area=15, ind=0, am=1, be=0, %b[75]
          movaqp,3      area=15, ind=16, am=0, be=0, %b[74]
        }
        {
          loop_mode
          qpxor,0       %b[21], %r6, %b[76]
          qpxor,1       %b[45], %r6, %b[77]
          qpxor,3       %b[46], %r6, %b[78]
          qpxor,4       %b[26], %r6, %b[79]
          qpfmuls,5     %b[66], %b[55], %b[66]
          movaqp,0      area=16, ind=0, am=1, be=0, %b[81]
          movaqp,1      area=16, ind=16, am=0, be=0, %b[80]
          movaqp,2      area=16, ind=0, am=1, be=0, %b[83]
          movaqp,3      area=16, ind=16, am=0, be=0, %b[82]
        }
        {
          loop_mode
          qpfmuls,0     %b[77], %g28, %b[77]
          qpfmuls,2     %b[76], %g20, %b[76]
          qpfmuls,3     %b[78], %g24, %b[78]
          qpxor,4       %b[30], %r6, %b[84]
          qpfmuls,5     %b[71], %g16, %b[71]
          movaqp,0      area=17, ind=0, am=1, be=0, %b[86]
          movaqp,1      area=17, ind=16, am=0, be=0, %b[85]
          movaqp,2      area=17, ind=0, am=1, be=0, %b[88]
          movaqp,3      area=17, ind=16, am=0, be=0, %b[87]
        }
        {
          loop_mode
          qpxor,0       %b[29], %r6, %b[89]
          qpxor,1       %b[53], %r6, %b[90]
          qpxor,3       %b[54], %r6, %b[91]
          qpxor,4       %b[34], %r6, %b[92]
          qpfmuls,5     %b[84], %b[28], %b[84]
          movaqp,0      area=18, ind=0, am=1, be=0, %b[94]
          movaqp,1      area=18, ind=16, am=0, be=0, %b[93]
          movaqp,2      area=18, ind=0, am=1, be=0, %b[96]
          movaqp,3      area=18, ind=16, am=0, be=0, %b[95]
        }
        {
          loop_mode
          qpfmuls,0     %b[90], %b[40], %b[90]
          qpxor,1       %b[33], %r6, %b[97]
          qpfmuls,2     %b[89], %b[32], %b[89]
          qpfmuls,3     %b[92], %b[44], %b[92]
          qpxor,4       %b[58], %r6, %b[98]
          qpfmuls,5     %b[91], %b[36], %b[91]
          movaqp,0      area=19, ind=0, am=1, be=0, %b[100]
          movaqp,1      area=19, ind=16, am=0, be=0, %b[99]
          movaqp,2      area=19, ind=0, am=1, be=0, %b[102]
          movaqp,3      area=19, ind=16, am=0, be=0, %b[101]
        }
        {
          loop_mode
          qpxor,0       %b[57], %r6, %b[103]
          qpxor,1       %b[25], %r6, %b[104]
          qpfmuls,2     %b[97], %b[48], %b[97]
          qpxor,3       %b[49], %r6, %b[105]
          qpxor,4       %b[50], %r6, %b[106]
          qpfmuls,5     %b[98], %b[52], %b[98]
        }
        {
          loop_mode
          qpfmuls,0     %b[79], %b[19], %b[79]
          qpfmuls,1     %b[104], %b[23], %b[104]
          qpfmuls,2     %b[103], %b[56], %b[103]
          qpfmuls,3     %b[106], %g18, %b[106]
          qpshufb,4     %g27, %g27, %r25, %g27
          qpfmuls,5     %b[105], %g22, %b[105]
        }
        {
          loop_mode
          qpshufb,0     %g26, %g26, %r25, %g26
          qpshufb,1     %b[37], %b[37], %r25, %b[37]
          qpshufb,3     %b[38], %b[38], %r25, %b[38]
          qpshufb,4     %g31, %g31, %r25, %g31
          qpfmul_hadds,5        %g27, %b[27], %b[59], %g27
        }
        {
          loop_mode
          qpfmul_hadds,0        %b[37], %b[39], %b[62], %b[27]
          qpfmul_hadds,2        %g26, %b[31], %b[60], %g26
          qpfmul_hadds,3        %g31, %b[43], %b[63], %g31
          qpshufb,4     %g30, %g30, %r25, %g30
          qpfmul_hadds,5        %b[38], %b[35], %b[61], %b[31]
        }
        {
          loop_mode
          qpshufb,0     %b[41], %b[41], %r25, %b[35]
          qpshufb,1     %b[30], %b[30], %r25, %b[30]
          qpshufb,3     %b[29], %b[29], %r25, %b[29]
          qpshufb,4     %b[42], %b[42], %r25, %b[37]
          qpfmul_hadds,5        %g30, %b[47], %b[64], %g30
        }
        {
          loop_mode
          qpshufb,0     %b[54], %b[54], %r25, %b[38]
          qpshufb,1     %b[53], %b[53], %r25, %b[39]
          qpfmul_hadds,2        %b[35], %b[55], %b[66], %b[35]
          qpshufb,3     %b[34], %b[34], %r25, %b[34]
          qpshufb,4     %b[33], %b[33], %r25, %b[33]
          qpfmul_hadds,5        %b[37], %b[51], %b[65], %b[37]
        }
        {
          loop_mode
          qpshufb,0     %b[58], %b[58], %r25, %b[41]
          qpshufb,1     %b[57], %b[57], %r25, %b[42]
          qpfmul_hadds,2        %b[30], %b[28], %b[84], %b[28]
          qpfmul_hadds,3        %b[29], %b[32], %b[89], %b[29]
          qpfmul_hadds,4        %b[34], %b[44], %b[92], %b[32]
          qpfmul_hadds,5        %b[33], %b[48], %b[97], %b[30]
        }
        {
          loop_mode
          qpfmul_hadds,0        %b[38], %b[36], %b[91], %b[34]
          qpfmul_hadds,1        %b[41], %b[52], %b[98], %b[36]
          qpfmul_hadds,2        %b[39], %b[40], %b[90], %b[33]
          qpshufb,3     %b[21], %b[21], %r25, %b[21]
          qpshufb,4     %b[22], %b[22], %r25, %b[22]
        }
        {
          loop_mode
          qpshufb,0     %b[46], %b[46], %r25, %b[39]
          qpshufb,1     %b[45], %b[45], %r25, %b[40]
          qpfmul_hadds,2        %b[42], %b[56], %b[103], %b[38]
          qpshufb,3     %b[26], %b[26], %r25, %b[26]
          qpshufb,4     %b[25], %b[25], %r25, %b[25]
          qpfmul_hadds,5        %b[21], %g20, %b[76], %g20
        }
        {
          loop_mode
          qpfmul_hadds,0        %b[40], %g28, %b[77], %g28
          qpshufb,1     %b[50], %b[50], %r25, %b[21]
          qpfmul_hadds,2        %b[39], %g24, %b[78], %g24
          qpfmul_hadds,3        %b[26], %b[19], %b[79], %b[19]
          qpshufb,4     %b[49], %b[49], %r25, %b[41]
          qpfmul_hadds,5        %b[22], %g16, %b[71], %g16
        }
        {
          loop_mode
          qpxor,0       %b[67], %r6, %b[21]
          qpxor,1       %b[68], %r6, %b[23]
          qpfmul_hadds,2        %b[21], %g18, %b[106], %g18
          qpfmul_hadds,3        %b[41], %g22, %b[105], %g22
          qpshufb,4     %g27, %g27, %r22, %g27
          qpfmul_hadds,5        %b[25], %b[23], %b[104], %b[22]
        }
        {
          loop_mode
          qpshufb,0     %g26, %g26, %r22, %g26
          qpshufb,1     %b[27], %b[27], %r22, %b[26]
          qpshufb,3     %b[31], %b[31], %r22, %b[25]
          qpshufb,4     %g31, %g31, %r22, %g31
        }
        {
          loop_mode
          qpxor,0       %b[73], %r6, %b[27]
          qpxor,1       %b[72], %r6, %b[31]
        }
        {
          loop_mode
          qpshufb,3     %g30, %g30, %r22, %g30
          qpshufb,4     %b[37], %b[37], %r22, %b[37]
        }
        {
          loop_mode
          qpshufb,0     %b[35], %b[35], %r22, %b[35]
          qpshufb,1     %b[28], %b[28], %r22, %b[28]
          qpshufb,3     %b[29], %b[29], %r22, %b[29]
          qpshufb,4     %b[32], %b[32], %r22, %b[32]
        }
        {
          loop_mode
          qpfsubs,0     %g27, %b[28], %b[39]
          qpshufb,1     %b[34], %b[34], %r22, %b[34]
          qpfadds,2     %g27, %b[28], %g27
          qpfsubs,3     %g26, %b[29], %b[40]
          qpshufb,4     %b[30], %b[30], %r22, %b[30]
          qpfsubs,5     %g31, %b[32], %b[41]
        }
        {
          loop_mode
          qpshufb,0     %b[33], %b[33], %r22, %b[28]
          qpshufb,1     %b[36], %b[36], %r22, %b[33]
          qpfsubs,2     %b[25], %b[34], %b[36]
          qpfadds,3     %g26, %b[29], %g26
          qpshufb,4     %g20, %g20, %r22, %g20
          qpfsubs,5     %g30, %b[30], %b[42]
        }
        {
          loop_mode
          qpfsubs,0     %b[37], %b[33], %b[43]
          qpshufb,1     %b[38], %b[38], %r22, %b[29]
          qpfsubs,2     %b[26], %b[28], %b[38]
          qpfadds,3     %g31, %b[32], %g31
          qpshufb,4     %g16, %g16, %r22, %g16
          qpfadds,5     %g30, %b[30], %g30
        }
        {
          loop_mode
          qpshufb,0     %g28, %g28, %r22, %g28
          qpshufb,1     %g24, %g24, %r22, %g24
          qpfsubs,2     %b[35], %b[29], %b[30]
          qpfadds,3     %g21, %g20, %b[32]
          qpshufb,4     %b[22], %b[22], %r22, %b[22]
          qpfsubs,5     %g21, %g20, %g20
        }
        {
          loop_mode
          qpfadds,0     %b[26], %b[28], %b[19]
          qpshufb,1     %b[19], %b[19], %r22, %g21
          qpfadds,2     %b[25], %b[34], %b[25]
          qpfadds,3     %b[24], %b[22], %b[26]
          qpshufb,4     %g22, %g22, %r22, %g22
          qpfsubs,5     %b[24], %b[22], %b[22]
        }
        {
          loop_mode
          qpshufb,0     %g18, %g18, %r22, %g18
          qpfadds,1     %b[37], %b[33], %b[24]
          qpfadds,2     %b[35], %b[29], %b[28]
          qpfadds,3     %g17, %g16, %b[29]
          qpfsubs,4     %g17, %g16, %g16
          qpfadds,5     %g23, %g22, %g17
        }
        {
          loop_mode
          qpfadds,0     %g29, %g28, %b[33]
          qpfsubs,1     %g29, %g28, %g28
          qpfadds,2     %b[20], %g21, %g29
          qpshufb,4     %b[39], %b[39], %r25, %g23
          qpfsubs,5     %g23, %g22, %g22
        }
        {
          loop_mode
          qpfsubs,0     %g25, %g24, %g24
          qpfsubs,1     %g19, %g18, %g25
          qpfsubs,2     %b[20], %g21, %g21
          qpfsubs,3     %b[32], %g26, %b[34]
          qpfadds,4     %b[32], %g26, %g26
          qpfadds,5     %g25, %g24, %b[20]
        }
        {
          loop_mode
          qpfadds,2     %g19, %g18, %g18
          qpshufb,3     %b[40], %b[40], %r25, %g19
          qpshufb,4     %b[36], %b[36], %r25, %b[32]
          qpfsubs,5     %b[26], %g30, %b[35]
        }
        {
          loop_mode
          qpfadds,3     %b[26], %g30, %g30
          qpfadds,4     %b[29], %g27, %g27
          qpfsubs,5     %b[29], %g27, %b[36]
        }
        {
          loop_mode
          qpshufb,0     %b[38], %b[38], %r25, %b[26]
          qpshufb,1     %b[42], %b[42], %r25, %b[29]
          qpfsubs,2     %g29, %g31, %b[38]
          qpshufb,3     %b[41], %b[41], %r25, %b[37]
          qpshufb,4     %b[30], %b[30], %r25, %b[30]
        }
        {
          loop_mode
          qpshufb,0     %b[43], %b[43], %r25, %b[39]
          qpxor,1       %g23, %r23, %g23
          qpfsubs,2     %b[33], %b[19], %b[40]
          qpfadds,3     %b[20], %b[25], %b[41]
          qpfadds,4     %g17, %b[28], %b[25]
          qpfsubs,5     %b[20], %b[25], %b[20]
        }
        {
          loop_mode
          qpxor,0       %g19, %r23, %g19
          qpxor,1       %b[32], %r23, %b[32]
          qpfadds,2     %g29, %g31, %g29
          qpxor,3       %b[37], %r23, %b[37]
          qpxor,4       %b[30], %r23, %b[30]
          qpfadds,5     %b[33], %b[19], %g31
        }
        {
          loop_mode
          qpxor,0       %b[29], %r23, %b[19]
          qpxor,1       %b[26], %r23, %b[26]
          qpfadds,2     %g20, %g19, %b[29]
          qpfsubs,3     %g17, %b[28], %g17
          qpfsubs,4     %g21, %b[37], %b[28]
          qpfadds,5     %g21, %b[37], %g21
        }
        {
          loop_mode
          qpxor,0       %b[39], %r23, %b[33]
          qpfsubs,1     %g20, %g19, %g19
          qpfadds,2     %g18, %b[24], %g20
          qpfsubs,3     %g18, %b[24], %g18
          qpfsubs,4     %g22, %b[30], %g22
          qpfadds,5     %g22, %b[30], %b[24]
        }
        {
          loop_mode
          qpfsubs,0     %g16, %g23, %b[30]
          qpfadds,1     %g16, %g23, %g16
          qpfadds,2     %g24, %b[32], %g23
        }
        {
          loop_mode
          qpfadds,0     %g28, %b[26], %b[37]
          qpfsubs,1     %g24, %b[32], %g24
          qpfsubs,2     %g28, %b[26], %g28
        }
        {
          loop_mode
          qpfsubs,0     %g25, %b[33], %b[22]
          qpfadds,1     %g25, %b[33], %g25
          qpfsubs,2     %b[22], %b[19], %b[26]
          qpxor,3       %b[75], %r6, %b[32]
          qpxor,4       %b[83], %r6, %b[33]
          qpfadds,5     %b[22], %b[19], %b[19]
        }
        {
          loop_mode
          qpxor,3       %b[74], %r6, %b[39]
          qpxor,4       %b[82], %r6, %b[42]
        }
        {
          loop_mode
          qpxor,3       %b[86], %r6, %b[43]
          qpxor,4       %b[94], %r6, %b[44]
        }
        {
          loop_mode
          qpxor,0       %b[85], %r6, %b[45]
          qpxor,1       %b[93], %r6, %b[46]
          qpxor,3       %b[96], %r6, %b[47]
          qpxor,4       %b[102], %r6, %b[48]
        }
        {
          loop_mode
          qpxor,0       %b[95], %r6, %b[49]
          qpxor,1       %b[101], %r6, %b[50]
          qpshufb,3     %b[41], %g27, %r24, %b[51]
          qpshufb,4     %g31, %g26, %r24, %b[52]
        }
        {
          loop_mode
          qpshufb,0     %b[20], %b[36], %r24, %b[53]
          qpshufb,1     %b[40], %b[34], %r24, %b[54]
          qpshufb,3     %g20, %g29, %r24, %b[55]
          qpshufb,4     %b[25], %g30, %r24, %b[56]
          qpfmuls,5     %b[23], %b[51], %b[23]
        }
        {
          loop_mode
          qpshufb,0     %g18, %b[38], %r24, %b[57]
          qpshufb,1     %g17, %b[35], %r24, %b[58]
          qpfmuls,2     %b[43], %b[53], %b[43]
          qpshufb,3     %g24, %b[30], %r24, %b[59]
          qpshufb,4     %g28, %g19, %r24, %b[60]
          qpfmuls,5     %b[27], %b[52], %b[27]
        }
        {
          loop_mode
          qpshufb,0     %g23, %g16, %r24, %b[61]
          qpshufb,1     %b[37], %b[29], %r24, %b[62]
          qpfmuls,2     %b[44], %b[54], %b[44]
          qpshufb,3     %b[22], %b[28], %r24, %b[63]
          qpshufb,4     %g22, %b[26], %r24, %b[64]
          qpfmuls,5     %b[21], %b[55], %b[21]
        }
        {
          loop_mode
          qpshufb,0     %g25, %g21, %r24, %b[65]
          qpshufb,1     %b[24], %b[19], %r24, %b[66]
          qpfmuls,2     %b[45], %b[57], %b[45]
          qpfmuls,3     %b[31], %b[56], %b[31]
          qpfmuls,4     %b[32], %b[59], %b[32]
          qpfmuls,5     %b[33], %b[60], %b[33]
        }
        {
          loop_mode
          qpfmuls,0     %b[46], %b[58], %b[46]
          qpfmuls,1     %b[47], %b[61], %b[47]
          qpfmuls,2     %b[48], %b[62], %b[48]
          qpfmuls,3     %b[42], %b[64], %b[42]
          qpxor,4       %b[70], %r6, %b[71]
          qpfmuls,5     %b[39], %b[63], %b[39]
        }
        {
          loop_mode
          qpfmuls,0     %b[50], %b[66], %b[50]
          qpxor,1       %b[81], %r6, %b[76]
          qpfmuls,2     %b[49], %b[65], %b[49]
        }
        {
          loop_mode
          qpxor,4       %b[69], %r6, %b[77]
        }
        {
          loop_mode
          qpxor,1       %b[80], %r6, %b[78]
          qpxor,3       %b[88], %r6, %b[79]
          qpxor,4       %b[100], %r6, %b[84]
        }
        {
          loop_mode
          qpxor,0       %b[87], %r6, %b[89]
          qpxor,1       %b[99], %r6, %b[90]
          qpshufb,3     %g31, %g26, %r7, %g26
          qpshufb,4     %b[40], %b[34], %r7, %g31
        }
        {
          loop_mode
          qpshufb,0     %b[25], %g30, %r7, %g30
          qpshufb,1     %g17, %b[35], %r7, %g17
          qpshufb,3     %g28, %g19, %r7, %g19
          qpshufb,4     %b[37], %b[29], %r7, %g28
          qpfmuls,5     %b[79], %g31, %b[25]
        }
        {
          loop_mode
          qpshufb,0     %g22, %b[26], %r7, %g22
          qpshufb,1     %b[24], %b[19], %r7, %b[19]
          qpfmuls,2     %b[77], %g30, %b[26]
          qpfmuls,3     %b[76], %g19, %b[29]
          qpfmuls,4     %b[84], %g28, %b[34]
          qpfmuls,5     %b[71], %g26, %b[24]
        }
        {
          loop_mode
          qpfmuls,0     %b[90], %b[19], %b[37]
          qpfmuls,1     %b[89], %g17, %b[40]
          qpfmuls,2     %b[78], %g22, %b[35]
          qpshufb,3     %b[68], %b[68], %r25, %b[68]
          qpshufb,4     %b[67], %b[67], %r25, %b[67]
        }
        {
          loop_mode
          qpshufb,0     %b[73], %b[73], %r25, %b[71]
          qpshufb,1     %b[72], %b[72], %r25, %b[72]
          qpfmul_hadds,3        %b[67], %b[55], %b[21], %b[21]
          qpfmul_hadds,5        %b[68], %b[51], %b[23], %b[23]
        }
        {
          loop_mode
          qpfmul_hadds,0        %b[72], %b[56], %b[31], %b[31]
          qpfmul_hadds,2        %b[71], %b[52], %b[27], %b[27]
          qpshufb,3     %b[75], %b[75], %r25, %b[51]
          qpshufb,4     %b[83], %b[83], %r25, %b[55]
        }
        {
          loop_mode
          qpshufb,0     %b[74], %b[74], %r25, %b[52]
          qpshufb,1     %b[82], %b[82], %r25, %b[56]
          qpshufb,3     %b[86], %b[86], %r25, %b[67]
          qpshufb,4     %b[94], %b[94], %r25, %b[68]
          qpfmul_hadds,5        %b[51], %b[59], %b[32], %b[32]
        }
        {
          loop_mode
          qpshufb,0     %b[85], %b[85], %r25, %b[51]
          qpshufb,1     %b[93], %b[93], %r25, %b[59]
          qpfmul_hadds,2        %b[52], %b[63], %b[39], %b[39]
          qpshufb,3     %b[96], %b[96], %r25, %b[71]
          qpshufb,4     %b[102], %b[102], %r25, %b[72]
          qpfmul_hadds,5        %b[55], %b[60], %b[33], %b[33]
        }
        {
          loop_mode
          qpshufb,0     %b[95], %b[95], %r25, %b[52]
          qpshufb,1     %b[101], %b[101], %r25, %b[55]
          qpfmul_hadds,2        %b[56], %b[64], %b[42], %b[42]
          qpfmul_hadds,3        %b[67], %b[53], %b[43], %b[43]
          qpfmul_hadds,4        %b[68], %b[54], %b[44], %b[44]
          qpfmul_hadds,5        %b[71], %b[61], %b[47], %b[47]
        }
        {
          loop_mode
          qpfmul_hadds,0        %b[59], %b[58], %b[46], %b[46]
          qpfmul_hadds,1        %b[52], %b[65], %b[49], %b[49]
          qpfmul_hadds,2        %b[51], %b[57], %b[45], %b[45]
          qpshufb,3     %b[70], %b[70], %r25, %b[51]
          qpshufb,4     %b[69], %b[69], %r25, %b[52]
          qpfmul_hadds,5        %b[72], %b[62], %b[48], %b[48]
        }
        {
          loop_mode
          qpshufb,0     %b[81], %b[81], %r25, %b[53]
          qpshufb,1     %b[88], %b[88], %r25, %b[54]
          qpfmul_hadds,2        %b[55], %b[66], %b[50], %b[50]
          qpfmul_hadds,3        %b[52], %g30, %b[26], %g30
          qpshufb,4     %b[80], %b[80], %r25, %b[55]
          qpfmul_hadds,5        %b[51], %g26, %b[24], %g26
        }
        {
          loop_mode
          qpfmul_hadds,0        %b[53], %g19, %b[29], %g19
          qpshufb,1     %b[87], %b[87], %r25, %b[24]
          qpfmul_hadds,2        %b[54], %g31, %b[25], %g31
          qpshufb,3     %b[100], %b[100], %r25, %b[26]
          qpshufb,4     %b[99], %b[99], %r25, %b[51]
          qpfmul_hadds,5        %b[55], %g22, %b[35], %g22
        }
        {
          loop_mode
          qpshufb,0     %b[20], %b[36], %r7, %b[20]
          qpshufb,1     %b[41], %g27, %r7, %g27
          qpfmul_hadds,2        %b[24], %g17, %b[40], %g17
          qpfmul_hadds,3        %b[51], %b[19], %b[37], %b[19]
          qpshufb,4     %b[23], %b[23], %r22, %b[23]
          qpfmul_hadds,5        %b[26], %g28, %b[34], %g28
        }
        {
          loop_mode
          qpshufb,0     %b[27], %b[27], %r22, %b[24]
          qpshufb,1     %b[31], %b[31], %r22, %b[25]
          qpshufb,3     %b[21], %b[21], %r22, %b[21]
          qpshufb,4     %g20, %g29, %r7, %g20
        }
        {
          loop_mode
          qpshufb,0     %g18, %b[38], %r7, %g18
          qpshufb,1     %g23, %g16, %r7, %g16
        }
        {
          loop_mode
          qpshufb,3     %b[32], %b[32], %r22, %g23
          qpshufb,4     %b[33], %b[33], %r22, %g29
        }
        {
          loop_mode
          qpshufb,0     %b[39], %b[39], %r22, %b[26]
          qpshufb,1     %b[42], %b[42], %r22, %b[27]
          qpshufb,4     %b[43], %b[43], %r22, %b[29]
        }
        {
          loop_mode
          qpfsubs,0     %b[26], %b[27], %b[36]
          qpshufb,1     %b[45], %b[45], %r22, %b[31]
          qpfsubs,2     %b[23], %b[24], %b[34]
          qpshufb,3     %b[44], %b[44], %r22, %b[32]
          qpshufb,4     %b[47], %b[47], %r22, %b[33]
          qpfsubs,5     %b[21], %b[25], %b[35]
        }
        {
          loop_mode
          qpshufb,0     %b[46], %b[46], %r22, %b[37]
          qpshufb,1     %b[49], %b[49], %r22, %b[38]
          qpfadds,2     %b[23], %b[24], %b[23]
          qpfsubs,3     %b[29], %b[32], %b[41]
          qpshufb,4     %b[48], %b[48], %r22, %b[39]
          qpfsubs,5     %g23, %g29, %b[40]
        }
        {
          loop_mode
          qpfadds,0     %b[21], %b[25], %g25
          qpshufb,1     %b[50], %b[50], %r22, %b[24]
          qpfsubs,2     %b[31], %b[37], %b[43]
          qpshufb,3     %g24, %b[30], %r7, %g24
          qpshufb,4     %g25, %g21, %r7, %g21
          qpfsubs,5     %b[33], %b[39], %b[42]
        }
        {
          loop_mode
          qpshufb,0     %b[22], %b[28], %r7, %b[22]
          qpshufb,1     %g26, %g26, %r22, %g26
          qpfsubs,2     %b[38], %b[24], %b[21]
          qpfadds,3     %g23, %g29, %g23
          qpshufb,4     %g30, %g30, %r22, %g30
          qpfadds,5     %b[26], %b[27], %g29
        }
        {
          loop_mode
          qpfadds,0     %b[38], %b[24], %b[24]
          qpshufb,1     %g17, %g17, %r22, %g17
          qpfadds,2     %b[31], %b[37], %b[25]
          qpshufb,3     %g22, %g22, %r22, %g22
          qpshufb,4     %g31, %g31, %r22, %g31
          qpfadds,5     %b[33], %b[39], %b[26]
        }
        {
          loop_mode
          qpshufb,0     %g19, %g19, %r22, %g19
          qpshufb,1     %g28, %g28, %r22, %g28
          qpfadds,2     %b[29], %b[32], %b[27]
          qpfadds,3     %g20, %g30, %b[28]
          qpshufb,4     %b[19], %b[19], %r22, %b[19]
          qpfsubs,5     %g20, %g30, %g20
        }
        {
          loop_mode
          qpfsubs,0     %g27, %g26, %g30
          qpfadds,1     %g27, %g26, %g26
          qpfadds,2     %g18, %g17, %b[20]
          qpfadds,3     %b[20], %g31, %g27
          qpfsubs,4     %b[20], %g31, %g31
          qpfsubs,5     %g21, %b[19], %b[29]
        }
        {
          loop_mode
          qpfsubs,0     %g18, %g17, %g17
          qpfsubs,1     %g24, %g19, %g18
          qpfadds,2     %g24, %g19, %g19
          qpfadds,3     %b[22], %g22, %g24
          qpfsubs,4     %b[22], %g22, %g22
          qpfadds,5     %g21, %b[19], %g21
        }
        {
          loop_mode
          qpfsubs,0     %g16, %g28, %g16
          qpfadds,2     %g16, %g28, %b[19]
        }
        {
          loop_mode
          qpfsubs,3     %b[28], %g25, %g28
          qpfadds,4     %b[28], %g25, %g25
        }
        {
          loop_mode
          qpfadds,0     %g26, %b[23], %b[31]
          qpshufb,1     %b[35], %b[35], %r25, %b[22]
          qpfsubs,2     %g26, %b[23], %g26
          qpshufb,3     %b[34], %b[34], %r25, %b[28]
          qpshufb,4     %b[40], %b[40], %r25, %b[30]
        }
        {
          loop_mode
          qpshufb,0     %b[36], %b[36], %r25, %b[23]
          qpshufb,1     %b[21], %b[21], %r25, %b[21]
          qpfsubs,2     %b[20], %b[25], %b[27]
          qpfadds,3     %g27, %b[27], %b[32]
          qpfsubs,4     %g27, %b[27], %g27
          qpfadds,5     %g21, %b[24], %b[33]
        }
        {
          loop_mode
          qpfadds,0     %b[20], %b[25], %b[20]
          qpshufb,1     %b[43], %b[43], %r25, %b[34]
          qpfadds,2     %g19, %g23, %b[25]
          qpshufb,3     %b[42], %b[42], %r25, %b[35]
          qpshufb,4     %b[41], %b[41], %r25, %b[36]
          qpfsubs,5     %g21, %b[24], %g21
        }
        {
          loop_mode
          qpxor,0       %b[22], %r23, %b[22]
          qpxor,1       %b[23], %r23, %b[23]
          qpfsubs,2     %b[19], %b[26], %g23
          qpfsubs,3     %g19, %g23, %g19
          qpfadds,4     %g24, %g29, %b[24]
          qpfsubs,5     %g24, %g29, %g24
        }
        {
          loop_mode
          qpfadds,0     %b[19], %b[26], %b[19]
          qpxor,1       %b[28], %r23, %g29
          qpfadds,2     %g20, %b[22], %b[26]
          qpxor,3       %b[30], %r23, %b[28]
          qpxor,4       %b[35], %r23, %b[30]
          stqp,5        %r33, %r0, %g28
        }
        {
          loop_mode
          qpxor,0       %b[21], %r23, %g28
          qpxor,1       %b[34], %r23, %b[21]
          qpfsubs,2     %g30, %g29, %b[34]
          qpfadds,3     %g18, %b[28], %b[35]
          qpfsubs,4     %g18, %b[28], %g18
          qpfadds,5     %g16, %b[30], %b[28]
        }
        {
          loop_mode
          qpfadds,0     %g30, %g29, %g29
          qpxor,1       %b[36], %r23, %b[36]
          qpfsubs,2     %g20, %b[22], %g20
          qpfsubs,3     %g16, %b[30], %g16
          stqp,5        %r18, %r0, %g26
        }
        {
          loop_mode
          qpfsubs,0     %g22, %b[23], %g26
          qpfadds,1     %g22, %b[23], %g22
          qpfadds,2     %b[29], %g28, %g30
          stqp,5        %r28, %r0, %g25
        }
        {
          loop_mode
          qpfsubs,0     %g17, %b[21], %g25
          qpfadds,1     %g17, %b[21], %g17
          qpfsubs,2     %b[29], %g28, %g28
          stqp,5        %r2, %r0, %b[31]
        }
        {
          loop_mode
          qpfsubs,0     %g31, %b[36], %b[21]
          qpfadds,1     %g31, %b[36], %g31
          stqp,2        %r16, %r0, %g27
          stqp,5        %r20, %r0, %b[32]
        }
        {
          loop_mode
          stqp,2        %r37, %r0, %b[27]
          stqp,5        %r26, %r0, %b[20]
        }
        {
          loop_mode
          stqp,2        %r13, %r0, %g19
          stqp,5        %r21, %r0, %b[25]
        }
        {
          loop_mode
          stqp,2        %r36, %r0, %g21
          stqp,5        %r3, %r0, %g23
        }
        {
          loop_mode
          stqp,2        %r9, %r0, %b[33]
          stqp,5        %r27, %r0, %b[24]
        }
        {
          loop_mode
          stqp,2        %r32, %r0, %g24
          stqp,5        %r17, %r0, %b[19]
        }
        {
          loop_mode
          stqp,2        %r35, %r0, %b[26]
          stqp,5        %r15, %r0, %g29
        }
        {
          loop_mode
          stqp,2        %r31, %r0, %g20
          stqp,5        %r19, %r0, %b[34]
        }
        {
          loop_mode
          stqp,2        %r1, %r0, %b[35]
          stqp,5        %r12, %r0, %g18
        }
        {
          loop_mode
          stqp,2        %r40, %r0, %g22
          stqp,5        %r30, %r0, %g26
        }
        {
          loop_mode
          stqp,2        %r38, %r0, %g30
          stqp,5        %r4, %r0, %b[28]
        }
        {
          loop_mode
          stqp,2        %r34, %r0, %g28
          stqp,5        %r39, %r0, %g17
        }
        {
          loop_mode
          stqp,2        %r14, %r0, %g16
          stqp,5        %r5, %r0, %g31
        }
        {
          loop_mode
          ct    %ctpr1 ? %NOT_LOOP_END
          alc   alcf=1, alct=1
          stqp,2        %r29, %r0, %g25
          addd,3,sm     %r0, _f16s,_lts0lo 0x20, %r0
          stqp,5        %r11, %r0, %b[21]
        }

Теоретическая скорость: 64 комплексных числа за 115 тактов (64/115) = 4.45 Байт/такт
Четверная теоретическая скорость: 17.81 Байт/такт

Замеры скорости

5. stage_radix4_2x_simd128_noConj

Здесь происходит ручная раскрутка алгоритма stage_radix4_simd128_noConj в 2 раза.

Код на Си
void stage_radix4_2x_simd128_noConj(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b)
{
	__v2di *xy0_in = (__v2di*)&data_in[ 0];
	__v2di *zw0_in = (__v2di*)&data_in[ 2];
	__v2di *xy1_in = (__v2di*)&data_in[ 4];
	__v2di *zw1_in = (__v2di*)&data_in[ 6];
	__v2di *xy2_in = (__v2di*)&data_in[ 8];
	__v2di *zw2_in = (__v2di*)&data_in[10];
	__v2di *xy3_in = (__v2di*)&data_in[12];
	__v2di *zw3_in = (__v2di*)&data_in[14];
	__v2di *xy4_in = (__v2di*)&data_in[16];
	__v2di *zw4_in = (__v2di*)&data_in[18];
	__v2di *xy5_in = (__v2di*)&data_in[20];
	__v2di *zw5_in = (__v2di*)&data_in[22];
	__v2di *xy6_in = (__v2di*)&data_in[24];
	__v2di *zw6_in = (__v2di*)&data_in[26];
	__v2di *xy7_in = (__v2di*)&data_in[28];
	__v2di *zw7_in = (__v2di*)&data_in[30];
	__v2di *c0a_in = (__v2di*)&coefC_a[0];
	__v2di *c1a_in = (__v2di*)&coefC_a[2];
	__v2di *c2a_in = (__v2di*)&coefC_a[4];
	__v2di *c3a_in = (__v2di*)&coefC_a[6];
	__v2di *d0a_in = (__v2di*)&coefD_a[0];
	__v2di *d1a_in = (__v2di*)&coefD_a[2];
	__v2di *d2a_in = (__v2di*)&coefD_a[4];
	__v2di *d3a_in = (__v2di*)&coefD_a[6];
	__v2di *e0a_in = (__v2di*)&coefE_a[0];
	__v2di *e1a_in = (__v2di*)&coefE_a[2];
	__v2di *e2a_in = (__v2di*)&coefE_a[4];
	__v2di *e3a_in = (__v2di*)&coefE_a[6];
	__v2di *c0b_in = (__v2di*)&coefC_b[0*data_count/16];
	__v2di *c1b_in = (__v2di*)&coefC_b[1*data_count/16];
	__v2di *c2b_in = (__v2di*)&coefC_b[2*data_count/16];
	__v2di *c3b_in = (__v2di*)&coefC_b[3*data_count/16];
	__v2di *d0b_in = (__v2di*)&coefD_b[0*data_count/16];
	__v2di *d1b_in = (__v2di*)&coefD_b[1*data_count/16];
	__v2di *d2b_in = (__v2di*)&coefD_b[2*data_count/16];
	__v2di *d3b_in = (__v2di*)&coefD_b[3*data_count/16];
	__v2di *e0b_in = (__v2di*)&coefE_b[0*data_count/16];
	__v2di *e1b_in = (__v2di*)&coefE_b[1*data_count/16];
	__v2di *e2b_in = (__v2di*)&coefE_b[2*data_count/16];
	__v2di *e3b_in = (__v2di*)&coefE_b[3*data_count/16];

	__v2di *out_0  = (__v2di*)&data_out[ 0*data_count/16];
	__v2di *out_1  = (__v2di*)&data_out[ 1*data_count/16];
	__v2di *out_2  = (__v2di*)&data_out[ 2*data_count/16];
	__v2di *out_3  = (__v2di*)&data_out[ 3*data_count/16];
	__v2di *out_4  = (__v2di*)&data_out[ 4*data_count/16];
	__v2di *out_5  = (__v2di*)&data_out[ 5*data_count/16];
	__v2di *out_6  = (__v2di*)&data_out[ 6*data_count/16];
	__v2di *out_7  = (__v2di*)&data_out[ 7*data_count/16];
	__v2di *out_8  = (__v2di*)&data_out[ 8*data_count/16];
	__v2di *out_9  = (__v2di*)&data_out[ 9*data_count/16];
	__v2di *out_10 = (__v2di*)&data_out[10*data_count/16];
	__v2di *out_11 = (__v2di*)&data_out[11*data_count/16];
	__v2di *out_12 = (__v2di*)&data_out[12*data_count/16];
	__v2di *out_13 = (__v2di*)&data_out[13*data_count/16];
	__v2di *out_14 = (__v2di*)&data_out[14*data_count/16];
	__v2di *out_15 = (__v2di*)&data_out[15*data_count/16];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/32; ++i)
	{
		__v2di xy0 = xy0_in[16*i];
		__v2di zw0 = zw0_in[16*i];
		__v2di xy1 = xy1_in[16*i];
		__v2di zw1 = zw1_in[16*i];
		__v2di c0  = c0a_in[4*i];
		__v2di d0  = d0a_in[4*i];
		__v2di e0  = e0a_in[4*i];

		__v2di xy2 = xy2_in[16*i];
		__v2di zw2 = zw2_in[16*i];
		__v2di xy3 = xy3_in[16*i];
		__v2di zw3 = zw3_in[16*i];
		__v2di c1  = c1a_in[4*i];
		__v2di d1  = d1a_in[4*i];
		__v2di e1  = e1a_in[4*i];

		__v2di xy4 = xy4_in[16*i];
		__v2di zw4 = zw4_in[16*i];
		__v2di xy5 = xy5_in[16*i];
		__v2di zw5 = zw5_in[16*i];
		__v2di c2  = c2a_in[4*i];
		__v2di d2  = d2a_in[4*i];
		__v2di e2  = e2a_in[4*i];

		__v2di xy6 = xy6_in[16*i];
		__v2di zw6 = zw6_in[16*i];
		__v2di xy7 = xy7_in[16*i];
		__v2di zw7 = zw7_in[16*i];
		__v2di c3  = c3a_in[4*i];
		__v2di d3  = d3a_in[4*i];
		__v2di e3  = e3a_in[4*i];

		__v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		__v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});

		__v2di cy0_real = __builtin_e2k_qpfmuls(     c0, y0);
		__v2di cy1_real = __builtin_e2k_qpfmuls(     c1, y1);
		__v2di cy2_real = __builtin_e2k_qpfmuls(     c2, y2);
		__v2di cy3_real = __builtin_e2k_qpfmuls(     c3, y3);
		__v2di dz0_real = __builtin_e2k_qpfmuls(     d0, z0);
		__v2di dz1_real = __builtin_e2k_qpfmuls(     d1, z1);
		__v2di dz2_real = __builtin_e2k_qpfmuls(     d2, z2);
		__v2di dz3_real = __builtin_e2k_qpfmuls(     d3, z3);
		__v2di ew0_real = __builtin_e2k_qpfmuls(     e0, w0);
		__v2di ew1_real = __builtin_e2k_qpfmuls(     e1, w1);
		__v2di ew2_real = __builtin_e2k_qpfmuls(     e2, w2);
		__v2di ew3_real = __builtin_e2k_qpfmuls(     e3, w3);
		__v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
		__v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
		__v2di cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
		__v2di cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
		__v2di dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
		__v2di dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
		__v2di dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
		__v2di dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
		__v2di ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
		__v2di ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
		__v2di ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
		__v2di ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);

		__v2di cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real);
		__v2di cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real);
		__v2di cy2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy2_real);
		__v2di cy3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy3_real);
		__v2di dz0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz0_real);
		__v2di dz1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz1_real);
		__v2di dz2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz2_real);
		__v2di dz3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz3_real);
		__v2di ew0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew0_real);
		__v2di ew1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew1_real);
		__v2di ew2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew2_real);
		__v2di ew3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew3_real);
		__v2di cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag);
		__v2di cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag);
		__v2di cy2_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy2_imag);
		__v2di cy3_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy3_imag);
		__v2di dz0_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz0_imag);
		__v2di dz1_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz1_imag);
		__v2di dz2_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz2_imag);
		__v2di dz3_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz3_imag);
		__v2di ew0_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew0_imag);
		__v2di ew1_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew1_imag);
		__v2di ew2_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew2_imag);
		__v2di ew3_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew3_imag);

		__v2di cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di cy2 = __builtin_e2k_qppermb(cy2_ii, cy2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di cy3 = __builtin_e2k_qppermb(cy3_ii, cy3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di dz0 = __builtin_e2k_qppermb(dz0_ii, dz0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di dz1 = __builtin_e2k_qppermb(dz1_ii, dz1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di dz2 = __builtin_e2k_qppermb(dz2_ii, dz2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di dz3 = __builtin_e2k_qppermb(dz3_ii, dz3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di ew0 = __builtin_e2k_qppermb(ew0_ii, ew0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di ew1 = __builtin_e2k_qppermb(ew1_ii, ew1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di ew2 = __builtin_e2k_qppermb(ew2_ii, ew2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di ew3 = __builtin_e2k_qppermb(ew3_ii, ew3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});

		__v2di add02_0 = __builtin_e2k_qpfadds( x0, dz0);
		__v2di add02_1 = __builtin_e2k_qpfadds( x1, dz1);
		__v2di add02_2 = __builtin_e2k_qpfadds( x2, dz2);
		__v2di add02_3 = __builtin_e2k_qpfadds( x3, dz3);
		__v2di sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
		__v2di sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
		__v2di sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
		__v2di sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
		__v2di add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
		__v2di add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
		__v2di add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
		__v2di add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
		__v2di sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
		__v2di sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
		__v2di sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
		__v2di sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);

		__v2di swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
		__v2di sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
		__v2di sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
		__v2di sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});

		__v2di out0  = __builtin_e2k_qpfadds(add02_0, add13_0);
		__v2di out1  = __builtin_e2k_qpfadds(add02_1, add13_1);
		__v2di out2  = __builtin_e2k_qpfadds(add02_2, add13_2);
		__v2di out3  = __builtin_e2k_qpfadds(add02_3, add13_3);
		__v2di out4  = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
		__v2di out5  = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
		__v2di out6  = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
		__v2di out7  = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
		__v2di out8  = __builtin_e2k_qpfsubs(add02_0, add13_0);
		__v2di out9  = __builtin_e2k_qpfsubs(add02_1, add13_1);
		__v2di out10 = __builtin_e2k_qpfsubs(add02_2, add13_2);
		__v2di out11 = __builtin_e2k_qpfsubs(add02_3, add13_3);
		__v2di out12 = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
		__v2di out13 = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
		__v2di out14 = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
		__v2di out15 = __builtin_e2k_qpfadds(sub02_3, sub13i_3);


		xy0 = out0;
		zw0 = out1;
		xy1 = out2;
		zw1 = out3;
		c0  = c0b_in[i];
		d0  = d0b_in[i];
		e0  = e0b_in[i];

		xy2 = out4;
		zw2 = out5;
		xy3 = out6;
		zw3 = out7;
		c1  = c1b_in[i];
		d1  = d1b_in[i];
		e1  = e1b_in[i];

		xy4 = out8;
		zw4 = out9;
		xy5 = out10;
		zw5 = out11;
		c2  = c2b_in[i];
		d2  = d2b_in[i];
		e2  = e2b_in[i];

		xy6 = out12;
		zw6 = out13;
		xy7 = out14;
		zw7 = out15;
		c3  = c3b_in[i];
		d3  = d3b_in[i];
		e3  = e3b_in[i];

		x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
		w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
		y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
		w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
		y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
		w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
		y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
		w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});

		cy0_real = __builtin_e2k_qpfmuls(     c0, y0);
		cy1_real = __builtin_e2k_qpfmuls(     c1, y1);
		cy2_real = __builtin_e2k_qpfmuls(     c2, y2);
		cy3_real = __builtin_e2k_qpfmuls(     c3, y3);
		dz0_real = __builtin_e2k_qpfmuls(     d0, z0);
		dz1_real = __builtin_e2k_qpfmuls(     d1, z1);
		dz2_real = __builtin_e2k_qpfmuls(     d2, z2);
		dz3_real = __builtin_e2k_qpfmuls(     d3, z3);
		ew0_real = __builtin_e2k_qpfmuls(     e0, w0);
		ew1_real = __builtin_e2k_qpfmuls(     e1, w1);
		ew2_real = __builtin_e2k_qpfmuls(     e2, w2);
		ew3_real = __builtin_e2k_qpfmuls(     e3, w3);
		cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
		cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
		cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
		cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
		dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
		dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
		dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
		dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
		ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
		ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
		ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
		ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);

		cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real);
		cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real);
		cy2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy2_real);
		cy3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy3_real);
		dz0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz0_real);
		dz1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz1_real);
		dz2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz2_real);
		dz3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz3_real);
		ew0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew0_real);
		ew1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew1_real);
		ew2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew2_real);
		ew3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew3_real);
		cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag);
		cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag);
		cy2_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy2_imag);
		cy3_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy3_imag);
		dz0_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz0_imag);
		dz1_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz1_imag);
		dz2_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz2_imag);
		dz3_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz3_imag);
		ew0_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew0_imag);
		ew1_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew1_imag);
		ew2_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew2_imag);
		ew3_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew3_imag);

		cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		cy2 = __builtin_e2k_qppermb(cy2_ii, cy2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		cy3 = __builtin_e2k_qppermb(cy3_ii, cy3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		dz0 = __builtin_e2k_qppermb(dz0_ii, dz0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		dz1 = __builtin_e2k_qppermb(dz1_ii, dz1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		dz2 = __builtin_e2k_qppermb(dz2_ii, dz2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		dz3 = __builtin_e2k_qppermb(dz3_ii, dz3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		ew0 = __builtin_e2k_qppermb(ew0_ii, ew0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		ew1 = __builtin_e2k_qppermb(ew1_ii, ew1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		ew2 = __builtin_e2k_qppermb(ew2_ii, ew2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		ew3 = __builtin_e2k_qppermb(ew3_ii, ew3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});

		add02_0 = __builtin_e2k_qpfadds( x0, dz0);
		add02_1 = __builtin_e2k_qpfadds( x1, dz1);
		add02_2 = __builtin_e2k_qpfadds( x2, dz2);
		add02_3 = __builtin_e2k_qpfadds( x3, dz3);
		sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
		sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
		sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
		sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
		add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
		add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
		add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
		add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
		sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
		sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
		sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
		sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);

		swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
		sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
		sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
		sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});

		out_0[i]  = __builtin_e2k_qpfadds(add02_0, add13_0);
		out_1[i]  = __builtin_e2k_qpfadds(add02_1, add13_1);
		out_2[i]  = __builtin_e2k_qpfadds(add02_2, add13_2);
		out_3[i]  = __builtin_e2k_qpfadds(add02_3, add13_3);
		out_4[i]  = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
		out_5[i]  = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
		out_6[i]  = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
		out_7[i]  = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
		out_8[i]  = __builtin_e2k_qpfsubs(add02_0, add13_0);
		out_9[i]  = __builtin_e2k_qpfsubs(add02_1, add13_1);
		out_10[i] = __builtin_e2k_qpfsubs(add02_2, add13_2);
		out_11[i] = __builtin_e2k_qpfsubs(add02_3, add13_3);
		out_12[i] = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
		out_13[i] = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
		out_14[i] = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
		out_15[i] = __builtin_e2k_qpfadds(sub02_3, sub13i_3);
	}
}
Основной цикл на ассемблере
.L15211:
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=0, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=64
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=96
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=4, disp=128
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=4, disp=160
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=6, disp=192
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=6, disp=224
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=4, asz=1, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=4, asz=1, abs=8, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=3, asz=1, abs=10, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=3, asz=1, abs=10, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=1, abs=12, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=1, abs=12, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=1, incr=0, ind=0, asz=1, abs=14, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=15, asz=1, abs=14, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=14, asz=1, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=13, asz=1, abs=16, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=12, asz=1, abs=18, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=11, asz=1, abs=18, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=10, asz=2, abs=20, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=9, asz=2, abs=20, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=8, asz=2, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=2, abs=24, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=2, abs=28, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=2, abs=28, disp=0
        }
.L11903:
        {
          loop_mode
          disp  %ctpr1, .L11903
          movaqp,0      area=0, ind=0, am=1, be=0, %g17
          movaqp,1      area=0, ind=16, am=0, be=0, %g16
          movaqp,2      area=0, ind=0, am=1, be=0, %g19
          movaqp,3      area=0, ind=16, am=0, be=0, %g18
        }
        {
          loop_mode
          movaqp,0      area=1, ind=0, am=1, be=0, %g21
          movaqp,1      area=1, ind=16, am=0, be=0, %g20
          movaqp,2      area=1, ind=0, am=1, be=0, %g23
          movaqp,3      area=1, ind=16, am=0, be=0, %g22
        }
        {
          loop_mode
          movaqp,0      area=2, ind=0, am=1, be=0, %g25
          movaqp,1      area=2, ind=16, am=0, be=0, %g24
          movaqp,2      area=2, ind=0, am=1, be=0, %g27
          movaqp,3      area=2, ind=16, am=0, be=0, %g26
        }
        {
          loop_mode
          movaqp,0      area=3, ind=0, am=1, be=0, %g29
          movaqp,1      area=3, ind=16, am=0, be=0, %g28
          movaqp,2      area=3, ind=0, am=1, be=0, %g31
          movaqp,3      area=3, ind=16, am=0, be=0, %g30
        }
        {
          loop_mode
          movaqp,0      area=4, ind=0, am=1, be=0, %r26
          movaqp,1      area=4, ind=16, am=0, be=0, %r9
          movaqp,2      area=4, ind=0, am=1, be=0, %r28
          movaqp,3      area=4, ind=16, am=0, be=0, %r27
        }
        {
          loop_mode
          qpshufb,0     %g19, %g17, %r1, %r33
          qpshufb,1     %g18, %g16, %r1, %r34
          qpshufb,3     %g18, %g16, %r6, %g16
          qpshufb,4     %g19, %g17, %r6, %g17
          movaqp,0      area=5, ind=0, am=1, be=0, %r30
          movaqp,1      area=5, ind=16, am=0, be=0, %r29
          movaqp,2      area=5, ind=0, am=1, be=0, %r32
          movaqp,3      area=5, ind=16, am=0, be=0, %r31
        }
        {
          loop_mode
          qpshufb,0     %g23, %g21, %r1, %r37
          qpshufb,1     %g22, %g20, %r1, %r38
          qpshufb,3     %g22, %g20, %r6, %g20
          qpshufb,4     %g23, %g21, %r6, %g21
          movaqp,0      area=6, ind=0, am=1, be=0, %g19
          movaqp,1      area=6, ind=16, am=0, be=0, %g18
          movaqp,2      area=6, ind=0, am=1, be=0, %r36
          movaqp,3      area=6, ind=16, am=0, be=0, %r35
        }
        {
          loop_mode
          qpshufb,0     %g27, %g25, %r1, %g22
          qpshufb,1     %g26, %g24, %r1, %g23
          qpshufb,3     %g26, %g24, %r6, %g24
          qpshufb,4     %g27, %g25, %r6, %g25
          movaqp,0      area=8, ind=0, am=1, be=0, %r39
          movaqp,1      area=7, ind=0, am=1, be=0, %g26
          movaqp,2      area=8, ind=0, am=1, be=0, %r40
          movaqp,3      area=7, ind=0, am=1, be=0, %g27
        }
        {
          loop_mode
          qpshufb,0     %g31, %g29, %r1, %r41
          qpshufb,1     %g30, %g28, %r1, %r42
          qpshufb,3     %g30, %g28, %r6, %g28
          qpshufb,4     %g31, %g29, %r6, %g29
          movaqp,0      area=10, ind=0, am=1, be=0, %r43
          movaqp,1      area=9, ind=0, am=1, be=0, %g30
          movaqp,2      area=10, ind=0, am=1, be=0, %r44
          movaqp,3      area=9, ind=0, am=1, be=0, %g31
        }
        {
          loop_mode
          qpshufb,0     %r26, %r26, %r5, %r45
          qpshufb,1     %r9, %r9, %r5, %r46
          qpfmul_hsubs,2        %r26, %r33, %r7, %r26
          qpshufb,3     %r28, %r28, %r5, %r47
          qpshufb,4     %r27, %r27, %r5, %r48
          movaqp,0      area=12, ind=0, am=1, be=0, %r51
          movaqp,1      area=11, ind=0, am=1, be=0, %r49
          movaqp,2      area=12, ind=0, am=1, be=0, %r52
          movaqp,3      area=11, ind=0, am=1, be=0, %r50
        }
        {
          loop_mode
          qpfmul_hadds,0        %r46, %r37, %r7, %r37
          qpfmul_hsubs,1        %r28, %g22, %r7, %r28
          qpfmul_hsubs,2        %r9, %r37, %r7, %r9
          qpshufb,3     %r30, %r30, %r5, %r46
          qpshufb,4     %r29, %r29, %r5, %r53
          qpfmul_hsubs,5        %r30, %g16, %r7, %r30
        }
        {
          loop_mode
          qpshufb,0     %g19, %g19, %r5, %r54
          qpshufb,1     %g18, %g18, %r5, %r55
          qpfmul_hsubs,2        %r27, %r41, %r7, %r27
          qpshufb,3     %r36, %r36, %r5, %r56
          qpshufb,4     %r35, %r35, %r5, %r57
          qpfmul_hsubs,5        %g19, %r34, %r7, %g19
        }
        {
          loop_mode
          qpfmul_hsubs,0        %g18, %r38, %r7, %g18
          qpfmul_hsubs,1        %r36, %g23, %r7, %r36
          qpfmul_hsubs,2        %r35, %r42, %r7, %r35
          qpfmul_hadds,3        %r47, %g22, %r7, %g22
          qpfmul_hadds,4        %r48, %r41, %r7, %r41
          qpfmul_hadds,5        %r56, %g23, %r7, %g23
        }
        {
          loop_mode
          qpfmul_hadds,0        %r54, %r34, %r7, %r34
          qpfmul_hadds,1        %r55, %r38, %r7, %r38
          qpfmul_hadds,2        %r45, %r33, %r7, %r33
          qpshufb,3     %r32, %r32, %r5, %r45
          qpshufb,4     %r31, %r31, %r5, %r47
          qpfmul_hadds,5        %r57, %r42, %r7, %r42
        }
        {
          loop_mode
          qpfmul_hsubs,0        %r29, %g20, %r7, %r29
          qpfmul_hsubs,1        %r32, %g24, %r7, %r32
          qpfmul_hsubs,2        %r31, %g28, %r7, %r31
          qpfmul_hadds,3        %r53, %g20, %r7, %g20
          qpfmul_hadds,4        %r45, %g24, %r7, %g24
          qpfmul_hadds,5        %r46, %g16, %r7, %g16
        }
        {
          loop_mode
          qpshufb,0     %r40, %r40, %r5, %r47
          qpshufb,1     %g31, %g31, %r5, %r48
          qpshufb,3     %g26, %g26, %r5, %r45
          qpshufb,4     %r39, %r39, %r5, %r46
          qpfmul_hadds,5        %r47, %g28, %r7, %g28
        }
        {
          loop_mode
          qpshufb,3     %r43, %r43, %r5, %r53
          qpshufb,4     %r50, %r50, %r5, %r54
        }
        {
          loop_mode
          nop 1
          qpshufb,0     %r49, %r49, %r5, %r55
          qpshufb,1     %r52, %r52, %r5, %r56
          qpshufb,3     %g27, %g27, %r5, %r57
          qpshufb,4     %g30, %g30, %r5, %r58
        }
        {
          loop_mode
          nop 1
          qpshufb,3     %r44, %r44, %r5, %r59
          qpshufb,4     %r51, %r51, %r5, %r60
        }
        {
          loop_mode
          qppermb,0     %r33, %r26, %r3, %r26
          qppermb,1     %r34, %g19, %r3, %g19
          qppermb,3     %r41, %r27, %r3, %r27
          qppermb,4     %r37, %r9, %r3, %r9
        }
        {
          loop_mode
          qppermb,0     %r38, %g18, %r3, %g18
          qppermb,1     %g22, %r28, %r3, %g22
          qpfsubs,2     %r26, %g19, %r33
          qppermb,3     %g23, %r36, %r3, %g23
          qppermb,4     %r42, %r35, %r3, %r28
        }
        {
          loop_mode
          qpfadds,0     %r26, %g19, %g19
          qppermb,3     %g16, %r30, %r3, %g16
          qpfadds,4     %r27, %r28, %r27
          qpfsubs,5     %r27, %r28, %r34
        }
        {
          loop_mode
          qppermb,0     %g20, %r29, %r3, %g20
          qppermb,1     %g24, %r32, %r3, %g24
          qppermb,3     %g28, %r31, %r3, %g28
          qpfadds,4     %g17, %g16, %r26
          qpfsubs,5     %g17, %g16, %g16
        }
        {
          loop_mode
          qpfsubs,0     %r9, %g18, %g17
          qpfadds,1     %r9, %g18, %g18
          qpfsubs,2     %g21, %g20, %r9
          qpfadds,3     %g29, %g28, %r28
          qpfsubs,4     %g29, %g28, %g28
        }
        {
          loop_mode
          qpfadds,0     %g21, %g20, %g20
          qpfsubs,1     %g25, %g24, %g21
          qpfsubs,2     %g22, %g23, %g29
          qpfadds,5     %g22, %g23, %g22
        }
        {
          loop_mode
          qpfadds,2     %g25, %g24, %g23
        }
        {
          loop_mode
          qpshufb,3     %r34, %r34, %r5, %g24
          qpshufb,4     %r33, %r33, %r5, %g25
        }
        {
          loop_mode
          qpshufb,0     %g17, %g17, %r5, %g17
          qpxor,3       %g24, %r4, %g24
          qpxor,4       %g25, %r4, %g25
          qpfadds,5     %r26, %g19, %r29
        }
        {
          loop_mode
          qpshufb,0     %g29, %g29, %r5, %g29
          qpxor,1       %g17, %r4, %g17
          qpfsubs,2     %r26, %g19, %g19
          qpfadds,3     %r28, %r27, %r26
          qpfsubs,4     %r28, %r27, %r27
          qpfsubs,5     %g16, %g25, %r28
        }
        {
          loop_mode
          qpxor,0       %g29, %r4, %g29
          qpfadds,1     %g20, %g18, %r30
          qpfsubs,2     %g20, %g18, %g18
          qpfadds,3     %g16, %g25, %g16
          qpfsubs,4     %g28, %g24, %g20
          qpfadds,5     %g28, %g24, %g24
        }
        {
          loop_mode
          qpfadds,0     %r9, %g17, %g25
          qpfadds,1     %g23, %g22, %g28
          qpfsubs,2     %g23, %g22, %g22
        }
        {
          loop_mode
          nop 2
          qpfsubs,0     %r9, %g17, %g17
          qpfsubs,1     %g21, %g29, %g23
          qpfadds,2     %g21, %g29, %g21
        }
        {
          loop_mode
          qpshufb,0     %r26, %r30, %r1, %g29
          qpshufb,1     %r27, %g18, %r1, %r9
        }
        {
          loop_mode
          qpshufb,0     %g28, %r29, %r1, %r31
          qpshufb,1     %g22, %g19, %r1, %r32
          qpfmul_hadds,2        %r55, %r9, %r7, %r33
          qpshufb,3     %r27, %g18, %r6, %g18
          qpshufb,4     %r26, %r30, %r6, %r26
        }
        {
          loop_mode
          qpshufb,0     %g23, %r28, %r1, %r27
          qpshufb,1     %g20, %g17, %r1, %r30
          qpfmul_hsubs,2        %r49, %r9, %r7, %r9
          qpshufb,3     %g24, %g25, %r1, %r34
          qpshufb,4     %g24, %g25, %r6, %g24
          qpfmul_hadds,5        %r59, %g18, %r7, %g25
        }
        {
          loop_mode
          qpshufb,0     %g21, %g16, %r1, %r35
          qpfmul_hsubs,1        %r40, %r27, %r7, %r36
          qpfmul_hadds,2        %r47, %r27, %r7, %r27
          qpfmul_hadds,3        %r56, %r34, %r7, %r34
          qpshufb,4     %g20, %g17, %r6, %g17
          qpfmul_hsubs,5        %r52, %r34, %r7, %r37
        }
        {
          loop_mode
          qpfmul_hsubs,0        %r39, %g29, %r7, %g20
          qpfmul_hadds,1        %r46, %g29, %r7, %g29
          qpfmul_hsubs,2        %r43, %r32, %r7, %r38
          qpfmul_hsubs,3        %r44, %g18, %r7, %g18
          qpfmul_hsubs,4        %g27, %r26, %r7, %g27
          qpfmul_hadds,5        %r57, %r26, %r7, %r26
        }
        {
          loop_mode
          qpfmul_hadds,0        %r45, %r31, %r7, %r39
          qpfmul_hadds,1        %r53, %r32, %r7, %r32
          qpfmul_hsubs,2        %g26, %r31, %r7, %g26
          qpfmul_hadds,3        %r58, %g17, %r7, %r40
          qpfmul_hadds,4        %r60, %g24, %r7, %g24
          qpfmul_hsubs,5        %r51, %g24, %r7, %r31
        }
        {
          loop_mode
          qpfmul_hadds,0        %r54, %r35, %r7, %r41
          qpfmul_hadds,1        %r48, %r30, %r7, %r30
          qpfmul_hsubs,2        %g31, %r30, %r7, %g31
          qpshufb,3     %g22, %g19, %r6, %g19
          qpshufb,4     %g28, %r29, %r6, %g22
          qpfmul_hsubs,5        %g30, %g17, %r7, %g17
        }
        {
          loop_mode
          nop 4
          qpshufb,0     %g23, %r28, %r6, %g23
          qpshufb,1     %g21, %g16, %r6, %g16
          qpfmul_hsubs,2        %r50, %r35, %r7, %g28
        }
        {
          loop_mode
          qppermb,3     %r33, %r9, %r3, %g21
          qppermb,4     %r34, %r37, %r3, %g30
        }
        {
          loop_mode
          qppermb,0     %r27, %r36, %r3, %r9
          qppermb,1     %g29, %g20, %r3, %g20
          qppermb,3     %r26, %g27, %r3, %g27
          qppermb,4     %g25, %g18, %r3, %g18
        }
        {
          loop_mode
          qppermb,0     %r39, %g26, %r3, %g25
          qppermb,1     %r32, %r38, %r3, %g26
          qppermb,3     %r40, %g17, %r3, %g17
          qppermb,4     %g24, %r31, %r3, %g24
          qpfsubs,5     %g19, %g18, %g29
        }
        {
          loop_mode
          qppermb,0     %r30, %g31, %r3, %g31
          qppermb,1     %r41, %g28, %r3, %g28
          qpfsubs,2     %g25, %g20, %r26
          qpfadds,3     %g22, %g27, %r27
          qpfadds,4     %g19, %g18, %g18
          qpfsubs,5     %g22, %g27, %g19
        }
        {
          loop_mode
          qpfsubs,0     %g26, %g21, %g22
          qpfsubs,1     %r9, %g31, %g27
          qpfsubs,2     %g28, %g30, %r28
          qpfadds,3     %g16, %g24, %r29
          qpfsubs,4     %g23, %g17, %r30
          qpfsubs,5     %g16, %g24, %g16
        }
        {
          loop_mode
          qpfadds,0     %g25, %g20, %g20
          qpfadds,1     %g26, %g21, %g21
          qpfadds,2     %r9, %g31, %g23
          qpfadds,3     %g23, %g17, %g17
        }
        {
          loop_mode
          nop 1
          qpfadds,0     %g28, %g30, %g24
        }
        {
          loop_mode
          qpshufb,1     %r26, %r26, %r5, %g25
        }
        {
          loop_mode
          qpshufb,0     %g22, %g22, %r5, %g22
          qpshufb,1     %g27, %g27, %r5, %g26
          qpfsubs,2     %r27, %g20, %g27
        }
        {
          loop_mode
          qpshufb,0     %r28, %r28, %r5, %g28
          qpxor,1       %g25, %r4, %g25
          qpfadds,2     %r27, %g20, %g20
        }
        {
          loop_mode
          qpxor,0       %g22, %r4, %g22
          qpxor,1       %g26, %r4, %g26
          qpfadds,2     %g18, %g21, %g30
          qpfsubs,3     %g18, %g21, %g18
          qpfadds,4     %g17, %g23, %g21
          qpfsubs,5     %g17, %g23, %g17
        }
        {
          loop_mode
          qpxor,0       %g28, %r4, %g23
          qpfadds,1     %r29, %g24, %g28
          qpfsubs,2     %r29, %g24, %g24
        }
        {
          loop_mode
          qpfsubs,0     %g19, %g25, %g31
          qpfadds,1     %g19, %g25, %g19
          qpfadds,2     %g29, %g22, %g25
        }
        {
          loop_mode
          qpfsubs,0     %g29, %g22, %g22
          qpfsubs,1     %r30, %g26, %g29
          qpfadds,2     %r30, %g26, %g26
        }
        {
          loop_mode
          qpfsubs,0     %g16, %g23, %r9
          qpfadds,1     %g16, %g23, %g16
          stqp,2        %r25, %r0, %g30
          stqp,5        %r23, %r0, %g27
        }
        {
          loop_mode
          stqp,2        %r2, %r0, %g20
          stqp,5        %r18, %r0, %g18
        }
        {
          loop_mode
          stqp,2        %r16, %r0, %g28
          stqp,5        %r19, %r0, %g17
        }
        {
          loop_mode
          stqp,2        %r22, %r0, %g21
          stqp,5        %r12, %r0, %g24
        }
        {
          loop_mode
          stqp,2        %r24, %r0, %g31
          stqp,5        %r17, %r0, %g19
        }
        {
          loop_mode
          stqp,2        %r15, %r0, %g22
          stqp,5        %r14, %r0, %g25
        }
        {
          loop_mode
          stqp,2        %r21, %r0, %g29
          stqp,5        %r11, %r0, %g26
        }
        {
          loop_mode
          ct    %ctpr1 ? %NOT_LOOP_END
          alc   alcf=1, alct=1
          addd,0,sm     0x10, %r0, %r0
          stqp,2        %r20, %r0, %r9
          stqp,5        %r13, %r0, %g16
        }

Теоретическая скорость: 32 комплексных числа за 62 такта (32/62) = 4.13 Байт/такт
Четверная теоретическая скорость: 16.52 Байт/такт

Замеры скорости

6. stage_radix4_2x_simd128_noConj_unroll2

Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.

Код на Си
void stage_radix4_2x_simd128_noConj_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b)
{
	__v2di *xy0_in = (__v2di*)&data_in[ 0];
	__v2di *zw0_in = (__v2di*)&data_in[ 2];
	__v2di *xy1_in = (__v2di*)&data_in[ 4];
	__v2di *zw1_in = (__v2di*)&data_in[ 6];
	__v2di *xy2_in = (__v2di*)&data_in[ 8];
	__v2di *zw2_in = (__v2di*)&data_in[10];
	__v2di *xy3_in = (__v2di*)&data_in[12];
	__v2di *zw3_in = (__v2di*)&data_in[14];
	__v2di *xy4_in = (__v2di*)&data_in[16];
	__v2di *zw4_in = (__v2di*)&data_in[18];
	__v2di *xy5_in = (__v2di*)&data_in[20];
	__v2di *zw5_in = (__v2di*)&data_in[22];
	__v2di *xy6_in = (__v2di*)&data_in[24];
	__v2di *zw6_in = (__v2di*)&data_in[26];
	__v2di *xy7_in = (__v2di*)&data_in[28];
	__v2di *zw7_in = (__v2di*)&data_in[30];
	__v2di *c0a_in = (__v2di*)&coefC_a[0];
	__v2di *c1a_in = (__v2di*)&coefC_a[2];
	__v2di *c2a_in = (__v2di*)&coefC_a[4];
	__v2di *c3a_in = (__v2di*)&coefC_a[6];
	__v2di *d0a_in = (__v2di*)&coefD_a[0];
	__v2di *d1a_in = (__v2di*)&coefD_a[2];
	__v2di *d2a_in = (__v2di*)&coefD_a[4];
	__v2di *d3a_in = (__v2di*)&coefD_a[6];
	__v2di *e0a_in = (__v2di*)&coefE_a[0];
	__v2di *e1a_in = (__v2di*)&coefE_a[2];
	__v2di *e2a_in = (__v2di*)&coefE_a[4];
	__v2di *e3a_in = (__v2di*)&coefE_a[6];
	__v2di *c0b_in = (__v2di*)&coefC_b[0*data_count/16];
	__v2di *c1b_in = (__v2di*)&coefC_b[1*data_count/16];
	__v2di *c2b_in = (__v2di*)&coefC_b[2*data_count/16];
	__v2di *c3b_in = (__v2di*)&coefC_b[3*data_count/16];
	__v2di *d0b_in = (__v2di*)&coefD_b[0*data_count/16];
	__v2di *d1b_in = (__v2di*)&coefD_b[1*data_count/16];
	__v2di *d2b_in = (__v2di*)&coefD_b[2*data_count/16];
	__v2di *d3b_in = (__v2di*)&coefD_b[3*data_count/16];
	__v2di *e0b_in = (__v2di*)&coefE_b[0*data_count/16];
	__v2di *e1b_in = (__v2di*)&coefE_b[1*data_count/16];
	__v2di *e2b_in = (__v2di*)&coefE_b[2*data_count/16];
	__v2di *e3b_in = (__v2di*)&coefE_b[3*data_count/16];

	__v2di *out_0  = (__v2di*)&data_out[ 0*data_count/16];
	__v2di *out_1  = (__v2di*)&data_out[ 1*data_count/16];
	__v2di *out_2  = (__v2di*)&data_out[ 2*data_count/16];
	__v2di *out_3  = (__v2di*)&data_out[ 3*data_count/16];
	__v2di *out_4  = (__v2di*)&data_out[ 4*data_count/16];
	__v2di *out_5  = (__v2di*)&data_out[ 5*data_count/16];
	__v2di *out_6  = (__v2di*)&data_out[ 6*data_count/16];
	__v2di *out_7  = (__v2di*)&data_out[ 7*data_count/16];
	__v2di *out_8  = (__v2di*)&data_out[ 8*data_count/16];
	__v2di *out_9  = (__v2di*)&data_out[ 9*data_count/16];
	__v2di *out_10 = (__v2di*)&data_out[10*data_count/16];
	__v2di *out_11 = (__v2di*)&data_out[11*data_count/16];
	__v2di *out_12 = (__v2di*)&data_out[12*data_count/16];
	__v2di *out_13 = (__v2di*)&data_out[13*data_count/16];
	__v2di *out_14 = (__v2di*)&data_out[14*data_count/16];
	__v2di *out_15 = (__v2di*)&data_out[15*data_count/16];

	#pragma ivdep
	#pragma unroll(2)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/32; ++i)
	{
		__v2di xy0 = xy0_in[16*i];
		__v2di zw0 = zw0_in[16*i];
		__v2di xy1 = xy1_in[16*i];
		__v2di zw1 = zw1_in[16*i];
		__v2di c0  = c0a_in[4*i];
		__v2di d0  = d0a_in[4*i];
		__v2di e0  = e0a_in[4*i];

		__v2di xy2 = xy2_in[16*i];
		__v2di zw2 = zw2_in[16*i];
		__v2di xy3 = xy3_in[16*i];
		__v2di zw3 = zw3_in[16*i];
		__v2di c1  = c1a_in[4*i];
		__v2di d1  = d1a_in[4*i];
		__v2di e1  = e1a_in[4*i];

		__v2di xy4 = xy4_in[16*i];
		__v2di zw4 = zw4_in[16*i];
		__v2di xy5 = xy5_in[16*i];
		__v2di zw5 = zw5_in[16*i];
		__v2di c2  = c2a_in[4*i];
		__v2di d2  = d2a_in[4*i];
		__v2di e2  = e2a_in[4*i];

		__v2di xy6 = xy6_in[16*i];
		__v2di zw6 = zw6_in[16*i];
		__v2di xy7 = xy7_in[16*i];
		__v2di zw7 = zw7_in[16*i];
		__v2di c3  = c3a_in[4*i];
		__v2di d3  = d3a_in[4*i];
		__v2di e3  = e3a_in[4*i];

		__v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		__v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});

		__v2di cy0_real = __builtin_e2k_qpfmuls(     c0, y0);
		__v2di cy1_real = __builtin_e2k_qpfmuls(     c1, y1);
		__v2di cy2_real = __builtin_e2k_qpfmuls(     c2, y2);
		__v2di cy3_real = __builtin_e2k_qpfmuls(     c3, y3);
		__v2di dz0_real = __builtin_e2k_qpfmuls(     d0, z0);
		__v2di dz1_real = __builtin_e2k_qpfmuls(     d1, z1);
		__v2di dz2_real = __builtin_e2k_qpfmuls(     d2, z2);
		__v2di dz3_real = __builtin_e2k_qpfmuls(     d3, z3);
		__v2di ew0_real = __builtin_e2k_qpfmuls(     e0, w0);
		__v2di ew1_real = __builtin_e2k_qpfmuls(     e1, w1);
		__v2di ew2_real = __builtin_e2k_qpfmuls(     e2, w2);
		__v2di ew3_real = __builtin_e2k_qpfmuls(     e3, w3);
		__v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
		__v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
		__v2di cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
		__v2di cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
		__v2di dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
		__v2di dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
		__v2di dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
		__v2di dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
		__v2di ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
		__v2di ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
		__v2di ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
		__v2di ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);

		__v2di cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real);
		__v2di cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real);
		__v2di cy2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy2_real);
		__v2di cy3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy3_real);
		__v2di dz0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz0_real);
		__v2di dz1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz1_real);
		__v2di dz2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz2_real);
		__v2di dz3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz3_real);
		__v2di ew0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew0_real);
		__v2di ew1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew1_real);
		__v2di ew2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew2_real);
		__v2di ew3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew3_real);
		__v2di cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag);
		__v2di cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag);
		__v2di cy2_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy2_imag);
		__v2di cy3_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy3_imag);
		__v2di dz0_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz0_imag);
		__v2di dz1_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz1_imag);
		__v2di dz2_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz2_imag);
		__v2di dz3_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz3_imag);
		__v2di ew0_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew0_imag);
		__v2di ew1_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew1_imag);
		__v2di ew2_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew2_imag);
		__v2di ew3_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew3_imag);

		__v2di cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di cy2 = __builtin_e2k_qppermb(cy2_ii, cy2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di cy3 = __builtin_e2k_qppermb(cy3_ii, cy3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di dz0 = __builtin_e2k_qppermb(dz0_ii, dz0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di dz1 = __builtin_e2k_qppermb(dz1_ii, dz1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di dz2 = __builtin_e2k_qppermb(dz2_ii, dz2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di dz3 = __builtin_e2k_qppermb(dz3_ii, dz3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di ew0 = __builtin_e2k_qppermb(ew0_ii, ew0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di ew1 = __builtin_e2k_qppermb(ew1_ii, ew1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di ew2 = __builtin_e2k_qppermb(ew2_ii, ew2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di ew3 = __builtin_e2k_qppermb(ew3_ii, ew3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});

		__v2di add02_0 = __builtin_e2k_qpfadds( x0, dz0);
		__v2di add02_1 = __builtin_e2k_qpfadds( x1, dz1);
		__v2di add02_2 = __builtin_e2k_qpfadds( x2, dz2);
		__v2di add02_3 = __builtin_e2k_qpfadds( x3, dz3);
		__v2di sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
		__v2di sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
		__v2di sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
		__v2di sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
		__v2di add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
		__v2di add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
		__v2di add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
		__v2di add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
		__v2di sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
		__v2di sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
		__v2di sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
		__v2di sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);

		__v2di swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
		__v2di sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
		__v2di sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
		__v2di sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});

		__v2di out0  = __builtin_e2k_qpfadds(add02_0, add13_0);
		__v2di out1  = __builtin_e2k_qpfadds(add02_1, add13_1);
		__v2di out2  = __builtin_e2k_qpfadds(add02_2, add13_2);
		__v2di out3  = __builtin_e2k_qpfadds(add02_3, add13_3);
		__v2di out4  = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
		__v2di out5  = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
		__v2di out6  = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
		__v2di out7  = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
		__v2di out8  = __builtin_e2k_qpfsubs(add02_0, add13_0);
		__v2di out9  = __builtin_e2k_qpfsubs(add02_1, add13_1);
		__v2di out10 = __builtin_e2k_qpfsubs(add02_2, add13_2);
		__v2di out11 = __builtin_e2k_qpfsubs(add02_3, add13_3);
		__v2di out12 = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
		__v2di out13 = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
		__v2di out14 = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
		__v2di out15 = __builtin_e2k_qpfadds(sub02_3, sub13i_3);


		xy0 = out0;
		zw0 = out1;
		xy1 = out2;
		zw1 = out3;
		c0  = c0b_in[i];
		d0  = d0b_in[i];
		e0  = e0b_in[i];

		xy2 = out4;
		zw2 = out5;
		xy3 = out6;
		zw3 = out7;
		c1  = c1b_in[i];
		d1  = d1b_in[i];
		e1  = e1b_in[i];

		xy4 = out8;
		zw4 = out9;
		xy5 = out10;
		zw5 = out11;
		c2  = c2b_in[i];
		d2  = d2b_in[i];
		e2  = e2b_in[i];

		xy6 = out12;
		zw6 = out13;
		xy7 = out14;
		zw7 = out15;
		c3  = c3b_in[i];
		d3  = d3b_in[i];
		e3  = e3b_in[i];

		x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
		w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
		y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
		w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
		y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
		w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
		y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
		w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});

		cy0_real = __builtin_e2k_qpfmuls(     c0, y0);
		cy1_real = __builtin_e2k_qpfmuls(     c1, y1);
		cy2_real = __builtin_e2k_qpfmuls(     c2, y2);
		cy3_real = __builtin_e2k_qpfmuls(     c3, y3);
		dz0_real = __builtin_e2k_qpfmuls(     d0, z0);
		dz1_real = __builtin_e2k_qpfmuls(     d1, z1);
		dz2_real = __builtin_e2k_qpfmuls(     d2, z2);
		dz3_real = __builtin_e2k_qpfmuls(     d3, z3);
		ew0_real = __builtin_e2k_qpfmuls(     e0, w0);
		ew1_real = __builtin_e2k_qpfmuls(     e1, w1);
		ew2_real = __builtin_e2k_qpfmuls(     e2, w2);
		ew3_real = __builtin_e2k_qpfmuls(     e3, w3);
		cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
		cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
		cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
		cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
		dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
		dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
		dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
		dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
		ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
		ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
		ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
		ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);

		cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real);
		cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real);
		cy2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy2_real);
		cy3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy3_real);
		dz0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz0_real);
		dz1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz1_real);
		dz2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz2_real);
		dz3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz3_real);
		ew0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew0_real);
		ew1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew1_real);
		ew2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew2_real);
		ew3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew3_real);
		cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag);
		cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag);
		cy2_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy2_imag);
		cy3_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy3_imag);
		dz0_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz0_imag);
		dz1_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz1_imag);
		dz2_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz2_imag);
		dz3_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz3_imag);
		ew0_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew0_imag);
		ew1_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew1_imag);
		ew2_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew2_imag);
		ew3_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew3_imag);

		cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		cy2 = __builtin_e2k_qppermb(cy2_ii, cy2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		cy3 = __builtin_e2k_qppermb(cy3_ii, cy3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		dz0 = __builtin_e2k_qppermb(dz0_ii, dz0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		dz1 = __builtin_e2k_qppermb(dz1_ii, dz1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		dz2 = __builtin_e2k_qppermb(dz2_ii, dz2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		dz3 = __builtin_e2k_qppermb(dz3_ii, dz3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		ew0 = __builtin_e2k_qppermb(ew0_ii, ew0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		ew1 = __builtin_e2k_qppermb(ew1_ii, ew1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		ew2 = __builtin_e2k_qppermb(ew2_ii, ew2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		ew3 = __builtin_e2k_qppermb(ew3_ii, ew3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});

		add02_0 = __builtin_e2k_qpfadds( x0, dz0);
		add02_1 = __builtin_e2k_qpfadds( x1, dz1);
		add02_2 = __builtin_e2k_qpfadds( x2, dz2);
		add02_3 = __builtin_e2k_qpfadds( x3, dz3);
		sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
		sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
		sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
		sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
		add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
		add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
		add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
		add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
		sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
		sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
		sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
		sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);

		swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
		sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
		sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
		sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});

		out_0[i]  = __builtin_e2k_qpfadds(add02_0, add13_0);
		out_1[i]  = __builtin_e2k_qpfadds(add02_1, add13_1);
		out_2[i]  = __builtin_e2k_qpfadds(add02_2, add13_2);
		out_3[i]  = __builtin_e2k_qpfadds(add02_3, add13_3);
		out_4[i]  = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
		out_5[i]  = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
		out_6[i]  = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
		out_7[i]  = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
		out_8[i]  = __builtin_e2k_qpfsubs(add02_0, add13_0);
		out_9[i]  = __builtin_e2k_qpfsubs(add02_1, add13_1);
		out_10[i] = __builtin_e2k_qpfsubs(add02_2, add13_2);
		out_11[i] = __builtin_e2k_qpfsubs(add02_3, add13_3);
		out_12[i] = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
		out_13[i] = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
		out_14[i] = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
		out_15[i] = __builtin_e2k_qpfadds(sub02_3, sub13i_3);
	}
}
Основной цикл на ассемблере
.L19508:
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=0, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=1, disp=64
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=1, disp=96
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=2, disp=128
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=2, disp=160
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=3, disp=192
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=3, disp=224
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=4, disp=256
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=4, disp=288
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=5, disp=320
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=5, disp=352
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=6, disp=384
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=6, disp=416
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=7, disp=448
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=7, disp=480
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=8, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=10, disp=64
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=10, disp=96
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=12, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=12, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=14, disp=64
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=14, disp=96
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=16, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=18, disp=64
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=18, disp=96
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=1, incr=2, ind=0, asz=1, abs=20, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=15, asz=1, abs=20, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=14, asz=1, abs=22, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=13, asz=1, abs=22, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=12, asz=1, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=11, asz=1, abs=24, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=10, asz=1, abs=26, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=9, asz=1, abs=26, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=8, asz=1, abs=28, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=7, asz=1, abs=28, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=6, asz=1, abs=30, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=5, asz=1, abs=30, disp=0
        }
.L15504:
        {
          loop_mode
          disp  %ctpr1, .L15504
          movaqp,0      area=0, ind=0, am=1, be=0, %g17
          movaqp,1      area=0, ind=16, am=0, be=0, %g16
          movaqp,2      area=0, ind=0, am=1, be=0, %g19
          movaqp,3      area=0, ind=16, am=0, be=0, %g18
        }
        {
          loop_mode
          movaqp,0      area=1, ind=0, am=1, be=0, %g21
          movaqp,1      area=1, ind=16, am=0, be=0, %g20
          movaqp,2      area=1, ind=0, am=1, be=0, %g23
          movaqp,3      area=1, ind=16, am=0, be=0, %g22
        }
        {
          loop_mode
          movaqp,0      area=2, ind=0, am=1, be=0, %g25
          movaqp,1      area=2, ind=16, am=0, be=0, %g24
          movaqp,2      area=2, ind=0, am=1, be=0, %g27
          movaqp,3      area=2, ind=16, am=0, be=0, %g26
        }
        {
          loop_mode
          movaqp,0      area=3, ind=0, am=1, be=0, %g29
          movaqp,1      area=3, ind=16, am=0, be=0, %g28
          movaqp,2      area=3, ind=0, am=1, be=0, %g31
          movaqp,3      area=3, ind=16, am=0, be=0, %g30
        }
        {
          loop_mode
          movaqp,0      area=4, ind=0, am=1, be=0, %b[20]
          movaqp,1      area=4, ind=16, am=0, be=0, %b[19]
          movaqp,2      area=4, ind=0, am=1, be=0, %b[22]
          movaqp,3      area=4, ind=16, am=0, be=0, %b[21]
        }
        {
          loop_mode
          qpshufb,0     %g19, %g17, %r7, %b[27]
          qpshufb,1     %g18, %g16, %r7, %b[28]
          qpshufb,3     %g18, %g16, %r6, %g16
          qpshufb,4     %g19, %g17, %r6, %g17
          movaqp,0      area=5, ind=0, am=1, be=0, %b[24]
          movaqp,1      area=5, ind=16, am=0, be=0, %b[23]
          movaqp,2      area=5, ind=0, am=1, be=0, %b[26]
          movaqp,3      area=5, ind=16, am=0, be=0, %b[25]
        }
        {
          loop_mode
          qpshufb,0     %g23, %g21, %r7, %b[31]
          qpshufb,1     %g22, %g20, %r7, %b[32]
          qpshufb,3     %g22, %g20, %r6, %g20
          qpshufb,4     %g23, %g21, %r6, %g21
          movaqp,0      area=6, ind=0, am=1, be=0, %g19
          movaqp,1      area=6, ind=16, am=0, be=0, %g18
          movaqp,2      area=6, ind=0, am=1, be=0, %b[30]
          movaqp,3      area=6, ind=16, am=0, be=0, %b[29]
        }
        {
          loop_mode
          qpshufb,0     %g27, %g25, %r7, %b[35]
          qpshufb,1     %g26, %g24, %r7, %b[36]
          qpshufb,3     %g26, %g24, %r6, %g24
          qpshufb,4     %g27, %g25, %r6, %g25
          movaqp,0      area=7, ind=0, am=1, be=0, %g23
          movaqp,1      area=7, ind=16, am=0, be=0, %g22
          movaqp,2      area=7, ind=0, am=1, be=0, %b[34]
          movaqp,3      area=7, ind=16, am=0, be=0, %b[33]
        }
        {
          loop_mode
          qpshufb,0     %g31, %g29, %r7, %b[39]
          qpshufb,1     %g30, %g28, %r7, %b[40]
          qpshufb,3     %g30, %g28, %r6, %g28
          qpshufb,4     %g31, %g29, %r6, %g29
          movaqp,0      area=8, ind=0, am=1, be=0, %g27
          movaqp,1      area=8, ind=16, am=0, be=0, %g26
          movaqp,2      area=8, ind=0, am=1, be=0, %b[38]
          movaqp,3      area=8, ind=16, am=0, be=0, %b[37]
        }
        {
          loop_mode
          qpshufb,0     %b[22], %b[20], %r7, %b[43]
          qpshufb,1     %b[21], %b[19], %r7, %b[44]
          qpshufb,3     %b[21], %b[19], %r6, %b[19]
          qpshufb,4     %b[22], %b[20], %r6, %b[20]
          movaqp,0      area=9, ind=0, am=1, be=0, %g31
          movaqp,1      area=9, ind=16, am=0, be=0, %g30
          movaqp,2      area=9, ind=0, am=1, be=0, %b[42]
          movaqp,3      area=9, ind=16, am=0, be=0, %b[41]
        }
        {
          loop_mode
          qpshufb,0     %b[26], %b[24], %r7, %b[47]
          qpshufb,1     %b[25], %b[23], %r7, %b[48]
          qpshufb,3     %b[25], %b[23], %r6, %b[23]
          qpshufb,4     %b[26], %b[24], %r6, %b[24]
          movaqp,0      area=10, ind=0, am=1, be=0, %b[22]
          movaqp,1      area=10, ind=16, am=0, be=0, %b[21]
          movaqp,2      area=10, ind=0, am=1, be=0, %b[46]
          movaqp,3      area=10, ind=16, am=0, be=0, %b[45]
        }
        {
          loop_mode
          qpshufb,0     %b[30], %g19, %r7, %b[51]
          qpshufb,1     %b[29], %g18, %r7, %b[52]
          qpshufb,3     %b[29], %g18, %r6, %g18
          qpshufb,4     %b[30], %g19, %r6, %g19
          movaqp,0      area=11, ind=0, am=1, be=0, %b[26]
          movaqp,1      area=11, ind=16, am=0, be=0, %b[25]
          movaqp,2      area=11, ind=0, am=1, be=0, %b[50]
          movaqp,3      area=11, ind=16, am=0, be=0, %b[49]
        }
        {
          loop_mode
          qpshufb,0     %b[34], %g23, %r7, %b[55]
          qpshufb,1     %b[33], %g22, %r7, %b[56]
          qpshufb,3     %b[33], %g22, %r6, %g22
          qpshufb,4     %b[34], %g23, %r6, %g23
          movaqp,0      area=12, ind=0, am=1, be=0, %b[30]
          movaqp,1      area=12, ind=16, am=0, be=0, %b[29]
          movaqp,2      area=12, ind=0, am=1, be=0, %b[54]
          movaqp,3      area=12, ind=16, am=0, be=0, %b[53]
        }
        {
          loop_mode
          qpshufb,0     %g27, %g27, %r23, %b[59]
          qpshufb,1     %g26, %g26, %r23, %b[60]
          qpfmul_hsubs,2        %g27, %b[27], %r40, %g27
          qpshufb,3     %b[38], %b[38], %r23, %b[61]
          qpshufb,4     %b[37], %b[37], %r23, %b[62]
          qpfmul_hsubs,5        %g26, %b[31], %r40, %g26
          movaqp,0      area=13, ind=0, am=1, be=0, %b[34]
          movaqp,1      area=13, ind=16, am=0, be=0, %b[33]
          movaqp,2      area=13, ind=0, am=1, be=0, %b[58]
          movaqp,3      area=13, ind=16, am=0, be=0, %b[57]
        }
        {
          loop_mode
          qpshufb,0     %g31, %g31, %r23, %b[63]
          qpshufb,1     %g30, %g30, %r23, %b[64]
          qpfmul_hsubs,2        %b[38], %b[35], %r40, %b[38]
          qpshufb,3     %b[42], %b[42], %r23, %b[65]
          qpshufb,4     %b[41], %b[41], %r23, %b[66]
          qpfmul_hsubs,5        %b[37], %b[39], %r40, %b[37]
          movaqp,0      area=14, ind=0, am=1, be=0, %b[68]
          movaqp,1      area=14, ind=16, am=0, be=0, %b[67]
          movaqp,2      area=14, ind=0, am=1, be=0, %b[70]
          movaqp,3      area=14, ind=16, am=0, be=0, %b[69]
        }
        {
          loop_mode
          qpfmul_hsubs,0        %g31, %b[43], %r40, %g31
          qpfmul_hsubs,1        %g30, %b[47], %r40, %g30
          qpfmul_hsubs,2        %b[42], %b[51], %r40, %b[42]
          qpfmul_hadds,3        %b[61], %b[35], %r40, %b[35]
          qpfmul_hadds,4        %b[62], %b[39], %r40, %b[39]
          qpfmul_hadds,5        %b[65], %b[51], %r40, %b[51]
          movaqp,0      area=15, ind=0, am=1, be=0, %b[62]
          movaqp,1      area=15, ind=16, am=0, be=0, %b[61]
          movaqp,2      area=15, ind=0, am=1, be=0, %b[71]
          movaqp,3      area=15, ind=16, am=0, be=0, %b[65]
        }
        {
          loop_mode
          qpfmul_hadds,0        %b[59], %b[27], %r40, %b[27]
          qpfmul_hadds,1        %b[64], %b[47], %r40, %b[47]
          qpfmul_hadds,2        %b[60], %b[31], %r40, %b[31]
          qpfmul_hadds,3        %b[66], %b[55], %r40, %b[55]
          qpshufb,4     %b[21], %b[21], %r23, %b[59]
          qpfmul_hsubs,5        %b[41], %b[55], %r40, %b[41]
          movaqp,0      area=16, ind=0, am=1, be=0, %b[64]
          movaqp,1      area=16, ind=16, am=0, be=0, %b[60]
          movaqp,2      area=16, ind=0, am=1, be=0, %b[72]
          movaqp,3      area=16, ind=16, am=0, be=0, %b[66]
        }
        {
          loop_mode
          qpshufb,0     %b[30], %b[30], %r23, %b[73]
          qpshufb,1     %b[29], %b[29], %r23, %b[74]
          qpfmul_hsubs,2        %b[29], %b[32], %r40, %b[29]
          qpshufb,3     %b[54], %b[54], %r23, %b[75]
          qpshufb,4     %b[53], %b[53], %r23, %b[76]
          qpfmul_hsubs,5        %b[30], %b[28], %r40, %b[30]
          movaqp,0      area=17, ind=0, am=1, be=0, %b[78]
          movaqp,1      area=17, ind=16, am=0, be=0, %b[77]
          movaqp,2      area=17, ind=0, am=1, be=0, %b[80]
          movaqp,3      area=17, ind=16, am=0, be=0, %b[79]
        }
        {
          loop_mode
          qpshufb,0     %b[34], %b[34], %r23, %b[81]
          qpshufb,1     %b[33], %b[33], %r23, %b[82]
          qpfmul_hadds,2        %b[63], %b[43], %r40, %b[43]
          qpshufb,3     %b[58], %b[58], %r23, %b[83]
          qpshufb,4     %b[57], %b[57], %r23, %b[84]
          qpfmul_hsubs,5        %b[54], %b[36], %r40, %b[54]
          movaqp,0      area=18, ind=0, am=1, be=0, %b[85]
          movaqp,1      area=18, ind=16, am=0, be=0, %b[63]
          movaqp,2      area=18, ind=0, am=1, be=0, %b[87]
          movaqp,3      area=18, ind=16, am=0, be=0, %b[86]
        }
        {
          loop_mode
          qpfmul_hsubs,0        %b[53], %b[40], %r40, %b[53]
          qpfmul_hsubs,1        %b[34], %b[44], %r40, %b[34]
          qpfmul_hsubs,2        %b[33], %b[48], %r40, %b[33]
          qpfmul_hsubs,3        %b[58], %b[52], %r40, %b[58]
          qpfmul_hsubs,4        %b[57], %b[56], %r40, %b[57]
          qpfmul_hadds,5        %b[75], %b[36], %r40, %b[36]
          movaqp,0      area=19, ind=0, am=1, be=0, %b[88]
          movaqp,1      area=19, ind=16, am=0, be=0, %b[75]
          movaqp,2      area=19, ind=0, am=1, be=0, %b[90]
          movaqp,3      area=19, ind=16, am=0, be=0, %b[89]
        }
        {
          loop_mode
          qpfmul_hadds,0        %b[73], %b[28], %r40, %b[28]
          qpfmul_hadds,1        %b[74], %b[32], %r40, %b[32]
          qpfmul_hadds,2        %b[82], %b[48], %r40, %b[48]
          qpfmul_hadds,3        %b[84], %b[56], %r40, %b[56]
          qpfmul_hadds,4        %b[83], %b[52], %r40, %b[52]
          qpfmul_hadds,5        %b[76], %b[40], %r40, %b[40]
        }
        {
          loop_mode
          qpfmul_hsubs,0        %b[21], %g20, %r40, %b[21]
          qpfmul_hsubs,1        %b[22], %g16, %r40, %b[73]
          qpfmul_hadds,2        %b[81], %b[44], %r40, %b[44]
          qpfmul_hsubs,3        %b[46], %g24, %r40, %b[74]
          qpfmul_hsubs,4        %b[45], %g28, %r40, %b[76]
          qpfmul_hsubs,5        %b[25], %b[23], %r40, %b[81]
        }
        {
          loop_mode
          qpfmul_hsubs,0        %b[50], %g18, %r40, %b[82]
          qpfmul_hsubs,1        %b[49], %g22, %r40, %b[83]
          qpfmul_hsubs,2        %b[26], %b[19], %r40, %b[59]
          qpshufb,4     %b[22], %b[22], %r23, %b[22]
          qpfmul_hadds,5        %b[59], %g20, %r40, %g20
        }
        {
          loop_mode
          qpshufb,0     %b[46], %b[46], %r23, %b[46]
          qpshufb,1     %b[45], %b[45], %r23, %b[45]
          qpshufb,3     %b[26], %b[26], %r23, %b[26]
          qpshufb,4     %b[25], %b[25], %r23, %b[25]
          qpfmul_hadds,5        %b[22], %g16, %r40, %g16
        }
        {
          loop_mode
          qpshufb,0     %b[50], %b[50], %r23, %b[22]
          qpshufb,1     %b[49], %b[49], %r23, %b[49]
          qpfmul_hadds,2        %b[45], %g28, %r40, %g28
          qpfmul_hadds,3        %b[25], %b[23], %r40, %b[23]
          qppermb,4     %b[39], %b[37], %r24, %b[25]
          qpfmul_hadds,5        %b[26], %b[19], %r40, %b[19]
        }
        {
          loop_mode
          nop 2
          qpfmul_hadds,0        %b[22], %g18, %r40, %g18
          qpfmul_hadds,1        %b[49], %g22, %r40, %g22
          qpfmul_hadds,2        %b[46], %g24, %r40, %g24
        }
        {
          loop_mode
          qppermb,3     %b[27], %g27, %r24, %g27
          qppermb,4     %b[31], %g26, %r24, %g26
        }
        {
          loop_mode
          qppermb,0     %b[35], %b[38], %r24, %b[22]
          qppermb,1     %b[51], %b[42], %r24, %b[26]
          qppermb,3     %b[47], %g30, %r24, %g30
          qppermb,4     %b[55], %b[41], %r24, %b[27]
        }
        {
          loop_mode
          qppermb,0     %b[32], %b[29], %r24, %b[29]
          qppermb,1     %b[43], %g31, %r24, %g31
          qppermb,4     %b[28], %b[30], %r24, %b[28]
        }
        {
          loop_mode
          qppermb,3     %b[40], %b[53], %r24, %b[30]
          qppermb,4     %b[48], %b[33], %r24, %b[31]
          qpfsubs,5     %g27, %b[28], %b[32]
        }
        {
          loop_mode
          qppermb,0     %b[56], %b[57], %r24, %b[33]
          qppermb,1     %b[36], %b[54], %r24, %b[35]
          qpfsubs,2     %g26, %b[29], %b[37]
          qppermb,3     %b[44], %b[34], %r24, %b[34]
          qppermb,4     %b[52], %b[58], %r24, %b[36]
          qpfsubs,5     %b[25], %b[30], %b[38]
        }
        {
          loop_mode
          qpfsubs,0     %b[27], %b[33], %b[42]
          qppermb,1     %g16, %b[73], %r24, %g16
          qpfsubs,2     %b[22], %b[35], %b[39]
          qpfsubs,3     %b[26], %b[36], %b[41]
          qppermb,4     %g20, %b[21], %r24, %g20
          qpfsubs,5     %g30, %b[31], %b[40]
        }
        {
          loop_mode
          qppermb,0     %g28, %b[76], %r24, %g28
          qppermb,1     %g24, %b[74], %r24, %g24
          qpfadds,2     %g27, %b[28], %g27
          qppermb,3     %b[23], %b[81], %r24, %b[23]
          qppermb,4     %b[19], %b[59], %r24, %b[19]
          qpfsubs,5     %g31, %b[34], %b[21]
        }
        {
          loop_mode
          qpfadds,0     %g26, %b[29], %g26
          qppermb,1     %g22, %b[83], %r24, %g22
          qpfadds,2     %b[25], %b[30], %b[25]
          qpfadds,3     %g30, %b[31], %g30
          qppermb,4     %g18, %b[82], %r24, %g18
          qpfsubs,5     %g21, %g20, %b[28]
        }
        {
          loop_mode
          qpfadds,0     %b[27], %b[33], %b[27]
          qpfadds,1     %g17, %g16, %b[29]
          qpfsubs,2     %g17, %g16, %g16
          qpfadds,3     %b[22], %b[35], %g17
          qpfadds,4     %g31, %b[34], %g31
          qpfadds,5     %b[26], %b[36], %b[22]
        }
        {
          loop_mode
          qpfadds,0     %g21, %g20, %g20
          qpfadds,1     %g29, %g28, %g21
          qpfsubs,2     %g29, %g28, %g28
          qpfadds,3     %b[24], %b[23], %g29
          qpfsubs,4     %b[24], %b[23], %b[23]
          qpfadds,5     %b[20], %b[19], %b[24]
        }
        {
          loop_mode
          qpfadds,0     %g25, %g24, %b[26]
          qpfsubs,1     %g25, %g24, %g24
          qpfsubs,2     %b[20], %b[19], %g25
          qpfsubs,3     %g19, %g18, %g18
          qpfadds,5     %g19, %g18, %b[19]
        }
        {
          loop_mode
          qpfsubs,2     %g23, %g22, %g19
          qpfadds,5     %g23, %g22, %g22
        }
        {
          loop_mode
          qpfadds,0     %b[29], %g27, %g27
          qpfsubs,2     %b[29], %g27, %b[20]
          qpshufb,4     %b[32], %b[32], %r23, %g23
        }
        {
          loop_mode
          qpshufb,0     %b[37], %b[37], %r23, %b[29]
          qpshufb,1     %b[38], %b[38], %r23, %b[30]
          qpfsubs,2     %g20, %g26, %b[33]
          qpshufb,3     %b[40], %b[40], %r23, %b[31]
          qpshufb,4     %b[42], %b[42], %r23, %b[32]
          qpfadds,5     %g29, %g30, %b[34]
        }
        {
          loop_mode
          qpfadds,0     %g20, %g26, %g20
          qpshufb,1     %b[39], %b[39], %r23, %b[35]
          qpfadds,2     %b[26], %g17, %g26
          qpshufb,3     %b[21], %b[21], %r23, %b[21]
          qpshufb,4     %b[41], %b[41], %r23, %b[36]
          qpfsubs,5     %g29, %g30, %g29
        }
        {
          loop_mode
          qpxor,0       %b[29], %r22, %g30
          qpxor,1       %b[30], %r22, %b[29]
          qpfadds,2     %g21, %b[25], %b[31]
          qpxor,3       %g23, %r22, %g23
          qpxor,4       %b[31], %r22, %b[30]
          qpfsubs,5     %g21, %b[25], %g21
        }
        {
          loop_mode
          qpfsubs,0     %b[26], %g17, %g17
          qpxor,1       %b[35], %r22, %b[32]
          qpfsubs,2     %b[24], %g31, %b[26]
          qpxor,3       %b[32], %r22, %b[25]
          qpxor,4       %b[21], %r22, %b[21]
          qpfadds,5     %b[24], %g31, %g31
        }
        {
          loop_mode
          qpfsubs,0     %g22, %b[27], %b[35]
          qpfadds,1     %b[19], %b[22], %b[36]
          qpfadds,2     %g22, %b[27], %g22
          qpxor,3       %b[36], %r22, %b[24]
          qpfsubs,4     %b[19], %b[22], %b[19]
          qpfsubs,5     %g16, %g23, %b[22]
        }
        {
          loop_mode
          qpfsubs,0     %b[28], %g30, %b[27]
          qpfadds,1     %b[28], %g30, %g30
          qpfadds,2     %g28, %b[29], %g23
          qpfadds,3     %g16, %g23, %g16
          qpfsubs,4     %b[23], %b[30], %b[28]
          qpfadds,5     %b[23], %b[30], %b[23]
        }
        {
          loop_mode
          qpfsubs,0     %g28, %b[29], %g28
          qpfsubs,1     %g24, %b[32], %b[25]
          qpfadds,2     %g24, %b[32], %g24
          qpfadds,3     %g19, %b[25], %b[30]
          qpfsubs,4     %g19, %b[25], %g19
          qpfsubs,5     %g25, %b[21], %b[29]
        }
        {
          loop_mode
          nop 1
          qpfadds,2     %g25, %b[21], %g25
          qpfsubs,3     %g18, %b[24], %g18
          qpfadds,5     %g18, %b[24], %b[21]
        }
        {
          loop_mode
          qpshufb,0     %b[68], %b[68], %r23, %b[24]
          qpshufb,1     %b[67], %b[67], %r23, %b[32]
          qpshufb,4     %b[62], %b[62], %r23, %b[37]
        }
        {
          loop_mode
          qpshufb,0     %b[65], %b[65], %r23, %b[38]
          qpshufb,1     %b[61], %b[61], %r23, %b[39]
          qpshufb,3     %b[71], %b[71], %r23, %b[40]
          qpshufb,4     %b[72], %b[72], %r23, %b[41]
        }
        {
          loop_mode
          qpshufb,0     %b[77], %b[77], %r23, %b[42]
          qpshufb,1     %b[66], %b[66], %r23, %b[43]
          qpshufb,3     %b[78], %b[78], %r23, %b[44]
          qpshufb,4     %b[85], %b[85], %r23, %b[45]
        }
        {
          loop_mode
          qpshufb,0     %b[86], %b[86], %r23, %b[46]
          qpshufb,1     %b[63], %b[63], %r23, %b[47]
          qpshufb,3     %b[87], %b[87], %r23, %b[48]
          qpshufb,4     %b[90], %b[90], %r23, %b[49]
        }
        {
          loop_mode
          qpshufb,0     %b[89], %b[89], %r23, %b[50]
          qpshufb,1     %g21, %b[33], %r7, %b[51]
          qpshufb,3     %g26, %g27, %r7, %b[52]
          qpshufb,4     %b[31], %g20, %r7, %b[53]
        }
        {
          loop_mode
          qpshufb,0     %g17, %b[20], %r7, %b[54]
          qpshufb,1     %b[35], %g29, %r7, %b[55]
          qpfmul_hsubs,2        %b[85], %b[51], %r40, %b[58]
          qpshufb,3     %b[36], %g31, %r7, %b[56]
          qpshufb,4     %g22, %b[34], %r7, %b[57]
          qpfmul_hadds,5        %b[37], %b[53], %r40, %b[37]
        }
        {
          loop_mode
          qpshufb,0     %b[19], %b[26], %r7, %b[59]
          qpshufb,1     %g23, %g30, %r7, %b[73]
          qpfmul_hadds,2        %b[45], %b[51], %r40, %b[45]
          qpshufb,3     %g28, %b[27], %r7, %b[74]
          qpshufb,4     %g24, %g16, %r7, %b[76]
          qpfmul_hsubs,5        %b[62], %b[53], %r40, %b[51]
        }
        {
          loop_mode
          qpshufb,0     %g19, %b[28], %r7, %b[53]
          qpshufb,1     %b[30], %b[23], %r7, %b[62]
          qpfmul_hadds,2        %b[44], %b[54], %r40, %b[44]
          qpshufb,3     %b[25], %b[22], %r7, %b[81]
          qpshufb,4     %g18, %b[29], %r7, %b[82]
          qpfmul_hadds,5        %b[24], %b[52], %r40, %b[24]
        }
        {
          loop_mode
          qpshufb,0     %b[21], %g25, %r7, %b[83]
          qpfmul_hsubs,1        %b[68], %b[52], %r40, %b[52]
          qpfmul_hsubs,2        %b[63], %b[55], %r40, %b[63]
          qpfmul_hsubs,3        %b[67], %b[56], %r40, %b[67]
          qpfmul_hadds,4        %b[32], %b[56], %r40, %b[32]
          qpfmul_hadds,5        %b[39], %b[57], %r40, %b[39]
        }
        {
          loop_mode
          qpfmul_hsubs,0        %b[78], %b[54], %r40, %b[54]
          qpfmul_hadds,1        %b[47], %b[55], %r40, %b[47]
          qpfmul_hadds,2        %b[49], %b[73], %r40, %b[49]
          qpfmul_hsubs,3        %b[61], %b[57], %r40, %b[55]
          qpfmul_hsubs,4        %b[87], %b[76], %r40, %b[56]
          qpfmul_hsubs,5        %b[72], %b[74], %r40, %b[57]
        }
        {
          loop_mode
          qpfmul_hsubs,0        %b[77], %b[59], %r40, %b[61]
          qpfmul_hadds,1        %b[42], %b[59], %r40, %b[42]
          qpfmul_hsubs,2        %b[90], %b[73], %r40, %b[59]
          qpfmul_hadds,3        %b[48], %b[76], %r40, %b[48]
          qpfmul_hadds,4        %b[41], %b[74], %r40, %b[41]
          qpfmul_hsubs,5        %b[71], %b[81], %r40, %b[68]
        }
        {
          loop_mode
          qpfmul_hsubs,0        %b[66], %b[53], %r40, %b[66]
          qpfmul_hadds,1        %b[43], %b[53], %r40, %b[43]
          qpfmul_hsubs,2        %b[89], %b[62], %r40, %b[53]
          qpfmul_hadds,3        %b[50], %b[62], %r40, %b[50]
          qpfmul_hadds,4        %b[40], %b[81], %r40, %b[40]
          qpfmul_hsubs,5        %b[65], %b[82], %r40, %b[62]
        }
        {
          loop_mode
          qpfmul_hadds,0        %b[46], %b[83], %r40, %b[46]
          qpshufb,1     %b[70], %b[70], %r23, %b[71]
          qpfmul_hsubs,2        %b[86], %b[83], %r40, %b[65]
          qpshufb,3     %b[64], %b[64], %r23, %b[72]
          qpshufb,4     %b[69], %b[69], %r23, %b[73]
          qpfmul_hadds,5        %b[38], %b[82], %r40, %b[38]
        }
        {
          loop_mode
          qpshufb,0     %b[60], %b[60], %r23, %b[74]
          qpshufb,1     %b[80], %b[80], %r23, %b[76]
          qpshufb,3     %b[79], %b[79], %r23, %b[77]
          qpshufb,4     %b[75], %b[75], %r23, %b[78]
        }
        {
          loop_mode
          nop 3
          qpshufb,0     %b[88], %b[88], %r23, %b[81]
        }
        {
          loop_mode
          qpshufb,1     %b[31], %g20, %r6, %g20
          qpshufb,3     %g21, %b[33], %r6, %g21
          qpshufb,4     %b[35], %g29, %r6, %g29
        }
        {
          loop_mode
          qpshufb,0     %g22, %b[34], %r6, %g22
          qpshufb,1     %g28, %b[27], %r6, %g28
          qpfmul_hadds,2        %b[71], %g20, %r40, %b[23]
          qpshufb,3     %g23, %g30, %r6, %g23
          qpshufb,4     %b[30], %b[23], %r6, %g30
          qpfmul_hadds,5        %b[76], %g21, %r40, %b[27]
        }
        {
          loop_mode
          qpshufb,0     %g19, %b[28], %r6, %g19
          qpfmul_hsubs,1        %b[70], %g20, %r40, %g20
          qpfmul_hadds,2        %b[73], %g22, %r40, %b[30]
          qpfmul_hsubs,3        %b[80], %g21, %r40, %g21
          qpfmul_hsubs,4        %b[79], %g29, %r40, %b[28]
          qpfmul_hadds,5        %b[77], %g29, %r40, %g29
        }
        {
          loop_mode
          qpfmul_hsubs,0        %b[69], %g22, %r40, %g22
          qpfmul_hadds,1        %b[72], %g28, %r40, %b[33]
          qpfmul_hsubs,2        %b[64], %g28, %r40, %g28
          qpfmul_hsubs,3        %b[88], %g23, %r40, %b[31]
          qpfmul_hadds,4        %b[81], %g23, %r40, %g23
          qpfmul_hsubs,5        %b[75], %g30, %r40, %b[34]
        }
        {
          loop_mode
          qpfmul_hadds,0        %b[74], %g19, %r40, %g19
          qppermb,1     %b[45], %b[58], %r24, %b[45]
          qpfmul_hsubs,2        %b[60], %g19, %r40, %b[35]
          qppermb,3     %b[24], %b[52], %r24, %b[24]
          qppermb,4     %b[44], %b[54], %r24, %b[44]
          qpfmul_hadds,5        %b[78], %g30, %r40, %g30
        }
        {
          loop_mode
          qppermb,0     %b[37], %b[51], %r24, %b[37]
          qppermb,1     %b[32], %b[67], %r24, %b[32]
          qppermb,3     %b[42], %b[61], %r24, %b[42]
          qppermb,4     %b[39], %b[55], %r24, %b[39]
        }
        {
          loop_mode
          qppermb,0     %b[47], %b[63], %r24, %b[47]
          qppermb,1     %b[49], %b[59], %r24, %b[49]
          qppermb,3     %b[40], %b[68], %r24, %b[40]
          qppermb,4     %b[48], %b[56], %r24, %b[48]
        }
        {
          loop_mode
          qppermb,0     %b[43], %b[66], %r24, %b[43]
          qppermb,1     %b[50], %b[53], %r24, %b[50]
          qppermb,3     %b[38], %b[62], %r24, %b[38]
          qppermb,4     %b[41], %b[57], %r24, %b[41]
        }
        {
          loop_mode
          qppermb,0     %b[46], %b[65], %r24, %b[46]
          qpfsubs,1     %b[24], %b[37], %b[51]
          qpfsubs,3     %b[44], %b[45], %b[52]
          qpfsubs,4     %b[40], %b[41], %b[53]
        }
        {
          loop_mode
          qpfsubs,0     %b[42], %b[47], %b[55]
          qpfsubs,1     %b[46], %b[50], %b[56]
          qpfsubs,2     %b[32], %b[39], %b[54]
          qpfadds,3     %b[24], %b[37], %b[24]
          qpfadds,4     %b[44], %b[45], %b[37]
          qpfadds,5     %b[32], %b[39], %b[32]
        }
        {
          loop_mode
          qpfadds,0     %b[48], %b[49], %b[44]
          qpfadds,1     %b[42], %b[47], %b[42]
          qpfsubs,2     %b[48], %b[49], %b[39]
          qpfadds,3     %b[40], %b[41], %b[40]
        }
        {
          loop_mode
          qpfadds,0     %b[38], %b[43], %b[38]
          qpfadds,1     %b[46], %b[50], %b[43]
          qpfsubs,2     %b[38], %b[43], %b[41]
        }
        {
          loop_mode
          qpshufb,4     %g26, %g27, %r6, %g26
        }
        {
          loop_mode
          qpshufb,3     %g17, %b[20], %r6, %g17
          qpshufb,4     %b[36], %g31, %r6, %g27
        }
        {
          loop_mode
          qpshufb,0     %b[19], %b[26], %r6, %g31
          qpshufb,1     %g24, %g16, %r6, %g16
          qpshufb,3     %b[25], %b[22], %r6, %g24
          qpshufb,4     %g18, %b[29], %r6, %g18
        }
        {
          loop_mode
          qpshufb,0     %b[21], %g25, %r6, %g25
          qppermb,1     %b[23], %g20, %r24, %g20
          qppermb,3     %b[27], %g21, %r24, %g21
          qppermb,4     %g29, %b[28], %r24, %g29
        }
        {
          loop_mode
          qppermb,0     %b[30], %g22, %r24, %g22
          qppermb,1     %b[33], %g28, %r24, %g28
          qpfsubs,2     %g26, %g20, %b[19]
          qppermb,3     %g23, %b[31], %r24, %g23
          qppermb,4     %g30, %b[34], %r24, %g30
          qpfsubs,5     %g17, %g21, %b[20]
        }
        {
          loop_mode
          qppermb,0     %g19, %b[35], %r24, %g19
          qpfsubs,1     %g27, %g22, %g21
          qpfadds,2     %g26, %g20, %g20
          qpshufb,3     %b[51], %b[51], %r23, %g26
          qpshufb,4     %b[52], %b[52], %r23, %b[21]
          qpfadds,5     %g17, %g21, %g17
        }
        {
          loop_mode
          qpfadds,0     %g27, %g22, %g22
          qpfsubs,1     %g24, %g28, %g31
          qpfadds,2     %g24, %g28, %g24
          qpfadds,3     %g31, %g29, %b[22]
          qpfsubs,4     %g31, %g29, %g29
          qpfadds,5     %g16, %g23, %g27
        }
        {
          loop_mode
          nop 1
          qpfsubs,0     %g18, %g19, %g18
          qpfadds,2     %g18, %g19, %g28
          qpfadds,3     %g25, %g30, %g23
          qpfsubs,4     %g25, %g30, %g25
          qpfsubs,5     %g16, %g23, %g16
        }
        {
          loop_mode
          qpfadds,0     %g20, %b[24], %g30
          qpshufb,1     %b[54], %b[54], %r23, %g19
          qpfsubs,2     %g20, %b[24], %g20
          qpfadds,3     %g17, %b[37], %b[23]
          qpfsubs,4     %g17, %b[37], %g17
        }
        {
          loop_mode
          qpshufb,0     %b[39], %b[39], %r23, %b[24]
          qpshufb,1     %b[41], %b[41], %r23, %b[25]
          qpfsubs,2     %g22, %b[32], %b[28]
          qpshufb,3     %b[55], %b[55], %r23, %b[26]
          qpshufb,4     %b[53], %b[53], %r23, %b[27]
          qpfsubs,5     %b[22], %b[42], %b[29]
        }
        {
          loop_mode
          qpfadds,0     %g22, %b[32], %g22
          qpshufb,1     %b[56], %b[56], %r23, %b[30]
          qpfsubs,2     %g24, %b[40], %b[33]
          qpfadds,3     %b[22], %b[42], %b[22]
          qpfsubs,4     %g27, %b[44], %b[31]
          qpfadds,5     %g23, %b[43], %b[32]
        }
        {
          loop_mode
          qpxor,0       %g26, %r22, %g26
          qpxor,1       %g19, %r22, %g19
          qpfadds,2     %g27, %b[44], %g27
          qpxor,3       %b[21], %r22, %b[21]
          qpxor,4       %b[26], %r22, %b[26]
          qpfsubs,5     %g23, %b[43], %g23
        }
        {
          loop_mode
          qpfadds,0     %g24, %b[40], %g24
          qpxor,1       %b[25], %r22, %b[25]
          qpfsubs,2     %b[19], %g26, %b[34]
          qpfadds,3     %g28, %b[38], %b[35]
          qpfsubs,4     %g28, %b[38], %g28
          qpfsubs,5     %b[20], %b[21], %b[36]
        }
        {
          loop_mode
          qpxor,0       %b[24], %r22, %b[24]
          qpxor,1       %b[27], %r22, %b[27]
          qpfadds,2     %b[19], %g26, %g26
          qpfadds,3     %b[20], %b[21], %b[19]
          qpfadds,4     %g29, %b[26], %b[20]
          qpfsubs,5     %g29, %b[26], %g29
        }
        {
          loop_mode
          qpfsubs,0     %g21, %g19, %b[26]
          qpxor,1       %b[30], %r22, %b[21]
          qpfadds,2     %g21, %g19, %g19
          stqp,5        %r16, %r0, %g17
        }
        {
          loop_mode
          qpfadds,0     %g16, %b[24], %g17
          qpfsubs,1     %g18, %b[25], %g21
          qpfadds,2     %g18, %b[25], %g18
          stqp,5        %r20, %r0, %b[23]
        }
        {
          loop_mode
          qpfsubs,0     %g16, %b[24], %g16
          qpfsubs,1     %g31, %b[27], %b[23]
          qpfadds,2     %g31, %b[27], %g31
          stqp,5        %r18, %r0, %g20
        }
        {
          loop_mode
          qpfadds,0     %g25, %b[21], %g20
          qpfsubs,1     %g25, %b[21], %g25
          stqp,2        %r2, %r0, %g30
          stqp,5        %r36, %r0, %b[29]
        }
        {
          loop_mode
          stqp,2        %r25, %r0, %b[22]
          stqp,5        %r32, %r0, %b[28]
        }
        {
          loop_mode
          stqp,2        %r27, %r0, %g22
          stqp,5        %r3, %r0, %b[31]
        }
        {
          loop_mode
          stqp,2        %r9, %r0, %b[32]
          stqp,5        %r35, %r0, %g23
        }
        {
          loop_mode
          stqp,2        %r17, %r0, %g27
          stqp,5        %r13, %r0, %b[33]
        }
        {
          loop_mode
          stqp,2        %r21, %r0, %g24
          stqp,5        %r31, %r0, %g28
        }
        {
          loop_mode
          stqp,2        %r15, %r0, %g26
          stqp,5        %r26, %r0, %b[35]
        }
        {
          loop_mode
          stqp,2        %r19, %r0, %b[34]
          stqp,5        %r34, %r0, %g19
        }
        {
          loop_mode
          stqp,2        %r30, %r0, %b[26]
          stqp,5        %r39, %r0, %g18
        }
        {
          loop_mode
          stqp,2        %r5, %r0, %b[19]
          stqp,5        %r29, %r0, %g21
        }
        {
          loop_mode
          stqp,2        %r11, %r0, %b[36]
          stqp,5        %r4, %r0, %g17
        }
        {
          loop_mode
          stqp,2        %r38, %r0, %b[20]
          stqp,5        %r14, %r0, %g16
        }
        {
          loop_mode
          stqp,2        %r28, %r0, %g29
          stqp,5        %r1, %r0, %g31
        }
        {
          loop_mode
          stqp,2        %r12, %r0, %b[23]
          stqp,5        %r37, %r0, %g20
        }
        {
          loop_mode
          ct    %ctpr1 ? %NOT_LOOP_END
          alc   alcf=1, alct=1
          stqp,2        %r33, %r0, %g25
          addd,3,sm     %r0, _f16s,_lts0lo 0x20, %r0
        }

Теоретическая скорость: 64 комплексных числа за 106 тактов (64/106) = 4.83 Байт/такт
Четверная теоретическая скорость: 19.32 Байт/такт

Замеры скорости

7. stage_radix4_2x_simd128_noConj_unroll3

Здесь происходит раскрутка цикла в 3 раза с помощью опции unroll.

Код на Си
void stage_radix4_2x_simd128_noConj_unroll3(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b)
{
	__v2di *xy0_in = (__v2di*)&data_in[ 0];
	__v2di *zw0_in = (__v2di*)&data_in[ 2];
	__v2di *xy1_in = (__v2di*)&data_in[ 4];
	__v2di *zw1_in = (__v2di*)&data_in[ 6];
	__v2di *xy2_in = (__v2di*)&data_in[ 8];
	__v2di *zw2_in = (__v2di*)&data_in[10];
	__v2di *xy3_in = (__v2di*)&data_in[12];
	__v2di *zw3_in = (__v2di*)&data_in[14];
	__v2di *xy4_in = (__v2di*)&data_in[16];
	__v2di *zw4_in = (__v2di*)&data_in[18];
	__v2di *xy5_in = (__v2di*)&data_in[20];
	__v2di *zw5_in = (__v2di*)&data_in[22];
	__v2di *xy6_in = (__v2di*)&data_in[24];
	__v2di *zw6_in = (__v2di*)&data_in[26];
	__v2di *xy7_in = (__v2di*)&data_in[28];
	__v2di *zw7_in = (__v2di*)&data_in[30];
	__v2di *c0a_in = (__v2di*)&coefC_a[0];
	__v2di *c1a_in = (__v2di*)&coefC_a[2];
	__v2di *c2a_in = (__v2di*)&coefC_a[4];
	__v2di *c3a_in = (__v2di*)&coefC_a[6];
	__v2di *d0a_in = (__v2di*)&coefD_a[0];
	__v2di *d1a_in = (__v2di*)&coefD_a[2];
	__v2di *d2a_in = (__v2di*)&coefD_a[4];
	__v2di *d3a_in = (__v2di*)&coefD_a[6];
	__v2di *e0a_in = (__v2di*)&coefE_a[0];
	__v2di *e1a_in = (__v2di*)&coefE_a[2];
	__v2di *e2a_in = (__v2di*)&coefE_a[4];
	__v2di *e3a_in = (__v2di*)&coefE_a[6];
	__v2di *c0b_in = (__v2di*)&coefC_b[0*data_count/16];
	__v2di *c1b_in = (__v2di*)&coefC_b[1*data_count/16];
	__v2di *c2b_in = (__v2di*)&coefC_b[2*data_count/16];
	__v2di *c3b_in = (__v2di*)&coefC_b[3*data_count/16];
	__v2di *d0b_in = (__v2di*)&coefD_b[0*data_count/16];
	__v2di *d1b_in = (__v2di*)&coefD_b[1*data_count/16];
	__v2di *d2b_in = (__v2di*)&coefD_b[2*data_count/16];
	__v2di *d3b_in = (__v2di*)&coefD_b[3*data_count/16];
	__v2di *e0b_in = (__v2di*)&coefE_b[0*data_count/16];
	__v2di *e1b_in = (__v2di*)&coefE_b[1*data_count/16];
	__v2di *e2b_in = (__v2di*)&coefE_b[2*data_count/16];
	__v2di *e3b_in = (__v2di*)&coefE_b[3*data_count/16];

	__v2di *out_0  = (__v2di*)&data_out[ 0*data_count/16];
	__v2di *out_1  = (__v2di*)&data_out[ 1*data_count/16];
	__v2di *out_2  = (__v2di*)&data_out[ 2*data_count/16];
	__v2di *out_3  = (__v2di*)&data_out[ 3*data_count/16];
	__v2di *out_4  = (__v2di*)&data_out[ 4*data_count/16];
	__v2di *out_5  = (__v2di*)&data_out[ 5*data_count/16];
	__v2di *out_6  = (__v2di*)&data_out[ 6*data_count/16];
	__v2di *out_7  = (__v2di*)&data_out[ 7*data_count/16];
	__v2di *out_8  = (__v2di*)&data_out[ 8*data_count/16];
	__v2di *out_9  = (__v2di*)&data_out[ 9*data_count/16];
	__v2di *out_10 = (__v2di*)&data_out[10*data_count/16];
	__v2di *out_11 = (__v2di*)&data_out[11*data_count/16];
	__v2di *out_12 = (__v2di*)&data_out[12*data_count/16];
	__v2di *out_13 = (__v2di*)&data_out[13*data_count/16];
	__v2di *out_14 = (__v2di*)&data_out[14*data_count/16];
	__v2di *out_15 = (__v2di*)&data_out[15*data_count/16];

	#pragma ivdep
	#pragma unroll(3)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/32; ++i)
	{
		__v2di xy0 = xy0_in[16*i];
		__v2di zw0 = zw0_in[16*i];
		__v2di xy1 = xy1_in[16*i];
		__v2di zw1 = zw1_in[16*i];
		__v2di c0  = c0a_in[4*i];
		__v2di d0  = d0a_in[4*i];
		__v2di e0  = e0a_in[4*i];

		__v2di xy2 = xy2_in[16*i];
		__v2di zw2 = zw2_in[16*i];
		__v2di xy3 = xy3_in[16*i];
		__v2di zw3 = zw3_in[16*i];
		__v2di c1  = c1a_in[4*i];
		__v2di d1  = d1a_in[4*i];
		__v2di e1  = e1a_in[4*i];

		__v2di xy4 = xy4_in[16*i];
		__v2di zw4 = zw4_in[16*i];
		__v2di xy5 = xy5_in[16*i];
		__v2di zw5 = zw5_in[16*i];
		__v2di c2  = c2a_in[4*i];
		__v2di d2  = d2a_in[4*i];
		__v2di e2  = e2a_in[4*i];

		__v2di xy6 = xy6_in[16*i];
		__v2di zw6 = zw6_in[16*i];
		__v2di xy7 = xy7_in[16*i];
		__v2di zw7 = zw7_in[16*i];
		__v2di c3  = c3a_in[4*i];
		__v2di d3  = d3a_in[4*i];
		__v2di e3  = e3a_in[4*i];

		__v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		__v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});

		__v2di cy0_real = __builtin_e2k_qpfmuls(     c0, y0);
		__v2di cy1_real = __builtin_e2k_qpfmuls(     c1, y1);
		__v2di cy2_real = __builtin_e2k_qpfmuls(     c2, y2);
		__v2di cy3_real = __builtin_e2k_qpfmuls(     c3, y3);
		__v2di dz0_real = __builtin_e2k_qpfmuls(     d0, z0);
		__v2di dz1_real = __builtin_e2k_qpfmuls(     d1, z1);
		__v2di dz2_real = __builtin_e2k_qpfmuls(     d2, z2);
		__v2di dz3_real = __builtin_e2k_qpfmuls(     d3, z3);
		__v2di ew0_real = __builtin_e2k_qpfmuls(     e0, w0);
		__v2di ew1_real = __builtin_e2k_qpfmuls(     e1, w1);
		__v2di ew2_real = __builtin_e2k_qpfmuls(     e2, w2);
		__v2di ew3_real = __builtin_e2k_qpfmuls(     e3, w3);
		__v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
		__v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
		__v2di cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
		__v2di cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
		__v2di dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
		__v2di dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
		__v2di dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
		__v2di dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
		__v2di ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
		__v2di ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
		__v2di ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
		__v2di ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);

		__v2di cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real);
		__v2di cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real);
		__v2di cy2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy2_real);
		__v2di cy3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy3_real);
		__v2di dz0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz0_real);
		__v2di dz1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz1_real);
		__v2di dz2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz2_real);
		__v2di dz3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz3_real);
		__v2di ew0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew0_real);
		__v2di ew1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew1_real);
		__v2di ew2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew2_real);
		__v2di ew3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew3_real);
		__v2di cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag);
		__v2di cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag);
		__v2di cy2_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy2_imag);
		__v2di cy3_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy3_imag);
		__v2di dz0_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz0_imag);
		__v2di dz1_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz1_imag);
		__v2di dz2_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz2_imag);
		__v2di dz3_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz3_imag);
		__v2di ew0_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew0_imag);
		__v2di ew1_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew1_imag);
		__v2di ew2_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew2_imag);
		__v2di ew3_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew3_imag);

		__v2di cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di cy2 = __builtin_e2k_qppermb(cy2_ii, cy2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di cy3 = __builtin_e2k_qppermb(cy3_ii, cy3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di dz0 = __builtin_e2k_qppermb(dz0_ii, dz0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di dz1 = __builtin_e2k_qppermb(dz1_ii, dz1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di dz2 = __builtin_e2k_qppermb(dz2_ii, dz2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di dz3 = __builtin_e2k_qppermb(dz3_ii, dz3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di ew0 = __builtin_e2k_qppermb(ew0_ii, ew0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di ew1 = __builtin_e2k_qppermb(ew1_ii, ew1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di ew2 = __builtin_e2k_qppermb(ew2_ii, ew2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		__v2di ew3 = __builtin_e2k_qppermb(ew3_ii, ew3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});

		__v2di add02_0 = __builtin_e2k_qpfadds( x0, dz0);
		__v2di add02_1 = __builtin_e2k_qpfadds( x1, dz1);
		__v2di add02_2 = __builtin_e2k_qpfadds( x2, dz2);
		__v2di add02_3 = __builtin_e2k_qpfadds( x3, dz3);
		__v2di sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
		__v2di sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
		__v2di sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
		__v2di sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
		__v2di add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
		__v2di add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
		__v2di add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
		__v2di add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
		__v2di sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
		__v2di sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
		__v2di sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
		__v2di sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);

		__v2di swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
		__v2di sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
		__v2di sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
		__v2di sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});

		__v2di out0  = __builtin_e2k_qpfadds(add02_0, add13_0);
		__v2di out1  = __builtin_e2k_qpfadds(add02_1, add13_1);
		__v2di out2  = __builtin_e2k_qpfadds(add02_2, add13_2);
		__v2di out3  = __builtin_e2k_qpfadds(add02_3, add13_3);
		__v2di out4  = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
		__v2di out5  = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
		__v2di out6  = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
		__v2di out7  = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
		__v2di out8  = __builtin_e2k_qpfsubs(add02_0, add13_0);
		__v2di out9  = __builtin_e2k_qpfsubs(add02_1, add13_1);
		__v2di out10 = __builtin_e2k_qpfsubs(add02_2, add13_2);
		__v2di out11 = __builtin_e2k_qpfsubs(add02_3, add13_3);
		__v2di out12 = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
		__v2di out13 = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
		__v2di out14 = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
		__v2di out15 = __builtin_e2k_qpfadds(sub02_3, sub13i_3);


		xy0 = out0;
		zw0 = out1;
		xy1 = out2;
		zw1 = out3;
		c0  = c0b_in[i];
		d0  = d0b_in[i];
		e0  = e0b_in[i];

		xy2 = out4;
		zw2 = out5;
		xy3 = out6;
		zw3 = out7;
		c1  = c1b_in[i];
		d1  = d1b_in[i];
		e1  = e1b_in[i];

		xy4 = out8;
		zw4 = out9;
		xy5 = out10;
		zw5 = out11;
		c2  = c2b_in[i];
		d2  = d2b_in[i];
		e2  = e2b_in[i];

		xy6 = out12;
		zw6 = out13;
		xy7 = out14;
		zw7 = out15;
		c3  = c3b_in[i];
		d3  = d3b_in[i];
		e3  = e3b_in[i];

		x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
		w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
		y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
		w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
		y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
		w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
		y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
		w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});

		cy0_real = __builtin_e2k_qpfmuls(     c0, y0);
		cy1_real = __builtin_e2k_qpfmuls(     c1, y1);
		cy2_real = __builtin_e2k_qpfmuls(     c2, y2);
		cy3_real = __builtin_e2k_qpfmuls(     c3, y3);
		dz0_real = __builtin_e2k_qpfmuls(     d0, z0);
		dz1_real = __builtin_e2k_qpfmuls(     d1, z1);
		dz2_real = __builtin_e2k_qpfmuls(     d2, z2);
		dz3_real = __builtin_e2k_qpfmuls(     d3, z3);
		ew0_real = __builtin_e2k_qpfmuls(     e0, w0);
		ew1_real = __builtin_e2k_qpfmuls(     e1, w1);
		ew2_real = __builtin_e2k_qpfmuls(     e2, w2);
		ew3_real = __builtin_e2k_qpfmuls(     e3, w3);
		cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
		cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
		cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
		cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
		dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
		dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
		dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
		dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
		ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
		ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
		ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
		ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);

		cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real);
		cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real);
		cy2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy2_real);
		cy3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy3_real);
		dz0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz0_real);
		dz1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz1_real);
		dz2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz2_real);
		dz3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz3_real);
		ew0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew0_real);
		ew1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew1_real);
		ew2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew2_real);
		ew3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew3_real);
		cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag);
		cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag);
		cy2_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy2_imag);
		cy3_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy3_imag);
		dz0_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz0_imag);
		dz1_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz1_imag);
		dz2_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz2_imag);
		dz3_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz3_imag);
		ew0_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew0_imag);
		ew1_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew1_imag);
		ew2_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew2_imag);
		ew3_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew3_imag);

		cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		cy2 = __builtin_e2k_qppermb(cy2_ii, cy2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		cy3 = __builtin_e2k_qppermb(cy3_ii, cy3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		dz0 = __builtin_e2k_qppermb(dz0_ii, dz0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		dz1 = __builtin_e2k_qppermb(dz1_ii, dz1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		dz2 = __builtin_e2k_qppermb(dz2_ii, dz2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		dz3 = __builtin_e2k_qppermb(dz3_ii, dz3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		ew0 = __builtin_e2k_qppermb(ew0_ii, ew0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		ew1 = __builtin_e2k_qppermb(ew1_ii, ew1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		ew2 = __builtin_e2k_qppermb(ew2_ii, ew2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
		ew3 = __builtin_e2k_qppermb(ew3_ii, ew3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});

		add02_0 = __builtin_e2k_qpfadds( x0, dz0);
		add02_1 = __builtin_e2k_qpfadds( x1, dz1);
		add02_2 = __builtin_e2k_qpfadds( x2, dz2);
		add02_3 = __builtin_e2k_qpfadds( x3, dz3);
		sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
		sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
		sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
		sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
		add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
		add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
		add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
		add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
		sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
		sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
		sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
		sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);

		swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
		sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
		sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
		sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});

		out_0[i]  = __builtin_e2k_qpfadds(add02_0, add13_0);
		out_1[i]  = __builtin_e2k_qpfadds(add02_1, add13_1);
		out_2[i]  = __builtin_e2k_qpfadds(add02_2, add13_2);
		out_3[i]  = __builtin_e2k_qpfadds(add02_3, add13_3);
		out_4[i]  = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
		out_5[i]  = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
		out_6[i]  = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
		out_7[i]  = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
		out_8[i]  = __builtin_e2k_qpfsubs(add02_0, add13_0);
		out_9[i]  = __builtin_e2k_qpfsubs(add02_1, add13_1);
		out_10[i] = __builtin_e2k_qpfsubs(add02_2, add13_2);
		out_11[i] = __builtin_e2k_qpfsubs(add02_3, add13_3);
		out_12[i] = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
		out_13[i] = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
		out_14[i] = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
		out_15[i] = __builtin_e2k_qpfadds(sub02_3, sub13i_3);
	}
}
Основной цикл на ассемблере
.L24812:
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=0, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=1, disp=64
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=1, disp=96
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=2, disp=128
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=2, disp=160
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=3, disp=192
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=3, disp=224
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=4, disp=256
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=4, disp=288
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=5, disp=320
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=5, disp=352
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=6, disp=384
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=6, disp=416
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=7, disp=448
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=7, disp=480
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=8, disp=512
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=8, disp=544
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=9, disp=576
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=9, disp=608
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=10, disp=640
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=10, disp=672
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=11, disp=704
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=11, disp=736
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=0, abs=12, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=0, abs=12, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=0, abs=13, disp=64
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=0, abs=13, disp=96
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=0, abs=14, disp=128
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=0, abs=14, disp=160
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=0, abs=15, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=0, abs=15, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=0, abs=16, disp=64
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=0, abs=16, disp=96
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=0, abs=17, disp=128
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=0, abs=17, disp=160
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=0, abs=18, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=0, abs=18, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=0, abs=19, disp=64
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=0, abs=19, disp=96
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=0, abs=20, disp=128
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=0, abs=20, disp=160
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=1, incr=2, ind=0, asz=0, abs=21, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=15, asz=0, abs=21, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=14, asz=0, abs=22, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=13, asz=0, abs=22, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=12, asz=0, abs=23, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=11, asz=0, abs=23, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=10, asz=0, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=9, asz=0, abs=24, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=8, asz=0, abs=25, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=7, asz=0, abs=25, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=6, asz=0, abs=26, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=5, asz=0, abs=26, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=1, incr=2, ind=0, asz=0, abs=27, disp=32
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=15, asz=0, abs=27, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=14, asz=0, abs=28, disp=32
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=13, asz=0, abs=28, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=12, asz=0, abs=29, disp=32
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=11, asz=0, abs=29, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=10, asz=0, abs=30, disp=32
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=9, asz=0, abs=30, disp=32
        }
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=8, asz=0, abs=31, disp=32
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=7, asz=0, abs=31, disp=32
        }
.L19801:
        {
          loop_mode
          disp  %ctpr1, .L19801
          ldqp,0        %r58, %r0, %g20
          addd,1,sm     %r0, _f16s,_lts0lo 0x30, %g22
          ldqp,2        %r55, %r0, %g21
          movaqp,0      area=0, ind=0, am=1, be=0, %g17
          movaqp,1      area=0, ind=16, am=0, be=0, %g16
          movaqp,2      area=0, ind=0, am=1, be=0, %g19
          movaqp,3      area=0, ind=16, am=0, be=0, %g18
        }
        {
          loop_mode
          ldb,0,sm      %r58, %g22, %empty, mas=0x20
          ldb,2,sm      %r55, %g22, %empty, mas=0x20
          movaqp,0      area=1, ind=0, am=1, be=0, %g24
          movaqp,1      area=1, ind=16, am=0, be=0, %g23
          movaqp,2      area=1, ind=0, am=1, be=0, %g26
          movaqp,3      area=1, ind=16, am=0, be=0, %g25
        }
        {
          loop_mode
          movaqp,0      area=2, ind=0, am=1, be=0, %g27
          movaqp,1      area=2, ind=16, am=0, be=0, %g22
          movaqp,2      area=2, ind=0, am=1, be=0, %g29
          movaqp,3      area=2, ind=16, am=0, be=0, %g28
        }
        {
          loop_mode
          movaqp,0      area=3, ind=0, am=1, be=0, %g31
          movaqp,1      area=3, ind=16, am=0, be=0, %g30
          movaqp,2      area=3, ind=0, am=1, be=0, %b[38]
          movaqp,3      area=3, ind=16, am=0, be=0, %b[37]
        }
        {
          loop_mode
          movaqp,0      area=4, ind=0, am=1, be=0, %b[40]
          movaqp,1      area=4, ind=16, am=0, be=0, %b[39]
          movaqp,2      area=4, ind=0, am=1, be=0, %b[42]
          movaqp,3      area=4, ind=16, am=0, be=0, %b[41]
        }
        {
          loop_mode
          qpshufb,0     %g19, %g17, %r6, %b[47]
          qpshufb,1     %g18, %g16, %r6, %b[48]
          qpshufb,3     %g18, %g16, %r23, %g16
          qpshufb,4     %g19, %g17, %r23, %g17
          movaqp,0      area=5, ind=0, am=1, be=0, %b[44]
          movaqp,1      area=5, ind=16, am=0, be=0, %b[43]
          movaqp,2      area=5, ind=0, am=1, be=0, %b[46]
          movaqp,3      area=5, ind=16, am=0, be=0, %b[45]
        }
        {
          loop_mode
          qpshufb,0     %g26, %g24, %r6, %b[51]
          qpshufb,1     %g25, %g23, %r6, %b[52]
          qpshufb,3     %g25, %g23, %r23, %g23
          qpshufb,4     %g26, %g24, %r23, %g24
          movaqp,0      area=6, ind=0, am=1, be=0, %g19
          movaqp,1      area=6, ind=16, am=0, be=0, %g18
          movaqp,2      area=6, ind=0, am=1, be=0, %b[50]
          movaqp,3      area=6, ind=16, am=0, be=0, %b[49]
        }
        {
          loop_mode
          qpshufb,0     %g29, %g27, %r6, %b[55]
          qpshufb,1     %g28, %g22, %r6, %b[56]
          qpshufb,3     %g28, %g22, %r23, %g22
          qpshufb,4     %g29, %g27, %r23, %g27
          movaqp,0      area=7, ind=0, am=1, be=0, %g26
          movaqp,1      area=7, ind=16, am=0, be=0, %g25
          movaqp,2      area=7, ind=0, am=1, be=0, %b[54]
          movaqp,3      area=7, ind=16, am=0, be=0, %b[53]
        }
        {
          loop_mode
          qpshufb,0     %b[38], %g31, %r6, %b[59]
          qpshufb,1     %b[37], %g30, %r6, %b[60]
          qpshufb,3     %b[37], %g30, %r23, %g30
          qpshufb,4     %b[38], %g31, %r23, %g31
          movaqp,0      area=8, ind=0, am=1, be=0, %g29
          movaqp,1      area=8, ind=16, am=0, be=0, %g28
          movaqp,2      area=8, ind=0, am=1, be=0, %b[58]
          movaqp,3      area=8, ind=16, am=0, be=0, %b[57]
        }
        {
          loop_mode
          qpshufb,0     %b[42], %b[40], %r6, %b[63]
          qpshufb,1     %b[41], %b[39], %r6, %b[64]
          qpshufb,3     %b[41], %b[39], %r23, %b[39]
          qpshufb,4     %b[42], %b[40], %r23, %b[40]
          movaqp,0      area=9, ind=0, am=1, be=0, %b[38]
          movaqp,1      area=9, ind=16, am=0, be=0, %b[37]
          movaqp,2      area=9, ind=0, am=1, be=0, %b[62]
          movaqp,3      area=9, ind=16, am=0, be=0, %b[61]
        }
        {
          loop_mode
          qpshufb,0     %b[46], %b[44], %r6, %b[67]
          qpshufb,1     %b[45], %b[43], %r6, %b[68]
          qpshufb,3     %b[45], %b[43], %r23, %b[43]
          qpshufb,4     %b[46], %b[44], %r23, %b[44]
          movaqp,0      area=10, ind=0, am=1, be=0, %b[42]
          movaqp,1      area=10, ind=16, am=0, be=0, %b[41]
          movaqp,2      area=10, ind=0, am=1, be=0, %b[66]
          movaqp,3      area=10, ind=16, am=0, be=0, %b[65]
        }
        {
          loop_mode
          qpshufb,0     %b[50], %g19, %r6, %b[71]
          qpshufb,1     %b[49], %g18, %r6, %b[72]
          qpshufb,3     %b[49], %g18, %r23, %g18
          qpshufb,4     %b[50], %g19, %r23, %g19
          movaqp,0      area=11, ind=0, am=1, be=0, %b[46]
          movaqp,1      area=11, ind=16, am=0, be=0, %b[45]
          movaqp,2      area=11, ind=0, am=1, be=0, %b[70]
          movaqp,3      area=11, ind=16, am=0, be=0, %b[69]
        }
        {
          loop_mode
          qpshufb,0     %b[54], %g26, %r6, %b[75]
          qpshufb,1     %b[53], %g25, %r6, %b[76]
          qpshufb,3     %b[53], %g25, %r23, %g25
          qpshufb,4     %b[54], %g26, %r23, %g26
          movaqp,0      area=12, ind=0, am=1, be=0, %b[50]
          movaqp,1      area=12, ind=16, am=0, be=0, %b[49]
          movaqp,2      area=12, ind=0, am=1, be=0, %b[74]
          movaqp,3      area=12, ind=16, am=0, be=0, %b[73]
        }
        {
          loop_mode
          qpshufb,0     %b[58], %g29, %r6, %b[79]
          qpshufb,1     %b[57], %g28, %r6, %b[80]
          qpshufb,3     %b[57], %g28, %r23, %g28
          qpshufb,4     %b[58], %g29, %r23, %g29
          movaqp,0      area=13, ind=0, am=1, be=0, %b[54]
          movaqp,1      area=13, ind=16, am=0, be=0, %b[53]
          movaqp,2      area=13, ind=0, am=1, be=0, %b[78]
          movaqp,3      area=13, ind=16, am=0, be=0, %b[77]
        }
        {
          loop_mode
          qpshufb,0     %b[62], %b[38], %r6, %b[83]
          qpshufb,1     %b[61], %b[37], %r6, %b[84]
          qpshufb,3     %b[61], %b[37], %r23, %b[37]
          qpshufb,4     %b[62], %b[38], %r23, %b[38]
          movaqp,0      area=14, ind=0, am=1, be=0, %b[58]
          movaqp,1      area=14, ind=16, am=0, be=0, %b[57]
          movaqp,2      area=14, ind=0, am=1, be=0, %b[82]
          movaqp,3      area=14, ind=16, am=0, be=0, %b[81]
        }
        {
          loop_mode
          qpshufb,0     %b[66], %b[42], %r6, %b[87]
          qpshufb,1     %b[65], %b[41], %r6, %b[88]
          qpshufb,3     %b[65], %b[41], %r23, %b[41]
          qpshufb,4     %b[66], %b[42], %r23, %b[42]
          movaqp,0      area=15, ind=0, am=1, be=0, %b[62]
          movaqp,1      area=15, ind=16, am=0, be=0, %b[61]
          movaqp,2      area=15, ind=0, am=1, be=0, %b[86]
          movaqp,3      area=15, ind=16, am=0, be=0, %b[85]
        }
        {
          loop_mode
          qpshufb,0     %b[70], %b[46], %r6, %b[91]
          qpshufb,1     %b[69], %b[45], %r6, %b[92]
          qpshufb,3     %b[69], %b[45], %r23, %b[45]
          qpshufb,4     %b[70], %b[46], %r23, %b[46]
          movaqp,0      area=16, ind=0, am=1, be=0, %b[66]
          movaqp,1      area=16, ind=16, am=0, be=0, %b[65]
          movaqp,2      area=16, ind=0, am=1, be=0, %b[90]
          movaqp,3      area=16, ind=16, am=0, be=0, %b[89]
        }
        {
          loop_mode
          qpshufb,0     %b[50], %b[50], %r24, %b[95]
          qpshufb,1     %b[49], %b[49], %r24, %b[96]
          qpfmul_hsubs,2        %b[50], %b[47], %r25, %b[50]
          qpshufb,3     %b[74], %b[74], %r24, %b[97]
          qpshufb,4     %b[73], %b[73], %r24, %b[98]
          qpfmul_hsubs,5        %b[49], %b[51], %r25, %b[49]
          movaqp,0      area=17, ind=0, am=1, be=0, %b[70]
          movaqp,1      area=17, ind=16, am=0, be=0, %b[69]
          movaqp,2      area=17, ind=0, am=1, be=0, %b[94]
          movaqp,3      area=17, ind=16, am=0, be=0, %b[93]
        }
        {
          loop_mode
          qpshufb,0     %b[54], %b[54], %r24, %b[103]
          qpshufb,1     %b[53], %b[53], %r24, %b[104]
          qpfmul_hsubs,2        %b[74], %b[55], %r25, %b[74]
          qpshufb,3     %b[78], %b[78], %r24, %b[105]
          qpshufb,4     %b[77], %b[77], %r24, %b[106]
          qpfmul_hsubs,5        %b[73], %b[59], %r25, %b[73]
          movaqp,0      area=18, ind=0, am=1, be=0, %b[100]
          movaqp,1      area=18, ind=16, am=0, be=0, %b[99]
          movaqp,2      area=18, ind=0, am=1, be=0, %b[102]
          movaqp,3      area=18, ind=16, am=0, be=0, %b[101]
        }
        {
          loop_mode
          qpshufb,0     %b[58], %b[58], %r24, %b[111]
          qpshufb,1     %b[57], %b[57], %r24, %b[112]
          qpfmul_hsubs,2        %b[54], %b[63], %r25, %b[54]
          qpshufb,3     %b[82], %b[82], %r24, %b[113]
          qpshufb,4     %b[81], %b[81], %r24, %b[114]
          qpfmul_hsubs,5        %b[53], %b[67], %r25, %b[53]
          movaqp,0      area=19, ind=0, am=1, be=0, %b[108]
          movaqp,1      area=19, ind=16, am=0, be=0, %b[107]
          movaqp,2      area=19, ind=0, am=1, be=0, %b[110]
          movaqp,3      area=19, ind=16, am=0, be=0, %b[109]
        }
        {
          loop_mode
          qpfmul_hadds,0        %b[96], %b[51], %r25, %b[51]
          qpfmul_hsubs,1        %b[78], %b[71], %r25, %b[78]
          qpfmul_hsubs,2        %b[77], %b[75], %r25, %b[77]
          qpfmul_hadds,3        %b[97], %b[55], %r25, %b[55]
          qpfmul_hadds,4        %b[98], %b[59], %r25, %b[59]
          qpfmul_hsubs,5        %b[58], %b[79], %r25, %b[58]
          movaqp,0      area=20, ind=0, am=1, be=0, %b[116]
          movaqp,1      area=20, ind=16, am=0, be=0, %b[115]
          movaqp,2      area=20, ind=0, am=1, be=0, %b[118]
          movaqp,3      area=20, ind=16, am=0, be=0, %b[117]
        }
        {
          loop_mode
          qpfmul_hadds,0        %b[95], %b[47], %r25, %b[47]
          qpfmul_hsubs,1        %b[57], %b[83], %r25, %b[57]
          qpfmul_hsubs,2        %b[82], %b[87], %r25, %b[82]
          qpfmul_hadds,3        %b[105], %b[71], %r25, %b[71]
          qpfmul_hadds,4        %b[106], %b[75], %r25, %b[75]
          qpfmul_hsubs,5        %b[81], %b[91], %r25, %b[81]
          movaqp,0      area=21, ind=0, am=1, be=0, %b[96]
          movaqp,1      area=21, ind=16, am=0, be=0, %b[95]
          movaqp,2      area=21, ind=0, am=1, be=0, %b[98]
          movaqp,3      area=21, ind=16, am=0, be=0, %b[97]
        }
        {
          loop_mode
          qpfmul_hadds,0        %b[103], %b[63], %r25, %b[63]
          qpfmul_hadds,1        %b[112], %b[83], %r25, %b[83]
          qpfmul_hadds,2        %b[104], %b[67], %r25, %b[67]
          qpfmul_hadds,3        %b[114], %b[91], %r25, %b[91]
          qpshufb,4     %b[62], %b[62], %r24, %b[103]
          qpfmul_hadds,5        %b[113], %b[87], %r25, %b[87]
          movaqp,0      area=22, ind=0, am=1, be=0, %b[105]
          movaqp,1      area=22, ind=16, am=0, be=0, %b[104]
          movaqp,2      area=22, ind=0, am=1, be=0, %b[112]
          movaqp,3      area=22, ind=16, am=0, be=0, %b[106]
        }
        {
          loop_mode
          qpshufb,0     %b[100], %b[100], %r24, %b[113]
          qpshufb,1     %b[99], %b[99], %r24, %b[114]
          qpfmul_hsubs,2        %b[99], %b[52], %r25, %b[99]
          qpshufb,3     %b[102], %b[102], %r24, %b[119]
          qpshufb,4     %b[101], %b[101], %r24, %b[120]
          qpfmul_hadds,5        %b[111], %b[79], %r25, %b[79]
          movaqp,0      area=23, ind=0, am=1, be=0, %b[121]
          movaqp,1      area=23, ind=16, am=0, be=0, %b[111]
          movaqp,2      area=23, ind=0, am=1, be=0, %b[123]
          movaqp,3      area=23, ind=16, am=0, be=0, %b[122]
        }
        {
          loop_mode
          qpshufb,0     %b[108], %b[108], %r24, %b[124]
          qpshufb,1     %b[107], %b[107], %r24, %b[125]
          qpfmul_hsubs,2        %b[100], %b[48], %r25, %b[100]
          qpshufb,3     %b[110], %b[110], %r24, %b[126]
          qpshufb,4     %b[109], %b[109], %r24, %b[127]
          qpfmul_hsubs,5        %b[101], %b[60], %r25, %b[101]
          movaqp,0      area=24, ind=0, am=1, be=0, %b[35]
          movaqp,1      area=24, ind=16, am=0, be=0, %b[36]
          movaqp,2      area=24, ind=0, am=1, be=0, %b[33]
          movaqp,3      area=24, ind=16, am=0, be=0, %b[34]
        }
        {
          loop_mode
          qpshufb,0     %b[116], %b[116], %r24, %b[14]
          qpshufb,1     %b[115], %b[115], %r24, %b[13]
          qpfmul_hsubs,2        %b[102], %b[56], %r25, %b[102]
          qpshufb,3     %b[118], %b[118], %r24, %b[12]
          qpshufb,4     %b[117], %b[117], %r24, %b[11]
          qpfmul_hsubs,5        %b[107], %b[68], %r25, %b[107]
          movaqp,0      area=25, ind=0, am=1, be=0, %b[31]
          movaqp,1      area=25, ind=16, am=0, be=0, %b[32]
          movaqp,2      area=25, ind=0, am=1, be=0, %b[29]
          movaqp,3      area=25, ind=16, am=0, be=0, %b[30]
        }
        {
          loop_mode
          qpfmul_hsubs,0        %b[108], %b[64], %r25, %b[108]
          qpfmul_hsubs,1        %b[110], %b[72], %r25, %b[110]
          qpfmul_hsubs,2        %b[109], %b[76], %r25, %b[109]
          qpfmul_hadds,3        %b[119], %b[56], %r25, %b[56]
          qpfmul_hadds,4        %b[120], %b[60], %r25, %b[60]
          qpfmul_hsubs,5        %b[115], %b[84], %r25, %b[115]
          movaqp,0      area=26, ind=0, am=1, be=0, %b[27]
          movaqp,1      area=26, ind=16, am=0, be=0, %b[28]
          movaqp,2      area=26, ind=0, am=1, be=0, %b[25]
          movaqp,3      area=26, ind=16, am=0, be=0, %b[26]
        }
        {
          loop_mode
          qpfmul_hadds,0        %b[113], %b[48], %r25, %b[48]
          qpfmul_hadds,1        %b[114], %b[52], %r25, %b[52]
          qpfmul_hsubs,2        %b[116], %b[80], %r25, %b[113]
          qpfmul_hadds,3        %b[127], %b[76], %r25, %b[76]
          qpfmul_hsubs,4        %b[118], %b[88], %r25, %b[114]
          qpfmul_hsubs,5        %b[117], %b[92], %r25, %b[116]
          movaqp,0      area=28, ind=0, am=1, be=0, %b[22]
          movaqp,1      area=27, ind=0, am=1, be=0, %b[24]
          movaqp,2      area=28, ind=0, am=1, be=0, %b[21]
          movaqp,3      area=27, ind=0, am=1, be=0, %b[23]
        }
        {
          loop_mode
          qpfmul_hadds,0        %b[124], %b[64], %r25, %b[64]
          qpfmul_hadds,1        %b[125], %b[68], %r25, %b[68]
          qpfmul_hadds,2        %b[126], %b[72], %r25, %b[72]
          qpfmul_hadds,3        %b[11], %b[92], %r25, %b[11]
          qpshufb,4     %b[61], %b[61], %r24, %b[88]
          qpfmul_hadds,5        %b[12], %b[88], %r25, %b[12]
          movaqp,0      area=30, ind=0, am=1, be=0, %b[18]
          movaqp,1      area=29, ind=0, am=1, be=0, %b[20]
          movaqp,2      area=30, ind=0, am=1, be=0, %b[17]
          movaqp,3      area=29, ind=0, am=1, be=0, %b[19]
        }
        {
          loop_mode
          qpfmul_hadds,0        %b[13], %b[84], %r25, %b[13]
          qpshufb,1     %b[86], %b[86], %r24, %b[80]
          qpfmul_hadds,2        %b[14], %b[80], %r25, %b[14]
          qpshufb,3     %b[85], %b[85], %r24, %b[84]
          qpshufb,4     %b[66], %b[66], %r24, %b[92]
          qpfmul_hsubs,5        %b[61], %g23, %r25, %b[61]
          movaqp,1      area=31, ind=0, am=1, be=0, %b[16]
          movaqp,3      area=31, ind=0, am=1, be=0, %b[15]
        }
        {
          loop_mode
          qpshufb,0     %b[65], %b[65], %r24, %b[117]
          qpshufb,1     %b[90], %b[90], %r24, %b[118]
          qpfmul_hsubs,2        %b[62], %g16, %r25, %b[62]
          qpshufb,3     %b[89], %b[89], %r24, %b[119]
          qpshufb,4     %b[70], %b[70], %r24, %b[120]
          qpfmul_hsubs,5        %b[86], %g22, %r25, %b[86]
        }
        {
          loop_mode
          qpshufb,0     %b[69], %b[69], %r24, %b[124]
          qpshufb,1     %b[94], %b[94], %r24, %b[125]
          qpfmul_hsubs,2        %b[85], %g30, %r25, %b[85]
          qpshufb,3     %b[93], %b[93], %r24, %b[126]
          qpfmul_hsubs,4        %b[66], %b[39], %r25, %b[66]
          qpfmul_hsubs,5        %b[65], %b[43], %r25, %b[65]
        }
        {
          loop_mode
          qpfmul_hsubs,0        %b[90], %g18, %r25, %b[90]
          qpfmul_hsubs,1        %b[89], %g25, %r25, %b[89]
          qpfmul_hadds,2        %b[103], %g16, %r25, %g16
          qpfmul_hadds,3        %b[88], %g23, %r25, %g23
          qpfmul_hsubs,4        %b[70], %g28, %r25, %b[70]
          qpfmul_hsubs,5        %b[69], %b[37], %r25, %b[69]
        }
        {
          loop_mode
          qpfmul_hsubs,0        %b[94], %b[41], %r25, %b[88]
          qpfmul_hsubs,1        %b[93], %b[45], %r25, %b[93]
          qpfmul_hadds,2        %b[80], %g22, %r25, %g22
          qpfmul_hadds,3        %b[84], %g30, %r25, %g30
          qpfmul_hadds,4        %b[92], %b[39], %r25, %b[39]
          qpfmul_hadds,5        %b[119], %g25, %r25, %g25
        }
        {
          loop_mode
          qpfmul_hadds,0        %b[117], %b[43], %r25, %b[43]
          qpfmul_hadds,1        %b[118], %g18, %r25, %g18
          qpfmul_hadds,2        %b[124], %b[37], %r25, %b[37]
          qpfmul_hadds,3        %b[126], %b[45], %r25, %b[45]
          qppermb,4     %b[51], %b[49], %r7, %b[49]
          qpfmul_hadds,5        %b[120], %g28, %r25, %g28
        }
        {
          loop_mode
          qppermb,1     %b[59], %b[73], %r7, %b[51]
          qpfmul_hadds,2        %b[125], %b[41], %r25, %b[41]
          qppermb,3     %b[47], %b[50], %r7, %b[47]
          qppermb,4     %b[55], %b[74], %r7, %b[50]
        }
        {
          loop_mode
          qppermb,0     %b[71], %b[78], %r7, %b[55]
          qppermb,1     %b[75], %b[77], %r7, %b[59]
          qppermb,3     %b[63], %b[54], %r7, %b[54]
          qppermb,4     %b[67], %b[53], %r7, %b[53]
        }
        {
          loop_mode
          nop 2
          qppermb,0     %b[79], %b[58], %r7, %b[58]
          qppermb,1     %b[83], %b[57], %r7, %b[57]
          qppermb,3     %b[87], %b[82], %r7, %b[63]
        }
        {
          loop_mode
          qppermb,4     %b[91], %b[81], %r7, %b[67]
        }
        {
          loop_mode
          qppermb,0     %b[56], %b[102], %r7, %b[56]
          qppermb,1     %b[60], %b[101], %r7, %b[60]
          qppermb,3     %b[48], %b[100], %r7, %b[48]
          qppermb,4     %b[52], %b[99], %r7, %b[52]
        }
        {
          loop_mode
          qppermb,0     %b[68], %b[107], %r7, %b[68]
          qppermb,1     %b[76], %b[109], %r7, %b[71]
          qpfsubs,2     %b[51], %b[60], %b[73]
          qppermb,3     %b[64], %b[108], %r7, %b[64]
          qppermb,4     %b[72], %b[110], %r7, %b[72]
          qpfsubs,5     %b[49], %b[52], %b[74]
        }
        {
          loop_mode
          qppermb,0     %b[12], %b[114], %r7, %b[12]
          qppermb,1     %b[11], %b[116], %r7, %b[11]
          qpfsubs,2     %b[50], %b[56], %b[75]
          qppermb,3     %b[14], %b[113], %r7, %b[14]
          qppermb,4     %b[13], %b[115], %r7, %b[13]
          qpfsubs,5     %b[47], %b[48], %b[76]
        }
        {
          loop_mode
          qpfsubs,0     %b[59], %b[71], %b[77]
          qpfsubs,1     %b[53], %b[68], %b[78]
          qpfsubs,2     %b[63], %b[12], %b[79]
          qpfsubs,3     %b[55], %b[72], %b[81]
          qpfsubs,4     %b[58], %b[14], %b[82]
          qpfsubs,5     %b[54], %b[64], %b[80]
        }
        {
          loop_mode
          qppermb,0     %g16, %b[62], %r7, %g16
          qppermb,1     %g23, %b[61], %r7, %g23
          qpfsubs,2     %b[67], %b[11], %b[83]
          qppermb,3     %g22, %b[86], %r7, %g22
          qppermb,4     %g30, %b[85], %r7, %g30
          qpfsubs,5     %b[57], %b[13], %b[84]
        }
        {
          loop_mode
          qpfadds,0     %b[51], %b[60], %b[51]
          qpfadds,1     %b[47], %b[48], %b[47]
          qpfadds,2     %b[49], %b[52], %b[48]
          qpfadds,3     %b[50], %b[56], %b[49]
          qpfadds,4     %b[59], %b[71], %b[50]
          qpfadds,5     %b[54], %b[64], %b[52]
        }
        {
          loop_mode
          qppermb,0     %b[39], %b[66], %r7, %b[39]
          qppermb,1     %b[43], %b[65], %r7, %b[43]
          qpfadds,2     %b[53], %b[68], %b[53]
          qppermb,3     %g18, %b[90], %r7, %g18
          qppermb,4     %g25, %b[89], %r7, %g25
          qpfadds,5     %b[55], %b[72], %b[54]
        }
        {
          loop_mode
          qpfadds,0     %b[67], %b[11], %b[11]
          qpfadds,1     %g24, %g23, %b[55]
          qpfsubs,2     %g17, %g16, %b[56]
          qpfadds,3     %b[58], %b[14], %b[14]
          qpfadds,4     %b[57], %b[13], %b[13]
          qpfadds,5     %b[63], %b[12], %b[12]
        }
        {
          loop_mode
          qppermb,0     %b[37], %b[69], %r7, %b[37]
          qppermb,1     %g28, %b[70], %r7, %g28
          qpfsubs,2     %g24, %g23, %g23
          qppermb,3     %b[45], %b[93], %r7, %b[45]
          qppermb,4     %b[41], %b[88], %r7, %b[41]
          qpfadds,5     %g17, %g16, %g16
        }
        {
          loop_mode
          qpfsubs,0     %g31, %g30, %g17
          qpfadds,1     %g27, %g22, %g24
          qpfadds,2     %g31, %g30, %g30
          qpfsubs,3     %g27, %g22, %g22
          qpfsubs,4     %g26, %g25, %g27
          qpfadds,5     %g26, %g25, %g25
        }
        {
          loop_mode
          qpfsubs,0     %b[44], %b[43], %g26
          qpfadds,1     %b[40], %b[39], %g31
          qpfadds,2     %b[44], %b[43], %b[43]
          qpfsubs,3     %b[40], %b[39], %b[39]
          qpfsubs,4     %g19, %g18, %b[40]
          qpfadds,5     %g19, %g18, %g18
        }
        {
          loop_mode
          qpfadds,0     %b[38], %b[37], %g19
          qpfsubs,1     %g29, %g28, %b[44]
          qpfsubs,2     %b[38], %b[37], %b[37]
          qpfsubs,3     %b[46], %b[45], %b[45]
          qpfadds,4     %b[42], %b[41], %b[46]
          qpfadds,5     %b[46], %b[45], %b[38]
        }
        {
          loop_mode
          qpfadds,0     %b[55], %b[48], %b[41]
          qpfsubs,1     %b[55], %b[48], %b[42]
          qpfadds,2     %g29, %g28, %g28
          qpfadds,3     %g16, %b[47], %b[48]
          qpfsubs,4     %g16, %b[47], %g16
          qpfsubs,5     %b[42], %b[41], %g29
        }
        {
          loop_mode
          qpfadds,0     %g30, %b[51], %b[47]
          qpfadds,1     %g24, %b[49], %b[51]
          qpfsubs,2     %g30, %b[51], %g30
          qpfsubs,3     %g25, %b[50], %g25
          qpfadds,5     %g25, %b[50], %b[55]
        }
        {
          loop_mode
          qpfsubs,0     %g24, %b[49], %g24
          qpfadds,1     %b[43], %b[53], %b[49]
          qpfsubs,2     %g31, %b[52], %b[50]
          qpfsubs,3     %g18, %b[54], %g18
          qpfadds,5     %g18, %b[54], %b[57]
        }
        {
          loop_mode
          qpfsubs,0     %b[43], %b[53], %b[43]
          qpfadds,1     %g31, %b[52], %g31
          qpfadds,2     %g19, %b[13], %b[52]
          qpfsubs,3     %b[38], %b[11], %b[53]
          qpshufb,4     %b[75], %b[75], %r24, %b[54]
          qpfadds,5     %b[46], %b[12], %b[58]
        }
        {
          loop_mode
          qpfadds,0     %g28, %b[14], %b[13]
          qpfsubs,1     %g28, %b[14], %g28
          qpfsubs,2     %g19, %b[13], %g19
          qpfsubs,3     %b[46], %b[12], %b[12]
          qpshufb,4     %b[73], %b[73], %r24, %b[59]
          qpfadds,5     %b[38], %b[11], %b[11]
        }
        {
          loop_mode
          qpshufb,4     %b[74], %b[74], %r24, %b[14]
        }
        {
          loop_mode
          qpshufb,4     %b[78], %b[78], %r24, %b[38]
        }
        {
          loop_mode
          qpshufb,0     %b[77], %b[77], %r24, %b[46]
          qpshufb,1     %b[83], %b[83], %r24, %b[60]
          qpshufb,3     %b[76], %b[76], %r24, %b[61]
          qpshufb,4     %b[80], %b[80], %r24, %b[62]
        }
        {
          loop_mode
          qpshufb,0     %b[81], %b[81], %r24, %b[63]
          qpshufb,1     %b[79], %b[79], %r24, %b[64]
          qpshufb,3     %b[82], %b[82], %r24, %b[65]
          qpshufb,4     %b[84], %b[84], %r24, %b[66]
        }
        {
          loop_mode
          qpxor,0       %b[54], %r22, %b[54]
          qpxor,1       %b[59], %r22, %b[59]
          qpxor,3       %b[14], %r22, %b[14]
          qpxor,4       %b[38], %r22, %b[38]
        }
        {
          loop_mode
          qpxor,0       %b[46], %r22, %b[46]
          qpxor,1       %b[60], %r22, %b[60]
          qpfadds,2     %g17, %b[59], %b[67]
          qpxor,3       %b[61], %r22, %b[61]
          qpxor,4       %b[62], %r22, %b[62]
          qpfadds,5     %g26, %b[38], %b[68]
        }
        {
          loop_mode
          qpxor,0       %b[63], %r22, %b[63]
          qpxor,1       %b[64], %r22, %b[64]
          qpfsubs,2     %g17, %b[59], %g17
          qpxor,3       %b[65], %r22, %b[65]
          qpxor,4       %b[66], %r22, %b[66]
          qpfsubs,5     %g23, %b[14], %b[59]
        }
        {
          loop_mode
          qpfadds,0     %g27, %b[46], %b[69]
          qpfsubs,1     %g22, %b[54], %b[70]
          qpfadds,2     %g22, %b[54], %g22
          qpfadds,3     %g23, %b[14], %g23
          qpfsubs,4     %g26, %b[38], %g26
          qpfsubs,5     %b[56], %b[61], %b[14]
        }
        {
          loop_mode
          qpfsubs,0     %g27, %b[46], %g27
          qpfsubs,1     %b[45], %b[60], %b[38]
          qpfadds,2     %g29, %b[64], %b[46]
          qpfadds,3     %b[56], %b[61], %b[54]
          qpfsubs,4     %b[39], %b[62], %b[56]
          qpfadds,5     %b[39], %b[62], %b[39]
        }
        {
          loop_mode
          qpfadds,0     %b[45], %b[60], %b[45]
          qpfadds,1     %b[40], %b[63], %b[60]
          qpfsubs,2     %g29, %b[64], %g29
          qpfadds,3     %b[37], %b[66], %b[37]
          qpfsubs,4     %b[44], %b[65], %b[62]
          qpfsubs,5     %b[37], %b[66], %b[61]
        }
        {
          loop_mode
          nop 1
          qpfsubs,2     %b[40], %b[63], %b[40]
          qpfadds,5     %b[44], %b[65], %b[44]
        }
        {
          loop_mode
          qpshufb,0     %b[95], %b[95], %r24, %b[63]
          qpshufb,1     %g20, %g20, %r24, %b[64]
          qpshufb,3     %b[96], %b[96], %r24, %b[65]
          qpshufb,4     %b[104], %b[104], %r24, %b[66]
        }
        {
          loop_mode
          qpshufb,0     %b[112], %b[112], %r24, %b[71]
          qpshufb,1     %b[105], %b[105], %r24, %b[72]
          qpshufb,3     %b[106], %b[106], %r24, %b[73]
          qpshufb,4     %b[122], %b[122], %r24, %b[74]
        }
        {
          loop_mode
          qpshufb,0     %b[35], %b[35], %r24, %b[75]
          qpshufb,1     %b[123], %b[123], %r24, %b[76]
          qpshufb,3     %b[36], %b[36], %r24, %b[77]
          qpshufb,4     %b[32], %b[32], %r24, %b[78]
        }
        {
          loop_mode
          qpshufb,0     %b[29], %b[29], %r24, %b[79]
          qpshufb,1     %b[31], %b[31], %r24, %b[80]
          qpshufb,3     %b[30], %b[30], %r24, %b[81]
          qpshufb,4     %b[26], %b[26], %r24, %b[82]
        }
        {
          loop_mode
          qpshufb,0     %b[25], %b[25], %r24, %b[83]
          qpshufb,1     %b[24], %b[24], %r24, %b[84]
          qpshufb,3     %b[21], %b[21], %r24, %b[85]
          qpshufb,4     %b[22], %b[22], %r24, %b[86]
        }
        {
          loop_mode
          qpshufb,0     %b[18], %b[18], %r24, %b[87]
          qpshufb,1     %b[15], %b[15], %r24, %b[88]
          qpshufb,3     %b[19], %b[19], %r24, %b[89]
          qpshufb,4     %b[16], %b[16], %r24, %b[90]
        }
        {
          loop_mode
          qpshufb,0     %b[51], %b[48], %r6, %b[91]
          qpshufb,1     %b[47], %b[41], %r6, %b[92]
          qpshufb,3     %g24, %g16, %r6, %b[93]
          qpshufb,4     %g30, %b[42], %r6, %b[94]
        }
        {
          loop_mode
          qpshufb,0     %b[57], %g31, %r6, %b[99]
          qpshufb,1     %b[55], %b[49], %r6, %b[100]
          qpfmul_hadds,2        %b[65], %b[91], %r25, %b[65]
          qpshufb,3     %g18, %b[50], %r6, %b[101]
          qpshufb,4     %g25, %b[43], %r6, %b[102]
          qpfmul_hadds,5        %b[75], %b[93], %r25, %b[75]
        }
        {
          loop_mode
          qpshufb,0     %b[58], %b[13], %r6, %b[103]
          qpshufb,1     %b[11], %b[52], %r6, %b[107]
          qpfmul_hadds,2        %b[72], %b[92], %r25, %b[72]
          qpshufb,3     %b[12], %g28, %r6, %b[108]
          qpshufb,4     %b[53], %g19, %r6, %b[109]
          qpfmul_hadds,5        %b[80], %b[94], %r25, %b[80]
        }
        {
          loop_mode
          qpshufb,0     %g17, %b[59], %r6, %b[110]
          qpshufb,1     %b[67], %g23, %r6, %b[113]
          qpfmul_hsubs,2        %b[96], %b[91], %r25, %b[91]
          qpshufb,3     %g27, %g26, %r6, %b[114]
          qpshufb,4     %b[69], %b[68], %r6, %b[115]
          qpfmul_hsubs,5        %b[35], %b[93], %r25, %b[35]
        }
        {
          loop_mode
          qpshufb,0     %b[70], %b[14], %r6, %b[93]
          qpshufb,1     %g22, %b[54], %r6, %b[96]
          qpfmul_hsubs,2        %b[31], %b[94], %r25, %b[31]
          qpshufb,3     %b[60], %b[39], %r6, %b[116]
          qpshufb,4     %b[45], %b[37], %r6, %b[117]
          qpfmul_hsubs,5        %b[105], %b[92], %r25, %b[92]
        }
        {
          loop_mode
          qpshufb,0     %b[40], %b[56], %r6, %b[94]
          qpshufb,1     %g29, %b[62], %r6, %b[105]
          qpfmul_hsubs,2        %b[32], %b[102], %r25, %b[32]
          qpshufb,3     %b[38], %b[61], %r6, %b[118]
          qpshufb,4     %b[46], %b[44], %r6, %b[119]
          qpfmul_hadds,5        %b[78], %b[102], %r25, %b[78]
        }
        {
          loop_mode
          qpfmul_hsubs,0        %b[95], %b[99], %r25, %b[95]
          qpfmul_hsubs,1        %b[36], %b[101], %r25, %b[36]
          qpfmul_hsubs,2        %b[104], %b[100], %r25, %b[102]
          qpfmul_hadds,3        %b[63], %b[99], %r25, %b[63]
          qpfmul_hadds,4        %b[77], %b[101], %r25, %b[77]
          qpfmul_hadds,5        %b[66], %b[100], %r25, %b[66]
        }
        {
          loop_mode
          qpfmul_hsubs,0        %b[18], %b[108], %r25, %b[18]
          qpfmul_hsubs,1        %b[22], %b[107], %r25, %b[22]
          qpfmul_hsubs,2        %b[16], %b[109], %r25, %b[16]
          qpfmul_hadds,3        %b[87], %b[108], %r25, %b[87]
          qpfmul_hadds,4        %b[86], %b[107], %r25, %b[86]
          qpfmul_hadds,5        %b[90], %b[109], %r25, %b[90]
        }
        {
          loop_mode
          qpfmul_hsubs,0        %b[24], %b[103], %r25, %b[24]
          qpfmul_hadds,1        %b[84], %b[103], %r25, %b[84]
          qpfmul_hsubs,2        %b[25], %b[113], %r25, %b[25]
          qpfmul_hadds,3        %b[83], %b[113], %r25, %b[83]
          qpfmul_hsubs,4        %b[26], %b[115], %r25, %b[26]
          qpfmul_hadds,5        %b[82], %b[115], %r25, %b[82]
        }
        {
          loop_mode
          qpfmul_hsubs,0        %b[29], %b[96], %r25, %b[29]
          qpfmul_hsubs,1        %b[123], %b[110], %r25, %b[99]
          qpfmul_hadds,2        %b[79], %b[96], %r25, %b[79]
          qpfmul_hadds,3        %b[76], %b[110], %r25, %b[76]
          qpfmul_hsubs,4        %b[122], %b[114], %r25, %b[96]
          qpfmul_hadds,5        %b[74], %b[114], %r25, %b[74]
        }
        {
          loop_mode
          qpfmul_hsubs,0        %b[112], %b[93], %r25, %b[100]
          qpfmul_hadds,1        %b[71], %b[93], %r25, %b[71]
          qpfmul_hsubs,2        %b[30], %b[116], %r25, %b[30]
          qpfmul_hadds,3        %b[81], %b[116], %r25, %b[81]
          qpfmul_hsubs,4        %g20, %b[117], %r25, %g20
          qpfmul_hadds,5        %b[64], %b[117], %r25, %b[64]
        }
        {
          loop_mode
          qpfmul_hsubs,0        %b[106], %b[94], %r25, %b[93]
          qpfmul_hadds,1        %b[73], %b[94], %r25, %b[73]
          qpfmul_hsubs,2        %b[15], %b[119], %r25, %b[15]
          qpfmul_hsubs,3        %b[19], %b[118], %r25, %b[19]
          qpfmul_hadds,4        %b[88], %b[119], %r25, %b[88]
          qpfmul_hadds,5        %b[89], %b[118], %r25, %b[89]
        }
        {
          loop_mode
          nop 5
          qpfmul_hadds,0        %b[85], %b[105], %r25, %b[85]
          qpfmul_hsubs,2        %b[21], %b[105], %r25, %b[21]
        }
        {
          loop_mode
          qpshufb,1     %b[97], %b[97], %r24, %b[94]
          qpshufb,3     %g21, %g21, %r24, %b[101]
          qpshufb,4     %b[98], %b[98], %r24, %b[103]
        }
        {
          loop_mode
          qpshufb,0     %b[121], %b[121], %r24, %b[104]
          qpshufb,1     %b[111], %b[111], %r24, %b[105]
          qpshufb,3     %b[34], %b[34], %r24, %b[106]
          qpshufb,4     %b[33], %b[33], %r24, %b[107]
        }
        {
          loop_mode
          qpshufb,0     %b[27], %b[27], %r24, %b[108]
          qpshufb,1     %b[28], %b[28], %r24, %b[109]
          qpshufb,3     %b[23], %b[23], %r24, %b[110]
          qpshufb,4     %b[20], %b[20], %r24, %b[112]
        }
        {
          loop_mode
          qpshufb,0     %b[17], %b[17], %r24, %b[113]
          qpshufb,1     %b[47], %b[41], %r23, %b[41]
          qpshufb,3     %g30, %b[42], %r23, %g30
          qpshufb,4     %b[55], %b[49], %r23, %b[42]
        }
        {
          loop_mode
          qpshufb,0     %g25, %b[43], %r23, %g25
          qpshufb,1     %b[11], %b[52], %r23, %b[11]
          qpfmul_hsubs,2        %b[98], %b[41], %r25, %b[43]
          qpshufb,3     %b[53], %g19, %r23, %g19
          qpshufb,4     %g17, %b[59], %r23, %g17
          qpfmul_hsubs,5        %b[33], %g30, %r25, %b[33]
        }
        {
          loop_mode
          qpshufb,0     %b[67], %g23, %r23, %g23
          qpshufb,1     %g27, %g26, %r23, %g26
          qpfmul_hadds,2        %b[106], %g25, %r25, %b[47]
          qpshufb,3     %b[69], %b[68], %r23, %g27
          qpshufb,4     %b[38], %b[61], %r23, %b[38]
          qpfmul_hsubs,5        %b[17], %g19, %r25, %b[17]
        }
        {
          loop_mode
          qpshufb,0     %b[45], %b[37], %r23, %b[37]
          qpfmul_hadds,1        %b[103], %b[41], %r25, %b[41]
          qpfmul_hsubs,2        %b[34], %g25, %r25, %g25
          qpfmul_hadds,3        %b[107], %g30, %r25, %g30
          qpfmul_hsubs,4        %b[97], %b[42], %r25, %b[34]
          qpfmul_hadds,5        %b[94], %b[42], %r25, %b[42]
        }
        {
          loop_mode
          qpfmul_hadds,0        %b[110], %b[11], %r25, %b[45]
          qpfmul_hsubs,1        %b[23], %b[11], %r25, %b[11]
          qpfmul_hadds,2        %b[108], %g23, %r25, %b[49]
          qpfmul_hsubs,3        %b[121], %g17, %r25, %b[23]
          qpfmul_hadds,4        %b[104], %g17, %r25, %g17
          qpfmul_hadds,5        %b[113], %g19, %r25, %g19
        }
        {
          loop_mode
          qpfmul_hsubs,0        %b[111], %g26, %r25, %b[52]
          qpfmul_hadds,1        %b[105], %g26, %r25, %g26
          qpfmul_hsubs,2        %b[27], %g23, %r25, %g23
          qpfmul_hsubs,3        %b[28], %g27, %r25, %b[28]
          qpfmul_hadds,4        %b[109], %g27, %r25, %g27
          qpfmul_hsubs,5        %b[20], %b[38], %r25, %b[20]
        }
        {
          loop_mode
          qpfmul_hadds,0        %b[101], %b[37], %r25, %b[27]
          qppermb,1     %b[75], %b[35], %r7, %b[35]
          qpfmul_hsubs,2        %g21, %b[37], %r25, %g21
          qppermb,3     %b[80], %b[31], %r7, %b[31]
          qppermb,4     %b[65], %b[91], %r7, %b[38]
          qpfmul_hadds,5        %b[112], %b[38], %r25, %b[37]
        }
        {
          loop_mode
          qppermb,0     %b[72], %b[92], %r7, %b[53]
          qppermb,1     %b[63], %b[95], %r7, %b[55]
          qppermb,3     %b[66], %b[102], %r7, %b[59]
          qppermb,4     %b[78], %b[32], %r7, %b[32]
        }
        {
          loop_mode
          qppermb,0     %b[90], %b[16], %r7, %b[16]
          qppermb,1     %b[77], %b[36], %r7, %b[36]
          qppermb,3     %b[87], %b[18], %r7, %b[18]
          qppermb,4     %b[84], %b[24], %r7, %b[24]
        }
        {
          loop_mode
          qppermb,0     %b[86], %b[22], %r7, %b[22]
        }
        {
          loop_mode
          qpfsubs,1     %b[38], %b[53], %b[61]
          qpfsubs,3     %b[35], %b[31], %b[63]
          qpfadds,4     %b[35], %b[31], %b[31]
        }
        {
          loop_mode
          qpfsubs,0     %b[36], %b[32], %b[65]
          qpfadds,1     %b[38], %b[53], %b[38]
          qpfsubs,2     %b[55], %b[59], %b[35]
          qpfadds,3     %b[55], %b[59], %b[53]
        }
        {
          loop_mode
          qpfsubs,0     %b[24], %b[22], %b[59]
          qpfadds,1     %b[36], %b[32], %b[32]
          qpfsubs,2     %b[18], %b[16], %b[55]
          qpfadds,3     %b[18], %b[16], %b[16]
        }
        {
          loop_mode
          qpfadds,0     %b[24], %b[22], %b[22]
          qppermb,4     %b[83], %b[25], %r7, %b[18]
        }
        {
          loop_mode
          qppermb,4     %b[79], %b[29], %r7, %b[24]
        }
        {
          loop_mode
          qppermb,1     %b[76], %b[99], %r7, %b[25]
          qppermb,3     %b[71], %b[100], %r7, %b[29]
          qppermb,4     %b[74], %b[96], %r7, %b[36]
          qpfsubs,5     %b[24], %b[18], %b[66]
        }
        {
          loop_mode
          qppermb,0     %b[82], %b[26], %r7, %b[26]
          qppermb,1     %b[64], %g20, %r7, %g20
          qppermb,3     %b[81], %b[30], %r7, %b[30]
          qppermb,4     %b[85], %b[21], %r7, %b[21]
          qpfadds,5     %b[24], %b[18], %b[18]
        }
        {
          loop_mode
          qppermb,0     %b[88], %b[15], %r7, %b[15]
          qppermb,1     %b[89], %b[19], %r7, %b[19]
          qppermb,3     %b[73], %b[93], %r7, %b[24]
          qpshufb,4     %b[51], %b[48], %r23, %b[48]
        }
        {
          loop_mode
          qpshufb,0     %g24, %g16, %r23, %g16
          qpshufb,1     %g18, %b[50], %r23, %g18
          qpfsubs,2     %b[15], %g20, %b[64]
          qpshufb,3     %b[57], %g31, %r23, %g24
          qpshufb,4     %b[58], %b[13], %r23, %g31
          qpfsubs,5     %b[24], %b[36], %b[51]
        }
        {
          loop_mode
          qpshufb,0     %b[12], %g28, %r23, %g28
          qpshufb,1     %g22, %b[54], %r23, %g22
          qpfsubs,2     %b[29], %b[25], %b[13]
          qpshufb,3     %b[70], %b[14], %r23, %b[12]
          qpshufb,4     %b[40], %b[56], %r23, %b[14]
          qpfadds,5     %b[29], %b[25], %b[25]
        }
        {
          loop_mode
          qpfsubs,0     %b[21], %b[19], %b[40]
          qpshufb,1     %b[60], %b[39], %r23, %b[39]
          qpfsubs,2     %b[30], %b[26], %b[29]
          qpshufb,3     %b[46], %b[44], %r23, %b[44]
          qppermb,4     %b[41], %b[43], %r7, %b[41]
          qpfadds,5     %b[30], %b[26], %b[26]
        }
        {
          loop_mode
          qppermb,0     %b[47], %g25, %r7, %g25
          qpshufb,1     %g29, %b[62], %r23, %g29
          qpfadds,2     %b[15], %g20, %g20
          qppermb,3     %g30, %b[33], %r7, %g30
          qppermb,4     %b[42], %b[34], %r7, %b[30]
          qpfadds,5     %b[24], %b[36], %b[15]
        }
        {
          loop_mode
          qpfadds,0     %b[21], %b[19], %b[17]
          qppermb,1     %b[45], %b[11], %r7, %b[11]
          qpfadds,2     %g18, %g25, %b[21]
          qppermb,3     %g17, %b[23], %r7, %g17
          qppermb,4     %g19, %b[17], %r7, %g19
          qpfsubs,5     %b[48], %b[41], %b[19]
        }
        {
          loop_mode
          qppermb,0     %b[49], %g23, %r7, %g23
          qppermb,1     %g26, %b[52], %r7, %g26
          qpfsubs,2     %g18, %g25, %g18
          qppermb,3     %g27, %b[28], %r7, %g27
          qppermb,4     %b[37], %b[20], %r7, %b[20]
          qpfadds,5     %b[48], %b[41], %b[23]
        }
        {
          loop_mode
          qpfadds,0     %g31, %b[11], %b[27]
          qppermb,1     %b[27], %g21, %r7, %g21
          qpfsubs,2     %g31, %b[11], %g31
          qpfsubs,3     %g24, %b[30], %g25
          qpfsubs,4     %g16, %g30, %b[24]
          qpfadds,5     %g24, %b[30], %g24
        }
        {
          loop_mode
          qpfadds,0     %g16, %g30, %g16
          qpfsubs,1     %g22, %g23, %b[12]
          qpfadds,2     %b[14], %g26, %b[28]
          qpfadds,3     %b[12], %g17, %g30
          qpfsubs,4     %b[12], %g17, %g17
          qpfsubs,5     %b[39], %g27, %b[11]
        }
        {
          loop_mode
          qpfadds,0     %g28, %g19, %b[30]
          qpfsubs,1     %g28, %g19, %g19
          qpfsubs,2     %b[14], %g26, %g26
          qpfadds,3     %b[39], %g27, %g27
          qpfsubs,4     %g29, %b[20], %g29
          qpfadds,5     %g29, %b[20], %g28
        }
        {
          loop_mode
          qpfadds,0     %b[44], %g21, %g23
          qpfsubs,1     %b[44], %g21, %g21
          qpfadds,2     %g22, %g23, %g22
          qpfadds,3     %b[23], %b[38], %b[14]
          qpfsubs,4     %b[23], %b[38], %b[20]
        }
        {
          loop_mode
          qpfadds,0     %b[21], %b[32], %b[23]
          qpfsubs,1     %b[21], %b[32], %b[21]
          qpfadds,2     %b[27], %b[22], %b[32]
          qpfsubs,3     %g24, %b[53], %b[33]
          qpfadds,4     %g24, %b[53], %g24
        }
        {
          loop_mode
          qpfadds,0     %g16, %b[31], %b[34]
          qpfsubs,1     %g16, %b[31], %g16
          qpfsubs,2     %b[27], %b[22], %b[22]
          qpfsubs,3     %g30, %b[25], %b[27]
          qpfadds,4     %g30, %b[25], %g30
        }
        {
          loop_mode
          qpfadds,0     %b[30], %b[16], %b[25]
          qpfsubs,1     %b[30], %b[16], %b[16]
          qpfsubs,2     %b[28], %b[15], %b[31]
          qpfadds,3     %g27, %b[26], %b[30]
          qpfsubs,4     %g27, %b[26], %g27
          qpfsubs,5     %g28, %b[17], %b[26]
        }
        {
          loop_mode
          qpfadds,0     %g22, %b[18], %b[36]
          qpfsubs,1     %g22, %b[18], %g22
          qpfadds,2     %g23, %g20, %b[18]
          qpfadds,3     %b[28], %b[15], %b[15]
          qpfadds,4     %g28, %b[17], %g28
          stqp,5        %r18, %r0, %b[20]
        }
        {
          loop_mode
          qpfsubs,0     %g23, %g20, %g20
          stqp,2        %r36, %r0, %b[21]
          stqp,5        %r2, %r0, %b[14]
        }
        {
          loop_mode
          stqp,2        %r16, %r0, %g16
          stqp,5        %r30, %r0, %b[33]
        }
        {
          loop_mode
          qpshufb,1     %b[61], %b[61], %r24, %g16
          stqp,2        %r27, %r0, %b[23]
          qpshufb,3     %b[63], %b[63], %r24, %g23
          qpshufb,4     %b[35], %b[35], %r24, %b[14]
          stqp,5        %r29, %r0, %g24
        }
        {
          loop_mode
          qpshufb,0     %b[55], %b[55], %r24, %g24
          qpshufb,1     %b[65], %b[65], %r24, %b[17]
          stqp,2        %r51, %r0, %b[22]
          qpshufb,3     %b[59], %b[59], %r24, %b[20]
          qpshufb,4     %b[66], %b[66], %r24, %b[21]
          stqp,5        %r20, %r0, %b[34]
        }
        {
          loop_mode
          qpshufb,0     %b[13], %b[13], %r24, %b[13]
          qpshufb,1     %b[51], %b[51], %r24, %b[22]
          stqp,2        %r38, %r0, %b[32]
          qpshufb,3     %b[29], %b[29], %r24, %b[23]
          qpshufb,4     %b[64], %b[64], %r24, %b[28]
          stqp,5        %r21, %r0, %g30
        }
        {
          loop_mode
          qpshufb,0     %b[40], %b[40], %r24, %g30
          qpxor,1       %g16, %r22, %g16
          stqp,2        %r13, %r0, %b[27]
          qpxor,3       %g23, %r22, %g23
          qpxor,4       %b[14], %r22, %b[14]
          stqp,5        %r35, %r0, %g27
        }
        {
          loop_mode
          qpxor,0       %b[17], %r22, %g27
          qpxor,1       %g24, %r22, %g24
          qpfadds,2     %b[19], %g16, %b[21]
          qpxor,3       %b[20], %r22, %b[17]
          qpxor,4       %b[21], %r22, %b[20]
          qpfadds,5     %b[24], %g23, %b[27]
        }
        {
          loop_mode
          qpxor,0       %b[13], %r22, %b[13]
          qpxor,1       %b[22], %r22, %b[22]
          qpfsubs,2     %b[19], %g16, %g16
          qpxor,3       %b[23], %r22, %b[23]
          qpxor,4       %b[28], %r22, %b[28]
          qpfsubs,5     %b[24], %g23, %g23
        }
        {
          loop_mode
          qpxor,0       %g30, %r22, %g30
          qpfsubs,1     %g19, %g24, %b[14]
          qpfadds,2     %g19, %g24, %g19
          qpfsubs,3     %g25, %b[14], %b[19]
          qpfadds,4     %g25, %b[14], %g25
          qpfadds,5     %g31, %b[17], %b[24]
        }
        {
          loop_mode
          qpfsubs,0     %g18, %g27, %g24
          qpfadds,1     %g18, %g27, %g18
          qpfadds,2     %g17, %b[13], %b[17]
          qpfsubs,3     %g31, %b[17], %g27
          qpfsubs,4     %b[12], %b[20], %g31
          qpfadds,5     %b[12], %b[20], %b[12]
        }
        {
          loop_mode
          qpfsubs,0     %g17, %b[13], %g17
          qpfadds,1     %g26, %b[22], %b[20]
          qpfsubs,2     %g26, %b[22], %g26
          qpfsubs,3     %b[11], %b[23], %b[13]
          qpfadds,4     %b[11], %b[23], %b[11]
          qpfsubs,5     %g21, %b[28], %b[23]
        }
        {
          loop_mode
          qpfadds,0     %g21, %b[28], %g21
          qpfsubs,1     %g29, %g30, %b[22]
          qpfadds,2     %g29, %g30, %g29
          stqp,5        %r26, %r0, %b[30]
        }
        {
          loop_mode
          stqp,2        %r37, %r0, %b[31]
          stqp,5        %r44, %r0, %b[25]
        }
        {
          loop_mode
          stqp,2        %r49, %r0, %b[16]
          stqp,5        %r3, %r0, %g22
        }
        {
          loop_mode
          stqp,2        %r28, %r0, %b[15]
          stqp,5        %r17, %r0, %b[36]
        }
        {
          loop_mode
          stqp,2        %r54, %r0, %g20
          stqp,5        %r43, %r0, %b[18]
        }
        {
          loop_mode
          stqp,2        %r50, %r0, %b[26]
          stqp,5        %r5, %r0, %b[27]
        }
        {
          loop_mode
          stqp,2        %r45, %r0, %g28
          stqp,5        %r11, %r0, %g23
        }
        {
          loop_mode
          stqp,2        %r15, %r0, %b[21]
          stqp,5        %r19, %r0, %g16
        }
        {
          loop_mode
          stqp,2        %r34, %r0, %g25
          stqp,5        %r9, %r0, %b[19]
        }
        {
          loop_mode
          stqp,2        %r57, %r0, %g19
          stqp,5        %r47, %r0, %b[14]
        }
        {
          loop_mode
          stqp,2        %r53, %r0, %b[24]
          stqp,5        %r32, %r0, %g24
        }
        {
          loop_mode
          stqp,2        %r40, %r0, %g18
          stqp,5        %r42, %r0, %g27
        }
        {
          loop_mode
          stqp,2        %r4, %r0, %b[12]
          stqp,5        %r14, %r0, %g31
        }
        {
          loop_mode
          stqp,2        %r1, %r0, %b[17]
          stqp,5        %r12, %r0, %g17
        }
        {
          loop_mode
          stqp,2        %r56, %r0, %g21
          stqp,5        %r39, %r0, %b[11]
        }
        {
          loop_mode
          stqp,2        %r46, %r0, %b[23]
          stqp,5        %r31, %r0, %b[13]
        }
        {
          loop_mode
          stqp,2        %r41, %r0, %b[20]
          stqp,5        %r33, %r0, %g26
        }
        {
          loop_mode
          addd,0,sm     %r0, _f16s,_lts0lo 0x30, %r0
          stqp,2        %r52, %r0, %g29
          stqp,5        %r48, %r0, %b[22]
        }
        {
          loop_mode
          ct    %ctpr1 ? %NOT_LOOP_END
          alc   alcf=1, alct=1
        }
        {
          loop_mode
          ldd,0 %r8, _f16s,_lts0lo 0xff30, %b[11], mas=0x4
          ldd,2 %r8, _f16s,_lts0hi 0xff38, %b[12], mas=0x4
          ldd,3 %r8, _f16s,_lts1lo 0xff40, %b[13], mas=0x4
          ldd,5 %r8, _f16s,_lts1hi 0xff48, %b[14], mas=0x4
        }
        {
          loop_mode
          ldd,0 %r8, _f16s,_lts0lo 0xff50, %b[15], mas=0x4
          ldd,2 %r8, _f16s,_lts0hi 0xff58, %b[16], mas=0x4
          ldd,3 %r8, _f16s,_lts1lo 0xff60, %b[17], mas=0x4
          ldd,5 %r8, _f16s,_lts1hi 0xff68, %b[18], mas=0x4
        }
        {
          loop_mode
          ldd,0 %r8, _f16s,_lts0lo 0xff70, %b[19], mas=0x4
          ldd,2 %r8, _f16s,_lts0hi 0xff78, %b[20], mas=0x4
          ldd,3 %r8, _f16s,_lts1lo 0xff80, %b[21], mas=0x4
          ldd,5 %r8, _f16s,_lts1hi 0xff88, %b[22], mas=0x4
        }
        {
          loop_mode
          ldd,0 %r8, _f16s,_lts0lo 0xff90, %b[23], mas=0x4
          ldd,2 %r8, _f16s,_lts0hi 0xff98, %b[24], mas=0x4
          ldd,3 %r8, _f16s,_lts1lo 0xffa0, %b[25], mas=0x4
          ldd,5 %r8, _f16s,_lts1hi 0xffa8, %b[26], mas=0x4
        }
        {
          loop_mode
          ldd,0 %r8, _f16s,_lts0lo 0xffb0, %b[27], mas=0x4
          ldd,2 %r8, _f16s,_lts0hi 0xffb8, %b[28], mas=0x4
          ldd,3 %r8, _f16s,_lts1lo 0xffc0, %b[29], mas=0x4
          ldd,5 %r8, _f16s,_lts1hi 0xffc8, %b[30], mas=0x4
        }
        {
          loop_mode
          ldd,0 %r8, _f16s,_lts0lo 0xffd0, %b[31], mas=0x4
          ldd,2 %r8, _f16s,_lts0hi 0xffd8, %b[32], mas=0x4
          ldd,3 %r8, _f16s,_lts1lo 0xffe0, %b[33], mas=0x4
          ldd,5 %r8, _f16s,_lts1hi 0xffe8, %b[34], mas=0x4
        }
        {
          loop_mode
          ldd,0 %r8, _f16s,_lts0lo 0xfff0, %b[35], mas=0x4
          ldd,2 %r8, _f16s,_lts0hi 0xfff8, %b[36], mas=0x4
        }

Теоретическая скорость: 96 комплексных чисел за 158 тактов (96/158) = 4.86 Байт/такт
Четверная теоретическая скорость: 19.44 Байт/такт

Замеры скорости

Итоги по stage_radix4_2x

Скорости упали по сравнению с исходными версиями stage_radix4.
График FFT находится здесь.


stage_radix4_readConjSwap

Один проход по stage_radix4_readConjSwap совершает ту же работу, что 2 прохода по stage_radix2. Поэтому скорость stage_radix4_readConjSwap будем умножать на 2 для удобства сравнения с stage_radix2 (этот факт подписан на оси графика и в выводе консоли).

1. stage_radix4_readConjSwap_simd64

Вычисления делаем аналогично stage_radix2_readConjSwap_simd64.

Код на Си
void stage_radix4_readConjSwap_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coefC, myComplex *conj_coefD, myComplex *conj_coefE, myComplex *swap_coefC, myComplex *swap_coefD, myComplex *swap_coefE)
{
	uint64_t *x_in = (uint64_t*)&data_in[0];
	uint64_t *y_in = (uint64_t*)&data_in[1];
	uint64_t *z_in = (uint64_t*)&data_in[2];
	uint64_t *w_in = (uint64_t*)&data_in[3];
	uint64_t *conj_c_in = (uint64_t*)conj_coefC;
	uint64_t *conj_d_in = (uint64_t*)conj_coefD;
	uint64_t *conj_e_in = (uint64_t*)conj_coefE;
	uint64_t *swap_c_in = (uint64_t*)swap_coefC;
	uint64_t *swap_d_in = (uint64_t*)swap_coefD;
	uint64_t *swap_e_in = (uint64_t*)swap_coefE;

	uint64_t *out_0 = (uint64_t*)&data_out[0*data_count/4];
	uint64_t *out_1 = (uint64_t*)&data_out[1*data_count/4];
	uint64_t *out_2 = (uint64_t*)&data_out[2*data_count/4];
	uint64_t *out_3 = (uint64_t*)&data_out[3*data_count/4];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/4; ++i)
	{
		uint64_t x = x_in[4*i];
		uint64_t y = y_in[4*i];
		uint64_t z = z_in[4*i];
		uint64_t w = w_in[4*i];
		uint64_t conj_c = conj_c_in[i];
		uint64_t conj_d = conj_d_in[i];
		uint64_t conj_e = conj_e_in[i];
		uint64_t swap_c = swap_c_in[i];
		uint64_t swap_d = swap_d_in[i];
		uint64_t swap_e = swap_e_in[i];

		uint64_t cy_real = __builtin_e2k_pfmuls(conj_c, y);
		uint64_t dz_real = __builtin_e2k_pfmuls(conj_d, z);
		uint64_t ew_real = __builtin_e2k_pfmuls(conj_e, w);
		uint64_t cy_imag = __builtin_e2k_pfmuls(swap_c, y);
		uint64_t dz_imag = __builtin_e2k_pfmuls(swap_d, z);
		uint64_t ew_imag = __builtin_e2k_pfmuls(swap_e, w);

		uint64_t cy = __builtin_e2k_pfhadds(cy_real, cy_imag);
		uint64_t dz = __builtin_e2k_pfhadds(dz_real, dz_imag);
		uint64_t ew = __builtin_e2k_pfhadds(ew_real, ew_imag);

		uint64_t add02 = __builtin_e2k_pfadds( x, dz);
		uint64_t sub02 = __builtin_e2k_pfsubs( x, dz);
		uint64_t add13 = __builtin_e2k_pfadds(cy, ew);
		uint64_t sub13 = __builtin_e2k_pfsubs(cy, ew);

		//uint64_t conj_sub13 = __builtin_e2k_pxord(sub13, 1LL<<63);
		//uint64_t sub13i = __builtin_e2k_pshufb(0, conj_sub13, 0x0302010007060504);
		uint64_t swap_sub13 = __builtin_e2k_pshufb(0, sub13, 0x0302010007060504);
		uint64_t sub13i = __builtin_e2k_pxord(swap_sub13, 1LL<<31);

		out_0[i] = __builtin_e2k_pfadds(add02, add13);
		out_1[i] = __builtin_e2k_pfsubs(sub02, sub13i);
		out_2[i] = __builtin_e2k_pfsubs(add02, add13);
		out_3[i] = __builtin_e2k_pfadds(sub02, sub13i);
	}
}
Основной цикл на ассемблере
.L640:
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=3, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=3, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=3, abs=8, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=3, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=4, abs=16, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=3, abs=24, disp=0
        }
.L275:
        {
          loop_mode
          pfmul_hadds,0,sm      %b[43], %b[59], %b[80], %b[11]
          pfadd_adds,1,sm       %b[34], %b[17], %b[87], %b[67]
          pfadd_rsubs,2,sm      %b[34], %b[17], %b[87], %b[54]
          pfmuls,3,sm   %b[12], %b[55], %b[76]
          pfsubs,4,sm   %b[42], %b[25], %b[88]
          pfadds,5,sm   %b[44], %b[27], %b[83]
          movad,1       area=0, ind=0, am=1, be=0, %b[70]
          movad,2       area=2, ind=0, am=1, be=0, %b[1]
          movad,3       area=1, ind=0, am=1, be=0, %b[0]
        }
        {
          loop_mode
          pfsub_rsubs,0,sm      %b[34], %b[17], %b[99], %b[80]
          pfsub_adds,1,sm       %b[34], %b[17], %b[99], %b[59]
          staad,2       %b[73], %aad4[ %aasti11 ]
          incr,2        %aaincr0
          pfmuls,3,sm   %b[100], %b[64], %b[87]
          pshufb,4,sm   0x0, %b[90], %r21, %b[93]
          staad,5       %b[60], %aad2[ %aasti9 ]
          incr,5        %aaincr0
          movad,0       area=3, ind=0, am=1, be=0, %b[44]
          movad,1       area=2, ind=0, am=1, be=0, %b[27]
          movad,2       area=0, ind=0, am=0, be=0, %b[12]
          movad,3       area=0, ind=16, am=0, be=0, %b[43]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          pfmul_hadds,0,sm      %b[52], %b[68], %b[91], %b[17]
          pfmul_hadds,1,sm      %b[9], %b[81], %b[94], %b[34]
          staad,2       %b[86], %aad3[ %aasti10 ]
          incr,2        %aaincr0
          pfmuls,3,sm   %b[74], %b[77], %b[90]
          xord,4,sm     %b[95], %r9, %b[97]
          staad,5       %b[65], %aad1[ %aasti8 ]
          incr,5        %aaincr0
          movad,1       area=1, ind=0, am=1, be=0, %b[96]
          movad,2       area=0, ind=8, am=1, be=0, %b[73]
          movad,3       area=0, ind=24, am=0, be=0, %b[60]
        }

Теоретическая скорость: 4 комплексных числа за 3 такта (4/3) = 10.67 Байт/такт
Двойная теоретическая скорость: 21.33 Байт/такт

Замеры скорости

2. stage_radix4_readConjSwap_simd128

Вычисления делаем аналогично stage_radix2_readConjSwap_simd128.

Код на Си
void stage_radix4_readConjSwap_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coefC, myComplex *conj_coefD, myComplex *conj_coefE, myComplex *swap_coefC, myComplex *swap_coefD, myComplex *swap_coefE)
{
	__v2di *xy0_in = (__v2di*)&data_in[0];
	__v2di *zw0_in = (__v2di*)&data_in[2];
	__v2di *xy1_in = (__v2di*)&data_in[4];
	__v2di *zw1_in = (__v2di*)&data_in[6];
	__v2di *conj_c_in = (__v2di*)conj_coefC;
	__v2di *conj_d_in = (__v2di*)conj_coefD;
	__v2di *conj_e_in = (__v2di*)conj_coefE;
	__v2di *swap_c_in = (__v2di*)swap_coefC;
	__v2di *swap_d_in = (__v2di*)swap_coefD;
	__v2di *swap_e_in = (__v2di*)swap_coefE;

	__v2di *out_0 = (__v2di*)&data_out[0*data_count/4];
	__v2di *out_1 = (__v2di*)&data_out[1*data_count/4];
	__v2di *out_2 = (__v2di*)&data_out[2*data_count/4];
	__v2di *out_3 = (__v2di*)&data_out[3*data_count/4];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/8; ++i)
	{
		__v2di xy0 = xy0_in[4*i];
		__v2di zw0 = zw0_in[4*i];
		__v2di xy1 = xy1_in[4*i];
		__v2di zw1 = zw1_in[4*i];
		__v2di conj_c = conj_c_in[i];
		__v2di conj_d = conj_d_in[i];
		__v2di conj_e = conj_e_in[i];
		__v2di swap_c = swap_c_in[i];
		__v2di swap_d = swap_d_in[i];
		__v2di swap_e = swap_e_in[i];

		__v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		__v2di cy_real = __builtin_e2k_qpfmuls(conj_c, y);
		__v2di dz_real = __builtin_e2k_qpfmuls(conj_d, z);
		__v2di ew_real = __builtin_e2k_qpfmuls(conj_e, w);
		__v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y);
		__v2di dz_imag = __builtin_e2k_qpfmuls(swap_d, z);
		__v2di ew_imag = __builtin_e2k_qpfmuls(swap_e, w);

		__v2di cy_rrii = __builtin_e2k_qpfhadds(cy_real, cy_imag);
		__v2di dz_rrii = __builtin_e2k_qpfhadds(dz_real, dz_imag);
		__v2di ew_rrii = __builtin_e2k_qpfhadds(ew_real, ew_imag);

		__v2di dz = __builtin_e2k_qpshufb(dz_rrii, dz_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});

		__v2di add02      = __builtin_e2k_qpfadds( x, dz);
		__v2di sub02      = __builtin_e2k_qpfsubs( x, dz);
		__v2di add13_rrii = __builtin_e2k_qpfadds(cy_rrii, ew_rrii);
		__v2di sub13_rrii = __builtin_e2k_qpfsubs(cy_rrii, ew_rrii);

		__v2di add13 = __builtin_e2k_qpshufb(add13_rrii, add13_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});

		//__v2di conj_sub13 = __builtin_e2k_qpxor(sub13_rrii, (__v2di){(1LL<<63) + (1LL<<31), 0});
		//__v2di sub13i = __builtin_e2k_qpshufb(conj_sub13, conj_sub13, (__v2di){0x030201000B0A0908, 0x070605040F0E0D0C});
		__v2di swap_sub13 = __builtin_e2k_qpshufb(sub13_rrii, sub13_rrii, (__v2di){0x030201000B0A0908, 0x070605040F0E0D0C});
		__v2di sub13i = __builtin_e2k_qpxor(swap_sub13, (__v2di){1LL<<31, 1LL<<31});

		out_0[i] = __builtin_e2k_qpfadds(add02, add13);
		out_1[i] = __builtin_e2k_qpfsubs(sub02, sub13i);
		out_2[i] = __builtin_e2k_qpfsubs(add02, add13);
		out_3[i] = __builtin_e2k_qpfadds(sub02, sub13i);
	}
}
Основной цикл на ассемблере
.L1243:
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=3, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=3, abs=8, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=3, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=16, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=3, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=3, abs=24, disp=0
        }
.L699:
        {
          loop_mode
          qpfmuls,0,sm  %b[59], %b[33], %b[48]
          qpfmul_hadds,1,sm     %b[32], %b[41], %b[80], %b[6]
          qpfadd_adds,2,sm      %b[24], %b[77], %b[62], %b[47]
          qpshufb,3,sm  %b[54], %b[54], %g17, %b[60]
          qpshufb,4,sm  %b[42], %b[42], %g16, %b[69]
          qpfadd_rsubs,5,sm     %b[24], %b[77], %b[62], %b[7]
          movaqp,1      area=0, ind=0, am=0, be=0, %b[1]
          movaqp,3      area=0, ind=0, am=0, be=0, %b[0]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[19], %b[38], %b[73], %b[72]
          qpfmul_hadds,1,sm     %b[29], %b[35], %b[50], %b[41]
          qpfsub_rsubs,2,sm     %b[24], %b[77], %b[81], %b[54]
          qpshufb,3,sm  %b[2], %b[3], %r22, %b[32]
          qpxor,4,sm    %b[69], %g18, %b[79]
          qpfsub_adds,5,sm      %b[24], %b[77], %b[81], %b[42]
          movaqp,1      area=0, ind=16, am=1, be=0, %b[62]
          movaqp,3      area=0, ind=16, am=1, be=0, %b[59]
        }
        {
          loop_mode
          qpfadds,0,sm  %b[76], %b[10], %b[50]
          qpfsubs,1,sm  %b[76], %b[10], %b[38]
          staaqp,2      %b[51], %aad4[ %aasti11 ]
          incr,2        %aaincr0
          qpshufb,3,sm  %b[61], %b[64], %r22, %b[35]
          qpshufb,4,sm  %b[63], %b[66], %g19, %b[29]
          staaqp,5      %b[11], %aad2[ %aasti9 ]
          incr,5        %aaincr0
          movaqp,1      area=3, ind=0, am=1, be=0, %b[19]
          movaqp,3      area=3, ind=0, am=1, be=0, %b[24]
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qpfmuls,0,sm  %b[70], %b[34], %b[69]
          qpfmuls,1,sm  %b[67], %b[37], %b[76]
          staaqp,2      %b[58], %aad3[ %aasti10 ]
          incr,2        %aaincr0
          qpshufb,3,sm  %b[4], %b[5], %g19, %b[10]
          qpshufb,4,sm  %b[45], %b[45], %g17, %b[73]
          staaqp,5      %b[46], %aad1[ %aasti8 ]
          incr,5        %aaincr0
          movaqp,0      area=2, ind=0, am=1, be=0, %b[63]
          movaqp,1      area=1, ind=0, am=1, be=0, %b[66]
          movaqp,2      area=2, ind=0, am=1, be=0, %b[11]
          movaqp,3      area=1, ind=0, am=1, be=0, %b[51]
        }

Теоретическая скорость: 8 комплексных чисел за 4 такта (8/4) = 16 Байт/такт
Двойная теоретическая скорость: 32 Байт/такт

Замеры скорости

Итоги по stage_radix4_readConjSwap

График FFT находится здесь.


stage_radix4_readConjSwap_2x

Один проход по stage_radix4_readConjSwap_2x совершает ту же работу, что 2 прохода по stage_radix4_readConjSwap. А один проход по stage_radix4_readConjSwap совершает ту же работу, что 2 прохода по stage_radix2. Поэтому скорость stage_radix4_readConjSwap_2x будем умножать на 4 для удобства сравнения с stage_radix2 (этот факт подписан на оси графика и в выводе консоли).

1. stage_radix4_readConjSwap_2x_simd64

Здесь происходит ручная раскрутка алгоритма stage_radix4_readConjSwap_simd64 в 2 раза.

Код на Си
void stage_radix4_readConjSwap_2x_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coefC_a, myComplex *conj_coefD_a, myComplex *conj_coefE_a, myComplex *conj_coefC_b, myComplex *conj_coefD_b, myComplex *conj_coefE_b, myComplex *swap_coefC_a, myComplex *swap_coefD_a, myComplex *swap_coefE_a, myComplex *swap_coefC_b, myComplex *swap_coefD_b, myComplex *swap_coefE_b)
{
	uint64_t *x0_in = (uint64_t*)&data_in[ 0];
	uint64_t *y0_in = (uint64_t*)&data_in[ 1];
	uint64_t *z0_in = (uint64_t*)&data_in[ 2];
	uint64_t *w0_in = (uint64_t*)&data_in[ 3];
	uint64_t *x1_in = (uint64_t*)&data_in[ 4];
	uint64_t *y1_in = (uint64_t*)&data_in[ 5];
	uint64_t *z1_in = (uint64_t*)&data_in[ 6];
	uint64_t *w1_in = (uint64_t*)&data_in[ 7];
	uint64_t *x2_in = (uint64_t*)&data_in[ 8];
	uint64_t *y2_in = (uint64_t*)&data_in[ 9];
	uint64_t *z2_in = (uint64_t*)&data_in[10];
	uint64_t *w2_in = (uint64_t*)&data_in[11];
	uint64_t *x3_in = (uint64_t*)&data_in[12];
	uint64_t *y3_in = (uint64_t*)&data_in[13];
	uint64_t *z3_in = (uint64_t*)&data_in[14];
	uint64_t *w3_in = (uint64_t*)&data_in[15];
	uint64_t *conj_c0a_in = (uint64_t*)&conj_coefC_a[0];
	uint64_t *conj_c1a_in = (uint64_t*)&conj_coefC_a[1];
	uint64_t *conj_c2a_in = (uint64_t*)&conj_coefC_a[2];
	uint64_t *conj_c3a_in = (uint64_t*)&conj_coefC_a[3];
	uint64_t *conj_d0a_in = (uint64_t*)&conj_coefD_a[0];
	uint64_t *conj_d1a_in = (uint64_t*)&conj_coefD_a[1];
	uint64_t *conj_d2a_in = (uint64_t*)&conj_coefD_a[2];
	uint64_t *conj_d3a_in = (uint64_t*)&conj_coefD_a[3];
	uint64_t *conj_e0a_in = (uint64_t*)&conj_coefE_a[0];
	uint64_t *conj_e1a_in = (uint64_t*)&conj_coefE_a[1];
	uint64_t *conj_e2a_in = (uint64_t*)&conj_coefE_a[2];
	uint64_t *conj_e3a_in = (uint64_t*)&conj_coefE_a[3];
	uint64_t *conj_c0b_in = (uint64_t*)&conj_coefC_b[0*data_count/16];
	uint64_t *conj_c1b_in = (uint64_t*)&conj_coefC_b[1*data_count/16];
	uint64_t *conj_c2b_in = (uint64_t*)&conj_coefC_b[2*data_count/16];
	uint64_t *conj_c3b_in = (uint64_t*)&conj_coefC_b[3*data_count/16];
	uint64_t *conj_d0b_in = (uint64_t*)&conj_coefD_b[0*data_count/16];
	uint64_t *conj_d1b_in = (uint64_t*)&conj_coefD_b[1*data_count/16];
	uint64_t *conj_d2b_in = (uint64_t*)&conj_coefD_b[2*data_count/16];
	uint64_t *conj_d3b_in = (uint64_t*)&conj_coefD_b[3*data_count/16];
	uint64_t *conj_e0b_in = (uint64_t*)&conj_coefE_b[0*data_count/16];
	uint64_t *conj_e1b_in = (uint64_t*)&conj_coefE_b[1*data_count/16];
	uint64_t *conj_e2b_in = (uint64_t*)&conj_coefE_b[2*data_count/16];
	uint64_t *conj_e3b_in = (uint64_t*)&conj_coefE_b[3*data_count/16];
	uint64_t *swap_c0a_in = (uint64_t*)&swap_coefC_a[0];
	uint64_t *swap_c1a_in = (uint64_t*)&swap_coefC_a[1];
	uint64_t *swap_c2a_in = (uint64_t*)&swap_coefC_a[2];
	uint64_t *swap_c3a_in = (uint64_t*)&swap_coefC_a[3];
	uint64_t *swap_d0a_in = (uint64_t*)&swap_coefD_a[0];
	uint64_t *swap_d1a_in = (uint64_t*)&swap_coefD_a[1];
	uint64_t *swap_d2a_in = (uint64_t*)&swap_coefD_a[2];
	uint64_t *swap_d3a_in = (uint64_t*)&swap_coefD_a[3];
	uint64_t *swap_e0a_in = (uint64_t*)&swap_coefE_a[0];
	uint64_t *swap_e1a_in = (uint64_t*)&swap_coefE_a[1];
	uint64_t *swap_e2a_in = (uint64_t*)&swap_coefE_a[2];
	uint64_t *swap_e3a_in = (uint64_t*)&swap_coefE_a[3];
	uint64_t *swap_c0b_in = (uint64_t*)&swap_coefC_b[0*data_count/16];
	uint64_t *swap_c1b_in = (uint64_t*)&swap_coefC_b[1*data_count/16];
	uint64_t *swap_c2b_in = (uint64_t*)&swap_coefC_b[2*data_count/16];
	uint64_t *swap_c3b_in = (uint64_t*)&swap_coefC_b[3*data_count/16];
	uint64_t *swap_d0b_in = (uint64_t*)&swap_coefD_b[0*data_count/16];
	uint64_t *swap_d1b_in = (uint64_t*)&swap_coefD_b[1*data_count/16];
	uint64_t *swap_d2b_in = (uint64_t*)&swap_coefD_b[2*data_count/16];
	uint64_t *swap_d3b_in = (uint64_t*)&swap_coefD_b[3*data_count/16];
	uint64_t *swap_e0b_in = (uint64_t*)&swap_coefE_b[0*data_count/16];
	uint64_t *swap_e1b_in = (uint64_t*)&swap_coefE_b[1*data_count/16];
	uint64_t *swap_e2b_in = (uint64_t*)&swap_coefE_b[2*data_count/16];
	uint64_t *swap_e3b_in = (uint64_t*)&swap_coefE_b[3*data_count/16];

	uint64_t *out_0  = (uint64_t*)&data_out[ 0*data_count/16];
	uint64_t *out_1  = (uint64_t*)&data_out[ 1*data_count/16];
	uint64_t *out_2  = (uint64_t*)&data_out[ 2*data_count/16];
	uint64_t *out_3  = (uint64_t*)&data_out[ 3*data_count/16];
	uint64_t *out_4  = (uint64_t*)&data_out[ 4*data_count/16];
	uint64_t *out_5  = (uint64_t*)&data_out[ 5*data_count/16];
	uint64_t *out_6  = (uint64_t*)&data_out[ 6*data_count/16];
	uint64_t *out_7  = (uint64_t*)&data_out[ 7*data_count/16];
	uint64_t *out_8  = (uint64_t*)&data_out[ 8*data_count/16];
	uint64_t *out_9  = (uint64_t*)&data_out[ 9*data_count/16];
	uint64_t *out_10 = (uint64_t*)&data_out[10*data_count/16];
	uint64_t *out_11 = (uint64_t*)&data_out[11*data_count/16];
	uint64_t *out_12 = (uint64_t*)&data_out[12*data_count/16];
	uint64_t *out_13 = (uint64_t*)&data_out[13*data_count/16];
	uint64_t *out_14 = (uint64_t*)&data_out[14*data_count/16];
	uint64_t *out_15 = (uint64_t*)&data_out[15*data_count/16];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/16; ++i)
	{
		uint64_t x0 = x0_in[16*i];
		uint64_t y0 = y0_in[16*i];
		uint64_t z0 = z0_in[16*i];
		uint64_t w0 = w0_in[16*i];
		uint64_t conj_c0 = conj_c0a_in[4*i];
		uint64_t conj_d0 = conj_d0a_in[4*i];
		uint64_t conj_e0 = conj_e0a_in[4*i];
		uint64_t swap_c0 = swap_c0a_in[4*i];
		uint64_t swap_d0 = swap_d0a_in[4*i];
		uint64_t swap_e0 = swap_e0a_in[4*i];

		uint64_t x1 = x1_in[16*i];
		uint64_t y1 = y1_in[16*i];
		uint64_t z1 = z1_in[16*i];
		uint64_t w1 = w1_in[16*i];
		uint64_t conj_c1 = conj_c1a_in[4*i];
		uint64_t conj_d1 = conj_d1a_in[4*i];
		uint64_t conj_e1 = conj_e1a_in[4*i];
		uint64_t swap_c1 = swap_c1a_in[4*i];
		uint64_t swap_d1 = swap_d1a_in[4*i];
		uint64_t swap_e1 = swap_e1a_in[4*i];

		uint64_t x2 = x2_in[16*i];
		uint64_t y2 = y2_in[16*i];
		uint64_t z2 = z2_in[16*i];
		uint64_t w2 = w2_in[16*i];
		uint64_t conj_c2 = conj_c2a_in[4*i];
		uint64_t conj_d2 = conj_d2a_in[4*i];
		uint64_t conj_e2 = conj_e2a_in[4*i];
		uint64_t swap_c2 = swap_c2a_in[4*i];
		uint64_t swap_d2 = swap_d2a_in[4*i];
		uint64_t swap_e2 = swap_e2a_in[4*i];

		uint64_t x3 = x3_in[16*i];
		uint64_t y3 = y3_in[16*i];
		uint64_t z3 = z3_in[16*i];
		uint64_t w3 = w3_in[16*i];
		uint64_t conj_c3 = conj_c3a_in[4*i];
		uint64_t conj_d3 = conj_d3a_in[4*i];
		uint64_t conj_e3 = conj_e3a_in[4*i];
		uint64_t swap_c3 = swap_c3a_in[4*i];
		uint64_t swap_d3 = swap_d3a_in[4*i];
		uint64_t swap_e3 = swap_e3a_in[4*i];

		uint64_t cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
		uint64_t cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
		uint64_t cy2_real = __builtin_e2k_pfmuls(conj_c2, y2);
		uint64_t cy3_real = __builtin_e2k_pfmuls(conj_c3, y3);
		uint64_t dz0_real = __builtin_e2k_pfmuls(conj_d0, z0);
		uint64_t dz1_real = __builtin_e2k_pfmuls(conj_d1, z1);
		uint64_t dz2_real = __builtin_e2k_pfmuls(conj_d2, z2);
		uint64_t dz3_real = __builtin_e2k_pfmuls(conj_d3, z3);
		uint64_t ew0_real = __builtin_e2k_pfmuls(conj_e0, w0);
		uint64_t ew1_real = __builtin_e2k_pfmuls(conj_e1, w1);
		uint64_t ew2_real = __builtin_e2k_pfmuls(conj_e2, w2);
		uint64_t ew3_real = __builtin_e2k_pfmuls(conj_e3, w3);
		uint64_t cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
		uint64_t cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);
		uint64_t cy2_imag = __builtin_e2k_pfmuls(swap_c2, y2);
		uint64_t cy3_imag = __builtin_e2k_pfmuls(swap_c3, y3);
		uint64_t dz0_imag = __builtin_e2k_pfmuls(swap_d0, z0);
		uint64_t dz1_imag = __builtin_e2k_pfmuls(swap_d1, z1);
		uint64_t dz2_imag = __builtin_e2k_pfmuls(swap_d2, z2);
		uint64_t dz3_imag = __builtin_e2k_pfmuls(swap_d3, z3);
		uint64_t ew0_imag = __builtin_e2k_pfmuls(swap_e0, w0);
		uint64_t ew1_imag = __builtin_e2k_pfmuls(swap_e1, w1);
		uint64_t ew2_imag = __builtin_e2k_pfmuls(swap_e2, w2);
		uint64_t ew3_imag = __builtin_e2k_pfmuls(swap_e3, w3);

		uint64_t cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
		uint64_t cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);
		uint64_t cy2 = __builtin_e2k_pfhadds(cy2_real, cy2_imag);
		uint64_t cy3 = __builtin_e2k_pfhadds(cy3_real, cy3_imag);
		uint64_t dz0 = __builtin_e2k_pfhadds(dz0_real, dz0_imag);
		uint64_t dz1 = __builtin_e2k_pfhadds(dz1_real, dz1_imag);
		uint64_t dz2 = __builtin_e2k_pfhadds(dz2_real, dz2_imag);
		uint64_t dz3 = __builtin_e2k_pfhadds(dz3_real, dz3_imag);
		uint64_t ew0 = __builtin_e2k_pfhadds(ew0_real, ew0_imag);
		uint64_t ew1 = __builtin_e2k_pfhadds(ew1_real, ew1_imag);
		uint64_t ew2 = __builtin_e2k_pfhadds(ew2_real, ew2_imag);
		uint64_t ew3 = __builtin_e2k_pfhadds(ew3_real, ew3_imag);

		uint64_t add02_0 = __builtin_e2k_pfadds( x0, dz0);
		uint64_t add02_1 = __builtin_e2k_pfadds( x1, dz1);
		uint64_t add02_2 = __builtin_e2k_pfadds( x2, dz2);
		uint64_t add02_3 = __builtin_e2k_pfadds( x3, dz3);
		uint64_t sub02_0 = __builtin_e2k_pfsubs( x0, dz0);
		uint64_t sub02_1 = __builtin_e2k_pfsubs( x1, dz1);
		uint64_t sub02_2 = __builtin_e2k_pfsubs( x2, dz2);
		uint64_t sub02_3 = __builtin_e2k_pfsubs( x3, dz3);
		uint64_t add13_0 = __builtin_e2k_pfadds(cy0, ew0);
		uint64_t add13_1 = __builtin_e2k_pfadds(cy1, ew1);
		uint64_t add13_2 = __builtin_e2k_pfadds(cy2, ew2);
		uint64_t add13_3 = __builtin_e2k_pfadds(cy3, ew3);
		uint64_t sub13_0 = __builtin_e2k_pfsubs(cy0, ew0);
		uint64_t sub13_1 = __builtin_e2k_pfsubs(cy1, ew1);
		uint64_t sub13_2 = __builtin_e2k_pfsubs(cy2, ew2);
		uint64_t sub13_3 = __builtin_e2k_pfsubs(cy3, ew3);

		//uint64_t conj_sub13_0 = __builtin_e2k_pxord(sub13_0, 1LL<<63);
		//uint64_t conj_sub13_1 = __builtin_e2k_pxord(sub13_1, 1LL<<63);
		//uint64_t conj_sub13_2 = __builtin_e2k_pxord(sub13_2, 1LL<<63);
		//uint64_t conj_sub13_3 = __builtin_e2k_pxord(sub13_3, 1LL<<63);
		//uint64_t sub13i_0 = __builtin_e2k_pshufb(0, conj_sub13_0, 0x0302010007060504);
		//uint64_t sub13i_1 = __builtin_e2k_pshufb(0, conj_sub13_1, 0x0302010007060504);
		//uint64_t sub13i_2 = __builtin_e2k_pshufb(0, conj_sub13_2, 0x0302010007060504);
		//uint64_t sub13i_3 = __builtin_e2k_pshufb(0, conj_sub13_3, 0x0302010007060504);
		uint64_t swap_sub13_0 = __builtin_e2k_pshufb(0, sub13_0, 0x0302010007060504);
		uint64_t swap_sub13_1 = __builtin_e2k_pshufb(0, sub13_1, 0x0302010007060504);
		uint64_t swap_sub13_2 = __builtin_e2k_pshufb(0, sub13_2, 0x0302010007060504);
		uint64_t swap_sub13_3 = __builtin_e2k_pshufb(0, sub13_3, 0x0302010007060504);
		uint64_t sub13i_0 = __builtin_e2k_pxord(swap_sub13_0, 1LL<<31);
		uint64_t sub13i_1 = __builtin_e2k_pxord(swap_sub13_1, 1LL<<31);
		uint64_t sub13i_2 = __builtin_e2k_pxord(swap_sub13_2, 1LL<<31);
		uint64_t sub13i_3 = __builtin_e2k_pxord(swap_sub13_3, 1LL<<31);

		uint64_t out0  = __builtin_e2k_pfadds(add02_0, add13_0);
		uint64_t out1  = __builtin_e2k_pfadds(add02_1, add13_1);
		uint64_t out2  = __builtin_e2k_pfadds(add02_2, add13_2);
		uint64_t out3  = __builtin_e2k_pfadds(add02_3, add13_3);
		uint64_t out4  = __builtin_e2k_pfsubs(sub02_0, sub13i_0);
		uint64_t out5  = __builtin_e2k_pfsubs(sub02_1, sub13i_1);
		uint64_t out6  = __builtin_e2k_pfsubs(sub02_2, sub13i_2);
		uint64_t out7  = __builtin_e2k_pfsubs(sub02_3, sub13i_3);
		uint64_t out8  = __builtin_e2k_pfsubs(add02_0, add13_0);
		uint64_t out9  = __builtin_e2k_pfsubs(add02_1, add13_1);
		uint64_t out10 = __builtin_e2k_pfsubs(add02_2, add13_2);
		uint64_t out11 = __builtin_e2k_pfsubs(add02_3, add13_3);
		uint64_t out12 = __builtin_e2k_pfadds(sub02_0, sub13i_0);
		uint64_t out13 = __builtin_e2k_pfadds(sub02_1, sub13i_1);
		uint64_t out14 = __builtin_e2k_pfadds(sub02_2, sub13i_2);
		uint64_t out15 = __builtin_e2k_pfadds(sub02_3, sub13i_3);


		x0 = out0;
		y0 = out1;
		z0 = out2;
		w0 = out3;
		conj_c0 = conj_c0b_in[i];
		conj_d0 = conj_d0b_in[i];
		conj_e0 = conj_e0b_in[i];
		swap_c0 = swap_c0b_in[i];
		swap_d0 = swap_d0b_in[i];
		swap_e0 = swap_e0b_in[i];

		x1 = out4;
		y1 = out5;
		z1 = out6;
		w1 = out7;
		conj_c1 = conj_c1b_in[i];
		conj_d1 = conj_d1b_in[i];
		conj_e1 = conj_e1b_in[i];
		swap_c1 = swap_c1b_in[i];
		swap_d1 = swap_d1b_in[i];
		swap_e1 = swap_e1b_in[i];

		x2 = out8;
		y2 = out9;
		z2 = out10;
		w2 = out11;
		conj_c2 = conj_c2b_in[i];
		conj_d2 = conj_d2b_in[i];
		conj_e2 = conj_e2b_in[i];
		swap_c2 = swap_c2b_in[i];
		swap_d2 = swap_d2b_in[i];
		swap_e2 = swap_e2b_in[i];

		x3 = out12;
		y3 = out13;
		z3 = out14;
		w3 = out15;
		conj_c3 = conj_c3b_in[i];
		conj_d3 = conj_d3b_in[i];
		conj_e3 = conj_e3b_in[i];
		swap_c3 = swap_c3b_in[i];
		swap_d3 = swap_d3b_in[i];
		swap_e3 = swap_e3b_in[i];

		cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
		cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
		cy2_real = __builtin_e2k_pfmuls(conj_c2, y2);
		cy3_real = __builtin_e2k_pfmuls(conj_c3, y3);
		dz0_real = __builtin_e2k_pfmuls(conj_d0, z0);
		dz1_real = __builtin_e2k_pfmuls(conj_d1, z1);
		dz2_real = __builtin_e2k_pfmuls(conj_d2, z2);
		dz3_real = __builtin_e2k_pfmuls(conj_d3, z3);
		ew0_real = __builtin_e2k_pfmuls(conj_e0, w0);
		ew1_real = __builtin_e2k_pfmuls(conj_e1, w1);
		ew2_real = __builtin_e2k_pfmuls(conj_e2, w2);
		ew3_real = __builtin_e2k_pfmuls(conj_e3, w3);
		cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
		cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);
		cy2_imag = __builtin_e2k_pfmuls(swap_c2, y2);
		cy3_imag = __builtin_e2k_pfmuls(swap_c3, y3);
		dz0_imag = __builtin_e2k_pfmuls(swap_d0, z0);
		dz1_imag = __builtin_e2k_pfmuls(swap_d1, z1);
		dz2_imag = __builtin_e2k_pfmuls(swap_d2, z2);
		dz3_imag = __builtin_e2k_pfmuls(swap_d3, z3);
		ew0_imag = __builtin_e2k_pfmuls(swap_e0, w0);
		ew1_imag = __builtin_e2k_pfmuls(swap_e1, w1);
		ew2_imag = __builtin_e2k_pfmuls(swap_e2, w2);
		ew3_imag = __builtin_e2k_pfmuls(swap_e3, w3);

		cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
		cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);
		cy2 = __builtin_e2k_pfhadds(cy2_real, cy2_imag);
		cy3 = __builtin_e2k_pfhadds(cy3_real, cy3_imag);
		dz0 = __builtin_e2k_pfhadds(dz0_real, dz0_imag);
		dz1 = __builtin_e2k_pfhadds(dz1_real, dz1_imag);
		dz2 = __builtin_e2k_pfhadds(dz2_real, dz2_imag);
		dz3 = __builtin_e2k_pfhadds(dz3_real, dz3_imag);
		ew0 = __builtin_e2k_pfhadds(ew0_real, ew0_imag);
		ew1 = __builtin_e2k_pfhadds(ew1_real, ew1_imag);
		ew2 = __builtin_e2k_pfhadds(ew2_real, ew2_imag);
		ew3 = __builtin_e2k_pfhadds(ew3_real, ew3_imag);

		add02_0 = __builtin_e2k_pfadds( x0, dz0);
		add02_1 = __builtin_e2k_pfadds( x1, dz1);
		add02_2 = __builtin_e2k_pfadds( x2, dz2);
		add02_3 = __builtin_e2k_pfadds( x3, dz3);
		sub02_0 = __builtin_e2k_pfsubs( x0, dz0);
		sub02_1 = __builtin_e2k_pfsubs( x1, dz1);
		sub02_2 = __builtin_e2k_pfsubs( x2, dz2);
		sub02_3 = __builtin_e2k_pfsubs( x3, dz3);
		add13_0 = __builtin_e2k_pfadds(cy0, ew0);
		add13_1 = __builtin_e2k_pfadds(cy1, ew1);
		add13_2 = __builtin_e2k_pfadds(cy2, ew2);
		add13_3 = __builtin_e2k_pfadds(cy3, ew3);
		sub13_0 = __builtin_e2k_pfsubs(cy0, ew0);
		sub13_1 = __builtin_e2k_pfsubs(cy1, ew1);
		sub13_2 = __builtin_e2k_pfsubs(cy2, ew2);
		sub13_3 = __builtin_e2k_pfsubs(cy3, ew3);

		//conj_sub13_0 = __builtin_e2k_pxord(sub13_0, 1LL<<63);
		//conj_sub13_1 = __builtin_e2k_pxord(sub13_1, 1LL<<63);
		//conj_sub13_2 = __builtin_e2k_pxord(sub13_2, 1LL<<63);
		//conj_sub13_3 = __builtin_e2k_pxord(sub13_3, 1LL<<63);
		//sub13i_0 = __builtin_e2k_pshufb(0, conj_sub13_0, 0x0302010007060504);
		//sub13i_1 = __builtin_e2k_pshufb(0, conj_sub13_1, 0x0302010007060504);
		//sub13i_2 = __builtin_e2k_pshufb(0, conj_sub13_2, 0x0302010007060504);
		//sub13i_3 = __builtin_e2k_pshufb(0, conj_sub13_3, 0x0302010007060504);
		swap_sub13_0 = __builtin_e2k_pshufb(0, sub13_0, 0x0302010007060504);
		swap_sub13_1 = __builtin_e2k_pshufb(0, sub13_1, 0x0302010007060504);
		swap_sub13_2 = __builtin_e2k_pshufb(0, sub13_2, 0x0302010007060504);
		swap_sub13_3 = __builtin_e2k_pshufb(0, sub13_3, 0x0302010007060504);
		sub13i_0 = __builtin_e2k_pxord(swap_sub13_0, 1LL<<31);
		sub13i_1 = __builtin_e2k_pxord(swap_sub13_1, 1LL<<31);
		sub13i_2 = __builtin_e2k_pxord(swap_sub13_2, 1LL<<31);
		sub13i_3 = __builtin_e2k_pxord(swap_sub13_3, 1LL<<31);

		out_0[i]  = __builtin_e2k_pfadds(add02_0, add13_0);
		out_1[i]  = __builtin_e2k_pfadds(add02_1, add13_1);
		out_2[i]  = __builtin_e2k_pfadds(add02_2, add13_2);
		out_3[i]  = __builtin_e2k_pfadds(add02_3, add13_3);
		out_4[i]  = __builtin_e2k_pfsubs(sub02_0, sub13i_0);
		out_5[i]  = __builtin_e2k_pfsubs(sub02_1, sub13i_1);
		out_6[i]  = __builtin_e2k_pfsubs(sub02_2, sub13i_2);
		out_7[i]  = __builtin_e2k_pfsubs(sub02_3, sub13i_3);
		out_8[i]  = __builtin_e2k_pfsubs(add02_0, add13_0);
		out_9[i]  = __builtin_e2k_pfsubs(add02_1, add13_1);
		out_10[i] = __builtin_e2k_pfsubs(add02_2, add13_2);
		out_11[i] = __builtin_e2k_pfsubs(add02_3, add13_3);
		out_12[i] = __builtin_e2k_pfadds(sub02_0, sub13i_0);
		out_13[i] = __builtin_e2k_pfadds(sub02_1, sub13i_1);
		out_14[i] = __builtin_e2k_pfadds(sub02_2, sub13i_2);
		out_15[i] = __builtin_e2k_pfadds(sub02_3, sub13i_3);
	}
}
Основной цикл на ассемблере
.L2581:
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=1, asz=0, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=1, asz=0, abs=0, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=1, asz=0, abs=1, disp=64
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=1, asz=0, abs=1, disp=96
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=7, asz=1, abs=2, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=6, asz=1, abs=2, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=5, asz=1, abs=4, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=4, asz=1, abs=4, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=3, asz=1, abs=6, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=2, asz=1, abs=6, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=16, incr=0, ind=0, asz=1, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=15, incr=0, ind=0, asz=1, abs=8, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=14, incr=0, ind=0, asz=1, abs=10, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=13, incr=0, ind=0, asz=1, abs=10, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=12, incr=0, ind=0, asz=1, abs=12, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=11, incr=0, ind=0, asz=1, abs=12, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=10, incr=0, ind=0, asz=1, abs=14, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=9, incr=0, ind=0, asz=1, abs=14, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=8, incr=0, ind=0, asz=1, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=7, incr=0, ind=0, asz=1, abs=16, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=6, incr=0, ind=0, asz=1, abs=18, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=5, incr=0, ind=0, asz=1, abs=18, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=4, incr=0, ind=0, asz=1, abs=20, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=3, incr=0, ind=0, asz=1, abs=20, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=2, incr=0, ind=0, asz=1, abs=22, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=1, abs=22, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=1, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=1, abs=24, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=1, abs=26, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=1, abs=26, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=1, abs=28, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=1, abs=28, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=1, abs=30, disp=0
          fapb  dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=1, abs=30, disp=0
        }
.L1747:
        {
          loop_mode
          pfadd_rsubs,1,sm      %b[63], %b[62], %b[77], %b[86]
          pfmul_hadds,2,sm      %b[9], %b[117], %b[86], %b[68]
          pfmul_hadds,3,sm      %b[33], %b[68], %b[88], %b[88]
        }
        {
          loop_mode
          pfmuls,1,sm   %b[113], %b[103], %b[73]
          xord,2,sm     %b[107], %r3, %b[90]
          pfmul_hadds,3,sm      %b[71], %b[73], %b[65], %b[71]
          pfmul_hadds,4,sm      %b[40], %b[106], %b[101], %b[65]
          movad,0       area=16, ind=0, am=1, be=0, %b[1]
          movad,1       area=15, ind=0, am=1, be=0, %b[12]
          movad,2       area=16, ind=0, am=1, be=0, %b[8]
          movad,3       area=15, ind=0, am=1, be=0, %b[9]
        }
        {
          loop_mode
          pfsub_adds,2,sm       %b[63], %b[62], %b[90], %b[92]
          pfadd_adds,3,sm       %b[63], %b[62], %b[77], %b[95]
          pfmul_hadds,4,sm      %b[5], %b[97], %b[105], %b[77]
          movad,0       area=14, ind=0, am=1, be=0, %b[16]
          movad,1       area=13, ind=0, am=1, be=0, %b[5]
          movad,2       area=14, ind=0, am=1, be=0, %b[13]
          movad,3       area=13, ind=0, am=1, be=0, %b[17]
        }
        {
          loop_mode
          pfsubs,1,sm   %b[93], %b[82], %b[85]
          pfmul_hadds,2,sm      %b[32], %b[64], %b[85], %b[64]
          movad,0       area=12, ind=0, am=1, be=0, %b[25]
          movad,1       area=11, ind=0, am=1, be=0, %b[24]
          movad,2       area=12, ind=0, am=1, be=0, %b[20]
          movad,3       area=11, ind=0, am=1, be=0, %b[21]
        }
        {
          loop_mode
          pfsubs,0,sm   %b[91], %b[89], %b[96]
          pfadds,4,sm   %b[74], %b[87], %b[93]
          pfadds,5,sm   %b[93], %b[82], %b[82]
          movad,0       area=10, ind=0, am=1, be=0, %b[28]
          movad,1       area=9, ind=0, am=1, be=0, %b[33]
          movad,2       area=10, ind=0, am=1, be=0, %b[32]
          movad,3       area=9, ind=0, am=1, be=0, %b[29]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[80], %b[103], %b[73], %b[62]
          pfadds,1,sm   %b[91], %b[89], %b[72]
          pfmuls,2,sm   %b[72], %b[108], %b[73]
          pfsub_rsubs,3,sm      %b[63], %b[62], %b[90], %b[63]
          pfsubs,5,sm   %b[74], %b[87], %b[74]
          movad,0       area=8, ind=0, am=1, be=0, %b[41]
          movad,1       area=7, ind=0, am=1, be=0, %b[36]
          movad,2       area=8, ind=0, am=1, be=0, %b[37]
          movad,3       area=7, ind=0, am=1, be=0, %b[40]
        }
        {
          loop_mode
          pfsubs,0,sm   %b[75], %b[70], %b[80]
          pfsubs,5,sm   %b[81], %b[78], %b[87]
          movad,0       area=6, ind=0, am=1, be=0, %b[49]
          movad,1       area=5, ind=0, am=1, be=0, %b[48]
          movad,2       area=6, ind=0, am=1, be=0, %b[44]
          movad,3       area=5, ind=0, am=1, be=0, %b[45]
        }
        {
          loop_mode
          pshufb,0,sm   0x0, %b[85], %r25, %b[84]
          pfmuls,1,sm   %b[84], %b[102], %b[85]
          pfadds,4,sm   %b[81], %b[78], %b[90]
          pfsubs,5,sm   %b[83], %b[69], %b[89]
          movad,0       area=4, ind=0, am=0, be=0, %b[52]
          movad,1       area=4, ind=16, am=0, be=0, %b[78]
          movad,2       area=4, ind=0, am=0, be=0, %b[53]
          movad,3       area=4, ind=16, am=0, be=0, %b[81]
        }
        {
          loop_mode
          pshufb,1,sm   0x0, %b[96], %r25, %b[98]
          pfadds,3,sm   %b[83], %b[69], %b[99]
          pfsubs,5,sm   %b[88], %b[76], %b[97]
          movad,0       area=4, ind=24, am=0, be=0, %b[69]
          movad,1       area=4, ind=8, am=1, be=0, %b[83]
          movad,2       area=4, ind=24, am=0, be=0, %b[91]
          movad,3       area=4, ind=8, am=1, be=0, %b[96]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[55], %b[108], %b[73], %b[75]
          pfadds,1,sm   %b[75], %b[70], %b[100]
          pfadd_adds,2,sm       %b[79], %b[66], %b[72], %b[74]
          pshufb,3,sm   0x0, %b[74], %r25, %b[101]
          pfadd_rsubs,4,sm      %b[94], %b[71], %b[82], %b[76]
          pfadds,5,sm   %b[88], %b[76], %b[104]
          movad,0       area=3, ind=0, am=0, be=0, %b[70]
          movad,1       area=3, ind=16, am=0, be=0, %b[88]
          movad,2       area=3, ind=0, am=0, be=0, %b[56]
          movad,3       area=3, ind=16, am=0, be=0, %b[73]
        }
        {
          loop_mode
          pfadd_rsubs,0,sm      %b[79], %b[66], %b[72], %b[72]
          pshufb,1,sm   0x0, %b[80], %r25, %b[107]
          pfadd_adds,2,sm       %b[86], %b[68], %b[93], %b[103]
          pshufb,4,sm   0x0, %b[87], %r25, %b[109]
          pfadd_rsubs,5,sm      %b[86], %b[68], %b[93], %b[93]
          movad,0       area=3, ind=24, am=0, be=0, %b[105]
          movad,1       area=3, ind=8, am=1, be=0, %b[106]
          movad,2       area=3, ind=24, am=0, be=0, %b[80]
          movad,3       area=3, ind=8, am=1, be=0, %b[87]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[58], %b[102], %b[85], %b[84]
          pfadd_adds,1,sm       %b[94], %b[71], %b[82], %b[85]
          xord,2,sm     %b[84], %r3, %b[112]
          pfadd_adds,3,sm       %b[95], %b[65], %b[90], %b[102]
          pshufb,4,sm   0x0, %b[89], %r25, %b[111]
          pfadd_rsubs,5,sm      %b[95], %b[65], %b[90], %b[90]
          movad,0       area=2, ind=0, am=0, be=0, %b[82]
          movad,1       area=2, ind=16, am=0, be=0, %b[108]
          movad,2       area=2, ind=24, am=0, be=0, %b[110]
          movad,3       area=0, ind=24, am=0, be=0, %b[89]
        }
        {
          loop_mode
          xord,0,sm     %b[98], %r3, %b[116]
          pfsub_adds,1,sm       %b[94], %b[71], %b[112], %b[94]
          pfsub_rsubs,2,sm      %b[94], %b[71], %b[112], %b[97]
          pfadd_rsubs,3,sm      %b[92], %b[77], %b[99], %b[98]
          pshufb,4,sm   0x0, %b[97], %r25, %b[115]
          pfadd_adds,5,sm       %b[92], %b[77], %b[99], %b[99]
          movad,0       area=2, ind=24, am=0, be=0, %b[114]
          movad,1       area=2, ind=8, am=1, be=0, %b[113]
          movad,2       area=1, ind=16, am=0, be=0, %b[71]
          movad,3       area=0, ind=8, am=0, be=0, %b[112]
        }
        {
          loop_mode
          pfadd_adds,1,sm       %b[57], %b[62], %b[100], %b[104]
          pfsub_adds,2,sm       %b[79], %b[66], %b[116], %b[118]
          xord,3,sm     %b[101], %r3, %g16
          pfadd_rsubs,4,sm      %b[63], %b[64], %b[104], %b[117]
          pfadd_adds,5,sm       %b[63], %b[64], %b[104], %b[119]
          movad,0       area=1, ind=0, am=0, be=0, %b[55]
          movad,1       area=1, ind=16, am=0, be=0, %b[101]
          movad,2       area=1, ind=24, am=0, be=0, %g18
          movad,3       area=1, ind=8, am=0, be=0, %g17
        }
        {
          loop_mode
          xord,0,sm     %b[107], %r3, %b[116]
          pfmuls,1,sm   %b[67], %b[60], %g19
          pfsub_rsubs,2,sm      %b[79], %b[66], %b[116], %b[66]
          pfsub_rsubs,3,sm      %b[86], %b[68], %g16, %b[107]
          xord,4,sm     %b[109], %r3, %g16
          pfsub_adds,5,sm       %b[86], %b[68], %g16, %b[86]
          movad,0       area=1, ind=24, am=0, be=0, %b[68]
          movad,1       area=1, ind=8, am=1, be=0, %b[79]
          movad,2       area=2, ind=8, am=0, be=0, %b[109]
          movad,3       area=0, ind=16, am=0, be=0, %b[67]
        }
        {
          loop_mode
          pfsub_adds,0,sm       %b[57], %b[62], %b[116], %b[95]
          pfsub_rsubs,3,sm      %b[95], %b[65], %g16, %g16
          pfsub_adds,4,sm       %b[95], %b[65], %g16, %g21
          xord,5,sm     %b[111], %r3, %g20
          movad,0       area=0, ind=0, am=0, be=0, %b[59]
          movad,1       area=0, ind=16, am=0, be=0, %b[58]
          movad,2       area=2, ind=0, am=1, be=0, %b[65]
          movad,3       area=2, ind=16, am=0, be=0, %b[111]
        }
        {
          loop_mode
          pfmuls,0,sm   %b[106], %b[89], %g24
          pfadd_rsubs,2,sm      %b[57], %b[62], %b[100], %b[115]
          xord,3,sm     %b[115], %r3, %g23
          pfsub_rsubs,4,sm      %b[92], %b[77], %g20, %g20
          pfsub_adds,5,sm       %b[92], %b[77], %g20, %g22
          movad,0       area=0, ind=8, am=1, be=0, %b[100]
          movad,1       area=0, ind=24, am=0, be=0, %b[106]
          movad,2       area=1, ind=0, am=1, be=0, %b[92]
          movad,3       area=0, ind=0, am=1, be=0, %b[77]
        }
        {
          loop_mode
          pfmuls,0,sm   %b[113], %b[112], %b[113]
          pfmuls,3,sm   %b[110], %b[71], %b[63]
          pfsub_rsubs,4,sm      %b[63], %b[64], %g23, %b[110]
          pfsub_adds,5,sm       %b[63], %b[64], %g23, %b[64]
        }
        {
          loop_mode
          pfmuls,0,sm   %b[105], %g18, %b[114]
          pfmuls,1,sm   %b[114], %g17, %g19
          pfmul_hadds,2,sm      %b[54], %b[60], %g19, %b[60]
          pfmuls,4,sm   %b[27], %b[76], %b[105]
        }
        {
          loop_mode
          pfsub_rsubs,0,sm      %b[57], %b[62], %b[116], %b[62]
          pfsubs,1,sm   %b[84], %b[75], %b[116]
          pfmuls,2,sm   %b[51], %b[85], %g23
          pfmuls,5,sm   %b[109], %b[67], %b[109]
        }
        {
          loop_mode
          pfmuls,0,sm   %b[88], %b[68], %b[88]
          pfmuls,1,sm   %b[108], %b[79], %b[93]
          std,2 %r23, %b[6], %b[103]
          pfmuls,3,sm   %b[26], %b[72], %b[103]
          addd,4,sm     0x8, %b[6], %b[4] ? %pcnt0
          std,5 %r19, %b[6], %b[93]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[96], %b[89], %g24, %b[87]
          pfmul_hadds,1,sm      %b[87], %b[112], %b[113], %b[89]
          std,2 %r2, %b[6], %b[102]
          pfmuls,3,sm   %b[50], %b[74], %b[90]
          std,5 %r20, %b[6], %b[90]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[91], %g18, %b[114], %b[80]
          pfmul_hadds,1,sm      %b[80], %g17, %g19, %b[91]
          std,2 %r11, %b[6], %b[98]
          pfmuls,3,sm   %b[14], %b[94], %b[96]
          pfmuls,4,sm   %b[35], %b[97], %b[98]
          std,5 %r22, %b[6], %b[99]
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[42], %b[85], %g23, %b[76]
          pfmuls,1,sm   %b[47], %b[104], %b[99]
          std,2 %r16, %b[6], %b[117]
          pfmul_hadds,3,sm      %b[19], %b[76], %b[105], %b[85]
          pfmuls,4,sm   %b[18], %b[118], %b[102]
          std,5 %r24, %b[6], %b[119]
        }
        {
          loop_mode
          pfadds,0,sm   %b[84], %b[75], %b[75]
          pfmuls,1,sm   %b[23], %b[115], %b[84]
          std,2 %r14, %b[6], %b[107]
          pfmul_hadds,3,sm      %b[22], %b[72], %b[103], %b[72]
          pfmuls,4,sm   %b[43], %b[66], %b[86]
          std,5 %r13, %b[6], %b[86]
        }
        {
          loop_mode
          std,2 %r21, %b[6], %g16
          pshufb,3,sm   0x0, %b[116], %r25, %b[105]
          pfmuls,4,sm   %b[15], %b[95], %b[103]
          std,5 %r18, %b[6], %g21
        }
        {
          loop_mode
          pfmul_hadds,0,sm      %b[81], %b[68], %b[88], %b[68]
          pfmul_hadds,1,sm      %b[73], %b[79], %b[93], %b[73]
          std,2 %r17, %b[6], %g20
          pfmul_hadds,3,sm      %b[46], %b[74], %b[90], %b[79]
          pfmul_hadds,4,sm      %b[34], %b[97], %b[98], %b[74]
          std,5 %r12, %b[6], %g22
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          pfmul_hadds,0,sm      %b[83], %b[67], %b[109], %b[64]
          pfmuls,1,sm   %b[39], %b[62], %b[83]
          std,2 %r15, %b[6], %b[110]
          pfmul_hadds,3,sm      %b[10], %b[94], %b[96], %b[67]
          pfmul_hadds,4,sm      %b[11], %b[118], %b[102], %b[81]
          std,5 %r0, %b[6], %b[64]
        }

Теоретическая скорость: 16 комплексных чисел за 28 тактов (16/28) = 4.57 Байт/такт
Четверная теоретическая скорость: 18.29 Байт/такт

Замеры скорости

2. stage_radix4_readConjSwap_2x_simd128

Здесь происходит ручная раскрутка алгоритма stage_radix4_readConjSwap_simd128 в 2 раза.

Код на Си
void stage_radix4_readConjSwap_2x_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coefC_a, myComplex *conj_coefD_a, myComplex *conj_coefE_a, myComplex *conj_coefC_b, myComplex *conj_coefD_b, myComplex *conj_coefE_b, myComplex *swap_coefC_a, myComplex *swap_coefD_a, myComplex *swap_coefE_a, myComplex *swap_coefC_b, myComplex *swap_coefD_b, myComplex *swap_coefE_b)
{
	__v2di *xy0_in = (__v2di*)&data_in[ 0];
	__v2di *zw0_in = (__v2di*)&data_in[ 2];
	__v2di *xy1_in = (__v2di*)&data_in[ 4];
	__v2di *zw1_in = (__v2di*)&data_in[ 6];
	__v2di *xy2_in = (__v2di*)&data_in[ 8];
	__v2di *zw2_in = (__v2di*)&data_in[10];
	__v2di *xy3_in = (__v2di*)&data_in[12];
	__v2di *zw3_in = (__v2di*)&data_in[14];
	__v2di *xy4_in = (__v2di*)&data_in[16];
	__v2di *zw4_in = (__v2di*)&data_in[18];
	__v2di *xy5_in = (__v2di*)&data_in[20];
	__v2di *zw5_in = (__v2di*)&data_in[22];
	__v2di *xy6_in = (__v2di*)&data_in[24];
	__v2di *zw6_in = (__v2di*)&data_in[26];
	__v2di *xy7_in = (__v2di*)&data_in[28];
	__v2di *zw7_in = (__v2di*)&data_in[30];
	__v2di *conj_c0a_in = (__v2di*)&conj_coefC_a[0];
	__v2di *conj_c1a_in = (__v2di*)&conj_coefC_a[2];
	__v2di *conj_c2a_in = (__v2di*)&conj_coefC_a[4];
	__v2di *conj_c3a_in = (__v2di*)&conj_coefC_a[6];
	__v2di *conj_d0a_in = (__v2di*)&conj_coefD_a[0];
	__v2di *conj_d1a_in = (__v2di*)&conj_coefD_a[2];
	__v2di *conj_d2a_in = (__v2di*)&conj_coefD_a[4];
	__v2di *conj_d3a_in = (__v2di*)&conj_coefD_a[6];
	__v2di *conj_e0a_in = (__v2di*)&conj_coefE_a[0];
	__v2di *conj_e1a_in = (__v2di*)&conj_coefE_a[2];
	__v2di *conj_e2a_in = (__v2di*)&conj_coefE_a[4];
	__v2di *conj_e3a_in = (__v2di*)&conj_coefE_a[6];
	__v2di *conj_c0b_in = (__v2di*)&conj_coefC_b[0*data_count/16];
	__v2di *conj_c1b_in = (__v2di*)&conj_coefC_b[1*data_count/16];
	__v2di *conj_c2b_in = (__v2di*)&conj_coefC_b[2*data_count/16];
	__v2di *conj_c3b_in = (__v2di*)&conj_coefC_b[3*data_count/16];
	__v2di *conj_d0b_in = (__v2di*)&conj_coefD_b[0*data_count/16];
	__v2di *conj_d1b_in = (__v2di*)&conj_coefD_b[1*data_count/16];
	__v2di *conj_d2b_in = (__v2di*)&conj_coefD_b[2*data_count/16];
	__v2di *conj_d3b_in = (__v2di*)&conj_coefD_b[3*data_count/16];
	__v2di *conj_e0b_in = (__v2di*)&conj_coefE_b[0*data_count/16];
	__v2di *conj_e1b_in = (__v2di*)&conj_coefE_b[1*data_count/16];
	__v2di *conj_e2b_in = (__v2di*)&conj_coefE_b[2*data_count/16];
	__v2di *conj_e3b_in = (__v2di*)&conj_coefE_b[3*data_count/16];
	__v2di *swap_c0a_in = (__v2di*)&swap_coefC_a[0];
	__v2di *swap_c1a_in = (__v2di*)&swap_coefC_a[2];
	__v2di *swap_c2a_in = (__v2di*)&swap_coefC_a[4];
	__v2di *swap_c3a_in = (__v2di*)&swap_coefC_a[6];
	__v2di *swap_d0a_in = (__v2di*)&swap_coefD_a[0];
	__v2di *swap_d1a_in = (__v2di*)&swap_coefD_a[2];
	__v2di *swap_d2a_in = (__v2di*)&swap_coefD_a[4];
	__v2di *swap_d3a_in = (__v2di*)&swap_coefD_a[6];
	__v2di *swap_e0a_in = (__v2di*)&swap_coefE_a[0];
	__v2di *swap_e1a_in = (__v2di*)&swap_coefE_a[2];
	__v2di *swap_e2a_in = (__v2di*)&swap_coefE_a[4];
	__v2di *swap_e3a_in = (__v2di*)&swap_coefE_a[6];
	__v2di *swap_c0b_in = (__v2di*)&swap_coefC_b[0*data_count/16];
	__v2di *swap_c1b_in = (__v2di*)&swap_coefC_b[1*data_count/16];
	__v2di *swap_c2b_in = (__v2di*)&swap_coefC_b[2*data_count/16];
	__v2di *swap_c3b_in = (__v2di*)&swap_coefC_b[3*data_count/16];
	__v2di *swap_d0b_in = (__v2di*)&swap_coefD_b[0*data_count/16];
	__v2di *swap_d1b_in = (__v2di*)&swap_coefD_b[1*data_count/16];
	__v2di *swap_d2b_in = (__v2di*)&swap_coefD_b[2*data_count/16];
	__v2di *swap_d3b_in = (__v2di*)&swap_coefD_b[3*data_count/16];
	__v2di *swap_e0b_in = (__v2di*)&swap_coefE_b[0*data_count/16];
	__v2di *swap_e1b_in = (__v2di*)&swap_coefE_b[1*data_count/16];
	__v2di *swap_e2b_in = (__v2di*)&swap_coefE_b[2*data_count/16];
	__v2di *swap_e3b_in = (__v2di*)&swap_coefE_b[3*data_count/16];

	__v2di *out_0  = (__v2di*)&data_out[ 0*data_count/16];
	__v2di *out_1  = (__v2di*)&data_out[ 1*data_count/16];
	__v2di *out_2  = (__v2di*)&data_out[ 2*data_count/16];
	__v2di *out_3  = (__v2di*)&data_out[ 3*data_count/16];
	__v2di *out_4  = (__v2di*)&data_out[ 4*data_count/16];
	__v2di *out_5  = (__v2di*)&data_out[ 5*data_count/16];
	__v2di *out_6  = (__v2di*)&data_out[ 6*data_count/16];
	__v2di *out_7  = (__v2di*)&data_out[ 7*data_count/16];
	__v2di *out_8  = (__v2di*)&data_out[ 8*data_count/16];
	__v2di *out_9  = (__v2di*)&data_out[ 9*data_count/16];
	__v2di *out_10 = (__v2di*)&data_out[10*data_count/16];
	__v2di *out_11 = (__v2di*)&data_out[11*data_count/16];
	__v2di *out_12 = (__v2di*)&data_out[12*data_count/16];
	__v2di *out_13 = (__v2di*)&data_out[13*data_count/16];
	__v2di *out_14 = (__v2di*)&data_out[14*data_count/16];
	__v2di *out_15 = (__v2di*)&data_out[15*data_count/16];

	#pragma ivdep
	#pragma unroll(1)
	#pragma prefetch
	for(int64_t i = 0; i < data_count/32; ++i)
	{
		__v2di xy0 = xy0_in[16*i];
		__v2di zw0 = zw0_in[16*i];
		__v2di xy1 = xy1_in[16*i];
		__v2di zw1 = zw1_in[16*i];
		__v2di conj_c0 = conj_c0a_in[4*i];
		__v2di conj_d0 = conj_d0a_in[4*i];
		__v2di conj_e0 = conj_e0a_in[4*i];
		__v2di swap_c0 = swap_c0a_in[4*i];
		__v2di swap_d0 = swap_d0a_in[4*i];
		__v2di swap_e0 = swap_e0a_in[4*i];

		__v2di xy2 = xy2_in[16*i];
		__v2di zw2 = zw2_in[16*i];
		__v2di xy3 = xy3_in[16*i];
		__v2di zw3 = zw3_in[16*i];
		__v2di conj_c1 = conj_c1a_in[4*i];
		__v2di conj_d1 = conj_d1a_in[4*i];
		__v2di conj_e1 = conj_e1a_in[4*i];
		__v2di swap_c1 = swap_c1a_in[4*i];
		__v2di swap_d1 = swap_d1a_in[4*i];
		__v2di swap_e1 = swap_e1a_in[4*i];

		__v2di xy4 = xy4_in[16*i];
		__v2di zw4 = zw4_in[16*i];
		__v2di xy5 = xy5_in[16*i];
		__v2di zw5 = zw5_in[16*i];
		__v2di conj_c2 = conj_c2a_in[4*i];
		__v2di conj_d2 = conj_d2a_in[4*i];
		__v2di conj_e2 = conj_e2a_in[4*i];
		__v2di swap_c2 = swap_c2a_in[4*i];
		__v2di swap_d2 = swap_d2a_in[4*i];
		__v2di swap_e2 = swap_e2a_in[4*i];

		__v2di xy6 = xy6_in[16*i];
		__v2di zw6 = zw6_in[16*i];
		__v2di xy7 = xy7_in[16*i];
		__v2di zw7 = zw7_in[16*i];
		__v2di conj_c3 = conj_c3a_in[4*i];
		__v2di conj_d3 = conj_d3a_in[4*i];
		__v2di conj_e3 = conj_e3a_in[4*i];
		__v2di swap_c3 = swap_c3a_in[4*i];
		__v2di swap_d3 = swap_d3a_in[4*i];
		__v2di swap_e3 = swap_e3a_in[4*i];

		__v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		__v2di z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
		__v2di w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		__v2di cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
		__v2di cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
		__v2di cy2_real = __builtin_e2k_qpfmuls(conj_c2, y2);
		__v2di cy3_real = __builtin_e2k_qpfmuls(conj_c3, y3);
		__v2di dz0_real = __builtin_e2k_qpfmuls(conj_d0, z0);
		__v2di dz1_real = __builtin_e2k_qpfmuls(conj_d1, z1);
		__v2di dz2_real = __builtin_e2k_qpfmuls(conj_d2, z2);
		__v2di dz3_real = __builtin_e2k_qpfmuls(conj_d3, z3);
		__v2di ew0_real = __builtin_e2k_qpfmuls(conj_e0, w0);
		__v2di ew1_real = __builtin_e2k_qpfmuls(conj_e1, w1);
		__v2di ew2_real = __builtin_e2k_qpfmuls(conj_e2, w2);
		__v2di ew3_real = __builtin_e2k_qpfmuls(conj_e3, w3);
		__v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
		__v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
		__v2di cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
		__v2di cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
		__v2di dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
		__v2di dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
		__v2di dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
		__v2di dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
		__v2di ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
		__v2di ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
		__v2di ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
		__v2di ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);

		__v2di cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
		__v2di cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);
		__v2di cy2_rrii = __builtin_e2k_qpfhadds(cy2_real, cy2_imag);
		__v2di cy3_rrii = __builtin_e2k_qpfhadds(cy3_real, cy3_imag);
		__v2di dz0_rrii = __builtin_e2k_qpfhadds(dz0_real, dz0_imag);
		__v2di dz1_rrii = __builtin_e2k_qpfhadds(dz1_real, dz1_imag);
		__v2di dz2_rrii = __builtin_e2k_qpfhadds(dz2_real, dz2_imag);
		__v2di dz3_rrii = __builtin_e2k_qpfhadds(dz3_real, dz3_imag);
		__v2di ew0_rrii = __builtin_e2k_qpfhadds(ew0_real, ew0_imag);
		__v2di ew1_rrii = __builtin_e2k_qpfhadds(ew1_real, ew1_imag);
		__v2di ew2_rrii = __builtin_e2k_qpfhadds(ew2_real, ew2_imag);
		__v2di ew3_rrii = __builtin_e2k_qpfhadds(ew3_real, ew3_imag);

		__v2di cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di cy2 = __builtin_e2k_qpshufb(cy2_rrii, cy2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di cy3 = __builtin_e2k_qpshufb(cy3_rrii, cy3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di dz0 = __builtin_e2k_qpshufb(dz0_rrii, dz0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di dz1 = __builtin_e2k_qpshufb(dz1_rrii, dz1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di dz2 = __builtin_e2k_qpshufb(dz2_rrii, dz2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di dz3 = __builtin_e2k_qpshufb(dz3_rrii, dz3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di ew0 = __builtin_e2k_qpshufb(ew0_rrii, ew0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di ew1 = __builtin_e2k_qpshufb(ew1_rrii, ew1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di ew2 = __builtin_e2k_qpshufb(ew2_rrii, ew2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		__v2di ew3 = __builtin_e2k_qpshufb(ew3_rrii, ew3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});

		__v2di add02_0 = __builtin_e2k_qpfadds( x0, dz0);
		__v2di add02_1 = __builtin_e2k_qpfadds( x1, dz1);
		__v2di add02_2 = __builtin_e2k_qpfadds( x2, dz2);
		__v2di add02_3 = __builtin_e2k_qpfadds( x3, dz3);
		__v2di sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
		__v2di sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
		__v2di sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
		__v2di sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
		__v2di add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
		__v2di add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
		__v2di add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
		__v2di add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
		__v2di sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
		__v2di sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
		__v2di sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
		__v2di sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);

		__v2di swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		__v2di sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
		__v2di sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
		__v2di sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
		__v2di sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});

		__v2di out0  = __builtin_e2k_qpfadds(add02_0, add13_0);
		__v2di out1  = __builtin_e2k_qpfadds(add02_1, add13_1);
		__v2di out2  = __builtin_e2k_qpfadds(add02_2, add13_2);
		__v2di out3  = __builtin_e2k_qpfadds(add02_3, add13_3);
		__v2di out4  = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
		__v2di out5  = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
		__v2di out6  = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
		__v2di out7  = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
		__v2di out8  = __builtin_e2k_qpfsubs(add02_0, add13_0);
		__v2di out9  = __builtin_e2k_qpfsubs(add02_1, add13_1);
		__v2di out10 = __builtin_e2k_qpfsubs(add02_2, add13_2);
		__v2di out11 = __builtin_e2k_qpfsubs(add02_3, add13_3);
		__v2di out12 = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
		__v2di out13 = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
		__v2di out14 = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
		__v2di out15 = __builtin_e2k_qpfadds(sub02_3, sub13i_3);


		xy0 = out0;
		zw0 = out1;
		xy1 = out2;
		zw1 = out3;
		conj_c0 = conj_c0b_in[i];
		conj_d0 = conj_d0b_in[i];
		conj_e0 = conj_e0b_in[i];
		swap_c0 = swap_c0b_in[i];
		swap_d0 = swap_d0b_in[i];
		swap_e0 = swap_e0b_in[i];

		xy2 = out4;
		zw2 = out5;
		xy3 = out6;
		zw3 = out7;
		conj_c1 = conj_c1b_in[i];
		conj_d1 = conj_d1b_in[i];
		conj_e1 = conj_e1b_in[i];
		swap_c1 = swap_c1b_in[i];
		swap_d1 = swap_d1b_in[i];
		swap_e1 = swap_e1b_in[i];

		xy4 = out8;
		zw4 = out9;
		xy5 = out10;
		zw5 = out11;
		conj_c2 = conj_c2b_in[i];
		conj_d2 = conj_d2b_in[i];
		conj_e2 = conj_e2b_in[i];
		swap_c2 = swap_c2b_in[i];
		swap_d2 = swap_d2b_in[i];
		swap_e2 = swap_e2b_in[i];

		xy6 = out12;
		zw6 = out13;
		xy7 = out14;
		zw7 = out15;
		conj_c3 = conj_c3b_in[i];
		conj_d3 = conj_d3b_in[i];
		conj_e3 = conj_e3b_in[i];
		swap_c3 = swap_c3b_in[i];
		swap_d3 = swap_d3b_in[i];
		swap_e3 = swap_e3b_in[i];

		x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
		y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
		w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
		y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
		w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
		y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
		w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
		y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
		z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
		w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});

		cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
		cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
		cy2_real = __builtin_e2k_qpfmuls(conj_c2, y2);
		cy3_real = __builtin_e2k_qpfmuls(conj_c3, y3);
		dz0_real = __builtin_e2k_qpfmuls(conj_d0, z0);
		dz1_real = __builtin_e2k_qpfmuls(conj_d1, z1);
		dz2_real = __builtin_e2k_qpfmuls(conj_d2, z2);
		dz3_real = __builtin_e2k_qpfmuls(conj_d3, z3);
		ew0_real = __builtin_e2k_qpfmuls(conj_e0, w0);
		ew1_real = __builtin_e2k_qpfmuls(conj_e1, w1);
		ew2_real = __builtin_e2k_qpfmuls(conj_e2, w2);
		ew3_real = __builtin_e2k_qpfmuls(conj_e3, w3);
		cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
		cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
		cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
		cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
		dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
		dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
		dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
		dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
		ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
		ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
		ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
		ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);

		cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
		cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);
		cy2_rrii = __builtin_e2k_qpfhadds(cy2_real, cy2_imag);
		cy3_rrii = __builtin_e2k_qpfhadds(cy3_real, cy3_imag);
		dz0_rrii = __builtin_e2k_qpfhadds(dz0_real, dz0_imag);
		dz1_rrii = __builtin_e2k_qpfhadds(dz1_real, dz1_imag);
		dz2_rrii = __builtin_e2k_qpfhadds(dz2_real, dz2_imag);
		dz3_rrii = __builtin_e2k_qpfhadds(dz3_real, dz3_imag);
		ew0_rrii = __builtin_e2k_qpfhadds(ew0_real, ew0_imag);
		ew1_rrii = __builtin_e2k_qpfhadds(ew1_real, ew1_imag);
		ew2_rrii = __builtin_e2k_qpfhadds(ew2_real, ew2_imag);
		ew3_rrii = __builtin_e2k_qpfhadds(ew3_real, ew3_imag);

		cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		cy2 = __builtin_e2k_qpshufb(cy2_rrii, cy2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		cy3 = __builtin_e2k_qpshufb(cy3_rrii, cy3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		dz0 = __builtin_e2k_qpshufb(dz0_rrii, dz0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		dz1 = __builtin_e2k_qpshufb(dz1_rrii, dz1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		dz2 = __builtin_e2k_qpshufb(dz2_rrii, dz2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		dz3 = __builtin_e2k_qpshufb(dz3_rrii, dz3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		ew0 = __builtin_e2k_qpshufb(ew0_rrii, ew0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		ew1 = __builtin_e2k_qpshufb(ew1_rrii, ew1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		ew2 = __builtin_e2k_qpshufb(ew2_rrii, ew2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
		ew3 = __builtin_e2k_qpshufb(ew3_rrii, ew3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});

		add02_0 = __builtin_e2k_qpfadds( x0, dz0);
		add02_1 = __builtin_e2k_qpfadds( x1, dz1);
		add02_2 = __builtin_e2k_qpfadds( x2, dz2);
		add02_3 = __builtin_e2k_qpfadds( x3, dz3);
		sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
		sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
		sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
		sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
		add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
		add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
		add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
		add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
		sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
		sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
		sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
		sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);

		swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
		sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
		sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
		sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
		sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});

		out_0[i]  = __builtin_e2k_qpfadds(add02_0, add13_0);
		out_1[i]  = __builtin_e2k_qpfadds(add02_1, add13_1);
		out_2[i]  = __builtin_e2k_qpfadds(add02_2, add13_2);
		out_3[i]  = __builtin_e2k_qpfadds(add02_3, add13_3);
		out_4[i]  = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
		out_5[i]  = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
		out_6[i]  = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
		out_7[i]  = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
		out_8[i]  = __builtin_e2k_qpfsubs(add02_0, add13_0);
		out_9[i]  = __builtin_e2k_qpfsubs(add02_1, add13_1);
		out_10[i] = __builtin_e2k_qpfsubs(add02_2, add13_2);
		out_11[i] = __builtin_e2k_qpfsubs(add02_3, add13_3);
		out_12[i] = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
		out_13[i] = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
		out_14[i] = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
		out_15[i] = __builtin_e2k_qpfadds(sub02_3, sub13i_3);
	}
}
Основной цикл на ассемблере
.L6737:
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=0, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=0, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=1, disp=64
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=1, disp=96
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=2, disp=128
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=2, disp=160
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=3, disp=192
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=3, disp=224
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=7, asz=0, abs=4, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=7, asz=0, abs=4, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=6, asz=0, abs=5, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=6, asz=0, abs=5, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=5, asz=0, abs=6, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=5, asz=0, abs=6, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=4, asz=0, abs=7, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=4, asz=0, abs=7, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=3, asz=0, abs=8, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=3, asz=0, abs=8, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=2, asz=0, abs=9, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=2, asz=0, abs=9, disp=32
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=16, incr=0, ind=0, asz=0, abs=10, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=15, incr=0, ind=0, asz=0, abs=10, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=14, incr=0, ind=0, asz=0, abs=11, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=13, incr=0, ind=0, asz=0, abs=11, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=12, incr=0, ind=0, asz=1, abs=12, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=11, incr=0, ind=0, asz=1, abs=12, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=10, incr=0, ind=0, asz=1, abs=14, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=9, incr=0, ind=0, asz=1, abs=14, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=8, incr=0, ind=0, asz=1, abs=16, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=7, incr=0, ind=0, asz=1, abs=16, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=6, incr=0, ind=0, asz=1, abs=18, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=5, incr=0, ind=0, asz=1, abs=18, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=4, incr=0, ind=0, asz=1, abs=20, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=3, incr=0, ind=0, asz=1, abs=20, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=2, incr=0, ind=0, asz=1, abs=22, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=1, incr=0, ind=0, asz=1, abs=22, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=15, asz=1, abs=24, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=14, asz=1, abs=24, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=13, asz=1, abs=26, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=12, asz=1, abs=26, disp=0
        }
        {
          fapb  ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=11, asz=1, abs=28, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=10, asz=1, abs=28, disp=0
        }
        {
          fapb  ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=9, asz=1, abs=30, disp=0
          fapb  dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=8, asz=1, abs=30, disp=0
        }
.L2988:
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[92], %b[83], %b[98], %b[83]
          qpshufb,1,sm  %b[69], %b[69], %r4, %b[69]
          qpshufb,3,sm  %b[63], %b[66], %r0, %b[92]
          qpfmuls,4,sm  %b[48], %b[90], %b[98]
          qpfmuls,5,sm  %b[21], %b[86], %b[99]
        }
        {
          loop_mode
          qpshufb,1,sm  %b[62], %b[62], %r4, %b[74]
          qpshufb,3,sm  %b[72], %b[74], %r3, %b[62]
          qpfmuls,5,sm  %b[29], %b[92], %b[72]
        }
        {
          loop_mode
          qpshufb,1,sm  %b[97], %b[97], %r4, %b[97]
          qpfmul_hadds,2,sm     %b[32], %b[87], %b[102], %b[87]
          qpshufb,3,sm  %b[115], %b[118], %r3, %b[100]
          qpfmuls,5,sm  %b[45], %b[62], %b[103]
        }
        {
          loop_mode
          qpfmul_hadds,2,sm     %b[44], %b[101], %b[93], %b[71]
          qpshufb,3,sm  %b[71], %b[76], %r3, %b[76]
          qpfsubs,4,sm  %b[79], %b[77], %b[93]
          qpfmuls,5,sm  %b[28], %b[100], %b[101]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[49], %b[95], %b[84], %b[84]
          qpshufb,1,sm  %b[96], %b[96], %r4, %b[73]
          qpfmul_hadds,2,sm     %b[25], %b[91], %b[94], %b[89]
          qpshufb,3,sm  %b[73], %b[89], %r3, %b[91]
          qpfsubs,4,sm  %b[69], %b[70], %b[94]
          qpfmuls,5,sm  %b[13], %b[76], %b[95]
        }
        {
          loop_mode
          qpshufb,1,sm  %b[108], %b[109], %r3, %b[80]
          qpfmul_hadds,2,sm     %b[41], %b[85], %b[88], %b[85]
          qpshufb,3,sm  %b[80], %b[80], %r4, %b[102]
          qpfsubs,4,sm  %b[74], %b[81], %b[96]
          qpfmuls,5,sm  %b[24], %b[91], %b[88]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[36], %b[90], %b[98], %b[81]
          qpshufb,1,sm  %b[82], %b[82], %r4, %b[64]
          qpfmul_hadds,2,sm     %b[17], %b[86], %b[99], %b[82]
          qpshufb,3,sm  %b[64], %b[67], %r3, %b[67]
          qpfadds,4,sm  %b[74], %b[81], %b[74]
          qpfsubs,5,sm  %b[102], %b[97], %b[86]
          movaqp,0      area=21, ind=0, am=1, be=0, %b[1]
          movaqp,1      area=20, ind=0, am=1, be=0, %b[17]
          movaqp,2      area=21, ind=0, am=1, be=0, %b[13]
          movaqp,3      area=20, ind=0, am=1, be=0, %b[8]
        }
        {
          loop_mode
          qpshufb,1,sm  %b[75], %b[75], %r4, %b[72]
          qpfmul_hadds,2,sm     %b[12], %b[92], %b[72], %b[77]
          qpshufb,3,sm  %b[78], %b[105], %r3, %b[75]
          qpfadds,5,sm  %b[79], %b[77], %b[78]
          movaqp,0      area=19, ind=0, am=1, be=0, %b[25]
          movaqp,1      area=18, ind=0, am=1, be=0, %b[12]
          movaqp,2      area=19, ind=0, am=1, be=0, %b[24]
          movaqp,3      area=18, ind=0, am=1, be=0, %b[21]
        }
        {
          loop_mode
          qpshufb,1,sm  %b[83], %b[83], %r4, %b[62]
          qpfmul_hadds,2,sm     %b[9], %b[62], %b[103], %b[69]
          qpshufb,3,sm  %b[65], %b[68], %r3, %b[65]
          qpfadds,5,sm  %b[69], %b[70], %b[68]
          movaqp,0      area=17, ind=0, am=1, be=0, %b[29]
          movaqp,1      area=16, ind=0, am=1, be=0, %b[33]
          movaqp,2      area=17, ind=0, am=1, be=0, %b[28]
          movaqp,3      area=16, ind=0, am=1, be=0, %b[9]
        }
        {
          loop_mode
          qpshufb,1,sm  %b[93], %b[93], %r5, %b[83]
          qpfmul_hadds,2,sm     %b[5], %b[100], %b[101], %b[70]
          qpshufb,3,sm  %b[96], %b[96], %r5, %b[90]
          qpfadds,5,sm  %b[102], %b[97], %b[79]
          movaqp,0      area=15, ind=0, am=1, be=0, %b[5]
          movaqp,1      area=14, ind=0, am=1, be=0, %b[36]
          movaqp,2      area=15, ind=0, am=1, be=0, %b[37]
          movaqp,3      area=14, ind=0, am=1, be=0, %b[32]
        }
        {
          loop_mode
          qpshufb,1,sm  %b[94], %b[94], %r5, %b[92]
          qpfmul_hadds,2,sm     %b[16], %b[76], %b[95], %b[76]
          qpshufb,3,sm  %b[86], %b[86], %r5, %b[86]
          movaqp,0      area=13, ind=0, am=1, be=0, %b[44]
          movaqp,1      area=12, ind=0, am=1, be=0, %b[16]
          movaqp,2      area=13, ind=0, am=1, be=0, %b[41]
          movaqp,3      area=12, ind=0, am=1, be=0, %b[40]
        }
        {
          loop_mode
          qpxor,1,sm    %b[92], %r1, %b[88]
          qpfmul_hadds,2,sm     %b[20], %b[91], %b[88], %b[91]
          qpxor,3,sm    %b[86], %r1, %b[86]
          movaqp,0      area=11, ind=0, am=1, be=0, %b[48]
          movaqp,1      area=10, ind=0, am=1, be=0, %b[49]
          movaqp,2      area=11, ind=0, am=1, be=0, %b[45]
          movaqp,3      area=10, ind=0, am=1, be=0, %b[20]
        }
        {
          loop_mode
          qpfadd_adds,0,sm      %b[65], %b[62], %b[74], %b[87]
          qpshufb,1,sm  %b[89], %b[89], %r4, %b[95]
          qpshufb,3,sm  %b[87], %b[87], %r4, %b[96]
          movaqp,0      area=9, ind=0, am=0, be=0, %b[93]
          movaqp,1      area=9, ind=16, am=1, be=0, %b[94]
          movaqp,2      area=9, ind=0, am=0, be=0, %b[89]
          movaqp,3      area=9, ind=16, am=1, be=0, %b[92]
        }
        {
          loop_mode
          qpfadd_adds,0,sm      %b[75], %b[72], %b[78], %b[71]
          qpxor,1,sm    %b[90], %r1, %b[100]
          qpshufb,3,sm  %b[71], %b[71], %r4, %b[101]
          movaqp,0      area=8, ind=16, am=1, be=0, %b[90]
          movaqp,1      area=8, ind=0, am=0, be=0, %b[98]
          movaqp,2      area=8, ind=16, am=1, be=0, %b[97]
          movaqp,3      area=8, ind=0, am=0, be=0, %b[99]
        }
        {
          loop_mode
          qpfadd_rsubs,0,sm     %b[67], %b[64], %b[68], %b[52]
          qpshufb,1,sm  %b[82], %b[82], %r4, %b[103]
          qpshufb,3,sm  %b[84], %b[84], %r4, %b[106]
          movaqp,0      area=7, ind=16, am=1, be=0, %b[102]
          movaqp,1      area=7, ind=0, am=0, be=0, %b[104]
          movaqp,2      area=7, ind=0, am=0, be=0, %b[82]
          movaqp,3      area=7, ind=16, am=1, be=0, %b[84]
        }
        {
          loop_mode
          qpfadd_rsubs,0,sm     %b[80], %b[73], %b[79], %b[53]
          qpxor,1,sm    %b[83], %r1, %b[108]
          qpshufb,3,sm  %b[85], %b[85], %r4, %b[110]
          qpfadds,4,sm  %b[106], %b[101], %b[106]
          qpfsubs,5,sm  %b[106], %b[101], %b[107]
          movaqp,0      area=6, ind=0, am=0, be=0, %b[101]
          movaqp,1      area=6, ind=16, am=1, be=0, %b[105]
          movaqp,2      area=6, ind=0, am=0, be=0, %b[83]
          movaqp,3      area=6, ind=16, am=1, be=0, %b[85]
        }
        {
          loop_mode
          qpfadd_rsubs,0,sm     %b[65], %b[62], %b[74], %b[74]
          qpshufb,1,sm  %b[69], %b[69], %r4, %b[109]
          qpfadd_rsubs,2,sm     %b[75], %b[72], %b[78], %b[69]
          qpshufb,3,sm  %b[81], %b[81], %r4, %b[113]
          qpfadds,4,sm  %b[96], %b[95], %b[111]
          qpfsubs,5,sm  %b[96], %b[95], %b[112]
          movaqp,0      area=5, ind=16, am=0, be=0, %b[78]
          movaqp,1      area=1, ind=16, am=0, be=0, %b[95]
          movaqp,2      area=5, ind=16, am=0, be=0, %b[96]
          movaqp,3      area=1, ind=16, am=0, be=0, %b[81]
        }
        {
          loop_mode
          qpfadd_adds,0,sm      %b[67], %b[64], %b[68], %b[57]
          qpshufb,1,sm  %b[60], %b[61], %r3, %b[79]
          qpfadd_adds,2,sm      %b[80], %b[73], %b[79], %b[56]
          qpshufb,3,sm  %b[77], %b[77], %r4, %b[113]
          qpfadds,4,sm  %b[113], %b[110], %b[110]
          qpfsubs,5,sm  %b[113], %b[110], %b[114]
          movaqp,0      area=5, ind=0, am=1, be=0, %b[60]
          movaqp,1      area=0, ind=16, am=0, be=0, %b[68]
          movaqp,2      area=5, ind=0, am=1, be=0, %b[77]
          movaqp,3      area=0, ind=16, am=0, be=0, %b[61]
        }
        {
          loop_mode
          qpfsub_adds,0,sm      %b[65], %b[62], %b[100], %b[116]
          qpshufb,1,sm  %b[76], %b[76], %r4, %b[119]
          qpfsub_adds,2,sm      %b[75], %b[72], %b[108], %b[113]
          qpshufb,3,sm  %b[55], %b[54], %r3, %b[118]
          qpfadds,4,sm  %b[113], %b[103], %g17
          qpfsubs,5,sm  %b[113], %b[103], %g16
          movaqp,0      area=3, ind=16, am=1, be=0, %b[117]
          movaqp,1      area=3, ind=0, am=0, be=0, %b[103]
          movaqp,2      area=3, ind=16, am=1, be=0, %b[115]
          movaqp,3      area=3, ind=0, am=0, be=0, %b[76]
        }
        {
          loop_mode
          qpfsub_rsubs,0,sm     %b[65], %b[62], %b[100], %b[72]
          qpshufb,1,sm  %b[91], %b[91], %r4, %b[108]
          qpfsub_rsubs,2,sm     %b[75], %b[72], %b[108], %b[70]
          qpshufb,3,sm  %b[70], %b[70], %r4, %b[75]
          movaqp,0      area=4, ind=16, am=0, be=0, %b[91]
          movaqp,1      area=0, ind=0, am=1, be=0, %b[65]
          movaqp,2      area=4, ind=16, am=0, be=0, %b[100]
          movaqp,3      area=0, ind=0, am=1, be=0, %b[62]
        }
        {
          loop_mode
          qpfsub_rsubs,0,sm     %b[67], %b[64], %b[88], %b[59]
          qpshufb,1,sm  %b[58], %b[59], %r3, %g18
          qpfsub_rsubs,2,sm     %b[80], %b[73], %b[86], %b[58]
          qpshufb,3,sm  %b[63], %b[66], %r3, %g19
          movaqp,0      area=4, ind=0, am=1, be=0, %g20
          movaqp,1      area=1, ind=0, am=1, be=0, %b[66]
          movaqp,2      area=4, ind=0, am=1, be=0, %g21
          movaqp,3      area=1, ind=0, am=1, be=0, %b[63]
        }
        {
          loop_mode
          qpfadd_rsubs,0,sm     %g18, %b[108], %b[106], %b[112]
          qpshufb,1,sm  %b[107], %b[107], %r5, %g22
          qpfadd_adds,2,sm      %g18, %b[108], %b[106], %g24
          qpshufb,3,sm  %b[112], %b[112], %r5, %g23
          qpshufb,4,sm  %b[81], %b[95], %r0, %g25
          movaqp,0      area=2, ind=0, am=0, be=0, %b[107]
          movaqp,1      area=2, ind=16, am=1, be=0, %g26
          movaqp,2      area=2, ind=0, am=0, be=0, %b[106]
          movaqp,3      area=2, ind=16, am=1, be=0, %g27
        }
        {
          loop_mode
          qpfadd_rsubs,0,sm     %b[118], %b[119], %b[111], %b[105]
          qpxor,1,sm    %g22, %r1, %g22
          qpfadd_adds,2,sm      %b[118], %b[119], %b[111], %b[111]
          qpxor,3,sm    %g23, %r1, %g23
          qpshufb,4,sm  %b[61], %b[68], %r0, %g29
          qpfmuls,5,sm  %b[105], %g25, %g28
        }
        {
          loop_mode
          qpfsub_rsubs,0,sm     %g18, %b[108], %g22, %b[108]
          qpshufb,1,sm  %b[114], %b[114], %r5, %g30
          qpfsub_adds,2,sm      %g18, %b[108], %g22, %b[101]
          qpshufb,3,sm  %b[115], %b[117], %r0, %b[114]
          qpshufb,4,sm  %b[76], %b[103], %r0, %g22
          qpfmuls,5,sm  %b[101], %g29, %g18
        }
        {
          loop_mode
          qpfadd_adds,0,sm      %b[79], %b[109], %b[110], %b[100]
          qpshufb,1,sm  %g16, %g16, %r5, %g16
          qpfadd_adds,2,sm      %g19, %b[75], %g17, %b[85]
          qpshufb,3,sm  %b[62], %b[65], %r0, %g31
          qpfmuls,4,sm  %b[85], %b[114], %r7
          qpfmuls,5,sm  %b[100], %g22, %r6
        }
        {
          loop_mode
          qpfadd_rsubs,0,sm     %b[79], %b[109], %b[110], %g17
          qpxor,1,sm    %g16, %r1, %g16
          qpfadd_rsubs,2,sm     %g19, %b[75], %g17, %b[110]
          qpshufb,3,sm  %b[63], %b[66], %r0, %r9
          qpfmuls,5,sm  %g20, %g31, %g20
        }
        {
          loop_mode
          qpfsub_rsubs,0,sm     %b[118], %b[119], %g23, %b[118]
          qpxor,1,sm    %g30, %r1, %g30
          qpfsub_adds,2,sm      %b[118], %b[119], %g23, %b[91]
          qpshufb,3,sm  %g27, %g26, %r0, %r26
          qpfmuls,5,sm  %b[91], %r9, %g23
        }
        {
          loop_mode
          qpfsub_adds,0,sm      %g19, %b[75], %g16, %b[83]
          qpfsub_rsubs,2,sm     %b[79], %b[109], %g30, %r27
          qpshufb,3,sm  %b[106], %b[107], %r0, %r29
          qpshufb,4,sm  %g27, %g26, %r3, %g26
          qpfmuls,5,sm  %b[83], %r26, %r28
        }
        {
          loop_mode
          qpfsub_rsubs,0,sm     %g19, %b[75], %g16, %b[77]
          qpfmul_hadds,1,sm     %b[94], %g25, %g28, %b[79]
          qpfsub_adds,2,sm      %b[79], %b[109], %g30, %b[75]
          qpfmuls,4,sm  %b[77], %g26, %b[94]
          qpfmuls,5,sm  %g21, %r29, %b[109]
        }
        {
          loop_mode
          qpfsub_adds,0,sm      %b[67], %b[64], %b[88], %b[64]
          qpfmul_hadds,1,sm     %b[93], %g29, %g18, %b[68]
          qpfsub_adds,2,sm      %b[80], %b[73], %b[86], %b[61]
          qpshufb,3,sm  %b[61], %b[68], %r3, %b[80]
          qpshufb,4,sm  %b[115], %b[117], %r3, %b[73]
          qpfmul_hadds,5,sm     %b[104], %g31, %g20, %b[67]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[84], %g22, %r6, %b[86]
          qpfmul_hadds,2,sm     %b[92], %b[114], %r7, %b[84]
          qpfmuls,3,sm  %b[96], %b[73], %b[88]
          qpfmuls,4,sm  %b[60], %b[80], %b[92]
          qpfmul_hadds,5,sm     %b[102], %r9, %g23, %b[60]
        }
        {
          loop_mode
          addd,1,sm     0x10, %b[6], %b[4] ? %pcnt0
          stqp,2        %r21, %b[6], %b[112]
          stqp,5        %r2, %b[6], %g24
        }
        {
          loop_mode
          qpshufb,1,sm  %b[81], %b[95], %r3, %b[81]
          stqp,2        %r20, %b[6], %b[105]
          stqp,5        %r24, %b[6], %b[111]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[89], %r26, %r28, %b[95]
          qpshufb,1,sm  %b[69], %b[74], %r0, %b[89]
          stqp,2        %r22, %b[6], %b[108]
          qpshufb,3,sm  %b[56], %b[57], %r0, %b[93]
          stqp,5        %r19, %b[6], %b[101]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[82], %r29, %b[109], %b[78]
          qpfmuls,1,sm  %b[78], %b[81], %b[96]
          stqp,2        %r25, %b[6], %b[100]
          qpshufb,3,sm  %b[53], %b[52], %r0, %b[85]
          qpfmuls,4,sm  %b[51], %b[93], %b[82]
          stqp,5        %r23, %b[6], %b[85]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[99], %g26, %b[94], %b[94]
          stqp,2        %r17, %b[6], %g17
          qpshufb,3,sm  %b[71], %b[87], %r0, %b[99]
          qpfmuls,4,sm  %b[35], %b[85], %b[100]
          stqp,5        %r12, %b[6], %b[110]
        }
        {
          loop_mode
          qpfmul_hadds,0,sm     %b[98], %b[80], %b[92], %b[80]
          qpfmul_hadds,1,sm     %b[97], %b[73], %b[88], %b[73]
          stqp,2        %r15, %b[6], %b[118]
          qpshufb,3,sm  %b[58], %b[59], %r0, %b[88]
          qpfmuls,4,sm  %b[50], %b[99], %b[91]
          stqp,5        %r14, %b[6], %b[91]
        }
        {
          loop_mode
          qpshufb,0,sm  %b[79], %b[79], %r4, %b[79]
          qpshufb,1,sm  %b[68], %b[68], %r4, %b[68]
          stqp,2        %r13, %b[6], %b[83]
          qpshufb,3,sm  %b[70], %b[72], %r0, %b[83]
          qpfmuls,4,sm  %b[31], %b[89], %b[92]
          stqp,5        %r16, %b[6], %r27
        }
        {
          loop_mode
          alc   alcf=1, alct=1
          abn   abnf=1, abnt=1
          ct    %ctpr1 ? %NOT_LOOP_END
          qpshufb,0,sm  %b[84], %b[84], %r4, %b[75]
          qpshufb,1,sm  %b[86], %b[86], %r4, %b[77]
          stqp,2        %r18, %b[6], %b[77]
          qpshufb,3,sm  %b[113], %b[116], %r0, %b[84]
          qpfmuls,4,sm  %b[38], %b[83], %b[86]
          stqp,5        %r11, %b[6], %b[75]
        }

Теоретическая скорость: 32 комплексных числа за 39 тактов (32/39) = 6.56 Байт/такт
Четверная теоретическая скорость: 26.26 Байт/такт

Замеры скорости

Итоги по stage_radix4_readConjSwap_2x

Скорости упали по сравнению с исходными версиями stage_radix4_readConjSwap.
График FFT находится здесь.


Собираем FFT

fft_radix2

Собираем fft_radix2 из reverse_radix2_x32 и всех вариантов stage_radix2.


fft_radix2_2x

Собираем fft_radix2_2x из reverse_radix2_x32 и всех вариантов stage_radix2_2x.


fft_radix2_readConjSwap

Собираем fft_radix2_readConjSwap из reverse_radix2_x32 и всех вариантов stage_radix2_readConjSwap.


fft_radix2_readConjSwap_2x

Собираем fft_radix2_readConjSwap_2x из reverse_radix2_x32 и всех вариантов stage_radix2_readConjSwap_2x.


fft_radix4

Собираем fft_radix4 из reverse_radix4_x16 и всех вариантов stage_radix4.


fft_radix4_2x

Собираем fft_radix4_2x из reverse_radix4_x16 и всех вариантов stage_radix4_2x.


fft_radix4_readConjSwap

Собираем fft_radix4_readConjSwap из reverse_radix4_x16 и всех вариантов stage_radix4_readConjSwap.


fft_radix4_readConjSwap_2x

Собираем fft_radix4_readConjSwap_2x из reverse_radix4_x16 и всех вариантов stage_radix4_readConjSwap_2x.


Дальнейшие оптимизации имеющегося кода

Сейчас коэффициенты размещаются в нескольких массивах (conj/swap, coefC/coefD/coefE).
Можно разместить коэффициенты в одном массиве, перемежив их между собой.
Это позволит эффективнее производить чтение из памяти.
Хуже от этого стать не должно.


Дальнейшие направления создания кода

Что делать дальше:

  • Вариант 3x в radix2.
    Сейчас 2x в radix2 ускоряет вычисления, поэтому можно сделать 3x (и далее).
    Сейчас 2x в radix4 замедляет вычисления, поэтому бесполезно делать 3x.
    Вряд ли это даст что‑то интересное, но для полноты картины можно сделать.

  • Варианты Stage, в которых коэффициенты вычисляются на ходу, а не читаются из памяти.
    Это должно ускорить вычисления в случаях, когда данные не помещаются в кэш.
    В radix4 можно рассмотреть ещё такой случай: читать из памяти только coefC и вычислять coefD/coefE из coefC.

  • Варианты Stage, на вход которым подаётся один постоянный коэффициент (или набор коэффициентов c/d/e для radix4).
    Можно заметить, что первый Stage использует один и тот же коэффициент для обработки всего входного массива. Второй Stage использует 2 разных коэффициента: один — для обработки первой половины массива, второй — для обработки второй половины. Третий Stage использует 4 разных коэффициента и так далее
    Можно разбить обработку входного массива на части, обрабатываемые одинаковым коэффициентом. Получили коэффициент (из памяти или вычислением на ходу) — обработали часть массива, получили следующий коэффициент — обработали следующую часть массива.
    Этот способ будет работать для начальных Stage, у которых такие части достаточно длинные. Для более поздних Stage надо использовать обычные методы получения коэффициентов (на каждом шаге читать из памяти или вычислять на ходу).
    Поиск границы между начальными и поздними Stage придётся делать экспериментально.
    Это отдельная задача.

  • Объединить Reverse и первый Stage.
    Сейчас есть объединение нескольких Stage в одном цикле (2x). Можно аналогично объединить Reverse и первый Stage (в общем случае — Reverse и несколько первых Stage).

  • Входные данные типа double complex.
    Сейчас сделан float complex.
    Надо сделать версию для double complex.


Заключение

На примере stage_radix4_2x можно понять, что компилятор делает много работы по эффективной упаковке инструкций в такты. Вручную такая задача заняла бы много времени.
Я пишу код на Си так, чтобы его было удобно читать, а компилятор тасует операции так, чтобы, они эффективнее выполнялись. На ассемблере мне бы пришлось выбирать между понятным кодом и быстрым кодом.

На ассемблере приятно решать небольшие задачи, полностью осознавая, что где происходит. Однако, стоит немного увеличить сложность, и мозги начинают резко уставать.
Не уверен, смог бы я решить эту задачу на чистом ассемблере.