Примерно полгода назад я познакомился с VLIW‑процессором Эльбрус-8СВ. На тот момент у меня уже был опыт написания кода на ассемблере для VLIW‑процессора TMS320C66. Поэтому я захотел написать нечто похожее для Эльбруса. А именно, реализовать алгоритм FFT на ассемблере. Но из‑за нехватки документации на инструкции процессора мне пришлось начать с реализации какого‑нибудь простого алгоритма на Си, чтобы изучать его ассемблерный вывод. По результатам той деятельности была написана предыдущая статья.
После написания той статьи я решил попробовать реализовать алгоритм FFT на Си для Эльбруса. Работа ещё не завершена, но определённые успехи уже есть (сравнение с EML присутствует). В этой статье я хочу поделиться полученными на данный момент результатами.
Содержание:
Пишем функцию Reverse
reverse_radix2
Пишем функцию Stage
stage_radix2
stage_radix2_2x
stage_radix2_readConjSwap
stage_radix2_readConjSwap_2x
stage_radix4
stage_radix4_2x
stage_radix4_readConjSwap
Собираем FFT
Постановка задачи
Дано: указатель на входной массив комплексных чисел, количество элементов в нём и указатель на выходной массив.
Требуется: вычислить FFT от входных данных, ответ записать в выходной массив.
Будем считать, что комплексные числа имеют типfloat complex(то есть действительная и мнимая части имеют типfloat).
Будем считать, что количество элементов массива является степенью числа 2 (или 4, если надо будет).
И, конечно же, пусть массивы будут выровнены в памяти на удобные нам границы.
Ключи компиляции
При компиляции использовались следующие ключи lcc:
-Wall -O3 -faligned -ffast-math -march=elbrus-v5
Почему я компилирую для elbrus-v5
После написания предыдущей статьи у меня пропал доступ к elbrus‑v5 (он уехал в ремонт). Сначала я придумывал алгоритмы «на бумажке», потом мне был предоставлен доступ к elbrus‑v6. Я планировал делать код для elbrus‑v5, поэтому добавил в скрипт компиляции ключ ‑march=elbrus‑v5. В дальнейшем доступ к v5 восстановился, но я уже привык работать на v6, так как он был круглосуточно доступен.
Перед написанием этой статьи я решил убрать ключ -march=elbrus-v5. После перекомпиляции некоторый код стал работать медленнее, чем при наличии ключа. Выяснилось, что этот код компилируется для v5 эффективнее, чем для v6 (плотнее упакованы инструкции). Получилось так, что код, скомпилированный для v5, работает на v6 быстрее, чем тот же код, скомпилированный для v6.
Такое поведение можно наблюдать в следующих функциях:
Поэтому я пока остановился на компиляции для v5.
Посмотреть на различия компиляции можно на сайте ce.mentality.rip:
Вставляем в левое поле одну из указанных выше функций
Добавляем перед вставленной функцией такой код:
#include <e2kintrin.h> #include <stdint.h> typedef struct { float real; float imag; } myComplex;В строке «Compiler options» указываем ключи компиляции:
-Wall -O3 -faligned -ffast-math -march=elbrus-v5
После этого считаем количество тактов в цикле (делаем поиск по «loop_mode»).
При указании -march=elbrus-v5 тактов меньше, чем при -march=elbrus-v6.
Как измерялось время
Замеры времени делались с помощью функции clock_gettime():
struct timespec t0, t1; clock_gettime(CLOCK_REALTIME, &t0); /*** здесь измеряемый код ***/ clock_gettime(CLOCK_REALTIME, &t1); int usec = (t1.tv_sec - t0.tv_sec)*1000000 + (t1.tv_nsec - t0.tv_nsec)/1000;
Также использовалось чтение счётчика тактов процессора:
uint64_t get_clock_count() { uint64_t dst; #pragma asm_inline asm ("rrd %%clkr, %0" : "=r" (dst)); return dst; } ... uint64_t ticks0 = get_clock_count(); /*** здесь измеряемый код ***/ uint64_t ticks1 = get_clock_count(); uint64_t ticks = ticks1 - ticks0;
Про опции pragma
В этот раз я использовал #pragma prefetch вместо #pragma loop_count(100).
Как я понял, оба варианта включают APB. Разница в том, что prefetchне отключает предварительный запрос данных перед циклом, а loop_count отключает. Этот запрос немного уменьшает время исполнения цикла. При большом числе итераций это ускорение несущественно, но увеличивает размер кода.
Разницы почти нет, но prefetch выглядит проще, чем loop_count(100).
Также использовалась опция #pragma ivdep, которая указывает компилятору, что разные итерации цикла независимы между собой по обращениям к памяти, и можно начать выполнение следующей итерации, не дожидаясь завершения текущей.
Опция полезна, когда в цикле одновременно присутствуют чтение и запись в память.
Про раскрутку циклов
Иногда для ускорения полезно сделать раскрутку циклов. В одном такте могут одновременно выполняться 6 инструкций, потому что в этом процессоре есть 6 исполнительных юнитов. Если, например, итерация цикла состоит из 9 инструкций, то она будет выполняться за 2 такта, а две такие итерации, соответственно, за 4 такта. После раскрутки в 2 раза одна итерация нового цикла будет состоять из 2*9=18 инструкций, и можно ожидать, что она будет выполняться за 3 такта. Таким образом, две итерации исходного цикла будут выполняться не за 4 такта, а за 3. Но следует помнить, что по разным причинам не всегда удаётся уместить в каждый такт 6 инструкций. Например, потому что некоторые инструкции способны выполняться не на всех исполнительных юнитах.
Способы раскрутить цикл:
Раскрутка цикла компилятором (у таких функций я добавляю к названию суффикс
"_unroll2"/"_unroll3"/"_unroll4")Для её использования надо написать
#pragma unroll(k)перед циклом, где k — множитель раскрутки. И компилятор раскрутит цикл ровно в k раз. Также будет добавлен код, проверяющий кратность числа итераций исходного цикла параметру k. Если число итераций не кратно k, остаток будет обработан отдельным кодом. Раскрученный код будет выполнять те же действия и в том же порядке, что и изначальный код. Если поставить#pragma unroll(1), то раскрутка не будет производиться. Это бывает полезно, потому что по‑умолчанию компилятор пытается сделатьunroll(2).Ручная раскрутка цикла программистом (у таких функций я добавляю к названию суффикс
"_x2"/"_x3"/"_x4")При ручной раскрутке программист сам пишет код так, чтобы в одной итерации цикла выполнялось k итераций алгоритма. Проверка кратности числа итераций алгоритма величине k лежит на программисте. Если число итераций может быть не кратно k, нужно обрабатывать эти случаи отдельно. В приведённом в этой статье коде такие случаи не обрабатываются для упрощения. В отличие от раскрутки компилятором, полностью повторяющей все действия в исходном порядке, раскрученный вручную код можно оптимизировать, заменяя действия на другие и меняя их порядок.
Интересным вариантом является ручная раскрутка цикла, тело которого состоит из другого цикла, обрабатывающего один и тот же массив. В этом случае можно объединить k итераций внутреннего цикла. Например, если исходная итерация внутреннего цикла состоит из чтения данных из памяти, обработки данных и сохранения результата в память, то после раскрутки внешнего цикла в 2 раза можно убрать сохранение результата в конце первой итерации и чтение его обратно перед второй. В итоге останется просто чтение, обработка, ещё раз обработка и сохранение. Пример такой оптимизации находится в функциях с пометкой
"2x", например,stage_radix2_2x(не"x2"для лучшей сортировки файлов по имени).
Что такое FFT
Пусть дан набор комплексных чисел x0, …, xN-1.
Дискретным преобразованием Фурье (DFT) называется перевод этого набора в другой набор комплексных чисел X0, …, XN-1 по следующей формуле:
Вычисление X0, …, XN-1 по этой формуле требует O(N2) операций.
Быстрым преобразованием Фурье (FFT) называется алгоритм, позволяющий вычислить DFT значительно быстрее (обычно за O(N logN) операций).
Наиболее известным FFT является алгоритм Кули‑Тьюки (Cooley‑Tukey). Этот алгоритм рекурсивно вычисляет DFT через DFT меньшего размера (метод «разделяй и властвуй» / «divide‑and‑conquer»).
Например, если поделить исходный массив на два подмассива (вариант «radix-2»), исходную формулу можно преобразовать к такому виду:
Et — DFT от N/2 чётных (Even) элементов исходного массива (с индексами вида 2s+0).
Ot — DFT от N/2 нечётных (Odd) элементов исходного массива (с индексами вида 2s+1).
Вариант «radix-4»
По аналогии вариант «radix-4» преобразует формулу к такому виду:
EEt — DFT от N/4 элементов исходного массива с индексами вида 4s+0.
EOt — DFT от N/4 элементов исходного массива с индексами вида 4s+1.
OEt — DFT от N/4 элементов исходного массива с индексами вида 4s+2.
OOt — DFT от N/4 элементов исходного массива с индексами вида 4s+3.
Напишем псевдокод рекурсивной функции, реализующей вариант «radix-2»:
FFT(IN) // IN - набор входных данных { N = IN.length if N == 1 return IN E = FFT(IN[0:N:2]) // FFT от чётных элементов O = FFT(IN[1:N:2]) // FFT от нечётных элементов for t = 0:N/2 { c = e^(-2*pi*i*t/N) OUT[t ] = E[t] + c*O[t] OUT[t + N/2] = E[t] - c*O[t] } return OUT }
Напишем чуть ближе к реальному коду:
FFT(*IN, N, s) // IN - указатель на первый элемент // N - количество элементов // s - расстояние между элементами { if N == 1 return IN[0] E[0:N/2] = FFT(IN, N/2, 2s) // FFT от чётных элементов O[0:N/2] = FFT(IN + s, N/2, 2s) // FFT от нечётных элементов for t = 0:N/2 { c = e^(-2*pi*i*t/N) OUT[t ] = E[t] + c*O[t] OUT[t + N/2] = E[t] - c*O[t] } return OUT[0:N] }
Разместим элементы Et в первой половине OUT, а элементы Ot во второй половине OUT:
FFT(*IN, N, s) // IN - указатель на первый элемент // N - количество элементов // s - расстояние между элементами { if N == 1 return IN[0] OUT[0:N/2] = FFT(IN, N/2, 2s) // FFT от чётных элементов OUT[N/2:N] = FFT(IN + s, N/2, 2s) // FFT от нечётных элементов for t = 0:N/2 { x = OUT[t ] y = OUT[t + N/2] c = e^(-2*pi*i*t/N) OUT[t ] = x + c*y OUT[t + N/2] = x - c*y } return OUT[0:N] }
Для удобства реализации перед рекурсивными вызовами можно передвинуть все чётные элементы в начало массива, нечётные — в конец массива. Тогда подмассивы будут занимать подряд идущие ячейки памяти, а не перемежаться между собой.
Если количество элементов N = 2n, то данная перестановка равнозначна перестановке вида IN[k] → OUT[rotateRight(k, n)], где rotateRight(k, n) — операция, «прокручивающая вправо» младшие n битов числа k, т.е. переставляющая младший бит числа k с позиции 0 на позицию n-1, сдвигая биты с позиций n-1, …, 1 на одну позицию в сторону позиции 0.
Псевдокод станет таким:
FFT(*IN, N) // IN - указатель на первый элемент // N - количество элементов { if N == 1 return IN[0] // перестановка "чётные - в начало, нечётные - в конец" IN = Even2Beginning_Odd2Ending(IN, N) OUT[0:N/2] = FFT(IN, N/2) // FFT от начальной половины массива OUT[N/2:N] = FFT(IN + N/2, N/2) // FFT от конечной половины массива for t = 0:N/2 { x = OUT[t ] y = OUT[t + N/2] c = e^(-2*pi*i*t/N) OUT[t ] = x + c*y OUT[t + N/2] = x - c*y } return OUT[0:N] }
Если теперь мысленно проследить за ходом рекурсии, можно увидеть, что к моменту достижения самой глубины рекурсивных вызовов все элементы исходного массива будут переставлены в порядке, который обычно называется «bit reversal». В этой статье я буду называть такую перестановку просто «Reverse».
Что такое Reverse
Перестановка элементов вида IN[k] → OUT[reverseNumber(k)], где reverseNumber(k) — операция, переставляющая биты числа k в обратном порядке.
Результат reverseNumber(k) зависит от количества битов в числе k.
Например:
- если битов 3, то reverseNumber(6) = 3 ( 110 → 011 )
- если битов 4, то reverseNumber(6) = 6 ( 0110 → 0110 )
- если битов 5, то reverseNumber(6) = 12 (00110 → 01100)

В случае «radix-4» мы придём к аналогичной перестановке, где потребуется вместо двоичных цифр (битов) переставлять четверичные цифры (пары битов).

Таким образом, движение по рекурсии вглубь можно заменить на Reverse.
Обратный путь рекурсии состоит из обработки подмассивов, размер которых увеличивается по мере возврата из рекурсии. Один шаг возврата из рекурсии, обрабатывающий все подмассивы сразу, будем называть «Stage».
Stage(*OUT, N, stage_num) { m = 2^stage_num // m - размер подмассива for k = 0:N/m // k - номер подмассива { for t = 0:m/2 { x = OUT[m*k + t ] y = OUT[m*k + t + m/2] c = e^(-2*pi*i*t/N) OUT[m*k + t ] = x + c*y OUT[m*k + t + m/2] = x - c*y } } }

Псевдокод алгоритма теперь выглядит так (без рекурсии):
FFT(*IN, N, *OUT) { OUT = Reverse(IN, N) for stage_num = 1, ..., log2(N) Stage(OUT, N, stage_num) }
Таким образом, FFT = Reverse + log2(N)*Stage.
Для решения задачи нужно написать две функции: Reverse и Stage.
Особенности реализации FFT на Эльбрусе
В процессоре Эльбрус есть механизм APB, который позволяет быстро читать данные, расположенные в памяти с равным шагом. Число потоков чтения в APB ограничено 32 штуками.
В последнем варианте алгоритма (без рекурсии) разные Stage обрабатывают подмассивы разного размера:
первый Stage обрабатывает подмассивы длиной 2 (это можно представить в виде 2-х потоков чтения с равным шагом: чётные и нечётные элементы)
второй Stage обрабатывает подмассивы длиной 4 (это можно представить в виде 4-х потоков чтения с равным шагом)
третий Stage обрабатывает подмассивы длиной 8 (это можно представить в виде 8-ми потоков чтения с равным шагом)
и так далее
При достаточно большом количестве Stage перестанет хватать потоков чтения APB. Для эффективного использования APB модифицируем алгоритм Stage таким образом, чтобы элементы всегда читались парами (как сейчас в первом Stage).
Получится такой вариант для Эльбруса:
Stage(*IN, N, *OUT, stage_num) { m = 2^stage_num for k = 0:m/2 { c = e^(-2*pi*i*k/m) for t = 0:N/m { x = IN[2*t ] y = IN[2*t + 1] OUT[t ] = x + c*y OUT[t + N/2] = x - c*y } } }

Если мысленно проследить за ходом операций, можно увидеть, что в этом варианте на каждом Stage выполняются те же арифметические операции с теми же парами чисел и теми же коэффициентами, что и в классическом варианте. Просто эти пары чисел обрабатываются в другом порядке (на всех Stage, кроме первого). И в конце каждого Stage добавляется перестановка «чётные — в начало, нечётные — в конец». Как было написано выше, эта перестановка равнозначна перестановке вида IN[k] → OUT[rotateRight(k, n)], поэтому после прохода по всем Stage числа возвращаются на исходные позиции (прокрутка делается log2(N) раз).
Как выглядит Stage для версии «radix-4»

Именно этот вариант будет реализован в данной статье.
Замечание про коэффициенты в функциях Stage
Для вычисления Stage нам нужны входные данные и коэффициенты. Входные данные располагаются в памяти изначально. Коэффициенты же можно либо читать из памяти одновременно с входными данными (вычислять заранее), либо вычислять на ходу.
Вариант чтения коэффициентов из памяти хорошо работает в случае малого числа коэффициентов, когда они все помещаются в кэше.
Вариант вычисления на ходу оправдан при большом числе коэффициентов, так как в этом случае тонким место является канал доступа к памяти и отказ от чтения коэффициентов из памяти позволяет использовать весь канал только для чтения входных данных. Общий размер коэффициентов (в зависимости от алгоритма) может в несколько раз превосходить размер входных данных. Поэтому отказ от чтения коэффициентов из памяти может увеличить скорость чтения входных данных в несколько раз. При малом числе коэффициентов вычисление на ходу будет замедлять работу, так как оно требует дополнительных инструкций для собственно вычисления.
В приведенном в этой статье коде реализован только вариант чтения коэффициентов из памяти.
Пишем функцию Reverse
reverse_radix2
1. reverse_radix2_etalon
Эталонный вариант для сравнения на корректность.
Здесь reverseNumber вычисляется с помощью цикла.
int reverseNumber_radix2(int number, int bit_count) { int answer = 0; for(int i = 0; i < bit_count; ++i) { answer <<= 1; answer |= number & 1; number >>= 1; } return answer; } void reverse_radix2_etalon(int bit_count, myComplex *data_in, myComplex *data_out) { int count = 1 << bit_count; for(int64_t i = 0; i < count; ++i) { int index = reverseNumber_radix2(i, bit_count); data_out[index] = data_in[i]; } }
2. reverse_radix2
В процессоре есть инструкция bitrevd, которая производит операцию reverseNumber_radix2 над 64-битным числом. Заменим reverseNumber_radix2() на __builtin_e2k_bitrevd().
Схема перемещения данных в памяти

Код на Си
void reverse_radix2(int bit_count, myComplex *data_in, myComplex *data_out) { int count = 1 << bit_count; int shift = 64 - bit_count; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < count; ++i) { int64_t index = __builtin_e2k_bitrevd(i) >> shift; data_out[index] = data_in[i]; } }
Основной цикл на ассемблере
.L1554: { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=5, abs=0, disp=0 } .L1385: { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END bitrevd,0,sm %b[9], %b[1] addd,2,sm %b[9], 0x1, %b[7] shrd,3,sm %b[5], %r0, %b[10] shld,4,sm %b[12], 0x3, %b[11] std,5 %r2, %b[13], %b[8] movad,1 area=0, ind=0, am=1, be=0, %b[0] }
Теоретическая скорость: 1 комплексное число за 1 такт (1/1) = 8 Байт/такт
Замеры скорости

3. reverse_radix2_x2_bad
Попробуем ускорить с помощью ручной раскрутки цикла в 2 раза.
Схема перемещения данных в памяти

Код на Си
void reverse_radix2_x2_bad(int bit_count, myComplex *data_in, myComplex *data_out) { int count = 1 << bit_count; int shift = 64 - bit_count; myComplex *data_out_0 = &data_out[0]; myComplex *data_out_1 = &data_out[count/2]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < count; i += 2) { int64_t index = __builtin_e2k_bitrevd(i) >> shift; data_out_0[index] = data_in[i + 0]; data_out_1[index] = data_in[i + 1]; } }
Основной цикл на ассемблере
.L1860: { fapb ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=1, asz=5, abs=0, disp=0 } .L1655: { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END bitrevd,0,sm %b[20], %b[12] addd,1,sm %b[20], 0x2, %b[18] std,2 %r2, %b[17], %b[10] shrd,3,sm %b[16], %r4, %b[19] shld,4,sm %b[21], 0x3, %b[13] std,5 %r0, %b[17], %b[11] movad,0 area=0, ind=0, am=0, be=0, %b[0] movad,1 area=0, ind=8, am=1, be=0, %b[1] }
Теоретическая скорость: 2 комплексных числа за 1 такт (2/1) = 16 Байт/такт
Замеры скорости

Видим ускорение в начале графика.
Здесь происходит два чтения из одного места памяти и запись в два разных места памяти.
Вероятно, отсутствие ускорения по всей длине графика связано с тем, что запись в память всё равно делается по очереди (в один банк памяти?).
4. reverse_radix2_x2_good
Попробуем сделать наоборот: будем читать из двух разных мест, а писать рядом.
Схема перемещения данных в памяти

Код на Си
void reverse_radix2_x2_good(int bit_count, myComplex *data_in, myComplex *data_out) { int count = 1 << bit_count; int shift = 64 - bit_count; myComplex *data_in_0 = &data_in[0]; myComplex *data_in_1 = &data_in[count/2]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < count/2; ++i) { int64_t index = __builtin_e2k_bitrevd(i) >> shift; data_out[index + 0] = data_in_0[i]; data_out[index + 1] = data_in_1[i]; } }
Основной цикл на ассемблере
.L2162: { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=5, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=5, abs=0, disp=0 } .L1993: { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END bitrevd,0,sm %b[20], %b[12] addd,1,sm %b[20], 0x1, %b[18] std,2 %b[17], %r2, %b[11] shrd,3,sm %b[16], %r4, %b[19] shld,4,sm %b[21], 0x3, %b[13] std,5 %r0, %b[17], %b[10] movad,1 area=0, ind=0, am=1, be=0, %b[1] movad,3 area=0, ind=0, am=1, be=0, %b[0] }
Теоретическая скорость: 2 комплексных числа за 1 такт (2/1) = 16 Байт/такт
Замеры скорости

Видим желаемое ускорение по всей длине графика.
Строго говоря, это не раскрутка цикла. Честной раскруткой был предыдущий вариант. Здесь же произошло изменение алгоритма (данные обрабатываются в другом порядке). Но я не придумал, как это назвать («stream2»?), поэтому все дальнейшие «раскрутки» будут называться x4/x8 и т.д.
5. reverse_radix2_x2_best
Прежде, чем переходить к более сильным раскруткам, посмотрим, что будет, если вместо двух 64-битных записей в память сделать одну 128-битную запись.
Схема перемещения данных в памяти

Код на Си
void reverse_radix2_x2_best(int bit_count, myComplex *data_in, myComplex *data_out) { int count = 1 << bit_count; int shift = 64 - bit_count; uint64_t *data_in_0 = (uint64_t*)&data_in[0]; uint64_t *data_in_1 = (uint64_t*)&data_in[count/2]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < count/2; ++i) { int64_t offset = 8 * (__builtin_e2k_bitrevd(i) >> shift); *(__v2du*)((void*)data_out + offset) = (__v2du){data_in_0[i], data_in_1[i]}; } }
Основной цикл на ассемблере
.L2350: { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=5, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=5, abs=0, disp=0 } .L2262: { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END bitrevd,0,sm %b[22], %b[21] qppackdl,1,sm %b[10], %b[11], %b[13] addd,2,sm %b[22], 0x1, %b[20] shrd,3,sm %b[25], %r0, %b[24] shld,4,sm %b[26], 0x3, %b[27] stqp,5 %r2, %b[29], %b[19] movad,1 area=0, ind=0, am=1, be=0, %b[1] movad,3 area=0, ind=0, am=1, be=0, %b[0] }
Теоретическая скорость: 2 комплексных числа за 1 такт (2/1) = 16 Байт/такт
Замеры скорости

Видим небольшое ускорение.
В дальнейшем будем всегда писать в память 128-битными кусками.
6. reverse_radix2_x4
Сделаем аналогичную «псевдо раскрутку» теперь в 4 раза.
Схема перемещения данных в памяти

Код на Си
void reverse_radix2_x4(int bit_count, myComplex *data_in, myComplex *data_out) { int count = 1 << bit_count; int shift = 64 - bit_count; uint64_t *data_in_00 = (uint64_t*)&data_in[0 * count/4]; uint64_t *data_in_01 = (uint64_t*)&data_in[1 * count/4]; uint64_t *data_in_10 = (uint64_t*)&data_in[2 * count/4]; uint64_t *data_in_11 = (uint64_t*)&data_in[3 * count/4]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < count/4; ++i) { int64_t offset = 8 * (__builtin_e2k_bitrevd(i) >> shift); *(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_00[i], data_in_10[i]}; *(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_01[i], data_in_11[i]}; } }
Основной цикл на ассемблере
.L2619: { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=4, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=4, abs=0, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=4, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=4, abs=16, disp=0 } .L2503: { loop_mode qppackdl,1,sm %b[15], %b[18], %b[20] shrd,3,sm %b[21], %r4, %b[1] qppackdl,4,sm %b[9], %b[12], %b[22] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END addd,1,sm %b[4], 0x1, %b[2] stqp,2 %r2, %b[17], %b[20] bitrevd,3,sm %b[4], %b[19] shld,4,sm %b[1], 0x3, %b[15] stqp,5 %r0, %b[17], %b[22] movad,0 area=1, ind=0, am=1, be=0, %b[6] movad,1 area=0, ind=0, am=1, be=0, %b[12] movad,2 area=1, ind=0, am=1, be=0, %b[3] movad,3 area=0, ind=0, am=1, be=0, %b[9] }
Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт
Замеры скорости

Видим сильное ускорение.
Однако, код перестал вмещаться в один такт. Хочется это исправить.
7. reverse_radix2_x4_oneTickVersion
Перепишем код, чтобы вместиться в один такт (убираем инструкции shrd и shld).
Схема перемещения данных в памяти

Код на Си
void reverse_radix2_x4_oneTickVersion(int bit_count, myComplex *data_in, myComplex *data_out) { int count = 1 << bit_count; int shift = 64 - bit_count; int64_t delta = (1LL << shift) / 8; uint64_t *data_in_00 = (uint64_t*)&data_in[0 * count/4]; uint64_t *data_in_01 = (uint64_t*)&data_in[1 * count/4]; uint64_t *data_in_10 = (uint64_t*)&data_in[2 * count/4]; uint64_t *data_in_11 = (uint64_t*)&data_in[3 * count/4]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0, shifted_i = 0; i < count/4; ++i, shifted_i += delta) { int64_t offset = __builtin_e2k_bitrevd(shifted_i); *(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_00[i], data_in_10[i]}; *(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_01[i], data_in_11[i]}; } }
Основной цикл на ассемблере
.L2891: { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=4, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=4, abs=0, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=4, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=4, abs=16, disp=0 } .L2777: { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END bitrevd,0,sm %b[34], %b[25] qppackdl,1,sm %b[22], %b[23], %b[31] stqp,2 %r2, %b[29], %b[33] qppackdl,3,sm %b[10], %b[11], %b[35] addd,4,sm %b[32], %r4, %b[30] stqp,5 %r0, %b[29], %b[37] movad,0 area=1, ind=0, am=1, be=0, %b[1] movad,1 area=0, ind=0, am=1, be=0, %b[13] movad,2 area=1, ind=0, am=1, be=0, %b[0] movad,3 area=0, ind=0, am=1, be=0, %b[12] }
Теоретическая скорость: 4 комплексных числа за 1 такт (4/1) = 32 Байт/такт
Замеры скорости

Видим ускорение в начале и замедление в конце графика.
Должно было быть либо лучше предыдущего варианта, либо так же. Не знаю, как это объяснить.
Примерно в этот момент у меня была мысль, что дальше оптимизировать не получится. Но, посмотрев на схемы алгоритмов, возник вопрос: «а если раскрутить по аналогии ещё в 2 раза?» Можно ли читать из 8-ми разных мест одновременно без потери скорости?
8. reverse_radix2_x8
Продолжаем «псевдо раскручивать» дальше.
Схема перемещения данных в памяти

Код на Си
void reverse_radix2_x8(int bit_count, myComplex *data_in, myComplex *data_out) { int count = 1 << bit_count; int shift = 64 - bit_count; uint64_t *data_in_000 = (uint64_t*)&data_in[0 * count/8]; uint64_t *data_in_001 = (uint64_t*)&data_in[1 * count/8]; uint64_t *data_in_010 = (uint64_t*)&data_in[2 * count/8]; uint64_t *data_in_011 = (uint64_t*)&data_in[3 * count/8]; uint64_t *data_in_100 = (uint64_t*)&data_in[4 * count/8]; uint64_t *data_in_101 = (uint64_t*)&data_in[5 * count/8]; uint64_t *data_in_110 = (uint64_t*)&data_in[6 * count/8]; uint64_t *data_in_111 = (uint64_t*)&data_in[7 * count/8]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < count/8; ++i) { int64_t offset = 8 * (__builtin_e2k_bitrevd(i) >> shift); *(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_000[i], data_in_100[i]}; *(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_010[i], data_in_110[i]}; *(__v2du*)((void*)data_out + offset + 2*16) = (__v2du){data_in_001[i], data_in_101[i]}; *(__v2du*)((void*)data_out + offset + 3*16) = (__v2du){data_in_011[i], data_in_111[i]}; } }
Основной цикл на ассемблере
.L3314: { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=3, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=3, abs=0, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=3, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=3, abs=8, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=3, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=3, abs=16, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=3, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=3, abs=24, disp=0 } .L3146: { loop_mode bitrevd,0,sm %b[32], %b[29] qppackdl,1,sm %b[20], %b[21], %b[33] stqp,2 %r2, %b[42], %b[38] shld,3,sm %b[34], 0x3, %b[40] qppackdl,4,sm %b[6], %b[7], %b[37] stqp,5 %r0, %b[42], %b[36] movad,0 area=3, ind=0, am=1, be=0, %b[1] movad,1 area=2, ind=0, am=1, be=0, %b[15] movad,2 area=3, ind=0, am=1, be=0, %b[0] movad,3 area=2, ind=0, am=1, be=0, %b[14] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qppackdl,0,sm %b[26], %b[27], %b[36] addd,1,sm %b[30], 0x1, %b[28] stqp,2 %r4, %b[42], %b[35] qppackdl,3,sm %b[12], %b[13], %b[34] shrd,4,sm %b[31], %r6, %b[32] stqp,5 %r5, %b[42], %b[39] movad,0 area=1, ind=0, am=1, be=0, %b[7] movad,1 area=0, ind=0, am=1, be=0, %b[21] movad,2 area=1, ind=0, am=1, be=0, %b[6] movad,3 area=0, ind=0, am=1, be=0, %b[20] }
Теоретическая скорость: 8 комплексных чисел за 2 такта (8/2) = 32 Байт/такт
Замеры скорости

Видим замедление в начале и ускорение в конце графика.
Однако же, система справилась с чтением из 8-ми разных мест.
9. reverse_radix2_x16
А если в 16 раз?
Схема перемещения данных в памяти

Код на Си
void reverse_radix2_x16(int bit_count, myComplex *data_in, myComplex *data_out) { int count = 1 << bit_count; int shift = 64 - bit_count; uint64_t *data_in_0000 = (uint64_t*)&data_in[ 0 * count/16]; uint64_t *data_in_0001 = (uint64_t*)&data_in[ 1 * count/16]; uint64_t *data_in_0010 = (uint64_t*)&data_in[ 2 * count/16]; uint64_t *data_in_0011 = (uint64_t*)&data_in[ 3 * count/16]; uint64_t *data_in_0100 = (uint64_t*)&data_in[ 4 * count/16]; uint64_t *data_in_0101 = (uint64_t*)&data_in[ 5 * count/16]; uint64_t *data_in_0110 = (uint64_t*)&data_in[ 6 * count/16]; uint64_t *data_in_0111 = (uint64_t*)&data_in[ 7 * count/16]; uint64_t *data_in_1000 = (uint64_t*)&data_in[ 8 * count/16]; uint64_t *data_in_1001 = (uint64_t*)&data_in[ 9 * count/16]; uint64_t *data_in_1010 = (uint64_t*)&data_in[10 * count/16]; uint64_t *data_in_1011 = (uint64_t*)&data_in[11 * count/16]; uint64_t *data_in_1100 = (uint64_t*)&data_in[12 * count/16]; uint64_t *data_in_1101 = (uint64_t*)&data_in[13 * count/16]; uint64_t *data_in_1110 = (uint64_t*)&data_in[14 * count/16]; uint64_t *data_in_1111 = (uint64_t*)&data_in[15 * count/16]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < count/16; ++i) { int64_t offset = 8 * (__builtin_e2k_bitrevd(i) >> shift); *(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_0000[i], data_in_1000[i]}; *(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_0100[i], data_in_1100[i]}; *(__v2du*)((void*)data_out + offset + 2*16) = (__v2du){data_in_0010[i], data_in_1010[i]}; *(__v2du*)((void*)data_out + offset + 3*16) = (__v2du){data_in_0110[i], data_in_1110[i]}; *(__v2du*)((void*)data_out + offset + 4*16) = (__v2du){data_in_0001[i], data_in_1001[i]}; *(__v2du*)((void*)data_out + offset + 5*16) = (__v2du){data_in_0101[i], data_in_1101[i]}; *(__v2du*)((void*)data_out + offset + 6*16) = (__v2du){data_in_0011[i], data_in_1011[i]}; *(__v2du*)((void*)data_out + offset + 7*16) = (__v2du){data_in_0111[i], data_in_1111[i]}; } }
Основной цикл на ассемблере
.L4040: { fapb ct=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=2, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=2, abs=0, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=2, abs=4, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=2, abs=4, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=2, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=2, abs=8, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=2, abs=12, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=2, abs=12, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=2, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=2, abs=16, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=2, abs=20, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=2, abs=20, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=2, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=2, abs=24, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=2, abs=28, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=2, abs=28, disp=0 } .L3765: { loop_mode qppackdl,1,sm %b[21], %b[24], %b[40] stqp,2 %r2, %b[37], %b[41] qppackdl,4,sm %b[7], %b[10], %b[42] stqp,5 %r0, %b[37], %b[38] movad,0 area=7, ind=0, am=1, be=0, %b[4] movad,1 area=6, ind=0, am=1, be=0, %b[18] movad,2 area=7, ind=0, am=1, be=0, %b[1] movad,3 area=6, ind=0, am=1, be=0, %b[15] } { loop_mode qppackdl,1,sm %b[25], %b[28], %b[41] stqp,2 %r5, %b[37], %b[40] shrd,3,sm %b[31], %r12, %b[29] qppackdl,4,sm %b[11], %b[14], %b[38] stqp,5 %r6, %b[37], %b[42] movad,0 area=5, ind=0, am=1, be=0, %b[10] movad,1 area=4, ind=0, am=1, be=0, %b[24] movad,2 area=5, ind=0, am=1, be=0, %b[7] movad,3 area=4, ind=0, am=1, be=0, %b[21] } { loop_mode qppackdl,1,sm %b[17], %b[20], %b[32] stqp,2 %r7, %b[37], %b[41] shld,3,sm %b[29], 0x3, %b[35] qppackdl,4,sm %b[3], %b[6], %b[31] stqp,5 %r9, %b[37], %b[38] movad,0 area=1, ind=0, am=1, be=0, %b[14] movad,1 area=0, ind=0, am=1, be=0, %b[28] movad,2 area=1, ind=0, am=1, be=0, %b[11] movad,3 area=0, ind=0, am=1, be=0, %b[25] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qppackdl,0,sm %b[27], %b[30], %b[39] addd,1,sm %b[2], 0x1, %b[0] stqp,2 %r10, %b[37], %b[34] qppackdl,3,sm %b[13], %b[16], %b[36] bitrevd,4,sm %b[2], %b[29] stqp,5 %r11, %b[37], %b[33] movad,0 area=3, ind=0, am=1, be=0, %b[6] movad,1 area=2, ind=0, am=1, be=0, %b[20] movad,2 area=3, ind=0, am=1, be=0, %b[3] movad,3 area=2, ind=0, am=1, be=0, %b[17] }
Теоретическая скорость: 16 комплексных чисел за 4 такта (16/4) = 32 Байт/такт
Замеры скорости

Видим сильное ускорение.
10. reverse_radix2_x32
В 32 раза?
Код на Си
void reverse_radix2_x32(int bit_count, myComplex *data_in, myComplex *data_out) { int count = 1 << bit_count; int shift = 64 - bit_count; uint64_t *data_in_00000 = (uint64_t*)&data_in[ 0 * count/32]; uint64_t *data_in_00001 = (uint64_t*)&data_in[ 1 * count/32]; uint64_t *data_in_00010 = (uint64_t*)&data_in[ 2 * count/32]; uint64_t *data_in_00011 = (uint64_t*)&data_in[ 3 * count/32]; uint64_t *data_in_00100 = (uint64_t*)&data_in[ 4 * count/32]; uint64_t *data_in_00101 = (uint64_t*)&data_in[ 5 * count/32]; uint64_t *data_in_00110 = (uint64_t*)&data_in[ 6 * count/32]; uint64_t *data_in_00111 = (uint64_t*)&data_in[ 7 * count/32]; uint64_t *data_in_01000 = (uint64_t*)&data_in[ 8 * count/32]; uint64_t *data_in_01001 = (uint64_t*)&data_in[ 9 * count/32]; uint64_t *data_in_01010 = (uint64_t*)&data_in[10 * count/32]; uint64_t *data_in_01011 = (uint64_t*)&data_in[11 * count/32]; uint64_t *data_in_01100 = (uint64_t*)&data_in[12 * count/32]; uint64_t *data_in_01101 = (uint64_t*)&data_in[13 * count/32]; uint64_t *data_in_01110 = (uint64_t*)&data_in[14 * count/32]; uint64_t *data_in_01111 = (uint64_t*)&data_in[15 * count/32]; uint64_t *data_in_10000 = (uint64_t*)&data_in[16 * count/32]; uint64_t *data_in_10001 = (uint64_t*)&data_in[17 * count/32]; uint64_t *data_in_10010 = (uint64_t*)&data_in[18 * count/32]; uint64_t *data_in_10011 = (uint64_t*)&data_in[19 * count/32]; uint64_t *data_in_10100 = (uint64_t*)&data_in[20 * count/32]; uint64_t *data_in_10101 = (uint64_t*)&data_in[21 * count/32]; uint64_t *data_in_10110 = (uint64_t*)&data_in[22 * count/32]; uint64_t *data_in_10111 = (uint64_t*)&data_in[23 * count/32]; uint64_t *data_in_11000 = (uint64_t*)&data_in[24 * count/32]; uint64_t *data_in_11001 = (uint64_t*)&data_in[25 * count/32]; uint64_t *data_in_11010 = (uint64_t*)&data_in[26 * count/32]; uint64_t *data_in_11011 = (uint64_t*)&data_in[27 * count/32]; uint64_t *data_in_11100 = (uint64_t*)&data_in[28 * count/32]; uint64_t *data_in_11101 = (uint64_t*)&data_in[29 * count/32]; uint64_t *data_in_11110 = (uint64_t*)&data_in[30 * count/32]; uint64_t *data_in_11111 = (uint64_t*)&data_in[31 * count/32]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < count/32; ++i) { int64_t offset = 8 * (__builtin_e2k_bitrevd(i) >> shift); *(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_00000[i], data_in_10000[i]}; *(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_01000[i], data_in_11000[i]}; *(__v2du*)((void*)data_out + offset + 2*16) = (__v2du){data_in_00100[i], data_in_10100[i]}; *(__v2du*)((void*)data_out + offset + 3*16) = (__v2du){data_in_01100[i], data_in_11100[i]}; *(__v2du*)((void*)data_out + offset + 4*16) = (__v2du){data_in_00010[i], data_in_10010[i]}; *(__v2du*)((void*)data_out + offset + 5*16) = (__v2du){data_in_01010[i], data_in_11010[i]}; *(__v2du*)((void*)data_out + offset + 6*16) = (__v2du){data_in_00110[i], data_in_10110[i]}; *(__v2du*)((void*)data_out + offset + 7*16) = (__v2du){data_in_01110[i], data_in_11110[i]}; *(__v2du*)((void*)data_out + offset + 8*16) = (__v2du){data_in_00001[i], data_in_10001[i]}; *(__v2du*)((void*)data_out + offset + 9*16) = (__v2du){data_in_01001[i], data_in_11001[i]}; *(__v2du*)((void*)data_out + offset + 10*16) = (__v2du){data_in_00101[i], data_in_10101[i]}; *(__v2du*)((void*)data_out + offset + 11*16) = (__v2du){data_in_01101[i], data_in_11101[i]}; *(__v2du*)((void*)data_out + offset + 12*16) = (__v2du){data_in_00011[i], data_in_10011[i]}; *(__v2du*)((void*)data_out + offset + 13*16) = (__v2du){data_in_01011[i], data_in_11011[i]}; *(__v2du*)((void*)data_out + offset + 14*16) = (__v2du){data_in_00111[i], data_in_10111[i]}; *(__v2du*)((void*)data_out + offset + 15*16) = (__v2du){data_in_01111[i], data_in_11111[i]}; } }
Основной цикл на ассемблере
.L5406: { fapb ct=0, dcd=0, fmt=4, mrng=8, d=17, incr=0, ind=0, asz=1, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=16, incr=0, ind=0, asz=1, abs=0, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=15, incr=0, ind=0, asz=1, abs=2, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=14, incr=0, ind=0, asz=1, abs=2, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=13, incr=0, ind=0, asz=1, abs=4, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=12, incr=0, ind=0, asz=1, abs=4, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=11, incr=0, ind=0, asz=1, abs=6, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=10, incr=0, ind=0, asz=1, abs=6, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=9, incr=0, ind=0, asz=1, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=8, incr=0, ind=0, asz=1, abs=8, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=7, incr=0, ind=0, asz=1, abs=10, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=6, incr=0, ind=0, asz=1, abs=10, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=5, incr=0, ind=0, asz=1, abs=12, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=4, incr=0, ind=0, asz=1, abs=12, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=3, incr=0, ind=0, asz=1, abs=14, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=2, incr=0, ind=0, asz=1, abs=14, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=1, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=1, abs=16, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=1, abs=18, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=1, abs=18, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=1, abs=20, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=1, abs=20, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=1, abs=22, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=1, abs=22, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=1, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=1, abs=24, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=1, abs=26, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=1, abs=26, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=1, abs=28, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=1, abs=28, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=1, abs=30, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=1, abs=30, disp=0 } .L4875: { loop_mode qppackdl,1,sm %b[46], %b[47], %b[63] stqp,2 %r2, %b[61], %b[32] qppackdl,4,sm %b[38], %b[39], %b[62] stqp,5 %r4, %b[61], %b[51] movad,0 area=15, ind=0, am=1, be=0, %b[5] movad,1 area=14, ind=0, am=1, be=0, %b[13] movad,2 area=15, ind=0, am=1, be=0, %b[1] movad,3 area=14, ind=0, am=1, be=0, %b[9] } { loop_mode qppackdl,1,sm %b[20], %b[26], %b[65] stqp,2 %r5, %b[61], %b[63] qppackdl,4,sm %b[8], %b[14], %b[64] stqp,5 %r6, %b[61], %b[62] movad,0 area=13, ind=0, am=1, be=0, %b[25] movad,1 area=12, ind=0, am=1, be=0, %b[32] movad,2 area=13, ind=0, am=1, be=0, %b[17] movad,3 area=12, ind=0, am=1, be=0, %b[33] } { loop_mode qppackdl,1,sm %b[59], %b[60], %b[38] stqp,2 %r7, %b[61], %b[65] qppackdl,4,sm %b[55], %b[56], %b[39] stqp,5 %r9, %b[61], %b[64] movad,0 area=11, ind=0, am=1, be=0, %b[14] movad,1 area=10, ind=0, am=1, be=0, %b[26] movad,2 area=11, ind=0, am=1, be=0, %b[8] movad,3 area=10, ind=0, am=1, be=0, %b[20] } { loop_mode qppackdl,1,sm %b[57], %b[58], %b[47] stqp,2 %r10, %b[61], %b[40] qppackdl,4,sm %b[54], %b[53], %b[46] stqp,5 %r11, %b[61], %b[41] movad,0 area=9, ind=0, am=1, be=0, %b[51] movad,1 area=8, ind=0, am=1, be=0, %b[56] movad,2 area=9, ind=0, am=1, be=0, %b[52] movad,3 area=8, ind=0, am=1, be=0, %b[55] } { loop_mode addd,0,sm %b[4], 0x1, %b[2] qppackdl,1,sm %b[22], %b[28], %b[40] stqp,2 %r12, %b[61], %b[49] bitrevd,3,sm %b[4], %b[59] qppackdl,4,sm %b[10], %b[16], %b[41] stqp,5 %r13, %b[61], %b[48] movad,0 area=7, ind=0, am=1, be=0, %b[54] movad,1 area=6, ind=0, am=1, be=0, %b[58] movad,2 area=7, ind=0, am=1, be=0, %b[53] movad,3 area=6, ind=0, am=1, be=0, %b[57] } { loop_mode qppackdl,1,sm %b[35], %b[34], %b[48] stqp,2 %r14, %b[61], %b[42] shrd,3,sm %b[59], %r0, %b[49] qppackdl,4,sm %b[19], %b[27], %b[28] stqp,5 %r15, %b[61], %b[43] movad,0 area=5, ind=0, am=1, be=0, %b[10] movad,1 area=4, ind=0, am=1, be=0, %b[22] movad,2 area=5, ind=0, am=1, be=0, %b[4] movad,3 area=4, ind=0, am=1, be=0, %b[16] } { loop_mode qppackdl,1,sm %b[9], %b[13], %b[19] stqp,2 %r16, %b[61], %b[50] shld,3,sm %b[49], 0x3, %b[59] qppackdl,4,sm %b[1], %b[5], %b[27] stqp,5 %r17, %b[61], %b[30] movad,0 area=3, ind=0, am=1, be=0, %b[35] movad,1 area=2, ind=0, am=1, be=0, %b[43] movad,2 area=3, ind=0, am=1, be=0, %b[34] movad,3 area=2, ind=0, am=1, be=0, %b[42] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qppackdl,0,sm %b[11], %b[15], %b[30] stqp,2 %r18, %b[61], %b[23] qppackdl,3,sm %b[3], %b[7], %b[49] stqp,5 %r19, %b[61], %b[31] movad,0 area=1, ind=0, am=1, be=0, %b[5] movad,1 area=0, ind=0, am=1, be=0, %b[13] movad,2 area=1, ind=0, am=1, be=0, %b[1] movad,3 area=0, ind=0, am=1, be=0, %b[9] }
Теоретическая скорость: 32 комплексных числа за 8 тактов (32/8) = 32 Байт/такт
Замеры скорости

Видим замедление в начале и ускорение в конце графика.
При попытке «псевдо раскрутить» в 64 раза получается резко менее эффективный код. APB может читать максимум из 32 потоков, поэтому для чтения из 64 потоков компилятор вставляет операции обычного чтения ldd. В итоге скорость резко проседает.
Можно ли ускорить ещё?
В голову приходит разве что попробовать читать не 64-битными кусками, а 128-битными.
11. reverse_radix2_x32x2
Попробуем увеличить скорость чтения версии reverse_radix2_x32.
По сути, в этом варианте сделана честная раскрутка в 2 раза.
Код на Си
void reverse_radix2_x32x2(int bit_count, myComplex *data_in, myComplex *data_out) { int count = 1 << bit_count; int shift = 64 - bit_count; __v2di *data_in_00000 = (__v2di*)&data_in[ 0 * count/32]; __v2di *data_in_00001 = (__v2di*)&data_in[ 1 * count/32]; __v2di *data_in_00010 = (__v2di*)&data_in[ 2 * count/32]; __v2di *data_in_00011 = (__v2di*)&data_in[ 3 * count/32]; __v2di *data_in_00100 = (__v2di*)&data_in[ 4 * count/32]; __v2di *data_in_00101 = (__v2di*)&data_in[ 5 * count/32]; __v2di *data_in_00110 = (__v2di*)&data_in[ 6 * count/32]; __v2di *data_in_00111 = (__v2di*)&data_in[ 7 * count/32]; __v2di *data_in_01000 = (__v2di*)&data_in[ 8 * count/32]; __v2di *data_in_01001 = (__v2di*)&data_in[ 9 * count/32]; __v2di *data_in_01010 = (__v2di*)&data_in[10 * count/32]; __v2di *data_in_01011 = (__v2di*)&data_in[11 * count/32]; __v2di *data_in_01100 = (__v2di*)&data_in[12 * count/32]; __v2di *data_in_01101 = (__v2di*)&data_in[13 * count/32]; __v2di *data_in_01110 = (__v2di*)&data_in[14 * count/32]; __v2di *data_in_01111 = (__v2di*)&data_in[15 * count/32]; __v2di *data_in_10000 = (__v2di*)&data_in[16 * count/32]; __v2di *data_in_10001 = (__v2di*)&data_in[17 * count/32]; __v2di *data_in_10010 = (__v2di*)&data_in[18 * count/32]; __v2di *data_in_10011 = (__v2di*)&data_in[19 * count/32]; __v2di *data_in_10100 = (__v2di*)&data_in[20 * count/32]; __v2di *data_in_10101 = (__v2di*)&data_in[21 * count/32]; __v2di *data_in_10110 = (__v2di*)&data_in[22 * count/32]; __v2di *data_in_10111 = (__v2di*)&data_in[23 * count/32]; __v2di *data_in_11000 = (__v2di*)&data_in[24 * count/32]; __v2di *data_in_11001 = (__v2di*)&data_in[25 * count/32]; __v2di *data_in_11010 = (__v2di*)&data_in[26 * count/32]; __v2di *data_in_11011 = (__v2di*)&data_in[27 * count/32]; __v2di *data_in_11100 = (__v2di*)&data_in[28 * count/32]; __v2di *data_in_11101 = (__v2di*)&data_in[29 * count/32]; __v2di *data_in_11110 = (__v2di*)&data_in[30 * count/32]; __v2di *data_in_11111 = (__v2di*)&data_in[31 * count/32]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < count/32/2; ++i) { int64_t offset0 = 8 * (__builtin_e2k_bitrevd(2*i + 0) >> shift); __v2di mask0 = {0x0706050403020100, 0x0706050403020100}; *(__v2du*)((void*)data_out + offset0 + 0*16) = __builtin_e2k_qpshufb(data_in_10000[i], data_in_00000[i], mask0); *(__v2du*)((void*)data_out + offset0 + 1*16) = __builtin_e2k_qpshufb(data_in_11000[i], data_in_01000[i], mask0); *(__v2du*)((void*)data_out + offset0 + 2*16) = __builtin_e2k_qpshufb(data_in_10100[i], data_in_00100[i], mask0); *(__v2du*)((void*)data_out + offset0 + 3*16) = __builtin_e2k_qpshufb(data_in_11100[i], data_in_01100[i], mask0); *(__v2du*)((void*)data_out + offset0 + 4*16) = __builtin_e2k_qpshufb(data_in_10010[i], data_in_00010[i], mask0); *(__v2du*)((void*)data_out + offset0 + 5*16) = __builtin_e2k_qpshufb(data_in_11010[i], data_in_01010[i], mask0); *(__v2du*)((void*)data_out + offset0 + 6*16) = __builtin_e2k_qpshufb(data_in_10110[i], data_in_00110[i], mask0); *(__v2du*)((void*)data_out + offset0 + 7*16) = __builtin_e2k_qpshufb(data_in_11110[i], data_in_01110[i], mask0); *(__v2du*)((void*)data_out + offset0 + 8*16) = __builtin_e2k_qpshufb(data_in_10001[i], data_in_00001[i], mask0); *(__v2du*)((void*)data_out + offset0 + 9*16) = __builtin_e2k_qpshufb(data_in_11001[i], data_in_01001[i], mask0); *(__v2du*)((void*)data_out + offset0 + 10*16) = __builtin_e2k_qpshufb(data_in_10101[i], data_in_00101[i], mask0); *(__v2du*)((void*)data_out + offset0 + 11*16) = __builtin_e2k_qpshufb(data_in_11101[i], data_in_01101[i], mask0); *(__v2du*)((void*)data_out + offset0 + 12*16) = __builtin_e2k_qpshufb(data_in_10011[i], data_in_00011[i], mask0); *(__v2du*)((void*)data_out + offset0 + 13*16) = __builtin_e2k_qpshufb(data_in_11011[i], data_in_01011[i], mask0); *(__v2du*)((void*)data_out + offset0 + 14*16) = __builtin_e2k_qpshufb(data_in_10111[i], data_in_00111[i], mask0); *(__v2du*)((void*)data_out + offset0 + 15*16) = __builtin_e2k_qpshufb(data_in_11111[i], data_in_01111[i], mask0); int64_t offset1 = 8 * (__builtin_e2k_bitrevd(2*i + 1) >> shift); __v2di mask1 = {0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}; *(__v2du*)((void*)data_out + offset1 + 0*16) = __builtin_e2k_qpshufb(data_in_10000[i], data_in_00000[i], mask1); *(__v2du*)((void*)data_out + offset1 + 1*16) = __builtin_e2k_qpshufb(data_in_11000[i], data_in_01000[i], mask1); *(__v2du*)((void*)data_out + offset1 + 2*16) = __builtin_e2k_qpshufb(data_in_10100[i], data_in_00100[i], mask1); *(__v2du*)((void*)data_out + offset1 + 3*16) = __builtin_e2k_qpshufb(data_in_11100[i], data_in_01100[i], mask1); *(__v2du*)((void*)data_out + offset1 + 4*16) = __builtin_e2k_qpshufb(data_in_10010[i], data_in_00010[i], mask1); *(__v2du*)((void*)data_out + offset1 + 5*16) = __builtin_e2k_qpshufb(data_in_11010[i], data_in_01010[i], mask1); *(__v2du*)((void*)data_out + offset1 + 6*16) = __builtin_e2k_qpshufb(data_in_10110[i], data_in_00110[i], mask1); *(__v2du*)((void*)data_out + offset1 + 7*16) = __builtin_e2k_qpshufb(data_in_11110[i], data_in_01110[i], mask1); *(__v2du*)((void*)data_out + offset1 + 8*16) = __builtin_e2k_qpshufb(data_in_10001[i], data_in_00001[i], mask1); *(__v2du*)((void*)data_out + offset1 + 9*16) = __builtin_e2k_qpshufb(data_in_11001[i], data_in_01001[i], mask1); *(__v2du*)((void*)data_out + offset1 + 10*16) = __builtin_e2k_qpshufb(data_in_10101[i], data_in_00101[i], mask1); *(__v2du*)((void*)data_out + offset1 + 11*16) = __builtin_e2k_qpshufb(data_in_11101[i], data_in_01101[i], mask1); *(__v2du*)((void*)data_out + offset1 + 12*16) = __builtin_e2k_qpshufb(data_in_10011[i], data_in_00011[i], mask1); *(__v2du*)((void*)data_out + offset1 + 13*16) = __builtin_e2k_qpshufb(data_in_11011[i], data_in_01011[i], mask1); *(__v2du*)((void*)data_out + offset1 + 14*16) = __builtin_e2k_qpshufb(data_in_10111[i], data_in_00111[i], mask1); *(__v2du*)((void*)data_out + offset1 + 15*16) = __builtin_e2k_qpshufb(data_in_11111[i], data_in_01111[i], mask1); } }
Основной цикл на ассемблере
.L7238: { fapb ct=0, dcd=0, fmt=5, mrng=16, d=17, incr=0, ind=0, asz=1, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=16, incr=0, ind=0, asz=1, abs=0, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=15, incr=0, ind=0, asz=1, abs=2, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=14, incr=0, ind=0, asz=1, abs=2, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=13, incr=0, ind=0, asz=1, abs=4, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=12, incr=0, ind=0, asz=1, abs=4, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=11, incr=0, ind=0, asz=1, abs=6, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=10, incr=0, ind=0, asz=1, abs=6, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=9, incr=0, ind=0, asz=1, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=8, incr=0, ind=0, asz=1, abs=8, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=7, incr=0, ind=0, asz=1, abs=10, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=6, incr=0, ind=0, asz=1, abs=10, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=5, incr=0, ind=0, asz=1, abs=12, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=4, incr=0, ind=0, asz=1, abs=12, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=3, incr=0, ind=0, asz=1, abs=14, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=2, incr=0, ind=0, asz=1, abs=14, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=1, incr=0, ind=0, asz=1, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=15, asz=1, abs=16, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=14, asz=1, abs=18, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=13, asz=1, abs=18, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=12, asz=1, abs=20, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=11, asz=1, abs=20, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=10, asz=1, abs=22, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=9, asz=1, abs=22, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=8, asz=1, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=1, abs=24, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=1, abs=26, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=1, abs=26, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=1, abs=28, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=1, abs=28, disp=0 } { fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=1, abs=30, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=1, asz=1, abs=30, disp=0 } .L5673: { loop_mode qpshufb,1,sm %b[42], %b[41], %r0, %b[40] stqp,2 %r2, %b[4], %b[48] qpshufb,3,sm %b[40], %b[39], %r0, %b[39] stqp,5 %r4, %b[4], %b[47] } { loop_mode qpshufb,1,sm %b[46], %b[45], %r1, %b[40] stqp,2 %r2, %g16, %b[40] qpshufb,4,sm %b[44], %b[43], %r1, %b[39] stqp,5 %r4, %g16, %b[39] } { loop_mode qpshufb,1,sm %b[46], %b[45], %r0, %b[40] stqp,2 %r6, %b[4], %b[40] qpshufb,4,sm %b[44], %b[43], %r0, %b[39] stqp,5 %r7, %b[4], %b[39] } { loop_mode qpshufb,1,sm %b[38], %b[37], %r0, %b[38] stqp,2 %r6, %g16, %b[40] qpshufb,4,sm %b[38], %b[37], %r1, %b[37] stqp,5 %r7, %g16, %b[39] } { loop_mode qpshufb,1,sm %b[36], %b[35], %r0, %b[36] stqp,2 %r9, %g16, %b[38] qpshufb,4,sm %b[36], %b[35], %r1, %b[35] stqp,5 %r9, %b[4], %b[37] movaqp,0 area=15, ind=0, am=1, be=0, %b[5] movaqp,1 area=14, ind=0, am=1, be=0, %b[9] movaqp,2 area=15, ind=0, am=1, be=0, %b[1] movaqp,3 area=14, ind=0, am=1, be=0, %b[6] } { loop_mode qpshufb,1,sm %b[34], %b[33], %r0, %b[34] stqp,2 %r10, %g16, %b[36] qpshufb,4,sm %b[34], %b[33], %r1, %b[33] stqp,5 %r10, %b[4], %b[35] movaqp,0 area=13, ind=0, am=1, be=0, %b[13] movaqp,1 area=12, ind=0, am=1, be=0, %b[17] movaqp,2 area=13, ind=0, am=1, be=0, %b[10] movaqp,3 area=12, ind=0, am=1, be=0, %b[14] } { loop_mode qpshufb,1,sm %b[32], %b[30], %r0, %b[32] stqp,2 %r11, %g16, %b[34] qpshufb,4,sm %b[32], %b[30], %r1, %b[30] stqp,5 %r11, %b[4], %b[33] movaqp,0 area=11, ind=0, am=1, be=0, %b[21] movaqp,1 area=10, ind=0, am=1, be=0, %b[25] movaqp,2 area=11, ind=0, am=1, be=0, %b[18] movaqp,3 area=10, ind=0, am=1, be=0, %b[22] } { loop_mode qpshufb,1,sm %g17, %g18, %r0, %b[32] stqp,2 %r12, %g16, %b[32] qpshufb,4,sm %g17, %g18, %r1, %b[30] stqp,5 %r12, %b[4], %b[30] movaqp,0 area=9, ind=0, am=1, be=0, %b[29] movaqp,1 area=8, ind=0, am=1, be=0, %g17 movaqp,2 area=9, ind=0, am=1, be=0, %b[26] movaqp,3 area=8, ind=0, am=1, be=0, %g18 } { loop_mode qpshufb,1,sm %b[31], %b[28], %r0, %b[34] stqp,2 %r13, %g16, %b[32] qpshufb,4,sm %b[31], %b[28], %r1, %b[33] stqp,5 %r13, %b[4], %b[30] movaqp,0 area=7, ind=0, am=1, be=0, %b[30] movaqp,1 area=6, ind=0, am=1, be=0, %b[32] movaqp,2 area=7, ind=0, am=1, be=0, %b[28] movaqp,3 area=6, ind=0, am=1, be=0, %b[31] } { loop_mode qpshufb,1,sm %b[27], %b[24], %r0, %b[38] stqp,2 %r14, %g16, %b[34] qpshufb,4,sm %b[27], %b[24], %r1, %b[37] stqp,5 %r14, %b[4], %b[33] movaqp,0 area=5, ind=0, am=1, be=0, %b[34] movaqp,1 area=4, ind=0, am=1, be=0, %b[36] movaqp,2 area=5, ind=0, am=1, be=0, %b[33] movaqp,3 area=4, ind=0, am=1, be=0, %b[35] } { loop_mode qpshufb,1,sm %b[23], %b[20], %r0, %b[42] stqp,2 %r15, %g16, %b[38] qpshufb,4,sm %b[23], %b[20], %r1, %b[41] stqp,5 %r15, %b[4], %b[37] movaqp,0 area=1, ind=0, am=1, be=0, %b[38] movaqp,1 area=0, ind=0, am=1, be=0, %b[40] movaqp,2 area=1, ind=0, am=1, be=0, %b[37] movaqp,3 area=0, ind=0, am=1, be=0, %b[39] } { loop_mode addd,0,sm 0x2, %b[2], %b[0] qpshufb,1,sm %b[19], %b[16], %r0, %b[47] stqp,2 %r16, %g16, %b[42] addd,3,sm %b[2], 0x1, %b[45] qpshufb,4,sm %b[19], %b[16], %r1, %b[46] stqp,5 %r16, %b[4], %b[41] movaqp,0 area=3, ind=0, am=1, be=0, %b[42] movaqp,1 area=2, ind=0, am=1, be=0, %b[44] movaqp,2 area=3, ind=0, am=1, be=0, %b[41] movaqp,3 area=2, ind=0, am=1, be=0, %b[43] } { loop_mode bitrevd,0,sm %b[2], %b[46] qpshufb,1,sm %b[15], %b[12], %r0, %b[49] stqp,2 %r17, %g16, %b[47] bitrevd,3,sm %b[45], %b[45] qpshufb,4,sm %b[15], %b[12], %r1, %b[47] stqp,5 %r17, %b[4], %b[46] } { loop_mode shrd,0,sm %b[46], %r5, %b[45] qpshufb,1,sm %b[11], %b[8], %r0, %b[48] stqp,2 %r18, %g16, %b[49] shrd,3,sm %b[45], %r5, %b[46] qpshufb,4,sm %b[11], %b[8], %r1, %b[47] stqp,5 %r18, %b[4], %b[47] } { loop_mode qpshufb,1,sm %b[7], %b[3], %r0, %b[47] stqp,2 %r19, %g16, %b[48] shld,3,sm %b[46], 0x3, %b[2] qpshufb,4,sm %b[7], %b[3], %r1, %b[46] stqp,5 %r19, %b[4], %b[47] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qpshufb,0,sm %b[40], %b[39], %r1, %b[46] stqp,2 %r20, %g16, %b[47] qpshufb,3,sm %b[38], %b[37], %r1, %b[45] shld,4,sm %b[45], 0x3, %g16 stqp,5 %r20, %b[4], %b[46] }
Теоретическая скорость: 64 комплексных числа за 16 тактов (64/16) = 32 Байт/такт
Замеры скорости

Видим замедление в середине графика.
Итоги по reverse_radix2


Победителем можно считать вариант reverse_radix2_x32.
При реализации Radix-2 FFT будем использовать его.
reverse_radix4
1. reverse_radix4_etalon
Эталонный вариант для сравнения на корректность.
Здесь reverseNumber вычисляется с помощью цикла.
int reverseNumber_radix4(int number, int bit_count) { int answer = 0; for(int i = 0; i < bit_count/2; ++i) { answer <<= 2; answer |= number & 3; number >>= 2; } return answer; } void reverse_radix4_etalon(int bit_count, myComplex *data_in, myComplex *data_out) { int count = 1 << bit_count; for(int64_t i = 0; i < count; ++i) { int index = reverseNumber_radix4(i, bit_count); data_out[index] = data_in[i]; } }
2. reverse_radix4
В процессоре нет готовых инструкций, производящих операцию reverseNumber_radix4.
Поэтому выполним инструкцию bitrevd и переставим соседние биты местами.
Схема перемещения данных в памяти

Код на Си
void reverse_radix4(int bit_count, myComplex *data_in, myComplex *data_out) { int count = 1 << bit_count; int shift = 64 - bit_count; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < count; ++i) { uint64_t rev = __builtin_e2k_bitrevd(i) >> shift; int64_t index = ((rev<<1) & 0xAAAAAAAAAAAAAAAA) | ((rev>>1) & 0x5555555555555555); data_out[index] = data_in[i]; } }
Основной цикл на ассемблере
.L1601: { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=5, abs=0, disp=0 } .L1398: { loop_mode shrd,2,sm %b[16], %r0, %b[1] shld,4,sm %b[17], 0x3, %b[18] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END bitrevd,0,sm %b[2], %b[14] shr_andd,1,sm %b[1], 0x1, %r5, %b[5] addd,2,sm %b[2], 0x1, %b[0] ord,3,sm %b[13], %b[9], %b[15] shl_andd,4,sm %b[3], 0x1, %r4, %b[11] std,5 %r2, %b[18], %b[12] movad,1 area=0, ind=0, am=1, be=0, %b[4] }
Теоретическая скорость: 1 комплексное число за 2 такта (1/2) = 4 Байт/такт
Замеры скорости

Заметим, что код можно вместить в один такт, если немного перетасовать инструкции.
3. reverse_radix4_oneTickVersion
Перепишем код, чтобы вместиться в один такт (убираем инструкции shrd и shld).
Схема перемещения данных в памяти

Код на Си
void reverse_radix4_oneTickVersion(int bit_count, myComplex *data_in, myComplex *data_out) { int count = 1 << bit_count; int shift = 64 - bit_count; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < count; ++i) { uint64_t rev = __builtin_e2k_bitrevd(i); int64_t offset = ((rev>>(shift-3-1)) & 0x5555555555555555) | ((rev>>(shift-3+1)) & 0xAAAAAAAAAAAAAAAA); *(myComplex*)((void*)data_out + offset) = data_in[i]; } }
Основной цикл на ассемблере
.L1873: { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=5, abs=0, disp=0 } .L1686: { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END bitrevd,0,sm %b[21], %b[14] shr_andd,1,sm %b[16], %r0, %r6, %b[1] addd,2,sm %b[21], 0x1, %b[19] ord,3,sm %b[17], %b[9], %b[20] shr_andd,4,sm %b[18], %r4, %r5, %b[11] std,5 %r2, %b[22], %b[12] movad,1 area=0, ind=0, am=1, be=0, %b[0] }
Теоретическая скорость: 1 комплексное число за 1 такт (1/1) = 8 Байт/такт
Замеры скорости

Видим ускорение в начале графика.
4. reverse_radix4_x4_bad
Попробуем ускорить с помощью ручной раскрутки цикла в 4 раза.
Схема перемещения данных в памяти

Код на Си
void reverse_radix4_x4_bad(int bit_count, myComplex *data_in, myComplex *data_out) { int count = 1 << bit_count; int shift = 64 - bit_count; myComplex *data_out_0 = &data_out[0 * count/4]; myComplex *data_out_1 = &data_out[1 * count/4]; myComplex *data_out_2 = &data_out[2 * count/4]; myComplex *data_out_3 = &data_out[3 * count/4]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < count; i += 4) { uint64_t rev = __builtin_e2k_bitrevd(i); int64_t index = ((rev>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev>>(shift+1)) & 0x5555555555555555); data_out_0[index] = data_in[i + 0]; data_out_1[index] = data_in[i + 1]; data_out_2[index] = data_in[i + 2]; data_out_3[index] = data_in[i + 3]; } }
Основной цикл на ассемблере
.L2338: { fapb ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=1, asz=5, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=1, asz=5, abs=0, disp=16 } .L2049: { loop_mode bitrevd,0,sm %b[29], %b[23] std,2 %r0, %b[26], %b[18] addd,3,sm %b[29], 0x4, %b[27] shld,4,sm %b[30], 0x3, %b[24] std,5 %r5, %b[26], %b[19] movad,0 area=0, ind=0, am=0, be=0, %b[1] movad,1 area=0, ind=8, am=1, be=0, %b[11] movad,2 area=0, ind=0, am=0, be=0, %b[0] movad,3 area=0, ind=8, am=1, be=0, %b[10] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END shr_andd,1,sm %b[23], %r7, %r9, %b[18] std,2 %r4, %b[26], %b[8] ord,3,sm %b[21], %b[22], %b[28] shr_andd,4,sm %b[25], %r6, %r8, %b[19] std,5 %r2, %b[26], %b[9] }
Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт
Замеры скорости

Видим ускорение в начале графика.
Как мы помним из reverse_radix2, запись в разные места памяти работает хуже, чем запись рядом.
5. reverse_radix4_x4_good
Попробуем сделать наоборот: будем читать из разных мест, а писать рядом.
Схема перемещения данных в памяти

Код на Си
void reverse_radix4_x4_good(int bit_count, myComplex *data_in, myComplex *data_out) { int count = 1 << bit_count; int shift = 64 - bit_count; myComplex *data_in_0 = &data_in[0 * count/4]; myComplex *data_in_1 = &data_in[1 * count/4]; myComplex *data_in_2 = &data_in[2 * count/4]; myComplex *data_in_3 = &data_in[3 * count/4]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < count/4; ++i) { uint64_t rev = __builtin_e2k_bitrevd(i); int64_t index = ((rev>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev>>(shift+1)) & 0x5555555555555555); data_out[index + 0] = data_in_0[i]; data_out[index + 1] = data_in_1[i]; data_out[index + 2] = data_in_2[i]; data_out[index + 3] = data_in_3[i]; } }
Основной цикл на ассемблере
.L2807: { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=4, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=4, abs=0, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=4, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=4, abs=16, disp=0 } .L2518: { loop_mode bitrevd,0,sm %b[29], %b[23] std,2 %r5, %b[26], %b[19] addd,3,sm %b[29], 0x1, %b[27] shld,4,sm %b[30], 0x3, %b[24] std,5 %r0, %b[26], %b[18] movad,0 area=1, ind=0, am=1, be=0, %b[1] movad,1 area=0, ind=0, am=1, be=0, %b[0] movad,2 area=1, ind=0, am=1, be=0, %b[11] movad,3 area=0, ind=0, am=1, be=0, %b[10] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END shr_andd,1,sm %b[23], %r7, %r9, %b[18] std,2 %r4, %b[26], %b[9] ord,3,sm %b[21], %b[22], %b[28] shr_andd,4,sm %b[25], %r6, %r8, %b[19] std,5 %b[26], %r2, %b[8] }
Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт
Замеры скорости

Видим желаемое ускорение по всей длине графика.
Строго говоря, это не раскрутка цикла. Честной раскруткой был предыдущий вариант. Здесь же произошло изменение алгоритма (данные обрабатываются в другом порядке). Но я не придумал, как это назвать («stream4»?), поэтому все дальнейшие «раскрутки» будут называться x4/x16 и т.д.
6. reverse_radix4_x4_best
Вместо четырёх 64-битных записей в память сделаем две 128-битные записи.
Схема перемещения данных в памяти

Код на Си
void reverse_radix4_x4_best(int bit_count, myComplex *data_in, myComplex *data_out) { int count = 1 << bit_count; int shift = 64 - bit_count; uint64_t *data_in_0 = (uint64_t*)&data_in[0 * count/4]; uint64_t *data_in_1 = (uint64_t*)&data_in[1 * count/4]; uint64_t *data_in_2 = (uint64_t*)&data_in[2 * count/4]; uint64_t *data_in_3 = (uint64_t*)&data_in[3 * count/4]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < count/4; ++i) { uint64_t rev = __builtin_e2k_bitrevd(i); int64_t offset = 8 * (((rev>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev>>(shift+1)) & 0x5555555555555555)); *(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_0[i], data_in_1[i]}; *(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_2[i], data_in_3[i]}; } }
Основной цикл на ассемблере
.L3099: { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=4, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=4, abs=0, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=4, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=4, abs=16, disp=0 } .L2975: { loop_mode qppackdl,0,sm %b[10], %b[16], %b[9] shr_andd,1,sm %b[23], %r5, %r7, %b[0] qppackdl,3,sm %b[21], %b[22], %b[5] shr_andd,4,sm %b[25], %r4, %r6, %b[13] ord,5,sm %b[15], %b[4], %b[26] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END bitrevd,0,sm %b[3], %b[21] stqp,2 %r0, %b[24], %b[11] addd,3,sm %b[3], 0x1, %b[1] shld,4,sm %b[26], 0x3, %b[22] stqp,5 %r2, %b[24], %b[7] movad,0 area=1, ind=0, am=1, be=0, %b[10] movad,1 area=0, ind=0, am=1, be=0, %b[16] movad,2 area=1, ind=0, am=1, be=0, %b[4] movad,3 area=0, ind=0, am=1, be=0, %b[15] }
Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт
Замеры скорости

Видим сильное ускорение.
В дальнейшем будем всегда писать в память 128-битными кусками.
7. reverse_radix4_x16
Продолжим «псевдо раскручивать» дальше.
Схема перемещения данных в памяти

Код на Си
void reverse_radix4_x16(int bit_count, myComplex *data_in, myComplex *data_out) { int count = 1 << bit_count; int shift = 64 - bit_count; uint64_t *data_in_00 = (uint64_t*)&data_in[ 0 * count/16]; uint64_t *data_in_01 = (uint64_t*)&data_in[ 1 * count/16]; uint64_t *data_in_02 = (uint64_t*)&data_in[ 2 * count/16]; uint64_t *data_in_03 = (uint64_t*)&data_in[ 3 * count/16]; uint64_t *data_in_10 = (uint64_t*)&data_in[ 4 * count/16]; uint64_t *data_in_11 = (uint64_t*)&data_in[ 5 * count/16]; uint64_t *data_in_12 = (uint64_t*)&data_in[ 6 * count/16]; uint64_t *data_in_13 = (uint64_t*)&data_in[ 7 * count/16]; uint64_t *data_in_20 = (uint64_t*)&data_in[ 8 * count/16]; uint64_t *data_in_21 = (uint64_t*)&data_in[ 9 * count/16]; uint64_t *data_in_22 = (uint64_t*)&data_in[10 * count/16]; uint64_t *data_in_23 = (uint64_t*)&data_in[11 * count/16]; uint64_t *data_in_30 = (uint64_t*)&data_in[12 * count/16]; uint64_t *data_in_31 = (uint64_t*)&data_in[13 * count/16]; uint64_t *data_in_32 = (uint64_t*)&data_in[14 * count/16]; uint64_t *data_in_33 = (uint64_t*)&data_in[15 * count/16]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < count/16; ++i) { uint64_t rev = __builtin_e2k_bitrevd(i); int64_t offset = 8 * (((rev>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev>>(shift+1)) & 0x5555555555555555)); *(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_00[i], data_in_10[i]}; *(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_20[i], data_in_30[i]}; *(__v2du*)((void*)data_out + offset + 2*16) = (__v2du){data_in_01[i], data_in_11[i]}; *(__v2du*)((void*)data_out + offset + 3*16) = (__v2du){data_in_21[i], data_in_31[i]}; *(__v2du*)((void*)data_out + offset + 4*16) = (__v2du){data_in_02[i], data_in_12[i]}; *(__v2du*)((void*)data_out + offset + 5*16) = (__v2du){data_in_22[i], data_in_32[i]}; *(__v2du*)((void*)data_out + offset + 6*16) = (__v2du){data_in_03[i], data_in_13[i]}; *(__v2du*)((void*)data_out + offset + 7*16) = (__v2du){data_in_23[i], data_in_33[i]}; } }
Основной цикл на ассемблере
.L3848: { fapb ct=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=2, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=2, abs=0, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=2, abs=4, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=2, abs=4, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=2, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=2, abs=8, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=2, abs=12, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=2, abs=12, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=2, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=2, abs=16, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=2, abs=20, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=2, abs=20, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=2, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=2, abs=24, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=2, abs=28, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=2, abs=28, disp=0 } .L3565: { loop_mode qppackdl,0,sm %b[51], %b[58], %b[1] stqp,2 %r0, %b[57], %b[42] qppackdl,3,sm %b[35], %b[50], %b[6] shr_andd,4,sm %b[60], %r12, %r15, %b[29] stqp,5 %r2, %b[57], %b[43] movad,0 area=7, ind=0, am=1, be=0, %b[18] movad,1 area=6, ind=0, am=1, be=0, %b[26] movad,2 area=7, ind=0, am=1, be=0, %b[13] movad,3 area=6, ind=0, am=1, be=0, %b[21] } { loop_mode ord,0,sm %b[61], %b[31], %b[59] qppackdl,1,sm %b[54], %b[55], %b[35] stqp,2 %r5, %b[57], %b[5] qppackdl,4,sm %b[46], %b[47], %b[34] stqp,5 %r6, %b[57], %b[10] movad,0 area=5, ind=0, am=1, be=0, %b[43] movad,1 area=4, ind=0, am=1, be=0, %b[51] movad,2 area=5, ind=0, am=1, be=0, %b[42] movad,3 area=4, ind=0, am=1, be=0, %b[50] } { loop_mode shld,0,sm %b[59], 0x3, %b[55] qppackdl,1,sm %b[23], %b[28], %b[5] stqp,2 %r7, %b[57], %b[39] addd,3,sm %b[4], 0x1, %b[2] qppackdl,4,sm %b[15], %b[20], %b[10] stqp,5 %r9, %b[57], %b[38] movad,0 area=3, ind=0, am=1, be=0, %b[46] movad,1 area=2, ind=0, am=1, be=0, %b[54] movad,2 area=3, ind=0, am=1, be=0, %b[31] movad,3 area=2, ind=0, am=1, be=0, %b[47] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qppackdl,0,sm %b[19], %b[24], %b[38] shr_andd,1,sm %b[60], %r13, %r14, %b[59] stqp,2 %r10, %b[57], %b[11] qppackdl,3,sm %b[27], %b[32], %b[39] bitrevd,4,sm %b[4], %b[58] stqp,5 %r11, %b[57], %b[16] movad,0 area=1, ind=0, am=1, be=0, %b[20] movad,1 area=0, ind=0, am=1, be=0, %b[28] movad,2 area=1, ind=0, am=1, be=0, %b[15] movad,3 area=0, ind=0, am=1, be=0, %b[23] }
Теоретическая скорость: 16 комплексных чисел за 4 такта (16/4) = 32 Байт/такт
Замеры скорости

Видим сильное ускорение.
При попытке «псевдо раскрутить» в 64 раза получается резко менее эффективный код. APB может читать максимум из 32 потоков, поэтому для чтения из 64 потоков компилятор вставляет операции обычного чтения ldd. В итоге скорость резко проседает.
Попробуем читать не 64-битными кусками, а 128-битными.
8. reverse_radix4_x16x2
Попробуем увеличить скорость чтения версии reverse_radix4_x16.
По сути, в этом варианте сделана честная раскрутка в 2 раза.
Код на Си
void reverse_radix4_x16x2(int bit_count, myComplex *data_in, myComplex *data_out) { int count = 1 << bit_count; int shift = 64 - bit_count; __v2di *data_in_00 = (__v2di*)&data_in[ 0 * count/16]; __v2di *data_in_01 = (__v2di*)&data_in[ 1 * count/16]; __v2di *data_in_02 = (__v2di*)&data_in[ 2 * count/16]; __v2di *data_in_03 = (__v2di*)&data_in[ 3 * count/16]; __v2di *data_in_10 = (__v2di*)&data_in[ 4 * count/16]; __v2di *data_in_11 = (__v2di*)&data_in[ 5 * count/16]; __v2di *data_in_12 = (__v2di*)&data_in[ 6 * count/16]; __v2di *data_in_13 = (__v2di*)&data_in[ 7 * count/16]; __v2di *data_in_20 = (__v2di*)&data_in[ 8 * count/16]; __v2di *data_in_21 = (__v2di*)&data_in[ 9 * count/16]; __v2di *data_in_22 = (__v2di*)&data_in[10 * count/16]; __v2di *data_in_23 = (__v2di*)&data_in[11 * count/16]; __v2di *data_in_30 = (__v2di*)&data_in[12 * count/16]; __v2di *data_in_31 = (__v2di*)&data_in[13 * count/16]; __v2di *data_in_32 = (__v2di*)&data_in[14 * count/16]; __v2di *data_in_33 = (__v2di*)&data_in[15 * count/16]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < count/16/2; ++i) { uint64_t rev0 = __builtin_e2k_bitrevd(2*i+0); int64_t offset0 = 8 * (((rev0>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev0>>(shift+1)) & 0x5555555555555555)); __v2di mask0 = {0x0706050403020100, 0x0706050403020100}; *(__v2du*)((void*)data_out + offset0 + 0*16) = __builtin_e2k_qpshufb(data_in_10[i], data_in_00[i], mask0); *(__v2du*)((void*)data_out + offset0 + 1*16) = __builtin_e2k_qpshufb(data_in_30[i], data_in_20[i], mask0); *(__v2du*)((void*)data_out + offset0 + 2*16) = __builtin_e2k_qpshufb(data_in_11[i], data_in_01[i], mask0); *(__v2du*)((void*)data_out + offset0 + 3*16) = __builtin_e2k_qpshufb(data_in_31[i], data_in_21[i], mask0); *(__v2du*)((void*)data_out + offset0 + 4*16) = __builtin_e2k_qpshufb(data_in_12[i], data_in_02[i], mask0); *(__v2du*)((void*)data_out + offset0 + 5*16) = __builtin_e2k_qpshufb(data_in_32[i], data_in_22[i], mask0); *(__v2du*)((void*)data_out + offset0 + 6*16) = __builtin_e2k_qpshufb(data_in_13[i], data_in_03[i], mask0); *(__v2du*)((void*)data_out + offset0 + 7*16) = __builtin_e2k_qpshufb(data_in_33[i], data_in_23[i], mask0); uint64_t rev1 = __builtin_e2k_bitrevd(2*i+1); int64_t offset1 = 8 * (((rev1>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev1>>(shift+1)) & 0x5555555555555555)); __v2di mask1 = {0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}; *(__v2du*)((void*)data_out + offset1 + 0*16) = __builtin_e2k_qpshufb(data_in_10[i], data_in_00[i], mask1); *(__v2du*)((void*)data_out + offset1 + 1*16) = __builtin_e2k_qpshufb(data_in_30[i], data_in_20[i], mask1); *(__v2du*)((void*)data_out + offset1 + 2*16) = __builtin_e2k_qpshufb(data_in_11[i], data_in_01[i], mask1); *(__v2du*)((void*)data_out + offset1 + 3*16) = __builtin_e2k_qpshufb(data_in_31[i], data_in_21[i], mask1); *(__v2du*)((void*)data_out + offset1 + 4*16) = __builtin_e2k_qpshufb(data_in_12[i], data_in_02[i], mask1); *(__v2du*)((void*)data_out + offset1 + 5*16) = __builtin_e2k_qpshufb(data_in_32[i], data_in_22[i], mask1); *(__v2du*)((void*)data_out + offset1 + 6*16) = __builtin_e2k_qpshufb(data_in_13[i], data_in_03[i], mask1); *(__v2du*)((void*)data_out + offset1 + 7*16) = __builtin_e2k_qpshufb(data_in_33[i], data_in_23[i], mask1); } }
Основной цикл на ассемблере
.L4839: { fapb ct=0, dcd=0, fmt=5, mrng=16, d=1, incr=0, ind=0, asz=2, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=15, asz=2, abs=0, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=14, asz=2, abs=4, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=13, asz=2, abs=4, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=12, asz=2, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=11, asz=2, abs=8, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=10, asz=2, abs=12, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=9, asz=2, abs=12, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=8, asz=2, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=2, abs=16, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=2, abs=20, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=2, abs=20, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=2, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=2, abs=24, disp=0 } { fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=2, abs=28, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=1, asz=2, abs=28, disp=0 } .L3987: { loop_mode qpshufb,0,sm %b[26], %b[21], %r12, %b[51] shr_andd,1,sm %b[23], %r14, %r17, %b[1] stqp,2 %r2, %b[4], %b[13] qpshufb,3,sm %b[45], %b[42], %r12, %b[50] shr_andd,4,sm %b[23], %r15, %r16, %b[5] stqp,5 %r0, %b[4], %b[12] } { loop_mode qpshufb,0,sm %b[35], %b[32], %r12, %b[54] shr_andd,1,sm %b[3], %r14, %r17, %b[23] stqp,2 %r5, %b[4], %b[51] qpshufb,3,sm %b[29], %b[11], %r12, %b[53] ord,4,sm %b[7], %b[25], %b[52] stqp,5 %r6, %b[4], %b[50] movaqp,0 area=7, ind=0, am=1, be=0, %b[12] movaqp,1 area=6, ind=0, am=1, be=0, %b[18] movaqp,2 area=7, ind=0, am=1, be=0, %b[6] movaqp,3 area=6, ind=0, am=1, be=0, %b[13] } { loop_mode qpshufb,1,sm %b[22], %b[17], %r12, %b[55] stqp,2 %r7, %b[4], %b[54] qpshufb,3,sm %b[22], %b[17], %r13, %b[51] shld,4,sm %b[52], 0x3, %b[50] stqp,5 %r9, %b[4], %b[53] movaqp,0 area=5, ind=0, am=1, be=0, %b[25] movaqp,1 area=4, ind=0, am=1, be=0, %b[31] movaqp,2 area=5, ind=0, am=1, be=0, %b[7] movaqp,3 area=4, ind=0, am=1, be=0, %b[28] } { loop_mode qpshufb,1,sm %b[45], %b[42], %r13, %b[53] stqp,2 %r10, %b[4], %b[55] qpshufb,4,sm %b[35], %b[32], %r13, %b[52] stqp,5 %r10, %b[50], %b[51] movaqp,0 area=3, ind=0, am=1, be=0, %b[41] movaqp,1 area=2, ind=0, am=1, be=0, %b[22] movaqp,2 area=3, ind=0, am=1, be=0, %b[38] movaqp,3 area=2, ind=0, am=1, be=0, %b[17] } { loop_mode addd,0,sm %b[2], 0x1, %b[48] qpshufb,1,sm %b[49], %b[46], %r13, %b[54] stqp,2 %r6, %b[50], %b[53] addd,3,sm 0x2, %b[2], %b[0] qpshufb,4,sm %b[26], %b[21], %r13, %b[51] stqp,5 %r7, %b[50], %b[52] movaqp,0 area=1, ind=0, am=1, be=0, %b[45] movaqp,1 area=0, ind=0, am=1, be=0, %b[35] movaqp,2 area=1, ind=0, am=1, be=0, %b[42] movaqp,3 area=0, ind=0, am=1, be=0, %b[32] } { loop_mode bitrevd,0,sm %b[2], %b[21] qpshufb,1,sm %b[39], %b[36], %r13, %b[52] stqp,2 %r0, %b[50], %b[54] ord,3,sm %b[5], %b[1], %b[26] qpshufb,4,sm %b[16], %b[10], %r13, %b[49] stqp,5 %r5, %b[50], %b[51] } { loop_mode bitrevd,0,sm %b[48], %b[1] qpshufb,1,sm %b[16], %b[10], %r12, %b[53] stqp,2 %r2, %b[50], %b[52] shld,3,sm %b[26], 0x3, %b[2] qpshufb,4,sm %b[29], %b[11], %r13, %b[51] stqp,5 %r11, %b[50], %b[49] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qpshufb,0,sm %b[37], %b[34], %r12, %b[11] stqp,2 %r11, %b[4], %b[53] qpshufb,3,sm %b[47], %b[44], %r12, %b[10] shr_andd,4,sm %b[3], %r15, %r16, %b[5] stqp,5 %r9, %b[50], %b[51] }
Теоретическая скорость: 32 комплексных числа за 8 тактов (32/8) = 32 Байт/такт
Замеры скорости

Видим замедление в середине графика.
Ещё можно сделать раскрутку в 32 раза. Для этого напишем версию раскрутки в 64 раза и обработаем сначала одну половину строк в одном цикле, а потом вторую половину строк во втором цикле. В каждом цикле будут использованы 32 потока чтения APB.
9. reverse_radix4_x32
Сделаем «псевдо раскрутку» в 32 раза с помощью двух циклов.
Код на Си
void reverse_radix4_x32(int bit_count, myComplex *data_in, myComplex *data_out) { int count = 1 << bit_count; int shift = 64 - bit_count; uint64_t *data_in_000 = (uint64_t*)&data_in[ 0 * count/64]; uint64_t *data_in_001 = (uint64_t*)&data_in[ 1 * count/64]; uint64_t *data_in_002 = (uint64_t*)&data_in[ 2 * count/64]; uint64_t *data_in_003 = (uint64_t*)&data_in[ 3 * count/64]; uint64_t *data_in_010 = (uint64_t*)&data_in[ 4 * count/64]; uint64_t *data_in_011 = (uint64_t*)&data_in[ 5 * count/64]; uint64_t *data_in_012 = (uint64_t*)&data_in[ 6 * count/64]; uint64_t *data_in_013 = (uint64_t*)&data_in[ 7 * count/64]; uint64_t *data_in_020 = (uint64_t*)&data_in[ 8 * count/64]; uint64_t *data_in_021 = (uint64_t*)&data_in[ 9 * count/64]; uint64_t *data_in_022 = (uint64_t*)&data_in[10 * count/64]; uint64_t *data_in_023 = (uint64_t*)&data_in[11 * count/64]; uint64_t *data_in_030 = (uint64_t*)&data_in[12 * count/64]; uint64_t *data_in_031 = (uint64_t*)&data_in[13 * count/64]; uint64_t *data_in_032 = (uint64_t*)&data_in[14 * count/64]; uint64_t *data_in_033 = (uint64_t*)&data_in[15 * count/64]; uint64_t *data_in_100 = (uint64_t*)&data_in[16 * count/64]; uint64_t *data_in_101 = (uint64_t*)&data_in[17 * count/64]; uint64_t *data_in_102 = (uint64_t*)&data_in[18 * count/64]; uint64_t *data_in_103 = (uint64_t*)&data_in[19 * count/64]; uint64_t *data_in_110 = (uint64_t*)&data_in[20 * count/64]; uint64_t *data_in_111 = (uint64_t*)&data_in[21 * count/64]; uint64_t *data_in_112 = (uint64_t*)&data_in[22 * count/64]; uint64_t *data_in_113 = (uint64_t*)&data_in[23 * count/64]; uint64_t *data_in_120 = (uint64_t*)&data_in[24 * count/64]; uint64_t *data_in_121 = (uint64_t*)&data_in[25 * count/64]; uint64_t *data_in_122 = (uint64_t*)&data_in[26 * count/64]; uint64_t *data_in_123 = (uint64_t*)&data_in[27 * count/64]; uint64_t *data_in_130 = (uint64_t*)&data_in[28 * count/64]; uint64_t *data_in_131 = (uint64_t*)&data_in[29 * count/64]; uint64_t *data_in_132 = (uint64_t*)&data_in[30 * count/64]; uint64_t *data_in_133 = (uint64_t*)&data_in[31 * count/64]; uint64_t *data_in_200 = (uint64_t*)&data_in[32 * count/64]; uint64_t *data_in_201 = (uint64_t*)&data_in[33 * count/64]; uint64_t *data_in_202 = (uint64_t*)&data_in[34 * count/64]; uint64_t *data_in_203 = (uint64_t*)&data_in[35 * count/64]; uint64_t *data_in_210 = (uint64_t*)&data_in[36 * count/64]; uint64_t *data_in_211 = (uint64_t*)&data_in[37 * count/64]; uint64_t *data_in_212 = (uint64_t*)&data_in[38 * count/64]; uint64_t *data_in_213 = (uint64_t*)&data_in[39 * count/64]; uint64_t *data_in_220 = (uint64_t*)&data_in[40 * count/64]; uint64_t *data_in_221 = (uint64_t*)&data_in[41 * count/64]; uint64_t *data_in_222 = (uint64_t*)&data_in[42 * count/64]; uint64_t *data_in_223 = (uint64_t*)&data_in[43 * count/64]; uint64_t *data_in_230 = (uint64_t*)&data_in[44 * count/64]; uint64_t *data_in_231 = (uint64_t*)&data_in[45 * count/64]; uint64_t *data_in_232 = (uint64_t*)&data_in[46 * count/64]; uint64_t *data_in_233 = (uint64_t*)&data_in[47 * count/64]; uint64_t *data_in_300 = (uint64_t*)&data_in[48 * count/64]; uint64_t *data_in_301 = (uint64_t*)&data_in[49 * count/64]; uint64_t *data_in_302 = (uint64_t*)&data_in[50 * count/64]; uint64_t *data_in_303 = (uint64_t*)&data_in[51 * count/64]; uint64_t *data_in_310 = (uint64_t*)&data_in[52 * count/64]; uint64_t *data_in_311 = (uint64_t*)&data_in[53 * count/64]; uint64_t *data_in_312 = (uint64_t*)&data_in[54 * count/64]; uint64_t *data_in_313 = (uint64_t*)&data_in[55 * count/64]; uint64_t *data_in_320 = (uint64_t*)&data_in[56 * count/64]; uint64_t *data_in_321 = (uint64_t*)&data_in[57 * count/64]; uint64_t *data_in_322 = (uint64_t*)&data_in[58 * count/64]; uint64_t *data_in_323 = (uint64_t*)&data_in[59 * count/64]; uint64_t *data_in_330 = (uint64_t*)&data_in[60 * count/64]; uint64_t *data_in_331 = (uint64_t*)&data_in[61 * count/64]; uint64_t *data_in_332 = (uint64_t*)&data_in[62 * count/64]; uint64_t *data_in_333 = (uint64_t*)&data_in[63 * count/64]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < count/32/2; ++i) { uint64_t rev = __builtin_e2k_bitrevd(i); int64_t offset = 8 * (((rev>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev>>(shift+1)) & 0x5555555555555555)); *(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_000[i], data_in_100[i]}; *(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_200[i], data_in_300[i]}; *(__v2du*)((void*)data_out + offset + 2*16) = (__v2du){data_in_010[i], data_in_110[i]}; *(__v2du*)((void*)data_out + offset + 3*16) = (__v2du){data_in_210[i], data_in_310[i]}; *(__v2du*)((void*)data_out + offset + 4*16) = (__v2du){data_in_020[i], data_in_120[i]}; *(__v2du*)((void*)data_out + offset + 5*16) = (__v2du){data_in_220[i], data_in_320[i]}; *(__v2du*)((void*)data_out + offset + 6*16) = (__v2du){data_in_030[i], data_in_130[i]}; *(__v2du*)((void*)data_out + offset + 7*16) = (__v2du){data_in_230[i], data_in_330[i]}; *(__v2du*)((void*)data_out + offset + 8*16) = (__v2du){data_in_001[i], data_in_101[i]}; *(__v2du*)((void*)data_out + offset + 9*16) = (__v2du){data_in_201[i], data_in_301[i]}; *(__v2du*)((void*)data_out + offset + 10*16) = (__v2du){data_in_011[i], data_in_111[i]}; *(__v2du*)((void*)data_out + offset + 11*16) = (__v2du){data_in_211[i], data_in_311[i]}; *(__v2du*)((void*)data_out + offset + 12*16) = (__v2du){data_in_021[i], data_in_121[i]}; *(__v2du*)((void*)data_out + offset + 13*16) = (__v2du){data_in_221[i], data_in_321[i]}; *(__v2du*)((void*)data_out + offset + 14*16) = (__v2du){data_in_031[i], data_in_131[i]}; *(__v2du*)((void*)data_out + offset + 15*16) = (__v2du){data_in_231[i], data_in_331[i]}; } #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < count/32/2; ++i) { uint64_t rev = __builtin_e2k_bitrevd(i); int64_t offset = 8 * (((rev>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev>>(shift+1)) & 0x5555555555555555)); *(__v2du*)((void*)data_out + offset + 16*16) = (__v2du){data_in_002[i], data_in_102[i]}; *(__v2du*)((void*)data_out + offset + 17*16) = (__v2du){data_in_202[i], data_in_302[i]}; *(__v2du*)((void*)data_out + offset + 18*16) = (__v2du){data_in_012[i], data_in_112[i]}; *(__v2du*)((void*)data_out + offset + 19*16) = (__v2du){data_in_212[i], data_in_312[i]}; *(__v2du*)((void*)data_out + offset + 20*16) = (__v2du){data_in_022[i], data_in_122[i]}; *(__v2du*)((void*)data_out + offset + 21*16) = (__v2du){data_in_222[i], data_in_322[i]}; *(__v2du*)((void*)data_out + offset + 22*16) = (__v2du){data_in_032[i], data_in_132[i]}; *(__v2du*)((void*)data_out + offset + 23*16) = (__v2du){data_in_232[i], data_in_332[i]}; *(__v2du*)((void*)data_out + offset + 24*16) = (__v2du){data_in_003[i], data_in_103[i]}; *(__v2du*)((void*)data_out + offset + 25*16) = (__v2du){data_in_203[i], data_in_303[i]}; *(__v2du*)((void*)data_out + offset + 26*16) = (__v2du){data_in_013[i], data_in_113[i]}; *(__v2du*)((void*)data_out + offset + 27*16) = (__v2du){data_in_213[i], data_in_313[i]}; *(__v2du*)((void*)data_out + offset + 28*16) = (__v2du){data_in_023[i], data_in_123[i]}; *(__v2du*)((void*)data_out + offset + 29*16) = (__v2du){data_in_223[i], data_in_323[i]}; *(__v2du*)((void*)data_out + offset + 30*16) = (__v2du){data_in_033[i], data_in_133[i]}; *(__v2du*)((void*)data_out + offset + 31*16) = (__v2du){data_in_233[i], data_in_333[i]}; } }
Основной цикл на ассемблере
.L7926: { fapb ct=0, dcd=0, fmt=4, mrng=8, d=17, incr=0, ind=0, asz=1, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=16, incr=0, ind=0, asz=1, abs=0, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=15, incr=0, ind=0, asz=1, abs=2, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=14, incr=0, ind=0, asz=1, abs=2, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=13, incr=0, ind=0, asz=1, abs=4, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=12, incr=0, ind=0, asz=1, abs=4, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=11, incr=0, ind=0, asz=1, abs=6, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=10, incr=0, ind=0, asz=1, abs=6, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=9, incr=0, ind=0, asz=1, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=8, incr=0, ind=0, asz=1, abs=8, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=7, incr=0, ind=0, asz=1, abs=10, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=6, incr=0, ind=0, asz=1, abs=10, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=5, incr=0, ind=0, asz=1, abs=12, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=4, incr=0, ind=0, asz=1, abs=12, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=3, incr=0, ind=0, asz=1, abs=14, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=2, incr=0, ind=0, asz=1, abs=14, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=1, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=1, abs=16, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=1, abs=18, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=1, abs=18, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=1, abs=20, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=1, abs=20, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=1, abs=22, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=1, abs=22, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=1, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=1, abs=24, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=1, abs=26, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=1, abs=26, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=1, abs=28, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=1, abs=28, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=1, abs=30, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=1, abs=30, disp=0 } .L6604: { loop_mode qppackdl,0,sm %b[55], %b[56], %b[1] shr_andd,1,sm %b[62], %r0, %r21, %b[67] stqp,2 %r4, %b[58], %b[14] qppackdl,3,sm %b[26], %b[50], %b[14] shr_andd,4,sm %b[62], %r1, %r20, %b[68] stqp,5 %r2, %b[58], %b[19] movad,0 area=15, ind=0, am=1, be=0, %b[62] movad,1 area=14, ind=0, am=1, be=0, %b[66] movad,2 area=15, ind=0, am=1, be=0, %b[60] movad,3 area=14, ind=0, am=1, be=0, %b[64] } { loop_mode qppackdl,1,sm %b[44], %b[49], %b[31] stqp,2 %r5, %b[58], %b[3] qppackdl,4,sm %b[25], %b[31], %b[26] stqp,5 %r6, %b[58], %b[16] movad,0 area=13, ind=0, am=1, be=0, %b[16] movad,1 area=12, ind=0, am=1, be=0, %b[25] movad,2 area=13, ind=0, am=1, be=0, %b[3] movad,3 area=12, ind=0, am=1, be=0, %b[19] } { loop_mode qppackdl,1,sm %b[63], %b[65], %b[33] stqp,2 %r7, %b[58], %b[33] qppackdl,4,sm %b[59], %b[61], %b[28] stqp,5 %r9, %b[58], %b[28] movad,0 area=11, ind=0, am=1, be=0, %b[49] movad,1 area=10, ind=0, am=1, be=0, %b[55] movad,2 area=11, ind=0, am=1, be=0, %b[44] movad,3 area=10, ind=0, am=1, be=0, %b[50] } { loop_mode qppackdl,1,sm %g18, %g19, %b[37] stqp,2 %r10, %b[58], %b[37] qppackdl,4,sm %g16, %g17, %b[32] stqp,5 %r11, %b[58], %b[32] movad,0 area=9, ind=0, am=1, be=0, %g17 movad,1 area=8, ind=0, am=1, be=0, %g19 movad,2 area=9, ind=0, am=1, be=0, %g16 movad,3 area=8, ind=0, am=1, be=0, %g18 } { loop_mode qppackdl,1,sm %b[52], %b[57], %b[41] stqp,2 %r12, %b[58], %b[41] qppackdl,4,sm %b[46], %b[51], %b[36] stqp,5 %r13, %b[58], %b[36] movad,0 area=7, ind=0, am=1, be=0, %b[59] movad,1 area=6, ind=0, am=1, be=0, %b[63] movad,2 area=7, ind=0, am=1, be=0, %b[57] movad,3 area=6, ind=0, am=1, be=0, %b[61] } { loop_mode addd,0,sm %b[4], 0x1, %b[2] ? %pcnt2 qppackdl,1,sm %b[21], %b[27], %b[5] stqp,2 %r14, %b[58], %b[45] ord,3,sm %b[68], %b[67], %b[65] qppackdl,4,sm %b[5], %b[18], %b[18] stqp,5 %r15, %b[58], %b[40] movad,0 area=5, ind=0, am=1, be=0, %b[27] movad,1 area=4, ind=0, am=1, be=0, %b[45] movad,2 area=5, ind=0, am=1, be=0, %b[21] movad,3 area=4, ind=0, am=1, be=0, %b[40] } { loop_mode bitrevd,0,sm %b[4], %b[60] qppackdl,1,sm %b[64], %b[66], %b[9] stqp,2 %r16, %b[58], %b[9] shld,3,sm %b[65], 0x3, %b[56] qppackdl,4,sm %b[60], %b[62], %b[4] stqp,5 %r17, %b[58], %b[22] movad,0 area=3, ind=0, am=1, be=0, %b[46] movad,1 area=2, ind=0, am=1, be=0, %b[52] movad,2 area=3, ind=0, am=1, be=0, %b[22] movad,3 area=2, ind=0, am=1, be=0, %b[51] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qppackdl,0,sm %g22, %g23, %b[10] stqp,2 %r18, %b[58], %b[15] qppackdl,3,sm %g20, %g21, %b[15] stqp,5 %r19, %b[58], %b[10] movad,0 area=1, ind=0, am=1, be=0, %g23 movad,1 area=0, ind=0, am=1, be=0, %g21 movad,2 area=1, ind=0, am=1, be=0, %g22 movad,3 area=0, ind=0, am=1, be=0, %g20 } ... .L7272: { fapb ct=0, dcd=0, fmt=4, mrng=8, d=17, incr=0, ind=0, asz=1, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=16, incr=0, ind=0, asz=1, abs=0, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=15, incr=0, ind=0, asz=1, abs=2, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=14, incr=0, ind=0, asz=1, abs=2, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=13, incr=0, ind=0, asz=1, abs=4, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=12, incr=0, ind=0, asz=1, abs=4, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=11, incr=0, ind=0, asz=1, abs=6, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=10, incr=0, ind=0, asz=1, abs=6, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=9, incr=0, ind=0, asz=1, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=8, incr=0, ind=0, asz=1, abs=8, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=7, incr=0, ind=0, asz=1, abs=10, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=6, incr=0, ind=0, asz=1, abs=10, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=5, incr=0, ind=0, asz=1, abs=12, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=4, incr=0, ind=0, asz=1, abs=12, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=3, incr=0, ind=0, asz=1, abs=14, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=2, incr=0, ind=0, asz=1, abs=14, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=1, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=1, abs=16, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=1, abs=18, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=1, abs=18, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=1, abs=20, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=1, abs=20, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=1, abs=22, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=1, abs=22, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=1, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=1, abs=24, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=1, abs=26, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=1, abs=26, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=1, abs=28, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=1, abs=28, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=1, abs=30, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=1, abs=30, disp=0 } .L6518: { loop_mode qppackdl,0,sm %b[55], %b[56], %b[1] shr_andd,1,sm %b[62], %r0, %r21, %b[67] stqp,2 %r4, %b[58], %b[19] qppackdl,3,sm %b[26], %b[50], %b[14] shr_andd,4,sm %b[62], %r1, %r20, %b[68] stqp,5 %r2, %b[58], %b[14] movad,0 area=15, ind=0, am=1, be=0, %b[62] movad,1 area=14, ind=0, am=1, be=0, %b[66] movad,2 area=15, ind=0, am=1, be=0, %b[60] movad,3 area=14, ind=0, am=1, be=0, %b[64] } { loop_mode qppackdl,1,sm %b[44], %b[49], %b[26] stqp,2 %r5, %b[58], %b[3] qppackdl,4,sm %b[25], %b[31], %b[31] stqp,5 %r6, %b[58], %b[16] movad,0 area=13, ind=0, am=1, be=0, %b[16] movad,1 area=12, ind=0, am=1, be=0, %b[25] movad,2 area=13, ind=0, am=1, be=0, %b[3] movad,3 area=12, ind=0, am=1, be=0, %b[19] } { loop_mode qppackdl,1,sm %b[63], %b[65], %b[33] stqp,2 %r7, %b[58], %b[28] qppackdl,4,sm %b[59], %b[61], %b[28] stqp,5 %r9, %b[58], %b[33] movad,0 area=11, ind=0, am=1, be=0, %b[49] movad,1 area=10, ind=0, am=1, be=0, %b[55] movad,2 area=11, ind=0, am=1, be=0, %b[44] movad,3 area=10, ind=0, am=1, be=0, %b[50] } { loop_mode qppackdl,1,sm %g18, %g19, %b[37] stqp,2 %r10, %b[58], %b[37] qppackdl,4,sm %g16, %g17, %b[32] stqp,5 %r11, %b[58], %b[32] movad,0 area=9, ind=0, am=1, be=0, %g17 movad,1 area=8, ind=0, am=1, be=0, %g19 movad,2 area=9, ind=0, am=1, be=0, %g16 movad,3 area=8, ind=0, am=1, be=0, %g18 } { loop_mode qppackdl,1,sm %b[52], %b[57], %b[36] stqp,2 %r12, %b[58], %b[41] qppackdl,4,sm %b[46], %b[51], %b[41] stqp,5 %r13, %b[58], %b[36] movad,0 area=7, ind=0, am=1, be=0, %b[59] movad,1 area=6, ind=0, am=1, be=0, %b[63] movad,2 area=7, ind=0, am=1, be=0, %b[57] movad,3 area=6, ind=0, am=1, be=0, %b[61] } { loop_mode addd,0,sm %b[4], 0x1, %b[2] ? %pcnt2 qppackdl,1,sm %b[21], %b[27], %b[5] stqp,2 %r14, %b[58], %b[40] ord,3,sm %b[68], %b[67], %b[65] qppackdl,4,sm %b[5], %b[18], %b[18] stqp,5 %r15, %b[58], %b[45] movad,0 area=5, ind=0, am=1, be=0, %b[27] movad,1 area=4, ind=0, am=1, be=0, %b[45] movad,2 area=5, ind=0, am=1, be=0, %b[21] movad,3 area=4, ind=0, am=1, be=0, %b[40] } { loop_mode bitrevd,0,sm %b[4], %b[60] qppackdl,1,sm %b[64], %b[66], %b[4] stqp,2 %r16, %b[58], %b[9] shld,3,sm %b[65], 0x3, %b[56] qppackdl,4,sm %b[60], %b[62], %b[9] stqp,5 %r17, %b[58], %b[22] movad,0 area=3, ind=0, am=1, be=0, %b[46] movad,1 area=2, ind=0, am=1, be=0, %b[52] movad,2 area=3, ind=0, am=1, be=0, %b[22] movad,3 area=2, ind=0, am=1, be=0, %b[51] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qppackdl,0,sm %g22, %g23, %b[15] stqp,2 %r18, %b[58], %b[10] qppackdl,3,sm %g20, %g21, %b[10] stqp,5 %r19, %b[58], %b[15] movad,0 area=1, ind=0, am=1, be=0, %g23 movad,1 area=0, ind=0, am=1, be=0, %g21 movad,2 area=1, ind=0, am=1, be=0, %g22 movad,3 area=0, ind=0, am=1, be=0, %g20 }
Теоретическая скорость: 32 комплексных числа за 8 тактов (32/8) = 32 Байт/такт
Замеры скорости

Видим замедление в начале и ускорение в конце графика.
Накладные расходы на организацию второго цикла не дают проявиться ускорению по всей длине графика.
Итоги по reverse_radix4


Победителем можно считать либо reverse_radix4_x16, либо reverse_radix4_x32.
Алгоритм FFT состоит из одного запуска Reverse и нескольких запусков Stage. Чем больше запусков Stage, тем меньший вклад вносит скорость Reverse в итоговую скорость FFT. Поэтому скорость Reverse важнее на меньших длинах входных данных, где меньше запусков Stage.
При реализации Radix-4 FFT будем использовать reverse_radix4_x16.
Потом можно заменить на reverse_radix4_x32 и посмотреть, как изменится скорость FFT.
Пишем функцию Stage
stage_radix2
Схема алгоритма Stage для версии «radix-2».

1. stage_radix2_etalon
Эталонный вариант для сравнения на корректность.
Код на Си
void stage_radix2_etalon(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef) { myComplex *x_in = &data_in[0]; myComplex *y_in = &data_in[1]; myComplex *c_in = coef; myComplex *out_add = &data_out[0]; myComplex *out_sub = &data_out[data_count/2]; #pragma ivdep #pragma unroll(1) // #pragma prefetch for(int64_t i = 0; i < data_count/2; ++i) { myComplex x = x_in[2*i]; myComplex y = y_in[2*i]; myComplex c = c_in[i]; myComplex cy = complex_mul(c, y); out_add[i] = complex_add(x, cy); out_sub[i] = complex_sub(x, cy); } }
Основной цикл на ассемблере
.L444: { fapb ct=1, dcd=0, fmt=3, mrng=16, d=0, incr=1, ind=1, asz=5, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=2, asz=5, abs=0, disp=0 } .L120: { loop_mode fmuls,0,sm %b[67], %b[6], %b[37] fsubs,1,sm %b[46], %b[24], %b[58] staaw,2 %b[62], %aad1[ %aasti3 + _f32s,_lts0 0x4 ] fmul_adds,3,sm %b[55], %b[13], %b[43], %b[14] fadds,4,sm %b[46], %b[24], %b[57] staaw,5 %b[61], %aad2[ %aasti4 + _f32s,_lts0 0x4 ] movaw,0 area=0, ind=8, am=0, be=0, %b[0] movaw,1 area=0, ind=12, am=0, be=0, %b[1] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END fmuls,0,sm %b[67], %b[7], %b[62] fsubs,1,sm %b[35], %b[56], %b[70] staaw,2 %b[74], %aad1[ %aasti3 ] incr,2 %aaincr3 fmul_rsubs,3,sm %b[55], %b[12], %b[68], %b[46] fadds,4,sm %b[35], %b[56], %b[69] staaw,5 %b[73], %aad2[ %aasti4 ] incr,5 %aaincr3 movaw,0 area=0, ind=0, am=0, be=0, %b[13] movaw,1 area=0, ind=4, am=1, be=0, %b[24] movaw,2 area=0, ind=4, am=1, be=0, %b[61] movaw,3 area=0, ind=0, am=0, be=0, %b[43] }
Теоретическая скорость: 2 комплексных числа за 2 такта (2/2) = 8 Байт/такт
Замеры скорости

2. stage_radix2_etalon_unroll2
Этот вариант появился случайно.
Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.
Можно видеть, что компилятор умеет использовать векторные инструкции.
Код на Си
void stage_radix2_etalon_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef) { myComplex *x_in = &data_in[0]; myComplex *y_in = &data_in[1]; myComplex *c_in = coef; myComplex *out_add = &data_out[0]; myComplex *out_sub = &data_out[data_count/2]; #pragma ivdep #pragma unroll(2) // #pragma prefetch for(int64_t i = 0; i < data_count/2; ++i) { myComplex x = x_in[2*i]; myComplex y = y_in[2*i]; myComplex c = c_in[i]; myComplex cy = complex_mul(c, y); out_add[i] = complex_add(x, cy); out_sub[i] = complex_sub(x, cy); } }
Основной цикл на ассемблере
.L1266: { fapb ct=0, dcd=0, fmt=3, mrng=12, d=0, incr=1, ind=1, asz=4, abs=0, disp=20 fapb dpl=0, dcd=0, fmt=3, mrng=20, d=0, incr=1, ind=1, asz=5, abs=0, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=2, asz=4, abs=16, disp=0 } .L463: { loop_mode pfmuls,0,sm %b[51], %b[32], %b[53] insfd,1,sm %b[28], %r8, %b[54], %b[1] pfmuls,2,sm %b[51], %b[13], %b[45] insfd,3,sm %b[23], %r8, %b[50], %b[0] pshufb,4,sm %b[9], %b[19], %r0, %b[56] pfadds,5,sm %b[33], %b[39], %b[10] movaw,1 area=0, ind=8, am=0, be=0, %b[38] movaw,3 area=0, ind=12, am=0, be=0, %b[44] } { loop_mode pfmul_rsubs,0,sm %b[5], %b[15], %b[55], %b[39] insfd,1,sm %b[20], %r8, %b[24], %b[23] pfmul_adds,2,sm %b[5], %b[34], %b[47], %b[33] insfd,3,sm %b[40], %r8, %b[46], %b[28] pshufb,4,sm %b[12], %b[16], %r0, %b[54] staad,5 %b[56], %aad1[ %aasti3 + _f32s,_lts0 0x8 ] movad,1 area=1, ind=0, am=0, be=0, %b[50] } { loop_mode pfsubs,0,sm %b[8], %b[43], %b[15] insfd,1,sm %b[9], %r8, %b[19], %b[57] pfsubs,2,sm %b[31], %b[37], %b[5] insfd,3,sm %b[12], %r8, %b[16], %b[55] staad,5 %b[54], %aad2[ %aasti4 + _f32s,_lts0 0x8 ] movad,0 area=1, ind=8, am=1, be=0, %b[24] movaw,1 area=0, ind=4, am=0, be=0, %b[34] movaw,2 area=0, ind=4, am=0, be=0, %b[20] movaw,3 area=0, ind=8, am=0, be=0, %b[40] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END pfadds,0,sm %b[8], %b[43], %b[12] insfd,1,sm %b[36], %r8, %b[42], %b[9] staad,2 %b[57], %aad1[ %aasti3 ] incr,2 %aaincr3 pshufb,4,sm %b[26], %b[52], %r0, %b[47] staad,5 %b[55], %aad2[ %aasti4 ] incr,5 %aaincr3 movaw,1 area=0, ind=0, am=1, be=0, %b[16] movaw,2 area=0, ind=0, am=1, be=0, %b[46] movaw,3 area=0, ind=16, am=0, be=0, %b[19] }
Теоретическая скорость: 4 комплексных числа за 4 такта (4/4) = 8 Байт/такт
Замеры скорости

Видим ускорение.
Теоретическая скорость не изменилась по сравнению с эталонным вариантом, но скорость выросла. В ассемблерном коде можно видеть, что компилятор вставил векторные инструкции.
3. stage_radix2_simd64
Прямую векторизацию сейчас пробовать не будем. Её посмотрим потом отдельно.
Сейчас попробуем использовать векторные инструкции SIMD64 для выполнения нескольких умножений одной инструкцией.
Умножение двух комплексных чисел c и y будем делать так:
читаем комплексные числа c и y из памяти в 64-битные регистры (в одну половину регистра попадает действительная часть, в другую половину — мнимая часть)
меняем знак у мнимой части c с помощью
xor(получаем conj_c) и перемножаем векторно conj_c и y — получаем полуфабрикат для действительной части cy (для завершения получения действительной части cy надо сложить половины регистра)меняем местами действительную и мнимую части c с помощью
shuf(получаем swap_c) и перемножаем векторно swap_c и y — получаем полуфабрикат для мнимой части cy (для завершения получения мнимой части cy надо сложить половины регистра)складываем половины регистров‑полуфабрикатов с помощью векторного горизонтального сложения
fhadd— получаем cy
Код на Си
void stage_radix2_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef) { uint64_t *x_in = (uint64_t*)&data_in[0]; uint64_t *y_in = (uint64_t*)&data_in[1]; uint64_t *c_in = (uint64_t*)coef; uint64_t *out_add = (uint64_t*)&data_out[0]; uint64_t *out_sub = (uint64_t*)&data_out[data_count/2]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/2; ++i) { uint64_t x = x_in[2*i]; uint64_t y = y_in[2*i]; uint64_t c = c_in[i]; uint64_t conj_c = __builtin_e2k_pxord(c, 1LL<<63); uint64_t swap_c = __builtin_e2k_pshufb(0, c, 0x0302010007060504); uint64_t cy_real = __builtin_e2k_pfmuls(conj_c, y); uint64_t cy_imag = __builtin_e2k_pfmuls(swap_c, y); uint64_t cy = __builtin_e2k_pfhadds(cy_real, cy_imag); out_add[i] = __builtin_e2k_pfadds(x, cy); out_sub[i] = __builtin_e2k_pfsubs(x, cy); } }
Основной цикл на ассемблере
.L1588: { fapb ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=1, asz=5, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=5, abs=0, disp=0 } .L1388: { loop_mode pfmuls,0,sm %b[35], %b[28], %b[18] pfmul_hadds,1,sm %b[33], %b[32], %b[22], %b[0] pshufb,4,sm 0x0, %b[7], %r5, %b[29] pfadds,5,sm %b[27], %b[10], %b[12] movad,3 area=0, ind=0, am=1, be=0, %b[1] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END xord,0,sm %b[5], %r0, %b[33] pfsubs,1,sm %b[27], %b[10], %b[32] staad,2 %b[36], %aad1[ %aasti3 ] incr,2 %aaincr0 staad,5 %b[16], %aad2[ %aasti4 ] incr,5 %aaincr0 movad,0 area=0, ind=8, am=1, be=0, %b[22] movad,1 area=0, ind=0, am=0, be=0, %b[7] }
После компиляции видим, что цикл состоит из 8 инструкций: xor, shuf, fmul, fmul_fhadd, fadd, fsub, std, std. Инструкция fhadd оказалась «сцеплена» с одной из инструкций fmul (оказывается, Эльбрус так умеет).
Теоретическая скорость: 2 комплексных числа за 2 такта (2/2) = 8 Байт/такт
Замеры скорости

Видим небольшое ускорение.
В одном такте помещается 6 инструкций, а у нас здесь 8 инструкций. Т.е. у нас занято 8/6 такта. В идеале, если раскрутить цикл в 3 раза, получится самая плотная упаковка (3 * 8/6 = 4 такта). Раскручивать будем с помощью опции unroll.
Но сначала посмотрим на раскрутку в 2 раза (2 * 8/6 = 3 такта).
4. stage_radix2_simd64_unroll2
Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.
Код на Си
void stage_radix2_simd64_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef) { uint64_t *x_in = (uint64_t*)&data_in[0]; uint64_t *y_in = (uint64_t*)&data_in[1]; uint64_t *c_in = (uint64_t*)coef; uint64_t *out_add = (uint64_t*)&data_out[0]; uint64_t *out_sub = (uint64_t*)&data_out[data_count/2]; #pragma ivdep #pragma unroll(2) #pragma prefetch for(int64_t i = 0; i < data_count/2; ++i) { uint64_t x = x_in[2*i]; uint64_t y = y_in[2*i]; uint64_t c = c_in[i]; uint64_t conj_c = __builtin_e2k_pxord(c, 1LL<<63); uint64_t swap_c = __builtin_e2k_pshufb(0, c, 0x0302010007060504); uint64_t cy_real = __builtin_e2k_pfmuls(conj_c, y); uint64_t cy_imag = __builtin_e2k_pfmuls(swap_c, y); uint64_t cy = __builtin_e2k_pfhadds(cy_real, cy_imag); out_add[i] = __builtin_e2k_pfadds(x, cy); out_sub[i] = __builtin_e2k_pfsubs(x, cy); } }
Основной цикл на ассемблере
.L2152: { fapb ct=1, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=2, asz=5, abs=0, disp=0 } .L1710: { loop_mode pfmul_hadds,0,sm %b[51], %b[10], %b[36], %b[11] pfmuls,1,sm %b[57], %b[7], %b[44] pfsubs,2,sm %b[41], %b[25], %b[45] xord,4,sm %b[42], %r0, %b[52] pfadds,5,sm %b[41], %b[25], %b[48] movad,0 area=0, ind=24, am=0, be=0, %b[1] movad,1 area=0, ind=8, am=0, be=0, %b[0] } { loop_mode pshufb,0,sm 0x0, %b[32], %r9, %b[41] pfsubs,1,sm %b[26], %b[17], %b[51] staad,2 %b[47], %aad1[ %aasti3 + _f32s,_lts0 0x8 ] pfadds,3,sm %b[26], %b[17], %b[54] xord,4,sm %b[30], %r0, %b[55] staad,5 %b[50], %aad2[ %aasti4 + _f32s,_lts0 0x8 ] movad,0 area=0, ind=0, am=1, be=0, %b[10] movad,1 area=0, ind=16, am=0, be=0, %b[25] movad,3 area=0, ind=0, am=0, be=0, %b[36] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END pfmul_hadds,0,sm %b[43], %b[9], %b[46], %b[17] pfmuls,1,sm %b[52], %b[6], %b[32] staad,2 %b[53], %aad1[ %aasti3 ] incr,2 %aaincr3 pshufb,4,sm 0x0, %b[42], %r9, %b[47] staad,5 %b[56], %aad2[ %aasti4 ] incr,5 %aaincr3 movad,3 area=0, ind=8, am=1, be=0, %b[26] }
Теоретическая скорость: 4 комплексных числа за 3 такта (4/3) = 10.67 Байт/такт
Замеры скорости

Видим ускорение.
Теперь посмотрим на раскрутку в 3 раза.
5. stage_radix2_simd64_unroll3
Здесь происходит раскрутка цикла в 3 раза с помощью опции unroll.
Код на Си
void stage_radix2_simd64_unroll3(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef) { uint64_t *x_in = (uint64_t*)&data_in[0]; uint64_t *y_in = (uint64_t*)&data_in[1]; uint64_t *c_in = (uint64_t*)coef; uint64_t *out_add = (uint64_t*)&data_out[0]; uint64_t *out_sub = (uint64_t*)&data_out[data_count/2]; #pragma ivdep #pragma unroll(3) #pragma prefetch for(int64_t i = 0; i < data_count/2; ++i) { uint64_t x = x_in[2*i]; uint64_t y = y_in[2*i]; uint64_t c = c_in[i]; uint64_t conj_c = __builtin_e2k_pxord(c, 1LL<<63); uint64_t swap_c = __builtin_e2k_pshufb(0, c, 0x0302010007060504); uint64_t cy_real = __builtin_e2k_pfmuls(conj_c, y); uint64_t cy_imag = __builtin_e2k_pfmuls(swap_c, y); uint64_t cy = __builtin_e2k_pfhadds(cy_real, cy_imag); out_add[i] = __builtin_e2k_pfadds(x, cy); out_sub[i] = __builtin_e2k_pfsubs(x, cy); } }
Основной цикл на ассемблере
.L2815: { fapb ct=0, dcd=0, fmt=4, mrng=24, d=0, incr=2, ind=2, asz=4, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=1, asz=4, abs=16, disp=32 } .L2177: { loop_mode pfmuls,0,sm %b[61], %b[23], %b[41] pfmuls,1,sm %b[73], %b[12], %b[1] xord,2,sm %b[57], %r0, %b[59] pfmul_hadds,3,sm %b[78], %b[53], %b[28], %b[0] xord,4,sm %b[52], %r0, %b[66] pfadds,5,sm %b[48], %b[34], %b[58] } { loop_mode pfmul_hadds,0,sm %b[67], %b[25], %b[43], %b[28] pfsubs,1,sm %b[48], %b[34], %b[68] staad,2 %b[70], %aad1[ %aasti3 + _f32s,_lts0 0x10 ] pfsubs,3,sm %b[17], %b[4], %b[61] xord,4,sm %b[20], %r0, %b[71] staad,5 %b[60], %aad2[ %aasti4 + _f32s,_lts0 0x10 ] movad,1 area=0, ind=16, am=0, be=0, %b[53] } { loop_mode pfmul_hadds,0,sm %b[72], %b[14], %b[3], %b[60] pfsubs,1,sm %b[39], %b[64], %b[73] staad,2 %b[75], %aad1[ %aasti3 + _f32s,_lts0 0x8 ] pfadds,3,sm %b[17], %b[4], %b[67] pshufb,4,sm 0x0, %b[22], %r10, %b[70] staad,5 %b[63], %aad1[ %aasti3 ] incr,5 %aaincr3 movad,0 area=0, ind=0, am=0, be=0, %b[48] movad,1 area=1, ind=0, am=0, be=0, %b[34] movad,2 area=0, ind=16, am=0, be=0, %b[25] movad,3 area=0, ind=8, am=0, be=0, %b[43] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END pfmuls,0,sm %b[66], %b[47], %b[22] pfadds,1,sm %b[39], %b[64], %b[72] staad,2 %b[74], %aad2[ %aasti4 + _f32s,_lts0 0x8 ] pshufb,3,sm 0x0, %b[57], %r10, %b[63] pshufb,4,sm 0x0, %b[56], %r10, %b[76] staad,5 %b[69], %aad2[ %aasti4 ] incr,5 %aaincr3 movad,0 area=1, ind=8, am=1, be=0, %b[17] movad,1 area=0, ind=8, am=1, be=0, %b[14] movad,2 area=0, ind=0, am=1, be=0, %b[3] movad,3 area=0, ind=24, am=0, be=0, %b[4] }
Теоретическая скорость: 6 комплексных чисел за 4 такта (6/4) = 12 Байт/такт
Замеры скорости

Видим ускорение в середине графика, но замедление в начале и в конце графика.
6. stage_radix2_simd128
Теперь попробуем использовать векторные инструкции SIMD128 по аналогии с SIMD64.
В отличие от SIMD64, здесь придётся перетасовать данные в начале и в конце цикла с помощью инструкции shuf. Это нужно для того, чтобы в одном 128-битном регистре оказались данные, относящиеся к двум числам x, а в другом — данные, относящиеся к двум числам y.
Код на Си
void stage_radix2_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef) { __v2di *xy0_in = (__v2di*)&data_in[0]; __v2di *xy1_in = (__v2di*)&data_in[2]; __v2di *c_in = (__v2di*)coef; __v2di *out_add = (__v2di*)&data_out[0]; __v2di *out_sub = (__v2di*)&data_out[data_count/2]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/4; ++i) { __v2di xy0 = xy0_in[2*i]; __v2di xy1 = xy1_in[2*i]; __v2di c = c_in[i]; __v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di conj_c = __builtin_e2k_qpxor(c, (__v2di){1LL<<63, 1LL<<63}); __v2di swap_c = __builtin_e2k_qpshufb(c, c, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di cy_real = __builtin_e2k_qpfmuls(conj_c, y); __v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y); __v2di cy_rrii = __builtin_e2k_qpfhadds(cy_real, cy_imag); __v2di cy = __builtin_e2k_qpshufb(cy_rrii, cy_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); out_add[i] = __builtin_e2k_qpfadds(x, cy); out_sub[i] = __builtin_e2k_qpfsubs(x, cy); } }
Основной цикл на ассемблере
.L3099: { fapb ct=1, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=5, abs=0, disp=0 } .L2840: { loop_mode qpshufb,0,sm %b[36], %b[45], %r6, %b[0] qpfmuls,1,sm %b[28], %b[5], %b[18] qpfsubs,2,sm %b[16], %b[47], %b[11] qpshufb,3,sm %b[34], %b[43], %r7, %b[1] qpxor,4,sm %b[23], %r0, %b[24] staaqp,5 %b[17], %aad1[ %aasti3 ] incr,5 %aaincr0 } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qpshufb,0,sm %b[35], %b[35], %r5, %b[45] qpfmul_hadds,1,sm %b[42], %b[9], %b[22], %b[27] qpfadds,2,sm %b[16], %b[47], %b[44] qpshufb,4,sm %b[25], %b[25], %r9, %b[36] staaqp,5 %b[50], %aad2[ %aasti4 ] incr,5 %aaincr0 movaqp,0 area=0, ind=0, am=0, be=0, %b[37] movaqp,1 area=0, ind=16, am=1, be=0, %b[28] movaqp,3 area=0, ind=0, am=1, be=0, %b[17] }
После компиляции видим, что цикл состоит из 11 инструкций (такие же 8 инструкций, что были в варианте SIMD64, и ещё 3 дополнительные инструкции shuf).
Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт
Замеры скорости

Видим ускорение.
Прежде, чем переходить к раскрутке, заметим, что можно сделать цикл на одну инструкцию меньше. В дальнейшем это позволит выполнить более эффективную раскрутку.
7. stage_radix2_simd128_noConj
Воспользуется тем, что Эльбрус умеет сцеплять некоторые инструкции.
Откажемся от создания conj_c (убрали инструкцию xor) и будем использовать fhsub для получения действительной части cy из полуфабриката. Мнимую часть будем, как и раньше, получать с помощью fhadd. Обе эти инструкции будут сцеплены с двумя fmul, то есть будут «бесплатными». Итоговое соединение в единое комплексное число будет сделано в финальном shuf одновременно с уже имеющейся перетасовкой данных.
В версии SIMD64 такой приём сделать было нельзя, потому что там не было финального shuf.
(финальный shuf пришлось заменить на perm, инструкции аналогичны друг другу)
Код на Си
void stage_radix2_simd128_noConj(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef) { __v2di *xy0_in = (__v2di*)&data_in[0]; __v2di *xy1_in = (__v2di*)&data_in[2]; __v2di *c_in = (__v2di*)coef; __v2di *out_add = (__v2di*)&data_out[0]; __v2di *out_sub = (__v2di*)&data_out[data_count/2]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/4; ++i) { __v2di xy0 = xy0_in[2*i]; __v2di xy1 = xy1_in[2*i]; __v2di c = c_in[i]; __v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di swap_c = __builtin_e2k_qpshufb(c, c, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di cy_real = __builtin_e2k_qpfmuls( c, y); __v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y); __v2di cy_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy_real); __v2di cy_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy_imag); __v2di cy = __builtin_e2k_qppermb(cy_ii, cy_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); out_add[i] = __builtin_e2k_qpfadds(x, cy); out_sub[i] = __builtin_e2k_qpfsubs(x, cy); } }
Основной цикл на ассемблере
.L3385: { fapb ct=1, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=5, abs=0, disp=0 } .L3124: { loop_mode qpfmul_hsubs,0,sm %b[25], %b[28], %r9, %b[16] qpfmul_hadds,1,sm %b[27], %b[28], %r9, %b[1] qpfsubs,2,sm %b[14], %b[42], %b[37] qpshufb,3,sm %b[35], %b[36], %r6, %b[0] qppermb,4,sm %b[11], %b[26], %r7, %b[38] staaqp,5 %b[43], %aad1[ %aasti3 ] incr,5 %aaincr0 movaqp,0 area=0, ind=0, am=0, be=0, %b[30] movaqp,1 area=0, ind=16, am=1, be=0, %b[29] movaqp,3 area=0, ind=0, am=1, be=0, %b[19] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qpshufb,0,sm %b[33], %b[34], %r0, %b[26] qpshufb,1,sm %b[23], %b[23], %r5, %b[25] qpfadds,2,sm %b[14], %b[42], %b[11] staaqp,5 %b[17], %aad2[ %aasti4 ] incr,5 %aaincr0 }
После компиляции видим, что цикл состоит из 10 инструкций (убрали инструкцию xor).
Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт
Замеры скорости

Скорость не изменилась по сравнению с предыдущим вариантом.
И теперь переходим к раскрутке.
Сейчас занято 10/6 такта. Раскрутка в 2 раза даст 2 * 10/6 = 4 такта, то есть ничего интересного (одна итерация цикла обработает в 2 раза больше данных за в 2 раза большее число тактов).
Поэтому сразу переходим к раскрутке в 3 раза (3 * 10/6 = 5 тактов).
8. stage_radix2_simd128_noConj_unroll3
Здесь происходит раскрутка цикла в 3 раза с помощью опции unroll.
Код на Си
void stage_radix2_simd128_noConj_unroll3(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef) { __v2di *xy0_in = (__v2di*)&data_in[0]; __v2di *xy1_in = (__v2di*)&data_in[2]; __v2di *c_in = (__v2di*)coef; __v2di *out_add = (__v2di*)&data_out[0]; __v2di *out_sub = (__v2di*)&data_out[data_count/2]; #pragma ivdep #pragma unroll(3) #pragma prefetch for(int64_t i = 0; i < data_count/4; ++i) { __v2di xy0 = xy0_in[2*i]; __v2di xy1 = xy1_in[2*i]; __v2di c = c_in[i]; __v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di swap_c = __builtin_e2k_qpshufb(c, c, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di cy_real = __builtin_e2k_qpfmuls( c, y); __v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y); __v2di cy_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy_real); __v2di cy_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy_imag); __v2di cy = __builtin_e2k_qppermb(cy_ii, cy_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); out_add[i] = __builtin_e2k_qpfadds(x, cy); out_sub[i] = __builtin_e2k_qpfsubs(x, cy); } }
Основной цикл на ассемблере
.L3932: { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=4, abs=0, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=8, disp=64 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=4, abs=16, disp=0 } { fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=2, asz=4, abs=16, disp=32 } .L3410: { loop_mode qpfmul_hsubs,0,sm %b[19], %b[55], %r14, %b[0] qpshufb,1,sm %b[26], %b[23], %r12, %b[15] qpfmul_hsubs,2,sm %b[31], %b[63], %r14, %b[1] qpshufb,3,sm %b[52], %b[53], %r0, %b[18] qpfadds,4,sm %b[66], %b[69], %b[6] qpfadds,5,sm %b[64], %b[42], %b[68] } { loop_mode qpfmul_hsubs,0,sm %b[34], %b[15], %r14, %b[52] qpshufb,1,sm %b[34], %b[34], %r11, %b[61] staaqp,2 %b[35], %aad1[ %aasti3 + _f32s,_lts0 0x10 ] qpshufb,3,sm %b[44], %b[45], %r12, %b[53] qpfsubs,4,sm %b[62], %b[40], %b[57] qpfsubs,5,sm %b[18], %b[65], %b[58] movaqp,0 area=0, ind=0, am=0, be=0, %b[43] movaqp,1 area=0, ind=16, am=1, be=0, %b[42] } { loop_mode qpfmul_hadds,0,sm %b[61], %b[15], %r14, %b[34] qpshufb,1,sm %b[31], %b[31], %r11, %b[66] staaqp,2 %b[60], %aad1[ %aasti3 ] qpshufb,3,sm %b[30], %b[27], %r0, %b[64] qppermb,4,sm %b[38], %b[56], %r13, %b[67] qpfadds,5,sm %b[18], %b[65], %b[35] } { loop_mode qpfmul_hadds,0,sm %b[66], %b[63], %r14, %b[18] qpshufb,1,sm %b[19], %b[19], %r11, %b[56] staaqp,2 %b[8], %aad2[ %aasti4 + _f32s,_lts0 0x10 ] qppermb,3,sm %b[22], %b[5], %r13, %b[38] qpfsubs,4,sm %b[64], %b[67], %b[31] staaqp,5 %b[37], %aad2[ %aasti4 ] movaqp,1 area=2, ind=0, am=1, be=0, %b[27] movaqp,2 area=1, ind=16, am=1, be=0, %b[30] movaqp,3 area=1, ind=0, am=0, be=0, %b[15] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qpfmul_hadds,0,sm %b[56], %b[55], %r14, %b[37] qpshufb,1,sm %b[10], %b[7], %r12, %b[61] staaqp,2 %b[59], %aad1[ %aasti3 + _f32s,_lts0 0x20 ] incr,2 %aaincr3 qpshufb,3,sm %b[16], %b[13], %r0, %b[60] qppermb,4,sm %b[41], %b[4], %r13, %b[63] staaqp,5 %b[68], %aad2[ %aasti4 + _f32s,_lts0 0x20 ] incr,5 %aaincr3 movaqp,0 area=1, ind=0, am=0, be=0, %b[5] movaqp,1 area=1, ind=16, am=1, be=0, %b[8] movaqp,2 area=0, ind=0, am=0, be=0, %b[19] movaqp,3 area=0, ind=16, am=1, be=0, %b[22] }
Теоретическая скорость: 12 комплексных чисел за 5 тактов (12/5) = 19.2 Байт/такт
Замеры скорости

Видим ускорение в середине графика, но замедление в начале и в конце графика.
Итоги по stage_radix2


График FFT находится здесь.
stage_radix2_2x
Схема алгоритма Stage для версии «radix-2» 2x.

Один проход по stage_radix2_2x совершает ту же работу, что 2 прохода по stage_radix2. Поэтому скорость stage_radix2_2x будем умножать на 2 для удобства сравнения с stage_radix2 (этот факт подписан на оси графика и в выводе консоли).
1. stage_radix2_2x_etalon
Здесь происходит ручная раскрутка алгоритма stage_radix2_etalon в 2 раза.
Код на Си
void stage_radix2_2x_etalon(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef_a, myComplex *coef_b) { myComplex *x0_in = &data_in[0]; myComplex *y0_in = &data_in[1]; myComplex *x1_in = &data_in[2]; myComplex *y1_in = &data_in[3]; myComplex *c0a_in = &coef_a[0]; myComplex *c1a_in = &coef_a[1]; myComplex *c0b_in = &coef_b[0]; myComplex *c1b_in = &coef_b[data_count/4]; myComplex *out_add0 = &data_out[0*data_count/4]; myComplex *out_add1 = &data_out[1*data_count/4]; myComplex *out_sub0 = &data_out[2*data_count/4]; myComplex *out_sub1 = &data_out[3*data_count/4]; #pragma ivdep #pragma unroll(1) // #pragma prefetch for(int64_t i = 0; i < data_count/4; ++i) { myComplex x0 = x0_in[4*i]; myComplex y0 = y0_in[4*i]; myComplex c0 = c0a_in[2*i]; myComplex x1 = x1_in[4*i]; myComplex y1 = y1_in[4*i]; myComplex c1 = c1a_in[2*i]; myComplex cy0 = complex_mul(c0, y0); myComplex cy1 = complex_mul(c1, y1); myComplex add0 = complex_add(x0, cy0); myComplex sub0 = complex_sub(x0, cy0); myComplex add1 = complex_add(x1, cy1); myComplex sub1 = complex_sub(x1, cy1); x0 = add0; y0 = add1; c0 = c0b_in[i]; x1 = sub0; y1 = sub1; c1 = c1b_in[i]; cy0 = complex_mul(c0, y0); cy1 = complex_mul(c1, y1); out_add0[i] = complex_add(x0, cy0); out_sub0[i] = complex_sub(x0, cy0); out_add1[i] = complex_add(x1, cy1); out_sub1[i] = complex_sub(x1, cy1); } }
Основной цикл на ассемблере
.L965: { fapb ct=0, dcd=0, fmt=3, mrng=16, d=0, incr=3, ind=2, asz=3, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0 } { fapb ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=4, asz=3, abs=8, disp=0 } { fapb ct=1, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=3, asz=4, abs=16, disp=0 } .L259: { loop_mode fmul_adds,0,sm %b[27], %b[84], %b[93], %b[45] fsub_adds,1,sm %b[15], %b[79], %b[88], %b[0] fsub_rsubs,2,sm %b[15], %b[79], %b[88], %b[1] fmuls,3,sm %b[56], %b[74], %b[90] fmuls,4,sm %b[45], %b[34], %b[89] fmuls,5,sm %b[68], %b[65], %b[88] } { loop_mode fmul_rsubs,0,sm %b[24], %b[67], %b[95], %b[56] fadd_adds,1,sm %b[15], %b[79], %b[47], %b[24] fadd_rsubs,2,sm %b[15], %b[79], %b[47], %b[47] fmul_rsubs,3,sm %b[73], %b[74], %b[94], %b[15] fmuls,5,sm %b[63], %b[51], %b[91] } { loop_mode fmul_rsubs,0,sm %b[27], %b[53], %b[96], %b[58] fsub_adds,1,sm %b[14], %b[83], %b[58], %b[27] fsub_rsubs,2,sm %b[14], %b[83], %b[58], %b[53] fmuls,4,sm %b[54], %b[50], %b[92] fmuls,5,sm %b[68], %b[85], %b[93] } { loop_mode fadd_adds,0,sm %b[14], %b[83], %b[60], %b[67] fadd_rsubs,1,sm %b[14], %b[83], %b[60], %b[68] staaw,2 %b[2], %aad2[ %aasti6 + _f32s,_lts0 0x4 ] fsubs,3,sm %b[48], %b[17], %b[63] fmuls,4,sm %b[63], %b[82], %b[94] staaw,5 %b[3], %aad1[ %aasti5 + _f32s,_lts0 0x4 ] movaw,0 area=2, ind=0, am=0, be=0, %b[14] movaw,1 area=2, ind=4, am=1, be=0, %b[60] movaw,2 area=0, ind=0, am=0, be=0, %b[2] movaw,3 area=0, ind=4, am=0, be=0, %b[3] } { loop_mode staaw,2 %b[26], %aad4[ %aasti8 + _f32s,_lts0 0x4 ] fmul_adds,3,sm %b[73], %b[52], %b[90], %b[73] fadds,4,sm %b[48], %b[17], %b[49] staaw,5 %b[49], %aad3[ %aasti7 + _f32s,_lts0 0x4 ] movaw,0 area=1, ind=0, am=0, be=0, %b[17] movaw,1 area=0, ind=12, am=0, be=0, %b[52] movaw,2 area=0, ind=8, am=0, be=0, %b[26] movaw,3 area=0, ind=28, am=0, be=0, %b[48] } { loop_mode fmul_rsubs,1,sm %b[42], %b[34], %b[87], %b[79] staaw,2 %b[29], %aad2[ %aasti6 ] incr,2 %aaincr4 fsubs,4,sm %b[80], %b[75], %b[83] staaw,5 %b[55], %aad1[ %aasti5 ] incr,5 %aaincr4 movaw,0 area=1, ind=4, am=1, be=0, %b[55] movaw,1 area=0, ind=0, am=0, be=0, %b[34] movaw,2 area=0, ind=12, am=0, be=0, %b[29] movaw,3 area=0, ind=20, am=0, be=0, %b[74] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END fmul_adds,0,sm %b[42], %b[37], %b[89], %b[75] fmul_adds,1,sm %b[22], %b[85], %b[88], %b[84] staaw,2 %b[69], %aad4[ %aasti8 ] incr,2 %aaincr4 fmuls,3,sm %b[43], %b[35], %b[85] fadds,4,sm %b[80], %b[75], %b[80] staaw,5 %b[70], %aad3[ %aasti7 ] incr,5 %aaincr4 movaw,0 area=0, ind=4, am=1, be=0, %b[37] movaw,1 area=0, ind=8, am=0, be=0, %b[69] movaw,2 area=0, ind=16, am=1, be=0, %b[42] movaw,3 area=0, ind=24, am=0, be=0, %b[70] }
Теоретическая скорость: 4 комплексных числа за 7 тактов (4/7) = 4.57 Байт/такт
Двойная теоретическая скорость: 9.14 Байт/такт
Замеры скорости

2. stage_radix2_2x_etalon_unroll2
Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.
Код на Си
void stage_radix2_2x_etalon_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef_a, myComplex *coef_b) { myComplex *x0_in = &data_in[0]; myComplex *y0_in = &data_in[1]; myComplex *x1_in = &data_in[2]; myComplex *y1_in = &data_in[3]; myComplex *c0a_in = &coef_a[0]; myComplex *c1a_in = &coef_a[1]; myComplex *c0b_in = &coef_b[0]; myComplex *c1b_in = &coef_b[data_count/4]; myComplex *out_add0 = &data_out[0*data_count/4]; myComplex *out_add1 = &data_out[1*data_count/4]; myComplex *out_sub0 = &data_out[2*data_count/4]; myComplex *out_sub1 = &data_out[3*data_count/4]; #pragma ivdep #pragma unroll(2) // #pragma prefetch for(int64_t i = 0; i < data_count/4; ++i) { myComplex x0 = x0_in[4*i]; myComplex y0 = y0_in[4*i]; myComplex c0 = c0a_in[2*i]; myComplex x1 = x1_in[4*i]; myComplex y1 = y1_in[4*i]; myComplex c1 = c1a_in[2*i]; myComplex cy0 = complex_mul(c0, y0); myComplex cy1 = complex_mul(c1, y1); myComplex add0 = complex_add(x0, cy0); myComplex sub0 = complex_sub(x0, cy0); myComplex add1 = complex_add(x1, cy1); myComplex sub1 = complex_sub(x1, cy1); x0 = add0; y0 = add1; c0 = c0b_in[i]; x1 = sub0; y1 = sub1; c1 = c1b_in[i]; cy0 = complex_mul(c0, y0); cy1 = complex_mul(c1, y1); out_add0[i] = complex_add(x0, cy0); out_sub0[i] = complex_sub(x0, cy0); out_add1[i] = complex_add(x1, cy1); out_sub1[i] = complex_sub(x1, cy1); } }
Основной цикл на ассемблере
.L2305: { fapb ct=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=3, mrng=16, d=0, incr=2, ind=2, asz=3, abs=0, disp=16 } { fapb ct=0, dcd=0, fmt=3, mrng=16, d=0, incr=2, ind=2, asz=3, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=3, abs=8, disp=32 } { fapb ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=3, ind=3, asz=4, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=3, ind=4, asz=4, abs=16, disp=0 } .L1020: { loop_mode pfmul_rsubs,0,sm %b[71], %b[17], %b[75], %b[1] insfd,1,sm %b[10], %r6, %b[11], %b[87] pfmul_rsubs,2,sm %b[101], %b[47], %b[96], %b[0] insfd,3,sm %b[92], %r6, %b[98], %b[86] insfd,4,sm %b[86], %r6, %b[87], %b[10] pfsubs,5,sm %b[29], %b[40], %b[11] } { loop_mode pfmul_adds,0,sm %b[101], %b[13], %b[80], %b[17] insfd,1,sm %b[63], %r6, %b[38], %b[12] pfmul_rsubs,2,sm %b[87], %b[93], %b[85], %b[47] pfmul_adds,3,sm %b[90], %b[12], %b[97], %b[38] insfd,4,sm %b[76], %r6, %b[52], %b[13] } { loop_mode pfadd_adds,0,sm %b[18], %b[3], %b[49], %b[49] insfd,1,sm %b[82], %r6, %b[95], %b[29] pfadd_rsubs,2,sm %b[18], %b[3], %b[49], %b[40] pfadds,3,sm %b[29], %b[40], %b[52] pshufb,4,sm %b[43], %b[57], %r0, %b[80] pfmuls,5,sm %b[91], %b[93], %b[76] } { loop_mode pfsub_rsubs,0,sm %b[18], %b[3], %b[2], %b[55] insfd,1,sm %b[94], %r6, %b[81], %b[82] pfsub_rsubs,2,sm %b[37], %b[28], %b[19], %b[41] pshufb,3,sm %b[24], %b[25], %r0, %b[85] pshufb,4,sm %b[34], %b[48], %r0, %b[90] pfmuls,5,sm %b[41], %b[15], %b[81] } { loop_mode pfsub_adds,0,sm %b[18], %b[3], %b[2], %b[46] pfmuls,1,sm %b[82], %b[10], %b[92] pfsub_adds,2,sm %b[37], %b[28], %b[19], %b[32] pfadds,3,sm %b[32], %b[46], %b[91] pshufb,4,sm %b[64], %b[42], %r0, %b[93] movad,0 area=2, ind=0, am=0, be=0, %b[19] movad,1 area=2, ind=8, am=1, be=0, %b[18] movad,2 area=2, ind=0, am=0, be=0, %b[3] movad,3 area=2, ind=8, am=1, be=0, %b[2] } { loop_mode pfadd_rsubs,0,sm %b[37], %b[28], %b[84], %b[62] insfd,1,sm %b[89], %r6, %b[100], %b[56] staad,2 %b[80], %aad1[ %aasti5 + _f32s,_lts0 0x8 ] pshufb,3,sm %b[8], %b[9], %r0, %b[89] pshufb,4,sm %b[70], %b[51], %r0, %b[95] pfmuls,5,sm %b[85], %b[11], %b[94] movaw,0 area=1, ind=0, am=0, be=0, %b[63] movaw,1 area=1, ind=4, am=0, be=0, %b[66] movaw,2 area=1, ind=4, am=0, be=0, %b[80] movaw,3 area=1, ind=0, am=0, be=0, %b[59] } { loop_mode pfadd_adds,0,sm %b[37], %b[28], %b[84], %b[68] insfd,1,sm %b[78], %r6, %b[99], %b[28] staad,2 %b[90], %aad2[ %aasti6 + _f32s,_lts0 0x8 ] pfmuls,3,sm %b[85], %b[45], %b[78] insfd,4,sm %b[83], %r6, %b[68], %b[37] pfmuls,5,sm %b[89], %b[52], %b[83] movaw,0 area=0, ind=24, am=0, be=0, %b[96] movaw,1 area=0, ind=28, am=0, be=0, %b[85] movaw,2 area=1, ind=24, am=0, be=0, %b[90] movaw,3 area=1, ind=28, am=0, be=0, %b[84] } { loop_mode insfd,1,sm %b[79], %r6, %g16, %b[88] staad,2 %b[93], %aad3[ %aasti7 + _f32s,_lts0 0x8 ] insfd,4,sm %b[88], %r6, %b[65], %b[65] pfmuls,5,sm %b[39], %b[58], %b[71] movaw,0 area=1, ind=8, am=0, be=0, %g16 movaw,1 area=1, ind=12, am=1, be=0, %b[79] movaw,2 area=1, ind=8, am=0, be=0, %b[72] movaw,3 area=1, ind=20, am=0, be=0, %b[75] } { loop_mode pfmul_adds,0,sm %b[87], %b[54], %b[76], %b[82] insfd,1,sm %b[43], %r6, %b[57], %b[98] staad,2 %b[95], %aad4[ %aasti8 + _f32s,_lts0 0x8 ] pfmuls,3,sm %b[82], %b[86], %b[95] insfd,4,sm %b[34], %r6, %b[48], %b[97] pfsubs,5,sm %b[30], %b[44], %b[43] movaw,0 area=0, ind=4, am=0, be=0, %b[93] movaw,1 area=0, ind=0, am=0, be=0, %b[34] movaw,2 area=1, ind=16, am=0, be=0, %b[76] movaw,3 area=1, ind=12, am=1, be=0, %b[87] } { loop_mode pfmul_rsubs,0,sm %b[88], %b[86], %b[92], %b[42] insfd,1,sm %b[70], %r6, %b[51], %b[97] staad,2 %b[98], %aad1[ %aasti5 ] incr,2 %aaincr4 insfd,4,sm %b[64], %r6, %b[42], %b[98] staad,5 %b[97], %aad2[ %aasti6 ] incr,5 %aaincr4 movaw,0 area=0, ind=8, am=0, be=0, %b[48] movaw,1 area=0, ind=20, am=0, be=0, %b[51] movaw,2 area=0, ind=0, am=0, be=0, %b[86] movaw,3 area=0, ind=12, am=0, be=0, %b[92] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END pfmul_adds,0,sm %b[69], %b[60], %b[81], %b[24] insfd,1,sm %b[24], %r6, %b[25], %b[99] staad,2 %b[97], %aad4[ %aasti8 ] incr,2 %aaincr4 insfd,4,sm %b[77], %r6, %b[53], %b[25] staad,5 %b[98], %aad3[ %aasti7 ] incr,5 %aaincr4 movaw,0 area=0, ind=16, am=0, be=0, %b[97] movaw,1 area=0, ind=12, am=1, be=0, %b[98] movaw,2 area=0, ind=4, am=1, be=0, %b[81] movaw,3 area=0, ind=8, am=0, be=0, %b[77] }
Так же, как это было в stage_radix2_etalon_unroll2, можно видеть, что компилятор вставил векторные инструкции.
Теоретическая скорость: 8 комплексных чисел за 11 тактов (8/11) = 5.82 Байт/такт
Двойная теоретическая скорость: 11.64 Байт/такт
Замеры скорости

3. stage_radix2_2x_simd64
Здесь происходит ручная раскрутка алгоритма stage_radix2_simd64 в 2 раза.
Код на Си
void stage_radix2_2x_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef_a, myComplex *coef_b) { uint64_t *x0_in = (uint64_t*)&data_in[0]; uint64_t *y0_in = (uint64_t*)&data_in[1]; uint64_t *x1_in = (uint64_t*)&data_in[2]; uint64_t *y1_in = (uint64_t*)&data_in[3]; uint64_t *c0a_in = (uint64_t*)&coef_a[0]; uint64_t *c1a_in = (uint64_t*)&coef_a[1]; uint64_t *c0b_in = (uint64_t*)&coef_b[0]; uint64_t *c1b_in = (uint64_t*)&coef_b[data_count/4]; uint64_t *out_add0 = (uint64_t*)&data_out[0*data_count/4]; uint64_t *out_add1 = (uint64_t*)&data_out[1*data_count/4]; uint64_t *out_sub0 = (uint64_t*)&data_out[2*data_count/4]; uint64_t *out_sub1 = (uint64_t*)&data_out[3*data_count/4]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/4; ++i) { uint64_t x0 = x0_in[4*i]; uint64_t y0 = y0_in[4*i]; uint64_t c0 = c0a_in[2*i]; uint64_t x1 = x1_in[4*i]; uint64_t y1 = y1_in[4*i]; uint64_t c1 = c1a_in[2*i]; uint64_t conj_c0 = __builtin_e2k_pxord(c0, 1LL<<63); uint64_t conj_c1 = __builtin_e2k_pxord(c1, 1LL<<63); uint64_t swap_c0 = __builtin_e2k_pshufb(0, c0, 0x0302010007060504); uint64_t swap_c1 = __builtin_e2k_pshufb(0, c1, 0x0302010007060504); uint64_t cy0_real = __builtin_e2k_pfmuls(conj_c0, y0); uint64_t cy1_real = __builtin_e2k_pfmuls(conj_c1, y1); uint64_t cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0); uint64_t cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1); uint64_t cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag); uint64_t cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag); uint64_t add0 = __builtin_e2k_pfadds(x0, cy0); uint64_t sub0 = __builtin_e2k_pfsubs(x0, cy0); uint64_t add1 = __builtin_e2k_pfadds(x1, cy1); uint64_t sub1 = __builtin_e2k_pfsubs(x1, cy1); x0 = add0; y0 = add1; c0 = c0b_in[i]; x1 = sub0; y1 = sub1; c1 = c1b_in[i]; conj_c0 = __builtin_e2k_pxord(c0, 1LL<<63); conj_c1 = __builtin_e2k_pxord(c1, 1LL<<63); swap_c0 = __builtin_e2k_pshufb(0, c0, 0x0302010007060504); swap_c1 = __builtin_e2k_pshufb(0, c1, 0x0302010007060504); cy0_real = __builtin_e2k_pfmuls(conj_c0, y0); cy1_real = __builtin_e2k_pfmuls(conj_c1, y1); cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0); cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1); cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag); cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag); out_add0[i] = __builtin_e2k_pfadds(x0, cy0); out_sub0[i] = __builtin_e2k_pfsubs(x0, cy0); out_add1[i] = __builtin_e2k_pfadds(x1, cy1); out_sub1[i] = __builtin_e2k_pfsubs(x1, cy1); } }
Основной цикл на ассемблере
.L2998: { fapb ct=0, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=2, asz=3, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=3, abs=8, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=4, abs=16, disp=0 } .L2607: { loop_mode pfmul_hadds,0,sm %b[43], %b[20], %b[37], %b[24] pfmul_hadds,1,sm %b[41], %b[5], %b[39], %b[28] pfmuls,2,sm %b[75], %b[18], %b[35] xord,3,sm %b[44], %r0, %b[84] xord,4,sm %b[2], %r0, %b[81] pfsubs,5,sm %b[78], %b[49], %b[1] movad,1 area=0, ind=8, am=0, be=0, %b[0] } { loop_mode pfmul_hadds,1,sm %b[62], %b[9], %b[60], %b[20] pfmuls,2,sm %b[83], %b[3], %b[37] pshufb,3,sm 0x0, %b[71], %r6, %b[41] pshufb,4,sm 0x0, %b[58], %r6, %b[39] pfadds,5,sm %b[78], %b[49], %b[5] } { loop_mode pfmul_hadds,1,sm %b[73], %b[15], %b[53], %b[43] pfmuls,2,sm %b[84], %b[7], %b[58] pshufb,4,sm 0x0, %b[44], %r6, %b[60] pfmuls,5,sm %b[81], %b[11], %b[49] movad,3 area=0, ind=24, am=0, be=0, %b[9] } { loop_mode pfsub_adds,0,sm %b[33], %b[26], %b[30], %b[62] pfsub_rsubs,1,sm %b[33], %b[26], %b[30], %b[53] staad,2 %b[66], %aad2[ %aasti6 ] incr,2 %aaincr0 xord,3,sm %b[69], %r0, %b[73] pshufb,4,sm 0x0, %b[4], %r6, %b[71] staad,5 %b[57], %aad1[ %aasti5 ] incr,5 %aaincr0 movad,1 area=2, ind=0, am=1, be=0, %b[44] movad,3 area=0, ind=0, am=0, be=0, %b[15] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END pfadd_adds,0,sm %b[33], %b[26], %b[22], %b[78] pfadd_rsubs,1,sm %b[33], %b[26], %b[22], %b[75] staad,2 %b[82], %aad4[ %aasti8 ] incr,2 %aaincr0 xord,3,sm %b[56], %r0, %b[81] staad,5 %b[79], %aad3[ %aasti7 ] incr,5 %aaincr0 movad,0 area=1, ind=0, am=1, be=0, %b[30] movad,1 area=0, ind=0, am=1, be=0, %b[57] movad,2 area=0, ind=8, am=1, be=0, %b[4] movad,3 area=0, ind=16, am=0, be=0, %b[66] }
Теоретическая скорость: 4 комплексных числа за 5 тактов (4/5) = 6.4 Байт/такт
Двойная теоретическая скорость: 12.8 Байт/такт
Замеры скорости

4. stage_radix2_2x_simd128
Здесь происходит ручная раскрутка алгоритма stage_radix2_simd128 в 2 раза.
Код на Си
void stage_radix2_2x_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef_a, myComplex *coef_b) { __v2di *xy0_in = (__v2di*)&data_in[0]; __v2di *xy1_in = (__v2di*)&data_in[2]; __v2di *xy2_in = (__v2di*)&data_in[4]; __v2di *xy3_in = (__v2di*)&data_in[6]; __v2di *c0a_in = (__v2di*)&coef_a[0]; __v2di *c1a_in = (__v2di*)&coef_a[2]; __v2di *c0b_in = (__v2di*)&coef_b[0]; __v2di *c1b_in = (__v2di*)&coef_b[data_count/4]; __v2di *out_add0 = (__v2di*)&data_out[0*data_count/4]; __v2di *out_add1 = (__v2di*)&data_out[1*data_count/4]; __v2di *out_sub0 = (__v2di*)&data_out[2*data_count/4]; __v2di *out_sub1 = (__v2di*)&data_out[3*data_count/4]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/8; ++i) { __v2di xy0 = xy0_in[4*i]; __v2di xy1 = xy1_in[4*i]; __v2di c0 = c0a_in[2*i]; __v2di xy2 = xy2_in[4*i]; __v2di xy3 = xy3_in[4*i]; __v2di c1 = c1a_in[2*i]; __v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di conj_c0 = __builtin_e2k_qpxor(c0, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_c1 = __builtin_e2k_qpxor(c1, (__v2di){1LL<<63, 1LL<<63}); __v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0); __v2di cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1); __v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0); __v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1); __v2di cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag); __v2di cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag); __v2di cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di add0 = __builtin_e2k_qpfadds(x0, cy0); __v2di sub0 = __builtin_e2k_qpfsubs(x0, cy0); __v2di add1 = __builtin_e2k_qpfadds(x1, cy1); __v2di sub1 = __builtin_e2k_qpfsubs(x1, cy1); xy0 = add0; xy1 = add1; c0 = c0b_in[i]; xy2 = sub0; xy3 = sub1; c1 = c1b_in[i]; x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100}); y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); conj_c0 = __builtin_e2k_qpxor(c0, (__v2di){1LL<<63, 1LL<<63}); conj_c1 = __builtin_e2k_qpxor(c1, (__v2di){1LL<<63, 1LL<<63}); swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0); cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1); cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0); cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1); cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag); cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag); cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); out_add0[i] = __builtin_e2k_qpfadds(x0, cy0); out_sub0[i] = __builtin_e2k_qpfsubs(x0, cy0); out_add1[i] = __builtin_e2k_qpfadds(x1, cy1); out_sub1[i] = __builtin_e2k_qpfsubs(x1, cy1); } }
Основной цикл на ассемблере
.L3790: { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=32 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=4, abs=0, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=4, abs=16, disp=0 } { fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=4, abs=16, disp=0 } .L3059: { loop_mode qpfmul_hadds,0,sm %b[56], %b[54], %b[70], %b[1] qpshufb,1,sm %b[9], %b[32], %r15, %b[52] qpfadds,2,sm %b[88], %b[87], %b[0] qpshufb,3,sm %b[33], %b[33], %r0, %b[94] qpshufb,4,sm %b[6], %b[30], %r16, %b[95] qpfadds,5,sm %b[61], %b[71], %b[92] } { loop_mode qpfmul_hadds,0,sm %b[74], %b[69], %b[7], %b[33] qpshufb,1,sm %b[29], %b[29], %r14, %b[56] qpfsubs,2,sm %b[55], %b[90], %b[30] qpxor,3,sm %b[60], %r13, %b[54] qpxor,4,sm %b[27], %r13, %b[57] qpfsubs,5,sm %b[95], %b[94], %b[96] movaqp,3 area=1, ind=0, am=0, be=0, %b[6] } { loop_mode qpfmul_hadds,0,sm %b[56], %b[73], %b[93], %b[29] qpshufb,1,sm %b[13], %b[36], %r16, %b[59] qpfsubs,2,sm %b[86], %b[85], %b[7] qpshufb,4,sm %b[62], %b[62], %r14, %b[61] qpfadds,5,sm %b[95], %b[94], %b[97] } { loop_mode qpshufb,1,sm %b[3], %b[3], %r0, %b[69] qpfmuls,2,sm %b[53], %b[52], %b[68] qpshufb,3,sm %b[14], %b[43], %r15, %b[62] qpshufb,4,sm %b[75], %b[76], %r15, %b[63] staaqp,5 %b[72], %aad1[ %aasti5 ] incr,5 %aaincr0 movaqp,0 area=2, ind=0, am=1, be=0, %b[36] movaqp,1 area=1, ind=0, am=1, be=0, %b[13] movaqp,3 area=1, ind=16, am=1, be=0, %b[56] } { loop_mode qpfmuls,0,sm %b[41], %b[65], %b[3] qpshufb,1,sm %b[0], %b[24], %r15, %b[71] qpfsubs,2,sm %b[59], %b[69], %b[70] qpshufb,3,sm %b[12], %b[12], %r14, %b[72] qpshufb,4,sm %b[83], %b[84], %r16, %b[53] staaqp,5 %b[92], %aad2[ %aasti6 ] incr,5 %aaincr0 } { loop_mode qpfmuls,0,sm %b[54], %b[64], %b[87] qpshufb,1,sm %b[35], %b[35], %r0, %b[88] qpfmuls,2,sm %b[57], %b[71], %b[91] qpshufb,3,sm %b[22], %b[51], %r16, %b[84] qpshufb,4,sm %b[39], %b[39], %r0, %b[83] staaqp,5 %b[96], %aad3[ %aasti7 ] incr,5 %aaincr0 movaqp,0 area=0, ind=0, am=0, be=0, %b[41] movaqp,1 area=0, ind=16, am=1, be=0, %b[12] movaqp,2 area=0, ind=0, am=0, be=0, %b[74] movaqp,3 area=0, ind=16, am=1, be=0, %b[73] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qpfmul_hadds,0,sm %b[61], %b[66], %b[89], %b[35] qpshufb,1,sm %b[50], %b[50], %r14, %b[54] qpfadds,2,sm %b[55], %b[90], %b[22] qpxor,3,sm %b[8], %r13, %b[39] qpxor,4,sm %b[48], %r13, %b[51] staaqp,5 %b[97], %aad4[ %aasti8 ] incr,5 %aaincr0 }
Теоретическая скорость: 8 комплексных чисел за 7 тактов (8/7) = 9.14 Байт/такт
Двойная теоретическая скорость: 18.29 Байт/такт
Замеры скорости

5. stage_radix2_2x_simd128_noConj
Здесь происходит ручная раскрутка алгоритма stage_radix2_simd128_noConj в 2 раза.
Код на Си
void stage_radix2_2x_simd128_noConj(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef_a, myComplex *coef_b) { __v2di *xy0_in = (__v2di*)&data_in[0]; __v2di *xy1_in = (__v2di*)&data_in[2]; __v2di *xy2_in = (__v2di*)&data_in[4]; __v2di *xy3_in = (__v2di*)&data_in[6]; __v2di *c0a_in = (__v2di*)&coef_a[0]; __v2di *c1a_in = (__v2di*)&coef_a[2]; __v2di *c0b_in = (__v2di*)&coef_b[0]; __v2di *c1b_in = (__v2di*)&coef_b[data_count/4]; __v2di *out_add0 = (__v2di*)&data_out[0*data_count/4]; __v2di *out_add1 = (__v2di*)&data_out[1*data_count/4]; __v2di *out_sub0 = (__v2di*)&data_out[2*data_count/4]; __v2di *out_sub1 = (__v2di*)&data_out[3*data_count/4]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/8; ++i) { __v2di xy0 = xy0_in[4*i]; __v2di xy1 = xy1_in[4*i]; __v2di c0 = c0a_in[2*i]; __v2di xy2 = xy2_in[4*i]; __v2di xy3 = xy3_in[4*i]; __v2di c1 = c1a_in[2*i]; __v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di cy0_real = __builtin_e2k_qpfmuls( c0, y0); __v2di cy1_real = __builtin_e2k_qpfmuls( c1, y1); __v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0); __v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1); __v2di cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real); __v2di cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real); __v2di cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag); __v2di cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag); __v2di cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di add0 = __builtin_e2k_qpfadds(x0, cy0); __v2di sub0 = __builtin_e2k_qpfsubs(x0, cy0); __v2di add1 = __builtin_e2k_qpfadds(x1, cy1); __v2di sub1 = __builtin_e2k_qpfsubs(x1, cy1); xy0 = add0; xy1 = add1; c0 = c0b_in[i]; xy2 = sub0; xy3 = sub1; c1 = c1b_in[i]; x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100}); y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); cy0_real = __builtin_e2k_qpfmuls( c0, y0); cy1_real = __builtin_e2k_qpfmuls( c1, y1); cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0); cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1); cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real); cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real); cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag); cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag); cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); out_add0[i] = __builtin_e2k_qpfadds(x0, cy0); out_sub0[i] = __builtin_e2k_qpfsubs(x0, cy0); out_add1[i] = __builtin_e2k_qpfadds(x1, cy1); out_sub1[i] = __builtin_e2k_qpfsubs(x1, cy1); } }
Основной цикл на ассемблере
.L4577: { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=32 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=4, abs=0, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=4, abs=16, disp=0 } { fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=4, abs=16, disp=0 } .L3851: { loop_mode qpfmul_hsubs,0,sm %b[53], %b[54], %r16, %b[5] qpshufb,1,sm %b[39], %b[39], %r15, %b[25] qpfmul_hadds,2,sm %b[92], %b[54], %r16, %b[1] qpshufb,3,sm %b[16], %b[8], %r14, %b[24] qpfsubs,4,sm %b[66], %b[22], %b[9] qpfsubs,5,sm %b[89], %b[69], %b[0] } { loop_mode qpfmul_hsubs,0,sm %b[39], %b[71], %r16, %b[58] qpshufb,1,sm %b[72], %b[75], %r0, %b[64] qpfmul_hadds,2,sm %b[25], %b[71], %r16, %b[54] qpshufb,3,sm %b[20], %b[20], %r15, %b[62] qpfsubs,4,sm %b[86], %b[59], %b[8] qpfadds,5,sm %b[24], %b[48], %b[61] movaqp,2 area=1, ind=0, am=0, be=0, %b[53] movaqp,3 area=1, ind=16, am=1, be=0, %b[16] } { loop_mode qpfmul_hsubs,0,sm %b[57], %b[64], %r16, %b[78] qpshufb,1,sm %b[57], %b[57], %r15, %b[88] staaqp,2 %b[87], %aad1[ %aasti5 ] incr,2 %aaincr0 qpshufb,3,sm %b[26], %b[34], %r0, %b[81] qpshufb,4,sm %b[32], %b[40], %r14, %b[84] qpfsubs,5,sm %b[24], %b[48], %b[85] movaqp,0 area=2, ind=0, am=1, be=0, %b[39] movaqp,1 area=1, ind=0, am=1, be=0, %b[25] movaqp,2 area=0, ind=0, am=0, be=0, %b[71] movaqp,3 area=0, ind=16, am=1, be=0, %b[68] } { loop_mode qpfmul_hadds,0,sm %b[88], %b[64], %r16, %b[48] qpshufb,1,sm %b[51], %b[51], %r15, %b[90] staaqp,2 %b[63], %aad2[ %aasti6 ] incr,2 %aaincr0 qpshufb,3,sm %b[76], %b[79], %r14, %b[87] qppermb,4,sm %b[15], %b[67], %r13, %b[57] qpfadds,5,sm %b[89], %b[69], %b[40] movaqp,0 area=0, ind=0, am=0, be=0, %b[32] movaqp,1 area=0, ind=16, am=1, be=0, %b[24] } { loop_mode qpfmul_hsubs,0,sm %b[20], %b[83], %r16, %b[63] qpshufb,1,sm %b[17], %b[42], %r0, %b[69] staaqp,2 %b[11], %aad3[ %aasti7 ] incr,2 %aaincr0 qppermb,3,sm %b[52], %b[82], %r13, %b[67] qpshufb,4,sm %b[21], %b[46], %r14, %b[64] qpfadds,5,sm %b[86], %b[59], %b[15] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qpfmul_hadds,0,sm %b[62], %b[83], %r16, %b[11] qpshufb,1,sm %b[10], %b[2], %r0, %b[52] staaqp,2 %b[23], %aad4[ %aasti8 ] incr,2 %aaincr0 qppermb,3,sm %b[3], %b[7], %r13, %b[46] qppermb,4,sm %b[56], %b[60], %r13, %b[20] qpfadds,5,sm %b[66], %b[22], %b[21] }
Теоретическая скорость: 8 комплексных чисел за 6 тактов (8/6) = 10.67 Байт/такт
Двойная теоретическая скорость: 21.33 Байт/такт
Замеры скорости

Итоги по stage_radix2_2x


Скорости выросли по сравнению с исходными версиями stage_radix2.
График FFT находится здесь.
stage_radix2_readConjSwap
Вернёмся к алгоритмам stage_radix2. Обратим внимание, что conj_c и swap_c получаются напрямую из c, который читается из памяти и больше нигде не используется.
Оптимизация: вместо вычисления conj_c и swap_c сразу читать их из памяти, чтение c больше не нужно. В результате уйдут две инструкции: xor и shuf.
Смотрим, что получится.
1. stage_radix2_readConjSwap_simd64
Развитие stage_radix2_simd64: замена вычисления conj и swap на чтение из памяти.
Код на Си
void stage_radix2_readConjSwap_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef, myComplex *swap_coef) { uint64_t *x_in = (uint64_t*)&data_in[0]; uint64_t *y_in = (uint64_t*)&data_in[1]; uint64_t *conj_c_in = (uint64_t*)conj_coef; uint64_t *swap_c_in = (uint64_t*)swap_coef; uint64_t *out_add = (uint64_t*)&data_out[0]; uint64_t *out_sub = (uint64_t*)&data_out[data_count/2]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/2; ++i) { uint64_t x = x_in[2*i]; uint64_t y = y_in[2*i]; uint64_t conj_c = conj_c_in[i]; uint64_t swap_c = swap_c_in[i]; uint64_t cy_real = __builtin_e2k_pfmuls(conj_c, y); uint64_t cy_imag = __builtin_e2k_pfmuls(swap_c, y); uint64_t cy = __builtin_e2k_pfhadds(cy_real, cy_imag); out_add[i] = __builtin_e2k_pfadds(x, cy); out_sub[i] = __builtin_e2k_pfsubs(x, cy); } }
Основной цикл на ассемблере
.L326: { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=4, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=1, asz=5, abs=0, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=4, abs=16, disp=0 } .L125: { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END pfmuls,0,sm %b[67], %b[33], %b[45] pfsubs,1,sm %b[42], %b[62], %b[62] staad,2 %b[70], %aad1[ %aasti4 ] incr,2 %aaincr0 pfmul_hadds,3,sm %b[23], %b[45], %b[57], %b[42] pfadds,4,sm %b[42], %b[62], %b[67] staad,5 %b[75], %aad2[ %aasti5 ] incr,5 %aaincr0 movad,0 area=0, ind=0, am=1, be=0, %b[57] movad,1 area=1, ind=0, am=1, be=0, %b[1] movad,2 area=0, ind=8, am=1, be=0, %b[23] movad,3 area=0, ind=0, am=0, be=0, %b[0] }
Раньше было 8 инструкций в цикле, теперь стало 6.
6 инструкций идеально помещаются в 1 такт.
Теоретическая скорость: 2 комплексных числа за 1 такт (2/1) = 16 Байт/такт
Замеры скорости

2. stage_radix2_readConjSwap_simd128
Развитие stage_radix2_simd128: замена вычисления conj и swap на чтение из памяти.
Развитие stage_radix2_simd128_noConj приходит сюда же.
Код на Си
void stage_radix2_readConjSwap_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef, myComplex *swap_coef) { __v2di *xy0_in = (__v2di*)&data_in[0]; __v2di *xy1_in = (__v2di*)&data_in[2]; __v2di *conj_c_in = (__v2di*)conj_coef; __v2di *swap_c_in = (__v2di*)swap_coef; __v2di *out_add = (__v2di*)&data_out[0]; __v2di *out_sub = (__v2di*)&data_out[data_count/2]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/4; ++i) { __v2di xy0 = xy0_in[2*i]; __v2di xy1 = xy1_in[2*i]; __v2di conj_c = conj_c_in[i]; __v2di swap_c = swap_c_in[i]; __v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di cy_real = __builtin_e2k_qpfmuls(conj_c, y); __v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y); __v2di cy_rrii = __builtin_e2k_qpfhadds(cy_real, cy_imag); __v2di cy = __builtin_e2k_qpshufb(cy_rrii, cy_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); out_add[i] = __builtin_e2k_qpfadds(x, cy); out_sub[i] = __builtin_e2k_qpfsubs(x, cy); } }
Основной цикл на ассемблере
.L599: { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=4, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0 } { fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=4, abs=16, disp=0 } .L353: { loop_mode qpshufb,1,sm %b[31], %b[40], %r0, %b[16] qpshufb,3,sm %b[33], %b[42], %r7, %b[0] qpfsubs,4,sm %b[14], %b[44], %b[21] staaqp,5 %b[25], %aad1[ %aasti4 ] incr,5 %aaincr0 movaqp,0 area=0, ind=0, am=1, be=0, %b[13] movaqp,1 area=1, ind=0, am=1, be=0, %b[1] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qpfmuls,0,sm %b[19], %b[16], %b[33] qpfmul_hadds,2,sm %b[11], %b[20], %b[37], %b[22] qpshufb,3,sm %b[32], %b[32], %r6, %b[42] qpfadds,4,sm %b[14], %b[44], %b[39] staaqp,5 %b[43], %aad2[ %aasti5 ] incr,5 %aaincr0 movaqp,2 area=0, ind=0, am=0, be=0, %b[34] movaqp,3 area=0, ind=16, am=1, be=0, %b[25] }
Раньше было 11 инструкций в цикле, теперь стало 9.
Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт
Замеры скорости

Сейчас занято 9/6 такта. Раскрутка в 2 раза даст 2 * 9/6 = 3 такта.
3. stage_radix2_readConjSwap_simd128_unroll2
Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.
Код на Си
void stage_radix2_readConjSwap_simd128_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef, myComplex *swap_coef) { __v2di *xy0_in = (__v2di*)&data_in[0]; __v2di *xy1_in = (__v2di*)&data_in[2]; __v2di *conj_c_in = (__v2di*)conj_coef; __v2di *swap_c_in = (__v2di*)swap_coef; __v2di *out_add = (__v2di*)&data_out[0]; __v2di *out_sub = (__v2di*)&data_out[data_count/2]; #pragma ivdep #pragma unroll(2) #pragma prefetch for(int64_t i = 0; i < data_count/4; ++i) { __v2di xy0 = xy0_in[2*i]; __v2di xy1 = xy1_in[2*i]; __v2di conj_c = conj_c_in[i]; __v2di swap_c = swap_c_in[i]; __v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di cy_real = __builtin_e2k_qpfmuls(conj_c, y); __v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y); __v2di cy_rrii = __builtin_e2k_qpfhadds(cy_real, cy_imag); __v2di cy = __builtin_e2k_qpshufb(cy_rrii, cy_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); out_add[i] = __builtin_e2k_qpfadds(x, cy); out_sub[i] = __builtin_e2k_qpfsubs(x, cy); } }
Основной цикл на ассемблере
.L992: { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=4, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=4, abs=0, disp=32 } { fapb ct=1, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=3, asz=4, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=2, asz=4, abs=16, disp=0 } .L626: { loop_mode qpfmul_hadds,0,sm %b[27], %b[14], %b[92], %b[1] qpfmuls,1,sm %b[81], %b[4], %b[9] qpfsubs,2,sm %b[89], %b[88], %b[34] qpshufb,3,sm %b[52], %b[51], %r0, %b[0] qpshufb,4,sm %b[20], %b[33], %r0, %b[8] qpfadds,5,sm %b[89], %b[88], %b[13] } { loop_mode qpfmuls,0,sm %b[73], %b[10], %b[88] qpfsubs,1,sm %b[84], %b[85], %b[89] staaqp,2 %b[36], %aad1[ %aasti4 ] qpshufb,3,sm %b[66], %b[65], %r13, %b[80] qpshufb,4,sm %b[74], %b[74], %r14, %b[81] staaqp,5 %b[15], %aad2[ %aasti5 ] movaqp,0 area=0, ind=0, am=0, be=0, %b[27] movaqp,1 area=0, ind=16, am=1, be=0, %b[14] movaqp,2 area=0, ind=0, am=0, be=0, %b[47] movaqp,3 area=0, ind=16, am=1, be=0, %b[48] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qpfmul_hadds,0,sm %b[46], %b[6], %b[11], %b[66] qpfadds,1,sm %b[82], %b[83], %b[74] staaqp,2 %b[91], %aad1[ %aasti4 + _f32s,_lts0 0x10 ] incr,2 %aaincr3 qpshufb,3,sm %b[32], %b[45], %r13, %b[85] qpshufb,4,sm %b[7], %b[7], %r14, %b[84] staaqp,5 %b[78], %aad2[ %aasti5 + _f32s,_lts0 0x10 ] incr,5 %aaincr3 movaqp,0 area=1, ind=0, am=0, be=0, %b[65] movaqp,1 area=1, ind=16, am=1, be=0, %b[73] movaqp,2 area=1, ind=0, am=0, be=0, %b[15] movaqp,3 area=1, ind=16, am=1, be=0, %b[36] }
Теоретическая скорость: 8 комплексных чисел за 3 такта (8/3) = 21.33 Байт/такт
Замеры скорости

Итоги по stage_radix2_readConjSwap


График FFT находится здесь.
stage_radix2_readConjSwap_2x
Один проход по stage_radix2_readConjSwap_2x совершает ту же работу, что 2 прохода по stage_radix2. Поэтому скорость stage_radix2_readConjSwap_2x будем умножать на 2 для удобства сравнения с stage_radix2 (этот факт подписан на оси графика и в выводе консоли).
1. stage_radix2_readConjSwap_2x_simd64
Здесь происходит ручная раскрутка алгоритма stage_radix2_readConjSwap_simd64 в 2 раза.
Код на Си
void stage_radix2_readConjSwap_2x_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef_a, myComplex *conj_coef_b, myComplex *swap_coef_a, myComplex *swap_coef_b) { uint64_t *x0_in = (uint64_t*)&data_in[0]; uint64_t *y0_in = (uint64_t*)&data_in[1]; uint64_t *x1_in = (uint64_t*)&data_in[2]; uint64_t *y1_in = (uint64_t*)&data_in[3]; uint64_t *conj_c0a_in = (uint64_t*)&conj_coef_a[0]; uint64_t *conj_c1a_in = (uint64_t*)&conj_coef_a[1]; uint64_t *conj_c0b_in = (uint64_t*)&conj_coef_b[0]; uint64_t *conj_c1b_in = (uint64_t*)&conj_coef_b[data_count/4]; uint64_t *swap_c0a_in = (uint64_t*)&swap_coef_a[0]; uint64_t *swap_c1a_in = (uint64_t*)&swap_coef_a[1]; uint64_t *swap_c0b_in = (uint64_t*)&swap_coef_b[0]; uint64_t *swap_c1b_in = (uint64_t*)&swap_coef_b[data_count/4]; uint64_t *out_add0 = (uint64_t*)&data_out[0*data_count/4]; uint64_t *out_add1 = (uint64_t*)&data_out[1*data_count/4]; uint64_t *out_sub0 = (uint64_t*)&data_out[2*data_count/4]; uint64_t *out_sub1 = (uint64_t*)&data_out[3*data_count/4]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/4; ++i) { uint64_t x0 = x0_in[4*i]; uint64_t y0 = y0_in[4*i]; uint64_t conj_c0 = conj_c0a_in[2*i]; uint64_t swap_c0 = swap_c0a_in[2*i]; uint64_t x1 = x1_in[4*i]; uint64_t y1 = y1_in[4*i]; uint64_t conj_c1 = conj_c1a_in[2*i]; uint64_t swap_c1 = swap_c1a_in[2*i]; uint64_t cy0_real = __builtin_e2k_pfmuls(conj_c0, y0); uint64_t cy1_real = __builtin_e2k_pfmuls(conj_c1, y1); uint64_t cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0); uint64_t cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1); uint64_t cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag); uint64_t cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag); uint64_t add0 = __builtin_e2k_pfadds(x0, cy0); uint64_t sub0 = __builtin_e2k_pfsubs(x0, cy0); uint64_t add1 = __builtin_e2k_pfadds(x1, cy1); uint64_t sub1 = __builtin_e2k_pfsubs(x1, cy1); x0 = add0; y0 = add1; conj_c0 = conj_c0b_in[i]; swap_c0 = swap_c0b_in[i]; x1 = sub0; y1 = sub1; conj_c1 = conj_c1b_in[i]; swap_c1 = swap_c1b_in[i]; cy0_real = __builtin_e2k_pfmuls(conj_c0, y0); cy1_real = __builtin_e2k_pfmuls(conj_c1, y1); cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0); cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1); cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag); cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag); out_add0[i] = __builtin_e2k_pfadds(x0, cy0); out_sub0[i] = __builtin_e2k_pfsubs(x0, cy0); out_add1[i] = __builtin_e2k_pfadds(x1, cy1); out_sub1[i] = __builtin_e2k_pfsubs(x1, cy1); } }
Основной цикл на ассемблере
.L723: { fapb ct=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=3, asz=3, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=1, asz=3, abs=0, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=2, asz=3, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=3, abs=8, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=3, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=4, abs=16, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=3, abs=24, disp=0 } .L322: { loop_mode pfmuls,1,sm %b[60], %b[45], %b[34] pfmul_hadds,2,sm %b[108], %b[47], %b[36], %b[0] pfmul_hadds,3,sm %b[41], %b[97], %b[111], %b[21] pfsub_adds,4,sm %b[77], %b[103], %b[25], %b[6] pfsub_rsubs,5,sm %b[77], %b[103], %b[25], %b[1] } { loop_mode pfmul_hadds,3,sm %b[76], %b[102], %b[107], %b[53] pfadd_adds,4,sm %b[77], %b[103], %b[57], %b[50] pfadd_rsubs,5,sm %b[77], %b[103], %b[57], %b[47] movad,0 area=0, ind=8, am=0, be=0, %b[56] movad,1 area=3, ind=0, am=1, be=0, %b[36] movad,2 area=0, ind=24, am=0, be=0, %b[41] movad,3 area=2, ind=0, am=1, be=0, %b[25] } { loop_mode staad,2 %b[10], %aad2[ %aasti9 ] incr,2 %aaincr0 pfsubs,3,sm %b[32], %b[4], %b[91] pfmuls,4,sm %b[22], %b[17], %b[92] staad,5 %b[5], %aad1[ %aasti8 ] incr,5 %aaincr0 movad,0 area=2, ind=0, am=1, be=0, %b[77] movad,1 area=1, ind=0, am=0, be=0, %b[76] movad,2 area=1, ind=0, am=1, be=0, %b[60] movad,3 area=0, ind=0, am=0, be=0, %b[57] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END pfadds,0,sm %b[32], %b[4], %b[96] pfmuls,1,sm %b[89], %b[98], %b[103] staad,2 %b[54], %aad4[ %aasti11 ] incr,2 %aaincr0 pfmuls,3,sm %b[48], %b[93], %b[107] pfmul_hadds,4,sm %b[90], %b[19], %b[94], %b[97] staad,5 %b[51], %aad3[ %aasti10 ] incr,5 %aaincr0 movad,0 area=1, ind=8, am=1, be=0, %b[102] movad,1 area=0, ind=0, am=1, be=0, %b[10] movad,2 area=0, ind=8, am=1, be=0, %b[5] movad,3 area=0, ind=16, am=0, be=0, %b[22] }
Теоретическая скорость: 4 комплексных числа за 4 такта (4/4) = 8 Байт/такт
Двойная теоретическая скорость: 16 Байт/такт
Замеры скорости

2. stage_radix2_readConjSwap_2x_simd64_unroll2
Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.
Код на Си
void stage_radix2_readConjSwap_2x_simd64_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef_a, myComplex *conj_coef_b, myComplex *swap_coef_a, myComplex *swap_coef_b) { uint64_t *x0_in = (uint64_t*)&data_in[0]; uint64_t *y0_in = (uint64_t*)&data_in[1]; uint64_t *x1_in = (uint64_t*)&data_in[2]; uint64_t *y1_in = (uint64_t*)&data_in[3]; uint64_t *conj_c0a_in = (uint64_t*)&conj_coef_a[0]; uint64_t *conj_c1a_in = (uint64_t*)&conj_coef_a[1]; uint64_t *conj_c0b_in = (uint64_t*)&conj_coef_b[0]; uint64_t *conj_c1b_in = (uint64_t*)&conj_coef_b[data_count/4]; uint64_t *swap_c0a_in = (uint64_t*)&swap_coef_a[0]; uint64_t *swap_c1a_in = (uint64_t*)&swap_coef_a[1]; uint64_t *swap_c0b_in = (uint64_t*)&swap_coef_b[0]; uint64_t *swap_c1b_in = (uint64_t*)&swap_coef_b[data_count/4]; uint64_t *out_add0 = (uint64_t*)&data_out[0*data_count/4]; uint64_t *out_add1 = (uint64_t*)&data_out[1*data_count/4]; uint64_t *out_sub0 = (uint64_t*)&data_out[2*data_count/4]; uint64_t *out_sub1 = (uint64_t*)&data_out[3*data_count/4]; #pragma ivdep #pragma unroll(2) #pragma prefetch for(int64_t i = 0; i < data_count/4; ++i) { uint64_t x0 = x0_in[4*i]; uint64_t y0 = y0_in[4*i]; uint64_t conj_c0 = conj_c0a_in[2*i]; uint64_t swap_c0 = swap_c0a_in[2*i]; uint64_t x1 = x1_in[4*i]; uint64_t y1 = y1_in[4*i]; uint64_t conj_c1 = conj_c1a_in[2*i]; uint64_t swap_c1 = swap_c1a_in[2*i]; uint64_t cy0_real = __builtin_e2k_pfmuls(conj_c0, y0); uint64_t cy1_real = __builtin_e2k_pfmuls(conj_c1, y1); uint64_t cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0); uint64_t cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1); uint64_t cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag); uint64_t cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag); uint64_t add0 = __builtin_e2k_pfadds(x0, cy0); uint64_t sub0 = __builtin_e2k_pfsubs(x0, cy0); uint64_t add1 = __builtin_e2k_pfadds(x1, cy1); uint64_t sub1 = __builtin_e2k_pfsubs(x1, cy1); x0 = add0; y0 = add1; conj_c0 = conj_c0b_in[i]; swap_c0 = swap_c0b_in[i]; x1 = sub0; y1 = sub1; conj_c1 = conj_c1b_in[i]; swap_c1 = swap_c1b_in[i]; cy0_real = __builtin_e2k_pfmuls(conj_c0, y0); cy1_real = __builtin_e2k_pfmuls(conj_c1, y1); cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0); cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1); cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag); cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag); out_add0[i] = __builtin_e2k_pfadds(x0, cy0); out_sub0[i] = __builtin_e2k_pfsubs(x0, cy0); out_add1[i] = __builtin_e2k_pfadds(x1, cy1); out_sub1[i] = __builtin_e2k_pfsubs(x1, cy1); } }
Основной цикл на ассемблере
.L1964: { fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=3, ind=1, asz=3, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=3, ind=1, asz=3, abs=0, disp=32 } { fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=3, asz=3, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=2, asz=3, abs=8, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=7, asz=3, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=6, asz=3, abs=16, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=5, asz=3, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=4, asz=3, abs=24, disp=0 } .L1045: { loop_mode pfmul_hadds,0,sm %b[71], %b[68], %b[118], %b[31] pfsub_adds,1,sm %b[85], %b[94], %b[107], %b[1] pfsub_rsubs,2,sm %b[85], %b[94], %b[107], %b[0] pfmuls,3,sm %b[54], %b[50], %b[119] pfmuls,4,sm %b[31], %b[18], %b[117] pfmuls,5,sm %b[62], %b[103], %b[115] } { loop_mode pfmul_hadds,0,sm %b[43], %b[116], %g16, %b[85] pfadd_adds,1,sm %b[85], %b[94], %b[33], %b[71] pfadd_rsubs,2,sm %b[85], %b[94], %b[33], %b[68] pfmul_hadds,3,sm %b[108], %b[38], %g17, %b[62] pfmuls,5,sm %b[95], %b[66], %b[116] movad,0 area=3, ind=0, am=0, be=0, %b[43] movad,1 area=3, ind=8, am=1, be=0, %b[54] movad,2 area=3, ind=0, am=0, be=0, %b[33] movad,3 area=3, ind=8, am=1, be=0, %b[38] } { loop_mode pfmul_hadds,0,sm %b[61], %b[76], %g18, %b[107] pfsub_adds,1,sm %b[23], %b[101], %b[87], %b[94] pfsub_rsubs,2,sm %b[23], %b[101], %b[87], %b[95] pfmuls,4,sm %g19, %b[36], %g17 pfmuls,5,sm %b[51], %b[114], %g16 movad,0 area=2, ind=0, am=0, be=0, %b[76] movad,1 area=2, ind=8, am=1, be=0, %b[87] movad,2 area=2, ind=0, am=0, be=0, %b[51] movad,3 area=2, ind=8, am=1, be=0, %b[61] } { loop_mode pfadd_adds,0,sm %b[23], %b[101], %b[109], %b[108] pfadd_rsubs,1,sm %b[23], %b[101], %b[109], %b[109] staad,2 %b[3], %aad2[ %aasti9 + _f32s,_lts0 0x8 ] pfsubs,3,sm %b[30], %b[64], %b[101] pfmuls,4,sm %b[84], %b[74], %g18 staad,5 %b[2], %aad1[ %aasti8 + _f32s,_lts0 0x8 ] movad,0 area=1, ind=16, am=0, be=0, %b[23] movad,1 area=1, ind=0, am=0, be=0, %b[84] movad,2 area=1, ind=16, am=0, be=0, %b[2] movad,3 area=1, ind=0, am=0, be=0, %b[3] } { loop_mode staad,2 %b[73], %aad4[ %aasti11 + _f32s,_lts0 0x8 ] pfmul_hadds,3,sm %b[34], %b[50], %b[119], %b[70] pfadds,4,sm %b[30], %b[64], %b[64] staad,5 %b[70], %aad3[ %aasti10 + _f32s,_lts0 0x8 ] movad,0 area=1, ind=8, am=1, be=0, %b[50] movad,1 area=1, ind=24, am=0, be=0, %g19 movad,2 area=1, ind=8, am=0, be=0, %b[30] movad,3 area=0, ind=24, am=0, be=0, %b[34] } { loop_mode pfmul_hadds,1,sm %b[11], %b[104], %b[113], %b[97] staad,2 %b[96], %aad2[ %aasti9 ] incr,2 %aaincr4 pfsubs,4,sm %b[24], %b[72], %b[112] staad,5 %b[97], %aad1[ %aasti8 ] incr,5 %aaincr4 movad,0 area=0, ind=0, am=0, be=0, %b[11] movad,1 area=0, ind=8, am=0, be=0, %b[96] movad,2 area=1, ind=24, am=1, be=0, %b[104] movad,3 area=0, ind=0, am=0, be=0, %b[73] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END pfmul_hadds,0,sm %b[10], %b[18], %b[117], %b[90] pfmul_hadds,1,sm %b[46], %b[103], %b[115], %b[103] staad,2 %b[110], %aad4[ %aasti11 ] incr,2 %aaincr4 pfmuls,3,sm %b[90], %b[102], %b[111] pfadds,4,sm %b[24], %b[72], %b[72] staad,5 %b[111], %aad3[ %aasti10 ] incr,5 %aaincr4 movad,0 area=0, ind=16, am=0, be=0, %b[18] movad,1 area=0, ind=24, am=1, be=0, %b[46] movad,2 area=0, ind=8, am=1, be=0, %b[10] movad,3 area=0, ind=16, am=0, be=0, %b[24] }
Теоретическая скорость: 8 комплексных чисел за 7 тактов (8/7) = 9.14 Байт/такт
Двойная теоретическая скорость: 18.29 Байт/такт
Замеры скорости

3. stage_radix2_readConjSwap_2x_simd64_unroll4
Здесь происходит раскрутка цикла в 4 раза с помощью опции unroll.
Код на Си
void stage_radix2_readConjSwap_2x_simd64_unroll4(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef_a, myComplex *conj_coef_b, myComplex *swap_coef_a, myComplex *swap_coef_b) { uint64_t *x0_in = (uint64_t*)&data_in[0]; uint64_t *y0_in = (uint64_t*)&data_in[1]; uint64_t *x1_in = (uint64_t*)&data_in[2]; uint64_t *y1_in = (uint64_t*)&data_in[3]; uint64_t *conj_c0a_in = (uint64_t*)&conj_coef_a[0]; uint64_t *conj_c1a_in = (uint64_t*)&conj_coef_a[1]; uint64_t *conj_c0b_in = (uint64_t*)&conj_coef_b[0]; uint64_t *conj_c1b_in = (uint64_t*)&conj_coef_b[data_count/4]; uint64_t *swap_c0a_in = (uint64_t*)&swap_coef_a[0]; uint64_t *swap_c1a_in = (uint64_t*)&swap_coef_a[1]; uint64_t *swap_c0b_in = (uint64_t*)&swap_coef_b[0]; uint64_t *swap_c1b_in = (uint64_t*)&swap_coef_b[data_count/4]; uint64_t *out_add0 = (uint64_t*)&data_out[0*data_count/4]; uint64_t *out_add1 = (uint64_t*)&data_out[1*data_count/4]; uint64_t *out_sub0 = (uint64_t*)&data_out[2*data_count/4]; uint64_t *out_sub1 = (uint64_t*)&data_out[3*data_count/4]; #pragma ivdep #pragma unroll(4) #pragma prefetch for(int64_t i = 0; i < data_count/4; ++i) { uint64_t x0 = x0_in[4*i]; uint64_t y0 = y0_in[4*i]; uint64_t conj_c0 = conj_c0a_in[2*i]; uint64_t swap_c0 = swap_c0a_in[2*i]; uint64_t x1 = x1_in[4*i]; uint64_t y1 = y1_in[4*i]; uint64_t conj_c1 = conj_c1a_in[2*i]; uint64_t swap_c1 = swap_c1a_in[2*i]; uint64_t cy0_real = __builtin_e2k_pfmuls(conj_c0, y0); uint64_t cy1_real = __builtin_e2k_pfmuls(conj_c1, y1); uint64_t cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0); uint64_t cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1); uint64_t cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag); uint64_t cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag); uint64_t add0 = __builtin_e2k_pfadds(x0, cy0); uint64_t sub0 = __builtin_e2k_pfsubs(x0, cy0); uint64_t add1 = __builtin_e2k_pfadds(x1, cy1); uint64_t sub1 = __builtin_e2k_pfsubs(x1, cy1); x0 = add0; y0 = add1; conj_c0 = conj_c0b_in[i]; swap_c0 = swap_c0b_in[i]; x1 = sub0; y1 = sub1; conj_c1 = conj_c1b_in[i]; swap_c1 = swap_c1b_in[i]; cy0_real = __builtin_e2k_pfmuls(conj_c0, y0); cy1_real = __builtin_e2k_pfmuls(conj_c1, y1); cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0); cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1); cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag); cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag); out_add0[i] = __builtin_e2k_pfadds(x0, cy0); out_sub0[i] = __builtin_e2k_pfsubs(x0, cy0); out_add1[i] = __builtin_e2k_pfadds(x1, cy1); out_sub1[i] = __builtin_e2k_pfsubs(x1, cy1); } }
Основной цикл на ассемблере
.L3317: { fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=3, ind=1, asz=2, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=3, ind=1, asz=2, abs=0, disp=32 } { fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=3, ind=1, asz=2, abs=4, disp=64 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=3, ind=1, asz=2, abs=4, disp=96 } { fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=3, asz=2, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=3, asz=2, abs=8, disp=32 } { fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=2, asz=2, abs=12, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=2, asz=2, abs=12, disp=32 } { fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=7, asz=3, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=6, asz=3, abs=16, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=5, asz=3, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=4, asz=3, abs=24, disp=0 } .L2286: { loop_mode pfmul_hadds,0,sm %b[13], %b[110], %g16, %b[110] pfadd_adds,1,sm %b[37], %b[72], %b[111], %b[115] pfmuls,2,sm %b[18], %b[116], %g17 pfadd_rsubs,3,sm %b[37], %b[72], %b[111], %b[111] pfmuls,4,sm %b[115], %g20, %g21 pfmuls,5,sm %g18, %b[107], %g19 } { loop_mode pfmul_hadds,0,sm %b[12], %b[103], %g24, %b[118] pfsub_adds,1,sm %b[30], %b[102], %b[118], %g23 pfmuls,2,sm %b[52], %b[114], %g25 pfsub_rsubs,3,sm %b[30], %b[102], %b[118], %g22 pfmuls,4,sm %g26, %b[99], %g27 pfmuls,5,sm %b[40], %b[80], %b[103] movad,0 area=5, ind=16, am=0, be=0, %b[12] movad,1 area=5, ind=0, am=0, be=0, %b[13] movad,2 area=5, ind=16, am=0, be=0, %b[0] movad,3 area=5, ind=0, am=0, be=0, %b[1] } { loop_mode pfmul_hadds,0,sm %b[7], %r0, %r1, %b[102] pfadd_adds,1,sm %b[30], %b[102], %g29, %g31 pfsubs,2,sm %b[92], %b[109], %r2 pfadd_rsubs,3,sm %b[30], %b[102], %g29, %g30 pfmuls,4,sm %b[21], %b[57], %r3 staad,5 %g28, %aad2[ %aasti9 + _f32s,_lts0 0x8 ] movad,0 area=5, ind=24, am=0, be=0, %b[21] movad,1 area=5, ind=8, am=1, be=0, %b[30] movad,2 area=5, ind=24, am=0, be=0, %b[7] movad,3 area=5, ind=8, am=1, be=0, %b[18] } { loop_mode pfmul_hadds,1,sm %b[46], %b[108], %r4, %b[109] pfadds,2,sm %b[92], %b[109], %r5 pfsubs,4,sm %b[81], %b[100], %b[108] staad,5 %b[119], %aad1[ %aasti8 + _f32s,_lts0 0x8 ] movad,0 area=4, ind=16, am=0, be=0, %b[46] movad,1 area=4, ind=0, am=0, be=0, %b[52] movad,2 area=4, ind=16, am=0, be=0, %b[37] movad,3 area=4, ind=0, am=0, be=0, %b[40] } { loop_mode pfmul_hadds,0,sm %b[49], %b[63], %r6, %b[100] pfmul_hadds,1,sm %b[6], %b[116], %g17, %b[116] pfmuls,2,sm %b[71], %b[88], %g17 pfadds,4,sm %b[81], %b[100], %b[101] staad,5 %b[101], %aad4[ %aasti11 + _f32s,_lts0 0x8 ] movad,0 area=4, ind=24, am=0, be=0, %b[63] movad,1 area=4, ind=8, am=1, be=0, %b[71] movad,2 area=4, ind=24, am=0, be=0, %b[6] movad,3 area=4, ind=8, am=1, be=0, %b[49] } { loop_mode pfmul_hadds,0,sm %b[43], %b[114], %g25, %g29 pfsub_adds,1,sm %b[68], %b[112], %b[117], %g28 pfsubs,2,sm %b[84], %r9, %b[114] pfsubs,4,sm %b[89], %b[106], %r0 staad,5 %r8, %aad3[ %aasti10 + _f32s,_lts0 0x8 ] movad,0 area=3, ind=16, am=0, be=0, %b[81] movad,1 area=3, ind=0, am=0, be=0, %b[93] movad,2 area=3, ind=0, am=0, be=0, %b[43] movad,3 area=3, ind=16, am=0, be=0, %b[72] } { loop_mode pfsub_rsubs,1,sm %b[68], %b[112], %b[117], %b[117] pfmuls,2,sm %b[34], %r2, %g25 pfadds,4,sm %b[89], %b[106], %b[106] staad,5 %r10, %aad2[ %aasti9 + _f32s,_lts0 0x18 ] movad,0 area=3, ind=8, am=1, be=0, %b[34] movad,1 area=3, ind=24, am=0, be=0, %b[96] movad,2 area=3, ind=8, am=1, be=0, %b[89] movad,3 area=3, ind=24, am=0, be=0, %b[92] } { loop_mode pfmul_hadds,0,sm %b[98], %b[99], %g27, %b[107] pfadd_adds,1,sm %b[68], %b[112], %r11, %b[99] pfmuls,2,sm %b[75], %r5, %r12 pfmul_hadds,3,sm %b[94], %b[107], %g19, %b[98] pfmuls,4,sm %b[25], %b[108], %g16 staad,5 %b[113], %aad1[ %aasti8 + _f32s,_lts0 0x18 ] movad,0 area=2, ind=8, am=0, be=0, %g19 movad,1 area=0, ind=24, am=0, be=0, %b[75] movad,2 area=2, ind=0, am=0, be=0, %b[25] movad,3 area=2, ind=8, am=0, be=0, %b[113] } { loop_mode pfmul_hadds,0,sm %b[91], %g20, %g21, %r9 pfadd_rsubs,1,sm %b[68], %b[112], %r11, %r8 staad,2 %r13, %aad4[ %aasti11 + _f32s,_lts0 0x18 ] pfadds,3,sm %b[84], %r9, %b[112] pfmuls,4,sm %b[67], %b[101], %g24 staad,5 %b[105], %aad3[ %aasti10 + _f32s,_lts0 0x18 ] movad,0 area=2, ind=0, am=0, be=0, %b[67] movad,1 area=1, ind=24, am=0, be=0, %g20 movad,2 area=2, ind=24, am=0, be=0, %g18 movad,3 area=1, ind=24, am=0, be=0, %b[105] } { loop_mode pfmul_hadds,0,sm %b[97], %b[88], %g17, %b[68] pfsub_adds,1,sm %b[62], %r16, %b[110], %r10 staad,2 %r15, %aad2[ %aasti9 ] pfmul_hadds,3,sm %b[36], %b[77], %b[104], %b[104] pfmuls,4,sm %b[17], %r0, %r1 staad,5 %r14, %aad1[ %aasti8 ] movad,0 area=2, ind=16, am=0, be=0, %b[17] movad,1 area=2, ind=24, am=1, be=0, %g26 movad,2 area=2, ind=16, am=1, be=0, %b[36] movad,3 area=0, ind=24, am=0, be=0, %b[97] } { loop_mode pfmul_hadds,0,sm %b[85], %b[57], %r3, %b[110] pfmul_hadds,1,sm %b[22], %r2, %g25, %b[115] staad,2 %b[115], %aad4[ %aasti11 ] pfsub_rsubs,3,sm %b[62], %r16, %b[110], %b[111] pfmuls,4,sm %b[56], %b[106], %r4 staad,5 %b[111], %aad3[ %aasti10 ] movad,0 area=1, ind=0, am=0, be=0, %b[22] movad,1 area=1, ind=8, am=0, be=0, %b[57] movad,2 area=1, ind=0, am=0, be=0, %b[56] movad,3 area=1, ind=16, am=0, be=0, %b[77] } { loop_mode pfmul_hadds,0,sm %b[76], %b[80], %b[103], %r16 pfadd_adds,1,sm %b[62], %r16, %b[118], %r13 staad,2 %g23, %aad2[ %aasti9 + _f32s,_lts0 0x10 ] incr,2 %aaincr4 pfadd_rsubs,3,sm %b[62], %r16, %b[118], %b[103] pfmuls,4,sm %b[29], %b[61], %r6 staad,5 %g22, %aad1[ %aasti8 + _f32s,_lts0 0x10 ] incr,5 %aaincr4 movad,0 area=1, ind=16, am=1, be=0, %b[80] movad,1 area=0, ind=0, am=0, be=0, %b[29] movad,2 area=1, ind=8, am=1, be=0, %b[76] movad,3 area=0, ind=0, am=0, be=0, %b[62] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END pfmul_hadds,0,sm %b[53], %r5, %r12, %r11 pfsub_adds,1,sm %b[35], %b[70], %b[102], %r15 staad,2 %g31, %aad4[ %aasti11 + _f32s,_lts0 0x10 ] incr,2 %aaincr4 pfsub_rsubs,3,sm %b[35], %b[70], %b[102], %r14 pfmuls,4,sm %g19, %b[75], %b[102] staad,5 %g30, %aad3[ %aasti10 + _f32s,_lts0 0x10 ] incr,5 %aaincr4 movad,0 area=0, ind=16, am=0, be=0, %b[85] movad,1 area=0, ind=8, am=1, be=0, %b[84] movad,2 area=0, ind=8, am=1, be=0, %b[53] movad,3 area=0, ind=16, am=0, be=0, %b[88] }
Теоретическая скорость: 16 комплексных чисел за 13 тактов (16/13) = 9.85 Байт/такт
Двойная теоретическая скорость: 19.69 Байт/такт
Замеры скорости

4. stage_radix2_readConjSwap_2x_simd128
Здесь происходит ручная раскрутка алгоритма stage_radix2_readConjSwap_simd128 в 2 раза.
Код на Си
void stage_radix2_readConjSwap_2x_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef_a, myComplex *conj_coef_b, myComplex *swap_coef_a, myComplex *swap_coef_b) { __v2di *xy0_in = (__v2di*)&data_in[0]; __v2di *xy1_in = (__v2di*)&data_in[2]; __v2di *xy2_in = (__v2di*)&data_in[4]; __v2di *xy3_in = (__v2di*)&data_in[6]; __v2di *conj_c0a_in = (__v2di*)&conj_coef_a[0]; __v2di *conj_c1a_in = (__v2di*)&conj_coef_a[2]; __v2di *conj_c0b_in = (__v2di*)&conj_coef_b[0]; __v2di *conj_c1b_in = (__v2di*)&conj_coef_b[data_count/4]; __v2di *swap_c0a_in = (__v2di*)&swap_coef_a[0]; __v2di *swap_c1a_in = (__v2di*)&swap_coef_a[2]; __v2di *swap_c0b_in = (__v2di*)&swap_coef_b[0]; __v2di *swap_c1b_in = (__v2di*)&swap_coef_b[data_count/4]; __v2di *out_add0 = (__v2di*)&data_out[0*data_count/4]; __v2di *out_add1 = (__v2di*)&data_out[1*data_count/4]; __v2di *out_sub0 = (__v2di*)&data_out[2*data_count/4]; __v2di *out_sub1 = (__v2di*)&data_out[3*data_count/4]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/8; ++i) { __v2di xy0 = xy0_in[4*i]; __v2di xy1 = xy1_in[4*i]; __v2di conj_c0 = conj_c0a_in[2*i]; __v2di swap_c0 = swap_c0a_in[2*i]; __v2di xy2 = xy2_in[4*i]; __v2di xy3 = xy3_in[4*i]; __v2di conj_c1 = conj_c1a_in[2*i]; __v2di swap_c1 = swap_c1a_in[2*i]; __v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0); __v2di cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1); __v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0); __v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1); __v2di cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag); __v2di cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag); __v2di cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di add0 = __builtin_e2k_qpfadds(x0, cy0); __v2di sub0 = __builtin_e2k_qpfsubs(x0, cy0); __v2di add1 = __builtin_e2k_qpfadds(x1, cy1); __v2di sub1 = __builtin_e2k_qpfsubs(x1, cy1); xy0 = add0; xy1 = add1; conj_c0 = conj_c0b_in[i]; swap_c0 = swap_c0b_in[i]; xy2 = sub0; xy3 = sub1; conj_c1 = conj_c1b_in[i]; swap_c1 = swap_c1b_in[i]; x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100}); y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0); cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1); cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0); cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1); cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag); cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag); cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); out_add0[i] = __builtin_e2k_qpfadds(x0, cy0); out_sub0[i] = __builtin_e2k_qpfsubs(x0, cy0); out_add1[i] = __builtin_e2k_qpfadds(x1, cy1); out_sub1[i] = __builtin_e2k_qpfsubs(x1, cy1); } }
Основной цикл на ассемблере
.L4621: { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=3, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=3, abs=0, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=3, asz=3, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=2, asz=3, abs=8, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=3, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=3, abs=16, disp=0 } { fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=3, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=24, disp=0 } .L3922: { loop_mode qpfsubs,2,sm %b[65], %b[78], %b[95] qpfadds,3,sm %b[28], %b[3], %b[0] qpshufb,4,sm %b[69], %b[69], %r20, %b[1] qpfadds,5,sm %b[65], %b[78], %b[96] } { loop_mode qpfmul_hadds,0,sm %b[74], %b[30], %b[76], %b[17] qpshufb,1,sm %b[50], %b[55], %r18, %b[28] qpfsubs,2,sm %b[77], %b[92], %b[97] qpshufb,4,sm %b[88], %b[91], %r18, %b[21] qpfadds,5,sm %b[77], %b[92], %b[98] movaqp,0 area=3, ind=0, am=1, be=0, %b[8] movaqp,1 area=2, ind=0, am=1, be=0, %b[3] } { loop_mode qpfmul_hadds,0,sm %b[68], %b[89], %b[71], %b[65] qpshufb,1,sm %b[90], %b[93], %r19, %b[59] qpfmuls,2,sm %b[86], %b[28], %b[74] qpshufb,4,sm %b[62], %b[62], %r20, %b[76] qpfmuls,5,sm %b[82], %b[87], %b[69] movaqp,0 area=0, ind=0, am=0, be=0, %b[51] movaqp,1 area=0, ind=16, am=1, be=0, %b[46] movaqp,2 area=3, ind=0, am=1, be=0, %b[33] movaqp,3 area=2, ind=0, am=1, be=0, %b[30] } { loop_mode qpshufb,0,sm %b[4], %b[29], %r18, %b[85] qpshufb,1,sm %b[2], %b[81], %r19, %b[71] qpfadds,2,sm %b[58], %b[54], %b[77] qpfsubs,3,sm %b[58], %b[54], %b[89] qpshufb,4,sm %b[27], %b[27], %r20, %b[90] qpfsubs,5,sm %b[26], %b[1], %b[86] movaqp,0 area=1, ind=16, am=1, be=0, %b[78] movaqp,1 area=1, ind=0, am=0, be=0, %b[82] movaqp,2 area=1, ind=16, am=1, be=0, %b[62] movaqp,3 area=1, ind=0, am=0, be=0, %b[68] } { loop_mode qpfmul_hadds,0,sm %b[47], %b[23], %b[94], %b[58] qpshufb,1,sm %b[52], %b[57], %r19, %b[54] staaqp,2 %b[95], %aad1[ %aasti8 ] incr,2 %aaincr0 qpfmuls,3,sm %b[20], %b[21], %b[92] qpshufb,4,sm %b[0], %b[79], %r18, %b[81] staaqp,5 %b[96], %aad2[ %aasti9 ] incr,5 %aaincr0 movaqp,2 area=0, ind=0, am=0, be=0, %b[27] movaqp,3 area=0, ind=16, am=1, be=0, %b[2] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qpfmul_hadds,0,sm %b[44], %b[83], %b[49], %b[23] qpshufb,1,sm %b[6], %b[31], %r19, %b[20] staaqp,2 %b[97], %aad3[ %aasti10 ] incr,2 %aaincr0 qpfmuls,3,sm %b[15], %b[81], %b[47] qpshufb,4,sm %b[19], %b[19], %r20, %b[52] staaqp,5 %b[98], %aad4[ %aasti11 ] incr,5 %aaincr0 }
Теоретическая скорость: 8 комплексных чисел за 6 тактов (8/6) = 10.67 Байт/такт
Двойная теоретическая скорость: 21.33 Байт/такт
Замеры скорости

5. stage_radix2_readConjSwap_2x_simd128_v2
Перетасовали код, чтобы уменьшить число инструкций.
Код на Си
void stage_radix2_readConjSwap_2x_simd128_v2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef_a, myComplex *conj_coef_b, myComplex *swap_coef_a, myComplex *swap_coef_b) { __v2di *xy0_in = (__v2di*)&data_in[0]; __v2di *xy1_in = (__v2di*)&data_in[2]; __v2di *xy2_in = (__v2di*)&data_in[4]; __v2di *xy3_in = (__v2di*)&data_in[6]; __v2di *conj_c0a_in = (__v2di*)&conj_coef_a[0]; __v2di *conj_c1a_in = (__v2di*)&conj_coef_a[2]; __v2di *conj_c0b_in = (__v2di*)&conj_coef_b[0]; __v2di *conj_c1b_in = (__v2di*)&conj_coef_b[data_count/4]; __v2di *swap_c0a_in = (__v2di*)&swap_coef_a[0]; __v2di *swap_c1a_in = (__v2di*)&swap_coef_a[2]; __v2di *swap_c0b_in = (__v2di*)&swap_coef_b[0]; __v2di *swap_c1b_in = (__v2di*)&swap_coef_b[data_count/4]; __v2di *out_add0 = (__v2di*)&data_out[0*data_count/4]; __v2di *out_add1 = (__v2di*)&data_out[1*data_count/4]; __v2di *out_sub0 = (__v2di*)&data_out[2*data_count/4]; __v2di *out_sub1 = (__v2di*)&data_out[3*data_count/4]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/8; ++i) { __v2di xy0 = xy0_in[4*i]; __v2di xy1 = xy1_in[4*i]; __v2di conj_c0 = conj_c0a_in[2*i]; __v2di swap_c0 = swap_c0a_in[2*i]; __v2di xy2 = xy2_in[4*i]; __v2di xy3 = xy3_in[4*i]; __v2di conj_c1 = conj_c1a_in[2*i]; __v2di swap_c1 = swap_c1a_in[2*i]; __v2di x0_rrii = __builtin_e2k_qppermb(xy1, xy0, (__v2di){0x1312111003020100, 0x1716151407060504}); __v2di x1_rrii = __builtin_e2k_qppermb(xy3, xy2, (__v2di){0x1312111003020100, 0x1716151407060504}); __v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0); __v2di cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1); __v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0); __v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1); __v2di cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag); __v2di cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag); __v2di add0_rrii = __builtin_e2k_qpfadds(x0_rrii, cy0_rrii); __v2di sub0_rrii = __builtin_e2k_qpfsubs(x0_rrii, cy0_rrii); __v2di add1_rrii = __builtin_e2k_qpfadds(x1_rrii, cy1_rrii); __v2di sub1_rrii = __builtin_e2k_qpfsubs(x1_rrii, cy1_rrii); __v2di xy0_rrii = add0_rrii; __v2di xy1_rrii = add1_rrii; conj_c0 = conj_c0b_in[i]; swap_c0 = swap_c0b_in[i]; __v2di xy2_rrii = sub0_rrii; __v2di xy3_rrii = sub1_rrii; conj_c1 = conj_c1b_in[i]; swap_c1 = swap_c1b_in[i]; __v2di x0 = __builtin_e2k_qppermb(xy1_rrii, xy0_rrii, (__v2di){0x0B0A090803020100, 0x1B1A191813121110}); __v2di x1 = __builtin_e2k_qppermb(xy3_rrii, xy2_rrii, (__v2di){0x0B0A090803020100, 0x1B1A191813121110}); y0 = __builtin_e2k_qppermb(xy1_rrii, xy0_rrii, (__v2di){0x0F0E0D0C07060504, 0x1F1E1D1C17161514}); y1 = __builtin_e2k_qppermb(xy3_rrii, xy2_rrii, (__v2di){0x0F0E0D0C07060504, 0x1F1E1D1C17161514}); cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0); cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1); cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0); cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1); cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag); cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag); __v2di cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); out_add0[i] = __builtin_e2k_qpfadds(x0, cy0); out_sub0[i] = __builtin_e2k_qpfsubs(x0, cy0); out_add1[i] = __builtin_e2k_qpfadds(x1, cy1); out_sub1[i] = __builtin_e2k_qpfsubs(x1, cy1); } }
Основной цикл на ассемблере
.L5345: { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=3, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=3, abs=0, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=3, asz=3, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=2, asz=3, abs=8, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=3, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=3, abs=16, disp=0 } { fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=3, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=24, disp=0 } .L4696: { loop_mode qpfmul_hadds,0,sm %b[62], %b[79], %b[110], %b[62] qpshufb,1,sm %b[83], %b[83], %r11, %b[110] qpfsubs,2,sm %b[119], %b[118], %b[113] qpfmuls,3,sm %b[73], %b[103], %b[73] qppermb,4,sm %b[69], %b[12], %r9, %b[115] qpfadds,5,sm %b[119], %b[118], %b[114] movaqp,0 area=0, ind=0, am=0, be=0, %b[0] movaqp,1 area=1, ind=0, am=0, be=0, %b[69] movaqp,2 area=0, ind=0, am=0, be=0, %b[12] movaqp,3 area=0, ind=16, am=1, be=0, %b[1] } { loop_mode qpfmul_hadds,0,sm %b[44], %b[26], %b[108], %b[79] qppermb,1,sm %b[92], %b[76], %r0, %b[117] qpfmuls,2,sm %b[57], %b[77], %b[108] qpfmuls,3,sm %b[80], %b[109], %b[80] qpshufb,4,sm %b[66], %b[66], %r11, %b[116] qpfadds,5,sm %b[115], %b[25], %b[66] movaqp,0 area=0, ind=16, am=1, be=0, %b[57] movaqp,1 area=1, ind=16, am=1, be=0, %b[76] movaqp,2 area=3, ind=0, am=1, be=0, %b[26] movaqp,3 area=2, ind=0, am=1, be=0, %b[44] } { loop_mode qpfmul_hadds,0,sm %b[106], %b[111], %b[82], %b[92] qpshufb,1,sm %b[3], %b[14], %r12, %b[107] qpfmuls,2,sm %b[41], %b[24], %b[106] qpfadds,3,sm %g16, %b[98], %b[82] qpshufb,4,sm %b[59], %b[2], %r12, %b[101] qpfsubs,5,sm %b[115], %b[25], %b[83] movaqp,0 area=3, ind=0, am=1, be=0, %b[25] movaqp,1 area=2, ind=0, am=1, be=0, %b[41] movaqp,2 area=1, ind=0, am=0, be=0, %b[93] movaqp,3 area=1, ind=16, am=1, be=0, %b[100] } { loop_mode qpfsubs,0,sm %g18, %b[110], %g17 qppermb,1,sm %b[11], %b[22], %r9, %g16 staaqp,2 %g17, %aad1[ %aasti8 ] incr,2 %aaincr0 qpfsubs,3,sm %g16, %b[98], %b[11] qppermb,4,sm %b[13], %b[85], %r10, %b[22] staaqp,5 %b[112], %aad2[ %aasti9 ] incr,5 %aaincr0 } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qpfmul_hadds,0,sm %b[99], %b[105], %b[75], %b[19] qppermb,1,sm %b[84], %b[68], %r10, %b[75] staaqp,2 %b[113], %aad3[ %aasti10 ] incr,2 %aaincr0 qpfadds,3,sm %g18, %b[110], %b[110] qppermb,4,sm %b[19], %b[91], %r0, %g18 staaqp,5 %b[114], %aad4[ %aasti11 ] incr,5 %aaincr0 }
Теоретическая скорость: 8 комплексных чисел за 5 тактов (8/5) = 12.8 Байт/такт
Двойная теоретическая скорость: 25.6 Байт/такт
Замеры скорости

Итоги по stage_radix2_readConjSwap_2x


Скорости выросли по сравнению с исходными версиями stage_radix2_readConjSwap.
График FFT находится здесь.
stage_radix4
Схема алгоритма Stage для версии «radix-4».

Один проход по stage_radix4 совершает ту же работу, что 2 прохода по stage_radix2. Поэтому скорость stage_radix4 будем умножать на 2 для удобства сравнения с stage_radix2 (этот факт подписан на оси графика и в выводе консоли).
1. stage_radix4_etalon
Эталонный вариант для сравнения на корректность.
Код на Си
void stage_radix4_etalon(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC, myComplex *coefD, myComplex *coefE) { myComplex *x_in = &data_in[0]; myComplex *y_in = &data_in[1]; myComplex *z_in = &data_in[2]; myComplex *w_in = &data_in[3]; myComplex *c_in = coefC; myComplex *d_in = coefD; myComplex *e_in = coefE; myComplex *out_0 = &data_out[0*data_count/4]; myComplex *out_1 = &data_out[1*data_count/4]; myComplex *out_2 = &data_out[2*data_count/4]; myComplex *out_3 = &data_out[3*data_count/4]; #pragma ivdep #pragma unroll(1) // #pragma prefetch for(int64_t i = 0; i < data_count/4; ++i) { myComplex x = x_in[4*i]; myComplex y = y_in[4*i]; myComplex z = z_in[4*i]; myComplex w = w_in[4*i]; myComplex c = c_in[i]; myComplex d = d_in[i]; myComplex e = e_in[i]; myComplex cy = complex_mul(c, y); myComplex dz = complex_mul(d, z); myComplex ew = complex_mul(e, w); myComplex add02 = complex_add( x, dz); myComplex sub02 = complex_sub( x, dz); myComplex add13 = complex_add(cy, ew); myComplex sub13 = complex_sub(cy, ew); myComplex sub13i = (myComplex){.real = -sub13.imag, .imag = sub13.real}; out_0[i] = complex_add(add02, add13); out_1[i] = complex_sub(sub02, sub13i); out_2[i] = complex_sub(add02, add13); out_3[i] = complex_add(sub02, sub13i); } }
Основной цикл на ассемблере
.L868: { fapb ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=4, asz=3, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0 } { fapb ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=3, asz=3, abs=8, disp=0 } { fapb ct=1, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=2, asz=4, abs=16, disp=0 } .L236: { loop_mode fmul_adds,0,sm %b[66], %b[72], %b[54], %b[1] fsub_rsubs,1,sm %b[12], %b[71], %b[55], %b[5] fsub_adds,2,sm %b[12], %b[71], %b[55], %b[45] fmuls,4,sm %b[59], %b[58], %b[52] fsubs,5,sm %b[79], %b[75], %b[53] movaw,3 area=0, ind=4, am=0, be=0, %b[0] } { loop_mode fmul_rsubs,0,sm %b[66], %b[60], %b[82], %b[71] fadd_adds,1,sm %b[24], %b[83], %b[90], %b[72] fadd_rsubs,2,sm %b[24], %b[83], %b[90], %b[76] fmuls,4,sm %b[59], %b[70], %b[80] fadds,5,sm %b[79], %b[75], %b[88] movaw,1 area=2, ind=4, am=0, be=0, %b[55] movaw,2 area=0, ind=0, am=0, be=0, %b[12] movaw,3 area=0, ind=24, am=0, be=0, %b[54] } { loop_mode fmul_rsubs,0,sm %b[42], %b[17], %b[84], %b[75] fmul_rsubs,1,sm %b[32], %b[67], %b[85], %b[79] staaw,2 %b[36], %aad3[ %aasti7 ] fmuls,3,sm %b[29], %b[46], %b[82] fmuls,4,sm %b[35], %b[23], %b[83] staaw,5 %b[39], %aad1[ %aasti5 ] movaw,0 area=2, ind=0, am=1, be=0, %b[60] movaw,1 area=1, ind=0, am=0, be=0, %b[24] movaw,2 area=0, ind=16, am=0, be=0, %b[59] movaw,3 area=0, ind=28, am=0, be=0, %b[66] } { loop_mode fmul_adds,0,sm %b[40], %b[46], %b[86], %b[39] fmul_adds,1,sm %b[32], %b[25], %b[87], %b[67] staaw,2 %b[11], %aad4[ %aasti8 + _f32s,_lts0 0x4 ] fmuls,3,sm %b[27], %b[13], %b[84] fmuls,4,sm %b[35], %b[65], %b[85] staaw,5 %b[51], %aad2[ %aasti6 + _f32s,_lts0 0x4 ] movaw,0 area=1, ind=4, am=1, be=0, %b[29] movaw,1 area=0, ind=0, am=0, be=0, %b[36] movaw,2 area=0, ind=20, am=0, be=0, %b[17] movaw,3 area=0, ind=12, am=0, be=0, %b[42] } { loop_mode fsub_rsubs,0,sm %b[22], %b[81], %b[48], %b[32] fsub_adds,1,sm %b[22], %b[81], %b[48], %b[35] staaw,2 %b[7], %aad3[ %aasti7 + _f32s,_lts0 0x4 ] incr,2 %aaincr3 fsubs,4,sm %b[3], %b[43], %b[46] staaw,5 %b[47], %aad1[ %aasti5 + _f32s,_lts0 0x4 ] incr,5 %aaincr3 movaw,1 area=0, ind=4, am=1, be=0, %b[25] movaw,3 area=0, ind=8, am=1, be=0, %b[11] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END fadd_adds,0,sm %b[10], %b[69], %b[50], %b[7] fadd_rsubs,1,sm %b[10], %b[69], %b[50], %b[47] staaw,2 %b[74], %aad4[ %aasti8 ] incr,2 %aaincr3 fadds,4,sm %b[43], %b[3], %b[48] staaw,5 %b[78], %aad2[ %aasti6 ] incr,5 %aaincr3 }
Теоретическая скорость: 4 комплексных числа за 6 тактов (4/6) = 5.33 Байт/такт
Двойная теоретическая скорость: 10.67 Байт/такт
Замеры скорости

2. stage_radix4_etalon_unroll2
Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.
Код на Си
void stage_radix4_etalon_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC, myComplex *coefD, myComplex *coefE) { myComplex *x_in = &data_in[0]; myComplex *y_in = &data_in[1]; myComplex *z_in = &data_in[2]; myComplex *w_in = &data_in[3]; myComplex *c_in = coefC; myComplex *d_in = coefD; myComplex *e_in = coefE; myComplex *out_0 = &data_out[0*data_count/4]; myComplex *out_1 = &data_out[1*data_count/4]; myComplex *out_2 = &data_out[2*data_count/4]; myComplex *out_3 = &data_out[3*data_count/4]; #pragma ivdep #pragma unroll(2) // #pragma prefetch for(int64_t i = 0; i < data_count/4; ++i) { myComplex x = x_in[4*i]; myComplex y = y_in[4*i]; myComplex z = z_in[4*i]; myComplex w = w_in[4*i]; myComplex c = c_in[i]; myComplex d = d_in[i]; myComplex e = e_in[i]; myComplex cy = complex_mul(c, y); myComplex dz = complex_mul(d, z); myComplex ew = complex_mul(e, w); myComplex add02 = complex_add( x, dz); myComplex sub02 = complex_sub( x, dz); myComplex add13 = complex_add(cy, ew); myComplex sub13 = complex_sub(cy, ew); myComplex sub13i = (myComplex){.real = -sub13.imag, .imag = sub13.real}; out_0[i] = complex_add(add02, add13); out_1[i] = complex_sub(sub02, sub13i); out_2[i] = complex_sub(add02, add13); out_3[i] = complex_add(sub02, sub13i); } }
Основной цикл на ассемблере
.L2050: { fapb ct=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=4, abs=0, disp=32 } { fapb ct=0, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=4, asz=3, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=3, asz=4, abs=16, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=2, asz=4, abs=16, disp=0 } .L913: { loop_mode insfd,0,sm %b[67], %r7, %b[70], %b[0] pfmul_rsubs,1,sm %b[78], %b[5], %g16, %b[30] pfadd_adds,2,sm %b[50], %b[32], %b[77], %b[5] pshufb,3,sm %b[22], %b[47], %r0, %b[67] insfd,4,sm %b[30], %r7, %b[31], %b[1] pfmuls,5,sm %b[83], %b[58], %g16 } { loop_mode insfd,0,sm %b[24], %r7, %b[49], %b[78] pfsub_adds,1,sm %b[50], %b[32], %b[81], %b[31] pfsub_adds,2,sm %b[6], %b[25], %b[71], %b[24] pshufb,3,sm %b[61], %b[8], %r0, %b[83] pshufb,4,sm %b[26], %b[33], %r0, %b[87] pfmuls,5,sm %b[83], %b[3], %b[70] } { loop_mode insfd,0,sm %b[63], %r7, %b[10], %b[71] pfsub_rsubs,1,sm %b[50], %b[32], %b[81], %b[36] pfsub_rsubs,2,sm %b[6], %b[25], %b[71], %b[35] insfd,3,sm %b[76], %r7, %b[79], %b[10] pshufb,4,sm %b[37], %b[38], %r0, %b[76] } { loop_mode insfd,0,sm %b[85], %r7, %b[86], %b[43] pfadd_rsubs,1,sm %b[50], %b[32], %b[77], %b[39] pfadd_rsubs,2,sm %b[6], %b[25], %b[75], %b[32] insfd,3,sm %b[80], %r7, %b[84], %b[40] pshufb,4,sm %b[34], %b[41], %r0, %b[77] } { loop_mode pfmuls,0,sm %b[67], %b[43], %b[25] pfadd_adds,1,sm %b[6], %b[25], %b[75], %b[49] staad,2 %b[87], %aad1[ %aasti5 + _f32s,_lts0 0x8 ] pfsubs,3,sm %b[64], %b[21], %b[79] pshufb,4,sm %b[51], %b[7], %r0, %b[84] pfadds,5,sm %b[11], %b[20], %b[75] movad,0 area=2, ind=0, am=0, be=0, %b[6] movaw,1 area=0, ind=24, am=0, be=0, %b[81] movaw,3 area=0, ind=24, am=0, be=0, %b[80] } { loop_mode pfmuls,0,sm %b[83], %b[40], %b[50] pfmul_adds,1,sm %b[78], %b[45], %b[69], %b[17] staad,2 %b[76], %aad3[ %aasti7 + _f32s,_lts0 0x8 ] pfsubs,3,sm %b[11], %b[20], %b[69] insfd,4,sm %b[59], %r7, %b[17], %b[76] movad,0 area=1, ind=0, am=0, be=0, %b[45] movad,1 area=1, ind=8, am=1, be=0, %b[20] movad,3 area=1, ind=0, am=0, be=0, %b[11] } { loop_mode insfd,0,sm %b[72], %r7, %b[82], %b[54] pfmul_adds,1,sm %b[71], %b[42], %g17, %b[60] staad,2 %b[77], %aad2[ %aasti6 + _f32s,_lts0 0x8 ] pfmuls,3,sm %b[83], %b[14], %g17 insfd,4,sm %b[73], %r7, %b[74], %b[42] pfmuls,5,sm %b[67], %b[10], %b[67] movad,0 area=2, ind=8, am=1, be=0, %b[59] movaw,1 area=0, ind=4, am=0, be=0, %b[66] movad,2 area=1, ind=8, am=1, be=0, %b[53] movaw,3 area=0, ind=4, am=0, be=0, %b[63] } { loop_mode insfd,0,sm %b[26], %r7, %b[33], %b[83] pfmul_rsubs,1,sm %b[71], %b[16], %b[52], %b[16] staad,2 %b[84], %aad4[ %aasti8 + _f32s,_lts0 0x8 ] insfd,4,sm %b[37], %r7, %b[38], %b[82] pfadds,5,sm %b[21], %b[64], %b[73] movaw,0 area=0, ind=0, am=0, be=0, %b[72] movaw,1 area=0, ind=8, am=0, be=0, %b[77] movaw,2 area=0, ind=0, am=0, be=0, %b[71] movaw,3 area=0, ind=8, am=0, be=0, %b[74] } { loop_mode insfd,0,sm %b[51], %r7, %b[7], %b[85] pfmul_rsubs,1,sm %b[78], %b[12], %b[27], %b[7] staad,2 %b[83], %aad1[ %aasti5 ] incr,2 %aaincr3 insfd,4,sm %b[34], %r7, %b[41], %b[86] staad,5 %b[82], %aad3[ %aasti7 ] incr,5 %aaincr3 movaw,0 area=0, ind=28, am=0, be=0, %b[82] movaw,1 area=0, ind=12, am=0, be=0, %b[84] movaw,2 area=0, ind=28, am=0, be=0, %b[78] movaw,3 area=0, ind=12, am=0, be=0, %b[83] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END pfmul_adds,1,sm %b[76], %b[58], %b[70], %b[21] staad,2 %b[85], %aad4[ %aasti8 ] incr,2 %aaincr3 insfd,3,sm %b[80], %r7, %b[81], %b[12] pshufb,4,sm %b[57], %b[15], %r0, %b[81] staad,5 %b[86], %aad2[ %aasti6 ] incr,5 %aaincr3 movaw,0 area=0, ind=16, am=0, be=0, %b[27] movaw,1 area=0, ind=20, am=1, be=0, %b[80] movaw,2 area=0, ind=16, am=0, be=0, %b[26] movaw,3 area=0, ind=20, am=1, be=0, %b[70] }
Теоретическая скорость: 8 комплексных чисел за 10 тактов (8/10) = 6.4 Байт/такт
Двойная теоретическая скорость: 12.8 Байт/такт
Замеры скорости

Видим ускорение.
3. stage_radix4_simd64
Вычисления делаем аналогично stage_radix2_simd64.
Код на Си
void stage_radix4_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC, myComplex *coefD, myComplex *coefE) { uint64_t *x_in = (uint64_t*)&data_in[0]; uint64_t *y_in = (uint64_t*)&data_in[1]; uint64_t *z_in = (uint64_t*)&data_in[2]; uint64_t *w_in = (uint64_t*)&data_in[3]; uint64_t *c_in = (uint64_t*)coefC; uint64_t *d_in = (uint64_t*)coefD; uint64_t *e_in = (uint64_t*)coefE; uint64_t *out_0 = (uint64_t*)&data_out[0*data_count/4]; uint64_t *out_1 = (uint64_t*)&data_out[1*data_count/4]; uint64_t *out_2 = (uint64_t*)&data_out[2*data_count/4]; uint64_t *out_3 = (uint64_t*)&data_out[3*data_count/4]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/4; ++i) { uint64_t x = x_in[4*i]; uint64_t y = y_in[4*i]; uint64_t z = z_in[4*i]; uint64_t w = w_in[4*i]; uint64_t c = c_in[i]; uint64_t d = d_in[i]; uint64_t e = e_in[i]; uint64_t conj_c = __builtin_e2k_pxord(c, 1LL<<63); uint64_t conj_d = __builtin_e2k_pxord(d, 1LL<<63); uint64_t conj_e = __builtin_e2k_pxord(e, 1LL<<63); uint64_t swap_c = __builtin_e2k_pshufb(0, c, 0x0302010007060504); uint64_t swap_d = __builtin_e2k_pshufb(0, d, 0x0302010007060504); uint64_t swap_e = __builtin_e2k_pshufb(0, e, 0x0302010007060504); uint64_t cy_real = __builtin_e2k_pfmuls(conj_c, y); uint64_t dz_real = __builtin_e2k_pfmuls(conj_d, z); uint64_t ew_real = __builtin_e2k_pfmuls(conj_e, w); uint64_t cy_imag = __builtin_e2k_pfmuls(swap_c, y); uint64_t dz_imag = __builtin_e2k_pfmuls(swap_d, z); uint64_t ew_imag = __builtin_e2k_pfmuls(swap_e, w); uint64_t cy = __builtin_e2k_pfhadds(cy_real, cy_imag); uint64_t dz = __builtin_e2k_pfhadds(dz_real, dz_imag); uint64_t ew = __builtin_e2k_pfhadds(ew_real, ew_imag); uint64_t add02 = __builtin_e2k_pfadds( x, dz); uint64_t sub02 = __builtin_e2k_pfsubs( x, dz); uint64_t add13 = __builtin_e2k_pfadds(cy, ew); uint64_t sub13 = __builtin_e2k_pfsubs(cy, ew); //uint64_t conj_sub13 = __builtin_e2k_pxord(sub13, 1LL<<63); //uint64_t sub13i = __builtin_e2k_pshufb(0, conj_sub13, 0x0302010007060504); uint64_t swap_sub13 = __builtin_e2k_pshufb(0, sub13, 0x0302010007060504); uint64_t sub13i = __builtin_e2k_pxord(swap_sub13, 1LL<<31); out_0[i] = __builtin_e2k_pfadds(add02, add13); out_1[i] = __builtin_e2k_pfsubs(sub02, sub13i); out_2[i] = __builtin_e2k_pfsubs(add02, add13); out_3[i] = __builtin_e2k_pfadds(sub02, sub13i); } }
Основной цикл на ассемблере
.L2675: { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=3, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=3, abs=8, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=4, abs=16, disp=0 } .L2317: { loop_mode pfadds,0,sm %b[70], %b[67], %b[73] pfadd_adds,1,sm %b[40], %b[47], %b[75], %b[1] pfadd_rsubs,2,sm %b[40], %b[47], %b[75], %b[0] pfsubs,3,sm %b[66], %b[63], %b[39] xord,4,sm %b[51], %r0, %b[58] xord,5,sm %b[33], %r0, %b[79] } { loop_mode pfmuls,0,sm %b[60], %b[21], %b[70] pfsub_rsubs,1,sm %b[40], %b[47], %b[81], %b[67] pfsub_adds,2,sm %b[40], %b[47], %b[81], %b[33] pshufb,3,sm 0x0, %b[18], %r8, %b[75] pshufb,4,sm 0x0, %b[57], %r8, %b[84] xord,5,sm %b[18], %r0, %b[82] } { loop_mode pfmuls,0,sm %b[79], %b[13], %b[85] pfmul_hadds,1,sm %b[78], %b[15], %b[87], %b[57] staad,2 %b[5], %aad4[ %aasti8 ] incr,2 %aaincr0 pfmul_hadds,3,sm %b[84], %b[25], %b[74], %b[60] pshufb,4,sm 0x0, %b[41], %r8, %b[81] staad,5 %b[4], %aad2[ %aasti6 ] incr,5 %aaincr0 movad,1 area=0, ind=0, am=1, be=0, %b[47] movad,2 area=0, ind=0, am=0, be=0, %b[18] movad,3 area=0, ind=16, am=0, be=0, %b[40] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END pfmuls,0,sm %b[82], %b[54], %b[78] pfmul_hadds,1,sm %b[77], %b[56], %b[80], %b[41] staad,2 %b[71], %aad3[ %aasti7 ] incr,2 %aaincr0 pshufb,3,sm 0x0, %b[31], %r8, %b[74] xord,4,sm %b[83], %r7, %b[79] staad,5 %b[37], %aad1[ %aasti5 ] incr,5 %aaincr0 movad,0 area=2, ind=0, am=1, be=0, %b[25] movad,1 area=1, ind=0, am=1, be=0, %b[4] movad,2 area=0, ind=24, am=0, be=0, %b[5] movad,3 area=0, ind=8, am=1, be=0, %b[15] }
Теоретическая скорость: 4 комплексных числа за 4 такта (4/4) = 8 Байт/такт
Двойная теоретическая скорость: 16 Байт/такт
Замеры скорости

Видим ускорение.
4. stage_radix4_simd128
Вычисления делаем аналогично stage_radix2_simd128.
Код на Си
void stage_radix4_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC, myComplex *coefD, myComplex *coefE) { __v2di *xy0_in = (__v2di*)&data_in[0]; __v2di *zw0_in = (__v2di*)&data_in[2]; __v2di *xy1_in = (__v2di*)&data_in[4]; __v2di *zw1_in = (__v2di*)&data_in[6]; __v2di *c_in = (__v2di*)coefC; __v2di *d_in = (__v2di*)coefD; __v2di *e_in = (__v2di*)coefE; __v2di *out_0 = (__v2di*)&data_out[0*data_count/4]; __v2di *out_1 = (__v2di*)&data_out[1*data_count/4]; __v2di *out_2 = (__v2di*)&data_out[2*data_count/4]; __v2di *out_3 = (__v2di*)&data_out[3*data_count/4]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/8; ++i) { __v2di xy0 = xy0_in[4*i]; __v2di zw0 = zw0_in[4*i]; __v2di xy1 = xy1_in[4*i]; __v2di zw1 = zw1_in[4*i]; __v2di c = c_in[i]; __v2di d = d_in[i]; __v2di e = e_in[i]; __v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di conj_c = __builtin_e2k_qpxor(c, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_d = __builtin_e2k_qpxor(d, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_e = __builtin_e2k_qpxor(e, (__v2di){1LL<<63, 1LL<<63}); __v2di swap_c = __builtin_e2k_qpshufb(c, c, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d = __builtin_e2k_qpshufb(d, d, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e = __builtin_e2k_qpshufb(e, e, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di cy_real = __builtin_e2k_qpfmuls(conj_c, y); __v2di dz_real = __builtin_e2k_qpfmuls(conj_d, z); __v2di ew_real = __builtin_e2k_qpfmuls(conj_e, w); __v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y); __v2di dz_imag = __builtin_e2k_qpfmuls(swap_d, z); __v2di ew_imag = __builtin_e2k_qpfmuls(swap_e, w); __v2di cy_rrii = __builtin_e2k_qpfhadds(cy_real, cy_imag); __v2di dz_rrii = __builtin_e2k_qpfhadds(dz_real, dz_imag); __v2di ew_rrii = __builtin_e2k_qpfhadds(ew_real, ew_imag); __v2di cy = __builtin_e2k_qpshufb(cy_rrii, cy_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di dz = __builtin_e2k_qpshufb(dz_rrii, dz_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di ew = __builtin_e2k_qpshufb(ew_rrii, ew_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di add02 = __builtin_e2k_qpfadds( x, dz); __v2di sub02 = __builtin_e2k_qpfsubs( x, dz); __v2di add13 = __builtin_e2k_qpfadds(cy, ew); __v2di sub13 = __builtin_e2k_qpfsubs(cy, ew); //__v2di conj_sub13 = __builtin_e2k_qpxor(sub13, (__v2di){1LL<<63, 1LL<<63}); //__v2di sub13i = __builtin_e2k_qpshufb(conj_sub13, conj_sub13, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13 = __builtin_e2k_qpshufb(sub13, sub13, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di sub13i = __builtin_e2k_qpxor(swap_sub13, (__v2di){1LL<<31, 1LL<<31}); out_0[i] = __builtin_e2k_qpfadds(add02, add13); out_1[i] = __builtin_e2k_qpfsubs(sub02, sub13i); out_2[i] = __builtin_e2k_qpfsubs(add02, add13); out_3[i] = __builtin_e2k_qpfadds(sub02, sub13i); } }
Основной цикл на ассемблере
.L3309: { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=4, abs=0, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=4, abs=16, disp=0 } { fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=4, abs=16, disp=0 } .L2728: { loop_mode qpfadd_rsubs,0,sm %b[13], %b[38], %b[21], %b[0] qpshufb,1,sm %b[44], %b[44], %r9, %b[8] qpshufb,3,sm %b[16], %b[16], %r10, %b[25] qpshufb,4,sm %b[17], %b[17], %r10, %b[48] qpfmuls,5,sm %b[60], %b[49], %b[1] } { loop_mode qpfsub_rsubs,0,sm %b[13], %b[38], %b[58], %b[44] qpshufb,1,sm %b[36], %b[36], %r9, %b[61] qpshufb,3,sm %b[29], %b[52], %r12, %b[16] qpfsub_adds,4,sm %b[13], %b[38], %b[58], %b[21] qpfadds,5,sm %b[48], %b[25], %b[17] } { loop_mode qpfmul_hadds,0,sm %b[10], %b[51], %b[3], %b[13] qpshufb,1,sm %b[56], %b[56], %r10, %b[36] qpxor,3,sm %b[34], %r0, %b[38] qpshufb,4,sm %b[59], %b[59], %r9, %b[52] qpfmuls,5,sm %b[57], %b[45], %b[29] } { loop_mode qpfmul_hadds,0,sm %b[55], %b[47], %b[31], %b[10] qpshufb,1,sm %b[37], %b[43], %r12, %b[3] staaqp,2 %b[26], %aad4[ %aasti8 ] incr,2 %aaincr0 qpxor,4,sm %b[52], %r7, %b[56] qpfmuls,5,sm %b[38], %b[20], %b[51] } { loop_mode qpfmul_hadds,0,sm %b[61], %b[22], %b[53], %b[52] qpxor,1,sm %b[4], %r0, %b[55] staaqp,2 %b[2], %aad2[ %aasti6 ] incr,2 %aaincr0 qpshufb,3,sm %b[35], %b[41], %r11, %b[47] qpshufb,4,sm %b[27], %b[50], %r11, %b[43] qpfsubs,5,sm %b[48], %b[25], %b[57] movaqp,0 area=1, ind=0, am=1, be=0, %b[38] movaqp,1 area=0, ind=0, am=0, be=0, %b[37] movaqp,2 area=1, ind=0, am=1, be=0, %b[26] movaqp,3 area=0, ind=0, am=0, be=0, %b[31] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qpfadd_adds,0,sm %b[11], %b[36], %b[19], %b[22] qpshufb,1,sm %b[6], %b[6], %r9, %b[53] staaqp,2 %b[46], %aad3[ %aasti7 ] incr,2 %aaincr0 qpxor,3,sm %b[42], %r0, %b[58] staaqp,5 %b[23], %aad1[ %aasti5 ] incr,5 %aaincr0 movaqp,0 area=2, ind=0, am=1, be=0, %b[2] movaqp,1 area=0, ind=16, am=1, be=0, %b[48] movaqp,3 area=0, ind=16, am=1, be=0, %b[25] }
Теоретическая скорость: 8 комплексных чисел за 6 тактов (8/6) = 10.67 Байт/такт
Двойная теоретическая скорость: 21.33 Байт/такт
Замеры скорости

Видим ускорение.
5. stage_radix4_simd128_noConj
Уменьшаем число инструкций аналогично stage_radix2_simd128_noConj.
Код на Си
void stage_radix4_simd128_noConj(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC, myComplex *coefD, myComplex *coefE) { __v2di *xy0_in = (__v2di*)&data_in[0]; __v2di *zw0_in = (__v2di*)&data_in[2]; __v2di *xy1_in = (__v2di*)&data_in[4]; __v2di *zw1_in = (__v2di*)&data_in[6]; __v2di *c_in = (__v2di*)coefC; __v2di *d_in = (__v2di*)coefD; __v2di *e_in = (__v2di*)coefE; __v2di *out_0 = (__v2di*)&data_out[0*data_count/4]; __v2di *out_1 = (__v2di*)&data_out[1*data_count/4]; __v2di *out_2 = (__v2di*)&data_out[2*data_count/4]; __v2di *out_3 = (__v2di*)&data_out[3*data_count/4]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/8; ++i) { __v2di xy0 = xy0_in[4*i]; __v2di zw0 = zw0_in[4*i]; __v2di xy1 = xy1_in[4*i]; __v2di zw1 = zw1_in[4*i]; __v2di c = c_in[i]; __v2di d = d_in[i]; __v2di e = e_in[i]; __v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di swap_c = __builtin_e2k_qpshufb(c, c, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d = __builtin_e2k_qpshufb(d, d, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e = __builtin_e2k_qpshufb(e, e, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di cy_real = __builtin_e2k_qpfmuls( c, y); __v2di dz_real = __builtin_e2k_qpfmuls( d, z); __v2di ew_real = __builtin_e2k_qpfmuls( e, w); __v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y); __v2di dz_imag = __builtin_e2k_qpfmuls(swap_d, z); __v2di ew_imag = __builtin_e2k_qpfmuls(swap_e, w); __v2di cy_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy_real); __v2di dz_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz_real); __v2di ew_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew_real); __v2di cy_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy_imag); __v2di dz_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz_imag); __v2di ew_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew_imag); __v2di cy = __builtin_e2k_qppermb(cy_ii, cy_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di dz = __builtin_e2k_qppermb(dz_ii, dz_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di ew = __builtin_e2k_qppermb(ew_ii, ew_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di add02 = __builtin_e2k_qpfadds( x, dz); __v2di sub02 = __builtin_e2k_qpfsubs( x, dz); __v2di add13 = __builtin_e2k_qpfadds(cy, ew); __v2di sub13 = __builtin_e2k_qpfsubs(cy, ew); //__v2di conj_sub13 = __builtin_e2k_qpxor(sub13, (__v2di){1LL<<63, 1LL<<63}); //__v2di sub13i = __builtin_e2k_qpshufb(conj_sub13, conj_sub13, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13 = __builtin_e2k_qpshufb(sub13, sub13, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di sub13i = __builtin_e2k_qpxor(swap_sub13, (__v2di){1LL<<31, 1LL<<31}); out_0[i] = __builtin_e2k_qpfadds(add02, add13); out_1[i] = __builtin_e2k_qpfsubs(sub02, sub13i); out_2[i] = __builtin_e2k_qpfsubs(add02, add13); out_3[i] = __builtin_e2k_qpfadds(sub02, sub13i); } }
Основной цикл на ассемблере
.L3939: { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=4, abs=0, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=4, abs=16, disp=0 } { fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=4, abs=16, disp=0 } .L3362: { loop_mode qpfmul_hsubs,0,sm %b[61], %b[63], %r12, %b[1] qpfmul_hadds,2,sm %b[64], %b[63], %r12, %b[0] qpshufb,3,sm %b[8], %b[9], %r9, %b[40] qpshufb,4,sm %b[19], %b[19], %r11, %b[31] qpfadds,5,sm %b[37], %b[30], %b[26] movaqp,0 area=1, ind=0, am=1, be=0, %b[17] movaqp,1 area=0, ind=0, am=0, be=0, %b[7] movaqp,3 area=0, ind=0, am=0, be=0, %b[6] } { loop_mode qpfmul_hsubs,0,sm %b[21], %b[42], %r12, %b[44] qpfmul_hadds,2,sm %b[33], %b[42], %r12, %b[41] qpshufb,3,sm %b[49], %b[52], %r9, %b[61] qpshufb,4,sm %b[59], %b[59], %r11, %b[62] qpfsubs,5,sm %b[37], %b[30], %b[58] movaqp,0 area=2, ind=0, am=1, be=0, %b[57] movaqp,1 area=0, ind=16, am=1, be=0, %b[50] movaqp,3 area=0, ind=16, am=1, be=0, %b[47] } { loop_mode qpfmul_hsubs,0,sm %b[16], %b[39], %r12, %b[21] qpshufb,1,sm %b[60], %b[60], %r11, %b[42] qpfadd_adds,2,sm %b[24], %b[56], %b[28], %b[30] qpshufb,3,sm %b[14], %b[14], %r11, %b[33] qpshufb,4,sm %b[51], %b[54], %r10, %b[37] staaqp,5 %b[34], %aad4[ %aasti8 ] incr,5 %aaincr0 } { loop_mode qpfmul_hadds,0,sm %b[35], %b[39], %r12, %b[34] qpxor,1,sm %b[42], %r7, %b[60] qpfadd_rsubs,2,sm %b[24], %b[56], %b[28], %b[51] qpshufb,3,sm %b[10], %b[11], %r10, %b[16] qppermb,4,sm %b[38], %b[25], %r0, %b[54] staaqp,5 %b[55], %aad2[ %aasti6 ] incr,5 %aaincr0 } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qpfsub_rsubs,0,sm %b[24], %b[56], %b[60], %b[25] qpfsub_adds,1,sm %b[24], %b[56], %b[60], %b[11] staaqp,2 %b[29], %aad3[ %aasti7 ] incr,2 %aaincr0 qppermb,3,sm %b[4], %b[5], %r0, %b[28] qppermb,4,sm %b[45], %b[48], %r0, %b[35] staaqp,5 %b[15], %aad1[ %aasti5 ] incr,5 %aaincr0 movaqp,3 area=1, ind=0, am=1, be=0, %b[10] }
Теоретическая скорость: 8 комплексных чисел за 5 тактов (8/5) = 12.8 Байт/такт
Двойная теоретическая скорость: 25.6 Байт/такт
Замеры скорости

Видим ускорение.
6. stage_radix4_simd128_noConj_unroll3
Здесь происходит раскрутка цикла в 3 раза с помощью опции unroll.
Код на Си
void stage_radix4_simd128_noConj_unroll3(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC, myComplex *coefD, myComplex *coefE) { __v2di *xy0_in = (__v2di*)&data_in[0]; __v2di *zw0_in = (__v2di*)&data_in[2]; __v2di *xy1_in = (__v2di*)&data_in[4]; __v2di *zw1_in = (__v2di*)&data_in[6]; __v2di *c_in = (__v2di*)coefC; __v2di *d_in = (__v2di*)coefD; __v2di *e_in = (__v2di*)coefE; __v2di *out_0 = (__v2di*)&data_out[0*data_count/4]; __v2di *out_1 = (__v2di*)&data_out[1*data_count/4]; __v2di *out_2 = (__v2di*)&data_out[2*data_count/4]; __v2di *out_3 = (__v2di*)&data_out[3*data_count/4]; #pragma ivdep #pragma unroll(3) #pragma prefetch for(int64_t i = 0; i < data_count/8; ++i) { __v2di xy0 = xy0_in[4*i]; __v2di zw0 = zw0_in[4*i]; __v2di xy1 = xy1_in[4*i]; __v2di zw1 = zw1_in[4*i]; __v2di c = c_in[i]; __v2di d = d_in[i]; __v2di e = e_in[i]; __v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di swap_c = __builtin_e2k_qpshufb(c, c, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d = __builtin_e2k_qpshufb(d, d, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e = __builtin_e2k_qpshufb(e, e, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di cy_real = __builtin_e2k_qpfmuls( c, y); __v2di dz_real = __builtin_e2k_qpfmuls( d, z); __v2di ew_real = __builtin_e2k_qpfmuls( e, w); __v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y); __v2di dz_imag = __builtin_e2k_qpfmuls(swap_d, z); __v2di ew_imag = __builtin_e2k_qpfmuls(swap_e, w); __v2di cy_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy_real); __v2di dz_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz_real); __v2di ew_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew_real); __v2di cy_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy_imag); __v2di dz_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz_imag); __v2di ew_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew_imag); __v2di cy = __builtin_e2k_qppermb(cy_ii, cy_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di dz = __builtin_e2k_qppermb(dz_ii, dz_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di ew = __builtin_e2k_qppermb(ew_ii, ew_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di add02 = __builtin_e2k_qpfadds( x, dz); __v2di sub02 = __builtin_e2k_qpfsubs( x, dz); __v2di add13 = __builtin_e2k_qpfadds(cy, ew); __v2di sub13 = __builtin_e2k_qpfsubs(cy, ew); //__v2di conj_sub13 = __builtin_e2k_qpxor(sub13, (__v2di){1LL<<63, 1LL<<63}); //__v2di sub13i = __builtin_e2k_qpshufb(conj_sub13, conj_sub13, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13 = __builtin_e2k_qpshufb(sub13, sub13, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di sub13i = __builtin_e2k_qpxor(swap_sub13, (__v2di){1LL<<31, 1LL<<31}); out_0[i] = __builtin_e2k_qpfadds(add02, add13); out_1[i] = __builtin_e2k_qpfsubs(sub02, sub13i); out_2[i] = __builtin_e2k_qpfsubs(add02, add13); out_3[i] = __builtin_e2k_qpfadds(sub02, sub13i); } }
Основной цикл на ассемблере
.L5038: { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=2, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=2, abs=0, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=2, abs=4, disp=64 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=2, abs=4, disp=96 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=2, abs=8, disp=128 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=2, abs=8, disp=160 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=4, asz=2, abs=12, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=3, asz=2, abs=12, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=3, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=4, asz=3, abs=16, disp=32 } { fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=3, asz=3, abs=24, disp=32 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=2, asz=3, abs=24, disp=32 } .L3992: { loop_mode qpfsub_adds,0,sm %b[106], %b[103], %g16, %g17 qpfmul_hsubs,1,sm %b[10], %b[97], %r9, %b[0] qpfadds,2,sm %g18, %g19, %g20 qpshufb,3,sm %b[82], %b[81], %r0, %g22 qpshufb,4,sm %b[82], %b[81], %r5, %g21 qpfmul_hadds,5,sm %g23, %g24, %r9, %b[1] } { loop_mode qpfsub_rsubs,0,sm %g25, %g26, %g27, %b[117] qppermb,1,sm %g28, %b[117], %r1, %g29 qpfadds,2,sm %b[105], %b[108], %b[108] qpshufb,3,sm %b[54], %b[54], %r3, %g31 qpshufb,4,sm %b[13], %b[13], %r3, %g30 qpfmul_hsubs,5,sm %b[13], %g21, %r9, %b[105] } { loop_mode qpfsub_adds,0,sm %b[116], %b[109], %b[113], %r2 qpfmul_hadds,1,sm %r11, %b[43], %r9, %b[3] qpfsubs,2,sm %g29, %b[110], %r7 qppermb,3,sm %b[3], %b[64], %r1, %g19 qppermb,4,sm %b[118], %r10, %r1, %g18 qpfmul_hadds,5,sm %g30, %g21, %r9, %g21 } { loop_mode qpfsub_rsubs,0,sm %b[106], %b[103], %g16, %b[110] qpfmul_hsubs,1,sm %b[38], %b[114], %r9, %b[13] qpfadds,2,sm %g29, %b[110], %b[113] qppermb,3,sm %b[27], %b[16], %r1, %b[106] qppermb,4,sm %b[21], %b[17], %r1, %b[103] qpfsub_rsubs,5,sm %b[116], %b[109], %b[113], %b[109] } { loop_mode qpfmul_hsubs,0,sm %b[54], %g22, %r9, %r12 qpfmul_hsubs,1,sm %b[57], %b[41], %r9, %b[16] staaqp,2 %b[101], %aad2[ %aasti6 + _f32s,_lts0 0x10 ] qpshufb,3,sm %b[59], %b[42], %r5, %b[102] qppermb,4,sm %b[102], %r12, %r1, %b[101] qpfmul_hadds,5,sm %b[104], %b[114], %r9, %b[17] } { loop_mode qpfmul_hadds,0,sm %g31, %g22, %r9, %b[100] qpshufb,1,sm %b[74], %b[73], %r0, %b[104] staaqp,2 %b[115], %aad4[ %aasti8 + _f32s,_lts0 0x10 ] qpshufb,3,sm %b[100], %b[100], %r3, %b[115] qpshufb,4,sm %b[10], %b[10], %r3, %b[114] qpfmul_hsubs,5,sm %b[100], %b[102], %r9, %b[10] } { loop_mode qpfmul_hsubs,0,sm %b[51], %b[112], %r9, %b[115] qpfsubs,1,sm %g18, %g19, %b[27] staaqp,2 %r13, %aad2[ %aasti6 + _f32s,_lts0 0x20 ] qpshufb,3,sm %b[36], %b[36], %r3, %b[102] qpshufb,4,sm %b[88], %b[87], %r5, %b[116] qpfmul_hadds,5,sm %b[115], %b[102], %r9, %b[21] } { loop_mode qpfmul_hadds,0,sm %r14, %b[112], %r9, %g28 qpfsubs,1,sm %b[103], %b[106], %b[38] staaqp,2 %b[119], %aad2[ %aasti6 ] incr,2 %aaincr3 qpshufb,3,sm %b[31], %b[26], %r5, %b[112] qpshufb,4,sm %b[60], %b[60], %r3, %g22 qpfmul_hsubs,5,sm %b[60], %b[116], %r9, %r10 } { loop_mode qpfadd_rsubs,0,sm %b[104], %b[101], %b[113], %b[99] qppermb,1,sm %b[5], %b[20], %r1, %b[107] staaqp,2 %b[107], %aad4[ %aasti8 + _f32s,_lts0 0x20 ] qppermb,3,sm %b[99], %b[2], %r1, %g26 qpshufb,4,sm %b[39], %b[34], %r0, %g25 qpfmul_hadds,5,sm %g22, %b[116], %r9, %b[116] movaqp,1 area=5, ind=0, am=1, be=0, %b[2] } { loop_mode qpfmul_hadds,0,sm %b[114], %b[97], %r9, %b[97] qpshufb,1,sm %b[92], %b[91], %r0, %b[114] staaqp,2 %b[98], %aad4[ %aasti8 ] incr,2 %aaincr3 qpshufb,3,sm %b[96], %b[95], %r0, %b[39] qpshufb,4,sm %b[24], %b[24], %r3, %g23 qpfadd_adds,5,sm %b[104], %b[101], %b[113], %b[113] movaqp,0 area=4, ind=16, am=1, be=0, %b[5] movaqp,1 area=4, ind=0, am=0, be=0, %b[20] movaqp,2 area=5, ind=0, am=1, be=0, %b[98] movaqp,3 area=4, ind=0, am=1, be=0, %b[34] } { loop_mode qpfadd_adds,0,sm %b[114], %b[107], %g20, %b[96] qpshufb,1,sm %r7, %r7, %r3, %g17 staaqp,2 %g17, %aad1[ %aasti5 + _f32s,_lts0 0x10 ] qpshufb,3,sm %b[96], %b[95], %r5, %g24 qpshufb,4,sm %b[63], %b[46], %r0, %b[95] qpfadd_rsubs,5,sm %g25, %g26, %b[108], %r13 movaqp,0 area=3, ind=16, am=1, be=0, %b[43] movaqp,1 area=3, ind=0, am=0, be=0, %b[54] movaqp,2 area=3, ind=16, am=1, be=0, %b[46] movaqp,3 area=3, ind=0, am=0, be=0, %b[51] } { loop_mode qpfadd_rsubs,0,sm %b[114], %b[107], %g20, %b[117] qpshufb,1,sm %b[40], %b[40], %r3, %g22 staaqp,2 %b[117], %aad3[ %aasti7 + _f32s,_lts0 0x20 ] qpshufb,3,sm %b[57], %b[57], %r3, %r11 qpshufb,4,sm %b[29], %b[29], %r3, %g29 qpfmul_hsubs,5,sm %b[24], %g24, %r9, %b[60] movaqp,0 area=2, ind=0, am=0, be=0, %b[24] movaqp,1 area=2, ind=16, am=1, be=0, %b[40] movaqp,2 area=2, ind=0, am=0, be=0, %b[29] movaqp,3 area=2, ind=16, am=1, be=0, %b[57] } { loop_mode qpfadd_adds,0,sm %g25, %g26, %b[108], %b[105] qpxor,1,sm %g22, %r4, %g27 staaqp,2 %b[111], %aad1[ %aasti5 + _f32s,_lts0 0x20 ] qppermb,3,sm %g21, %b[105], %r1, %b[108] qpxor,4,sm %g29, %r4, %b[111] staaqp,5 %r2, %aad1[ %aasti5 ] incr,5 %aaincr3 movaqp,0 area=1, ind=16, am=1, be=0, %b[73] movaqp,1 area=1, ind=0, am=0, be=0, %b[63] movaqp,2 area=1, ind=16, am=1, be=0, %b[74] movaqp,3 area=1, ind=0, am=0, be=0, %b[64] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qpfsub_adds,0,sm %g25, %g26, %g27, %b[109] qpxor,1,sm %g17, %r4, %g16 staaqp,2 %b[110], %aad3[ %aasti7 + _f32s,_lts0 0x10 ] qpshufb,3,sm %b[49], %b[49], %r3, %r14 qpshufb,4,sm %b[70], %b[69], %r5, %b[110] staaqp,5 %b[109], %aad3[ %aasti7 ] incr,5 %aaincr3 movaqp,0 area=0, ind=0, am=0, be=0, %b[81] movaqp,1 area=0, ind=16, am=1, be=0, %b[91] movaqp,2 area=0, ind=0, am=0, be=0, %b[82] movaqp,3 area=0, ind=16, am=1, be=0, %b[92] }
Теоретическая скорость: 24 комплексных числа за 14 тактов (24/14) = 13.71 Байт/такт
Двойная теоретическая скорость: 27.43 Байт/такт
Замеры скорости

Видим ускорение в середине графика, но замедление в начале и в конце графика.
Итоги по stage_radix4


График FFT находится здесь.
stage_radix4_2x
Схема алгоритма Stage для версии «radix-4» 2x.

Один проход по stage_radix4_2x совершает ту же работу, что 2 прохода по stage_radix4. А один проход по stage_radix4 совершает ту же работу, что 2 прохода по stage_radix2. Поэтому скорость stage_radix4_2x будем умножать на 4 для удобства сравнения с stage_radix2 (этот факт подписан на оси графика и в выводе консоли).
1. stage_radix4_2x_etalon
Здесь происходит ручная раскрутка алгоритма stage_radix4_etalon в 2 раза.
Код на Си
void stage_radix4_2x_etalon(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b) { myComplex *x0_in = &data_in[ 0]; myComplex *y0_in = &data_in[ 1]; myComplex *z0_in = &data_in[ 2]; myComplex *w0_in = &data_in[ 3]; myComplex *x1_in = &data_in[ 4]; myComplex *y1_in = &data_in[ 5]; myComplex *z1_in = &data_in[ 6]; myComplex *w1_in = &data_in[ 7]; myComplex *x2_in = &data_in[ 8]; myComplex *y2_in = &data_in[ 9]; myComplex *z2_in = &data_in[10]; myComplex *w2_in = &data_in[11]; myComplex *x3_in = &data_in[12]; myComplex *y3_in = &data_in[13]; myComplex *z3_in = &data_in[14]; myComplex *w3_in = &data_in[15]; myComplex *c0a_in = &coefC_a[0]; myComplex *c1a_in = &coefC_a[1]; myComplex *c2a_in = &coefC_a[2]; myComplex *c3a_in = &coefC_a[3]; myComplex *d0a_in = &coefD_a[0]; myComplex *d1a_in = &coefD_a[1]; myComplex *d2a_in = &coefD_a[2]; myComplex *d3a_in = &coefD_a[3]; myComplex *e0a_in = &coefE_a[0]; myComplex *e1a_in = &coefE_a[1]; myComplex *e2a_in = &coefE_a[2]; myComplex *e3a_in = &coefE_a[3]; myComplex *c0b_in = &coefC_b[0*data_count/16]; myComplex *c1b_in = &coefC_b[1*data_count/16]; myComplex *c2b_in = &coefC_b[2*data_count/16]; myComplex *c3b_in = &coefC_b[3*data_count/16]; myComplex *d0b_in = &coefD_b[0*data_count/16]; myComplex *d1b_in = &coefD_b[1*data_count/16]; myComplex *d2b_in = &coefD_b[2*data_count/16]; myComplex *d3b_in = &coefD_b[3*data_count/16]; myComplex *e0b_in = &coefE_b[0*data_count/16]; myComplex *e1b_in = &coefE_b[1*data_count/16]; myComplex *e2b_in = &coefE_b[2*data_count/16]; myComplex *e3b_in = &coefE_b[3*data_count/16]; myComplex *out_0 = &data_out[ 0*data_count/16]; myComplex *out_1 = &data_out[ 1*data_count/16]; myComplex *out_2 = &data_out[ 2*data_count/16]; myComplex *out_3 = &data_out[ 3*data_count/16]; myComplex *out_4 = &data_out[ 4*data_count/16]; myComplex *out_5 = &data_out[ 5*data_count/16]; myComplex *out_6 = &data_out[ 6*data_count/16]; myComplex *out_7 = &data_out[ 7*data_count/16]; myComplex *out_8 = &data_out[ 8*data_count/16]; myComplex *out_9 = &data_out[ 9*data_count/16]; myComplex *out_10 = &data_out[10*data_count/16]; myComplex *out_11 = &data_out[11*data_count/16]; myComplex *out_12 = &data_out[12*data_count/16]; myComplex *out_13 = &data_out[13*data_count/16]; myComplex *out_14 = &data_out[14*data_count/16]; myComplex *out_15 = &data_out[15*data_count/16]; #pragma ivdep #pragma unroll(1) // #pragma prefetch for(int64_t i = 0; i < data_count/16; ++i) { myComplex x0 = x0_in[16*i]; myComplex y0 = y0_in[16*i]; myComplex z0 = z0_in[16*i]; myComplex w0 = w0_in[16*i]; myComplex c0 = c0a_in[4*i]; myComplex d0 = d0a_in[4*i]; myComplex e0 = e0a_in[4*i]; myComplex x1 = x1_in[16*i]; myComplex y1 = y1_in[16*i]; myComplex z1 = z1_in[16*i]; myComplex w1 = w1_in[16*i]; myComplex c1 = c1a_in[4*i]; myComplex d1 = d1a_in[4*i]; myComplex e1 = e1a_in[4*i]; myComplex x2 = x2_in[16*i]; myComplex y2 = y2_in[16*i]; myComplex z2 = z2_in[16*i]; myComplex w2 = w2_in[16*i]; myComplex c2 = c2a_in[4*i]; myComplex d2 = d2a_in[4*i]; myComplex e2 = e2a_in[4*i]; myComplex x3 = x3_in[16*i]; myComplex y3 = y3_in[16*i]; myComplex z3 = z3_in[16*i]; myComplex w3 = w3_in[16*i]; myComplex c3 = c3a_in[4*i]; myComplex d3 = d3a_in[4*i]; myComplex e3 = e3a_in[4*i]; myComplex cy0 = complex_mul(c0, y0); myComplex cy1 = complex_mul(c1, y1); myComplex cy2 = complex_mul(c2, y2); myComplex cy3 = complex_mul(c3, y3); myComplex dz0 = complex_mul(d0, z0); myComplex dz1 = complex_mul(d1, z1); myComplex dz2 = complex_mul(d2, z2); myComplex dz3 = complex_mul(d3, z3); myComplex ew0 = complex_mul(e0, w0); myComplex ew1 = complex_mul(e1, w1); myComplex ew2 = complex_mul(e2, w2); myComplex ew3 = complex_mul(e3, w3); myComplex add02_0 = complex_add( x0, dz0); myComplex add02_1 = complex_add( x1, dz1); myComplex add02_2 = complex_add( x2, dz2); myComplex add02_3 = complex_add( x3, dz3); myComplex sub02_0 = complex_sub( x0, dz0); myComplex sub02_1 = complex_sub( x1, dz1); myComplex sub02_2 = complex_sub( x2, dz2); myComplex sub02_3 = complex_sub( x3, dz3); myComplex add13_0 = complex_add(cy0, ew0); myComplex add13_1 = complex_add(cy1, ew1); myComplex add13_2 = complex_add(cy2, ew2); myComplex add13_3 = complex_add(cy3, ew3); myComplex sub13_0 = complex_sub(cy0, ew0); myComplex sub13_1 = complex_sub(cy1, ew1); myComplex sub13_2 = complex_sub(cy2, ew2); myComplex sub13_3 = complex_sub(cy3, ew3); myComplex sub13i_0 = (myComplex){.real = -sub13_0.imag, .imag = sub13_0.real}; myComplex sub13i_1 = (myComplex){.real = -sub13_1.imag, .imag = sub13_1.real}; myComplex sub13i_2 = (myComplex){.real = -sub13_2.imag, .imag = sub13_2.real}; myComplex sub13i_3 = (myComplex){.real = -sub13_3.imag, .imag = sub13_3.real}; myComplex out0 = complex_add(add02_0, add13_0); myComplex out1 = complex_add(add02_1, add13_1); myComplex out2 = complex_add(add02_2, add13_2); myComplex out3 = complex_add(add02_3, add13_3); myComplex out4 = complex_sub(sub02_0, sub13i_0); myComplex out5 = complex_sub(sub02_1, sub13i_1); myComplex out6 = complex_sub(sub02_2, sub13i_2); myComplex out7 = complex_sub(sub02_3, sub13i_3); myComplex out8 = complex_sub(add02_0, add13_0); myComplex out9 = complex_sub(add02_1, add13_1); myComplex out10 = complex_sub(add02_2, add13_2); myComplex out11 = complex_sub(add02_3, add13_3); myComplex out12 = complex_add(sub02_0, sub13i_0); myComplex out13 = complex_add(sub02_1, sub13i_1); myComplex out14 = complex_add(sub02_2, sub13i_2); myComplex out15 = complex_add(sub02_3, sub13i_3); x0 = out0; y0 = out1; z0 = out2; w0 = out3; c0 = c0b_in[i]; d0 = d0b_in[i]; e0 = e0b_in[i]; x1 = out4; y1 = out5; z1 = out6; w1 = out7; c1 = c1b_in[i]; d1 = d1b_in[i]; e1 = e1b_in[i]; x2 = out8; y2 = out9; z2 = out10; w2 = out11; c2 = c2b_in[i]; d2 = d2b_in[i]; e2 = e2b_in[i]; x3 = out12; y3 = out13; z3 = out14; w3 = out15; c3 = c3b_in[i]; d3 = d3b_in[i]; e3 = e3b_in[i]; cy0 = complex_mul(c0, y0); cy1 = complex_mul(c1, y1); cy2 = complex_mul(c2, y2); cy3 = complex_mul(c3, y3); dz0 = complex_mul(d0, z0); dz1 = complex_mul(d1, z1); dz2 = complex_mul(d2, z2); dz3 = complex_mul(d3, z3); ew0 = complex_mul(e0, w0); ew1 = complex_mul(e1, w1); ew2 = complex_mul(e2, w2); ew3 = complex_mul(e3, w3); add02_0 = complex_add( x0, dz0); add02_1 = complex_add( x1, dz1); add02_2 = complex_add( x2, dz2); add02_3 = complex_add( x3, dz3); sub02_0 = complex_sub( x0, dz0); sub02_1 = complex_sub( x1, dz1); sub02_2 = complex_sub( x2, dz2); sub02_3 = complex_sub( x3, dz3); add13_0 = complex_add(cy0, ew0); add13_1 = complex_add(cy1, ew1); add13_2 = complex_add(cy2, ew2); add13_3 = complex_add(cy3, ew3); sub13_0 = complex_sub(cy0, ew0); sub13_1 = complex_sub(cy1, ew1); sub13_2 = complex_sub(cy2, ew2); sub13_3 = complex_sub(cy3, ew3); sub13i_0 = (myComplex){.real = -sub13_0.imag, .imag = sub13_0.real}; sub13i_1 = (myComplex){.real = -sub13_1.imag, .imag = sub13_1.real}; sub13i_2 = (myComplex){.real = -sub13_2.imag, .imag = sub13_2.real}; sub13i_3 = (myComplex){.real = -sub13_3.imag, .imag = sub13_3.real}; out_0[i] = complex_add(add02_0, add13_0); out_1[i] = complex_add(add02_1, add13_1); out_2[i] = complex_add(add02_2, add13_2); out_3[i] = complex_add(add02_3, add13_3); out_4[i] = complex_sub(sub02_0, sub13i_0); out_5[i] = complex_sub(sub02_1, sub13i_1); out_6[i] = complex_sub(sub02_2, sub13i_2); out_7[i] = complex_sub(sub02_3, sub13i_3); out_8[i] = complex_sub(add02_0, add13_0); out_9[i] = complex_sub(add02_1, add13_1); out_10[i] = complex_sub(add02_2, add13_2); out_11[i] = complex_sub(add02_3, add13_3); out_12[i] = complex_add(sub02_0, sub13i_0); out_13[i] = complex_add(sub02_1, sub13i_1); out_14[i] = complex_add(sub02_2, sub13i_2); out_15[i] = complex_add(sub02_3, sub13i_3); } }
Основной цикл на ассемблере
.L1379: { fapb ct=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=1, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=3, mrng=16, d=0, incr=3, ind=2, asz=1, abs=0, disp=16 } { fapb ct=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=64 fapb dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=32 } { fapb ct=0, dcd=0, fmt=3, mrng=0, d=0, incr=3, ind=4, asz=1, abs=4, disp=0 fapb dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=1, abs=4, disp=96 } { fapb ct=0, dcd=0, fmt=3, mrng=16, d=0, incr=3, ind=2, asz=1, abs=6, disp=0 fapb dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=3, ind=3, asz=1, abs=6, disp=0 } { fapb ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=15, asz=2, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=3, mrng=8, d=1, incr=2, ind=0, asz=2, abs=8, disp=0 } { fapb ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=13, asz=2, abs=12, disp=0 fapb dpl=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=14, asz=2, abs=12, disp=0 } { fapb ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=11, asz=2, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=12, asz=2, abs=16, disp=0 } { fapb ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=9, asz=2, abs=20, disp=0 fapb dpl=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=10, asz=2, abs=20, disp=0 } { fapb ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=7, asz=2, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=8, asz=2, abs=24, disp=0 } { fapb ct=1, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=5, asz=2, abs=28, disp=0 fapb dpl=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=6, asz=2, abs=28, disp=0 } .L285: { loop_mode disp %ctpr1, .L285 movaw,0 area=0, ind=24, am=0, be=0, %g17 movaw,1 area=0, ind=28, am=0, be=0, %g16 movaw,2 area=0, ind=8, am=0, be=0, %g19 movaw,3 area=0, ind=12, am=0, be=0, %g18 } { loop_mode movaw,0 area=0, ind=16, am=0, be=0, %g21 movaw,1 area=0, ind=20, am=0, be=0, %g20 movaw,2 area=0, ind=0, am=1, be=0, %g23 movaw,3 area=0, ind=4, am=0, be=0, %g22 } { loop_mode movaw,0 area=0, ind=8, am=0, be=0, %g25 movaw,1 area=0, ind=12, am=0, be=0, %g24 movaw,2 area=1, ind=24, am=0, be=0, %g27 movaw,3 area=1, ind=28, am=0, be=0, %g26 } { loop_mode movaw,0 area=0, ind=0, am=1, be=0, %g29 movaw,1 area=0, ind=4, am=0, be=0, %g28 movaw,2 area=1, ind=16, am=0, be=0, %g31 movaw,3 area=1, ind=20, am=0, be=0, %g30 } { loop_mode movaw,0 area=1, ind=24, am=0, be=0, %r3 movaw,1 area=1, ind=28, am=0, be=0, %r1 movaw,2 area=1, ind=8, am=0, be=0, %r5 movaw,3 area=1, ind=12, am=0, be=0, %r4 } { loop_mode movaw,0 area=1, ind=16, am=0, be=0, %r9 movaw,1 area=1, ind=20, am=0, be=0, %r7 movaw,2 area=1, ind=0, am=1, be=0, %r42 movaw,3 area=1, ind=4, am=0, be=0, %r41 } { loop_mode movaw,0 area=1, ind=8, am=0, be=0, %r44 movaw,1 area=1, ind=12, am=0, be=0, %r43 movaw,2 area=2, ind=24, am=0, be=0, %r46 movaw,3 area=2, ind=28, am=0, be=0, %r45 } { loop_mode movaw,0 area=1, ind=0, am=1, be=0, %r48 movaw,1 area=1, ind=4, am=0, be=0, %r47 movaw,2 area=2, ind=16, am=0, be=0, %r50 movaw,3 area=2, ind=20, am=0, be=0, %r49 } { loop_mode movaw,0 area=2, ind=24, am=0, be=0, %r52 movaw,1 area=2, ind=28, am=0, be=0, %r51 movaw,2 area=2, ind=8, am=0, be=0, %r54 movaw,3 area=2, ind=12, am=0, be=0, %r53 } { loop_mode fmuls,0 %g23, %r3, %r57 fmuls,1 %g22, %r1, %r58 fmuls,2 %g23, %r1, %g23 fmuls,3 %g22, %r3, %g22 movaw,0 area=2, ind=16, am=0, be=0, %r56 movaw,1 area=2, ind=20, am=0, be=0, %r55 movaw,2 area=2, ind=0, am=1, be=0, %r3 movaw,3 area=2, ind=4, am=0, be=0, %r1 } { loop_mode movaw,0 area=2, ind=8, am=0, be=0, %r60 movaw,1 area=2, ind=12, am=0, be=0, %r59 movaw,2 area=3, ind=24, am=0, be=0, %r62 movaw,3 area=3, ind=28, am=0, be=0, %r61 } { loop_mode fmuls,0 %g19, %r46, %r63 fmuls,1 %g18, %r45, %b[0] fmuls,2 %g19, %r45, %g19 fmuls,3 %g18, %r46, %g18 movaw,0 area=2, ind=0, am=1, be=0, %r46 movaw,1 area=2, ind=4, am=0, be=0, %r45 movaw,2 area=3, ind=16, am=0, be=0, %b[2] movaw,3 area=3, ind=20, am=0, be=0, %b[1] } { loop_mode movaw,0 area=3, ind=8, am=0, be=0, %b[4] movaw,1 area=3, ind=12, am=0, be=0, %b[3] movaw,2 area=3, ind=8, am=0, be=0, %b[6] movaw,3 area=3, ind=12, am=0, be=0, %b[5] } { loop_mode fmuls,0 %r52, %r54, %b[7] fmuls,1 %r51, %r53, %b[8] fmuls,2 %r52, %r53, %r52 fmuls,3 %r51, %r54, %r51 movaw,0 area=3, ind=0, am=1, be=0, %r54 movaw,1 area=3, ind=4, am=0, be=0, %r53 movaw,2 area=3, ind=0, am=1, be=0, %b[10] movaw,3 area=3, ind=4, am=0, be=0, %b[9] } { loop_mode fmuls,0 %r56, %r44, %b[11] fmuls,1 %r55, %r43, %b[12] fmuls,2 %r56, %r43, %r43 fmuls,3 %r55, %r44, %r44 movaw,0 area=4, ind=0, am=1, be=0, %r56 movaw,1 area=4, ind=4, am=0, be=0, %r55 movaw,2 area=4, ind=0, am=1, be=0, %b[14] movaw,3 area=4, ind=4, am=0, be=0, %b[13] } { loop_mode fmuls,0 %r60, %r5, %b[15] fmuls,1 %r59, %r4, %b[16] fmuls,2 %r60, %r4, %r4 fmuls,3 %r59, %r5, %r5 fmuls,4 %r61, %r49, %r59 fmuls,5 %r61, %r50, %r60 movaw,0 area=5, ind=0, am=1, be=0, %b[17] movaw,1 area=5, ind=4, am=0, be=0, %r61 movaw,2 area=5, ind=0, am=1, be=0, %b[19] movaw,3 area=5, ind=4, am=0, be=0, %b[18] } { loop_mode fmuls,0 %b[2], %r9, %b[20] fmuls,1 %b[1], %r7, %b[21] fmuls,2 %b[2], %r7, %r7 fmuls,3 %b[1], %r9, %r9 fmuls,4 %r62, %r50, %r50 fmuls,5 %r62, %r49, %r49 movaw,0 area=6, ind=0, am=1, be=0, %b[1] movaw,1 area=6, ind=4, am=0, be=0, %r62 movaw,2 area=6, ind=0, am=1, be=0, %b[22] movaw,3 area=6, ind=4, am=0, be=0, %b[2] } { loop_mode fmuls,0 %b[5], %g30, %b[23] fmuls,1 %b[5], %g31, %b[5] fmuls,2 %b[4], %g27, %b[24] fmuls,3 %b[3], %g26, %b[25] fmuls,4 %b[4], %g26, %g26 fmuls,5 %b[3], %g27, %g27 movaw,0 area=7, ind=0, am=1, be=0, %b[4] movaw,1 area=7, ind=4, am=0, be=0, %b[3] movaw,2 area=7, ind=0, am=1, be=0, %b[27] movaw,3 area=7, ind=4, am=0, be=0, %b[26] } { loop_mode fmuls,0 %b[6], %g31, %g31 fmuls,1 %b[6], %g30, %g30 fsubs,2 %r57, %r58, %r57 fadds,3 %g23, %g22, %g22 fsubs,4 %r63, %b[0], %g23 fadds,5 %g19, %g18, %g18 movaw,0 area=8, ind=0, am=1, be=0, %r58 movaw,1 area=8, ind=4, am=0, be=0, %g19 movaw,2 area=8, ind=0, am=1, be=0, %b[0] movaw,3 area=8, ind=4, am=0, be=0, %r63 } { loop_mode fsubs,0 %b[7], %b[8], %b[6] fadds,1 %r52, %r51, %r51 fsubs,2 %b[11], %b[12], %r52 fmuls,3 %r45, %g24, %b[7] fmuls,4 %r46, %g24, %g24 fmuls,5 %r45, %g25, %r45 movaw,0 area=9, ind=0, am=1, be=0, %b[11] movaw,1 area=9, ind=4, am=0, be=0, %b[8] movaw,2 area=9, ind=0, am=1, be=0, %b[28] movaw,3 area=9, ind=4, am=0, be=0, %b[12] } { loop_mode fadds,0 %r43, %r44, %r43 fsubs,1 %b[20], %b[21], %r44 fsubs,2 %b[15], %b[16], %b[15] fsubs,3 %r50, %r59, %r50 fadds,4 %r49, %r60, %r49 fmuls,5 %r46, %g25, %g25 } { loop_mode fadds,0 %r4, %r5, %r4 fmuls,1 %r53, %g16, %g27 fmuls,2 %r53, %g17, %r5 fadds,3 %g26, %g27, %g26 fmuls,4 %b[9], %g21, %r46 fmuls,5 %r54, %g17, %g17 } { loop_mode fadds,0 %r7, %r9, %r7 fsubs,1 %g31, %b[23], %g31 fadds,2 %g30, %b[5], %g30 fmuls,3 %r54, %g16, %g16 fmuls,4 %b[10], %g21, %g21 fmuls,5 %b[9], %g20, %r9 } { loop_mode fsubs,0 %b[24], %b[25], %r53 fmuls,1 %b[10], %g20, %g20 fadds,2 %r52, %r57, %r54 fadds,3 %g24, %r45, %g24 } { loop_mode fadds,0 %r48, %r44, %r45 fsubs,1 %r48, %r44, %r44 fsubs,2 %g22, %r43, %r48 fsubs,3 %r1, %r49, %r59 fadds,4 %r3, %r50, %r60 fadds,5 %r1, %r49, %r1 } { loop_mode fadds,0 %r43, %g22, %g22 fsubs,1 %g18, %r51, %r43 fadds,2 %b[6], %g23, %r49 fadds,3 %r51, %g18, %g18 fsubs,4 %b[6], %g23, %g23 fsubs,5 %r52, %r57, %r51 } { loop_mode fsubs,0 %r3, %r50, %r3 fadds,1 %r47, %r7, %r50 fsubs,2 %r47, %r7, %r7 fsubs,3 %g25, %b[7], %g25 fsubs,4 %g21, %r9, %g21 } { loop_mode fadds,0 %r4, %g26, %r9 fsubs,1 %g26, %r4, %g26 fadds,2 %r42, %g31, %r4 fsubs,3 %g17, %g27, %g17 fadds,4 %g16, %r5, %g16 } { loop_mode fadds,0 %r41, %g30, %g27 fsubs,1 %r42, %g31, %g31 fadds,2 %b[15], %r53, %r5 fsubs,3 %r41, %g30, %g30 } { loop_mode fsubs,0 %b[15], %r53, %r41 fadds,1 %g20, %r46, %g20 fsubs,2 %r44, %r48, %r42 fsubs,3 %r59, %g23, %r46 fadds,4 %r59, %g23, %g23 fadds,5 %r1, %g18, %r47 } { loop_mode fsubs,0 %r45, %r54, %r52 fadds,1 %r44, %r48, %r44 fadds,2 %r45, %r54, %r45 fsubs,3 %r1, %g18, %g18 fsubs,4 %g29, %g21, %r1 fadds,5 %g29, %g21, %g21 } { loop_mode fadds,0 %r50, %g22, %g29 fsubs,1 %r50, %g22, %g22 fadds,2 %r3, %r43, %r48 fadds,3 %r60, %r49, %r50 fsubs,4 %r60, %r49, %r49 fadds,5 %g24, %g16, %r53 } { loop_mode fsubs,0 %r3, %r43, %r3 fadds,1 %g27, %r9, %r43 fsubs,2 %r7, %r51, %r54 fadds,3 %r7, %r51, %r7 fsubs,4 %g25, %g17, %r51 fsubs,5 %g16, %g24, %g16 } { loop_mode fsubs,0 %g27, %r9, %g24 fadds,1 %g31, %g26, %g27 fadds,2 %r4, %r5, %r9 fadds,3 %g25, %g17, %g17 fmuls,4 %r62, %r46, %g25 fmuls,5 %b[1], %r46, %r46 } { loop_mode fadds,0 %g30, %r41, %r57 fsubs,1 %g31, %g26, %g26 fsubs,2 %g30, %r41, %g30 fsubs,3 %r4, %r5, %g31 fmuls,4 %b[8], %g23, %r4 fmuls,5 %b[11], %g23, %g23 } { loop_mode fadds,0 %g28, %g20, %r5 fsubs,1 %g28, %g20, %g20 fmuls,2 %b[22], %r42, %g28 fmuls,3 %b[2], %r42, %r41 fmuls,4 %b[18], %r47, %r42 fmuls,5 %b[19], %r47, %r47 } { loop_mode fmuls,0 %b[4], %r52, %r59 fmuls,1 %b[3], %r52, %r52 fmuls,2 %b[28], %r44, %r60 fmuls,3 %b[12], %r44, %r44 fmuls,4 %r56, %r45, %b[5] fmuls,5 %r55, %r45, %r45 } { loop_mode fmuls,0 %r63, %g18, %b[6] fmuls,1 %b[0], %g18, %g18 fmuls,2 %b[11], %r48, %b[7] fmuls,3 %b[8], %r48, %r48 fmuls,4 %r55, %g29, %r55 fmuls,5 %r56, %g29, %g29 } { loop_mode fmuls,0 %b[3], %g22, %r56 fmuls,1 %b[4], %g22, %g22 fmuls,2 %b[1], %r3, %b[1] fmuls,3 %r62, %r3, %r3 fmuls,4 %b[12], %r7, %r62 fmuls,5 %b[28], %r7, %r7 } { loop_mode fmuls,0 %b[19], %r50, %b[3] fmuls,1 %b[18], %r50, %r50 fmuls,2 %b[0], %r49, %b[0] fmuls,3 %r63, %r49, %r49 fmuls,4 %b[26], %g24, %r63 fmuls,5 %b[27], %g24, %g24 } { loop_mode fmuls,0 %g19, %r57, %b[4] fmuls,1 %r58, %r57, %r57 fmuls,2 %b[2], %r54, %b[2] fmuls,3 %b[22], %r54, %r54 fmuls,4 %b[13], %r43, %b[8] fmuls,5 %b[14], %r43, %r43 } { loop_mode fmuls,0 %r61, %g30, %b[9] fmuls,1 %b[17], %g30, %g30 fmuls,2 %r58, %g27, %r58 fmuls,3 %g19, %g27, %g19 fmuls,4 %b[14], %r9, %g27 fmuls,5 %b[13], %r9, %r9 } { loop_mode fmuls,0 %b[17], %g26, %b[10] fmuls,1 %r61, %g26, %g26 fmuls,2 %b[27], %g31, %r61 fmuls,3 %b[26], %g31, %g31 fsubs,4 %g20, %r51, %b[11] fadds,5 %g20, %r51, %g20 } { loop_mode fadds,0 %g21, %g17, %r51 fadds,1 %r5, %r53, %b[12] fsubs,2 %r1, %g16, %b[13] fsubs,3 %g21, %g17, %g17 fsubs,4 %r5, %r53, %g21 fadds,5 %r1, %g16, %g16 } { loop_mode fsubs,0 %b[5], %r55, %r1 fadds,1 %g29, %r45, %g29 fsubs,2 %r59, %r56, %r5 fadds,3 %g22, %r52, %g22 fsubs,4 %b[7], %r4, %r4 fadds,5 %g23, %r48, %g23 } { loop_mode fsubs,0 %b[3], %r42, %r42 fadds,1 %r47, %r50, %r45 fsubs,2 %b[1], %g25, %g25 fadds,3 %r46, %r3, %r3 fsubs,4 %b[0], %b[6], %r46 fadds,5 %g18, %r49, %g18 } { loop_mode fsubs,0 %r61, %r63, %r47 fsubs,1 %r58, %b[4], %r48 fsubs,2 %g28, %b[2], %g28 fadds,3 %r54, %r41, %r41 fsubs,4 %r60, %r62, %r49 fadds,5 %r7, %r44, %r7 } { loop_mode fsubs,0 %g27, %b[8], %g27 fadds,1 %r43, %r9, %r9 fadds,2 %g30, %g26, %g26 fadds,3 %g24, %g31, %g24 fadds,4 %r57, %g19, %g19 } { loop_mode fsubs,0 %b[10], %b[9], %g30 fsubs,1 %r51, %r1, %r43 fadds,2 %r51, %r1, %g22 fadds,3 %g21, %g22, %g31 fsubs,4 %g21, %g22, %g21 } { loop_mode fadds,0 %g17, %r5, %r1 fsubs,1 %g17, %r5, %g17 fadds,2 %b[12], %g29, %r5 } { loop_mode fsubs,0 %b[12], %g29, %g29 fadds,1 %b[13], %g28, %r44 fsubs,2 %b[13], %g28, %g28 fadds,3 %g20, %r7, %r50 fsubs,4 %g16, %r49, %r51 fsubs,5 %g20, %r7, %g20 } { loop_mode fsubs,0 %r48, %r4, %r7 fadds,1 %r47, %r46, %r49 fadds,2 %r48, %r4, %r4 fadds,3 %b[11], %r41, %r52 fadds,4 %g16, %r49, %g16 fsubs,5 %b[11], %r41, %r41 } { loop_mode fsubs,0 %r47, %r46, %r46 fadds,1 %r9, %r45, %r47 fsubs,2 %g27, %r42, %r53 fadds,3 %g19, %g23, %r48 fsubs,4 %g18, %g24, %r54 fsubs,5 %g23, %g19, %g19 } { loop_mode fsubs,0 %r45, %r9, %g23 fadds,1 %g27, %r42, %g27 fadds,2 %g26, %r3, %r9 fadds,3 %g24, %g18, %g18 fsubs,4 %r3, %g26, %g24 } { loop_mode fadds,0 %g30, %g25, %g26 fsubs,1 %g30, %g25, %g25 } { loop_mode fsubs,0 %r1, %r49, %g30 fadds,1 %r1, %r49, %r1 } { loop_mode fsubs,0 %g21, %r46, %r3 fadds,1 %g21, %r46, %g21 fsubs,2 %g20, %r7, %r42 fadds,3 %r51, %g19, %r45 fadds,4 %r50, %r48, %r46 fsubs,5 %r51, %g19, %g19 } { loop_mode fsubs,0 %r43, %g23, %r49 fadds,1 %r43, %g23, %g23 fadds,2 %g20, %r7, %g20 fsubs,3 %r50, %r48, %r7 fadds,4 %g16, %r4, %r43 fsubs,5 %g17, %r54, %r48 } { loop_mode fadds,0 %r5, %r47, %r50 fsubs,1 %r5, %r47, %r5 fadds,2 %g29, %r53, %r47 fsubs,3 %g29, %r53, %g29 fsubs,4 %g16, %r4, %g16 fadds,5 %g17, %r54, %g17 } { loop_mode fsubs,0 %g22, %g27, %r4 fadds,1 %r52, %r9, %r51 fadds,2 %g31, %g18, %r53 fsubs,3 %g28, %g24, %r54 fsubs,4 %r52, %r9, %r9 fsubs,5 %g31, %g18, %g18 } { loop_mode fadds,0 %g28, %g24, %g24 fadds,1 %g22, %g27, %g22 fsubs,2 %r44, %g26, %g26 fadds,3 %r44, %g26, %g27 fsubs,4 %r41, %g25, %g28 fadds,5 %r41, %g25, %g25 } { loop_mode stw,2 %r23, %r0, %g21 stw,5 %r32, %r0, %g30 } { loop_mode stw,2 %r6, %r0, %r3 stw,5 %r39, %r0, %r1 } { loop_mode stw,2 %r22, %r0, %g20 stw,5 %r21, %r0, %r42 } { loop_mode stw,2 %r27, %r0, %r45 stw,5 %r38, %r0, %r49 } { loop_mode stw,2 %r31, %r0, %g23 stw,5 %r34, %r0, %g19 } { loop_mode stw,2 %r17, %r0, %r7 stw,5 %r13, %r0, %r46 } { loop_mode stw,2 %r20, %r0, %r5 stw,5 %r2, %r0, %r50 } { loop_mode stw,2 %r16, %r0, %r47 stw,5 %r29, %r0, %r48 } { loop_mode stw,2 %r28, %r0, %g17 stw,5 %r12, %r0, %g29 } { loop_mode stw,2 %r26, %r0, %g16 stw,5 %r30, %r0, %r43 } { loop_mode stw,2 %r37, %r0, %r4 stw,5 %r14, %r0, %r53 } { loop_mode stw,2 %r18, %r0, %g18 stw,5 %r35, %r0, %r54 } { loop_mode stw,2 %r25, %r0, %g24 stw,5 %r15, %r0, %r51 } { loop_mode stw,2 %r19, %r0, %r9 stw,5 %r40, %r0, %g22 } { loop_mode stw,2 %r24, %r0, %g25 stw,5 %r11, %r0, %g28 } { loop_mode ct %ctpr1 ? %NOT_LOOP_END alc alcf=1, alct=1 stw,2 %r33, %r0, %g26 addd,3,sm 0x8, %r0, %r0 stw,5 %r36, %r0, %g27 }
Теоретическая скорость: 16 комплексных чисел за 77 тактов (16/77) = 1.66 Байт/такт
Четверная теоретическая скорость: 6.65 Байт/такт
Замеры скорости

2. stage_radix4_2x_simd64
Здесь происходит ручная раскрутка алгоритма stage_radix4_simd64 в 2 раза.
Код на Си
void stage_radix4_2x_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b) { uint64_t *x0_in = (uint64_t*)&data_in[ 0]; uint64_t *y0_in = (uint64_t*)&data_in[ 1]; uint64_t *z0_in = (uint64_t*)&data_in[ 2]; uint64_t *w0_in = (uint64_t*)&data_in[ 3]; uint64_t *x1_in = (uint64_t*)&data_in[ 4]; uint64_t *y1_in = (uint64_t*)&data_in[ 5]; uint64_t *z1_in = (uint64_t*)&data_in[ 6]; uint64_t *w1_in = (uint64_t*)&data_in[ 7]; uint64_t *x2_in = (uint64_t*)&data_in[ 8]; uint64_t *y2_in = (uint64_t*)&data_in[ 9]; uint64_t *z2_in = (uint64_t*)&data_in[10]; uint64_t *w2_in = (uint64_t*)&data_in[11]; uint64_t *x3_in = (uint64_t*)&data_in[12]; uint64_t *y3_in = (uint64_t*)&data_in[13]; uint64_t *z3_in = (uint64_t*)&data_in[14]; uint64_t *w3_in = (uint64_t*)&data_in[15]; uint64_t *c0a_in = (uint64_t*)&coefC_a[0]; uint64_t *c1a_in = (uint64_t*)&coefC_a[1]; uint64_t *c2a_in = (uint64_t*)&coefC_a[2]; uint64_t *c3a_in = (uint64_t*)&coefC_a[3]; uint64_t *d0a_in = (uint64_t*)&coefD_a[0]; uint64_t *d1a_in = (uint64_t*)&coefD_a[1]; uint64_t *d2a_in = (uint64_t*)&coefD_a[2]; uint64_t *d3a_in = (uint64_t*)&coefD_a[3]; uint64_t *e0a_in = (uint64_t*)&coefE_a[0]; uint64_t *e1a_in = (uint64_t*)&coefE_a[1]; uint64_t *e2a_in = (uint64_t*)&coefE_a[2]; uint64_t *e3a_in = (uint64_t*)&coefE_a[3]; uint64_t *c0b_in = (uint64_t*)&coefC_b[0*data_count/16]; uint64_t *c1b_in = (uint64_t*)&coefC_b[1*data_count/16]; uint64_t *c2b_in = (uint64_t*)&coefC_b[2*data_count/16]; uint64_t *c3b_in = (uint64_t*)&coefC_b[3*data_count/16]; uint64_t *d0b_in = (uint64_t*)&coefD_b[0*data_count/16]; uint64_t *d1b_in = (uint64_t*)&coefD_b[1*data_count/16]; uint64_t *d2b_in = (uint64_t*)&coefD_b[2*data_count/16]; uint64_t *d3b_in = (uint64_t*)&coefD_b[3*data_count/16]; uint64_t *e0b_in = (uint64_t*)&coefE_b[0*data_count/16]; uint64_t *e1b_in = (uint64_t*)&coefE_b[1*data_count/16]; uint64_t *e2b_in = (uint64_t*)&coefE_b[2*data_count/16]; uint64_t *e3b_in = (uint64_t*)&coefE_b[3*data_count/16]; uint64_t *out_0 = (uint64_t*)&data_out[ 0*data_count/16]; uint64_t *out_1 = (uint64_t*)&data_out[ 1*data_count/16]; uint64_t *out_2 = (uint64_t*)&data_out[ 2*data_count/16]; uint64_t *out_3 = (uint64_t*)&data_out[ 3*data_count/16]; uint64_t *out_4 = (uint64_t*)&data_out[ 4*data_count/16]; uint64_t *out_5 = (uint64_t*)&data_out[ 5*data_count/16]; uint64_t *out_6 = (uint64_t*)&data_out[ 6*data_count/16]; uint64_t *out_7 = (uint64_t*)&data_out[ 7*data_count/16]; uint64_t *out_8 = (uint64_t*)&data_out[ 8*data_count/16]; uint64_t *out_9 = (uint64_t*)&data_out[ 9*data_count/16]; uint64_t *out_10 = (uint64_t*)&data_out[10*data_count/16]; uint64_t *out_11 = (uint64_t*)&data_out[11*data_count/16]; uint64_t *out_12 = (uint64_t*)&data_out[12*data_count/16]; uint64_t *out_13 = (uint64_t*)&data_out[13*data_count/16]; uint64_t *out_14 = (uint64_t*)&data_out[14*data_count/16]; uint64_t *out_15 = (uint64_t*)&data_out[15*data_count/16]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/16; ++i) { uint64_t x0 = x0_in[16*i]; uint64_t y0 = y0_in[16*i]; uint64_t z0 = z0_in[16*i]; uint64_t w0 = w0_in[16*i]; uint64_t c0 = c0a_in[4*i]; uint64_t d0 = d0a_in[4*i]; uint64_t e0 = e0a_in[4*i]; uint64_t x1 = x1_in[16*i]; uint64_t y1 = y1_in[16*i]; uint64_t z1 = z1_in[16*i]; uint64_t w1 = w1_in[16*i]; uint64_t c1 = c1a_in[4*i]; uint64_t d1 = d1a_in[4*i]; uint64_t e1 = e1a_in[4*i]; uint64_t x2 = x2_in[16*i]; uint64_t y2 = y2_in[16*i]; uint64_t z2 = z2_in[16*i]; uint64_t w2 = w2_in[16*i]; uint64_t c2 = c2a_in[4*i]; uint64_t d2 = d2a_in[4*i]; uint64_t e2 = e2a_in[4*i]; uint64_t x3 = x3_in[16*i]; uint64_t y3 = y3_in[16*i]; uint64_t z3 = z3_in[16*i]; uint64_t w3 = w3_in[16*i]; uint64_t c3 = c3a_in[4*i]; uint64_t d3 = d3a_in[4*i]; uint64_t e3 = e3a_in[4*i]; uint64_t conj_c0 = __builtin_e2k_pxord(c0, 1LL<<63); uint64_t conj_c1 = __builtin_e2k_pxord(c1, 1LL<<63); uint64_t conj_c2 = __builtin_e2k_pxord(c2, 1LL<<63); uint64_t conj_c3 = __builtin_e2k_pxord(c3, 1LL<<63); uint64_t conj_d0 = __builtin_e2k_pxord(d0, 1LL<<63); uint64_t conj_d1 = __builtin_e2k_pxord(d1, 1LL<<63); uint64_t conj_d2 = __builtin_e2k_pxord(d2, 1LL<<63); uint64_t conj_d3 = __builtin_e2k_pxord(d3, 1LL<<63); uint64_t conj_e0 = __builtin_e2k_pxord(e0, 1LL<<63); uint64_t conj_e1 = __builtin_e2k_pxord(e1, 1LL<<63); uint64_t conj_e2 = __builtin_e2k_pxord(e2, 1LL<<63); uint64_t conj_e3 = __builtin_e2k_pxord(e3, 1LL<<63); uint64_t swap_c0 = __builtin_e2k_pshufb(0, c0, 0x0302010007060504); uint64_t swap_c1 = __builtin_e2k_pshufb(0, c1, 0x0302010007060504); uint64_t swap_c2 = __builtin_e2k_pshufb(0, c2, 0x0302010007060504); uint64_t swap_c3 = __builtin_e2k_pshufb(0, c3, 0x0302010007060504); uint64_t swap_d0 = __builtin_e2k_pshufb(0, d0, 0x0302010007060504); uint64_t swap_d1 = __builtin_e2k_pshufb(0, d1, 0x0302010007060504); uint64_t swap_d2 = __builtin_e2k_pshufb(0, d2, 0x0302010007060504); uint64_t swap_d3 = __builtin_e2k_pshufb(0, d3, 0x0302010007060504); uint64_t swap_e0 = __builtin_e2k_pshufb(0, e0, 0x0302010007060504); uint64_t swap_e1 = __builtin_e2k_pshufb(0, e1, 0x0302010007060504); uint64_t swap_e2 = __builtin_e2k_pshufb(0, e2, 0x0302010007060504); uint64_t swap_e3 = __builtin_e2k_pshufb(0, e3, 0x0302010007060504); uint64_t cy0_real = __builtin_e2k_pfmuls(conj_c0, y0); uint64_t cy1_real = __builtin_e2k_pfmuls(conj_c1, y1); uint64_t cy2_real = __builtin_e2k_pfmuls(conj_c2, y2); uint64_t cy3_real = __builtin_e2k_pfmuls(conj_c3, y3); uint64_t dz0_real = __builtin_e2k_pfmuls(conj_d0, z0); uint64_t dz1_real = __builtin_e2k_pfmuls(conj_d1, z1); uint64_t dz2_real = __builtin_e2k_pfmuls(conj_d2, z2); uint64_t dz3_real = __builtin_e2k_pfmuls(conj_d3, z3); uint64_t ew0_real = __builtin_e2k_pfmuls(conj_e0, w0); uint64_t ew1_real = __builtin_e2k_pfmuls(conj_e1, w1); uint64_t ew2_real = __builtin_e2k_pfmuls(conj_e2, w2); uint64_t ew3_real = __builtin_e2k_pfmuls(conj_e3, w3); uint64_t cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0); uint64_t cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1); uint64_t cy2_imag = __builtin_e2k_pfmuls(swap_c2, y2); uint64_t cy3_imag = __builtin_e2k_pfmuls(swap_c3, y3); uint64_t dz0_imag = __builtin_e2k_pfmuls(swap_d0, z0); uint64_t dz1_imag = __builtin_e2k_pfmuls(swap_d1, z1); uint64_t dz2_imag = __builtin_e2k_pfmuls(swap_d2, z2); uint64_t dz3_imag = __builtin_e2k_pfmuls(swap_d3, z3); uint64_t ew0_imag = __builtin_e2k_pfmuls(swap_e0, w0); uint64_t ew1_imag = __builtin_e2k_pfmuls(swap_e1, w1); uint64_t ew2_imag = __builtin_e2k_pfmuls(swap_e2, w2); uint64_t ew3_imag = __builtin_e2k_pfmuls(swap_e3, w3); uint64_t cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag); uint64_t cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag); uint64_t cy2 = __builtin_e2k_pfhadds(cy2_real, cy2_imag); uint64_t cy3 = __builtin_e2k_pfhadds(cy3_real, cy3_imag); uint64_t dz0 = __builtin_e2k_pfhadds(dz0_real, dz0_imag); uint64_t dz1 = __builtin_e2k_pfhadds(dz1_real, dz1_imag); uint64_t dz2 = __builtin_e2k_pfhadds(dz2_real, dz2_imag); uint64_t dz3 = __builtin_e2k_pfhadds(dz3_real, dz3_imag); uint64_t ew0 = __builtin_e2k_pfhadds(ew0_real, ew0_imag); uint64_t ew1 = __builtin_e2k_pfhadds(ew1_real, ew1_imag); uint64_t ew2 = __builtin_e2k_pfhadds(ew2_real, ew2_imag); uint64_t ew3 = __builtin_e2k_pfhadds(ew3_real, ew3_imag); uint64_t add02_0 = __builtin_e2k_pfadds( x0, dz0); uint64_t add02_1 = __builtin_e2k_pfadds( x1, dz1); uint64_t add02_2 = __builtin_e2k_pfadds( x2, dz2); uint64_t add02_3 = __builtin_e2k_pfadds( x3, dz3); uint64_t sub02_0 = __builtin_e2k_pfsubs( x0, dz0); uint64_t sub02_1 = __builtin_e2k_pfsubs( x1, dz1); uint64_t sub02_2 = __builtin_e2k_pfsubs( x2, dz2); uint64_t sub02_3 = __builtin_e2k_pfsubs( x3, dz3); uint64_t add13_0 = __builtin_e2k_pfadds(cy0, ew0); uint64_t add13_1 = __builtin_e2k_pfadds(cy1, ew1); uint64_t add13_2 = __builtin_e2k_pfadds(cy2, ew2); uint64_t add13_3 = __builtin_e2k_pfadds(cy3, ew3); uint64_t sub13_0 = __builtin_e2k_pfsubs(cy0, ew0); uint64_t sub13_1 = __builtin_e2k_pfsubs(cy1, ew1); uint64_t sub13_2 = __builtin_e2k_pfsubs(cy2, ew2); uint64_t sub13_3 = __builtin_e2k_pfsubs(cy3, ew3); //uint64_t conj_sub13_0 = __builtin_e2k_pxord(sub13_0, 1LL<<63); //uint64_t conj_sub13_1 = __builtin_e2k_pxord(sub13_1, 1LL<<63); //uint64_t conj_sub13_2 = __builtin_e2k_pxord(sub13_2, 1LL<<63); //uint64_t conj_sub13_3 = __builtin_e2k_pxord(sub13_3, 1LL<<63); //uint64_t sub13i_0 = __builtin_e2k_pshufb(0, conj_sub13_0, 0x0302010007060504); //uint64_t sub13i_1 = __builtin_e2k_pshufb(0, conj_sub13_1, 0x0302010007060504); //uint64_t sub13i_2 = __builtin_e2k_pshufb(0, conj_sub13_2, 0x0302010007060504); //uint64_t sub13i_3 = __builtin_e2k_pshufb(0, conj_sub13_3, 0x0302010007060504); uint64_t swap_sub13_0 = __builtin_e2k_pshufb(0, sub13_0, 0x0302010007060504); uint64_t swap_sub13_1 = __builtin_e2k_pshufb(0, sub13_1, 0x0302010007060504); uint64_t swap_sub13_2 = __builtin_e2k_pshufb(0, sub13_2, 0x0302010007060504); uint64_t swap_sub13_3 = __builtin_e2k_pshufb(0, sub13_3, 0x0302010007060504); uint64_t sub13i_0 = __builtin_e2k_pxord(swap_sub13_0, 1LL<<31); uint64_t sub13i_1 = __builtin_e2k_pxord(swap_sub13_1, 1LL<<31); uint64_t sub13i_2 = __builtin_e2k_pxord(swap_sub13_2, 1LL<<31); uint64_t sub13i_3 = __builtin_e2k_pxord(swap_sub13_3, 1LL<<31); uint64_t out0 = __builtin_e2k_pfadds(add02_0, add13_0); uint64_t out1 = __builtin_e2k_pfadds(add02_1, add13_1); uint64_t out2 = __builtin_e2k_pfadds(add02_2, add13_2); uint64_t out3 = __builtin_e2k_pfadds(add02_3, add13_3); uint64_t out4 = __builtin_e2k_pfsubs(sub02_0, sub13i_0); uint64_t out5 = __builtin_e2k_pfsubs(sub02_1, sub13i_1); uint64_t out6 = __builtin_e2k_pfsubs(sub02_2, sub13i_2); uint64_t out7 = __builtin_e2k_pfsubs(sub02_3, sub13i_3); uint64_t out8 = __builtin_e2k_pfsubs(add02_0, add13_0); uint64_t out9 = __builtin_e2k_pfsubs(add02_1, add13_1); uint64_t out10 = __builtin_e2k_pfsubs(add02_2, add13_2); uint64_t out11 = __builtin_e2k_pfsubs(add02_3, add13_3); uint64_t out12 = __builtin_e2k_pfadds(sub02_0, sub13i_0); uint64_t out13 = __builtin_e2k_pfadds(sub02_1, sub13i_1); uint64_t out14 = __builtin_e2k_pfadds(sub02_2, sub13i_2); uint64_t out15 = __builtin_e2k_pfadds(sub02_3, sub13i_3); x0 = out0; y0 = out1; z0 = out2; w0 = out3; c0 = c0b_in[i]; d0 = d0b_in[i]; e0 = e0b_in[i]; x1 = out4; y1 = out5; z1 = out6; w1 = out7; c1 = c1b_in[i]; d1 = d1b_in[i]; e1 = e1b_in[i]; x2 = out8; y2 = out9; z2 = out10; w2 = out11; c2 = c2b_in[i]; d2 = d2b_in[i]; e2 = e2b_in[i]; x3 = out12; y3 = out13; z3 = out14; w3 = out15; c3 = c3b_in[i]; d3 = d3b_in[i]; e3 = e3b_in[i]; conj_c0 = __builtin_e2k_pxord(c0, 1LL<<63); conj_c1 = __builtin_e2k_pxord(c1, 1LL<<63); conj_c2 = __builtin_e2k_pxord(c2, 1LL<<63); conj_c3 = __builtin_e2k_pxord(c3, 1LL<<63); conj_d0 = __builtin_e2k_pxord(d0, 1LL<<63); conj_d1 = __builtin_e2k_pxord(d1, 1LL<<63); conj_d2 = __builtin_e2k_pxord(d2, 1LL<<63); conj_d3 = __builtin_e2k_pxord(d3, 1LL<<63); conj_e0 = __builtin_e2k_pxord(e0, 1LL<<63); conj_e1 = __builtin_e2k_pxord(e1, 1LL<<63); conj_e2 = __builtin_e2k_pxord(e2, 1LL<<63); conj_e3 = __builtin_e2k_pxord(e3, 1LL<<63); swap_c0 = __builtin_e2k_pshufb(0, c0, 0x0302010007060504); swap_c1 = __builtin_e2k_pshufb(0, c1, 0x0302010007060504); swap_c2 = __builtin_e2k_pshufb(0, c2, 0x0302010007060504); swap_c3 = __builtin_e2k_pshufb(0, c3, 0x0302010007060504); swap_d0 = __builtin_e2k_pshufb(0, d0, 0x0302010007060504); swap_d1 = __builtin_e2k_pshufb(0, d1, 0x0302010007060504); swap_d2 = __builtin_e2k_pshufb(0, d2, 0x0302010007060504); swap_d3 = __builtin_e2k_pshufb(0, d3, 0x0302010007060504); swap_e0 = __builtin_e2k_pshufb(0, e0, 0x0302010007060504); swap_e1 = __builtin_e2k_pshufb(0, e1, 0x0302010007060504); swap_e2 = __builtin_e2k_pshufb(0, e2, 0x0302010007060504); swap_e3 = __builtin_e2k_pshufb(0, e3, 0x0302010007060504); cy0_real = __builtin_e2k_pfmuls(conj_c0, y0); cy1_real = __builtin_e2k_pfmuls(conj_c1, y1); cy2_real = __builtin_e2k_pfmuls(conj_c2, y2); cy3_real = __builtin_e2k_pfmuls(conj_c3, y3); dz0_real = __builtin_e2k_pfmuls(conj_d0, z0); dz1_real = __builtin_e2k_pfmuls(conj_d1, z1); dz2_real = __builtin_e2k_pfmuls(conj_d2, z2); dz3_real = __builtin_e2k_pfmuls(conj_d3, z3); ew0_real = __builtin_e2k_pfmuls(conj_e0, w0); ew1_real = __builtin_e2k_pfmuls(conj_e1, w1); ew2_real = __builtin_e2k_pfmuls(conj_e2, w2); ew3_real = __builtin_e2k_pfmuls(conj_e3, w3); cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0); cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1); cy2_imag = __builtin_e2k_pfmuls(swap_c2, y2); cy3_imag = __builtin_e2k_pfmuls(swap_c3, y3); dz0_imag = __builtin_e2k_pfmuls(swap_d0, z0); dz1_imag = __builtin_e2k_pfmuls(swap_d1, z1); dz2_imag = __builtin_e2k_pfmuls(swap_d2, z2); dz3_imag = __builtin_e2k_pfmuls(swap_d3, z3); ew0_imag = __builtin_e2k_pfmuls(swap_e0, w0); ew1_imag = __builtin_e2k_pfmuls(swap_e1, w1); ew2_imag = __builtin_e2k_pfmuls(swap_e2, w2); ew3_imag = __builtin_e2k_pfmuls(swap_e3, w3); cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag); cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag); cy2 = __builtin_e2k_pfhadds(cy2_real, cy2_imag); cy3 = __builtin_e2k_pfhadds(cy3_real, cy3_imag); dz0 = __builtin_e2k_pfhadds(dz0_real, dz0_imag); dz1 = __builtin_e2k_pfhadds(dz1_real, dz1_imag); dz2 = __builtin_e2k_pfhadds(dz2_real, dz2_imag); dz3 = __builtin_e2k_pfhadds(dz3_real, dz3_imag); ew0 = __builtin_e2k_pfhadds(ew0_real, ew0_imag); ew1 = __builtin_e2k_pfhadds(ew1_real, ew1_imag); ew2 = __builtin_e2k_pfhadds(ew2_real, ew2_imag); ew3 = __builtin_e2k_pfhadds(ew3_real, ew3_imag); add02_0 = __builtin_e2k_pfadds( x0, dz0); add02_1 = __builtin_e2k_pfadds( x1, dz1); add02_2 = __builtin_e2k_pfadds( x2, dz2); add02_3 = __builtin_e2k_pfadds( x3, dz3); sub02_0 = __builtin_e2k_pfsubs( x0, dz0); sub02_1 = __builtin_e2k_pfsubs( x1, dz1); sub02_2 = __builtin_e2k_pfsubs( x2, dz2); sub02_3 = __builtin_e2k_pfsubs( x3, dz3); add13_0 = __builtin_e2k_pfadds(cy0, ew0); add13_1 = __builtin_e2k_pfadds(cy1, ew1); add13_2 = __builtin_e2k_pfadds(cy2, ew2); add13_3 = __builtin_e2k_pfadds(cy3, ew3); sub13_0 = __builtin_e2k_pfsubs(cy0, ew0); sub13_1 = __builtin_e2k_pfsubs(cy1, ew1); sub13_2 = __builtin_e2k_pfsubs(cy2, ew2); sub13_3 = __builtin_e2k_pfsubs(cy3, ew3); //conj_sub13_0 = __builtin_e2k_pxord(sub13_0, 1LL<<63); //conj_sub13_1 = __builtin_e2k_pxord(sub13_1, 1LL<<63); //conj_sub13_2 = __builtin_e2k_pxord(sub13_2, 1LL<<63); //conj_sub13_3 = __builtin_e2k_pxord(sub13_3, 1LL<<63); //sub13i_0 = __builtin_e2k_pshufb(0, conj_sub13_0, 0x0302010007060504); //sub13i_1 = __builtin_e2k_pshufb(0, conj_sub13_1, 0x0302010007060504); //sub13i_2 = __builtin_e2k_pshufb(0, conj_sub13_2, 0x0302010007060504); //sub13i_3 = __builtin_e2k_pshufb(0, conj_sub13_3, 0x0302010007060504); swap_sub13_0 = __builtin_e2k_pshufb(0, sub13_0, 0x0302010007060504); swap_sub13_1 = __builtin_e2k_pshufb(0, sub13_1, 0x0302010007060504); swap_sub13_2 = __builtin_e2k_pshufb(0, sub13_2, 0x0302010007060504); swap_sub13_3 = __builtin_e2k_pshufb(0, sub13_3, 0x0302010007060504); sub13i_0 = __builtin_e2k_pxord(swap_sub13_0, 1LL<<31); sub13i_1 = __builtin_e2k_pxord(swap_sub13_1, 1LL<<31); sub13i_2 = __builtin_e2k_pxord(swap_sub13_2, 1LL<<31); sub13i_3 = __builtin_e2k_pxord(swap_sub13_3, 1LL<<31); out_0[i] = __builtin_e2k_pfadds(add02_0, add13_0); out_1[i] = __builtin_e2k_pfadds(add02_1, add13_1); out_2[i] = __builtin_e2k_pfadds(add02_2, add13_2); out_3[i] = __builtin_e2k_pfadds(add02_3, add13_3); out_4[i] = __builtin_e2k_pfsubs(sub02_0, sub13i_0); out_5[i] = __builtin_e2k_pfsubs(sub02_1, sub13i_1); out_6[i] = __builtin_e2k_pfsubs(sub02_2, sub13i_2); out_7[i] = __builtin_e2k_pfsubs(sub02_3, sub13i_3); out_8[i] = __builtin_e2k_pfsubs(add02_0, add13_0); out_9[i] = __builtin_e2k_pfsubs(add02_1, add13_1); out_10[i] = __builtin_e2k_pfsubs(add02_2, add13_2); out_11[i] = __builtin_e2k_pfsubs(add02_3, add13_3); out_12[i] = __builtin_e2k_pfadds(sub02_0, sub13i_0); out_13[i] = __builtin_e2k_pfadds(sub02_1, sub13i_1); out_14[i] = __builtin_e2k_pfadds(sub02_2, sub13i_2); out_15[i] = __builtin_e2k_pfadds(sub02_3, sub13i_3); } }
Основной цикл на ассемблере
.L3676: { fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=1, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=2, asz=1, abs=0, disp=16 } { fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=64 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=32 } { fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=4, asz=1, abs=4, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=1, abs=4, disp=96 } { fapb ct=0, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=2, asz=1, abs=6, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=3, asz=1, abs=6, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=2, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=2, abs=8, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=2, abs=12, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=2, abs=12, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=2, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=2, abs=16, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=2, abs=20, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=2, abs=20, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=2, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=2, abs=24, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=2, abs=28, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=2, abs=28, disp=0 } .L2949: { loop_mode pfmul_hadds,0,sm %b[111], %b[62], %b[104], %b[62] pfmuls,1,sm %b[63], %b[57], %b[63] pfmuls,2,sm %b[97], %b[82], %b[97] pshufb,3,sm 0x0, %b[29], %r26, %b[99] } { loop_mode pfsubs,1,sm %b[68], %b[67], %b[103] pfmul_hadds,2,sm %b[100], %b[52], %b[103], %b[100] pshufb,3,sm 0x0, %b[32], %r26, %b[104] xord,4,sm %b[86], %r0, %b[105] xord,5,sm %b[105], %r6, %b[106] } { loop_mode pfmul_hadds,0,sm %b[80], %b[53], %b[98], %b[80] pfsubs,1,sm %b[81], %b[79], %b[93] pfmul_hadds,2,sm %b[93], %b[96], %b[89], %b[89] pshufb,3,sm 0x0, %b[49], %r26, %b[98] pfmuls,5,sm %b[105], %b[102], %b[96] } { loop_mode pfsub_adds,0,sm %b[13], %b[64], %b[106], %b[87] pfsubs,1,sm %b[91], %b[87], %b[91] pfadds,2,sm %b[91], %b[87], %b[105] xord,5,sm %b[21], %r0, %b[108] } { loop_mode pfmul_hadds,0,sm %b[99], %b[82], %b[97], %b[82] pfsubs,1,sm %b[94], %b[84], %b[84] pfadds,2,sm %b[94], %b[84], %b[86] pshufb,3,sm 0x0, %b[48], %r26, %b[94] pshufb,4,sm 0x0, %b[86], %r26, %b[97] xord,5,sm %b[74], %r0, %b[99] } { loop_mode pfsub_rsubs,0,sm %b[13], %b[64], %b[106], %b[64] pfsubs,1,sm %b[69], %b[15], %b[106] pfmuls,3,sm %b[99], %b[77], %b[99] pfmuls,4,sm %b[108], %b[10], %b[1] xord,5,sm %b[83], %r0, %b[108] } { loop_mode pfmul_hadds,0,sm %b[104], %b[60], %b[90], %b[63] pfadds,1,sm %b[81], %b[79], %b[79] pfmul_hadds,2,sm %b[98], %b[57], %b[63], %b[60] pshufb,3,sm 0x0, %b[23], %r26, %b[90] pfmuls,4,sm %b[108], %b[72], %b[81] pfmul_hadds,5,sm %b[97], %b[102], %b[96], %b[13] } { loop_mode pfadd_adds,0,sm %b[107], %b[71], %b[105], %b[75] pfsubs,1,sm %b[85], %b[75], %b[85] pfadds,2,sm %b[85], %b[75], %b[96] xord,5,sm %b[78], %r0, %b[97] } { loop_mode pfmul_hadds,0,sm %b[94], %b[61], %b[88], %b[93] xord,1,sm %b[47], %r0, %b[61] pfadd_rsubs,2,sm %b[101], %b[70], %b[86], %b[94] pshufb,3,sm 0x0, %b[93], %r26, %b[97] xord,4,sm %b[30], %r0, %b[88] pfmuls,5,sm %b[97], %b[92], %b[98] } { loop_mode pfadd_rsubs,0,sm %b[107], %b[71], %b[105], %b[83] pfadds,1,sm %b[69], %b[15], %b[91] pfadds,2,sm %b[62], %b[73], %b[102] pshufb,3,sm 0x0, %b[91], %r26, %b[104] pshufb,4,sm 0x0, %b[83], %r26, %b[105] xord,5,sm %b[109], %r0, %b[69] } { loop_mode pfmul_hadds,0,sm %b[90], %b[12], %b[3], %b[62] pshufb,1,sm 0x0, %b[74], %r26, %b[90] pfadd_adds,2,sm %b[101], %b[70], %b[86], %b[73] pshufb,3,sm 0x0, %b[84], %r26, %b[74] pfsubs,4,sm %b[62], %b[73], %b[84] xord,5,sm %b[76], %r0, %b[86] } { loop_mode pfadd_adds,0,sm %b[41], %b[80], %b[79], %b[68] pfadds,1,sm %b[68], %b[67], %b[78] pfadd_rsubs,2,sm %b[87], %b[89], %b[96], %b[72] pshufb,3,sm 0x0, %b[106], %r26, %b[81] pshufb,4,sm 0x0, %b[78], %r26, %b[105] pfmul_hadds,5,sm %b[105], %b[72], %b[81], %b[67] } { loop_mode pfadd_rsubs,0,sm %b[41], %b[80], %b[79], %b[90] pfmul_hadds,1,sm %b[90], %b[77], %b[99], %b[77] pfadd_adds,2,sm %b[87], %b[89], %b[96], %b[96] pshufb,3,sm 0x0, %b[103], %r26, %b[98] xord,4,sm %b[110], %r0, %b[92] pfmul_hadds,5,sm %b[105], %b[92], %b[98], %b[79] } { loop_mode pfadd_adds,0,sm %b[20], %b[100], %b[91], %b[85] pfmuls,1,sm %b[86], %b[66], %b[86] pfadd_adds,2,sm %b[64], %b[82], %b[102], %b[99] pshufb,3,sm 0x0, %b[85], %r26, %b[105] xord,4,sm %b[28], %r0, %b[103] xord,5,sm %b[104], %r6, %b[104] } { loop_mode pfadd_rsubs,0,sm %b[20], %b[100], %b[91], %b[74] xord,1,sm %b[95], %r0, %b[108] pfadd_rsubs,2,sm %b[64], %b[82], %b[102], %b[84] pshufb,3,sm 0x0, %b[84], %r26, %b[106] xord,4,sm %b[45], %r0, %b[91] xord,5,sm %b[74], %r6, %b[102] } { loop_mode pfadd_rsubs,0,sm %b[56], %b[60], %b[78], %b[97] pfmuls,1,sm %b[108], %b[65], %b[111] pfsub_adds,2,sm %b[107], %b[71], %b[104], %b[108] xord,3,sm %b[46], %r0, %b[81] xord,4,sm %b[97], %r6, %b[112] xord,5,sm %b[81], %r6, %b[113] } { loop_mode pfadd_adds,0,sm %b[56], %b[60], %b[78], %b[109] pshufb,1,sm 0x0, %b[110], %r26, %b[78] pfsub_rsubs,2,sm %b[101], %b[70], %b[102], %b[110] pshufb,3,sm 0x0, %b[109], %r26, %b[98] xord,4,sm %b[98], %r6, %b[114] addd,5,sm 0x8, %b[8], %b[6] ? %pcnt0 } { loop_mode pfsub_rsubs,0,sm %b[20], %b[100], %b[113], %b[71] pfsubs,1,sm %b[93], %b[63], %b[75] pfsub_rsubs,2,sm %b[107], %b[71], %b[104], %b[76] pshufb,3,sm 0x0, %b[76], %r26, %b[104] xord,4,sm %b[105], %r6, %b[105] std,5 %r25, %b[8], %b[75] } { loop_mode pfsub_adds,0,sm %b[20], %b[100], %b[113], %b[63] pfadds,1,sm %b[93], %b[63], %b[93] pfsub_adds,2,sm %b[101], %b[70], %b[102], %b[70] pshufb,3,sm 0x0, %b[95], %r26, %b[94] xord,4,sm %b[106], %r6, %b[95] std,5 %r23, %b[8], %b[94] } { loop_mode pfsub_adds,0,sm %b[56], %b[60], %b[114], %b[83] pfmuls,1,sm %b[91], %b[68], %b[100] pfsub_rsubs,2,sm %b[87], %b[89], %b[105], %b[101] pshufb,3,sm 0x0, %b[36], %r26, %b[91] xord,4,sm %b[44], %r0, %b[102] std,5 %r18, %b[8], %b[83] } { loop_mode pfsub_rsubs,0,sm %b[56], %b[60], %b[114], %b[60] pfmuls,1,sm %b[103], %b[90], %b[103] pfsub_adds,2,sm %b[64], %b[82], %b[95], %b[73] pshufb,3,sm 0x0, %b[45], %r26, %b[106] xord,4,sm %b[24], %r0, %b[107] std,5 %r2, %b[8], %b[73] } { loop_mode pfsub_adds,0,sm %b[87], %b[89], %b[105], %b[64] pfmuls,1,sm %b[102], %b[85], %b[82] pfsub_rsubs,2,sm %b[64], %b[82], %b[95], %b[72] pshufb,3,sm 0x0, %b[44], %r26, %b[87] xord,4,sm %b[33], %r0, %b[89] std,5 %r12, %b[8], %b[72] } { loop_mode pfmul_hadds,0,sm %b[94], %b[65], %b[111], %b[65] pfmuls,1,sm %b[107], %b[74], %b[95] pfsub_adds,2,sm %b[41], %b[80], %b[112], %b[94] pshufb,3,sm 0x0, %b[24], %r26, %b[96] xord,4,sm %b[40], %r0, %b[102] std,5 %r16, %b[8], %b[96] movad,0 area=9, ind=0, am=1, be=0, %b[15] movad,1 area=8, ind=0, am=1, be=0, %b[12] movad,2 area=9, ind=0, am=1, be=0, %b[3] movad,3 area=8, ind=0, am=1, be=0, %b[20] } { loop_mode pfmul_hadds,0,sm %b[104], %b[66], %b[86], %b[66] pfmuls,1,sm %b[89], %b[97], %b[86] pshufb,3,sm 0x0, %b[33], %r26, %b[89] xord,4,sm %b[36], %r0, %b[99] std,5 %r22, %b[8], %b[99] movad,0 area=7, ind=0, am=1, be=0, %b[24] movad,1 area=6, ind=0, am=1, be=0, %b[32] movad,2 area=7, ind=0, am=1, be=0, %b[29] movad,3 area=6, ind=0, am=1, be=0, %b[23] } { loop_mode pfsub_rsubs,0,sm %b[41], %b[80], %b[112], %b[80] pfmuls,1,sm %b[102], %b[109], %b[84] pshufb,3,sm 0x0, %b[40], %r26, %b[102] xord,4,sm %b[19], %r0, %b[104] std,5 %r19, %b[8], %b[84] movad,0 area=5, ind=0, am=1, be=0, %b[33] movad,1 area=4, ind=0, am=1, be=0, %b[41] movad,2 area=5, ind=0, am=1, be=0, %b[40] movad,3 area=4, ind=0, am=1, be=0, %b[36] } { loop_mode pfadd_adds,0,sm %b[11], %b[62], %b[93], %b[99] pfmuls,1,sm %b[99], %b[71], %b[111] pfadd_rsubs,2,sm %b[11], %b[62], %b[93], %b[105] pshufb,3,sm 0x0, %b[28], %r26, %b[112] xord,4,sm %b[16], %r0, %b[113] std,5 %r14, %b[8], %b[108] movad,0 area=3, ind=8, am=1, be=0, %b[93] movad,1 area=3, ind=0, am=0, be=0, %b[28] movad,2 area=3, ind=16, am=0, be=0, %b[108] movad,3 area=3, ind=24, am=0, be=0, %b[107] } { loop_mode pfmul_hadds,0,sm %b[87], %b[85], %b[82], %b[82] pfmuls,1,sm %b[104], %b[63], %b[87] pfmul_hadds,2,sm %b[96], %b[74], %b[95], %b[85] pshufb,3,sm 0x0, %b[19], %r26, %b[95] xord,4,sm %b[37], %r0, %b[96] std,5 %r24, %b[8], %b[110] movad,0 area=2, ind=0, am=0, be=0, %b[44] movad,1 area=2, ind=8, am=0, be=0, %b[74] movad,2 area=3, ind=8, am=1, be=0, %b[45] movad,3 area=3, ind=0, am=0, be=0, %b[19] } { loop_mode pfmul_hadds,0,sm %b[89], %b[97], %b[86], %b[89] pfmuls,1,sm %b[81], %b[59], %b[86] pfmuls,2,sm %b[113], %b[83], %b[97] pshufb,3,sm 0x0, %b[16], %r26, %b[104] std,5 %r15, %b[8], %b[76] movad,0 area=2, ind=16, am=0, be=0, %b[76] movad,1 area=2, ind=24, am=1, be=0, %b[81] movad,2 area=2, ind=0, am=0, be=0, %b[16] movad,3 area=2, ind=16, am=0, be=0, %b[48] } { loop_mode pfmul_hadds,0,sm %b[102], %b[109], %b[84], %b[92] pfmuls,1,sm %b[92], %b[51], %b[96] pfmuls,2,sm %b[96], %b[60], %b[102] pshufb,3,sm 0x0, %b[37], %r26, %b[109] xord,4,sm %b[7], %r0, %b[110] std,5 %r17, %b[8], %b[70] movad,0 area=1, ind=0, am=0, be=0, %b[37] movad,1 area=1, ind=16, am=0, be=0, %b[49] movad,2 area=2, ind=8, am=0, be=0, %b[70] movad,3 area=0, ind=8, am=0, be=0, %b[84] } { loop_mode pfmul_hadds,0,sm %b[106], %b[68], %b[100], %b[68] pfmuls,1,sm %b[69], %b[50], %b[101] pfmul_hadds,2,sm %b[112], %b[90], %b[103], %b[69] pshufb,3,sm 0x0, %b[75], %r26, %b[103] std,5 %r20, %b[8], %b[101] movad,0 area=1, ind=8, am=1, be=0, %b[90] movad,1 area=1, ind=24, am=0, be=0, %b[75] movad,2 area=2, ind=24, am=1, be=0, %b[100] movad,3 area=1, ind=0, am=0, be=0, %b[52] } { loop_mode pfmul_hadds,0,sm %b[91], %b[71], %b[111], %b[71] pfmuls,1,sm %b[110], %b[94], %b[87] pfmul_hadds,2,sm %b[95], %b[63], %b[87], %b[73] pshufb,3,sm 0x0, %b[7], %r26, %b[91] xord,4,sm %b[27], %r0, %b[95] std,5 %r11, %b[8], %b[73] movad,0 area=0, ind=0, am=0, be=0, %b[7] movad,1 area=0, ind=24, am=0, be=0, %b[56] movad,2 area=1, ind=16, am=0, be=0, %b[53] movad,3 area=1, ind=24, am=0, be=0, %b[63] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END pfmul_hadds,0,sm %b[104], %b[83], %b[97], %b[83] pfmuls,1,sm %b[88], %b[58], %b[88] std,2 %r13, %b[8], %b[64] std,5 %r21, %b[8], %b[72] movad,0 area=0, ind=8, am=1, be=0, %b[57] movad,1 area=0, ind=16, am=0, be=0, %b[8] movad,2 area=1, ind=8, am=1, be=0, %b[64] movad,3 area=0, ind=0, am=1, be=0, %b[72] }
Теоретическая скорость: 16 комплексных чисел за 32 такта (16/32) = 4 Байт/такт
Четверная теоретическая скорость: 16 Байт/такт
Замеры скорости

3. stage_radix4_2x_simd128
Здесь происходит ручная раскрутка алгоритма stage_radix4_simd128 в 2 раза.
Код на Си
void stage_radix4_2x_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b) { __v2di *xy0_in = (__v2di*)&data_in[ 0]; __v2di *zw0_in = (__v2di*)&data_in[ 2]; __v2di *xy1_in = (__v2di*)&data_in[ 4]; __v2di *zw1_in = (__v2di*)&data_in[ 6]; __v2di *xy2_in = (__v2di*)&data_in[ 8]; __v2di *zw2_in = (__v2di*)&data_in[10]; __v2di *xy3_in = (__v2di*)&data_in[12]; __v2di *zw3_in = (__v2di*)&data_in[14]; __v2di *xy4_in = (__v2di*)&data_in[16]; __v2di *zw4_in = (__v2di*)&data_in[18]; __v2di *xy5_in = (__v2di*)&data_in[20]; __v2di *zw5_in = (__v2di*)&data_in[22]; __v2di *xy6_in = (__v2di*)&data_in[24]; __v2di *zw6_in = (__v2di*)&data_in[26]; __v2di *xy7_in = (__v2di*)&data_in[28]; __v2di *zw7_in = (__v2di*)&data_in[30]; __v2di *c0a_in = (__v2di*)&coefC_a[0]; __v2di *c1a_in = (__v2di*)&coefC_a[2]; __v2di *c2a_in = (__v2di*)&coefC_a[4]; __v2di *c3a_in = (__v2di*)&coefC_a[6]; __v2di *d0a_in = (__v2di*)&coefD_a[0]; __v2di *d1a_in = (__v2di*)&coefD_a[2]; __v2di *d2a_in = (__v2di*)&coefD_a[4]; __v2di *d3a_in = (__v2di*)&coefD_a[6]; __v2di *e0a_in = (__v2di*)&coefE_a[0]; __v2di *e1a_in = (__v2di*)&coefE_a[2]; __v2di *e2a_in = (__v2di*)&coefE_a[4]; __v2di *e3a_in = (__v2di*)&coefE_a[6]; __v2di *c0b_in = (__v2di*)&coefC_b[0*data_count/16]; __v2di *c1b_in = (__v2di*)&coefC_b[1*data_count/16]; __v2di *c2b_in = (__v2di*)&coefC_b[2*data_count/16]; __v2di *c3b_in = (__v2di*)&coefC_b[3*data_count/16]; __v2di *d0b_in = (__v2di*)&coefD_b[0*data_count/16]; __v2di *d1b_in = (__v2di*)&coefD_b[1*data_count/16]; __v2di *d2b_in = (__v2di*)&coefD_b[2*data_count/16]; __v2di *d3b_in = (__v2di*)&coefD_b[3*data_count/16]; __v2di *e0b_in = (__v2di*)&coefE_b[0*data_count/16]; __v2di *e1b_in = (__v2di*)&coefE_b[1*data_count/16]; __v2di *e2b_in = (__v2di*)&coefE_b[2*data_count/16]; __v2di *e3b_in = (__v2di*)&coefE_b[3*data_count/16]; __v2di *out_0 = (__v2di*)&data_out[ 0*data_count/16]; __v2di *out_1 = (__v2di*)&data_out[ 1*data_count/16]; __v2di *out_2 = (__v2di*)&data_out[ 2*data_count/16]; __v2di *out_3 = (__v2di*)&data_out[ 3*data_count/16]; __v2di *out_4 = (__v2di*)&data_out[ 4*data_count/16]; __v2di *out_5 = (__v2di*)&data_out[ 5*data_count/16]; __v2di *out_6 = (__v2di*)&data_out[ 6*data_count/16]; __v2di *out_7 = (__v2di*)&data_out[ 7*data_count/16]; __v2di *out_8 = (__v2di*)&data_out[ 8*data_count/16]; __v2di *out_9 = (__v2di*)&data_out[ 9*data_count/16]; __v2di *out_10 = (__v2di*)&data_out[10*data_count/16]; __v2di *out_11 = (__v2di*)&data_out[11*data_count/16]; __v2di *out_12 = (__v2di*)&data_out[12*data_count/16]; __v2di *out_13 = (__v2di*)&data_out[13*data_count/16]; __v2di *out_14 = (__v2di*)&data_out[14*data_count/16]; __v2di *out_15 = (__v2di*)&data_out[15*data_count/16]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/32; ++i) { __v2di xy0 = xy0_in[16*i]; __v2di zw0 = zw0_in[16*i]; __v2di xy1 = xy1_in[16*i]; __v2di zw1 = zw1_in[16*i]; __v2di c0 = c0a_in[4*i]; __v2di d0 = d0a_in[4*i]; __v2di e0 = e0a_in[4*i]; __v2di xy2 = xy2_in[16*i]; __v2di zw2 = zw2_in[16*i]; __v2di xy3 = xy3_in[16*i]; __v2di zw3 = zw3_in[16*i]; __v2di c1 = c1a_in[4*i]; __v2di d1 = d1a_in[4*i]; __v2di e1 = e1a_in[4*i]; __v2di xy4 = xy4_in[16*i]; __v2di zw4 = zw4_in[16*i]; __v2di xy5 = xy5_in[16*i]; __v2di zw5 = zw5_in[16*i]; __v2di c2 = c2a_in[4*i]; __v2di d2 = d2a_in[4*i]; __v2di e2 = e2a_in[4*i]; __v2di xy6 = xy6_in[16*i]; __v2di zw6 = zw6_in[16*i]; __v2di xy7 = xy7_in[16*i]; __v2di zw7 = zw7_in[16*i]; __v2di c3 = c3a_in[4*i]; __v2di d3 = d3a_in[4*i]; __v2di e3 = e3a_in[4*i]; __v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di conj_c0 = __builtin_e2k_qpxor(c0, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_c1 = __builtin_e2k_qpxor(c1, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_c2 = __builtin_e2k_qpxor(c2, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_c3 = __builtin_e2k_qpxor(c3, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_d0 = __builtin_e2k_qpxor(d0, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_d1 = __builtin_e2k_qpxor(d1, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_d2 = __builtin_e2k_qpxor(d2, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_d3 = __builtin_e2k_qpxor(d3, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_e0 = __builtin_e2k_qpxor(e0, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_e1 = __builtin_e2k_qpxor(e1, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_e2 = __builtin_e2k_qpxor(e2, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_e3 = __builtin_e2k_qpxor(e3, (__v2di){1LL<<63, 1LL<<63}); __v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0); __v2di cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1); __v2di cy2_real = __builtin_e2k_qpfmuls(conj_c2, y2); __v2di cy3_real = __builtin_e2k_qpfmuls(conj_c3, y3); __v2di dz0_real = __builtin_e2k_qpfmuls(conj_d0, z0); __v2di dz1_real = __builtin_e2k_qpfmuls(conj_d1, z1); __v2di dz2_real = __builtin_e2k_qpfmuls(conj_d2, z2); __v2di dz3_real = __builtin_e2k_qpfmuls(conj_d3, z3); __v2di ew0_real = __builtin_e2k_qpfmuls(conj_e0, w0); __v2di ew1_real = __builtin_e2k_qpfmuls(conj_e1, w1); __v2di ew2_real = __builtin_e2k_qpfmuls(conj_e2, w2); __v2di ew3_real = __builtin_e2k_qpfmuls(conj_e3, w3); __v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0); __v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1); __v2di cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2); __v2di cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3); __v2di dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0); __v2di dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1); __v2di dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2); __v2di dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3); __v2di ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0); __v2di ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1); __v2di ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2); __v2di ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3); __v2di cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag); __v2di cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag); __v2di cy2_rrii = __builtin_e2k_qpfhadds(cy2_real, cy2_imag); __v2di cy3_rrii = __builtin_e2k_qpfhadds(cy3_real, cy3_imag); __v2di dz0_rrii = __builtin_e2k_qpfhadds(dz0_real, dz0_imag); __v2di dz1_rrii = __builtin_e2k_qpfhadds(dz1_real, dz1_imag); __v2di dz2_rrii = __builtin_e2k_qpfhadds(dz2_real, dz2_imag); __v2di dz3_rrii = __builtin_e2k_qpfhadds(dz3_real, dz3_imag); __v2di ew0_rrii = __builtin_e2k_qpfhadds(ew0_real, ew0_imag); __v2di ew1_rrii = __builtin_e2k_qpfhadds(ew1_real, ew1_imag); __v2di ew2_rrii = __builtin_e2k_qpfhadds(ew2_real, ew2_imag); __v2di ew3_rrii = __builtin_e2k_qpfhadds(ew3_real, ew3_imag); __v2di cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di cy2 = __builtin_e2k_qpshufb(cy2_rrii, cy2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di cy3 = __builtin_e2k_qpshufb(cy3_rrii, cy3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di dz0 = __builtin_e2k_qpshufb(dz0_rrii, dz0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di dz1 = __builtin_e2k_qpshufb(dz1_rrii, dz1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di dz2 = __builtin_e2k_qpshufb(dz2_rrii, dz2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di dz3 = __builtin_e2k_qpshufb(dz3_rrii, dz3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di ew0 = __builtin_e2k_qpshufb(ew0_rrii, ew0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di ew1 = __builtin_e2k_qpshufb(ew1_rrii, ew1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di ew2 = __builtin_e2k_qpshufb(ew2_rrii, ew2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di ew3 = __builtin_e2k_qpshufb(ew3_rrii, ew3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di add02_0 = __builtin_e2k_qpfadds( x0, dz0); __v2di add02_1 = __builtin_e2k_qpfadds( x1, dz1); __v2di add02_2 = __builtin_e2k_qpfadds( x2, dz2); __v2di add02_3 = __builtin_e2k_qpfadds( x3, dz3); __v2di sub02_0 = __builtin_e2k_qpfsubs( x0, dz0); __v2di sub02_1 = __builtin_e2k_qpfsubs( x1, dz1); __v2di sub02_2 = __builtin_e2k_qpfsubs( x2, dz2); __v2di sub02_3 = __builtin_e2k_qpfsubs( x3, dz3); __v2di add13_0 = __builtin_e2k_qpfadds(cy0, ew0); __v2di add13_1 = __builtin_e2k_qpfadds(cy1, ew1); __v2di add13_2 = __builtin_e2k_qpfadds(cy2, ew2); __v2di add13_3 = __builtin_e2k_qpfadds(cy3, ew3); __v2di sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0); __v2di sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1); __v2di sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2); __v2di sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3); __v2di swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31}); __v2di sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31}); __v2di sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31}); __v2di sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31}); __v2di out0 = __builtin_e2k_qpfadds(add02_0, add13_0); __v2di out1 = __builtin_e2k_qpfadds(add02_1, add13_1); __v2di out2 = __builtin_e2k_qpfadds(add02_2, add13_2); __v2di out3 = __builtin_e2k_qpfadds(add02_3, add13_3); __v2di out4 = __builtin_e2k_qpfsubs(sub02_0, sub13i_0); __v2di out5 = __builtin_e2k_qpfsubs(sub02_1, sub13i_1); __v2di out6 = __builtin_e2k_qpfsubs(sub02_2, sub13i_2); __v2di out7 = __builtin_e2k_qpfsubs(sub02_3, sub13i_3); __v2di out8 = __builtin_e2k_qpfsubs(add02_0, add13_0); __v2di out9 = __builtin_e2k_qpfsubs(add02_1, add13_1); __v2di out10 = __builtin_e2k_qpfsubs(add02_2, add13_2); __v2di out11 = __builtin_e2k_qpfsubs(add02_3, add13_3); __v2di out12 = __builtin_e2k_qpfadds(sub02_0, sub13i_0); __v2di out13 = __builtin_e2k_qpfadds(sub02_1, sub13i_1); __v2di out14 = __builtin_e2k_qpfadds(sub02_2, sub13i_2); __v2di out15 = __builtin_e2k_qpfadds(sub02_3, sub13i_3); xy0 = out0; zw0 = out1; xy1 = out2; zw1 = out3; c0 = c0b_in[i]; d0 = d0b_in[i]; e0 = e0b_in[i]; xy2 = out4; zw2 = out5; xy3 = out6; zw3 = out7; c1 = c1b_in[i]; d1 = d1b_in[i]; e1 = e1b_in[i]; xy4 = out8; zw4 = out9; xy5 = out10; zw5 = out11; c2 = c2b_in[i]; d2 = d2b_in[i]; e2 = e2b_in[i]; xy6 = out12; zw6 = out13; xy7 = out14; zw7 = out15; c3 = c3b_in[i]; d3 = d3b_in[i]; e3 = e3b_in[i]; x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100}); w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100}); y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100}); w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100}); y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100}); w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100}); y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100}); w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); conj_c0 = __builtin_e2k_qpxor(c0, (__v2di){1LL<<63, 1LL<<63}); conj_c1 = __builtin_e2k_qpxor(c1, (__v2di){1LL<<63, 1LL<<63}); conj_c2 = __builtin_e2k_qpxor(c2, (__v2di){1LL<<63, 1LL<<63}); conj_c3 = __builtin_e2k_qpxor(c3, (__v2di){1LL<<63, 1LL<<63}); conj_d0 = __builtin_e2k_qpxor(d0, (__v2di){1LL<<63, 1LL<<63}); conj_d1 = __builtin_e2k_qpxor(d1, (__v2di){1LL<<63, 1LL<<63}); conj_d2 = __builtin_e2k_qpxor(d2, (__v2di){1LL<<63, 1LL<<63}); conj_d3 = __builtin_e2k_qpxor(d3, (__v2di){1LL<<63, 1LL<<63}); conj_e0 = __builtin_e2k_qpxor(e0, (__v2di){1LL<<63, 1LL<<63}); conj_e1 = __builtin_e2k_qpxor(e1, (__v2di){1LL<<63, 1LL<<63}); conj_e2 = __builtin_e2k_qpxor(e2, (__v2di){1LL<<63, 1LL<<63}); conj_e3 = __builtin_e2k_qpxor(e3, (__v2di){1LL<<63, 1LL<<63}); swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0); cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1); cy2_real = __builtin_e2k_qpfmuls(conj_c2, y2); cy3_real = __builtin_e2k_qpfmuls(conj_c3, y3); dz0_real = __builtin_e2k_qpfmuls(conj_d0, z0); dz1_real = __builtin_e2k_qpfmuls(conj_d1, z1); dz2_real = __builtin_e2k_qpfmuls(conj_d2, z2); dz3_real = __builtin_e2k_qpfmuls(conj_d3, z3); ew0_real = __builtin_e2k_qpfmuls(conj_e0, w0); ew1_real = __builtin_e2k_qpfmuls(conj_e1, w1); ew2_real = __builtin_e2k_qpfmuls(conj_e2, w2); ew3_real = __builtin_e2k_qpfmuls(conj_e3, w3); cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0); cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1); cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2); cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3); dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0); dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1); dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2); dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3); ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0); ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1); ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2); ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3); cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag); cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag); cy2_rrii = __builtin_e2k_qpfhadds(cy2_real, cy2_imag); cy3_rrii = __builtin_e2k_qpfhadds(cy3_real, cy3_imag); dz0_rrii = __builtin_e2k_qpfhadds(dz0_real, dz0_imag); dz1_rrii = __builtin_e2k_qpfhadds(dz1_real, dz1_imag); dz2_rrii = __builtin_e2k_qpfhadds(dz2_real, dz2_imag); dz3_rrii = __builtin_e2k_qpfhadds(dz3_real, dz3_imag); ew0_rrii = __builtin_e2k_qpfhadds(ew0_real, ew0_imag); ew1_rrii = __builtin_e2k_qpfhadds(ew1_real, ew1_imag); ew2_rrii = __builtin_e2k_qpfhadds(ew2_real, ew2_imag); ew3_rrii = __builtin_e2k_qpfhadds(ew3_real, ew3_imag); cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); cy2 = __builtin_e2k_qpshufb(cy2_rrii, cy2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); cy3 = __builtin_e2k_qpshufb(cy3_rrii, cy3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); dz0 = __builtin_e2k_qpshufb(dz0_rrii, dz0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); dz1 = __builtin_e2k_qpshufb(dz1_rrii, dz1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); dz2 = __builtin_e2k_qpshufb(dz2_rrii, dz2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); dz3 = __builtin_e2k_qpshufb(dz3_rrii, dz3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); ew0 = __builtin_e2k_qpshufb(ew0_rrii, ew0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); ew1 = __builtin_e2k_qpshufb(ew1_rrii, ew1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); ew2 = __builtin_e2k_qpshufb(ew2_rrii, ew2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); ew3 = __builtin_e2k_qpshufb(ew3_rrii, ew3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); add02_0 = __builtin_e2k_qpfadds( x0, dz0); add02_1 = __builtin_e2k_qpfadds( x1, dz1); add02_2 = __builtin_e2k_qpfadds( x2, dz2); add02_3 = __builtin_e2k_qpfadds( x3, dz3); sub02_0 = __builtin_e2k_qpfsubs( x0, dz0); sub02_1 = __builtin_e2k_qpfsubs( x1, dz1); sub02_2 = __builtin_e2k_qpfsubs( x2, dz2); sub02_3 = __builtin_e2k_qpfsubs( x3, dz3); add13_0 = __builtin_e2k_qpfadds(cy0, ew0); add13_1 = __builtin_e2k_qpfadds(cy1, ew1); add13_2 = __builtin_e2k_qpfadds(cy2, ew2); add13_3 = __builtin_e2k_qpfadds(cy3, ew3); sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0); sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1); sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2); sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3); swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31}); sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31}); sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31}); sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31}); out_0[i] = __builtin_e2k_qpfadds(add02_0, add13_0); out_1[i] = __builtin_e2k_qpfadds(add02_1, add13_1); out_2[i] = __builtin_e2k_qpfadds(add02_2, add13_2); out_3[i] = __builtin_e2k_qpfadds(add02_3, add13_3); out_4[i] = __builtin_e2k_qpfsubs(sub02_0, sub13i_0); out_5[i] = __builtin_e2k_qpfsubs(sub02_1, sub13i_1); out_6[i] = __builtin_e2k_qpfsubs(sub02_2, sub13i_2); out_7[i] = __builtin_e2k_qpfsubs(sub02_3, sub13i_3); out_8[i] = __builtin_e2k_qpfsubs(add02_0, add13_0); out_9[i] = __builtin_e2k_qpfsubs(add02_1, add13_1); out_10[i] = __builtin_e2k_qpfsubs(add02_2, add13_2); out_11[i] = __builtin_e2k_qpfsubs(add02_3, add13_3); out_12[i] = __builtin_e2k_qpfadds(sub02_0, sub13i_0); out_13[i] = __builtin_e2k_qpfadds(sub02_1, sub13i_1); out_14[i] = __builtin_e2k_qpfadds(sub02_2, sub13i_2); out_15[i] = __builtin_e2k_qpfadds(sub02_3, sub13i_3); } }
Основной цикл на ассемблере
.L7295: { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=0, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=64 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=96 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=4, disp=128 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=4, disp=160 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=6, disp=192 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=6, disp=224 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=4, asz=1, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=4, asz=1, abs=8, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=3, asz=1, abs=10, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=3, asz=1, abs=10, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=1, abs=12, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=1, abs=12, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=1, incr=0, ind=0, asz=1, abs=14, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=15, asz=1, abs=14, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=14, asz=1, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=13, asz=1, abs=16, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=12, asz=1, abs=18, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=11, asz=1, abs=18, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=10, asz=2, abs=20, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=9, asz=2, abs=20, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=8, asz=2, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=2, abs=24, disp=0 } { fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=2, abs=28, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=2, abs=28, disp=0 } .L3969: { loop_mode disp %ctpr1, .L3969 movaqp,0 area=0, ind=0, am=1, be=0, %g17 movaqp,1 area=0, ind=16, am=0, be=0, %g16 movaqp,2 area=0, ind=0, am=1, be=0, %g19 movaqp,3 area=0, ind=16, am=0, be=0, %g18 } { loop_mode movaqp,0 area=1, ind=0, am=1, be=0, %g21 movaqp,1 area=1, ind=16, am=0, be=0, %g20 movaqp,2 area=1, ind=0, am=1, be=0, %g23 movaqp,3 area=1, ind=16, am=0, be=0, %g22 } { loop_mode movaqp,0 area=2, ind=0, am=1, be=0, %g25 movaqp,1 area=2, ind=16, am=0, be=0, %g24 movaqp,2 area=2, ind=0, am=1, be=0, %g27 movaqp,3 area=2, ind=16, am=0, be=0, %g26 } { loop_mode movaqp,0 area=3, ind=0, am=1, be=0, %g29 movaqp,1 area=3, ind=16, am=0, be=0, %g28 movaqp,2 area=3, ind=0, am=1, be=0, %g31 movaqp,3 area=3, ind=16, am=0, be=0, %g30 } { loop_mode movaqp,0 area=4, ind=0, am=1, be=0, %r26 movaqp,1 area=4, ind=16, am=0, be=0, %r9 movaqp,2 area=4, ind=0, am=1, be=0, %r28 movaqp,3 area=4, ind=16, am=0, be=0, %r27 } { loop_mode qpshufb,0 %g19, %g17, %r1, %r33 qpshufb,1 %g18, %g16, %r1, %r34 qpshufb,3 %g18, %g16, %r7, %g16 qpshufb,4 %g19, %g17, %r7, %g17 movaqp,0 area=5, ind=0, am=1, be=0, %r30 movaqp,1 area=5, ind=16, am=0, be=0, %r29 movaqp,2 area=5, ind=0, am=1, be=0, %r32 movaqp,3 area=5, ind=16, am=0, be=0, %r31 } { loop_mode qpshufb,0 %g23, %g21, %r1, %r37 qpshufb,1 %g22, %g20, %r1, %r38 qpshufb,3 %g22, %g20, %r7, %g20 qpshufb,4 %g23, %g21, %r7, %g21 movaqp,0 area=6, ind=0, am=1, be=0, %g19 movaqp,1 area=6, ind=16, am=0, be=0, %g18 movaqp,2 area=6, ind=0, am=1, be=0, %r36 movaqp,3 area=6, ind=16, am=0, be=0, %r35 } { loop_mode qpshufb,0 %g27, %g25, %r1, %g22 qpshufb,1 %g26, %g24, %r1, %g23 qpshufb,3 %g26, %g24, %r7, %g24 qpshufb,4 %g27, %g25, %r7, %g25 movaqp,0 area=8, ind=0, am=1, be=0, %r39 movaqp,1 area=7, ind=0, am=1, be=0, %g26 movaqp,2 area=8, ind=0, am=1, be=0, %r40 movaqp,3 area=7, ind=0, am=1, be=0, %g27 } { loop_mode qpshufb,0 %g31, %g29, %r1, %r41 qpshufb,1 %g30, %g28, %r1, %r42 qpshufb,3 %g30, %g28, %r7, %g28 qpshufb,4 %g31, %g29, %r7, %g29 movaqp,0 area=10, ind=0, am=1, be=0, %r43 movaqp,1 area=9, ind=0, am=1, be=0, %g30 movaqp,2 area=10, ind=0, am=1, be=0, %r44 movaqp,3 area=9, ind=0, am=1, be=0, %g31 } { loop_mode qpxor,0 %r26, %r5, %r45 qpxor,1 %r9, %r5, %r46 qpshufb,3 %r26, %r26, %r6, %r26 qpshufb,4 %r9, %r9, %r6, %r9 movaqp,0 area=12, ind=0, am=1, be=0, %r49 movaqp,1 area=11, ind=0, am=1, be=0, %r47 movaqp,2 area=12, ind=0, am=1, be=0, %r50 movaqp,3 area=11, ind=0, am=1, be=0, %r48 } { loop_mode qpxor,0 %r28, %r5, %r51 qpxor,1 %r27, %r5, %r52 qpfmuls,2 %r45, %r33, %r33 qpshufb,3 %r28, %r28, %r6, %r28 qpshufb,4 %r27, %r27, %r6, %r27 qpfmuls,5 %r26, %r33, %r26 } { loop_mode qpxor,0 %g19, %r5, %r45 qpxor,1 %g18, %r5, %r53 qpfmuls,2 %r46, %r37, %r37 qpshufb,3 %g19, %g19, %r6, %g19 qpshufb,4 %g18, %g18, %r6, %g18 qpfmuls,5 %r9, %r37, %r9 } { loop_mode qpxor,0 %r36, %r5, %r46 qpxor,1 %r35, %r5, %r54 qpfmuls,2 %r51, %g22, %g22 qpshufb,3 %r36, %r36, %r6, %r36 qpshufb,4 %r35, %r35, %r6, %r35 qpfmuls,5 %r28, %g22, %r28 } { loop_mode qpfmuls,0 %r52, %r41, %r41 qpfmuls,1 %r45, %r34, %r45 qpfmuls,2 %r53, %r38, %r51 qpfmuls,3 %r27, %r41, %r27 qpfmuls,4 %g19, %r34, %g19 qpfmuls,5 %g18, %r38, %g18 } { loop_mode qpfmuls,0 %r54, %r42, %r38 qpxor,1 %r29, %r5, %r36 qpfmuls,2 %r46, %g23, %r34 qpfmuls,3 %r35, %r42, %r35 qpshufb,4 %r29, %r29, %r6, %r29 qpfmuls,5 %r36, %g23, %g23 } { loop_mode qpfmuls,2 %r36, %g20, %r36 qpfmuls,5 %r29, %g20, %g20 } { loop_mode qpxor,1 %r30, %r5, %r29 qpfhadds,2 %r33, %r26, %r26 qpxor,4 %r31, %r5, %r42 } { loop_mode qpshufb,0 %r30, %r30, %r6, %r30 qpshufb,1 %r31, %r31, %r6, %r31 qpfmuls,2 %r29, %g16, %r29 qpxor,3 %r32, %r5, %r33 qpshufb,4 %r32, %r32, %r6, %r32 qpfmuls,5 %r42, %g28, %r42 } { loop_mode qpfmuls,0 %r30, %g16, %g16 qpfhadds,1 %r37, %r9, %r9 qpfmuls,2 %r31, %g28, %g28 qpfmuls,3 %r32, %g24, %g24 qpfhadds,4 %g22, %r28, %g22 qpfmuls,5 %r33, %g24, %r30 } { loop_mode qpfhadds,0 %r45, %g19, %g19 qpfhadds,1 %r51, %g18, %g18 qpfhadds,2 %r41, %r27, %r27 qpshufb,3 %g26, %g26, %r6, %r28 qpxor,4 %g26, %r5, %g26 } { loop_mode qpfhadds,0 %r38, %r35, %r31 qpfhadds,2 %r34, %g23, %g23 } { loop_mode qpfhadds,2 %r36, %g20, %g20 qpshufb,3 %r39, %r39, %r6, %r32 qpxor,4 %r39, %r5, %r33 } { loop_mode qpshufb,1 %r26, %r26, %r3, %r26 qpfhadds,2 %r29, %g16, %g16 qpshufb,3 %g22, %g22, %r3, %g22 qpshufb,4 %r40, %r40, %r6, %r29 qpfhadds,5 %r30, %g24, %g24 } { loop_mode qpshufb,0 %r9, %r9, %r3, %r9 qpshufb,1 %g19, %g19, %r3, %g19 qpfhadds,2 %r42, %g28, %g28 qpxor,3 %r40, %r5, %r30 qpshufb,4 %g31, %g31, %r6, %r34 } { loop_mode qpshufb,0 %r27, %r27, %r3, %r27 qpshufb,1 %g18, %g18, %r3, %g18 qpfsubs,2 %r26, %g19, %r35 qpxor,3 %g31, %r5, %g31 qpxor,4 %r43, %r5, %r36 } { loop_mode qpshufb,0 %r31, %r31, %r3, %r31 qpshufb,1 %g23, %g23, %r3, %g23 qpfsubs,2 %r9, %g18, %r37 qpshufb,3 %r43, %r43, %r6, %r38 qpxor,4 %r48, %r5, %r39 } { loop_mode qpfsubs,0 %g22, %g23, %r41 qpshufb,1 %g16, %g16, %r3, %g16 qpfsubs,2 %r27, %r31, %r40 qpshufb,3 %g24, %g24, %r3, %g24 qpxor,4 %r47, %r5, %r42 } { loop_mode qpshufb,0 %g28, %g28, %r3, %g28 qpshufb,1 %g20, %g20, %r3, %g20 qpfadds,2 %r26, %g19, %g19 qpfsubs,3 %g25, %g24, %g24 qpshufb,4 %r48, %r48, %r6, %g25 qpfadds,5 %g25, %g24, %r26 } { loop_mode qpfadds,0 %r9, %g18, %g18 qpfadds,1 %r27, %r31, %r9 qpfadds,2 %g22, %g23, %g22 qpshufb,3 %r47, %r47, %r6, %g23 qpxor,4 %r50, %r5, %r27 } { loop_mode qpfadds,0 %g17, %g16, %r31 qpfadds,1 %g29, %g28, %g17 qpfsubs,2 %g17, %g16, %g16 qpshufb,4 %r50, %r50, %r6, %r43 } { loop_mode qpfadds,0 %g21, %g20, %g29 qpfsubs,1 %g21, %g20, %g20 qpfsubs,2 %g29, %g28, %g28 qpshufb,3 %r35, %r35, %r6, %g21 qpxor,4 %g27, %r5, %r35 } { loop_mode qpshufb,3 %r37, %r37, %r6, %r37 qpxor,4 %g21, %r4, %g21 } { loop_mode qpshufb,3 %r40, %r40, %r6, %r40 qpshufb,4 %r41, %r41, %r6, %r41 } { loop_mode qpfsubs,0 %r31, %g19, %r45 qpfadds,1 %r31, %g19, %g19 qpfadds,2 %g17, %r9, %r31 qpxor,3 %r37, %r4, %r37 qpxor,4 %r41, %r4, %r41 } { loop_mode qpfsubs,0 %g17, %r9, %g17 qpfadds,1 %g29, %g18, %r9 qpfsubs,2 %g29, %g18, %g18 qpxor,3 %r40, %r4, %r40 qpfadds,4 %r26, %g22, %g29 qpfsubs,5 %r26, %g22, %g22 } { loop_mode qpfsubs,2 %g16, %g21, %r26 qpfsubs,3 %g24, %r41, %g21 qpfadds,4 %g24, %r41, %g24 qpfadds,5 %g16, %g21, %g16 } { loop_mode qpfsubs,3 %g20, %r37, %r46 qpfadds,4 %g28, %r40, %g28 qpfsubs,5 %g28, %r40, %r41 } { loop_mode qpshufb,0 %g27, %g27, %r6, %g27 qpshufb,1 %g30, %g30, %r6, %r37 qpfadds,2 %g20, %r37, %g20 } { loop_mode qpshufb,0 %r31, %r9, %r1, %r40 qpshufb,1 %g17, %g18, %r1, %r47 } { loop_mode qpfmuls,0 %r33, %r40, %r33 qpfmuls,1 %r42, %r47, %r40 qpfmuls,2 %r32, %r40, %r32 qpshufb,3 %g29, %g19, %r1, %r48 qpshufb,4 %g22, %r45, %r1, %r50 } { loop_mode qpxor,0 %g30, %r5, %g30 qpshufb,1 %r44, %r44, %r6, %r47 qpfmuls,2 %g23, %r47, %g23 qpshufb,3 %r41, %r46, %r1, %r42 qpshufb,4 %g24, %g16, %r1, %r51 qpfmuls,5 %r38, %r50, %r38 } { loop_mode qpshufb,3 %g21, %r26, %r1, %r52 qpfmuls,4 %g31, %r42, %g31 qpfmuls,5 %r34, %r42, %r34 } { loop_mode qpshufb,0 %g28, %g20, %r1, %r42 qpshufb,1 %g17, %g18, %r7, %g17 qpfmuls,3 %g26, %r48, %g26 qpfmuls,4 %r36, %r50, %r36 qpfmuls,5 %r28, %r48, %r28 } { loop_mode qpfmuls,0 %r43, %r42, %r39 qpxor,1 %r44, %r5, %r42 qpfmuls,2 %r27, %r42, %r27 qpfmuls,3 %r39, %r51, %g18 qpfmuls,4 %g25, %r51, %g25 qpfmuls,5 %r29, %r52, %r29 } { loop_mode qpxor,0 %r49, %r5, %r43 qpshufb,1 %r49, %r49, %r6, %r44 qpfmuls,2 %r42, %g17, %r42 qpfmuls,5 %r30, %r52, %r30 } { loop_mode qpfhadds,0 %r33, %r32, %r31 qpshufb,1 %r31, %r9, %r7, %r9 qpfmuls,2 %r47, %g17, %g17 qpfhadds,5 %g31, %r34, %g31 } { loop_mode qpshufb,0 %g28, %g20, %r7, %g20 qpshufb,1 %r41, %r46, %r7, %g28 qpfmuls,2 %r35, %r9, %r32 qpfhadds,3 %r36, %r38, %r28 qpfhadds,4 %r40, %g23, %g23 qpfhadds,5 %g26, %r28, %g26 } { loop_mode qpfmuls,0 %r43, %g20, %r33 qpfmuls,1 %r44, %g20, %g20 qpfmuls,2 %g27, %r9, %g27 qpshufb,3 %g29, %g19, %r7, %g19 qpshufb,4 %g22, %r45, %r7, %g22 qpfhadds,5 %g18, %g25, %g18 } { loop_mode qpfmuls,0 %g30, %g28, %g28 qpfhadds,1 %r27, %r39, %g29 qpfmuls,2 %r37, %g28, %g25 qpfhadds,5 %r30, %r29, %g30 } { loop_mode qpfhadds,2 %r42, %g17, %g17 qpshufb,3 %g31, %g31, %r3, %g31 qpshufb,4 %g24, %g16, %r7, %g16 } { loop_mode qpshufb,3 %g26, %g26, %r3, %g24 qpshufb,4 %r28, %r28, %r3, %g26 } { loop_mode qpfhadds,0 %r32, %g27, %g27 qpshufb,1 %r31, %r31, %r3, %r9 qpfhadds,2 %r33, %g20, %g20 qpshufb,3 %g23, %g23, %r3, %g23 qpshufb,4 %g18, %g18, %r3, %g18 } { loop_mode qpshufb,0 %g29, %g29, %r3, %g28 qpshufb,1 %g21, %r26, %r7, %g21 qpfhadds,2 %g28, %g25, %g25 qpshufb,3 %g30, %g30, %r3, %g29 qpfadds,4 %g26, %g23, %g23 qpfsubs,5 %g26, %g23, %g30 } { loop_mode qpshufb,1 %g17, %g17, %r3, %g17 qpfadds,3 %g29, %g31, %g29 qpfsubs,5 %g29, %g31, %g26 } { loop_mode qpfsubs,0 %g24, %r9, %g31 qpfadds,1 %g24, %r9, %g24 qpfadds,2 %g22, %g17, %r9 } { loop_mode qpshufb,0 %g20, %g20, %r3, %g20 qpshufb,1 %g27, %g27, %r3, %g27 qpfsubs,2 %g18, %g28, %r26 } { loop_mode qpfadds,0 %g18, %g28, %g18 qpfsubs,1 %g16, %g20, %g16 qpfadds,2 %g16, %g20, %g28 qpshufb,3 %g30, %g30, %r6, %g20 } { loop_mode qpshufb,0 %g25, %g25, %r3, %g25 qpfadds,1 %g19, %g27, %g19 qpfsubs,2 %g19, %g27, %g30 qpshufb,3 %g26, %g26, %r6, %g22 qpxor,4 %g20, %r4, %g20 qpfsubs,5 %g22, %g17, %g17 } { loop_mode qpfsubs,0 %g21, %g25, %g21 qpfadds,1 %r9, %g23, %g25 qpfadds,2 %g21, %g25, %g26 qpxor,3 %g22, %r4, %g22 } { loop_mode qpshufb,0 %g31, %g31, %r6, %g27 qpfsubs,1 %r9, %g23, %g23 } { loop_mode qpfadds,0 %g28, %g18, %g31 qpfsubs,1 %g28, %g18, %g18 } { loop_mode qpshufb,0 %r26, %r26, %r6, %g28 qpfadds,1 %g19, %g24, %r9 qpfsubs,2 %g19, %g24, %g19 qpfsubs,3 %g17, %g20, %g24 qpfadds,4 %g17, %g20, %g17 } { loop_mode qpfadds,0 %g26, %g29, %g20 qpfsubs,1 %g26, %g29, %g26 qpfadds,2 %g21, %g22, %g29 } { loop_mode qpxor,0 %g27, %r4, %g27 qpfsubs,1 %g21, %g22, %g21 stqp,2 %r18, %r0, %g23 } { loop_mode qpfsubs,0 %g30, %g27, %g22 qpfadds,1 %g30, %g27, %g23 stqp,2 %r12, %r0, %g18 stqp,5 %r25, %r0, %g25 } { loop_mode qpxor,0 %g28, %r4, %g18 stqp,2 %r16, %r0, %g31 stqp,5 %r14, %r0, %g17 } { loop_mode qpfsubs,0 %g16, %g18, %g17 qpfadds,1 %g16, %g18, %g16 stqp,2 %r23, %r0, %g19 stqp,5 %r15, %r0, %g24 } { loop_mode stqp,2 %r2, %r0, %r9 } { loop_mode stqp,2 %r24, %r0, %g22 stqp,5 %r22, %r0, %g20 } { loop_mode stqp,2 %r19, %r0, %g26 stqp,5 %r21, %r0, %g21 } { loop_mode stqp,2 %r17, %r0, %g23 stqp,5 %r11, %r0, %g29 } { loop_mode stqp,2 %r20, %r0, %g17 } { loop_mode ct %ctpr1 ? %NOT_LOOP_END alc alcf=1, alct=1 addd,0,sm 0x10, %r0, %r0 stqp,2 %r13, %r0, %g16 }
Теоретическая скорость: 32 комплексных числа за 73 такта (32/73) = 3.51 Байт/такт
Четверная теоретическая скорость: 14.03 Байт/такт
Замеры скорости

4. stage_radix4_2x_simd128_unroll2
Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.
Код на Си
void stage_radix4_2x_simd128_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b) { __v2di *xy0_in = (__v2di*)&data_in[ 0]; __v2di *zw0_in = (__v2di*)&data_in[ 2]; __v2di *xy1_in = (__v2di*)&data_in[ 4]; __v2di *zw1_in = (__v2di*)&data_in[ 6]; __v2di *xy2_in = (__v2di*)&data_in[ 8]; __v2di *zw2_in = (__v2di*)&data_in[10]; __v2di *xy3_in = (__v2di*)&data_in[12]; __v2di *zw3_in = (__v2di*)&data_in[14]; __v2di *xy4_in = (__v2di*)&data_in[16]; __v2di *zw4_in = (__v2di*)&data_in[18]; __v2di *xy5_in = (__v2di*)&data_in[20]; __v2di *zw5_in = (__v2di*)&data_in[22]; __v2di *xy6_in = (__v2di*)&data_in[24]; __v2di *zw6_in = (__v2di*)&data_in[26]; __v2di *xy7_in = (__v2di*)&data_in[28]; __v2di *zw7_in = (__v2di*)&data_in[30]; __v2di *c0a_in = (__v2di*)&coefC_a[0]; __v2di *c1a_in = (__v2di*)&coefC_a[2]; __v2di *c2a_in = (__v2di*)&coefC_a[4]; __v2di *c3a_in = (__v2di*)&coefC_a[6]; __v2di *d0a_in = (__v2di*)&coefD_a[0]; __v2di *d1a_in = (__v2di*)&coefD_a[2]; __v2di *d2a_in = (__v2di*)&coefD_a[4]; __v2di *d3a_in = (__v2di*)&coefD_a[6]; __v2di *e0a_in = (__v2di*)&coefE_a[0]; __v2di *e1a_in = (__v2di*)&coefE_a[2]; __v2di *e2a_in = (__v2di*)&coefE_a[4]; __v2di *e3a_in = (__v2di*)&coefE_a[6]; __v2di *c0b_in = (__v2di*)&coefC_b[0*data_count/16]; __v2di *c1b_in = (__v2di*)&coefC_b[1*data_count/16]; __v2di *c2b_in = (__v2di*)&coefC_b[2*data_count/16]; __v2di *c3b_in = (__v2di*)&coefC_b[3*data_count/16]; __v2di *d0b_in = (__v2di*)&coefD_b[0*data_count/16]; __v2di *d1b_in = (__v2di*)&coefD_b[1*data_count/16]; __v2di *d2b_in = (__v2di*)&coefD_b[2*data_count/16]; __v2di *d3b_in = (__v2di*)&coefD_b[3*data_count/16]; __v2di *e0b_in = (__v2di*)&coefE_b[0*data_count/16]; __v2di *e1b_in = (__v2di*)&coefE_b[1*data_count/16]; __v2di *e2b_in = (__v2di*)&coefE_b[2*data_count/16]; __v2di *e3b_in = (__v2di*)&coefE_b[3*data_count/16]; __v2di *out_0 = (__v2di*)&data_out[ 0*data_count/16]; __v2di *out_1 = (__v2di*)&data_out[ 1*data_count/16]; __v2di *out_2 = (__v2di*)&data_out[ 2*data_count/16]; __v2di *out_3 = (__v2di*)&data_out[ 3*data_count/16]; __v2di *out_4 = (__v2di*)&data_out[ 4*data_count/16]; __v2di *out_5 = (__v2di*)&data_out[ 5*data_count/16]; __v2di *out_6 = (__v2di*)&data_out[ 6*data_count/16]; __v2di *out_7 = (__v2di*)&data_out[ 7*data_count/16]; __v2di *out_8 = (__v2di*)&data_out[ 8*data_count/16]; __v2di *out_9 = (__v2di*)&data_out[ 9*data_count/16]; __v2di *out_10 = (__v2di*)&data_out[10*data_count/16]; __v2di *out_11 = (__v2di*)&data_out[11*data_count/16]; __v2di *out_12 = (__v2di*)&data_out[12*data_count/16]; __v2di *out_13 = (__v2di*)&data_out[13*data_count/16]; __v2di *out_14 = (__v2di*)&data_out[14*data_count/16]; __v2di *out_15 = (__v2di*)&data_out[15*data_count/16]; #pragma ivdep #pragma unroll(2) #pragma prefetch for(int64_t i = 0; i < data_count/32; ++i) { __v2di xy0 = xy0_in[16*i]; __v2di zw0 = zw0_in[16*i]; __v2di xy1 = xy1_in[16*i]; __v2di zw1 = zw1_in[16*i]; __v2di c0 = c0a_in[4*i]; __v2di d0 = d0a_in[4*i]; __v2di e0 = e0a_in[4*i]; __v2di xy2 = xy2_in[16*i]; __v2di zw2 = zw2_in[16*i]; __v2di xy3 = xy3_in[16*i]; __v2di zw3 = zw3_in[16*i]; __v2di c1 = c1a_in[4*i]; __v2di d1 = d1a_in[4*i]; __v2di e1 = e1a_in[4*i]; __v2di xy4 = xy4_in[16*i]; __v2di zw4 = zw4_in[16*i]; __v2di xy5 = xy5_in[16*i]; __v2di zw5 = zw5_in[16*i]; __v2di c2 = c2a_in[4*i]; __v2di d2 = d2a_in[4*i]; __v2di e2 = e2a_in[4*i]; __v2di xy6 = xy6_in[16*i]; __v2di zw6 = zw6_in[16*i]; __v2di xy7 = xy7_in[16*i]; __v2di zw7 = zw7_in[16*i]; __v2di c3 = c3a_in[4*i]; __v2di d3 = d3a_in[4*i]; __v2di e3 = e3a_in[4*i]; __v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di conj_c0 = __builtin_e2k_qpxor(c0, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_c1 = __builtin_e2k_qpxor(c1, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_c2 = __builtin_e2k_qpxor(c2, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_c3 = __builtin_e2k_qpxor(c3, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_d0 = __builtin_e2k_qpxor(d0, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_d1 = __builtin_e2k_qpxor(d1, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_d2 = __builtin_e2k_qpxor(d2, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_d3 = __builtin_e2k_qpxor(d3, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_e0 = __builtin_e2k_qpxor(e0, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_e1 = __builtin_e2k_qpxor(e1, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_e2 = __builtin_e2k_qpxor(e2, (__v2di){1LL<<63, 1LL<<63}); __v2di conj_e3 = __builtin_e2k_qpxor(e3, (__v2di){1LL<<63, 1LL<<63}); __v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0); __v2di cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1); __v2di cy2_real = __builtin_e2k_qpfmuls(conj_c2, y2); __v2di cy3_real = __builtin_e2k_qpfmuls(conj_c3, y3); __v2di dz0_real = __builtin_e2k_qpfmuls(conj_d0, z0); __v2di dz1_real = __builtin_e2k_qpfmuls(conj_d1, z1); __v2di dz2_real = __builtin_e2k_qpfmuls(conj_d2, z2); __v2di dz3_real = __builtin_e2k_qpfmuls(conj_d3, z3); __v2di ew0_real = __builtin_e2k_qpfmuls(conj_e0, w0); __v2di ew1_real = __builtin_e2k_qpfmuls(conj_e1, w1); __v2di ew2_real = __builtin_e2k_qpfmuls(conj_e2, w2); __v2di ew3_real = __builtin_e2k_qpfmuls(conj_e3, w3); __v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0); __v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1); __v2di cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2); __v2di cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3); __v2di dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0); __v2di dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1); __v2di dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2); __v2di dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3); __v2di ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0); __v2di ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1); __v2di ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2); __v2di ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3); __v2di cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag); __v2di cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag); __v2di cy2_rrii = __builtin_e2k_qpfhadds(cy2_real, cy2_imag); __v2di cy3_rrii = __builtin_e2k_qpfhadds(cy3_real, cy3_imag); __v2di dz0_rrii = __builtin_e2k_qpfhadds(dz0_real, dz0_imag); __v2di dz1_rrii = __builtin_e2k_qpfhadds(dz1_real, dz1_imag); __v2di dz2_rrii = __builtin_e2k_qpfhadds(dz2_real, dz2_imag); __v2di dz3_rrii = __builtin_e2k_qpfhadds(dz3_real, dz3_imag); __v2di ew0_rrii = __builtin_e2k_qpfhadds(ew0_real, ew0_imag); __v2di ew1_rrii = __builtin_e2k_qpfhadds(ew1_real, ew1_imag); __v2di ew2_rrii = __builtin_e2k_qpfhadds(ew2_real, ew2_imag); __v2di ew3_rrii = __builtin_e2k_qpfhadds(ew3_real, ew3_imag); __v2di cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di cy2 = __builtin_e2k_qpshufb(cy2_rrii, cy2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di cy3 = __builtin_e2k_qpshufb(cy3_rrii, cy3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di dz0 = __builtin_e2k_qpshufb(dz0_rrii, dz0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di dz1 = __builtin_e2k_qpshufb(dz1_rrii, dz1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di dz2 = __builtin_e2k_qpshufb(dz2_rrii, dz2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di dz3 = __builtin_e2k_qpshufb(dz3_rrii, dz3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di ew0 = __builtin_e2k_qpshufb(ew0_rrii, ew0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di ew1 = __builtin_e2k_qpshufb(ew1_rrii, ew1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di ew2 = __builtin_e2k_qpshufb(ew2_rrii, ew2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di ew3 = __builtin_e2k_qpshufb(ew3_rrii, ew3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di add02_0 = __builtin_e2k_qpfadds( x0, dz0); __v2di add02_1 = __builtin_e2k_qpfadds( x1, dz1); __v2di add02_2 = __builtin_e2k_qpfadds( x2, dz2); __v2di add02_3 = __builtin_e2k_qpfadds( x3, dz3); __v2di sub02_0 = __builtin_e2k_qpfsubs( x0, dz0); __v2di sub02_1 = __builtin_e2k_qpfsubs( x1, dz1); __v2di sub02_2 = __builtin_e2k_qpfsubs( x2, dz2); __v2di sub02_3 = __builtin_e2k_qpfsubs( x3, dz3); __v2di add13_0 = __builtin_e2k_qpfadds(cy0, ew0); __v2di add13_1 = __builtin_e2k_qpfadds(cy1, ew1); __v2di add13_2 = __builtin_e2k_qpfadds(cy2, ew2); __v2di add13_3 = __builtin_e2k_qpfadds(cy3, ew3); __v2di sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0); __v2di sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1); __v2di sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2); __v2di sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3); __v2di swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31}); __v2di sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31}); __v2di sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31}); __v2di sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31}); __v2di out0 = __builtin_e2k_qpfadds(add02_0, add13_0); __v2di out1 = __builtin_e2k_qpfadds(add02_1, add13_1); __v2di out2 = __builtin_e2k_qpfadds(add02_2, add13_2); __v2di out3 = __builtin_e2k_qpfadds(add02_3, add13_3); __v2di out4 = __builtin_e2k_qpfsubs(sub02_0, sub13i_0); __v2di out5 = __builtin_e2k_qpfsubs(sub02_1, sub13i_1); __v2di out6 = __builtin_e2k_qpfsubs(sub02_2, sub13i_2); __v2di out7 = __builtin_e2k_qpfsubs(sub02_3, sub13i_3); __v2di out8 = __builtin_e2k_qpfsubs(add02_0, add13_0); __v2di out9 = __builtin_e2k_qpfsubs(add02_1, add13_1); __v2di out10 = __builtin_e2k_qpfsubs(add02_2, add13_2); __v2di out11 = __builtin_e2k_qpfsubs(add02_3, add13_3); __v2di out12 = __builtin_e2k_qpfadds(sub02_0, sub13i_0); __v2di out13 = __builtin_e2k_qpfadds(sub02_1, sub13i_1); __v2di out14 = __builtin_e2k_qpfadds(sub02_2, sub13i_2); __v2di out15 = __builtin_e2k_qpfadds(sub02_3, sub13i_3); xy0 = out0; zw0 = out1; xy1 = out2; zw1 = out3; c0 = c0b_in[i]; d0 = d0b_in[i]; e0 = e0b_in[i]; xy2 = out4; zw2 = out5; xy3 = out6; zw3 = out7; c1 = c1b_in[i]; d1 = d1b_in[i]; e1 = e1b_in[i]; xy4 = out8; zw4 = out9; xy5 = out10; zw5 = out11; c2 = c2b_in[i]; d2 = d2b_in[i]; e2 = e2b_in[i]; xy6 = out12; zw6 = out13; xy7 = out14; zw7 = out15; c3 = c3b_in[i]; d3 = d3b_in[i]; e3 = e3b_in[i]; x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100}); w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100}); y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100}); w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100}); y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100}); w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100}); y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100}); w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); conj_c0 = __builtin_e2k_qpxor(c0, (__v2di){1LL<<63, 1LL<<63}); conj_c1 = __builtin_e2k_qpxor(c1, (__v2di){1LL<<63, 1LL<<63}); conj_c2 = __builtin_e2k_qpxor(c2, (__v2di){1LL<<63, 1LL<<63}); conj_c3 = __builtin_e2k_qpxor(c3, (__v2di){1LL<<63, 1LL<<63}); conj_d0 = __builtin_e2k_qpxor(d0, (__v2di){1LL<<63, 1LL<<63}); conj_d1 = __builtin_e2k_qpxor(d1, (__v2di){1LL<<63, 1LL<<63}); conj_d2 = __builtin_e2k_qpxor(d2, (__v2di){1LL<<63, 1LL<<63}); conj_d3 = __builtin_e2k_qpxor(d3, (__v2di){1LL<<63, 1LL<<63}); conj_e0 = __builtin_e2k_qpxor(e0, (__v2di){1LL<<63, 1LL<<63}); conj_e1 = __builtin_e2k_qpxor(e1, (__v2di){1LL<<63, 1LL<<63}); conj_e2 = __builtin_e2k_qpxor(e2, (__v2di){1LL<<63, 1LL<<63}); conj_e3 = __builtin_e2k_qpxor(e3, (__v2di){1LL<<63, 1LL<<63}); swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0); cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1); cy2_real = __builtin_e2k_qpfmuls(conj_c2, y2); cy3_real = __builtin_e2k_qpfmuls(conj_c3, y3); dz0_real = __builtin_e2k_qpfmuls(conj_d0, z0); dz1_real = __builtin_e2k_qpfmuls(conj_d1, z1); dz2_real = __builtin_e2k_qpfmuls(conj_d2, z2); dz3_real = __builtin_e2k_qpfmuls(conj_d3, z3); ew0_real = __builtin_e2k_qpfmuls(conj_e0, w0); ew1_real = __builtin_e2k_qpfmuls(conj_e1, w1); ew2_real = __builtin_e2k_qpfmuls(conj_e2, w2); ew3_real = __builtin_e2k_qpfmuls(conj_e3, w3); cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0); cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1); cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2); cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3); dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0); dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1); dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2); dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3); ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0); ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1); ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2); ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3); cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag); cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag); cy2_rrii = __builtin_e2k_qpfhadds(cy2_real, cy2_imag); cy3_rrii = __builtin_e2k_qpfhadds(cy3_real, cy3_imag); dz0_rrii = __builtin_e2k_qpfhadds(dz0_real, dz0_imag); dz1_rrii = __builtin_e2k_qpfhadds(dz1_real, dz1_imag); dz2_rrii = __builtin_e2k_qpfhadds(dz2_real, dz2_imag); dz3_rrii = __builtin_e2k_qpfhadds(dz3_real, dz3_imag); ew0_rrii = __builtin_e2k_qpfhadds(ew0_real, ew0_imag); ew1_rrii = __builtin_e2k_qpfhadds(ew1_real, ew1_imag); ew2_rrii = __builtin_e2k_qpfhadds(ew2_real, ew2_imag); ew3_rrii = __builtin_e2k_qpfhadds(ew3_real, ew3_imag); cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); cy2 = __builtin_e2k_qpshufb(cy2_rrii, cy2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); cy3 = __builtin_e2k_qpshufb(cy3_rrii, cy3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); dz0 = __builtin_e2k_qpshufb(dz0_rrii, dz0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); dz1 = __builtin_e2k_qpshufb(dz1_rrii, dz1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); dz2 = __builtin_e2k_qpshufb(dz2_rrii, dz2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); dz3 = __builtin_e2k_qpshufb(dz3_rrii, dz3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); ew0 = __builtin_e2k_qpshufb(ew0_rrii, ew0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); ew1 = __builtin_e2k_qpshufb(ew1_rrii, ew1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); ew2 = __builtin_e2k_qpshufb(ew2_rrii, ew2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); ew3 = __builtin_e2k_qpshufb(ew3_rrii, ew3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); add02_0 = __builtin_e2k_qpfadds( x0, dz0); add02_1 = __builtin_e2k_qpfadds( x1, dz1); add02_2 = __builtin_e2k_qpfadds( x2, dz2); add02_3 = __builtin_e2k_qpfadds( x3, dz3); sub02_0 = __builtin_e2k_qpfsubs( x0, dz0); sub02_1 = __builtin_e2k_qpfsubs( x1, dz1); sub02_2 = __builtin_e2k_qpfsubs( x2, dz2); sub02_3 = __builtin_e2k_qpfsubs( x3, dz3); add13_0 = __builtin_e2k_qpfadds(cy0, ew0); add13_1 = __builtin_e2k_qpfadds(cy1, ew1); add13_2 = __builtin_e2k_qpfadds(cy2, ew2); add13_3 = __builtin_e2k_qpfadds(cy3, ew3); sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0); sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1); sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2); sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3); swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31}); sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31}); sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31}); sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31}); out_0[i] = __builtin_e2k_qpfadds(add02_0, add13_0); out_1[i] = __builtin_e2k_qpfadds(add02_1, add13_1); out_2[i] = __builtin_e2k_qpfadds(add02_2, add13_2); out_3[i] = __builtin_e2k_qpfadds(add02_3, add13_3); out_4[i] = __builtin_e2k_qpfsubs(sub02_0, sub13i_0); out_5[i] = __builtin_e2k_qpfsubs(sub02_1, sub13i_1); out_6[i] = __builtin_e2k_qpfsubs(sub02_2, sub13i_2); out_7[i] = __builtin_e2k_qpfsubs(sub02_3, sub13i_3); out_8[i] = __builtin_e2k_qpfsubs(add02_0, add13_0); out_9[i] = __builtin_e2k_qpfsubs(add02_1, add13_1); out_10[i] = __builtin_e2k_qpfsubs(add02_2, add13_2); out_11[i] = __builtin_e2k_qpfsubs(add02_3, add13_3); out_12[i] = __builtin_e2k_qpfadds(sub02_0, sub13i_0); out_13[i] = __builtin_e2k_qpfadds(sub02_1, sub13i_1); out_14[i] = __builtin_e2k_qpfadds(sub02_2, sub13i_2); out_15[i] = __builtin_e2k_qpfadds(sub02_3, sub13i_3); } }
Основной цикл на ассемблере
.L11610: { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=0, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=1, disp=64 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=1, disp=96 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=2, disp=128 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=2, disp=160 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=3, disp=192 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=3, disp=224 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=4, disp=256 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=4, disp=288 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=5, disp=320 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=5, disp=352 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=6, disp=384 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=6, disp=416 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=7, disp=448 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=7, disp=480 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=8, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=10, disp=64 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=10, disp=96 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=12, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=12, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=14, disp=64 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=14, disp=96 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=16, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=18, disp=64 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=18, disp=96 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=1, incr=2, ind=0, asz=1, abs=20, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=15, asz=1, abs=20, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=14, asz=1, abs=22, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=13, asz=1, abs=22, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=12, asz=1, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=11, asz=1, abs=24, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=10, asz=1, abs=26, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=9, asz=1, abs=26, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=8, asz=1, abs=28, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=7, asz=1, abs=28, disp=0 } { fapb ct=1, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=6, asz=1, abs=30, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=5, asz=1, abs=30, disp=0 } .L7588: { loop_mode disp %ctpr1, .L7588 movaqp,0 area=0, ind=0, am=1, be=0, %g17 movaqp,1 area=0, ind=16, am=0, be=0, %g16 movaqp,2 area=0, ind=0, am=1, be=0, %g19 movaqp,3 area=0, ind=16, am=0, be=0, %g18 } { loop_mode movaqp,0 area=1, ind=0, am=1, be=0, %g21 movaqp,1 area=1, ind=16, am=0, be=0, %g20 movaqp,2 area=1, ind=0, am=1, be=0, %g23 movaqp,3 area=1, ind=16, am=0, be=0, %g22 } { loop_mode movaqp,0 area=2, ind=0, am=1, be=0, %g25 movaqp,1 area=2, ind=16, am=0, be=0, %g24 movaqp,2 area=2, ind=0, am=1, be=0, %g27 movaqp,3 area=2, ind=16, am=0, be=0, %g26 } { loop_mode movaqp,0 area=3, ind=0, am=1, be=0, %g29 movaqp,1 area=3, ind=16, am=0, be=0, %g28 movaqp,2 area=3, ind=0, am=1, be=0, %g31 movaqp,3 area=3, ind=16, am=0, be=0, %g30 } { loop_mode movaqp,0 area=4, ind=0, am=1, be=0, %b[20] movaqp,1 area=4, ind=16, am=0, be=0, %b[19] movaqp,2 area=4, ind=0, am=1, be=0, %b[22] movaqp,3 area=4, ind=16, am=0, be=0, %b[21] } { loop_mode qpshufb,0 %g19, %g17, %r24, %b[27] qpshufb,1 %g18, %g16, %r24, %b[28] qpshufb,3 %g18, %g16, %r7, %g16 qpshufb,4 %g19, %g17, %r7, %g17 movaqp,0 area=5, ind=0, am=1, be=0, %b[24] movaqp,1 area=5, ind=16, am=0, be=0, %b[23] movaqp,2 area=5, ind=0, am=1, be=0, %b[26] movaqp,3 area=5, ind=16, am=0, be=0, %b[25] } { loop_mode qpshufb,0 %g23, %g21, %r24, %b[31] qpshufb,1 %g22, %g20, %r24, %b[32] qpshufb,3 %g22, %g20, %r7, %g20 qpshufb,4 %g23, %g21, %r7, %g21 movaqp,0 area=6, ind=0, am=1, be=0, %g19 movaqp,1 area=6, ind=16, am=0, be=0, %g18 movaqp,2 area=6, ind=0, am=1, be=0, %b[30] movaqp,3 area=6, ind=16, am=0, be=0, %b[29] } { loop_mode qpshufb,0 %g27, %g25, %r24, %b[35] qpshufb,1 %g26, %g24, %r24, %b[36] qpshufb,3 %g26, %g24, %r7, %g24 qpshufb,4 %g27, %g25, %r7, %g25 movaqp,0 area=7, ind=0, am=1, be=0, %g23 movaqp,1 area=7, ind=16, am=0, be=0, %g22 movaqp,2 area=7, ind=0, am=1, be=0, %b[34] movaqp,3 area=7, ind=16, am=0, be=0, %b[33] } { loop_mode qpshufb,0 %g31, %g29, %r24, %b[39] qpshufb,1 %g30, %g28, %r24, %b[40] qpshufb,3 %g30, %g28, %r7, %g28 qpshufb,4 %g31, %g29, %r7, %g29 movaqp,0 area=8, ind=0, am=1, be=0, %g27 movaqp,1 area=8, ind=16, am=0, be=0, %g26 movaqp,2 area=8, ind=0, am=1, be=0, %b[38] movaqp,3 area=8, ind=16, am=0, be=0, %b[37] } { loop_mode qpshufb,0 %b[22], %b[20], %r24, %b[43] qpshufb,1 %b[21], %b[19], %r24, %b[44] qpshufb,3 %b[21], %b[19], %r7, %b[19] qpshufb,4 %b[22], %b[20], %r7, %b[20] movaqp,0 area=9, ind=0, am=1, be=0, %g31 movaqp,1 area=9, ind=16, am=0, be=0, %g30 movaqp,2 area=9, ind=0, am=1, be=0, %b[42] movaqp,3 area=9, ind=16, am=0, be=0, %b[41] } { loop_mode qpshufb,0 %b[26], %b[24], %r24, %b[47] qpshufb,1 %b[25], %b[23], %r24, %b[48] qpshufb,3 %b[25], %b[23], %r7, %b[23] qpshufb,4 %b[26], %b[24], %r7, %b[24] movaqp,0 area=10, ind=0, am=1, be=0, %b[22] movaqp,1 area=10, ind=16, am=0, be=0, %b[21] movaqp,2 area=10, ind=0, am=1, be=0, %b[46] movaqp,3 area=10, ind=16, am=0, be=0, %b[45] } { loop_mode qpshufb,0 %b[30], %g19, %r24, %b[51] qpshufb,1 %b[29], %g18, %r24, %b[52] qpshufb,3 %b[29], %g18, %r7, %g18 qpshufb,4 %b[30], %g19, %r7, %g19 movaqp,0 area=11, ind=0, am=1, be=0, %b[26] movaqp,1 area=11, ind=16, am=0, be=0, %b[25] movaqp,2 area=11, ind=0, am=1, be=0, %b[50] movaqp,3 area=11, ind=16, am=0, be=0, %b[49] } { loop_mode qpshufb,0 %b[34], %g23, %r24, %b[55] qpshufb,1 %b[33], %g22, %r24, %b[56] qpshufb,3 %b[33], %g22, %r7, %g22 qpshufb,4 %b[34], %g23, %r7, %g23 movaqp,0 area=12, ind=0, am=1, be=0, %b[30] movaqp,1 area=12, ind=16, am=0, be=0, %b[29] movaqp,2 area=12, ind=0, am=1, be=0, %b[54] movaqp,3 area=12, ind=16, am=0, be=0, %b[53] } { loop_mode qpxor,0 %g27, %r6, %b[59] qpxor,1 %g26, %r6, %b[60] qpxor,3 %b[38], %r6, %b[61] qpxor,4 %b[37], %r6, %b[62] movaqp,0 area=13, ind=0, am=1, be=0, %b[34] movaqp,1 area=13, ind=16, am=0, be=0, %b[33] movaqp,2 area=13, ind=0, am=1, be=0, %b[58] movaqp,3 area=13, ind=16, am=0, be=0, %b[57] } { loop_mode qpxor,0 %g31, %r6, %b[63] qpxor,1 %g30, %r6, %b[64] qpfmuls,2 %b[60], %b[31], %b[60] qpxor,3 %b[42], %r6, %b[65] qpxor,4 %b[41], %r6, %b[66] qpfmuls,5 %b[62], %b[39], %b[62] movaqp,0 area=14, ind=0, am=1, be=0, %b[68] movaqp,1 area=14, ind=16, am=0, be=0, %b[67] movaqp,2 area=14, ind=0, am=1, be=0, %b[70] movaqp,3 area=14, ind=16, am=0, be=0, %b[69] } { loop_mode qpfmuls,0 %b[63], %b[43], %b[63] qpfmuls,1 %b[64], %b[47], %b[64] qpfmuls,2 %b[59], %b[27], %b[59] qpfmuls,3 %b[65], %b[51], %b[65] qpxor,4 %b[22], %r6, %b[71] qpfmuls,5 %b[61], %b[35], %b[61] movaqp,0 area=15, ind=0, am=1, be=0, %b[73] movaqp,1 area=15, ind=16, am=0, be=0, %b[72] movaqp,2 area=15, ind=0, am=1, be=0, %b[75] movaqp,3 area=15, ind=16, am=0, be=0, %b[74] } { loop_mode qpxor,0 %b[21], %r6, %b[76] qpxor,1 %b[45], %r6, %b[77] qpxor,3 %b[46], %r6, %b[78] qpxor,4 %b[26], %r6, %b[79] qpfmuls,5 %b[66], %b[55], %b[66] movaqp,0 area=16, ind=0, am=1, be=0, %b[81] movaqp,1 area=16, ind=16, am=0, be=0, %b[80] movaqp,2 area=16, ind=0, am=1, be=0, %b[83] movaqp,3 area=16, ind=16, am=0, be=0, %b[82] } { loop_mode qpfmuls,0 %b[77], %g28, %b[77] qpfmuls,2 %b[76], %g20, %b[76] qpfmuls,3 %b[78], %g24, %b[78] qpxor,4 %b[30], %r6, %b[84] qpfmuls,5 %b[71], %g16, %b[71] movaqp,0 area=17, ind=0, am=1, be=0, %b[86] movaqp,1 area=17, ind=16, am=0, be=0, %b[85] movaqp,2 area=17, ind=0, am=1, be=0, %b[88] movaqp,3 area=17, ind=16, am=0, be=0, %b[87] } { loop_mode qpxor,0 %b[29], %r6, %b[89] qpxor,1 %b[53], %r6, %b[90] qpxor,3 %b[54], %r6, %b[91] qpxor,4 %b[34], %r6, %b[92] qpfmuls,5 %b[84], %b[28], %b[84] movaqp,0 area=18, ind=0, am=1, be=0, %b[94] movaqp,1 area=18, ind=16, am=0, be=0, %b[93] movaqp,2 area=18, ind=0, am=1, be=0, %b[96] movaqp,3 area=18, ind=16, am=0, be=0, %b[95] } { loop_mode qpfmuls,0 %b[90], %b[40], %b[90] qpxor,1 %b[33], %r6, %b[97] qpfmuls,2 %b[89], %b[32], %b[89] qpfmuls,3 %b[92], %b[44], %b[92] qpxor,4 %b[58], %r6, %b[98] qpfmuls,5 %b[91], %b[36], %b[91] movaqp,0 area=19, ind=0, am=1, be=0, %b[100] movaqp,1 area=19, ind=16, am=0, be=0, %b[99] movaqp,2 area=19, ind=0, am=1, be=0, %b[102] movaqp,3 area=19, ind=16, am=0, be=0, %b[101] } { loop_mode qpxor,0 %b[57], %r6, %b[103] qpxor,1 %b[25], %r6, %b[104] qpfmuls,2 %b[97], %b[48], %b[97] qpxor,3 %b[49], %r6, %b[105] qpxor,4 %b[50], %r6, %b[106] qpfmuls,5 %b[98], %b[52], %b[98] } { loop_mode qpfmuls,0 %b[79], %b[19], %b[79] qpfmuls,1 %b[104], %b[23], %b[104] qpfmuls,2 %b[103], %b[56], %b[103] qpfmuls,3 %b[106], %g18, %b[106] qpshufb,4 %g27, %g27, %r25, %g27 qpfmuls,5 %b[105], %g22, %b[105] } { loop_mode qpshufb,0 %g26, %g26, %r25, %g26 qpshufb,1 %b[37], %b[37], %r25, %b[37] qpshufb,3 %b[38], %b[38], %r25, %b[38] qpshufb,4 %g31, %g31, %r25, %g31 qpfmul_hadds,5 %g27, %b[27], %b[59], %g27 } { loop_mode qpfmul_hadds,0 %b[37], %b[39], %b[62], %b[27] qpfmul_hadds,2 %g26, %b[31], %b[60], %g26 qpfmul_hadds,3 %g31, %b[43], %b[63], %g31 qpshufb,4 %g30, %g30, %r25, %g30 qpfmul_hadds,5 %b[38], %b[35], %b[61], %b[31] } { loop_mode qpshufb,0 %b[41], %b[41], %r25, %b[35] qpshufb,1 %b[30], %b[30], %r25, %b[30] qpshufb,3 %b[29], %b[29], %r25, %b[29] qpshufb,4 %b[42], %b[42], %r25, %b[37] qpfmul_hadds,5 %g30, %b[47], %b[64], %g30 } { loop_mode qpshufb,0 %b[54], %b[54], %r25, %b[38] qpshufb,1 %b[53], %b[53], %r25, %b[39] qpfmul_hadds,2 %b[35], %b[55], %b[66], %b[35] qpshufb,3 %b[34], %b[34], %r25, %b[34] qpshufb,4 %b[33], %b[33], %r25, %b[33] qpfmul_hadds,5 %b[37], %b[51], %b[65], %b[37] } { loop_mode qpshufb,0 %b[58], %b[58], %r25, %b[41] qpshufb,1 %b[57], %b[57], %r25, %b[42] qpfmul_hadds,2 %b[30], %b[28], %b[84], %b[28] qpfmul_hadds,3 %b[29], %b[32], %b[89], %b[29] qpfmul_hadds,4 %b[34], %b[44], %b[92], %b[32] qpfmul_hadds,5 %b[33], %b[48], %b[97], %b[30] } { loop_mode qpfmul_hadds,0 %b[38], %b[36], %b[91], %b[34] qpfmul_hadds,1 %b[41], %b[52], %b[98], %b[36] qpfmul_hadds,2 %b[39], %b[40], %b[90], %b[33] qpshufb,3 %b[21], %b[21], %r25, %b[21] qpshufb,4 %b[22], %b[22], %r25, %b[22] } { loop_mode qpshufb,0 %b[46], %b[46], %r25, %b[39] qpshufb,1 %b[45], %b[45], %r25, %b[40] qpfmul_hadds,2 %b[42], %b[56], %b[103], %b[38] qpshufb,3 %b[26], %b[26], %r25, %b[26] qpshufb,4 %b[25], %b[25], %r25, %b[25] qpfmul_hadds,5 %b[21], %g20, %b[76], %g20 } { loop_mode qpfmul_hadds,0 %b[40], %g28, %b[77], %g28 qpshufb,1 %b[50], %b[50], %r25, %b[21] qpfmul_hadds,2 %b[39], %g24, %b[78], %g24 qpfmul_hadds,3 %b[26], %b[19], %b[79], %b[19] qpshufb,4 %b[49], %b[49], %r25, %b[41] qpfmul_hadds,5 %b[22], %g16, %b[71], %g16 } { loop_mode qpxor,0 %b[67], %r6, %b[21] qpxor,1 %b[68], %r6, %b[23] qpfmul_hadds,2 %b[21], %g18, %b[106], %g18 qpfmul_hadds,3 %b[41], %g22, %b[105], %g22 qpshufb,4 %g27, %g27, %r22, %g27 qpfmul_hadds,5 %b[25], %b[23], %b[104], %b[22] } { loop_mode qpshufb,0 %g26, %g26, %r22, %g26 qpshufb,1 %b[27], %b[27], %r22, %b[26] qpshufb,3 %b[31], %b[31], %r22, %b[25] qpshufb,4 %g31, %g31, %r22, %g31 } { loop_mode qpxor,0 %b[73], %r6, %b[27] qpxor,1 %b[72], %r6, %b[31] } { loop_mode qpshufb,3 %g30, %g30, %r22, %g30 qpshufb,4 %b[37], %b[37], %r22, %b[37] } { loop_mode qpshufb,0 %b[35], %b[35], %r22, %b[35] qpshufb,1 %b[28], %b[28], %r22, %b[28] qpshufb,3 %b[29], %b[29], %r22, %b[29] qpshufb,4 %b[32], %b[32], %r22, %b[32] } { loop_mode qpfsubs,0 %g27, %b[28], %b[39] qpshufb,1 %b[34], %b[34], %r22, %b[34] qpfadds,2 %g27, %b[28], %g27 qpfsubs,3 %g26, %b[29], %b[40] qpshufb,4 %b[30], %b[30], %r22, %b[30] qpfsubs,5 %g31, %b[32], %b[41] } { loop_mode qpshufb,0 %b[33], %b[33], %r22, %b[28] qpshufb,1 %b[36], %b[36], %r22, %b[33] qpfsubs,2 %b[25], %b[34], %b[36] qpfadds,3 %g26, %b[29], %g26 qpshufb,4 %g20, %g20, %r22, %g20 qpfsubs,5 %g30, %b[30], %b[42] } { loop_mode qpfsubs,0 %b[37], %b[33], %b[43] qpshufb,1 %b[38], %b[38], %r22, %b[29] qpfsubs,2 %b[26], %b[28], %b[38] qpfadds,3 %g31, %b[32], %g31 qpshufb,4 %g16, %g16, %r22, %g16 qpfadds,5 %g30, %b[30], %g30 } { loop_mode qpshufb,0 %g28, %g28, %r22, %g28 qpshufb,1 %g24, %g24, %r22, %g24 qpfsubs,2 %b[35], %b[29], %b[30] qpfadds,3 %g21, %g20, %b[32] qpshufb,4 %b[22], %b[22], %r22, %b[22] qpfsubs,5 %g21, %g20, %g20 } { loop_mode qpfadds,0 %b[26], %b[28], %b[19] qpshufb,1 %b[19], %b[19], %r22, %g21 qpfadds,2 %b[25], %b[34], %b[25] qpfadds,3 %b[24], %b[22], %b[26] qpshufb,4 %g22, %g22, %r22, %g22 qpfsubs,5 %b[24], %b[22], %b[22] } { loop_mode qpshufb,0 %g18, %g18, %r22, %g18 qpfadds,1 %b[37], %b[33], %b[24] qpfadds,2 %b[35], %b[29], %b[28] qpfadds,3 %g17, %g16, %b[29] qpfsubs,4 %g17, %g16, %g16 qpfadds,5 %g23, %g22, %g17 } { loop_mode qpfadds,0 %g29, %g28, %b[33] qpfsubs,1 %g29, %g28, %g28 qpfadds,2 %b[20], %g21, %g29 qpshufb,4 %b[39], %b[39], %r25, %g23 qpfsubs,5 %g23, %g22, %g22 } { loop_mode qpfsubs,0 %g25, %g24, %g24 qpfsubs,1 %g19, %g18, %g25 qpfsubs,2 %b[20], %g21, %g21 qpfsubs,3 %b[32], %g26, %b[34] qpfadds,4 %b[32], %g26, %g26 qpfadds,5 %g25, %g24, %b[20] } { loop_mode qpfadds,2 %g19, %g18, %g18 qpshufb,3 %b[40], %b[40], %r25, %g19 qpshufb,4 %b[36], %b[36], %r25, %b[32] qpfsubs,5 %b[26], %g30, %b[35] } { loop_mode qpfadds,3 %b[26], %g30, %g30 qpfadds,4 %b[29], %g27, %g27 qpfsubs,5 %b[29], %g27, %b[36] } { loop_mode qpshufb,0 %b[38], %b[38], %r25, %b[26] qpshufb,1 %b[42], %b[42], %r25, %b[29] qpfsubs,2 %g29, %g31, %b[38] qpshufb,3 %b[41], %b[41], %r25, %b[37] qpshufb,4 %b[30], %b[30], %r25, %b[30] } { loop_mode qpshufb,0 %b[43], %b[43], %r25, %b[39] qpxor,1 %g23, %r23, %g23 qpfsubs,2 %b[33], %b[19], %b[40] qpfadds,3 %b[20], %b[25], %b[41] qpfadds,4 %g17, %b[28], %b[25] qpfsubs,5 %b[20], %b[25], %b[20] } { loop_mode qpxor,0 %g19, %r23, %g19 qpxor,1 %b[32], %r23, %b[32] qpfadds,2 %g29, %g31, %g29 qpxor,3 %b[37], %r23, %b[37] qpxor,4 %b[30], %r23, %b[30] qpfadds,5 %b[33], %b[19], %g31 } { loop_mode qpxor,0 %b[29], %r23, %b[19] qpxor,1 %b[26], %r23, %b[26] qpfadds,2 %g20, %g19, %b[29] qpfsubs,3 %g17, %b[28], %g17 qpfsubs,4 %g21, %b[37], %b[28] qpfadds,5 %g21, %b[37], %g21 } { loop_mode qpxor,0 %b[39], %r23, %b[33] qpfsubs,1 %g20, %g19, %g19 qpfadds,2 %g18, %b[24], %g20 qpfsubs,3 %g18, %b[24], %g18 qpfsubs,4 %g22, %b[30], %g22 qpfadds,5 %g22, %b[30], %b[24] } { loop_mode qpfsubs,0 %g16, %g23, %b[30] qpfadds,1 %g16, %g23, %g16 qpfadds,2 %g24, %b[32], %g23 } { loop_mode qpfadds,0 %g28, %b[26], %b[37] qpfsubs,1 %g24, %b[32], %g24 qpfsubs,2 %g28, %b[26], %g28 } { loop_mode qpfsubs,0 %g25, %b[33], %b[22] qpfadds,1 %g25, %b[33], %g25 qpfsubs,2 %b[22], %b[19], %b[26] qpxor,3 %b[75], %r6, %b[32] qpxor,4 %b[83], %r6, %b[33] qpfadds,5 %b[22], %b[19], %b[19] } { loop_mode qpxor,3 %b[74], %r6, %b[39] qpxor,4 %b[82], %r6, %b[42] } { loop_mode qpxor,3 %b[86], %r6, %b[43] qpxor,4 %b[94], %r6, %b[44] } { loop_mode qpxor,0 %b[85], %r6, %b[45] qpxor,1 %b[93], %r6, %b[46] qpxor,3 %b[96], %r6, %b[47] qpxor,4 %b[102], %r6, %b[48] } { loop_mode qpxor,0 %b[95], %r6, %b[49] qpxor,1 %b[101], %r6, %b[50] qpshufb,3 %b[41], %g27, %r24, %b[51] qpshufb,4 %g31, %g26, %r24, %b[52] } { loop_mode qpshufb,0 %b[20], %b[36], %r24, %b[53] qpshufb,1 %b[40], %b[34], %r24, %b[54] qpshufb,3 %g20, %g29, %r24, %b[55] qpshufb,4 %b[25], %g30, %r24, %b[56] qpfmuls,5 %b[23], %b[51], %b[23] } { loop_mode qpshufb,0 %g18, %b[38], %r24, %b[57] qpshufb,1 %g17, %b[35], %r24, %b[58] qpfmuls,2 %b[43], %b[53], %b[43] qpshufb,3 %g24, %b[30], %r24, %b[59] qpshufb,4 %g28, %g19, %r24, %b[60] qpfmuls,5 %b[27], %b[52], %b[27] } { loop_mode qpshufb,0 %g23, %g16, %r24, %b[61] qpshufb,1 %b[37], %b[29], %r24, %b[62] qpfmuls,2 %b[44], %b[54], %b[44] qpshufb,3 %b[22], %b[28], %r24, %b[63] qpshufb,4 %g22, %b[26], %r24, %b[64] qpfmuls,5 %b[21], %b[55], %b[21] } { loop_mode qpshufb,0 %g25, %g21, %r24, %b[65] qpshufb,1 %b[24], %b[19], %r24, %b[66] qpfmuls,2 %b[45], %b[57], %b[45] qpfmuls,3 %b[31], %b[56], %b[31] qpfmuls,4 %b[32], %b[59], %b[32] qpfmuls,5 %b[33], %b[60], %b[33] } { loop_mode qpfmuls,0 %b[46], %b[58], %b[46] qpfmuls,1 %b[47], %b[61], %b[47] qpfmuls,2 %b[48], %b[62], %b[48] qpfmuls,3 %b[42], %b[64], %b[42] qpxor,4 %b[70], %r6, %b[71] qpfmuls,5 %b[39], %b[63], %b[39] } { loop_mode qpfmuls,0 %b[50], %b[66], %b[50] qpxor,1 %b[81], %r6, %b[76] qpfmuls,2 %b[49], %b[65], %b[49] } { loop_mode qpxor,4 %b[69], %r6, %b[77] } { loop_mode qpxor,1 %b[80], %r6, %b[78] qpxor,3 %b[88], %r6, %b[79] qpxor,4 %b[100], %r6, %b[84] } { loop_mode qpxor,0 %b[87], %r6, %b[89] qpxor,1 %b[99], %r6, %b[90] qpshufb,3 %g31, %g26, %r7, %g26 qpshufb,4 %b[40], %b[34], %r7, %g31 } { loop_mode qpshufb,0 %b[25], %g30, %r7, %g30 qpshufb,1 %g17, %b[35], %r7, %g17 qpshufb,3 %g28, %g19, %r7, %g19 qpshufb,4 %b[37], %b[29], %r7, %g28 qpfmuls,5 %b[79], %g31, %b[25] } { loop_mode qpshufb,0 %g22, %b[26], %r7, %g22 qpshufb,1 %b[24], %b[19], %r7, %b[19] qpfmuls,2 %b[77], %g30, %b[26] qpfmuls,3 %b[76], %g19, %b[29] qpfmuls,4 %b[84], %g28, %b[34] qpfmuls,5 %b[71], %g26, %b[24] } { loop_mode qpfmuls,0 %b[90], %b[19], %b[37] qpfmuls,1 %b[89], %g17, %b[40] qpfmuls,2 %b[78], %g22, %b[35] qpshufb,3 %b[68], %b[68], %r25, %b[68] qpshufb,4 %b[67], %b[67], %r25, %b[67] } { loop_mode qpshufb,0 %b[73], %b[73], %r25, %b[71] qpshufb,1 %b[72], %b[72], %r25, %b[72] qpfmul_hadds,3 %b[67], %b[55], %b[21], %b[21] qpfmul_hadds,5 %b[68], %b[51], %b[23], %b[23] } { loop_mode qpfmul_hadds,0 %b[72], %b[56], %b[31], %b[31] qpfmul_hadds,2 %b[71], %b[52], %b[27], %b[27] qpshufb,3 %b[75], %b[75], %r25, %b[51] qpshufb,4 %b[83], %b[83], %r25, %b[55] } { loop_mode qpshufb,0 %b[74], %b[74], %r25, %b[52] qpshufb,1 %b[82], %b[82], %r25, %b[56] qpshufb,3 %b[86], %b[86], %r25, %b[67] qpshufb,4 %b[94], %b[94], %r25, %b[68] qpfmul_hadds,5 %b[51], %b[59], %b[32], %b[32] } { loop_mode qpshufb,0 %b[85], %b[85], %r25, %b[51] qpshufb,1 %b[93], %b[93], %r25, %b[59] qpfmul_hadds,2 %b[52], %b[63], %b[39], %b[39] qpshufb,3 %b[96], %b[96], %r25, %b[71] qpshufb,4 %b[102], %b[102], %r25, %b[72] qpfmul_hadds,5 %b[55], %b[60], %b[33], %b[33] } { loop_mode qpshufb,0 %b[95], %b[95], %r25, %b[52] qpshufb,1 %b[101], %b[101], %r25, %b[55] qpfmul_hadds,2 %b[56], %b[64], %b[42], %b[42] qpfmul_hadds,3 %b[67], %b[53], %b[43], %b[43] qpfmul_hadds,4 %b[68], %b[54], %b[44], %b[44] qpfmul_hadds,5 %b[71], %b[61], %b[47], %b[47] } { loop_mode qpfmul_hadds,0 %b[59], %b[58], %b[46], %b[46] qpfmul_hadds,1 %b[52], %b[65], %b[49], %b[49] qpfmul_hadds,2 %b[51], %b[57], %b[45], %b[45] qpshufb,3 %b[70], %b[70], %r25, %b[51] qpshufb,4 %b[69], %b[69], %r25, %b[52] qpfmul_hadds,5 %b[72], %b[62], %b[48], %b[48] } { loop_mode qpshufb,0 %b[81], %b[81], %r25, %b[53] qpshufb,1 %b[88], %b[88], %r25, %b[54] qpfmul_hadds,2 %b[55], %b[66], %b[50], %b[50] qpfmul_hadds,3 %b[52], %g30, %b[26], %g30 qpshufb,4 %b[80], %b[80], %r25, %b[55] qpfmul_hadds,5 %b[51], %g26, %b[24], %g26 } { loop_mode qpfmul_hadds,0 %b[53], %g19, %b[29], %g19 qpshufb,1 %b[87], %b[87], %r25, %b[24] qpfmul_hadds,2 %b[54], %g31, %b[25], %g31 qpshufb,3 %b[100], %b[100], %r25, %b[26] qpshufb,4 %b[99], %b[99], %r25, %b[51] qpfmul_hadds,5 %b[55], %g22, %b[35], %g22 } { loop_mode qpshufb,0 %b[20], %b[36], %r7, %b[20] qpshufb,1 %b[41], %g27, %r7, %g27 qpfmul_hadds,2 %b[24], %g17, %b[40], %g17 qpfmul_hadds,3 %b[51], %b[19], %b[37], %b[19] qpshufb,4 %b[23], %b[23], %r22, %b[23] qpfmul_hadds,5 %b[26], %g28, %b[34], %g28 } { loop_mode qpshufb,0 %b[27], %b[27], %r22, %b[24] qpshufb,1 %b[31], %b[31], %r22, %b[25] qpshufb,3 %b[21], %b[21], %r22, %b[21] qpshufb,4 %g20, %g29, %r7, %g20 } { loop_mode qpshufb,0 %g18, %b[38], %r7, %g18 qpshufb,1 %g23, %g16, %r7, %g16 } { loop_mode qpshufb,3 %b[32], %b[32], %r22, %g23 qpshufb,4 %b[33], %b[33], %r22, %g29 } { loop_mode qpshufb,0 %b[39], %b[39], %r22, %b[26] qpshufb,1 %b[42], %b[42], %r22, %b[27] qpshufb,4 %b[43], %b[43], %r22, %b[29] } { loop_mode qpfsubs,0 %b[26], %b[27], %b[36] qpshufb,1 %b[45], %b[45], %r22, %b[31] qpfsubs,2 %b[23], %b[24], %b[34] qpshufb,3 %b[44], %b[44], %r22, %b[32] qpshufb,4 %b[47], %b[47], %r22, %b[33] qpfsubs,5 %b[21], %b[25], %b[35] } { loop_mode qpshufb,0 %b[46], %b[46], %r22, %b[37] qpshufb,1 %b[49], %b[49], %r22, %b[38] qpfadds,2 %b[23], %b[24], %b[23] qpfsubs,3 %b[29], %b[32], %b[41] qpshufb,4 %b[48], %b[48], %r22, %b[39] qpfsubs,5 %g23, %g29, %b[40] } { loop_mode qpfadds,0 %b[21], %b[25], %g25 qpshufb,1 %b[50], %b[50], %r22, %b[24] qpfsubs,2 %b[31], %b[37], %b[43] qpshufb,3 %g24, %b[30], %r7, %g24 qpshufb,4 %g25, %g21, %r7, %g21 qpfsubs,5 %b[33], %b[39], %b[42] } { loop_mode qpshufb,0 %b[22], %b[28], %r7, %b[22] qpshufb,1 %g26, %g26, %r22, %g26 qpfsubs,2 %b[38], %b[24], %b[21] qpfadds,3 %g23, %g29, %g23 qpshufb,4 %g30, %g30, %r22, %g30 qpfadds,5 %b[26], %b[27], %g29 } { loop_mode qpfadds,0 %b[38], %b[24], %b[24] qpshufb,1 %g17, %g17, %r22, %g17 qpfadds,2 %b[31], %b[37], %b[25] qpshufb,3 %g22, %g22, %r22, %g22 qpshufb,4 %g31, %g31, %r22, %g31 qpfadds,5 %b[33], %b[39], %b[26] } { loop_mode qpshufb,0 %g19, %g19, %r22, %g19 qpshufb,1 %g28, %g28, %r22, %g28 qpfadds,2 %b[29], %b[32], %b[27] qpfadds,3 %g20, %g30, %b[28] qpshufb,4 %b[19], %b[19], %r22, %b[19] qpfsubs,5 %g20, %g30, %g20 } { loop_mode qpfsubs,0 %g27, %g26, %g30 qpfadds,1 %g27, %g26, %g26 qpfadds,2 %g18, %g17, %b[20] qpfadds,3 %b[20], %g31, %g27 qpfsubs,4 %b[20], %g31, %g31 qpfsubs,5 %g21, %b[19], %b[29] } { loop_mode qpfsubs,0 %g18, %g17, %g17 qpfsubs,1 %g24, %g19, %g18 qpfadds,2 %g24, %g19, %g19 qpfadds,3 %b[22], %g22, %g24 qpfsubs,4 %b[22], %g22, %g22 qpfadds,5 %g21, %b[19], %g21 } { loop_mode qpfsubs,0 %g16, %g28, %g16 qpfadds,2 %g16, %g28, %b[19] } { loop_mode qpfsubs,3 %b[28], %g25, %g28 qpfadds,4 %b[28], %g25, %g25 } { loop_mode qpfadds,0 %g26, %b[23], %b[31] qpshufb,1 %b[35], %b[35], %r25, %b[22] qpfsubs,2 %g26, %b[23], %g26 qpshufb,3 %b[34], %b[34], %r25, %b[28] qpshufb,4 %b[40], %b[40], %r25, %b[30] } { loop_mode qpshufb,0 %b[36], %b[36], %r25, %b[23] qpshufb,1 %b[21], %b[21], %r25, %b[21] qpfsubs,2 %b[20], %b[25], %b[27] qpfadds,3 %g27, %b[27], %b[32] qpfsubs,4 %g27, %b[27], %g27 qpfadds,5 %g21, %b[24], %b[33] } { loop_mode qpfadds,0 %b[20], %b[25], %b[20] qpshufb,1 %b[43], %b[43], %r25, %b[34] qpfadds,2 %g19, %g23, %b[25] qpshufb,3 %b[42], %b[42], %r25, %b[35] qpshufb,4 %b[41], %b[41], %r25, %b[36] qpfsubs,5 %g21, %b[24], %g21 } { loop_mode qpxor,0 %b[22], %r23, %b[22] qpxor,1 %b[23], %r23, %b[23] qpfsubs,2 %b[19], %b[26], %g23 qpfsubs,3 %g19, %g23, %g19 qpfadds,4 %g24, %g29, %b[24] qpfsubs,5 %g24, %g29, %g24 } { loop_mode qpfadds,0 %b[19], %b[26], %b[19] qpxor,1 %b[28], %r23, %g29 qpfadds,2 %g20, %b[22], %b[26] qpxor,3 %b[30], %r23, %b[28] qpxor,4 %b[35], %r23, %b[30] stqp,5 %r33, %r0, %g28 } { loop_mode qpxor,0 %b[21], %r23, %g28 qpxor,1 %b[34], %r23, %b[21] qpfsubs,2 %g30, %g29, %b[34] qpfadds,3 %g18, %b[28], %b[35] qpfsubs,4 %g18, %b[28], %g18 qpfadds,5 %g16, %b[30], %b[28] } { loop_mode qpfadds,0 %g30, %g29, %g29 qpxor,1 %b[36], %r23, %b[36] qpfsubs,2 %g20, %b[22], %g20 qpfsubs,3 %g16, %b[30], %g16 stqp,5 %r18, %r0, %g26 } { loop_mode qpfsubs,0 %g22, %b[23], %g26 qpfadds,1 %g22, %b[23], %g22 qpfadds,2 %b[29], %g28, %g30 stqp,5 %r28, %r0, %g25 } { loop_mode qpfsubs,0 %g17, %b[21], %g25 qpfadds,1 %g17, %b[21], %g17 qpfsubs,2 %b[29], %g28, %g28 stqp,5 %r2, %r0, %b[31] } { loop_mode qpfsubs,0 %g31, %b[36], %b[21] qpfadds,1 %g31, %b[36], %g31 stqp,2 %r16, %r0, %g27 stqp,5 %r20, %r0, %b[32] } { loop_mode stqp,2 %r37, %r0, %b[27] stqp,5 %r26, %r0, %b[20] } { loop_mode stqp,2 %r13, %r0, %g19 stqp,5 %r21, %r0, %b[25] } { loop_mode stqp,2 %r36, %r0, %g21 stqp,5 %r3, %r0, %g23 } { loop_mode stqp,2 %r9, %r0, %b[33] stqp,5 %r27, %r0, %b[24] } { loop_mode stqp,2 %r32, %r0, %g24 stqp,5 %r17, %r0, %b[19] } { loop_mode stqp,2 %r35, %r0, %b[26] stqp,5 %r15, %r0, %g29 } { loop_mode stqp,2 %r31, %r0, %g20 stqp,5 %r19, %r0, %b[34] } { loop_mode stqp,2 %r1, %r0, %b[35] stqp,5 %r12, %r0, %g18 } { loop_mode stqp,2 %r40, %r0, %g22 stqp,5 %r30, %r0, %g26 } { loop_mode stqp,2 %r38, %r0, %g30 stqp,5 %r4, %r0, %b[28] } { loop_mode stqp,2 %r34, %r0, %g28 stqp,5 %r39, %r0, %g17 } { loop_mode stqp,2 %r14, %r0, %g16 stqp,5 %r5, %r0, %g31 } { loop_mode ct %ctpr1 ? %NOT_LOOP_END alc alcf=1, alct=1 stqp,2 %r29, %r0, %g25 addd,3,sm %r0, _f16s,_lts0lo 0x20, %r0 stqp,5 %r11, %r0, %b[21] }
Теоретическая скорость: 64 комплексных числа за 115 тактов (64/115) = 4.45 Байт/такт
Четверная теоретическая скорость: 17.81 Байт/такт
Замеры скорости

5. stage_radix4_2x_simd128_noConj
Здесь происходит ручная раскрутка алгоритма stage_radix4_simd128_noConj в 2 раза.
Код на Си
void stage_radix4_2x_simd128_noConj(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b) { __v2di *xy0_in = (__v2di*)&data_in[ 0]; __v2di *zw0_in = (__v2di*)&data_in[ 2]; __v2di *xy1_in = (__v2di*)&data_in[ 4]; __v2di *zw1_in = (__v2di*)&data_in[ 6]; __v2di *xy2_in = (__v2di*)&data_in[ 8]; __v2di *zw2_in = (__v2di*)&data_in[10]; __v2di *xy3_in = (__v2di*)&data_in[12]; __v2di *zw3_in = (__v2di*)&data_in[14]; __v2di *xy4_in = (__v2di*)&data_in[16]; __v2di *zw4_in = (__v2di*)&data_in[18]; __v2di *xy5_in = (__v2di*)&data_in[20]; __v2di *zw5_in = (__v2di*)&data_in[22]; __v2di *xy6_in = (__v2di*)&data_in[24]; __v2di *zw6_in = (__v2di*)&data_in[26]; __v2di *xy7_in = (__v2di*)&data_in[28]; __v2di *zw7_in = (__v2di*)&data_in[30]; __v2di *c0a_in = (__v2di*)&coefC_a[0]; __v2di *c1a_in = (__v2di*)&coefC_a[2]; __v2di *c2a_in = (__v2di*)&coefC_a[4]; __v2di *c3a_in = (__v2di*)&coefC_a[6]; __v2di *d0a_in = (__v2di*)&coefD_a[0]; __v2di *d1a_in = (__v2di*)&coefD_a[2]; __v2di *d2a_in = (__v2di*)&coefD_a[4]; __v2di *d3a_in = (__v2di*)&coefD_a[6]; __v2di *e0a_in = (__v2di*)&coefE_a[0]; __v2di *e1a_in = (__v2di*)&coefE_a[2]; __v2di *e2a_in = (__v2di*)&coefE_a[4]; __v2di *e3a_in = (__v2di*)&coefE_a[6]; __v2di *c0b_in = (__v2di*)&coefC_b[0*data_count/16]; __v2di *c1b_in = (__v2di*)&coefC_b[1*data_count/16]; __v2di *c2b_in = (__v2di*)&coefC_b[2*data_count/16]; __v2di *c3b_in = (__v2di*)&coefC_b[3*data_count/16]; __v2di *d0b_in = (__v2di*)&coefD_b[0*data_count/16]; __v2di *d1b_in = (__v2di*)&coefD_b[1*data_count/16]; __v2di *d2b_in = (__v2di*)&coefD_b[2*data_count/16]; __v2di *d3b_in = (__v2di*)&coefD_b[3*data_count/16]; __v2di *e0b_in = (__v2di*)&coefE_b[0*data_count/16]; __v2di *e1b_in = (__v2di*)&coefE_b[1*data_count/16]; __v2di *e2b_in = (__v2di*)&coefE_b[2*data_count/16]; __v2di *e3b_in = (__v2di*)&coefE_b[3*data_count/16]; __v2di *out_0 = (__v2di*)&data_out[ 0*data_count/16]; __v2di *out_1 = (__v2di*)&data_out[ 1*data_count/16]; __v2di *out_2 = (__v2di*)&data_out[ 2*data_count/16]; __v2di *out_3 = (__v2di*)&data_out[ 3*data_count/16]; __v2di *out_4 = (__v2di*)&data_out[ 4*data_count/16]; __v2di *out_5 = (__v2di*)&data_out[ 5*data_count/16]; __v2di *out_6 = (__v2di*)&data_out[ 6*data_count/16]; __v2di *out_7 = (__v2di*)&data_out[ 7*data_count/16]; __v2di *out_8 = (__v2di*)&data_out[ 8*data_count/16]; __v2di *out_9 = (__v2di*)&data_out[ 9*data_count/16]; __v2di *out_10 = (__v2di*)&data_out[10*data_count/16]; __v2di *out_11 = (__v2di*)&data_out[11*data_count/16]; __v2di *out_12 = (__v2di*)&data_out[12*data_count/16]; __v2di *out_13 = (__v2di*)&data_out[13*data_count/16]; __v2di *out_14 = (__v2di*)&data_out[14*data_count/16]; __v2di *out_15 = (__v2di*)&data_out[15*data_count/16]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/32; ++i) { __v2di xy0 = xy0_in[16*i]; __v2di zw0 = zw0_in[16*i]; __v2di xy1 = xy1_in[16*i]; __v2di zw1 = zw1_in[16*i]; __v2di c0 = c0a_in[4*i]; __v2di d0 = d0a_in[4*i]; __v2di e0 = e0a_in[4*i]; __v2di xy2 = xy2_in[16*i]; __v2di zw2 = zw2_in[16*i]; __v2di xy3 = xy3_in[16*i]; __v2di zw3 = zw3_in[16*i]; __v2di c1 = c1a_in[4*i]; __v2di d1 = d1a_in[4*i]; __v2di e1 = e1a_in[4*i]; __v2di xy4 = xy4_in[16*i]; __v2di zw4 = zw4_in[16*i]; __v2di xy5 = xy5_in[16*i]; __v2di zw5 = zw5_in[16*i]; __v2di c2 = c2a_in[4*i]; __v2di d2 = d2a_in[4*i]; __v2di e2 = e2a_in[4*i]; __v2di xy6 = xy6_in[16*i]; __v2di zw6 = zw6_in[16*i]; __v2di xy7 = xy7_in[16*i]; __v2di zw7 = zw7_in[16*i]; __v2di c3 = c3a_in[4*i]; __v2di d3 = d3a_in[4*i]; __v2di e3 = e3a_in[4*i]; __v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di cy0_real = __builtin_e2k_qpfmuls( c0, y0); __v2di cy1_real = __builtin_e2k_qpfmuls( c1, y1); __v2di cy2_real = __builtin_e2k_qpfmuls( c2, y2); __v2di cy3_real = __builtin_e2k_qpfmuls( c3, y3); __v2di dz0_real = __builtin_e2k_qpfmuls( d0, z0); __v2di dz1_real = __builtin_e2k_qpfmuls( d1, z1); __v2di dz2_real = __builtin_e2k_qpfmuls( d2, z2); __v2di dz3_real = __builtin_e2k_qpfmuls( d3, z3); __v2di ew0_real = __builtin_e2k_qpfmuls( e0, w0); __v2di ew1_real = __builtin_e2k_qpfmuls( e1, w1); __v2di ew2_real = __builtin_e2k_qpfmuls( e2, w2); __v2di ew3_real = __builtin_e2k_qpfmuls( e3, w3); __v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0); __v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1); __v2di cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2); __v2di cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3); __v2di dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0); __v2di dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1); __v2di dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2); __v2di dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3); __v2di ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0); __v2di ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1); __v2di ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2); __v2di ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3); __v2di cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real); __v2di cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real); __v2di cy2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy2_real); __v2di cy3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy3_real); __v2di dz0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz0_real); __v2di dz1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz1_real); __v2di dz2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz2_real); __v2di dz3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz3_real); __v2di ew0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew0_real); __v2di ew1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew1_real); __v2di ew2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew2_real); __v2di ew3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew3_real); __v2di cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag); __v2di cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag); __v2di cy2_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy2_imag); __v2di cy3_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy3_imag); __v2di dz0_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz0_imag); __v2di dz1_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz1_imag); __v2di dz2_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz2_imag); __v2di dz3_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz3_imag); __v2di ew0_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew0_imag); __v2di ew1_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew1_imag); __v2di ew2_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew2_imag); __v2di ew3_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew3_imag); __v2di cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di cy2 = __builtin_e2k_qppermb(cy2_ii, cy2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di cy3 = __builtin_e2k_qppermb(cy3_ii, cy3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di dz0 = __builtin_e2k_qppermb(dz0_ii, dz0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di dz1 = __builtin_e2k_qppermb(dz1_ii, dz1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di dz2 = __builtin_e2k_qppermb(dz2_ii, dz2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di dz3 = __builtin_e2k_qppermb(dz3_ii, dz3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di ew0 = __builtin_e2k_qppermb(ew0_ii, ew0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di ew1 = __builtin_e2k_qppermb(ew1_ii, ew1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di ew2 = __builtin_e2k_qppermb(ew2_ii, ew2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di ew3 = __builtin_e2k_qppermb(ew3_ii, ew3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di add02_0 = __builtin_e2k_qpfadds( x0, dz0); __v2di add02_1 = __builtin_e2k_qpfadds( x1, dz1); __v2di add02_2 = __builtin_e2k_qpfadds( x2, dz2); __v2di add02_3 = __builtin_e2k_qpfadds( x3, dz3); __v2di sub02_0 = __builtin_e2k_qpfsubs( x0, dz0); __v2di sub02_1 = __builtin_e2k_qpfsubs( x1, dz1); __v2di sub02_2 = __builtin_e2k_qpfsubs( x2, dz2); __v2di sub02_3 = __builtin_e2k_qpfsubs( x3, dz3); __v2di add13_0 = __builtin_e2k_qpfadds(cy0, ew0); __v2di add13_1 = __builtin_e2k_qpfadds(cy1, ew1); __v2di add13_2 = __builtin_e2k_qpfadds(cy2, ew2); __v2di add13_3 = __builtin_e2k_qpfadds(cy3, ew3); __v2di sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0); __v2di sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1); __v2di sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2); __v2di sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3); __v2di swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31}); __v2di sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31}); __v2di sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31}); __v2di sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31}); __v2di out0 = __builtin_e2k_qpfadds(add02_0, add13_0); __v2di out1 = __builtin_e2k_qpfadds(add02_1, add13_1); __v2di out2 = __builtin_e2k_qpfadds(add02_2, add13_2); __v2di out3 = __builtin_e2k_qpfadds(add02_3, add13_3); __v2di out4 = __builtin_e2k_qpfsubs(sub02_0, sub13i_0); __v2di out5 = __builtin_e2k_qpfsubs(sub02_1, sub13i_1); __v2di out6 = __builtin_e2k_qpfsubs(sub02_2, sub13i_2); __v2di out7 = __builtin_e2k_qpfsubs(sub02_3, sub13i_3); __v2di out8 = __builtin_e2k_qpfsubs(add02_0, add13_0); __v2di out9 = __builtin_e2k_qpfsubs(add02_1, add13_1); __v2di out10 = __builtin_e2k_qpfsubs(add02_2, add13_2); __v2di out11 = __builtin_e2k_qpfsubs(add02_3, add13_3); __v2di out12 = __builtin_e2k_qpfadds(sub02_0, sub13i_0); __v2di out13 = __builtin_e2k_qpfadds(sub02_1, sub13i_1); __v2di out14 = __builtin_e2k_qpfadds(sub02_2, sub13i_2); __v2di out15 = __builtin_e2k_qpfadds(sub02_3, sub13i_3); xy0 = out0; zw0 = out1; xy1 = out2; zw1 = out3; c0 = c0b_in[i]; d0 = d0b_in[i]; e0 = e0b_in[i]; xy2 = out4; zw2 = out5; xy3 = out6; zw3 = out7; c1 = c1b_in[i]; d1 = d1b_in[i]; e1 = e1b_in[i]; xy4 = out8; zw4 = out9; xy5 = out10; zw5 = out11; c2 = c2b_in[i]; d2 = d2b_in[i]; e2 = e2b_in[i]; xy6 = out12; zw6 = out13; xy7 = out14; zw7 = out15; c3 = c3b_in[i]; d3 = d3b_in[i]; e3 = e3b_in[i]; x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100}); w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100}); y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100}); w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100}); y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100}); w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100}); y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100}); w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); cy0_real = __builtin_e2k_qpfmuls( c0, y0); cy1_real = __builtin_e2k_qpfmuls( c1, y1); cy2_real = __builtin_e2k_qpfmuls( c2, y2); cy3_real = __builtin_e2k_qpfmuls( c3, y3); dz0_real = __builtin_e2k_qpfmuls( d0, z0); dz1_real = __builtin_e2k_qpfmuls( d1, z1); dz2_real = __builtin_e2k_qpfmuls( d2, z2); dz3_real = __builtin_e2k_qpfmuls( d3, z3); ew0_real = __builtin_e2k_qpfmuls( e0, w0); ew1_real = __builtin_e2k_qpfmuls( e1, w1); ew2_real = __builtin_e2k_qpfmuls( e2, w2); ew3_real = __builtin_e2k_qpfmuls( e3, w3); cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0); cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1); cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2); cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3); dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0); dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1); dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2); dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3); ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0); ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1); ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2); ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3); cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real); cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real); cy2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy2_real); cy3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy3_real); dz0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz0_real); dz1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz1_real); dz2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz2_real); dz3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz3_real); ew0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew0_real); ew1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew1_real); ew2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew2_real); ew3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew3_real); cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag); cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag); cy2_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy2_imag); cy3_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy3_imag); dz0_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz0_imag); dz1_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz1_imag); dz2_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz2_imag); dz3_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz3_imag); ew0_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew0_imag); ew1_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew1_imag); ew2_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew2_imag); ew3_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew3_imag); cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); cy2 = __builtin_e2k_qppermb(cy2_ii, cy2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); cy3 = __builtin_e2k_qppermb(cy3_ii, cy3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); dz0 = __builtin_e2k_qppermb(dz0_ii, dz0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); dz1 = __builtin_e2k_qppermb(dz1_ii, dz1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); dz2 = __builtin_e2k_qppermb(dz2_ii, dz2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); dz3 = __builtin_e2k_qppermb(dz3_ii, dz3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); ew0 = __builtin_e2k_qppermb(ew0_ii, ew0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); ew1 = __builtin_e2k_qppermb(ew1_ii, ew1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); ew2 = __builtin_e2k_qppermb(ew2_ii, ew2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); ew3 = __builtin_e2k_qppermb(ew3_ii, ew3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); add02_0 = __builtin_e2k_qpfadds( x0, dz0); add02_1 = __builtin_e2k_qpfadds( x1, dz1); add02_2 = __builtin_e2k_qpfadds( x2, dz2); add02_3 = __builtin_e2k_qpfadds( x3, dz3); sub02_0 = __builtin_e2k_qpfsubs( x0, dz0); sub02_1 = __builtin_e2k_qpfsubs( x1, dz1); sub02_2 = __builtin_e2k_qpfsubs( x2, dz2); sub02_3 = __builtin_e2k_qpfsubs( x3, dz3); add13_0 = __builtin_e2k_qpfadds(cy0, ew0); add13_1 = __builtin_e2k_qpfadds(cy1, ew1); add13_2 = __builtin_e2k_qpfadds(cy2, ew2); add13_3 = __builtin_e2k_qpfadds(cy3, ew3); sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0); sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1); sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2); sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3); swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31}); sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31}); sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31}); sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31}); out_0[i] = __builtin_e2k_qpfadds(add02_0, add13_0); out_1[i] = __builtin_e2k_qpfadds(add02_1, add13_1); out_2[i] = __builtin_e2k_qpfadds(add02_2, add13_2); out_3[i] = __builtin_e2k_qpfadds(add02_3, add13_3); out_4[i] = __builtin_e2k_qpfsubs(sub02_0, sub13i_0); out_5[i] = __builtin_e2k_qpfsubs(sub02_1, sub13i_1); out_6[i] = __builtin_e2k_qpfsubs(sub02_2, sub13i_2); out_7[i] = __builtin_e2k_qpfsubs(sub02_3, sub13i_3); out_8[i] = __builtin_e2k_qpfsubs(add02_0, add13_0); out_9[i] = __builtin_e2k_qpfsubs(add02_1, add13_1); out_10[i] = __builtin_e2k_qpfsubs(add02_2, add13_2); out_11[i] = __builtin_e2k_qpfsubs(add02_3, add13_3); out_12[i] = __builtin_e2k_qpfadds(sub02_0, sub13i_0); out_13[i] = __builtin_e2k_qpfadds(sub02_1, sub13i_1); out_14[i] = __builtin_e2k_qpfadds(sub02_2, sub13i_2); out_15[i] = __builtin_e2k_qpfadds(sub02_3, sub13i_3); } }
Основной цикл на ассемблере
.L15211: { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=0, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=64 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=96 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=4, disp=128 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=4, disp=160 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=6, disp=192 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=6, disp=224 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=4, asz=1, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=4, asz=1, abs=8, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=3, asz=1, abs=10, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=3, asz=1, abs=10, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=1, abs=12, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=1, abs=12, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=1, incr=0, ind=0, asz=1, abs=14, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=15, asz=1, abs=14, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=14, asz=1, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=13, asz=1, abs=16, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=12, asz=1, abs=18, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=11, asz=1, abs=18, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=10, asz=2, abs=20, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=9, asz=2, abs=20, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=8, asz=2, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=2, abs=24, disp=0 } { fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=2, abs=28, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=2, abs=28, disp=0 } .L11903: { loop_mode disp %ctpr1, .L11903 movaqp,0 area=0, ind=0, am=1, be=0, %g17 movaqp,1 area=0, ind=16, am=0, be=0, %g16 movaqp,2 area=0, ind=0, am=1, be=0, %g19 movaqp,3 area=0, ind=16, am=0, be=0, %g18 } { loop_mode movaqp,0 area=1, ind=0, am=1, be=0, %g21 movaqp,1 area=1, ind=16, am=0, be=0, %g20 movaqp,2 area=1, ind=0, am=1, be=0, %g23 movaqp,3 area=1, ind=16, am=0, be=0, %g22 } { loop_mode movaqp,0 area=2, ind=0, am=1, be=0, %g25 movaqp,1 area=2, ind=16, am=0, be=0, %g24 movaqp,2 area=2, ind=0, am=1, be=0, %g27 movaqp,3 area=2, ind=16, am=0, be=0, %g26 } { loop_mode movaqp,0 area=3, ind=0, am=1, be=0, %g29 movaqp,1 area=3, ind=16, am=0, be=0, %g28 movaqp,2 area=3, ind=0, am=1, be=0, %g31 movaqp,3 area=3, ind=16, am=0, be=0, %g30 } { loop_mode movaqp,0 area=4, ind=0, am=1, be=0, %r26 movaqp,1 area=4, ind=16, am=0, be=0, %r9 movaqp,2 area=4, ind=0, am=1, be=0, %r28 movaqp,3 area=4, ind=16, am=0, be=0, %r27 } { loop_mode qpshufb,0 %g19, %g17, %r1, %r33 qpshufb,1 %g18, %g16, %r1, %r34 qpshufb,3 %g18, %g16, %r6, %g16 qpshufb,4 %g19, %g17, %r6, %g17 movaqp,0 area=5, ind=0, am=1, be=0, %r30 movaqp,1 area=5, ind=16, am=0, be=0, %r29 movaqp,2 area=5, ind=0, am=1, be=0, %r32 movaqp,3 area=5, ind=16, am=0, be=0, %r31 } { loop_mode qpshufb,0 %g23, %g21, %r1, %r37 qpshufb,1 %g22, %g20, %r1, %r38 qpshufb,3 %g22, %g20, %r6, %g20 qpshufb,4 %g23, %g21, %r6, %g21 movaqp,0 area=6, ind=0, am=1, be=0, %g19 movaqp,1 area=6, ind=16, am=0, be=0, %g18 movaqp,2 area=6, ind=0, am=1, be=0, %r36 movaqp,3 area=6, ind=16, am=0, be=0, %r35 } { loop_mode qpshufb,0 %g27, %g25, %r1, %g22 qpshufb,1 %g26, %g24, %r1, %g23 qpshufb,3 %g26, %g24, %r6, %g24 qpshufb,4 %g27, %g25, %r6, %g25 movaqp,0 area=8, ind=0, am=1, be=0, %r39 movaqp,1 area=7, ind=0, am=1, be=0, %g26 movaqp,2 area=8, ind=0, am=1, be=0, %r40 movaqp,3 area=7, ind=0, am=1, be=0, %g27 } { loop_mode qpshufb,0 %g31, %g29, %r1, %r41 qpshufb,1 %g30, %g28, %r1, %r42 qpshufb,3 %g30, %g28, %r6, %g28 qpshufb,4 %g31, %g29, %r6, %g29 movaqp,0 area=10, ind=0, am=1, be=0, %r43 movaqp,1 area=9, ind=0, am=1, be=0, %g30 movaqp,2 area=10, ind=0, am=1, be=0, %r44 movaqp,3 area=9, ind=0, am=1, be=0, %g31 } { loop_mode qpshufb,0 %r26, %r26, %r5, %r45 qpshufb,1 %r9, %r9, %r5, %r46 qpfmul_hsubs,2 %r26, %r33, %r7, %r26 qpshufb,3 %r28, %r28, %r5, %r47 qpshufb,4 %r27, %r27, %r5, %r48 movaqp,0 area=12, ind=0, am=1, be=0, %r51 movaqp,1 area=11, ind=0, am=1, be=0, %r49 movaqp,2 area=12, ind=0, am=1, be=0, %r52 movaqp,3 area=11, ind=0, am=1, be=0, %r50 } { loop_mode qpfmul_hadds,0 %r46, %r37, %r7, %r37 qpfmul_hsubs,1 %r28, %g22, %r7, %r28 qpfmul_hsubs,2 %r9, %r37, %r7, %r9 qpshufb,3 %r30, %r30, %r5, %r46 qpshufb,4 %r29, %r29, %r5, %r53 qpfmul_hsubs,5 %r30, %g16, %r7, %r30 } { loop_mode qpshufb,0 %g19, %g19, %r5, %r54 qpshufb,1 %g18, %g18, %r5, %r55 qpfmul_hsubs,2 %r27, %r41, %r7, %r27 qpshufb,3 %r36, %r36, %r5, %r56 qpshufb,4 %r35, %r35, %r5, %r57 qpfmul_hsubs,5 %g19, %r34, %r7, %g19 } { loop_mode qpfmul_hsubs,0 %g18, %r38, %r7, %g18 qpfmul_hsubs,1 %r36, %g23, %r7, %r36 qpfmul_hsubs,2 %r35, %r42, %r7, %r35 qpfmul_hadds,3 %r47, %g22, %r7, %g22 qpfmul_hadds,4 %r48, %r41, %r7, %r41 qpfmul_hadds,5 %r56, %g23, %r7, %g23 } { loop_mode qpfmul_hadds,0 %r54, %r34, %r7, %r34 qpfmul_hadds,1 %r55, %r38, %r7, %r38 qpfmul_hadds,2 %r45, %r33, %r7, %r33 qpshufb,3 %r32, %r32, %r5, %r45 qpshufb,4 %r31, %r31, %r5, %r47 qpfmul_hadds,5 %r57, %r42, %r7, %r42 } { loop_mode qpfmul_hsubs,0 %r29, %g20, %r7, %r29 qpfmul_hsubs,1 %r32, %g24, %r7, %r32 qpfmul_hsubs,2 %r31, %g28, %r7, %r31 qpfmul_hadds,3 %r53, %g20, %r7, %g20 qpfmul_hadds,4 %r45, %g24, %r7, %g24 qpfmul_hadds,5 %r46, %g16, %r7, %g16 } { loop_mode qpshufb,0 %r40, %r40, %r5, %r47 qpshufb,1 %g31, %g31, %r5, %r48 qpshufb,3 %g26, %g26, %r5, %r45 qpshufb,4 %r39, %r39, %r5, %r46 qpfmul_hadds,5 %r47, %g28, %r7, %g28 } { loop_mode qpshufb,3 %r43, %r43, %r5, %r53 qpshufb,4 %r50, %r50, %r5, %r54 } { loop_mode nop 1 qpshufb,0 %r49, %r49, %r5, %r55 qpshufb,1 %r52, %r52, %r5, %r56 qpshufb,3 %g27, %g27, %r5, %r57 qpshufb,4 %g30, %g30, %r5, %r58 } { loop_mode nop 1 qpshufb,3 %r44, %r44, %r5, %r59 qpshufb,4 %r51, %r51, %r5, %r60 } { loop_mode qppermb,0 %r33, %r26, %r3, %r26 qppermb,1 %r34, %g19, %r3, %g19 qppermb,3 %r41, %r27, %r3, %r27 qppermb,4 %r37, %r9, %r3, %r9 } { loop_mode qppermb,0 %r38, %g18, %r3, %g18 qppermb,1 %g22, %r28, %r3, %g22 qpfsubs,2 %r26, %g19, %r33 qppermb,3 %g23, %r36, %r3, %g23 qppermb,4 %r42, %r35, %r3, %r28 } { loop_mode qpfadds,0 %r26, %g19, %g19 qppermb,3 %g16, %r30, %r3, %g16 qpfadds,4 %r27, %r28, %r27 qpfsubs,5 %r27, %r28, %r34 } { loop_mode qppermb,0 %g20, %r29, %r3, %g20 qppermb,1 %g24, %r32, %r3, %g24 qppermb,3 %g28, %r31, %r3, %g28 qpfadds,4 %g17, %g16, %r26 qpfsubs,5 %g17, %g16, %g16 } { loop_mode qpfsubs,0 %r9, %g18, %g17 qpfadds,1 %r9, %g18, %g18 qpfsubs,2 %g21, %g20, %r9 qpfadds,3 %g29, %g28, %r28 qpfsubs,4 %g29, %g28, %g28 } { loop_mode qpfadds,0 %g21, %g20, %g20 qpfsubs,1 %g25, %g24, %g21 qpfsubs,2 %g22, %g23, %g29 qpfadds,5 %g22, %g23, %g22 } { loop_mode qpfadds,2 %g25, %g24, %g23 } { loop_mode qpshufb,3 %r34, %r34, %r5, %g24 qpshufb,4 %r33, %r33, %r5, %g25 } { loop_mode qpshufb,0 %g17, %g17, %r5, %g17 qpxor,3 %g24, %r4, %g24 qpxor,4 %g25, %r4, %g25 qpfadds,5 %r26, %g19, %r29 } { loop_mode qpshufb,0 %g29, %g29, %r5, %g29 qpxor,1 %g17, %r4, %g17 qpfsubs,2 %r26, %g19, %g19 qpfadds,3 %r28, %r27, %r26 qpfsubs,4 %r28, %r27, %r27 qpfsubs,5 %g16, %g25, %r28 } { loop_mode qpxor,0 %g29, %r4, %g29 qpfadds,1 %g20, %g18, %r30 qpfsubs,2 %g20, %g18, %g18 qpfadds,3 %g16, %g25, %g16 qpfsubs,4 %g28, %g24, %g20 qpfadds,5 %g28, %g24, %g24 } { loop_mode qpfadds,0 %r9, %g17, %g25 qpfadds,1 %g23, %g22, %g28 qpfsubs,2 %g23, %g22, %g22 } { loop_mode nop 2 qpfsubs,0 %r9, %g17, %g17 qpfsubs,1 %g21, %g29, %g23 qpfadds,2 %g21, %g29, %g21 } { loop_mode qpshufb,0 %r26, %r30, %r1, %g29 qpshufb,1 %r27, %g18, %r1, %r9 } { loop_mode qpshufb,0 %g28, %r29, %r1, %r31 qpshufb,1 %g22, %g19, %r1, %r32 qpfmul_hadds,2 %r55, %r9, %r7, %r33 qpshufb,3 %r27, %g18, %r6, %g18 qpshufb,4 %r26, %r30, %r6, %r26 } { loop_mode qpshufb,0 %g23, %r28, %r1, %r27 qpshufb,1 %g20, %g17, %r1, %r30 qpfmul_hsubs,2 %r49, %r9, %r7, %r9 qpshufb,3 %g24, %g25, %r1, %r34 qpshufb,4 %g24, %g25, %r6, %g24 qpfmul_hadds,5 %r59, %g18, %r7, %g25 } { loop_mode qpshufb,0 %g21, %g16, %r1, %r35 qpfmul_hsubs,1 %r40, %r27, %r7, %r36 qpfmul_hadds,2 %r47, %r27, %r7, %r27 qpfmul_hadds,3 %r56, %r34, %r7, %r34 qpshufb,4 %g20, %g17, %r6, %g17 qpfmul_hsubs,5 %r52, %r34, %r7, %r37 } { loop_mode qpfmul_hsubs,0 %r39, %g29, %r7, %g20 qpfmul_hadds,1 %r46, %g29, %r7, %g29 qpfmul_hsubs,2 %r43, %r32, %r7, %r38 qpfmul_hsubs,3 %r44, %g18, %r7, %g18 qpfmul_hsubs,4 %g27, %r26, %r7, %g27 qpfmul_hadds,5 %r57, %r26, %r7, %r26 } { loop_mode qpfmul_hadds,0 %r45, %r31, %r7, %r39 qpfmul_hadds,1 %r53, %r32, %r7, %r32 qpfmul_hsubs,2 %g26, %r31, %r7, %g26 qpfmul_hadds,3 %r58, %g17, %r7, %r40 qpfmul_hadds,4 %r60, %g24, %r7, %g24 qpfmul_hsubs,5 %r51, %g24, %r7, %r31 } { loop_mode qpfmul_hadds,0 %r54, %r35, %r7, %r41 qpfmul_hadds,1 %r48, %r30, %r7, %r30 qpfmul_hsubs,2 %g31, %r30, %r7, %g31 qpshufb,3 %g22, %g19, %r6, %g19 qpshufb,4 %g28, %r29, %r6, %g22 qpfmul_hsubs,5 %g30, %g17, %r7, %g17 } { loop_mode nop 4 qpshufb,0 %g23, %r28, %r6, %g23 qpshufb,1 %g21, %g16, %r6, %g16 qpfmul_hsubs,2 %r50, %r35, %r7, %g28 } { loop_mode qppermb,3 %r33, %r9, %r3, %g21 qppermb,4 %r34, %r37, %r3, %g30 } { loop_mode qppermb,0 %r27, %r36, %r3, %r9 qppermb,1 %g29, %g20, %r3, %g20 qppermb,3 %r26, %g27, %r3, %g27 qppermb,4 %g25, %g18, %r3, %g18 } { loop_mode qppermb,0 %r39, %g26, %r3, %g25 qppermb,1 %r32, %r38, %r3, %g26 qppermb,3 %r40, %g17, %r3, %g17 qppermb,4 %g24, %r31, %r3, %g24 qpfsubs,5 %g19, %g18, %g29 } { loop_mode qppermb,0 %r30, %g31, %r3, %g31 qppermb,1 %r41, %g28, %r3, %g28 qpfsubs,2 %g25, %g20, %r26 qpfadds,3 %g22, %g27, %r27 qpfadds,4 %g19, %g18, %g18 qpfsubs,5 %g22, %g27, %g19 } { loop_mode qpfsubs,0 %g26, %g21, %g22 qpfsubs,1 %r9, %g31, %g27 qpfsubs,2 %g28, %g30, %r28 qpfadds,3 %g16, %g24, %r29 qpfsubs,4 %g23, %g17, %r30 qpfsubs,5 %g16, %g24, %g16 } { loop_mode qpfadds,0 %g25, %g20, %g20 qpfadds,1 %g26, %g21, %g21 qpfadds,2 %r9, %g31, %g23 qpfadds,3 %g23, %g17, %g17 } { loop_mode nop 1 qpfadds,0 %g28, %g30, %g24 } { loop_mode qpshufb,1 %r26, %r26, %r5, %g25 } { loop_mode qpshufb,0 %g22, %g22, %r5, %g22 qpshufb,1 %g27, %g27, %r5, %g26 qpfsubs,2 %r27, %g20, %g27 } { loop_mode qpshufb,0 %r28, %r28, %r5, %g28 qpxor,1 %g25, %r4, %g25 qpfadds,2 %r27, %g20, %g20 } { loop_mode qpxor,0 %g22, %r4, %g22 qpxor,1 %g26, %r4, %g26 qpfadds,2 %g18, %g21, %g30 qpfsubs,3 %g18, %g21, %g18 qpfadds,4 %g17, %g23, %g21 qpfsubs,5 %g17, %g23, %g17 } { loop_mode qpxor,0 %g28, %r4, %g23 qpfadds,1 %r29, %g24, %g28 qpfsubs,2 %r29, %g24, %g24 } { loop_mode qpfsubs,0 %g19, %g25, %g31 qpfadds,1 %g19, %g25, %g19 qpfadds,2 %g29, %g22, %g25 } { loop_mode qpfsubs,0 %g29, %g22, %g22 qpfsubs,1 %r30, %g26, %g29 qpfadds,2 %r30, %g26, %g26 } { loop_mode qpfsubs,0 %g16, %g23, %r9 qpfadds,1 %g16, %g23, %g16 stqp,2 %r25, %r0, %g30 stqp,5 %r23, %r0, %g27 } { loop_mode stqp,2 %r2, %r0, %g20 stqp,5 %r18, %r0, %g18 } { loop_mode stqp,2 %r16, %r0, %g28 stqp,5 %r19, %r0, %g17 } { loop_mode stqp,2 %r22, %r0, %g21 stqp,5 %r12, %r0, %g24 } { loop_mode stqp,2 %r24, %r0, %g31 stqp,5 %r17, %r0, %g19 } { loop_mode stqp,2 %r15, %r0, %g22 stqp,5 %r14, %r0, %g25 } { loop_mode stqp,2 %r21, %r0, %g29 stqp,5 %r11, %r0, %g26 } { loop_mode ct %ctpr1 ? %NOT_LOOP_END alc alcf=1, alct=1 addd,0,sm 0x10, %r0, %r0 stqp,2 %r20, %r0, %r9 stqp,5 %r13, %r0, %g16 }
Теоретическая скорость: 32 комплексных числа за 62 такта (32/62) = 4.13 Байт/такт
Четверная теоретическая скорость: 16.52 Байт/такт
Замеры скорости

6. stage_radix4_2x_simd128_noConj_unroll2
Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.
Код на Си
void stage_radix4_2x_simd128_noConj_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b) { __v2di *xy0_in = (__v2di*)&data_in[ 0]; __v2di *zw0_in = (__v2di*)&data_in[ 2]; __v2di *xy1_in = (__v2di*)&data_in[ 4]; __v2di *zw1_in = (__v2di*)&data_in[ 6]; __v2di *xy2_in = (__v2di*)&data_in[ 8]; __v2di *zw2_in = (__v2di*)&data_in[10]; __v2di *xy3_in = (__v2di*)&data_in[12]; __v2di *zw3_in = (__v2di*)&data_in[14]; __v2di *xy4_in = (__v2di*)&data_in[16]; __v2di *zw4_in = (__v2di*)&data_in[18]; __v2di *xy5_in = (__v2di*)&data_in[20]; __v2di *zw5_in = (__v2di*)&data_in[22]; __v2di *xy6_in = (__v2di*)&data_in[24]; __v2di *zw6_in = (__v2di*)&data_in[26]; __v2di *xy7_in = (__v2di*)&data_in[28]; __v2di *zw7_in = (__v2di*)&data_in[30]; __v2di *c0a_in = (__v2di*)&coefC_a[0]; __v2di *c1a_in = (__v2di*)&coefC_a[2]; __v2di *c2a_in = (__v2di*)&coefC_a[4]; __v2di *c3a_in = (__v2di*)&coefC_a[6]; __v2di *d0a_in = (__v2di*)&coefD_a[0]; __v2di *d1a_in = (__v2di*)&coefD_a[2]; __v2di *d2a_in = (__v2di*)&coefD_a[4]; __v2di *d3a_in = (__v2di*)&coefD_a[6]; __v2di *e0a_in = (__v2di*)&coefE_a[0]; __v2di *e1a_in = (__v2di*)&coefE_a[2]; __v2di *e2a_in = (__v2di*)&coefE_a[4]; __v2di *e3a_in = (__v2di*)&coefE_a[6]; __v2di *c0b_in = (__v2di*)&coefC_b[0*data_count/16]; __v2di *c1b_in = (__v2di*)&coefC_b[1*data_count/16]; __v2di *c2b_in = (__v2di*)&coefC_b[2*data_count/16]; __v2di *c3b_in = (__v2di*)&coefC_b[3*data_count/16]; __v2di *d0b_in = (__v2di*)&coefD_b[0*data_count/16]; __v2di *d1b_in = (__v2di*)&coefD_b[1*data_count/16]; __v2di *d2b_in = (__v2di*)&coefD_b[2*data_count/16]; __v2di *d3b_in = (__v2di*)&coefD_b[3*data_count/16]; __v2di *e0b_in = (__v2di*)&coefE_b[0*data_count/16]; __v2di *e1b_in = (__v2di*)&coefE_b[1*data_count/16]; __v2di *e2b_in = (__v2di*)&coefE_b[2*data_count/16]; __v2di *e3b_in = (__v2di*)&coefE_b[3*data_count/16]; __v2di *out_0 = (__v2di*)&data_out[ 0*data_count/16]; __v2di *out_1 = (__v2di*)&data_out[ 1*data_count/16]; __v2di *out_2 = (__v2di*)&data_out[ 2*data_count/16]; __v2di *out_3 = (__v2di*)&data_out[ 3*data_count/16]; __v2di *out_4 = (__v2di*)&data_out[ 4*data_count/16]; __v2di *out_5 = (__v2di*)&data_out[ 5*data_count/16]; __v2di *out_6 = (__v2di*)&data_out[ 6*data_count/16]; __v2di *out_7 = (__v2di*)&data_out[ 7*data_count/16]; __v2di *out_8 = (__v2di*)&data_out[ 8*data_count/16]; __v2di *out_9 = (__v2di*)&data_out[ 9*data_count/16]; __v2di *out_10 = (__v2di*)&data_out[10*data_count/16]; __v2di *out_11 = (__v2di*)&data_out[11*data_count/16]; __v2di *out_12 = (__v2di*)&data_out[12*data_count/16]; __v2di *out_13 = (__v2di*)&data_out[13*data_count/16]; __v2di *out_14 = (__v2di*)&data_out[14*data_count/16]; __v2di *out_15 = (__v2di*)&data_out[15*data_count/16]; #pragma ivdep #pragma unroll(2) #pragma prefetch for(int64_t i = 0; i < data_count/32; ++i) { __v2di xy0 = xy0_in[16*i]; __v2di zw0 = zw0_in[16*i]; __v2di xy1 = xy1_in[16*i]; __v2di zw1 = zw1_in[16*i]; __v2di c0 = c0a_in[4*i]; __v2di d0 = d0a_in[4*i]; __v2di e0 = e0a_in[4*i]; __v2di xy2 = xy2_in[16*i]; __v2di zw2 = zw2_in[16*i]; __v2di xy3 = xy3_in[16*i]; __v2di zw3 = zw3_in[16*i]; __v2di c1 = c1a_in[4*i]; __v2di d1 = d1a_in[4*i]; __v2di e1 = e1a_in[4*i]; __v2di xy4 = xy4_in[16*i]; __v2di zw4 = zw4_in[16*i]; __v2di xy5 = xy5_in[16*i]; __v2di zw5 = zw5_in[16*i]; __v2di c2 = c2a_in[4*i]; __v2di d2 = d2a_in[4*i]; __v2di e2 = e2a_in[4*i]; __v2di xy6 = xy6_in[16*i]; __v2di zw6 = zw6_in[16*i]; __v2di xy7 = xy7_in[16*i]; __v2di zw7 = zw7_in[16*i]; __v2di c3 = c3a_in[4*i]; __v2di d3 = d3a_in[4*i]; __v2di e3 = e3a_in[4*i]; __v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di cy0_real = __builtin_e2k_qpfmuls( c0, y0); __v2di cy1_real = __builtin_e2k_qpfmuls( c1, y1); __v2di cy2_real = __builtin_e2k_qpfmuls( c2, y2); __v2di cy3_real = __builtin_e2k_qpfmuls( c3, y3); __v2di dz0_real = __builtin_e2k_qpfmuls( d0, z0); __v2di dz1_real = __builtin_e2k_qpfmuls( d1, z1); __v2di dz2_real = __builtin_e2k_qpfmuls( d2, z2); __v2di dz3_real = __builtin_e2k_qpfmuls( d3, z3); __v2di ew0_real = __builtin_e2k_qpfmuls( e0, w0); __v2di ew1_real = __builtin_e2k_qpfmuls( e1, w1); __v2di ew2_real = __builtin_e2k_qpfmuls( e2, w2); __v2di ew3_real = __builtin_e2k_qpfmuls( e3, w3); __v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0); __v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1); __v2di cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2); __v2di cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3); __v2di dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0); __v2di dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1); __v2di dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2); __v2di dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3); __v2di ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0); __v2di ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1); __v2di ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2); __v2di ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3); __v2di cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real); __v2di cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real); __v2di cy2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy2_real); __v2di cy3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy3_real); __v2di dz0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz0_real); __v2di dz1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz1_real); __v2di dz2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz2_real); __v2di dz3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz3_real); __v2di ew0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew0_real); __v2di ew1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew1_real); __v2di ew2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew2_real); __v2di ew3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew3_real); __v2di cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag); __v2di cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag); __v2di cy2_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy2_imag); __v2di cy3_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy3_imag); __v2di dz0_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz0_imag); __v2di dz1_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz1_imag); __v2di dz2_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz2_imag); __v2di dz3_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz3_imag); __v2di ew0_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew0_imag); __v2di ew1_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew1_imag); __v2di ew2_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew2_imag); __v2di ew3_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew3_imag); __v2di cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di cy2 = __builtin_e2k_qppermb(cy2_ii, cy2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di cy3 = __builtin_e2k_qppermb(cy3_ii, cy3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di dz0 = __builtin_e2k_qppermb(dz0_ii, dz0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di dz1 = __builtin_e2k_qppermb(dz1_ii, dz1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di dz2 = __builtin_e2k_qppermb(dz2_ii, dz2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di dz3 = __builtin_e2k_qppermb(dz3_ii, dz3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di ew0 = __builtin_e2k_qppermb(ew0_ii, ew0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di ew1 = __builtin_e2k_qppermb(ew1_ii, ew1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di ew2 = __builtin_e2k_qppermb(ew2_ii, ew2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di ew3 = __builtin_e2k_qppermb(ew3_ii, ew3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di add02_0 = __builtin_e2k_qpfadds( x0, dz0); __v2di add02_1 = __builtin_e2k_qpfadds( x1, dz1); __v2di add02_2 = __builtin_e2k_qpfadds( x2, dz2); __v2di add02_3 = __builtin_e2k_qpfadds( x3, dz3); __v2di sub02_0 = __builtin_e2k_qpfsubs( x0, dz0); __v2di sub02_1 = __builtin_e2k_qpfsubs( x1, dz1); __v2di sub02_2 = __builtin_e2k_qpfsubs( x2, dz2); __v2di sub02_3 = __builtin_e2k_qpfsubs( x3, dz3); __v2di add13_0 = __builtin_e2k_qpfadds(cy0, ew0); __v2di add13_1 = __builtin_e2k_qpfadds(cy1, ew1); __v2di add13_2 = __builtin_e2k_qpfadds(cy2, ew2); __v2di add13_3 = __builtin_e2k_qpfadds(cy3, ew3); __v2di sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0); __v2di sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1); __v2di sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2); __v2di sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3); __v2di swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31}); __v2di sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31}); __v2di sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31}); __v2di sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31}); __v2di out0 = __builtin_e2k_qpfadds(add02_0, add13_0); __v2di out1 = __builtin_e2k_qpfadds(add02_1, add13_1); __v2di out2 = __builtin_e2k_qpfadds(add02_2, add13_2); __v2di out3 = __builtin_e2k_qpfadds(add02_3, add13_3); __v2di out4 = __builtin_e2k_qpfsubs(sub02_0, sub13i_0); __v2di out5 = __builtin_e2k_qpfsubs(sub02_1, sub13i_1); __v2di out6 = __builtin_e2k_qpfsubs(sub02_2, sub13i_2); __v2di out7 = __builtin_e2k_qpfsubs(sub02_3, sub13i_3); __v2di out8 = __builtin_e2k_qpfsubs(add02_0, add13_0); __v2di out9 = __builtin_e2k_qpfsubs(add02_1, add13_1); __v2di out10 = __builtin_e2k_qpfsubs(add02_2, add13_2); __v2di out11 = __builtin_e2k_qpfsubs(add02_3, add13_3); __v2di out12 = __builtin_e2k_qpfadds(sub02_0, sub13i_0); __v2di out13 = __builtin_e2k_qpfadds(sub02_1, sub13i_1); __v2di out14 = __builtin_e2k_qpfadds(sub02_2, sub13i_2); __v2di out15 = __builtin_e2k_qpfadds(sub02_3, sub13i_3); xy0 = out0; zw0 = out1; xy1 = out2; zw1 = out3; c0 = c0b_in[i]; d0 = d0b_in[i]; e0 = e0b_in[i]; xy2 = out4; zw2 = out5; xy3 = out6; zw3 = out7; c1 = c1b_in[i]; d1 = d1b_in[i]; e1 = e1b_in[i]; xy4 = out8; zw4 = out9; xy5 = out10; zw5 = out11; c2 = c2b_in[i]; d2 = d2b_in[i]; e2 = e2b_in[i]; xy6 = out12; zw6 = out13; xy7 = out14; zw7 = out15; c3 = c3b_in[i]; d3 = d3b_in[i]; e3 = e3b_in[i]; x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100}); w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100}); y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100}); w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100}); y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100}); w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100}); y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100}); w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); cy0_real = __builtin_e2k_qpfmuls( c0, y0); cy1_real = __builtin_e2k_qpfmuls( c1, y1); cy2_real = __builtin_e2k_qpfmuls( c2, y2); cy3_real = __builtin_e2k_qpfmuls( c3, y3); dz0_real = __builtin_e2k_qpfmuls( d0, z0); dz1_real = __builtin_e2k_qpfmuls( d1, z1); dz2_real = __builtin_e2k_qpfmuls( d2, z2); dz3_real = __builtin_e2k_qpfmuls( d3, z3); ew0_real = __builtin_e2k_qpfmuls( e0, w0); ew1_real = __builtin_e2k_qpfmuls( e1, w1); ew2_real = __builtin_e2k_qpfmuls( e2, w2); ew3_real = __builtin_e2k_qpfmuls( e3, w3); cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0); cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1); cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2); cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3); dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0); dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1); dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2); dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3); ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0); ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1); ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2); ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3); cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real); cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real); cy2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy2_real); cy3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy3_real); dz0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz0_real); dz1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz1_real); dz2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz2_real); dz3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz3_real); ew0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew0_real); ew1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew1_real); ew2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew2_real); ew3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew3_real); cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag); cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag); cy2_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy2_imag); cy3_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy3_imag); dz0_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz0_imag); dz1_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz1_imag); dz2_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz2_imag); dz3_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz3_imag); ew0_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew0_imag); ew1_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew1_imag); ew2_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew2_imag); ew3_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew3_imag); cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); cy2 = __builtin_e2k_qppermb(cy2_ii, cy2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); cy3 = __builtin_e2k_qppermb(cy3_ii, cy3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); dz0 = __builtin_e2k_qppermb(dz0_ii, dz0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); dz1 = __builtin_e2k_qppermb(dz1_ii, dz1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); dz2 = __builtin_e2k_qppermb(dz2_ii, dz2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); dz3 = __builtin_e2k_qppermb(dz3_ii, dz3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); ew0 = __builtin_e2k_qppermb(ew0_ii, ew0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); ew1 = __builtin_e2k_qppermb(ew1_ii, ew1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); ew2 = __builtin_e2k_qppermb(ew2_ii, ew2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); ew3 = __builtin_e2k_qppermb(ew3_ii, ew3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); add02_0 = __builtin_e2k_qpfadds( x0, dz0); add02_1 = __builtin_e2k_qpfadds( x1, dz1); add02_2 = __builtin_e2k_qpfadds( x2, dz2); add02_3 = __builtin_e2k_qpfadds( x3, dz3); sub02_0 = __builtin_e2k_qpfsubs( x0, dz0); sub02_1 = __builtin_e2k_qpfsubs( x1, dz1); sub02_2 = __builtin_e2k_qpfsubs( x2, dz2); sub02_3 = __builtin_e2k_qpfsubs( x3, dz3); add13_0 = __builtin_e2k_qpfadds(cy0, ew0); add13_1 = __builtin_e2k_qpfadds(cy1, ew1); add13_2 = __builtin_e2k_qpfadds(cy2, ew2); add13_3 = __builtin_e2k_qpfadds(cy3, ew3); sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0); sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1); sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2); sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3); swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31}); sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31}); sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31}); sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31}); out_0[i] = __builtin_e2k_qpfadds(add02_0, add13_0); out_1[i] = __builtin_e2k_qpfadds(add02_1, add13_1); out_2[i] = __builtin_e2k_qpfadds(add02_2, add13_2); out_3[i] = __builtin_e2k_qpfadds(add02_3, add13_3); out_4[i] = __builtin_e2k_qpfsubs(sub02_0, sub13i_0); out_5[i] = __builtin_e2k_qpfsubs(sub02_1, sub13i_1); out_6[i] = __builtin_e2k_qpfsubs(sub02_2, sub13i_2); out_7[i] = __builtin_e2k_qpfsubs(sub02_3, sub13i_3); out_8[i] = __builtin_e2k_qpfsubs(add02_0, add13_0); out_9[i] = __builtin_e2k_qpfsubs(add02_1, add13_1); out_10[i] = __builtin_e2k_qpfsubs(add02_2, add13_2); out_11[i] = __builtin_e2k_qpfsubs(add02_3, add13_3); out_12[i] = __builtin_e2k_qpfadds(sub02_0, sub13i_0); out_13[i] = __builtin_e2k_qpfadds(sub02_1, sub13i_1); out_14[i] = __builtin_e2k_qpfadds(sub02_2, sub13i_2); out_15[i] = __builtin_e2k_qpfadds(sub02_3, sub13i_3); } }
Основной цикл на ассемблере
.L19508: { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=0, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=1, disp=64 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=1, disp=96 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=2, disp=128 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=2, disp=160 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=3, disp=192 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=3, disp=224 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=4, disp=256 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=4, disp=288 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=5, disp=320 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=5, disp=352 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=6, disp=384 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=6, disp=416 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=7, disp=448 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=7, disp=480 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=8, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=10, disp=64 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=10, disp=96 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=12, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=12, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=14, disp=64 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=14, disp=96 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=16, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=18, disp=64 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=18, disp=96 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=1, incr=2, ind=0, asz=1, abs=20, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=15, asz=1, abs=20, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=14, asz=1, abs=22, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=13, asz=1, abs=22, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=12, asz=1, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=11, asz=1, abs=24, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=10, asz=1, abs=26, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=9, asz=1, abs=26, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=8, asz=1, abs=28, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=7, asz=1, abs=28, disp=0 } { fapb ct=1, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=6, asz=1, abs=30, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=5, asz=1, abs=30, disp=0 } .L15504: { loop_mode disp %ctpr1, .L15504 movaqp,0 area=0, ind=0, am=1, be=0, %g17 movaqp,1 area=0, ind=16, am=0, be=0, %g16 movaqp,2 area=0, ind=0, am=1, be=0, %g19 movaqp,3 area=0, ind=16, am=0, be=0, %g18 } { loop_mode movaqp,0 area=1, ind=0, am=1, be=0, %g21 movaqp,1 area=1, ind=16, am=0, be=0, %g20 movaqp,2 area=1, ind=0, am=1, be=0, %g23 movaqp,3 area=1, ind=16, am=0, be=0, %g22 } { loop_mode movaqp,0 area=2, ind=0, am=1, be=0, %g25 movaqp,1 area=2, ind=16, am=0, be=0, %g24 movaqp,2 area=2, ind=0, am=1, be=0, %g27 movaqp,3 area=2, ind=16, am=0, be=0, %g26 } { loop_mode movaqp,0 area=3, ind=0, am=1, be=0, %g29 movaqp,1 area=3, ind=16, am=0, be=0, %g28 movaqp,2 area=3, ind=0, am=1, be=0, %g31 movaqp,3 area=3, ind=16, am=0, be=0, %g30 } { loop_mode movaqp,0 area=4, ind=0, am=1, be=0, %b[20] movaqp,1 area=4, ind=16, am=0, be=0, %b[19] movaqp,2 area=4, ind=0, am=1, be=0, %b[22] movaqp,3 area=4, ind=16, am=0, be=0, %b[21] } { loop_mode qpshufb,0 %g19, %g17, %r7, %b[27] qpshufb,1 %g18, %g16, %r7, %b[28] qpshufb,3 %g18, %g16, %r6, %g16 qpshufb,4 %g19, %g17, %r6, %g17 movaqp,0 area=5, ind=0, am=1, be=0, %b[24] movaqp,1 area=5, ind=16, am=0, be=0, %b[23] movaqp,2 area=5, ind=0, am=1, be=0, %b[26] movaqp,3 area=5, ind=16, am=0, be=0, %b[25] } { loop_mode qpshufb,0 %g23, %g21, %r7, %b[31] qpshufb,1 %g22, %g20, %r7, %b[32] qpshufb,3 %g22, %g20, %r6, %g20 qpshufb,4 %g23, %g21, %r6, %g21 movaqp,0 area=6, ind=0, am=1, be=0, %g19 movaqp,1 area=6, ind=16, am=0, be=0, %g18 movaqp,2 area=6, ind=0, am=1, be=0, %b[30] movaqp,3 area=6, ind=16, am=0, be=0, %b[29] } { loop_mode qpshufb,0 %g27, %g25, %r7, %b[35] qpshufb,1 %g26, %g24, %r7, %b[36] qpshufb,3 %g26, %g24, %r6, %g24 qpshufb,4 %g27, %g25, %r6, %g25 movaqp,0 area=7, ind=0, am=1, be=0, %g23 movaqp,1 area=7, ind=16, am=0, be=0, %g22 movaqp,2 area=7, ind=0, am=1, be=0, %b[34] movaqp,3 area=7, ind=16, am=0, be=0, %b[33] } { loop_mode qpshufb,0 %g31, %g29, %r7, %b[39] qpshufb,1 %g30, %g28, %r7, %b[40] qpshufb,3 %g30, %g28, %r6, %g28 qpshufb,4 %g31, %g29, %r6, %g29 movaqp,0 area=8, ind=0, am=1, be=0, %g27 movaqp,1 area=8, ind=16, am=0, be=0, %g26 movaqp,2 area=8, ind=0, am=1, be=0, %b[38] movaqp,3 area=8, ind=16, am=0, be=0, %b[37] } { loop_mode qpshufb,0 %b[22], %b[20], %r7, %b[43] qpshufb,1 %b[21], %b[19], %r7, %b[44] qpshufb,3 %b[21], %b[19], %r6, %b[19] qpshufb,4 %b[22], %b[20], %r6, %b[20] movaqp,0 area=9, ind=0, am=1, be=0, %g31 movaqp,1 area=9, ind=16, am=0, be=0, %g30 movaqp,2 area=9, ind=0, am=1, be=0, %b[42] movaqp,3 area=9, ind=16, am=0, be=0, %b[41] } { loop_mode qpshufb,0 %b[26], %b[24], %r7, %b[47] qpshufb,1 %b[25], %b[23], %r7, %b[48] qpshufb,3 %b[25], %b[23], %r6, %b[23] qpshufb,4 %b[26], %b[24], %r6, %b[24] movaqp,0 area=10, ind=0, am=1, be=0, %b[22] movaqp,1 area=10, ind=16, am=0, be=0, %b[21] movaqp,2 area=10, ind=0, am=1, be=0, %b[46] movaqp,3 area=10, ind=16, am=0, be=0, %b[45] } { loop_mode qpshufb,0 %b[30], %g19, %r7, %b[51] qpshufb,1 %b[29], %g18, %r7, %b[52] qpshufb,3 %b[29], %g18, %r6, %g18 qpshufb,4 %b[30], %g19, %r6, %g19 movaqp,0 area=11, ind=0, am=1, be=0, %b[26] movaqp,1 area=11, ind=16, am=0, be=0, %b[25] movaqp,2 area=11, ind=0, am=1, be=0, %b[50] movaqp,3 area=11, ind=16, am=0, be=0, %b[49] } { loop_mode qpshufb,0 %b[34], %g23, %r7, %b[55] qpshufb,1 %b[33], %g22, %r7, %b[56] qpshufb,3 %b[33], %g22, %r6, %g22 qpshufb,4 %b[34], %g23, %r6, %g23 movaqp,0 area=12, ind=0, am=1, be=0, %b[30] movaqp,1 area=12, ind=16, am=0, be=0, %b[29] movaqp,2 area=12, ind=0, am=1, be=0, %b[54] movaqp,3 area=12, ind=16, am=0, be=0, %b[53] } { loop_mode qpshufb,0 %g27, %g27, %r23, %b[59] qpshufb,1 %g26, %g26, %r23, %b[60] qpfmul_hsubs,2 %g27, %b[27], %r40, %g27 qpshufb,3 %b[38], %b[38], %r23, %b[61] qpshufb,4 %b[37], %b[37], %r23, %b[62] qpfmul_hsubs,5 %g26, %b[31], %r40, %g26 movaqp,0 area=13, ind=0, am=1, be=0, %b[34] movaqp,1 area=13, ind=16, am=0, be=0, %b[33] movaqp,2 area=13, ind=0, am=1, be=0, %b[58] movaqp,3 area=13, ind=16, am=0, be=0, %b[57] } { loop_mode qpshufb,0 %g31, %g31, %r23, %b[63] qpshufb,1 %g30, %g30, %r23, %b[64] qpfmul_hsubs,2 %b[38], %b[35], %r40, %b[38] qpshufb,3 %b[42], %b[42], %r23, %b[65] qpshufb,4 %b[41], %b[41], %r23, %b[66] qpfmul_hsubs,5 %b[37], %b[39], %r40, %b[37] movaqp,0 area=14, ind=0, am=1, be=0, %b[68] movaqp,1 area=14, ind=16, am=0, be=0, %b[67] movaqp,2 area=14, ind=0, am=1, be=0, %b[70] movaqp,3 area=14, ind=16, am=0, be=0, %b[69] } { loop_mode qpfmul_hsubs,0 %g31, %b[43], %r40, %g31 qpfmul_hsubs,1 %g30, %b[47], %r40, %g30 qpfmul_hsubs,2 %b[42], %b[51], %r40, %b[42] qpfmul_hadds,3 %b[61], %b[35], %r40, %b[35] qpfmul_hadds,4 %b[62], %b[39], %r40, %b[39] qpfmul_hadds,5 %b[65], %b[51], %r40, %b[51] movaqp,0 area=15, ind=0, am=1, be=0, %b[62] movaqp,1 area=15, ind=16, am=0, be=0, %b[61] movaqp,2 area=15, ind=0, am=1, be=0, %b[71] movaqp,3 area=15, ind=16, am=0, be=0, %b[65] } { loop_mode qpfmul_hadds,0 %b[59], %b[27], %r40, %b[27] qpfmul_hadds,1 %b[64], %b[47], %r40, %b[47] qpfmul_hadds,2 %b[60], %b[31], %r40, %b[31] qpfmul_hadds,3 %b[66], %b[55], %r40, %b[55] qpshufb,4 %b[21], %b[21], %r23, %b[59] qpfmul_hsubs,5 %b[41], %b[55], %r40, %b[41] movaqp,0 area=16, ind=0, am=1, be=0, %b[64] movaqp,1 area=16, ind=16, am=0, be=0, %b[60] movaqp,2 area=16, ind=0, am=1, be=0, %b[72] movaqp,3 area=16, ind=16, am=0, be=0, %b[66] } { loop_mode qpshufb,0 %b[30], %b[30], %r23, %b[73] qpshufb,1 %b[29], %b[29], %r23, %b[74] qpfmul_hsubs,2 %b[29], %b[32], %r40, %b[29] qpshufb,3 %b[54], %b[54], %r23, %b[75] qpshufb,4 %b[53], %b[53], %r23, %b[76] qpfmul_hsubs,5 %b[30], %b[28], %r40, %b[30] movaqp,0 area=17, ind=0, am=1, be=0, %b[78] movaqp,1 area=17, ind=16, am=0, be=0, %b[77] movaqp,2 area=17, ind=0, am=1, be=0, %b[80] movaqp,3 area=17, ind=16, am=0, be=0, %b[79] } { loop_mode qpshufb,0 %b[34], %b[34], %r23, %b[81] qpshufb,1 %b[33], %b[33], %r23, %b[82] qpfmul_hadds,2 %b[63], %b[43], %r40, %b[43] qpshufb,3 %b[58], %b[58], %r23, %b[83] qpshufb,4 %b[57], %b[57], %r23, %b[84] qpfmul_hsubs,5 %b[54], %b[36], %r40, %b[54] movaqp,0 area=18, ind=0, am=1, be=0, %b[85] movaqp,1 area=18, ind=16, am=0, be=0, %b[63] movaqp,2 area=18, ind=0, am=1, be=0, %b[87] movaqp,3 area=18, ind=16, am=0, be=0, %b[86] } { loop_mode qpfmul_hsubs,0 %b[53], %b[40], %r40, %b[53] qpfmul_hsubs,1 %b[34], %b[44], %r40, %b[34] qpfmul_hsubs,2 %b[33], %b[48], %r40, %b[33] qpfmul_hsubs,3 %b[58], %b[52], %r40, %b[58] qpfmul_hsubs,4 %b[57], %b[56], %r40, %b[57] qpfmul_hadds,5 %b[75], %b[36], %r40, %b[36] movaqp,0 area=19, ind=0, am=1, be=0, %b[88] movaqp,1 area=19, ind=16, am=0, be=0, %b[75] movaqp,2 area=19, ind=0, am=1, be=0, %b[90] movaqp,3 area=19, ind=16, am=0, be=0, %b[89] } { loop_mode qpfmul_hadds,0 %b[73], %b[28], %r40, %b[28] qpfmul_hadds,1 %b[74], %b[32], %r40, %b[32] qpfmul_hadds,2 %b[82], %b[48], %r40, %b[48] qpfmul_hadds,3 %b[84], %b[56], %r40, %b[56] qpfmul_hadds,4 %b[83], %b[52], %r40, %b[52] qpfmul_hadds,5 %b[76], %b[40], %r40, %b[40] } { loop_mode qpfmul_hsubs,0 %b[21], %g20, %r40, %b[21] qpfmul_hsubs,1 %b[22], %g16, %r40, %b[73] qpfmul_hadds,2 %b[81], %b[44], %r40, %b[44] qpfmul_hsubs,3 %b[46], %g24, %r40, %b[74] qpfmul_hsubs,4 %b[45], %g28, %r40, %b[76] qpfmul_hsubs,5 %b[25], %b[23], %r40, %b[81] } { loop_mode qpfmul_hsubs,0 %b[50], %g18, %r40, %b[82] qpfmul_hsubs,1 %b[49], %g22, %r40, %b[83] qpfmul_hsubs,2 %b[26], %b[19], %r40, %b[59] qpshufb,4 %b[22], %b[22], %r23, %b[22] qpfmul_hadds,5 %b[59], %g20, %r40, %g20 } { loop_mode qpshufb,0 %b[46], %b[46], %r23, %b[46] qpshufb,1 %b[45], %b[45], %r23, %b[45] qpshufb,3 %b[26], %b[26], %r23, %b[26] qpshufb,4 %b[25], %b[25], %r23, %b[25] qpfmul_hadds,5 %b[22], %g16, %r40, %g16 } { loop_mode qpshufb,0 %b[50], %b[50], %r23, %b[22] qpshufb,1 %b[49], %b[49], %r23, %b[49] qpfmul_hadds,2 %b[45], %g28, %r40, %g28 qpfmul_hadds,3 %b[25], %b[23], %r40, %b[23] qppermb,4 %b[39], %b[37], %r24, %b[25] qpfmul_hadds,5 %b[26], %b[19], %r40, %b[19] } { loop_mode nop 2 qpfmul_hadds,0 %b[22], %g18, %r40, %g18 qpfmul_hadds,1 %b[49], %g22, %r40, %g22 qpfmul_hadds,2 %b[46], %g24, %r40, %g24 } { loop_mode qppermb,3 %b[27], %g27, %r24, %g27 qppermb,4 %b[31], %g26, %r24, %g26 } { loop_mode qppermb,0 %b[35], %b[38], %r24, %b[22] qppermb,1 %b[51], %b[42], %r24, %b[26] qppermb,3 %b[47], %g30, %r24, %g30 qppermb,4 %b[55], %b[41], %r24, %b[27] } { loop_mode qppermb,0 %b[32], %b[29], %r24, %b[29] qppermb,1 %b[43], %g31, %r24, %g31 qppermb,4 %b[28], %b[30], %r24, %b[28] } { loop_mode qppermb,3 %b[40], %b[53], %r24, %b[30] qppermb,4 %b[48], %b[33], %r24, %b[31] qpfsubs,5 %g27, %b[28], %b[32] } { loop_mode qppermb,0 %b[56], %b[57], %r24, %b[33] qppermb,1 %b[36], %b[54], %r24, %b[35] qpfsubs,2 %g26, %b[29], %b[37] qppermb,3 %b[44], %b[34], %r24, %b[34] qppermb,4 %b[52], %b[58], %r24, %b[36] qpfsubs,5 %b[25], %b[30], %b[38] } { loop_mode qpfsubs,0 %b[27], %b[33], %b[42] qppermb,1 %g16, %b[73], %r24, %g16 qpfsubs,2 %b[22], %b[35], %b[39] qpfsubs,3 %b[26], %b[36], %b[41] qppermb,4 %g20, %b[21], %r24, %g20 qpfsubs,5 %g30, %b[31], %b[40] } { loop_mode qppermb,0 %g28, %b[76], %r24, %g28 qppermb,1 %g24, %b[74], %r24, %g24 qpfadds,2 %g27, %b[28], %g27 qppermb,3 %b[23], %b[81], %r24, %b[23] qppermb,4 %b[19], %b[59], %r24, %b[19] qpfsubs,5 %g31, %b[34], %b[21] } { loop_mode qpfadds,0 %g26, %b[29], %g26 qppermb,1 %g22, %b[83], %r24, %g22 qpfadds,2 %b[25], %b[30], %b[25] qpfadds,3 %g30, %b[31], %g30 qppermb,4 %g18, %b[82], %r24, %g18 qpfsubs,5 %g21, %g20, %b[28] } { loop_mode qpfadds,0 %b[27], %b[33], %b[27] qpfadds,1 %g17, %g16, %b[29] qpfsubs,2 %g17, %g16, %g16 qpfadds,3 %b[22], %b[35], %g17 qpfadds,4 %g31, %b[34], %g31 qpfadds,5 %b[26], %b[36], %b[22] } { loop_mode qpfadds,0 %g21, %g20, %g20 qpfadds,1 %g29, %g28, %g21 qpfsubs,2 %g29, %g28, %g28 qpfadds,3 %b[24], %b[23], %g29 qpfsubs,4 %b[24], %b[23], %b[23] qpfadds,5 %b[20], %b[19], %b[24] } { loop_mode qpfadds,0 %g25, %g24, %b[26] qpfsubs,1 %g25, %g24, %g24 qpfsubs,2 %b[20], %b[19], %g25 qpfsubs,3 %g19, %g18, %g18 qpfadds,5 %g19, %g18, %b[19] } { loop_mode qpfsubs,2 %g23, %g22, %g19 qpfadds,5 %g23, %g22, %g22 } { loop_mode qpfadds,0 %b[29], %g27, %g27 qpfsubs,2 %b[29], %g27, %b[20] qpshufb,4 %b[32], %b[32], %r23, %g23 } { loop_mode qpshufb,0 %b[37], %b[37], %r23, %b[29] qpshufb,1 %b[38], %b[38], %r23, %b[30] qpfsubs,2 %g20, %g26, %b[33] qpshufb,3 %b[40], %b[40], %r23, %b[31] qpshufb,4 %b[42], %b[42], %r23, %b[32] qpfadds,5 %g29, %g30, %b[34] } { loop_mode qpfadds,0 %g20, %g26, %g20 qpshufb,1 %b[39], %b[39], %r23, %b[35] qpfadds,2 %b[26], %g17, %g26 qpshufb,3 %b[21], %b[21], %r23, %b[21] qpshufb,4 %b[41], %b[41], %r23, %b[36] qpfsubs,5 %g29, %g30, %g29 } { loop_mode qpxor,0 %b[29], %r22, %g30 qpxor,1 %b[30], %r22, %b[29] qpfadds,2 %g21, %b[25], %b[31] qpxor,3 %g23, %r22, %g23 qpxor,4 %b[31], %r22, %b[30] qpfsubs,5 %g21, %b[25], %g21 } { loop_mode qpfsubs,0 %b[26], %g17, %g17 qpxor,1 %b[35], %r22, %b[32] qpfsubs,2 %b[24], %g31, %b[26] qpxor,3 %b[32], %r22, %b[25] qpxor,4 %b[21], %r22, %b[21] qpfadds,5 %b[24], %g31, %g31 } { loop_mode qpfsubs,0 %g22, %b[27], %b[35] qpfadds,1 %b[19], %b[22], %b[36] qpfadds,2 %g22, %b[27], %g22 qpxor,3 %b[36], %r22, %b[24] qpfsubs,4 %b[19], %b[22], %b[19] qpfsubs,5 %g16, %g23, %b[22] } { loop_mode qpfsubs,0 %b[28], %g30, %b[27] qpfadds,1 %b[28], %g30, %g30 qpfadds,2 %g28, %b[29], %g23 qpfadds,3 %g16, %g23, %g16 qpfsubs,4 %b[23], %b[30], %b[28] qpfadds,5 %b[23], %b[30], %b[23] } { loop_mode qpfsubs,0 %g28, %b[29], %g28 qpfsubs,1 %g24, %b[32], %b[25] qpfadds,2 %g24, %b[32], %g24 qpfadds,3 %g19, %b[25], %b[30] qpfsubs,4 %g19, %b[25], %g19 qpfsubs,5 %g25, %b[21], %b[29] } { loop_mode nop 1 qpfadds,2 %g25, %b[21], %g25 qpfsubs,3 %g18, %b[24], %g18 qpfadds,5 %g18, %b[24], %b[21] } { loop_mode qpshufb,0 %b[68], %b[68], %r23, %b[24] qpshufb,1 %b[67], %b[67], %r23, %b[32] qpshufb,4 %b[62], %b[62], %r23, %b[37] } { loop_mode qpshufb,0 %b[65], %b[65], %r23, %b[38] qpshufb,1 %b[61], %b[61], %r23, %b[39] qpshufb,3 %b[71], %b[71], %r23, %b[40] qpshufb,4 %b[72], %b[72], %r23, %b[41] } { loop_mode qpshufb,0 %b[77], %b[77], %r23, %b[42] qpshufb,1 %b[66], %b[66], %r23, %b[43] qpshufb,3 %b[78], %b[78], %r23, %b[44] qpshufb,4 %b[85], %b[85], %r23, %b[45] } { loop_mode qpshufb,0 %b[86], %b[86], %r23, %b[46] qpshufb,1 %b[63], %b[63], %r23, %b[47] qpshufb,3 %b[87], %b[87], %r23, %b[48] qpshufb,4 %b[90], %b[90], %r23, %b[49] } { loop_mode qpshufb,0 %b[89], %b[89], %r23, %b[50] qpshufb,1 %g21, %b[33], %r7, %b[51] qpshufb,3 %g26, %g27, %r7, %b[52] qpshufb,4 %b[31], %g20, %r7, %b[53] } { loop_mode qpshufb,0 %g17, %b[20], %r7, %b[54] qpshufb,1 %b[35], %g29, %r7, %b[55] qpfmul_hsubs,2 %b[85], %b[51], %r40, %b[58] qpshufb,3 %b[36], %g31, %r7, %b[56] qpshufb,4 %g22, %b[34], %r7, %b[57] qpfmul_hadds,5 %b[37], %b[53], %r40, %b[37] } { loop_mode qpshufb,0 %b[19], %b[26], %r7, %b[59] qpshufb,1 %g23, %g30, %r7, %b[73] qpfmul_hadds,2 %b[45], %b[51], %r40, %b[45] qpshufb,3 %g28, %b[27], %r7, %b[74] qpshufb,4 %g24, %g16, %r7, %b[76] qpfmul_hsubs,5 %b[62], %b[53], %r40, %b[51] } { loop_mode qpshufb,0 %g19, %b[28], %r7, %b[53] qpshufb,1 %b[30], %b[23], %r7, %b[62] qpfmul_hadds,2 %b[44], %b[54], %r40, %b[44] qpshufb,3 %b[25], %b[22], %r7, %b[81] qpshufb,4 %g18, %b[29], %r7, %b[82] qpfmul_hadds,5 %b[24], %b[52], %r40, %b[24] } { loop_mode qpshufb,0 %b[21], %g25, %r7, %b[83] qpfmul_hsubs,1 %b[68], %b[52], %r40, %b[52] qpfmul_hsubs,2 %b[63], %b[55], %r40, %b[63] qpfmul_hsubs,3 %b[67], %b[56], %r40, %b[67] qpfmul_hadds,4 %b[32], %b[56], %r40, %b[32] qpfmul_hadds,5 %b[39], %b[57], %r40, %b[39] } { loop_mode qpfmul_hsubs,0 %b[78], %b[54], %r40, %b[54] qpfmul_hadds,1 %b[47], %b[55], %r40, %b[47] qpfmul_hadds,2 %b[49], %b[73], %r40, %b[49] qpfmul_hsubs,3 %b[61], %b[57], %r40, %b[55] qpfmul_hsubs,4 %b[87], %b[76], %r40, %b[56] qpfmul_hsubs,5 %b[72], %b[74], %r40, %b[57] } { loop_mode qpfmul_hsubs,0 %b[77], %b[59], %r40, %b[61] qpfmul_hadds,1 %b[42], %b[59], %r40, %b[42] qpfmul_hsubs,2 %b[90], %b[73], %r40, %b[59] qpfmul_hadds,3 %b[48], %b[76], %r40, %b[48] qpfmul_hadds,4 %b[41], %b[74], %r40, %b[41] qpfmul_hsubs,5 %b[71], %b[81], %r40, %b[68] } { loop_mode qpfmul_hsubs,0 %b[66], %b[53], %r40, %b[66] qpfmul_hadds,1 %b[43], %b[53], %r40, %b[43] qpfmul_hsubs,2 %b[89], %b[62], %r40, %b[53] qpfmul_hadds,3 %b[50], %b[62], %r40, %b[50] qpfmul_hadds,4 %b[40], %b[81], %r40, %b[40] qpfmul_hsubs,5 %b[65], %b[82], %r40, %b[62] } { loop_mode qpfmul_hadds,0 %b[46], %b[83], %r40, %b[46] qpshufb,1 %b[70], %b[70], %r23, %b[71] qpfmul_hsubs,2 %b[86], %b[83], %r40, %b[65] qpshufb,3 %b[64], %b[64], %r23, %b[72] qpshufb,4 %b[69], %b[69], %r23, %b[73] qpfmul_hadds,5 %b[38], %b[82], %r40, %b[38] } { loop_mode qpshufb,0 %b[60], %b[60], %r23, %b[74] qpshufb,1 %b[80], %b[80], %r23, %b[76] qpshufb,3 %b[79], %b[79], %r23, %b[77] qpshufb,4 %b[75], %b[75], %r23, %b[78] } { loop_mode nop 3 qpshufb,0 %b[88], %b[88], %r23, %b[81] } { loop_mode qpshufb,1 %b[31], %g20, %r6, %g20 qpshufb,3 %g21, %b[33], %r6, %g21 qpshufb,4 %b[35], %g29, %r6, %g29 } { loop_mode qpshufb,0 %g22, %b[34], %r6, %g22 qpshufb,1 %g28, %b[27], %r6, %g28 qpfmul_hadds,2 %b[71], %g20, %r40, %b[23] qpshufb,3 %g23, %g30, %r6, %g23 qpshufb,4 %b[30], %b[23], %r6, %g30 qpfmul_hadds,5 %b[76], %g21, %r40, %b[27] } { loop_mode qpshufb,0 %g19, %b[28], %r6, %g19 qpfmul_hsubs,1 %b[70], %g20, %r40, %g20 qpfmul_hadds,2 %b[73], %g22, %r40, %b[30] qpfmul_hsubs,3 %b[80], %g21, %r40, %g21 qpfmul_hsubs,4 %b[79], %g29, %r40, %b[28] qpfmul_hadds,5 %b[77], %g29, %r40, %g29 } { loop_mode qpfmul_hsubs,0 %b[69], %g22, %r40, %g22 qpfmul_hadds,1 %b[72], %g28, %r40, %b[33] qpfmul_hsubs,2 %b[64], %g28, %r40, %g28 qpfmul_hsubs,3 %b[88], %g23, %r40, %b[31] qpfmul_hadds,4 %b[81], %g23, %r40, %g23 qpfmul_hsubs,5 %b[75], %g30, %r40, %b[34] } { loop_mode qpfmul_hadds,0 %b[74], %g19, %r40, %g19 qppermb,1 %b[45], %b[58], %r24, %b[45] qpfmul_hsubs,2 %b[60], %g19, %r40, %b[35] qppermb,3 %b[24], %b[52], %r24, %b[24] qppermb,4 %b[44], %b[54], %r24, %b[44] qpfmul_hadds,5 %b[78], %g30, %r40, %g30 } { loop_mode qppermb,0 %b[37], %b[51], %r24, %b[37] qppermb,1 %b[32], %b[67], %r24, %b[32] qppermb,3 %b[42], %b[61], %r24, %b[42] qppermb,4 %b[39], %b[55], %r24, %b[39] } { loop_mode qppermb,0 %b[47], %b[63], %r24, %b[47] qppermb,1 %b[49], %b[59], %r24, %b[49] qppermb,3 %b[40], %b[68], %r24, %b[40] qppermb,4 %b[48], %b[56], %r24, %b[48] } { loop_mode qppermb,0 %b[43], %b[66], %r24, %b[43] qppermb,1 %b[50], %b[53], %r24, %b[50] qppermb,3 %b[38], %b[62], %r24, %b[38] qppermb,4 %b[41], %b[57], %r24, %b[41] } { loop_mode qppermb,0 %b[46], %b[65], %r24, %b[46] qpfsubs,1 %b[24], %b[37], %b[51] qpfsubs,3 %b[44], %b[45], %b[52] qpfsubs,4 %b[40], %b[41], %b[53] } { loop_mode qpfsubs,0 %b[42], %b[47], %b[55] qpfsubs,1 %b[46], %b[50], %b[56] qpfsubs,2 %b[32], %b[39], %b[54] qpfadds,3 %b[24], %b[37], %b[24] qpfadds,4 %b[44], %b[45], %b[37] qpfadds,5 %b[32], %b[39], %b[32] } { loop_mode qpfadds,0 %b[48], %b[49], %b[44] qpfadds,1 %b[42], %b[47], %b[42] qpfsubs,2 %b[48], %b[49], %b[39] qpfadds,3 %b[40], %b[41], %b[40] } { loop_mode qpfadds,0 %b[38], %b[43], %b[38] qpfadds,1 %b[46], %b[50], %b[43] qpfsubs,2 %b[38], %b[43], %b[41] } { loop_mode qpshufb,4 %g26, %g27, %r6, %g26 } { loop_mode qpshufb,3 %g17, %b[20], %r6, %g17 qpshufb,4 %b[36], %g31, %r6, %g27 } { loop_mode qpshufb,0 %b[19], %b[26], %r6, %g31 qpshufb,1 %g24, %g16, %r6, %g16 qpshufb,3 %b[25], %b[22], %r6, %g24 qpshufb,4 %g18, %b[29], %r6, %g18 } { loop_mode qpshufb,0 %b[21], %g25, %r6, %g25 qppermb,1 %b[23], %g20, %r24, %g20 qppermb,3 %b[27], %g21, %r24, %g21 qppermb,4 %g29, %b[28], %r24, %g29 } { loop_mode qppermb,0 %b[30], %g22, %r24, %g22 qppermb,1 %b[33], %g28, %r24, %g28 qpfsubs,2 %g26, %g20, %b[19] qppermb,3 %g23, %b[31], %r24, %g23 qppermb,4 %g30, %b[34], %r24, %g30 qpfsubs,5 %g17, %g21, %b[20] } { loop_mode qppermb,0 %g19, %b[35], %r24, %g19 qpfsubs,1 %g27, %g22, %g21 qpfadds,2 %g26, %g20, %g20 qpshufb,3 %b[51], %b[51], %r23, %g26 qpshufb,4 %b[52], %b[52], %r23, %b[21] qpfadds,5 %g17, %g21, %g17 } { loop_mode qpfadds,0 %g27, %g22, %g22 qpfsubs,1 %g24, %g28, %g31 qpfadds,2 %g24, %g28, %g24 qpfadds,3 %g31, %g29, %b[22] qpfsubs,4 %g31, %g29, %g29 qpfadds,5 %g16, %g23, %g27 } { loop_mode nop 1 qpfsubs,0 %g18, %g19, %g18 qpfadds,2 %g18, %g19, %g28 qpfadds,3 %g25, %g30, %g23 qpfsubs,4 %g25, %g30, %g25 qpfsubs,5 %g16, %g23, %g16 } { loop_mode qpfadds,0 %g20, %b[24], %g30 qpshufb,1 %b[54], %b[54], %r23, %g19 qpfsubs,2 %g20, %b[24], %g20 qpfadds,3 %g17, %b[37], %b[23] qpfsubs,4 %g17, %b[37], %g17 } { loop_mode qpshufb,0 %b[39], %b[39], %r23, %b[24] qpshufb,1 %b[41], %b[41], %r23, %b[25] qpfsubs,2 %g22, %b[32], %b[28] qpshufb,3 %b[55], %b[55], %r23, %b[26] qpshufb,4 %b[53], %b[53], %r23, %b[27] qpfsubs,5 %b[22], %b[42], %b[29] } { loop_mode qpfadds,0 %g22, %b[32], %g22 qpshufb,1 %b[56], %b[56], %r23, %b[30] qpfsubs,2 %g24, %b[40], %b[33] qpfadds,3 %b[22], %b[42], %b[22] qpfsubs,4 %g27, %b[44], %b[31] qpfadds,5 %g23, %b[43], %b[32] } { loop_mode qpxor,0 %g26, %r22, %g26 qpxor,1 %g19, %r22, %g19 qpfadds,2 %g27, %b[44], %g27 qpxor,3 %b[21], %r22, %b[21] qpxor,4 %b[26], %r22, %b[26] qpfsubs,5 %g23, %b[43], %g23 } { loop_mode qpfadds,0 %g24, %b[40], %g24 qpxor,1 %b[25], %r22, %b[25] qpfsubs,2 %b[19], %g26, %b[34] qpfadds,3 %g28, %b[38], %b[35] qpfsubs,4 %g28, %b[38], %g28 qpfsubs,5 %b[20], %b[21], %b[36] } { loop_mode qpxor,0 %b[24], %r22, %b[24] qpxor,1 %b[27], %r22, %b[27] qpfadds,2 %b[19], %g26, %g26 qpfadds,3 %b[20], %b[21], %b[19] qpfadds,4 %g29, %b[26], %b[20] qpfsubs,5 %g29, %b[26], %g29 } { loop_mode qpfsubs,0 %g21, %g19, %b[26] qpxor,1 %b[30], %r22, %b[21] qpfadds,2 %g21, %g19, %g19 stqp,5 %r16, %r0, %g17 } { loop_mode qpfadds,0 %g16, %b[24], %g17 qpfsubs,1 %g18, %b[25], %g21 qpfadds,2 %g18, %b[25], %g18 stqp,5 %r20, %r0, %b[23] } { loop_mode qpfsubs,0 %g16, %b[24], %g16 qpfsubs,1 %g31, %b[27], %b[23] qpfadds,2 %g31, %b[27], %g31 stqp,5 %r18, %r0, %g20 } { loop_mode qpfadds,0 %g25, %b[21], %g20 qpfsubs,1 %g25, %b[21], %g25 stqp,2 %r2, %r0, %g30 stqp,5 %r36, %r0, %b[29] } { loop_mode stqp,2 %r25, %r0, %b[22] stqp,5 %r32, %r0, %b[28] } { loop_mode stqp,2 %r27, %r0, %g22 stqp,5 %r3, %r0, %b[31] } { loop_mode stqp,2 %r9, %r0, %b[32] stqp,5 %r35, %r0, %g23 } { loop_mode stqp,2 %r17, %r0, %g27 stqp,5 %r13, %r0, %b[33] } { loop_mode stqp,2 %r21, %r0, %g24 stqp,5 %r31, %r0, %g28 } { loop_mode stqp,2 %r15, %r0, %g26 stqp,5 %r26, %r0, %b[35] } { loop_mode stqp,2 %r19, %r0, %b[34] stqp,5 %r34, %r0, %g19 } { loop_mode stqp,2 %r30, %r0, %b[26] stqp,5 %r39, %r0, %g18 } { loop_mode stqp,2 %r5, %r0, %b[19] stqp,5 %r29, %r0, %g21 } { loop_mode stqp,2 %r11, %r0, %b[36] stqp,5 %r4, %r0, %g17 } { loop_mode stqp,2 %r38, %r0, %b[20] stqp,5 %r14, %r0, %g16 } { loop_mode stqp,2 %r28, %r0, %g29 stqp,5 %r1, %r0, %g31 } { loop_mode stqp,2 %r12, %r0, %b[23] stqp,5 %r37, %r0, %g20 } { loop_mode ct %ctpr1 ? %NOT_LOOP_END alc alcf=1, alct=1 stqp,2 %r33, %r0, %g25 addd,3,sm %r0, _f16s,_lts0lo 0x20, %r0 }
Теоретическая скорость: 64 комплексных числа за 106 тактов (64/106) = 4.83 Байт/такт
Четверная теоретическая скорость: 19.32 Байт/такт
Замеры скорости

7. stage_radix4_2x_simd128_noConj_unroll3
Здесь происходит раскрутка цикла в 3 раза с помощью опции unroll.
Код на Си
void stage_radix4_2x_simd128_noConj_unroll3(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b) { __v2di *xy0_in = (__v2di*)&data_in[ 0]; __v2di *zw0_in = (__v2di*)&data_in[ 2]; __v2di *xy1_in = (__v2di*)&data_in[ 4]; __v2di *zw1_in = (__v2di*)&data_in[ 6]; __v2di *xy2_in = (__v2di*)&data_in[ 8]; __v2di *zw2_in = (__v2di*)&data_in[10]; __v2di *xy3_in = (__v2di*)&data_in[12]; __v2di *zw3_in = (__v2di*)&data_in[14]; __v2di *xy4_in = (__v2di*)&data_in[16]; __v2di *zw4_in = (__v2di*)&data_in[18]; __v2di *xy5_in = (__v2di*)&data_in[20]; __v2di *zw5_in = (__v2di*)&data_in[22]; __v2di *xy6_in = (__v2di*)&data_in[24]; __v2di *zw6_in = (__v2di*)&data_in[26]; __v2di *xy7_in = (__v2di*)&data_in[28]; __v2di *zw7_in = (__v2di*)&data_in[30]; __v2di *c0a_in = (__v2di*)&coefC_a[0]; __v2di *c1a_in = (__v2di*)&coefC_a[2]; __v2di *c2a_in = (__v2di*)&coefC_a[4]; __v2di *c3a_in = (__v2di*)&coefC_a[6]; __v2di *d0a_in = (__v2di*)&coefD_a[0]; __v2di *d1a_in = (__v2di*)&coefD_a[2]; __v2di *d2a_in = (__v2di*)&coefD_a[4]; __v2di *d3a_in = (__v2di*)&coefD_a[6]; __v2di *e0a_in = (__v2di*)&coefE_a[0]; __v2di *e1a_in = (__v2di*)&coefE_a[2]; __v2di *e2a_in = (__v2di*)&coefE_a[4]; __v2di *e3a_in = (__v2di*)&coefE_a[6]; __v2di *c0b_in = (__v2di*)&coefC_b[0*data_count/16]; __v2di *c1b_in = (__v2di*)&coefC_b[1*data_count/16]; __v2di *c2b_in = (__v2di*)&coefC_b[2*data_count/16]; __v2di *c3b_in = (__v2di*)&coefC_b[3*data_count/16]; __v2di *d0b_in = (__v2di*)&coefD_b[0*data_count/16]; __v2di *d1b_in = (__v2di*)&coefD_b[1*data_count/16]; __v2di *d2b_in = (__v2di*)&coefD_b[2*data_count/16]; __v2di *d3b_in = (__v2di*)&coefD_b[3*data_count/16]; __v2di *e0b_in = (__v2di*)&coefE_b[0*data_count/16]; __v2di *e1b_in = (__v2di*)&coefE_b[1*data_count/16]; __v2di *e2b_in = (__v2di*)&coefE_b[2*data_count/16]; __v2di *e3b_in = (__v2di*)&coefE_b[3*data_count/16]; __v2di *out_0 = (__v2di*)&data_out[ 0*data_count/16]; __v2di *out_1 = (__v2di*)&data_out[ 1*data_count/16]; __v2di *out_2 = (__v2di*)&data_out[ 2*data_count/16]; __v2di *out_3 = (__v2di*)&data_out[ 3*data_count/16]; __v2di *out_4 = (__v2di*)&data_out[ 4*data_count/16]; __v2di *out_5 = (__v2di*)&data_out[ 5*data_count/16]; __v2di *out_6 = (__v2di*)&data_out[ 6*data_count/16]; __v2di *out_7 = (__v2di*)&data_out[ 7*data_count/16]; __v2di *out_8 = (__v2di*)&data_out[ 8*data_count/16]; __v2di *out_9 = (__v2di*)&data_out[ 9*data_count/16]; __v2di *out_10 = (__v2di*)&data_out[10*data_count/16]; __v2di *out_11 = (__v2di*)&data_out[11*data_count/16]; __v2di *out_12 = (__v2di*)&data_out[12*data_count/16]; __v2di *out_13 = (__v2di*)&data_out[13*data_count/16]; __v2di *out_14 = (__v2di*)&data_out[14*data_count/16]; __v2di *out_15 = (__v2di*)&data_out[15*data_count/16]; #pragma ivdep #pragma unroll(3) #pragma prefetch for(int64_t i = 0; i < data_count/32; ++i) { __v2di xy0 = xy0_in[16*i]; __v2di zw0 = zw0_in[16*i]; __v2di xy1 = xy1_in[16*i]; __v2di zw1 = zw1_in[16*i]; __v2di c0 = c0a_in[4*i]; __v2di d0 = d0a_in[4*i]; __v2di e0 = e0a_in[4*i]; __v2di xy2 = xy2_in[16*i]; __v2di zw2 = zw2_in[16*i]; __v2di xy3 = xy3_in[16*i]; __v2di zw3 = zw3_in[16*i]; __v2di c1 = c1a_in[4*i]; __v2di d1 = d1a_in[4*i]; __v2di e1 = e1a_in[4*i]; __v2di xy4 = xy4_in[16*i]; __v2di zw4 = zw4_in[16*i]; __v2di xy5 = xy5_in[16*i]; __v2di zw5 = zw5_in[16*i]; __v2di c2 = c2a_in[4*i]; __v2di d2 = d2a_in[4*i]; __v2di e2 = e2a_in[4*i]; __v2di xy6 = xy6_in[16*i]; __v2di zw6 = zw6_in[16*i]; __v2di xy7 = xy7_in[16*i]; __v2di zw7 = zw7_in[16*i]; __v2di c3 = c3a_in[4*i]; __v2di d3 = d3a_in[4*i]; __v2di e3 = e3a_in[4*i]; __v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di cy0_real = __builtin_e2k_qpfmuls( c0, y0); __v2di cy1_real = __builtin_e2k_qpfmuls( c1, y1); __v2di cy2_real = __builtin_e2k_qpfmuls( c2, y2); __v2di cy3_real = __builtin_e2k_qpfmuls( c3, y3); __v2di dz0_real = __builtin_e2k_qpfmuls( d0, z0); __v2di dz1_real = __builtin_e2k_qpfmuls( d1, z1); __v2di dz2_real = __builtin_e2k_qpfmuls( d2, z2); __v2di dz3_real = __builtin_e2k_qpfmuls( d3, z3); __v2di ew0_real = __builtin_e2k_qpfmuls( e0, w0); __v2di ew1_real = __builtin_e2k_qpfmuls( e1, w1); __v2di ew2_real = __builtin_e2k_qpfmuls( e2, w2); __v2di ew3_real = __builtin_e2k_qpfmuls( e3, w3); __v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0); __v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1); __v2di cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2); __v2di cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3); __v2di dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0); __v2di dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1); __v2di dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2); __v2di dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3); __v2di ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0); __v2di ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1); __v2di ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2); __v2di ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3); __v2di cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real); __v2di cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real); __v2di cy2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy2_real); __v2di cy3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy3_real); __v2di dz0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz0_real); __v2di dz1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz1_real); __v2di dz2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz2_real); __v2di dz3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz3_real); __v2di ew0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew0_real); __v2di ew1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew1_real); __v2di ew2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew2_real); __v2di ew3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew3_real); __v2di cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag); __v2di cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag); __v2di cy2_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy2_imag); __v2di cy3_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy3_imag); __v2di dz0_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz0_imag); __v2di dz1_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz1_imag); __v2di dz2_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz2_imag); __v2di dz3_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz3_imag); __v2di ew0_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew0_imag); __v2di ew1_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew1_imag); __v2di ew2_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew2_imag); __v2di ew3_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew3_imag); __v2di cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di cy2 = __builtin_e2k_qppermb(cy2_ii, cy2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di cy3 = __builtin_e2k_qppermb(cy3_ii, cy3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di dz0 = __builtin_e2k_qppermb(dz0_ii, dz0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di dz1 = __builtin_e2k_qppermb(dz1_ii, dz1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di dz2 = __builtin_e2k_qppermb(dz2_ii, dz2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di dz3 = __builtin_e2k_qppermb(dz3_ii, dz3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di ew0 = __builtin_e2k_qppermb(ew0_ii, ew0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di ew1 = __builtin_e2k_qppermb(ew1_ii, ew1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di ew2 = __builtin_e2k_qppermb(ew2_ii, ew2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di ew3 = __builtin_e2k_qppermb(ew3_ii, ew3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); __v2di add02_0 = __builtin_e2k_qpfadds( x0, dz0); __v2di add02_1 = __builtin_e2k_qpfadds( x1, dz1); __v2di add02_2 = __builtin_e2k_qpfadds( x2, dz2); __v2di add02_3 = __builtin_e2k_qpfadds( x3, dz3); __v2di sub02_0 = __builtin_e2k_qpfsubs( x0, dz0); __v2di sub02_1 = __builtin_e2k_qpfsubs( x1, dz1); __v2di sub02_2 = __builtin_e2k_qpfsubs( x2, dz2); __v2di sub02_3 = __builtin_e2k_qpfsubs( x3, dz3); __v2di add13_0 = __builtin_e2k_qpfadds(cy0, ew0); __v2di add13_1 = __builtin_e2k_qpfadds(cy1, ew1); __v2di add13_2 = __builtin_e2k_qpfadds(cy2, ew2); __v2di add13_3 = __builtin_e2k_qpfadds(cy3, ew3); __v2di sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0); __v2di sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1); __v2di sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2); __v2di sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3); __v2di swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31}); __v2di sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31}); __v2di sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31}); __v2di sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31}); __v2di out0 = __builtin_e2k_qpfadds(add02_0, add13_0); __v2di out1 = __builtin_e2k_qpfadds(add02_1, add13_1); __v2di out2 = __builtin_e2k_qpfadds(add02_2, add13_2); __v2di out3 = __builtin_e2k_qpfadds(add02_3, add13_3); __v2di out4 = __builtin_e2k_qpfsubs(sub02_0, sub13i_0); __v2di out5 = __builtin_e2k_qpfsubs(sub02_1, sub13i_1); __v2di out6 = __builtin_e2k_qpfsubs(sub02_2, sub13i_2); __v2di out7 = __builtin_e2k_qpfsubs(sub02_3, sub13i_3); __v2di out8 = __builtin_e2k_qpfsubs(add02_0, add13_0); __v2di out9 = __builtin_e2k_qpfsubs(add02_1, add13_1); __v2di out10 = __builtin_e2k_qpfsubs(add02_2, add13_2); __v2di out11 = __builtin_e2k_qpfsubs(add02_3, add13_3); __v2di out12 = __builtin_e2k_qpfadds(sub02_0, sub13i_0); __v2di out13 = __builtin_e2k_qpfadds(sub02_1, sub13i_1); __v2di out14 = __builtin_e2k_qpfadds(sub02_2, sub13i_2); __v2di out15 = __builtin_e2k_qpfadds(sub02_3, sub13i_3); xy0 = out0; zw0 = out1; xy1 = out2; zw1 = out3; c0 = c0b_in[i]; d0 = d0b_in[i]; e0 = e0b_in[i]; xy2 = out4; zw2 = out5; xy3 = out6; zw3 = out7; c1 = c1b_in[i]; d1 = d1b_in[i]; e1 = e1b_in[i]; xy4 = out8; zw4 = out9; xy5 = out10; zw5 = out11; c2 = c2b_in[i]; d2 = d2b_in[i]; e2 = e2b_in[i]; xy6 = out12; zw6 = out13; xy7 = out14; zw7 = out15; c3 = c3b_in[i]; d3 = d3b_in[i]; e3 = e3b_in[i]; x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100}); w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100}); y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100}); w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100}); y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100}); w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100}); y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100}); w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); cy0_real = __builtin_e2k_qpfmuls( c0, y0); cy1_real = __builtin_e2k_qpfmuls( c1, y1); cy2_real = __builtin_e2k_qpfmuls( c2, y2); cy3_real = __builtin_e2k_qpfmuls( c3, y3); dz0_real = __builtin_e2k_qpfmuls( d0, z0); dz1_real = __builtin_e2k_qpfmuls( d1, z1); dz2_real = __builtin_e2k_qpfmuls( d2, z2); dz3_real = __builtin_e2k_qpfmuls( d3, z3); ew0_real = __builtin_e2k_qpfmuls( e0, w0); ew1_real = __builtin_e2k_qpfmuls( e1, w1); ew2_real = __builtin_e2k_qpfmuls( e2, w2); ew3_real = __builtin_e2k_qpfmuls( e3, w3); cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0); cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1); cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2); cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3); dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0); dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1); dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2); dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3); ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0); ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1); ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2); ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3); cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real); cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real); cy2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy2_real); cy3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy3_real); dz0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz0_real); dz1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz1_real); dz2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz2_real); dz3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz3_real); ew0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew0_real); ew1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew1_real); ew2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew2_real); ew3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew3_real); cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag); cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag); cy2_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy2_imag); cy3_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy3_imag); dz0_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz0_imag); dz1_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz1_imag); dz2_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz2_imag); dz3_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz3_imag); ew0_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew0_imag); ew1_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew1_imag); ew2_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew2_imag); ew3_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew3_imag); cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); cy2 = __builtin_e2k_qppermb(cy2_ii, cy2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); cy3 = __builtin_e2k_qppermb(cy3_ii, cy3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); dz0 = __builtin_e2k_qppermb(dz0_ii, dz0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); dz1 = __builtin_e2k_qppermb(dz1_ii, dz1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); dz2 = __builtin_e2k_qppermb(dz2_ii, dz2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); dz3 = __builtin_e2k_qppermb(dz3_ii, dz3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); ew0 = __builtin_e2k_qppermb(ew0_ii, ew0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); ew1 = __builtin_e2k_qppermb(ew1_ii, ew1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); ew2 = __builtin_e2k_qppermb(ew2_ii, ew2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); ew3 = __builtin_e2k_qppermb(ew3_ii, ew3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C}); add02_0 = __builtin_e2k_qpfadds( x0, dz0); add02_1 = __builtin_e2k_qpfadds( x1, dz1); add02_2 = __builtin_e2k_qpfadds( x2, dz2); add02_3 = __builtin_e2k_qpfadds( x3, dz3); sub02_0 = __builtin_e2k_qpfsubs( x0, dz0); sub02_1 = __builtin_e2k_qpfsubs( x1, dz1); sub02_2 = __builtin_e2k_qpfsubs( x2, dz2); sub02_3 = __builtin_e2k_qpfsubs( x3, dz3); add13_0 = __builtin_e2k_qpfadds(cy0, ew0); add13_1 = __builtin_e2k_qpfadds(cy1, ew1); add13_2 = __builtin_e2k_qpfadds(cy2, ew2); add13_3 = __builtin_e2k_qpfadds(cy3, ew3); sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0); sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1); sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2); sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3); swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31}); sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31}); sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31}); sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31}); out_0[i] = __builtin_e2k_qpfadds(add02_0, add13_0); out_1[i] = __builtin_e2k_qpfadds(add02_1, add13_1); out_2[i] = __builtin_e2k_qpfadds(add02_2, add13_2); out_3[i] = __builtin_e2k_qpfadds(add02_3, add13_3); out_4[i] = __builtin_e2k_qpfsubs(sub02_0, sub13i_0); out_5[i] = __builtin_e2k_qpfsubs(sub02_1, sub13i_1); out_6[i] = __builtin_e2k_qpfsubs(sub02_2, sub13i_2); out_7[i] = __builtin_e2k_qpfsubs(sub02_3, sub13i_3); out_8[i] = __builtin_e2k_qpfsubs(add02_0, add13_0); out_9[i] = __builtin_e2k_qpfsubs(add02_1, add13_1); out_10[i] = __builtin_e2k_qpfsubs(add02_2, add13_2); out_11[i] = __builtin_e2k_qpfsubs(add02_3, add13_3); out_12[i] = __builtin_e2k_qpfadds(sub02_0, sub13i_0); out_13[i] = __builtin_e2k_qpfadds(sub02_1, sub13i_1); out_14[i] = __builtin_e2k_qpfadds(sub02_2, sub13i_2); out_15[i] = __builtin_e2k_qpfadds(sub02_3, sub13i_3); } }
Основной цикл на ассемблере
.L24812: { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=0, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=1, disp=64 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=1, disp=96 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=2, disp=128 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=2, disp=160 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=3, disp=192 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=3, disp=224 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=4, disp=256 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=4, disp=288 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=5, disp=320 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=5, disp=352 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=6, disp=384 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=6, disp=416 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=7, disp=448 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=7, disp=480 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=8, disp=512 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=8, disp=544 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=9, disp=576 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=9, disp=608 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=10, disp=640 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=10, disp=672 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=11, disp=704 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=11, disp=736 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=0, abs=12, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=0, abs=12, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=0, abs=13, disp=64 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=0, abs=13, disp=96 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=0, abs=14, disp=128 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=0, abs=14, disp=160 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=0, abs=15, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=0, abs=15, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=0, abs=16, disp=64 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=0, abs=16, disp=96 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=0, abs=17, disp=128 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=0, abs=17, disp=160 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=0, abs=18, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=0, abs=18, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=0, abs=19, disp=64 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=0, abs=19, disp=96 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=0, abs=20, disp=128 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=0, abs=20, disp=160 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=1, incr=2, ind=0, asz=0, abs=21, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=15, asz=0, abs=21, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=14, asz=0, abs=22, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=13, asz=0, abs=22, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=12, asz=0, abs=23, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=11, asz=0, abs=23, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=10, asz=0, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=9, asz=0, abs=24, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=8, asz=0, abs=25, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=7, asz=0, abs=25, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=6, asz=0, abs=26, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=5, asz=0, abs=26, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=1, incr=2, ind=0, asz=0, abs=27, disp=32 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=15, asz=0, abs=27, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=14, asz=0, abs=28, disp=32 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=13, asz=0, abs=28, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=12, asz=0, abs=29, disp=32 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=11, asz=0, abs=29, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=10, asz=0, abs=30, disp=32 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=9, asz=0, abs=30, disp=32 } { fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=8, asz=0, abs=31, disp=32 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=7, asz=0, abs=31, disp=32 } .L19801: { loop_mode disp %ctpr1, .L19801 ldqp,0 %r58, %r0, %g20 addd,1,sm %r0, _f16s,_lts0lo 0x30, %g22 ldqp,2 %r55, %r0, %g21 movaqp,0 area=0, ind=0, am=1, be=0, %g17 movaqp,1 area=0, ind=16, am=0, be=0, %g16 movaqp,2 area=0, ind=0, am=1, be=0, %g19 movaqp,3 area=0, ind=16, am=0, be=0, %g18 } { loop_mode ldb,0,sm %r58, %g22, %empty, mas=0x20 ldb,2,sm %r55, %g22, %empty, mas=0x20 movaqp,0 area=1, ind=0, am=1, be=0, %g24 movaqp,1 area=1, ind=16, am=0, be=0, %g23 movaqp,2 area=1, ind=0, am=1, be=0, %g26 movaqp,3 area=1, ind=16, am=0, be=0, %g25 } { loop_mode movaqp,0 area=2, ind=0, am=1, be=0, %g27 movaqp,1 area=2, ind=16, am=0, be=0, %g22 movaqp,2 area=2, ind=0, am=1, be=0, %g29 movaqp,3 area=2, ind=16, am=0, be=0, %g28 } { loop_mode movaqp,0 area=3, ind=0, am=1, be=0, %g31 movaqp,1 area=3, ind=16, am=0, be=0, %g30 movaqp,2 area=3, ind=0, am=1, be=0, %b[38] movaqp,3 area=3, ind=16, am=0, be=0, %b[37] } { loop_mode movaqp,0 area=4, ind=0, am=1, be=0, %b[40] movaqp,1 area=4, ind=16, am=0, be=0, %b[39] movaqp,2 area=4, ind=0, am=1, be=0, %b[42] movaqp,3 area=4, ind=16, am=0, be=0, %b[41] } { loop_mode qpshufb,0 %g19, %g17, %r6, %b[47] qpshufb,1 %g18, %g16, %r6, %b[48] qpshufb,3 %g18, %g16, %r23, %g16 qpshufb,4 %g19, %g17, %r23, %g17 movaqp,0 area=5, ind=0, am=1, be=0, %b[44] movaqp,1 area=5, ind=16, am=0, be=0, %b[43] movaqp,2 area=5, ind=0, am=1, be=0, %b[46] movaqp,3 area=5, ind=16, am=0, be=0, %b[45] } { loop_mode qpshufb,0 %g26, %g24, %r6, %b[51] qpshufb,1 %g25, %g23, %r6, %b[52] qpshufb,3 %g25, %g23, %r23, %g23 qpshufb,4 %g26, %g24, %r23, %g24 movaqp,0 area=6, ind=0, am=1, be=0, %g19 movaqp,1 area=6, ind=16, am=0, be=0, %g18 movaqp,2 area=6, ind=0, am=1, be=0, %b[50] movaqp,3 area=6, ind=16, am=0, be=0, %b[49] } { loop_mode qpshufb,0 %g29, %g27, %r6, %b[55] qpshufb,1 %g28, %g22, %r6, %b[56] qpshufb,3 %g28, %g22, %r23, %g22 qpshufb,4 %g29, %g27, %r23, %g27 movaqp,0 area=7, ind=0, am=1, be=0, %g26 movaqp,1 area=7, ind=16, am=0, be=0, %g25 movaqp,2 area=7, ind=0, am=1, be=0, %b[54] movaqp,3 area=7, ind=16, am=0, be=0, %b[53] } { loop_mode qpshufb,0 %b[38], %g31, %r6, %b[59] qpshufb,1 %b[37], %g30, %r6, %b[60] qpshufb,3 %b[37], %g30, %r23, %g30 qpshufb,4 %b[38], %g31, %r23, %g31 movaqp,0 area=8, ind=0, am=1, be=0, %g29 movaqp,1 area=8, ind=16, am=0, be=0, %g28 movaqp,2 area=8, ind=0, am=1, be=0, %b[58] movaqp,3 area=8, ind=16, am=0, be=0, %b[57] } { loop_mode qpshufb,0 %b[42], %b[40], %r6, %b[63] qpshufb,1 %b[41], %b[39], %r6, %b[64] qpshufb,3 %b[41], %b[39], %r23, %b[39] qpshufb,4 %b[42], %b[40], %r23, %b[40] movaqp,0 area=9, ind=0, am=1, be=0, %b[38] movaqp,1 area=9, ind=16, am=0, be=0, %b[37] movaqp,2 area=9, ind=0, am=1, be=0, %b[62] movaqp,3 area=9, ind=16, am=0, be=0, %b[61] } { loop_mode qpshufb,0 %b[46], %b[44], %r6, %b[67] qpshufb,1 %b[45], %b[43], %r6, %b[68] qpshufb,3 %b[45], %b[43], %r23, %b[43] qpshufb,4 %b[46], %b[44], %r23, %b[44] movaqp,0 area=10, ind=0, am=1, be=0, %b[42] movaqp,1 area=10, ind=16, am=0, be=0, %b[41] movaqp,2 area=10, ind=0, am=1, be=0, %b[66] movaqp,3 area=10, ind=16, am=0, be=0, %b[65] } { loop_mode qpshufb,0 %b[50], %g19, %r6, %b[71] qpshufb,1 %b[49], %g18, %r6, %b[72] qpshufb,3 %b[49], %g18, %r23, %g18 qpshufb,4 %b[50], %g19, %r23, %g19 movaqp,0 area=11, ind=0, am=1, be=0, %b[46] movaqp,1 area=11, ind=16, am=0, be=0, %b[45] movaqp,2 area=11, ind=0, am=1, be=0, %b[70] movaqp,3 area=11, ind=16, am=0, be=0, %b[69] } { loop_mode qpshufb,0 %b[54], %g26, %r6, %b[75] qpshufb,1 %b[53], %g25, %r6, %b[76] qpshufb,3 %b[53], %g25, %r23, %g25 qpshufb,4 %b[54], %g26, %r23, %g26 movaqp,0 area=12, ind=0, am=1, be=0, %b[50] movaqp,1 area=12, ind=16, am=0, be=0, %b[49] movaqp,2 area=12, ind=0, am=1, be=0, %b[74] movaqp,3 area=12, ind=16, am=0, be=0, %b[73] } { loop_mode qpshufb,0 %b[58], %g29, %r6, %b[79] qpshufb,1 %b[57], %g28, %r6, %b[80] qpshufb,3 %b[57], %g28, %r23, %g28 qpshufb,4 %b[58], %g29, %r23, %g29 movaqp,0 area=13, ind=0, am=1, be=0, %b[54] movaqp,1 area=13, ind=16, am=0, be=0, %b[53] movaqp,2 area=13, ind=0, am=1, be=0, %b[78] movaqp,3 area=13, ind=16, am=0, be=0, %b[77] } { loop_mode qpshufb,0 %b[62], %b[38], %r6, %b[83] qpshufb,1 %b[61], %b[37], %r6, %b[84] qpshufb,3 %b[61], %b[37], %r23, %b[37] qpshufb,4 %b[62], %b[38], %r23, %b[38] movaqp,0 area=14, ind=0, am=1, be=0, %b[58] movaqp,1 area=14, ind=16, am=0, be=0, %b[57] movaqp,2 area=14, ind=0, am=1, be=0, %b[82] movaqp,3 area=14, ind=16, am=0, be=0, %b[81] } { loop_mode qpshufb,0 %b[66], %b[42], %r6, %b[87] qpshufb,1 %b[65], %b[41], %r6, %b[88] qpshufb,3 %b[65], %b[41], %r23, %b[41] qpshufb,4 %b[66], %b[42], %r23, %b[42] movaqp,0 area=15, ind=0, am=1, be=0, %b[62] movaqp,1 area=15, ind=16, am=0, be=0, %b[61] movaqp,2 area=15, ind=0, am=1, be=0, %b[86] movaqp,3 area=15, ind=16, am=0, be=0, %b[85] } { loop_mode qpshufb,0 %b[70], %b[46], %r6, %b[91] qpshufb,1 %b[69], %b[45], %r6, %b[92] qpshufb,3 %b[69], %b[45], %r23, %b[45] qpshufb,4 %b[70], %b[46], %r23, %b[46] movaqp,0 area=16, ind=0, am=1, be=0, %b[66] movaqp,1 area=16, ind=16, am=0, be=0, %b[65] movaqp,2 area=16, ind=0, am=1, be=0, %b[90] movaqp,3 area=16, ind=16, am=0, be=0, %b[89] } { loop_mode qpshufb,0 %b[50], %b[50], %r24, %b[95] qpshufb,1 %b[49], %b[49], %r24, %b[96] qpfmul_hsubs,2 %b[50], %b[47], %r25, %b[50] qpshufb,3 %b[74], %b[74], %r24, %b[97] qpshufb,4 %b[73], %b[73], %r24, %b[98] qpfmul_hsubs,5 %b[49], %b[51], %r25, %b[49] movaqp,0 area=17, ind=0, am=1, be=0, %b[70] movaqp,1 area=17, ind=16, am=0, be=0, %b[69] movaqp,2 area=17, ind=0, am=1, be=0, %b[94] movaqp,3 area=17, ind=16, am=0, be=0, %b[93] } { loop_mode qpshufb,0 %b[54], %b[54], %r24, %b[103] qpshufb,1 %b[53], %b[53], %r24, %b[104] qpfmul_hsubs,2 %b[74], %b[55], %r25, %b[74] qpshufb,3 %b[78], %b[78], %r24, %b[105] qpshufb,4 %b[77], %b[77], %r24, %b[106] qpfmul_hsubs,5 %b[73], %b[59], %r25, %b[73] movaqp,0 area=18, ind=0, am=1, be=0, %b[100] movaqp,1 area=18, ind=16, am=0, be=0, %b[99] movaqp,2 area=18, ind=0, am=1, be=0, %b[102] movaqp,3 area=18, ind=16, am=0, be=0, %b[101] } { loop_mode qpshufb,0 %b[58], %b[58], %r24, %b[111] qpshufb,1 %b[57], %b[57], %r24, %b[112] qpfmul_hsubs,2 %b[54], %b[63], %r25, %b[54] qpshufb,3 %b[82], %b[82], %r24, %b[113] qpshufb,4 %b[81], %b[81], %r24, %b[114] qpfmul_hsubs,5 %b[53], %b[67], %r25, %b[53] movaqp,0 area=19, ind=0, am=1, be=0, %b[108] movaqp,1 area=19, ind=16, am=0, be=0, %b[107] movaqp,2 area=19, ind=0, am=1, be=0, %b[110] movaqp,3 area=19, ind=16, am=0, be=0, %b[109] } { loop_mode qpfmul_hadds,0 %b[96], %b[51], %r25, %b[51] qpfmul_hsubs,1 %b[78], %b[71], %r25, %b[78] qpfmul_hsubs,2 %b[77], %b[75], %r25, %b[77] qpfmul_hadds,3 %b[97], %b[55], %r25, %b[55] qpfmul_hadds,4 %b[98], %b[59], %r25, %b[59] qpfmul_hsubs,5 %b[58], %b[79], %r25, %b[58] movaqp,0 area=20, ind=0, am=1, be=0, %b[116] movaqp,1 area=20, ind=16, am=0, be=0, %b[115] movaqp,2 area=20, ind=0, am=1, be=0, %b[118] movaqp,3 area=20, ind=16, am=0, be=0, %b[117] } { loop_mode qpfmul_hadds,0 %b[95], %b[47], %r25, %b[47] qpfmul_hsubs,1 %b[57], %b[83], %r25, %b[57] qpfmul_hsubs,2 %b[82], %b[87], %r25, %b[82] qpfmul_hadds,3 %b[105], %b[71], %r25, %b[71] qpfmul_hadds,4 %b[106], %b[75], %r25, %b[75] qpfmul_hsubs,5 %b[81], %b[91], %r25, %b[81] movaqp,0 area=21, ind=0, am=1, be=0, %b[96] movaqp,1 area=21, ind=16, am=0, be=0, %b[95] movaqp,2 area=21, ind=0, am=1, be=0, %b[98] movaqp,3 area=21, ind=16, am=0, be=0, %b[97] } { loop_mode qpfmul_hadds,0 %b[103], %b[63], %r25, %b[63] qpfmul_hadds,1 %b[112], %b[83], %r25, %b[83] qpfmul_hadds,2 %b[104], %b[67], %r25, %b[67] qpfmul_hadds,3 %b[114], %b[91], %r25, %b[91] qpshufb,4 %b[62], %b[62], %r24, %b[103] qpfmul_hadds,5 %b[113], %b[87], %r25, %b[87] movaqp,0 area=22, ind=0, am=1, be=0, %b[105] movaqp,1 area=22, ind=16, am=0, be=0, %b[104] movaqp,2 area=22, ind=0, am=1, be=0, %b[112] movaqp,3 area=22, ind=16, am=0, be=0, %b[106] } { loop_mode qpshufb,0 %b[100], %b[100], %r24, %b[113] qpshufb,1 %b[99], %b[99], %r24, %b[114] qpfmul_hsubs,2 %b[99], %b[52], %r25, %b[99] qpshufb,3 %b[102], %b[102], %r24, %b[119] qpshufb,4 %b[101], %b[101], %r24, %b[120] qpfmul_hadds,5 %b[111], %b[79], %r25, %b[79] movaqp,0 area=23, ind=0, am=1, be=0, %b[121] movaqp,1 area=23, ind=16, am=0, be=0, %b[111] movaqp,2 area=23, ind=0, am=1, be=0, %b[123] movaqp,3 area=23, ind=16, am=0, be=0, %b[122] } { loop_mode qpshufb,0 %b[108], %b[108], %r24, %b[124] qpshufb,1 %b[107], %b[107], %r24, %b[125] qpfmul_hsubs,2 %b[100], %b[48], %r25, %b[100] qpshufb,3 %b[110], %b[110], %r24, %b[126] qpshufb,4 %b[109], %b[109], %r24, %b[127] qpfmul_hsubs,5 %b[101], %b[60], %r25, %b[101] movaqp,0 area=24, ind=0, am=1, be=0, %b[35] movaqp,1 area=24, ind=16, am=0, be=0, %b[36] movaqp,2 area=24, ind=0, am=1, be=0, %b[33] movaqp,3 area=24, ind=16, am=0, be=0, %b[34] } { loop_mode qpshufb,0 %b[116], %b[116], %r24, %b[14] qpshufb,1 %b[115], %b[115], %r24, %b[13] qpfmul_hsubs,2 %b[102], %b[56], %r25, %b[102] qpshufb,3 %b[118], %b[118], %r24, %b[12] qpshufb,4 %b[117], %b[117], %r24, %b[11] qpfmul_hsubs,5 %b[107], %b[68], %r25, %b[107] movaqp,0 area=25, ind=0, am=1, be=0, %b[31] movaqp,1 area=25, ind=16, am=0, be=0, %b[32] movaqp,2 area=25, ind=0, am=1, be=0, %b[29] movaqp,3 area=25, ind=16, am=0, be=0, %b[30] } { loop_mode qpfmul_hsubs,0 %b[108], %b[64], %r25, %b[108] qpfmul_hsubs,1 %b[110], %b[72], %r25, %b[110] qpfmul_hsubs,2 %b[109], %b[76], %r25, %b[109] qpfmul_hadds,3 %b[119], %b[56], %r25, %b[56] qpfmul_hadds,4 %b[120], %b[60], %r25, %b[60] qpfmul_hsubs,5 %b[115], %b[84], %r25, %b[115] movaqp,0 area=26, ind=0, am=1, be=0, %b[27] movaqp,1 area=26, ind=16, am=0, be=0, %b[28] movaqp,2 area=26, ind=0, am=1, be=0, %b[25] movaqp,3 area=26, ind=16, am=0, be=0, %b[26] } { loop_mode qpfmul_hadds,0 %b[113], %b[48], %r25, %b[48] qpfmul_hadds,1 %b[114], %b[52], %r25, %b[52] qpfmul_hsubs,2 %b[116], %b[80], %r25, %b[113] qpfmul_hadds,3 %b[127], %b[76], %r25, %b[76] qpfmul_hsubs,4 %b[118], %b[88], %r25, %b[114] qpfmul_hsubs,5 %b[117], %b[92], %r25, %b[116] movaqp,0 area=28, ind=0, am=1, be=0, %b[22] movaqp,1 area=27, ind=0, am=1, be=0, %b[24] movaqp,2 area=28, ind=0, am=1, be=0, %b[21] movaqp,3 area=27, ind=0, am=1, be=0, %b[23] } { loop_mode qpfmul_hadds,0 %b[124], %b[64], %r25, %b[64] qpfmul_hadds,1 %b[125], %b[68], %r25, %b[68] qpfmul_hadds,2 %b[126], %b[72], %r25, %b[72] qpfmul_hadds,3 %b[11], %b[92], %r25, %b[11] qpshufb,4 %b[61], %b[61], %r24, %b[88] qpfmul_hadds,5 %b[12], %b[88], %r25, %b[12] movaqp,0 area=30, ind=0, am=1, be=0, %b[18] movaqp,1 area=29, ind=0, am=1, be=0, %b[20] movaqp,2 area=30, ind=0, am=1, be=0, %b[17] movaqp,3 area=29, ind=0, am=1, be=0, %b[19] } { loop_mode qpfmul_hadds,0 %b[13], %b[84], %r25, %b[13] qpshufb,1 %b[86], %b[86], %r24, %b[80] qpfmul_hadds,2 %b[14], %b[80], %r25, %b[14] qpshufb,3 %b[85], %b[85], %r24, %b[84] qpshufb,4 %b[66], %b[66], %r24, %b[92] qpfmul_hsubs,5 %b[61], %g23, %r25, %b[61] movaqp,1 area=31, ind=0, am=1, be=0, %b[16] movaqp,3 area=31, ind=0, am=1, be=0, %b[15] } { loop_mode qpshufb,0 %b[65], %b[65], %r24, %b[117] qpshufb,1 %b[90], %b[90], %r24, %b[118] qpfmul_hsubs,2 %b[62], %g16, %r25, %b[62] qpshufb,3 %b[89], %b[89], %r24, %b[119] qpshufb,4 %b[70], %b[70], %r24, %b[120] qpfmul_hsubs,5 %b[86], %g22, %r25, %b[86] } { loop_mode qpshufb,0 %b[69], %b[69], %r24, %b[124] qpshufb,1 %b[94], %b[94], %r24, %b[125] qpfmul_hsubs,2 %b[85], %g30, %r25, %b[85] qpshufb,3 %b[93], %b[93], %r24, %b[126] qpfmul_hsubs,4 %b[66], %b[39], %r25, %b[66] qpfmul_hsubs,5 %b[65], %b[43], %r25, %b[65] } { loop_mode qpfmul_hsubs,0 %b[90], %g18, %r25, %b[90] qpfmul_hsubs,1 %b[89], %g25, %r25, %b[89] qpfmul_hadds,2 %b[103], %g16, %r25, %g16 qpfmul_hadds,3 %b[88], %g23, %r25, %g23 qpfmul_hsubs,4 %b[70], %g28, %r25, %b[70] qpfmul_hsubs,5 %b[69], %b[37], %r25, %b[69] } { loop_mode qpfmul_hsubs,0 %b[94], %b[41], %r25, %b[88] qpfmul_hsubs,1 %b[93], %b[45], %r25, %b[93] qpfmul_hadds,2 %b[80], %g22, %r25, %g22 qpfmul_hadds,3 %b[84], %g30, %r25, %g30 qpfmul_hadds,4 %b[92], %b[39], %r25, %b[39] qpfmul_hadds,5 %b[119], %g25, %r25, %g25 } { loop_mode qpfmul_hadds,0 %b[117], %b[43], %r25, %b[43] qpfmul_hadds,1 %b[118], %g18, %r25, %g18 qpfmul_hadds,2 %b[124], %b[37], %r25, %b[37] qpfmul_hadds,3 %b[126], %b[45], %r25, %b[45] qppermb,4 %b[51], %b[49], %r7, %b[49] qpfmul_hadds,5 %b[120], %g28, %r25, %g28 } { loop_mode qppermb,1 %b[59], %b[73], %r7, %b[51] qpfmul_hadds,2 %b[125], %b[41], %r25, %b[41] qppermb,3 %b[47], %b[50], %r7, %b[47] qppermb,4 %b[55], %b[74], %r7, %b[50] } { loop_mode qppermb,0 %b[71], %b[78], %r7, %b[55] qppermb,1 %b[75], %b[77], %r7, %b[59] qppermb,3 %b[63], %b[54], %r7, %b[54] qppermb,4 %b[67], %b[53], %r7, %b[53] } { loop_mode nop 2 qppermb,0 %b[79], %b[58], %r7, %b[58] qppermb,1 %b[83], %b[57], %r7, %b[57] qppermb,3 %b[87], %b[82], %r7, %b[63] } { loop_mode qppermb,4 %b[91], %b[81], %r7, %b[67] } { loop_mode qppermb,0 %b[56], %b[102], %r7, %b[56] qppermb,1 %b[60], %b[101], %r7, %b[60] qppermb,3 %b[48], %b[100], %r7, %b[48] qppermb,4 %b[52], %b[99], %r7, %b[52] } { loop_mode qppermb,0 %b[68], %b[107], %r7, %b[68] qppermb,1 %b[76], %b[109], %r7, %b[71] qpfsubs,2 %b[51], %b[60], %b[73] qppermb,3 %b[64], %b[108], %r7, %b[64] qppermb,4 %b[72], %b[110], %r7, %b[72] qpfsubs,5 %b[49], %b[52], %b[74] } { loop_mode qppermb,0 %b[12], %b[114], %r7, %b[12] qppermb,1 %b[11], %b[116], %r7, %b[11] qpfsubs,2 %b[50], %b[56], %b[75] qppermb,3 %b[14], %b[113], %r7, %b[14] qppermb,4 %b[13], %b[115], %r7, %b[13] qpfsubs,5 %b[47], %b[48], %b[76] } { loop_mode qpfsubs,0 %b[59], %b[71], %b[77] qpfsubs,1 %b[53], %b[68], %b[78] qpfsubs,2 %b[63], %b[12], %b[79] qpfsubs,3 %b[55], %b[72], %b[81] qpfsubs,4 %b[58], %b[14], %b[82] qpfsubs,5 %b[54], %b[64], %b[80] } { loop_mode qppermb,0 %g16, %b[62], %r7, %g16 qppermb,1 %g23, %b[61], %r7, %g23 qpfsubs,2 %b[67], %b[11], %b[83] qppermb,3 %g22, %b[86], %r7, %g22 qppermb,4 %g30, %b[85], %r7, %g30 qpfsubs,5 %b[57], %b[13], %b[84] } { loop_mode qpfadds,0 %b[51], %b[60], %b[51] qpfadds,1 %b[47], %b[48], %b[47] qpfadds,2 %b[49], %b[52], %b[48] qpfadds,3 %b[50], %b[56], %b[49] qpfadds,4 %b[59], %b[71], %b[50] qpfadds,5 %b[54], %b[64], %b[52] } { loop_mode qppermb,0 %b[39], %b[66], %r7, %b[39] qppermb,1 %b[43], %b[65], %r7, %b[43] qpfadds,2 %b[53], %b[68], %b[53] qppermb,3 %g18, %b[90], %r7, %g18 qppermb,4 %g25, %b[89], %r7, %g25 qpfadds,5 %b[55], %b[72], %b[54] } { loop_mode qpfadds,0 %b[67], %b[11], %b[11] qpfadds,1 %g24, %g23, %b[55] qpfsubs,2 %g17, %g16, %b[56] qpfadds,3 %b[58], %b[14], %b[14] qpfadds,4 %b[57], %b[13], %b[13] qpfadds,5 %b[63], %b[12], %b[12] } { loop_mode qppermb,0 %b[37], %b[69], %r7, %b[37] qppermb,1 %g28, %b[70], %r7, %g28 qpfsubs,2 %g24, %g23, %g23 qppermb,3 %b[45], %b[93], %r7, %b[45] qppermb,4 %b[41], %b[88], %r7, %b[41] qpfadds,5 %g17, %g16, %g16 } { loop_mode qpfsubs,0 %g31, %g30, %g17 qpfadds,1 %g27, %g22, %g24 qpfadds,2 %g31, %g30, %g30 qpfsubs,3 %g27, %g22, %g22 qpfsubs,4 %g26, %g25, %g27 qpfadds,5 %g26, %g25, %g25 } { loop_mode qpfsubs,0 %b[44], %b[43], %g26 qpfadds,1 %b[40], %b[39], %g31 qpfadds,2 %b[44], %b[43], %b[43] qpfsubs,3 %b[40], %b[39], %b[39] qpfsubs,4 %g19, %g18, %b[40] qpfadds,5 %g19, %g18, %g18 } { loop_mode qpfadds,0 %b[38], %b[37], %g19 qpfsubs,1 %g29, %g28, %b[44] qpfsubs,2 %b[38], %b[37], %b[37] qpfsubs,3 %b[46], %b[45], %b[45] qpfadds,4 %b[42], %b[41], %b[46] qpfadds,5 %b[46], %b[45], %b[38] } { loop_mode qpfadds,0 %b[55], %b[48], %b[41] qpfsubs,1 %b[55], %b[48], %b[42] qpfadds,2 %g29, %g28, %g28 qpfadds,3 %g16, %b[47], %b[48] qpfsubs,4 %g16, %b[47], %g16 qpfsubs,5 %b[42], %b[41], %g29 } { loop_mode qpfadds,0 %g30, %b[51], %b[47] qpfadds,1 %g24, %b[49], %b[51] qpfsubs,2 %g30, %b[51], %g30 qpfsubs,3 %g25, %b[50], %g25 qpfadds,5 %g25, %b[50], %b[55] } { loop_mode qpfsubs,0 %g24, %b[49], %g24 qpfadds,1 %b[43], %b[53], %b[49] qpfsubs,2 %g31, %b[52], %b[50] qpfsubs,3 %g18, %b[54], %g18 qpfadds,5 %g18, %b[54], %b[57] } { loop_mode qpfsubs,0 %b[43], %b[53], %b[43] qpfadds,1 %g31, %b[52], %g31 qpfadds,2 %g19, %b[13], %b[52] qpfsubs,3 %b[38], %b[11], %b[53] qpshufb,4 %b[75], %b[75], %r24, %b[54] qpfadds,5 %b[46], %b[12], %b[58] } { loop_mode qpfadds,0 %g28, %b[14], %b[13] qpfsubs,1 %g28, %b[14], %g28 qpfsubs,2 %g19, %b[13], %g19 qpfsubs,3 %b[46], %b[12], %b[12] qpshufb,4 %b[73], %b[73], %r24, %b[59] qpfadds,5 %b[38], %b[11], %b[11] } { loop_mode qpshufb,4 %b[74], %b[74], %r24, %b[14] } { loop_mode qpshufb,4 %b[78], %b[78], %r24, %b[38] } { loop_mode qpshufb,0 %b[77], %b[77], %r24, %b[46] qpshufb,1 %b[83], %b[83], %r24, %b[60] qpshufb,3 %b[76], %b[76], %r24, %b[61] qpshufb,4 %b[80], %b[80], %r24, %b[62] } { loop_mode qpshufb,0 %b[81], %b[81], %r24, %b[63] qpshufb,1 %b[79], %b[79], %r24, %b[64] qpshufb,3 %b[82], %b[82], %r24, %b[65] qpshufb,4 %b[84], %b[84], %r24, %b[66] } { loop_mode qpxor,0 %b[54], %r22, %b[54] qpxor,1 %b[59], %r22, %b[59] qpxor,3 %b[14], %r22, %b[14] qpxor,4 %b[38], %r22, %b[38] } { loop_mode qpxor,0 %b[46], %r22, %b[46] qpxor,1 %b[60], %r22, %b[60] qpfadds,2 %g17, %b[59], %b[67] qpxor,3 %b[61], %r22, %b[61] qpxor,4 %b[62], %r22, %b[62] qpfadds,5 %g26, %b[38], %b[68] } { loop_mode qpxor,0 %b[63], %r22, %b[63] qpxor,1 %b[64], %r22, %b[64] qpfsubs,2 %g17, %b[59], %g17 qpxor,3 %b[65], %r22, %b[65] qpxor,4 %b[66], %r22, %b[66] qpfsubs,5 %g23, %b[14], %b[59] } { loop_mode qpfadds,0 %g27, %b[46], %b[69] qpfsubs,1 %g22, %b[54], %b[70] qpfadds,2 %g22, %b[54], %g22 qpfadds,3 %g23, %b[14], %g23 qpfsubs,4 %g26, %b[38], %g26 qpfsubs,5 %b[56], %b[61], %b[14] } { loop_mode qpfsubs,0 %g27, %b[46], %g27 qpfsubs,1 %b[45], %b[60], %b[38] qpfadds,2 %g29, %b[64], %b[46] qpfadds,3 %b[56], %b[61], %b[54] qpfsubs,4 %b[39], %b[62], %b[56] qpfadds,5 %b[39], %b[62], %b[39] } { loop_mode qpfadds,0 %b[45], %b[60], %b[45] qpfadds,1 %b[40], %b[63], %b[60] qpfsubs,2 %g29, %b[64], %g29 qpfadds,3 %b[37], %b[66], %b[37] qpfsubs,4 %b[44], %b[65], %b[62] qpfsubs,5 %b[37], %b[66], %b[61] } { loop_mode nop 1 qpfsubs,2 %b[40], %b[63], %b[40] qpfadds,5 %b[44], %b[65], %b[44] } { loop_mode qpshufb,0 %b[95], %b[95], %r24, %b[63] qpshufb,1 %g20, %g20, %r24, %b[64] qpshufb,3 %b[96], %b[96], %r24, %b[65] qpshufb,4 %b[104], %b[104], %r24, %b[66] } { loop_mode qpshufb,0 %b[112], %b[112], %r24, %b[71] qpshufb,1 %b[105], %b[105], %r24, %b[72] qpshufb,3 %b[106], %b[106], %r24, %b[73] qpshufb,4 %b[122], %b[122], %r24, %b[74] } { loop_mode qpshufb,0 %b[35], %b[35], %r24, %b[75] qpshufb,1 %b[123], %b[123], %r24, %b[76] qpshufb,3 %b[36], %b[36], %r24, %b[77] qpshufb,4 %b[32], %b[32], %r24, %b[78] } { loop_mode qpshufb,0 %b[29], %b[29], %r24, %b[79] qpshufb,1 %b[31], %b[31], %r24, %b[80] qpshufb,3 %b[30], %b[30], %r24, %b[81] qpshufb,4 %b[26], %b[26], %r24, %b[82] } { loop_mode qpshufb,0 %b[25], %b[25], %r24, %b[83] qpshufb,1 %b[24], %b[24], %r24, %b[84] qpshufb,3 %b[21], %b[21], %r24, %b[85] qpshufb,4 %b[22], %b[22], %r24, %b[86] } { loop_mode qpshufb,0 %b[18], %b[18], %r24, %b[87] qpshufb,1 %b[15], %b[15], %r24, %b[88] qpshufb,3 %b[19], %b[19], %r24, %b[89] qpshufb,4 %b[16], %b[16], %r24, %b[90] } { loop_mode qpshufb,0 %b[51], %b[48], %r6, %b[91] qpshufb,1 %b[47], %b[41], %r6, %b[92] qpshufb,3 %g24, %g16, %r6, %b[93] qpshufb,4 %g30, %b[42], %r6, %b[94] } { loop_mode qpshufb,0 %b[57], %g31, %r6, %b[99] qpshufb,1 %b[55], %b[49], %r6, %b[100] qpfmul_hadds,2 %b[65], %b[91], %r25, %b[65] qpshufb,3 %g18, %b[50], %r6, %b[101] qpshufb,4 %g25, %b[43], %r6, %b[102] qpfmul_hadds,5 %b[75], %b[93], %r25, %b[75] } { loop_mode qpshufb,0 %b[58], %b[13], %r6, %b[103] qpshufb,1 %b[11], %b[52], %r6, %b[107] qpfmul_hadds,2 %b[72], %b[92], %r25, %b[72] qpshufb,3 %b[12], %g28, %r6, %b[108] qpshufb,4 %b[53], %g19, %r6, %b[109] qpfmul_hadds,5 %b[80], %b[94], %r25, %b[80] } { loop_mode qpshufb,0 %g17, %b[59], %r6, %b[110] qpshufb,1 %b[67], %g23, %r6, %b[113] qpfmul_hsubs,2 %b[96], %b[91], %r25, %b[91] qpshufb,3 %g27, %g26, %r6, %b[114] qpshufb,4 %b[69], %b[68], %r6, %b[115] qpfmul_hsubs,5 %b[35], %b[93], %r25, %b[35] } { loop_mode qpshufb,0 %b[70], %b[14], %r6, %b[93] qpshufb,1 %g22, %b[54], %r6, %b[96] qpfmul_hsubs,2 %b[31], %b[94], %r25, %b[31] qpshufb,3 %b[60], %b[39], %r6, %b[116] qpshufb,4 %b[45], %b[37], %r6, %b[117] qpfmul_hsubs,5 %b[105], %b[92], %r25, %b[92] } { loop_mode qpshufb,0 %b[40], %b[56], %r6, %b[94] qpshufb,1 %g29, %b[62], %r6, %b[105] qpfmul_hsubs,2 %b[32], %b[102], %r25, %b[32] qpshufb,3 %b[38], %b[61], %r6, %b[118] qpshufb,4 %b[46], %b[44], %r6, %b[119] qpfmul_hadds,5 %b[78], %b[102], %r25, %b[78] } { loop_mode qpfmul_hsubs,0 %b[95], %b[99], %r25, %b[95] qpfmul_hsubs,1 %b[36], %b[101], %r25, %b[36] qpfmul_hsubs,2 %b[104], %b[100], %r25, %b[102] qpfmul_hadds,3 %b[63], %b[99], %r25, %b[63] qpfmul_hadds,4 %b[77], %b[101], %r25, %b[77] qpfmul_hadds,5 %b[66], %b[100], %r25, %b[66] } { loop_mode qpfmul_hsubs,0 %b[18], %b[108], %r25, %b[18] qpfmul_hsubs,1 %b[22], %b[107], %r25, %b[22] qpfmul_hsubs,2 %b[16], %b[109], %r25, %b[16] qpfmul_hadds,3 %b[87], %b[108], %r25, %b[87] qpfmul_hadds,4 %b[86], %b[107], %r25, %b[86] qpfmul_hadds,5 %b[90], %b[109], %r25, %b[90] } { loop_mode qpfmul_hsubs,0 %b[24], %b[103], %r25, %b[24] qpfmul_hadds,1 %b[84], %b[103], %r25, %b[84] qpfmul_hsubs,2 %b[25], %b[113], %r25, %b[25] qpfmul_hadds,3 %b[83], %b[113], %r25, %b[83] qpfmul_hsubs,4 %b[26], %b[115], %r25, %b[26] qpfmul_hadds,5 %b[82], %b[115], %r25, %b[82] } { loop_mode qpfmul_hsubs,0 %b[29], %b[96], %r25, %b[29] qpfmul_hsubs,1 %b[123], %b[110], %r25, %b[99] qpfmul_hadds,2 %b[79], %b[96], %r25, %b[79] qpfmul_hadds,3 %b[76], %b[110], %r25, %b[76] qpfmul_hsubs,4 %b[122], %b[114], %r25, %b[96] qpfmul_hadds,5 %b[74], %b[114], %r25, %b[74] } { loop_mode qpfmul_hsubs,0 %b[112], %b[93], %r25, %b[100] qpfmul_hadds,1 %b[71], %b[93], %r25, %b[71] qpfmul_hsubs,2 %b[30], %b[116], %r25, %b[30] qpfmul_hadds,3 %b[81], %b[116], %r25, %b[81] qpfmul_hsubs,4 %g20, %b[117], %r25, %g20 qpfmul_hadds,5 %b[64], %b[117], %r25, %b[64] } { loop_mode qpfmul_hsubs,0 %b[106], %b[94], %r25, %b[93] qpfmul_hadds,1 %b[73], %b[94], %r25, %b[73] qpfmul_hsubs,2 %b[15], %b[119], %r25, %b[15] qpfmul_hsubs,3 %b[19], %b[118], %r25, %b[19] qpfmul_hadds,4 %b[88], %b[119], %r25, %b[88] qpfmul_hadds,5 %b[89], %b[118], %r25, %b[89] } { loop_mode nop 5 qpfmul_hadds,0 %b[85], %b[105], %r25, %b[85] qpfmul_hsubs,2 %b[21], %b[105], %r25, %b[21] } { loop_mode qpshufb,1 %b[97], %b[97], %r24, %b[94] qpshufb,3 %g21, %g21, %r24, %b[101] qpshufb,4 %b[98], %b[98], %r24, %b[103] } { loop_mode qpshufb,0 %b[121], %b[121], %r24, %b[104] qpshufb,1 %b[111], %b[111], %r24, %b[105] qpshufb,3 %b[34], %b[34], %r24, %b[106] qpshufb,4 %b[33], %b[33], %r24, %b[107] } { loop_mode qpshufb,0 %b[27], %b[27], %r24, %b[108] qpshufb,1 %b[28], %b[28], %r24, %b[109] qpshufb,3 %b[23], %b[23], %r24, %b[110] qpshufb,4 %b[20], %b[20], %r24, %b[112] } { loop_mode qpshufb,0 %b[17], %b[17], %r24, %b[113] qpshufb,1 %b[47], %b[41], %r23, %b[41] qpshufb,3 %g30, %b[42], %r23, %g30 qpshufb,4 %b[55], %b[49], %r23, %b[42] } { loop_mode qpshufb,0 %g25, %b[43], %r23, %g25 qpshufb,1 %b[11], %b[52], %r23, %b[11] qpfmul_hsubs,2 %b[98], %b[41], %r25, %b[43] qpshufb,3 %b[53], %g19, %r23, %g19 qpshufb,4 %g17, %b[59], %r23, %g17 qpfmul_hsubs,5 %b[33], %g30, %r25, %b[33] } { loop_mode qpshufb,0 %b[67], %g23, %r23, %g23 qpshufb,1 %g27, %g26, %r23, %g26 qpfmul_hadds,2 %b[106], %g25, %r25, %b[47] qpshufb,3 %b[69], %b[68], %r23, %g27 qpshufb,4 %b[38], %b[61], %r23, %b[38] qpfmul_hsubs,5 %b[17], %g19, %r25, %b[17] } { loop_mode qpshufb,0 %b[45], %b[37], %r23, %b[37] qpfmul_hadds,1 %b[103], %b[41], %r25, %b[41] qpfmul_hsubs,2 %b[34], %g25, %r25, %g25 qpfmul_hadds,3 %b[107], %g30, %r25, %g30 qpfmul_hsubs,4 %b[97], %b[42], %r25, %b[34] qpfmul_hadds,5 %b[94], %b[42], %r25, %b[42] } { loop_mode qpfmul_hadds,0 %b[110], %b[11], %r25, %b[45] qpfmul_hsubs,1 %b[23], %b[11], %r25, %b[11] qpfmul_hadds,2 %b[108], %g23, %r25, %b[49] qpfmul_hsubs,3 %b[121], %g17, %r25, %b[23] qpfmul_hadds,4 %b[104], %g17, %r25, %g17 qpfmul_hadds,5 %b[113], %g19, %r25, %g19 } { loop_mode qpfmul_hsubs,0 %b[111], %g26, %r25, %b[52] qpfmul_hadds,1 %b[105], %g26, %r25, %g26 qpfmul_hsubs,2 %b[27], %g23, %r25, %g23 qpfmul_hsubs,3 %b[28], %g27, %r25, %b[28] qpfmul_hadds,4 %b[109], %g27, %r25, %g27 qpfmul_hsubs,5 %b[20], %b[38], %r25, %b[20] } { loop_mode qpfmul_hadds,0 %b[101], %b[37], %r25, %b[27] qppermb,1 %b[75], %b[35], %r7, %b[35] qpfmul_hsubs,2 %g21, %b[37], %r25, %g21 qppermb,3 %b[80], %b[31], %r7, %b[31] qppermb,4 %b[65], %b[91], %r7, %b[38] qpfmul_hadds,5 %b[112], %b[38], %r25, %b[37] } { loop_mode qppermb,0 %b[72], %b[92], %r7, %b[53] qppermb,1 %b[63], %b[95], %r7, %b[55] qppermb,3 %b[66], %b[102], %r7, %b[59] qppermb,4 %b[78], %b[32], %r7, %b[32] } { loop_mode qppermb,0 %b[90], %b[16], %r7, %b[16] qppermb,1 %b[77], %b[36], %r7, %b[36] qppermb,3 %b[87], %b[18], %r7, %b[18] qppermb,4 %b[84], %b[24], %r7, %b[24] } { loop_mode qppermb,0 %b[86], %b[22], %r7, %b[22] } { loop_mode qpfsubs,1 %b[38], %b[53], %b[61] qpfsubs,3 %b[35], %b[31], %b[63] qpfadds,4 %b[35], %b[31], %b[31] } { loop_mode qpfsubs,0 %b[36], %b[32], %b[65] qpfadds,1 %b[38], %b[53], %b[38] qpfsubs,2 %b[55], %b[59], %b[35] qpfadds,3 %b[55], %b[59], %b[53] } { loop_mode qpfsubs,0 %b[24], %b[22], %b[59] qpfadds,1 %b[36], %b[32], %b[32] qpfsubs,2 %b[18], %b[16], %b[55] qpfadds,3 %b[18], %b[16], %b[16] } { loop_mode qpfadds,0 %b[24], %b[22], %b[22] qppermb,4 %b[83], %b[25], %r7, %b[18] } { loop_mode qppermb,4 %b[79], %b[29], %r7, %b[24] } { loop_mode qppermb,1 %b[76], %b[99], %r7, %b[25] qppermb,3 %b[71], %b[100], %r7, %b[29] qppermb,4 %b[74], %b[96], %r7, %b[36] qpfsubs,5 %b[24], %b[18], %b[66] } { loop_mode qppermb,0 %b[82], %b[26], %r7, %b[26] qppermb,1 %b[64], %g20, %r7, %g20 qppermb,3 %b[81], %b[30], %r7, %b[30] qppermb,4 %b[85], %b[21], %r7, %b[21] qpfadds,5 %b[24], %b[18], %b[18] } { loop_mode qppermb,0 %b[88], %b[15], %r7, %b[15] qppermb,1 %b[89], %b[19], %r7, %b[19] qppermb,3 %b[73], %b[93], %r7, %b[24] qpshufb,4 %b[51], %b[48], %r23, %b[48] } { loop_mode qpshufb,0 %g24, %g16, %r23, %g16 qpshufb,1 %g18, %b[50], %r23, %g18 qpfsubs,2 %b[15], %g20, %b[64] qpshufb,3 %b[57], %g31, %r23, %g24 qpshufb,4 %b[58], %b[13], %r23, %g31 qpfsubs,5 %b[24], %b[36], %b[51] } { loop_mode qpshufb,0 %b[12], %g28, %r23, %g28 qpshufb,1 %g22, %b[54], %r23, %g22 qpfsubs,2 %b[29], %b[25], %b[13] qpshufb,3 %b[70], %b[14], %r23, %b[12] qpshufb,4 %b[40], %b[56], %r23, %b[14] qpfadds,5 %b[29], %b[25], %b[25] } { loop_mode qpfsubs,0 %b[21], %b[19], %b[40] qpshufb,1 %b[60], %b[39], %r23, %b[39] qpfsubs,2 %b[30], %b[26], %b[29] qpshufb,3 %b[46], %b[44], %r23, %b[44] qppermb,4 %b[41], %b[43], %r7, %b[41] qpfadds,5 %b[30], %b[26], %b[26] } { loop_mode qppermb,0 %b[47], %g25, %r7, %g25 qpshufb,1 %g29, %b[62], %r23, %g29 qpfadds,2 %b[15], %g20, %g20 qppermb,3 %g30, %b[33], %r7, %g30 qppermb,4 %b[42], %b[34], %r7, %b[30] qpfadds,5 %b[24], %b[36], %b[15] } { loop_mode qpfadds,0 %b[21], %b[19], %b[17] qppermb,1 %b[45], %b[11], %r7, %b[11] qpfadds,2 %g18, %g25, %b[21] qppermb,3 %g17, %b[23], %r7, %g17 qppermb,4 %g19, %b[17], %r7, %g19 qpfsubs,5 %b[48], %b[41], %b[19] } { loop_mode qppermb,0 %b[49], %g23, %r7, %g23 qppermb,1 %g26, %b[52], %r7, %g26 qpfsubs,2 %g18, %g25, %g18 qppermb,3 %g27, %b[28], %r7, %g27 qppermb,4 %b[37], %b[20], %r7, %b[20] qpfadds,5 %b[48], %b[41], %b[23] } { loop_mode qpfadds,0 %g31, %b[11], %b[27] qppermb,1 %b[27], %g21, %r7, %g21 qpfsubs,2 %g31, %b[11], %g31 qpfsubs,3 %g24, %b[30], %g25 qpfsubs,4 %g16, %g30, %b[24] qpfadds,5 %g24, %b[30], %g24 } { loop_mode qpfadds,0 %g16, %g30, %g16 qpfsubs,1 %g22, %g23, %b[12] qpfadds,2 %b[14], %g26, %b[28] qpfadds,3 %b[12], %g17, %g30 qpfsubs,4 %b[12], %g17, %g17 qpfsubs,5 %b[39], %g27, %b[11] } { loop_mode qpfadds,0 %g28, %g19, %b[30] qpfsubs,1 %g28, %g19, %g19 qpfsubs,2 %b[14], %g26, %g26 qpfadds,3 %b[39], %g27, %g27 qpfsubs,4 %g29, %b[20], %g29 qpfadds,5 %g29, %b[20], %g28 } { loop_mode qpfadds,0 %b[44], %g21, %g23 qpfsubs,1 %b[44], %g21, %g21 qpfadds,2 %g22, %g23, %g22 qpfadds,3 %b[23], %b[38], %b[14] qpfsubs,4 %b[23], %b[38], %b[20] } { loop_mode qpfadds,0 %b[21], %b[32], %b[23] qpfsubs,1 %b[21], %b[32], %b[21] qpfadds,2 %b[27], %b[22], %b[32] qpfsubs,3 %g24, %b[53], %b[33] qpfadds,4 %g24, %b[53], %g24 } { loop_mode qpfadds,0 %g16, %b[31], %b[34] qpfsubs,1 %g16, %b[31], %g16 qpfsubs,2 %b[27], %b[22], %b[22] qpfsubs,3 %g30, %b[25], %b[27] qpfadds,4 %g30, %b[25], %g30 } { loop_mode qpfadds,0 %b[30], %b[16], %b[25] qpfsubs,1 %b[30], %b[16], %b[16] qpfsubs,2 %b[28], %b[15], %b[31] qpfadds,3 %g27, %b[26], %b[30] qpfsubs,4 %g27, %b[26], %g27 qpfsubs,5 %g28, %b[17], %b[26] } { loop_mode qpfadds,0 %g22, %b[18], %b[36] qpfsubs,1 %g22, %b[18], %g22 qpfadds,2 %g23, %g20, %b[18] qpfadds,3 %b[28], %b[15], %b[15] qpfadds,4 %g28, %b[17], %g28 stqp,5 %r18, %r0, %b[20] } { loop_mode qpfsubs,0 %g23, %g20, %g20 stqp,2 %r36, %r0, %b[21] stqp,5 %r2, %r0, %b[14] } { loop_mode stqp,2 %r16, %r0, %g16 stqp,5 %r30, %r0, %b[33] } { loop_mode qpshufb,1 %b[61], %b[61], %r24, %g16 stqp,2 %r27, %r0, %b[23] qpshufb,3 %b[63], %b[63], %r24, %g23 qpshufb,4 %b[35], %b[35], %r24, %b[14] stqp,5 %r29, %r0, %g24 } { loop_mode qpshufb,0 %b[55], %b[55], %r24, %g24 qpshufb,1 %b[65], %b[65], %r24, %b[17] stqp,2 %r51, %r0, %b[22] qpshufb,3 %b[59], %b[59], %r24, %b[20] qpshufb,4 %b[66], %b[66], %r24, %b[21] stqp,5 %r20, %r0, %b[34] } { loop_mode qpshufb,0 %b[13], %b[13], %r24, %b[13] qpshufb,1 %b[51], %b[51], %r24, %b[22] stqp,2 %r38, %r0, %b[32] qpshufb,3 %b[29], %b[29], %r24, %b[23] qpshufb,4 %b[64], %b[64], %r24, %b[28] stqp,5 %r21, %r0, %g30 } { loop_mode qpshufb,0 %b[40], %b[40], %r24, %g30 qpxor,1 %g16, %r22, %g16 stqp,2 %r13, %r0, %b[27] qpxor,3 %g23, %r22, %g23 qpxor,4 %b[14], %r22, %b[14] stqp,5 %r35, %r0, %g27 } { loop_mode qpxor,0 %b[17], %r22, %g27 qpxor,1 %g24, %r22, %g24 qpfadds,2 %b[19], %g16, %b[21] qpxor,3 %b[20], %r22, %b[17] qpxor,4 %b[21], %r22, %b[20] qpfadds,5 %b[24], %g23, %b[27] } { loop_mode qpxor,0 %b[13], %r22, %b[13] qpxor,1 %b[22], %r22, %b[22] qpfsubs,2 %b[19], %g16, %g16 qpxor,3 %b[23], %r22, %b[23] qpxor,4 %b[28], %r22, %b[28] qpfsubs,5 %b[24], %g23, %g23 } { loop_mode qpxor,0 %g30, %r22, %g30 qpfsubs,1 %g19, %g24, %b[14] qpfadds,2 %g19, %g24, %g19 qpfsubs,3 %g25, %b[14], %b[19] qpfadds,4 %g25, %b[14], %g25 qpfadds,5 %g31, %b[17], %b[24] } { loop_mode qpfsubs,0 %g18, %g27, %g24 qpfadds,1 %g18, %g27, %g18 qpfadds,2 %g17, %b[13], %b[17] qpfsubs,3 %g31, %b[17], %g27 qpfsubs,4 %b[12], %b[20], %g31 qpfadds,5 %b[12], %b[20], %b[12] } { loop_mode qpfsubs,0 %g17, %b[13], %g17 qpfadds,1 %g26, %b[22], %b[20] qpfsubs,2 %g26, %b[22], %g26 qpfsubs,3 %b[11], %b[23], %b[13] qpfadds,4 %b[11], %b[23], %b[11] qpfsubs,5 %g21, %b[28], %b[23] } { loop_mode qpfadds,0 %g21, %b[28], %g21 qpfsubs,1 %g29, %g30, %b[22] qpfadds,2 %g29, %g30, %g29 stqp,5 %r26, %r0, %b[30] } { loop_mode stqp,2 %r37, %r0, %b[31] stqp,5 %r44, %r0, %b[25] } { loop_mode stqp,2 %r49, %r0, %b[16] stqp,5 %r3, %r0, %g22 } { loop_mode stqp,2 %r28, %r0, %b[15] stqp,5 %r17, %r0, %b[36] } { loop_mode stqp,2 %r54, %r0, %g20 stqp,5 %r43, %r0, %b[18] } { loop_mode stqp,2 %r50, %r0, %b[26] stqp,5 %r5, %r0, %b[27] } { loop_mode stqp,2 %r45, %r0, %g28 stqp,5 %r11, %r0, %g23 } { loop_mode stqp,2 %r15, %r0, %b[21] stqp,5 %r19, %r0, %g16 } { loop_mode stqp,2 %r34, %r0, %g25 stqp,5 %r9, %r0, %b[19] } { loop_mode stqp,2 %r57, %r0, %g19 stqp,5 %r47, %r0, %b[14] } { loop_mode stqp,2 %r53, %r0, %b[24] stqp,5 %r32, %r0, %g24 } { loop_mode stqp,2 %r40, %r0, %g18 stqp,5 %r42, %r0, %g27 } { loop_mode stqp,2 %r4, %r0, %b[12] stqp,5 %r14, %r0, %g31 } { loop_mode stqp,2 %r1, %r0, %b[17] stqp,5 %r12, %r0, %g17 } { loop_mode stqp,2 %r56, %r0, %g21 stqp,5 %r39, %r0, %b[11] } { loop_mode stqp,2 %r46, %r0, %b[23] stqp,5 %r31, %r0, %b[13] } { loop_mode stqp,2 %r41, %r0, %b[20] stqp,5 %r33, %r0, %g26 } { loop_mode addd,0,sm %r0, _f16s,_lts0lo 0x30, %r0 stqp,2 %r52, %r0, %g29 stqp,5 %r48, %r0, %b[22] } { loop_mode ct %ctpr1 ? %NOT_LOOP_END alc alcf=1, alct=1 } { loop_mode ldd,0 %r8, _f16s,_lts0lo 0xff30, %b[11], mas=0x4 ldd,2 %r8, _f16s,_lts0hi 0xff38, %b[12], mas=0x4 ldd,3 %r8, _f16s,_lts1lo 0xff40, %b[13], mas=0x4 ldd,5 %r8, _f16s,_lts1hi 0xff48, %b[14], mas=0x4 } { loop_mode ldd,0 %r8, _f16s,_lts0lo 0xff50, %b[15], mas=0x4 ldd,2 %r8, _f16s,_lts0hi 0xff58, %b[16], mas=0x4 ldd,3 %r8, _f16s,_lts1lo 0xff60, %b[17], mas=0x4 ldd,5 %r8, _f16s,_lts1hi 0xff68, %b[18], mas=0x4 } { loop_mode ldd,0 %r8, _f16s,_lts0lo 0xff70, %b[19], mas=0x4 ldd,2 %r8, _f16s,_lts0hi 0xff78, %b[20], mas=0x4 ldd,3 %r8, _f16s,_lts1lo 0xff80, %b[21], mas=0x4 ldd,5 %r8, _f16s,_lts1hi 0xff88, %b[22], mas=0x4 } { loop_mode ldd,0 %r8, _f16s,_lts0lo 0xff90, %b[23], mas=0x4 ldd,2 %r8, _f16s,_lts0hi 0xff98, %b[24], mas=0x4 ldd,3 %r8, _f16s,_lts1lo 0xffa0, %b[25], mas=0x4 ldd,5 %r8, _f16s,_lts1hi 0xffa8, %b[26], mas=0x4 } { loop_mode ldd,0 %r8, _f16s,_lts0lo 0xffb0, %b[27], mas=0x4 ldd,2 %r8, _f16s,_lts0hi 0xffb8, %b[28], mas=0x4 ldd,3 %r8, _f16s,_lts1lo 0xffc0, %b[29], mas=0x4 ldd,5 %r8, _f16s,_lts1hi 0xffc8, %b[30], mas=0x4 } { loop_mode ldd,0 %r8, _f16s,_lts0lo 0xffd0, %b[31], mas=0x4 ldd,2 %r8, _f16s,_lts0hi 0xffd8, %b[32], mas=0x4 ldd,3 %r8, _f16s,_lts1lo 0xffe0, %b[33], mas=0x4 ldd,5 %r8, _f16s,_lts1hi 0xffe8, %b[34], mas=0x4 } { loop_mode ldd,0 %r8, _f16s,_lts0lo 0xfff0, %b[35], mas=0x4 ldd,2 %r8, _f16s,_lts0hi 0xfff8, %b[36], mas=0x4 }
Теоретическая скорость: 96 комплексных чисел за 158 тактов (96/158) = 4.86 Байт/такт
Четверная теоретическая скорость: 19.44 Байт/такт
Замеры скорости

Итоги по stage_radix4_2x


Скорости упали по сравнению с исходными версиями stage_radix4.
График FFT находится здесь.
stage_radix4_readConjSwap
Один проход по stage_radix4_readConjSwap совершает ту же работу, что 2 прохода по stage_radix2. Поэтому скорость stage_radix4_readConjSwap будем умножать на 2 для удобства сравнения с stage_radix2 (этот факт подписан на оси графика и в выводе консоли).
1. stage_radix4_readConjSwap_simd64
Вычисления делаем аналогично stage_radix2_readConjSwap_simd64.
Код на Си
void stage_radix4_readConjSwap_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coefC, myComplex *conj_coefD, myComplex *conj_coefE, myComplex *swap_coefC, myComplex *swap_coefD, myComplex *swap_coefE) { uint64_t *x_in = (uint64_t*)&data_in[0]; uint64_t *y_in = (uint64_t*)&data_in[1]; uint64_t *z_in = (uint64_t*)&data_in[2]; uint64_t *w_in = (uint64_t*)&data_in[3]; uint64_t *conj_c_in = (uint64_t*)conj_coefC; uint64_t *conj_d_in = (uint64_t*)conj_coefD; uint64_t *conj_e_in = (uint64_t*)conj_coefE; uint64_t *swap_c_in = (uint64_t*)swap_coefC; uint64_t *swap_d_in = (uint64_t*)swap_coefD; uint64_t *swap_e_in = (uint64_t*)swap_coefE; uint64_t *out_0 = (uint64_t*)&data_out[0*data_count/4]; uint64_t *out_1 = (uint64_t*)&data_out[1*data_count/4]; uint64_t *out_2 = (uint64_t*)&data_out[2*data_count/4]; uint64_t *out_3 = (uint64_t*)&data_out[3*data_count/4]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/4; ++i) { uint64_t x = x_in[4*i]; uint64_t y = y_in[4*i]; uint64_t z = z_in[4*i]; uint64_t w = w_in[4*i]; uint64_t conj_c = conj_c_in[i]; uint64_t conj_d = conj_d_in[i]; uint64_t conj_e = conj_e_in[i]; uint64_t swap_c = swap_c_in[i]; uint64_t swap_d = swap_d_in[i]; uint64_t swap_e = swap_e_in[i]; uint64_t cy_real = __builtin_e2k_pfmuls(conj_c, y); uint64_t dz_real = __builtin_e2k_pfmuls(conj_d, z); uint64_t ew_real = __builtin_e2k_pfmuls(conj_e, w); uint64_t cy_imag = __builtin_e2k_pfmuls(swap_c, y); uint64_t dz_imag = __builtin_e2k_pfmuls(swap_d, z); uint64_t ew_imag = __builtin_e2k_pfmuls(swap_e, w); uint64_t cy = __builtin_e2k_pfhadds(cy_real, cy_imag); uint64_t dz = __builtin_e2k_pfhadds(dz_real, dz_imag); uint64_t ew = __builtin_e2k_pfhadds(ew_real, ew_imag); uint64_t add02 = __builtin_e2k_pfadds( x, dz); uint64_t sub02 = __builtin_e2k_pfsubs( x, dz); uint64_t add13 = __builtin_e2k_pfadds(cy, ew); uint64_t sub13 = __builtin_e2k_pfsubs(cy, ew); //uint64_t conj_sub13 = __builtin_e2k_pxord(sub13, 1LL<<63); //uint64_t sub13i = __builtin_e2k_pshufb(0, conj_sub13, 0x0302010007060504); uint64_t swap_sub13 = __builtin_e2k_pshufb(0, sub13, 0x0302010007060504); uint64_t sub13i = __builtin_e2k_pxord(swap_sub13, 1LL<<31); out_0[i] = __builtin_e2k_pfadds(add02, add13); out_1[i] = __builtin_e2k_pfsubs(sub02, sub13i); out_2[i] = __builtin_e2k_pfsubs(add02, add13); out_3[i] = __builtin_e2k_pfadds(sub02, sub13i); } }
Основной цикл на ассемблере
.L640: { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=3, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=3, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=3, abs=8, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=3, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=4, abs=16, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=3, abs=24, disp=0 } .L275: { loop_mode pfmul_hadds,0,sm %b[43], %b[59], %b[80], %b[11] pfadd_adds,1,sm %b[34], %b[17], %b[87], %b[67] pfadd_rsubs,2,sm %b[34], %b[17], %b[87], %b[54] pfmuls,3,sm %b[12], %b[55], %b[76] pfsubs,4,sm %b[42], %b[25], %b[88] pfadds,5,sm %b[44], %b[27], %b[83] movad,1 area=0, ind=0, am=1, be=0, %b[70] movad,2 area=2, ind=0, am=1, be=0, %b[1] movad,3 area=1, ind=0, am=1, be=0, %b[0] } { loop_mode pfsub_rsubs,0,sm %b[34], %b[17], %b[99], %b[80] pfsub_adds,1,sm %b[34], %b[17], %b[99], %b[59] staad,2 %b[73], %aad4[ %aasti11 ] incr,2 %aaincr0 pfmuls,3,sm %b[100], %b[64], %b[87] pshufb,4,sm 0x0, %b[90], %r21, %b[93] staad,5 %b[60], %aad2[ %aasti9 ] incr,5 %aaincr0 movad,0 area=3, ind=0, am=1, be=0, %b[44] movad,1 area=2, ind=0, am=1, be=0, %b[27] movad,2 area=0, ind=0, am=0, be=0, %b[12] movad,3 area=0, ind=16, am=0, be=0, %b[43] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END pfmul_hadds,0,sm %b[52], %b[68], %b[91], %b[17] pfmul_hadds,1,sm %b[9], %b[81], %b[94], %b[34] staad,2 %b[86], %aad3[ %aasti10 ] incr,2 %aaincr0 pfmuls,3,sm %b[74], %b[77], %b[90] xord,4,sm %b[95], %r9, %b[97] staad,5 %b[65], %aad1[ %aasti8 ] incr,5 %aaincr0 movad,1 area=1, ind=0, am=1, be=0, %b[96] movad,2 area=0, ind=8, am=1, be=0, %b[73] movad,3 area=0, ind=24, am=0, be=0, %b[60] }
Теоретическая скорость: 4 комплексных числа за 3 такта (4/3) = 10.67 Байт/такт
Двойная теоретическая скорость: 21.33 Байт/такт
Замеры скорости

2. stage_radix4_readConjSwap_simd128
Вычисления делаем аналогично stage_radix2_readConjSwap_simd128.
Код на Си
void stage_radix4_readConjSwap_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coefC, myComplex *conj_coefD, myComplex *conj_coefE, myComplex *swap_coefC, myComplex *swap_coefD, myComplex *swap_coefE) { __v2di *xy0_in = (__v2di*)&data_in[0]; __v2di *zw0_in = (__v2di*)&data_in[2]; __v2di *xy1_in = (__v2di*)&data_in[4]; __v2di *zw1_in = (__v2di*)&data_in[6]; __v2di *conj_c_in = (__v2di*)conj_coefC; __v2di *conj_d_in = (__v2di*)conj_coefD; __v2di *conj_e_in = (__v2di*)conj_coefE; __v2di *swap_c_in = (__v2di*)swap_coefC; __v2di *swap_d_in = (__v2di*)swap_coefD; __v2di *swap_e_in = (__v2di*)swap_coefE; __v2di *out_0 = (__v2di*)&data_out[0*data_count/4]; __v2di *out_1 = (__v2di*)&data_out[1*data_count/4]; __v2di *out_2 = (__v2di*)&data_out[2*data_count/4]; __v2di *out_3 = (__v2di*)&data_out[3*data_count/4]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/8; ++i) { __v2di xy0 = xy0_in[4*i]; __v2di zw0 = zw0_in[4*i]; __v2di xy1 = xy1_in[4*i]; __v2di zw1 = zw1_in[4*i]; __v2di conj_c = conj_c_in[i]; __v2di conj_d = conj_d_in[i]; __v2di conj_e = conj_e_in[i]; __v2di swap_c = swap_c_in[i]; __v2di swap_d = swap_d_in[i]; __v2di swap_e = swap_e_in[i]; __v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di cy_real = __builtin_e2k_qpfmuls(conj_c, y); __v2di dz_real = __builtin_e2k_qpfmuls(conj_d, z); __v2di ew_real = __builtin_e2k_qpfmuls(conj_e, w); __v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y); __v2di dz_imag = __builtin_e2k_qpfmuls(swap_d, z); __v2di ew_imag = __builtin_e2k_qpfmuls(swap_e, w); __v2di cy_rrii = __builtin_e2k_qpfhadds(cy_real, cy_imag); __v2di dz_rrii = __builtin_e2k_qpfhadds(dz_real, dz_imag); __v2di ew_rrii = __builtin_e2k_qpfhadds(ew_real, ew_imag); __v2di dz = __builtin_e2k_qpshufb(dz_rrii, dz_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di add02 = __builtin_e2k_qpfadds( x, dz); __v2di sub02 = __builtin_e2k_qpfsubs( x, dz); __v2di add13_rrii = __builtin_e2k_qpfadds(cy_rrii, ew_rrii); __v2di sub13_rrii = __builtin_e2k_qpfsubs(cy_rrii, ew_rrii); __v2di add13 = __builtin_e2k_qpshufb(add13_rrii, add13_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); //__v2di conj_sub13 = __builtin_e2k_qpxor(sub13_rrii, (__v2di){(1LL<<63) + (1LL<<31), 0}); //__v2di sub13i = __builtin_e2k_qpshufb(conj_sub13, conj_sub13, (__v2di){0x030201000B0A0908, 0x070605040F0E0D0C}); __v2di swap_sub13 = __builtin_e2k_qpshufb(sub13_rrii, sub13_rrii, (__v2di){0x030201000B0A0908, 0x070605040F0E0D0C}); __v2di sub13i = __builtin_e2k_qpxor(swap_sub13, (__v2di){1LL<<31, 1LL<<31}); out_0[i] = __builtin_e2k_qpfadds(add02, add13); out_1[i] = __builtin_e2k_qpfsubs(sub02, sub13i); out_2[i] = __builtin_e2k_qpfsubs(add02, add13); out_3[i] = __builtin_e2k_qpfadds(sub02, sub13i); } }
Основной цикл на ассемблере
.L1243: { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=3, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=3, abs=8, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=3, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=16, disp=0 } { fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=3, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=3, abs=24, disp=0 } .L699: { loop_mode qpfmuls,0,sm %b[59], %b[33], %b[48] qpfmul_hadds,1,sm %b[32], %b[41], %b[80], %b[6] qpfadd_adds,2,sm %b[24], %b[77], %b[62], %b[47] qpshufb,3,sm %b[54], %b[54], %g17, %b[60] qpshufb,4,sm %b[42], %b[42], %g16, %b[69] qpfadd_rsubs,5,sm %b[24], %b[77], %b[62], %b[7] movaqp,1 area=0, ind=0, am=0, be=0, %b[1] movaqp,3 area=0, ind=0, am=0, be=0, %b[0] } { loop_mode qpfmul_hadds,0,sm %b[19], %b[38], %b[73], %b[72] qpfmul_hadds,1,sm %b[29], %b[35], %b[50], %b[41] qpfsub_rsubs,2,sm %b[24], %b[77], %b[81], %b[54] qpshufb,3,sm %b[2], %b[3], %r22, %b[32] qpxor,4,sm %b[69], %g18, %b[79] qpfsub_adds,5,sm %b[24], %b[77], %b[81], %b[42] movaqp,1 area=0, ind=16, am=1, be=0, %b[62] movaqp,3 area=0, ind=16, am=1, be=0, %b[59] } { loop_mode qpfadds,0,sm %b[76], %b[10], %b[50] qpfsubs,1,sm %b[76], %b[10], %b[38] staaqp,2 %b[51], %aad4[ %aasti11 ] incr,2 %aaincr0 qpshufb,3,sm %b[61], %b[64], %r22, %b[35] qpshufb,4,sm %b[63], %b[66], %g19, %b[29] staaqp,5 %b[11], %aad2[ %aasti9 ] incr,5 %aaincr0 movaqp,1 area=3, ind=0, am=1, be=0, %b[19] movaqp,3 area=3, ind=0, am=1, be=0, %b[24] } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qpfmuls,0,sm %b[70], %b[34], %b[69] qpfmuls,1,sm %b[67], %b[37], %b[76] staaqp,2 %b[58], %aad3[ %aasti10 ] incr,2 %aaincr0 qpshufb,3,sm %b[4], %b[5], %g19, %b[10] qpshufb,4,sm %b[45], %b[45], %g17, %b[73] staaqp,5 %b[46], %aad1[ %aasti8 ] incr,5 %aaincr0 movaqp,0 area=2, ind=0, am=1, be=0, %b[63] movaqp,1 area=1, ind=0, am=1, be=0, %b[66] movaqp,2 area=2, ind=0, am=1, be=0, %b[11] movaqp,3 area=1, ind=0, am=1, be=0, %b[51] }
Теоретическая скорость: 8 комплексных чисел за 4 такта (8/4) = 16 Байт/такт
Двойная теоретическая скорость: 32 Байт/такт
Замеры скорости

Итоги по stage_radix4_readConjSwap


График FFT находится здесь.
stage_radix4_readConjSwap_2x
Один проход по stage_radix4_readConjSwap_2x совершает ту же работу, что 2 прохода по stage_radix4_readConjSwap. А один проход по stage_radix4_readConjSwap совершает ту же работу, что 2 прохода по stage_radix2. Поэтому скорость stage_radix4_readConjSwap_2x будем умножать на 4 для удобства сравнения с stage_radix2 (этот факт подписан на оси графика и в выводе консоли).
1. stage_radix4_readConjSwap_2x_simd64
Здесь происходит ручная раскрутка алгоритма stage_radix4_readConjSwap_simd64 в 2 раза.
Код на Си
void stage_radix4_readConjSwap_2x_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coefC_a, myComplex *conj_coefD_a, myComplex *conj_coefE_a, myComplex *conj_coefC_b, myComplex *conj_coefD_b, myComplex *conj_coefE_b, myComplex *swap_coefC_a, myComplex *swap_coefD_a, myComplex *swap_coefE_a, myComplex *swap_coefC_b, myComplex *swap_coefD_b, myComplex *swap_coefE_b) { uint64_t *x0_in = (uint64_t*)&data_in[ 0]; uint64_t *y0_in = (uint64_t*)&data_in[ 1]; uint64_t *z0_in = (uint64_t*)&data_in[ 2]; uint64_t *w0_in = (uint64_t*)&data_in[ 3]; uint64_t *x1_in = (uint64_t*)&data_in[ 4]; uint64_t *y1_in = (uint64_t*)&data_in[ 5]; uint64_t *z1_in = (uint64_t*)&data_in[ 6]; uint64_t *w1_in = (uint64_t*)&data_in[ 7]; uint64_t *x2_in = (uint64_t*)&data_in[ 8]; uint64_t *y2_in = (uint64_t*)&data_in[ 9]; uint64_t *z2_in = (uint64_t*)&data_in[10]; uint64_t *w2_in = (uint64_t*)&data_in[11]; uint64_t *x3_in = (uint64_t*)&data_in[12]; uint64_t *y3_in = (uint64_t*)&data_in[13]; uint64_t *z3_in = (uint64_t*)&data_in[14]; uint64_t *w3_in = (uint64_t*)&data_in[15]; uint64_t *conj_c0a_in = (uint64_t*)&conj_coefC_a[0]; uint64_t *conj_c1a_in = (uint64_t*)&conj_coefC_a[1]; uint64_t *conj_c2a_in = (uint64_t*)&conj_coefC_a[2]; uint64_t *conj_c3a_in = (uint64_t*)&conj_coefC_a[3]; uint64_t *conj_d0a_in = (uint64_t*)&conj_coefD_a[0]; uint64_t *conj_d1a_in = (uint64_t*)&conj_coefD_a[1]; uint64_t *conj_d2a_in = (uint64_t*)&conj_coefD_a[2]; uint64_t *conj_d3a_in = (uint64_t*)&conj_coefD_a[3]; uint64_t *conj_e0a_in = (uint64_t*)&conj_coefE_a[0]; uint64_t *conj_e1a_in = (uint64_t*)&conj_coefE_a[1]; uint64_t *conj_e2a_in = (uint64_t*)&conj_coefE_a[2]; uint64_t *conj_e3a_in = (uint64_t*)&conj_coefE_a[3]; uint64_t *conj_c0b_in = (uint64_t*)&conj_coefC_b[0*data_count/16]; uint64_t *conj_c1b_in = (uint64_t*)&conj_coefC_b[1*data_count/16]; uint64_t *conj_c2b_in = (uint64_t*)&conj_coefC_b[2*data_count/16]; uint64_t *conj_c3b_in = (uint64_t*)&conj_coefC_b[3*data_count/16]; uint64_t *conj_d0b_in = (uint64_t*)&conj_coefD_b[0*data_count/16]; uint64_t *conj_d1b_in = (uint64_t*)&conj_coefD_b[1*data_count/16]; uint64_t *conj_d2b_in = (uint64_t*)&conj_coefD_b[2*data_count/16]; uint64_t *conj_d3b_in = (uint64_t*)&conj_coefD_b[3*data_count/16]; uint64_t *conj_e0b_in = (uint64_t*)&conj_coefE_b[0*data_count/16]; uint64_t *conj_e1b_in = (uint64_t*)&conj_coefE_b[1*data_count/16]; uint64_t *conj_e2b_in = (uint64_t*)&conj_coefE_b[2*data_count/16]; uint64_t *conj_e3b_in = (uint64_t*)&conj_coefE_b[3*data_count/16]; uint64_t *swap_c0a_in = (uint64_t*)&swap_coefC_a[0]; uint64_t *swap_c1a_in = (uint64_t*)&swap_coefC_a[1]; uint64_t *swap_c2a_in = (uint64_t*)&swap_coefC_a[2]; uint64_t *swap_c3a_in = (uint64_t*)&swap_coefC_a[3]; uint64_t *swap_d0a_in = (uint64_t*)&swap_coefD_a[0]; uint64_t *swap_d1a_in = (uint64_t*)&swap_coefD_a[1]; uint64_t *swap_d2a_in = (uint64_t*)&swap_coefD_a[2]; uint64_t *swap_d3a_in = (uint64_t*)&swap_coefD_a[3]; uint64_t *swap_e0a_in = (uint64_t*)&swap_coefE_a[0]; uint64_t *swap_e1a_in = (uint64_t*)&swap_coefE_a[1]; uint64_t *swap_e2a_in = (uint64_t*)&swap_coefE_a[2]; uint64_t *swap_e3a_in = (uint64_t*)&swap_coefE_a[3]; uint64_t *swap_c0b_in = (uint64_t*)&swap_coefC_b[0*data_count/16]; uint64_t *swap_c1b_in = (uint64_t*)&swap_coefC_b[1*data_count/16]; uint64_t *swap_c2b_in = (uint64_t*)&swap_coefC_b[2*data_count/16]; uint64_t *swap_c3b_in = (uint64_t*)&swap_coefC_b[3*data_count/16]; uint64_t *swap_d0b_in = (uint64_t*)&swap_coefD_b[0*data_count/16]; uint64_t *swap_d1b_in = (uint64_t*)&swap_coefD_b[1*data_count/16]; uint64_t *swap_d2b_in = (uint64_t*)&swap_coefD_b[2*data_count/16]; uint64_t *swap_d3b_in = (uint64_t*)&swap_coefD_b[3*data_count/16]; uint64_t *swap_e0b_in = (uint64_t*)&swap_coefE_b[0*data_count/16]; uint64_t *swap_e1b_in = (uint64_t*)&swap_coefE_b[1*data_count/16]; uint64_t *swap_e2b_in = (uint64_t*)&swap_coefE_b[2*data_count/16]; uint64_t *swap_e3b_in = (uint64_t*)&swap_coefE_b[3*data_count/16]; uint64_t *out_0 = (uint64_t*)&data_out[ 0*data_count/16]; uint64_t *out_1 = (uint64_t*)&data_out[ 1*data_count/16]; uint64_t *out_2 = (uint64_t*)&data_out[ 2*data_count/16]; uint64_t *out_3 = (uint64_t*)&data_out[ 3*data_count/16]; uint64_t *out_4 = (uint64_t*)&data_out[ 4*data_count/16]; uint64_t *out_5 = (uint64_t*)&data_out[ 5*data_count/16]; uint64_t *out_6 = (uint64_t*)&data_out[ 6*data_count/16]; uint64_t *out_7 = (uint64_t*)&data_out[ 7*data_count/16]; uint64_t *out_8 = (uint64_t*)&data_out[ 8*data_count/16]; uint64_t *out_9 = (uint64_t*)&data_out[ 9*data_count/16]; uint64_t *out_10 = (uint64_t*)&data_out[10*data_count/16]; uint64_t *out_11 = (uint64_t*)&data_out[11*data_count/16]; uint64_t *out_12 = (uint64_t*)&data_out[12*data_count/16]; uint64_t *out_13 = (uint64_t*)&data_out[13*data_count/16]; uint64_t *out_14 = (uint64_t*)&data_out[14*data_count/16]; uint64_t *out_15 = (uint64_t*)&data_out[15*data_count/16]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/16; ++i) { uint64_t x0 = x0_in[16*i]; uint64_t y0 = y0_in[16*i]; uint64_t z0 = z0_in[16*i]; uint64_t w0 = w0_in[16*i]; uint64_t conj_c0 = conj_c0a_in[4*i]; uint64_t conj_d0 = conj_d0a_in[4*i]; uint64_t conj_e0 = conj_e0a_in[4*i]; uint64_t swap_c0 = swap_c0a_in[4*i]; uint64_t swap_d0 = swap_d0a_in[4*i]; uint64_t swap_e0 = swap_e0a_in[4*i]; uint64_t x1 = x1_in[16*i]; uint64_t y1 = y1_in[16*i]; uint64_t z1 = z1_in[16*i]; uint64_t w1 = w1_in[16*i]; uint64_t conj_c1 = conj_c1a_in[4*i]; uint64_t conj_d1 = conj_d1a_in[4*i]; uint64_t conj_e1 = conj_e1a_in[4*i]; uint64_t swap_c1 = swap_c1a_in[4*i]; uint64_t swap_d1 = swap_d1a_in[4*i]; uint64_t swap_e1 = swap_e1a_in[4*i]; uint64_t x2 = x2_in[16*i]; uint64_t y2 = y2_in[16*i]; uint64_t z2 = z2_in[16*i]; uint64_t w2 = w2_in[16*i]; uint64_t conj_c2 = conj_c2a_in[4*i]; uint64_t conj_d2 = conj_d2a_in[4*i]; uint64_t conj_e2 = conj_e2a_in[4*i]; uint64_t swap_c2 = swap_c2a_in[4*i]; uint64_t swap_d2 = swap_d2a_in[4*i]; uint64_t swap_e2 = swap_e2a_in[4*i]; uint64_t x3 = x3_in[16*i]; uint64_t y3 = y3_in[16*i]; uint64_t z3 = z3_in[16*i]; uint64_t w3 = w3_in[16*i]; uint64_t conj_c3 = conj_c3a_in[4*i]; uint64_t conj_d3 = conj_d3a_in[4*i]; uint64_t conj_e3 = conj_e3a_in[4*i]; uint64_t swap_c3 = swap_c3a_in[4*i]; uint64_t swap_d3 = swap_d3a_in[4*i]; uint64_t swap_e3 = swap_e3a_in[4*i]; uint64_t cy0_real = __builtin_e2k_pfmuls(conj_c0, y0); uint64_t cy1_real = __builtin_e2k_pfmuls(conj_c1, y1); uint64_t cy2_real = __builtin_e2k_pfmuls(conj_c2, y2); uint64_t cy3_real = __builtin_e2k_pfmuls(conj_c3, y3); uint64_t dz0_real = __builtin_e2k_pfmuls(conj_d0, z0); uint64_t dz1_real = __builtin_e2k_pfmuls(conj_d1, z1); uint64_t dz2_real = __builtin_e2k_pfmuls(conj_d2, z2); uint64_t dz3_real = __builtin_e2k_pfmuls(conj_d3, z3); uint64_t ew0_real = __builtin_e2k_pfmuls(conj_e0, w0); uint64_t ew1_real = __builtin_e2k_pfmuls(conj_e1, w1); uint64_t ew2_real = __builtin_e2k_pfmuls(conj_e2, w2); uint64_t ew3_real = __builtin_e2k_pfmuls(conj_e3, w3); uint64_t cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0); uint64_t cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1); uint64_t cy2_imag = __builtin_e2k_pfmuls(swap_c2, y2); uint64_t cy3_imag = __builtin_e2k_pfmuls(swap_c3, y3); uint64_t dz0_imag = __builtin_e2k_pfmuls(swap_d0, z0); uint64_t dz1_imag = __builtin_e2k_pfmuls(swap_d1, z1); uint64_t dz2_imag = __builtin_e2k_pfmuls(swap_d2, z2); uint64_t dz3_imag = __builtin_e2k_pfmuls(swap_d3, z3); uint64_t ew0_imag = __builtin_e2k_pfmuls(swap_e0, w0); uint64_t ew1_imag = __builtin_e2k_pfmuls(swap_e1, w1); uint64_t ew2_imag = __builtin_e2k_pfmuls(swap_e2, w2); uint64_t ew3_imag = __builtin_e2k_pfmuls(swap_e3, w3); uint64_t cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag); uint64_t cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag); uint64_t cy2 = __builtin_e2k_pfhadds(cy2_real, cy2_imag); uint64_t cy3 = __builtin_e2k_pfhadds(cy3_real, cy3_imag); uint64_t dz0 = __builtin_e2k_pfhadds(dz0_real, dz0_imag); uint64_t dz1 = __builtin_e2k_pfhadds(dz1_real, dz1_imag); uint64_t dz2 = __builtin_e2k_pfhadds(dz2_real, dz2_imag); uint64_t dz3 = __builtin_e2k_pfhadds(dz3_real, dz3_imag); uint64_t ew0 = __builtin_e2k_pfhadds(ew0_real, ew0_imag); uint64_t ew1 = __builtin_e2k_pfhadds(ew1_real, ew1_imag); uint64_t ew2 = __builtin_e2k_pfhadds(ew2_real, ew2_imag); uint64_t ew3 = __builtin_e2k_pfhadds(ew3_real, ew3_imag); uint64_t add02_0 = __builtin_e2k_pfadds( x0, dz0); uint64_t add02_1 = __builtin_e2k_pfadds( x1, dz1); uint64_t add02_2 = __builtin_e2k_pfadds( x2, dz2); uint64_t add02_3 = __builtin_e2k_pfadds( x3, dz3); uint64_t sub02_0 = __builtin_e2k_pfsubs( x0, dz0); uint64_t sub02_1 = __builtin_e2k_pfsubs( x1, dz1); uint64_t sub02_2 = __builtin_e2k_pfsubs( x2, dz2); uint64_t sub02_3 = __builtin_e2k_pfsubs( x3, dz3); uint64_t add13_0 = __builtin_e2k_pfadds(cy0, ew0); uint64_t add13_1 = __builtin_e2k_pfadds(cy1, ew1); uint64_t add13_2 = __builtin_e2k_pfadds(cy2, ew2); uint64_t add13_3 = __builtin_e2k_pfadds(cy3, ew3); uint64_t sub13_0 = __builtin_e2k_pfsubs(cy0, ew0); uint64_t sub13_1 = __builtin_e2k_pfsubs(cy1, ew1); uint64_t sub13_2 = __builtin_e2k_pfsubs(cy2, ew2); uint64_t sub13_3 = __builtin_e2k_pfsubs(cy3, ew3); //uint64_t conj_sub13_0 = __builtin_e2k_pxord(sub13_0, 1LL<<63); //uint64_t conj_sub13_1 = __builtin_e2k_pxord(sub13_1, 1LL<<63); //uint64_t conj_sub13_2 = __builtin_e2k_pxord(sub13_2, 1LL<<63); //uint64_t conj_sub13_3 = __builtin_e2k_pxord(sub13_3, 1LL<<63); //uint64_t sub13i_0 = __builtin_e2k_pshufb(0, conj_sub13_0, 0x0302010007060504); //uint64_t sub13i_1 = __builtin_e2k_pshufb(0, conj_sub13_1, 0x0302010007060504); //uint64_t sub13i_2 = __builtin_e2k_pshufb(0, conj_sub13_2, 0x0302010007060504); //uint64_t sub13i_3 = __builtin_e2k_pshufb(0, conj_sub13_3, 0x0302010007060504); uint64_t swap_sub13_0 = __builtin_e2k_pshufb(0, sub13_0, 0x0302010007060504); uint64_t swap_sub13_1 = __builtin_e2k_pshufb(0, sub13_1, 0x0302010007060504); uint64_t swap_sub13_2 = __builtin_e2k_pshufb(0, sub13_2, 0x0302010007060504); uint64_t swap_sub13_3 = __builtin_e2k_pshufb(0, sub13_3, 0x0302010007060504); uint64_t sub13i_0 = __builtin_e2k_pxord(swap_sub13_0, 1LL<<31); uint64_t sub13i_1 = __builtin_e2k_pxord(swap_sub13_1, 1LL<<31); uint64_t sub13i_2 = __builtin_e2k_pxord(swap_sub13_2, 1LL<<31); uint64_t sub13i_3 = __builtin_e2k_pxord(swap_sub13_3, 1LL<<31); uint64_t out0 = __builtin_e2k_pfadds(add02_0, add13_0); uint64_t out1 = __builtin_e2k_pfadds(add02_1, add13_1); uint64_t out2 = __builtin_e2k_pfadds(add02_2, add13_2); uint64_t out3 = __builtin_e2k_pfadds(add02_3, add13_3); uint64_t out4 = __builtin_e2k_pfsubs(sub02_0, sub13i_0); uint64_t out5 = __builtin_e2k_pfsubs(sub02_1, sub13i_1); uint64_t out6 = __builtin_e2k_pfsubs(sub02_2, sub13i_2); uint64_t out7 = __builtin_e2k_pfsubs(sub02_3, sub13i_3); uint64_t out8 = __builtin_e2k_pfsubs(add02_0, add13_0); uint64_t out9 = __builtin_e2k_pfsubs(add02_1, add13_1); uint64_t out10 = __builtin_e2k_pfsubs(add02_2, add13_2); uint64_t out11 = __builtin_e2k_pfsubs(add02_3, add13_3); uint64_t out12 = __builtin_e2k_pfadds(sub02_0, sub13i_0); uint64_t out13 = __builtin_e2k_pfadds(sub02_1, sub13i_1); uint64_t out14 = __builtin_e2k_pfadds(sub02_2, sub13i_2); uint64_t out15 = __builtin_e2k_pfadds(sub02_3, sub13i_3); x0 = out0; y0 = out1; z0 = out2; w0 = out3; conj_c0 = conj_c0b_in[i]; conj_d0 = conj_d0b_in[i]; conj_e0 = conj_e0b_in[i]; swap_c0 = swap_c0b_in[i]; swap_d0 = swap_d0b_in[i]; swap_e0 = swap_e0b_in[i]; x1 = out4; y1 = out5; z1 = out6; w1 = out7; conj_c1 = conj_c1b_in[i]; conj_d1 = conj_d1b_in[i]; conj_e1 = conj_e1b_in[i]; swap_c1 = swap_c1b_in[i]; swap_d1 = swap_d1b_in[i]; swap_e1 = swap_e1b_in[i]; x2 = out8; y2 = out9; z2 = out10; w2 = out11; conj_c2 = conj_c2b_in[i]; conj_d2 = conj_d2b_in[i]; conj_e2 = conj_e2b_in[i]; swap_c2 = swap_c2b_in[i]; swap_d2 = swap_d2b_in[i]; swap_e2 = swap_e2b_in[i]; x3 = out12; y3 = out13; z3 = out14; w3 = out15; conj_c3 = conj_c3b_in[i]; conj_d3 = conj_d3b_in[i]; conj_e3 = conj_e3b_in[i]; swap_c3 = swap_c3b_in[i]; swap_d3 = swap_d3b_in[i]; swap_e3 = swap_e3b_in[i]; cy0_real = __builtin_e2k_pfmuls(conj_c0, y0); cy1_real = __builtin_e2k_pfmuls(conj_c1, y1); cy2_real = __builtin_e2k_pfmuls(conj_c2, y2); cy3_real = __builtin_e2k_pfmuls(conj_c3, y3); dz0_real = __builtin_e2k_pfmuls(conj_d0, z0); dz1_real = __builtin_e2k_pfmuls(conj_d1, z1); dz2_real = __builtin_e2k_pfmuls(conj_d2, z2); dz3_real = __builtin_e2k_pfmuls(conj_d3, z3); ew0_real = __builtin_e2k_pfmuls(conj_e0, w0); ew1_real = __builtin_e2k_pfmuls(conj_e1, w1); ew2_real = __builtin_e2k_pfmuls(conj_e2, w2); ew3_real = __builtin_e2k_pfmuls(conj_e3, w3); cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0); cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1); cy2_imag = __builtin_e2k_pfmuls(swap_c2, y2); cy3_imag = __builtin_e2k_pfmuls(swap_c3, y3); dz0_imag = __builtin_e2k_pfmuls(swap_d0, z0); dz1_imag = __builtin_e2k_pfmuls(swap_d1, z1); dz2_imag = __builtin_e2k_pfmuls(swap_d2, z2); dz3_imag = __builtin_e2k_pfmuls(swap_d3, z3); ew0_imag = __builtin_e2k_pfmuls(swap_e0, w0); ew1_imag = __builtin_e2k_pfmuls(swap_e1, w1); ew2_imag = __builtin_e2k_pfmuls(swap_e2, w2); ew3_imag = __builtin_e2k_pfmuls(swap_e3, w3); cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag); cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag); cy2 = __builtin_e2k_pfhadds(cy2_real, cy2_imag); cy3 = __builtin_e2k_pfhadds(cy3_real, cy3_imag); dz0 = __builtin_e2k_pfhadds(dz0_real, dz0_imag); dz1 = __builtin_e2k_pfhadds(dz1_real, dz1_imag); dz2 = __builtin_e2k_pfhadds(dz2_real, dz2_imag); dz3 = __builtin_e2k_pfhadds(dz3_real, dz3_imag); ew0 = __builtin_e2k_pfhadds(ew0_real, ew0_imag); ew1 = __builtin_e2k_pfhadds(ew1_real, ew1_imag); ew2 = __builtin_e2k_pfhadds(ew2_real, ew2_imag); ew3 = __builtin_e2k_pfhadds(ew3_real, ew3_imag); add02_0 = __builtin_e2k_pfadds( x0, dz0); add02_1 = __builtin_e2k_pfadds( x1, dz1); add02_2 = __builtin_e2k_pfadds( x2, dz2); add02_3 = __builtin_e2k_pfadds( x3, dz3); sub02_0 = __builtin_e2k_pfsubs( x0, dz0); sub02_1 = __builtin_e2k_pfsubs( x1, dz1); sub02_2 = __builtin_e2k_pfsubs( x2, dz2); sub02_3 = __builtin_e2k_pfsubs( x3, dz3); add13_0 = __builtin_e2k_pfadds(cy0, ew0); add13_1 = __builtin_e2k_pfadds(cy1, ew1); add13_2 = __builtin_e2k_pfadds(cy2, ew2); add13_3 = __builtin_e2k_pfadds(cy3, ew3); sub13_0 = __builtin_e2k_pfsubs(cy0, ew0); sub13_1 = __builtin_e2k_pfsubs(cy1, ew1); sub13_2 = __builtin_e2k_pfsubs(cy2, ew2); sub13_3 = __builtin_e2k_pfsubs(cy3, ew3); //conj_sub13_0 = __builtin_e2k_pxord(sub13_0, 1LL<<63); //conj_sub13_1 = __builtin_e2k_pxord(sub13_1, 1LL<<63); //conj_sub13_2 = __builtin_e2k_pxord(sub13_2, 1LL<<63); //conj_sub13_3 = __builtin_e2k_pxord(sub13_3, 1LL<<63); //sub13i_0 = __builtin_e2k_pshufb(0, conj_sub13_0, 0x0302010007060504); //sub13i_1 = __builtin_e2k_pshufb(0, conj_sub13_1, 0x0302010007060504); //sub13i_2 = __builtin_e2k_pshufb(0, conj_sub13_2, 0x0302010007060504); //sub13i_3 = __builtin_e2k_pshufb(0, conj_sub13_3, 0x0302010007060504); swap_sub13_0 = __builtin_e2k_pshufb(0, sub13_0, 0x0302010007060504); swap_sub13_1 = __builtin_e2k_pshufb(0, sub13_1, 0x0302010007060504); swap_sub13_2 = __builtin_e2k_pshufb(0, sub13_2, 0x0302010007060504); swap_sub13_3 = __builtin_e2k_pshufb(0, sub13_3, 0x0302010007060504); sub13i_0 = __builtin_e2k_pxord(swap_sub13_0, 1LL<<31); sub13i_1 = __builtin_e2k_pxord(swap_sub13_1, 1LL<<31); sub13i_2 = __builtin_e2k_pxord(swap_sub13_2, 1LL<<31); sub13i_3 = __builtin_e2k_pxord(swap_sub13_3, 1LL<<31); out_0[i] = __builtin_e2k_pfadds(add02_0, add13_0); out_1[i] = __builtin_e2k_pfadds(add02_1, add13_1); out_2[i] = __builtin_e2k_pfadds(add02_2, add13_2); out_3[i] = __builtin_e2k_pfadds(add02_3, add13_3); out_4[i] = __builtin_e2k_pfsubs(sub02_0, sub13i_0); out_5[i] = __builtin_e2k_pfsubs(sub02_1, sub13i_1); out_6[i] = __builtin_e2k_pfsubs(sub02_2, sub13i_2); out_7[i] = __builtin_e2k_pfsubs(sub02_3, sub13i_3); out_8[i] = __builtin_e2k_pfsubs(add02_0, add13_0); out_9[i] = __builtin_e2k_pfsubs(add02_1, add13_1); out_10[i] = __builtin_e2k_pfsubs(add02_2, add13_2); out_11[i] = __builtin_e2k_pfsubs(add02_3, add13_3); out_12[i] = __builtin_e2k_pfadds(sub02_0, sub13i_0); out_13[i] = __builtin_e2k_pfadds(sub02_1, sub13i_1); out_14[i] = __builtin_e2k_pfadds(sub02_2, sub13i_2); out_15[i] = __builtin_e2k_pfadds(sub02_3, sub13i_3); } }
Основной цикл на ассемблере
.L2581: { fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=1, asz=0, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=1, asz=0, abs=0, disp=32 } { fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=1, asz=0, abs=1, disp=64 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=1, asz=0, abs=1, disp=96 } { fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=7, asz=1, abs=2, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=6, asz=1, abs=2, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=5, asz=1, abs=4, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=4, asz=1, abs=4, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=3, asz=1, abs=6, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=2, asz=1, abs=6, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=16, incr=0, ind=0, asz=1, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=15, incr=0, ind=0, asz=1, abs=8, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=14, incr=0, ind=0, asz=1, abs=10, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=13, incr=0, ind=0, asz=1, abs=10, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=12, incr=0, ind=0, asz=1, abs=12, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=11, incr=0, ind=0, asz=1, abs=12, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=10, incr=0, ind=0, asz=1, abs=14, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=9, incr=0, ind=0, asz=1, abs=14, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=8, incr=0, ind=0, asz=1, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=7, incr=0, ind=0, asz=1, abs=16, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=6, incr=0, ind=0, asz=1, abs=18, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=5, incr=0, ind=0, asz=1, abs=18, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=4, incr=0, ind=0, asz=1, abs=20, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=3, incr=0, ind=0, asz=1, abs=20, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=2, incr=0, ind=0, asz=1, abs=22, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=1, abs=22, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=1, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=1, abs=24, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=1, abs=26, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=1, abs=26, disp=0 } { fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=1, abs=28, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=1, abs=28, disp=0 } { fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=1, abs=30, disp=0 fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=1, abs=30, disp=0 } .L1747: { loop_mode pfadd_rsubs,1,sm %b[63], %b[62], %b[77], %b[86] pfmul_hadds,2,sm %b[9], %b[117], %b[86], %b[68] pfmul_hadds,3,sm %b[33], %b[68], %b[88], %b[88] } { loop_mode pfmuls,1,sm %b[113], %b[103], %b[73] xord,2,sm %b[107], %r3, %b[90] pfmul_hadds,3,sm %b[71], %b[73], %b[65], %b[71] pfmul_hadds,4,sm %b[40], %b[106], %b[101], %b[65] movad,0 area=16, ind=0, am=1, be=0, %b[1] movad,1 area=15, ind=0, am=1, be=0, %b[12] movad,2 area=16, ind=0, am=1, be=0, %b[8] movad,3 area=15, ind=0, am=1, be=0, %b[9] } { loop_mode pfsub_adds,2,sm %b[63], %b[62], %b[90], %b[92] pfadd_adds,3,sm %b[63], %b[62], %b[77], %b[95] pfmul_hadds,4,sm %b[5], %b[97], %b[105], %b[77] movad,0 area=14, ind=0, am=1, be=0, %b[16] movad,1 area=13, ind=0, am=1, be=0, %b[5] movad,2 area=14, ind=0, am=1, be=0, %b[13] movad,3 area=13, ind=0, am=1, be=0, %b[17] } { loop_mode pfsubs,1,sm %b[93], %b[82], %b[85] pfmul_hadds,2,sm %b[32], %b[64], %b[85], %b[64] movad,0 area=12, ind=0, am=1, be=0, %b[25] movad,1 area=11, ind=0, am=1, be=0, %b[24] movad,2 area=12, ind=0, am=1, be=0, %b[20] movad,3 area=11, ind=0, am=1, be=0, %b[21] } { loop_mode pfsubs,0,sm %b[91], %b[89], %b[96] pfadds,4,sm %b[74], %b[87], %b[93] pfadds,5,sm %b[93], %b[82], %b[82] movad,0 area=10, ind=0, am=1, be=0, %b[28] movad,1 area=9, ind=0, am=1, be=0, %b[33] movad,2 area=10, ind=0, am=1, be=0, %b[32] movad,3 area=9, ind=0, am=1, be=0, %b[29] } { loop_mode pfmul_hadds,0,sm %b[80], %b[103], %b[73], %b[62] pfadds,1,sm %b[91], %b[89], %b[72] pfmuls,2,sm %b[72], %b[108], %b[73] pfsub_rsubs,3,sm %b[63], %b[62], %b[90], %b[63] pfsubs,5,sm %b[74], %b[87], %b[74] movad,0 area=8, ind=0, am=1, be=0, %b[41] movad,1 area=7, ind=0, am=1, be=0, %b[36] movad,2 area=8, ind=0, am=1, be=0, %b[37] movad,3 area=7, ind=0, am=1, be=0, %b[40] } { loop_mode pfsubs,0,sm %b[75], %b[70], %b[80] pfsubs,5,sm %b[81], %b[78], %b[87] movad,0 area=6, ind=0, am=1, be=0, %b[49] movad,1 area=5, ind=0, am=1, be=0, %b[48] movad,2 area=6, ind=0, am=1, be=0, %b[44] movad,3 area=5, ind=0, am=1, be=0, %b[45] } { loop_mode pshufb,0,sm 0x0, %b[85], %r25, %b[84] pfmuls,1,sm %b[84], %b[102], %b[85] pfadds,4,sm %b[81], %b[78], %b[90] pfsubs,5,sm %b[83], %b[69], %b[89] movad,0 area=4, ind=0, am=0, be=0, %b[52] movad,1 area=4, ind=16, am=0, be=0, %b[78] movad,2 area=4, ind=0, am=0, be=0, %b[53] movad,3 area=4, ind=16, am=0, be=0, %b[81] } { loop_mode pshufb,1,sm 0x0, %b[96], %r25, %b[98] pfadds,3,sm %b[83], %b[69], %b[99] pfsubs,5,sm %b[88], %b[76], %b[97] movad,0 area=4, ind=24, am=0, be=0, %b[69] movad,1 area=4, ind=8, am=1, be=0, %b[83] movad,2 area=4, ind=24, am=0, be=0, %b[91] movad,3 area=4, ind=8, am=1, be=0, %b[96] } { loop_mode pfmul_hadds,0,sm %b[55], %b[108], %b[73], %b[75] pfadds,1,sm %b[75], %b[70], %b[100] pfadd_adds,2,sm %b[79], %b[66], %b[72], %b[74] pshufb,3,sm 0x0, %b[74], %r25, %b[101] pfadd_rsubs,4,sm %b[94], %b[71], %b[82], %b[76] pfadds,5,sm %b[88], %b[76], %b[104] movad,0 area=3, ind=0, am=0, be=0, %b[70] movad,1 area=3, ind=16, am=0, be=0, %b[88] movad,2 area=3, ind=0, am=0, be=0, %b[56] movad,3 area=3, ind=16, am=0, be=0, %b[73] } { loop_mode pfadd_rsubs,0,sm %b[79], %b[66], %b[72], %b[72] pshufb,1,sm 0x0, %b[80], %r25, %b[107] pfadd_adds,2,sm %b[86], %b[68], %b[93], %b[103] pshufb,4,sm 0x0, %b[87], %r25, %b[109] pfadd_rsubs,5,sm %b[86], %b[68], %b[93], %b[93] movad,0 area=3, ind=24, am=0, be=0, %b[105] movad,1 area=3, ind=8, am=1, be=0, %b[106] movad,2 area=3, ind=24, am=0, be=0, %b[80] movad,3 area=3, ind=8, am=1, be=0, %b[87] } { loop_mode pfmul_hadds,0,sm %b[58], %b[102], %b[85], %b[84] pfadd_adds,1,sm %b[94], %b[71], %b[82], %b[85] xord,2,sm %b[84], %r3, %b[112] pfadd_adds,3,sm %b[95], %b[65], %b[90], %b[102] pshufb,4,sm 0x0, %b[89], %r25, %b[111] pfadd_rsubs,5,sm %b[95], %b[65], %b[90], %b[90] movad,0 area=2, ind=0, am=0, be=0, %b[82] movad,1 area=2, ind=16, am=0, be=0, %b[108] movad,2 area=2, ind=24, am=0, be=0, %b[110] movad,3 area=0, ind=24, am=0, be=0, %b[89] } { loop_mode xord,0,sm %b[98], %r3, %b[116] pfsub_adds,1,sm %b[94], %b[71], %b[112], %b[94] pfsub_rsubs,2,sm %b[94], %b[71], %b[112], %b[97] pfadd_rsubs,3,sm %b[92], %b[77], %b[99], %b[98] pshufb,4,sm 0x0, %b[97], %r25, %b[115] pfadd_adds,5,sm %b[92], %b[77], %b[99], %b[99] movad,0 area=2, ind=24, am=0, be=0, %b[114] movad,1 area=2, ind=8, am=1, be=0, %b[113] movad,2 area=1, ind=16, am=0, be=0, %b[71] movad,3 area=0, ind=8, am=0, be=0, %b[112] } { loop_mode pfadd_adds,1,sm %b[57], %b[62], %b[100], %b[104] pfsub_adds,2,sm %b[79], %b[66], %b[116], %b[118] xord,3,sm %b[101], %r3, %g16 pfadd_rsubs,4,sm %b[63], %b[64], %b[104], %b[117] pfadd_adds,5,sm %b[63], %b[64], %b[104], %b[119] movad,0 area=1, ind=0, am=0, be=0, %b[55] movad,1 area=1, ind=16, am=0, be=0, %b[101] movad,2 area=1, ind=24, am=0, be=0, %g18 movad,3 area=1, ind=8, am=0, be=0, %g17 } { loop_mode xord,0,sm %b[107], %r3, %b[116] pfmuls,1,sm %b[67], %b[60], %g19 pfsub_rsubs,2,sm %b[79], %b[66], %b[116], %b[66] pfsub_rsubs,3,sm %b[86], %b[68], %g16, %b[107] xord,4,sm %b[109], %r3, %g16 pfsub_adds,5,sm %b[86], %b[68], %g16, %b[86] movad,0 area=1, ind=24, am=0, be=0, %b[68] movad,1 area=1, ind=8, am=1, be=0, %b[79] movad,2 area=2, ind=8, am=0, be=0, %b[109] movad,3 area=0, ind=16, am=0, be=0, %b[67] } { loop_mode pfsub_adds,0,sm %b[57], %b[62], %b[116], %b[95] pfsub_rsubs,3,sm %b[95], %b[65], %g16, %g16 pfsub_adds,4,sm %b[95], %b[65], %g16, %g21 xord,5,sm %b[111], %r3, %g20 movad,0 area=0, ind=0, am=0, be=0, %b[59] movad,1 area=0, ind=16, am=0, be=0, %b[58] movad,2 area=2, ind=0, am=1, be=0, %b[65] movad,3 area=2, ind=16, am=0, be=0, %b[111] } { loop_mode pfmuls,0,sm %b[106], %b[89], %g24 pfadd_rsubs,2,sm %b[57], %b[62], %b[100], %b[115] xord,3,sm %b[115], %r3, %g23 pfsub_rsubs,4,sm %b[92], %b[77], %g20, %g20 pfsub_adds,5,sm %b[92], %b[77], %g20, %g22 movad,0 area=0, ind=8, am=1, be=0, %b[100] movad,1 area=0, ind=24, am=0, be=0, %b[106] movad,2 area=1, ind=0, am=1, be=0, %b[92] movad,3 area=0, ind=0, am=1, be=0, %b[77] } { loop_mode pfmuls,0,sm %b[113], %b[112], %b[113] pfmuls,3,sm %b[110], %b[71], %b[63] pfsub_rsubs,4,sm %b[63], %b[64], %g23, %b[110] pfsub_adds,5,sm %b[63], %b[64], %g23, %b[64] } { loop_mode pfmuls,0,sm %b[105], %g18, %b[114] pfmuls,1,sm %b[114], %g17, %g19 pfmul_hadds,2,sm %b[54], %b[60], %g19, %b[60] pfmuls,4,sm %b[27], %b[76], %b[105] } { loop_mode pfsub_rsubs,0,sm %b[57], %b[62], %b[116], %b[62] pfsubs,1,sm %b[84], %b[75], %b[116] pfmuls,2,sm %b[51], %b[85], %g23 pfmuls,5,sm %b[109], %b[67], %b[109] } { loop_mode pfmuls,0,sm %b[88], %b[68], %b[88] pfmuls,1,sm %b[108], %b[79], %b[93] std,2 %r23, %b[6], %b[103] pfmuls,3,sm %b[26], %b[72], %b[103] addd,4,sm 0x8, %b[6], %b[4] ? %pcnt0 std,5 %r19, %b[6], %b[93] } { loop_mode pfmul_hadds,0,sm %b[96], %b[89], %g24, %b[87] pfmul_hadds,1,sm %b[87], %b[112], %b[113], %b[89] std,2 %r2, %b[6], %b[102] pfmuls,3,sm %b[50], %b[74], %b[90] std,5 %r20, %b[6], %b[90] } { loop_mode pfmul_hadds,0,sm %b[91], %g18, %b[114], %b[80] pfmul_hadds,1,sm %b[80], %g17, %g19, %b[91] std,2 %r11, %b[6], %b[98] pfmuls,3,sm %b[14], %b[94], %b[96] pfmuls,4,sm %b[35], %b[97], %b[98] std,5 %r22, %b[6], %b[99] } { loop_mode pfmul_hadds,0,sm %b[42], %b[85], %g23, %b[76] pfmuls,1,sm %b[47], %b[104], %b[99] std,2 %r16, %b[6], %b[117] pfmul_hadds,3,sm %b[19], %b[76], %b[105], %b[85] pfmuls,4,sm %b[18], %b[118], %b[102] std,5 %r24, %b[6], %b[119] } { loop_mode pfadds,0,sm %b[84], %b[75], %b[75] pfmuls,1,sm %b[23], %b[115], %b[84] std,2 %r14, %b[6], %b[107] pfmul_hadds,3,sm %b[22], %b[72], %b[103], %b[72] pfmuls,4,sm %b[43], %b[66], %b[86] std,5 %r13, %b[6], %b[86] } { loop_mode std,2 %r21, %b[6], %g16 pshufb,3,sm 0x0, %b[116], %r25, %b[105] pfmuls,4,sm %b[15], %b[95], %b[103] std,5 %r18, %b[6], %g21 } { loop_mode pfmul_hadds,0,sm %b[81], %b[68], %b[88], %b[68] pfmul_hadds,1,sm %b[73], %b[79], %b[93], %b[73] std,2 %r17, %b[6], %g20 pfmul_hadds,3,sm %b[46], %b[74], %b[90], %b[79] pfmul_hadds,4,sm %b[34], %b[97], %b[98], %b[74] std,5 %r12, %b[6], %g22 } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END pfmul_hadds,0,sm %b[83], %b[67], %b[109], %b[64] pfmuls,1,sm %b[39], %b[62], %b[83] std,2 %r15, %b[6], %b[110] pfmul_hadds,3,sm %b[10], %b[94], %b[96], %b[67] pfmul_hadds,4,sm %b[11], %b[118], %b[102], %b[81] std,5 %r0, %b[6], %b[64] }
Теоретическая скорость: 16 комплексных чисел за 28 тактов (16/28) = 4.57 Байт/такт
Четверная теоретическая скорость: 18.29 Байт/такт
Замеры скорости

2. stage_radix4_readConjSwap_2x_simd128
Здесь происходит ручная раскрутка алгоритма stage_radix4_readConjSwap_simd128 в 2 раза.
Код на Си
void stage_radix4_readConjSwap_2x_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coefC_a, myComplex *conj_coefD_a, myComplex *conj_coefE_a, myComplex *conj_coefC_b, myComplex *conj_coefD_b, myComplex *conj_coefE_b, myComplex *swap_coefC_a, myComplex *swap_coefD_a, myComplex *swap_coefE_a, myComplex *swap_coefC_b, myComplex *swap_coefD_b, myComplex *swap_coefE_b) { __v2di *xy0_in = (__v2di*)&data_in[ 0]; __v2di *zw0_in = (__v2di*)&data_in[ 2]; __v2di *xy1_in = (__v2di*)&data_in[ 4]; __v2di *zw1_in = (__v2di*)&data_in[ 6]; __v2di *xy2_in = (__v2di*)&data_in[ 8]; __v2di *zw2_in = (__v2di*)&data_in[10]; __v2di *xy3_in = (__v2di*)&data_in[12]; __v2di *zw3_in = (__v2di*)&data_in[14]; __v2di *xy4_in = (__v2di*)&data_in[16]; __v2di *zw4_in = (__v2di*)&data_in[18]; __v2di *xy5_in = (__v2di*)&data_in[20]; __v2di *zw5_in = (__v2di*)&data_in[22]; __v2di *xy6_in = (__v2di*)&data_in[24]; __v2di *zw6_in = (__v2di*)&data_in[26]; __v2di *xy7_in = (__v2di*)&data_in[28]; __v2di *zw7_in = (__v2di*)&data_in[30]; __v2di *conj_c0a_in = (__v2di*)&conj_coefC_a[0]; __v2di *conj_c1a_in = (__v2di*)&conj_coefC_a[2]; __v2di *conj_c2a_in = (__v2di*)&conj_coefC_a[4]; __v2di *conj_c3a_in = (__v2di*)&conj_coefC_a[6]; __v2di *conj_d0a_in = (__v2di*)&conj_coefD_a[0]; __v2di *conj_d1a_in = (__v2di*)&conj_coefD_a[2]; __v2di *conj_d2a_in = (__v2di*)&conj_coefD_a[4]; __v2di *conj_d3a_in = (__v2di*)&conj_coefD_a[6]; __v2di *conj_e0a_in = (__v2di*)&conj_coefE_a[0]; __v2di *conj_e1a_in = (__v2di*)&conj_coefE_a[2]; __v2di *conj_e2a_in = (__v2di*)&conj_coefE_a[4]; __v2di *conj_e3a_in = (__v2di*)&conj_coefE_a[6]; __v2di *conj_c0b_in = (__v2di*)&conj_coefC_b[0*data_count/16]; __v2di *conj_c1b_in = (__v2di*)&conj_coefC_b[1*data_count/16]; __v2di *conj_c2b_in = (__v2di*)&conj_coefC_b[2*data_count/16]; __v2di *conj_c3b_in = (__v2di*)&conj_coefC_b[3*data_count/16]; __v2di *conj_d0b_in = (__v2di*)&conj_coefD_b[0*data_count/16]; __v2di *conj_d1b_in = (__v2di*)&conj_coefD_b[1*data_count/16]; __v2di *conj_d2b_in = (__v2di*)&conj_coefD_b[2*data_count/16]; __v2di *conj_d3b_in = (__v2di*)&conj_coefD_b[3*data_count/16]; __v2di *conj_e0b_in = (__v2di*)&conj_coefE_b[0*data_count/16]; __v2di *conj_e1b_in = (__v2di*)&conj_coefE_b[1*data_count/16]; __v2di *conj_e2b_in = (__v2di*)&conj_coefE_b[2*data_count/16]; __v2di *conj_e3b_in = (__v2di*)&conj_coefE_b[3*data_count/16]; __v2di *swap_c0a_in = (__v2di*)&swap_coefC_a[0]; __v2di *swap_c1a_in = (__v2di*)&swap_coefC_a[2]; __v2di *swap_c2a_in = (__v2di*)&swap_coefC_a[4]; __v2di *swap_c3a_in = (__v2di*)&swap_coefC_a[6]; __v2di *swap_d0a_in = (__v2di*)&swap_coefD_a[0]; __v2di *swap_d1a_in = (__v2di*)&swap_coefD_a[2]; __v2di *swap_d2a_in = (__v2di*)&swap_coefD_a[4]; __v2di *swap_d3a_in = (__v2di*)&swap_coefD_a[6]; __v2di *swap_e0a_in = (__v2di*)&swap_coefE_a[0]; __v2di *swap_e1a_in = (__v2di*)&swap_coefE_a[2]; __v2di *swap_e2a_in = (__v2di*)&swap_coefE_a[4]; __v2di *swap_e3a_in = (__v2di*)&swap_coefE_a[6]; __v2di *swap_c0b_in = (__v2di*)&swap_coefC_b[0*data_count/16]; __v2di *swap_c1b_in = (__v2di*)&swap_coefC_b[1*data_count/16]; __v2di *swap_c2b_in = (__v2di*)&swap_coefC_b[2*data_count/16]; __v2di *swap_c3b_in = (__v2di*)&swap_coefC_b[3*data_count/16]; __v2di *swap_d0b_in = (__v2di*)&swap_coefD_b[0*data_count/16]; __v2di *swap_d1b_in = (__v2di*)&swap_coefD_b[1*data_count/16]; __v2di *swap_d2b_in = (__v2di*)&swap_coefD_b[2*data_count/16]; __v2di *swap_d3b_in = (__v2di*)&swap_coefD_b[3*data_count/16]; __v2di *swap_e0b_in = (__v2di*)&swap_coefE_b[0*data_count/16]; __v2di *swap_e1b_in = (__v2di*)&swap_coefE_b[1*data_count/16]; __v2di *swap_e2b_in = (__v2di*)&swap_coefE_b[2*data_count/16]; __v2di *swap_e3b_in = (__v2di*)&swap_coefE_b[3*data_count/16]; __v2di *out_0 = (__v2di*)&data_out[ 0*data_count/16]; __v2di *out_1 = (__v2di*)&data_out[ 1*data_count/16]; __v2di *out_2 = (__v2di*)&data_out[ 2*data_count/16]; __v2di *out_3 = (__v2di*)&data_out[ 3*data_count/16]; __v2di *out_4 = (__v2di*)&data_out[ 4*data_count/16]; __v2di *out_5 = (__v2di*)&data_out[ 5*data_count/16]; __v2di *out_6 = (__v2di*)&data_out[ 6*data_count/16]; __v2di *out_7 = (__v2di*)&data_out[ 7*data_count/16]; __v2di *out_8 = (__v2di*)&data_out[ 8*data_count/16]; __v2di *out_9 = (__v2di*)&data_out[ 9*data_count/16]; __v2di *out_10 = (__v2di*)&data_out[10*data_count/16]; __v2di *out_11 = (__v2di*)&data_out[11*data_count/16]; __v2di *out_12 = (__v2di*)&data_out[12*data_count/16]; __v2di *out_13 = (__v2di*)&data_out[13*data_count/16]; __v2di *out_14 = (__v2di*)&data_out[14*data_count/16]; __v2di *out_15 = (__v2di*)&data_out[15*data_count/16]; #pragma ivdep #pragma unroll(1) #pragma prefetch for(int64_t i = 0; i < data_count/32; ++i) { __v2di xy0 = xy0_in[16*i]; __v2di zw0 = zw0_in[16*i]; __v2di xy1 = xy1_in[16*i]; __v2di zw1 = zw1_in[16*i]; __v2di conj_c0 = conj_c0a_in[4*i]; __v2di conj_d0 = conj_d0a_in[4*i]; __v2di conj_e0 = conj_e0a_in[4*i]; __v2di swap_c0 = swap_c0a_in[4*i]; __v2di swap_d0 = swap_d0a_in[4*i]; __v2di swap_e0 = swap_e0a_in[4*i]; __v2di xy2 = xy2_in[16*i]; __v2di zw2 = zw2_in[16*i]; __v2di xy3 = xy3_in[16*i]; __v2di zw3 = zw3_in[16*i]; __v2di conj_c1 = conj_c1a_in[4*i]; __v2di conj_d1 = conj_d1a_in[4*i]; __v2di conj_e1 = conj_e1a_in[4*i]; __v2di swap_c1 = swap_c1a_in[4*i]; __v2di swap_d1 = swap_d1a_in[4*i]; __v2di swap_e1 = swap_e1a_in[4*i]; __v2di xy4 = xy4_in[16*i]; __v2di zw4 = zw4_in[16*i]; __v2di xy5 = xy5_in[16*i]; __v2di zw5 = zw5_in[16*i]; __v2di conj_c2 = conj_c2a_in[4*i]; __v2di conj_d2 = conj_d2a_in[4*i]; __v2di conj_e2 = conj_e2a_in[4*i]; __v2di swap_c2 = swap_c2a_in[4*i]; __v2di swap_d2 = swap_d2a_in[4*i]; __v2di swap_e2 = swap_e2a_in[4*i]; __v2di xy6 = xy6_in[16*i]; __v2di zw6 = zw6_in[16*i]; __v2di xy7 = xy7_in[16*i]; __v2di zw7 = zw7_in[16*i]; __v2di conj_c3 = conj_c3a_in[4*i]; __v2di conj_d3 = conj_d3a_in[4*i]; __v2di conj_e3 = conj_e3a_in[4*i]; __v2di swap_c3 = swap_c3a_in[4*i]; __v2di swap_d3 = swap_d3a_in[4*i]; __v2di swap_e3 = swap_e3a_in[4*i]; __v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100}); __v2di w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); __v2di cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0); __v2di cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1); __v2di cy2_real = __builtin_e2k_qpfmuls(conj_c2, y2); __v2di cy3_real = __builtin_e2k_qpfmuls(conj_c3, y3); __v2di dz0_real = __builtin_e2k_qpfmuls(conj_d0, z0); __v2di dz1_real = __builtin_e2k_qpfmuls(conj_d1, z1); __v2di dz2_real = __builtin_e2k_qpfmuls(conj_d2, z2); __v2di dz3_real = __builtin_e2k_qpfmuls(conj_d3, z3); __v2di ew0_real = __builtin_e2k_qpfmuls(conj_e0, w0); __v2di ew1_real = __builtin_e2k_qpfmuls(conj_e1, w1); __v2di ew2_real = __builtin_e2k_qpfmuls(conj_e2, w2); __v2di ew3_real = __builtin_e2k_qpfmuls(conj_e3, w3); __v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0); __v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1); __v2di cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2); __v2di cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3); __v2di dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0); __v2di dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1); __v2di dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2); __v2di dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3); __v2di ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0); __v2di ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1); __v2di ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2); __v2di ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3); __v2di cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag); __v2di cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag); __v2di cy2_rrii = __builtin_e2k_qpfhadds(cy2_real, cy2_imag); __v2di cy3_rrii = __builtin_e2k_qpfhadds(cy3_real, cy3_imag); __v2di dz0_rrii = __builtin_e2k_qpfhadds(dz0_real, dz0_imag); __v2di dz1_rrii = __builtin_e2k_qpfhadds(dz1_real, dz1_imag); __v2di dz2_rrii = __builtin_e2k_qpfhadds(dz2_real, dz2_imag); __v2di dz3_rrii = __builtin_e2k_qpfhadds(dz3_real, dz3_imag); __v2di ew0_rrii = __builtin_e2k_qpfhadds(ew0_real, ew0_imag); __v2di ew1_rrii = __builtin_e2k_qpfhadds(ew1_real, ew1_imag); __v2di ew2_rrii = __builtin_e2k_qpfhadds(ew2_real, ew2_imag); __v2di ew3_rrii = __builtin_e2k_qpfhadds(ew3_real, ew3_imag); __v2di cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di cy2 = __builtin_e2k_qpshufb(cy2_rrii, cy2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di cy3 = __builtin_e2k_qpshufb(cy3_rrii, cy3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di dz0 = __builtin_e2k_qpshufb(dz0_rrii, dz0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di dz1 = __builtin_e2k_qpshufb(dz1_rrii, dz1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di dz2 = __builtin_e2k_qpshufb(dz2_rrii, dz2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di dz3 = __builtin_e2k_qpshufb(dz3_rrii, dz3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di ew0 = __builtin_e2k_qpshufb(ew0_rrii, ew0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di ew1 = __builtin_e2k_qpshufb(ew1_rrii, ew1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di ew2 = __builtin_e2k_qpshufb(ew2_rrii, ew2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di ew3 = __builtin_e2k_qpshufb(ew3_rrii, ew3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); __v2di add02_0 = __builtin_e2k_qpfadds( x0, dz0); __v2di add02_1 = __builtin_e2k_qpfadds( x1, dz1); __v2di add02_2 = __builtin_e2k_qpfadds( x2, dz2); __v2di add02_3 = __builtin_e2k_qpfadds( x3, dz3); __v2di sub02_0 = __builtin_e2k_qpfsubs( x0, dz0); __v2di sub02_1 = __builtin_e2k_qpfsubs( x1, dz1); __v2di sub02_2 = __builtin_e2k_qpfsubs( x2, dz2); __v2di sub02_3 = __builtin_e2k_qpfsubs( x3, dz3); __v2di add13_0 = __builtin_e2k_qpfadds(cy0, ew0); __v2di add13_1 = __builtin_e2k_qpfadds(cy1, ew1); __v2di add13_2 = __builtin_e2k_qpfadds(cy2, ew2); __v2di add13_3 = __builtin_e2k_qpfadds(cy3, ew3); __v2di sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0); __v2di sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1); __v2di sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2); __v2di sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3); __v2di swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); __v2di sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31}); __v2di sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31}); __v2di sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31}); __v2di sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31}); __v2di out0 = __builtin_e2k_qpfadds(add02_0, add13_0); __v2di out1 = __builtin_e2k_qpfadds(add02_1, add13_1); __v2di out2 = __builtin_e2k_qpfadds(add02_2, add13_2); __v2di out3 = __builtin_e2k_qpfadds(add02_3, add13_3); __v2di out4 = __builtin_e2k_qpfsubs(sub02_0, sub13i_0); __v2di out5 = __builtin_e2k_qpfsubs(sub02_1, sub13i_1); __v2di out6 = __builtin_e2k_qpfsubs(sub02_2, sub13i_2); __v2di out7 = __builtin_e2k_qpfsubs(sub02_3, sub13i_3); __v2di out8 = __builtin_e2k_qpfsubs(add02_0, add13_0); __v2di out9 = __builtin_e2k_qpfsubs(add02_1, add13_1); __v2di out10 = __builtin_e2k_qpfsubs(add02_2, add13_2); __v2di out11 = __builtin_e2k_qpfsubs(add02_3, add13_3); __v2di out12 = __builtin_e2k_qpfadds(sub02_0, sub13i_0); __v2di out13 = __builtin_e2k_qpfadds(sub02_1, sub13i_1); __v2di out14 = __builtin_e2k_qpfadds(sub02_2, sub13i_2); __v2di out15 = __builtin_e2k_qpfadds(sub02_3, sub13i_3); xy0 = out0; zw0 = out1; xy1 = out2; zw1 = out3; conj_c0 = conj_c0b_in[i]; conj_d0 = conj_d0b_in[i]; conj_e0 = conj_e0b_in[i]; swap_c0 = swap_c0b_in[i]; swap_d0 = swap_d0b_in[i]; swap_e0 = swap_e0b_in[i]; xy2 = out4; zw2 = out5; xy3 = out6; zw3 = out7; conj_c1 = conj_c1b_in[i]; conj_d1 = conj_d1b_in[i]; conj_e1 = conj_e1b_in[i]; swap_c1 = swap_c1b_in[i]; swap_d1 = swap_d1b_in[i]; swap_e1 = swap_e1b_in[i]; xy4 = out8; zw4 = out9; xy5 = out10; zw5 = out11; conj_c2 = conj_c2b_in[i]; conj_d2 = conj_d2b_in[i]; conj_e2 = conj_e2b_in[i]; swap_c2 = swap_c2b_in[i]; swap_d2 = swap_d2b_in[i]; swap_e2 = swap_e2b_in[i]; xy6 = out12; zw6 = out13; xy7 = out14; zw7 = out15; conj_c3 = conj_c3b_in[i]; conj_d3 = conj_d3b_in[i]; conj_e3 = conj_e3b_in[i]; swap_c3 = swap_c3b_in[i]; swap_d3 = swap_d3b_in[i]; swap_e3 = swap_e3b_in[i]; x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100}); y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100}); w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100}); y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100}); w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100}); y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100}); w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100}); y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100}); w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908}); cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0); cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1); cy2_real = __builtin_e2k_qpfmuls(conj_c2, y2); cy3_real = __builtin_e2k_qpfmuls(conj_c3, y3); dz0_real = __builtin_e2k_qpfmuls(conj_d0, z0); dz1_real = __builtin_e2k_qpfmuls(conj_d1, z1); dz2_real = __builtin_e2k_qpfmuls(conj_d2, z2); dz3_real = __builtin_e2k_qpfmuls(conj_d3, z3); ew0_real = __builtin_e2k_qpfmuls(conj_e0, w0); ew1_real = __builtin_e2k_qpfmuls(conj_e1, w1); ew2_real = __builtin_e2k_qpfmuls(conj_e2, w2); ew3_real = __builtin_e2k_qpfmuls(conj_e3, w3); cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0); cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1); cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2); cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3); dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0); dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1); dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2); dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3); ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0); ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1); ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2); ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3); cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag); cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag); cy2_rrii = __builtin_e2k_qpfhadds(cy2_real, cy2_imag); cy3_rrii = __builtin_e2k_qpfhadds(cy3_real, cy3_imag); dz0_rrii = __builtin_e2k_qpfhadds(dz0_real, dz0_imag); dz1_rrii = __builtin_e2k_qpfhadds(dz1_real, dz1_imag); dz2_rrii = __builtin_e2k_qpfhadds(dz2_real, dz2_imag); dz3_rrii = __builtin_e2k_qpfhadds(dz3_real, dz3_imag); ew0_rrii = __builtin_e2k_qpfhadds(ew0_real, ew0_imag); ew1_rrii = __builtin_e2k_qpfhadds(ew1_real, ew1_imag); ew2_rrii = __builtin_e2k_qpfhadds(ew2_real, ew2_imag); ew3_rrii = __builtin_e2k_qpfhadds(ew3_real, ew3_imag); cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); cy2 = __builtin_e2k_qpshufb(cy2_rrii, cy2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); cy3 = __builtin_e2k_qpshufb(cy3_rrii, cy3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); dz0 = __builtin_e2k_qpshufb(dz0_rrii, dz0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); dz1 = __builtin_e2k_qpshufb(dz1_rrii, dz1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); dz2 = __builtin_e2k_qpshufb(dz2_rrii, dz2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); dz3 = __builtin_e2k_qpshufb(dz3_rrii, dz3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); ew0 = __builtin_e2k_qpshufb(ew0_rrii, ew0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); ew1 = __builtin_e2k_qpshufb(ew1_rrii, ew1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); ew2 = __builtin_e2k_qpshufb(ew2_rrii, ew2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); ew3 = __builtin_e2k_qpshufb(ew3_rrii, ew3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504}); add02_0 = __builtin_e2k_qpfadds( x0, dz0); add02_1 = __builtin_e2k_qpfadds( x1, dz1); add02_2 = __builtin_e2k_qpfadds( x2, dz2); add02_3 = __builtin_e2k_qpfadds( x3, dz3); sub02_0 = __builtin_e2k_qpfsubs( x0, dz0); sub02_1 = __builtin_e2k_qpfsubs( x1, dz1); sub02_2 = __builtin_e2k_qpfsubs( x2, dz2); sub02_3 = __builtin_e2k_qpfsubs( x3, dz3); add13_0 = __builtin_e2k_qpfadds(cy0, ew0); add13_1 = __builtin_e2k_qpfadds(cy1, ew1); add13_2 = __builtin_e2k_qpfadds(cy2, ew2); add13_3 = __builtin_e2k_qpfadds(cy3, ew3); sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0); sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1); sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2); sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3); swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C}); sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31}); sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31}); sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31}); sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31}); out_0[i] = __builtin_e2k_qpfadds(add02_0, add13_0); out_1[i] = __builtin_e2k_qpfadds(add02_1, add13_1); out_2[i] = __builtin_e2k_qpfadds(add02_2, add13_2); out_3[i] = __builtin_e2k_qpfadds(add02_3, add13_3); out_4[i] = __builtin_e2k_qpfsubs(sub02_0, sub13i_0); out_5[i] = __builtin_e2k_qpfsubs(sub02_1, sub13i_1); out_6[i] = __builtin_e2k_qpfsubs(sub02_2, sub13i_2); out_7[i] = __builtin_e2k_qpfsubs(sub02_3, sub13i_3); out_8[i] = __builtin_e2k_qpfsubs(add02_0, add13_0); out_9[i] = __builtin_e2k_qpfsubs(add02_1, add13_1); out_10[i] = __builtin_e2k_qpfsubs(add02_2, add13_2); out_11[i] = __builtin_e2k_qpfsubs(add02_3, add13_3); out_12[i] = __builtin_e2k_qpfadds(sub02_0, sub13i_0); out_13[i] = __builtin_e2k_qpfadds(sub02_1, sub13i_1); out_14[i] = __builtin_e2k_qpfadds(sub02_2, sub13i_2); out_15[i] = __builtin_e2k_qpfadds(sub02_3, sub13i_3); } }
Основной цикл на ассемблере
.L6737: { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=0, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=0, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=1, disp=64 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=1, disp=96 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=2, disp=128 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=2, disp=160 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=3, disp=192 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=3, disp=224 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=7, asz=0, abs=4, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=7, asz=0, abs=4, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=6, asz=0, abs=5, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=6, asz=0, abs=5, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=5, asz=0, abs=6, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=5, asz=0, abs=6, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=4, asz=0, abs=7, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=4, asz=0, abs=7, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=3, asz=0, abs=8, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=3, asz=0, abs=8, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=2, asz=0, abs=9, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=2, asz=0, abs=9, disp=32 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=16, incr=0, ind=0, asz=0, abs=10, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=15, incr=0, ind=0, asz=0, abs=10, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=14, incr=0, ind=0, asz=0, abs=11, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=13, incr=0, ind=0, asz=0, abs=11, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=12, incr=0, ind=0, asz=1, abs=12, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=11, incr=0, ind=0, asz=1, abs=12, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=10, incr=0, ind=0, asz=1, abs=14, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=9, incr=0, ind=0, asz=1, abs=14, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=8, incr=0, ind=0, asz=1, abs=16, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=7, incr=0, ind=0, asz=1, abs=16, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=6, incr=0, ind=0, asz=1, abs=18, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=5, incr=0, ind=0, asz=1, abs=18, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=4, incr=0, ind=0, asz=1, abs=20, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=3, incr=0, ind=0, asz=1, abs=20, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=2, incr=0, ind=0, asz=1, abs=22, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=1, incr=0, ind=0, asz=1, abs=22, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=15, asz=1, abs=24, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=14, asz=1, abs=24, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=13, asz=1, abs=26, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=12, asz=1, abs=26, disp=0 } { fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=11, asz=1, abs=28, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=10, asz=1, abs=28, disp=0 } { fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=9, asz=1, abs=30, disp=0 fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=8, asz=1, abs=30, disp=0 } .L2988: { loop_mode qpfmul_hadds,0,sm %b[92], %b[83], %b[98], %b[83] qpshufb,1,sm %b[69], %b[69], %r4, %b[69] qpshufb,3,sm %b[63], %b[66], %r0, %b[92] qpfmuls,4,sm %b[48], %b[90], %b[98] qpfmuls,5,sm %b[21], %b[86], %b[99] } { loop_mode qpshufb,1,sm %b[62], %b[62], %r4, %b[74] qpshufb,3,sm %b[72], %b[74], %r3, %b[62] qpfmuls,5,sm %b[29], %b[92], %b[72] } { loop_mode qpshufb,1,sm %b[97], %b[97], %r4, %b[97] qpfmul_hadds,2,sm %b[32], %b[87], %b[102], %b[87] qpshufb,3,sm %b[115], %b[118], %r3, %b[100] qpfmuls,5,sm %b[45], %b[62], %b[103] } { loop_mode qpfmul_hadds,2,sm %b[44], %b[101], %b[93], %b[71] qpshufb,3,sm %b[71], %b[76], %r3, %b[76] qpfsubs,4,sm %b[79], %b[77], %b[93] qpfmuls,5,sm %b[28], %b[100], %b[101] } { loop_mode qpfmul_hadds,0,sm %b[49], %b[95], %b[84], %b[84] qpshufb,1,sm %b[96], %b[96], %r4, %b[73] qpfmul_hadds,2,sm %b[25], %b[91], %b[94], %b[89] qpshufb,3,sm %b[73], %b[89], %r3, %b[91] qpfsubs,4,sm %b[69], %b[70], %b[94] qpfmuls,5,sm %b[13], %b[76], %b[95] } { loop_mode qpshufb,1,sm %b[108], %b[109], %r3, %b[80] qpfmul_hadds,2,sm %b[41], %b[85], %b[88], %b[85] qpshufb,3,sm %b[80], %b[80], %r4, %b[102] qpfsubs,4,sm %b[74], %b[81], %b[96] qpfmuls,5,sm %b[24], %b[91], %b[88] } { loop_mode qpfmul_hadds,0,sm %b[36], %b[90], %b[98], %b[81] qpshufb,1,sm %b[82], %b[82], %r4, %b[64] qpfmul_hadds,2,sm %b[17], %b[86], %b[99], %b[82] qpshufb,3,sm %b[64], %b[67], %r3, %b[67] qpfadds,4,sm %b[74], %b[81], %b[74] qpfsubs,5,sm %b[102], %b[97], %b[86] movaqp,0 area=21, ind=0, am=1, be=0, %b[1] movaqp,1 area=20, ind=0, am=1, be=0, %b[17] movaqp,2 area=21, ind=0, am=1, be=0, %b[13] movaqp,3 area=20, ind=0, am=1, be=0, %b[8] } { loop_mode qpshufb,1,sm %b[75], %b[75], %r4, %b[72] qpfmul_hadds,2,sm %b[12], %b[92], %b[72], %b[77] qpshufb,3,sm %b[78], %b[105], %r3, %b[75] qpfadds,5,sm %b[79], %b[77], %b[78] movaqp,0 area=19, ind=0, am=1, be=0, %b[25] movaqp,1 area=18, ind=0, am=1, be=0, %b[12] movaqp,2 area=19, ind=0, am=1, be=0, %b[24] movaqp,3 area=18, ind=0, am=1, be=0, %b[21] } { loop_mode qpshufb,1,sm %b[83], %b[83], %r4, %b[62] qpfmul_hadds,2,sm %b[9], %b[62], %b[103], %b[69] qpshufb,3,sm %b[65], %b[68], %r3, %b[65] qpfadds,5,sm %b[69], %b[70], %b[68] movaqp,0 area=17, ind=0, am=1, be=0, %b[29] movaqp,1 area=16, ind=0, am=1, be=0, %b[33] movaqp,2 area=17, ind=0, am=1, be=0, %b[28] movaqp,3 area=16, ind=0, am=1, be=0, %b[9] } { loop_mode qpshufb,1,sm %b[93], %b[93], %r5, %b[83] qpfmul_hadds,2,sm %b[5], %b[100], %b[101], %b[70] qpshufb,3,sm %b[96], %b[96], %r5, %b[90] qpfadds,5,sm %b[102], %b[97], %b[79] movaqp,0 area=15, ind=0, am=1, be=0, %b[5] movaqp,1 area=14, ind=0, am=1, be=0, %b[36] movaqp,2 area=15, ind=0, am=1, be=0, %b[37] movaqp,3 area=14, ind=0, am=1, be=0, %b[32] } { loop_mode qpshufb,1,sm %b[94], %b[94], %r5, %b[92] qpfmul_hadds,2,sm %b[16], %b[76], %b[95], %b[76] qpshufb,3,sm %b[86], %b[86], %r5, %b[86] movaqp,0 area=13, ind=0, am=1, be=0, %b[44] movaqp,1 area=12, ind=0, am=1, be=0, %b[16] movaqp,2 area=13, ind=0, am=1, be=0, %b[41] movaqp,3 area=12, ind=0, am=1, be=0, %b[40] } { loop_mode qpxor,1,sm %b[92], %r1, %b[88] qpfmul_hadds,2,sm %b[20], %b[91], %b[88], %b[91] qpxor,3,sm %b[86], %r1, %b[86] movaqp,0 area=11, ind=0, am=1, be=0, %b[48] movaqp,1 area=10, ind=0, am=1, be=0, %b[49] movaqp,2 area=11, ind=0, am=1, be=0, %b[45] movaqp,3 area=10, ind=0, am=1, be=0, %b[20] } { loop_mode qpfadd_adds,0,sm %b[65], %b[62], %b[74], %b[87] qpshufb,1,sm %b[89], %b[89], %r4, %b[95] qpshufb,3,sm %b[87], %b[87], %r4, %b[96] movaqp,0 area=9, ind=0, am=0, be=0, %b[93] movaqp,1 area=9, ind=16, am=1, be=0, %b[94] movaqp,2 area=9, ind=0, am=0, be=0, %b[89] movaqp,3 area=9, ind=16, am=1, be=0, %b[92] } { loop_mode qpfadd_adds,0,sm %b[75], %b[72], %b[78], %b[71] qpxor,1,sm %b[90], %r1, %b[100] qpshufb,3,sm %b[71], %b[71], %r4, %b[101] movaqp,0 area=8, ind=16, am=1, be=0, %b[90] movaqp,1 area=8, ind=0, am=0, be=0, %b[98] movaqp,2 area=8, ind=16, am=1, be=0, %b[97] movaqp,3 area=8, ind=0, am=0, be=0, %b[99] } { loop_mode qpfadd_rsubs,0,sm %b[67], %b[64], %b[68], %b[52] qpshufb,1,sm %b[82], %b[82], %r4, %b[103] qpshufb,3,sm %b[84], %b[84], %r4, %b[106] movaqp,0 area=7, ind=16, am=1, be=0, %b[102] movaqp,1 area=7, ind=0, am=0, be=0, %b[104] movaqp,2 area=7, ind=0, am=0, be=0, %b[82] movaqp,3 area=7, ind=16, am=1, be=0, %b[84] } { loop_mode qpfadd_rsubs,0,sm %b[80], %b[73], %b[79], %b[53] qpxor,1,sm %b[83], %r1, %b[108] qpshufb,3,sm %b[85], %b[85], %r4, %b[110] qpfadds,4,sm %b[106], %b[101], %b[106] qpfsubs,5,sm %b[106], %b[101], %b[107] movaqp,0 area=6, ind=0, am=0, be=0, %b[101] movaqp,1 area=6, ind=16, am=1, be=0, %b[105] movaqp,2 area=6, ind=0, am=0, be=0, %b[83] movaqp,3 area=6, ind=16, am=1, be=0, %b[85] } { loop_mode qpfadd_rsubs,0,sm %b[65], %b[62], %b[74], %b[74] qpshufb,1,sm %b[69], %b[69], %r4, %b[109] qpfadd_rsubs,2,sm %b[75], %b[72], %b[78], %b[69] qpshufb,3,sm %b[81], %b[81], %r4, %b[113] qpfadds,4,sm %b[96], %b[95], %b[111] qpfsubs,5,sm %b[96], %b[95], %b[112] movaqp,0 area=5, ind=16, am=0, be=0, %b[78] movaqp,1 area=1, ind=16, am=0, be=0, %b[95] movaqp,2 area=5, ind=16, am=0, be=0, %b[96] movaqp,3 area=1, ind=16, am=0, be=0, %b[81] } { loop_mode qpfadd_adds,0,sm %b[67], %b[64], %b[68], %b[57] qpshufb,1,sm %b[60], %b[61], %r3, %b[79] qpfadd_adds,2,sm %b[80], %b[73], %b[79], %b[56] qpshufb,3,sm %b[77], %b[77], %r4, %b[113] qpfadds,4,sm %b[113], %b[110], %b[110] qpfsubs,5,sm %b[113], %b[110], %b[114] movaqp,0 area=5, ind=0, am=1, be=0, %b[60] movaqp,1 area=0, ind=16, am=0, be=0, %b[68] movaqp,2 area=5, ind=0, am=1, be=0, %b[77] movaqp,3 area=0, ind=16, am=0, be=0, %b[61] } { loop_mode qpfsub_adds,0,sm %b[65], %b[62], %b[100], %b[116] qpshufb,1,sm %b[76], %b[76], %r4, %b[119] qpfsub_adds,2,sm %b[75], %b[72], %b[108], %b[113] qpshufb,3,sm %b[55], %b[54], %r3, %b[118] qpfadds,4,sm %b[113], %b[103], %g17 qpfsubs,5,sm %b[113], %b[103], %g16 movaqp,0 area=3, ind=16, am=1, be=0, %b[117] movaqp,1 area=3, ind=0, am=0, be=0, %b[103] movaqp,2 area=3, ind=16, am=1, be=0, %b[115] movaqp,3 area=3, ind=0, am=0, be=0, %b[76] } { loop_mode qpfsub_rsubs,0,sm %b[65], %b[62], %b[100], %b[72] qpshufb,1,sm %b[91], %b[91], %r4, %b[108] qpfsub_rsubs,2,sm %b[75], %b[72], %b[108], %b[70] qpshufb,3,sm %b[70], %b[70], %r4, %b[75] movaqp,0 area=4, ind=16, am=0, be=0, %b[91] movaqp,1 area=0, ind=0, am=1, be=0, %b[65] movaqp,2 area=4, ind=16, am=0, be=0, %b[100] movaqp,3 area=0, ind=0, am=1, be=0, %b[62] } { loop_mode qpfsub_rsubs,0,sm %b[67], %b[64], %b[88], %b[59] qpshufb,1,sm %b[58], %b[59], %r3, %g18 qpfsub_rsubs,2,sm %b[80], %b[73], %b[86], %b[58] qpshufb,3,sm %b[63], %b[66], %r3, %g19 movaqp,0 area=4, ind=0, am=1, be=0, %g20 movaqp,1 area=1, ind=0, am=1, be=0, %b[66] movaqp,2 area=4, ind=0, am=1, be=0, %g21 movaqp,3 area=1, ind=0, am=1, be=0, %b[63] } { loop_mode qpfadd_rsubs,0,sm %g18, %b[108], %b[106], %b[112] qpshufb,1,sm %b[107], %b[107], %r5, %g22 qpfadd_adds,2,sm %g18, %b[108], %b[106], %g24 qpshufb,3,sm %b[112], %b[112], %r5, %g23 qpshufb,4,sm %b[81], %b[95], %r0, %g25 movaqp,0 area=2, ind=0, am=0, be=0, %b[107] movaqp,1 area=2, ind=16, am=1, be=0, %g26 movaqp,2 area=2, ind=0, am=0, be=0, %b[106] movaqp,3 area=2, ind=16, am=1, be=0, %g27 } { loop_mode qpfadd_rsubs,0,sm %b[118], %b[119], %b[111], %b[105] qpxor,1,sm %g22, %r1, %g22 qpfadd_adds,2,sm %b[118], %b[119], %b[111], %b[111] qpxor,3,sm %g23, %r1, %g23 qpshufb,4,sm %b[61], %b[68], %r0, %g29 qpfmuls,5,sm %b[105], %g25, %g28 } { loop_mode qpfsub_rsubs,0,sm %g18, %b[108], %g22, %b[108] qpshufb,1,sm %b[114], %b[114], %r5, %g30 qpfsub_adds,2,sm %g18, %b[108], %g22, %b[101] qpshufb,3,sm %b[115], %b[117], %r0, %b[114] qpshufb,4,sm %b[76], %b[103], %r0, %g22 qpfmuls,5,sm %b[101], %g29, %g18 } { loop_mode qpfadd_adds,0,sm %b[79], %b[109], %b[110], %b[100] qpshufb,1,sm %g16, %g16, %r5, %g16 qpfadd_adds,2,sm %g19, %b[75], %g17, %b[85] qpshufb,3,sm %b[62], %b[65], %r0, %g31 qpfmuls,4,sm %b[85], %b[114], %r7 qpfmuls,5,sm %b[100], %g22, %r6 } { loop_mode qpfadd_rsubs,0,sm %b[79], %b[109], %b[110], %g17 qpxor,1,sm %g16, %r1, %g16 qpfadd_rsubs,2,sm %g19, %b[75], %g17, %b[110] qpshufb,3,sm %b[63], %b[66], %r0, %r9 qpfmuls,5,sm %g20, %g31, %g20 } { loop_mode qpfsub_rsubs,0,sm %b[118], %b[119], %g23, %b[118] qpxor,1,sm %g30, %r1, %g30 qpfsub_adds,2,sm %b[118], %b[119], %g23, %b[91] qpshufb,3,sm %g27, %g26, %r0, %r26 qpfmuls,5,sm %b[91], %r9, %g23 } { loop_mode qpfsub_adds,0,sm %g19, %b[75], %g16, %b[83] qpfsub_rsubs,2,sm %b[79], %b[109], %g30, %r27 qpshufb,3,sm %b[106], %b[107], %r0, %r29 qpshufb,4,sm %g27, %g26, %r3, %g26 qpfmuls,5,sm %b[83], %r26, %r28 } { loop_mode qpfsub_rsubs,0,sm %g19, %b[75], %g16, %b[77] qpfmul_hadds,1,sm %b[94], %g25, %g28, %b[79] qpfsub_adds,2,sm %b[79], %b[109], %g30, %b[75] qpfmuls,4,sm %b[77], %g26, %b[94] qpfmuls,5,sm %g21, %r29, %b[109] } { loop_mode qpfsub_adds,0,sm %b[67], %b[64], %b[88], %b[64] qpfmul_hadds,1,sm %b[93], %g29, %g18, %b[68] qpfsub_adds,2,sm %b[80], %b[73], %b[86], %b[61] qpshufb,3,sm %b[61], %b[68], %r3, %b[80] qpshufb,4,sm %b[115], %b[117], %r3, %b[73] qpfmul_hadds,5,sm %b[104], %g31, %g20, %b[67] } { loop_mode qpfmul_hadds,0,sm %b[84], %g22, %r6, %b[86] qpfmul_hadds,2,sm %b[92], %b[114], %r7, %b[84] qpfmuls,3,sm %b[96], %b[73], %b[88] qpfmuls,4,sm %b[60], %b[80], %b[92] qpfmul_hadds,5,sm %b[102], %r9, %g23, %b[60] } { loop_mode addd,1,sm 0x10, %b[6], %b[4] ? %pcnt0 stqp,2 %r21, %b[6], %b[112] stqp,5 %r2, %b[6], %g24 } { loop_mode qpshufb,1,sm %b[81], %b[95], %r3, %b[81] stqp,2 %r20, %b[6], %b[105] stqp,5 %r24, %b[6], %b[111] } { loop_mode qpfmul_hadds,0,sm %b[89], %r26, %r28, %b[95] qpshufb,1,sm %b[69], %b[74], %r0, %b[89] stqp,2 %r22, %b[6], %b[108] qpshufb,3,sm %b[56], %b[57], %r0, %b[93] stqp,5 %r19, %b[6], %b[101] } { loop_mode qpfmul_hadds,0,sm %b[82], %r29, %b[109], %b[78] qpfmuls,1,sm %b[78], %b[81], %b[96] stqp,2 %r25, %b[6], %b[100] qpshufb,3,sm %b[53], %b[52], %r0, %b[85] qpfmuls,4,sm %b[51], %b[93], %b[82] stqp,5 %r23, %b[6], %b[85] } { loop_mode qpfmul_hadds,0,sm %b[99], %g26, %b[94], %b[94] stqp,2 %r17, %b[6], %g17 qpshufb,3,sm %b[71], %b[87], %r0, %b[99] qpfmuls,4,sm %b[35], %b[85], %b[100] stqp,5 %r12, %b[6], %b[110] } { loop_mode qpfmul_hadds,0,sm %b[98], %b[80], %b[92], %b[80] qpfmul_hadds,1,sm %b[97], %b[73], %b[88], %b[73] stqp,2 %r15, %b[6], %b[118] qpshufb,3,sm %b[58], %b[59], %r0, %b[88] qpfmuls,4,sm %b[50], %b[99], %b[91] stqp,5 %r14, %b[6], %b[91] } { loop_mode qpshufb,0,sm %b[79], %b[79], %r4, %b[79] qpshufb,1,sm %b[68], %b[68], %r4, %b[68] stqp,2 %r13, %b[6], %b[83] qpshufb,3,sm %b[70], %b[72], %r0, %b[83] qpfmuls,4,sm %b[31], %b[89], %b[92] stqp,5 %r16, %b[6], %r27 } { loop_mode alc alcf=1, alct=1 abn abnf=1, abnt=1 ct %ctpr1 ? %NOT_LOOP_END qpshufb,0,sm %b[84], %b[84], %r4, %b[75] qpshufb,1,sm %b[86], %b[86], %r4, %b[77] stqp,2 %r18, %b[6], %b[77] qpshufb,3,sm %b[113], %b[116], %r0, %b[84] qpfmuls,4,sm %b[38], %b[83], %b[86] stqp,5 %r11, %b[6], %b[75] }
Теоретическая скорость: 32 комплексных числа за 39 тактов (32/39) = 6.56 Байт/такт
Четверная теоретическая скорость: 26.26 Байт/такт
Замеры скорости

Итоги по stage_radix4_readConjSwap_2x


Скорости упали по сравнению с исходными версиями stage_radix4_readConjSwap.
График FFT находится здесь.
Собираем FFT
fft_radix2
Собираем fft_radix2 из reverse_radix2_x32 и всех вариантов stage_radix2.


fft_radix2_2x
Собираем fft_radix2_2x из reverse_radix2_x32 и всех вариантов stage_radix2_2x.


fft_radix2_readConjSwap
Собираем fft_radix2_readConjSwap из reverse_radix2_x32 и всех вариантов stage_radix2_readConjSwap.


fft_radix2_readConjSwap_2x
Собираем fft_radix2_readConjSwap_2x из reverse_radix2_x32 и всех вариантов stage_radix2_readConjSwap_2x.


fft_radix4
Собираем fft_radix4 из reverse_radix4_x16 и всех вариантов stage_radix4.


fft_radix4_2x
Собираем fft_radix4_2x из reverse_radix4_x16 и всех вариантов stage_radix4_2x.


fft_radix4_readConjSwap
Собираем fft_radix4_readConjSwap из reverse_radix4_x16 и всех вариантов stage_radix4_readConjSwap.


fft_radix4_readConjSwap_2x
Собираем fft_radix4_readConjSwap_2x из reverse_radix4_x16 и всех вариантов stage_radix4_readConjSwap_2x.


Дальнейшие оптимизации имеющегося кода
Сейчас коэффициенты размещаются в нескольких массивах (conj/swap, coefC/coefD/coefE).
Можно разместить коэффициенты в одном массиве, перемежив их между собой.
Это позволит эффективнее производить чтение из памяти.
Хуже от этого стать не должно.
Дальнейшие направления создания кода
Что делать дальше:
Вариант
3xвradix2.
Сейчас2xвradix2ускоряет вычисления, поэтому можно сделать3x(и далее).
Сейчас2xвradix4замедляет вычисления, поэтому бесполезно делать3x.
Вряд ли это даст что‑то интересное, но для полноты картины можно сделать.Варианты Stage, в которых коэффициенты вычисляются на ходу, а не читаются из памяти.
Это должно ускорить вычисления в случаях, когда данные не помещаются в кэш.
Вradix4можно рассмотреть ещё такой случай: читать из памяти только coefC и вычислять coefD/coefE из coefC.Варианты Stage, на вход которым подаётся один постоянный коэффициент (или набор коэффициентов c/d/e для
radix4).
Можно заметить, что первый Stage использует один и тот же коэффициент для обработки всего входного массива. Второй Stage использует 2 разных коэффициента: один — для обработки первой половины массива, второй — для обработки второй половины. Третий Stage использует 4 разных коэффициента и так далее
Можно разбить обработку входного массива на части, обрабатываемые одинаковым коэффициентом. Получили коэффициент (из памяти или вычислением на ходу) — обработали часть массива, получили следующий коэффициент — обработали следующую часть массива.
Этот способ будет работать для начальных Stage, у которых такие части достаточно длинные. Для более поздних Stage надо использовать обычные методы получения коэффициентов (на каждом шаге читать из памяти или вычислять на ходу).
Поиск границы между начальными и поздними Stage придётся делать экспериментально.
Это отдельная задача.Объединить Reverse и первый Stage.
Сейчас есть объединение нескольких Stage в одном цикле (2x). Можно аналогично объединить Reverse и первый Stage (в общем случае — Reverse и несколько первых Stage).Входные данные типа
double complex.
Сейчас сделанfloat complex.
Надо сделать версию дляdouble complex.
Заключение
На примере stage_radix4_2x можно понять, что компилятор делает много работы по эффективной упаковке инструкций в такты. Вручную такая задача заняла бы много времени.
Я пишу код на Си так, чтобы его было удобно читать, а компилятор тасует операции так, чтобы, они эффективнее выполнялись. На ассемблере мне бы пришлось выбирать между понятным кодом и быстрым кодом.
На ассемблере приятно решать небольшие задачи, полностью осознавая, что где происходит. Однако, стоит немного увеличить сложность, и мозги начинают резко уставать.
Не уверен, смог бы я решить эту задачу на чистом ассемблере.
