All streams
Search
Write a publication
Pull to refresh
73
0
Artem Solopiy @EntityFX

Iot Solutions Developer

Send message

Нет у меня других компьютеров. В других тестах ничего не поменялось, в кеш значит влезают.

Перетестил с 8ми, столько поставили сперва. Каналов 8, в теории 16 планок можно.

ПСП на Эльбрус-16С 8 планок по 8 ГБ (DDR4-2400):
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 16
Number of Threads counted = 16
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1688 microseconds.
   (= 1688 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           70389.0     0.002306     0.002273     0.002349
Scale:          67256.8     0.002406     0.002379     0.002446
Add:            74444.1     0.003237     0.003224     0.003256
Triad:          77148.4     0.003144     0.003111     0.003185
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
8CВ, 4 планки DDR4-2400 по 8 ГБ
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 200000000 (elements), Offset = 0 (elements)
Memory per array = 1525.9 MiB (= 1.5 GiB).
Total memory required = 4577.6 MiB (= 4.5 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 8
Number of Threads counted = 8
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 89796 microseconds.
   (= 89796 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           22845.9     0.140402     0.140069     0.140706
Scale:          22963.4     0.139740     0.139352     0.140096
Add:            25437.5     0.189290     0.188698     0.191004
Triad:          25591.7     0.188562     0.187561     0.188898
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Core i7 2600, 2 планки DDR3-1333 по 8 ГБ
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 100000000 (elements), Offset = 0 (elements)
Memory per array = 762.9 MiB (= 0.7 GiB).
Total memory required = 2288.8 MiB (= 2.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 8
Number of Threads counted = 8
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 119145 microseconds.
   (= 59572 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           11514.5     0.147185     0.138955     0.160230
Scale:          11379.7     0.148755     0.140601     0.160839
Add:            12667.7     0.197619     0.189458     0.207608
Triad:          12650.8     0.198835     0.189712     0.213622
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Кстати, бывают случаи, когда в трансляции код быстрее натива за счёт того, что транслятор делает динамические оптимизации, а компилятор нет.

Так, посмотрел Эльбрусовский HPL (версия 2.2).

MPI: /opt/mpich-3.1.4/, с флагом -lmpi
La: какая-то своя либа, с флагом -leml, -DHPL_CALL_CBLAS

Для Core i7 пока нет возможности протестировать HPL, там нужно грузится в ОС, а работаю теперь удалённо. Можно в WSL, но как там с производительностью будет.

Но вот результаты на Kunpeng 920 2.6 ГГц на 48 ядрах:

T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR10L4L4       40800   240     1    48             233.34             1.9406e+02
HPL_pdgesv() start time Wed Jan 26 22:07:05 2022

HPL_pdgesv() end time   Wed Jan 26 22:10:58 2022

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   9.95497593e-04 ...... PASSED
================================================================================

Я со СМИ не сотрудничаю, мне интересны бенчмарки Эльбрусов, да и сами Эльбрусы (энтузиаст: работа с ними, программирование. Честно, мне Эльбрусы очень нравятся, хотел бы себе такой). Что там они пишут - я не отвечаю, я отвечаю за цифры и косяки тестов, постараюсь исправлять, если найдут ошибки в тестах/методике.

Факторы:

  1. Инженерник (отключены часть кеш-линий)

  2. 2 планки ОЗУ на не полной частоте (2400 вместо 3200), надо 8 планок

  3. Компилятор не последней версии

Верно, это сразу бросается в глаза. Ещё нужно лезть в код и оптимизировать участки.

Да, добавил, а то СМИ там уже ахинею начали писать.

STREAM
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 5202 microseconds.
   (= 5202 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           20090.1     0.007969     0.007964     0.007975
Scale:          19408.0     0.008253     0.008244     0.008262
Add:            21848.2     0.010992     0.010985     0.011005
Triad:          22284.0     0.010776     0.010770     0.010796
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Инженерник, там частично отключены некоторые кеш-линии, это всё влияет на результаты. Память работает на DDR4-2400, в серийном должна на DDR4-3200.

Словил ошибки компиляции
 line 7861: error #350:
          more than one operator "*" matches these operands:
            function template
                      "jtk::Expr<jtk::BinExprOp<jtk::Constant<T>, jtk::matrix<T, Container>::const_iterator, jtk::OpMul                                                                                                                      <jtk::gettype<T, T2>::ty>>> jtk::operator*(T2, const jtk::matrix<T, Container> &)"
            function template
                      "jtk::matrix<double, Container> jtk::symmetric_sparse_matrix_wrapper<T>::operator*(const jtk::mat                                                                                                                      rix<double, Container> &) const [with T=double]"
            operand types are: const
                      jtk::symmetric_sparse_matrix_wrapper<double> *
                      jtk::matrix<double, std::vector<double,
                      std::allocator<double>>>
      matrix<T, Container> r = b - A * out;
                                     ^
          detected during:
            instantiation of
                      "void jtk::preconditioned_conjugate_gradient(jtk::matrix<T, Container> &, T &, uint32_t &, const                                                                                                                       jtk::symmetric_sparse_matrix_wrapper<T> &, const TPreconditioner &, const jtk::matrix<T, Container> &, const jtk::matri                                                                                                                      x<T, Container> &, T) [with T=double, Container=std::vector<double, std::allocator<double>>, TPreconditioner=jtk::diago                                                                                                                      nal_preconditioner<double>]"
                      at line 7917
            instantiation of
                      "void jtk::conjugate_gradient(jtk::matrix<T, Container> &, T &, uint32_t &, const jtk::symmetric_                                                                                                                      sparse_matrix_wrapper<T> &, const jtk::matrix<T, Container> &, const jtk::matrix<T, Container> &, T) [with T=double, Co                                                                                                                      ntainer=std::vector<double, std::allocator<double>>]"
                      at line 4739 of
                      "/root/vectorforth/jtk/jtk.tests/mat_tests.cpp"

lcc: "/root/vectorforth/jtk/jtk.tests/../jtk/mat.h", line 7861: error #350:
          more than one operator "*" matches these operands:
            function template
                      "jtk::Expr<jtk::BinExprOp<jtk::Constant<T>, jtk::matrix<T, Container>::const_iterator, jtk::OpMul                                                                                                                      <jtk::gettype<T, T2>::ty>>> jtk::operator*(T2, const jtk::matrix<T, Container> &)"
            function template
                      "jtk::matrix<float, Container> jtk::symmetric_sparse_matrix_wrapper<T>::operator*(const jtk::matr                                                                                                                      ix<float, Container> &) const [with T=float]"
            operand types are: const
                      jtk::symmetric_sparse_matrix_wrapper<float> *
                      jtk::matrix<float, std::vector<float,
                      std::allocator<float>>>
      matrix<T, Container> r = b - A * out;
                                     ^
          detected during instantiation of
                    "void jtk::preconditioned_conjugate_gradient(jtk::matrix<T, Container> &, T &, uint32_t &, const jt                                                                                                                      k::symmetric_sparse_matrix_wrapper<T> &, const TPreconditioner &, const jtk::matrix<T, Container> &, const jtk::matrix<                                                                                                                      T, Container> &, T) [with T=float, Container=std::vector<float, std::allocator<float>>, TPreconditioner=jtk::diagonal_p                                                                                                                      reconditioner<float>]"
                    at line 4773 of
                    "/root/vectorforth/jtk/jtk.tests/mat_tests.cpp"

lcc: "/opt/mcst/lcc-home/1.25.17/e2k-v5-linux/include/smmintrin.h", line 155: warning #1444:
          function "__builtin_ia32_dpps" (declared at line 3929 of
          "/opt/mcst/lcc-home/1.25.17/e2k-v5-linux/include/e2kbuiltin.h") was
          declared deprecated
          ("The function may be slow due to inefficient implementation, please try to avoid it")
          [-Wdeprecated-declarations]
    ((__m128) __builtin_ia32_dpps ((__v4sf)(__m128)(X),                   \
              ^
 in expansion of macro "_mm_dp_ps" at line 5318 of
           "/root/vectorforth/jtk/jtk.tests/../jtk/mat.h"
        __m128 d = _mm_dp_ps(v1, v2, 0xf1);
                   ^
          detected during instantiation of
                    "void jtk::preconditioned_conjugate_gradient(jtk::matrix<T, Container> &, T &, uint32_t &, const jt                                                                                                                      k::symmetric_sparse_matrix_wrapper<T> &, const TPreconditioner &, const jtk::matrix<T, Container> &, const jtk::matrix<                                                                                                                      T, Container> &, T) [with T=float, Container=std::vector<float, std::allocator<float>>, TPreconditioner=jtk::diagonal_p                                                                                                                      reconditioner<float>]"
                    at line 4773 of
                    "/root/vectorforth/jtk/jtk.tests/mat_tests.cpp"

2 errors detected in the compilation of "/root/vectorforth/jtk/jtk.tests/mat_tests.cpp".
make[2]: *** [jtk/jtk.tests/CMakeFiles/jtk.tests.dir/build.make:132: jtk/jtk.tests/CMakeFiles/jtk.tests.dir/mat_tests.c

У меня есть только Core i7 2600 из самых мощных компов, работаю с ним уже 10 лет и все до сих пор летает (веб, разработка, игры не играю).

Да, Дмитрия обожаю, смотрю все его выпуски с 2014го.

Кстати, МЦСТ могли бы вместо своей линейки SPARC перейти на RISC-V. И деньги бы пошли, наверное.

Information

Rating
Does not participate
Location
Казань, Татарстан, Россия
Works in
Date of birth
Registered
Activity

Specialization

Fullstack Developer, IoT
Senior
C#
.NET Core
.NET
SQL
Linux
Docker
JavaScript
Designing application architecture