Discussion:
[eigen] Matrix multiplication much slower on MSVC than on g++/clang
Patrik Huber
2018-02-07 14:30:07 UTC
Permalink
Hello,

I noticed that code I'm using is around 2x slower on VS2017 (15.5.5 and
15.6.0 Preview) than on g++-7 and clang-6. After some digging, I found that
it is down to the matrix multiplication with Eigen.
The simple benchmark (see below) tests matrix multiplication with various
sizes m x n * n x p where m, n, p are between 1 and 2048, and MSVC is
consistently around 1.5-2x slower than g++ and clang, which is quite huge.

Here are some examples. I'm of course using optimised builds in both cases:

cl.exe gemm_test.cpp -I 3rdparty\eigen /EHsc /std:c++17 /arch:AVX2 /O2 /Ob2
/nologo

1124 1215 1465
col major (checksum: 0) elapsed_ms: 971
row major (checksum: 0) elapsed_ms: 976
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1771
row major (checksum: 0) elapsed_ms: 1778
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 819
row major (checksum: 0) elapsed_ms: 834
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 668
row major (checksum: 0) elapsed_ms: 666

And gcc:
g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o
gcc7_gemm_test

1124 1215 1465
col major (checksum: 0) elapsed_ms: 696
row major (checksum: 0) elapsed_ms: 706
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1294
row major (checksum: 0) elapsed_ms: 1326
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 425
row major (checksum: 0) elapsed_ms: 418
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 321
row major (checksum: 0) elapsed_ms: 332

I fiddled around quite a lot with the MSVC flags but no other flag made
anything faster.

My CPU is an i7-7700HQ with AVX2.
Now interestingly, I've run the same benchmark on an older i5-3550, which
has AVX, but not AVX2.
The run time on MSVC is nearly identical.
But now g++ (5.4) (again with -march=native) is nearly the same speed as
MSVC:

1124 1215 1465
col major (checksum: 0) elapsed_ms: 946
row major (checksum: 0) elapsed_ms: 944
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1798
row major (checksum: 0) elapsed_ms: 1816
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 687
row major (checksum: 0) elapsed_ms: 692
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 535
row major (checksum: 0) elapsed_ms: 551

This sort-of looks to me as if the MSVC optimiser cannot make use of AVX2,
if it is available on the CPU. It's just as slow as only with AVX, while
g++ and clang can really make use of AVX2 and get a 1.5-2x speed-up.

Interestingly if I use g++-7 on the i5, I'm getting extremely bad results:
1124 1215 1465
col major (checksum: 0) elapsed_ms: 2007
row major (checksum: 0) elapsed_ms: 2019
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 3941
row major (checksum: 0) elapsed_ms: 3923
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 1625
row major (checksum: 0) elapsed_ms: 1624
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 1276
row major (checksum: 0) elapsed_ms: 1287

I believe this looks like a performance regression in g++-7. So I don't
think this is relevant to the problem I'm seeing. I am trying to report
this to the GCC bugtracker but they make signing up extremely hard.


If I use MSVC without the /arch:AVX2 switch, and g++5 with -march=core2,
then I am getting identical results. So it looks like with SSE3, MSVC and
g++5 are on par, but with AVX2, g++ and clang just blow away MSVC.
Again I'm seeing the same performance regression with g++7 and
-march=core2, it's around 50% slower than g++-5.

The Eigen version I used is 3.3.4.
Btw I realise the benchmark is a bit crude (and might better be done with
something like Google Benchmark), but I'm getting very consistent results.


So I guess my main question is:
Is there anything that the Eigen developers can do, either to enable AVX2
on MSVC, or to help the MSVC optimiser? Or is it purely a MSVC optimiser
problem?

FYI I reported this to MS:
https://developercommunity.visualstudio.com/content/problem/194955/vs-produces-code-that-is-15-2x-slower-than-gcc-and.html
(with code attached, but the code is not visible to non-MS-employees).

If you are interested in more background information and more benchmarks,
the whole thing originated here:
https://github.com/Dobiasd/frugally-deep/issues/9 (but it's quite a lengthy
thread).

Thank you and best wishes,

Patrik


Benchmark code:

// gemm_test.cpp
#include <array>
#include <chrono>
#include <iostream>
#include <random>
#include <Eigen/Dense>

using RowMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
Eigen::Dynamic, Eigen::RowMajor>;
using ColMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
Eigen::Dynamic, Eigen::ColMajor>;

template <typename Mat>
void run_test(const std::string& name, int s1, int s2, int s3)
{
using namespace std::chrono;
float checksum = 0.0f; // to prevent compiler from optimizing
everything away
const auto start_time_ns =
high_resolution_clock::now().time_since_epoch().count();
for (size_t i = 0; i < 10; ++i)
{
Mat a_rm(s1, s2);
Mat b_rm(s2, s3);
const auto c_rm = a_rm * b_rm;
checksum += c_rm(0, 0);
}
const auto end_time_ns =
high_resolution_clock::now().time_since_epoch().count();
const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
std::cout << name << " (checksum: " << checksum << ") elapsed_ms: " <<
elapsed_ms << std::endl;
}
int main()
{
//std::random_device rd;
//std::mt19937 gen(0);
//std::uniform_int_distribution<> dis(1, 2048);
std::vector<int> vals = { 1124, 1215, 1465, 1730, 1235, 1758, 1116,
1736, 868, 1278, 1323, 788 };
for (std::size_t i = 0; i < 12; ++i)
{
int s1 = vals[i++];//dis(gen);
int s2 = vals[i++];//dis(gen);
int s3 = vals[i];//dis(gen);
std::cout << s1 << " " << s2 << " " << s3 << std::endl;
run_test<ColMajorMatrixXf>("col major", s1, s2, s3);
run_test<RowMajorMatrixXf>("row major", s1, s2, s3);
std::cout << "--------" << std::endl;
}
return 0;
}
--
Dr. Patrik Huber
Centre for Vision, Speech and Signal Processing
University of Surrey
Guildford, Surrey GU2 7XH
United Kingdom

Web: www.patrikhuber.ch
Mobile: +44 (0)7482 633 934
Gael Guennebaud
2018-02-08 12:40:18 UTC
Permalink
Hi,

I did not read carefully your email, but it seems that on the MSVC build
you are missing FMA. Indeed, Compared to AVX, AVX2 does not bring any gain
for matrix-matrix multiply, only FMA does (usually around 1.5). In
contrast, with gcc/clang -march=native activate all supported instruction
sets, including FMA on recent CPUs.

gael
Post by Patrik Huber
Hello,
I noticed that code I'm using is around 2x slower on VS2017 (15.5.5 and
15.6.0 Preview) than on g++-7 and clang-6. After some digging, I found that
it is down to the matrix multiplication with Eigen.
The simple benchmark (see below) tests matrix multiplication with various
sizes m x n * n x p where m, n, p are between 1 and 2048, and MSVC is
consistently around 1.5-2x slower than g++ and clang, which is quite huge.
cl.exe gemm_test.cpp -I 3rdparty\eigen /EHsc /std:c++17 /arch:AVX2 /O2
/Ob2 /nologo
1124 1215 1465
col major (checksum: 0) elapsed_ms: 971
row major (checksum: 0) elapsed_ms: 976
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1771
row major (checksum: 0) elapsed_ms: 1778
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 819
row major (checksum: 0) elapsed_ms: 834
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 668
row major (checksum: 0) elapsed_ms: 666
g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o
gcc7_gemm_test
1124 1215 1465
col major (checksum: 0) elapsed_ms: 696
row major (checksum: 0) elapsed_ms: 706
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1294
row major (checksum: 0) elapsed_ms: 1326
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 425
row major (checksum: 0) elapsed_ms: 418
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 321
row major (checksum: 0) elapsed_ms: 332
I fiddled around quite a lot with the MSVC flags but no other flag made
anything faster.
My CPU is an i7-7700HQ with AVX2.
Now interestingly, I've run the same benchmark on an older i5-3550, which
has AVX, but not AVX2.
The run time on MSVC is nearly identical.
But now g++ (5.4) (again with -march=native) is nearly the same speed as
1124 1215 1465
col major (checksum: 0) elapsed_ms: 946
row major (checksum: 0) elapsed_ms: 944
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1798
row major (checksum: 0) elapsed_ms: 1816
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 687
row major (checksum: 0) elapsed_ms: 692
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 535
row major (checksum: 0) elapsed_ms: 551
This sort-of looks to me as if the MSVC optimiser cannot make use of AVX2,
if it is available on the CPU. It's just as slow as only with AVX, while
g++ and clang can really make use of AVX2 and get a 1.5-2x speed-up.
1124 1215 1465
col major (checksum: 0) elapsed_ms: 2007
row major (checksum: 0) elapsed_ms: 2019
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 3941
row major (checksum: 0) elapsed_ms: 3923
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 1625
row major (checksum: 0) elapsed_ms: 1624
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 1276
row major (checksum: 0) elapsed_ms: 1287
I believe this looks like a performance regression in g++-7. So I don't
think this is relevant to the problem I'm seeing. I am trying to report
this to the GCC bugtracker but they make signing up extremely hard.
If I use MSVC without the /arch:AVX2 switch, and g++5 with -march=core2,
then I am getting identical results. So it looks like with SSE3, MSVC and
g++5 are on par, but with AVX2, g++ and clang just blow away MSVC.
Again I'm seeing the same performance regression with g++7 and
-march=core2, it's around 50% slower than g++-5.
The Eigen version I used is 3.3.4.
Btw I realise the benchmark is a bit crude (and might better be done with
something like Google Benchmark), but I'm getting very consistent results.
Is there anything that the Eigen developers can do, either to enable AVX2
on MSVC, or to help the MSVC optimiser? Or is it purely a MSVC optimiser
problem?
FYI I reported this to MS: https://developercommunity.
visualstudio.com/content/problem/194955/vs-produces-
code-that-is-15-2x-slower-than-gcc-and.html (with code attached, but the
code is not visible to non-MS-employees).
If you are interested in more background information and more benchmarks,
the whole thing originated here: https://github.com/
Dobiasd/frugally-deep/issues/9 (but it's quite a lengthy thread).
Thank you and best wishes,
Patrik
// gemm_test.cpp
#include <array>
#include <chrono>
#include <iostream>
#include <random>
#include <Eigen/Dense>
using RowMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
Eigen::Dynamic, Eigen::RowMajor>;
using ColMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
Eigen::Dynamic, Eigen::ColMajor>;
template <typename Mat>
void run_test(const std::string& name, int s1, int s2, int s3)
{
using namespace std::chrono;
float checksum = 0.0f; // to prevent compiler from optimizing
everything away
const auto start_time_ns = high_resolution_clock::now().
time_since_epoch().count();
for (size_t i = 0; i < 10; ++i)
{
Mat a_rm(s1, s2);
Mat b_rm(s2, s3);
const auto c_rm = a_rm * b_rm;
checksum += c_rm(0, 0);
}
const auto end_time_ns = high_resolution_clock::now().
time_since_epoch().count();
const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
std::cout << name << " (checksum: " << checksum << ") elapsed_ms: " <<
elapsed_ms << std::endl;
}
int main()
{
//std::random_device rd;
//std::mt19937 gen(0);
//std::uniform_int_distribution<> dis(1, 2048);
std::vector<int> vals = { 1124, 1215, 1465, 1730, 1235, 1758, 1116,
1736, 868, 1278, 1323, 788 };
for (std::size_t i = 0; i < 12; ++i)
{
int s1 = vals[i++];//dis(gen);
int s2 = vals[i++];//dis(gen);
int s3 = vals[i];//dis(gen);
std::cout << s1 << " " << s2 << " " << s3 << std::endl;
run_test<ColMajorMatrixXf>("col major", s1, s2, s3);
run_test<RowMajorMatrixXf>("row major", s1, s2, s3);
std::cout << "--------" << std::endl;
}
return 0;
}
--
Dr. Patrik Huber
Centre for Vision, Speech and Signal Processing
University of Surrey
Guildford, Surrey GU2 7XH
United Kingdom
Web: www.patrikhuber.ch
Mobile: +44 (0)7482 633 934 <+44%207482%20633934>
Patrik Huber
2018-02-08 13:27:52 UTC
Permalink
Hi Gael,

Thanks for the reply.
Information on the topic of MSVC and FMA seems a bit scarce. But this blog
post says that with /arch:AVX2, " The compiler will generate code that
includes AVX2 and FMA instructions. "(
https://blogs.msdn.microsoft.com/vcblog/2014/02/28/avx2-support-in-visual-studio-c-compiler/).
So I think that at least concerning the flags, the compiler should emit FMA
instructions, if it can.
Since the difference I'm seeing is around 1.5+, it is indeed highly likely
that you are correct and the MSVC code is so much slower because it doesn't
emit FMA instructions, compared to gcc & clang. But if the compiler flags
are set - why does it not emit FMA instructions?
Is there an FMA code path for MSVC in Eigen and are people in general
seeing MSVC using FMA when using Eigen?

Thank you and best wishes,

Patrik
Post by Gael Guennebaud
Hi,
I did not read carefully your email, but it seems that on the MSVC build
you are missing FMA. Indeed, Compared to AVX, AVX2 does not bring any gain
for matrix-matrix multiply, only FMA does (usually around 1.5). In
contrast, with gcc/clang -march=native activate all supported instruction
sets, including FMA on recent CPUs.
gael
Post by Patrik Huber
Hello,
I noticed that code I'm using is around 2x slower on VS2017 (15.5.5 and
15.6.0 Preview) than on g++-7 and clang-6. After some digging, I found that
it is down to the matrix multiplication with Eigen.
The simple benchmark (see below) tests matrix multiplication with various
sizes m x n * n x p where m, n, p are between 1 and 2048, and MSVC is
consistently around 1.5-2x slower than g++ and clang, which is quite huge.
cl.exe gemm_test.cpp -I 3rdparty\eigen /EHsc /std:c++17 /arch:AVX2 /O2
/Ob2 /nologo
1124 1215 1465
col major (checksum: 0) elapsed_ms: 971
row major (checksum: 0) elapsed_ms: 976
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1771
row major (checksum: 0) elapsed_ms: 1778
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 819
row major (checksum: 0) elapsed_ms: 834
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 668
row major (checksum: 0) elapsed_ms: 666
g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o
gcc7_gemm_test
1124 1215 1465
col major (checksum: 0) elapsed_ms: 696
row major (checksum: 0) elapsed_ms: 706
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1294
row major (checksum: 0) elapsed_ms: 1326
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 425
row major (checksum: 0) elapsed_ms: 418
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 321
row major (checksum: 0) elapsed_ms: 332
I fiddled around quite a lot with the MSVC flags but no other flag made
anything faster.
My CPU is an i7-7700HQ with AVX2.
Now interestingly, I've run the same benchmark on an older i5-3550, which
has AVX, but not AVX2.
The run time on MSVC is nearly identical.
But now g++ (5.4) (again with -march=native) is nearly the same speed as
1124 1215 1465
col major (checksum: 0) elapsed_ms: 946
row major (checksum: 0) elapsed_ms: 944
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1798
row major (checksum: 0) elapsed_ms: 1816
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 687
row major (checksum: 0) elapsed_ms: 692
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 535
row major (checksum: 0) elapsed_ms: 551
This sort-of looks to me as if the MSVC optimiser cannot make use of
AVX2, if it is available on the CPU. It's just as slow as only with AVX,
while g++ and clang can really make use of AVX2 and get a 1.5-2x speed-up.
1124 1215 1465
col major (checksum: 0) elapsed_ms: 2007
row major (checksum: 0) elapsed_ms: 2019
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 3941
row major (checksum: 0) elapsed_ms: 3923
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 1625
row major (checksum: 0) elapsed_ms: 1624
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 1276
row major (checksum: 0) elapsed_ms: 1287
I believe this looks like a performance regression in g++-7. So I don't
think this is relevant to the problem I'm seeing. I am trying to report
this to the GCC bugtracker but they make signing up extremely hard.
If I use MSVC without the /arch:AVX2 switch, and g++5 with -march=core2,
then I am getting identical results. So it looks like with SSE3, MSVC and
g++5 are on par, but with AVX2, g++ and clang just blow away MSVC.
Again I'm seeing the same performance regression with g++7 and
-march=core2, it's around 50% slower than g++-5.
The Eigen version I used is 3.3.4.
Btw I realise the benchmark is a bit crude (and might better be done with
something like Google Benchmark), but I'm getting very consistent results.
Is there anything that the Eigen developers can do, either to enable AVX2
on MSVC, or to help the MSVC optimiser? Or is it purely a MSVC optimiser
problem?
FYI I reported this to MS: https://developercommunity
.visualstudio.com/content/problem/194955/vs-produces-code-
that-is-15-2x-slower-than-gcc-and.html (with code attached, but the code
is not visible to non-MS-employees).
If you are interested in more background information and more benchmarks,
the whole thing originated here: https://github.com/Dobia
sd/frugally-deep/issues/9 (but it's quite a lengthy thread).
Thank you and best wishes,
Patrik
// gemm_test.cpp
#include <array>
#include <chrono>
#include <iostream>
#include <random>
#include <Eigen/Dense>
using RowMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
Eigen::Dynamic, Eigen::RowMajor>;
using ColMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
Eigen::Dynamic, Eigen::ColMajor>;
template <typename Mat>
void run_test(const std::string& name, int s1, int s2, int s3)
{
using namespace std::chrono;
float checksum = 0.0f; // to prevent compiler from optimizing
everything away
const auto start_time_ns = high_resolution_clock::now().t
ime_since_epoch().count();
for (size_t i = 0; i < 10; ++i)
{
Mat a_rm(s1, s2);
Mat b_rm(s2, s3);
const auto c_rm = a_rm * b_rm;
checksum += c_rm(0, 0);
}
const auto end_time_ns = high_resolution_clock::now().t
ime_since_epoch().count();
const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
std::cout << name << " (checksum: " << checksum << ") elapsed_ms: "
<< elapsed_ms << std::endl;
}
int main()
{
//std::random_device rd;
//std::mt19937 gen(0);
//std::uniform_int_distribution<> dis(1, 2048);
std::vector<int> vals = { 1124, 1215, 1465, 1730, 1235, 1758, 1116,
1736, 868, 1278, 1323, 788 };
for (std::size_t i = 0; i < 12; ++i)
{
int s1 = vals[i++];//dis(gen);
int s2 = vals[i++];//dis(gen);
int s3 = vals[i];//dis(gen);
std::cout << s1 << " " << s2 << " " << s3 << std::endl;
run_test<ColMajorMatrixXf>("col major", s1, s2, s3);
run_test<RowMajorMatrixXf>("row major", s1, s2, s3);
std::cout << "--------" << std::endl;
}
return 0;
}
--
Dr. Patrik Huber
Centre for Vision, Speech and Signal Processing
University of Surrey
Guildford, Surrey GU2 7XH
United Kingdom
Web: www.patrikhuber.ch
Mobile: +44 (0)7482 633 934 <+44%207482%20633934>
--
Dr. Patrik Huber
Centre for Vision, Speech and Signal Processing
University of Surrey
Guildford, Surrey GU2 7XH
United Kingdom

Web: www.patrikhuber.ch
Mobile: +44 (0)7482 633 934
Oleg Shirokobrod
2018-02-08 14:35:58 UTC
Permalink
Hi Patrik,

Have a look at this link
https://developercommunity.visualstudio.com/content/problem/107145/inefficient-code-generation-for-fma-instructions.html

Best regards,

Oleg Shirokobrod
Post by Patrik Huber
Hi Gael,
Thanks for the reply.
Information on the topic of MSVC and FMA seems a bit scarce. But this blog
post says that with /arch:AVX2, " The compiler will generate code that
includes AVX2 and FMA instructions. "(https://blogs.msdn.
microsoft.com/vcblog/2014/02/28/avx2-support-in-visual-studio-c-compiler/).
So I think that at least concerning the flags, the compiler should emit FMA
instructions, if it can.
Since the difference I'm seeing is around 1.5+, it is indeed highly likely
that you are correct and the MSVC code is so much slower because it doesn't
emit FMA instructions, compared to gcc & clang. But if the compiler flags
are set - why does it not emit FMA instructions?
Is there an FMA code path for MSVC in Eigen and are people in general
seeing MSVC using FMA when using Eigen?
Thank you and best wishes,
Patrik
Post by Gael Guennebaud
Hi,
I did not read carefully your email, but it seems that on the MSVC build
you are missing FMA. Indeed, Compared to AVX, AVX2 does not bring any gain
for matrix-matrix multiply, only FMA does (usually around 1.5). In
contrast, with gcc/clang -march=native activate all supported instruction
sets, including FMA on recent CPUs.
gael
Post by Patrik Huber
Hello,
I noticed that code I'm using is around 2x slower on VS2017 (15.5.5 and
15.6.0 Preview) than on g++-7 and clang-6. After some digging, I found that
it is down to the matrix multiplication with Eigen.
The simple benchmark (see below) tests matrix multiplication with
various sizes m x n * n x p where m, n, p are between 1 and 2048, and MSVC
is consistently around 1.5-2x slower than g++ and clang, which is quite
huge.
cl.exe gemm_test.cpp -I 3rdparty\eigen /EHsc /std:c++17 /arch:AVX2 /O2
/Ob2 /nologo
1124 1215 1465
col major (checksum: 0) elapsed_ms: 971
row major (checksum: 0) elapsed_ms: 976
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1771
row major (checksum: 0) elapsed_ms: 1778
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 819
row major (checksum: 0) elapsed_ms: 834
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 668
row major (checksum: 0) elapsed_ms: 666
g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o
gcc7_gemm_test
1124 1215 1465
col major (checksum: 0) elapsed_ms: 696
row major (checksum: 0) elapsed_ms: 706
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1294
row major (checksum: 0) elapsed_ms: 1326
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 425
row major (checksum: 0) elapsed_ms: 418
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 321
row major (checksum: 0) elapsed_ms: 332
I fiddled around quite a lot with the MSVC flags but no other flag made
anything faster.
My CPU is an i7-7700HQ with AVX2.
Now interestingly, I've run the same benchmark on an older i5-3550,
which has AVX, but not AVX2.
The run time on MSVC is nearly identical.
But now g++ (5.4) (again with -march=native) is nearly the same speed as
1124 1215 1465
col major (checksum: 0) elapsed_ms: 946
row major (checksum: 0) elapsed_ms: 944
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1798
row major (checksum: 0) elapsed_ms: 1816
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 687
row major (checksum: 0) elapsed_ms: 692
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 535
row major (checksum: 0) elapsed_ms: 551
This sort-of looks to me as if the MSVC optimiser cannot make use of
AVX2, if it is available on the CPU. It's just as slow as only with AVX,
while g++ and clang can really make use of AVX2 and get a 1.5-2x speed-up.
1124 1215 1465
col major (checksum: 0) elapsed_ms: 2007
row major (checksum: 0) elapsed_ms: 2019
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 3941
row major (checksum: 0) elapsed_ms: 3923
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 1625
row major (checksum: 0) elapsed_ms: 1624
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 1276
row major (checksum: 0) elapsed_ms: 1287
I believe this looks like a performance regression in g++-7. So I don't
think this is relevant to the problem I'm seeing. I am trying to report
this to the GCC bugtracker but they make signing up extremely hard.
If I use MSVC without the /arch:AVX2 switch, and g++5 with -march=core2,
then I am getting identical results. So it looks like with SSE3, MSVC and
g++5 are on par, but with AVX2, g++ and clang just blow away MSVC.
Again I'm seeing the same performance regression with g++7 and
-march=core2, it's around 50% slower than g++-5.
The Eigen version I used is 3.3.4.
Btw I realise the benchmark is a bit crude (and might better be done
with something like Google Benchmark), but I'm getting very consistent
results.
Is there anything that the Eigen developers can do, either to enable
AVX2 on MSVC, or to help the MSVC optimiser? Or is it purely a MSVC
optimiser problem?
FYI I reported this to MS: https://developercommunity
.visualstudio.com/content/problem/194955/vs-produces-code-th
at-is-15-2x-slower-than-gcc-and.html (with code attached, but the code
is not visible to non-MS-employees).
If you are interested in more background information and more
benchmarks, the whole thing originated here: https://github.com/Dobia
sd/frugally-deep/issues/9 (but it's quite a lengthy thread).
Thank you and best wishes,
Patrik
// gemm_test.cpp
#include <array>
#include <chrono>
#include <iostream>
#include <random>
#include <Eigen/Dense>
using RowMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
Eigen::Dynamic, Eigen::RowMajor>;
using ColMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
Eigen::Dynamic, Eigen::ColMajor>;
template <typename Mat>
void run_test(const std::string& name, int s1, int s2, int s3)
{
using namespace std::chrono;
float checksum = 0.0f; // to prevent compiler from optimizing
everything away
const auto start_time_ns = high_resolution_clock::now().t
ime_since_epoch().count();
for (size_t i = 0; i < 10; ++i)
{
Mat a_rm(s1, s2);
Mat b_rm(s2, s3);
const auto c_rm = a_rm * b_rm;
checksum += c_rm(0, 0);
}
const auto end_time_ns = high_resolution_clock::now().t
ime_since_epoch().count();
const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
std::cout << name << " (checksum: " << checksum << ") elapsed_ms: "
<< elapsed_ms << std::endl;
}
int main()
{
//std::random_device rd;
//std::mt19937 gen(0);
//std::uniform_int_distribution<> dis(1, 2048);
std::vector<int> vals = { 1124, 1215, 1465, 1730, 1235, 1758, 1116,
1736, 868, 1278, 1323, 788 };
for (std::size_t i = 0; i < 12; ++i)
{
int s1 = vals[i++];//dis(gen);
int s2 = vals[i++];//dis(gen);
int s3 = vals[i];//dis(gen);
std::cout << s1 << " " << s2 << " " << s3 << std::endl;
run_test<ColMajorMatrixXf>("col major", s1, s2, s3);
run_test<RowMajorMatrixXf>("row major", s1, s2, s3);
std::cout << "--------" << std::endl;
}
return 0;
}
--
Dr. Patrik Huber
Centre for Vision, Speech and Signal Processing
University of Surrey
Guildford, Surrey GU2 7XH
United Kingdom
Web: www.patrikhuber.ch
Mobile: +44 (0)7482 633 934 <+44%207482%20633934>
--
Dr. Patrik Huber
Centre for Vision, Speech and Signal Processing
University of Surrey
Guildford, Surrey GU2 7XH
United Kingdom
Web: www.patrikhuber.ch
Mobile: +44 (0)7482 633 934 <+44%207482%20633934>
Christoph Hertzberg
2018-02-08 15:23:51 UTC
Permalink
Clang and GCC seem(ed) to have some issues with FMA as well. We have
some inline-assembly inside the corresponding PacketMath header:

https://bitbucket.org/eigen/eigen/src/2355b229/Eigen/src/Core/arch/AVX/PacketMath.h?at=default&fileviewer=file-view-default#PacketMath.h-160


Christoph
Post by Oleg Shirokobrod
Hi Patrik,
Have a look at this link
https://developercommunity.visualstudio.com/content/problem/107145/inefficient-code-generation-for-fma-instructions.html
Best regards,
Oleg Shirokobrod
Post by Patrik Huber
Hi Gael,
Thanks for the reply.
Information on the topic of MSVC and FMA seems a bit scarce. But this blog
post says that with /arch:AVX2, " The compiler will generate code that
includes AVX2 and FMA instructions. "(https://blogs.msdn.
microsoft.com/vcblog/2014/02/28/avx2-support-in-visual-studio-c-compiler/).
So I think that at least concerning the flags, the compiler should emit FMA
instructions, if it can.
Since the difference I'm seeing is around 1.5+, it is indeed highly likely
that you are correct and the MSVC code is so much slower because it doesn't
emit FMA instructions, compared to gcc & clang. But if the compiler flags
are set - why does it not emit FMA instructions?
Is there an FMA code path for MSVC in Eigen and are people in general
seeing MSVC using FMA when using Eigen?
Thank you and best wishes,
Patrik
Post by Gael Guennebaud
Hi,
I did not read carefully your email, but it seems that on the MSVC build
you are missing FMA. Indeed, Compared to AVX, AVX2 does not bring any gain
for matrix-matrix multiply, only FMA does (usually around 1.5). In
contrast, with gcc/clang -march=native activate all supported instruction
sets, including FMA on recent CPUs.
gael
Post by Patrik Huber
Hello,
I noticed that code I'm using is around 2x slower on VS2017 (15.5.5 and
15.6.0 Preview) than on g++-7 and clang-6. After some digging, I found that
it is down to the matrix multiplication with Eigen.
The simple benchmark (see below) tests matrix multiplication with
various sizes m x n * n x p where m, n, p are between 1 and 2048, and MSVC
is consistently around 1.5-2x slower than g++ and clang, which is quite
huge.
cl.exe gemm_test.cpp -I 3rdparty\eigen /EHsc /std:c++17 /arch:AVX2 /O2
/Ob2 /nologo
1124 1215 1465
col major (checksum: 0) elapsed_ms: 971
row major (checksum: 0) elapsed_ms: 976
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1771
row major (checksum: 0) elapsed_ms: 1778
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 819
row major (checksum: 0) elapsed_ms: 834
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 668
row major (checksum: 0) elapsed_ms: 666
g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o
gcc7_gemm_test
1124 1215 1465
col major (checksum: 0) elapsed_ms: 696
row major (checksum: 0) elapsed_ms: 706
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1294
row major (checksum: 0) elapsed_ms: 1326
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 425
row major (checksum: 0) elapsed_ms: 418
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 321
row major (checksum: 0) elapsed_ms: 332
I fiddled around quite a lot with the MSVC flags but no other flag made
anything faster.
My CPU is an i7-7700HQ with AVX2.
Now interestingly, I've run the same benchmark on an older i5-3550,
which has AVX, but not AVX2.
The run time on MSVC is nearly identical.
But now g++ (5.4) (again with -march=native) is nearly the same speed as
1124 1215 1465
col major (checksum: 0) elapsed_ms: 946
row major (checksum: 0) elapsed_ms: 944
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1798
row major (checksum: 0) elapsed_ms: 1816
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 687
row major (checksum: 0) elapsed_ms: 692
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 535
row major (checksum: 0) elapsed_ms: 551
This sort-of looks to me as if the MSVC optimiser cannot make use of
AVX2, if it is available on the CPU. It's just as slow as only with AVX,
while g++ and clang can really make use of AVX2 and get a 1.5-2x speed-up.
1124 1215 1465
col major (checksum: 0) elapsed_ms: 2007
row major (checksum: 0) elapsed_ms: 2019
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 3941
row major (checksum: 0) elapsed_ms: 3923
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 1625
row major (checksum: 0) elapsed_ms: 1624
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 1276
row major (checksum: 0) elapsed_ms: 1287
I believe this looks like a performance regression in g++-7. So I don't
think this is relevant to the problem I'm seeing. I am trying to report
this to the GCC bugtracker but they make signing up extremely hard.
If I use MSVC without the /arch:AVX2 switch, and g++5 with -march=core2,
then I am getting identical results. So it looks like with SSE3, MSVC and
g++5 are on par, but with AVX2, g++ and clang just blow away MSVC.
Again I'm seeing the same performance regression with g++7 and
-march=core2, it's around 50% slower than g++-5.
The Eigen version I used is 3.3.4.
Btw I realise the benchmark is a bit crude (and might better be done
with something like Google Benchmark), but I'm getting very consistent
results.
Is there anything that the Eigen developers can do, either to enable
AVX2 on MSVC, or to help the MSVC optimiser? Or is it purely a MSVC
optimiser problem?
FYI I reported this to MS: https://developercommunity
.visualstudio.com/content/problem/194955/vs-produces-code-th
at-is-15-2x-slower-than-gcc-and.html (with code attached, but the code
is not visible to non-MS-employees).
If you are interested in more background information and more
benchmarks, the whole thing originated here: https://github.com/Dobia
sd/frugally-deep/issues/9 (but it's quite a lengthy thread).
Thank you and best wishes,
Patrik
// gemm_test.cpp
#include <array>
#include <chrono>
#include <iostream>
#include <random>
#include <Eigen/Dense>
using RowMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
Eigen::Dynamic, Eigen::RowMajor>;
using ColMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
Eigen::Dynamic, Eigen::ColMajor>;
template <typename Mat>
void run_test(const std::string& name, int s1, int s2, int s3)
{
using namespace std::chrono;
float checksum = 0.0f; // to prevent compiler from optimizing
everything away
const auto start_time_ns = high_resolution_clock::now().t
ime_since_epoch().count();
for (size_t i = 0; i < 10; ++i)
{
Mat a_rm(s1, s2);
Mat b_rm(s2, s3);
const auto c_rm = a_rm * b_rm;
checksum += c_rm(0, 0);
}
const auto end_time_ns = high_resolution_clock::now().t
ime_since_epoch().count();
const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
std::cout << name << " (checksum: " << checksum << ") elapsed_ms: "
<< elapsed_ms << std::endl;
}
int main()
{
//std::random_device rd;
//std::mt19937 gen(0);
//std::uniform_int_distribution<> dis(1, 2048);
std::vector<int> vals = { 1124, 1215, 1465, 1730, 1235, 1758, 1116,
1736, 868, 1278, 1323, 788 };
for (std::size_t i = 0; i < 12; ++i)
{
int s1 = vals[i++];//dis(gen);
int s2 = vals[i++];//dis(gen);
int s3 = vals[i];//dis(gen);
std::cout << s1 << " " << s2 << " " << s3 << std::endl;
run_test<ColMajorMatrixXf>("col major", s1, s2, s3);
run_test<RowMajorMatrixXf>("row major", s1, s2, s3);
std::cout << "--------" << std::endl;
}
return 0;
}
--
Dr. Patrik Huber
Centre for Vision, Speech and Signal Processing
University of Surrey
Guildford, Surrey GU2 7XH
United Kingdom
Web: www.patrikhuber.ch
Mobile: +44 (0)7482 633 934 <+44%207482%20633934>
--
Dr. Patrik Huber
Centre for Vision, Speech and Signal Processing
University of Surrey
Guildford, Surrey GU2 7XH
United Kingdom
Web: www.patrikhuber.ch
Mobile: +44 (0)7482 633 934 <+44%207482%20633934>
--
Dr.-Ing. Christoph Hertzberg

Besuchsadresse der Nebengeschäftsstelle:
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-Straße 5
28359 Bremen, Germany

Postadresse der Hauptgeschäftsstelle Standort Bremen:
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-Straße 1
28359 Bremen, Germany

Tel.: +49 421 178 45-4021
Zentrale: +49 421 178 45-0
E-Mail: ***@dfki.de

Weitere Informationen: http://www.dfki.de/robotik
-----------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Straße 122, D-67663 Kaiserslautern
Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
(Vorsitzender) Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
USt-Id.Nr.: DE 148646973
Steuernummer: 19/672/50006
-----------------------------------------------------------------------
Edward Lam
2018-02-08 14:04:25 UTC
Permalink
Apparently, one also needs to supply /fp:fast in addition to /arch:AVX2 to
enable FMA code generation on MSVC.

However, even after I did this, I did not see a speed improvement in Patrick's
benchmark code when using Eigen 3.3.4, VS2017 15.5.5 and these compiler options:
/O2 /Fa /std:c++17 /arch:AVX2 /fp:fast
/D_SILENCE_CXX17_NEGATORS_DEPRECATION_WARNING

I've even confirmed this by grepping through the disassembly after compiling
gemm_test.cpp with these cl.exe options:
$ grep vfmadd gemm_test.asm | wc -l
125

So it doesn't appear to make a difference in this case.

-Edward
Hi,
I did not read carefully your email, but it seems that on the MSVC build you are
missing FMA. Indeed, Compared to AVX, AVX2 does not bring any gain for
matrix-matrix multiply, only FMA does (usually around 1.5). In contrast, with
gcc/clang -march=native activate all supported instruction sets, including FMA
on recent CPUs.
gael
Hello,
I noticed that code I'm using is around 2x slower on VS2017 (15.5.5 and
15.6.0 Preview) than on g++-7 and clang-6. After some digging, I found that
it is down to the matrix multiplication with Eigen.
The simple benchmark (see below) tests matrix multiplication with various
sizes m x n * n x p where m, n, p are between 1 and 2048, and MSVC is
consistently around 1.5-2x slower than g++ and clang, which is quite huge.
cl.exe gemm_test.cpp -I 3rdparty\eigen /EHsc /std:c++17 /arch:AVX2 /O2 /Ob2
/nologo
1124 1215 1465
col major (checksum: 0) elapsed_ms: 971
row major (checksum: 0) elapsed_ms: 976
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1771
row major (checksum: 0) elapsed_ms: 1778
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 819
row major (checksum: 0) elapsed_ms: 834
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 668
row major (checksum: 0) elapsed_ms: 666
g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o
gcc7_gemm_test
1124 1215 1465
col major (checksum: 0) elapsed_ms: 696
row major (checksum: 0) elapsed_ms: 706
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1294
row major (checksum: 0) elapsed_ms: 1326
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 425
row major (checksum: 0) elapsed_ms: 418
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 321
row major (checksum: 0) elapsed_ms: 332
I fiddled around quite a lot with the MSVC flags but no other flag made
anything faster.
My CPU is an i7-7700HQ with AVX2.
Now interestingly, I've run the same benchmark on an older i5-3550, which
has AVX, but not AVX2.
The run time on MSVC is nearly identical.
1124 1215 1465
col major (checksum: 0) elapsed_ms: 946
row major (checksum: 0) elapsed_ms: 944
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1798
row major (checksum: 0) elapsed_ms: 1816
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 687
row major (checksum: 0) elapsed_ms: 692
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 535
row major (checksum: 0) elapsed_ms: 551
This sort-of looks to me as if the MSVC optimiser cannot make use of AVX2,
if it is available on the CPU. It's just as slow as only with AVX, while g++
and clang can really make use of AVX2 and get a 1.5-2x speed-up.
1124 1215 1465
col major (checksum: 0) elapsed_ms: 2007
row major (checksum: 0) elapsed_ms: 2019
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 3941
row major (checksum: 0) elapsed_ms: 3923
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 1625
row major (checksum: 0) elapsed_ms: 1624
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 1276
row major (checksum: 0) elapsed_ms: 1287
I believe this looks like a performance regression in g++-7. So I don't
think this is relevant to the problem I'm seeing. I am trying to report this
to the GCC bugtracker but they make signing up extremely hard.
If I use MSVC without the /arch:AVX2 switch, and g++5 with -march=core2,
then I am getting identical results. So it looks like with SSE3, MSVC and
g++5 are on par, but with AVX2, g++ and clang just blow away MSVC.
Again I'm seeing the same performance regression with g++7 and -march=core2,
it's around 50% slower than g++-5.
The Eigen version I used is 3.3.4.
Btw I realise the benchmark is a bit crude (and might better be done with
something like Google Benchmark), but I'm getting very consistent results.
Is there anything that the Eigen developers can do, either to enable AVX2 on
MSVC, or to help the MSVC optimiser? Or is it purely a MSVC optimiser problem?
https://developercommunity.visualstudio.com/content/problem/194955/vs-produces-code-that-is-15-2x-slower-than-gcc-and.html
<https://developercommunity.visualstudio.com/content/problem/194955/vs-produces-code-that-is-15-2x-slower-than-gcc-and.html>
(with code attached, but the code is not visible to non-MS-employees).
If you are interested in more background information and more benchmarks,
https://github.com/Dobiasd/frugally-deep/issues/9
<https://github.com/Dobiasd/frugally-deep/issues/9> (but it's quite a
lengthy thread).
Thank you and best wishes,
Patrik
// gemm_test.cpp
#include <array>
#include <chrono>
#include <iostream>
#include <random>
#include <Eigen/Dense>
using RowMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
Eigen::Dynamic, Eigen::RowMajor>;
using ColMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
Eigen::Dynamic, Eigen::ColMajor>;
template <typename Mat>
void run_test(const std::string& name, int s1, int s2, int s3)
{
    using namespace std::chrono;
    float checksum = 0.0f; // to prevent compiler from optimizing
everything away
    const auto start_time_ns =
high_resolution_clock::now().time_since_epoch().count();
    for (size_t i = 0; i < 10; ++i)
    {
        Mat a_rm(s1, s2);
        Mat b_rm(s2, s3);
        const auto c_rm = a_rm * b_rm;
        checksum += c_rm(0, 0);
    }
    const auto end_time_ns =
high_resolution_clock::now().time_since_epoch().count();
    const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
    std::cout << name << " (checksum: " << checksum << ") elapsed_ms: " <<
elapsed_ms << std::endl;
}
int main()
{
    //std::random_device rd;
    //std::mt19937 gen(0);
    //std::uniform_int_distribution<> dis(1, 2048);
    std::vector<int> vals = { 1124, 1215, 1465, 1730, 1235, 1758, 1116,
1736, 868, 1278, 1323, 788 };
    for (std::size_t i = 0; i < 12; ++i)
    {
        int s1 = vals[i++];//dis(gen);
        int s2 = vals[i++];//dis(gen);
        int s3 = vals[i];//dis(gen);
        std::cout << s1 << " " << s2 << " " << s3 << std::endl;
        run_test<ColMajorMatrixXf>("col major", s1, s2, s3);
        run_test<RowMajorMatrixXf>("row major", s1, s2, s3);
        std::cout << "--------" << std::endl;
    }
    return 0;
}
--
Dr. Patrik Huber
Centre for Vision, Speech and Signal Processing
University of Surrey
Guildford, Surrey GU2 7XH
United Kingdom
Web: www.patrikhuber.ch <http://www.patrikhuber.ch>
Mobile: +44 (0)7482 633 934 <tel:+44%207482%20633934>
Christoph Hertzberg
2018-02-08 15:19:01 UTC
Permalink
Could you try writing a small AVX program which uses the
`_mm256_fmadd_ps(a,b,c)` intrinsic and see if it compiles with MSVC?
Perhaps only our `#ifdef __FMA__` test does not work with MSVC. (It
would be interesting to know how to detect FMA support then)

Christoph
Apparently,  one also needs to supply /fp:fast in addition to /arch:AVX2
to enable FMA code generation on MSVC.
However, even after I did this, I did not see a speed improvement in
Patrick's benchmark code when using Eigen 3.3.4, VS2017 15.5.5 and these
    /O2 /Fa /std:c++17 /arch:AVX2 /fp:fast
    /D_SILENCE_CXX17_NEGATORS_DEPRECATION_WARNING
I've even confirmed this by grepping through the disassembly after
    $ grep vfmadd gemm_test.asm | wc -l
    125
So it doesn't appear to make a difference in this case.
-Edward
Post by Gael Guennebaud
Hi,
I did not read carefully your email, but it seems that on the MSVC
build you are missing FMA. Indeed, Compared to AVX, AVX2 does not
bring any gain for matrix-matrix multiply, only FMA does (usually
around 1.5). In contrast, with gcc/clang -march=native activate all
supported instruction sets, including FMA on recent CPUs.
gael
    Hello,
    I noticed that code I'm using is around 2x slower on VS2017
(15.5.5 and
    15.6.0 Preview) than on g++-7 and clang-6. After some digging, I
found that
    it is down to the matrix multiplication with Eigen.
    The simple benchmark (see below) tests matrix multiplication with
various
    sizes m x n * n x p where m, n, p are between 1 and 2048, and MSVC is
    consistently around 1.5-2x slower than g++ and clang, which is
quite huge.
    Here are some examples. I'm of course using optimised builds in
    cl.exe gemm_test.cpp -I 3rdparty\eigen /EHsc /std:c++17 /arch:AVX2
/O2 /Ob2
    /nologo
    1124 1215 1465
    col major (checksum: 0) elapsed_ms: 971
    row major (checksum: 0) elapsed_ms: 976
    --------
    1730 1235 1758
    col major (checksum: 0) elapsed_ms: 1771
    row major (checksum: 0) elapsed_ms: 1778
    --------
    1116 1736 868
    col major (checksum: 0) elapsed_ms: 819
    row major (checksum: 0) elapsed_ms: 834
    --------
    1278 1323 788
    col major (checksum: 0) elapsed_ms: 668
    row major (checksum: 0) elapsed_ms: 666
    g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native
-O3 -o
    gcc7_gemm_test
    1124 1215 1465
    col major (checksum: 0) elapsed_ms: 696
    row major (checksum: 0) elapsed_ms: 706
    --------
    1730 1235 1758
    col major (checksum: 0) elapsed_ms: 1294
    row major (checksum: 0) elapsed_ms: 1326
    --------
    1116 1736 868
    col major (checksum: 0) elapsed_ms: 425
    row major (checksum: 0) elapsed_ms: 418
    --------
    1278 1323 788
    col major (checksum: 0) elapsed_ms: 321
    row major (checksum: 0) elapsed_ms: 332
    I fiddled around quite a lot with the MSVC flags but no other flag
made
    anything faster.
    My CPU is an i7-7700HQ with AVX2.
    Now interestingly, I've run the same benchmark on an older
i5-3550, which
    has AVX, but not AVX2.
    The run time on MSVC is nearly identical.
    But now g++ (5.4) (again with -march=native) is nearly the same
    1124 1215 1465
    col major (checksum: 0) elapsed_ms: 946
    row major (checksum: 0) elapsed_ms: 944
    --------
    1730 1235 1758
    col major (checksum: 0) elapsed_ms: 1798
    row major (checksum: 0) elapsed_ms: 1816
    --------
    1116 1736 868
    col major (checksum: 0) elapsed_ms: 687
    row major (checksum: 0) elapsed_ms: 692
    --------
    1278 1323 788
    col major (checksum: 0) elapsed_ms: 535
    row major (checksum: 0) elapsed_ms: 551
    This sort-of looks to me as if the MSVC optimiser cannot make use
of AVX2,
    if it is available on the CPU. It's just as slow as only with AVX,
while g++
    and clang can really make use of AVX2 and get a 1.5-2x speed-up.
    Interestingly if I use g++-7 on the i5, I'm getting extremely bad
    1124 1215 1465
    col major (checksum: 0) elapsed_ms: 2007
    row major (checksum: 0) elapsed_ms: 2019
    --------
    1730 1235 1758
    col major (checksum: 0) elapsed_ms: 3941
    row major (checksum: 0) elapsed_ms: 3923
    --------
    1116 1736 868
    col major (checksum: 0) elapsed_ms: 1625
    row major (checksum: 0) elapsed_ms: 1624
    --------
    1278 1323 788
    col major (checksum: 0) elapsed_ms: 1276
    row major (checksum: 0) elapsed_ms: 1287
    I believe this looks like a performance regression in g++-7. So I
don't
    think this is relevant to the problem I'm seeing. I am trying to
report this
    to the GCC bugtracker but they make signing up extremely hard.
    If I use MSVC without the /arch:AVX2 switch, and g++5 with
-march=core2,
    then I am getting identical results. So it looks like with SSE3,
MSVC and
    g++5 are on par, but with AVX2, g++ and clang just blow away MSVC.
    Again I'm seeing the same performance regression with g++7 and
-march=core2,
    it's around 50% slower than g++-5.
    The Eigen version I used is 3.3.4.
    Btw I realise the benchmark is a bit crude (and might better be
done with
    something like Google Benchmark), but I'm getting very consistent
results.
    Is there anything that the Eigen developers can do, either to
enable AVX2 on
    MSVC, or to help the MSVC optimiser? Or is it purely a MSVC
optimiser problem?
https://developercommunity.visualstudio.com/content/problem/194955/vs-produces-code-that-is-15-2x-slower-than-gcc-and.html
<https://developercommunity.visualstudio.com/content/problem/194955/vs-produces-code-that-is-15-2x-slower-than-gcc-and.html>
    (with code attached, but the code is not visible to
non-MS-employees).
    If you are interested in more background information and more
benchmarks,
    https://github.com/Dobiasd/frugally-deep/issues/9
    <https://github.com/Dobiasd/frugally-deep/issues/9> (but it's quite a
    lengthy thread).
    Thank you and best wishes,
    Patrik
    // gemm_test.cpp
    #include <array>
    #include <chrono>
    #include <iostream>
    #include <random>
    #include <Eigen/Dense>
    using RowMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
    Eigen::Dynamic, Eigen::RowMajor>;
    using ColMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
    Eigen::Dynamic, Eigen::ColMajor>;
    template <typename Mat>
    void run_test(const std::string& name, int s1, int s2, int s3)
    {
         using namespace std::chrono;
         float checksum = 0.0f; // to prevent compiler from optimizing
    everything away
         const auto start_time_ns =
    high_resolution_clock::now().time_since_epoch().count();
         for (size_t i = 0; i < 10; ++i)
         {
             Mat a_rm(s1, s2);
             Mat b_rm(s2, s3);
             const auto c_rm = a_rm * b_rm;
             checksum += c_rm(0, 0);
         }
         const auto end_time_ns =
    high_resolution_clock::now().time_since_epoch().count();
         const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
         std::cout << name << " (checksum: " << checksum << ")
elapsed_ms: " <<
    elapsed_ms << std::endl;
    }
    int main()
    {
         //std::random_device rd;
         //std::mt19937 gen(0);
         //std::uniform_int_distribution<> dis(1, 2048);
         std::vector<int> vals = { 1124, 1215, 1465, 1730, 1235, 1758,
1116,
    1736, 868, 1278, 1323, 788 };
         for (std::size_t i = 0; i < 12; ++i)
         {
             int s1 = vals[i++];//dis(gen);
             int s2 = vals[i++];//dis(gen);
             int s3 = vals[i];//dis(gen);
             std::cout << s1 << " " << s2 << " " << s3 << std::endl;
             run_test<ColMajorMatrixXf>("col major", s1, s2, s3);
             run_test<RowMajorMatrixXf>("row major", s1, s2, s3);
             std::cout << "--------" << std::endl;
         }
         return 0;
    }
    --     Dr. Patrik Huber
    Centre for Vision, Speech and Signal Processing
    University of Surrey
    Guildford, Surrey GU2 7XH
    United Kingdom
    Web: www.patrikhuber.ch <http://www.patrikhuber.ch>
    Mobile: +44 (0)7482 633 934 <tel:+44%207482%20633934>
--
Dr.-Ing. Christoph Hertzberg

Besuchsadresse der Nebengeschäftsstelle:
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-Straße 5
28359 Bremen, Germany

Postadresse der Hauptgeschäftsstelle Standort Bremen:
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-Straße 1
28359 Bremen, Germany

Tel.: +49 421 178 45-4021
Zentrale: +49 421 178 45-0
E-Mail: ***@dfki.de

Weitere Informationen: http://www.dfki.de/robotik
-----------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Straße 122, D-67663 Kaiserslautern
Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
(Vorsitzender) Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
USt-Id.Nr.: DE 148646973
Steuernummer: 19/672/50006
-----------------------------------------------------------------------
Edward Lam
2018-02-08 19:14:39 UTC
Permalink
Post by Christoph Hertzberg
Could you try writing a small AVX program which uses the
`_mm256_fmadd_ps(a,b,c)` intrinsic and see if it compiles with MSVC? Perhaps
only our `#ifdef __FMA__` test does not work with MSVC. (It would be interesting
to know how to detect FMA support then)
That works! For detection, the documentation at
https://msdn.microsoft.com/en-us/library/b0084kay.aspx suggests that perhaps
this will work:

#if defined(_MSC_VER) && defined(__AVX2__)
#define __FMA__
#endif

For reference, recompiling the earlier test with the best options plus -D__FMA__
produces:

$ ./gemm_test # 325 fmadd instructions produced
1124 1215 1465
col major (checksum: 0) elapsed_ms: 962
row major (checksum: 0) elapsed_ms: 1021
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1798
row major (checksum: 0) elapsed_ms: 1805
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 712
row major (checksum: 0) elapsed_ms: 712
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 578
row major (checksum: 0) elapsed_ms: 584
--------

Compared to the same compiler options *without* -D__FMA__ :

$ ./gemm_test # 125 fmadd instructions produced
1124 1215 1465
col major (checksum: 0) elapsed_ms: 1245
row major (checksum: 0) elapsed_ms: 1160
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 2071
row major (checksum: 0) elapsed_ms: 2066
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 905
row major (checksum: 0) elapsed_ms: 905
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 711
row major (checksum: 0) elapsed_ms: 720
--------


Cheers,
-Edward
Edward Lam
2018-02-08 19:17:51 UTC
Permalink
PS. I should note that adding whole program optimization (even for a single .cpp
file!) causes VS2017 to suddenly not generate FMA instructions again. So it's
important to *NOT* use /GL with cl.exe.
Post by Edward Lam
Post by Christoph Hertzberg
Could you try writing a small AVX program which uses the
`_mm256_fmadd_ps(a,b,c)` intrinsic and see if it compiles with MSVC? Perhaps
only our `#ifdef __FMA__` test does not work with MSVC. (It would be
interesting to know how to detect FMA support then)
That works! For detection, the documentation at
https://msdn.microsoft.com/en-us/library/b0084kay.aspx suggests that perhaps
#if defined(_MSC_VER) && defined(__AVX2__)
#define __FMA__
#endif
For reference, recompiling the earlier test with the best options plus -D__FMA__
$ ./gemm_test # 325 fmadd instructions produced
1124 1215 1465
col major (checksum: 0) elapsed_ms: 962
row major (checksum: 0) elapsed_ms: 1021
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1798
row major (checksum: 0) elapsed_ms: 1805
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 712
row major (checksum: 0) elapsed_ms: 712
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 578
row major (checksum: 0) elapsed_ms: 584
--------
$ ./gemm_test # 125 fmadd instructions produced
1124 1215 1465
col major (checksum: 0) elapsed_ms: 1245
row major (checksum: 0) elapsed_ms: 1160
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 2071
row major (checksum: 0) elapsed_ms: 2066
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 905
row major (checksum: 0) elapsed_ms: 905
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 711
row major (checksum: 0) elapsed_ms: 720
--------
Cheers,
-Edward
Patrik Huber
2018-02-08 20:08:08 UTC
Permalink
Hi all,

Thank you very much Oleg, Christoph and Edward! Absolutely fantastic that
you are able to help! :-)

Edward, this is brilliant. After compiling my benchmark on my machine with
the same flags but an added -D__FMA__ flag, I can see an 1.5-2x speed
increase, and MSVC is as fast as gcc! Wow.

Btw, I also noticed the speed drop with /GL. I reported this to MS
yesterday:
https://developercommunity.visualstudio.com/content/problem/194951/gl-results-in-15-2x-worse-run-time.html
It seems like you solved why this happens. I also think /GL may even
exhibit emitting AVX and AVX2 code too.
Post by Edward Lam
Post by Edward Lam
Apparently, one also needs to supply /fp:fast in addition to /arch:AVX2
to enable FMA code generation on MSVC.

I think this is incorrect though. I thin /fp:fast is not needed for MSVC to
generate FMA code. Also, gcc and clang can generate FMA code without
-ffast-math (which I guess is sort-of equivalent to /fp:fast).

So, I think we solved this already. The speed-gain is amazing. Can we
include this detection mechanism for MSVC into the next Eigen release?

Thank you very much again to everyone,

Patrik
Post by Edward Lam
PS. I should note that adding whole program optimization (even for a
single .cpp file!) causes VS2017 to suddenly not generate FMA instructions
again. So it's important to *NOT* use /GL with cl.exe.
Post by Edward Lam
Post by Christoph Hertzberg
Could you try writing a small AVX program which uses the
`_mm256_fmadd_ps(a,b,c)` intrinsic and see if it compiles with MSVC?
Perhaps only our `#ifdef __FMA__` test does not work with MSVC. (It would
be interesting to know how to detect FMA support then)
That works! For detection, the documentation at
https://msdn.microsoft.com/en-us/library/b0084kay.aspx suggests that
#if defined(_MSC_VER) && defined(__AVX2__)
#define __FMA__
#endif
For reference, recompiling the earlier test with the best options plus
$ ./gemm_test # 325 fmadd instructions produced
1124 1215 1465
col major (checksum: 0) elapsed_ms: 962
row major (checksum: 0) elapsed_ms: 1021
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1798
row major (checksum: 0) elapsed_ms: 1805
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 712
row major (checksum: 0) elapsed_ms: 712
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 578
row major (checksum: 0) elapsed_ms: 584
--------
$ ./gemm_test # 125 fmadd instructions produced
1124 1215 1465
col major (checksum: 0) elapsed_ms: 1245
row major (checksum: 0) elapsed_ms: 1160
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 2071
row major (checksum: 0) elapsed_ms: 2066
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 905
row major (checksum: 0) elapsed_ms: 905
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 711
row major (checksum: 0) elapsed_ms: 720
--------
Cheers,
-Edward
--
Dr. Patrik Huber
Centre for Vision, Speech and Signal Processing
University of Surrey
Guildford, Surrey GU2 7XH
United Kingdom

Web: www.patrikhuber.ch
Mobile: +44 (0)7482 633 934
Edward Lam
2018-02-08 21:19:10 UTC
Permalink
Hi Patrik,
Post by Patrik Huber
I think this is incorrect though. I thin /fp:fast is not needed for MSVC to
generate FMA code. Also, gcc and clang can generate FMA code without -ffast-math
(which I guess is sort-of equivalent to /fp:fast).
Using /fp:fast is not necessary for the intrinsics, but without it, I can't get
this to generate an vfmadd instruction:
=========
//foo.cpp
//
// Test with: cl /Fa /O2 /arch:AVX2 /fp:fast foo.cpp
// Generates foo.exe and foo.asm

float mul_add(float a, float b, float c) {
return a*b + c;
}

int main()
{
return 0;
}
=========

Best regards,
-Edward
Patrik Huber
2018-02-08 21:40:13 UTC
Permalink
Hi Edward,

I see, that's good to know, thank you! So /fp:fast has the potential to let
the compiler generate even more intrinsics. In practice I've observed the
same as you wrote earlier though, adding /fp:fast in a few of my
applications didn't yield any performance benefit.
The FMA speed-up is huge though :-)))

Thank you again and best wishes,

Patrik
Post by Oleg Shirokobrod
Hi Patrik,
Post by Patrik Huber
I think this is incorrect though. I thin /fp:fast is not needed for MSVC
to generate FMA code. Also, gcc and clang can generate FMA code without
-ffast-math (which I guess is sort-of equivalent to /fp:fast).
Using /fp:fast is not necessary for the intrinsics, but without it, I
=========
//foo.cpp
//
// Test with: cl /Fa /O2 /arch:AVX2 /fp:fast foo.cpp
// Generates foo.exe and foo.asm
float mul_add(float a, float b, float c) {
return a*b + c;
}
int main()
{
return 0;
}
=========
Best regards,
-Edward
--
Dr. Patrik Huber
Centre for Vision, Speech and Signal Processing
University of Surrey
Guildford, Surrey GU2 7XH
United Kingdom

Web: www.patrikhuber.ch
Mobile: +44 (0)7482 633 934
Gael Guennebaud
2018-02-09 08:16:48 UTC
Permalink
Post by Edward Lam
That works! For detection, the documentation at
https://msdn.microsoft.com/en-us/library/b0084kay.aspx suggests that
#if defined(_MSC_VER) && defined(__AVX2__)
#define __FMA__
#endif
To implement that we need to make sure that on all architectures AVX2 =>
FMA. This seems to be true for Intel's ones, but I'm not sure about AMD.


gael
Post by Edward Lam
For reference, recompiling the earlier test with the best options plus
$ ./gemm_test # 325 fmadd instructions produced
1124 1215 1465
col major (checksum: 0) elapsed_ms: 962
row major (checksum: 0) elapsed_ms: 1021
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 1798
row major (checksum: 0) elapsed_ms: 1805
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 712
row major (checksum: 0) elapsed_ms: 712
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 578
row major (checksum: 0) elapsed_ms: 584
--------
$ ./gemm_test # 125 fmadd instructions produced
1124 1215 1465
col major (checksum: 0) elapsed_ms: 1245
row major (checksum: 0) elapsed_ms: 1160
--------
1730 1235 1758
col major (checksum: 0) elapsed_ms: 2071
row major (checksum: 0) elapsed_ms: 2066
--------
1116 1736 868
col major (checksum: 0) elapsed_ms: 905
row major (checksum: 0) elapsed_ms: 905
--------
1278 1323 788
col major (checksum: 0) elapsed_ms: 711
row major (checksum: 0) elapsed_ms: 720
--------
Cheers,
-Edward
Edward Lam
2018-02-09 13:56:33 UTC
Permalink
Post by Edward Lam
That works! For detection, the documentation at
https://msdn.microsoft.com/en-us/library/b0084kay.aspx
<https://msdn.microsoft.com/en-us/library/b0084kay.aspx> suggests that
#if defined(_MSC_VER) && defined(__AVX2__)
#define __FMA__
#endif
To implement that we need to make sure that on all architectures AVX2 => FMA.
This seems to be true for Intel's ones, but I'm not sure about AMD.
According to
https://stackoverflow.com/questions/16348909/how-do-i-know-if-i-can-compile-with-fma-instruction-sets
, all AMD processors which support AVX2 support FMA. Unfortunately, I couldn't
easily confirm through official online resources. The wikipedia page on
Advanced_Vector_Extensions notes that only AMD Excavactor processors (and up)
support AVX2, and those definitely support FMA (double-checked at
https://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf).

-Edward
Christoph Hertzberg
2018-02-09 15:19:40 UTC
Permalink
Post by Edward Lam
    That works! For detection, the documentation at
    https://msdn.microsoft.com/en-us/library/b0084kay.aspx
    <https://msdn.microsoft.com/en-us/library/b0084kay.aspx> suggests
that
    #if defined(_MSC_VER) && defined(__AVX2__)
    #define __FMA__
    #endif
To implement that we need to make sure that on all architectures AVX2
=> FMA. This seems to be true for Intel's ones, but I'm not sure about
AMD.
According to
https://stackoverflow.com/questions/16348909/how-do-i-know-if-i-can-compile-with-fma-instruction-sets
, all AMD processors which support AVX2 support FMA. Unfortunately, I
couldn't easily confirm through official online resources. The wikipedia
page on Advanced_Vector_Extensions notes that only AMD Excavactor
processors (and up) support AVX2, and those definitely support FMA
(double-checked at
https://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf).
It seems that there are AMD architectures which support FMA3, but not
AVX2. This would make the check above safe, but not optimal.
https://en.wikipedia.org/wiki/Piledriver_(microarchitecture)

Bulldozer supports FMA4 (and AVX1), but not FMA3 (nor AVX2).
https://en.wikipedia.org/wiki/Bulldozer_(microarchitecture)

It seems, Eigen generally only supports FMA3, even though FMA4 should be
relatively easy (just the name of the intrinsic should be different).

That said, I'm ok with the hack suggested above. We should probably
still document how to make sure that FMA is enabled (if AVX2 is not
available).


Christoph
Post by Edward Lam
-Edward
--
Dr.-Ing. Christoph Hertzberg

Besuchsadresse der Nebengeschäftsstelle:
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-Straße 5
28359 Bremen, Germany

Postadresse der Hauptgeschäftsstelle Standort Bremen:
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-Straße 1
28359 Bremen, Germany

Tel.: +49 421 178 45-4021
Zentrale: +49 421 178 45-0
E-Mail: ***@dfki.de

Weitere Informationen: http://www.dfki.de/robotik
-----------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Straße 122, D-67663 Kaiserslautern
Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
(Vorsitzender) Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
USt-Id.Nr.: DE 148646973
Steuernummer: 19/672/50006
-----------------------------------------------------------------------
Loading...