Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

Discussion:

D***@dlr.de

2018-08-01 09:10:31 UTC

Hello everyone,

with the recent release of 3.3.5 I've once again looked at upgrading from our currently used Eigen 3.2 to the current stable branch, but some performance regressions remain, which make this a difficult decision, as I'm unable to nail down the exact cause (probably because it's not a single one) and would prefer to not slow down the overall performance.

I've attached a document with some performance measurements for different compilers, different Eigen versions, and 3 different test-cases for our code (tau, cgns, dg) that stress different areas / sizes.
The "vs best" column compares run-time against the overall best run-time, "vs same" only relative to shortest run-time with the same compiler (so essentially between different Eigen variants with the same compiler).
Eigen 3.2 version used was 3.2.9 + some backports of improvements to AutoDiffScalar
Eigen 3.3 version used was 3.3.5.
The tests were run on a Xeon E3-1276 v3 (with our code doing multi-threading, and Eigen configured to not use threading of its own). Minimum run-time of 4 runs.

We use Eigen in a CFD code for 3 roughly distinct subject areas:
1) fixed-size vectors (and some matrices) of doubles, direct access to individual values (with compile-time known indices) or segments, simple linear algebra, few matrix-vector products.
2) same as 1, but using Eigen::AutoDiffScalar instead of double (building up a Jacobian)
3) Fixed-size matrix-vector products (inside of a Block-Jacobi iteration, not using any of Eigen's solvers)

For the different cases:
tau: Only uses 1), with vectors of sizes 5 and 8, matrices of size 5x5
cgns: Uses 1)-3), with vectors of sizes 6 and 13, matrices of size 6x6 (for both 1 and 3).
dg: Uses 1)-3), with vectors of sizes 5 and 8, matrices of size 5x5 (for 1) and 20x20 (for 3).

The outcomes seem to be
- clang is generally fastest
- the performance regression is more pronounced for gcc
- (partial) vectorization seems to "hurt" simple direct access (area 1), disabling it improves performance (clang) or at least reduces the impact of Eigen 3.3 (gcc)

If we were only looking at clang, I'd be nearly willing to advocate moving to 3.3 (with default settings), because only a regression for the "tau" case remains.

Unfortunately, I'm at a loss at how to pin-point these any more, and attempts at extracting a reduced test-case / example that exhibits the same behavior have not been fruitful, and some profiling of the actual code between Eigen 3.2 and 3.3 does not seem to directly yield actionable information.

If anyone has any ideas for things to try, I'm all ears. :)

Either way, thanks for your helpful (and nice to use) library!

Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de

________________________________________
Von: Vollmer, Daniel
Gesendet: Donnerstag, 28. Juli 2016 12:46
An: ***@lists.tuxfamily.org
Betreff: RE: [eigen] 3.3-beta2 released!

Hi Gael,

> Fixed: https://bitbucket.org/eigen/eigen/commits/e35a38ad89fe/
> With float I get a nearly x2 speedup for the above 5x5 matrix-vector
> products (compared to 3.2), and x1.4 speedup with double.

I tried out this version (ca9bd08) and the results are as follows:
Note: the explicit solver pretty much only does residual evaluations,
whereas the implicit solver does a residual evaluation, followed by a
Jacobian computation (using AutoDiffScalar) and then a block-based
Gauss-Jacobi iteration where the blocks are 5x5 matrices to
approximately solve a linear system based on the Jacobian and the
residual.

Explicit solver:
----------------
eigen-3.3-ca9bd08 10.9s => 09% slower
eigen-3.3-beta2 11.1s => 11% slower
eigen-3.3-beta2 UNALIGNED_VEC=0 10.0s => 00% slower
eigen-3.2.9 10.0s => baseline

Implicit solver:
----------------
eigen-3.3-ca9bd08 34.2s => 06% faster
eigen-3.3-beta2 37.5s => 03% slower
eigen-3.3-beta2 UNALIGNED_VEC=0 38.2s => 05% slower
eigen-3.2.9 36.5s => baseline

So the change definitely helps for the implicit solver (which has lots
of 5x5 by 5x1 double multiplies), but for the explicit solver the
overhead of unaligned vectorization doesn't pay off. Maybe the use of
3D vectors (which used for geometric normals and coordinates) is
problematic because it's such a borderline case for vectorization?

What I don't quite understand is the difference between 3.2.9 (which
doesn't vectorize the given matrix sizes) and 3.3-beta2 without
vectorization: Something in 3.3 is slower under those conditions, but
maybe it's not the matrix-vector multiplies, as it could also be
AutoDiffScalar being slower.

Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de

Marc Glisse

2018-08-01 10:06:02 UTC

Permalink

On Wed, 1 Aug 2018, ***@dlr.de wrote:

> I've attached a document with some performance measurements for
> different compilers, different Eigen versions, and 3 different
> test-cases for our code (tau, cgns, dg) that stress different areas /
> sizes.

Would it be hard to create small testcases that you could share and that
would still show a similar performance pattern? I assume that the easier
it is for Eigen dev to reproduce, the more likely they are to investigate.

--
Marc Glisse

D***@dlr.de

2018-08-01 11:09:36 UTC

Permalink

Hi,

extracting the relevant code is rather difficult, unfortunately.

Attached you will find an extracted subset, which may point to one problematic area with partial vectorization as for this example Eigen 3.3 is ~13% slower (unless I'm measuring incorrectly).

If I turn off partial vectorization, the run-time is the same. This is not completely representative (obviously), because in my "complete" runs disabling partial vectorization only decreases the run-time difference between 3.3 and 3.2, but does not eliminate it.

Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de

________________________________________
Von: Marc Glisse [***@inria.fr]
Gesendet: Mittwoch, 1. August 2018 12:06
An: ***@lists.tuxfamily.org
Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

On Wed, 1 Aug 2018, ***@dlr.de wrote:

> I've attached a document with some performance measurements for
> different compilers, different Eigen versions, and 3 different
> test-cases for our code (tau, cgns, dg) that stress different areas /
> sizes.

Would it be hard to create small testcases that you could share and that
would still show a similar performance pattern? I assume that the easier
it is for Eigen dev to reproduce, the more likely they are to investigate.

--
Marc Glisse

Marc Glisse

2018-08-01 14:03:03 UTC

Permalink

On Wed, 1 Aug 2018, ***@dlr.de wrote:

> Attached you will find an extracted subset, which may point to one
> problematic area with partial vectorization as for this example Eigen
> 3.3 is ~13% slower (unless I'm measuring incorrectly).

Not what you are looking for, but the main difference between gcc and
clang here seems to be that gcc fails to turn the divisions by rho into
multiplications by 1/rho (computed once).

--
Marc Glisse

Gael Guennebaud

2018-08-01 22:10:47 UTC

Permalink

Hi,

I tried your little benchmark and with gcc 7, I got no difference at all
(-O3 -NDEBUG):

// 3.2: 7.75s
// 3.3: 7.68s

// 3.2 -march=native: 7.46s
// 3.3 -march=native: 7.48s

I ran each test 4 times and keep the best of each, the variations were
about 0.2s.

gael

On Wed, Aug 1, 2018 at 1:28 PM <***@dlr.de> wrote:

> Hi,
>
> extracting the relevant code is rather difficult, unfortunately.
>
> Attached you will find an extracted subset, which may point to one
> problematic area with partial vectorization as for this example Eigen 3.3
> is ~13% slower (unless I'm measuring incorrectly).
>
> If I turn off partial vectorization, the run-time is the same. This is not
> completely representative (obviously), because in my "complete" runs
> disabling partial vectorization only decreases the run-time difference
> between 3.3 and 3.2, but does not eliminate it.
>
>
> Best regards
>
> Daniel Vollmer
>
> --------------------------
> Deutsches Zentrum fÃŒr Luft- und Raumfahrt e.V. (DLR)
> German Aerospace Center
> Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108
> Braunschweig | Germany
>
> Daniel Vollmer | AS CÂ²AÂ²SÂ²E
> www.DLR.de
>
> ________________________________________
> Von: Marc Glisse [***@inria.fr]
> Gesendet: Mittwoch, 1. August 2018 12:06
> An: ***@lists.tuxfamily.org
> Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2
> released!)
>
> On Wed, 1 Aug 2018, ***@dlr.de wrote:
>
> > I've attached a document with some performance measurements for
> > different compilers, different Eigen versions, and 3 different
> > test-cases for our code (tau, cgns, dg) that stress different areas /
> > sizes.
>
> Would it be hard to create small testcases that you could share and that
> would still show a similar performance pattern? I assume that the easier
> it is for Eigen dev to reproduce, the more likely they are to investigate.
>
> --
> Marc Glisse
>
>
>

D***@dlr.de

2018-08-02 10:25:02 UTC

Permalink

Hello Gael,

if I only use SSE2 (that is without -march=native on a Haswell Xeon with AVX & FMA), then I also see no difference in the benchmark. The same is true for "-msse42" as well.

But if I use "-march=native" (which then enables AVX and FMA for Eigen-3.3) for *this particular example*, I see this ~13% slowdown. Can you confirm this?

For reference, the compilation command used was
g++ eigen_bench2.cpp -std=c++11 -Ofast -fno-finite-math-only -DNDEBUG -march=native -I eigen-3.2.9 / 3.3.5

I am no x86 assembly nor vectorization expert, so take the following with a lot of salt.

I have attached the generate assembly for the "hot loop". The first 49 lines are nearly the same, except for eigen-3.2 using %rbp-relative addressing and 3.3 using %rsp (with different offsets).

The remainder is more distinct. Eigen-3.3 doesn't use AVX registers as far as I can see, but uses more "...packed-double" instructions (Eigen 3.2 assembly doesn't seem to use any), but it seems the sequence is still (slightly) slower overall for Eigen-3.3.

eigen-3.2:
vmovsd -168(%rbp), %xmm3
vxorpd %xmm6, %xmm6, %xmm6
vmovsd .LC4(%rip), %xmm2
vmovsd -136(%rbp), %xmm8
vcvtsi2sd %eax, %xmm6, %xmm6
vfmadd132sd .LC3(%rip), %xmm2, %xmm6
vmulsd %xmm3, %xmm3, %xmm0
vmovsd -160(%rbp), %xmm2
vmovsd -176(%rbp), %xmm4
vmulsd %xmm8, %xmm3, %xmm1
vmovsd -144(%rbp), %xmm9
vmovsd -184(%rbp), %xmm5
vmovsd -152(%rbp), %xmm7
vfmadd231sd %xmm2, %xmm2, %xmm0
vfmadd231sd %xmm6, %xmm2, %xmm1
vfmadd231sd %xmm4, %xmm4, %xmm0
vmulsd .LC5(%rip), %xmm0, %xmm0
vfmadd231sd %xmm4, %xmm9, %xmm1
vdivsd %xmm5, %xmm0, %xmm0
vdivsd %xmm5, %xmm1, %xmm1
vsubsd %xmm0, %xmm7, %xmm0
vmulsd .LC6(%rip), %xmm0, %xmm0
vfmadd213sd -104(%rbp), %xmm1, %xmm4
vfmadd213sd -96(%rbp), %xmm1, %xmm3
vfmadd213sd -88(%rbp), %xmm1, %xmm2
vfmadd213sd -112(%rbp), %xmm1, %xmm5
vfmadd231sd %xmm9, %xmm0, %xmm4
vfmadd231sd %xmm8, %xmm0, %xmm3
vfmadd231sd %xmm6, %xmm0, %xmm2
vaddsd %xmm7, %xmm0, %xmm0
vmovsd %xmm5, -112(%rbp)
vfmadd213sd -80(%rbp), %xmm0, %xmm1
vmovsd %xmm4, -104(%rbp)
vmovsd %xmm3, -96(%rbp)
vmovsd %xmm2, -88(%rbp)
vmovsd %xmm1, -80(%rbp)
cmpl %ebx, %r13d
jl .L91

eigen-3.3:
vmovupd 120(%rsp), %xmm7
vmovapd 32(%rsp), %xmm8
vxorpd %xmm5, %xmm5, %xmm5
vcvtsi2sd %eax, %xmm5, %xmm5
vmovsd .LC4(%rip), %xmm4
vfmadd132sd .LC3(%rip), %xmm4, %xmm5
vmulpd %xmm7, %xmm7, %xmm1
vmovsd 16(%rsp), %xmm3
vmovsd 24(%rsp), %xmm4
vmovsd 8(%rsp), %xmm6
vunpckhpd %xmm1, %xmm1, %xmm2
vaddsd %xmm2, %xmm1, %xmm2
vmulpd %xmm8, %xmm7, %xmm1
vfmadd231sd %xmm3, %xmm3, %xmm2
vmulsd .LC5(%rip), %xmm2, %xmm2
vunpckhpd %xmm1, %xmm1, %xmm0
vaddsd %xmm0, %xmm1, %xmm0
vdivsd %xmm4, %xmm2, %xmm2
vfmadd231sd %xmm5, %xmm3, %xmm0
vdivsd %xmm4, %xmm0, %xmm0
vsubsd %xmm2, %xmm6, %xmm2
vmulsd .LC6(%rip), %xmm2, %xmm2
vfmadd213sd 88(%rsp), %xmm0, %xmm3
vfmadd213sd 64(%rsp), %xmm0, %xmm4
vmovddup %xmm0, %xmm1
vfmadd213pd 72(%rsp), %xmm1, %xmm7
vmovddup %xmm2, %xmm1
vfmadd231sd %xmm5, %xmm2, %xmm3
vaddsd %xmm6, %xmm2, %xmm2
vmovsd %xmm4, 64(%rsp)
vfmadd213sd 96(%rsp), %xmm2, %xmm0
vfmadd231pd %xmm1, %xmm8, %xmm7
vmovsd %xmm3, 88(%rsp)
vmovsd %xmm0, 96(%rsp)
vmovups %xmm7, 72(%rsp)
cmpl %ebx, %r12d
jl .L91

If there's anything else I could try to pin-point causes, I'm all ears. :)

Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de
________________________________
Von: Gael Guennebaud [***@gmail.com]
Gesendet: Donnerstag, 2. August 2018 0:10
An: eigen
Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

Hi,

I tried your little benchmark and with gcc 7, I got no difference at all (-O3 -NDEBUG):

// 3.2: 7.75s
// 3.3: 7.68s

// 3.2 -march=native: 7.46s
// 3.3 -march=native: 7.48s

I ran each test 4 times and keep the best of each, the variations were about 0.2s.

gael

On Wed, Aug 1, 2018 at 1:28 PM <***@dlr.de<mailto:***@dlr.de>> wrote:
Hi,

extracting the relevant code is rather difficult, unfortunately.

Attached you will find an extracted subset, which may point to one problematic area with partial vectorization as for this example Eigen 3.3 is ~13% slower (unless I'm measuring incorrectly).

If I turn off partial vectorization, the run-time is the same. This is not completely representative (obviously), because in my "complete" runs disabling partial vectorization only decreases the run-time difference between 3.3 and 3.2, but does not eliminate it.

Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de<http://www.DLR.de>

________________________________________
Von: Marc Glisse [***@inria.fr<mailto:***@inria.fr>]
Gesendet: Mittwoch, 1. August 2018 12:06
An: ***@lists.tuxfamily.org<mailto:***@lists.tuxfamily.org>
Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

On Wed, 1 Aug 2018, ***@dlr.de<mailto:***@dlr.de> wrote:

> I've attached a document with some performance measurements for
> different compilers, different Eigen versions, and 3 different
> test-cases for our code (tau, cgns, dg) that stress different areas /
> sizes.

Would it be hard to create small testcases that you could share and that
would still show a similar performance pattern? I assume that the easier
it is for Eigen dev to reproduce, the more likely they are to investigate.

--
Marc Glisse

D***@dlr.de

2018-08-06 16:36:30 UTC

Permalink

Hello again,

I've been trying to understand a bit better what is happening with the performance regression I'm seeing, and at the moment I am under the impression that Eigen-3.3 makes it harder (impossible?) for gcc to recognize when no aliasing is happening.

I've further reduced my original example to essentially the following loop (see eigen_bench3.cpp for a self-contained version).
using Vec = Eigen::Matrix<double, 2, 1>;
Vec sum = Vec::Zero();
for (int i = 0; i < num; ++i)
{
const Vec dirA = sum;
const Vec dirB = dirA;

sum += dirA.dot(dirB) * dirA;
}

When using Eigen-3.3, gcc-8 always spills back to memory when evaluating the expression, whereas when using Eigen-3.2 the loop runs completely without spilling. As there should be enough registers available, I assume this is due to "fear of aliasing".

This is independent of vectorization and all the other nice things, it happens with vectorization disabled as well.

Interestingly (?), when I change the size of Vec (from 2) to 1, then the generated code is identical for 3.3 and 3.2.

The following is the x86-64 assembly for the relevant loop (only using -O1 and disabled vectorization because of increased clarity; the problem remains for higher optimization levels):

Eigen-3.3:
# eigen_bench3.cpp:18: EIGEN_ASM_COMMENT("begin loop");
test ebx, ebx # iftmp.0_3
jle .L131 #,
# eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
mov edx, 0 # i,
.L132:
# eigen-3.3.5/Eigen/src/Core/DenseStorage.h:194: DenseStorage(const DenseStorage& other) : m_data(other.m_data) {
mov rsi, QWORD PTR [rsp+32] # MEM[(const struct DenseStorage &)&sum].m_data, MEM[(const struct DenseStorage &)&sum].m_data
mov rdi, QWORD PTR [rsp+40] # MEM[(const struct DenseStorage &)&sum].m_data, MEM[(const struct DenseStorage &)&sum].m_data
mov QWORD PTR [rsp], rsi # MEM[(struct DenseStorage *)&dirA].m_data, MEM[(const struct DenseStorage &)&sum].m_data
mov QWORD PTR [rsp+8], rdi # MEM[(struct DenseStorage *)&dirA].m_data, MEM[(const struct DenseStorage &)&sum].m_data
mov QWORD PTR [rsp+16], rsi # MEM[(struct DenseStorage *)&dirB].m_data, MEM[(const struct DenseStorage &)&sum].m_data
mov QWORD PTR [rsp+24], rdi # MEM[(struct DenseStorage *)&dirB].m_data, MEM[(const struct DenseStorage &)&sum].m_data
# eigen-3.3.5/Eigen/src/Core/GenericPacketMath.h:171: const Packet& b) { return a*b; }
vmovsd xmm0, QWORD PTR [rsp+8] # _26, MEM[(const double &)&dirA + 8]
vmovsd xmm1, QWORD PTR [rsp] # _28, MEM[(const double &)&dirA]
# eigen-3.3.5/Eigen/src/Core/GenericPacketMath.h:171: const Packet& b) { return a*b; }
vmulsd xmm2, xmm0, QWORD PTR [rsp+24] # tmp123, _26, MEM[(struct plain_array *)&dirB + 8B]
vmulsd xmm3, xmm1, QWORD PTR [rsp+16] # tmp124, _28, MEM[(struct plain_array *)&dirB]
# eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:42: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const LhsScalar& a, const RhsScalar& b) const { return a + b; }
vaddsd xmm2, xmm2, xmm3 # _30, tmp123, tmp124
# eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:86: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const LhsScalar& a, const RhsScalar& b) const { return a * b; }
vmulsd xmm1, xmm1, xmm2 # tmp125, _28, _30
# eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const SrcScalar& b) const { a += b; }
vaddsd xmm1, xmm1, QWORD PTR [rsp+32] # tmp126, tmp125, MEM[(double &)&sum]
# eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const SrcScalar& b) const { a += b; }
vmovsd QWORD PTR [rsp+32], xmm1 # MEM[(double &)&sum], tmp126
# eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:86: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const LhsScalar& a, const RhsScalar& b) const { return a * b; }
vmulsd xmm0, xmm0, xmm2 # tmp128, _26, _30
# eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const SrcScalar& b) const { a += b; }
vaddsd xmm0, xmm0, QWORD PTR [rsp+40] # tmp129, tmp128, MEM[(double &)&sum + 8]
# eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const SrcScalar& b) const { a += b; }
vmovsd QWORD PTR [rsp+40], xmm0 # MEM[(double &)&sum + 8], tmp129
# eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
inc edx # i
# eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
cmp ebx, edx # iftmp.0_3, i
jne .L132 #,
.L131:
# eigen_bench3.cpp:28: EIGEN_ASM_COMMENT("end loop");

And the same for Eigen-3.2

# eigen_bench3.cpp:18: EIGEN_ASM_COMMENT("begin loop");
test ebx, ebx # iftmp.0_3
jle .L131 #,
vmovsd xmm2, QWORD PTR [rsp+16] # sum__lsm.104, MEM[(const Scalar &)&sum]
vmovsd xmm1, QWORD PTR [rsp+24] # sum__lsm.105, MEM[(const Scalar &)&sum + 8]
# eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
mov edx, 0 # i,
vmovsd xmm4, QWORD PTR .LC6[rip] # tmp117,
.L132:
# eigen-3.2.9-mod/Eigen/src/Core/GenericPacketMath.h:114: const Packet& b) { return a*b; }
vmulsd xmm0, xmm1, xmm1 # tmp114, sum__lsm.105, sum__lsm.105
vmulsd xmm3, xmm2, xmm2 # tmp115, sum__lsm.104, sum__lsm.104
# eigen-3.2.9-mod/Eigen/src/Core/Functors.h:26: EIGEN_STRONG_INLINE const Scalar operator() (const Scalar& a, const Scalar& b) const { return a + b; }
vaddsd xmm0, xmm0, xmm3 # tmp116, tmp114, tmp115
vaddsd xmm0, xmm0, xmm4 # _52, tmp116, tmp117
vmulsd xmm2, xmm2, xmm0 # sum__lsm.104, sum__lsm.104, _52
vmulsd xmm1, xmm1, xmm0 # sum__lsm.105, sum__lsm.105, _52
# eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
inc edx # i
# eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
cmp ebx, edx # iftmp.0_3, i
jne .L132 #,
vmovsd QWORD PTR [rsp+16], xmm2 # MEM[(const Scalar &)&sum], sum__lsm.104
vmovsd QWORD PTR [rsp+24], xmm1 # MEM[(const Scalar &)&sum + 8], sum__lsm.105
.L131:
# eigen_bench3.cpp:28: EIGEN_ASM_COMMENT("end loop");

Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de
________________________________________
Von: Gael Guennebaud [***@gmail.com]
Gesendet: Donnerstag, 2. August 2018 0:10
An: eigen
Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

Hi,

I tried your little benchmark and with gcc 7, I got no difference at all (-O3 -NDEBUG):

// 3.2: 7.75s
// 3.3: 7.68s

// 3.2 -march=native: 7.46s
// 3.3 -march=native: 7.48s

I ran each test 4 times and keep the best of each, the variations were about 0.2s.

gael

On Wed, Aug 1, 2018 at 1:28 PM <***@dlr.de> wrote:
Hi,

extracting the relevant code is rather difficult, unfortunately.

Attached you will find an extracted subset, which may point to one problematic area with partial vectorization as for this example Eigen 3.3 is ~13% slower (unless I'm measuring incorrectly).

If I turn off partial vectorization, the run-time is the same. This is not completely representative (obviously), because in my "complete" runs disabling partial vectorization only decreases the run-time difference between 3.3 and 3.2, but does not eliminate it.

Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de

________________________________________
Von: Marc Glisse [***@inria.fr]
Gesendet: Mittwoch, 1. August 2018 12:06
An: ***@lists.tuxfamily.org
Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

On Wed, 1 Aug 2018, ***@dlr.de wrote:

> I've attached a document with some performance measurements for
> different compilers, different Eigen versions, and 3 different
> test-cases for our code (tau, cgns, dg) that stress different areas /
> sizes.

Would it be hard to create small testcases that you could share and that
would still show a similar performance pattern? I assume that the easier
it is for Eigen dev to reproduce, the more likely they are to investigate.

--
Marc Glisse

Marc Glisse

2018-08-06 17:19:36 UTC

Permalink

On Mon, 6 Aug 2018, ***@dlr.de wrote:

> I've been trying to understand a bit better what is happening with the
> performance regression I'm seeing, and at the moment I am under the
> impression that Eigen-3.3 makes it harder (impossible?) for gcc to
> recognize when no aliasing is happening.

Nah, it is just gcc being silly.

> I've further reduced my original example to essentially the following loop (see eigen_bench3.cpp for a self-contained version).
> using Vec = Eigen::Matrix<double, 2, 1>;
> Vec sum = Vec::Zero();
> for (int i = 0; i < num; ++i)
> {
> const Vec dirA = sum;
> const Vec dirB = dirA;
>
> sum += dirA.dot(dirB) * dirA;
> }

Without vectors, the main loop at -O3 starts with

movdqu (%rax), %xmm0
addl $1, %edx
movaps %xmm0, -40(%rsp)
movsd -40(%rsp), %xmm1
movsd -32(%rsp), %xmm4
movaps %xmm0, -24(%rsp)
movsd -16(%rsp), %xmm0
movsd -24(%rsp), %xmm5

so: read from memory, write to memory and re-read piecewise, and do it a
second time just for the sake of it.

The corresponding internal representation at the end of the high-level
optimization phase is

MEM[(struct DenseStorage *)&dirA].m_data = MEM[(const struct DenseStorage &)sum_5(D)].m_data;
dirA_31 = MEM[(struct plain_array *)&dirA];
dirA$8_30 = MEM[(struct plain_array *)&dirA + 8B];
MEM[(struct DenseStorage *)&dirB].m_data = MEM[(const struct DenseStorage &)&dirA].m_data;
dirB_37 = MEM[(struct plain_array *)&dirB];
dirB$8_38 = MEM[(struct plain_array *)&dirB + 8B];

This involves some direct mem-to-mem assignments, which is something that
gcc handles super badly. If the copy was done piecewise, each element
would be a SSA variable and optimizations would work. Even if the copy was
done with memcpy there would be code to simplify it. But mem-to-mem...

I strongly encourage you to report this testcase to gcc's bugzilla.

(it doesn't mean that people can't work around it in eigen somehow, but
that will likely not be nice and not catch all cases)

--
Marc Glisse

Gael Guennebaud

2018-08-15 21:17:27 UTC

Permalink

I still cannot reproduce, with -O2 -DNDEBUG I get the following assembly
with Eigen 3.3 and gcc 7:

L127:
movapd %xmm1, %xmm0
addl $1, %eax
mulpd %xmm1, %xmm0
cmpl %eax, %ebx
movapd %xmm0, %xmm2
unpckhpd %xmm0, %xmm2
addsd %xmm2, %xmm0
unpcklpd %xmm0, %xmm0
mulpd %xmm1, %xmm0
addpd %xmm0, %xmm1
jne L127

On my side, the only difference with 3.2 is the usage of haddpd with 3.2
that we disabled in 3.3 because it is pretty slow compared to
movapd+unpcklpd+addps.

Checking on compiler-explorer (https://godbolt.org/g/9bkjfu) it seems this
particular issue is only reproducible with gcc 8, so definitely a gcc
regression.

gael

On Mon, Aug 6, 2018 at 6:37 PM <***@dlr.de> wrote:

> Hello again,
>
> I've been trying to understand a bit better what is happening with the
> performance regression I'm seeing, and at the moment I am under the
> impression that Eigen-3.3 makes it harder (impossible?) for gcc to
> recognize when no aliasing is happening.
>
> I've further reduced my original example to essentially the following
> loop (see eigen_bench3.cpp for a self-contained version).
> using Vec = Eigen::Matrix<double, 2, 1>;
> Vec sum = Vec::Zero();
> for (int i = 0; i < num; ++i)
> {
> const Vec dirA = sum;
> const Vec dirB = dirA;
>
> sum += dirA.dot(dirB) * dirA;
> }
>
> When using Eigen-3.3, gcc-8 always spills back to memory when evaluating
> the expression, whereas when using Eigen-3.2 the loop runs completely
> without spilling. As there should be enough registers available, I assume
> this is due to "fear of aliasing".
>
> This is independent of vectorization and all the other nice things, it
> happens with vectorization disabled as well.
>
> Interestingly (?), when I change the size of Vec (from 2) to 1, then the
> generated code is identical for 3.3 and 3.2.
>
> The following is the x86-64 assembly for the relevant loop (only using -O1
> and disabled vectorization because of increased clarity; the problem
> remains for higher optimization levels):
>
> Eigen-3.3:
> # eigen_bench3.cpp:18: EIGEN_ASM_COMMENT("begin loop");
> test ebx, ebx # iftmp.0_3
> jle .L131 #,
> # eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
> mov edx, 0 # i,
> .L132:
> # eigen-3.3.5/Eigen/src/Core/DenseStorage.h:194: DenseStorage(const
> DenseStorage& other) : m_data(other.m_data) {
> mov rsi, QWORD PTR [rsp+32] # MEM[(const struct DenseStorage
> &)&sum].m_data, MEM[(const struct DenseStorage &)&sum].m_data
> mov rdi, QWORD PTR [rsp+40] # MEM[(const struct DenseStorage
> &)&sum].m_data, MEM[(const struct DenseStorage &)&sum].m_data
> mov QWORD PTR [rsp], rsi # MEM[(struct DenseStorage
> *)&dirA].m_data, MEM[(const struct DenseStorage &)&sum].m_data
> mov QWORD PTR [rsp+8], rdi # MEM[(struct DenseStorage
> *)&dirA].m_data, MEM[(const struct DenseStorage &)&sum].m_data
> mov QWORD PTR [rsp+16], rsi # MEM[(struct DenseStorage
> *)&dirB].m_data, MEM[(const struct DenseStorage &)&sum].m_data
> mov QWORD PTR [rsp+24], rdi # MEM[(struct DenseStorage
> *)&dirB].m_data, MEM[(const struct DenseStorage &)&sum].m_data
> # eigen-3.3.5/Eigen/src/Core/GenericPacketMath.h:171: const
> Packet& b) { return a*b; }
> vmovsd xmm0, QWORD PTR [rsp+8] # _26, MEM[(const double &)&dirA +
> 8]
> vmovsd xmm1, QWORD PTR [rsp] # _28, MEM[(const double &)&dirA]
> # eigen-3.3.5/Eigen/src/Core/GenericPacketMath.h:171: const
> Packet& b) { return a*b; }
> vmulsd xmm2, xmm0, QWORD PTR [rsp+24] # tmp123, _26, MEM[(struct
> plain_array *)&dirB + 8B]
> vmulsd xmm3, xmm1, QWORD PTR [rsp+16] # tmp124, _28, MEM[(struct
> plain_array *)&dirB]
> # eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:42:
> EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const
> LhsScalar& a, const RhsScalar& b) const { return a + b; }
> vaddsd xmm2, xmm2, xmm3 # _30, tmp123, tmp124
> # eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:86:
> EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const
> LhsScalar& a, const RhsScalar& b) const { return a * b; }
> vmulsd xmm1, xmm1, xmm2 # tmp125, _28, _30
> # eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49:
> EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const
> SrcScalar& b) const { a += b; }
> vaddsd xmm1, xmm1, QWORD PTR [rsp+32] # tmp126, tmp125,
> MEM[(double &)&sum]
> # eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49:
> EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const
> SrcScalar& b) const { a += b; }
> vmovsd QWORD PTR [rsp+32], xmm1 # MEM[(double &)&sum],
> tmp126
> # eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:86:
> EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const
> LhsScalar& a, const RhsScalar& b) const { return a * b; }
> vmulsd xmm0, xmm0, xmm2 # tmp128, _26, _30
> # eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49:
> EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const
> SrcScalar& b) const { a += b; }
> vaddsd xmm0, xmm0, QWORD PTR [rsp+40] # tmp129, tmp128,
> MEM[(double &)&sum + 8]
> # eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49:
> EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const
> SrcScalar& b) const { a += b; }
> vmovsd QWORD PTR [rsp+40], xmm0 # MEM[(double &)&sum + 8],
> tmp129
> # eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
> inc edx # i
> # eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
> cmp ebx, edx # iftmp.0_3, i
> jne .L132 #,
> .L131:
> # eigen_bench3.cpp:28: EIGEN_ASM_COMMENT("end loop");
>
>
>
>
> And the same for Eigen-3.2
>
> # eigen_bench3.cpp:18: EIGEN_ASM_COMMENT("begin loop");
> test ebx, ebx # iftmp.0_3
> jle .L131 #,
> vmovsd xmm2, QWORD PTR [rsp+16] # sum__lsm.104, MEM[(const
> Scalar &)&sum]
> vmovsd xmm1, QWORD PTR [rsp+24] # sum__lsm.105, MEM[(const
> Scalar &)&sum + 8]
> # eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
> mov edx, 0 # i,
> vmovsd xmm4, QWORD PTR .LC6[rip] # tmp117,
> .L132:
> # eigen-3.2.9-mod/Eigen/src/Core/GenericPacketMath.h:114: const
> Packet& b) { return a*b; }
> vmulsd xmm0, xmm1, xmm1 # tmp114, sum__lsm.105,
> sum__lsm.105
> vmulsd xmm3, xmm2, xmm2 # tmp115, sum__lsm.104,
> sum__lsm.104
> # eigen-3.2.9-mod/Eigen/src/Core/Functors.h:26: EIGEN_STRONG_INLINE
> const Scalar operator() (const Scalar& a, const Scalar& b) const { return a
> + b; }
> vaddsd xmm0, xmm0, xmm3 # tmp116, tmp114, tmp115
> vaddsd xmm0, xmm0, xmm4 # _52, tmp116, tmp117
> vmulsd xmm2, xmm2, xmm0 # sum__lsm.104, sum__lsm.104, _52
> vmulsd xmm1, xmm1, xmm0 # sum__lsm.105, sum__lsm.105, _52
> # eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
> inc edx # i
> # eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
> cmp ebx, edx # iftmp.0_3, i
> jne .L132 #,
> vmovsd QWORD PTR [rsp+16], xmm2 # MEM[(const Scalar
> &)&sum], sum__lsm.104
> vmovsd QWORD PTR [rsp+24], xmm1 # MEM[(const Scalar &)&sum
> + 8], sum__lsm.105
> .L131:
> # eigen_bench3.cpp:28: EIGEN_ASM_COMMENT("end loop");
>
>
> Best regards
>
> Daniel Vollmer
>
> --------------------------
> Deutsches Zentrum fÃŒr Luft- und Raumfahrt e.V. (DLR)
> German Aerospace Center
> Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108
> Braunschweig | Germany
>
> Daniel Vollmer | AS CÂ²AÂ²SÂ²E
> www.DLR.de
> ________________________________________
> Von: Gael Guennebaud [***@gmail.com]
> Gesendet: Donnerstag, 2. August 2018 0:10
> An: eigen
> Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2
> released!)
>
> Hi,
>
> I tried your little benchmark and with gcc 7, I got no difference at all
> (-O3 -NDEBUG):
>
> // 3.2: 7.75s
> // 3.3: 7.68s
>
> // 3.2 -march=native: 7.46s
> // 3.3 -march=native: 7.48s
>
> I ran each test 4 times and keep the best of each, the variations were
> about 0.2s.
>
> gael
>
>
> On Wed, Aug 1, 2018 at 1:28 PM <***@dlr.de> wrote:
> Hi,
>
> extracting the relevant code is rather difficult, unfortunately.
>
> Attached you will find an extracted subset, which may point to one
> problematic area with partial vectorization as for this example Eigen 3.3
> is ~13% slower (unless I'm measuring incorrectly).
>
> If I turn off partial vectorization, the run-time is the same. This is not
> completely representative (obviously), because in my "complete" runs
> disabling partial vectorization only decreases the run-time difference
> between 3.3 and 3.2, but does not eliminate it.
>
>
> Best regards
>
> Daniel Vollmer
>
> --------------------------
> Deutsches Zentrum fÃŒr Luft- und Raumfahrt e.V. (DLR)
> German Aerospace Center
> Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108
> Braunschweig | Germany
>
> Daniel Vollmer | AS CÂ²AÂ²SÂ²E
> www.DLR.de
>
> ________________________________________
> Von: Marc Glisse [***@inria.fr]
> Gesendet: Mittwoch, 1. August 2018 12:06
> An: ***@lists.tuxfamily.org
> Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2
> released!)
>
> On Wed, 1 Aug 2018, ***@dlr.de wrote:
>
> > I've attached a document with some performance measurements for
> > different compilers, different Eigen versions, and 3 different
> > test-cases for our code (tau, cgns, dg) that stress different areas /
> > sizes.
>
> Would it be hard to create small testcases that you could share and that
> would still show a similar performance pattern? I assume that the easier
> it is for Eigen dev to reproduce, the more likely they are to investigate.
>
> --
> Marc Glisse
>
>

Gael Guennebaud

2018-08-15 22:23:38 UTC

Permalink

ok, now regarding the difference between 3.2 and 3.3 with gcc 8, I found
that changing the copies to:

Vec dirA;
dirA = sum;
Vec dirB;
dirB = dirA;

workaround the issue, see: https://godbolt.org/g/oJRHdd

The respective change-set is:
https://bitbucket.org/eigen/eigen/commits/45ceeab6e89287

gael

On Wed, Aug 15, 2018 at 11:17 PM Gael Guennebaud <***@gmail.com>
wrote:

>
> I still cannot reproduce, with -O2 -DNDEBUG I get the following assembly
> with Eigen 3.3 and gcc 7:
>
> L127:
> movapd %xmm1, %xmm0
> addl $1, %eax
> mulpd %xmm1, %xmm0
> cmpl %eax, %ebx
> movapd %xmm0, %xmm2
> unpckhpd %xmm0, %xmm2
> addsd %xmm2, %xmm0
> unpcklpd %xmm0, %xmm0
> mulpd %xmm1, %xmm0
> addpd %xmm0, %xmm1
> jne L127
>
> On my side, the only difference with 3.2 is the usage of haddpd with 3.2
> that we disabled in 3.3 because it is pretty slow compared to
> movapd+unpcklpd+addps.
>
> Checking on compiler-explorer (https://godbolt.org/g/9bkjfu) it seems
> this particular issue is only reproducible with gcc 8, so definitely a gcc
> regression.
>
> gael
>
> On Mon, Aug 6, 2018 at 6:37 PM <***@dlr.de> wrote:
>
>> Hello again,
>>
>> I've been trying to understand a bit better what is happening with the
>> performance regression I'm seeing, and at the moment I am under the
>> impression that Eigen-3.3 makes it harder (impossible?) for gcc to
>> recognize when no aliasing is happening.
>>
>> I've further reduced my original example to essentially the following
>> loop (see eigen_bench3.cpp for a self-contained version).
>> using Vec = Eigen::Matrix<double, 2, 1>;
>> Vec sum = Vec::Zero();
>> for (int i = 0; i < num; ++i)
>> {
>> const Vec dirA = sum;
>> const Vec dirB = dirA;
>>
>> sum += dirA.dot(dirB) * dirA;
>> }
>>
>> When using Eigen-3.3, gcc-8 always spills back to memory when evaluating
>> the expression, whereas when using Eigen-3.2 the loop runs completely
>> without spilling. As there should be enough registers available, I assume
>> this is due to "fear of aliasing".
>>
>> This is independent of vectorization and all the other nice things, it
>> happens with vectorization disabled as well.
>>
>> Interestingly (?), when I change the size of Vec (from 2) to 1, then the
>> generated code is identical for 3.3 and 3.2.
>>
>> The following is the x86-64 assembly for the relevant loop (only using
>> -O1 and disabled vectorization because of increased clarity; the problem
>> remains for higher optimization levels):
>>
>> Eigen-3.3:
>> # eigen_bench3.cpp:18: EIGEN_ASM_COMMENT("begin loop");
>> test ebx, ebx # iftmp.0_3
>> jle .L131 #,
>> # eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
>> mov edx, 0 # i,
>> .L132:
>> # eigen-3.3.5/Eigen/src/Core/DenseStorage.h:194: DenseStorage(const
>> DenseStorage& other) : m_data(other.m_data) {
>> mov rsi, QWORD PTR [rsp+32] # MEM[(const struct DenseStorage
>> &)&sum].m_data, MEM[(const struct DenseStorage &)&sum].m_data
>> mov rdi, QWORD PTR [rsp+40] # MEM[(const struct DenseStorage
>> &)&sum].m_data, MEM[(const struct DenseStorage &)&sum].m_data
>> mov QWORD PTR [rsp], rsi # MEM[(struct DenseStorage
>> *)&dirA].m_data, MEM[(const struct DenseStorage &)&sum].m_data
>> mov QWORD PTR [rsp+8], rdi # MEM[(struct DenseStorage
>> *)&dirA].m_data, MEM[(const struct DenseStorage &)&sum].m_data
>> mov QWORD PTR [rsp+16], rsi # MEM[(struct DenseStorage
>> *)&dirB].m_data, MEM[(const struct DenseStorage &)&sum].m_data
>> mov QWORD PTR [rsp+24], rdi # MEM[(struct DenseStorage
>> *)&dirB].m_data, MEM[(const struct DenseStorage &)&sum].m_data
>> # eigen-3.3.5/Eigen/src/Core/GenericPacketMath.h:171: const
>> Packet& b) { return a*b; }
>> vmovsd xmm0, QWORD PTR [rsp+8] # _26, MEM[(const double &)&dirA
>> + 8]
>> vmovsd xmm1, QWORD PTR [rsp] # _28, MEM[(const double &)&dirA]
>> # eigen-3.3.5/Eigen/src/Core/GenericPacketMath.h:171: const
>> Packet& b) { return a*b; }
>> vmulsd xmm2, xmm0, QWORD PTR [rsp+24] # tmp123, _26,
>> MEM[(struct plain_array *)&dirB + 8B]
>> vmulsd xmm3, xmm1, QWORD PTR [rsp+16] # tmp124, _28,
>> MEM[(struct plain_array *)&dirB]
>> # eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:42:
>> EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const
>> LhsScalar& a, const RhsScalar& b) const { return a + b; }
>> vaddsd xmm2, xmm2, xmm3 # _30, tmp123, tmp124
>> # eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:86:
>> EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const
>> LhsScalar& a, const RhsScalar& b) const { return a * b; }
>> vmulsd xmm1, xmm1, xmm2 # tmp125, _28, _30
>> # eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49:
>> EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const
>> SrcScalar& b) const { a += b; }
>> vaddsd xmm1, xmm1, QWORD PTR [rsp+32] # tmp126, tmp125,
>> MEM[(double &)&sum]
>> # eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49:
>> EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const
>> SrcScalar& b) const { a += b; }
>> vmovsd QWORD PTR [rsp+32], xmm1 # MEM[(double &)&sum],
>> tmp126
>> # eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:86:
>> EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const
>> LhsScalar& a, const RhsScalar& b) const { return a * b; }
>> vmulsd xmm0, xmm0, xmm2 # tmp128, _26, _30
>> # eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49:
>> EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const
>> SrcScalar& b) const { a += b; }
>> vaddsd xmm0, xmm0, QWORD PTR [rsp+40] # tmp129, tmp128,
>> MEM[(double &)&sum + 8]
>> # eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49:
>> EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const
>> SrcScalar& b) const { a += b; }
>> vmovsd QWORD PTR [rsp+40], xmm0 # MEM[(double &)&sum +
>> 8], tmp129
>> # eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
>> inc edx # i
>> # eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
>> cmp ebx, edx # iftmp.0_3, i
>> jne .L132 #,
>> .L131:
>> # eigen_bench3.cpp:28: EIGEN_ASM_COMMENT("end loop");
>>
>>
>>
>>
>> And the same for Eigen-3.2
>>
>> # eigen_bench3.cpp:18: EIGEN_ASM_COMMENT("begin loop");
>> test ebx, ebx # iftmp.0_3
>> jle .L131 #,
>> vmovsd xmm2, QWORD PTR [rsp+16] # sum__lsm.104,
>> MEM[(const Scalar &)&sum]
>> vmovsd xmm1, QWORD PTR [rsp+24] # sum__lsm.105,
>> MEM[(const Scalar &)&sum + 8]
>> # eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
>> mov edx, 0 # i,
>> vmovsd xmm4, QWORD PTR .LC6[rip] # tmp117,
>> .L132:
>> # eigen-3.2.9-mod/Eigen/src/Core/GenericPacketMath.h:114: const
>> Packet& b) { return a*b; }
>> vmulsd xmm0, xmm1, xmm1 # tmp114, sum__lsm.105,
>> sum__lsm.105
>> vmulsd xmm3, xmm2, xmm2 # tmp115, sum__lsm.104,
>> sum__lsm.104
>> # eigen-3.2.9-mod/Eigen/src/Core/Functors.h:26: EIGEN_STRONG_INLINE
>> const Scalar operator() (const Scalar& a, const Scalar& b) const { return a
>> + b; }
>> vaddsd xmm0, xmm0, xmm3 # tmp116, tmp114, tmp115
>> vaddsd xmm0, xmm0, xmm4 # _52, tmp116, tmp117
>> vmulsd xmm2, xmm2, xmm0 # sum__lsm.104, sum__lsm.104, _52
>> vmulsd xmm1, xmm1, xmm0 # sum__lsm.105, sum__lsm.105, _52
>> # eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
>> inc edx # i
>> # eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
>> cmp ebx, edx # iftmp.0_3, i
>> jne .L132 #,
>> vmovsd QWORD PTR [rsp+16], xmm2 # MEM[(const Scalar
>> &)&sum], sum__lsm.104
>> vmovsd QWORD PTR [rsp+24], xmm1 # MEM[(const Scalar
>> &)&sum + 8], sum__lsm.105
>> .L131:
>> # eigen_bench3.cpp:28: EIGEN_ASM_COMMENT("end loop");
>>
>>
>> Best regards
>>
>> Daniel Vollmer
>>
>> --------------------------
>> Deutsches Zentrum fÃŒr Luft- und Raumfahrt e.V. (DLR)
>> German Aerospace Center
>> Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108
>> Braunschweig | Germany
>>
>> Daniel Vollmer | AS CÂ²AÂ²SÂ²E
>> www.DLR.de
>> ________________________________________
>> Von: Gael Guennebaud [***@gmail.com]
>> Gesendet: Donnerstag, 2. August 2018 0:10
>> An: eigen
>> Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2
>> released!)
>>
>> Hi,
>>
>> I tried your little benchmark and with gcc 7, I got no difference at all
>> (-O3 -NDEBUG):
>>
>> // 3.2: 7.75s
>> // 3.3: 7.68s
>>
>> // 3.2 -march=native: 7.46s
>> // 3.3 -march=native: 7.48s
>>
>> I ran each test 4 times and keep the best of each, the variations were
>> about 0.2s.
>>
>> gael
>>
>>
>> On Wed, Aug 1, 2018 at 1:28 PM <***@dlr.de> wrote:
>> Hi,
>>
>> extracting the relevant code is rather difficult, unfortunately.
>>
>> Attached you will find an extracted subset, which may point to one
>> problematic area with partial vectorization as for this example Eigen 3.3
>> is ~13% slower (unless I'm measuring incorrectly).
>>
>> If I turn off partial vectorization, the run-time is the same. This is
>> not completely representative (obviously), because in my "complete" runs
>> disabling partial vectorization only decreases the run-time difference
>> between 3.3 and 3.2, but does not eliminate it.
>>
>>
>> Best regards
>>
>> Daniel Vollmer
>>
>> --------------------------
>> Deutsches Zentrum fÃŒr Luft- und Raumfahrt e.V. (DLR)
>> German Aerospace Center
>> Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108
>> Braunschweig | Germany
>>
>> Daniel Vollmer | AS CÂ²AÂ²SÂ²E
>> www.DLR.de
>>
>> ________________________________________
>> Von: Marc Glisse [***@inria.fr]
>> Gesendet: Mittwoch, 1. August 2018 12:06
>> An: ***@lists.tuxfamily.org
>> Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2
>> released!)
>>
>> On Wed, 1 Aug 2018, ***@dlr.de wrote:
>>
>> > I've attached a document with some performance measurements for
>> > different compilers, different Eigen versions, and 3 different
>> > test-cases for our code (tau, cgns, dg) that stress different areas /
>> > sizes.
>>
>> Would it be hard to create small testcases that you could share and that
>> would still show a similar performance pattern? I assume that the easier
>> it is for Eigen dev to reproduce, the more likely they are to investigate.
>>
>> --
>> Marc Glisse
>>
>>

D***@dlr.de

2018-08-21 06:51:17 UTC

Permalink

Hi Gael,

unfortunately the example isn't *that* representative of the actual code-base I'm working with, so I cannot quite apply this transformation. Either way, I'll keep testing with gcc-7 (where I've also measured performance regressions between 3.2 and 3.3), and try to track down some examples.

I'm not quite sure how to report the gcc-8 problem to the gcc-maintainers. I've tried running creduce over the code, but the output (attached) seems slightly different (e.g. the copy in operator+=, which looks like the aliasing assignment was chosen).

Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de
________________________________
Von: Gael Guennebaud [***@gmail.com]
Gesendet: Donnerstag, 16. August 2018 0:23
An: eigen
Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

ok, now regarding the difference between 3.2 and 3.3 with gcc 8, I found that changing the copies to:

Vec dirA;
dirA = sum;
Vec dirB;
dirB = dirA;

workaround the issue, see: https://godbolt.org/g/oJRHdd

The respective change-set is: https://bitbucket.org/eigen/eigen/commits/45ceeab6e89287

gael

On Wed, Aug 15, 2018 at 11:17 PM Gael Guennebaud <***@gmail.com<mailto:***@gmail.com>> wrote:

I still cannot reproduce, with -O2 -DNDEBUG I get the following assembly with Eigen 3.3 and gcc 7:

L127:
movapd %xmm1, %xmm0
addl $1, %eax
mulpd %xmm1, %xmm0
cmpl %eax, %ebx
movapd %xmm0, %xmm2
unpckhpd %xmm0, %xmm2
addsd %xmm2, %xmm0
unpcklpd %xmm0, %xmm0
mulpd %xmm1, %xmm0
addpd %xmm0, %xmm1
jne L127

On my side, the only difference with 3.2 is the usage of haddpd with 3.2 that we disabled in 3.3 because it is pretty slow compared to movapd+unpcklpd+addps.

Checking on compiler-explorer (https://godbolt.org/g/9bkjfu) it seems this particular issue is only reproducible with gcc 8, so definitely a gcc regression.

gael

On Mon, Aug 6, 2018 at 6:37 PM <***@dlr.de<mailto:***@dlr.de>> wrote:
Hello again,

I've been trying to understand a bit better what is happening with the performance regression I'm seeing, and at the moment I am under the impression that Eigen-3.3 makes it harder (impossible?) for gcc to recognize when no aliasing is happening.

I've further reduced my original example to essentially the following loop (see eigen_bench3.cpp for a self-contained version).
using Vec = Eigen::Matrix<double, 2, 1>;
Vec sum = Vec::Zero();
for (int i = 0; i < num; ++i)
{
const Vec dirA = sum;
const Vec dirB = dirA;

sum += dirA.dot(dirB) * dirA;
}

When using Eigen-3.3, gcc-8 always spills back to memory when evaluating the expression, whereas when using Eigen-3.2 the loop runs completely without spilling. As there should be enough registers available, I assume this is due to "fear of aliasing".

This is independent of vectorization and all the other nice things, it happens with vectorization disabled as well.

Interestingly (?), when I change the size of Vec (from 2) to 1, then the generated code is identical for 3.3 and 3.2.

The following is the x86-64 assembly for the relevant loop (only using -O1 and disabled vectorization because of increased clarity; the problem remains for higher optimization levels):

Eigen-3.3:
# eigen_bench3.cpp:18: EIGEN_ASM_COMMENT("begin loop");
test ebx, ebx # iftmp.0_3
jle .L131 #,
# eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
mov edx, 0 # i,
.L132:
# eigen-3.3.5/Eigen/src/Core/DenseStorage.h:194: DenseStorage(const DenseStorage& other) : m_data(other.m_data) {
mov rsi, QWORD PTR [rsp+32] # MEM[(const struct DenseStorage &)&sum].m_data, MEM[(const struct DenseStorage &)&sum].m_data
mov rdi, QWORD PTR [rsp+40] # MEM[(const struct DenseStorage &)&sum].m_data, MEM[(const struct DenseStorage &)&sum].m_data
mov QWORD PTR [rsp], rsi # MEM[(struct DenseStorage *)&dirA].m_data, MEM[(const struct DenseStorage &)&sum].m_data
mov QWORD PTR [rsp+8], rdi # MEM[(struct DenseStorage *)&dirA].m_data, MEM[(const struct DenseStorage &)&sum].m_data
mov QWORD PTR [rsp+16], rsi # MEM[(struct DenseStorage *)&dirB].m_data, MEM[(const struct DenseStorage &)&sum].m_data
mov QWORD PTR [rsp+24], rdi # MEM[(struct DenseStorage *)&dirB].m_data, MEM[(const struct DenseStorage &)&sum].m_data
# eigen-3.3.5/Eigen/src/Core/GenericPacketMath.h:171: const Packet& b) { return a*b; }
vmovsd xmm0, QWORD PTR [rsp+8] # _26, MEM[(const double &)&dirA + 8]
vmovsd xmm1, QWORD PTR [rsp] # _28, MEM[(const double &)&dirA]
# eigen-3.3.5/Eigen/src/Core/GenericPacketMath.h:171: const Packet& b) { return a*b; }
vmulsd xmm2, xmm0, QWORD PTR [rsp+24] # tmp123, _26, MEM[(struct plain_array *)&dirB + 8B]
vmulsd xmm3, xmm1, QWORD PTR [rsp+16] # tmp124, _28, MEM[(struct plain_array *)&dirB]
# eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:42: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const LhsScalar& a, const RhsScalar& b) const { return a + b; }
vaddsd xmm2, xmm2, xmm3 # _30, tmp123, tmp124
# eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:86: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const LhsScalar& a, const RhsScalar& b) const { return a * b; }
vmulsd xmm1, xmm1, xmm2 # tmp125, _28, _30
# eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const SrcScalar& b) const { a += b; }
vaddsd xmm1, xmm1, QWORD PTR [rsp+32] # tmp126, tmp125, MEM[(double &)&sum]
# eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const SrcScalar& b) const { a += b; }
vmovsd QWORD PTR [rsp+32], xmm1 # MEM[(double &)&sum], tmp126
# eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:86: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const LhsScalar& a, const RhsScalar& b) const { return a * b; }
vmulsd xmm0, xmm0, xmm2 # tmp128, _26, _30
# eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const SrcScalar& b) const { a += b; }
vaddsd xmm0, xmm0, QWORD PTR [rsp+40] # tmp129, tmp128, MEM[(double &)&sum + 8]
# eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const SrcScalar& b) const { a += b; }
vmovsd QWORD PTR [rsp+40], xmm0 # MEM[(double &)&sum + 8], tmp129
# eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
inc edx # i
# eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
cmp ebx, edx # iftmp.0_3, i
jne .L132 #,
.L131:
# eigen_bench3.cpp:28: EIGEN_ASM_COMMENT("end loop");

And the same for Eigen-3.2

# eigen_bench3.cpp:18: EIGEN_ASM_COMMENT("begin loop");
test ebx, ebx # iftmp.0_3
jle .L131 #,
vmovsd xmm2, QWORD PTR [rsp+16] # sum__lsm.104, MEM[(const Scalar &)&sum]
vmovsd xmm1, QWORD PTR [rsp+24] # sum__lsm.105, MEM[(const Scalar &)&sum + 8]
# eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
mov edx, 0 # i,
vmovsd xmm4, QWORD PTR .LC6[rip] # tmp117,
.L132:
# eigen-3.2.9-mod/Eigen/src/Core/GenericPacketMath.h:114: const Packet& b) { return a*b; }
vmulsd xmm0, xmm1, xmm1 # tmp114, sum__lsm.105, sum__lsm.105
vmulsd xmm3, xmm2, xmm2 # tmp115, sum__lsm.104, sum__lsm.104
# eigen-3.2.9-mod/Eigen/src/Core/Functors.h:26: EIGEN_STRONG_INLINE const Scalar operator() (const Scalar& a, const Scalar& b) const { return a + b; }
vaddsd xmm0, xmm0, xmm3 # tmp116, tmp114, tmp115
vaddsd xmm0, xmm0, xmm4 # _52, tmp116, tmp117
vmulsd xmm2, xmm2, xmm0 # sum__lsm.104, sum__lsm.104, _52
vmulsd xmm1, xmm1, xmm0 # sum__lsm.105, sum__lsm.105, _52
# eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
inc edx # i
# eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
cmp ebx, edx # iftmp.0_3, i
jne .L132 #,
vmovsd QWORD PTR [rsp+16], xmm2 # MEM[(const Scalar &)&sum], sum__lsm.104
vmovsd QWORD PTR [rsp+24], xmm1 # MEM[(const Scalar &)&sum + 8], sum__lsm.105
.L131:
# eigen_bench3.cpp:28: EIGEN_ASM_COMMENT("end loop");

Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de<http://www.DLR.de>
________________________________________
Von: Gael Guennebaud [***@gmail.com<mailto:***@gmail.com>]
Gesendet: Donnerstag, 2. August 2018 0:10
An: eigen
Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

Hi,

I tried your little benchmark and with gcc 7, I got no difference at all (-O3 -NDEBUG):

// 3.2: 7.75s
// 3.3: 7.68s

// 3.2 -march=native: 7.46s
// 3.3 -march=native: 7.48s

I ran each test 4 times and keep the best of each, the variations were about 0.2s.

gael

On Wed, Aug 1, 2018 at 1:28 PM <***@dlr.de<mailto:***@dlr.de>> wrote:
Hi,

extracting the relevant code is rather difficult, unfortunately.

Attached you will find an extracted subset, which may point to one problematic area with partial vectorization as for this example Eigen 3.3 is ~13% slower (unless I'm measuring incorrectly).

If I turn off partial vectorization, the run-time is the same. This is not completely representative (obviously), because in my "complete" runs disabling partial vectorization only decreases the run-time difference between 3.3 and 3.2, but does not eliminate it.

Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de<http://www.DLR.de>

________________________________________
Von: Marc Glisse [***@inria.fr<mailto:***@inria.fr>]
Gesendet: Mittwoch, 1. August 2018 12:06
An: ***@lists.tuxfamily.org<mailto:***@lists.tuxfamily.org>
Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

On Wed, 1 Aug 2018, ***@dlr.de<mailto:***@dlr.de> wrote:

> I've attached a document with some performance measurements for
> different compilers, different Eigen versions, and 3 different
> test-cases for our code (tau, cgns, dg) that stress different areas /
> sizes.

Would it be hard to create small testcases that you could share and that
would still show a similar performance pattern? I assume that the easier
it is for Eigen dev to reproduce, the more likely they are to investigate.

--
Marc Glisse

Marc Glisse

2018-08-21 07:21:31 UTC

Permalink

On Tue, 21 Aug 2018, ***@dlr.de wrote:

> I'm not quite sure how to report the gcc-8 problem to the
> gcc-maintainers. I've tried running creduce over the code, but the
> output (attached) seems slightly different (e.g. the copy in operator+=,
> which looks like the aliasing assignment was chosen).

I've reported https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87008
(the source may be large and unreduced, but internally after the first few
rounds of optimization it is rather clear what is going on so it should be
good enough for gcc devs)

There were other regressions in gcc-8 related to the same optimization
pass (SRA) : 85459, 85762, etc. IIRC the author said he was trying to
rewrite that pass for gcc-9, but I could be wrong.

It might not be the only issue with gcc-8 though.

--
Marc Glisse

D***@dlr.de

2018-08-22 13:25:50 UTC

Permalink

Hello again,

I've tried extracting some pieces of the CFD code we're working on into independent benchmarks again to help nail down these regressions.

This time, I've mainly used gcc-7.3.0 due to the problems discovered with gcc-8; running on a 2.9 GHz Intel Core i9 (15" Macbook Pro), and I've been using the Google Benchmark library (https://github.com/google/benchmark) to get more reliable measurements. The code was compiled with

g++-7 -march=native -O3 -DNDEBUG -Wno-deprecated-declarations -ffast-math -fno-finite-math-only eigen_bench4.cpp -I /usr/local/Cellar/google-benchmark/1.4.1 -L /usr/local/Cellar/google-benchmark/1.4.1/lib -l benchmark -l benchmark_main -I eigen

So using AVX and FMA. Note that I'm not actually interested in the sum of the result vectors that often appear inside the benchmark::DoNotOptimize(...) calls, but I wanted something that depends on all values to prevent optimizing the complete expression away.

Using Eigen-3.2.9

***@feuerlilie ~/D/d/d/eigen-perf> ./bench_32 --benchmark_repetitions=10 --benchmark_report_aggregates_only=true

2018-08-22 15:08:04

Running ./bench_32

Run on (12 X 2900 MHz CPU s)

CPU Caches:

L1 Data 32K (x6)

L1 Instruction 32K (x6)

L2 Unified 262K (x6)

L3 Unified 12582K (x1)

----------------------------------------------------------------

Benchmark Time CPU Iterations

----------------------------------------------------------------

BM_Augment_mean 22 ns 22 ns 31277228

BM_Augment_median 22 ns 22 ns 31277228

BM_Augment_stddev 0 ns 0 ns 31277228

BM_ConvectionFlux_mean 34 ns 34 ns 20305570

BM_ConvectionFlux_median 34 ns 34 ns 20305570

BM_ConvectionFlux_stddev 0 ns 0 ns 20305570

BM_EigenMul_mean 23 ns 23 ns 30841624

BM_EigenMul_median 23 ns 23 ns 30841624

BM_EigenMul_stddev 0 ns 0 ns 30841624

Using Eigen-3.3.5

***@feuerlilie ~/D/d/d/eigen-perf> ./bench_33 --benchmark_repetitions=10 --benchmark_report_aggregates_only=true

2018-08-22 15:08:31

Running ./bench_33

Run on (12 X 2900 MHz CPU s)

CPU Caches:

L1 Data 32K (x6)

L1 Instruction 32K (x6)

L2 Unified 262K (x6)

L3 Unified 12582K (x1)

----------------------------------------------------------------

Benchmark Time CPU Iterations

----------------------------------------------------------------

BM_Augment_mean 26 ns 26 ns 26066589

BM_Augment_median 26 ns 26 ns 26066589

BM_Augment_stddev 0 ns 0 ns 26066589

BM_ConvectionFlux_mean 41 ns 41 ns 17082879

BM_ConvectionFlux_median 41 ns 41 ns 17082879

BM_ConvectionFlux_stddev 1 ns 1 ns 17082879

BM_EigenMul_mean 24 ns 24 ns 29441577

BM_EigenMul_median 24 ns 24 ns 29441577

BM_EigenMul_stddev 0 ns 0 ns 29441577

First off, most of the run-time is from calling rand() a lot for each iteration, but that overhead should be the same between both version. As an aside, if you compile the code with the FIXED_INPUT #define then we can see that the compiler has an "easier" time to "see through" the older Eigen-3.2 and it is able to hoist the computation outside of the benchmark loop in more cases. This might also be part of the actual cause, because not all of the extra outputs of Augment() are always used, for example.

Generally, the run-time seems increased (unless I'm measuring wrong once again) for Eigen-3.3 (except for EigenMul). But I have difficulty to pin this on any particular change in Eigen.

clang (Apple LLVM version 9.1.0 (clang-902.0.39.2)) seems less problematic, and using it I don't see any performance regressions (other than compile time ;)). Unfortunately, most people will probably be using gcc with our code.

If you have any ideas on causes or possible improvements (either through changing the code, or modifications to Eigen), please let me know. As Eigen-3.3 has quite some fixes we'd like to pick up (AutoDiffScalar mainly, but also others), but at the moment it's a bit of a hard sell.

Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de
________________________________
Von: Gael Guennebaud [***@gmail.com]
Gesendet: Donnerstag, 16. August 2018 0:23
An: eigen
Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

ok, now regarding the difference between 3.2 and 3.3 with gcc 8, I found that changing the copies to:

Vec dirA;
dirA = sum;
Vec dirB;
dirB = dirA;

workaround the issue, see: https://godbolt.org/g/oJRHdd

The respective change-set is: https://bitbucket.org/eigen/eigen/commits/45ceeab6e89287

gael

On Wed, Aug 15, 2018 at 11:17 PM Gael Guennebaud <***@gmail.com<mailto:***@gmail.com>> wrote:

I still cannot reproduce, with -O2 -DNDEBUG I get the following assembly with Eigen 3.3 and gcc 7:

L127:
movapd %xmm1, %xmm0
addl $1, %eax
mulpd %xmm1, %xmm0
cmpl %eax, %ebx
movapd %xmm0, %xmm2
unpckhpd %xmm0, %xmm2
addsd %xmm2, %xmm0
unpcklpd %xmm0, %xmm0
mulpd %xmm1, %xmm0
addpd %xmm0, %xmm1
jne L127

On my side, the only difference with 3.2 is the usage of haddpd with 3.2 that we disabled in 3.3 because it is pretty slow compared to movapd+unpcklpd+addps.

Checking on compiler-explorer (https://godbolt.org/g/9bkjfu) it seems this particular issue is only reproducible with gcc 8, so definitely a gcc regression.

gael

On Mon, Aug 6, 2018 at 6:37 PM <***@dlr.de<mailto:***@dlr.de>> wrote:
Hello again,

I've been trying to understand a bit better what is happening with the performance regression I'm seeing, and at the moment I am under the impression that Eigen-3.3 makes it harder (impossible?) for gcc to recognize when no aliasing is happening.

I've further reduced my original example to essentially the following loop (see eigen_bench3.cpp for a self-contained version).
using Vec = Eigen::Matrix<double, 2, 1>;
Vec sum = Vec::Zero();
for (int i = 0; i < num; ++i)
{
const Vec dirA = sum;
const Vec dirB = dirA;

sum += dirA.dot(dirB) * dirA;
}

When using Eigen-3.3, gcc-8 always spills back to memory when evaluating the expression, whereas when using Eigen-3.2 the loop runs completely without spilling. As there should be enough registers available, I assume this is due to "fear of aliasing".

This is independent of vectorization and all the other nice things, it happens with vectorization disabled as well.

Interestingly (?), when I change the size of Vec (from 2) to 1, then the generated code is identical for 3.3 and 3.2.

The following is the x86-64 assembly for the relevant loop (only using -O1 and disabled vectorization because of increased clarity; the problem remains for higher optimization levels):

Eigen-3.3:
# eigen_bench3.cpp:18: EIGEN_ASM_COMMENT("begin loop");
test ebx, ebx # iftmp.0_3
jle .L131 #,
# eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
mov edx, 0 # i,
.L132:
# eigen-3.3.5/Eigen/src/Core/DenseStorage.h:194: DenseStorage(const DenseStorage& other) : m_data(other.m_data) {
mov rsi, QWORD PTR [rsp+32] # MEM[(const struct DenseStorage &)&sum].m_data, MEM[(const struct DenseStorage &)&sum].m_data
mov rdi, QWORD PTR [rsp+40] # MEM[(const struct DenseStorage &)&sum].m_data, MEM[(const struct DenseStorage &)&sum].m_data
mov QWORD PTR [rsp], rsi # MEM[(struct DenseStorage *)&dirA].m_data, MEM[(const struct DenseStorage &)&sum].m_data
mov QWORD PTR [rsp+8], rdi # MEM[(struct DenseStorage *)&dirA].m_data, MEM[(const struct DenseStorage &)&sum].m_data
mov QWORD PTR [rsp+16], rsi # MEM[(struct DenseStorage *)&dirB].m_data, MEM[(const struct DenseStorage &)&sum].m_data
mov QWORD PTR [rsp+24], rdi # MEM[(struct DenseStorage *)&dirB].m_data, MEM[(const struct DenseStorage &)&sum].m_data
# eigen-3.3.5/Eigen/src/Core/GenericPacketMath.h:171: const Packet& b) { return a*b; }
vmovsd xmm0, QWORD PTR [rsp+8] # _26, MEM[(const double &)&dirA + 8]
vmovsd xmm1, QWORD PTR [rsp] # _28, MEM[(const double &)&dirA]
# eigen-3.3.5/Eigen/src/Core/GenericPacketMath.h:171: const Packet& b) { return a*b; }
vmulsd xmm2, xmm0, QWORD PTR [rsp+24] # tmp123, _26, MEM[(struct plain_array *)&dirB + 8B]
vmulsd xmm3, xmm1, QWORD PTR [rsp+16] # tmp124, _28, MEM[(struct plain_array *)&dirB]
# eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:42: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const LhsScalar& a, const RhsScalar& b) const { return a + b; }
vaddsd xmm2, xmm2, xmm3 # _30, tmp123, tmp124
# eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:86: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const LhsScalar& a, const RhsScalar& b) const { return a * b; }
vmulsd xmm1, xmm1, xmm2 # tmp125, _28, _30
# eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const SrcScalar& b) const { a += b; }
vaddsd xmm1, xmm1, QWORD PTR [rsp+32] # tmp126, tmp125, MEM[(double &)&sum]
# eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const SrcScalar& b) const { a += b; }
vmovsd QWORD PTR [rsp+32], xmm1 # MEM[(double &)&sum], tmp126
# eigen-3.3.5/Eigen/src/Core/functors/BinaryFunctors.h:86: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const result_type operator() (const LhsScalar& a, const RhsScalar& b) const { return a * b; }
vmulsd xmm0, xmm0, xmm2 # tmp128, _26, _30
# eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const SrcScalar& b) const { a += b; }
vaddsd xmm0, xmm0, QWORD PTR [rsp+40] # tmp129, tmp128, MEM[(double &)&sum + 8]
# eigen-3.3.5/Eigen/src/Core/functors/AssignmentFunctors.h:49: EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void assignCoeff(DstScalar& a, const SrcScalar& b) const { a += b; }
vmovsd QWORD PTR [rsp+40], xmm0 # MEM[(double &)&sum + 8], tmp129
# eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
inc edx # i
# eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
cmp ebx, edx # iftmp.0_3, i
jne .L132 #,
.L131:
# eigen_bench3.cpp:28: EIGEN_ASM_COMMENT("end loop");

And the same for Eigen-3.2

# eigen_bench3.cpp:18: EIGEN_ASM_COMMENT("begin loop");
test ebx, ebx # iftmp.0_3
jle .L131 #,
vmovsd xmm2, QWORD PTR [rsp+16] # sum__lsm.104, MEM[(const Scalar &)&sum]
vmovsd xmm1, QWORD PTR [rsp+24] # sum__lsm.105, MEM[(const Scalar &)&sum + 8]
# eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
mov edx, 0 # i,
vmovsd xmm4, QWORD PTR .LC6[rip] # tmp117,
.L132:
# eigen-3.2.9-mod/Eigen/src/Core/GenericPacketMath.h:114: const Packet& b) { return a*b; }
vmulsd xmm0, xmm1, xmm1 # tmp114, sum__lsm.105, sum__lsm.105
vmulsd xmm3, xmm2, xmm2 # tmp115, sum__lsm.104, sum__lsm.104
# eigen-3.2.9-mod/Eigen/src/Core/Functors.h:26: EIGEN_STRONG_INLINE const Scalar operator() (const Scalar& a, const Scalar& b) const { return a + b; }
vaddsd xmm0, xmm0, xmm3 # tmp116, tmp114, tmp115
vaddsd xmm0, xmm0, xmm4 # _52, tmp116, tmp117
vmulsd xmm2, xmm2, xmm0 # sum__lsm.104, sum__lsm.104, _52
vmulsd xmm1, xmm1, xmm0 # sum__lsm.105, sum__lsm.105, _52
# eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
inc edx # i
# eigen_bench3.cpp:19: for (int i = 0; i < num; ++i)
cmp ebx, edx # iftmp.0_3, i
jne .L132 #,
vmovsd QWORD PTR [rsp+16], xmm2 # MEM[(const Scalar &)&sum], sum__lsm.104
vmovsd QWORD PTR [rsp+24], xmm1 # MEM[(const Scalar &)&sum + 8], sum__lsm.105
.L131:
# eigen_bench3.cpp:28: EIGEN_ASM_COMMENT("end loop");

Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de<http://www.DLR.de>
________________________________________
Von: Gael Guennebaud [***@gmail.com<mailto:***@gmail.com>]
Gesendet: Donnerstag, 2. August 2018 0:10
An: eigen
Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

Hi,

I tried your little benchmark and with gcc 7, I got no difference at all (-O3 -NDEBUG):

// 3.2: 7.75s
// 3.3: 7.68s

// 3.2 -march=native: 7.46s
// 3.3 -march=native: 7.48s

I ran each test 4 times and keep the best of each, the variations were about 0.2s.

gael

On Wed, Aug 1, 2018 at 1:28 PM <***@dlr.de<mailto:***@dlr.de>> wrote:
Hi,

extracting the relevant code is rather difficult, unfortunately.

Attached you will find an extracted subset, which may point to one problematic area with partial vectorization as for this example Eigen 3.3 is ~13% slower (unless I'm measuring incorrectly).

If I turn off partial vectorization, the run-time is the same. This is not completely representative (obviously), because in my "complete" runs disabling partial vectorization only decreases the run-time difference between 3.3 and 3.2, but does not eliminate it.

Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de<http://www.DLR.de>

________________________________________
Von: Marc Glisse [***@inria.fr<mailto:***@inria.fr>]
Gesendet: Mittwoch, 1. August 2018 12:06
An: ***@lists.tuxfamily.org<mailto:***@lists.tuxfamily.org>
Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

On Wed, 1 Aug 2018, ***@dlr.de<mailto:***@dlr.de> wrote:

> I've attached a document with some performance measurements for
> different compilers, different Eigen versions, and 3 different
> test-cases for our code (tau, cgns, dg) that stress different areas /
> sizes.

Would it be hard to create small testcases that you could share and that
would still show a similar performance pattern? I assume that the easier
it is for Eigen dev to reproduce, the more likely they are to investigate.

--
Marc Glisse

D***@dlr.de

2018-08-01 12:14:23 UTC

Permalink

Hi Christoph,

(I cc'd the mailing list again.)

The compilation units are rather big, so directly comparing the resulting code is difficult.

I've run the test-cases for gcc-8.1 and clang-3.8 with -msse4.2 -mtune=native to disable AVX.

This improves the situation for gcc, (except for the "tau" test-cases where it's only "close") and results in the same performance as Eigen-3.2. Disabling partial vec or enabling it doesn't seem to make a difference for that setting anymore.

For clang disabling AVX is a slight win for "tau" vs. default settings, but a slight loss for cgns (where the matrix-vector product and AD plays a bigger role, see area 2&3).

Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de

________________________________________
Von: Christoph Hertzberg [***@informatik.uni-bremen.de]
Gesendet: Mittwoch, 1. August 2018 12:34
An: Vollmer, Daniel
Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

Hi,

could you also try compiling with `-DEIGEN_UNALIGNED_VECTORIZE=0` and
with AVX disabled, e.g., using `-msse4.2 -mtune=native` -- alternatively
also by commenting out the corresponding detection inside Eigen/Core
(this would actually be nice, if it was controllable by command-line
options).
And of course, any combinations of these options would be interesting,
if they make a difference.

If you have sufficiently small compilation units, it might also be worth
having a look at the difference between the generated assembler code --
but that is usually more productive if you had singled out a reduced
test-case.

Cheers,
Christoph

On 2018-08-01 11:10, ***@dlr.de wrote:
> Hello everyone,
>
> with the recent release of 3.3.5 I've once again looked at upgrading from our currently used Eigen 3.2 to the current stable branch, but some performance regressions remain, which make this a difficult decision, as I'm unable to nail down the exact cause (probably because it's not a single one) and would prefer to not slow down the overall performance.
>
> I've attached a document with some performance measurements for different compilers, different Eigen versions, and 3 different test-cases for our code (tau, cgns, dg) that stress different areas / sizes.
> The "vs best" column compares run-time against the overall best run-time, "vs same" only relative to shortest run-time with the same compiler (so essentially between different Eigen variants with the same compiler).
> Eigen 3.2 version used was 3.2.9 + some backports of improvements to AutoDiffScalar
> Eigen 3.3 version used was 3.3.5.
> The tests were run on a Xeon E3-1276 v3 (with our code doing multi-threading, and Eigen configured to not use threading of its own). Minimum run-time of 4 runs.
>
> We use Eigen in a CFD code for 3 roughly distinct subject areas:
> 1) fixed-size vectors (and some matrices) of doubles, direct access to individual values (with compile-time known indices) or segments, simple linear algebra, few matrix-vector products.
> 2) same as 1, but using Eigen::AutoDiffScalar instead of double (building up a Jacobian)
> 3) Fixed-size matrix-vector products (inside of a Block-Jacobi iteration, not using any of Eigen's solvers)
>
> For the different cases:
> tau: Only uses 1), with vectors of sizes 5 and 8, matrices of size 5x5
> cgns: Uses 1)-3), with vectors of sizes 6 and 13, matrices of size 6x6 (for both 1 and 3).
> dg: Uses 1)-3), with vectors of sizes 5 and 8, matrices of size 5x5 (for 1) and 20x20 (for 3).
>
> The outcomes seem to be
> - clang is generally fastest
> - the performance regression is more pronounced for gcc
> - (partial) vectorization seems to "hurt" simple direct access (area 1), disabling it improves performance (clang) or at least reduces the impact of Eigen 3.3 (gcc)
>
> If we were only looking at clang, I'd be nearly willing to advocate moving to 3.3 (with default settings), because only a regression for the "tau" case remains.
>
> Unfortunately, I'm at a loss at how to pin-point these any more, and attempts at extracting a reduced test-case / example that exhibits the same behavior have not been fruitful, and some profiling of the actual code between Eigen 3.2 and 3.3 does not seem to directly yield actionable information.
>
> If anyone has any ideas for things to try, I'm all ears. :)
>
> Either way, thanks for your helpful (and nice to use) library!
>
>
> Best regards
>
> Daniel Vollmer
>
> --------------------------
> Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
> German Aerospace Center
> Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany
>
> Daniel Vollmer | AS C²A²S²E
> www.DLR.de
>
> ________________________________________
> Von: Vollmer, Daniel
> Gesendet: Donnerstag, 28. Juli 2016 12:46
> An: ***@lists.tuxfamily.org
> Betreff: RE: [eigen] 3.3-beta2 released!
>
> Hi Gael,
>
>> Fixed: https://bitbucket.org/eigen/eigen/commits/e35a38ad89fe/
>> With float I get a nearly x2 speedup for the above 5x5 matrix-vector
>> products (compared to 3.2), and x1.4 speedup with double.
>
> I tried out this version (ca9bd08) and the results are as follows:
> Note: the explicit solver pretty much only does residual evaluations,
> whereas the implicit solver does a residual evaluation, followed by a
> Jacobian computation (using AutoDiffScalar) and then a block-based
> Gauss-Jacobi iteration where the blocks are 5x5 matrices to
> approximately solve a linear system based on the Jacobian and the
> residual.
>
> Explicit solver:
> ----------------
> eigen-3.3-ca9bd08 10.9s => 09% slower
> eigen-3.3-beta2 11.1s => 11% slower
> eigen-3.3-beta2 UNALIGNED_VEC=0 10.0s => 00% slower
> eigen-3.2.9 10.0s => baseline
>
> Implicit solver:
> ----------------
> eigen-3.3-ca9bd08 34.2s => 06% faster
> eigen-3.3-beta2 37.5s => 03% slower
> eigen-3.3-beta2 UNALIGNED_VEC=0 38.2s => 05% slower
> eigen-3.2.9 36.5s => baseline
>
> So the change definitely helps for the implicit solver (which has lots
> of 5x5 by 5x1 double multiplies), but for the explicit solver the
> overhead of unaligned vectorization doesn't pay off. Maybe the use of
> 3D vectors (which used for geometric normals and coordinates) is
> problematic because it's such a borderline case for vectorization?
>
> What I don't quite understand is the difference between 3.2.9 (which
> doesn't vectorize the given matrix sizes) and 3.3-beta2 without
> vectorization: Something in 3.3 is slower under those conditions, but
> maybe it's not the matrix-vector multiplies, as it could also be
> AutoDiffScalar being slower.
>
>
> Best regards
>
> Daniel Vollmer
>
> --------------------------
> Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
> German Aerospace Center
> Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany
>
> Daniel Vollmer | AS C²A²S²E
> www.DLR.de
>

--
Dr.-Ing. Christoph Hertzberg

Besuchsadresse der Nebengeschäftsstelle:
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-Straße 5
28359 Bremen, Germany

Postadresse der Hauptgeschäftsstelle Standort Bremen:
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-Straße 1
28359 Bremen, Germany

Tel.: +49 421 178 45-4021
Zentrale: +49 421 178 45-0
E-Mail: ***@dfki.de

Weitere Informationen: http://www.dfki.de/robotik
-----------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Straße 122, D-67663 Kaiserslautern
Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
(Vorsitzender) Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
USt-Id.Nr.: DE 148646973
Steuernummer: 19/672/50006
-----------------------------------------------------------------------

Christoph Hertzberg

2018-08-01 13:04:33 UTC

Permalink

Hi,

just to clear my confusion, by "partial vectorization" you mean
"unaligned vectorization"?

I guess Eigen's AVX-usage is a likely issue in some situations then --
but it is really hard to fix without anything concrete.
Could you (manually) disable the AVX-detection in Eigen/Core, but
compile with AVX enabled?
And does enabling/disabling AVX with Eigen3.2 make a difference? (That
version had no AVX support, but there may be issues with switching
between AVX and non-AVX instructions).

You could also try to make a diff between the assembly generated by gcc
and clang. This may involve cleaning up the assembly "somehow", or
actually disassembling the binary. Alternatively, just manually compare
some likely candidates -- you can mark them using
EIGEN_ASM_COMMENT("some label which is easy to find");

Christoph

On 2018-08-01 14:14, ***@dlr.de wrote:
> Hi Christoph,
>
> (I cc'd the mailing list again.)
>
> The compilation units are rather big, so directly comparing the resulting code is difficult.
>
> I've run the test-cases for gcc-8.1 and clang-3.8 with -msse4.2 -mtune=native to disable AVX.
>
> This improves the situation for gcc, (except for the "tau" test-cases where it's only "close") and results in the same performance as Eigen-3.2. Disabling partial vec or enabling it doesn't seem to make a difference for that setting anymore.
>
> For clang disabling AVX is a slight win for "tau" vs. default settings, but a slight loss for cgns (where the matrix-vector product and AD plays a bigger role, see area 2&3).
>
>
> Best regards
>
> Daniel Vollmer
>
> --------------------------
> Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
> German Aerospace Center
> Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany
>
> Daniel Vollmer | AS C²A²S²E
> www.DLR.de
>
> ________________________________________
> Von: Christoph Hertzberg [***@informatik.uni-bremen.de]
> Gesendet: Mittwoch, 1. August 2018 12:34
> An: Vollmer, Daniel
> Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
>
> Hi,
>
> could you also try compiling with `-DEIGEN_UNALIGNED_VECTORIZE=0` and
> with AVX disabled, e.g., using `-msse4.2 -mtune=native` -- alternatively
> also by commenting out the corresponding detection inside Eigen/Core
> (this would actually be nice, if it was controllable by command-line
> options).
> And of course, any combinations of these options would be interesting,
> if they make a difference.
>
> If you have sufficiently small compilation units, it might also be worth
> having a look at the difference between the generated assembler code --
> but that is usually more productive if you had singled out a reduced
> test-case.
>
>
> Cheers,
> Christoph
>
>
>
>
>
> On 2018-08-01 11:10, ***@dlr.de wrote:
>> Hello everyone,
>>
>> with the recent release of 3.3.5 I've once again looked at upgrading from our currently used Eigen 3.2 to the current stable branch, but some performance regressions remain, which make this a difficult decision, as I'm unable to nail down the exact cause (probably because it's not a single one) and would prefer to not slow down the overall performance.
>>
>> I've attached a document with some performance measurements for different compilers, different Eigen versions, and 3 different test-cases for our code (tau, cgns, dg) that stress different areas / sizes.
>> The "vs best" column compares run-time against the overall best run-time, "vs same" only relative to shortest run-time with the same compiler (so essentially between different Eigen variants with the same compiler).
>> Eigen 3.2 version used was 3.2.9 + some backports of improvements to AutoDiffScalar
>> Eigen 3.3 version used was 3.3.5.
>> The tests were run on a Xeon E3-1276 v3 (with our code doing multi-threading, and Eigen configured to not use threading of its own). Minimum run-time of 4 runs.
>>
>> We use Eigen in a CFD code for 3 roughly distinct subject areas:
>> 1) fixed-size vectors (and some matrices) of doubles, direct access to individual values (with compile-time known indices) or segments, simple linear algebra, few matrix-vector products.
>> 2) same as 1, but using Eigen::AutoDiffScalar instead of double (building up a Jacobian)
>> 3) Fixed-size matrix-vector products (inside of a Block-Jacobi iteration, not using any of Eigen's solvers)
>>
>> For the different cases:
>> tau: Only uses 1), with vectors of sizes 5 and 8, matrices of size 5x5
>> cgns: Uses 1)-3), with vectors of sizes 6 and 13, matrices of size 6x6 (for both 1 and 3).
>> dg: Uses 1)-3), with vectors of sizes 5 and 8, matrices of size 5x5 (for 1) and 20x20 (for 3).
>>
>> The outcomes seem to be
>> - clang is generally fastest
>> - the performance regression is more pronounced for gcc
>> - (partial) vectorization seems to "hurt" simple direct access (area 1), disabling it improves performance (clang) or at least reduces the impact of Eigen 3.3 (gcc)
>>
>> If we were only looking at clang, I'd be nearly willing to advocate moving to 3.3 (with default settings), because only a regression for the "tau" case remains.
>>
>> Unfortunately, I'm at a loss at how to pin-point these any more, and attempts at extracting a reduced test-case / example that exhibits the same behavior have not been fruitful, and some profiling of the actual code between Eigen 3.2 and 3.3 does not seem to directly yield actionable information.
>>
>> If anyone has any ideas for things to try, I'm all ears. :)
>>
>> Either way, thanks for your helpful (and nice to use) library!
>>
>>
>> Best regards
>>
>> Daniel Vollmer
>>
>> --------------------------
>> Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
>> German Aerospace Center
>> Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany
>>
>> Daniel Vollmer | AS C²A²S²E
>> www.DLR.de
>>
>> ________________________________________
>> Von: Vollmer, Daniel
>> Gesendet: Donnerstag, 28. Juli 2016 12:46
>> An: ***@lists.tuxfamily.org
>> Betreff: RE: [eigen] 3.3-beta2 released!
>>
>> Hi Gael,
>>
>>> Fixed: https://bitbucket.org/eigen/eigen/commits/e35a38ad89fe/
>>> With float I get a nearly x2 speedup for the above 5x5 matrix-vector
>>> products (compared to 3.2), and x1.4 speedup with double.
>>
>> I tried out this version (ca9bd08) and the results are as follows:
>> Note: the explicit solver pretty much only does residual evaluations,
>> whereas the implicit solver does a residual evaluation, followed by a
>> Jacobian computation (using AutoDiffScalar) and then a block-based
>> Gauss-Jacobi iteration where the blocks are 5x5 matrices to
>> approximately solve a linear system based on the Jacobian and the
>> residual.
>>
>> Explicit solver:
>> ----------------
>> eigen-3.3-ca9bd08 10.9s => 09% slower
>> eigen-3.3-beta2 11.1s => 11% slower
>> eigen-3.3-beta2 UNALIGNED_VEC=0 10.0s => 00% slower
>> eigen-3.2.9 10.0s => baseline
>>
>> Implicit solver:
>> ----------------
>> eigen-3.3-ca9bd08 34.2s => 06% faster
>> eigen-3.3-beta2 37.5s => 03% slower
>> eigen-3.3-beta2 UNALIGNED_VEC=0 38.2s => 05% slower
>> eigen-3.2.9 36.5s => baseline
>>
>> So the change definitely helps for the implicit solver (which has lots
>> of 5x5 by 5x1 double multiplies), but for the explicit solver the
>> overhead of unaligned vectorization doesn't pay off. Maybe the use of
>> 3D vectors (which used for geometric normals and coordinates) is
>> problematic because it's such a borderline case for vectorization?
>>
>> What I don't quite understand is the difference between 3.2.9 (which
>> doesn't vectorize the given matrix sizes) and 3.3-beta2 without
>> vectorization: Something in 3.3 is slower under those conditions, but
>> maybe it's not the matrix-vector multiplies, as it could also be
>> AutoDiffScalar being slower.
>>
>>
>> Best regards
>>
>> Daniel Vollmer
>>
>> --------------------------
>> Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
>> German Aerospace Center
>> Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany
>>
>> Daniel Vollmer | AS C²A²S²E
>> www.DLR.de
>>
>
>
> --
> Dr.-Ing. Christoph Hertzberg
>
> Besuchsadresse der Nebengeschäftsstelle:
> DFKI GmbH
> Robotics Innovation Center
> Robert-Hooke-Straße 5
> 28359 Bremen, Germany
>
> Postadresse der Hauptgeschäftsstelle Standort Bremen:
> DFKI GmbH
> Robotics Innovation Center
> Robert-Hooke-Straße 1
> 28359 Bremen, Germany
>
> Tel.: +49 421 178 45-4021
> Zentrale: +49 421 178 45-0
> E-Mail: ***@dfki.de
>
> Weitere Informationen: http://www.dfki.de/robotik
> -----------------------------------------------------------------------
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Straße 122, D-67663 Kaiserslautern
> Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
> (Vorsitzender) Dr. Walter Olthoff
> Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
> Amtsgericht Kaiserslautern, HRB 2313
> Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
> USt-Id.Nr.: DE 148646973
> Steuernummer: 19/672/50006
> -----------------------------------------------------------------------
>

--
Dr.-Ing. Christoph Hertzberg

Besuchsadresse der Nebengeschäftsstelle:
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-Straße 5
28359 Bremen, Germany

Postadresse der Hauptgeschäftsstelle Standort Bremen:
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-Straße 1
28359 Bremen, Germany

Tel.: +49 421 178 45-4021
Zentrale: +49 421 178 45-0
E-Mail: ***@dfki.de

Weitere Informationen: http://www.dfki.de/robotik
-----------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Straße 122, D-67663 Kaiserslautern
Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
(Vorsitzender) Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
USt-Id.Nr.: DE 148646973
Steuernummer: 19/672/50006
-----------------------------------------------------------------------

D***@dlr.de

2018-08-01 16:57:10 UTC

Permalink

Hello Christoph,

unfortunately I have to revise the additional measurements I made for you earlier, it seems I was always using Eigen-3.2.9 during those measurements (which explains why they are nigh identical to the original measurements for Eigen-3.2). Sincere apologies. "Wer misst misst Mist."

Attached is a table where I've remeasured those runs. The original table was OK, except for clang's no unaligned vec (where it was actually still enabled as I miscopied the name of the define).

Now it seems that the performance difference I'm seeing is fairly independent of AVX usage (on both the compiler and the Eigen-side). Enabling AVX seems strictly better than not using it for Eigen 3.3 (no difference for "tau" case, improvement for "cgns" & "dg"). Still, there is a difference between Eigen 3.2 and 3.3 and using AVX recoups some of that difference, but (for our peculiar usage) Eigen-3.3 is still a net loss, particularly when using gcc.

> just to clear my confusion, by "partial vectorization" you mean
> "unaligned vectorization"?

Yes, that's what I meant. Sorry for the confusion. I've renamed the entries in the table accordingly.

I don't know whether you noticed it, but I did extract a partial example that seems to reproduce part of the phenomenon (at least for some of the behavior of the "tau" case) and attached it to an earlier email from today (replying to Marc Glisse), called eigen_bench2.cpp.

> Could you (manually) disable the AVX-detection in Eigen/Core, but
> compile with AVX enabled?

For this I removed EIGEN_VECTORIZE_AVX from the #ifdef __AVX__ block in Eigen/Core, and also had to change SSE/PacketMath.h as that checked for __FMA__ and subsequently used intrinsics whose header wasn't included).

This didn't make a (measurable) difference compared to just -msse4.2 -mtune=native.

Best regards

Daniel Vollmer

--------------------------
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
German Aerospace Center
Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany

Daniel Vollmer | AS C²A²S²E
www.DLR.de

________________________________________
Von: Christoph Hertzberg [***@informatik.uni-bremen.de]
Gesendet: Mittwoch, 1. August 2018 15:04
An: ***@lists.tuxfamily.org
Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)

Hi,

just to clear my confusion, by "partial vectorization" you mean
"unaligned vectorization"?

I guess Eigen's AVX-usage is a likely issue in some situations then --
but it is really hard to fix without anything concrete.
Could you (manually) disable the AVX-detection in Eigen/Core, but
compile with AVX enabled?
And does enabling/disabling AVX with Eigen3.2 make a difference? (That
version had no AVX support, but there may be issues with switching
between AVX and non-AVX instructions).

You could also try to make a diff between the assembly generated by gcc
and clang. This may involve cleaning up the assembly "somehow", or
actually disassembling the binary. Alternatively, just manually compare
some likely candidates -- you can mark them using
EIGEN_ASM_COMMENT("some label which is easy to find");

Christoph

On 2018-08-01 14:14, ***@dlr.de wrote:
> Hi Christoph,
>
> (I cc'd the mailing list again.)
>
> The compilation units are rather big, so directly comparing the resulting code is difficult.
>
> I've run the test-cases for gcc-8.1 and clang-3.8 with -msse4.2 -mtune=native to disable AVX.
>
> This improves the situation for gcc, (except for the "tau" test-cases where it's only "close") and results in the same performance as Eigen-3.2. Disabling partial vec or enabling it doesn't seem to make a difference for that setting anymore.
>
> For clang disabling AVX is a slight win for "tau" vs. default settings, but a slight loss for cgns (where the matrix-vector product and AD plays a bigger role, see area 2&3).
>
>
> Best regards
>
> Daniel Vollmer
>
> --------------------------
> Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
> German Aerospace Center
> Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany
>
> Daniel Vollmer | AS C²A²S²E
> www.DLR.de
>
> ________________________________________
> Von: Christoph Hertzberg [***@informatik.uni-bremen.de]
> Gesendet: Mittwoch, 1. August 2018 12:34
> An: Vollmer, Daniel
> Betreff: Re: Eigen 3.3 vs 3.2 Performance (was RE: [eigen] 3.3-beta2 released!)
>
> Hi,
>
> could you also try compiling with `-DEIGEN_UNALIGNED_VECTORIZE=0` and
> with AVX disabled, e.g., using `-msse4.2 -mtune=native` -- alternatively
> also by commenting out the corresponding detection inside Eigen/Core
> (this would actually be nice, if it was controllable by command-line
> options).
> And of course, any combinations of these options would be interesting,
> if they make a difference.
>
> If you have sufficiently small compilation units, it might also be worth
> having a look at the difference between the generated assembler code --
> but that is usually more productive if you had singled out a reduced
> test-case.
>
>
> Cheers,
> Christoph
>
>
>
>
>
> On 2018-08-01 11:10, ***@dlr.de wrote:
>> Hello everyone,
>>
>> with the recent release of 3.3.5 I've once again looked at upgrading from our currently used Eigen 3.2 to the current stable branch, but some performance regressions remain, which make this a difficult decision, as I'm unable to nail down the exact cause (probably because it's not a single one) and would prefer to not slow down the overall performance.
>>
>> I've attached a document with some performance measurements for different compilers, different Eigen versions, and 3 different test-cases for our code (tau, cgns, dg) that stress different areas / sizes.
>> The "vs best" column compares run-time against the overall best run-time, "vs same" only relative to shortest run-time with the same compiler (so essentially between different Eigen variants with the same compiler).
>> Eigen 3.2 version used was 3.2.9 + some backports of improvements to AutoDiffScalar
>> Eigen 3.3 version used was 3.3.5.
>> The tests were run on a Xeon E3-1276 v3 (with our code doing multi-threading, and Eigen configured to not use threading of its own). Minimum run-time of 4 runs.
>>
>> We use Eigen in a CFD code for 3 roughly distinct subject areas:
>> 1) fixed-size vectors (and some matrices) of doubles, direct access to individual values (with compile-time known indices) or segments, simple linear algebra, few matrix-vector products.
>> 2) same as 1, but using Eigen::AutoDiffScalar instead of double (building up a Jacobian)
>> 3) Fixed-size matrix-vector products (inside of a Block-Jacobi iteration, not using any of Eigen's solvers)
>>
>> For the different cases:
>> tau: Only uses 1), with vectors of sizes 5 and 8, matrices of size 5x5
>> cgns: Uses 1)-3), with vectors of sizes 6 and 13, matrices of size 6x6 (for both 1 and 3).
>> dg: Uses 1)-3), with vectors of sizes 5 and 8, matrices of size 5x5 (for 1) and 20x20 (for 3).
>>
>> The outcomes seem to be
>> - clang is generally fastest
>> - the performance regression is more pronounced for gcc
>> - (partial) vectorization seems to "hurt" simple direct access (area 1), disabling it improves performance (clang) or at least reduces the impact of Eigen 3.3 (gcc)
>>
>> If we were only looking at clang, I'd be nearly willing to advocate moving to 3.3 (with default settings), because only a regression for the "tau" case remains.
>>
>> Unfortunately, I'm at a loss at how to pin-point these any more, and attempts at extracting a reduced test-case / example that exhibits the same behavior have not been fruitful, and some profiling of the actual code between Eigen 3.2 and 3.3 does not seem to directly yield actionable information.
>>
>> If anyone has any ideas for things to try, I'm all ears. :)
>>
>> Either way, thanks for your helpful (and nice to use) library!
>>
>>
>> Best regards
>>
>> Daniel Vollmer
>>
>> --------------------------
>> Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
>> German Aerospace Center
>> Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany
>>
>> Daniel Vollmer | AS C²A²S²E
>> www.DLR.de
>>
>> ________________________________________
>> Von: Vollmer, Daniel
>> Gesendet: Donnerstag, 28. Juli 2016 12:46
>> An: ***@lists.tuxfamily.org
>> Betreff: RE: [eigen] 3.3-beta2 released!
>>
>> Hi Gael,
>>
>>> Fixed: https://bitbucket.org/eigen/eigen/commits/e35a38ad89fe/
>>> With float I get a nearly x2 speedup for the above 5x5 matrix-vector
>>> products (compared to 3.2), and x1.4 speedup with double.
>>
>> I tried out this version (ca9bd08) and the results are as follows:
>> Note: the explicit solver pretty much only does residual evaluations,
>> whereas the implicit solver does a residual evaluation, followed by a
>> Jacobian computation (using AutoDiffScalar) and then a block-based
>> Gauss-Jacobi iteration where the blocks are 5x5 matrices to
>> approximately solve a linear system based on the Jacobian and the
>> residual.
>>
>> Explicit solver:
>> ----------------
>> eigen-3.3-ca9bd08 10.9s => 09% slower
>> eigen-3.3-beta2 11.1s => 11% slower
>> eigen-3.3-beta2 UNALIGNED_VEC=0 10.0s => 00% slower
>> eigen-3.2.9 10.0s => baseline
>>
>> Implicit solver:
>> ----------------
>> eigen-3.3-ca9bd08 34.2s => 06% faster
>> eigen-3.3-beta2 37.5s => 03% slower
>> eigen-3.3-beta2 UNALIGNED_VEC=0 38.2s => 05% slower
>> eigen-3.2.9 36.5s => baseline
>>
>> So the change definitely helps for the implicit solver (which has lots
>> of 5x5 by 5x1 double multiplies), but for the explicit solver the
>> overhead of unaligned vectorization doesn't pay off. Maybe the use of
>> 3D vectors (which used for geometric normals and coordinates) is
>> problematic because it's such a borderline case for vectorization?
>>
>> What I don't quite understand is the difference between 3.2.9 (which
>> doesn't vectorize the given matrix sizes) and 3.3-beta2 without
>> vectorization: Something in 3.3 is slower under those conditions, but
>> maybe it's not the matrix-vector multiplies, as it could also be
>> AutoDiffScalar being slower.
>>
>>
>> Best regards
>>
>> Daniel Vollmer
>>
>> --------------------------
>> Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
>> German Aerospace Center
>> Institute of Aerodynamics and Flow Technology | Lilienthalplatz 7 | 38108 Braunschweig | Germany
>>
>> Daniel Vollmer | AS C²A²S²E
>> www.DLR.de
>>
>
>
> --
> Dr.-Ing. Christoph Hertzberg
>
> Besuchsadresse der Nebengeschäftsstelle:
> DFKI GmbH
> Robotics Innovation Center
> Robert-Hooke-Straße 5
> 28359 Bremen, Germany
>
> Postadresse der Hauptgeschäftsstelle Standort Bremen:
> DFKI GmbH
> Robotics Innovation Center
> Robert-Hooke-Straße 1
> 28359 Bremen, Germany
>
> Tel.: +49 421 178 45-4021
> Zentrale: +49 421 178 45-0
> E-Mail: ***@dfki.de
>
> Weitere Informationen: http://www.dfki.de/robotik
> -----------------------------------------------------------------------
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Straße 122, D-67663 Kaiserslautern
> Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
> (Vorsitzender) Dr. Walter Olthoff
> Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
> Amtsgericht Kaiserslautern, HRB 2313
> Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
> USt-Id.Nr.: DE 148646973
> Steuernummer: 19/672/50006
> -----------------------------------------------------------------------
>

--
Dr.-Ing. Christoph Hertzberg

Besuchsadresse der Nebengeschäftsstelle:
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-Straße 5
28359 Bremen, Germany

Postadresse der Hauptgeschäftsstelle Standort Bremen:
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-Straße 1
28359 Bremen, Germany

Tel.: +49 421 178 45-4021
Zentrale: +49 421 178 45-0
E-Mail: ***@dfki.de

Weitere Informationen: http://www.dfki.de/robotik
-----------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Straße 122, D-67663 Kaiserslautern
Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
(Vorsitzender) Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
USt-Id.Nr.: DE 148646973
Steuernummer: 19/672/50006
-----------------------------------------------------------------------