Status of AVX support

Hi,

I don't know of any work in that direction.

The difficulty is that taking advantage of AVX without regressing
current performance from SSE will require us to handle *both* 16-byte
and 32-byte packet size and alignment. We will need 16-byte for e.g.
Vector4f, while we will of course want to use 32-byte wherever we can.

That part requires significant changes in Eigen.

On the other hand, if you can be satisfied with just a pure-32-byte
mode, not trying to fall back to 16-byte, that's a lot easier, you'd
just have to adjust existing code to allow for 32-byte alignment
instead of 16-byte (easy) and add arch/AVX/PacketMath.h (not hard if
you know AVX intrinsics). You're welcome to do so, but keep in mind
that such a pure-32-byte mode will regress performance for the
applications that don't lend themselves well to 32-byte packets and
alignment, such as Vector4f, so it won't be possible to enable it by
default until it can properly fall back to 16-byte packets and
alignment.

Cheers,
Benoit

Post by Rohit Garg
Hi,
I just laid my hands on a Sandy Bridge machine and I was wondering
what is the current status of AVX development for Eigen.
Cheers,
--
Rohit Garg
http://rpg-314.blogspot.com/
Graduate Student
Applied and Engineering Physics
Cornell University

Rohit Garg

2011-12-06 22:09:52 UTC

Post by Benoit Jacob
Hi,
I don't know of any work in that direction.
The difficulty is that taking advantage of AVX without regressing
current performance from SSE will require us to handle *both* 16-byte
and 32-byte packet size and alignment. We will need 16-byte for e.g.
Vector4f, while we will of course want to use 32-byte wherever we can.
That part requires significant changes in Eigen.

But that will still need the AVX and SSE backend to choose on a per
object basis, right?

Post by Benoit Jacob
On the other hand, if you can be satisfied with just a pure-32-byte
mode, not trying to fall back to 16-byte, that's a lot easier, you'd
just have to adjust existing code to allow for 32-byte alignment
instead of 16-byte (easy) and add arch/AVX/PacketMath.h (not hard if
you know AVX intrinsics). You're welcome to do so, but keep in mind
that such a pure-32-byte mode will regress performance for the
applications that don't lend themselves well to 32-byte packets and
alignment, such as Vector4f, so it won't be possible to enable it by
default until it can properly fall back to 16-byte packets and
alignment.
Cheers,
Benoit

--
Rohit Garg

http://rpg-314.blogspot.com/

Graduate Student
Applied and Engineering Physics
Cornell University

Rohit Garg

2011-12-06 22:48:30 UTC

To clarify, I meant separate backends, just like with SSE and NEON.

But that will still need the AVX and SSE backend to choose on a per
object basis, right?

--
Rohit Garg
http://rpg-314.blogspot.com/
Graduate Student
Applied and Engineering Physics
Cornell University

--
Rohit Garg

http://rpg-314.blogspot.com/

Graduate Student
Applied and Engineering Physics
Cornell University

Gael Guennebaud

2011-12-07 08:43:44 UTC

well, one (AVX) will be built on top of the SSE backend. The main
difficulty is that for the same scalar type, e.g., float, we'll have
the possibility to choose between two backends, and to be future proof
the logics we implement should be able to deal with an arbitrary
number of backends.

gael

Post by Rohit Garg
To clarify, I meant separate backends, just like with SSE and NEON.

But that will still need the AVX and SSE backend to choose on a per
object basis, right?

--
Rohit Garg
http://rpg-314.blogspot.com/
Graduate Student
Applied and Engineering Physics
Cornell University

Christoph Hertzberg

2011-12-07 15:35:12 UTC

Post by Benoit Jacob
The difficulty is that taking advantage of AVX without regressing
current performance from SSE will require us to handle *both* 16-byte
and 32-byte packet size and alignment. We will need 16-byte for e.g.
Vector4f, while we will of course want to use 32-byte wherever we can.

I'm afraid, you will lose ABI-compatibility if you use 32-byte alignment
by default. E.g. the size of
struct { char x; Vector4d y; };
will be 48 bytes with 16 byte alignment and 64 bytes with 32 byte alignment.

W.r.t porting to AVX: Be aware that there might be some pitfalls with
AVX-performance:
http://www.agner.org/optimize/blog/read.php?i=142

Christoph
--
----------------------------------------------
Dipl.-Inf. Christoph Hertzberg
Cartesium 0.051
Universität Bremen
Enrique-Schmidt-Straße 5
28359 Bremen

Tel: (+49) 421-218-64252
----------------------------------------------

Benoit Jacob

2011-12-07 15:47:48 UTC

Very good point.

Benoit

W.r.t porting to AVX: Be aware that there might be some pitfalls with
http://www.agner.org/optimize/blog/read.php?i=142
Christoph
--
----------------------------------------------
Dipl.-Inf. Christoph Hertzberg
Cartesium 0.051
Universität Bremen
Enrique-Schmidt-Straße 5
28359 Bremen
Tel: (+49) 421-218-64252
----------------------------------------------

Rhys Ulerich

2011-12-07 16:23:38 UTC

Post by Christoph Hertzberg
W.r.t porting to AVX: Be aware that there might be some pitfalls with
http://www.agner.org/optimize/blog/read.php?i=142

Interesting tidbit from that link "If the programmer inadvertently
mixes AVX and non-AVX vector instructions in the same code then there
is a penalty of 70 clock cycles for each transition between the two
forms."

Thank you for the pointer to the blog,
Rhys

Benoit Jacob

2011-12-07 16:31:41 UTC

Post by Christoph Hertzberg
W.r.t porting to AVX: Be aware that there might be some pitfalls with
http://www.agner.org/optimize/blog/read.php?i=142

Between this, and the fact that we can't 32-byte-align Vector4d
without breaking the ABI, I'm starting to wonder if maybe we should
treat AVX as a dynamic-size-only thing and completely give up on AVX
for fixed-size objects? For dynamic-size objects, the situation is
much simpler, we can increase the alignment without breaking the ABI
and we can assume that objects are large so that AVX is always better
than SSE.

In any case, I think we should start by doing AVX for dynamic-size
objects only, it will be time to think about fixed-size later.

Benoit

Post by Rhys Ulerich
Thank you for the pointer to the blog,
Rhys

Gael Guennebaud

2011-12-07 17:33:17 UTC

Note that you can work onto half of a register using AVX instructions.
So actually, the AVX backend will be completely separated and
exclusive with the SSE one. The main real issue is the required 32 bit
alignment.

An idea would be to replace (extend) the (Auto)Aligned keywords by
(Auto)Aligned16 and (Auto)Aligned32 keywords that could be used with
Map and Matrix. The default of Matrix will still be Aligned16 for ABI
compatibility and limited memory overhead.

gael

Post by Christoph Hertzberg
W.r.t porting to AVX: Be aware that there might be some pitfalls with
http://www.agner.org/optimize/blog/read.php?i=142

Between this, and the fact that we can't 32-byte-align Vector4d
without breaking the ABI, I'm starting to wonder if maybe we should
treat AVX as a dynamic-size-only thing and completely give up on AVX
for fixed-size objects? For dynamic-size objects, the situation is
much simpler, we can increase the alignment without breaking the ABI
and we can assume that objects are large so that AVX is always better
than SSE.
In any case, I think we should start by doing AVX for dynamic-size
objects only, it will be time to think about fixed-size later.
Benoit

Post by Rhys Ulerich
Thank you for the pointer to the blog,
Rhys

Rohit Garg

2011-12-10 17:58:43 UTC

Post by Christoph Hertzberg
W.r.t porting to AVX: Be aware that there might be some pitfalls with
http://www.agner.org/optimize/blog/read.php?i=142

That is a good idea. The fixed size objects would be very small
anyway, so not using AVX wouldn't hurt much.

Would it be any easier for implementing AVX for just the dynamic objects?

Post by Benoit Jacob
In any case, I think we should start by doing AVX for dynamic-size
objects only, it will be time to think about fixed-size later.
Benoit

Post by Rhys Ulerich
Thank you for the pointer to the blog,
Rhys

--
Rohit Garg

http://rpg-314.blogspot.com/

Graduate Student
Applied and Engineering Physics
Cornell University

Eamon Nerbonne

2012-02-22 13:55:01 UTC

I happened across
http://stackoverflow.com/questions/6546275/what-are-the-alignment-restrictions-on-the-new-haswell-avx-gather-instruction,
which notes the fact that most AVX instructions don't actually require
alignment. Might it not thus be possible to simply use AVX everywhere, and
opportunistically use 32-byte alignment only where easy (for the extra
performance)?

(Sorry for the dead thread revival, if that's objectionable)

--Eamon

Post by Christoph Hertzberg
W.r.t porting to AVX: Be aware that there might be some pitfalls with
http://www.agner.org/optimize/blog/read.php?i=142

That is a good idea. The fixed size objects would be very small
anyway, so not using AVX wouldn't hurt much.
Would it be any easier for implementing AVX for just the dynamic objects?

Post by Benoit Jacob
In any case, I think we should start by doing AVX for dynamic-size
objects only, it will be time to think about fixed-size later.
Benoit

Post by Rhys Ulerich
Thank you for the pointer to the blog,
Rhys

--
Rohit Garg
http://rpg-314.blogspot.com/
Graduate Student
Applied and Engineering Physics
Cornell University

Gael Guennebaud

2012-02-23 12:07:56 UTC

Sure, if the performance penalty is really low, then that considerably
simplify our work. So, the first thing to do would be to benchmark the
real performance penalty between a 32 and 16 byte alignment.

gael

Post by Eamon Nerbonne
I happened across
http://stackoverflow.com/questions/6546275/what-are-the-alignment-restrictions-on-the-new-haswell-avx-gather-instruction,
which notes the fact that most AVX instructions don't actually require
alignment. Might it not thus be possible to simply use AVX everywhere, and
opportunistically use 32-byte alignment only where easy (for the extra
performance)?
(Sorry for the dead thread revival, if that's objectionable)
--Eamon

Post by Christoph Hertzberg
W.r.t porting to AVX: Be aware that there might be some pitfalls with
http://www.agner.org/optimize/blog/read.php?i=142

That is a good idea. The fixed size objects would be very small
anyway, so not using AVX wouldn't hurt much.
Would it be any easier for implementing AVX for just the dynamic objects?

Post by Benoit Jacob
In any case, I think we should start by doing AVX for dynamic-size
objects only, it will be time to think about fixed-size later.
Benoit

Post by Rhys Ulerich
Thank you for the pointer to the blog,
Rhys

--
Rohit Garg
http://rpg-314.blogspot.com/
Graduate Student
Applied and Engineering Physics
Cornell University

Rohit Garg

2011-12-07 17:05:20 UTC

If you compile with -mavx, there will be no SSE code generated by GCC.
The SSE intrinsics will compile to corresponding AVX intrinsics. So
this penalty is not much of a bother.