Discussion:
Status of AVX support
Rohit Garg
2011-12-06 02:03:21 UTC
Permalink
Hi,

I just laid my hands on a Sandy Bridge machine and I was wondering
what is the current status of AVX development for Eigen.

Cheers,
--
Rohit Garg

http://rpg-314.blogspot.com/

Graduate Student
Applied and Engineering Physics
Cornell University
Benoit Jacob
2011-12-06 13:01:28 UTC
Permalink
Hi,

I don't know of any work in that direction.

The difficulty is that taking advantage of AVX without regressing
current performance from SSE will require us to handle *both* 16-byte
and 32-byte packet size and alignment. We will need 16-byte for e.g.
Vector4f, while we will of course want to use 32-byte wherever we can.

That part requires significant changes in Eigen.

On the other hand, if you can be satisfied with just a pure-32-byte
mode, not trying to fall back to 16-byte, that's a lot easier, you'd
just have to adjust existing code to allow for 32-byte alignment
instead of 16-byte (easy) and add arch/AVX/PacketMath.h (not hard if
you know AVX intrinsics). You're welcome to do so, but keep in mind
that such a pure-32-byte mode will regress performance for the
applications that don't lend themselves well to 32-byte packets and
alignment, such as Vector4f, so it won't be possible to enable it by
default until it can properly fall back to 16-byte packets and
alignment.

Cheers,
Benoit
Post by Rohit Garg
Hi,
I just laid my hands on a Sandy Bridge machine and I was wondering
what is the current status of AVX development for Eigen.
Cheers,
--
Rohit Garg
http://rpg-314.blogspot.com/
Graduate Student
Applied and Engineering Physics
Cornell University
Rohit Garg
2011-12-06 22:09:52 UTC
Permalink
Post by Benoit Jacob
Hi,
I don't know of any work in that direction.
The difficulty is that taking advantage of AVX without regressing
current performance from SSE will require us to handle *both* 16-byte
and 32-byte packet size and alignment. We will need 16-byte for e.g.
Vector4f, while we will of course want to use 32-byte wherever we can.
That part requires significant changes in Eigen.
But that will still need the AVX and SSE backend to choose on a per
object basis, right?
Post by Benoit Jacob
On the other hand, if you can be satisfied with just a pure-32-byte
mode, not trying to fall back to 16-byte, that's a lot easier, you'd
just have to adjust existing code to allow for 32-byte alignment
instead of 16-byte (easy) and add arch/AVX/PacketMath.h (not hard if
you know AVX intrinsics). You're welcome to do so, but keep in mind
that such a pure-32-byte mode will regress performance for the
applications that don't lend themselves well to 32-byte packets and
alignment, such as Vector4f, so it won't be possible to enable it by
default until it can properly fall back to 16-byte packets and
alignment.
Cheers,
Benoit
Post by Rohit Garg
Hi,
I just laid my hands on a Sandy Bridge machine and I was wondering
what is the current status of AVX development for Eigen.
Cheers,
--
Rohit Garg
http://rpg-314.blogspot.com/
Graduate Student
Applied and Engineering Physics
Cornell University
--
Rohit Garg

http://rpg-314.blogspot.com/

Graduate Student
Applied and Engineering Physics
Cornell University
Rohit Garg
2011-12-06 22:48:30 UTC
Permalink
To clarify, I meant separate backends, just like with SSE and NEON.
Post by Rohit Garg
Post by Benoit Jacob
Hi,
I don't know of any work in that direction.
The difficulty is that taking advantage of AVX without regressing
current performance from SSE will require us to handle *both* 16-byte
and 32-byte packet size and alignment. We will need 16-byte for e.g.
Vector4f, while we will of course want to use 32-byte wherever we can.
That part requires significant changes in Eigen.
But that will still need the AVX and SSE backend to choose on a per
object basis, right?
Post by Benoit Jacob
On the other hand, if you can be satisfied with just a pure-32-byte
mode, not trying to fall back to 16-byte, that's a lot easier, you'd
just have to adjust existing code to allow for 32-byte alignment
instead of 16-byte (easy) and add arch/AVX/PacketMath.h (not hard if
you know AVX intrinsics). You're welcome to do so, but keep in mind
that such a pure-32-byte mode will regress performance for the
applications that don't lend themselves well to 32-byte packets and
alignment, such as Vector4f, so it won't be possible to enable it by
default until it can properly fall back to 16-byte packets and
alignment.
Cheers,
Benoit
Post by Rohit Garg
Hi,
I just laid my hands on a Sandy Bridge machine and I was wondering
what is the current status of AVX development for Eigen.
Cheers,
--
Rohit Garg
http://rpg-314.blogspot.com/
Graduate Student
Applied and Engineering Physics
Cornell University
--
Rohit Garg
http://rpg-314.blogspot.com/
Graduate Student
Applied and Engineering Physics
Cornell University
--
Rohit Garg

http://rpg-314.blogspot.com/

Graduate Student
Applied and Engineering Physics
Cornell University
Gael Guennebaud
2011-12-07 08:43:44 UTC
Permalink
well, one (AVX) will be built on top of the SSE backend. The main
difficulty is that for the same scalar type, e.g., float, we'll have
the possibility to choose between two backends, and to be future proof
the logics we implement should be able to deal with an arbitrary
number of backends.

gael
Post by Rohit Garg
To clarify, I meant separate backends, just like with SSE and NEON.
Post by Rohit Garg
Post by Benoit Jacob
Hi,
I don't know of any work in that direction.
The difficulty is that taking advantage of AVX without regressing
current performance from SSE will require us to handle *both* 16-byte
and 32-byte packet size and alignment. We will need 16-byte for e.g.
Vector4f, while we will of course want to use 32-byte wherever we can.
That part requires significant changes in Eigen.
But that will still need the AVX and SSE backend to choose on a per
object basis, right?
Post by Benoit Jacob
On the other hand, if you can be satisfied with just a pure-32-byte
mode, not trying to fall back to 16-byte, that's a lot easier, you'd
just have to adjust existing code to allow for 32-byte alignment
instead of 16-byte (easy) and add arch/AVX/PacketMath.h (not hard if
you know AVX intrinsics). You're welcome to do so, but keep in mind
that such a pure-32-byte mode will regress performance for the
applications that don't lend themselves well to 32-byte packets and
alignment, such as Vector4f, so it won't be possible to enable it by
default until it can properly fall back to 16-byte packets and
alignment.
Cheers,
Benoit
Post by Rohit Garg
Hi,
I just laid my hands on a Sandy Bridge machine and I was wondering
what is the current status of AVX development for Eigen.
Cheers,
--
Rohit Garg
http://rpg-314.blogspot.com/
Graduate Student
Applied and Engineering Physics
Cornell University
--
Rohit Garg
http://rpg-314.blogspot.com/
Graduate Student
Applied and Engineering Physics
Cornell University
--
Rohit Garg
http://rpg-314.blogspot.com/
Graduate Student
Applied and Engineering Physics
Cornell University
Christoph Hertzberg
2011-12-07 15:35:12 UTC
Permalink
Post by Benoit Jacob
The difficulty is that taking advantage of AVX without regressing
current performance from SSE will require us to handle *both* 16-byte
and 32-byte packet size and alignment. We will need 16-byte for e.g.
Vector4f, while we will of course want to use 32-byte wherever we can.
I'm afraid, you will lose ABI-compatibility if you use 32-byte alignment
by default. E.g. the size of
struct { char x; Vector4d y; };
will be 48 bytes with 16 byte alignment and 64 bytes with 32 byte alignment.

W.r.t porting to AVX: Be aware that there might be some pitfalls with
AVX-performance:
http://www.agner.org/optimize/blog/read.php?i=142

Christoph
--
----------------------------------------------
Dipl.-Inf. Christoph Hertzberg
Cartesium 0.051
Universität Bremen
Enrique-Schmidt-Straße 5
28359 Bremen

Tel: (+49) 421-218-64252
----------------------------------------------
Benoit Jacob
2011-12-07 15:47:48 UTC
Permalink
Post by Benoit Jacob
The difficulty is that taking advantage of AVX without regressing
current performance from SSE will require us to handle *both* 16-byte
and 32-byte packet size and alignment. We will need 16-byte for e.g.
Vector4f, while we will of course want to use 32-byte wherever we can.
I'm afraid, you will lose ABI-compatibility if you use 32-byte alignment by
default. E.g. the size of
       struct { char x; Vector4d y; };
will be 48 bytes with 16 byte alignment and 64 bytes with 32 byte alignment.
Very good point.

Benoit
W.r.t porting to AVX: Be aware that there might be some pitfalls with
http://www.agner.org/optimize/blog/read.php?i=142
Christoph
--
----------------------------------------------
Dipl.-Inf. Christoph Hertzberg
Cartesium 0.051
Universität Bremen
Enrique-Schmidt-Straße 5
28359 Bremen
Tel: (+49) 421-218-64252
----------------------------------------------
Rhys Ulerich
2011-12-07 16:23:38 UTC
Permalink
Post by Christoph Hertzberg
W.r.t porting to AVX: Be aware that there might be some pitfalls with
http://www.agner.org/optimize/blog/read.php?i=142
Interesting tidbit from that link "If the programmer inadvertently
mixes AVX and non-AVX vector instructions in the same code then there
is a penalty of 70 clock cycles for each transition between the two
forms."

Thank you for the pointer to the blog,
Rhys
Benoit Jacob
2011-12-07 16:31:41 UTC
Permalink
Post by Rhys Ulerich
Post by Christoph Hertzberg
W.r.t porting to AVX: Be aware that there might be some pitfalls with
http://www.agner.org/optimize/blog/read.php?i=142
Interesting tidbit from that link "If the programmer inadvertently
mixes AVX and non-AVX vector instructions in the same code then there
is a penalty of 70 clock cycles for each transition between the two
forms."
Between this, and the fact that we can't 32-byte-align Vector4d
without breaking the ABI, I'm starting to wonder if maybe we should
treat AVX as a dynamic-size-only thing and completely give up on AVX
for fixed-size objects? For dynamic-size objects, the situation is
much simpler, we can increase the alignment without breaking the ABI
and we can assume that objects are large so that AVX is always better
than SSE.

In any case, I think we should start by doing AVX for dynamic-size
objects only, it will be time to think about fixed-size later.

Benoit
Post by Rhys Ulerich
Thank you for the pointer to the blog,
Rhys
Gael Guennebaud
2011-12-07 17:33:17 UTC
Permalink
Note that you can work onto half of a register using AVX instructions.
So actually, the AVX backend will be completely separated and
exclusive with the SSE one. The main real issue is the required 32 bit
alignment.

An idea would be to replace (extend) the (Auto)Aligned keywords by
(Auto)Aligned16 and (Auto)Aligned32 keywords that could be used with
Map and Matrix. The default of Matrix will still be Aligned16 for ABI
compatibility and limited memory overhead.


gael
Post by Benoit Jacob
Post by Rhys Ulerich
Post by Christoph Hertzberg
W.r.t porting to AVX: Be aware that there might be some pitfalls with
http://www.agner.org/optimize/blog/read.php?i=142
Interesting tidbit from that link "If the programmer inadvertently
mixes AVX and non-AVX vector instructions in the same code then there
is a penalty of 70 clock cycles for each transition between the two
forms."
Between this, and the fact that we can't 32-byte-align Vector4d
without breaking the ABI, I'm starting to wonder if maybe we should
treat AVX as a dynamic-size-only thing and completely give up on AVX
for fixed-size objects?  For dynamic-size objects, the situation is
much simpler, we can increase the alignment without breaking the ABI
and we can assume that objects are large so that AVX is always better
than SSE.
In any case, I think we should start by doing AVX for dynamic-size
objects only, it will be time to think about fixed-size later.
Benoit
Post by Rhys Ulerich
Thank you for the pointer to the blog,
Rhys
Rohit Garg
2011-12-10 17:58:43 UTC
Permalink
Post by Benoit Jacob
Post by Rhys Ulerich
Post by Christoph Hertzberg
W.r.t porting to AVX: Be aware that there might be some pitfalls with
http://www.agner.org/optimize/blog/read.php?i=142
Interesting tidbit from that link "If the programmer inadvertently
mixes AVX and non-AVX vector instructions in the same code then there
is a penalty of 70 clock cycles for each transition between the two
forms."
Between this, and the fact that we can't 32-byte-align Vector4d
without breaking the ABI, I'm starting to wonder if maybe we should
treat AVX as a dynamic-size-only thing and completely give up on AVX
for fixed-size objects?  For dynamic-size objects, the situation is
much simpler, we can increase the alignment without breaking the ABI
and we can assume that objects are large so that AVX is always better
than SSE.
That is a good idea. The fixed size objects would be very small
anyway, so not using AVX wouldn't hurt much.

Would it be any easier for implementing AVX for just the dynamic objects?
Post by Benoit Jacob
In any case, I think we should start by doing AVX for dynamic-size
objects only, it will be time to think about fixed-size later.
Benoit
Post by Rhys Ulerich
Thank you for the pointer to the blog,
Rhys
--
Rohit Garg

http://rpg-314.blogspot.com/

Graduate Student
Applied and Engineering Physics
Cornell University
Eamon Nerbonne
2012-02-22 13:55:01 UTC
Permalink
I happened across
http://stackoverflow.com/questions/6546275/what-are-the-alignment-restrictions-on-the-new-haswell-avx-gather-instruction,
which notes the fact that most AVX instructions don't actually require
alignment. Might it not thus be possible to simply use AVX everywhere, and
opportunistically use 32-byte alignment only where easy (for the extra
performance)?

(Sorry for the dead thread revival, if that's objectionable)

--Eamon
Post by Rohit Garg
Post by Benoit Jacob
Post by Rhys Ulerich
Post by Christoph Hertzberg
W.r.t porting to AVX: Be aware that there might be some pitfalls with
http://www.agner.org/optimize/blog/read.php?i=142
Interesting tidbit from that link "If the programmer inadvertently
mixes AVX and non-AVX vector instructions in the same code then there
is a penalty of 70 clock cycles for each transition between the two
forms."
Between this, and the fact that we can't 32-byte-align Vector4d
without breaking the ABI, I'm starting to wonder if maybe we should
treat AVX as a dynamic-size-only thing and completely give up on AVX
for fixed-size objects? For dynamic-size objects, the situation is
much simpler, we can increase the alignment without breaking the ABI
and we can assume that objects are large so that AVX is always better
than SSE.
That is a good idea. The fixed size objects would be very small
anyway, so not using AVX wouldn't hurt much.
Would it be any easier for implementing AVX for just the dynamic objects?
Post by Benoit Jacob
In any case, I think we should start by doing AVX for dynamic-size
objects only, it will be time to think about fixed-size later.
Benoit
Post by Rhys Ulerich
Thank you for the pointer to the blog,
Rhys
--
Rohit Garg
http://rpg-314.blogspot.com/
Graduate Student
Applied and Engineering Physics
Cornell University
Gael Guennebaud
2012-02-23 12:07:56 UTC
Permalink
Sure, if the performance penalty is really low, then that considerably
simplify our work. So, the first thing to do would be to benchmark the
real performance penalty between a 32 and 16 byte alignment.

gael
Post by Eamon Nerbonne
I happened across
http://stackoverflow.com/questions/6546275/what-are-the-alignment-restrictions-on-the-new-haswell-avx-gather-instruction,
which notes the fact that most AVX instructions don't actually require
alignment.  Might it not thus be possible to simply use AVX everywhere, and
opportunistically use 32-byte alignment only where easy (for the extra
performance)?
(Sorry for the dead thread revival, if that's objectionable)
--Eamon
Post by Rohit Garg
Post by Benoit Jacob
Post by Rhys Ulerich
Post by Christoph Hertzberg
W.r.t porting to AVX: Be aware that there might be some pitfalls with
http://www.agner.org/optimize/blog/read.php?i=142
Interesting tidbit from that link "If the programmer inadvertently
mixes AVX and non-AVX vector instructions in the same code then there
is a penalty of 70 clock cycles for each transition between the two
forms."
Between this, and the fact that we can't 32-byte-align Vector4d
without breaking the ABI, I'm starting to wonder if maybe we should
treat AVX as a dynamic-size-only thing and completely give up on AVX
for fixed-size objects?  For dynamic-size objects, the situation is
much simpler, we can increase the alignment without breaking the ABI
and we can assume that objects are large so that AVX is always better
than SSE.
That is a good idea. The fixed size objects would be very small
anyway, so not using AVX wouldn't hurt much.
Would it be any easier for implementing AVX for just the dynamic objects?
Post by Benoit Jacob
In any case, I think we should start by doing AVX for dynamic-size
objects only, it will be time to think about fixed-size later.
Benoit
Post by Rhys Ulerich
Thank you for the pointer to the blog,
Rhys
--
Rohit Garg
http://rpg-314.blogspot.com/
Graduate Student
Applied and Engineering Physics
Cornell University
Rohit Garg
2011-12-07 17:05:20 UTC
Permalink
If you compile with -mavx, there will be no SSE code generated by GCC.
The SSE intrinsics will compile to corresponding AVX intrinsics. So
this penalty is not much of a bother.
Post by Rhys Ulerich
Post by Christoph Hertzberg
W.r.t porting to AVX: Be aware that there might be some pitfalls with
http://www.agner.org/optimize/blog/read.php?i=142
Interesting tidbit from that link "If the programmer inadvertently
mixes AVX and non-AVX vector instructions in the same code then there
is a penalty of 70 clock cycles for each transition between the two
forms."
Thank you for the pointer to the blog,
Rhys
--
Rohit Garg

http://rpg-314.blogspot.com/

Graduate Student
Applied and Engineering Physics
Cornell University
Continue reading on narkive:
Search results for 'Status of AVX support' (Questions and Answers)
6
replies
Whats the difference between these processors?
started 2013-04-28 08:22:01 UTC
desktops
Loading...