[eigen] Performance colwise matrix mult vs. raw loop

Discussion:

Norbert Wenzel

2018-01-13 12:20:11 UTC

Hi,

I'm using Eigen in a program that reads a list of (millions of) 3d
points (in bulk), (rigidly) transforms these points and then processes
the transformed points. During optimizing the program (on Linux) I found
that I could gain a few percent runtime, by replacing a colwise matrix
multiplication with a raw loop that (from my perspective) does
essentially the same. Note that the runtime gain has been achieved for
the whole program, not only for the transformation part alone.

My first question would be if the assumption that the following two code
fragments should do the same and therefore be comparable is correct:

using pointlist = Eigen::Matrix<double, 3, Eigen::Dynamic, Eigen::RowMajor>;
using transform = Eigen::Transform<double, 3, Eigen::Isometry>;

extern pointlist points;
extern const transform trans;

void eigen_matrix_transformation()
{
points = (trans * points.colwise().homogeneous());
}

void raw_loop_transformation()
{
for(auto c = points.cols(); c > 0; --c)
{
points.col(c-1) = trans * points.col(c-1).homogeneous();
}
}

Although the number of points in the matrix is not known at compile time
the dimensionality of the points is, which should be enough to determine
the sizes of any intermediate storage at compile time.

I've extracted the code[0] from my program and found that the difference
between these implementations is ~3x. When comparing the generated
code[1] it seems that the colwise matrix multiplication allocates
dynamic memory whereas the raw loop does not. (At least it may throw
std::bad_alloc.)

Note that I could reproduce these results on different Linux machines,
but not on Windows. Although on my Windows laptop the overall benchmark
ran faster in a Linux virtual machine than on the Windows host.

I'm not sure if I'm using a sub-optimal Eigen function for my task or if
the Geometry module (or at least this part of the module) is not that
optimized, because there are more important parts in Eigen that needed
optimization. So I'd like to hear from you if you can reproduce my
benchmark results and/or consider this an issue. Is this maybe already
known?

I've currently replaced the colwise matrix multiplication with a raw
loop in my code so this is not a pressing issue to me at all. I was just
really surprised with the results I got.

I'd like to hear your opinions about this benchmark.

Thanks for your work on Eigen,
Norbert

[0] https://github.com/norbertwenzel/eigen-benchmark
[1] https://godbolt.org/g/atG6uA

Gael Guennebaud

2018-01-15 10:54:11 UTC

Permalink

Hi,

both code are not equivalent because one is applying the transformation
in-place, whereas the other one need to allocate a temporary. Nonetheless,
the root of the "problem" does not lie here and it is a bit more
complicated. Let's consider a simpler expression:

Matrix3d A;
Matrix3Xd pts(3,N);

pts = A * pts;

in this expression, the product cannot be carried out in-place because of
aliasing issue and a temporary has to be created. Of course, as you
realized, if we evaluate the result one column at once, then instead of
allocating a whole 3xN temporary, it is enough to allocate a 3x1 temporary
vector, and since the size is known at compile time and that it is very
small, it can be "allocated" on the stack and even optimized away by the
compiler. Unfortunately there is not way for Eigen to figure this out,
especially at compile-time. When there is no aliasing at all, the user can
tell it:

pts_bis.noalias() = A * pts;

In your case, we would need some kind of colwise/rowwise noalias to tell
Eigen that there is no aliasing across columns or rows. To be honest I've
never though about that possibility, but given the significant performance
hit, and that such kind of aliasing is probably the most frequent when
talking about matrix product, this might be worth the effort. I'm still not
sure what would be good names for the API though.

gael

On Sat, Jan 13, 2018 at 1:20 PM, Norbert Wenzel <

Post by Norbert Wenzel
Hi,
I'm using Eigen in a program that reads a list of (millions of) 3d
points (in bulk), (rigidly) transforms these points and then processes
the transformed points. During optimizing the program (on Linux) I found
that I could gain a few percent runtime, by replacing a colwise matrix
multiplication with a raw loop that (from my perspective) does
essentially the same. Note that the runtime gain has been achieved for
the whole program, not only for the transformation part alone.
My first question would be if the assumption that the following two code
using pointlist = Eigen::Matrix<double, 3, Eigen::Dynamic,
Eigen::RowMajor>;
using transform = Eigen::Transform<double, 3, Eigen::Isometry>;
extern pointlist points;
extern const transform trans;
void eigen_matrix_transformation()
{
points = (trans * points.colwise().homogeneous());
}
void raw_loop_transformation()
{
for(auto c = points.cols(); c > 0; --c)
{
points.col(c-1) = trans * points.col(c-1).homogeneous();
}
}
Although the number of points in the matrix is not known at compile time
the dimensionality of the points is, which should be enough to determine
the sizes of any intermediate storage at compile time.
I've extracted the code[0] from my program and found that the difference
between these implementations is ~3x. When comparing the generated
code[1] it seems that the colwise matrix multiplication allocates
dynamic memory whereas the raw loop does not. (At least it may throw
std::bad_alloc.)
Note that I could reproduce these results on different Linux machines,
but not on Windows. Although on my Windows laptop the overall benchmark
ran faster in a Linux virtual machine than on the Windows host.
I'm not sure if I'm using a sub-optimal Eigen function for my task or if
the Geometry module (or at least this part of the module) is not that
optimized, because there are more important parts in Eigen that needed
optimization. So I'd like to hear from you if you can reproduce my
benchmark results and/or consider this an issue. Is this maybe already
known?
I've currently replaced the colwise matrix multiplication with a raw
loop in my code so this is not a pressing issue to me at all. I was just
really surprised with the results I got.
I'd like to hear your opinions about this benchmark.
Thanks for your work on Eigen,
Norbert
[0] https://github.com/norbertwenzel/eigen-benchmark
[1] https://godbolt.org/g/atG6uA

Norbert Wenzel

2018-01-16 20:40:00 UTC

Permalink

Post by Gael Guennebaud
both code are not equivalent because one is applying the transformation
in-place, whereas the other one need to allocate a temporary. [...]

Thanks for the explanation.

Post by Gael Guennebaud
[...] the product cannot be carried out in-place because of
aliasing issue and a temporary has to be created. Of course, as you
realized, if we evaluate the result one column at once, then instead of
allocating a whole 3xN temporary, it is enough to allocate a 3x1 temporary
vector, and since the size is known at compile time and that it is very
small, it can be "allocated" on the stack and even optimized away by the
compiler. Unfortunately there is not way for Eigen to figure this out,
especially at compile-time. [...]

I understand you cannot determine any aliasing issues at compile time.
But I fail to see why there is no way to determine the size of the
temporary. Do you mean this cannot be done in Eigen as it is now or this
can generally not be determined at compile time?

Post by Gael Guennebaud
In your case, we would need some kind of colwise/rowwise noalias [...].
To be honest I've
never though about that possibility, but [...] this might be worth the> effort.

Thanks for considering this possibility.

As I've already said this issue was primarily surprising to me because I
thought colwise/rowwise would essentially boil down to the loop code
I've written manually. Your explanation shed some light on this issue
but clearly demonstrated, that I need to understand Eigen internals
better, when I care about runtime performance.

Best regards,
Norbert

Gael Guennebaud

2018-01-17 22:26:32 UTC

Permalink

On Tue, Jan 16, 2018 at 9:40 PM, Norbert Wenzel <

Post by Norbert Wenzel
I understand you cannot determine any aliasing issues at compile time.
But I fail to see why there is no way to determine the size of the
temporary. Do you mean this cannot be done in Eigen as it is now or this
can generally not be determined at compile time?

This is very difficult because aliasing might appear in very different
ways, e.g.:

M = A * M.transpose();

with M square. Moreover, depending on the sizes of the matrices, a
column-by-column (or row-by-row) evaluation can be extremely slow and
creating a whole temporary can be orders of magnitude faster. Moreover, in
order to not kill performance of small matrices, we must decide on the
evaluation strategy through very few runtime checks.

gael

Post by Norbert Wenzel

Post by Gael Guennebaud
In your case, we would need some kind of colwise/rowwise noalias [...].
To be honest I've
never though about that possibility, but [...] this might be worth the>

effort.
Thanks for considering this possibility.
As I've already said this issue was primarily surprising to me because I
thought colwise/rowwise would essentially boil down to the loop code
I've written manually. Your explanation shed some light on this issue
but clearly demonstrated, that I need to understand Eigen internals
better, when I care about runtime performance.
Best regards,
Norbert

Norbert Wenzel

2018-01-18 21:43:10 UTC

Permalink

Post by Gael Guennebaud
On Tue, Jan 16, 2018 at 9:40 PM, Norbert Wenzel <

This is very difficult because aliasing might appear in very different
ways, e.g.: [...]

Thanks for taking time to explain this to me. This was very enlightening.

Norbert