Discussion:
[eigen] RFC: making a deterministic and reproducable product codepath with Eigen
Jason Newton
2016-09-02 07:47:23 UTC
Permalink
Hey Gael et al,

I was trying to understand the underlying implementation of the General
Matrix Matrix and Matrix Vector products - I haven't succeeded there (is
there a write up on it somewhere?) but I thought I'd inquire about making a
codepath which other software can match the results of, provided they play
by the same rules. Maybe this means evaluating products like cblas/blas
reference does, maybe this means the naive approach - maybe allow the user
to switch between those 2 implementations. Maybe the MKL / blas code path
can be leveraged for this and enhanced (to support reference blas and
general inverse as well).

The idea for control is to set a macro up top to select this preference,
like the default storage order or palatalization/vectorization knobs.

The advantage of doing this is when porting code from one context to
another (be it GPUs, or different languages - like python/numpy) we can get
a 100% bit-exact match as long as both domains follow the same algorithms
(and deal with rounding the same way, another topic) which provides a
fairly strong guarantee that the ported code/code in another domain is
correct (provided a large enough input space is used for coverage) - and
when not performing that verification, we can go back to the fast paths.
Further, it is by necessity a reproducible result, across machines (maybe
not architectures, but certainly processor configs) - this also has value
to many people who are willing to take performance penalties to attain - I
think Eigen already achieves this when disabling openmp, and in mixed
generation machines, disabling vectorization potentially, but I thought I'd
throw it out there too.

I can tell you small matrices on a GPU (small enough to fit in local
memory) - the naive approach works very well for parallel computations and
probably are most frequently used matrices sizes.

Python/numpy doesn't care and fully relies on the underlying blas
implementation,, which users can fairly easily inspect what they're using,
and they support cblas by default.

Thought of bringing this up on the ML after reading (listed to show the
property is desirable, not that it's my first time encountering the issue):

http://stackoverflow.com/questions/22116553/why-result-of-matrix-multiplication-with-eigen-is-different-from-standard-vector
https://forum.kde.org/viewtopic.php?f=74&t=119907#p303319

-Jason
Jason Newton
2016-09-05 22:12:29 UTC
Permalink
Following up.

I missed the Goto paper reference for blocking matrix products that was way
high above the implementation - so after studying that paper and some of
Eigen's source, I know what's happening now, or at least what should be
happening - the actual implementation in Eigen still loses me.

But I did find out about GeneralProduct and the product_type_selector, and
CoeffBasedProductMode - which seems to be the naive c implementation I was
after. So I intend to try things with the following patch - Gael, I on the
right path in option and strategy?

diff -r c04c7f10894d Eigen/src/Core/GeneralProduct.h
--- a/Eigen/src/Core/GeneralProduct.h Mon Sep 05 17:14:20 2016 +0200
+++ b/Eigen/src/Core/GeneralProduct.h Mon Sep 05 18:00:29 2016 -0400
@@ -81,6 +81,16 @@
* This is a compile time mapping from {1,Small,Large}^3 -> {product types}
*/
// FIXME I'm not sure the current mapping is the ideal one.
template<int M, int N> struct product_type_selector<M,N,1> {
enum { ret = OuterProduct }; };
+#ifdef EIGEN_USE_COEFF_BASED_PRODUCTS
+template<int M, int N, int D> struct product_type_selector<M, N, Depth> {
enum { ret = CoeffBasedProductMode }; };
+template<int M> struct product_type_selector<M, 1, 1> {
enum { ret = LazyCoeffBasedProductMode }; };
+template<int N> struct product_type_selector<1, N, 1> {
enum { ret = LazyCoeffBasedProductMode }; };
+template<> struct product_type_selector<Small, Small, 1> {
enum { ret = LazyCoeffBasedProductMode }; };
+template<> struct product_type_selector<Small, Large, 1> {
enum { ret = LazyCoeffBasedProductMode }; };
+template<> struct product_type_selector<Large, Small, 1> {
enum { ret = LazyCoeffBasedProductMode }; };
+template<int Depth> struct product_type_selector<1, 1, Depth> {
enum { ret = InnerProduct }; };
+template<> struct product_type_selector<1, 1, 1> {
enum { ret = InnerProduct }; };
+#else
template<int M> struct product_type_selector<M, 1, 1> {
enum { ret = LazyCoeffBasedProductMode }; };
template<int N> struct product_type_selector<1, N, 1> {
enum { ret = LazyCoeffBasedProductMode }; };
template<int Depth> struct product_type_selector<1, 1, Depth> {
enum { ret = InnerProduct }; };
@@ -104,6 +114,7 @@
template<> struct product_type_selector<Large,Small,Small> {
enum { ret = CoeffBasedProductMode }; };
template<> struct product_type_selector<Small,Large,Small> {
enum { ret = CoeffBasedProductMode }; };
template<> struct product_type_selector<Large,Large,Small> {
enum { ret = GemmProduct }; };
+#endif

} // end namespace internal


But also there is cblas's way of doing matrix mults - which is to
distribute by coefs on the rhs term to all locations they are terms of in a
single pass. Maybe this is EIGEN_USE_DISTRIBUTE_BY_RHS_COEFF_PRODUCTS?
Not sure I can do a packetmath implementation of this but it's purpose is
validation with the present API, not speed.


-Jason
Post by Jason Newton
Hey Gael et al,
I was trying to understand the underlying implementation of the General
Matrix Matrix and Matrix Vector products - I haven't succeeded there (is
there a write up on it somewhere?) but I thought I'd inquire about making a
codepath which other software can match the results of, provided they play
by the same rules. Maybe this means evaluating products like cblas/blas
reference does, maybe this means the naive approach - maybe allow the user
to switch between those 2 implementations. Maybe the MKL / blas code path
can be leveraged for this and enhanced (to support reference blas and
general inverse as well).
The idea for control is to set a macro up top to select this preference,
like the default storage order or palatalization/vectorization knobs.
The advantage of doing this is when porting code from one context to
another (be it GPUs, or different languages - like python/numpy) we can get
a 100% bit-exact match as long as both domains follow the same algorithms
(and deal with rounding the same way, another topic) which provides a
fairly strong guarantee that the ported code/code in another domain is
correct (provided a large enough input space is used for coverage) - and
when not performing that verification, we can go back to the fast paths.
Further, it is by necessity a reproducible result, across machines (maybe
not architectures, but certainly processor configs) - this also has value
to many people who are willing to take performance penalties to attain - I
think Eigen already achieves this when disabling openmp, and in mixed
generation machines, disabling vectorization potentially, but I thought I'd
throw it out there too.
I can tell you small matrices on a GPU (small enough to fit in local
memory) - the naive approach works very well for parallel computations and
probably are most frequently used matrices sizes.
Python/numpy doesn't care and fully relies on the underlying blas
implementation,, which users can fairly easily inspect what they're using,
and they support cblas by default.
Thought of bringing this up on the ML after reading (listed to show the
http://stackoverflow.com/questions/22116553/why-result-
of-matrix-multiplication-with-eigen-is-different-from-standard-vector
https://forum.kde.org/viewtopic.php?f=74&t=119907#p303319
-Jason
Gael Guennebaud
2016-09-07 20:48:03 UTC
Permalink
User pluggable products would also allow us to expand out further without
diminishing returns to the core, such as using parallel prefix sum based
reductions that one would do in GPUs for the accumulations - or other
reduction trees. The only thing this seems to start opening up is that the
user may want to start mixing which kind of product is being used - I'm not
yet sure how to deal with that, with all the compile time dispatch... any
ideas?
that's already possible either using a customized BLAS backend [1] or
through template specialization for your favorite scalar type.

The latter can be done at different stage of the processing of the
expression and requires to play with Eigen's internal. For instance one can
specialize internal::general_matrix_matrix_product as we do for calling
BLAS *gemm, or specialize generic_product_impl to work on the unprocessed
operands.
I must also add that this will only work if you use the same Eigen
version everywhere. Different versions of Eigen might rewrite some
expressions in different ways.
Is this for more complex Eigen utilities, like JacobiSVD/inversion or do
you mean even for core ops - like product?
This of course concerns matrix decompositions but also products as most
optimizations and perf tunning on them will imply some changes to the
scheduling of the operations.

gael


[1] http://eigen.tuxfamily.org/dox-devel/TopicUsingBlasLapack.html
-Jason
Post by Jason Newton
Following up.
I missed the Goto paper reference for blocking matrix products that was
way high above the implementation - so after studying that paper and some
of Eigen's source, I know what's happening now, or at least what should be
happening - the actual implementation in Eigen still loses me.
But I did find out about GeneralProduct and the product_type_selector,
and CoeffBasedProductMode - which seems to be the naive c implementation I
was after. So I intend to try things with the following patch - Gael, I on
the right path in option and strategy?
diff -r c04c7f10894d Eigen/src/Core/GeneralProduct.h
--- a/Eigen/src/Core/GeneralProduct.h Mon Sep 05 17:14:20 2016 +0200
+++ b/Eigen/src/Core/GeneralProduct.h Mon Sep 05 18:00:29 2016 -0400
@@ -81,6 +81,16 @@
* This is a compile time mapping from {1,Small,Large}^3 -> {product
types} */
// FIXME I'm not sure the current mapping is the ideal one.
template<int M, int N> struct product_type_selector<M,N,1>
{ enum { ret = OuterProduct }; };
+#ifdef EIGEN_USE_COEFF_BASED_PRODUCTS
+template<int M, int N, int D> struct product_type_selector<M, N, Depth>
{ enum { ret = CoeffBasedProductMode }; };
+template<int M> struct product_type_selector<M, 1, 1>
{ enum { ret = LazyCoeffBasedProductMode }; };
+template<int N> struct product_type_selector<1, N, 1>
{ enum { ret = LazyCoeffBasedProductMode }; };
+template<> struct product_type_selector<Small, Small, 1>
{ enum { ret = LazyCoeffBasedProductMode }; };
+template<> struct product_type_selector<Small, Large, 1>
{ enum { ret = LazyCoeffBasedProductMode }; };
+template<> struct product_type_selector<Large, Small, 1>
{ enum { ret = LazyCoeffBasedProductMode }; };
+template<int Depth> struct product_type_selector<1, 1, Depth>
{ enum { ret = InnerProduct }; };
+template<> struct product_type_selector<1, 1, 1>
{ enum { ret = InnerProduct }; };
+#else
template<int M> struct product_type_selector<M, 1, 1>
{ enum { ret = LazyCoeffBasedProductMode }; };
template<int N> struct product_type_selector<1, N, 1>
{ enum { ret = LazyCoeffBasedProductMode }; };
template<int Depth> struct product_type_selector<1, 1, Depth>
{ enum { ret = InnerProduct }; };
@@ -104,6 +114,7 @@
template<> struct product_type_selector<Large,Small,Small>
{ enum { ret = CoeffBasedProductMode }; };
template<> struct product_type_selector<Small,Large,Small>
{ enum { ret = CoeffBasedProductMode }; };
template<> struct product_type_selector<Large,Large,Small>
{ enum { ret = GemmProduct }; };
+#endif
} // end namespace internal
But also there is cblas's way of doing matrix mults - which is to
distribute by coefs on the rhs term to all locations they are terms of in a
single pass. Maybe this is EIGEN_USE_DISTRIBUTE_BY_RHS_COEFF_PRODUCTS?
Not sure I can do a packetmath implementation of this but it's purpose is
validation with the present API, not speed.
-Jason
Post by Jason Newton
Hey Gael et al,
I was trying to understand the underlying implementation of the General
Matrix Matrix and Matrix Vector products - I haven't succeeded there (is
there a write up on it somewhere?) but I thought I'd inquire about making a
codepath which other software can match the results of, provided they play
by the same rules. Maybe this means evaluating products like cblas/blas
reference does, maybe this means the naive approach - maybe allow the user
to switch between those 2 implementations. Maybe the MKL / blas code path
can be leveraged for this and enhanced (to support reference blas and
general inverse as well).
The idea for control is to set a macro up top to select this
preference, like the default storage order or palatalization/vectorization
knobs.
The advantage of doing this is when porting code from one context to
another (be it GPUs, or different languages - like python/numpy) we can get
a 100% bit-exact match as long as both domains follow the same algorithms
(and deal with rounding the same way, another topic) which provides a
fairly strong guarantee that the ported code/code in another domain is
correct (provided a large enough input space is used for coverage) - and
when not performing that verification, we can go back to the fast paths.
Further, it is by necessity a reproducible result, across machines (maybe
not architectures, but certainly processor configs) - this also has value
to many people who are willing to take performance penalties to attain - I
think Eigen already achieves this when disabling openmp, and in mixed
generation machines, disabling vectorization potentially, but I thought I'd
throw it out there too.
I can tell you small matrices on a GPU (small enough to fit in local
memory) - the naive approach works very well for parallel computations and
probably are most frequently used matrices sizes.
Python/numpy doesn't care and fully relies on the underlying blas
implementation,, which users can fairly easily inspect what they're using,
and they support cblas by default.
Thought of bringing this up on the ML after reading (listed to show the
http://stackoverflow.com/questions/22116553/why-result-of-ma
trix-multiplication-with-eigen-is-different-from-standard-vector
https://forum.kde.org/viewtopic.php?f=74&t=119907#p303319
-Jason
Jason Newton
2016-09-06 09:08:09 UTC
Permalink
Hi Peter,

For FMA/SSE and what not, you must ensure that part is the same on both
implementation, but what matters is the reduction ordering is kept the same
after that part is fixed. It is one more thing like rounding that will
need attentiveness - gcc will let you escape the 80-bit register for
instance by relying on SSE for everything. I was not able to understand
how more than one FPU could wriggle itself inside the reduction ordering -
at least on on a GPU, I know this won't happen and there are many FPUs
there.

Itanium not sure of at all but those are going by the wayside, right?

I agree it is possible, at least without managing compiler settings
carefully, that there will be some situations where we cannot attain this
property, but most of the time it's achievable and high value.

It also might be good to have a user pluggable matrix product calculator -
this would let you fiddle with the reduction ordering, to say deterministic
reduction trees / different blocking/tiling configurations.

-Jason
Dear Jason,
The advantage of doing this is when porting code from one context to
Post by Jason Newton
another (be it GPUs, or different languages - like python/numpy) we can get
a 100% bit-exact match as long as both domains follow the same algorithms
(and deal
with rounding the same way, another topic) which provides a fairly strong
guarantee that the ported code/code in another domain is correct (provided
a large enough input space is used for coverage)
I doubt that this is possible, even if the code is single threaded.
Just changing between machines with FMA and without will significantly
change the result.
x86 with the 80 bit register gives you different results compared to
other. If you have more
than one FPU you can't be sure on the ordering within the scalar products,
and reordering
can/will change the result. And if you happen to use an Itanium machine,
you never know, what the compiler produces.
It may work in many cases, but at least it doesn't for my main application.
Best regards,
Peter
Jeff Hammond
2016-09-06 13:46:36 UTC
Permalink
Post by Jason Newton
Hi Peter,
For FMA/SSE and what not, you must ensure that part is the same on both
implementation, but what matters is the reduction ordering is kept the same
after that part is fixed. It is one more thing like rounding that will
need attentiveness - gcc will let you escape the 80-bit register for
instance by relying on SSE for everything. I was not able to understand
how more than one FPU could wriggle itself inside the reduction ordering -
at least on on a GPU, I know this won't happen and there are many FPUs
there.
For what it's worth, there is quite a bit online about this topic, e.g.

# Intel-specific

*
https://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler
*
https://software.intel.com/sites/default/files/article/326703/fp-control-2012-08.pdf
slides on the previous topic.
*
https://software.intel.com/en-us/articles/differences-in-floating-point-arithmetic-between-intel-xeon-processors-and-the-intel-xeon

The first link has proven to be of great practical relevant in the context
of https://github.com/elemental/Elemental/issues/179, as one example.

# Blogs/similar on this topic

*
https://blogs.msdn.microsoft.com/shawnhar/2009/03/25/is-floating-point-math-deterministic/
* http://christian-seiler.de/projekte/fpmath/
*
http://stackoverflow.com/questions/20963419/cross-platform-floating-point-consistancy
*
http://yosefk.com/blog/consistency-how-to-defeat-the-purpose-of-ieee-floating-point.html
Post by Jason Newton
Itanium not sure of at all but those are going by the wayside, right?
The most recent product was released in 2012Q4 (
http://ark.intel.com/products/family/451/Intel-Itanium-Processor). I
cannot comment on what Intel might do in the future.
Post by Jason Newton
I agree it is possible, at least without managing compiler settings
carefully, that there will be some situations where we cannot attain this
property, but most of the time it's achievable and high value.
Best,

Jeff (who works for Intel, should it matter)
Post by Jason Newton
It also might be good to have a user pluggable matrix product calculator -
this would let you fiddle with the reduction ordering, to say deterministic
reduction trees / different blocking/tiling configurations.
-Jason
Dear Jason,
The advantage of doing this is when porting code from one context to
Post by Jason Newton
another (be it GPUs, or different languages - like python/numpy) we can get
a 100% bit-exact match as long as both domains follow the same algorithms
(and deal
with rounding the same way, another topic) which provides a fairly
strong guarantee that the ported code/code in another domain is correct
(provided a large enough input space is used for coverage)
I doubt that this is possible, even if the code is single threaded.
Just changing between machines with FMA and without will significantly
change the result.
x86 with the 80 bit register gives you different results compared to
other. If you have more
than one FPU you can't be sure on the ordering within the scalar
products, and reordering
can/will change the result. And if you happen to use an Itanium machine,
you never know, what the compiler produces.
It may work in many cases, but at least it doesn't for my main application.
Best regards,
Peter
--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/
Peter
2016-09-06 14:09:26 UTC
Permalink
Dear Jason,
For FMA/SSE and what not, you must ensure that part is the same on both implementation, but what matters is the reduction ordering is kept the same after that part is fixed. It is one more thing like rounding that will need
attentiveness - gcc will let you escape the 80-bit register for instance by relying on SSE for everything. I was not able to understand how more than one FPU could wriggle itself inside the reduction ordering - at least on on a
GPU, I know this won't happen and there are many FPUs there.
GPUs are pretty simple, in contrast some processors have branch prediction and out-of-order execution in hardware.
It's beyond my knowledge, whether scalar products will always be scheduled in the same way on the FPUs by the hardware,
especially if the scalar product appears after an if-statement.
I'm just sceptical, but maybe I just got surprised too often.
Itanium not sure of at all but those are going by the wayside, right?
Yes it's dead, but it's a prime example where the compiler messes around a lot.
I agree it is possible, at least without managing compiler settings carefully,
o.k., with the help of the compiler one might get far.
It also might be good to have a user pluggable matrix product calculator - this would let you fiddle with the reduction ordering, to say deterministic reduction trees / different blocking/tiling configurations.
Could also be interesting to investigate performance issues.

Best regards,
Peter
Christoph Hertzberg
2016-09-08 17:06:43 UTC
Permalink
Post by Peter
It's beyond my knowledge, whether scalar products will always be
scheduled in the same way on the FPUs by the hardware,
especially if the scalar product appears after an if-statement.
I'm just sceptical, but maybe I just got surprised too often.
I would be very surprised, if a CPU would give different results for the
very same machine instructions (given the same inputs and
FPU-configuration, of course), solely based on internal scheduling.

Overall, I'm not sure if your overall goal is really feasible to
achieve. If you don't care (too much) about performance, then maybe by
providing a "always use simple 3-loop-matrix-products"-path, and always
evaluate as if we followed C++ operator precedence (which takes away
many optimization opportunities, e.g. for (Scalar * A * B + C) ).
Certainly, many element-wise operations (R= A+B+C) could still be lazily
evaluated (and even vectorized), but allowing vectorization even for
simple reductions will depend on the packet-size.
Things like blocking, expression-tree-reordering, vectorization will
likely change for different Eigen-versions (and depending on the CPU).
Making all of that user-configurable, will likely become a maintenance
nightmare over time.

For the given use-case, I guess the simplest solution would be to
hand-craft a simple library with naive but stable implementations.

Alternatively, if you mostly care about hardware independence (but not
Eigen-version independence), it might be possible to compile to some
intermediate code which is then translated to equivalent code for each
architecture (which is one of the main ideas of LLVM, I guess).
Of course, you'd also need to get rid of any run-time dependent
decisions (like determining block-sizes from CPU cache sizes).



Christoph
--
Dipl. Inf., Dipl. Math. Christoph Hertzberg

Universität Bremen
FB 3 - Mathematik und Informatik
AG Robotik
Robert-Hooke-Straße 1
28359 Bremen, Germany

Zentrale: +49 421 178 45-6611

Besuchsadresse der Nebengeschäftsstelle:
Robert-Hooke-Straße 5
28359 Bremen, Germany

Tel.: +49 421 178 45-4021
Empfang: +49 421 178 45-6600
Fax: +49 421 178 45-4150
E-Mail: ***@informatik.uni-bremen.de

Weitere Informationen: http://www.informatik.uni-bremen.de/robotik
Peter
2016-09-08 19:45:21 UTC
Permalink
Dar Christoph,
Post by Peter
It's beyond my knowledge, whether scalar products will always be
scheduled in the same way on the FPUs by the hardware,
especially if the scalar product appears after an if-statement.
I'm just sceptical, but maybe I just got surprised too often.
I would be very surprised, if a CPU would give different results for the very same machine instructions (given the same inputs and FPU-configuration, of course), solely based on internal scheduling.
In case you are interested, there's e.g. HP's Dynamo project, <http://www.hpl.hp.com/techreports/1999/HPL-1999-77.html>,
which messes around with binaries. And for scalar products, it's sufficient to change the order of evaluation,
to loose bit-wise accuracy, eg. the scalar product of ( 1, 1e-50, 1) with ( 1, 1, -1 ) is a simple example.
I'm just not sure how far the processors mess around.
For the given use-case, I guess the simplest solution would be to hand-craft a simple library with naive but stable implementations.
I agree, using a F77 BLAS should be sufficient. Although I still don't understand what one learns from bypassing all optimizations.

If correctness is important one should switch to exact scalar product, like in C-XSC,
which removes the dependence on the order of evaluation and just _has_ to provide the same result everywhere.
BTW, exact scalar products could be an interesting extension to Eigen in some future version,
opening the door to verified computing.

Best regards,
Peter
Jason Newton
2016-09-08 20:45:55 UTC
Permalink
FYI I came across reproblas a few months ago ( http://bebop.cs.berkeley.edu/
reproblas/ ) and it helped me figure out a few of the different goals one
can have with the different cases of determinism and reproducibility that
exist - I've had pretty much all of them at some point. The blas package
is incomplete unfortunately but I've been interested in trying out a DIY
C++ "scalar" for their index type and how Eigen might work with it - of
course I've learned a little more about how the BLAS hookup is done in
Eigen now and that might be a path to trying that out - Something like 4x
memory per scalar for unlimited parallelism scalability and higher
precision coming along for the ride - pretty good tradeoff - take a look at
their papers sometime if the topic interests you.

One such goal is software that is obligated to always get the same results
- a difficult guarantee without such tools if you don't have fixed
reduction orderings, and it happens...

Apparently a reemerging topic several corporatations places are starting to
take care and discuss with
http://www.netlib.org/utk/people/JackDongarra/WEB-PAGES/Batched-BLAS-2016/

Part of the reason I brought this topic up is I want strong confidence that
my software is correct to where it was written somewhere else (python) and
testing any other way is hard/of low value. I've seen it happen with
Kalman filters/state estimation algorithms - they just work in the presence
of errors introduced by the programmer (since that can be treated as
noise), that is until they don't. Debugging decisions made with things that
vary also can be a PITA if they aren't sufficiently reproducable - though
it looks like Eigen is in the clear for this category, perhaps even for
it's OpenMP path.

Also, with how the blas paths in Eigen have been coded up, it looks like
they also disable lazy evaluation - or am I incorrect in that assessment?
Essentially they have to submit the entire problem at once to their
appropriate routines. Since that product path has that capability - a user
defined product path could be added (without much being added to Eigen
itself - probably just by clearing out via a macro the product selector
specializations so the user can override), to allow whatever types of other
products, perhaps like ref blas's strategy (definitely a good target along
with naive), complex reduction trees, or perhaps a stable blocking based
algorithm - while keeping all of Eigen's existing constructs. Further, it
might be possible to make the static dispatch reroute to a dynamic dispatch
that uses some flags to control which product type is used - if the user
sets that up appropriately. Perhaps a few of these common ones could go
into Unsupported modules. Or maybe it should be approached via blas, as
Gael mentioned earlier. A short coming is with it's current strategy, it
limits to the blas interfaces and float/double though - not quite decided
if that's an issue (may be for integers near special cases) - then the user
has to define a blas which might break/clash in larger softwares.

-Jason
Post by Peter
Dar Christoph,
Post by Christoph Hertzberg
Post by Peter
It's beyond my knowledge, whether scalar products will always be
scheduled in the same way on the FPUs by the hardware,
especially if the scalar product appears after an if-statement.
I'm just sceptical, but maybe I just got surprised too often.
I would be very surprised, if a CPU would give different results for the
very same machine instructions (given the same inputs and
FPU-configuration, of course), solely based on internal scheduling.
In case you are interested, there's e.g. HP's Dynamo project, <
http://www.hpl.hp.com/techreports/1999/HPL-1999-77.html>,
which messes around with binaries. And for scalar products, it's
sufficient to change the order of evaluation,
to loose bit-wise accuracy, eg. the scalar product of ( 1, 1e-50, 1) with
( 1, 1, -1 ) is a simple example.
I'm just not sure how far the processors mess around.
For the given use-case, I guess the simplest solution would be to
Post by Christoph Hertzberg
hand-craft a simple library with naive but stable implementations.
I agree, using a F77 BLAS should be sufficient. Although I still don't
understand what one learns from bypassing all optimizations.
If correctness is important one should switch to exact scalar product, like in C-XSC,
which removes the dependence on the order of evaluation and just _has_ to
provide the same result everywhere.
BTW, exact scalar products could be an interesting extension to Eigen in
some future version,
opening the door to verified computing.
Best regards,
Peter
Christoph Hertzberg
2016-09-09 09:15:49 UTC
Permalink
Post by Peter
In case you are interested, there's e.g. HP's Dynamo project,
<http://www.hpl.hp.com/techreports/1999/HPL-1999-77.html>,
which messes around with binaries. And for scalar products, it's
sufficient to change the order of evaluation,
to loose bit-wise accuracy, eg. the scalar product of ( 1, 1e-50, 1)
with ( 1, 1, -1 ) is a simple example.
Sure, I'm aware that IEEE math is non-associative ...
Post by Peter
I'm just not sure how far the processors mess around.
By design they should only execute things out of order, if the
instructions (or the generated micro-instructions) are independent.
Everything beyond that would be insane, IMO. I'm not a CPU expert, but
I'm pretty sure a lot of people would have complaining about it, if CPUs
would do that.
Post by Peter
I agree, using a F77 BLAS should be sufficient. Although I still don't
understand what one learns from bypassing all optimizations.
If you have a BLAS implementation that does exactly the same on every
target architecture, you should be fine as well, of course. I don't know
what the status of F77->GPU compilers is.
Post by Peter
If correctness is important one should switch to exact scalar product, like in C-XSC,
which removes the dependence on the order of evaluation and just _has_
to provide the same result everywhere.
The original RFC was just on reproducibility not on exactness, I think.
Post by Peter
BTW, exact scalar products could be an interesting extension to Eigen in
some future version,
opening the door to verified computing.
Sure, that would be interesting. I'm not sure how complicated this will
be to integrate though. And it will certainly be significantly slower.


Christoph
--
Dipl. Inf., Dipl. Math. Christoph Hertzberg

Universität Bremen
FB 3 - Mathematik und Informatik
AG Robotik
Robert-Hooke-Straße 1
28359 Bremen, Germany

Zentrale: +49 421 178 45-6611

Besuchsadresse der Nebengeschäftsstelle:
Robert-Hooke-Straße 5
28359 Bremen, Germany

Tel.: +49 421 178 45-4021
Empfang: +49 421 178 45-6600
Fax: +49 421 178 45-4150
E-Mail: ***@informatik.uni-bremen.de

Weitere Informationen: http://www.informatik.uni-bremen.de/robotik
Rasmus Larsen
2016-09-09 15:38:10 UTC
Permalink
Just to throw in my 2 cents (mostly in the context of linear algebra): I
understand that some perceive a great benefit of having 100% reproducible
and deterministic computations, but I don't think it is realistic or even
particularly useful in a parallel or multi-threaded environment, unless you
want to grossly sacrifice performance. In my practical experience, this
idea often comes from a desire to get predictable outcomes of regressions
tests written without proper knowledge of the mathematics of the problem,
i.e. error analysis. I have found time and time again that looking up the
correct error bound in the literature (or deriving it) and possibly adding
a small fudge factor solves such problems in a way that might even yield
useful insights.

On Fri, Sep 9, 2016 at 2:15 AM, Christoph Hertzberg <
Post by Christoph Hertzberg
Post by Peter
In case you are interested, there's e.g. HP's Dynamo project,
<http://www.hpl.hp.com/techreports/1999/HPL-1999-77.html>,
which messes around with binaries. And for scalar products, it's
sufficient to change the order of evaluation,
to loose bit-wise accuracy, eg. the scalar product of ( 1, 1e-50, 1)
with ( 1, 1, -1 ) is a simple example.
Sure, I'm aware that IEEE math is non-associative ...
I'm just not sure how far the processors mess around.
By design they should only execute things out of order, if the
instructions (or the generated micro-instructions) are independent.
Everything beyond that would be insane, IMO. I'm not a CPU expert, but I'm
pretty sure a lot of people would have complaining about it, if CPUs would
do that.
I agree, using a F77 BLAS should be sufficient. Although I still don't
Post by Peter
understand what one learns from bypassing all optimizations.
If you have a BLAS implementation that does exactly the same on every
target architecture, you should be fine as well, of course. I don't know
what the status of F77->GPU compilers is.
If correctness is important one should switch to exact scalar product,
Post by Peter
like in C-XSC,
which removes the dependence on the order of evaluation and just _has_
to provide the same result everywhere.
The original RFC was just on reproducibility not on exactness, I think.
BTW, exact scalar products could be an interesting extension to Eigen in
Post by Peter
some future version,
opening the door to verified computing.
Sure, that would be interesting. I'm not sure how complicated this will be
to integrate though. And it will certainly be significantly slower.
Christoph
--
Dipl. Inf., Dipl. Math. Christoph Hertzberg
UniversitÀt Bremen
FB 3 - Mathematik und Informatik
AG Robotik
Robert-Hooke-Straße 1
28359 Bremen, Germany
Zentrale: +49 421 178 45-6611
Robert-Hooke-Straße 5
28359 Bremen, Germany
Tel.: +49 421 178 45-4021
Empfang: +49 421 178 45-6600
Fax: +49 421 178 45-4150
Weitere Informationen: http://www.informatik.uni-bremen.de/robotik
Jason Newton
2016-09-26 20:46:26 UTC
Permalink
Ramus,

It depends. For instance, are we talking about small/tiny matrices? If
that's the case, coefficient based algorithms are already the fastest
implementations - vectorization is something to ponder whether to have on
or off and the packetsize's implications. You can have that in
parallel/multi-threaded as well as a building block - however I'm guessing
you're thinking of just large matrix mults - performance can still be good,
but yes, we're not going to get the highest theoretical throughput. It's
not a problem understanding of mathematics in ones solution, it's a problem
of reproducability - to always get the same answer given the same inputs
has alot of strength in it than the error analysis when we're talking about
the computation part in terms of fault detection and system consistency.
Error analysis qualifies how good and useful the answers are to the
application (which also matters alot!) - but they shouldn't be expected to
be substitutes for eachother. Also, tracking an error bound in
large/complex software is quickly intractable - it is reasonable to do this
for single routines but it would be very expensive (more so than
reproducable software) to do this for all software written we'd want to
understand in a larger system.

-Jason
Post by Rasmus Larsen
Just to throw in my 2 cents (mostly in the context of linear algebra): I
understand that some perceive a great benefit of having 100% reproducible
and deterministic computations, but I don't think it is realistic or even
particularly useful in a parallel or multi-threaded environment, unless you
want to grossly sacrifice performance. In my practical experience, this
idea often comes from a desire to get predictable outcomes of regressions
tests written without proper knowledge of the mathematics of the problem,
i.e. error analysis. I have found time and time again that looking up the
correct error bound in the literature (or deriving it) and possibly adding
a small fudge factor solves such problems in a way that might even yield
useful insights.
On Fri, Sep 9, 2016 at 2:15 AM, Christoph Hertzberg <
Post by Christoph Hertzberg
Post by Peter
In case you are interested, there's e.g. HP's Dynamo project,
<http://www.hpl.hp.com/techreports/1999/HPL-1999-77.html>,
which messes around with binaries. And for scalar products, it's
sufficient to change the order of evaluation,
to loose bit-wise accuracy, eg. the scalar product of ( 1, 1e-50, 1)
with ( 1, 1, -1 ) is a simple example.
Sure, I'm aware that IEEE math is non-associative ...
I'm just not sure how far the processors mess around.
By design they should only execute things out of order, if the
instructions (or the generated micro-instructions) are independent.
Everything beyond that would be insane, IMO. I'm not a CPU expert, but
I'm pretty sure a lot of people would have complaining about it, if CPUs
would do that.
I agree, using a F77 BLAS should be sufficient. Although I still don't
Post by Peter
understand what one learns from bypassing all optimizations.
If you have a BLAS implementation that does exactly the same on every
target architecture, you should be fine as well, of course. I don't know
what the status of F77->GPU compilers is.
If correctness is important one should switch to exact scalar product,
Post by Peter
like in C-XSC,
which removes the dependence on the order of evaluation and just _has_
to provide the same result everywhere.
The original RFC was just on reproducibility not on exactness, I think.
BTW, exact scalar products could be an interesting extension to Eigen in
Post by Peter
some future version,
opening the door to verified computing.
Sure, that would be interesting. I'm not sure how complicated this will
be to integrate though. And it will certainly be significantly slower.
Christoph
--
Dipl. Inf., Dipl. Math. Christoph Hertzberg
UniversitÀt Bremen
FB 3 - Mathematik und Informatik
AG Robotik
Robert-Hooke-Straße 1
28359 Bremen, Germany
Zentrale: +49 421 178 45-6611
Robert-Hooke-Straße 5
28359 Bremen, Germany
Tel.: +49 421 178 45-4021
Empfang: +49 421 178 45-6600
Fax: +49 421 178 45-4150
Weitere Informationen: http://www.informatik.uni-bremen.de/robotik
Loading...