[eigen] Intel (R) MKL IE SpBLAS support in Eigen

Discussion:

Zhukova, Maria

2018-04-03 21:39:16 UTC

Hello Eigen community,

My name is Maria Zhukova and I'm a software development engineer at Intel (r) MKL Sparse team.

My team is interested in contributing into Eigen, so I've investigated our possibilities and so far this is what I have:
Eigen support different operations for sparse matrices stored in CSR and CSC format which can be implemented on a basis of IE SpBLAS kernels (please, refer to https://software.intel.com/en-us/mkl-developer-reference-c-inspector-executor-sparse-blas-routines for the general idea of interfaces)
, basically we want to implement calls to our IE SpBLAS into next operations:
SparseMatrix + SparseMatrix (mkl_sparse_?_add)
SparseMatrix * DenseVector (mkl_sparse_?_mv)
SparseMatrix * DenseMatrix (mkl_sparse_?_mm)
SparseMatrix * SparseMatrix (mkl_sparse_spmm),
and Triangular solve (mkl_sparse_?_trsv).
I've already started with implementation of sparse_time_dense_impl_mkl kernel which is based on mkl_sparse_?_mv (included in patch).

This is how it will look like for user:
#include <Eigen/SpBLASSupport> <-- NEW: IE SpBLAS include module

void main () {
SparseMatrix<double, RowMajor> A;
Matrix<double, Dynamic, 1> x, y;

A.makeCompressed(); /* Convert matrix A into CSR/CSC format */
A.createSparseHandle(); /* NEW: is used to create handle required for all IE SpBLAS routines */

// support of IE SpBLAS is here
y = beta*y + alpha*A*x; /* call to mkl_sparse_?_mv with operation = SPARSE_OPERATION_NON_TRANSPOSE */
y = beta*y + alpha*A.transpose()*x; /* call to mkl_sparse_?_mv with operation = SPARSE_OPERATION_TRANSPOSE */
y = beta*y + alpha*A.adjoint()*x; /* call to mkl_sparse_?_mv with operation = SPARSE_OPERATION_CONJUGATE_TRANSPOSE */

A.destroySparseHandle(); /* NEW: is used to delete created handle */
}

I've attached a draft patch including all necessary changes and would like to hear your feedback.
Please, let me know if you have any questions and comments.

Best regards,
Maria

Edward Lam

2018-04-03 23:27:50 UTC

Permalink

Hi Maria,

I'm only an Eigen user but I'm extremely happy at this effort to have more MKL
wrapping in Eigen! As our application depends on Intel TBB, the only real way to
use Eigen with multithreaded support is via MKL.

A.createSparseHandle(); /* *NEW*: is used to create handle required for all IE
SpBLAS routines */
...
A.destroySparseHandle(); /* *NEW*: is used to delete created handle */

I think that this is very un-C++ like. I'm not sure why it was decided to be
done this way. Why wouldn't the sparse handle be allocated within A itself and
then deallocated during A's constructor? Note also that in C++, we need to worry
about what happens when the matrix is copied so this state management needs to
be within the matrix itself or else the semantics of these create/destroy
handles are very troublesome.

Best Regards,
-Edward

Hello Eigen community,
My name is Maria Zhukova and I’m a software development engineer at Intel ® MKL
Sparse team.
My team is interested in contributing into Eigen, so I’ve investigated our
Eigen support different operations for sparse matrices stored in CSR and CSC
format which can be implemented on a basis of IE SpBLAS kernels (please, refer
to
https://software.intel.com/en-us/mkl-developer-reference-c-inspector-executor-sparse-blas-routines
for the general idea of interfaces)
....
                SparseMatrix + SparseMatrix (mkl_sparse_?_add)
                SparseMatrix * DenseVector (mkl_sparse_?_mv)
                SparseMatrix * DenseMatrix   (mkl_sparse_?_mm)
                SparseMatrix * SparseMatrix (mkl_sparse_spmm),
and Triangular solve (mkl_sparse_?_trsv).
I’ve already started with implementation of sparse_time_dense_impl_mkl kernel
which is based on mkl_sparse_?_mv (included in patch).
*#include <Eigen/SpBLASSupport> *<-- *NEW:* IE SpBLAS include module
void main () {
SparseMatrix<double, RowMajor> A;
Matrix<double, Dynamic, 1> x, y;
A.makeCompressed(); /* Convert matrix A into CSR/CSC format */
*A.createSparseHandle();*/* *NEW*: is used to create handle required for all IE
SpBLAS routines */
// support of IE SpBLAS is here
y = beta*y + alpha*A*x; /* call to mkl_sparse_?_mv with operation =
SPARSE_OPERATION_NON_TRANSPOSE */
y = beta*y + alpha*A.transpose()*x; /* call to mkl_sparse_?_mv with operation =
SPARSE_OPERATION_TRANSPOSE */
y = beta*y + alpha*A.adjoint()*x; /* call to mkl_sparse_?_mv with operation =
SPARSE_OPERATION_CONJUGATE_TRANSPOSE */
*A.destroySparseHandle();* /* *NEW*: is used to delete created handle */
}
I’ve attached a draft patch including all necessary changes and would like to
hear your feedback.
Please, let me know if you have any questions and comments.
Best regards,
Maria

Zhukova, Maria

2018-04-04 00:31:17 UTC

Permalink

Hello Edward and thanks much for your feedback!

This is a good point.
Actually, the correct place for creating IE SpBLAS handle is one of the key moments which I want to discuss.
The thing is -- all Intel (R) MKL IE SpBLAS functionality requires matrix to be stored in compressed row/column storage format. Necessary storage format can be obtained by calling SparseMatrix::makeCompressed() function.
So, I wonder if the better decision will be to create sparse handle inside the makeCompressed() and destroy in SparseMatrix destructor. What do you think about this option?

Best regards,
Maria

-----Original Message-----
From: Edward Lam [mailto:***@sidefx.com]
Sent: Tuesday, April 3, 2018 4:28 PM
To: ***@lists.tuxfamily.org
Subject: Re: [eigen] Intel (R) MKL IE SpBLAS support in Eigen

Hi Maria,

I'm only an Eigen user but I'm extremely happy at this effort to have more MKL wrapping in Eigen! As our application depends on Intel TBB, the only real way to use Eigen with multithreaded support is via MKL.

A.createSparseHandle(); /* *NEW*: is used to create handle required for all IE > SpBLAS routines */ > ...
A.destroySparseHandle(); /* *NEW*: is used to delete created handle */

I think that this is very un-C++ like. I'm not sure why it was decided to be done this way. Why wouldn't the sparse handle be allocated within A itself and then deallocated during A's constructor? Note also that in C++, we need to worry about what happens when the matrix is copied so this state management needs to be within the matrix itself or else the semantics of these create/destroy handles are very troublesome.

Best Regards,
-Edward

Hello Eigen community,
My name is Maria Zhukova and I’m a software development engineer at
Intel ® MKL Sparse team.
My team is interested in contributing into Eigen, so I’ve investigated
Eigen support different operations for sparse matrices stored in CSR
and CSC format which can be implemented on a basis of IE SpBLAS
kernels (please, refer to
https://software.intel.com/en-us/mkl-developer-reference-c-inspector-e
xecutor-sparse-blas-routines
for the general idea of interfaces)
....
                SparseMatrix + SparseMatrix (mkl_sparse_?_add)
                SparseMatrix * DenseVector (mkl_sparse_?_mv)
                SparseMatrix * DenseMatrix   (mkl_sparse_?_mm)
                SparseMatrix * SparseMatrix (mkl_sparse_spmm), and
Triangular solve (mkl_sparse_?_trsv).
I’ve already started with implementation of sparse_time_dense_impl_mkl
kernel which is based on mkl_sparse_?_mv (included in patch).
*#include <Eigen/SpBLASSupport> *<-- *NEW:* IE SpBLAS include module
void main () {
SparseMatrix<double, RowMajor> A;
Matrix<double, Dynamic, 1> x, y;
A.makeCompressed(); /* Convert matrix A into CSR/CSC format */
*A.createSparseHandle();*/* *NEW*: is used to create handle required
for all IE SpBLAS routines */
// support of IE SpBLAS is here
y = beta*y + alpha*A*x; /* call to mkl_sparse_?_mv with operation =
SPARSE_OPERATION_NON_TRANSPOSE */ y = beta*y + alpha*A.transpose()*x;
/* call to mkl_sparse_?_mv with operation = SPARSE_OPERATION_TRANSPOSE
*/ y = beta*y + alpha*A.adjoint()*x; /* call to mkl_sparse_?_mv with
operation = SPARSE_OPERATION_CONJUGATE_TRANSPOSE */
*A.destroySparseHandle();* /* *NEW*: is used to delete created handle
*/ }
I’ve attached a draft patch including all necessary changes and would
like to hear your feedback.
Please, let me know if you have any questions and comments.
Bes

Edward Lam

2018-04-04 00:58:07 UTC

Permalink

Hi Maria,

I think your new proposal sounds much better. I haven't looked at your
patch but please make sure the resource management is done properly when
the matrix is copied as well (ie. it's copy constructor and/or operator=()
method). Unfortunately, I'm a relatively newcomer to Eigen and so don't
understand all the intricacies of its internals.

Thanks in advance for your hard work!

-Edward

Post by Zhukova, Maria
Hello Edward and thanks much for your feedback!
This is a good point.
Actually, the correct place for creating IE SpBLAS handle is one of the
key moments which I want to discuss.
The thing is -- all Intel (R) MKL IE SpBLAS functionality requires matrix
to be stored in compressed row/column storage format. Necessary storage
format can be obtained by calling SparseMatrix::makeCompressed() function.
So, I wonder if the better decision will be to create sparse handle inside
the makeCompressed() and destroy in SparseMatrix destructor. What do you
think about this option?
Best regards,
Maria
-----Original Message-----
Sent: Tuesday, April 3, 2018 4:28 PM
Subject: Re: [eigen] Intel (R) MKL IE SpBLAS support in Eigen
Hi Maria,
I'm only an Eigen user but I'm extremely happy at this effort to have more
MKL wrapping in Eigen! As our application depends on Intel TBB, the only
real way to use Eigen with multithreaded support is via MKL.

A.createSparseHandle(); /* *NEW*: is used to create handle required for

all IE > SpBLAS routines */ > ...

A.destroySparseHandle(); /* *NEW*: is used to delete created handle */

I think that this is very un-C++ like. I'm not sure why it was decided to
be done this way. Why wouldn't the sparse handle be allocated within A
itself and then deallocated during A's constructor? Note also that in C++,
we need to worry about what happens when the matrix is copied so this state
management needs to be within the matrix itself or else the semantics of
these create/destroy handles are very troublesome.
Best Regards,
-Edward

Hello Eigen community,
My name is Maria Zhukova and Iâm a software development engineer at
Intel Â® MKL Sparse team.
My team is interested in contributing into Eigen, so Iâve investigated
Eigen support different operations for sparse matrices stored in CSR
and CSC format which can be implemented on a basis of IE SpBLAS
kernels (please, refer to
https://software.intel.com/en-us/mkl-developer-reference-c-inspector-e
xecutor-sparse-blas-routines
for the general idea of interfaces)
, basically we want to implement calls to our IE SpBLAS into next
....
SparseMatrix + SparseMatrix (mkl_sparse_?_add)
SparseMatrix * DenseVector (mkl_sparse_?_mv)
SparseMatrix * DenseMatrix (mkl_sparse_?_mm)
SparseMatrix * SparseMatrix (mkl_sparse_spmm), and
Triangular solve (mkl_sparse_?_trsv).
Iâve already started with implementation of sparse_time_dense_impl_mkl
kernel which is based on mkl_sparse_?_mv (included in patch).
*#include <Eigen/SpBLASSupport> *<-- *NEW:* IE SpBLAS include module
void main () {
SparseMatrix<double, RowMajor> A;
Matrix<double, Dynamic, 1> x, y;
A.makeCompressed(); /* Convert matrix A into CSR/CSC format */
*A.createSparseHandle();*/* *NEW*: is used to create handle required
for all IE SpBLAS routines */
// support of IE SpBLAS is here
y = beta*y + alpha*A*x; /* call to mkl_sparse_?_mv with operation =
SPARSE_OPERATION_NON_TRANSPOSE */ y = beta*y + alpha*A.transpose()*x;
/* call to mkl_sparse_?_mv with operation = SPARSE_OPERATION_TRANSPOSE
*/ y = beta*y + alpha*A.adjoint()*x; /* call to mkl_sparse_?_mv with
operation = SPARSE_OPERATION_CONJUGATE_TRANSPOSE */
*A.destroySparseHandle();* /* *NEW*: is used to delete created handle
*/ }
Iâve attached a draft patch including all necessary changes and would
like to hear your feedback.
Please, let me know if you have any questions and comments.
Best regards,
Maria

Christoph Hertzberg

2018-04-04 09:43:36 UTC

Permalink

Post by Edward Lam
Hi Maria,
I think your new proposal sounds much better. I haven't looked at your
patch but please make sure the resource management is done properly when
the matrix is copied as well (ie. it's copy constructor and/or operator=()
method). [...].

In other words, obey the rule of 5/3/0:
http://en.cppreference.com/w/cpp/language/rule_of_three

I.e., ideally have a helper struct which obeys the rule of 5 and don't
use custom copy/assignment implementations in the main class -- I'm
aware that Eigen does not strictly follow that everywhere, e.g.,
SparseMatrix only obeys the rule of 3 at the moment.

Christoph

Post by Edward Lam
Thanks in advance for your hard work!
-Edward

A.createSparseHandle(); /* *NEW*: is used to create handle required for

all IE > SpBLAS routines */ > ...

A.destroySparseHandle(); /* *NEW*: is used to delete created handle */

I think that this is very un-C++ like. I'm not sure why it was decided to
be done this way. Why wouldn't the sparse handle be allocated within A
itself and then deallocated during A's constructor? Note also that in C++,
we need to worry about what happens when the matrix is copied so this state
management needs to be within the matrix itself or else the semantics of
these create/destroy handles are very troublesome.
Best Regards,
-Edward

Hello Eigen community,
My name is Maria Zhukova and I’m a software development engineer at
Intel ® MKL Sparse team.
My team is interested in contributing into Eigen, so I’ve investigated
Eigen support different operations for sparse matrices stored in CSR
and CSC format which can be implemented on a basis of IE SpBLAS
kernels (please, refer to
https://software.intel.com/en-us/mkl-developer-reference-c-inspector-e
xecutor-sparse-blas-routines
for the general idea of interfaces)
, basically we want to implement calls to our IE SpBLAS into next
....
SparseMatrix + SparseMatrix (mkl_sparse_?_add)
SparseMatrix * DenseVector (mkl_sparse_?_mv)
SparseMatrix * DenseMatrix (mkl_sparse_?_mm)
SparseMatrix * SparseMatrix (mkl_sparse_spmm), and
Triangular solve (mkl_sparse_?_trsv).
I’ve already started with implementation of sparse_time_dense_impl_mkl
kernel which is based on mkl_sparse_?_mv (included in patch).
*#include <Eigen/SpBLASSupport> *<-- *NEW:* IE SpBLAS include module
void main () {
SparseMatrix<double, RowMajor> A;
Matrix<double, Dynamic, 1> x, y;
A.makeCompressed(); /* Convert matrix A into CSR/CSC format */
*A.createSparseHandle();*/* *NEW*: is used to create handle required
for all IE SpBLAS routines */
// support of IE SpBLAS is here
y = beta*y + alpha*A*x; /* call to mkl_sparse_?_mv with operation =
SPARSE_OPERATION_NON_TRANSPOSE */ y = beta*y + alpha*A.transpose()*x;
/* call to mkl_sparse_?_mv with operation = SPARSE_OPERATION_TRANSPOSE
*/ y = beta*y + alpha*A.adjoint()*x; /* call to mkl_sparse_?_mv with
operation = SPARSE_OPERATION_CONJUGATE_TRANSPOSE */
*A.destroySparseHandle();* /* *NEW*: is used to delete created handle
*/ }
I’ve attached a draft patch including all necessary changes and would
like to hear your feedback.
Please, let me know if you have any questions and comments.
Best regards,
Maria

--
Dr.-Ing. Christoph Hertzberg

Besuchsadresse der Nebengeschäftsstelle:
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-Straße 5
28359 Bremen, Germany

Postadresse der Hauptgeschäftsstelle Standort Bremen:
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-Straße 1
28359 Bremen, Germany

Tel.: +49 421 178 45-4021
Zentrale: +49 421 178 45-0
E-Mail: ***@dfki.de

Weitere Informationen: http://www.dfki.de/robotik
-----------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Straße 122, D-67663 Kaiserslautern
Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
(Vorsitzender) Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
USt-Id.Nr.: DE 148646973
Steuernummer: 19/672/50006
-----------------------------------------------------------------------

Zhukova, Maria

2018-04-05 00:11:13 UTC

Permalink

Hi Cristoph,

Thanks for your comments!
Regarding sparse handle: the idea of IE SpBLAS handle is that it should be created just once and contain internal data related to the matrix structure, different optimizations, etc., which is used in subsequent calls, so it will be very inefficient to do create and destroy on-the-fly.
That’s why I suggested adding this handle as an additional member and include creating routine in a place such as makeCompressed() function (because it is anyway required in order to make matrix format compatible with IE SpBLAS functionality).
I will take care of all the remarks in a next patch version once I gather more feedbacks.

Best regards,
Maria

-----Original Message-----
From: Christoph Hertzberg [mailto:***@informatik.uni-bremen.de]
Sent: Wednesday, April 4, 2018 2:44 AM
To: ***@lists.tuxfamily.org
Subject: Re: [eigen] Intel (R) MKL IE SpBLAS support in Eigen

Post by Edward Lam
Hi Maria,
I think your new proposal sounds much better. I haven't looked at your
patch but please make sure the resource management is done properly
when the matrix is copied as well (ie. it's copy constructor and/or
operator=() method). [...].

In other words, obey the rule of 5/3/0:
http://en.cppreference.com/w/cpp/language/rule_of_three

I.e., ideally have a helper struct which obeys the rule of 5 and don't use custom copy/assignment implementations in the main class -- I'm aware that Eigen does not strictly follow that everywhere, e.g., SparseMatrix only obeys the rule of 3 at the moment.

Christoph

Post by Edward Lam
Thanks in advance for your hard work!
-Edward

Post by Zhukova, Maria
Hello Edward and thanks much for your feedback!
This is a good point.
Actually, the correct place for creating IE SpBLAS handle is one of
the key moments which I want to discuss.
The thing is -- all Intel (R) MKL IE SpBLAS functionality requires
matrix to be stored in compressed row/column storage format.
Necessary storage format can be obtained by calling SparseMatrix::makeCompressed() function.
So, I wonder if the better decision will be to create sparse handle
inside the makeCompressed() and destroy in SparseMatrix destructor.
What do you think about this option?
Best regards,
Maria
-----Original Message-----
Sent: Tuesday, April 3, 2018 4:28 PM
Subject: Re: [eigen] Intel (R) MKL IE SpBLAS support in Eigen
Hi Maria,
I'm only an Eigen user but I'm extremely happy at this effort to have
more MKL wrapping in Eigen! As our application depends on Intel TBB,
the only real way to use Eigen with multithreaded support is via MKL.

A.createSparseHandle(); /* *NEW*: is used to create handle

required for all IE > SpBLAS routines */ > ...

A.destroySparseHandle(); /* *NEW*: is used to delete created

handle */
I think that this is very un-C++ like. I'm not sure why it was
decided to be done this way. Why wouldn't the sparse handle be
allocated within A itself and then deallocated during A's
constructor? Note also that in C++, we need to worry about what
happens when the matrix is copied so this state management needs to
be within the matrix itself or else the semantics of these create/destroy handles are very troublesome.
Best Regards,
-Edward

Hello Eigen community,
My name is Maria Zhukova and I’m a software development engineer at
Intel ® MKL Sparse team.
My team is interested in contributing into Eigen, so I’ve
Eigen support different operations for sparse matrices stored in CSR
and CSC format which can be implemented on a basis of IE SpBLAS
kernels (please, refer to
https://software.intel.com/en-us/mkl-developer-reference-c-inspector
-e
xecutor-sparse-blas-routines
for the general idea of interfaces)
, basically we want to implement calls to our IE SpBLAS into next
....
SparseMatrix + SparseMatrix (mkl_sparse_?_add)
SparseMatrix * DenseVector (mkl_sparse_?_mv)
SparseMatrix * DenseMatrix (mkl_sparse_?_mm)
SparseMatrix * SparseMatrix (mkl_sparse_spmm),
and Triangular solve (mkl_sparse_?_trsv).
I’ve already started with implementation of
sparse_time_dense_impl_mkl kernel which is based on mkl_sparse_?_mv (included in patch).
*#include <Eigen/SpBLASSupport> *<-- *NEW:* IE SpBLAS include module
void main () {
SparseMatrix<double, RowMajor> A;
Matrix<double, Dynamic, 1> x, y;
A.makeCompressed(); /* Convert matrix A into CSR/CSC format */
*A.createSparseHandle();*/* *NEW*: is used to create handle required
for all IE SpBLAS routines */
// support of IE SpBLAS is here
y = beta*y + alpha*A*x; /* call to mkl_sparse_?_mv with operation =
SPARSE_OPERATION_NON_TRANSPOSE */ y = beta*y +
alpha*A.transpose()*x;
/* call to mkl_sparse_?_mv with operation =
SPARSE_OPERATION_TRANSPOSE */ y = beta*y + alpha*A.adjoint()*x; /*
call to mkl_sparse_?_mv with operation =
SPARSE_OPERATION_CONJUGATE_TRANSPOSE */
*A.destroySparseHandle();* /* *NEW*: is used to delete created
handle */ }
I’ve attached a draft patch including all necessary changes and
would like to hear your feedback.
Please, let me know if you have any questions and comments.
Best regards,
Maria

Christoph Hertzberg

2018-04-04 09:37:13 UTC

Permalink

Hi,

I wonder what the overhead of `mkl_sparse_?_create_cs?` and
`mkl_sparse_destroy` is. Ideally, this should be done on-the-fly using
some inline functions (have a look, e.g., at `viewAsCholmod` in
<Eigen/src/CholmodSupport/CholmodSupport.h> )

In general, avoid adding data members to Core types if possible (this
will break ABI compatibility). And if you call methods which return a
status, at least check the status and assert that it is successful
(making the user check that manually is very C-stylish).

Cheers,
Christoph

Post by Zhukova, Maria
Hello Eigen community,
My name is Maria Zhukova and I'm a software development engineer at Intel (r) MKL Sparse team.
Eigen support different operations for sparse matrices stored in CSR and CSC format which can be implemented on a basis of IE SpBLAS kernels (please, refer to https://software.intel.com/en-us/mkl-developer-reference-c-inspector-executor-sparse-blas-routines for the general idea of interfaces)
SparseMatrix + SparseMatrix (mkl_sparse_?_add)
SparseMatrix * DenseVector (mkl_sparse_?_mv)
SparseMatrix * DenseMatrix (mkl_sparse_?_mm)
SparseMatrix * SparseMatrix (mkl_sparse_spmm),
and Triangular solve (mkl_sparse_?_trsv).
I've already started with implementation of sparse_time_dense_impl_mkl kernel which is based on mkl_sparse_?_mv (included in patch).
#include <Eigen/SpBLASSupport> <-- NEW: IE SpBLAS include module
void main () {
SparseMatrix<double, RowMajor> A;
Matrix<double, Dynamic, 1> x, y;
A.makeCompressed(); /* Convert matrix A into CSR/CSC format */
A.createSparseHandle(); /* NEW: is used to create handle required for all IE SpBLAS routines */
// support of IE SpBLAS is here
y = beta*y + alpha*A*x; /* call to mkl_sparse_?_mv with operation = SPARSE_OPERATION_NON_TRANSPOSE */
y = beta*y + alpha*A.transpose()*x; /* call to mkl_sparse_?_mv with operation = SPARSE_OPERATION_TRANSPOSE */
y = beta*y + alpha*A.adjoint()*x; /* call to mkl_sparse_?_mv with operation = SPARSE_OPERATION_CONJUGATE_TRANSPOSE */
A.destroySparseHandle(); /* NEW: is used to delete created handle */
}
I've attached a draft patch including all necessary changes and would like to hear your feedback.
Please, let me know if you have any questions and comments.
Best regards,
Maria

Gael Guennebaud

2018-04-05 12:01:03 UTC

Permalink

Thank you for opening this discussion on the public mailing list.

So let's discuss about the public API, which currently is not very
convenient as already noticed by others. Issues are:

(i1) - Storing MKL's handle in SparseMatrix breaks ABI and does not sounds
very generic.
- We need a way to control:
(i2) - which operations are going to be analyzed/optimized,
(i3) - and specify the 'expected_calls' parameter.

In order to discuss these issues, let's consider the following typical
pattern: (e.g., non-linear optimization, eigenvalues, ...)

SimplicialLDLT<SparseMatrix<double> > llt(A);

while(...) {
...
x = llt.solve(b);
...
}

Here the triangular L factor is going to be used for triangular and
transposed-triangular solves dozens to hundreds of time but only the user
of SimplicialLDLT knowns that, not SimplicialLDLT, nor SparseMatrix.
Moreover, the user does not own the SparseMatrix that we want to
analyze/optimize for. Other patterns are likely easier to handle, so let's
focus on it for now.

Regarding (i1), I would suggest to introduce a new type, say
MklSparseMatrix<> that would enhance SparseMatrix<> through inheritance.
Then for (i2) and (i3) we could imagine something like:

MklSparseMatrix::beginAnalysis(Index expected_calls) const {
// turn *this to compressed mode
// create handle
// store expected_calls
// enable recording mode
}
MklSparseMatrix::endAnalysis() const {
// disable recording mode
// [optional] call mkl_sparse_optimize
}

All states in MklSparseMatrix would be mutable.

Between a pair of beginAnalysis/endAnalysis each call to a supported
operation would trigger calls to
mkl_sparse_set_*_hint()/mkl_sparse_optimize.
Optionally, we could even add a "dryrun" mode for which no operation would
be performed, only calls to mkl_sparse_set_*_hint() and then
mkl_sparse_optimize would be called in endAnalysis(). This way
mkl_sparse_optimize() would be called only once.

And that's it. Our example would look-like:

SimplicialLDLT<MklSparseMatrix<double> > llt(A);
int it=0;
while(...) {
...
if(it==0) llt.matrixL().beginAnalysis(100);
x = llt.solve(b);
if(it==0) llt.matrixL().endAnalysis();
...
++it;
}

or using a "dry-run" mode:

SimplicialLDLT<MklSparseMatrix<double> > llt(A);

llt.matrixL().beginAnalysis(100, DryRun);
x = llt.solve(b); // permutation and division by the diagonal matrix D
would still be performed, but calls to actual triangular solves would be
by-passed
llt.matrixL().endAnalysis();

while(...) {
...
x = llt.solve(b);
...
}

If someone directly deal with the factor L, then we could follow the same
pattern or copy the SparseMatrix factor L to a MklSparseMatrix:

SimplicialLLT<SparseMatrix<double> > llt(A);

MklSparseMatrix L(llt.matrixL());
L.beginAnalysis(100,DryRun);
y = L.triangularView<Lower>() * x;
L.endAnalysis();
while(...) {
...
y = L.triangularView<Lower>() * x;
...
}

This design in quite general and expendable to any sparse-optimizers, even
built-in ones in the future.

In contrast to the current proposal, only selected operations would be
passed to MKL (need to use a MklSparseMatrix + begin/end recording phase).

What do you think?

gael

Post by Zhukova, Maria
Hello Eigen community,
My name is Maria Zhukova and Iâm a software development engineer at Intel
Â® MKL Sparse team.
My team is interested in contributing into Eigen, so Iâve investigated our
Eigen support different operations for sparse matrices stored in CSR and
CSC format which can be implemented on a basis of IE SpBLAS kernels
(please, refer to https://software.intel.com/en-
us/mkl-developer-reference-c-inspector-executor-sparse-blas-routines for
the general idea of interfaces)
SparseMatrix + SparseMatrix (mkl_sparse_?_add)
SparseMatrix * DenseVector (mkl_sparse_?_mv)
SparseMatrix * DenseMatrix (mkl_sparse_?_mm)
SparseMatrix * SparseMatrix (mkl_sparse_spmm),
and Triangular solve (mkl_sparse_?_trsv).
Iâve already started with implementation of sparse_time_dense_impl_mkl
kernel which is based on mkl_sparse_?_mv (included in patch).
*#include <Eigen/SpBLASSupport> *<-- *NEW:* IE SpBLAS
include module
void main () {
SparseMatrix<double, RowMajor> A;
Matrix<double, Dynamic, 1> x, y;
A.makeCompressed(); /* Convert matrix A into CSR/CSC format */
*A.createSparseHandle();* /* *NEW*: is used to
create handle required for all IE SpBLAS routines */
// support of IE SpBLAS is here
y = beta*y + alpha*A*x; /* call to
mkl_sparse_?_mv with operation = SPARSE_OPERATION_NON_TRANSPOSE */
y = beta*y + alpha*A.transpose()*x; /* call to
mkl_sparse_?_mv with operation = SPARSE_OPERATION_TRANSPOSE */
y = beta*y + alpha*A.adjoint()*x; /* call to
mkl_sparse_?_mv with operation = SPARSE_OPERATION_CONJUGATE_TRANSPOSE */
* A.destroySparseHandle();* /* *NEW*: is used to
delete created handle */
}
Iâve attached a draft patch including all necessary changes and would like
to hear your feedback.
Please, let me know if you have any questions and comments.
Best regards,
Maria

Edward Lam

2018-04-05 13:31:57 UTC

Permalink

Would it be useful to incorporate lambda's into the interface to avoid begin/end
pairs? So from the user side, we would write code like this instead:

1) Analyze and run

SimplicialLDLT<MklSparseMatrix<double>> llt(A);
int it = 0;
for (int it = 0; ...; ++it) {
if (it == 0)
llt.matrixL().sparseAnalyzeAndRun(100, [&] { llt.solve(b); });
else
llt.solve(b);
// ...
}

2) Analyze only

SimplicialLDLT<MklSparseMatrix<double>> llt(A);
llt.matrixL().sparseAnalyze(100, [&] { llt.solve(b); });
// and solve as usual

For more complicated algorithms, one can always outline the lambda and pass it
into the analysis.

Cheers,
-Edward

Post by Gael Guennebaud
Thank you for opening this discussion on the public mailing list.
So let's discuss about the public API, which currently is not very convenient as
(i1) - Storing MKL's handle in SparseMatrix breaks ABI and does not sounds very
generic.
(i2) - which operations are going to be analyzed/optimized,
(i3) - and specify the 'expected_calls' parameter.
(e.g., non-linear optimization, eigenvalues, ...)
SimplicialLDLT<SparseMatrix<double> > llt(A);
while(...) {
...
x = llt.solve(b);
...
}
Here the triangular L factor is going to be used for triangular and
transposed-triangular solves dozens to hundreds of time but only the user of
SimplicialLDLT knowns that, not SimplicialLDLT, nor SparseMatrix. Moreover, the
user does not own the SparseMatrix that we want to analyze/optimize for. Other
patterns are likely easier to handle, so let's focus on it for now.
Regarding (i1), I would suggest to introduce a new type, say MklSparseMatrix<>
that would enhance SparseMatrix<> through inheritance. Then for (i2) and (i3) we
MklSparseMatrix::beginAnalysis(Index expected_calls) const {
// turn *this to compressed mode
// create handle
// store expected_calls
// enable recording mode
}
MklSparseMatrix::endAnalysis() const {
// disable recording mode
// [optional] call mkl_sparse_optimize
}
All states in MklSparseMatrix would be mutable.
Between a pair of beginAnalysis/endAnalysis each call to a supported operation
would trigger calls to mkl_sparse_set_*_hint()/mkl_sparse_optimize.
Optionally, we could even add a "dryrun" mode for which no operation would be
performed, only calls to mkl_sparse_set_*_hint() and then mkl_sparse_optimize
would be called in endAnalysis(). This way mkl_sparse_optimize() would be called
only once.
SimplicialLDLT<MklSparseMatrix<double> > llt(A);
int it=0;
while(...) {
...
if(it==0) llt.matrixL().beginAnalysis(100);
x = llt.solve(b);
if(it==0) llt.matrixL().endAnalysis();
...
++it;
}
SimplicialLDLT<MklSparseMatrix<double> > llt(A);
llt.matrixL().beginAnalysis(100, DryRun);
x = llt.solve(b); // permutation and division by the diagonal matrix D would
still be performed, but calls to actual triangular solves would be by-passed
llt.matrixL().endAnalysis();
while(...) {
...
x = llt.solve(b);
...
}
If someone directly deal with the factor L, then we could follow the same
SimplicialLLT<SparseMatrix<double> > llt(A);
MklSparseMatrix L(llt.matrixL());
L.beginAnalysis(100,DryRun);
y = L.triangularView<Lower>() * x;
L.endAnalysis();
while(...) {
...
y = L.triangularView<Lower>() * x;
...
}
This design in quite general and expendable to any sparse-optimizers, even
built-in ones in the future.
In contrast to the current proposal, only selected operations would be passed to
MKL (need to use a MklSparseMatrix + begin/end recording phase).
What do you think?
gael
Hello Eigen community,
My name is Maria Zhukova and I’m a software development engineer at Intel ®
MKL Sparse team.
My team is interested in contributing into Eigen, so I’ve investigated our
Eigen support different operations for sparse matrices stored in CSR and CSC
format which can be implemented on a basis of IE SpBLAS kernels (please,
refer to
https://software.intel.com/en-us/mkl-developer-reference-c-inspector-executor-sparse-blas-routines
<https://software.intel.com/en-us/mkl-developer-reference-c-inspector-executor-sparse-blas-routines>
for the general idea of interfaces)
, basically we want to implement calls to our IE SpBLAS into next
operations:____
                SparseMatrix + SparseMatrix (mkl_sparse_?_add)
                SparseMatrix * DenseVector (mkl_sparse_?_mv)____
                SparseMatrix * DenseMatrix   (mkl_sparse_?_mm)____
                SparseMatrix * SparseMatrix (mkl_sparse_spmm),
and Triangular solve (mkl_sparse_?_trsv).____
I’ve already started with implementation of sparse_time_dense_impl_mkl
kernel which is based on mkl_sparse_?_mv (included in patch).____
*#include <Eigen/SpBLASSupport> *<-- *NEW:* IE SpBLAS include module ____
void main () {
SparseMatrix<double, RowMajor> A;
Matrix<double, Dynamic, 1> x, y;
A.makeCompressed(); /* Convert matrix A into CSR/CSC format */
*A.createSparseHandle();*/* *NEW*: is used to create handle required for all
IE SpBLAS routines */____
// support of IE SpBLAS is here
y = beta*y + alpha*A*x; /* call to mkl_sparse_?_mv with operation =
SPARSE_OPERATION_NON_TRANSPOSE */
y = beta*y + alpha*A.transpose()*x; /* call to mkl_sparse_?_mv with
operation = SPARSE_OPERATION_TRANSPOSE */
y = beta*y + alpha*A.adjoint()*x; /* call to mkl_sparse_?_mv with operation
= SPARSE_OPERATION_CONJUGATE_TRANSPOSE */____
*A.destroySparseHandle();* /* *NEW*: is used to delete created handle */
}____
__ __
I’ve attached a draft patch including all necessary changes and would like
to hear your feedback.
Please, let me know if you have any questions and comments.____
__ __
Best regards,
Maria____
__ __
__ __
__ __

Christoph Hertzberg

2018-04-05 21:00:38 UTC

Permalink

Post by Edward Lam
Would it be useful to incorporate lambda's into the interface to avoid
begin/end pairs? So from the user side, we would write code like this
1) Analyze and run
    SimplicialLDLT<MklSparseMatrix<double>> llt(A);
    int it = 0;
    for (int it = 0; ...; ++it) {
        if (it == 0)
            llt.matrixL().sparseAnalyzeAndRun(100, [&] { llt.solve(b); });
        else
            llt.solve(b);
        // ...
    }
2) Analyze only
    SimplicialLDLT<MklSparseMatrix<double>> llt(A);
    llt.matrixL().sparseAnalyze(100, [&] { llt.solve(b); });
    // and solve as usual
For more complicated algorithms, one can always outline the lambda and
pass it into the analysis.

That would certainly clean up things a lot. Having to call
A.beginXY();
doStuff();
A.endXY();
is a very C-stylish API, which is error-prone (e.g., one of the calls
might not get called, because it is masked by a wrong if-condition, or
due to an exception, which is caught only outside).
This can usually be avoided with proper C++ constructs.

However, if this is required, I would suggest to add this method not
only to matrices but also to the decomposition (and pass it through to
the internal matrices). Otherwise, this does not scale for
decompositions which use more than one matrix (like SparseLU). And we
could even let the sparseAnalyze() functions return a proxy to the
decomposition which would allow writing something like:

x = llt.sparseAnalyzeAndRun(100)->solve(b);
// equivalent to
llt.sparseAnalyzeAndRun(100, [&]{x=llt.solve(b);});

or
{
auto llt_ = llt.sparseAnalyzeAndRun(100);
x = llt_-> solve(b);
y = llt_-> matrixL() * c;
} // destructor of llt_ calls `endAnalyze`

The naming of this method is still debatable, of course.
And I have no idea what actually happens inside MKL when it 'analyzes'
an operation (after how many iterations do you actually benefit from the
overhead of analyzing the operation?)

Christoph

Post by Edward Lam
Cheers,
-Edward

Post by Gael Guennebaud
Thank you for opening this discussion on the public mailing list.
So let's discuss about the public API, which currently is not very
(i1) - Storing MKL's handle in SparseMatrix breaks ABI and does not
sounds very generic.
(i2) - which operations are going to be analyzed/optimized,
(i3) - and specify the 'expected_calls' parameter.
In order to discuss these issues, let's consider the following typical
pattern: (e.g., non-linear optimization, eigenvalues, ...)
SimplicialLDLT<SparseMatrix<double> > llt(A);
while(...) {
   ...
x = llt.solve(b);
...
}
Here the triangular L factor is going to be used for triangular and
transposed-triangular solves dozens to hundreds of time but only the
user of SimplicialLDLT knowns that, not SimplicialLDLT, nor
SparseMatrix. Moreover, the user does not own the SparseMatrix that we
want to analyze/optimize for. Other patterns are likely easier to
handle, so let's focus on it for now.
Regarding (i1), I would suggest to introduce a new type, say
MklSparseMatrix<> that would enhance SparseMatrix<> through
MklSparseMatrix::beginAnalysis(Index expected_calls) const {
// turn *this to compressed mode
// create handle
// store expected_calls
// enable recording mode
}
MklSparseMatrix::endAnalysis() const {
// disable recording mode
// [optional] call mkl_sparse_optimize
}
All states in MklSparseMatrix would be mutable.
Between a pair of beginAnalysis/endAnalysis each call to a supported
operation would trigger calls to
mkl_sparse_set_*_hint()/mkl_sparse_optimize.
Optionally, we could even add a "dryrun" mode for which no operation
would be performed, only calls to mkl_sparse_set_*_hint() and then
mkl_sparse_optimize would be called in endAnalysis(). This way
mkl_sparse_optimize() would be called only once.
SimplicialLDLT<MklSparseMatrix<double> > llt(A);
int it=0;
while(...) {
   ...
   if(it==0) llt.matrixL().beginAnalysis(100);
x = llt.solve(b);
if(it==0) llt.matrixL().endAnalysis();
...
++it;
}
SimplicialLDLT<MklSparseMatrix<double> > llt(A);
llt.matrixL().beginAnalysis(100, DryRun);
x = llt.solve(b); // permutation and division by the diagonal matrix D
would still be performed, but calls to actual triangular solves would
be by-passed
llt.matrixL().endAnalysis();
while(...) {
   ...
   x = llt.solve(b);
...
}
If someone directly deal with the factor L, then we could follow the
SimplicialLLT<SparseMatrix<double> > llt(A);
MklSparseMatrix L(llt.matrixL());
L.beginAnalysis(100,DryRun);
y = L.triangularView<Lower>() * x;
L.endAnalysis();
while(...) {
   ...
y = L.triangularView<Lower>() * x;
...
}
This design in quite general and expendable to any sparse-optimizers,
even built-in ones in the future.
In contrast to the current proposal, only selected operations would be
passed to MKL (need to use a MklSparseMatrix + begin/end recording phase).
What do you think?
gael
On Tue, Apr 3, 2018 at 11:39 PM, Zhukova, Maria
    Hello Eigen community,
    My name is Maria Zhukova and I’m a software development engineer
at Intel ®
    MKL Sparse team.
    My team is interested in contributing into Eigen, so I’ve
investigated our
    Eigen support different operations for sparse matrices stored in
CSR and CSC
    format which can be implemented on a basis of IE SpBLAS kernels
(please,
    refer to
https://software.intel.com/en-us/mkl-developer-reference-c-inspector-executor-sparse-blas-routines
<https://software.intel.com/en-us/mkl-developer-reference-c-inspector-executor-sparse-blas-routines>
    for the general idea of interfaces)
    , basically we want to implement calls to our IE SpBLAS into next
    operations:____
                     SparseMatrix + SparseMatrix (mkl_sparse_?_add)
                     SparseMatrix * DenseVector (mkl_sparse_?_mv)____
                     SparseMatrix * DenseMatrix   (mkl_sparse_?_mm)____
                     SparseMatrix * SparseMatrix (mkl_sparse_spmm),
    and Triangular solve (mkl_sparse_?_trsv).____
    I’ve already started with implementation of
sparse_time_dense_impl_mkl
    kernel which is based on mkl_sparse_?_mv (included in patch).____
    *#include <Eigen/SpBLASSupport> *<-- *NEW:* IE SpBLAS include
module ____
    void main () {
     SparseMatrix<double, RowMajor> A;
     Matrix<double, Dynamic, 1> x, y;
     A.makeCompressed(); /* Convert matrix A into CSR/CSC format */
    *A.createSparseHandle();*/* *NEW*: is used to create handle
required for all
    IE SpBLAS routines */____
    // support of IE SpBLAS is here
    y = beta*y + alpha*A*x; /* call to mkl_sparse_?_mv with operation =
    SPARSE_OPERATION_NON_TRANSPOSE */
    y = beta*y + alpha*A.transpose()*x; /* call to mkl_sparse_?_mv with
    operation = SPARSE_OPERATION_TRANSPOSE */
    y = beta*y + alpha*A.adjoint()*x; /* call to mkl_sparse_?_mv with
operation
    = SPARSE_OPERATION_CONJUGATE_TRANSPOSE */____
    *A.destroySparseHandle();* /* *NEW*: is used to delete created
handle */
    }____
    __ __
    I’ve attached a draft patch including all necessary changes and
would like
    to hear your feedback.
    Please, let me know if you have any questions and comments.____
    __ __
    Best regards,
    Maria____
    __ __
    __ __
    __ __

Gael Guennebaud

2018-04-05 21:54:15 UTC

Permalink

On Thu, Apr 5, 2018 at 11:00 PM, Christoph Hertzberg <

Post by Christoph Hertzberg

Post by Edward Lam
Would it be useful to incorporate lambda's into the interface to avoid
begin/end pairs? So from the user side, we would write code like this
1) Analyze and run
SimplicialLDLT<MklSparseMatrix<double>> llt(A);
int it = 0;
for (int it = 0; ...; ++it) {
if (it == 0)
llt.matrixL().sparseAnalyzeAndRun(100, [&] { llt.solve(b); });
else
llt.solve(b);
// ...
}
2) Analyze only
SimplicialLDLT<MklSparseMatrix<double>> llt(A);
llt.matrixL().sparseAnalyze(100, [&] { llt.solve(b); });
// and solve as usual
For more complicated algorithms, one can always outline the lambda and
pass it into the analysis.

The advantages of the begin/end pair are that 1) this keeps the API
minimal, and 2) the user can easily wrap them to accommodate its taste and
usages. To avoid all sorts of redundancies, one could even write a free
function on top the begin/end pair to do:

while(...) {
optimize_if(iter==0, 100, [&]{x=llt.solve(b);}, llt.matrixL());
}

and with variadic templates:

while(...) {
optimize_if(iter==0, 100, [&]{x=LU.solve(b);}, LU.matrixL(), LU.matrixU()
);
}

Actually, this last example is not conceivable at the moment because the
factors in SparseLU are not stored in a standard compressed format, we
would need to add an option to tell SparseLU to turn them to regular
SparseMatrix and to use them for solving.

Anyways, all these variants are essentially syntactic sugar/implementation
details, and at this stage I'm more concerned about possible shortcomings
in the general approach. Is it flexible enough? I'm also not fan of
introducing a MklSparseMatrix inheriting SparseMatrix, but I don't have a
better offer for now that is generic and easily applicable to
decompositions.

gael

Post by Christoph Hertzberg
However, if this is required, I would suggest to add this method not only
to matrices but also to the decomposition (and pass it through to the
internal matrices). Otherwise, this does not scale for decompositions which
use more than one matrix (like SparseLU). And we could even let the
sparseAnalyze() functions return a proxy to the decomposition which would
x = llt.sparseAnalyzeAndRun(100)->solve(b);
// equivalent to
llt.sparseAnalyzeAndRun(100, [&]{x=llt.solve(b);});
or
{
auto llt_ = llt.sparseAnalyzeAndRun(100);
x = llt_-> solve(b);
y = llt_-> matrixL() * c;
} // destructor of llt_ calls `endAnalyze`
The naming of this method is still debatable, of course.
And I have no idea what actually happens inside MKL when it 'analyzes' an
operation (after how many iterations do you actually benefit from the
overhead of analyzing the operation?)
Christoph

Post by Edward Lam
Cheers,
-Edward

Post by Gael Guennebaud
Thank you for opening this discussion on the public mailing list.
So let's discuss about the public API, which currently is not very
(i1) - Storing MKL's handle in SparseMatrix breaks ABI and does not
sounds very generic.
(i2) - which operations are going to be analyzed/optimized,
(i3) - and specify the 'expected_calls' parameter.
In order to discuss these issues, let's consider the following typical
pattern: (e.g., non-linear optimization, eigenvalues, ...)
SimplicialLDLT<SparseMatrix<double> > llt(A);
while(...) {
...
x = llt.solve(b);
...
}
Here the triangular L factor is going to be used for triangular and
transposed-triangular solves dozens to hundreds of time but only the user
of SimplicialLDLT knowns that, not SimplicialLDLT, nor SparseMatrix.
Moreover, the user does not own the SparseMatrix that we want to
analyze/optimize for. Other patterns are likely easier to handle, so let's
focus on it for now.
Regarding (i1), I would suggest to introduce a new type, say
MklSparseMatrix<> that would enhance SparseMatrix<> through inheritance.
MklSparseMatrix::beginAnalysis(Index expected_calls) const {
// turn *this to compressed mode
// create handle
// store expected_calls
// enable recording mode
}
MklSparseMatrix::endAnalysis() const {
// disable recording mode
// [optional] call mkl_sparse_optimize
}
All states in MklSparseMatrix would be mutable.
Between a pair of beginAnalysis/endAnalysis each call to a supported
operation would trigger calls to mkl_sparse_set_*_hint()/mkl_sp
arse_optimize.
Optionally, we could even add a "dryrun" mode for which no operation
would be performed, only calls to mkl_sparse_set_*_hint() and then
mkl_sparse_optimize would be called in endAnalysis(). This way
mkl_sparse_optimize() would be called only once.
SimplicialLDLT<MklSparseMatrix<double> > llt(A);
int it=0;
while(...) {
...
if(it==0) llt.matrixL().beginAnalysis(100);
x = llt.solve(b);
if(it==0) llt.matrixL().endAnalysis();
...
++it;
}
SimplicialLDLT<MklSparseMatrix<double> > llt(A);
llt.matrixL().beginAnalysis(100, DryRun);
x = llt.solve(b); // permutation and division by the diagonal matrix D
would still be performed, but calls to actual triangular solves would be
by-passed
llt.matrixL().endAnalysis();
while(...) {
...
x = llt.solve(b);
...
}
If someone directly deal with the factor L, then we could follow the
SimplicialLLT<SparseMatrix<double> > llt(A);
MklSparseMatrix L(llt.matrixL());
L.beginAnalysis(100,DryRun);
y = L.triangularView<Lower>() * x;
L.endAnalysis();
while(...) {
...
y = L.triangularView<Lower>() * x;
...
}
This design in quite general and expendable to any sparse-optimizers,
even built-in ones in the future.
In contrast to the current proposal, only selected operations would be
passed to MKL (need to use a MklSparseMatrix + begin/end recording phase).
What do you think?
gael
Hello Eigen community,
My name is Maria Zhukova and Iâm a software development engineer at
Intel Â®
MKL Sparse team.
My team is interested in contributing into Eigen, so Iâve investigated our
Eigen support different operations for sparse matrices stored in CSR and CSC
format which can be implemented on a basis of IE SpBLAS kernels (please,
refer to
https://software.intel.com/en-us/mkl-developer-reference-c-i
nspector-executor-sparse-blas-routines
<https://software.intel.com/en-us/mkl-developer-reference-c-
inspector-executor-sparse-blas-routines>
for the general idea of interfaces)
, basically we want to implement calls to our IE SpBLAS into next
operations:____
SparseMatrix + SparseMatrix (mkl_sparse_?_add)
SparseMatrix * DenseVector (mkl_sparse_?_mv)____
SparseMatrix * DenseMatrix (mkl_sparse_?_mm)____
SparseMatrix * SparseMatrix (mkl_sparse_spmm),
and Triangular solve (mkl_sparse_?_trsv).____
Iâve already started with implementation of
sparse_time_dense_impl_mkl
kernel which is based on mkl_sparse_?_mv (included in patch).____
*#include <Eigen/SpBLASSupport> *<-- *NEW:* IE SpBLAS include module ____
void main () {
SparseMatrix<double, RowMajor> A;
Matrix<double, Dynamic, 1> x, y;
A.makeCompressed(); /* Convert matrix A into CSR/CSC format */
*A.createSparseHandle();*/* *NEW*: is used to create handle required for all
IE SpBLAS routines */____
// support of IE SpBLAS is here
y = beta*y + alpha*A*x; /* call to mkl_sparse_?_mv with operation =
SPARSE_OPERATION_NON_TRANSPOSE */
y = beta*y + alpha*A.transpose()*x; /* call to mkl_sparse_?_mv with
operation = SPARSE_OPERATION_TRANSPOSE */
y = beta*y + alpha*A.adjoint()*x; /* call to mkl_sparse_?_mv with operation
= SPARSE_OPERATION_CONJUGATE_TRANSPOSE */____
*A.destroySparseHandle();* /* *NEW*: is used to delete created handle */
}____
__ __
Iâve attached a draft patch including all necessary changes and would like
to hear your feedback.
Please, let me know if you have any questions and comments.____
__ __
Best regards,
Maria____
__ __
__ __
__ __

--
Dr.-Ing. Christoph Hertzberg
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-StraÃe 5
28359 Bremen, Germany
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-StraÃe 1
28359 Bremen, Germany
Tel.: +49 421 178 45-4021
Zentrale: +49 421 178 45-0
Weitere Informationen: http://www.dfki.de/robotik
-----------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter StraÃe 122, D-67663 Kaiserslautern
Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
(Vorsitzender) Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
USt-Id.Nr.: DE 148646973
Steuernummer: 19/672/50006
-----------------------------------------------------------------------

Zhukova, Maria

2018-04-05 22:49:55 UTC

Permalink

Hello Gael,
Thanks for your feedback! Itâs greatly appreciated.

Regarding your list of issues:
(i1) Our initial suggest was to add all code related to IE SpBLAS together with #ifdef EIGEN_USE_MKL_SPBLAS /*Code is here*/ #endif, so I believe this shouldnât brake ABI.
(i2) Calling mkl_sparse_set_?_hint() and mkl_sparse_optimize() on-the-fly with every operation wonât be a problem and the overhead is expected to be minimal, because internally weâre checking for previous optimizations and what data can be reused. So adding it into the code just before calling any IE SpBLAS computational routine means that we will take care of required optimizations automatically and it wonât be userâs work.
(i3) Surely in addition we can implement any type of MKLOptimize(/*args TBD*/) routine which will be used to set user-specified hints, desired number of calls, required operations, etc.

The reason why we do not want to implement our own type is that it will be much easier for customers to benefit from Intel architecture: the only thing they should take care about is to include SpBLASSupport module and link MKL, all the optimization work will be hidden (together with IE SpBLAS calls). So if there was already an application which used sparse algebra then it will benefit automatically from MKL usage.

Best regards,
Maria

From: Gael Guennebaud [mailto:***@gmail.com]
Sent: Thursday, April 5, 2018 2:54 PM
To: eigen <***@lists.tuxfamily.org>
Subject: Re: [eigen] Intel (R) MKL IE SpBLAS support in Eigen

On Thu, Apr 5, 2018 at 11:00 PM, Christoph Hertzberg <***@informatik.uni-bremen.de<mailto:***@informatik.uni-bremen.de>> wrote:
On 2018-04-05 15:31, Edward Lam wrote:
Would it be useful to incorporate lambda's into the interface to avoid begin/end pairs? So from the user side, we would write code like this instead:

1) Analyze and run

SimplicialLDLT<MklSparseMatrix<double>> llt(A);
int it = 0;
for (int it = 0; ...; ++it) {
if (it == 0)
llt.matrixL().sparseAnalyzeAndRun(100, [&] { llt.solve(b); });
else
llt.solve(b);
// ...
}

2) Analyze only

SimplicialLDLT<MklSparseMatrix<double>> llt(A);
llt.matrixL().sparseAnalyze(100, [&] { llt.solve(b); });
// and solve as usual

For more complicated algorithms, one can always outline the lambda and pass it into the analysis.

That would certainly clean up things a lot. Having to call
A.beginXY();
doStuff();
A.endXY();
is a very C-stylish API, which is error-prone (e.g., one of the calls might not get called, because it is masked by a wrong if-condition, or due to an exception, which is caught only outside).
This can usually be avoided with proper C++ constructs.

The advantages of the begin/end pair are that 1) this keeps the API minimal, and 2) the user can easily wrap them to accommodate its taste and usages. To avoid all sorts of redundancies, one could even write a free function on top the begin/end pair to do:

while(...) {
optimize_if(iter==0, 100, [&]{x=llt.solve(b);}, llt.matrixL());
}

and with variadic templates:

while(...) {
optimize_if(iter==0, 100, [&]{x=LU.solve(b);}, LU.matrixL(), LU.matrixU());
}

Actually, this last example is not conceivable at the moment because the factors in SparseLU are not stored in a standard compressed format, we would need to add an option to tell SparseLU to turn them to regular SparseMatrix and to use them for solving.

Anyways, all these variants are essentially syntactic sugar/implementation details, and at this stage I'm more concerned about possible shortcomings in the general approach. Is it flexible enough? I'm also not fan of introducing a MklSparseMatrix inheriting SparseMatrix, but I don't have a better offer for now that is generic and easily applicable to decompositions.

gael

However, if this is required, I would suggest to add this method not only to matrices but also to the decomposition (and pass it through to the internal matrices). Otherwise, this does not scale for decompositions which use more than one matrix (like SparseLU). And we could even let the sparseAnalyze() functions return a proxy to the decomposition which would allow writing something like:

x = llt.sparseAnalyzeAndRun(100)->solve(b);
// equivalent to
llt.sparseAnalyzeAndRun(100, [&]{x=llt.solve(b);});

or
{
auto llt_ = llt.sparseAnalyzeAndRun(100);
x = llt_-> solve(b);
y = llt_-> matrixL() * c;
} // destructor of llt_ calls `endAnalyze`

The naming of this method is still debatable, of course.
And I have no idea what actually happens inside MKL when it 'analyzes' an operation (after how many iterations do you actually benefit from the overhead of analyzing the operation?)

Christoph

Cheers,
-Edward

On 4/5/2018 8:01 AM, Gael Guennebaud wrote:
Thank you for opening this discussion on the public mailing list.

So let's discuss about the public API, which currently is not very convenient as already noticed by others. Issues are:

(i1) - Storing MKL's handle in SparseMatrix breaks ABI and does not sounds very generic.
- We need a way to control:
(i2) - which operations are going to be analyzed/optimized,
(i3) - and specify the 'expected_calls' parameter.

In order to discuss these issues, let's consider the following typical pattern: (e.g., non-linear optimization, eigenvalues, ...)

SimplicialLDLT<SparseMatrix<double> > llt(A);

while(...) {
...
x = llt.solve(b);
...
}

Here the triangular L factor is going to be used for triangular and transposed-triangular solves dozens to hundreds of time but only the user of SimplicialLDLT knowns that, not SimplicialLDLT, nor SparseMatrix. Moreover, the user does not own the SparseMatrix that we want to analyze/optimize for. Other patterns are likely easier to handle, so let's focus on it for now.

Regarding (i1), I would suggest to introduce a new type, say MklSparseMatrix<> that would enhance SparseMatrix<> through inheritance. Then for (i2) and (i3) we could imagine something like:

MklSparseMatrix::beginAnalysis(Index expected_calls) const {
// turn *this to compressed mode
// create handle
// store expected_calls
// enable recording mode
}
MklSparseMatrix::endAnalysis() const {
// disable recording mode
// [optional] call mkl_sparse_optimize
}

All states in MklSparseMatrix would be mutable.

Between a pair of beginAnalysis/endAnalysis each call to a supported operation would trigger calls to mkl_sparse_set_*_hint()/mkl_sparse_optimize.
Optionally, we could even add a "dryrun" mode for which no operation would be performed, only calls to mkl_sparse_set_*_hint() and then mkl_sparse_optimize would be called in endAnalysis(). This way mkl_sparse_optimize() would be called only once.

And that's it. Our example would look-like:

SimplicialLDLT<MklSparseMatrix<double> > llt(A);
int it=0;
while(...) {
...
if(it==0) llt.matrixL().beginAnalysis(100);
x = llt.solve(b);
if(it==0) llt.matrixL().endAnalysis();
...
++it;
}

or using a "dry-run" mode:

SimplicialLDLT<MklSparseMatrix<double> > llt(A);

llt.matrixL().beginAnalysis(100, DryRun);
x = llt.solve(b); // permutation and division by the diagonal matrix D would still be performed, but calls to actual triangular solves would be by-passed
llt.matrixL().endAnalysis();

while(...) {
...
x = llt.solve(b);
...
}

If someone directly deal with the factor L, then we could follow the same pattern or copy the SparseMatrix factor L to a MklSparseMatrix:

SimplicialLLT<SparseMatrix<double> > llt(A);

MklSparseMatrix L(llt.matrixL());
L.beginAnalysis(100,DryRun);
y = L.triangularView<Lower>() * x;
L.endAnalysis();
while(...) {
...
y = L.triangularView<Lower>() * x;
...
}

This design in quite general and expendable to any sparse-optimizers, even built-in ones in the future.

In contrast to the current proposal, only selected operations would be passed to MKL (need to use a MklSparseMatrix + begin/end recording phase).

What do you think?

gael

On Tue, Apr 3, 2018 at 11:39 PM, Zhukova, Maria <***@intel.com<mailto:***@intel.com> <mailto:***@intel.com<mailto:***@intel.com>>> wrote:

Hello Eigen community,

My name is Maria Zhukova and Iâm a software development engineer at Intel Â®
MKL Sparse team.

My team is interested in contributing into Eigen, so Iâve investigated our
possibilities and so far this is what I have:
Eigen support different operations for sparse matrices stored in CSR and CSC
format which can be implemented on a basis of IE SpBLAS kernels (please,
refer to
https://software.intel.com/en-us/mkl-developer-reference-c-inspector-executor-sparse-blas-routines
<https://software.intel.com/en-us/mkl-developer-reference-c-inspector-executor-sparse-blas-routines>
for the general idea of interfaces)
, basically we want to implement calls to our IE SpBLAS into next
operations:____

SparseMatrix + SparseMatrix (mkl_sparse_?_add)
SparseMatrix * DenseVector (mkl_sparse_?_mv)____

SparseMatrix * DenseMatrix (mkl_sparse_?_mm)____

SparseMatrix * SparseMatrix (mkl_sparse_spmm),
and Triangular solve (mkl_sparse_?_trsv).____

Iâve already started with implementation of sparse_time_dense_impl_mkl
kernel which is based on mkl_sparse_?_mv (included in patch).____

This is how it will look like for user:
*#include <Eigen/SpBLASSupport> *<-- *NEW:* IE SpBLAS include module ____

void main () {
SparseMatrix<double, RowMajor> A;
Matrix<double, Dynamic, 1> x, y;

A.makeCompressed(); /* Convert matrix A into CSR/CSC format */
*A.createSparseHandle();*/* *NEW*: is used to create handle required for all
IE SpBLAS routines */____

// support of IE SpBLAS is here
y = beta*y + alpha*A*x; /* call to mkl_sparse_?_mv with operation =
SPARSE_OPERATION_NON_TRANSPOSE */
y = beta*y + alpha*A.transpose()*x; /* call to mkl_sparse_?_mv with
operation = SPARSE_OPERATION_TRANSPOSE */
y = beta*y + alpha*A.adjoint()*x; /* call to mkl_sparse_?_mv with operation
= SPARSE_OPERATION_CONJUGATE_TRANSPOSE */____

*A.destroySparseHandle();* /* *NEW*: is used to delete created handle */
}____

__ __

Iâve attached a draft patch including all necessary changes and would like
to hear your feedback.
Please, let me know if you have any questions and comments.____

__ __

Best regards,
Maria____

__ __

__ __

__ __
--
Dr.-Ing. Christoph Hertzberg

Besuchsadresse der NebengeschÃ€ftsstelle:
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-StraÃe 5
28359 Bremen, Germany

Postadresse der HauptgeschÃ€ftsstelle Standort Bremen:
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-StraÃe 1
28359 Bremen, Germany

Tel.: +49 421 178 45-4021
Zentrale: +49 421 178 45-0
E-Mail: ***@dfki.de<mailto:***@dfki.de>

Weitere Informationen: http://www.dfki.de/robotik
-----------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter StraÃe 122, D-67663 Kaiserslautern
Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
(Vorsitzender) Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
USt-Id.Nr.: DE 148646973
Steuernummer: 19/672/50006
-----------------------------------------------------------------------

Edward Lam

2018-04-06 00:15:35 UTC

Permalink

It's not clear to me what are the expected lifetimes between and the
matrix, the matrix algorithm, and the optimizer state for IE SpBLAS.

I liked the other idea here to seperate or the optimizer as it's own class
but I'm not clear on the data dependency relationships between these.

For the ABI concerns, does it make sense (or possible?) to store the
optimizer state within the solver itself? This would its use less flexible
but would satisfy the goal of transparently using it with just different
compile options. This of course doesn't preclude the possibility of having
a more general optimizer API as well.

Cheers,
-Edward

Post by Zhukova, Maria
Hello Gael,
Thanks for your feedback! Itâs greatly appreciated.
(i1) Our initial suggest was to add all code related to IE SpBLAS together
with #ifdef EIGEN_USE_MKL_SPBLAS /*Code is here*/ #endif, so I believe this
shouldnât brake ABI.
(i2) Calling mkl_sparse_set_?_hint() and mkl_sparse_optimize() on-the-fly
with every operation wonât be a problem and the overhead is expected to be
minimal, because internally weâre checking for previous optimizations and
what data can be reused. So adding it into the code just before calling any
IE SpBLAS computational routine means that we will take care of required
optimizations automatically and it wonât be userâs work.
(i3) Surely in addition we can implement any type of MKLOptimize(/*args
TBD*/) routine which will be used to set user-specified hints, desired
number of calls, required operations, etc.
The reason why we do not want to implement our own type is that it will be
much easier for customers to benefit from Intel architecture: the only
thing they should take care about is to include SpBLASSupport module and
link MKL, all the optimization work will be hidden (together with IE SpBLAS
calls). So if there was already an application which used sparse algebra
then it will benefit automatically from MKL usage.
Best regards,
Maria
*Sent:* Thursday, April 5, 2018 2:54 PM
*Subject:* Re: [eigen] Intel (R) MKL IE SpBLAS support in Eigen
On Thu, Apr 5, 2018 at 11:00 PM, Christoph Hertzberg <
Would it be useful to incorporate lambda's into the interface to avoid
begin/end pairs? So from the user side, we would write code like this
1) Analyze and run
SimplicialLDLT<MklSparseMatrix<double>> llt(A);
int it = 0;
for (int it = 0; ...; ++it) {
if (it == 0)
llt.matrixL().sparseAnalyzeAndRun(100, [&] { llt.solve(b); });
else
llt.solve(b);
// ...
}
2) Analyze only
SimplicialLDLT<MklSparseMatrix<double>> llt(A);
llt.matrixL().sparseAnalyze(100, [&] { llt.solve(b); });
// and solve as usual
For more complicated algorithms, one can always outline the lambda and
pass it into the analysis.
That would certainly clean up things a lot. Having to call
A.beginXY();
doStuff();
A.endXY();
is a very C-stylish API, which is error-prone (e.g., one of the calls
might not get called, because it is masked by a wrong if-condition, or due
to an exception, which is caught only outside).
This can usually be avoided with proper C++ constructs.
The advantages of the begin/end pair are that 1) this keeps the API
minimal, and 2) the user can easily wrap them to accommodate its taste and
usages. To avoid all sorts of redundancies, one could even write a free
while(...) {
optimize_if(iter==0, 100, [&]{x=llt.solve(b);}, llt.matrixL());
}
while(...) {
optimize_if(iter==0, 100,
[&]{x=LU.solve(b);}, LU.matrixL(), LU.matrixU());
}
Actually, this last example is not conceivable at the moment because the
factors in SparseLU are not stored in a standard compressed format, we
would need to add an option to tell SparseLU to turn them to regular
SparseMatrix and to use them for solving.
Anyways, all these variants are essentially syntactic sugar/implementation
details, and at this stage I'm more concerned about possible shortcomings
in the general approach. Is it flexible enough? I'm also not fan of
introducing a MklSparseMatrix inheriting SparseMatrix, but I don't have a
better offer for now that is generic and easily applicable to
decompositions.
gael
However, if this is required, I would suggest to add this method not only
to matrices but also to the decomposition (and pass it through to the
internal matrices). Otherwise, this does not scale for decompositions which
use more than one matrix (like SparseLU). And we could even let the
sparseAnalyze() functions return a proxy to the decomposition which would
x = llt.sparseAnalyzeAndRun(100)->solve(b);
// equivalent to
llt.sparseAnalyzeAndRun(100, [&]{x=llt.solve(b);});
or
{
auto llt_ = llt.sparseAnalyzeAndRun(100);
x = llt_-> solve(b);
y = llt_-> matrixL() * c;
} // destructor of llt_ calls `endAnalyze`
The naming of this method is still debatable, of course.
And I have no idea what actually happens inside MKL when it 'analyzes' an
operation (after how many iterations do you actually benefit from the
overhead of analyzing the operation?)
Christoph
Cheers,
-Edward
Thank you for opening this discussion on the public mailing list.
So let's discuss about the public API, which currently is not very
(i1) - Storing MKL's handle in SparseMatrix breaks ABI and does not sounds very generic.
(i2) - which operations are going to be analyzed/optimized,
(i3) - and specify the 'expected_calls' parameter.
In order to discuss these issues, let's consider the following typical
pattern: (e.g., non-linear optimization, eigenvalues, ...)
SimplicialLDLT<SparseMatrix<double> > llt(A);
while(...) {
...
x = llt.solve(b);
...
}
Here the triangular L factor is going to be used for triangular and
transposed-triangular solves dozens to hundreds of time but only the user
of SimplicialLDLT knowns that, not SimplicialLDLT, nor SparseMatrix.
Moreover, the user does not own the SparseMatrix that we want to
analyze/optimize for. Other patterns are likely easier to handle, so let's
focus on it for now.
Regarding (i1), I would suggest to introduce a new type, say
MklSparseMatrix<> that would enhance SparseMatrix<> through inheritance.
MklSparseMatrix::beginAnalysis(Index expected_calls) const {
// turn *this to compressed mode
// create handle
// store expected_calls
// enable recording mode
}
MklSparseMatrix::endAnalysis() const {
// disable recording mode
// [optional] call mkl_sparse_optimize
}
All states in MklSparseMatrix would be mutable.
Between a pair of beginAnalysis/endAnalysis each call to a supported
operation would trigger calls to
mkl_sparse_set_*_hint()/mkl_sparse_optimize.
Optionally, we could even add a "dryrun" mode for which no operation would
be performed, only calls to mkl_sparse_set_*_hint() and then
mkl_sparse_optimize would be called in endAnalysis(). This way
mkl_sparse_optimize() would be called only once.
SimplicialLDLT<MklSparseMatrix<double> > llt(A);
int it=0;
while(...) {
...
if(it==0) llt.matrixL().beginAnalysis(100);
x = llt.solve(b);
if(it==0) llt.matrixL().endAnalysis();
...
++it;
}
SimplicialLDLT<MklSparseMatrix<double> > llt(A);
llt.matrixL().beginAnalysis(100, DryRun);
x = llt.solve(b); // permutation and division by the diagonal matrix D
would still be performed, but calls to actual triangular solves would be
by-passed
llt.matrixL().endAnalysis();
while(...) {
...
x = llt.solve(b);
...
}
If someone directly deal with the factor L, then we could follow the same
SimplicialLLT<SparseMatrix<double> > llt(A);
MklSparseMatrix L(llt.matrixL());
L.beginAnalysis(100,DryRun);
y = L.triangularView<Lower>() * x;
L.endAnalysis();
while(...) {
...
y = L.triangularView<Lower>() * x;
...
}
This design in quite general and expendable to any sparse-optimizers, even
built-in ones in the future.
In contrast to the current proposal, only selected operations would be
passed to MKL (need to use a MklSparseMatrix + begin/end recording phase).
What do you think?
gael
Hello Eigen community,
My name is Maria Zhukova and Iâm a software development engineer at Intel Â®
MKL Sparse team.
My team is interested in contributing into Eigen, so Iâve investigated our
Eigen support different operations for sparse matrices stored in CSR and CSC
format which can be implemented on a basis of IE SpBLAS kernels (please,
refer to
https://software.intel.com/en-us/mkl-developer-reference-c-inspector-executor-sparse-blas-routines
<
https://software.intel.com/en-us/mkl-developer-reference-c-inspector-executor-sparse-blas-routines>
for the general idea of interfaces)
, basically we want to implement calls to our IE SpBLAS into next
operations:____
SparseMatrix + SparseMatrix (mkl_sparse_?_add)
SparseMatrix * DenseVector (mkl_sparse_?_mv)____
SparseMatrix * DenseMatrix (mkl_sparse_?_mm)____
SparseMatrix * SparseMatrix (mkl_sparse_spmm),
and Triangular solve (mkl_sparse_?_trsv).____
Iâve already started with implementation of sparse_time_dense_impl_mkl
kernel which is based on mkl_sparse_?_mv (included in patch).____
*#include <Eigen/SpBLASSupport> *<-- *NEW:* IE SpBLAS include module ____
void main () {
SparseMatrix<double, RowMajor> A;
Matrix<double, Dynamic, 1> x, y;
A.makeCompressed(); /* Convert matrix A into CSR/CSC format */
*A.createSparseHandle();*/* *NEW*: is used to create handle required for all
IE SpBLAS routines */____
// support of IE SpBLAS is here
y = beta*y + alpha*A*x; /* call to mkl_sparse_?_mv with operation =
SPARSE_OPERATION_NON_TRANSPOSE */
y = beta*y + alpha*A.transpose()*x; /* call to mkl_sparse_?_mv with
operation = SPARSE_OPERATION_TRANSPOSE */
y = beta*y + alpha*A.adjoint()*x; /* call to mkl_sparse_?_mv with operation
= SPARSE_OPERATION_CONJUGATE_TRANSPOSE */____
*A.destroySparseHandle();* /* *NEW*: is used to delete created handle */
}____
__ __
Iâve attached a draft patch including all necessary changes and would like
to hear your feedback.
Please, let me know if you have any questions and comments.____
__ __
Best regards,
Maria____
__ __
__ __
__ __
--
<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+5+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g> Dr.-Ing.
Christoph Hertzberg
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-StraÃe 5
<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+5+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>
<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+1+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>
28359 Bremen, Germany
<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+5+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-StraÃe 1
<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+1+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>
28359 Bremen, Germany
<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+1+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>
Tel.: +49 421 178 45-4021
Zentrale: +49 421 178 45-0
Weitere Informationen: http://www.dfki.de/robotik
-----------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter StraÃe 122, D-67663 Kaiserslautern
<https://maps.google.com/?q=Trippstadter+Stra%C3%9Fe+122,+D-67663+Kaiserslautern&entry=gmail&source=g>
Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
(Vorsitzender) Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
USt-Id.Nr.: DE 148646973
Steuernummer: 19/672/50006
-----------------------------------------------------------------------

Gael Guennebaud

2018-04-06 11:42:11 UTC

Permalink

This makes binaries compiled with and without MKL support incompatible., so
yes it breaks ABI and users have to very careful to enable MKL support in
all translation units, all libraries and dependencies...

Post by Zhukova, Maria
(i2) Calling mkl_sparse_set_?_hint() and mkl_sparse_optimize() on-the-fly
with every operation wonât be a problem and the overhead is expected to be
minimal, because internally weâre checking for previous optimizations and
what data can be reused. So adding it into the code just before calling any
IE SpBLAS computational routine means that we will take care of required
optimizations automatically and it wonât be userâs work.

(i3) Surely in addition we can implement any type of MKLOptimize(/*args

Post by Zhukova, Maria
TBD*/) routine which will be used to set user-specified hints, desired
number of calls, required operations, etc.

The reason why we do not want to implement our own type is that it will be

Post by Zhukova, Maria
much easier for customers to benefit from Intel architecture: the only
thing they should take care about is to include SpBLASSupport module and
link MKL, all the optimization work will be hidden (together with IE SpBLAS
calls). So if there was already an application which used sparse algebra
then it will benefit automatically from MKL usage.

I just tried your patch with:

vec1_dense.noalias() += row_major_sparse * vec2_dense

and I observe a very significant drop of performance (almost x10) compared
to pure Eigen (with -fopenmp) if this operation is performed only once, or
very few times... Changing the 'expected_calls' parameter from 10 to 1 does
not reduce the performance drop due to mkl_sparse_optimize. If I repeat
this operation hundreds of times, then both Eigen and MKL achieve the same
level of performance for this particular operation and the matrices I tried
(1M^2 Laplacian, bi-Laplacian, and tri-Laplacian matrices).

So I still believe that silently and unconditionally falling back to IE
SpBLAS is not the right approach. It could be that calling MKL's SpBLAS is
never a loss (we have to bench), but that's clearly not the case of the
optimize step. So, clearly, the optimize step must be explicitly activated
on user selected expressions only.

gael

Post by Zhukova, Maria
Best regards,
Maria
*Sent:* Thursday, April 5, 2018 2:54 PM
*Subject:* Re: [eigen] Intel (R) MKL IE SpBLAS support in Eigen
On Thu, Apr 5, 2018 at 11:00 PM, Christoph Hertzberg <
Would it be useful to incorporate lambda's into the interface to avoid
begin/end pairs? So from the user side, we would write code like this
1) Analyze and run
SimplicialLDLT<MklSparseMatrix<double>> llt(A);
int it = 0;
for (int it = 0; ...; ++it) {
if (it == 0)
llt.matrixL().sparseAnalyzeAndRun(100, [&] { llt.solve(b); });
else
llt.solve(b);
// ...
}
2) Analyze only
SimplicialLDLT<MklSparseMatrix<double>> llt(A);
llt.matrixL().sparseAnalyze(100, [&] { llt.solve(b); });
// and solve as usual
For more complicated algorithms, one can always outline the lambda and
pass it into the analysis.
That would certainly clean up things a lot. Having to call
A.beginXY();
doStuff();
A.endXY();
is a very C-stylish API, which is error-prone (e.g., one of the calls
might not get called, because it is masked by a wrong if-condition, or due
to an exception, which is caught only outside).
This can usually be avoided with proper C++ constructs.
The advantages of the begin/end pair are that 1) this keeps the API
minimal, and 2) the user can easily wrap them to accommodate its taste and
usages. To avoid all sorts of redundancies, one could even write a free
while(...) {
optimize_if(iter==0, 100, [&]{x=llt.solve(b);}, llt.matrixL());
}
while(...) {
optimize_if(iter==0, 100, [&]{x=LU.solve(b);}, LU.
matrixL(), LU.matrixU());
}
Actually, this last example is not conceivable at the moment because the
factors in SparseLU are not stored in a standard compressed format, we
would need to add an option to tell SparseLU to turn them to regular
SparseMatrix and to use them for solving.
Anyways, all these variants are essentially syntactic sugar/implementation
details, and at this stage I'm more concerned about possible shortcomings
in the general approach. Is it flexible enough? I'm also not fan of
introducing a MklSparseMatrix inheriting SparseMatrix, but I don't have a
better offer for now that is generic and easily applicable to
decompositions.
gael
However, if this is required, I would suggest to add this method not only
to matrices but also to the decomposition (and pass it through to the
internal matrices). Otherwise, this does not scale for decompositions which
use more than one matrix (like SparseLU). And we could even let the
sparseAnalyze() functions return a proxy to the decomposition which would
x = llt.sparseAnalyzeAndRun(100)->solve(b);
// equivalent to
llt.sparseAnalyzeAndRun(100, [&]{x=llt.solve(b);});
or
{
auto llt_ = llt.sparseAnalyzeAndRun(100);
x = llt_-> solve(b);
y = llt_-> matrixL() * c;
} // destructor of llt_ calls `endAnalyze`
The naming of this method is still debatable, of course.
And I have no idea what actually happens inside MKL when it 'analyzes' an
operation (after how many iterations do you actually benefit from the
overhead of analyzing the operation?)
Christoph
Cheers,
-Edward
Thank you for opening this discussion on the public mailing list.
So let's discuss about the public API, which currently is not very
(i1) - Storing MKL's handle in SparseMatrix breaks ABI and does not sounds very generic.
(i2) - which operations are going to be analyzed/optimized,
(i3) - and specify the 'expected_calls' parameter.
In order to discuss these issues, let's consider the following typical
pattern: (e.g., non-linear optimization, eigenvalues, ...)
SimplicialLDLT<SparseMatrix<double> > llt(A);
while(...) {
...
x = llt.solve(b);
...
}
Here the triangular L factor is going to be used for triangular and
transposed-triangular solves dozens to hundreds of time but only the user
of SimplicialLDLT knowns that, not SimplicialLDLT, nor SparseMatrix.
Moreover, the user does not own the SparseMatrix that we want to
analyze/optimize for. Other patterns are likely easier to handle, so let's
focus on it for now.
Regarding (i1), I would suggest to introduce a new type, say
MklSparseMatrix<> that would enhance SparseMatrix<> through inheritance.
MklSparseMatrix::beginAnalysis(Index expected_calls) const {
// turn *this to compressed mode
// create handle
// store expected_calls
// enable recording mode
}
MklSparseMatrix::endAnalysis() const {
// disable recording mode
// [optional] call mkl_sparse_optimize
}
All states in MklSparseMatrix would be mutable.
Between a pair of beginAnalysis/endAnalysis each call to a supported
operation would trigger calls to mkl_sparse_set_*_hint()/mkl_
sparse_optimize.
Optionally, we could even add a "dryrun" mode for which no operation would
be performed, only calls to mkl_sparse_set_*_hint() and then
mkl_sparse_optimize would be called in endAnalysis(). This way
mkl_sparse_optimize() would be called only once.
SimplicialLDLT<MklSparseMatrix<double> > llt(A);
int it=0;
while(...) {
...
if(it==0) llt.matrixL().beginAnalysis(100);
x = llt.solve(b);
if(it==0) llt.matrixL().endAnalysis();
...
++it;
}
SimplicialLDLT<MklSparseMatrix<double> > llt(A);
llt.matrixL().beginAnalysis(100, DryRun);
x = llt.solve(b); // permutation and division by the diagonal matrix D
would still be performed, but calls to actual triangular solves would be
by-passed
llt.matrixL().endAnalysis();
while(...) {
...
x = llt.solve(b);
...
}
If someone directly deal with the factor L, then we could follow the same
SimplicialLLT<SparseMatrix<double> > llt(A);
MklSparseMatrix L(llt.matrixL());
L.beginAnalysis(100,DryRun);
y = L.triangularView<Lower>() * x;
L.endAnalysis();
while(...) {
...
y = L.triangularView<Lower>() * x;
...
}
This design in quite general and expendable to any sparse-optimizers, even
built-in ones in the future.
In contrast to the current proposal, only selected operations would be
passed to MKL (need to use a MklSparseMatrix + begin/end recording phase).
What do you think?
gael
Hello Eigen community,
My name is Maria Zhukova and Iâm a software development engineer at Intel Â®
MKL Sparse team.
My team is interested in contributing into Eigen, so Iâve investigated our
Eigen support different operations for sparse matrices stored in CSR and CSC
format which can be implemented on a basis of IE SpBLAS kernels (please,
refer to
https://software.intel.com/en-us/mkl-developer-reference-c-
inspector-executor-sparse-blas-routines
<https://software.intel.com/en-us/mkl-developer-reference-
c-inspector-executor-sparse-blas-routines>
for the general idea of interfaces)
, basically we want to implement calls to our IE SpBLAS into next
operations:____
SparseMatrix + SparseMatrix (mkl_sparse_?_add)
SparseMatrix * DenseVector (mkl_sparse_?_mv)____
SparseMatrix * DenseMatrix (mkl_sparse_?_mm)____
SparseMatrix * SparseMatrix (mkl_sparse_spmm),
and Triangular solve (mkl_sparse_?_trsv).____
Iâve already started with implementation of sparse_time_dense_impl_mkl
kernel which is based on mkl_sparse_?_mv (included in patch).____
*#include <Eigen/SpBLASSupport> *<-- *NEW:* IE SpBLAS include module ____
void main () {
SparseMatrix<double, RowMajor> A;
Matrix<double, Dynamic, 1> x, y;
A.makeCompressed(); /* Convert matrix A into CSR/CSC format */
*A.createSparseHandle();*/* *NEW*: is used to create handle required for all
IE SpBLAS routines */____
// support of IE SpBLAS is here
y = beta*y + alpha*A*x; /* call to mkl_sparse_?_mv with operation =
SPARSE_OPERATION_NON_TRANSPOSE */
y = beta*y + alpha*A.transpose()*x; /* call to mkl_sparse_?_mv with
operation = SPARSE_OPERATION_TRANSPOSE */
y = beta*y + alpha*A.adjoint()*x; /* call to mkl_sparse_?_mv with operation
= SPARSE_OPERATION_CONJUGATE_TRANSPOSE */____
*A.destroySparseHandle();* /* *NEW*: is used to delete created handle */
}____
__ __
Iâve attached a draft patch including all necessary changes and would like
to hear your feedback.
Please, let me know if you have any questions and comments.____
__ __
Best regards,
Maria____
__ __
__ __
__ __
--
Dr.-Ing. Christoph Hertzberg
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-StraÃe 5
<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+5+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>
28359 Bremen, Germany
<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+5+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-StraÃe 1
<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+1+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>
28359 Bremen, Germany
<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+1+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>
Tel.: +49 421 178 45-4021
Zentrale: +49 421 178 45-0
Weitere Informationen: http://www.dfki.de/robotik
-----------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter StraÃe 122, D-67663 Kaiserslautern
<https://maps.google.com/?q=Trippstadter+Stra%C3%9Fe+122,+D-67663+Kaiserslautern&entry=gmail&source=g>
Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
(Vorsitzender) Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
USt-Id.Nr.: DE 148646973
Steuernummer: 19/672/50006
-----------------------------------------------------------------------

Gael Guennebaud

2018-04-06 12:06:08 UTC

Permalink

I also tried:

vec1_dense.noalias() += column_major_sparse * vec2_dense

Since this operation is not supported by mkl_sparse_optimize, the overhead
is null here, but for some very sparse matrices (5 nnz/col) mkl_sparse_d_mv
is about x2 slower than Eigen, while for denser matrices both
implementations achieve the same speed.

gael.

Post by Gael Guennebaud
vec1_dense.noalias() += row_major_sparse * vec2_dense
and I observe a very significant drop of performance (almost x10) compared
to pure Eigen (with -fopenmp) if this operation is performed only once, or
very few times... Changing the 'expected_calls' parameter from 10 to 1
does not reduce the performance drop due to mkl_sparse_optimize. If I
repeat this operation hundreds of times, then both Eigen and MKL achieve
the same level of performance for this particular operation and the
matrices I tried (1M^2 Laplacian, bi-Laplacian, and tri-Laplacian matrices).
So I still believe that silently and unconditionally falling back to IE
SpBLAS is not the right approach. It could be that calling MKL's SpBLAS
is never a loss (we have to bench), but that's clearly not the case of the
optimize step. So, clearly, the optimize step must be explicitly
activated on user selected expressions only.
gael

Post by Zhukova, Maria
Best regards,
Maria
*Sent:* Thursday, April 5, 2018 2:54 PM
*Subject:* Re: [eigen] Intel (R) MKL IE SpBLAS support in Eigen
On Thu, Apr 5, 2018 at 11:00 PM, Christoph Hertzberg <
Would it be useful to incorporate lambda's into the interface to avoid
begin/end pairs? So from the user side, we would write code like this
1) Analyze and run
SimplicialLDLT<MklSparseMatrix<double>> llt(A);
int it = 0;
for (int it = 0; ...; ++it) {
if (it == 0)
llt.matrixL().sparseAnalyzeAndRun(100, [&] { llt.solve(b); });
else
llt.solve(b);
// ...
}
2) Analyze only
SimplicialLDLT<MklSparseMatrix<double>> llt(A);
llt.matrixL().sparseAnalyze(100, [&] { llt.solve(b); });
// and solve as usual
For more complicated algorithms, one can always outline the lambda and
pass it into the analysis.
That would certainly clean up things a lot. Having to call
A.beginXY();
doStuff();
A.endXY();
is a very C-stylish API, which is error-prone (e.g., one of the calls
might not get called, because it is masked by a wrong if-condition, or due
to an exception, which is caught only outside).
This can usually be avoided with proper C++ constructs.
The advantages of the begin/end pair are that 1) this keeps the API
minimal, and 2) the user can easily wrap them to accommodate its taste and
usages. To avoid all sorts of redundancies, one could even write a free
while(...) {
optimize_if(iter==0, 100, [&]{x=llt.solve(b);}, llt.matrixL());
}
while(...) {
optimize_if(iter==0, 100, [&]{x=LU.solve(b);}, LU.m
atrixL(), LU.matrixU());
}
Actually, this last example is not conceivable at the moment because the
factors in SparseLU are not stored in a standard compressed format, we
would need to add an option to tell SparseLU to turn them to regular
SparseMatrix and to use them for solving.
Anyways, all these variants are essentially syntactic
sugar/implementation details, and at this stage I'm more concerned about
possible shortcomings in the general approach. Is it flexible enough? I'm
also not fan of introducing a MklSparseMatrix inheriting SparseMatrix, but
I don't have a better offer for now that is generic and easily applicable
to decompositions.
gael
However, if this is required, I would suggest to add this method not only
to matrices but also to the decomposition (and pass it through to the
internal matrices). Otherwise, this does not scale for decompositions which
use more than one matrix (like SparseLU). And we could even let the
sparseAnalyze() functions return a proxy to the decomposition which would
x = llt.sparseAnalyzeAndRun(100)->solve(b);
// equivalent to
llt.sparseAnalyzeAndRun(100, [&]{x=llt.solve(b);});
or
{
auto llt_ = llt.sparseAnalyzeAndRun(100);
x = llt_-> solve(b);
y = llt_-> matrixL() * c;
} // destructor of llt_ calls `endAnalyze`
The naming of this method is still debatable, of course.
And I have no idea what actually happens inside MKL when it 'analyzes' an
operation (after how many iterations do you actually benefit from the
overhead of analyzing the operation?)
Christoph
Cheers,
-Edward
Thank you for opening this discussion on the public mailing list.
So let's discuss about the public API, which currently is not very
(i1) - Storing MKL's handle in SparseMatrix breaks ABI and does not sounds very generic.
(i2) - which operations are going to be analyzed/optimized,
(i3) - and specify the 'expected_calls' parameter.
In order to discuss these issues, let's consider the following typical
pattern: (e.g., non-linear optimization, eigenvalues, ...)
SimplicialLDLT<SparseMatrix<double> > llt(A);
while(...) {
...
x = llt.solve(b);
...
}
Here the triangular L factor is going to be used for triangular and
transposed-triangular solves dozens to hundreds of time but only the user
of SimplicialLDLT knowns that, not SimplicialLDLT, nor SparseMatrix.
Moreover, the user does not own the SparseMatrix that we want to
analyze/optimize for. Other patterns are likely easier to handle, so let's
focus on it for now.
Regarding (i1), I would suggest to introduce a new type, say
MklSparseMatrix<> that would enhance SparseMatrix<> through inheritance.
MklSparseMatrix::beginAnalysis(Index expected_calls) const {
// turn *this to compressed mode
// create handle
// store expected_calls
// enable recording mode
}
MklSparseMatrix::endAnalysis() const {
// disable recording mode
// [optional] call mkl_sparse_optimize
}
All states in MklSparseMatrix would be mutable.
Between a pair of beginAnalysis/endAnalysis each call to a supported
operation would trigger calls to mkl_sparse_set_*_hint()/mkl_sp
arse_optimize.
Optionally, we could even add a "dryrun" mode for which no operation
would be performed, only calls to mkl_sparse_set_*_hint() and then
mkl_sparse_optimize would be called in endAnalysis(). This way
mkl_sparse_optimize() would be called only once.
SimplicialLDLT<MklSparseMatrix<double> > llt(A);
int it=0;
while(...) {
...
if(it==0) llt.matrixL().beginAnalysis(100);
x = llt.solve(b);
if(it==0) llt.matrixL().endAnalysis();
...
++it;
}
SimplicialLDLT<MklSparseMatrix<double> > llt(A);
llt.matrixL().beginAnalysis(100, DryRun);
x = llt.solve(b); // permutation and division by the diagonal matrix D
would still be performed, but calls to actual triangular solves would be
by-passed
llt.matrixL().endAnalysis();
while(...) {
...
x = llt.solve(b);
...
}
If someone directly deal with the factor L, then we could follow the same
SimplicialLLT<SparseMatrix<double> > llt(A);
MklSparseMatrix L(llt.matrixL());
L.beginAnalysis(100,DryRun);
y = L.triangularView<Lower>() * x;
L.endAnalysis();
while(...) {
...
y = L.triangularView<Lower>() * x;
...
}
This design in quite general and expendable to any sparse-optimizers,
even built-in ones in the future.
In contrast to the current proposal, only selected operations would be
passed to MKL (need to use a MklSparseMatrix + begin/end recording phase).
What do you think?
gael
Hello Eigen community,
My name is Maria Zhukova and Iâm a software development engineer at Intel Â®
MKL Sparse team.
My team is interested in contributing into Eigen, so Iâve investigated our
Eigen support different operations for sparse matrices stored in CSR and CSC
format which can be implemented on a basis of IE SpBLAS kernels (please,
refer to
https://software.intel.com/en-us/mkl-developer-reference-c-i
nspector-executor-sparse-blas-routines
<https://software.intel.com/en-us/mkl-developer-reference-c-
inspector-executor-sparse-blas-routines>
for the general idea of interfaces)
, basically we want to implement calls to our IE SpBLAS into next
operations:____
SparseMatrix + SparseMatrix (mkl_sparse_?_add)
SparseMatrix * DenseVector (mkl_sparse_?_mv)____
SparseMatrix * DenseMatrix (mkl_sparse_?_mm)____
SparseMatrix * SparseMatrix (mkl_sparse_spmm),
and Triangular solve (mkl_sparse_?_trsv).____
Iâve already started with implementation of sparse_time_dense_impl_mkl
kernel which is based on mkl_sparse_?_mv (included in patch).____
*#include <Eigen/SpBLASSupport> *<-- *NEW:* IE SpBLAS include module ____
void main () {
SparseMatrix<double, RowMajor> A;
Matrix<double, Dynamic, 1> x, y;
A.makeCompressed(); /* Convert matrix A into CSR/CSC format */
*A.createSparseHandle();*/* *NEW*: is used to create handle required for all
IE SpBLAS routines */____
// support of IE SpBLAS is here
y = beta*y + alpha*A*x; /* call to mkl_sparse_?_mv with operation =
SPARSE_OPERATION_NON_TRANSPOSE */
y = beta*y + alpha*A.transpose()*x; /* call to mkl_sparse_?_mv with
operation = SPARSE_OPERATION_TRANSPOSE */
y = beta*y + alpha*A.adjoint()*x; /* call to mkl_sparse_?_mv with operation
= SPARSE_OPERATION_CONJUGATE_TRANSPOSE */____
*A.destroySparseHandle();* /* *NEW*: is used to delete created handle */
}____
__ __
Iâve attached a draft patch including all necessary changes and would like
to hear your feedback.
Please, let me know if you have any questions and comments.____
__ __
Best regards,
Maria____
__ __
__ __
__ __
--
Dr.-Ing. Christoph Hertzberg
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-StraÃe 5
<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+5+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>
28359 Bremen, Germany
<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+5+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-StraÃe 1
<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+1+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>
28359 Bremen, Germany
<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+1+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>
Tel.: +49 421 178 45-4021
Zentrale: +49 421 178 45-0
Weitere Informationen: http://www.dfki.de/robotik
-----------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter StraÃe 122, D-67663 Kaiserslautern
<https://maps.google.com/?q=Trippstadter+Stra%C3%9Fe+122,+D-67663+Kaiserslautern&entry=gmail&source=g>
Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
(Vorsitzender) Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
USt-Id.Nr.: DE 148646973
Steuernummer: 19/672/50006
-----------------------------------------------------------------------

Kalinkin, Alexander A

2018-04-06 18:49:45 UTC

Permalink

Hi All,
First of all appreciate your interest in integration of sparse kernels in Eigen.
Please see my comments below

I also tried:

vec1_dense.noalias() += column_major_sparse * vec2_dense

Since this operation is not supported by mkl_sparse_optimize, the overhead is null here, but for some very sparse matrices (5 nnz/col) mkl_sparse_d_mv is about x2 slower than Eigen, while for denser matrices both implementations achieve the same speed.

[akalinki] Thatâs sound strange. Can I ask you which MKL version do you use for your testing and what architecture is the target? We implemented matvec support in IE SparseBlas couple of years ago and significantly improve its performance during this year. Can I ask you also about the size of matrix, so I will be able to check it on my side?

I just tried your patch with:

vec1_dense.noalias() += row_major_sparse * vec2_dense

and I observe a very significant drop of performance (almost x10) compared to pure Eigen (with -fopenmp) if this operation is performed only once, or very few times... Changing the 'expected_calls' parameter from 10 to 1 does not reduce the performance drop due to mkl_sparse_optimize. If I repeat this operation hundreds of times, then both Eigen and MKL achieve the same level of performance for this particular operation and the matrices I tried (1M^2 Laplacian, bi-Laplacian, and tri-Laplacian matrices).

So I still believe that silently and unconditionally falling back to IE SpBLAS is not the right approach. It could be that calling MKL's SpBLAS is never a loss (we have to bench), but that's clearly not the case of the optimize step. So, clearly, the optimize step must be explicitly activated on user selected expressions only.

[akalinki] This can be an issue for an older version of MKL, because initially we spent additional time on doing hints/optimize when calling this every time. Starting from MKL2018u3 weâre planning to remove this overhead â if you want we can share eng. build to check performance on same binaries.
Also about poor performance of SparseBlas matvec functionality in comparison with Eigen built-in one: all these cases we interpret as issue that need to be investigated and should be fixed on our side. So I donât think that we need to add any dispatch on Eigen level between built-in kernels and ours â all this stuff should be hidden in MKL libraries.

And back to the main question of this topic â we have customers who use Eigen on Intel processors and are interested in optimizing their code. Implementation of new SparseMatrix type will forced them to rewrite the code, that sometimes can be impossible due to different reasons. Thatâs why we are asking to find the way how we can just extend current SparseMatrix type to support MKL functionality.

Thanks,
Alexander Kalinkin,
PhD, Team leader, Architect or of Sparse components in IntelÂ®MKL

From: Gael Guennebaud [mailto:***@gmail.com]
Sent: Friday, April 6, 2018 5:06 AM
To: eigen <***@lists.tuxfamily.org>
Subject: Re: [eigen] Intel (R) MKL IE SpBLAS support in Eigen

I also tried:

vec1_dense.noalias() += column_major_sparse * vec2_dense

Since this operation is not supported by mkl_sparse_optimize, the overhead is null here, but for some very sparse matrices (5 nnz/col) mkl_sparse_d_mv is about x2 slower than Eigen, while for denser matrices both implementations achieve the same speed.

gael.

On Fri, Apr 6, 2018 at 1:42 PM, Gael Guennebaud <***@gmail.com<mailto:***@gmail.com>> wrote:

I just tried your patch with:

vec1_dense.noalias() += row_major_sparse * vec2_dense

and I observe a very significant drop of performance (almost x10) compared to pure Eigen (with -fopenmp) if this operation is performed only once, or very few times... Changing the 'expected_calls' parameter from 10 to 1 does not reduce the performance drop due to mkl_sparse_optimize. If I repeat this operation hundreds of times, then both Eigen and MKL achieve the same level of performance for this particular operation and the matrices I tried (1M^2 Laplacian, bi-Laplacian, and tri-Laplacian matrices).

So I still believe that silently and unconditionally falling back to IE SpBLAS is not the right approach. It could be that calling MKL's SpBLAS is never a loss (we have to bench), but that's clearly not the case of the optimize step. So, clearly, the optimize step must be explicitly activated on user selected expressions only.

gael

Best regards,
Maria

From: Gael Guennebaud [mailto:***@gmail.com<mailto:***@gmail.com>]
Sent: Thursday, April 5, 2018 2:54 PM
To: eigen <***@lists.tuxfamily.org<mailto:***@lists.tuxfamily.org>>
Subject: Re: [eigen] Intel (R) MKL IE SpBLAS support in Eigen

On Thu, Apr 5, 2018 at 11:00 PM, Christoph Hertzberg <***@informatik.uni-bremen.de<mailto:***@informatik.uni-bremen.de>> wrote:
On 2018-04-05 15:31, Edward Lam wrote:
Would it be useful to incorporate lambda's into the interface to avoid begin/end pairs? So from the user side, we would write code like this instead:

1) Analyze and run

SimplicialLDLT<MklSparseMatrix<double>> llt(A);
int it = 0;
for (int it = 0; ...; ++it) {
if (it == 0)
llt.matrixL().sparseAnalyzeAndRun(100, [&] { llt.solve(b); });
else
llt.solve(b);
// ...
}

2) Analyze only

SimplicialLDLT<MklSparseMatrix<double>> llt(A);
llt.matrixL().sparseAnalyze(100, [&] { llt.solve(b); });
// and solve as usual

For more complicated algorithms, one can always outline the lambda and pass it into the analysis.

That would certainly clean up things a lot. Having to call
A.beginXY();
doStuff();
A.endXY();
is a very C-stylish API, which is error-prone (e.g., one of the calls might not get called, because it is masked by a wrong if-condition, or due to an exception, which is caught only outside).
This can usually be avoided with proper C++ constructs.

The advantages of the begin/end pair are that 1) this keeps the API minimal, and 2) the user can easily wrap them to accommodate its taste and usages. To avoid all sorts of redundancies, one could even write a free function on top the begin/end pair to do:

while(...) {
optimize_if(iter==0, 100, [&]{x=llt.solve(b);}, llt.matrixL());
}

and with variadic templates:

while(...) {
optimize_if(iter==0, 100, [&]{x=LU.solve(b);}, LU.matrixL(), LU.matrixU());
}

Actually, this last example is not conceivable at the moment because the factors in SparseLU are not stored in a standard compressed format, we would need to add an option to tell SparseLU to turn them to regular SparseMatrix and to use them for solving.

Anyways, all these variants are essentially syntactic sugar/implementation details, and at this stage I'm more concerned about possible shortcomings in the general approach. Is it flexible enough? I'm also not fan of introducing a MklSparseMatrix inheriting SparseMatrix, but I don't have a better offer for now that is generic and easily applicable to decompositions.

gael

However, if this is required, I would suggest to add this method not only to matrices but also to the decomposition (and pass it through to the internal matrices). Otherwise, this does not scale for decompositions which use more than one matrix (like SparseLU). And we could even let the sparseAnalyze() functions return a proxy to the decomposition which would allow writing something like:

x = llt.sparseAnalyzeAndRun(100)->solve(b);
// equivalent to
llt.sparseAnalyzeAndRun(100, [&]{x=llt.solve(b);});

or
{
auto llt_ = llt.sparseAnalyzeAndRun(100);
x = llt_-> solve(b);
y = llt_-> matrixL() * c;
} // destructor of llt_ calls `endAnalyze`

The naming of this method is still debatable, of course.
And I have no idea what actually happens inside MKL when it 'analyzes' an operation (after how many iterations do you actually benefit from the overhead of analyzing the operation?)

Christoph

Cheers,
-Edward

On 4/5/2018 8:01 AM, Gael Guennebaud wrote:
Thank you for opening this discussion on the public mailing list.

So let's discuss about the public API, which currently is not very convenient as already noticed by others. Issues are:

(i1) - Storing MKL's handle in SparseMatrix breaks ABI and does not sounds very generic.
- We need a way to control:
(i2) - which operations are going to be analyzed/optimized,
(i3) - and specify the 'expected_calls' parameter.

In order to discuss these issues, let's consider the following typical pattern: (e.g., non-linear optimization, eigenvalues, ...)

SimplicialLDLT<SparseMatrix<double> > llt(A);

while(...) {
...
x = llt.solve(b);
...
}

Here the triangular L factor is going to be used for triangular and transposed-triangular solves dozens to hundreds of time but only the user of SimplicialLDLT knowns that, not SimplicialLDLT, nor SparseMatrix. Moreover, the user does not own the SparseMatrix that we want to analyze/optimize for. Other patterns are likely easier to handle, so let's focus on it for now.

Regarding (i1), I would suggest to introduce a new type, say MklSparseMatrix<> that would enhance SparseMatrix<> through inheritance. Then for (i2) and (i3) we could imagine something like:

MklSparseMatrix::beginAnalysis(Index expected_calls) const {
// turn *this to compressed mode
// create handle
// store expected_calls
// enable recording mode
}
MklSparseMatrix::endAnalysis() const {
// disable recording mode
// [optional] call mkl_sparse_optimize
}

All states in MklSparseMatrix would be mutable.

Between a pair of beginAnalysis/endAnalysis each call to a supported operation would trigger calls to mkl_sparse_set_*_hint()/mkl_sparse_optimize.
Optionally, we could even add a "dryrun" mode for which no operation would be performed, only calls to mkl_sparse_set_*_hint() and then mkl_sparse_optimize would be called in endAnalysis(). This way mkl_sparse_optimize() would be called only once.

And that's it. Our example would look-like:

SimplicialLDLT<MklSparseMatrix<double> > llt(A);
int it=0;
while(...) {
...
if(it==0) llt.matrixL().beginAnalysis(100);
x = llt.solve(b);
if(it==0) llt.matrixL().endAnalysis();
...
++it;
}

or using a "dry-run" mode:

SimplicialLDLT<MklSparseMatrix<double> > llt(A);

llt.matrixL().beginAnalysis(100, DryRun);
x = llt.solve(b); // permutation and division by the diagonal matrix D would still be performed, but calls to actual triangular solves would be by-passed
llt.matrixL().endAnalysis();

while(...) {
...
x = llt.solve(b);
...
}

If someone directly deal with the factor L, then we could follow the same pattern or copy the SparseMatrix factor L to a MklSparseMatrix:

SimplicialLLT<SparseMatrix<double> > llt(A);

MklSparseMatrix L(llt.matrixL());
L.beginAnalysis(100,DryRun);
y = L.triangularView<Lower>() * x;
L.endAnalysis();
while(...) {
...
y = L.triangularView<Lower>() * x;
...
}

This design in quite general and expendable to any sparse-optimizers, even built-in ones in the future.

In contrast to the current proposal, only selected operations would be passed to MKL (need to use a MklSparseMatrix + begin/end recording phase).

What do you think?

gael

On Tue, Apr 3, 2018 at 11:39 PM, Zhukova, Maria <***@intel.com<mailto:***@intel.com> <mailto:***@intel.com<mailto:***@intel.com>>> wrote:

Hello Eigen community,

My name is Maria Zhukova and Iâm a software development engineer at Intel Â®
MKL Sparse team.

My team is interested in contributing into Eigen, so Iâve investigated our
possibilities and so far this is what I have:
Eigen support different operations for sparse matrices stored in CSR and CSC
format which can be implemented on a basis of IE SpBLAS kernels (please,
refer to
https://software.intel.com/en-us/mkl-developer-reference-c-inspector-executor-sparse-blas-routines
<https://software.intel.com/en-us/mkl-developer-reference-c-inspector-executor-sparse-blas-routines>
for the general idea of interfaces)
, basically we want to implement calls to our IE SpBLAS into next
operations:____

SparseMatrix + SparseMatrix (mkl_sparse_?_add)
SparseMatrix * DenseVector (mkl_sparse_?_mv)____

SparseMatrix * DenseMatrix (mkl_sparse_?_mm)____

SparseMatrix * SparseMatrix (mkl_sparse_spmm),
and Triangular solve (mkl_sparse_?_trsv).____

Iâve already started with implementation of sparse_time_dense_impl_mkl
kernel which is based on mkl_sparse_?_mv (included in patch).____

This is how it will look like for user:
*#include <Eigen/SpBLASSupport> *<-- *NEW:* IE SpBLAS include module ____

void main () {
SparseMatrix<double, RowMajor> A;
Matrix<double, Dynamic, 1> x, y;

A.makeCompressed(); /* Convert matrix A into CSR/CSC format */
*A.createSparseHandle();*/* *NEW*: is used to create handle required for all
IE SpBLAS routines */____

// support of IE SpBLAS is here
y = beta*y + alpha*A*x; /* call to mkl_sparse_?_mv with operation =
SPARSE_OPERATION_NON_TRANSPOSE */
y = beta*y + alpha*A.transpose()*x; /* call to mkl_sparse_?_mv with
operation = SPARSE_OPERATION_TRANSPOSE */
y = beta*y + alpha*A.adjoint()*x; /* call to mkl_sparse_?_mv with operation
= SPARSE_OPERATION_CONJUGATE_TRANSPOSE */____

*A.destroySparseHandle();* /* *NEW*: is used to delete created handle */
}____

__ __

Iâve attached a draft patch including all necessary changes and would like
to hear your feedback.
Please, let me know if you have any questions and comments.____

__ __

Best regards,
Maria____

__ __

__ __

__ __
--
Dr.-Ing. Christoph Hertzberg

Besuchsadresse der NebengeschÃ€ftsstelle:
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-StraÃe 5<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+5+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>
28359 Bremen, Germany<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+5+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>

Postadresse der HauptgeschÃ€ftsstelle Standort Bremen:
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-StraÃe 1<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+1+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>
28359 Bremen, Germany<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+1+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>

Tel.: +49 421 178 45-4021
Zentrale: +49 421 178 45-0
E-Mail: ***@dfki.de<mailto:***@dfki.de>

Weitere Informationen: http://www.dfki.de/robotik
-----------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter StraÃe 122, D-67663 Kaiserslautern<https://maps.google.com/?q=Trippstadter+Stra%C3%9Fe+122,+D-67663+Kaiserslautern&entry=gmail&source=g>
Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
(Vorsitzender) Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
USt-Id.Nr.: DE 148646973
Steuernummer: 19/672/50006
-----------------------------------------------------------------------

Christoph Hertzberg

2018-04-06 19:16:31 UTC

Permalink

Hi,

your mail-client seems to mess-up quotations (it is deleting all '>' at
the line beginnings) That makes it a bit hard to read.

There is the possibility to modify SparseMatrix locally by defining the
macro EIGEN_SPARSEMATRIX_PLUGIN
The mechanism is documented here for MatrixBase:
http://eigen.tuxfamily.org/dox-devel/TopicCustomizing_Plugins.html

But, as we both said, we think it is generally not a good idea to change
the ABI of core Eigen types.

Cheers,
Christoph

Post by Kalinkin, Alexander A
Hi All,
First of all appreciate your interest in integration of sparse kernels in Eigen.
Please see my comments below
vec1_dense.noalias() += column_major_sparse * vec2_dense
Since this operation is not supported by mkl_sparse_optimize, the overhead is null here, but for some very sparse matrices (5 nnz/col) mkl_sparse_d_mv is about x2 slower than Eigen, while for denser matrices both implementations achieve the same speed.
[akalinki] That’s sound strange. Can I ask you which MKL version do you use for your testing and what architecture is the target? We implemented matvec support in IE SparseBlas couple of years ago and significantly improve its performance during this year. Can I ask you also about the size of matrix, so I will be able to check it on my side?
vec1_dense.noalias() += row_major_sparse * vec2_dense
and I observe a very significant drop of performance (almost x10) compared to pure Eigen (with -fopenmp) if this operation is performed only once, or very few times... Changing the 'expected_calls' parameter from 10 to 1 does not reduce the performance drop due to mkl_sparse_optimize. If I repeat this operation hundreds of times, then both Eigen and MKL achieve the same level of performance for this particular operation and the matrices I tried (1M^2 Laplacian, bi-Laplacian, and tri-Laplacian matrices).
So I still believe that silently and unconditionally falling back to IE SpBLAS is not the right approach. It could be that calling MKL's SpBLAS is never a loss (we have to bench), but that's clearly not the case of the optimize step. So, clearly, the optimize step must be explicitly activated on user selected expressions only.
[akalinki] This can be an issue for an older version of MKL, because initially we spent additional time on doing hints/optimize when calling this every time. Starting from MKL2018u3 we’re planning to remove this overhead – if you want we can share eng. build to check performance on same binaries.
Also about poor performance of SparseBlas matvec functionality in comparison with Eigen built-in one: all these cases we interpret as issue that need to be investigated and should be fixed on our side. So I don’t think that we need to add any dispatch on Eigen level between built-in kernels and ours – all this stuff should be hidden in MKL libraries.
And back to the main question of this topic – we have customers who use Eigen on Intel processors and are interested in optimizing their code. Implementation of new SparseMatrix type will forced them to rewrite the code, that sometimes can be impossible due to different reasons. That’s why we are asking to find the way how we can just extend current SparseMatrix type to support MKL functionality.
Thanks,
Alexander Kalinkin,
PhD, Team leader, Architect or of Sparse components in Intel®MKL
Sent: Friday, April 6, 2018 5:06 AM
Subject: Re: [eigen] Intel (R) MKL IE SpBLAS support in Eigen
vec1_dense.noalias() += column_major_sparse * vec2_dense
Since this operation is not supported by mkl_sparse_optimize, the overhead is null here, but for some very sparse matrices (5 nnz/col) mkl_sparse_d_mv is about x2 slower than Eigen, while for denser matrices both implementations achieve the same speed.
gael.
vec1_dense.noalias() += row_major_sparse * vec2_dense
and I observe a very significant drop of performance (almost x10) compared to pure Eigen (with -fopenmp) if this operation is performed only once, or very few times... Changing the 'expected_calls' parameter from 10 to 1 does not reduce the performance drop due to mkl_sparse_optimize. If I repeat this operation hundreds of times, then both Eigen and MKL achieve the same level of performance for this particular operation and the matrices I tried (1M^2 Laplacian, bi-Laplacian, and tri-Laplacian matrices).
So I still believe that silently and unconditionally falling back to IE SpBLAS is not the right approach. It could be that calling MKL's SpBLAS is never a loss (we have to bench), but that's clearly not the case of the optimize step. So, clearly, the optimize step must be explicitly activated on user selected expressions only.
gael
Best regards,
Maria
Sent: Thursday, April 5, 2018 2:54 PM
Subject: Re: [eigen] Intel (R) MKL IE SpBLAS support in Eigen
1) Analyze and run
SimplicialLDLT<MklSparseMatrix<double>> llt(A);
int it = 0;
for (int it = 0; ...; ++it) {
if (it == 0)
llt.matrixL().sparseAnalyzeAndRun(100, [&] { llt.solve(b); });
else
llt.solve(b);
// ...
}
2) Analyze only
SimplicialLDLT<MklSparseMatrix<double>> llt(A);
llt.matrixL().sparseAnalyze(100, [&] { llt.solve(b); });
// and solve as usual
For more complicated algorithms, one can always outline the lambda and pass it into the analysis.
That would certainly clean up things a lot. Having to call
A.beginXY();
doStuff();
A.endXY();
is a very C-stylish API, which is error-prone (e.g., one of the calls might not get called, because it is masked by a wrong if-condition, or due to an exception, which is caught only outside).
This can usually be avoided with proper C++ constructs.
while(...) {
optimize_if(iter==0, 100, [&]{x=llt.solve(b);}, llt.matrixL());
}
while(...) {
optimize_if(iter==0, 100, [&]{x=LU.solve(b);}, LU.matrixL(), LU.matrixU());
}
Actually, this last example is not conceivable at the moment because the factors in SparseLU are not stored in a standard compressed format, we would need to add an option to tell SparseLU to turn them to regular SparseMatrix and to use them for solving.
Anyways, all these variants are essentially syntactic sugar/implementation details, and at this stage I'm more concerned about possible shortcomings in the general approach. Is it flexible enough? I'm also not fan of introducing a MklSparseMatrix inheriting SparseMatrix, but I don't have a better offer for now that is generic and easily applicable to decompositions.
gael
x = llt.sparseAnalyzeAndRun(100)->solve(b);
// equivalent to
llt.sparseAnalyzeAndRun(100, [&]{x=llt.solve(b);});
or
{
auto llt_ = llt.sparseAnalyzeAndRun(100);
x = llt_-> solve(b);
y = llt_-> matrixL() * c;
} // destructor of llt_ calls `endAnalyze`
The naming of this method is still debatable, of course.
And I have no idea what actually happens inside MKL when it 'analyzes' an operation (after how many iterations do you actually benefit from the overhead of analyzing the operation?)
Christoph
Cheers,
-Edward
Thank you for opening this discussion on the public mailing list.
(i1) - Storing MKL's handle in SparseMatrix breaks ABI and does not sounds very generic.
(i2) - which operations are going to be analyzed/optimized,
(i3) - and specify the 'expected_calls' parameter.
In order to discuss these issues, let's consider the following typical pattern: (e.g., non-linear optimization, eigenvalues, ...)
SimplicialLDLT<SparseMatrix<double> > llt(A);
while(...) {
...
x = llt.solve(b);
...
}
Here the triangular L factor is going to be used for triangular and transposed-triangular solves dozens to hundreds of time but only the user of SimplicialLDLT knowns that, not SimplicialLDLT, nor SparseMatrix. Moreover, the user does not own the SparseMatrix that we want to analyze/optimize for. Other patterns are likely easier to handle, so let's focus on it for now.
MklSparseMatrix::beginAnalysis(Index expected_calls) const {
// turn *this to compressed mode
// create handle
// store expected_calls
// enable recording mode
}
MklSparseMatrix::endAnalysis() const {
// disable recording mode
// [optional] call mkl_sparse_optimize
}
All states in MklSparseMatrix would be mutable.
Between a pair of beginAnalysis/endAnalysis each call to a supported operation would trigger calls to mkl_sparse_set_*_hint()/mkl_sparse_optimize.
Optionally, we could even add a "dryrun" mode for which no operation would be performed, only calls to mkl_sparse_set_*_hint() and then mkl_sparse_optimize would be called in endAnalysis(). This way mkl_sparse_optimize() would be called only once.
SimplicialLDLT<MklSparseMatrix<double> > llt(A);
int it=0;
while(...) {
...
if(it==0) llt.matrixL().beginAnalysis(100);
x = llt.solve(b);
if(it==0) llt.matrixL().endAnalysis();
...
++it;
}
SimplicialLDLT<MklSparseMatrix<double> > llt(A);
llt.matrixL().beginAnalysis(100, DryRun);
x = llt.solve(b); // permutation and division by the diagonal matrix D would still be performed, but calls to actual triangular solves would be by-passed
llt.matrixL().endAnalysis();
while(...) {
...
x = llt.solve(b);
...
}
SimplicialLLT<SparseMatrix<double> > llt(A);
MklSparseMatrix L(llt.matrixL());
L.beginAnalysis(100,DryRun);
y = L.triangularView<Lower>() * x;
L.endAnalysis();
while(...) {
...
y = L.triangularView<Lower>() * x;
...
}
This design in quite general and expendable to any sparse-optimizers, even built-in ones in the future.
In contrast to the current proposal, only selected operations would be passed to MKL (need to use a MklSparseMatrix + begin/end recording phase).
What do you think?
gael
Hello Eigen community,
My name is Maria Zhukova and I’m a software development engineer at Intel ®
MKL Sparse team.
My team is interested in contributing into Eigen, so I’ve investigated our
Eigen support different operations for sparse matrices stored in CSR and CSC
format which can be implemented on a basis of IE SpBLAS kernels (please,
refer to
https://software.intel.com/en-us/mkl-developer-reference-c-inspector-executor-sparse-blas-routines
<https://software.intel.com/en-us/mkl-developer-reference-c-inspector-executor-sparse-blas-routines>
for the general idea of interfaces)
, basically we want to implement calls to our IE SpBLAS into next
operations:____
SparseMatrix + SparseMatrix (mkl_sparse_?_add)
SparseMatrix * DenseVector (mkl_sparse_?_mv)____
SparseMatrix * DenseMatrix (mkl_sparse_?_mm)____
SparseMatrix * SparseMatrix (mkl_sparse_spmm),
and Triangular solve (mkl_sparse_?_trsv).____
I’ve already started with implementation of sparse_time_dense_impl_mkl
kernel which is based on mkl_sparse_?_mv (included in patch).____
*#include <Eigen/SpBLASSupport> *<-- *NEW:* IE SpBLAS include module ____
void main () {
SparseMatrix<double, RowMajor> A;
Matrix<double, Dynamic, 1> x, y;
A.makeCompressed(); /* Convert matrix A into CSR/CSC format */
*A.createSparseHandle();*/* *NEW*: is used to create handle required for all
IE SpBLAS routines */____
// support of IE SpBLAS is here
y = beta*y + alpha*A*x; /* call to mkl_sparse_?_mv with operation =
SPARSE_OPERATION_NON_TRANSPOSE */
y = beta*y + alpha*A.transpose()*x; /* call to mkl_sparse_?_mv with
operation = SPARSE_OPERATION_TRANSPOSE */
y = beta*y + alpha*A.adjoint()*x; /* call to mkl_sparse_?_mv with operation
= SPARSE_OPERATION_CONJUGATE_TRANSPOSE */____
*A.destroySparseHandle();* /* *NEW*: is used to delete created handle */
}____
__ __
I’ve attached a draft patch including all necessary changes and would like
to hear your feedback.
Please, let me know if you have any questions and comments.____
__ __
Best regards,
Maria____
__ __
__ __
__ __
--
Dr.-Ing. Christoph Hertzberg
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-Straße 5<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+5+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>
28359 Bremen, Germany<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+5+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-Straße 1<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+1+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>
28359 Bremen, Germany<https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+1+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g>
Tel.: +49 421 178 45-4021
Zentrale: +49 421 178 45-0
Weitere Informationen: http://www.dfki.de/robotik
-----------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Straße 122, D-67663 Kaiserslautern<https://maps.google.com/?q=Trippstadter+Stra%C3%9Fe+122,+D-67663+Kaiserslautern&entry=gmail&source=g>
Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
(Vorsitzender) Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
USt-Id.Nr.: DE 148646973
Steuernummer: 19/672/50006
-----------------------------------------------------------------------

Edward Lam

2018-04-06 22:04:33 UTC

Permalink

Is it within reason to change the ABI of the solvers?

-Edward

On Fri, Apr 6, 2018, 4:33 PM Christoph Hertzberg, <

Post by Christoph Hertzberg
Hi,
your mail-client seems to mess-up quotations (it is deleting all '>' at
the line beginnings) That makes it a bit hard to read.
There is the possibility to modify SparseMatrix locally by defining the
macro EIGEN_SPARSEMATRIX_PLUGIN
http://eigen.tuxfamily.org/dox-devel/TopicCustomizing_Plugins.html
But, as we both said, we think it is generally not a good idea to change
the ABI of core Eigen types.
Cheers,
Christoph

Post by Kalinkin, Alexander A
Hi All,
First of all appreciate your interest in integration of sparse kernels

in Eigen.

Post by Kalinkin, Alexander A
Please see my comments below
vec1_dense.noalias() += column_major_sparse * vec2_dense
Since this operation is not supported by mkl_sparse_optimize, the

overhead is null here, but for some very sparse matrices (5 nnz/col)
mkl_sparse_d_mv is about x2 slower than Eigen, while for denser matrices
both implementations achieve the same speed.

Post by Kalinkin, Alexander A
[akalinki] Thatâs sound strange. Can I ask you which MKL version do you

use for your testing and what architecture is the target? We implemented
matvec support in IE SparseBlas couple of years ago and significantly
improve its performance during this year. Can I ask you also about the size
of matrix, so I will be able to check it on my side?

Post by Kalinkin, Alexander A
vec1_dense.noalias() += row_major_sparse * vec2_dense
and I observe a very significant drop of performance (almost x10)

compared to pure Eigen (with -fopenmp) if this operation is performed only
once, or very few times... Changing the 'expected_calls' parameter from 10
to 1 does not reduce the performance drop due to mkl_sparse_optimize. If I
repeat this operation hundreds of times, then both Eigen and MKL achieve
the same level of performance for this particular operation and the
matrices I tried (1M^2 Laplacian, bi-Laplacian, and tri-Laplacian matrices).

Post by Kalinkin, Alexander A
So I still believe that silently and unconditionally falling back to IE

SpBLAS is not the right approach. It could be that calling MKL's SpBLAS is
never a loss (we have to bench), but that's clearly not the case of the
optimize step. So, clearly, the optimize step must be explicitly activated
on user selected expressions only.

Post by Kalinkin, Alexander A
[akalinki] This can be an issue for an older version of MKL, because

initially we spent additional time on doing hints/optimize when calling
this every time. Starting from MKL2018u3 weâre planning to remove this
overhead â if you want we can share eng. build to check performance on same
binaries.

Post by Kalinkin, Alexander A
Also about poor performance of SparseBlas matvec functionality in

comparison with Eigen built-in one: all these cases we interpret as issue
that need to be investigated and should be fixed on our side. So I donât
think that we need to add any dispatch on Eigen level between built-in
kernels and ours â all this stuff should be hidden in MKL libraries.

Post by Kalinkin, Alexander A
And back to the main question of this topic â we have customers who use

Eigen on Intel processors and are interested in optimizing their code.
Implementation of new SparseMatrix type will forced them to rewrite the
code, that sometimes can be impossible due to different reasons. Thatâs why
we are asking to find the way how we can just extend current SparseMatrix
type to support MKL functionality.

Post by Kalinkin, Alexander A
Thanks,
Alexander Kalinkin,
PhD, Team leader, Architect or of Sparse components in IntelÂ®MKL
Sent: Friday, April 6, 2018 5:06 AM
Subject: Re: [eigen] Intel (R) MKL IE SpBLAS support in Eigen
vec1_dense.noalias() += column_major_sparse * vec2_dense
Since this operation is not supported by mkl_sparse_optimize, the

overhead is null here, but for some very sparse matrices (5 nnz/col)
mkl_sparse_d_mv is about x2 slower than Eigen, while for denser matrices
both implementations achieve the same speed.

Post by Kalinkin, Alexander A
gael.
On Fri, Apr 6, 2018 at 1:42 PM, Gael Guennebaud <
vec1_dense.noalias() += row_major_sparse * vec2_dense
and I observe a very significant drop of performance (almost x10)

Post by Kalinkin, Alexander A
So I still believe that silently and unconditionally falling back to IE

Post by Kalinkin, Alexander A
gael
Best regards,
Maria
Sent: Thursday, April 5, 2018 2:54 PM
Subject: Re: [eigen] Intel (R) MKL IE SpBLAS support in Eigen
On Thu, Apr 5, 2018 at 11:00 PM, Christoph Hertzberg <
Would it be useful to incorporate lambda's into the interface to avoid

begin/end pairs? So from the user side, we would write code like this

Post by Kalinkin, Alexander A
1) Analyze and run
SimplicialLDLT<MklSparseMatrix<double>> llt(A);
int it = 0;
for (int it = 0; ...; ++it) {
if (it == 0)
llt.matrixL().sparseAnalyzeAndRun(100, [&] { llt.solve(b);

});

Post by Kalinkin, Alexander A
else
llt.solve(b);
// ...
}
2) Analyze only
SimplicialLDLT<MklSparseMatrix<double>> llt(A);
llt.matrixL().sparseAnalyze(100, [&] { llt.solve(b); });
// and solve as usual
For more complicated algorithms, one can always outline the lambda and

pass it into the analysis.

Post by Kalinkin, Alexander A
That would certainly clean up things a lot. Having to call
A.beginXY();
doStuff();
A.endXY();
is a very C-stylish API, which is error-prone (e.g., one of the calls

might not get called, because it is masked by a wrong if-condition, or due
to an exception, which is caught only outside).

Post by Kalinkin, Alexander A
This can usually be avoided with proper C++ constructs.
The advantages of the begin/end pair are that 1) this keeps the API

minimal, and 2) the user can easily wrap them to accommodate its taste and
usages. To avoid all sorts of redundancies, one could even write a free

Post by Kalinkin, Alexander A
while(...) {
optimize_if(iter==0, 100, [&]{x=llt.solve(b);}, llt.matrixL());
}
while(...) {
optimize_if(iter==0, 100, [&]{x=LU.solve(b);}, LU.matrixL(),

LU.matrixU());

Post by Kalinkin, Alexander A
}
Actually, this last example is not conceivable at the moment because the

factors in SparseLU are not stored in a standard compressed format, we
would need to add an option to tell SparseLU to turn them to regular
SparseMatrix and to use them for solving.

Post by Kalinkin, Alexander A
Anyways, all these variants are essentially syntactic

sugar/implementation details, and at this stage I'm more concerned about
possible shortcomings in the general approach. Is it flexible enough? I'm
also not fan of introducing a MklSparseMatrix inheriting SparseMatrix, but
I don't have a better offer for now that is generic and easily applicable
to decompositions.

Post by Kalinkin, Alexander A
gael
However, if this is required, I would suggest to add this method not

only to matrices but also to the decomposition (and pass it through to the
internal matrices). Otherwise, this does not scale for decompositions which
use more than one matrix (like SparseLU). And we could even let the
sparseAnalyze() functions return a proxy to the decomposition which would

Post by Kalinkin, Alexander A
x = llt.sparseAnalyzeAndRun(100)->solve(b);
// equivalent to
llt.sparseAnalyzeAndRun(100, [&]{x=llt.solve(b);});
or
{
auto llt_ = llt.sparseAnalyzeAndRun(100);
x = llt_-> solve(b);
y = llt_-> matrixL() * c;
} // destructor of llt_ calls `endAnalyze`
The naming of this method is still debatable, of course.
And I have no idea what actually happens inside MKL when it 'analyzes'

an operation (after how many iterations do you actually benefit from the
overhead of analyzing the operation?)

Post by Kalinkin, Alexander A
Christoph
Cheers,
-Edward
Thank you for opening this discussion on the public mailing list.
So let's discuss about the public API, which currently is not very
(i1) - Storing MKL's handle in SparseMatrix breaks ABI and does not

sounds very generic.

Post by Kalinkin, Alexander A
(i2) - which operations are going to be analyzed/optimized,
(i3) - and specify the 'expected_calls' parameter.
In order to discuss these issues, let's consider the following typical

pattern: (e.g., non-linear optimization, eigenvalues, ...)

Post by Kalinkin, Alexander A
SimplicialLDLT<SparseMatrix<double> > llt(A);
while(...) {
...
x = llt.solve(b);
...
}
Here the triangular L factor is going to be used for triangular and

transposed-triangular solves dozens to hundreds of time but only the user
of SimplicialLDLT knowns that, not SimplicialLDLT, nor SparseMatrix.
Moreover, the user does not own the SparseMatrix that we want to
analyze/optimize for. Other patterns are likely easier to handle, so let's
focus on it for now.

Post by Kalinkin, Alexander A
Regarding (i1), I would suggest to introduce a new type, say

MklSparseMatrix<> that would enhance SparseMatrix<> through inheritance.

Post by Kalinkin, Alexander A
MklSparseMatrix::beginAnalysis(Index expected_calls) const {
// turn *this to compressed mode
// create handle
// store expected_calls
// enable recording mode
}
MklSparseMatrix::endAnalysis() const {
// disable recording mode
// [optional] call mkl_sparse_optimize
}
All states in MklSparseMatrix would be mutable.
Between a pair of beginAnalysis/endAnalysis each call to a supported

operation would trigger calls to
mkl_sparse_set_*_hint()/mkl_sparse_optimize.

Post by Kalinkin, Alexander A
Optionally, we could even add a "dryrun" mode for which no operation

would be performed, only calls to mkl_sparse_set_*_hint() and then
mkl_sparse_optimize would be called in endAnalysis(). This way
mkl_sparse_optimize() would be called only once.

Post by Kalinkin, Alexander A
SimplicialLDLT<MklSparseMatrix<double> > llt(A);
int it=0;
while(...) {
...
if(it==0) llt.matrixL().beginAnalysis(100);
x = llt.solve(b);
if(it==0) llt.matrixL().endAnalysis();
...
++it;
}
SimplicialLDLT<MklSparseMatrix<double> > llt(A);
llt.matrixL().beginAnalysis(100, DryRun);
x = llt.solve(b); // permutation and division by the diagonal matrix D

would still be performed, but calls to actual triangular solves would be
by-passed

Post by Kalinkin, Alexander A
llt.matrixL().endAnalysis();
while(...) {
...
x = llt.solve(b);
...
}
If someone directly deal with the factor L, then we could follow the
SimplicialLLT<SparseMatrix<double> > llt(A);
MklSparseMatrix L(llt.matrixL());
L.beginAnalysis(100,DryRun);
y = L.triangularView<Lower>() * x;
L.endAnalysis();
while(...) {
...
y = L.triangularView<Lower>() * x;
...
}
This design in quite general and expendable to any sparse-optimizers,

even built-in ones in the future.

Post by Kalinkin, Alexander A
In contrast to the current proposal, only selected operations would be

passed to MKL (need to use a MklSparseMatrix + begin/end recording phase).

Post by Kalinkin, Alexander A
What do you think?
gael
Hello Eigen community,
My name is Maria Zhukova and Iâm a software development engineer at

Intel Â®

Post by Kalinkin, Alexander A
MKL Sparse team.
My team is interested in contributing into Eigen, so Iâve

investigated our

Post by Kalinkin, Alexander A
Eigen support different operations for sparse matrices stored in

CSR and CSC

Post by Kalinkin, Alexander A
format which can be implemented on a basis of IE SpBLAS kernels

(please,

Post by Kalinkin, Alexander A
refer to

https://software.intel.com/en-us/mkl-developer-reference-c-inspector-executor-sparse-blas-routines

Post by Kalinkin, Alexander A
<

https://software.intel.com/en-us/mkl-developer-reference-c-inspector-executor-sparse-blas-routines

Post by Kalinkin, Alexander A
for the general idea of interfaces)
, basically we want to implement calls to our IE SpBLAS into next
operations:____
SparseMatrix + SparseMatrix (mkl_sparse_?_add)
SparseMatrix * DenseVector (mkl_sparse_?_mv)____
SparseMatrix * DenseMatrix (mkl_sparse_?_mm)____
SparseMatrix * SparseMatrix (mkl_sparse_spmm),
and Triangular solve (mkl_sparse_?_trsv).____
Iâve already started with implementation of

sparse_time_dense_impl_mkl

Post by Kalinkin, Alexander A
kernel which is based on mkl_sparse_?_mv (included in patch).____
*#include <Eigen/SpBLASSupport> *<-- *NEW:* IE SpBLAS include

module ____

Post by Kalinkin, Alexander A
void main () {
SparseMatrix<double, RowMajor> A;
Matrix<double, Dynamic, 1> x, y;
A.makeCompressed(); /* Convert matrix A into CSR/CSC format */
*A.createSparseHandle();*/* *NEW*: is used to create handle

required for all

Post by Kalinkin, Alexander A
IE SpBLAS routines */____
// support of IE SpBLAS is here
y = beta*y + alpha*A*x; /* call to mkl_sparse_?_mv with operation =
SPARSE_OPERATION_NON_TRANSPOSE */
y = beta*y + alpha*A.transpose()*x; /* call to mkl_sparse_?_mv with
operation = SPARSE_OPERATION_TRANSPOSE */
y = beta*y + alpha*A.adjoint()*x; /* call to mkl_sparse_?_mv with

operation

Post by Kalinkin, Alexander A
= SPARSE_OPERATION_CONJUGATE_TRANSPOSE */____
*A.destroySparseHandle();* /* *NEW*: is used to delete created

handle */

Post by Kalinkin, Alexander A
}____
__ __
Iâve attached a draft patch including all necessary changes and

would like

Post by Kalinkin, Alexander A
to hear your feedback.
Please, let me know if you have any questions and comments.____
__ __
Best regards,
Maria____
__ __
__ __
__ __
--
Dr.-Ing. Christoph Hertzberg
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-StraÃe 5<

https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+5+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g

Post by Kalinkin, Alexander A
28359 Bremen, Germany<

https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+5+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g

Post by Kalinkin, Alexander A
DFKI GmbH
Robotics Innovation Center
Robert-Hooke-StraÃe 1<

https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+1+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g

Post by Kalinkin, Alexander A
28359 Bremen, Germany<

https://maps.google.com/?q=Robert-Hooke-Stra%C3%9Fe+1+%0D%0A+28359+Bremen,+Germany&entry=gmail&source=g

Post by Kalinkin, Alexander A
Tel.: +49 421 178 45-4021
Zentrale: +49 421 178 45-0
Weitere Informationen: http://www.dfki.de/robotik
-----------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter StraÃe 122, D-67663 Kaiserslautern<

https://maps.google.com/?q=Trippstadter+Stra%C3%9Fe+122,+D-67663+Kaiserslautern&entry=gmail&source=g

Post by Kalinkin, Alexander A
Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
(Vorsitzender) Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Sitz der Gesellschaft: Kaiserslautern (HRB 2313)
USt-Id.Nr.: DE 148646973
Steuernummer: 19/672/50006
-----------------------------------------------------------------------

Gael Guennebaud

2018-04-07 06:08:26 UTC

Permalink

Post by Edward Lam
Is it within reason to change the ABI of the solvers?

yes, changing the ABI of solvers is acceptable because they are very
unlikely to be passed across libraries/applications.

gael

Post by Edward Lam
-Edward
On Fri, Apr 6, 2018, 4:33 PM Christoph Hertzberg, <