Re: [eigen] Re: RFC: making a deterministic and reproducable product codepath with Eigen

[ Thread Index | Date Index | More Archives ]

On Tue, Sep 6, 2016 at 10:17 AM, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote:


putting aside the problem of properly configuring your compiler, and neglecting speed all you need to ensure is that any kind of reductions (sum, products, etc.) are always carried out in the same order regardless of the hardware. This implies that (1) vectorization within reductions (like vector.sum()) must be disabled and that (2) block sizes are always the same.

Regarding (2) this is already possible through some internal mechanisms that we use for debugging and benchmarking purposes. You need to define the following macro prior to any Eigen's header:


This needs Eigen 3.3.

Hm, perhaps TEST should be replaced to USE?  Would allow people, like in the machine learning or people trying to sqeeze every drop out of their exact machines to find the optimum.

Regarding (1), currently, this can only be achieved by disabled Eigen's vectorization by defining EIGEN_DONT_VECTORIZE=1 (and maybe disabling auto-vectorization on compiler side).

If all this is for debugging purposes only, then I guess that's good enough.

Well, it's for validation but also for reproducible as a goal - the kind of reproducible I'm ultimately after (the one of most value to me) after doesn't just entail the same run-to-run or across machines but on different implementations (different matrix product libraries - like reference blas, or the common inhouse library) - special bit patterns being an allowed/necessary exception to bit-exact reproducibility.  Of course at different times people want different things - so there are times when run-to-run reproducible is all that's desired as well and something also I need and what you just linked I think plays an important role in that.

User pluggable products would also allow us to expand out further without diminishing returns to the core, such as using parallel prefix sum based reductions that one would do in GPUs for the accumulations - or other reduction trees.  The only thing this seems to start opening up is that the user may want to start mixing which kind of product is being used - I'm not yet sure how to deal with that, with all the compile time dispatch... any ideas?

I must also add that this will only work if you use the same Eigen version everywhere. Different versions of Eigen might rewrite some expressions in different ways.

Is this for more complex Eigen utilities, like JacobiSVD/inversion or do you mean even for core ops -  like product?


On Tue, Sep 6, 2016 at 12:12 AM, Jason Newton <nevion@xxxxxxxxx> wrote:
Following up.

I missed the Goto paper reference for blocking matrix products that was way high above the implementation - so after studying that paper and some of Eigen's source, I know what's happening now, or at least what should be happening - the actual implementation in Eigen still loses me.

But I did find out about GeneralProduct and the product_type_selector, and CoeffBasedProductMode - which seems to be the naive c implementation I was after.  So I intend to try things with the following patch - Gael, I on the right path in option and strategy?

diff -r c04c7f10894d Eigen/src/Core/GeneralProduct.h
--- a/Eigen/src/Core/GeneralProduct.h   Mon Sep 05 17:14:20 2016 +0200
+++ b/Eigen/src/Core/GeneralProduct.h   Mon Sep 05 18:00:29 2016 -0400
@@ -81,6 +81,16 @@
 * This is a compile time mapping from {1,Small,Large}^3 -> {product types} */
// FIXME I'm not sure the current mapping is the ideal one.
template<int M, int N>  struct product_type_selector<M,N,1>              { enum { ret = OuterProduct }; };
+template<int M, int N, int D> struct product_type_selector<M, N, Depth>  { enum { ret = CoeffBasedProductMode }; };
+template<int M>         struct product_type_selector<M, 1, 1>            { enum { ret = LazyCoeffBasedProductMode }; };
+template<int N>         struct product_type_selector<1, N, 1>            { enum { ret = LazyCoeffBasedProductMode }; };
+template<>              struct product_type_selector<Small, Small, 1>    { enum { ret = LazyCoeffBasedProductMode }; };
+template<>              struct product_type_selector<Small, Large, 1>    { enum { ret = LazyCoeffBasedProductMode }; };
+template<>              struct product_type_selector<Large, Small, 1>    { enum { ret = LazyCoeffBasedProductMode }; };
+template<int Depth>     struct product_type_selector<1,    1,    Depth>  { enum { ret = InnerProduct }; };
+template<>              struct product_type_selector<1,    1,    1>      { enum { ret = InnerProduct }; };
template<int M>         struct product_type_selector<M, 1, 1>            { enum { ret = LazyCoeffBasedProductMode }; };
template<int N>         struct product_type_selector<1, N, 1>            { enum { ret = LazyCoeffBasedProductMode }; };
template<int Depth>     struct product_type_selector<1,    1,    Depth>  { enum { ret = InnerProduct }; };
@@ -104,6 +114,7 @@
template<>              struct product_type_selector<Large,Small,Small>  { enum { ret = CoeffBasedProductMode }; };
template<>              struct product_type_selector<Small,Large,Small>  { enum { ret = CoeffBasedProductMode }; };
template<>              struct product_type_selector<Large,Large,Small>  { enum { ret = GemmProduct }; };
} // end namespace internal

But also there is cblas's way of doing matrix mults - which is to distribute by coefs on the rhs term to all locations they are terms of in a single pass.  Maybe this is EIGEN_USE_DISTRIBUTE_BY_RHS_COEFF_PRODUCTS?  Not sure I can do a packetmath implementation of this but it's purpose is validation with the present API, not speed.


On Fri, Sep 2, 2016 at 3:47 AM, Jason Newton <nevion@xxxxxxxxx> wrote:
Hey Gael et al,

I was trying to understand the underlying implementation of the General Matrix Matrix and Matrix Vector products - I haven't succeeded there (is there a write up on it somewhere?) but I thought I'd inquire about making a codepath which other software can match the results of, provided they play by the same rules.  Maybe this means evaluating products like cblas/blas reference does, maybe this means the naive approach - maybe allow the user to switch between those 2 implementations.  Maybe the MKL / blas code path can be leveraged for this and enhanced (to support reference blas and general inverse as well).

The idea for control is to set a macro up top to select this preference,  like the default storage order or palatalization/vectorization knobs.

The advantage of doing this is when porting code from one context to another (be it GPUs, or different languages - like python/numpy) we can get a 100% bit-exact match as long as both domains follow the same algorithms (and deal with rounding the same way, another topic) which provides a fairly strong guarantee that the ported code/code in another domain is correct (provided a large enough input space is used for coverage) - and when not performing that verification, we can go back to the fast paths.  Further, it is by necessity a reproducible result, across machines (maybe not architectures, but certainly processor configs) - this also has value to many people who are willing to take performance penalties to attain - I think Eigen already achieves this when disabling openmp, and in mixed generation machines, disabling vectorization potentially, but I thought I'd throw it out there too.

I can tell you small matrices on a GPU (small enough to fit in local memory) - the naive approach works very well for parallel computations and probably are most frequently used matrices sizes. 

Python/numpy doesn't care and fully relies on the underlying blas implementation,, which users can fairly easily inspect what they're using, and they support cblas by default.

Thought of bringing this up on the ML after reading (listed to show the property is desirable, not that it's my first time encountering the issue):


Mail converted by MHonArc 2.6.19+