Re: [eigen] Parallel matrix multiplication causes heap allocation

Sent from my iPhone

On Dec 18, 2016, at 1:12 PM, Francois Fayard <fayard@xxxxxxxxxxxxx> wrote:

My mistake, OpenBLAS, does the same. Here is an excerpt from their code (I have erased a few lines).

ICOPYB_OPERATION(min_l, min_i, a, lda, ls, is, sa);
for (xxx = range_n[current], bufferside = 0; xxx < range_n[current + 1]; xxx += div_n, bufferside ++) {
KERNEL_OPERATION(min_i, MIN(range_n[current + 1] - xxx, div_n), min_l, ALPHA5, ALPHA6,
sa, (FLOAT *)job[current].working[mypos][CACHE_LINE_SIZE * bufferside],
c, ldc, is, xxx);

Gael, do you copy because you want the “small matrix” to have its line one after the other (in C ordering) such that the leading dimension of the “small matrix" is equal to the number of columns? Or do you copy because you want to perform alignement for vector instructions?
If you have any global reference about optimizing BLAS 3 routines, that would be nice.

http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf (http://ieeexplore.ieee.org/document/6877334/) and references therein, especially https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf (http://dl.acm.org/citation.cfm?id=1356053).

Jeff

Thanks

François Fayard
Founder & Consultant - Inside Loop
Applied Mathematics & High Performance Computing
Tel: +33 (0)6 01 44 06 93
Web: www.insideloop.io

On Dec 18, 2016, at 6:12 PM, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote:

On Sun, Dec 18, 2016 at 7:28 AM, François Fayard <fayard@xxxxxxxxxxxxx> wrote:
Hi Rene,

I have skimed recently through the matrix multiplication code. In order to be cache friendly, Eigen performs many smaller matrix multiplication and it turns out that those smaller matrices are copied and rearranged in memory to speed up the multiplication process. So malloc is expected to happen in matrix multiplication.

yes I confirm.

As far as I know, other blas libraries such as OpenBLAS don't perform such copies. Is there any way to get rid of them in eigen?

nope, OpenBLAS does the same, and I believe MKL does too. I'm convinced that for large enough matrices it is impossible to reach such high performance without those copies/repacking.

gael

François

On 18 Dec 2016, at 01:06, Rene Ahlsdorf <ahlsdorf@xxxxxxxxxxxxxxxxxx> wrote:

Dear Eigen team,

first of all, thank you for all your effort to create such a great math library. I really love using it.

I’ve got a question about your parallelization routines. I want to calculate a parallel (omp based) matrix multiplication (result: 500 x 250 matrix) without allocating any new space in the meantime. So I have activated „Eigen::internal::set_is_malloc_allowed(false)“ to check that nothing goes wrong. However, my program crashes with the error message
„Assertion failed: (is_malloc_allowed() && "heap allocation is forbidden (EIGEN_RUNTIME_NO_MALLOC is defined and g_is_malloc_allowed is false)"), function check_that_malloc_is_allowed, file /Users/xxx//libs/eigen/Eigen/src/Core/util/Memory.h, line 143.“. Is this behaviour desired? Should there be an allocation before doing parallel calculations? Or am I doing something wrong?

Thanks in advance.

Regards,
René Ahlsdorf

Eigen Version: 3.3.1 (commit f562a193118d)

My code: https://gist.github.com/anonymous/d57c835171b2068817b9f82493b43ea7

Attached: Screenshot showing the last function calls
<Screenshot 2016-12-18 01.01.42.png>