Re: [eigen] Parallel matrix multiplication causes heap allocation |
[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]
My mistake, OpenBLAS, does the same. Here is an excerpt from their code (I have erased a few lines). ICOPYB_OPERATION(min_l, min_i, a, lda, ls, is, sa); for (xxx = range_n[current], bufferside = 0; xxx < range_n[current + 1]; xxx += div_n, bufferside ++) { KERNEL_OPERATION(min_i, MIN(range_n[current + 1] - xxx, div_n), min_l, ALPHA5, ALPHA6, sa, (FLOAT *)job[current].working[mypos][CACHE_LINE_SIZE * bufferside], c, ldc, is, xxx); Gael, do you copy because you want the “small matrix” to have its line one after the other (in C ordering) such that the leading dimension of the “small matrix" is equal to the number of columns? Or do you copy because you want to perform alignement for vector instructions? If you have any global reference about optimizing BLAS 3 routines, that would be nice. Thanks François Fayard Founder & Consultant - Inside Loop
|
Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |