Re: [eigen] Parallel matrix multiplication causes heap allocation |
[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]
On Dec 18, 2016, at 1:12 PM, Francois Fayard <fayard@xxxxxxxxxxxxx> wrote:My mistake, OpenBLAS, does the same. Here is an excerpt from their code (I have erased a few lines).ICOPYB_OPERATION(min_l, min_i, a, lda, ls, is, sa);for (xxx = range_n[current], bufferside = 0; xxx < range_n[current + 1]; xxx += div_n, bufferside ++) {KERNEL_OPERATION(min_i, MIN(range_n[current + 1] - xxx, div_n), min_l, ALPHA5, ALPHA6,sa, (FLOAT *)job[current].working[mypos][CACHE_LINE_SIZE * bufferside], c, ldc, is, xxx);Gael, do you copy because you want the “small matrix” to have its line one after the other (in C ordering) such that the leading dimension of the “small matrix" is equal to the number of columns? Or do you copy because you want to perform alignement for vector instructions?If you have any global reference about optimizing BLAS 3 routines, that would be nice.http://www.cs.utexas.edu/users/flame/pubs/blis3_ (http://ieeexplore.ieee.org/ipdps14.pdf document/6877334/ ) and references therein, especially https://www.cs.utexas.edu/users/pingali/ (http://dl.acm.org/citation.CS378/2008sp/papers/gotoPaper. cfm?id=1356053 ).
Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |