Re: [eigen] Parallel matrix multiplication causes heap allocation |

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

*To*: eigen <eigen@xxxxxxxxxxxxxxxxxxx>*Subject*: Re: [eigen] Parallel matrix multiplication causes heap allocation*From*: Gael Guennebaud <gael.guennebaud@xxxxxxxxx>*Date*: Sun, 18 Dec 2016 23:38:25 +0100*Dkim-signature*: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=FlG4SftoBWVLCrWBgkMbx8u/9uHvHCpZ/qM/K4TmcQM=; b=mUbmyp40ibhUVG6WcVvQJaCXbj1TFg5Ub8q4jEBP9QKWOmp8Dld+ykhkMig/T8A9Ki bUdJDVJpb+mM4EmC5be1hHEamLlPzoDfiRHBcvtNtxmkxeDCDUMeTJryJZ3fRN7aKzBf AvNQBvzw5icv8bS+SLeF5VlV9iwwUK6HW+CNGwE1BBxzaBOHtAhmo5gg+7N+B/QQfBTy 0q0PcuVx0U4l9WdFW//5WDf6MENtSfYAfEAZZT/LWSJ6yGVcGb3Nc4+TYa90B0wphSYV HE4JYvhh3184S6foH+KAlPDaGrJW6YELVX268jj6W52Rk2aIorfXdyDzJ0t8r2ufKXqx +R/Q==

On Sun, Dec 18, 2016 at 11:05 PM, Jeff Hammond <jeff.science@xxxxxxxxx> wrote:

On Dec 18, 2016, at 1:12 PM, Francois Fayard <fayard@xxxxxxxxxxxxx> wrote:My mistake, OpenBLAS, does the same. Here is an excerpt from their code (I have erased a few lines).ICOPYB_OPERATION(min_l, min_i, a, lda, ls, is, sa);for (xxx = range_n[current], bufferside = 0; xxx < range_n[current + 1]; xxx += div_n, bufferside ++) {KERNEL_OPERATION(min_i, MIN(range_n[current + 1] - xxx, div_n), min_l, ALPHA5, ALPHA6,sa, (FLOAT *)job[current].working[mypos][CACHE_LINE_SIZE * bufferside], c, ldc, is, xxx);Gael, do you copy because you want the “small matrix” to have its line one after the other (in C ordering) such that the leading dimension of the “small matrix" is equal to the number of columns? Or do you copy because you want to perform alignement for vector instructions?If you have any global reference about optimizing BLAS 3 routines, that would be nice.http://www.cs.utexas.edu/users/flame/pubs/blis3_ (http://ieeexplore.ieee.org/ipdps14.pdf document/6877334/ ) and references therein, especially https://www.cs.utexas.edu/users/pingali/ (http://dl.acm.org/citation.CS378/2008sp/papers/gotoPaper. cfm?id=1356053 ).

Those are good references. Eigen's implementation was initially based on Goto's paper, in 3.3 it as slightly evolved but the general ideas are the same. So basically, the left-hand-side is first split into vertical panels that are stored by horizontal panels of size PxK, each micro panel being sequentially stored in a column-major fashion. P depends on the hardware and scalar type. For instance with FMA and float, P=3*register-size=24.

The right-hand-side is split into horizontal panels, each panel is stored per chunk of Kx4 vertical micro-panels that are stored in a row-major fashion.

With this storage, we can compute products between PxK and Kx4 panels very efficiently within registers and purely sequential memory loads. With FMA and float P=24, and we thus reserve 12 registers to hold the result. Then it's only a matter of scheduling these micro-products to optimally use caches.

hope that helps,

gael

**References**:**[eigen] Parallel matrix multiplication causes heap allocation***From:*Rene Ahlsdorf

**Re: [eigen] Parallel matrix multiplication causes heap allocation***From:*François Fayard

**Re: [eigen] Parallel matrix multiplication causes heap allocation***From:*Gael Guennebaud

**Re: [eigen] Parallel matrix multiplication causes heap allocation***From:*Francois Fayard

**Re: [eigen] Parallel matrix multiplication causes heap allocation***From:*Jeff Hammond

**Messages sorted by:**[ date | thread ]- Prev by Date:
**Re: [eigen] Parallel matrix multiplication causes heap allocation** - Next by Date:
**Re: [eigen] Parallel matrix multiplication causes heap allocation** - Previous by thread:
**Re: [eigen] Parallel matrix multiplication causes heap allocation** - Next by thread:
**Re: [eigen] Parallel matrix multiplication causes heap allocation**

Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |