Re: [eigen] Parallel matrix multiplication causes heap allocation

On Sun, Dec 18, 2016 at 11:05 PM, Jeff Hammond <jeff.science@xxxxxxxxx> wrote:

On Dec 18, 2016, at 1:12 PM, Francois Fayard <fayard@xxxxxxxxxxxxx> wrote:

My mistake, OpenBLAS, does the same. Here is an excerpt from their code (I have erased a few lines).

ICOPYB_OPERATION(min_l, min_i, a, lda, ls, is, sa);
for (xxx = range_n[current], bufferside = 0; xxx < range_n[current + 1]; xxx += div_n, bufferside ++) {
KERNEL_OPERATION(min_i, MIN(range_n[current + 1] - xxx, div_n), min_l, ALPHA5, ALPHA6,
sa, (FLOAT *)job[current].working[mypos][CACHE_LINE_SIZE * bufferside],
c, ldc, is, xxx);

Gael, do you copy because you want the “small matrix” to have its line one after the other (in C ordering) such that the leading dimension of the “small matrix" is equal to the number of columns? Or do you copy because you want to perform alignement for vector instructions?
If you have any global reference about optimizing BLAS 3 routines, that would be nice.

http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf (http://ieeexplore.ieee.org/document/6877334/) and references therein, especially https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf (http://dl.acm.org/citation.cfm?id=1356053).

Those are good references. Eigen's implementation was initially based on Goto's paper, in 3.3 it as slightly evolved but the general ideas are the same. So basically, the left-hand-side is first split into vertical panels that are stored by horizontal panels of size PxK, each micro panel being sequentially stored in a column-major fashion. P depends on the hardware and scalar type. For instance with FMA and float, P=3*register-size=24.

The right-hand-side is split into horizontal panels, each panel is stored per chunk of Kx4 vertical micro-panels that are stored in a row-major fashion.

With this storage, we can compute products between PxK and Kx4 panels very efficiently within registers and purely sequential memory loads. With FMA and float P=24, and we thus reserve 12 registers to hold the result. Then it's only a matter of scheduling these micro-products to optimally use caches.

hope that helps,

gael