Re: [eigen] patch to add ACML support to BTL

[ Thread Index | Date Index | More Archives ]

On Mon, Mar 23, 2009 at 8:30 PM, Ilya Baran <baran37@xxxxxxxxx> wrote:
> Hi,
> So I'm just learning about fast matrix multiplication and I'm somewhat
> confused.  My original implementation degrades for larger matrices, so
> there's no point posting it.  Now I'm looking at the inner kernel(s)
> and I do not understand the argument in the GOTO paper why the GEPDOT
> kernel is likely to be inefficient--I'm playing with the following and
> it doesn't have any writing to memory until the very end:

note that the general idea of the GEPDOT kernel is to work on a block
of C filling half the L2 cache, and two panels of A and B, so for
instance for a 1000^3 matrix product, GEPDOT would perform a  250x250
= 250x1000 * 1000x250 sub-product. So this clearly not optimal since
you would have to perform reads and writes to L2 cache while streaming
both A and B from main memory. It is much better to place a block of
either A or B in L2 and stream C.

> //computes [v10 , v11]^T * [v20 , v21] -- everything should be aligned
> and num should be divisible by PacketSize
> Matrix<Scalar, 2, 2> myDot2(const Scalar *v10, const Scalar *v11,
> const Scalar *v20, const Scalar *v21, int num)
> {
>  Packet total00 = ei_pset1<Scalar>(0);
>  Packet total01 = ei_pset1<Scalar>(0);
>  Packet total10 = ei_pset1<Scalar>(0);
>  Packet total11 = ei_pset1<Scalar>(0);
>  const Scalar *v1end = v10 + num;
>  asm("#dot2 start");
>  for(; v10 < v1end; ) { //not unrolled for readability
>    Packet p10, p11, p20, p21;
>    p10 = ei_pload(v10);
>    p20 = ei_pload(v20);
>    p21 = ei_pload(v21);
>    p11 = ei_pload(v11);
>    total00 = ei_pmadd(p10, p20, total00);
>    total01 = ei_pmadd(p10, p21, total01);
>    total10 = ei_pmadd(p11, p20, total10);
>    total11 = ei_pmadd(p11, p21, total11);
>    v10 += PacketSize, v11 += PacketSize;
>    v20 += PacketSize, v21 += PacketSize;
>  }
>  asm("#dot2 end");
>  Matrix<Scalar, 2, 2> out;
>  out(0, 0) = ei_predux(total00);
>  out(0, 1) = ei_predux(total01);
>  out(1, 0) = ei_predux(total10);
>  out(1, 1) = ei_predux(total11);
>  return out;
> }
> I'm not sure how it compares to the GEBP in Eigen (for the 8 register
> IA32 version)--it doesn't duplicate the B values like Eigen, but it
> doesn't make optimal use of the accumulation registers--I'd appreciate
> any insight.

actually, this exactly corresponds to the previous version (the one in
the stable branch) which is significantly slower. My motivation for
this version was to avoid to duplicate the values of B, however I
think the main issue of this approach is that you have to perform four
times (for float) more read/write accesses to C and many predux ops
which are quite involved.

> Perhaps unsurprisingly, a dumb matrix multiply based on
> this seems to work faster than Eigen on things like (20-by-100 *
> 100-by-20) and slower on things like (100-by-20 * 20-by-100).
> I'm now testing on a Core 2, using
> g++-4.2 -DNDEBUG -O3 -march=core2 -msse2 -msse3 [-arch x86_64]
> My benchmarks at this point are not terribly scientific--I'm not
> hooking into BTL.

for such small sizes, I got very similar perf than GOTO, so I would
say the current state is ok.

> BTW, for 16 registers (and floats), the Eigen code inner kernel
> performs (8-by-4) += (8-by-1) * (1-by-4).  Would there be a slight
> speedup by doing (12-by-3) += (12-by-1) * (1-by-3)  (72 flops instead
> of 64 for the same number of memory accesses)?

indeed, theoretically it is better to use squared blocks (at any
levels), so currently we have blocks of 2x4=8 registers while you
suggest 3x3=9 registers. So I tried it, and in practice I obtained no
speed up at all. I think the problem is that solution requires one
more register and then there is not enough register left to optimally
prefetch the other values and perform the computations. Perhaps with
hand written asm...

Anyway, I think that with the current state of the product better perf
could only be achieved via hand written asm (better use of the
registers), optimal block sizes, better prefetching, and faster copies
of A and B into their respective blocks, optimize gcc (I get a big
speed up between 4.3 and 4.4 and I'm sure the compiler could still do
a better job on the low level stuff), etc...  More specifically the
inner loop could be better peeled and reads from the block of A are
costly (many cache miss). On the other hand there is nothing to
optimize in the accesses of the block of B and C.

> Regarding the A^T * A, I think it would make sense to factor out
> things like the GEBP (or GEPDOT or whatever) into separate functions
> because both the general product and A^T * A should use them as
> subroutines.

yes, this is exactly what I thought.

> Thanks,
>   -Ilya
> On Wed, Mar 18, 2009 at 8:30 AM, Gael Guennebaud
> <gael.guennebaud@xxxxxxxxx> wrote:
>>> I'll see what I can put up, but after the weekend--I have a siggraph
>>> rebuttal to do now.
>> small world :)
> Yeah, I thought you might have one of those too :)

Mail converted by MHonArc 2.6.19+