Re: [eigen] patch to add ACML support to BTL

[ Thread Index | Date Index | More Archives ]


So I'm just learning about fast matrix multiplication and I'm somewhat
confused.  My original implementation degrades for larger matrices, so
there's no point posting it.  Now I'm looking at the inner kernel(s)
and I do not understand the argument in the GOTO paper why the GEPDOT
kernel is likely to be inefficient--I'm playing with the following and
it doesn't have any writing to memory until the very end:

//computes [v10 , v11]^T * [v20 , v21] -- everything should be aligned
and num should be divisible by PacketSize
Matrix<Scalar, 2, 2> myDot2(const Scalar *v10, const Scalar *v11,
const Scalar *v20, const Scalar *v21, int num)
  Packet total00 = ei_pset1<Scalar>(0);
  Packet total01 = ei_pset1<Scalar>(0);
  Packet total10 = ei_pset1<Scalar>(0);
  Packet total11 = ei_pset1<Scalar>(0);

  const Scalar *v1end = v10 + num;
  asm("#dot2 start");
  for(; v10 < v1end; ) { //not unrolled for readability
    Packet p10, p11, p20, p21;
    p10 = ei_pload(v10);
    p20 = ei_pload(v20);
    p21 = ei_pload(v21);
    p11 = ei_pload(v11);
    total00 = ei_pmadd(p10, p20, total00);
    total01 = ei_pmadd(p10, p21, total01);
    total10 = ei_pmadd(p11, p20, total10);
    total11 = ei_pmadd(p11, p21, total11);

    v10 += PacketSize, v11 += PacketSize;
    v20 += PacketSize, v21 += PacketSize;
  asm("#dot2 end");

  Matrix<Scalar, 2, 2> out;
  out(0, 0) = ei_predux(total00);
  out(0, 1) = ei_predux(total01);
  out(1, 0) = ei_predux(total10);
  out(1, 1) = ei_predux(total11);

  return out;

I'm not sure how it compares to the GEBP in Eigen (for the 8 register
IA32 version)--it doesn't duplicate the B values like Eigen, but it
doesn't make optimal use of the accumulation registers--I'd appreciate
any insight. Perhaps unsurprisingly, a dumb matrix multiply based on
this seems to work faster than Eigen on things like (20-by-100 *
100-by-20) and slower on things like (100-by-20 * 20-by-100).
I'm now testing on a Core 2, using
g++-4.2 -DNDEBUG -O3 -march=core2 -msse2 -msse3 [-arch x86_64]
My benchmarks at this point are not terribly scientific--I'm not
hooking into BTL.

BTW, for 16 registers (and floats), the Eigen code inner kernel
performs (8-by-4) += (8-by-1) * (1-by-4).  Would there be a slight
speedup by doing (12-by-3) += (12-by-1) * (1-by-3)  (72 flops instead
of 64 for the same number of memory accesses)?

Regarding the A^T * A, I think it would make sense to factor out
things like the GEBP (or GEPDOT or whatever) into separate functions
because both the general product and A^T * A should use them as



On Wed, Mar 18, 2009 at 8:30 AM, Gael Guennebaud
<gael.guennebaud@xxxxxxxxx> wrote:
>> I'll see what I can put up, but after the weekend--I have a siggraph
>> rebuttal to do now.
> small world :)

Yeah, I thought you might have one of those too :)

Mail converted by MHonArc 2.6.19+