Re: [eigen] patch to add ACML support to BTL |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] patch to add ACML support to BTL
- From: Ilya Baran <baran37@xxxxxxxxx>
- Date: Mon, 23 Mar 2009 15:30:01 -0400
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=sQyRCSvWPAtZdwMfP1oFBelX5ZsJmpP2idIgEcNUhOE=; b=AEB1y4j9FKqx5rcYpmzUJ4Hj75plFOgzNFWDH28FunIcB7F99OG7b/9+0GzcNwv9wF KMPA8dHxDB5yIP940DUbDSlicSomDf43G+dFSVB5qNnM5PbKPXWh1glsrjzZ2WoHGXZI 6zLk3T7EpZhj4n2dk7IHBAP+hpxe1XnXyS1as=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=OuzupA2aXf6ByDyYWNmsBbTtNWXiD1Bm3Mvn2hoY5RW2JUxduKgVWI0x/2ALB51cHo ugRVgxdCn18BaRYPc1OaxUR+bP+c0y8+0Wbxlj0b7TDQQPj0Xj3u+IWUzEAdsy7DJRTy ZdlFZlobaEGbRp7Q2ZXcJ7CtU9ADdwIW3rJ5g=
Hi,
So I'm just learning about fast matrix multiplication and I'm somewhat
confused. My original implementation degrades for larger matrices, so
there's no point posting it. Now I'm looking at the inner kernel(s)
and I do not understand the argument in the GOTO paper why the GEPDOT
kernel is likely to be inefficient--I'm playing with the following and
it doesn't have any writing to memory until the very end:
//computes [v10 , v11]^T * [v20 , v21] -- everything should be aligned
and num should be divisible by PacketSize
Matrix<Scalar, 2, 2> myDot2(const Scalar *v10, const Scalar *v11,
const Scalar *v20, const Scalar *v21, int num)
{
Packet total00 = ei_pset1<Scalar>(0);
Packet total01 = ei_pset1<Scalar>(0);
Packet total10 = ei_pset1<Scalar>(0);
Packet total11 = ei_pset1<Scalar>(0);
const Scalar *v1end = v10 + num;
asm("#dot2 start");
for(; v10 < v1end; ) { //not unrolled for readability
Packet p10, p11, p20, p21;
p10 = ei_pload(v10);
p20 = ei_pload(v20);
p21 = ei_pload(v21);
p11 = ei_pload(v11);
total00 = ei_pmadd(p10, p20, total00);
total01 = ei_pmadd(p10, p21, total01);
total10 = ei_pmadd(p11, p20, total10);
total11 = ei_pmadd(p11, p21, total11);
v10 += PacketSize, v11 += PacketSize;
v20 += PacketSize, v21 += PacketSize;
}
asm("#dot2 end");
Matrix<Scalar, 2, 2> out;
out(0, 0) = ei_predux(total00);
out(0, 1) = ei_predux(total01);
out(1, 0) = ei_predux(total10);
out(1, 1) = ei_predux(total11);
return out;
}
I'm not sure how it compares to the GEBP in Eigen (for the 8 register
IA32 version)--it doesn't duplicate the B values like Eigen, but it
doesn't make optimal use of the accumulation registers--I'd appreciate
any insight. Perhaps unsurprisingly, a dumb matrix multiply based on
this seems to work faster than Eigen on things like (20-by-100 *
100-by-20) and slower on things like (100-by-20 * 20-by-100).
I'm now testing on a Core 2, using
g++-4.2 -DNDEBUG -O3 -march=core2 -msse2 -msse3 [-arch x86_64]
My benchmarks at this point are not terribly scientific--I'm not
hooking into BTL.
BTW, for 16 registers (and floats), the Eigen code inner kernel
performs (8-by-4) += (8-by-1) * (1-by-4). Would there be a slight
speedup by doing (12-by-3) += (12-by-1) * (1-by-3) (72 flops instead
of 64 for the same number of memory accesses)?
Regarding the A^T * A, I think it would make sense to factor out
things like the GEBP (or GEPDOT or whatever) into separate functions
because both the general product and A^T * A should use them as
subroutines.
Thanks,
-Ilya
On Wed, Mar 18, 2009 at 8:30 AM, Gael Guennebaud
<gael.guennebaud@xxxxxxxxx> wrote:
>> I'll see what I can put up, but after the weekend--I have a siggraph
>> rebuttal to do now.
>
> small world :)
Yeah, I thought you might have one of those too :)