Re: [eigen] generic unrollers |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] generic unrollers
- From: "Gael Guennebaud" <gael.guennebaud@xxxxxxxxx>
- Date: Wed, 18 Jun 2008 00:37:58 +0200
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type :content-transfer-encoding:content-disposition:references; bh=wbb/ypJKzM4b6792bZVh+SojjsWIjDD1oyD3cAcLHxE=; b=C4e0PzaZhRFI3HbnJC/5HABL9+bt5++7833hECn1vpIeUcAmTpGpQjaN/cW9A3mq8p Eo2agph2Xo63svP9VfkiIhb3hDnNFlEi+plgm2sg29J+YV2+IueRxJa9Rg8SfnBLKX8A iyZs679Fomga4PzVjnxoeR/xIbnzDpy6jM+nc=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:content-transfer-encoding:content-disposition :references; b=uG19ApiuTwuAk0hubkU7nX7hTUzbEYdqLaX/W2+wgEysTCCSZs5AvzTVFWgZq7kOBB 1oK3sLhAqZIHZTxJ1+Zm/A3zTDBjuPxoe5oAS6Pj8oAspgO+zqmkPJO6SXGUrtA8nnMN aF1yQX84d3YcAcq64UzIX4lGbIWT5wlT9BOqA=
Seeing your numbers, it seems the perfect formula should take into
account the register pressure: without vectorization and only 8
floating point registers the pressure is much higher and the compiler
cannot cache the temporary there, so the evaluation becomes quite more
expensive. With the new ei_assign_traits we can know in ei_nested if
the vectorization will occur or not and the compiler defines some
architecture tokens (_i386, _x86_64, etc.) such that we can know the
number of registers... so theoretically we can do very sophisticated
tests but this sounds to me a bit overkill :) the formula with "<="
sounds a really good compromise to me.
Gael.
On Tue, Jun 17, 2008 at 11:42 PM, Benoît Jacob <jacob@xxxxxxxxxxxxxxx> wrote:
> Hi,
>
> attached are my own measurements. Intel Core 1 (32bit), g++ 4.3.0, compiled
> with "-O3 -DNDEBUG", that is without vectorization, to make things more fun
> since you already measured with vectorization.
>
> As you can see, here, the overall winner is "<=" but not by a big margin. It
> looks like we have yet to find the definitive formula...
>
> Cheers,
> Benoit
>
> On Monday 16 June 2008 00:07:22 Gael Guennebaud wrote:
>> >> , for the a+b and 2*a cases I'll
>> >> write an exhaustive benchmark... If there is no obvious reason to eval
>> >> a+b for a 2x2 product then it might be better to not eval since this
>> >> allows the user to perform fine tuning for his specific case that is not
>> >> possible if we do (abusive?) evaluation.
>> >
>> > It's great if you do a benchmark, I don't see any other way of moving
>> > forward!
>>
>> here you go (see attached files). So M,N,K denotes the size of the
>> matrix product:
>>
>> MxN = MxK * KxN.
>>
>> I benchmarked both (a+b)*c and (2*a)*c, with 4 different conditions:
>> the current one with "<", the same with "<=", never evaluate, and
>> evaluate if N>1 (e.i. if a coeff is read at least twice). I compiled
>> with gcc-4.2, -O3 -DNDEBUG using float and vectorization enabled.
>>
>> So for this benchmark it is quite clear that, as expected, "<=" works
>> much better than the current "<". But surprisingly, N>1, which
>> implies the evaluation of (2*a) with N==2 works even slightly better !
>> This is probably because the compiler can cache the temporaries into
>> the registers (I have a 64bits CPU, so 16 SSE registers). In that case
>> counting for the extra loads and stores is wrong. So we could try this
>> one:
>>
>> r*SC <= (r-1) * RC
>>
>> which basically means let's forget the extra store and evaluate even
>> if it does not look really better (equality). In practice this should
>> give better results (at least for gcc-4.2 with a lot of floating point
>> registers).
>>
>> Gael.
>
>
>