|Re: [eigen] Vectorization of complex|
[ Thread Index |
| More lists.tuxfamily.org/eigen Archives
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] Vectorization of complex
- From: Gael Guennebaud <gael.guennebaud@xxxxxxxxx>
- Date: Fri, 21 Jan 2011 13:02:41 +0100
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type:content-transfer-encoding; bh=49QkmV3JexkLMHPzC/wXfZY6hRzY+sLRMZbEuJxwOTY=; b=CaolnDXPWhkwb/+Lomlx3LvnyCCNMpU2K4fmqz1mnXPgeOgQbExTooQPl7x4QHB4XX a+Cs9XgnwDcpSeuQBMMWkFHZ+kW2zNf6+jeLCfz5xrAWYpp4Bqw4ON9upGTs0P2cydjk bCfo9ighmjZa8DdwQrXI20Cu3O3GdJUJkiXCc=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=F/9IpNuOJ+MiMuYaPwj7me9PO/MYyMoS7tSqQ6A5RvB9ndAIJGZS689ZwY2wGB4MtC 9/iG7qa4Z+0LOx8vgOFCwvFHPCDg8s/2JDuc6BizZXZpm2nZrE2458yKZrPM7pFm8dRK +3i6XWBjsvmKqayO6md7pz+ew95Ixs9QQ0B8g=
note that our matrix-matrix product kernel for complexes does not use
this pmul function which is rather slow. The trick is to split the
products between the real and imaginary part and combine them at the
end of a series of mul-add.
Well this pmul function is actually used N^2 times for the
multiplication with alpha. Recall that our kernel computes C += alpha
* A * B, and even if you only do C = A*B this product with alpha is
still there, taking alpha = 1.
On Fri, Jan 21, 2011 at 12:18 PM, Christoph Hertzberg
> On 20.01.2011 23:11, David Luitz wrote:
>> I then started testing the code and realized that unfortunately my
>> implementation is a bit slower than the SSE2 version. Even more
>> puzzling: Actually, the already existing SSE3 implementation is ALSO
>> SLOWER than the SSE2 code! Does anybody have an idea, why my SSE4_1 code
>> is even slower than the SSE3 code?
> Just an uneducated guess:
> Especially for older processors it could be that it only emulates SSE3
> and SSE4_* instructions and is therefore slower (I had a similar thing
> with an old AMD64 and SSE2 once). Though in more complex programs it
> could be faster due to smaller code-size.
>> By the way, we are talking about something like 1 percent run time
>> difference in my tests, but still if the SSE3 and SSE4 codes are not
>> really faster than SSE2, I think they should be removed...
> At least this should be tested for different CPUs first ...
> Maybe also make general suggestions such as: "Don't enable SSE3 for ..."
> in the vectorization documentation.
> Dipl.-Inf. Christoph Hertzberg
> Cartesium 0.051
> Universität Bremen
> Enrique-Schmidt-Straße 5
> 28359 Bremen
> Tel: (+49) 421-218-64252