Re: [eigen] vectorization bug |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] vectorization bug
- From: "Gael Guennebaud" <gael.guennebaud@xxxxxxxxx>
- Date: Sun, 24 Aug 2008 16:09:23 +0200
- Cc: "Tim Vandermeersch" <tim.vandermeersch@xxxxxxxxx>
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:cc:in-reply-to:mime-version:content-type :content-transfer-encoding:content-disposition:references; bh=y1nGqYkQXMQpanK/fXJpwMPj4u2UMcbVrNSRfH08BpY=; b=gPBSSAouSVZi6BuG7dgDq8jgV1jwybPNqEZsdRDj//EfGc9L+DSLSS3taiBcvmoEXQ ELExF5a0R5RZ4nv4juA25KnnbvQ6fEBBJxC69DOz0uRsmiIjCrXn6Je7ceUe8Lfjy4N4 kmesX+wkdSPTn2IR+oy230kITDkvxPsXN6ahM=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version :content-type:content-transfer-encoding:content-disposition :references; b=srLsF70HCHglz4jI5Zmw/n1I9+4kRaO+GPK/xO/QWlqochHTiJiY4y5QPk771T3roo noMfm8Qo/G4aap5osXEOmDF8dDRBr+9CpxTP4LyTWiJc9TZXfra/SWql35JsbjXTwunp eh6fYwnWwRCHX8c1HDyUuUzcpBC+pA6fwX4VA=
ah I found the vectorized loop of ones generated by gcc 4.3:
..L43:
movq 32(%rsp), %rax
movabsq $4607182418800017408, %r8
movabsq $4607182418800017408, %rdi
movq %r8, (%rax,%rdx)
movq %rdi, 8(%rax,%rdx)
addq $16, %rdx
cmpq $24000, %rdx
jne .L43
the movapd (move 2 aligned double) have been replaced *by gcc* by two
64bits moves, and again we can see how this loop is poorly optimized
with a lot of redundant moves....
in you case it seems it did something similar, I guess the:
fstl (%eax,%edx)
fstpl 8(%eax,%edx)
are the two "store" instructions which correspond to the single movapd....
gael.
On Sun, Aug 24, 2008 at 3:57 PM, Gael Guennebaud
<gael.guennebaud@xxxxxxxxx> wrote:
> Hi Benoit,
>
> Secondly, with Eigen gcc 4.3 sucks compared to gcc 4.2. I observed
> that in all benchmarks.
> In your case here my results (core2, 64bits system):
>
> gcc 4.2 gcc 4.3
> explicit vec 0.38s 0.5s
> implicit vec 0.48s 0.46s
>
> the line implicit actually means no vectorization for gcc 4.2 and
> gcc's default vectorization for gcc 4.3.
>
> so here is the core of the vector addition:
>
> gcc 4.3:
>
> .L57:
> movq 32(%rsp), %rax
> addl $2, %ecx
> movapd (%rax,%rdx), %xmm0
> movq 16(%rsp), %rax
> addpd (%rax,%rdx), %xmm0
> movq (%rsp), %rax
> movapd %xmm0, (%rax,%rdx)
> addq $16, %rdx
> cmpl %ecx, %r8d
> jg .L57
>
> as we can see gcc should move 3 movq instructions (which load the
> address of the data) out of the loop !
>
> Now let's compare with gcc 4.2 code:
>
> .L73:
> movapd (%rax,%rbp), %xmm0
> addpd (%rax,%rbx), %xmm0
> movapd %xmm0, (%rax,%rdi)
> addq $16, %rax
> cmpq $24000, %rax
> jne .L73
>
> yeah much much better !!
>
> FYI current gcc trunk (future 4.4) generates code here, so let's not
> bother... also I'm using g++-4.3 (GCC) 4.3.0 20080215 (experimental)
> which is not the most recent one....
>
>
>
> About Ones, here it is well vectorized: (gcc 4.2 and 4.4)
>
> .L62:
> movapd %xmm0, (%rax,%rdx)
> addq $16, %rax
> cmpq $24000, %rax
> jne .L62
>
> and for some weird reasons, it seems gcc 4.3 drops the middle
> vectorized loop here.... very strange !
>
> cheers,
> gael.
>
>
> 2008/8/24 <jacob@xxxxxxxxxxxxxxx>:
>> Hi List,
>>
>> Here's a simple benchmark, a.cpp. It runs faster without vectorization than
>> with!
>>
>> Trying to understand this I added some asm comments in Assign.h, so my copy
>> looks like this:
>>
>> template<typename Derived1, typename Derived2>
>> struct ei_assign_impl<Derived1, Derived2, LinearVectorization, NoUnrolling>
>> {
>> static void run(Derived1 &dst, const Derived2 &src)
>> {
>> asm("#begin");
>> const int size = dst.size();
>> const int packetSize = ei_packet_traits<typename Derived1::Scalar>::size;
>> const int alignedStart =
>> ei_assign_traits<Derived1,Derived2>::DstIsAligned ? 0
>> : ei_alignmentOffset(&dst.coeffRef(0), size);
>> const int alignedEnd = alignedStart +
>> ((size-alignedStart)/packetSize)*packetSize;
>>
>> asm("#unaligned start");
>>
>> for(int index = 0; index < alignedStart; index++)
>> dst.copyCoeff(index, src);
>> asm("#aligned middle");
>>
>> for(int index = alignedStart; index < alignedEnd; index += packetSize)
>> {
>> dst.template copyPacket<Derived2, Aligned,
>> ei_assign_traits<Derived1,Derived2>::SrcAlignment>(index, src);
>> }
>>
>> asm("#unaligned end");
>>
>> for(int index = alignedEnd; index < size; index++)
>> dst.copyCoeff(index, src);
>> asm("#end");
>> }
>> };
>>
>> I attach the resulting assembly (a.s). Can you see what's wrong?
>>
>> Another thing. The ones() part compiles to this:
>>
>> xorl %edx, %edx
>> .L107:
>> movl -24(%ebp), %eax
>> fld1
>> fstl (%eax,%edx)
>> fstpl 8(%eax,%edx)
>> addl $16, %edx
>> cmpl $24000, %edx
>> jne .L107
>>
>> This is not vectorized, right??
>>
>> Cheers,
>> Benoit
>>
>> Cheers,
>> Benoit
>>
>> ----------------------------------------------------------------
>> This message was sent using IMP, the Internet Messaging Program.
>>
>>
>