Re: [eigen] vectorization bug

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


ah I found the vectorized loop of ones generated by gcc 4.3:

..L43:
	movq	32(%rsp), %rax
	movabsq	$4607182418800017408, %r8
	movabsq	$4607182418800017408, %rdi
	movq	%r8, (%rax,%rdx)
	movq	%rdi, 8(%rax,%rdx)
	addq	$16, %rdx
	cmpq	$24000, %rdx
	jne	.L43

the movapd (move 2 aligned double) have been replaced *by gcc* by two
64bits moves, and again we can see how this loop is poorly optimized
with a lot of redundant moves....

in you case it seems it did something similar, I guess the:

       fstl    (%eax,%edx)
       fstpl   8(%eax,%edx)

are the two "store" instructions which correspond to the single movapd....


gael.


On Sun, Aug 24, 2008 at 3:57 PM, Gael Guennebaud
<gael.guennebaud@xxxxxxxxx> wrote:
> Hi Benoit,
>
> Secondly, with Eigen gcc 4.3 sucks compared to gcc 4.2. I observed
> that in all benchmarks.
> In your case here my results (core2, 64bits system):
>
>                   gcc 4.2  gcc 4.3
> explicit vec   0.38s     0.5s
> implicit vec   0.48s     0.46s
>
> the line implicit actually means no vectorization for gcc 4.2 and
> gcc's default vectorization for gcc 4.3.
>
> so here is the core of the vector addition:
>
> gcc 4.3:
>
> .L57:
>        movq    32(%rsp), %rax
>        addl    $2, %ecx
>        movapd  (%rax,%rdx), %xmm0
>        movq    16(%rsp), %rax
>        addpd   (%rax,%rdx), %xmm0
>        movq    (%rsp), %rax
>        movapd  %xmm0, (%rax,%rdx)
>        addq    $16, %rdx
>        cmpl    %ecx, %r8d
>        jg      .L57
>
> as we can see gcc should move 3 movq instructions (which load the
> address of the data) out of the loop !
>
> Now let's compare with gcc 4.2 code:
>
> .L73:
>        movapd  (%rax,%rbp), %xmm0
>        addpd   (%rax,%rbx), %xmm0
>        movapd  %xmm0, (%rax,%rdi)
>        addq    $16, %rax
>        cmpq    $24000, %rax
>        jne     .L73
>
> yeah much much better !!
>
> FYI current gcc trunk (future 4.4) generates code here, so let's not
> bother... also I'm using g++-4.3 (GCC) 4.3.0 20080215 (experimental)
> which is not the most recent one....
>
>
>
> About Ones, here it is well vectorized: (gcc 4.2 and 4.4)
>
> .L62:
>        movapd  %xmm0, (%rax,%rdx)
>        addq    $16, %rax
>        cmpq    $24000, %rax
>        jne     .L62
>
> and for some weird reasons, it seems gcc 4.3 drops the middle
> vectorized loop here.... very strange !
>
> cheers,
> gael.
>
>
> 2008/8/24  <jacob@xxxxxxxxxxxxxxx>:
>> Hi List,
>>
>> Here's a simple benchmark, a.cpp. It runs faster without vectorization than
>> with!
>>
>> Trying to understand this I added some asm comments in Assign.h, so my copy
>> looks like this:
>>
>> template<typename Derived1, typename Derived2>
>> struct ei_assign_impl<Derived1, Derived2, LinearVectorization, NoUnrolling>
>> {
>>  static void run(Derived1 &dst, const Derived2 &src)
>>  {
>>    asm("#begin");
>>    const int size = dst.size();
>>    const int packetSize = ei_packet_traits<typename Derived1::Scalar>::size;
>>    const int alignedStart =
>> ei_assign_traits<Derived1,Derived2>::DstIsAligned ? 0
>>                           : ei_alignmentOffset(&dst.coeffRef(0), size);
>>    const int alignedEnd = alignedStart +
>> ((size-alignedStart)/packetSize)*packetSize;
>>
>>    asm("#unaligned start");
>>
>>    for(int index = 0; index < alignedStart; index++)
>>      dst.copyCoeff(index, src);
>>    asm("#aligned middle");
>>
>>    for(int index = alignedStart; index < alignedEnd; index += packetSize)
>>    {
>>      dst.template copyPacket<Derived2, Aligned,
>> ei_assign_traits<Derived1,Derived2>::SrcAlignment>(index, src);
>>    }
>>
>>    asm("#unaligned end");
>>
>>    for(int index = alignedEnd; index < size; index++)
>>      dst.copyCoeff(index, src);
>>    asm("#end");
>>  }
>> };
>>
>> I attach the resulting assembly (a.s). Can you see what's wrong?
>>
>> Another thing. The ones() part compiles to this:
>>
>>        xorl    %edx, %edx
>> .L107:
>>        movl    -24(%ebp), %eax
>>        fld1
>>        fstl    (%eax,%edx)
>>        fstpl   8(%eax,%edx)
>>        addl    $16, %edx
>>        cmpl    $24000, %edx
>>        jne     .L107
>>
>> This is not vectorized, right??
>>
>> Cheers,
>> Benoit
>>
>> Cheers,
>> Benoit
>>
>> ----------------------------------------------------------------
>> This message was sent using IMP, the Internet Messaging Program.
>>
>>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/