Re: [eigen] vectorization bug

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Hi Benoit,

Secondly, with Eigen gcc 4.3 sucks compared to gcc 4.2. I observed
that in all benchmarks.
In your case here my results (core2, 64bits system):

                   gcc 4.2  gcc 4.3
explicit vec   0.38s     0.5s
implicit vec   0.48s     0.46s

the line implicit actually means no vectorization for gcc 4.2 and
gcc's default vectorization for gcc 4.3.

so here is the core of the vector addition:

gcc 4.3:

..L57:
	movq	32(%rsp), %rax
	addl	$2, %ecx
	movapd	(%rax,%rdx), %xmm0
	movq	16(%rsp), %rax
	addpd	(%rax,%rdx), %xmm0
	movq	(%rsp), %rax
	movapd	%xmm0, (%rax,%rdx)
	addq	$16, %rdx
	cmpl	%ecx, %r8d
	jg	.L57

as we can see gcc should move 3 movq instructions (which load the
address of the data) out of the loop !

Now let's compare with gcc 4.2 code:

..L73:
	movapd	(%rax,%rbp), %xmm0
	addpd	(%rax,%rbx), %xmm0
	movapd	%xmm0, (%rax,%rdi)
	addq	$16, %rax
	cmpq	$24000, %rax
	jne	.L73

yeah much much better !!

FYI current gcc trunk (future 4.4) generates code here, so let's not
bother... also I'm using g++-4.3 (GCC) 4.3.0 20080215 (experimental)
which is not the most recent one....



About Ones, here it is well vectorized: (gcc 4.2 and 4.4)

..L62:
	movapd	%xmm0, (%rax,%rdx)
	addq	$16, %rax
	cmpq	$24000, %rax
	jne	.L62

and for some weird reasons, it seems gcc 4.3 drops the middle
vectorized loop here.... very strange !

cheers,
gael.


2008/8/24  <jacob@xxxxxxxxxxxxxxx>:
> Hi List,
>
> Here's a simple benchmark, a.cpp. It runs faster without vectorization than
> with!
>
> Trying to understand this I added some asm comments in Assign.h, so my copy
> looks like this:
>
> template<typename Derived1, typename Derived2>
> struct ei_assign_impl<Derived1, Derived2, LinearVectorization, NoUnrolling>
> {
>  static void run(Derived1 &dst, const Derived2 &src)
>  {
>    asm("#begin");
>    const int size = dst.size();
>    const int packetSize = ei_packet_traits<typename Derived1::Scalar>::size;
>    const int alignedStart =
> ei_assign_traits<Derived1,Derived2>::DstIsAligned ? 0
>                           : ei_alignmentOffset(&dst.coeffRef(0), size);
>    const int alignedEnd = alignedStart +
> ((size-alignedStart)/packetSize)*packetSize;
>
>    asm("#unaligned start");
>
>    for(int index = 0; index < alignedStart; index++)
>      dst.copyCoeff(index, src);
>    asm("#aligned middle");
>
>    for(int index = alignedStart; index < alignedEnd; index += packetSize)
>    {
>      dst.template copyPacket<Derived2, Aligned,
> ei_assign_traits<Derived1,Derived2>::SrcAlignment>(index, src);
>    }
>
>    asm("#unaligned end");
>
>    for(int index = alignedEnd; index < size; index++)
>      dst.copyCoeff(index, src);
>    asm("#end");
>  }
> };
>
> I attach the resulting assembly (a.s). Can you see what's wrong?
>
> Another thing. The ones() part compiles to this:
>
>        xorl    %edx, %edx
> .L107:
>        movl    -24(%ebp), %eax
>        fld1
>        fstl    (%eax,%edx)
>        fstpl   8(%eax,%edx)
>        addl    $16, %edx
>        cmpl    $24000, %edx
>        jne     .L107
>
> This is not vectorized, right??
>
> Cheers,
> Benoit
>
> Cheers,
> Benoit
>
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/