Re: [eigen] vectorization bug

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Hi,

Thanks for the explanations. I am also using gcc 4.3.0, i'll see if 4.3.2 is better.

Cheers,
Benoit


the line implicit actually means no vectorization for gcc 4.2 and
gcc's default vectorization for gcc 4.3.

so here is the core of the vector addition:

gcc 4.3:

.L57:
	movq	32(%rsp), %rax
	addl	$2, %ecx
	movapd	(%rax,%rdx), %xmm0
	movq	16(%rsp), %rax
	addpd	(%rax,%rdx), %xmm0
	movq	(%rsp), %rax
	movapd	%xmm0, (%rax,%rdx)
	addq	$16, %rdx
	cmpl	%ecx, %r8d
	jg	.L57

as we can see gcc should move 3 movq instructions (which load the
address of the data) out of the loop !

Now let's compare with gcc 4.2 code:

.L73:
	movapd	(%rax,%rbp), %xmm0
	addpd	(%rax,%rbx), %xmm0
	movapd	%xmm0, (%rax,%rdi)
	addq	$16, %rax
	cmpq	$24000, %rax
	jne	.L73

yeah much much better !!

FYI current gcc trunk (future 4.4) generates code here, so let's not
bother... also I'm using g++-4.3 (GCC) 4.3.0 20080215 (experimental)
which is not the most recent one....



About Ones, here it is well vectorized: (gcc 4.2 and 4.4)

.L62:
	movapd	%xmm0, (%rax,%rdx)
	addq	$16, %rax
	cmpq	$24000, %rax
	jne	.L62

and for some weird reasons, it seems gcc 4.3 drops the middle
vectorized loop here.... very strange !

cheers,
gael.


2008/8/24  <jacob@xxxxxxxxxxxxxxx>:
Hi List,

Here's a simple benchmark, a.cpp. It runs faster without vectorization than
with!

Trying to understand this I added some asm comments in Assign.h, so my copy
looks like this:

template<typename Derived1, typename Derived2>
struct ei_assign_impl<Derived1, Derived2, LinearVectorization, NoUnrolling>
{
 static void run(Derived1 &dst, const Derived2 &src)
 {
   asm("#begin");
   const int size = dst.size();
   const int packetSize = ei_packet_traits<typename Derived1::Scalar>::size;
   const int alignedStart =
ei_assign_traits<Derived1,Derived2>::DstIsAligned ? 0
                          : ei_alignmentOffset(&dst.coeffRef(0), size);
   const int alignedEnd = alignedStart +
((size-alignedStart)/packetSize)*packetSize;

   asm("#unaligned start");

   for(int index = 0; index < alignedStart; index++)
     dst.copyCoeff(index, src);
   asm("#aligned middle");

   for(int index = alignedStart; index < alignedEnd; index += packetSize)
   {
     dst.template copyPacket<Derived2, Aligned,
ei_assign_traits<Derived1,Derived2>::SrcAlignment>(index, src);
   }

   asm("#unaligned end");

   for(int index = alignedEnd; index < size; index++)
     dst.copyCoeff(index, src);
   asm("#end");
 }
};

I attach the resulting assembly (a.s). Can you see what's wrong?

Another thing. The ones() part compiles to this:

       xorl    %edx, %edx
.L107:
       movl    -24(%ebp), %eax
       fld1
       fstl    (%eax,%edx)
       fstpl   8(%eax,%edx)
       addl    $16, %edx
       cmpl    $24000, %edx
       jne     .L107

This is not vectorized, right??

Cheers,
Benoit

Cheers,
Benoit

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.








----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.




Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/