the line implicit actually means no vectorization for gcc 4.2 and
gcc's default vectorization for gcc 4.3.
so here is the core of the vector addition:
gcc 4.3:
movq 32(%rsp), %rax
addl $2, %ecx
movapd (%rax,%rdx), %xmm0
movq 16(%rsp), %rax
addpd (%rax,%rdx), %xmm0
movq (%rsp), %rax
movapd %xmm0, (%rax,%rdx)
addq $16, %rdx
cmpl %ecx, %r8d
jg .L57
as we can see gcc should move 3 movq instructions (which load the
address of the data) out of the loop !
Now let's compare with gcc 4.2 code:
movapd (%rax,%rbp), %xmm0
addpd (%rax,%rbx), %xmm0
movapd %xmm0, (%rax,%rdi)
addq $16, %rax
cmpq $24000, %rax
jne .L73
yeah much much better !!
FYI current gcc trunk (future 4.4) generates code here, so let's not
bother... also I'm using g++-4.3 (GCC) 4.3.0 20080215 (experimental)
which is not the most recent one....
About Ones, here it is well vectorized: (gcc 4.2 and 4.4)
movapd %xmm0, (%rax,%rdx)
addq $16, %rax
cmpq $24000, %rax
jne .L62
and for some weird reasons, it seems gcc 4.3 drops the middle
vectorized loop here.... very strange !
2008/8/24 <jacob@xxxxxxxxxxxxxxx>:
Hi List,
Here's a simple benchmark, a.cpp. It runs faster without vectorization than
Trying to understand this I added some asm comments in Assign.h, so my copy
looks like this:
template<typename Derived1, typename Derived2>
struct ei_assign_impl<Derived1, Derived2, LinearVectorization, NoUnrolling>
static void run(Derived1 &dst, const Derived2 &src)
const int size = dst.size();
const int packetSize = ei_packet_traits<typename Derived1::Scalar>::size;
const int alignedStart =
ei_assign_traits<Derived1,Derived2>::DstIsAligned ? 0
: ei_alignmentOffset(&dst.coeffRef(0), size);
const int alignedEnd = alignedStart +
asm("#unaligned start");
for(int index = 0; index < alignedStart; index++)
dst.copyCoeff(index, src);
asm("#aligned middle");
for(int index = alignedStart; index < alignedEnd; index += packetSize)
dst.template copyPacket<Derived2, Aligned,
ei_assign_traits<Derived1,Derived2>::SrcAlignment>(index, src);
asm("#unaligned end");
for(int index = alignedEnd; index < size; index++)
dst.copyCoeff(index, src);
I attach the resulting assembly (a.s). Can you see what's wrong?
Another thing. The ones() part compiles to this:
xorl %edx, %edx
movl -24(%ebp), %eax
fstl (%eax,%edx)
fstpl 8(%eax,%edx)
addl $16, %edx
cmpl $24000, %edx
jne .L107
This is not vectorized, right??
This message was sent using IMP, the Internet Messaging Program.