[eigen] Re: small sums: vectorization not worth it |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: [eigen] Re: small sums: vectorization not worth it
- From: "Benoit Jacob" <jacob.benoit.1@xxxxxxxxx>
- Date: Sat, 17 Jan 2009 00:40:47 +0100
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type:references; bh=0GveNgRl5BqcDH4+t+S8Zi7kyJ/nVWa3BCu66SCWmEk=; b=i/Of2jg2khAH4lPpUtDZOVQkXjcTP+B+owedYTTnqyUnyoyjRM8RhHr7t1nu0yCSlw jk2JutC8SI4A+n+Ca77f8M+msooqJhuNd0swbFWWQdEeIThECE/IXJ/8yp+Yx8vOqXy7 uFoJwQN4ksyOseKckrJYsXs7hX9mao2ybR21Q=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:references; b=M3XxsX7nYz+w0KnBGcO0JCvB/W5l+BumEFaiQitX3nvnoZx9DfjAq80+JCVvVleF8m 6tdaOzeApGuoUDE0Sw5HT3a00qRGF5Tj8UBwVYsfM+Ps/oGERdpud0ae3FmMCLzHa95a JQxy55IFWy+kR3UtCaE4JfWNBATkDBK7nLOIM=
Act 1, 4th and last scene:
i wanted to see for which size it starts getting beneficial to vectorize...
attached is one more example (sum.cpp) this time with size 64, type
double. unrolling forced.
Result: x87 is still 50% faster than SSE2.
The SSE2 asm is:
#APP
# 12 "dot.cpp" 1
#a
# 0 "" 2
#NO_APP
movapd (%esi), %xmm1
movapd -104(%ebp), %xmm0
addpd -88(%ebp), %xmm0
addpd -120(%ebp), %xmm0
addpd -136(%ebp), %xmm0
addpd -152(%ebp), %xmm0
addpd -168(%ebp), %xmm0
addpd -184(%ebp), %xmm0
addpd -200(%ebp), %xmm0
addpd -216(%ebp), %xmm0
addpd -232(%ebp), %xmm0
addpd -248(%ebp), %xmm0
addpd -264(%ebp), %xmm0
addpd -280(%ebp), %xmm0
addpd -296(%ebp), %xmm0
addpd -312(%ebp), %xmm0
addpd -328(%ebp), %xmm0
addpd -344(%ebp), %xmm0
addpd -360(%ebp), %xmm0
addpd -376(%ebp), %xmm0
addpd -392(%ebp), %xmm0
addpd -408(%ebp), %xmm0
addpd -424(%ebp), %xmm0
addpd -440(%ebp), %xmm0
addpd -456(%ebp), %xmm0
addpd -472(%ebp), %xmm0
addpd -488(%ebp), %xmm0
addpd -504(%ebp), %xmm0
addpd -520(%ebp), %xmm0
addpd -536(%ebp), %xmm0
addpd -552(%ebp), %xmm0
addpd -568(%ebp), %xmm0
addpd %xmm0, %xmm1
movapd %xmm1, %xmm2
unpckhpd %xmm1, %xmm2
addsd %xmm2, %xmm1
movapd %xmm1, %xmm2
movsd %xmm2, -584(%ebp)
#APP
# 14 "dot.cpp" 1
#b
# 0 "" 2
#NO_APP
2009/1/16 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
> and in case you wonder: the situation is almost the same for dot product
>
> (attached file dot.cpp)
>
> runs 2x slower with SSE...
>
> Cheers,
> Benoit
>
> 2009/1/16 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>> Note: i think the old heuristic was wrong anyway.
>>
>> Maybe take this occasion to introduce a EIGEN_COST_OF_PREDUX (since
>> this cost depends greatly on the simd platform) ?
>> And then do a natural heuristic rather than a quick hack like we used to have?
>>
>> Benoit
>>
>> 2009/1/16 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>> Hi Gael *cough* List,
>>>
>>> ei_predux is costly because it consists of >1 SIMD instruction.
>>>
>>> So until recently we had sum() only vectorize if the size was big enough.
>>> However this was recently changed.
>>>
>>> Attached is a benchmark that runs 2.5x slower with SSE (2 or 3) than
>>> without. It's just Vector2d::sum().
>>>
>>> So, revert to old behavior?
>>>
>>> Moreover: matrix product innerVectorization also uses a ei_predux. Same here?
>>>
>>> Cheers,
>>> Benoit
>>>
>>
>
#define EIGEN_UNROLLING_LIMIT 10000
#include<Eigen/Core>
typedef Eigen::Matrix<double, 64, 1> T;
int main()
{
T v; v.setZero(); v[0] = 1;
for(int i = 0; i < 10000000; i++)
{
v = T::Ones() + v * 1e-10;
asm("#a");
v[0] = v.sum();
asm("#b");
v[1] = v.sum();
//std::cout << v << "\n"; // check it's not inf...
}
return int(v[0]);
}