Re: [eigen] unaligned or not unaligned vectorization ?

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


first of all, in this bench there are 2 loads for 1 store, this is why
unaligned stores seems to be more costly....

I realized that though I used unaligned OPCODE, my data were still
aligned. So here are some new results for c = a+b; were the data are
indeed not aligned:

Loop peeling + A/A : 1.14591s  1.39627 GFlops
Loop peeling + U/A : 2.63456s  0.607312 GFlops
Loop peeling + A/U : 2.7675s  0.57814 GFlops
Loop peeling + U/U : 2.94519s  0.543259 GFlops
No peeling + A/A : 1.16711s  1.37091 GFlops
No peeling + U/A : 2.80952s  0.569491 GFlops
No peeling + A/U : 4.83049s  0.331229 GFlops
No peeling + U/U : 4.80667s  0.332871 GFlops
No vec, no peeling : 2.08395s  0.767774 GFlops


these results are for float, and I used various offset to unaligned
the data. Now unaligned store appears to be more expensive than
unaligned load, but the no vectorized path is actually much faster
anyway.... Since unaligned data comes for block expressions we could
try to find a way to evaluate the block to an aligned temporary....
but we should do that only if the next evaluated expression is
vectorizable and sufficiently costly to get high gain from the
vectorization.... this looks rather complicated.

gael.

On Thu, Jul 3, 2008 at 8:22 PM, Konstantinos Margaritis <markos@xxxxxxxx> wrote:
> Hi all,
> First I agree, at least on Altivec unaligned stores have a tremendous impact
> on performance (loads not that much).
> Regarding the benchmark, just one question, was it done using totally random
> alignment (non-aligned) for each iteration?
>
> Konstantinos
>
> On Thursday 03 July 2008 21:07:05 Gael Guennebaud wrote:
>> Hi,
>>
>> today we had a discussion about the usefulness of unaligned
>> vectorization. So here are some benchmark for a += a.cwiseProduct(b),
>> where, e.g. U/A means Unaligned loads / Aligned stores:
>>
>>
>> float:
>>
>> eigen A/A : 1.2163s   1.31546 GFlops
>> eigen U/A : 1.71109s   0.935079 GFlops
>> eigen U/U : 2.16024s   0.74066 GFlops
>> Loop peeling + A/A : 0.932119s  1.71652 GFlops
>> Loop peeling + U/A : 1.48324s  1.07872 GFlops
>> Loop peeling + A/U : 1.1676s  1.37033 GFlops
>> Loop peeling + U/U : 1.68971s  0.946908 GFlops
>>
>>
>> float (no vectorization):
>>
>> eigen : 2.05874s   0.777173 GFlops
>> Loop peeling : 2.27903s  0.702053 GFlops
>>
>>
>>
>> double:
>>
>> eigen A/A : 2.70669s   0.591128 GFlops
>> eigen U/U : 2.75419s   0.580933 GFlops
>> eigen U/A : 2.82088s   0.567199 GFlops
>> Loop peeling + A/A : 1.98525s  0.805943 GFlops
>> Loop peeling + U/A : 3.07734s  0.51993 GFlops
>> Loop peeling + A/U : 2.44861s  0.653431 GFlops
>> Loop peeling + U/U : 3.48922s  0.458555 GFlops
>>
>>
>> double (no vectorization):
>>
>> eigen : 2.86233s   0.558985 GFlops
>> Loop peeling : 3.10623s  0.515094 GFlops
>>
>> So, at least for SSE, there is currently no gain doing unaligned
>> vectorization but it is worth removing the unaligned stores by first
>> processing the unaligned coefficients of the result. So let's do it !
>>
>>
>> cheers,
>> Gael.
>
>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/