Re: [eigen] a record for Eigen: 250 GFLOPS !!

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


here are some some new results using ICC, and including MKL:

https://plafrim.bordeaux.inria.fr/doku.php?id=people:guenneba#some_results


On Wed, Jun 23, 2010 at 1:29 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> 2010/6/23 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>> Hi,
>>
>> this morning I played with a 48 cores AMD SMP server (8 processors
>> AMD-Opteron-8439-SE, 6 cores each @ 2,8 GHz) and a bi-processor made
>> of Intel X5570 @ 2.93GHz (4 multithreaded cores each => a total of 8
>> cores, 16 threads), and here are the results for a product of 2048^2
>> matrices of floats:
>
> Very interesting!
>

>> We can see that AMD's SSE implementation is half the speed of Intel's
>> one.
>
> Is it because MULPS and ADDPS don't pipeline as they do with Intel CPUs?

actually it seems that ICC performs better than gcc-4.3 on this kind
of architectures (see the above plots).

For a single threads, we now have: (both cpus run at 2.8 GHz)

              Eigen       MKL
AMD      14.59	    19.2
INTEL     19.77	    24.24

so maybe the differences are more on the automatic prefetching
mechanisms, blocking strategies and stuff like that than pure SSE
performance.

gael


>
>> This architecture seems to be tricky to control because the peak
>> performance is obtained with 32 threads with a speed up factor of x23
>> that is not bad. With more threads the perf significantly drops down.
>> There is also a slow down with 24 threads.
>
> Also, this is "just" 2048x2048, so there is only work to do for a
> finite number of threads... as is also seen in the fact that this job
> was completed in less than a tenth of a second. I wonder if a larger
> matrix product would scale better to large numbers of threads.
>
> Benoit
>
>>
>> that's all folks.
>>
>> gael
>>
>>
>>
>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/