|Re: [eigen] a record for Eigen: 250 GFLOPS !!|
[ Thread Index |
| More lists.tuxfamily.org/eigen Archives
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] a record for Eigen: 250 GFLOPS !!
- From: Gael Guennebaud <gael.guennebaud@xxxxxxxxx>
- Date: Wed, 23 Jun 2010 15:00:08 +0200
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type; bh=dzVU1AndePUwkmguNn/Tq9PUxiIlqMQAMI7fMzApJl4=; b=MB8lCjOpJXYWU7JpRgnR8/BaivXPwN4kviOI4ZaFwajF0xjMGnHK01Z4PL8ozePfPd 1P9H07VoZvejBVtwQGAqydNDr248ZXsajtqSlRgvapU76odkgdLaowtNlRO5I8nQ6VwI bW7uOW35Jrnyn8aZYv897vE4oLU8lJHfuZqsc=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=JriHPHwOmUyUic4fzwfG3BUdQcM4W9DjoRsHHMLQsnb8n5Tu8yWvQWV7Q8QAQPVFpu nEUMipbgBv2GBhmuZnlBrMsImYtGO4CyLpusuVUvFOqmLujXpOV3fEGZgQowYYHLFhg8 GkBtlHFXq3+BzeyvZ66aA4f+9PNtKruo/Ktmk=
here are some some new results using ICC, and including MKL:
On Wed, Jun 23, 2010 at 1:29 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> 2010/6/23 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>> this morning I played with a 48 cores AMD SMP server (8 processors
>> AMD-Opteron-8439-SE, 6 cores each @ 2,8 GHz) and a bi-processor made
>> of Intel X5570 @ 2.93GHz (4 multithreaded cores each => a total of 8
>> cores, 16 threads), and here are the results for a product of 2048^2
>> matrices of floats:
> Very interesting!
>> We can see that AMD's SSE implementation is half the speed of Intel's
> Is it because MULPS and ADDPS don't pipeline as they do with Intel CPUs?
actually it seems that ICC performs better than gcc-4.3 on this kind
of architectures (see the above plots).
For a single threads, we now have: (both cpus run at 2.8 GHz)
AMD 14.59 19.2
INTEL 19.77 24.24
so maybe the differences are more on the automatic prefetching
mechanisms, blocking strategies and stuff like that than pure SSE
>> This architecture seems to be tricky to control because the peak
>> performance is obtained with 32 threads with a speed up factor of x23
>> that is not bad. With more threads the perf significantly drops down.
>> There is also a slow down with 24 threads.
> Also, this is "just" 2048x2048, so there is only work to do for a
> finite number of threads... as is also seen in the fact that this job
> was completed in less than a tenth of a second. I wonder if a larger
> matrix product would scale better to large numbers of threads.
>> that's all folks.