Re: [eigen] patch to add ACML support to BTL |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] patch to add ACML support to BTL
- From: Gael Guennebaud <gael.guennebaud@xxxxxxxxx>
- Date: Wed, 18 Mar 2009 13:30:15 +0100
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=G2FYLkvbEU9Zjf7g46UKKjuf0PBYZwDXoZmo28GWHeU=; b=Xb/MUnUmU10T6BsYjSvBjxfNjDlRGOF0/nK2tzJ8IG1/BIlYMH3MKW91Xohb0VAuUt 45afaMi6N9A1/Ro/YqgQUmNFviQF7QT7wbIY1p1d6PN3b3NhXxi2zsC1rCZUqBw8Ue30 3nq0xlAFVoKlxvmpU5+STxhZBOxgOFDe/tok4=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=jOnMUsmXy9SLywpuGUOpDkoL4kFIGWBailX1ghoRw8ea5nr8eO+Hq6wAoIUwECNOaX 4K5QotJZPiivkOpw9edO9rbSxNqGWh1nilja6JsgdhZi0jO5+17m57R/aoeoDpMwhMvN H2dfAoqc+q/H9E5QwYLGUsHWEYvORi0WiELEI=
On Wed, Mar 18, 2009 at 1:22 PM, Ilya Baran <baran37@xxxxxxxxx> wrote:
> Hello,
>
> I'll see what I can put up, but after the weekend--I have a siggraph
> rebuttal to do now.
small world :)
> -Ilya
>
> On Wed, Mar 18, 2009 at 2:12 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>> And compilation options used should be put up as well.
>>
>> On Wed, Mar 18, 2009 at 12:36 AM, Gael Guennebaud
>> <gael.guennebaud@xxxxxxxxx> wrote:
>>> yes, A*A^T is actually implemented as a standard matrix product, so A
>>> * A.transpose() for eigen and using gemm for BLAS. In practice, ATLAS
>>> automatically detects this case in gemm and call syrk, whence the
>>> weird results. I think we could easily do the same in eigen, and so
>>> I'd be happy to see your implementation !
>>>
>>> On Tue, Mar 17, 2009 at 6:28 PM, Ilya Baran <baran37@xxxxxxxxx> wrote:
>>>> Hi,
>>>>
>>>> One note about A*A^T and A^T*A -- I think the flop count in the BTL
>>>> code is exaggerated by a factor of two (because the result is
>>>> symmetric, it takes half the flops of a normal matrix multiply).
>>>> Additionally, the BLAS benchmark calls *gemm instead of *syrk, which
>>>> in my test with MKL is almost twice as fast. I ran some informal
>>>> tests on my Core Duo and neither:
>>>>
>>>> n = m * m.transpose();
>>>>
>>>> nor the suggested
>>>>
>>>> n.part<Eigen::SelfAdjoint>() = (m*m.adjoint()).lazy();
>>>>
>>>> perform nearly as well as *syrk, but a simple vectorized unrolled
>>>> blocked two-columns-at-a-time implementation I hacked up matches MKL
>>>> (single threaded, of course). I think that A*A^T and A^T*A is
>>>> sufficiently common to warrant a specialized implementation. I can
>>>> share what I wrote, but it would need a bit of work to be general
>>>> (e.g. the block size is hard coded and it assumes that matrix
>>>> dimensions are a multiple of it).
>>>>
>>>> -Ilya
>>>>
>>>> On Tue, Mar 17, 2009 at 12:57 PM, Gael Guennebaud
>>>> <gael.guennebaud@xxxxxxxxx> wrote:
>>>>> Hi,
>>>>>
>>>>> just to say I updated the main benchmark page.
>>>>>
>>>>> Gael
>>>>>
>>>>> On Tue, Mar 17, 2009 at 10:27 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>>>>>> Thanks for that. After looking at these benches, I was thinking that
>>>>>> perhaps Eigen has become quite slower with new versions!!
>>>>>>
>>>>>> On Tue, Mar 17, 2009 at 2:52 PM, Gael Guennebaud
>>>>>> <gael.guennebaud@xxxxxxxxx> wrote:
>>>>>>> On Tue, Mar 17, 2009 at 9:20 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>>>>>>>> I think that with new library versions, new eigen versions, and new
>>>>>>>> gcc we should put these results on the main benchmark page of eigen
>>>>>>>> website. BTW I think the eigen's performance has slipped considerably
>>>>>>>> when I look at your Pentium D benchmarks, or it's all attributable to
>>>>>>>> core2 being a much better cpu?
>>>>>>>
>>>>>>> thanks for the benchs,
>>>>>>>
>>>>>>> core2 is indeed much better than a Pentium D, and since I only have a
>>>>>>> core2, the critical parts (matrix-matrix products) are only fine tuned
>>>>>>> for the core2. Another reason is that gcc 4.3 generates slower code
>>>>>>> than 4.2: some constant expressions are not removed out the inner
>>>>>>> loops, it is not optimal with block expressions, and by default 4.3
>>>>>>> automatically generates vectorized code which conflicts with Eigen's
>>>>>>> automatic vectorization. 4.4 do not suffer from all these issues, and
>>>>>>> sometimes, gcc 4.4 auto-vec is even better than Eigen's explicit one
>>>>>>> because it better understands what it is doing: an example is rank-2
>>>>>>> update which simply consists in a series "v += ax + by" ops. But
>>>>>>> Eigen's explicit vec is still worth it because we are able to
>>>>>>> vectorize much more cases than gcc. Examples: "v = ax + by" is not
>>>>>>> vectorized by gcc, matrix products, vectorization + explicit
>>>>>>> unrolling, in the future sin, cos, pow, exp, etc.
>>>>>>>
>>>>>>> gael
>>>>>>>
>>>>>>>> On Tue, Mar 17, 2009 at 1:08 PM, Victor <flyaway1212@xxxxxxxxx> wrote:
>>>>>>>>> Hi all.
>>>>>>>>> It sure took a while to run all the benchmarks with all the libraries
>>>>>>>>> available to me... I wish I had read the instructions more carefully and
>>>>>>>>> hadn't wasted any time testing multithreaded libraries...
>>>>>>>>> Anyways, the results are on the wiki:
>>>>>>>>> http://eigen.tuxfamily.org/index.php?title=Benchmark_AMD_Intel_compare
>>>>>>>>>
>>>>>>>>> Gael Guennebaud wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Victor,
>>>>>>>>>>
>>>>>>>>>> thanks a lot for the patch.
>>>>>>>>>> applied in rev 935462, the syr2 header will follow in a second.
>>>>>>>>>>
>>>>>>>>>> so what's your conclusion, is ACML as good as MKL ?
>>>>>>>>> Unfortunately, no. ACML is not bad though. It's hard to say once and for
>>>>>>>>> all, but most of the time MKL beats ACML. Even on an AMD CPU MKL is
>>>>>>>>> typically better. ACML shows decent performance (even on Intel CPU), on
>>>>>>>>> average similar to ATLAS, but again results differ from test to test.
>>>>>>>>> The good thing about ACML (and MKL, Goto and ATLAS) is that they can be
>>>>>>>>> used in multithreading mode, which unfortunately can't be demonstrated
>>>>>>>>> with BTL as far as I can tell.
>>>>>>>>>
>>>>>>>>> Also, it looks like in comparison with other libs Eigen does better on
>>>>>>>>> Intel than on AMD.
>>>>>>>>>
>>>>>>>>> Out of curiosity, I have also run BTL with Eigen compiled with 4
>>>>>>>>> different compilers. Well, 3 different gcc versions and intel c++.. See
>>>>>>>>> the results here
>>>>>>>>> http://eigen.tuxfamily.org/index.php?title=Eigen2_benchmark_Intel
>>>>>>>>>
>>>>>>>>> I hope this might be useful to somebody.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Victor.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Rohit Garg
>>>>>>>>
>>>>>>>> http://rpg-314.blogspot.com/
>>>>>>>>
>>>>>>>> Senior Undergraduate
>>>>>>>> Department of Physics
>>>>>>>> Indian Institute of Technology
>>>>>>>> Bombay
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Rohit Garg
>>>>>>
>>>>>> http://rpg-314.blogspot.com/
>>>>>>
>>>>>> Senior Undergraduate
>>>>>> Department of Physics
>>>>>> Indian Institute of Technology
>>>>>> Bombay
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>> --
>> Rohit Garg
>>
>> http://rpg-314.blogspot.com/
>>
>> Senior Undergraduate
>> Department of Physics
>> Indian Institute of Technology
>> Bombay
>>
>>
>>
>
>
>