Re: [eigen] patch to add ACML support to BTL |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] patch to add ACML support to BTL
- From: Ilya Baran <baran37@xxxxxxxxx>
- Date: Wed, 18 Mar 2009 08:22:15 -0400
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=cH1NM8CLotrRSHCHyPehOGfZ0WvI/fEh9xadp0b8HMc=; b=X3yXU83D2f7zKiIZ519P/8IdxIujy8qjjpqb2Fs/XzNRxKhrRg/BLLC4EubG+QfwMv R+VJmOrEkUX+rUdeWIVHu+bzEgHRMEXejCDqAMe/60dDJV4+PTDTvNrQZ9a0kdwgMhBl Jp0u3ZPdZddmpH5wKgUEmpKIr+TrA8hw47rW8=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=XLkwz/+W5u8EeefzSOniW99z8rjn3Q3tJ/4KBPbH7CeL/4MnfsR+u2qOrybhpxvIEF wo93ZEOo3KQM1NLAj0aD0NCvJLjYzChIJKywcs4dc6alud2MZmIQsh80brWZ7omsuYz8 erMaEZqxtn9NVDXKw9IYHkapQLKBSEc8VGgvA=
Hello,
I'll see what I can put up, but after the weekend--I have a siggraph
rebuttal to do now.
-Ilya
On Wed, Mar 18, 2009 at 2:12 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
> And compilation options used should be put up as well.
>
> On Wed, Mar 18, 2009 at 12:36 AM, Gael Guennebaud
> <gael.guennebaud@xxxxxxxxx> wrote:
>> yes, A*A^T is actually implemented as a standard matrix product, so A
>> * A.transpose() for eigen and using gemm for BLAS. In practice, ATLAS
>> automatically detects this case in gemm and call syrk, whence the
>> weird results. I think we could easily do the same in eigen, and so
>> I'd be happy to see your implementation !
>>
>> On Tue, Mar 17, 2009 at 6:28 PM, Ilya Baran <baran37@xxxxxxxxx> wrote:
>>> Hi,
>>>
>>> One note about A*A^T and A^T*A -- I think the flop count in the BTL
>>> code is exaggerated by a factor of two (because the result is
>>> symmetric, it takes half the flops of a normal matrix multiply).
>>> Additionally, the BLAS benchmark calls *gemm instead of *syrk, which
>>> in my test with MKL is almost twice as fast. I ran some informal
>>> tests on my Core Duo and neither:
>>>
>>> n = m * m.transpose();
>>>
>>> nor the suggested
>>>
>>> n.part<Eigen::SelfAdjoint>() = (m*m.adjoint()).lazy();
>>>
>>> perform nearly as well as *syrk, but a simple vectorized unrolled
>>> blocked two-columns-at-a-time implementation I hacked up matches MKL
>>> (single threaded, of course). I think that A*A^T and A^T*A is
>>> sufficiently common to warrant a specialized implementation. I can
>>> share what I wrote, but it would need a bit of work to be general
>>> (e.g. the block size is hard coded and it assumes that matrix
>>> dimensions are a multiple of it).
>>>
>>> -Ilya
>>>
>>> On Tue, Mar 17, 2009 at 12:57 PM, Gael Guennebaud
>>> <gael.guennebaud@xxxxxxxxx> wrote:
>>>> Hi,
>>>>
>>>> just to say I updated the main benchmark page.
>>>>
>>>> Gael
>>>>
>>>> On Tue, Mar 17, 2009 at 10:27 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>>>>> Thanks for that. After looking at these benches, I was thinking that
>>>>> perhaps Eigen has become quite slower with new versions!!
>>>>>
>>>>> On Tue, Mar 17, 2009 at 2:52 PM, Gael Guennebaud
>>>>> <gael.guennebaud@xxxxxxxxx> wrote:
>>>>>> On Tue, Mar 17, 2009 at 9:20 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>>>>>>> I think that with new library versions, new eigen versions, and new
>>>>>>> gcc we should put these results on the main benchmark page of eigen
>>>>>>> website. BTW I think the eigen's performance has slipped considerably
>>>>>>> when I look at your Pentium D benchmarks, or it's all attributable to
>>>>>>> core2 being a much better cpu?
>>>>>>
>>>>>> thanks for the benchs,
>>>>>>
>>>>>> core2 is indeed much better than a Pentium D, and since I only have a
>>>>>> core2, the critical parts (matrix-matrix products) are only fine tuned
>>>>>> for the core2. Another reason is that gcc 4.3 generates slower code
>>>>>> than 4.2: some constant expressions are not removed out the inner
>>>>>> loops, it is not optimal with block expressions, and by default 4.3
>>>>>> automatically generates vectorized code which conflicts with Eigen's
>>>>>> automatic vectorization. 4.4 do not suffer from all these issues, and
>>>>>> sometimes, gcc 4.4 auto-vec is even better than Eigen's explicit one
>>>>>> because it better understands what it is doing: an example is rank-2
>>>>>> update which simply consists in a series "v += ax + by" ops. But
>>>>>> Eigen's explicit vec is still worth it because we are able to
>>>>>> vectorize much more cases than gcc. Examples: "v = ax + by" is not
>>>>>> vectorized by gcc, matrix products, vectorization + explicit
>>>>>> unrolling, in the future sin, cos, pow, exp, etc.
>>>>>>
>>>>>> gael
>>>>>>
>>>>>>> On Tue, Mar 17, 2009 at 1:08 PM, Victor <flyaway1212@xxxxxxxxx> wrote:
>>>>>>>> Hi all.
>>>>>>>> It sure took a while to run all the benchmarks with all the libraries
>>>>>>>> available to me... I wish I had read the instructions more carefully and
>>>>>>>> hadn't wasted any time testing multithreaded libraries...
>>>>>>>> Anyways, the results are on the wiki:
>>>>>>>> http://eigen.tuxfamily.org/index.php?title=Benchmark_AMD_Intel_compare
>>>>>>>>
>>>>>>>> Gael Guennebaud wrote:
>>>>>>>>>
>>>>>>>>> Hi Victor,
>>>>>>>>>
>>>>>>>>> thanks a lot for the patch.
>>>>>>>>> applied in rev 935462, the syr2 header will follow in a second.
>>>>>>>>>
>>>>>>>>> so what's your conclusion, is ACML as good as MKL ?
>>>>>>>> Unfortunately, no. ACML is not bad though. It's hard to say once and for
>>>>>>>> all, but most of the time MKL beats ACML. Even on an AMD CPU MKL is
>>>>>>>> typically better. ACML shows decent performance (even on Intel CPU), on
>>>>>>>> average similar to ATLAS, but again results differ from test to test.
>>>>>>>> The good thing about ACML (and MKL, Goto and ATLAS) is that they can be
>>>>>>>> used in multithreading mode, which unfortunately can't be demonstrated
>>>>>>>> with BTL as far as I can tell.
>>>>>>>>
>>>>>>>> Also, it looks like in comparison with other libs Eigen does better on
>>>>>>>> Intel than on AMD.
>>>>>>>>
>>>>>>>> Out of curiosity, I have also run BTL with Eigen compiled with 4
>>>>>>>> different compilers. Well, 3 different gcc versions and intel c++. See
>>>>>>>> the results here
>>>>>>>> http://eigen.tuxfamily.org/index.php?title=Eigen2_benchmark_Intel
>>>>>>>>
>>>>>>>> I hope this might be useful to somebody.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Victor.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Rohit Garg
>>>>>>>
>>>>>>> http://rpg-314.blogspot.com/
>>>>>>>
>>>>>>> Senior Undergraduate
>>>>>>> Department of Physics
>>>>>>> Indian Institute of Technology
>>>>>>> Bombay
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Rohit Garg
>>>>>
>>>>> http://rpg-314.blogspot.com/
>>>>>
>>>>> Senior Undergraduate
>>>>> Department of Physics
>>>>> Indian Institute of Technology
>>>>> Bombay
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>
> --
> Rohit Garg
>
> http://rpg-314.blogspot.com/
>
> Senior Undergraduate
> Department of Physics
> Indian Institute of Technology
> Bombay
>
>
>