Re: [eigen] patch to add ACML support to BTL

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


And compilation options used should be put up as well.

On Wed, Mar 18, 2009 at 12:36 AM, Gael Guennebaud
<gael.guennebaud@xxxxxxxxx> wrote:
> yes, A*A^T is actually implemented as a standard matrix product, so A
> * A.transpose() for eigen and using gemm for BLAS. In practice, ATLAS
> automatically detects this case in gemm and call syrk, whence the
> weird results. I think we could easily do the same in eigen, and so
> I'd be happy to see your implementation !
>
> On Tue, Mar 17, 2009 at 6:28 PM, Ilya Baran <baran37@xxxxxxxxx> wrote:
>> Hi,
>>
>> One note about A*A^T and A^T*A -- I think the flop count in the BTL
>> code is exaggerated by a factor of two (because the result is
>> symmetric, it takes half the flops of a normal matrix multiply).
>> Additionally, the BLAS benchmark calls *gemm instead of *syrk, which
>> in my test with MKL is almost twice as fast.  I ran some informal
>> tests on my Core Duo and neither:
>>
>> n = m * m.transpose();
>>
>> nor the suggested
>>
>> n.part<Eigen::SelfAdjoint>() = (m*m.adjoint()).lazy();
>>
>> perform nearly as well as *syrk, but a simple vectorized unrolled
>> blocked two-columns-at-a-time implementation I hacked up matches MKL
>> (single threaded, of course).  I think that A*A^T and A^T*A is
>> sufficiently common to warrant a specialized implementation.  I can
>> share what I wrote, but it would need a bit of work to be general
>> (e.g. the block size is hard coded and it assumes that matrix
>> dimensions are a multiple of it).
>>
>>   -Ilya
>>
>> On Tue, Mar 17, 2009 at 12:57 PM, Gael Guennebaud
>> <gael.guennebaud@xxxxxxxxx> wrote:
>>> Hi,
>>>
>>> just to say I updated the main benchmark page.
>>>
>>> Gael
>>>
>>> On Tue, Mar 17, 2009 at 10:27 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>>>> Thanks for that. After looking at these benches, I was thinking that
>>>> perhaps Eigen has become quite slower with new versions!!
>>>>
>>>> On Tue, Mar 17, 2009 at 2:52 PM, Gael Guennebaud
>>>> <gael.guennebaud@xxxxxxxxx> wrote:
>>>>> On Tue, Mar 17, 2009 at 9:20 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>>>>>> I think that with new library versions, new eigen versions, and new
>>>>>> gcc we should put these results on the main benchmark page of eigen
>>>>>> website. BTW I think the eigen's performance has slipped considerably
>>>>>> when I look at your Pentium D benchmarks, or it's all attributable to
>>>>>> core2 being a much better cpu?
>>>>>
>>>>> thanks for the benchs,
>>>>>
>>>>> core2 is indeed much better than a Pentium D, and since I only have a
>>>>> core2, the critical parts (matrix-matrix products) are only fine tuned
>>>>> for the core2. Another reason is that gcc 4.3 generates slower code
>>>>> than 4.2: some constant expressions are not removed out the inner
>>>>> loops, it is not optimal with block expressions, and by default 4.3
>>>>> automatically generates vectorized code which conflicts with Eigen's
>>>>> automatic vectorization. 4.4 do not suffer from all these issues, and
>>>>> sometimes, gcc 4.4 auto-vec is even better than Eigen's explicit one
>>>>> because it better understands what it is doing: an example is rank-2
>>>>> update which simply consists in a series "v += ax + by" ops. But
>>>>> Eigen's explicit vec is still worth it because we are able to
>>>>> vectorize much more cases than gcc. Examples: "v = ax + by" is not
>>>>> vectorized by gcc, matrix products, vectorization + explicit
>>>>> unrolling, in the future sin, cos, pow, exp, etc.
>>>>>
>>>>> gael
>>>>>
>>>>>> On Tue, Mar 17, 2009 at 1:08 PM, Victor <flyaway1212@xxxxxxxxx> wrote:
>>>>>>> Hi all.
>>>>>>> It sure took a while to run all the benchmarks with all the libraries
>>>>>>> available to me... I wish I had read the instructions more carefully and
>>>>>>> hadn't wasted any time testing multithreaded libraries...
>>>>>>> Anyways, the results are on the wiki:
>>>>>>> http://eigen.tuxfamily.org/index.php?title=Benchmark_AMD_Intel_compare
>>>>>>>
>>>>>>> Gael Guennebaud wrote:
>>>>>>>>
>>>>>>>> Hi Victor,
>>>>>>>>
>>>>>>>> thanks a lot for the patch.
>>>>>>>> applied in rev 935462, the syr2 header will follow in a second.
>>>>>>>>
>>>>>>>> so what's your conclusion, is ACML as good as MKL ?
>>>>>>> Unfortunately, no. ACML is not bad though. It's hard to say once and for
>>>>>>> all, but most of the time MKL beats ACML. Even on an AMD CPU MKL is
>>>>>>> typically better. ACML shows decent performance (even on Intel CPU), on
>>>>>>> average similar to ATLAS, but again results differ from test to test.
>>>>>>> The good thing about ACML (and MKL, Goto and ATLAS) is that they can be
>>>>>>> used in multithreading mode, which unfortunately can't be demonstrated
>>>>>>> with BTL as far as I can tell.
>>>>>>>
>>>>>>> Also, it looks like in comparison with other libs Eigen does better on
>>>>>>> Intel than on AMD.
>>>>>>>
>>>>>>> Out of curiosity, I have also run BTL with Eigen compiled with 4
>>>>>>> different compilers. Well, 3 different gcc versions and intel c++. See
>>>>>>> the results here
>>>>>>> http://eigen.tuxfamily.org/index.php?title=Eigen2_benchmark_Intel
>>>>>>>
>>>>>>> I hope this might be useful to somebody.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Victor.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Rohit Garg
>>>>>>
>>>>>> http://rpg-314.blogspot.com/
>>>>>>
>>>>>> Senior Undergraduate
>>>>>> Department of Physics
>>>>>> Indian Institute of Technology
>>>>>> Bombay
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Rohit Garg
>>>>
>>>> http://rpg-314.blogspot.com/
>>>>
>>>> Senior Undergraduate
>>>> Department of Physics
>>>> Indian Institute of Technology
>>>> Bombay
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>



-- 
Rohit Garg

http://rpg-314.blogspot.com/

Senior Undergraduate
Department of Physics
Indian Institute of Technology
Bombay



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/