Re: [eigen] patch to add ACML support to BTL

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


yes, A*A^T is actually implemented as a standard matrix product, so A
* A.transpose() for eigen and using gemm for BLAS. In practice, ATLAS
automatically detects this case in gemm and call syrk, whence the
weird results. I think we could easily do the same in eigen, and so
I'd be happy to see your implementation !

On Tue, Mar 17, 2009 at 6:28 PM, Ilya Baran <baran37@xxxxxxxxx> wrote:
> Hi,
>
> One note about A*A^T and A^T*A -- I think the flop count in the BTL
> code is exaggerated by a factor of two (because the result is
> symmetric, it takes half the flops of a normal matrix multiply).
> Additionally, the BLAS benchmark calls *gemm instead of *syrk, which
> in my test with MKL is almost twice as fast.  I ran some informal
> tests on my Core Duo and neither:
>
> n = m * m.transpose();
>
> nor the suggested
>
> n.part<Eigen::SelfAdjoint>() = (m*m.adjoint()).lazy();
>
> perform nearly as well as *syrk, but a simple vectorized unrolled
> blocked two-columns-at-a-time implementation I hacked up matches MKL
> (single threaded, of course).  I think that A*A^T and A^T*A is
> sufficiently common to warrant a specialized implementation.  I can
> share what I wrote, but it would need a bit of work to be general
> (e.g. the block size is hard coded and it assumes that matrix
> dimensions are a multiple of it).
>
>   -Ilya
>
> On Tue, Mar 17, 2009 at 12:57 PM, Gael Guennebaud
> <gael.guennebaud@xxxxxxxxx> wrote:
>> Hi,
>>
>> just to say I updated the main benchmark page.
>>
>> Gael
>>
>> On Tue, Mar 17, 2009 at 10:27 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>>> Thanks for that. After looking at these benches, I was thinking that
>>> perhaps Eigen has become quite slower with new versions!!
>>>
>>> On Tue, Mar 17, 2009 at 2:52 PM, Gael Guennebaud
>>> <gael.guennebaud@xxxxxxxxx> wrote:
>>>> On Tue, Mar 17, 2009 at 9:20 AM, Rohit Garg <rpg.314@xxxxxxxxx> wrote:
>>>>> I think that with new library versions, new eigen versions, and new
>>>>> gcc we should put these results on the main benchmark page of eigen
>>>>> website. BTW I think the eigen's performance has slipped considerably
>>>>> when I look at your Pentium D benchmarks, or it's all attributable to
>>>>> core2 being a much better cpu?
>>>>
>>>> thanks for the benchs,
>>>>
>>>> core2 is indeed much better than a Pentium D, and since I only have a
>>>> core2, the critical parts (matrix-matrix products) are only fine tuned
>>>> for the core2. Another reason is that gcc 4.3 generates slower code
>>>> than 4.2: some constant expressions are not removed out the inner
>>>> loops, it is not optimal with block expressions, and by default 4.3
>>>> automatically generates vectorized code which conflicts with Eigen's
>>>> automatic vectorization. 4.4 do not suffer from all these issues, and
>>>> sometimes, gcc 4.4 auto-vec is even better than Eigen's explicit one
>>>> because it better understands what it is doing: an example is rank-2
>>>> update which simply consists in a series "v += ax + by" ops. But
>>>> Eigen's explicit vec is still worth it because we are able to
>>>> vectorize much more cases than gcc. Examples: "v = ax + by" is not
>>>> vectorized by gcc, matrix products, vectorization + explicit
>>>> unrolling, in the future sin, cos, pow, exp, etc.
>>>>
>>>> gael
>>>>
>>>>> On Tue, Mar 17, 2009 at 1:08 PM, Victor <flyaway1212@xxxxxxxxx> wrote:
>>>>>> Hi all.
>>>>>> It sure took a while to run all the benchmarks with all the libraries
>>>>>> available to me... I wish I had read the instructions more carefully and
>>>>>> hadn't wasted any time testing multithreaded libraries...
>>>>>> Anyways, the results are on the wiki:
>>>>>> http://eigen.tuxfamily.org/index.php?title=Benchmark_AMD_Intel_compare
>>>>>>
>>>>>> Gael Guennebaud wrote:
>>>>>>>
>>>>>>> Hi Victor,
>>>>>>>
>>>>>>> thanks a lot for the patch.
>>>>>>> applied in rev 935462, the syr2 header will follow in a second.
>>>>>>>
>>>>>>> so what's your conclusion, is ACML as good as MKL ?
>>>>>> Unfortunately, no. ACML is not bad though. It's hard to say once and for
>>>>>> all, but most of the time MKL beats ACML. Even on an AMD CPU MKL is
>>>>>> typically better. ACML shows decent performance (even on Intel CPU), on
>>>>>> average similar to ATLAS, but again results differ from test to test..
>>>>>> The good thing about ACML (and MKL, Goto and ATLAS) is that they can be
>>>>>> used in multithreading mode, which unfortunately can't be demonstrated
>>>>>> with BTL as far as I can tell.
>>>>>>
>>>>>> Also, it looks like in comparison with other libs Eigen does better on
>>>>>> Intel than on AMD.
>>>>>>
>>>>>> Out of curiosity, I have also run BTL with Eigen compiled with 4
>>>>>> different compilers. Well, 3 different gcc versions and intel c++. See
>>>>>> the results here
>>>>>> http://eigen.tuxfamily.org/index.php?title=Eigen2_benchmark_Intel
>>>>>>
>>>>>> I hope this might be useful to somebody.
>>>>>>
>>>>>> Cheers,
>>>>>> Victor.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Rohit Garg
>>>>>
>>>>> http://rpg-314.blogspot.com/
>>>>>
>>>>> Senior Undergraduate
>>>>> Department of Physics
>>>>> Indian Institute of Technology
>>>>> Bombay
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Rohit Garg
>>>
>>> http://rpg-314.blogspot.com/
>>>
>>> Senior Undergraduate
>>> Department of Physics
>>> Indian Institute of Technology
>>> Bombay
>>>
>>>
>>>
>>
>>
>>
>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/