Re: [eigen] BLAS backend

[ Thread Index | Date Index | More Archives ]

Hash: SHA256

Hauke Heibel schrieb:
> On Fri, Oct 16, 2009 at 2:06 PM, Thomas Capricelli
> <orzel@xxxxxxxxxxxxxxx <mailto:orzel@xxxxxxxxxxxxxxx>> wrote:
>     In data venerdì 16 ottobre 2009 11:08:48, Christian Mayer ha scritto:
>     > Parallelisation at the algorithm level (= expression template level)
>     > gives you the advantage to perform operations that have no
>     dependancy at
>     > each other. For example:
>     >
>     >   result = A*B + C*D    (A,B,C,D are big matrices)
>     >
>     > It's much better to have - one a two core CPU - one thread that's
>     > calculation A*B and another doing C*D than both threads fighting each
>     > other (= locks) doing A*B and then doing an C*D...
> Christion, did you mean to argue against OpenMP or pro MPI with this
> example? I did not quite get it, sorry. 

Well, my last OpenMP and MPI programming was a few years ago, so my
memories might be a bit faulty there... I had remembered that OpenMP was
great for "little things in between" and as soon as it comes to heavy
stuff where you need complete controll, MPI was better. That was the
background of my comment. Anyway it boils down to the same thing as
usual: use the right tool for the job.

> Afaik, this example is perfectly
> possible with OpenMP.

Then it might be that OpenMP is the right tool in this case :)

> I
> have very limited knowledge on MPI but when communication between
> different machines comes into play, the estimation of the cost gain is
> probably rather difficult.

MPI can also be used for communication on a single machine. Beeing able
to do cluster computing is just an additional benefit (but that's out of
the scope of Eigen, that's the task of a more highlevel program that's
using Eigen)

>     At low-level, maybe cuda or multi-core stuff could be useful, but i
>     would be
>     really surprised to have MPI/OpenMP stuff being really useful.
> I'am not totally convinced. There are so numerous possiblities of tuning
> OpenMP that I just don't see (depending on the expression) why low-level
> optimizations could not be possible with OpenMP.

IMHO using threading (OpenMP, MPI, pthreads, ... whatever) at the low
level can be very usefull. As well as using CUDA. (And OpenMP would be
very usefull at the low level, even with the considerations mentioned above)
BUT I also think that parallelisation of the outermost loop (i.e. at
algorithm level = expression template level) can give the biggest gains
(it can avoid many locks / barriers for CPU calculations and it can
avoid many memory transfers for GPU = CUDA = OpenCL calculations)

An A*B + C*D on a modern 8 core computer could be e.g. split up into
- - 4 threads and cores for A*B (i.e. matrix multiplication needs a low
  level parallelisation)
- - 4 threads and cores for C*D
- - after that 8 threads and cores for adding up 1/8 of the reasult each

on a 2 core machine it'll most probalby be a bit different with less
- - 1 thread and core for A*B
- - 1 thread and core for C*D
- - after that 2 threads and cores for adding up 1/2 of the reasult each

And on a single core machine no (additional) thread at all should be
used and all calculations been done by the main process.

Version: GnuPG v1.4.9 (GNU/Linux)


Mail converted by MHonArc 2.6.19+