Re: [eigen] BLAS backend

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: Re: [eigen] BLAS backend
From: Christian Mayer <mail@xxxxxxxxxxxxxxxxx>
Date: Fri, 16 Oct 2009 16:08:30 +0200
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:message-id:date:from :reply-to:user-agent:mime-version:to:subject:references:in-reply-to :x-enigmail-version:content-type:content-transfer-encoding; bh=W73JTeXq7QMUB4xpaPwmry1Vz2z2w5x5f0Pify2shAo=; b=ewDzb74L2deO3rQYbU1FSRqoYJ7r6TSEFma4VSo1eMKNoitrFPyygw8QNm9zxN70xx Pd5JCksmfjroKr9YuNXWFWfvH+9oMplkeeLZvaVuwbFkHHdYIF58qveGITE3e7oEtIiS LTzdINy0Tg+v3YVyAJNgbP9OKSe8SSLi90QhM=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:message-id:date:from:reply-to:user-agent:mime-version:to :subject:references:in-reply-to:x-enigmail-version:content-type :content-transfer-encoding; b=YDbAUhlYruTJi9jYlUEm0LAxJShaHfV2ie62zgrJ5fYDOifS4okafyXOLQL0zlV6f6 vtCrdrowyrpG5UUrODNF91I/EAhJ821bH+eDzyAhElgVIVX6yu3rIHKOkXg/FOs/Dsvx s6DMKHrPm1hejBTbRXn0S7zdrtQyNaut7vtVE=

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hauke Heibel schrieb:
> On Fri, Oct 16, 2009 at 2:06 PM, Thomas Capricelli
> <orzel@xxxxxxxxxxxxxxx <mailto:orzel@xxxxxxxxxxxxxxx>> wrote:
> 
>     In data venerdì 16 ottobre 2009 11:08:48, Christian Mayer ha scritto:
>     > Parallelisation at the algorithm level (= expression template level)
>     > gives you the advantage to perform operations that have no
>     dependancy at
>     > each other. For example:
>     >
>     >   result = A*B + C*D    (A,B,C,D are big matrices)
>     >
>     > It's much better to have - one a two core CPU - one thread that's
>     > calculation A*B and another doing C*D than both threads fighting each
>     > other (= locks) doing A*B and then doing an C*D...
> 
> 
> Christion, did you mean to argue against OpenMP or pro MPI with this
> example? I did not quite get it, sorry. 

Well, my last OpenMP and MPI programming was a few years ago, so my
memories might be a bit faulty there... I had remembered that OpenMP was
great for "little things in between" and as soon as it comes to heavy
stuff where you need complete controll, MPI was better. That was the
background of my comment. Anyway it boils down to the same thing as
usual: use the right tool for the job.

> Afaik, this example is perfectly
> possible with OpenMP.

Then it might be that OpenMP is the right tool in this case :)

> I
> have very limited knowledge on MPI but when communication between
> different machines comes into play, the estimation of the cost gain is
> probably rather difficult.

MPI can also be used for communication on a single machine. Beeing able
to do cluster computing is just an additional benefit (but that's out of
the scope of Eigen, that's the task of a more highlevel program that's
using Eigen)

>     At low-level, maybe cuda or multi-core stuff could be useful, but i
>     would be
>     really surprised to have MPI/OpenMP stuff being really useful.
> 
> 
> I'am not totally convinced. There are so numerous possiblities of tuning
> OpenMP that I just don't see (depending on the expression) why low-level
> optimizations could not be possible with OpenMP.

IMHO using threading (OpenMP, MPI, pthreads, ... whatever) at the low
level can be very usefull. As well as using CUDA. (And OpenMP would be
very usefull at the low level, even with the considerations mentioned above)
BUT I also think that parallelisation of the outermost loop (i.e. at
algorithm level = expression template level) can give the biggest gains
(it can avoid many locks / barriers for CPU calculations and it can
avoid many memory transfers for GPU = CUDA = OpenCL calculations)

An A*B + C*D on a modern 8 core computer could be e.g. split up into
- - 4 threads and cores for A*B (i.e. matrix multiplication needs a low
  level parallelisation)
- - 4 threads and cores for C*D
- - after that 8 threads and cores for adding up 1/8 of the reasult each

on a 2 core machine it'll most probalby be a bit different with less
threads:
- - 1 thread and core for A*B
- - 1 thread and core for C*D
- - after that 2 threads and cores for adding up 1/2 of the reasult each

And on a single core machine no (additional) thread at all should be
used and all calculations been done by the main process.

CU,
Christian
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEAREIAAYFAkrYfl4ACgkQoWM1JLkHou2/CwCgjdErRNOrd2Pokv3lKC8GrNK7
2DgAn0DCOqQEHRjKtHq4srDhId8T9JwA
=9ivr
-----END PGP SIGNATURE-----

References:
- [eigen] BLAS backend
  - From: Jean Sreng
- Re: [eigen] BLAS backend
  - From: Aron Ahmadia
- Re: [eigen] BLAS backend
  - From: Christian Mayer
- Re: [eigen] BLAS backend
  - From: Thomas Capricelli
- Re: [eigen] BLAS backend
  - From: Hauke Heibel

Messages sorted by: [ date | thread ]
Prev by Date: Re: [eigen] BLAS backend
Next by Date: Re: [eigen] AutoDiffScalar
Previous by thread: Re: [eigen] BLAS backend
Next by thread: [eigen] AutoDiffScalar

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/