[ Thread Index |
| More lists.tuxfamily.org/eigen Archives
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] BLAS backend
- From: Hauke Heibel <hauke.heibel@xxxxxxxxxxxxxx>
- Date: Fri, 16 Oct 2009 15:24:22 +0200
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=IsAxmhf4spXFU6M2g9oL7ryx8qmKW/uAzLRdbiKKv90=; b=g5n05kgr2p91VttL0HSdujKHaguHA9LZnR2xZKnwdQrsd8V90HezNkDCNacqPPFHVL TYdd2iiNvAadlBXvMiuM0d8+/+koCFeKwXP3bgefekHmnmYAh3PXuxzyt1nBftYqSWlj S8pMmFb/0ORAPWWJvO2mbrbUg+NAk7/vo/zmg=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=V/v496x48QvlCfHi35Sdk3SxXhK/M7Bj4xvQlFiF+xM3evYofLPURChCkbiS4kHJIh 6/0BIpUUKovo3PwBiTonN/CXYe7yQ/7OvQ/PSwTwPDL99iIOBQvSlJ0pwvsjIue06StQ TXtnRyGGJgrSfPCY7U7O75VloDqRgGRoAAV7Y=
On Fri, Oct 16, 2009 at 2:06 PM, Thomas Capricelli <orzel@xxxxxxxxxxxxxxx>
In data venerdì 16 ottobre 2009 11:08:48, Christian Mayer ha scritto:
> Parallelisation at the algorithm level (= _expression_ template level)
> gives you the advantage to perform operations that have no dependancy at
> each other. For example:
> result = A*B + C*D (A,B,C,D are big matrices)
> It's much better to have - one a two core CPU - one thread that's
> calculation A*B and another doing C*D than both threads fighting each
> other (= locks) doing A*B and then doing an C*D...
Christion, did you mean to argue against OpenMP or pro MPI with this example? I did not quite get it, sorry. Afaik, this example is perfectly possible with OpenMP. If I remember correctly you would do just this:
#pragma omp sections
#pragma omp section
tmp1 = A*B;
#pragma omp section
tmp2 = C*D;
#pragma omp barrier
result = tmp1+tmp2;
The choice for going for such an algorithm (or a low-level one) could probably be done in a
similar mannor as it is done right now with the _expression_ templates
and the decision making regarding temporaries. I have very limited knowledge on MPI but when communication between different machines comes into play, the estimation of the cost gain is probably rather difficult.
At low-level, maybe cuda or multi-core stuff could be useful, but i would be
really surprised to have MPI/OpenMP stuff being really useful.
I'am not totally convinced. There are so numerous possiblities of tuning OpenMP that I just don't see (depending on the _expression_) why low-level optimizations could not be possible with OpenMP.