Re: [eigen] BLAS backend

[ Thread Index | Date Index | More Archives ]

On Fri, Oct 16, 2009 at 2:06 PM, Thomas Capricelli <orzel@xxxxxxxxxxxxxxx> wrote:
In data venerdì 16 ottobre 2009 11:08:48, Christian Mayer ha scritto:
> Parallelisation at the algorithm level (= _expression_ template level)
> gives you the advantage to perform operations that have no dependancy at
> each other. For example:
>   result = A*B + C*D    (A,B,C,D are big matrices)
> It's much better to have - one a two core CPU - one thread that's
> calculation A*B and another doing C*D than both threads fighting each
> other (= locks) doing A*B and then doing an C*D...

Christion, did you mean to argue against OpenMP or pro MPI with this example? I did not quite get it, sorry. Afaik, this example is perfectly possible with OpenMP. If I remember correctly you would do just this:

MatrixXd tmp1,tmp2;
#pragma omp sections

  #pragma omp section
    tmp1 = A*B;
  #pragma omp section
    tmp2 = C*D;
#pragma omp barrier
result = tmp1+tmp2;

The choice for going for such an algorithm (or a low-level one) could probably be done in a similar mannor as it is done right now with the _expression_ templates and the decision making regarding temporaries. I have very limited knowledge on MPI but when communication between different machines comes into play, the estimation of the cost gain is probably rather difficult.
At low-level, maybe cuda or multi-core stuff could be useful, but i would be
really surprised to have MPI/OpenMP stuff being really useful.

I'am not totally convinced. There are so numerous possiblities of tuning OpenMP that I just don't see (depending on the _expression_) why low-level optimizations could not be possible with OpenMP.

Mail converted by MHonArc 2.6.19+