Re: [eigen] FLENS C++ expression template Library has excellent documentation

[ Thread Index | Date Index | More Archives ]

2009/4/18 Rohit Garg <rpg.314@xxxxxxxxx>:
>> So, in which area does Intel MKL still have a long-term lead? I would
>> say parallelization. We haven't started that yet and it is probably a
>> very, very tough one. It's what I have in mind when I say that a
>> BLAS/LAPACK wrapper is still welcome.
> Why do you think parallelization is very difficult? Do you mean
> parallelization nfrastructure?  AFAICS, using openmp will be cool. Let
> compiler handle all the dirty buisness etc This is something I want to
> explore (time availability is of course important !) so I would like
> some heads up.

I'm very ignorant of these matters. If parallelization can be done in
a generic way, then yes I understand that it's just a matter of
OpenMP-ifying for loops. Although even so, there remain issues to
sort: what if the user application doesn't want Eigen to launch more
than N threads, because it is already launching a lot of threads of
its own? OpenMP 2 didnt seem to help much in that situation, maybe 3
is better.

But can efficient parallelization really be done in a generic way? It
seems to me that different algorithms may require different strategies
for parallelization. For example, look at matrix product.
Parallelizing it will probably mean doing the product by blocks, each
thread computing a different block. But there are many different ways
of splitting the product into blocks. Which one is most efficient? One
wants to minimize thread interdependency, but also to minimize memory
accesses... conflicting goals! So it's not obvious at all which block
product strategy to take. With other algorithms, another issue will be
load balancing. All this makes me doubt that efficient parallelization
can be obtained just by OpenMP-ifying some for loops!

So as I see it the work for parallelizing algorithms splits into 2 phases,

1) change the algorithm itself to make it well parallelizable
2) implement using e.g. OpenMP (or that could also be another thing if
e.g. we want to leverage GPUs)

A good canditate for parallelization would be Block Cholesky (LLt) or
Block LU (with partial pivoting). (both of these are on my todo but
i'm lagging, so feel free to not wait for me ;) ). More generally, any
algorithm working per-blocks.

>> appropriate. The downside is very long compilation times -- 3 seconds
>> for a trivial program and 10 seconds for a typical file, and remember
>> that this is only basic operations, since for the nontrivial stuff it
>> relies on LAPACK. Extrapolating, this suggest the order of magnitude
>> of 1 minute to compile any of our big linear algebra algorithms.
> As some one whose programs take days aka lots of iterations (I kid you
> not) to run. A microsecond here and a millisecond there matter a LOT
> to me. May that's why I don't care if I need an hour to compile just
> 500 lines of code. :)  But still why bother with compilation speed? I
> mean runtime performance should matter more right.

That's typical of scientific applications. You're not the only one to
think like that. Then at the other end of the spectrum there are many
people who consider compilation times to be very important for
productivity, and there are also large free software projects *cough*
KDE *cough* that take hours to compile already and where developers
have to recompile the whole thing often...


Mail converted by MHonArc 2.6.19+