Re: [eigen] FLENS C++ expression template Library has excellent documentation

[ Thread Index | Date Index | More Archives ]

On Sat, Apr 18, 2009 at 10:53 AM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> 2009/4/18 Rohit Garg <rpg.314@xxxxxxxxx>:
>>> So, in which area does Intel MKL still have a long-term lead? I would
>>> say parallelization. We haven't started that yet and it is probably a
>>> very, very tough one. It's what I have in mind when I say that a
>>> BLAS/LAPACK wrapper is still welcome.
>> Why do you think parallelization is very difficult? Do you mean
>> parallelization nfrastructure?  AFAICS, using openmp will be cool. Let
>> compiler handle all the dirty buisness etc This is something I want to
>> explore (time availability is of course important !) so I would like
>> some heads up.
> I'm very ignorant of these matters. If parallelization can be done in
> a generic way, then yes I understand that it's just a matter of
> OpenMP-ifying for loops. Although even so, there remain issues to
> sort: what if the user application doesn't want Eigen to launch more
> than N threads, because it is already launching a lot of threads of
> its own? OpenMP 2 didnt seem to help much in that situation, maybe 3
> is better.

I was thinking of something like -DEIGEN_PARALLEL 4 (ie opt in
parallelization) at compile time to launch that many threads at
compile time. BLAS 1 should be trivial to parallelize and BLAS2
shouldn't be too difficult either.

> But can efficient parallelization really be done in a generic way? It
> seems to me that different algorithms may require different strategies
> for parallelization. For example, look at matrix product.
> Parallelizing it will probably mean doing the product by blocks, each
> thread computing a different block. But there are many different ways
> of splitting the product into blocks. Which one is most efficient? One
> wants to minimize thread interdependency, but also to minimize memory
> accesses... conflicting goals! So it's not obvious at all which block
> product strategy to take. With other algorithms, another issue will be
> load balancing. All this makes me doubt that efficient parallelization
> can be obtained just by OpenMP-ifying some for loops!

BLAS3 is indeed tricky. Since most CPU's have both private and shared
caches, ie L1 is private per core and L2/3 is shared, we don't want
too much inter core cache coherency traffic either. We're gona nedd
some serious profiling to get the load balancing right.

> So as I see it the work for parallelizing algorithms splits into 2 phases,
> 1) change the algorithm itself to make it well parallelizable
> 2) implement using e.g. OpenMP (or that could also be another thing if
> e.g. we want to leverage GPUs)

GPUs too? seriously? Then you gotta go the opencl route. And then
parallelism is even more fun. GPUs want nice (multiple of 16/32)
sizes, and what about multi GPU parallelism? And you seriously want
zero copies across the PCI bus.

> A good canditate for parallelization would be Block Cholesky (LLt) or
> Block LU (with partial pivoting). (both of these are on my todo but
> i'm lagging, so feel free to not wait for me ;) ). More generally, any
> algorithm working per-blocks.

Actually I am interested in block Cholesky too, but I am lagging a bit
as well .......:(

> That's typical of scientific applications. You're not the only one to
> think like that. Then at the other end of the spectrum there are many
> people who consider compilation times to be very important for
> productivity, and there are also large free software projects *cough*
> KDE *cough* that take hours to compile already and where developers
> have to recompile the whole thing often...

Ahh... is there a compile flag which lets GCC take it's own sweet time
to compile?

Rohit Garg

Senior Undergraduate
Department of Physics
Indian Institute of Technology

Mail converted by MHonArc 2.6.19+