|Re: [eigen] FLENS C++ expression template Library has excellent documentation|
[ Thread Index |
| More lists.tuxfamily.org/eigen Archives
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] FLENS C++ expression template Library has excellent documentation
- From: Rohit Garg <rpg.314@xxxxxxxxx>
- Date: Sat, 18 Apr 2009 11:10:48 +0530
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=Km9vmW9CvBHAakp5elBTfugvO/4T0PFkgvNVBy8e/jU=; b=SUW7UFAHpbMhfqfjg5SI5bmihb8iUPFGtzCPoWmBNMZ/5jZrzvLlf+Fcqdb9FxLWX/ //K2Md63q1NGytziUCKXugiYrg/8fAMbwmEvmEPxFB3KlaOJg/4u2GMhrh+dc12LyD+D cknboDlpgSkp19zRXCMQxV+RNS2EPW6q1C0Kc=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=HH2ocPhxbxvTKCPIvQDULTtgsn2ysx+FJhI2c5t9n/5yMl+/m6Qs+4418c7jnBzrI6 B3uctWgfWG+2mWfCu2zM1qsCqyjf6XD6FzFnb+8PZBdpcOER9andZlojbg8H2mFVlp0X htLCymZpXO9z5jhTIthWf8OFR28lYncJ7VCjo=
On Sat, Apr 18, 2009 at 10:53 AM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> 2009/4/18 Rohit Garg <rpg.314@xxxxxxxxx>:
>>> So, in which area does Intel MKL still have a long-term lead? I would
>>> say parallelization. We haven't started that yet and it is probably a
>>> very, very tough one. It's what I have in mind when I say that a
>>> BLAS/LAPACK wrapper is still welcome.
>> Why do you think parallelization is very difficult? Do you mean
>> parallelization nfrastructure? AFAICS, using openmp will be cool. Let
>> compiler handle all the dirty buisness etc This is something I want to
>> explore (time availability is of course important !) so I would like
>> some heads up.
> I'm very ignorant of these matters. If parallelization can be done in
> a generic way, then yes I understand that it's just a matter of
> OpenMP-ifying for loops. Although even so, there remain issues to
> sort: what if the user application doesn't want Eigen to launch more
> than N threads, because it is already launching a lot of threads of
> its own? OpenMP 2 didnt seem to help much in that situation, maybe 3
> is better.
I was thinking of something like -DEIGEN_PARALLEL 4 (ie opt in
parallelization) at compile time to launch that many threads at
compile time. BLAS 1 should be trivial to parallelize and BLAS2
shouldn't be too difficult either.
> But can efficient parallelization really be done in a generic way? It
> seems to me that different algorithms may require different strategies
> for parallelization. For example, look at matrix product.
> Parallelizing it will probably mean doing the product by blocks, each
> thread computing a different block. But there are many different ways
> of splitting the product into blocks. Which one is most efficient? One
> wants to minimize thread interdependency, but also to minimize memory
> accesses... conflicting goals! So it's not obvious at all which block
> product strategy to take. With other algorithms, another issue will be
> load balancing. All this makes me doubt that efficient parallelization
> can be obtained just by OpenMP-ifying some for loops!
BLAS3 is indeed tricky. Since most CPU's have both private and shared
caches, ie L1 is private per core and L2/3 is shared, we don't want
too much inter core cache coherency traffic either. We're gona nedd
some serious profiling to get the load balancing right.
> So as I see it the work for parallelizing algorithms splits into 2 phases,
> 1) change the algorithm itself to make it well parallelizable
> 2) implement using e.g. OpenMP (or that could also be another thing if
> e.g. we want to leverage GPUs)
GPUs too? seriously? Then you gotta go the opencl route. And then
parallelism is even more fun. GPUs want nice (multiple of 16/32)
sizes, and what about multi GPU parallelism? And you seriously want
zero copies across the PCI bus.
> A good canditate for parallelization would be Block Cholesky (LLt) or
> Block LU (with partial pivoting). (both of these are on my todo but
> i'm lagging, so feel free to not wait for me ;) ). More generally, any
> algorithm working per-blocks.
Actually I am interested in block Cholesky too, but I am lagging a bit
as well .......:(
> That's typical of scientific applications. You're not the only one to
> think like that. Then at the other end of the spectrum there are many
> people who consider compilation times to be very important for
> productivity, and there are also large free software projects *cough*
> KDE *cough* that take hours to compile already and where developers
> have to recompile the whole thing often...
Ahh... is there a compile flag which lets GCC take it's own sweet time
Department of Physics
Indian Institute of Technology