Re: [eigen] a branch for SMP (openmp) experimentations

On Tue, Feb 23, 2010 at 12:34 PM, Aron Ahmadia <aja2111@xxxxxxxxxxxx> wrote:

Oh very cool! I am going to start playing with this very soon. I
want to look into getting some automatic numbers up from each build so
that it's easy to see how the code is performing across several
different HPC architectures (BlueGene/P, Nehalem, Clovertown). The
bench code should be what I'm looking for.

I'm curious about the difference between OpenMP and pthreads. I know
there is a bit of overthread in OpenMP but I didn't realize it would
be so meaningful. This will of course vary between architectures and
compilers, but I will query the IBM Researchers (Gunnels is more of an
expert with both pthreads and OpenMP and worked in van de Geijn's
group so may have more direct thoughts).

Warm Regards,
Aron

On Mon, Feb 22, 2010 at 11:43 AM, Hauke Heibel
<hauke.heibel@xxxxxxxxxxxxxx> wrote:
> Works like a charm - see attachement.
>
> - Hauke
>
> On Mon, Feb 22, 2010 at 11:28 AM, Gael Guennebaud
> <gael.guennebaud@xxxxxxxxxx> wrote:
>>
>> Hi,
>>
>> I have just created a fork there:
>>
>> http://bitbucket.org/ggael/eigen-smp
>>
>> to play with SMP support, and more precisely, with OpenMP.
>>
>> Currently only the general matrix-matrix product is parallelized. I've
>> implemented a general 1D parallelizer to factor the parallelization code. It
>> is defined there:
>>
>> Eigen/src/Core/products/Parallelizer.h
>>
>> and used at the end of this file:
>> Eigen/src/Core/products/GeneralMatrixMatrix.h
>>
>> In the bench/ folder there are two bench_gemm*.cpp files to try it and
>> compare to BLAS.
>>
>> On my core2 duo, I've observed a speed up around 1.9 for relatively small
>> matrices.
>>
>> At work I have an "Intel(R) Core(TM)2 Quad CPU    Q9400 @ 2.66GHz" but it
>> is currently too busy to do really meaningfull experiments. Nevertheless, it
>> seems that gotoblas, which is directly using pthread, reports more
>> consistent speedups. So perhaps OpenMP is trying to do some too smart
>> scheduling and it might be useful to directly deal with pthread?
>>
>> For the lazy but interested reader, the interesting piece of code is there:
>>
>> int threads = omp_get_num_procs();
>> int blockSize = size / threads;
>> #pragma omp parallel for schedule(static,1)
>> for(int i=0; i<threads; ++i)
>> {
>>     int blockStart = i*blockSize;
>>     int actualBlockSize = std::min(blockSize, size - blockStart);
>>
>>     func(blockStart, actualBlockSize);
>> }
>>
>> feelfree to play with it and have fun!
>>
>> Gael.
>>
>> PS: to Aron, yesterday the main pb was that our benchtimer reported the
>> total execution time, not the "real" one... the other was that my system was
>> a bit buzy because of daemons, and stuff like that...
>>
>