Re: [eigen] a branch for SMP (openmp) experimentations

[ Thread Index | Date Index | More Archives ]

Ok, after some experiments it seems that the naive splitting strategy I've implemented does not scale very well:

product sizes (rows,cols,depth): 2048x2048x2048


#cores:   1     2     4
Eigen:    19   35   60
GOTO:    20   39   75

So first let me start by briefly review of the way the product is implemented. Let's consider C = A * B, and to simplify I'll instantiate the product for matrices of size 2048x2048:

for each horizontal panel B_k of B of size 256x2048 do
  copy/pack B_k to B'
  let A_k be the corresponding 2048x256 vertical panel of A
  for each block A_ik of A_k of size 512x256 do
    copy/pack A_ji to A'
    compute C_i += A' * B' using an optimized kernel

For a quad cores, the actual parallelization strategy is:
- split B into B1 and B2 each of size 2048x1024
- split A into A1 and A2 each of size 1024x2048
- split C into C11 .. C22 each of size 1024x1024
- do the 4 non-dependent matrix products in parallel:
  C11 = A1*B1
  C12 = A1*B2
  C21 = A2*B1
  C22 = A2*B2

Now if I use this strategy with GOTO (i.e., I instantiate 4 different products in parallel using OpenMP and I make sure that GOTO does not try to parallelize),  then I get really bad performance: 35Gflops (while Eigen does the same at 60Gflops). On the other hand, if I do these 4 products sequentially, and let GOTO parallelize (4threads), then GOTO yields 70Gflops !

An important result is that unlike what I thought, *it is not always better to parallelize at the highest level*. For instance, if a user has to do 4 large matrix products, it is better to do them sequentially and let the underlying products manage the parallelization ! So this is a strong argument to enable good vectorization in Eigen as it cannot optimaly be done at the user level.

Then I tried to understand why this "naive" parallelization strategy does not scale very well. Indeed, in the case of 4 threads, all the data are fully independent and it is equalivalent to do 4 smaller products in parallel. So what ?

To this end, I replaced the copies/packings to B' and A' by calls to memset on B' and A' respectiveley => then I got a perfect x4 meaning the optimized kernel is already perfect and that the performance penalty is entirely due to the (non sequential) reads from B_j and A_ij. I tried to disable only one of the two and I confirm that both are equally responsible for this slowdown. (With a single thread they represent about 5% of the computation.)

If I replace my copy/pack function calls to the GOTO's ones (hand written assembly with more aggressive unrolling and prefetching), then I get a slight speedup but not that much. So we really have to work at a higher level, i.e., how to better distribute the works across the different threads ? But now, I'm clueless...

One final remark though: With 4 threads, the naive splitting strategy sounded really not too bad because all submatrices involved are 100% independent. On the other hand, with only 2 threads, it is pretty clear that simply splitting the product into two, e.g.:

C1 = A * B1
C2 = A * B2

is not very good because A is entirely shared between the two threads and so the packing of the A_ij is done twice => *it is pretty clear the the threads have to talk* :(

that's all for now.


On Tue, Feb 23, 2010 at 12:34 PM, Aron Ahmadia <aja2111@xxxxxxxxxxxx> wrote:
Oh very cool!  I am going to start playing with this very soon.  I
want to look into getting some automatic numbers up from each build so
that it's easy to see how the code is performing across several
different HPC architectures (BlueGene/P, Nehalem, Clovertown).  The
bench code should be what I'm looking for.

I'm curious about the difference between OpenMP and pthreads.  I know
there is a bit of overthread in OpenMP but I didn't realize it would
be so meaningful.  This will of course vary between architectures and
compilers, but I will query the IBM Researchers (Gunnels is more of an
expert with both pthreads and OpenMP and worked in van de Geijn's
group so may have more direct thoughts).

Warm Regards,

On Mon, Feb 22, 2010 at 11:43 AM, Hauke Heibel
<hauke.heibel@xxxxxxxxxxxxxx> wrote:
> Works like a charm - see attachement.
> - Hauke
> On Mon, Feb 22, 2010 at 11:28 AM, Gael Guennebaud
> <gael.guennebaud@xxxxxxxxxx> wrote:
>> Hi,
>> I have just created a fork there:
>> to play with SMP support, and more precisely, with OpenMP.
>> Currently only the general matrix-matrix product is parallelized. I've
>> implemented a general 1D parallelizer to factor the parallelization code. It
>> is defined there:
>> Eigen/src/Core/products/Parallelizer.h
>> and used at the end of this file:
>> Eigen/src/Core/products/GeneralMatrixMatrix.h
>> In the bench/ folder there are two bench_gemm*.cpp files to try it and
>> compare to BLAS.
>> On my core2 duo, I've observed a speed up around 1.9 for relatively small
>> matrices.
>> At work I have an "Intel(R) Core(TM)2 Quad CPU    Q9400  @ 2.66GHz" but it
>> is currently too busy to do really meaningfull experiments. Nevertheless, it
>> seems that gotoblas, which is directly using pthread, reports more
>> consistent speedups. So perhaps OpenMP is trying to do some too smart
>> scheduling and it might be useful to directly deal with pthread?
>> For the lazy but interested reader, the interesting piece of code is there:
>>   int threads = omp_get_num_procs();
>>   int blockSize = size / threads;
>>   #pragma omp parallel for schedule(static,1)
>>   for(int i=0; i<threads; ++i)
>>   {
>>     int blockStart = i*blockSize;
>>     int actualBlockSize = std::min(blockSize, size - blockStart);
>>     func(blockStart, actualBlockSize);
>>   }
>> feelfree to play with it and have fun!
>> Gael.
>> PS: to Aron, yesterday the main pb was that our benchtimer reported the
>> total execution time, not the "real" one... the other was that my system was
>> a bit buzy because of daemons, and stuff like that...

Mail converted by MHonArc 2.6.19+