Re: [eigen] a branch for SMP (openmp) experimentations

[ Thread Index | Date Index | More Archives ]

Very nice work ! Also beyond my skills for now, but this Intel paper talks
a bit about optimizing the OpenMP "parallel for" pragma :

It may help somehow.


On Mon, Feb 22, 2010 at 11:28 AM, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote:


I have just created a fork there:

to play with SMP support, and more precisely, with OpenMP.

Currently only the general matrix-matrix product is parallelized. I've implemented a general 1D parallelizer to factor the parallelization code. It is defined there:


and used at the end of this file: Eigen/src/Core/products/GeneralMatrixMatrix.h

In the bench/ folder there are two bench_gemm*.cpp files to try it and compare to BLAS.

On my core2 duo, I've observed a speed up around 1.9 for relatively small matrices.

At work I have an "Intel(R) Core(TM)2 Quad CPU    Q9400  @ 2.66GHz" but it is currently too busy to do really meaningfull experiments. Nevertheless, it seems that gotoblas, which is directly using pthread, reports more consistent speedups. So perhaps OpenMP is trying to do some too smart scheduling and it might be useful to directly deal with pthread?

For the lazy but interested reader, the interesting piece of code is there:

  int threads = omp_get_num_procs();
  int blockSize = size / threads;
  #pragma omp parallel for schedule(static,1)
  for(int i=0; i<threads; ++i)
    int blockStart = i*blockSize;
    int actualBlockSize = std::min(blockSize, size - blockStart);

    func(blockStart, actualBlockSize);

feelfree to play with it and have fun!


PS: to Aron, yesterday the main pb was that our benchtimer reported the total execution time, not the "real" one... the other was that my system was a bit buzy because of daemons, and stuff like that...

Mail converted by MHonArc 2.6.19+