Re: [eigen] a branch for SMP (openmp) experimentations

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


ok, so adding a barrier *before* the packing fixed the issue, even when the packing to B' is distributed:

for(k=0;k<nb_k;++k)
{
   #pragma omp barrier
   pack B_k
   #pragma omp barrier

  // now it's safe :)
}

So I've committed this new version, and the perf for matrices of size 2048, and using 4 cores are as follow (Gflops):

previous strategy: 57.7
new strategy: 62.2
new strategy using GOTO's low level routines (for the packings and the gebp kernel): 67.4
new strategy without the two barriers (then the results are not correct): 64.7
new strategy without the two barriers and using GOTO's routines: 68.

gael.



On Fri, Feb 26, 2010 at 11:58 AM, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote:


On Fri, Feb 26, 2010 at 10:44 AM, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote:
There is also something very strange: if I change the code so that all threads pack the exactly same B_k to the same shared B' and keep the barrier, then I still don't get a correct result... (if each thread have there own B', then it's fine)

arf, I'm too much used to GPU computing where all threads of a wrap follows the same execution path. Here I realized that even though all threads have to do exactly the same amount of work they can be totally de-synchronized: the barrier occurs with different horizontal panel Bk of B ! To be more precise, the outermost loop looks like this:

for(k=0;k<nb_k;++k)
{
   pack_b(k);
  
   #pragma omp barrier

   // here some threads have k=0 while others have k=1....
}

I guess that means that packing b is faster than creating a thread, and so the first barrier occurs before all threads have been launched ! So we really have to take care at how we synchronize the threads.

gael



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/