On Fri, Feb 26, 2010 at 10:44 AM, Gael Guennebaud
<gael.guennebaud@xxxxxxxxx> wrote:
There is also something very strange: if I change the code so that all threads pack the exactly same B_k to the same shared B' and keep the barrier, then I still don't get a correct result... (if each thread have there own B', then it's fine)
arf, I'm too much used to GPU computing where all threads of a wrap follows the same execution path. Here I realized that even though all threads have to do exactly the same amount of work they can be totally de-synchronized: the barrier occurs with different horizontal panel Bk of B ! To be more precise, the outermost loop looks like this:
for(k=0;k<nb_k;++k)
{
pack_b(k);
#pragma omp barrier
// here some threads have k=0 while others have k=1....
}
I guess that means that packing b is faster than creating a thread, and so the first barrier occurs before all threads have been launched ! So we really have to take care at how we synchronize the threads.
gael