Re: [eigen] a branch for SMP (openmp) experimentations

On Fri, Feb 26, 2010 at 12:41 PM, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote:

On Fri, Feb 26, 2010 at 11:28 AM, Aron Ahmadia <aja2111@xxxxxxxxxxxx> wrote:

> Of course, because of the cache L1, the problem is that when the thread 1
> will read the data of B' written by the thread 0, we cannot guarantee that
> it will read the correct data. So does anybody how can I enforce that ? I
> know that's a newbie question, but newbie in this topic I am ;)

Okay, this might be a bit tricky, so forgive me if I'm
over-complicating things, can we introduce another subdivision?:

Create the memory buffer B' divided into N vertical panels (N = number
of threads), Call these B'1, B'2, ... B'N.

Each thread i is responsible for packing its B'j. Once, all B'j
panels are packed, dive directly into the multiply Ci=Ai*B', only for
the section B'j that has been packed by the thread. So Ci_j +=
Ai_j*B'j, this should give the threads and caches time to synchronize
and regain coherency.

This is indeed a nice idea to trigger some computation right after the partial packing of B'.

Now proceed with the rest of the multiply, it might make sense here to
use the concept of a barrel shift, with each thread incrementing the
jth B' panel that it is multiplying against.

for k=1:N
Ci_(j+k) += Ai_(j+k)*B'_(j+k)
end

The packing of B' only happens once, there's no wait (the threads
immediately launch into computing), and the only synchronization would
happen when they leave the parallel region. You could set explicit
locks to absolutely ensure that the threads do not read from a memory
address that is still being written to, especially for debugging
purposes, as weird caching effects could potentially cause one of the
threads to finish far sooner than it should, but this code should run
fairly safely.

Oh, now this is going to be really tricky... I'll let you play with it :)

ok, this strategy is now implemented and committed. However for the synchronization between the threads I had to use the Qt's QAtomicInt class because I did not manage to achieve the same using OpenMP only :(

gael.

Gael.

I'll take a look at implementing this some time this week (I'm feeling
guilty that you're making so much progress without me!)

A

On Fri, Feb 26, 2010 at 12:44 PM, Gael Guennebaud

<gael.guennebaud@xxxxxxxxx> wrote:
>
> ok so to summarize, so far the parallel strategy duplicated N times the
> packing where N is the number of thread. To overcome this shortcoming, I've
> implemented a new (still simple) strategy:
>
> - create a memory buffer B' that will be read/write by all threads
> - split A into A1...AN horizontal panels (does the same for C)
> - parallel loop
>     compute Ci = Ai * B using the shared B'
>
> Let's recall the B' is used to pack an horizontal panel B_k of B into B'.
> Then to avoid packing N times the same data into B', B_k and B' are
> subdivided into N block, each thread pack one block, and a #pragma omp
> barrier sync all threads
>
> Of course, because of the cache L1, the problem is that when the thread 1
> will read the data of B' written by the thread 0, we cannot guarantee that
> it will read the correct data. So does anybody how can I enforce that ? I
> know that's a newbie question, but newbie in this topic I am ;)
>
> I also know the omp flush directive, but it is useless here since I'm
> working on a whole buffer.
>
> There is also something very strange: if I change the code so that all
> threads pack the exactly same B_k to the same shared B' and keep the
> barrier, then I still don't get a correct result... (if each thread have
> there own B', then it's fine)
>
> cheers,
>
> gael.
>
>
>
>
> On Thu, Feb 25, 2010 at 1:11 PM, Gael Guennebaud <gael.guennebaud@xxxxxxxxx>
> wrote:
>>
>>
>> On Thu, Feb 25, 2010 at 12:47 PM, Aron Ahmadia <aja2111@xxxxxxxxxxxx>
>> wrote:
>>>
>>> Hi Gael,
>>>
>>> I think I follow what you did, but just to summarize...
>>>
>>> > for each horizontal panel B_k of B of size 256x2048 do
>>> > copy/pack B_k to B'
>>> > let A_k be the corresponding 2048x256 vertical panel of A
>>> > for each block A_ik of A_k of size 512x256 do
>>> >     copy/pack A_ji to A'
>>> >     compute C_i += A' * B' using an optimized kernel
>>> > endfor
>>> > endfor
>>>
>>> Do you mean copy A_ik to A' here instead of A_ji?
>>
>> arf yes, sorry.
>>
>>>
>>> > #cores:   1 2     4
>>> > Eigen:    19   35   60
>>>
>>> Those are some very impressive numbers, I wouldn't be kicking myself too
>>> hard.
>>>
>>> > An important result is that unlike what I thought, *it is not always
>>> > better
>>> > to parallelize at the highest level*. For instance, if a user has to do
>>> > 4
>>> > large matrix products, it is better to do them sequentially and let the
>>> > underlying products manage the parallelization ! So this is a strong
>>> > argument to enable good vectorization in Eigen as it cannot optimaly be
>>> > done
>>> > at the user level.
>>>
>>> I agree strongly with this statement. Of course I think this might
>>> mean I need to roll my sleeves up and start helping :)
>>
>> hehe!
>>
>>>
>>> > To this end, I replaced the copies/packings to B' and A' by calls to
>>> > memset
>>> > on B' and A' respectiveley => then I got a perfect x4 meaning the
>>> > optimized
>>> > kernel is already perfect and that the performance penalty is entirely
>>> > due
>>> > to the (non sequential) reads from B_j and A_ij. I tried to disable
>>> > only one
>>> > of the two and I confirm that both are equally responsible for this
>>> > slowdown. (With a single thread they represent about 5% of the
>>> > computation.)
>>>
>>> Again, your notation for the sub-block of A is slightly confusing, I
>>> wish I had taken more notes on your approach, I'll review Goto's
>>> paper.
>>
>> typo again. If you prefer I meant B_k and A_ik where j and k are sub-block
>> indices, not coeff indices but I know you got that right ;)
>>
>>>
>>> If I understand correctly, your scaling performance is suffering due
>>> to the increased time taken to copy into the 'optimal panels', not due
>>> to computation. If we can improve the efficiency of preparing the
>>> panels, we'll have solved the scalability problem. Of course, this is
>>> easier said than done.
>>>
>>> > C1 = A * B1
>>> > C2 = A * B2
>>> >
>>> > is not very good because A is entirely shared between the two threads
>>> > and so
>>> > the packing of the A_ij is done twice => *it is pretty clear the the
>>> > threads
>>> > have to talk* :(
>>>
>>> Actually, this happens in the 2x2 case as well:
>>>
>>> > C11 = A1*B1
>>> > C12 = A1*B2
>>> > C21 = A2*B1
>>> > C22 = A2*B2
>>>
>>> Each thread needs one of A1 or A2, and one of B1 or B2. This implies
>>> that each of the sub-matrices from A and B gets packed twice. I agree
>>> that we have to be careful about avoiding redundant copies when we
>>> can.
>>
>> arf! I feel so stupid... somehow I got confused by all the experiments I
>> made. Anyway, so good news because at least now we know what is the problem
>> :)
>>
>>>
>>> What CPU architectures are you performing your tests on? I have
>>> access to two 8-core Intel architectures here (X5550 and E5420), and
>>> the 4-core IBM BlueGene architecture may have a chance at working with
>>> Eigen once we get XLC upgraded. Also, are you still using OpenMP or
>>> do you want to consider other options (pthreads, Boost.threads)? Is
>>> your code pushed to the branch?
>>
>> Quad core2 (penryn). I have to check but I'm pretty sure that around me
>> there are some 8 cores machines. Regarding the underlying threading API, I'd
>> say that as long as OpenMP fulfills our needs we should stick to it, but I
>> fear it won't be flexible enough to allow the threads to talk well and
>> quick. Well, we'll see.
>>
>>>
>>> My first thoughts are on getting set up with benchmarks on both
>>> machines so that I can verify performance, as I suspect things will
>>> vary between the X5550 and the E5420. I'm most concerned with new
>>> architectures such as Nehalem (i.e., the X5550), so I will probably be
>>> tuning and thinking in that direction.
>>>
>>> I also plan to stick with GNU compilers for now, as KAUST's setup for
>>> the Intel compilers is a bit sticky at the moment.
>>
>> Here gcc 4.4 and icc equally perform for this particular case.
>>
>>>
>>> Regards,
>>> Aron
>>>
>>> On Thu, Feb 25, 2010 at 2:00 PM, Gael Guennebaud
>>> <gael.guennebaud@xxxxxxxxx> wrote:
>>> >
>>> > Ok, after some experiments it seems that the naive splitting strategy
>>> > I've
>>> > implemented does not scale very well:
>>> >
>>> > product sizes (rows,cols,depth): 2048x2048x2048
>>> >
>>> > GFLOPS:
>>> >
>>> > #cores:   1 2     4
>>> > Eigen:    19   35   60
>>> > GOTO:    20   39   75
>>> >
>>> > So first let me start by briefly review of the way the product is
>>> > implemented. Let's consider C = A * B, and to simplify I'll instantiate
>>> > the
>>> > product for matrices of size 2048x2048:
>>> >
>>> > for each horizontal panel B_k of B of size 256x2048 do
>>> > copy/pack B_k to B'
>>> > let A_k be the corresponding 2048x256 vertical panel of A
>>> > for each block A_ik of A_k of size 512x256 do
>>> >     copy/pack A_ji to A'
>>> >     compute C_i += A' * B' using an optimized kernel
>>> > endfor
>>> > endfor
>>> >
>>> > For a quad cores, the actual parallelization strategy is:
>>> > - split B into B1 and B2 each of size 2048x1024
>>> > - split A into A1 and A2 each of size 1024x2048
>>> > - split C into C11 .. C22 each of size 1024x1024
>>> > - do the 4 non-dependent matrix products in parallel:
>>> > C11 = A1*B1
>>> > C12 = A1*B2
>>> > C21 = A2*B1
>>> > C22 = A2*B2
>>> >
>>> > Now if I use this strategy with GOTO (i.e., I instantiate 4 different
>>> > products in parallel using OpenMP and I make sure that GOTO does not
>>> > try to
>>> > parallelize), then I get really bad performance: 35Gflops (while Eigen
>>> > does
>>> > the same at 60Gflops). On the other hand, if I do these 4 products
>>> > sequentially, and let GOTO parallelize (4threads), then GOTO yields
>>> > 70Gflops
>>> > !
>>> >
>>> > An important result is that unlike what I thought, *it is not always
>>> > better
>>> > to parallelize at the highest level*. For instance, if a user has to do
>>> > 4
>>> > large matrix products, it is better to do them sequentially and let the
>>> > underlying products manage the parallelization ! So this is a strong
>>> > argument to enable good vectorization in Eigen as it cannot optimaly be
>>> > done
>>> > at the user level.
>>> >
>>> > Then I tried to understand why this "naive" parallelization strategy
>>> > does
>>> > not scale very well. Indeed, in the case of 4 threads, all the data are
>>> > fully independent and it is equalivalent to do 4 smaller products in
>>> > parallel. So what ?
>>> >
>>> > To this end, I replaced the copies/packings to B' and A' by calls to
>>> > memset
>>> > on B' and A' respectiveley => then I got a perfect x4 meaning the
>>> > optimized
>>> > kernel is already perfect and that the performance penalty is entirely
>>> > due
>>> > to the (non sequential) reads from B_j and A_ij. I tried to disable
>>> > only one
>>> > of the two and I confirm that both are equally responsible for this
>>> > slowdown. (With a single thread they represent about 5% of the
>>> > computation.)
>>> >
>>> > If I replace my copy/pack function calls to the GOTO's ones (hand
>>> > written
>>> > assembly with more aggressive unrolling and prefetching), then I get a
>>> > slight speedup but not that much. So we really have to work at a higher
>>> > level, i.e., how to better distribute the works across the different
>>> > threads
>>> > ? But now, I'm clueless...
>>> >
>>> > One final remark though: With 4 threads, the naive splitting strategy
>>> > sounded really not too bad because all submatrices involved are 100%
>>> > independent. On the other hand, with only 2 threads, it is pretty clear
>>> > that
>>> > simply splitting the product into two, e.g.:
>>> >
>>> > C1 = A * B1
>>> > C2 = A * B2
>>> >
>>> > is not very good because A is entirely shared between the two threads
>>> > and so
>>> > the packing of the A_ij is done twice => *it is pretty clear the the
>>> > threads
>>> > have to talk* :(
>>> >
>>> >
>>> > that's all for now.
>>> >
>>> >
>>> > gael
>>> >
>>> >
>>> > On Tue, Feb 23, 2010 at 12:34 PM, Aron Ahmadia <aja2111@xxxxxxxxxxxx>
>>> > wrote:
>>> >>
>>> >> Oh very cool! I am going to start playing with this very soon. I
>>> >> want to look into getting some automatic numbers up from each build so
>>> >> that it's easy to see how the code is performing across several
>>> >> different HPC architectures (BlueGene/P, Nehalem, Clovertown). The
>>> >> bench code should be what I'm looking for.
>>> >>
>>> >> I'm curious about the difference between OpenMP and pthreads. I know
>>> >> there is a bit of overthread in OpenMP but I didn't realize it would
>>> >> be so meaningful. This will of course vary between architectures and
>>> >> compilers, but I will query the IBM Researchers (Gunnels is more of an
>>> >> expert with both pthreads and OpenMP and worked in van de Geijn's
>>> >> group so may have more direct thoughts).
>>> >>
>>> >> Warm Regards,
>>> >> Aron
>>> >>
>>> >> On Mon, Feb 22, 2010 at 11:43 AM, Hauke Heibel
>>> >> <hauke.heibel@xxxxxxxxxxxxxx> wrote:
>>> >> > Works like a charm - see attachement.
>>> >> >
>>> >> > - Hauke
>>> >> >
>>> >> > On Mon, Feb 22, 2010 at 11:28 AM, Gael Guennebaud
>>> >> > <gael.guennebaud@xxxxxxxxx> wrote:
>>> >> >>
>>> >> >> Hi,
>>> >> >>
>>> >> >> I have just created a fork there:
>>> >> >>
>>> >> >> http://bitbucket.org/ggael/eigen-smp
>>> >> >>
>>> >> >> to play with SMP support, and more precisely, with OpenMP.
>>> >> >>
>>> >> >> Currently only the general matrix-matrix product is parallelized.
>>> >> >> I've
>>> >> >> implemented a general 1D parallelizer to factor the parallelization
>>> >> >> code. It
>>> >> >> is defined there:
>>> >> >>
>>> >> >> Eigen/src/Core/products/Parallelizer.h
>>> >> >>
>>> >> >> and used at the end of this file:
>>> >> >> Eigen/src/Core/products/GeneralMatrixMatrix.h
>>> >> >>
>>> >> >> In the bench/ folder there are two bench_gemm*.cpp files to try it
>>> >> >> and
>>> >> >> compare to BLAS.
>>> >> >>
>>> >> >> On my core2 duo, I've observed a speed up around 1.9 for relatively
>>> >> >> small
>>> >> >> matrices.
>>> >> >>
>>> >> >> At work I have an "Intel(R) Core(TM)2 Quad CPU    Q9400 @ 2.66GHz"
>>> >> >> but
>>> >> >> it
>>> >> >> is currently too busy to do really meaningfull experiments.
>>> >> >> Nevertheless, it
>>> >> >> seems that gotoblas, which is directly using pthread, reports more
>>> >> >> consistent speedups. So perhaps OpenMP is trying to do some too
>>> >> >> smart
>>> >> >> scheduling and it might be useful to directly deal with pthread?
>>> >> >>
>>> >> >> For the lazy but interested reader, the interesting piece of code
>>> >> >> is
>>> >> >> there:
>>> >> >>
>>> >> >> int threads = omp_get_num_procs();
>>> >> >> int blockSize = size / threads;
>>> >> >> #pragma omp parallel for schedule(static,1)
>>> >> >> for(int i=0; i<threads; ++i)
>>> >> >> {
>>> >> >>     int blockStart = i*blockSize;
>>> >> >>     int actualBlockSize = std::min(blockSize, size - blockStart);
>>> >> >>
>>> >> >>     func(blockStart, actualBlockSize);
>>> >> >> }
>>> >> >>
>>> >> >> feelfree to play with it and have fun!
>>> >> >>
>>> >> >> Gael.
>>> >> >>
>>> >> >> PS: to Aron, yesterday the main pb was that our benchtimer reported
>>> >> >> the
>>> >> >> total execution time, not the "real" one... the other was that my
>>> >> >> system was
>>> >> >> a bit buzy because of daemons, and stuff like that...
>>> >> >>
>>> >> >
>>> >>
>>> >>
>>> >
>>> >
>>>
>>>
>>
>
>