<
gael.guennebaud@xxxxxxxxx> wrote:
>
>
> On Fri, Feb 26, 2010 at 12:41 PM, Gael Guennebaud
> <
gael.guennebaud@xxxxxxxxxx> wrote:
>>
>>
>> On Fri, Feb 26, 2010 at 11:28 AM, Aron Ahmadia <
aja2111@xxxxxxxxxxxx>
>> wrote:
>>>
>>> > Of course, because of the cache L1, the problem is that when the thread
>>> > 1
>>> > will read the data of B' written by the thread 0, we cannot guarantee
>>> > that
>>> > it will read the correct data. So does anybody how can I enforce that ?
>>> > I
>>> > know that's a newbie question, but newbie in this topic I am ;)
>>>
>>> Okay, this might be a bit tricky, so forgive me if I'm
>>> over-complicating things, can we introduce another subdivision?:
>>>
>>> Create the memory buffer B' divided into N vertical panels (N = number
>>> of threads), Call these B'1, B'2, ... B'N.
>>>
>>> Each thread i is responsible for packing its B'j. Once, all B'j
>>> panels are packed, dive directly into the multiply Ci=Ai*B', only for
>>> the section B'j that has been packed by the thread. So Ci_j +=
>>> Ai_j*B'j, this should give the threads and caches time to synchronize
>>> and regain coherency.
>>
>> This is indeed a nice idea to trigger some computation right after the
>> partial packing of B'.
>>
>>>
>>> Now proceed with the rest of the multiply, it might make sense here to
>>> use the concept of a barrel shift, with each thread incrementing the
>>> jth B' panel that it is multiplying against.
>>>
>>> for k=1:N
>>> Ci_(j+k) += Ai_(j+k)*B'_(j+k)
>>> end
>>>
>>> The packing of B' only happens once, there's no wait (the threads
>>> immediately launch into computing), and the only synchronization would
>>> happen when they leave the parallel region. You could set explicit
>>> locks to absolutely ensure that the threads do not read from a memory
>>> address that is still being written to, especially for debugging
>>> purposes, as weird caching effects could potentially cause one of the
>>> threads to finish far sooner than it should, but this code should run
>>> fairly safely.
>>
>> Oh, now this is going to be really tricky... I'll let you play with it :)
>>
>
> ok, this strategy is now implemented and committed. However for the
> synchronization between the threads I had to use the Qt's QAtomicInt class
> because I did not manage to achieve the same using OpenMP only :(
>
> gael.
>
>
>>
>> Gael.
>>
>>>
>>> I'll take a look at implementing this some time this week (I'm feeling
>>> guilty that you're making so much progress without me!)
>>>
>>> A
>>>
>>> On Fri, Feb 26, 2010 at 12:44 PM, Gael Guennebaud
>>> <
gael.guennebaud@xxxxxxxxx> wrote:
>>> >
>>> > ok so to summarize, so far the parallel strategy duplicated N times the
>>> > packing where N is the number of thread. To overcome this shortcoming,
>>> > I've
>>> > implemented a new (still simple) strategy:
>>> >
>>> > - create a memory buffer B' that will be read/write by all threads
>>> > - split A into A1...AN horizontal panels (does the same for C)
>>> > - parallel loop
>>> > compute Ci = Ai * B using the shared B'
>>> >
>>> > Let's recall the B' is used to pack an horizontal panel B_k of B into
>>> > B'.
>>> > Then to avoid packing N times the same data into B', B_k and B' are
>>> > subdivided into N block, each thread pack one block, and a #pragma omp
>>> > barrier sync all threads
>>> >
>>> > Of course, because of the cache L1, the problem is that when the thread
>>> > 1
>>> > will read the data of B' written by the thread 0, we cannot guarantee
>>> > that
>>> > it will read the correct data. So does anybody how can I enforce that ?
>>> > I
>>> > know that's a newbie question, but newbie in this topic I am ;)
>>> >
>>> > I also know the omp flush directive, but it is useless here since I'm
>>> > working on a whole buffer.
>>> >
>>> > There is also something very strange: if I change the code so that all
>>> > threads pack the exactly same B_k to the same shared B' and keep the
>>> > barrier, then I still don't get a correct result... (if each thread
>>> > have
>>> > there own B', then it's fine)
>>> >
>>> > cheers,
>>> >
>>> > gael.
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, Feb 25, 2010 at 1:11 PM, Gael Guennebaud
>>> > <
gael.guennebaud@xxxxxxxxx>
>>> > wrote:
>>> >>
>>> >>
>>> >> On Thu, Feb 25, 2010 at 12:47 PM, Aron Ahmadia <
aja2111@xxxxxxxxxxxx>
>>> >> wrote:
>>> >>>
>>> >>> Hi Gael,
>>> >>>
>>> >>> I think I follow what you did, but just to summarize...
>>> >>>
>>> >>> > for each horizontal panel B_k of B of size 256x2048 do
>>> >>> > copy/pack B_k to B'
>>> >>> > let A_k be the corresponding 2048x256 vertical panel of A
>>> >>> > for each block A_ik of A_k of size 512x256 do
>>> >>> > copy/pack A_ji to A'
>>> >>> > compute C_i += A' * B' using an optimized kernel
>>> >>> > endfor
>>> >>> > endfor
>>> >>>
>>> >>> Do you mean copy A_ik to A' here instead of A_ji?
>>> >>
>>> >> arf yes, sorry.
>>> >>
>>> >>>
>>> >>> > #cores: 1 2 4
>>> >>> > Eigen: 19 35 60
>>> >>>
>>> >>> Those are some very impressive numbers, I wouldn't be kicking myself
>>> >>> too
>>> >>> hard.
>>> >>>
>>> >>> > An important result is that unlike what I thought, *it is not
>>> >>> > always
>>> >>> > better
>>> >>> > to parallelize at the highest level*. For instance, if a user has
>>> >>> > to do
>>> >>> > 4
>>> >>> > large matrix products, it is better to do them sequentially and let
>>> >>> > the
>>> >>> > underlying products manage the parallelization ! So this is a
>>> >>> > strong
>>> >>> > argument to enable good vectorization in Eigen as it cannot
>>> >>> > optimaly be
>>> >>> > done
>>> >>> > at the user level.
>>> >>>
>>> >>> I agree strongly with this statement. Of course I think this might
>>> >>> mean I need to roll my sleeves up and start helping :)
>>> >>
>>> >> hehe!
>>> >>
>>> >>>
>>> >>> > To this end, I replaced the copies/packings to B' and A' by calls
>>> >>> > to
>>> >>> > memset
>>> >>> > on B' and A' respectiveley => then I got a perfect x4 meaning the
>>> >>> > optimized
>>> >>> > kernel is already perfect and that the performance penalty is
>>> >>> > entirely
>>> >>> > due
>>> >>> > to the (non sequential) reads from B_j and A_ij. I tried to disable
>>> >>> > only one
>>> >>> > of the two and I confirm that both are equally responsible for this
>>> >>> > slowdown. (With a single thread they represent about 5% of the
>>> >>> > computation.)
>>> >>>
>>> >>> Again, your notation for the sub-block of A is slightly confusing, I
>>> >>> wish I had taken more notes on your approach, I'll review Goto's
>>> >>> paper.
>>> >>
>>> >> typo again. If you prefer I meant B_k and A_ik where j and k are
>>> >> sub-block
>>> >> indices, not coeff indices but I know you got that right ;)
>>> >>
>>> >>>
>>> >>> If I understand correctly, your scaling performance is suffering due
>>> >>> to the increased time taken to copy into the 'optimal panels', not
>>> >>> due
>>> >>> to computation. If we can improve the efficiency of preparing the
>>> >>> panels, we'll have solved the scalability problem. Of course, this
>>> >>> is
>>> >>> easier said than done.
>>> >>>
>>> >>> > C1 = A * B1
>>> >>> > C2 = A * B2
>>> >>> >
>>> >>> > is not very good because A is entirely shared between the two
>>> >>> > threads
>>> >>> > and so
>>> >>> > the packing of the A_ij is done twice => *it is pretty clear the
>>> >>> > the
>>> >>> > threads
>>> >>> > have to talk* :(
>>> >>>
>>> >>> Actually, this happens in the 2x2 case as well:
>>> >>>
>>> >>> > C11 = A1*B1
>>> >>> > C12 = A1*B2
>>> >>> > C21 = A2*B1
>>> >>> > C22 = A2*B2
>>> >>>
>>> >>> Each thread needs one of A1 or A2, and one of B1 or B2. This implies
>>> >>> that each of the sub-matrices from A and B gets packed twice. I
>>> >>> agree
>>> >>> that we have to be careful about avoiding redundant copies when we
>>> >>> can.
>>> >>
>>> >> arf! I feel so stupid... somehow I got confused by all the experiments
>>> >> I
>>> >> made. Anyway, so good news because at least now we know what is the
>>> >> problem
>>> >> :)
>>> >>
>>> >>>
>>> >>> What CPU architectures are you performing your tests on? I have
>>> >>> access to two 8-core Intel architectures here (X5550 and E5420), and
>>> >>> the 4-core IBM BlueGene architecture may have a chance at working
>>> >>> with
>>> >>> Eigen once we get XLC upgraded. Also, are you still using OpenMP or
>>> >>> do you want to consider other options (pthreads, Boost.threads)? Is
>>> >>> your code pushed to the branch?
>>> >>
>>> >> Quad core2 (penryn). I have to check but I'm pretty sure that around
>>> >> me
>>> >> there are some 8 cores machines. Regarding the underlying threading
>>> >> API, I'd
>>> >> say that as long as OpenMP fulfills our needs we should stick to it,
>>> >> but I
>>> >> fear it won't be flexible enough to allow the threads to talk well and
>>> >> quick. Well, we'll see.
>>> >>
>>> >>>
>>> >>> My first thoughts are on getting set up with benchmarks on both
>>> >>> machines so that I can verify performance, as I suspect things will
>>> >>> vary between the X5550 and the E5420. I'm most concerned with new
>>> >>> architectures such as Nehalem (i.e., the X5550), so I will probably
>>> >>> be
>>> >>> tuning and thinking in that direction.
>>> >>>
>>> >>> I also plan to stick with GNU compilers for now, as KAUST's setup for
>>> >>> the Intel compilers is a bit sticky at the moment..
>>> >>
>>> >> Here gcc 4.4 and icc equally perform for this particular case.
>>> >>
>>> >>>
>>> >>> Regards,
>>> >>> Aron
>>> >>>
>>> >>> On Thu, Feb 25, 2010 at 2:00 PM, Gael Guennebaud
>>> >>> <
gael.guennebaud@xxxxxxxxx> wrote:
>>> >>> >
>>> >>> > Ok, after some experiments it seems that the naive splitting
>>> >>> > strategy
>>> >>> > I've
>>> >>> > implemented does not scale very well:
>>> >>> >
>>> >>> > product sizes (rows,cols,depth): 2048x2048x2048
>>> >>> >
>>> >>> > GFLOPS:
>>> >>> >
>>> >>> > #cores: 1 2 4
>>> >>> > Eigen: 19 35 60
>>> >>> > GOTO: 20 39 75
>>> >>> >
>>> >>> > So first let me start by briefly review of the way the product is
>>> >>> > implemented. Let's consider C = A * B, and to simplify I'll
>>> >>> > instantiate
>>> >>> > the
>>> >>> > product for matrices of size 2048x2048:
>>> >>> >
>>> >>> > for each horizontal panel B_k of B of size 256x2048 do
>>> >>> > copy/pack B_k to B'
>>> >>> > let A_k be the corresponding 2048x256 vertical panel of A
>>> >>> > for each block A_ik of A_k of size 512x256 do
>>> >>> > copy/pack A_ji to A'
>>> >>> > compute C_i += A' * B' using an optimized kernel
>>> >>> > endfor
>>> >>> > endfor
>>> >>> >
>>> >>> > For a quad cores, the actual parallelization strategy is:
>>> >>> > - split B into B1 and B2 each of size 2048x1024
>>> >>> > - split A into A1 and A2 each of size 1024x2048
>>> >>> > - split C into C11 .. C22 each of size 1024x1024
>>> >>> > - do the 4 non-dependent matrix products in parallel:
>>> >>> > C11 = A1*B1
>>> >>> > C12 = A1*B2
>>> >>> > C21 = A2*B1
>>> >>> > C22 = A2*B2
>>> >>> >
>>> >>> > Now if I use this strategy with GOTO (i.e., I instantiate 4
>>> >>> > different
>>> >>> > products in parallel using OpenMP and I make sure that GOTO does
>>> >>> > not
>>> >>> > try to
>>> >>> > parallelize), then I get really bad performance: 35Gflops (while
>>> >>> > Eigen
>>> >>> > does
>>> >>> > the same at 60Gflops). On the other hand, if I do these 4 products
>>> >>> > sequentially, and let GOTO parallelize (4threads), then GOTO yields
>>> >>> > 70Gflops
>>> >>> > !
>>> >>> >
>>> >>> > An important result is that unlike what I thought, *it is not
>>> >>> > always
>>> >>> > better
>>> >>> > to parallelize at the highest level*. For instance, if a user has
>>> >>> > to do
>>> >>> > 4
>>> >>> > large matrix products, it is better to do them sequentially and let
>>> >>> > the
>>> >>> > underlying products manage the parallelization ! So this is a
>>> >>> > strong
>>> >>> > argument to enable good vectorization in Eigen as it cannot
>>> >>> > optimaly be
>>> >>> > done
>>> >>> > at the user level.
>>> >>> >
>>> >>> > Then I tried to understand why this "naive" parallelization
>>> >>> > strategy
>>> >>> > does
>>> >>> > not scale very well. Indeed, in the case of 4 threads, all the data
>>> >>> > are
>>> >>> > fully independent and it is equalivalent to do 4 smaller products
>>> >>> > in
>>> >>> > parallel. So what ?
>>> >>> >
>>> >>> > To this end, I replaced the copies/packings to B' and A' by calls
>>> >>> > to
>>> >>> > memset
>>> >>> > on B' and A' respectiveley => then I got a perfect x4 meaning the
>>> >>> > optimized
>>> >>> > kernel is already perfect and that the performance penalty is
>>> >>> > entirely
>>> >>> > due
>>> >>> > to the (non sequential) reads from B_j and A_ij. I tried to disable
>>> >>> > only one
>>> >>> > of the two and I confirm that both are equally responsible for this
>>> >>> > slowdown. (With a single thread they represent about 5% of the
>>> >>> > computation.)
>>> >>> >
>>> >>> > If I replace my copy/pack function calls to the GOTO's ones (hand
>>> >>> > written
>>> >>> > assembly with more aggressive unrolling and prefetching), then I
>>> >>> > get a
>>> >>> > slight speedup but not that much. So we really have to work at a
>>> >>> > higher
>>> >>> > level, i.e., how to better distribute the works across the
>>> >>> > different
>>> >>> > threads
>>> >>> > ? But now, I'm clueless...
>>> >>> >
>>> >>> > One final remark though: With 4 threads, the naive splitting
>>> >>> > strategy
>>> >>> > sounded really not too bad because all submatrices involved are
>>> >>> > 100%
>>> >>> > independent. On the other hand, with only 2 threads, it is pretty
>>> >>> > clear
>>> >>> > that
>>> >>> > simply splitting the product into two, e.g.:
>>> >>> >
>>> >>> > C1 = A * B1
>>> >>> > C2 = A * B2
>>> >>> >
>>> >>> > is not very good because A is entirely shared between the two
>>> >>> > threads
>>> >>> > and so
>>> >>> > the packing of the A_ij is done twice => *it is pretty clear the
>>> >>> > the
>>> >>> > threads
>>> >>> > have to talk* :(
>>> >>> >
>>> >>> >
>>> >>> > that's all for now.
>>> >>> >
>>> >>> >
>>> >>> > gael
>>> >>> >
>>> >>> >
>>> >>> > On Tue, Feb 23, 2010 at 12:34 PM, Aron Ahmadia
>>> >>> > <
aja2111@xxxxxxxxxxxx>
>>> >>> > wrote:
>>> >>> >>
>>> >>> >> Oh very cool! I am going to start playing with this very soon. I
>>> >>> >> want to look into getting some automatic numbers up from each
>>> >>> >> build so
>>> >>> >> that it's easy to see how the code is performing across several
>>> >>> >> different HPC architectures (BlueGene/P, Nehalem, Clovertown).
>>> >>> >> The
>>> >>> >> bench code should be what I'm looking for.
>>> >>> >>
>>> >>> >> I'm curious about the difference between OpenMP and pthreads. I
>>> >>> >> know
>>> >>> >> there is a bit of overthread in OpenMP but I didn't realize it
>>> >>> >> would
>>> >>> >> be so meaningful. This will of course vary between architectures
>>> >>> >> and
>>> >>> >> compilers, but I will query the IBM Researchers (Gunnels is more
>>> >>> >> of an
>>> >>> >> expert with both pthreads and OpenMP and worked in van de Geijn's
>>> >>> >> group so may have more direct thoughts).
>>> >>> >>
>>> >>> >> Warm Regards,
>>> >>> >> Aron
>>> >>> >>
>>> >>> >> On Mon, Feb 22, 2010 at 11:43 AM, Hauke Heibel
>>> >>> >> <
hauke.heibel@xxxxxxxxxxxxxx> wrote:
>>> >>> >> > Works like a charm - see attachement.
>>> >>> >> >
>>> >>> >> > - Hauke
>>> >>> >> >
>>> >>> >> > On Mon, Feb 22, 2010 at 11:28 AM, Gael Guennebaud
>>> >>> >> > <
gael.guennebaud@xxxxxxxxx> wrote:
>>> >>> >> >>
>>> >>> >> >> Hi,
>>> >>> >> >>
>>> >>> >> >> I have just created a fork there:
>>> >>> >> >>
>>> >>> >> >>
http://bitbucket.org/ggael/eigen-smp
>>> >>> >> >>
>>> >>> >> >> to play with SMP support, and more precisely, with OpenMP.
>>> >>> >> >>
>>> >>> >> >> Currently only the general matrix-matrix product is
>>> >>> >> >> parallelized.
>>> >>> >> >> I've
>>> >>> >> >> implemented a general 1D parallelizer to factor the
>>> >>> >> >> parallelization
>>> >>> >> >> code. It
>>> >>> >> >> is defined there:
>>> >>> >> >>
>>> >>> >> >> Eigen/src/Core/products/Parallelizer.h
>>> >>> >> >>
>>> >>> >> >> and used at the end of this file:
>>> >>> >> >> Eigen/src/Core/products/GeneralMatrixMatrix.h
>>> >>> >> >>
>>> >>> >> >> In the bench/ folder there are two bench_gemm*.cpp files to try
>>> >>> >> >> it
>>> >>> >> >> and
>>> >>> >> >> compare to BLAS.
>>> >>> >> >>
>>> >>> >> >> On my core2 duo, I've observed a speed up around 1.9 for
>>> >>> >> >> relatively
>>> >>> >> >> small
>>> >>> >> >> matrices.
>>> >>> >> >>
>>> >>> >> >> At work I have an "Intel(R) Core(TM)2 Quad CPU Q9400 @
>>> >>> >> >> 2.66GHz"
>>> >>> >> >> but
>>> >>> >> >> it
>>> >>> >> >> is currently too busy to do really meaningfull experiments.
>>> >>> >> >> Nevertheless, it
>>> >>> >> >> seems that gotoblas, which is directly using pthread, reports
>>> >>> >> >> more
>>> >>> >> >> consistent speedups. So perhaps OpenMP is trying to do some too
>>> >>> >> >> smart
>>> >>> >> >> scheduling and it might be useful to directly deal with
>>> >>> >> >> pthread?
>>> >>> >> >>
>>> >>> >> >> For the lazy but interested reader, the interesting piece of
>>> >>> >> >> code
>>> >>> >> >> is
>>> >>> >> >> there:
>>> >>> >> >>
>>> >>> >> >> int threads = omp_get_num_procs();
>>> >>> >> >> int blockSize = size / threads;
>>> >>> >> >> #pragma omp parallel for schedule(static,1)
>>> >>> >> >> for(int i=0; i<threads; ++i)
>>> >>> >> >> {
>>> >>> >> >> int blockStart = i*blockSize;
>>> >>> >> >> int actualBlockSize = std::min(blockSize, size -
>>> >>> >> >> blockStart);
>>> >>> >> >>
>>> >>> >> >> func(blockStart, actualBlockSize);
>>> >>> >> >> }
>>> >>> >> >>
>>> >>> >> >> feelfree to play with it and have fun!
>>> >>> >> >>
>>> >>> >> >> Gael.
>>> >>> >> >>
>>> >>> >> >> PS: to Aron, yesterday the main pb was that our benchtimer
>>> >>> >> >> reported
>>> >>> >> >> the
>>> >>> >> >> total execution time, not the "real" one... the other was that
>>> >>> >> >> my
>>> >>> >> >> system was
>>> >>> >> >> a bit buzy because of daemons, and stuff like that...
>>> >>> >> >>
>>> >>> >> >
>>> >>> >>
>>> >>> >>
>>> >>> >
>>> >>> >
>>> >>>
>>> >>>
>>> >>
>>> >
>>> >
>>>
>>>
>>
>
>