<
gael.guennebaud@xxxxxxxxx> wrote:
>
> ok so to summarize, so far the parallel strategy duplicated N times the
> packing where N is the number of thread. To overcome this shortcoming, I've
> implemented a new (still simple) strategy:
>
> - create a memory buffer B' that will be read/write by all threads
> - split A into A1...AN horizontal panels (does the same for C)
> - parallel loop
> compute Ci = Ai * B using the shared B'
>
> Let's recall the B' is used to pack an horizontal panel B_k of B into B'.
> Then to avoid packing N times the same data into B', B_k and B' are
> subdivided into N block, each thread pack one block, and a #pragma omp
> barrier sync all threads
>
> Of course, because of the cache L1, the problem is that when the thread 1
> will read the data of B' written by the thread 0, we cannot guarantee that
> it will read the correct data. So does anybody how can I enforce that ? I
> know that's a newbie question, but newbie in this topic I am ;)
>
> I also know the omp flush directive, but it is useless here since I'm
> working on a whole buffer.
>
> There is also something very strange: if I change the code so that all
> threads pack the exactly same B_k to the same shared B' and keep the
> barrier, then I still don't get a correct result... (if each thread have
> there own B', then it's fine)
>
> cheers,
>
> gael.
>
>
>
>
> On Thu, Feb 25, 2010 at 1:11 PM, Gael Guennebaud <
gael.guennebaud@xxxxxxxxx>
> wrote:
>>
>>
>> On Thu, Feb 25, 2010 at 12:47 PM, Aron Ahmadia <
aja2111@xxxxxxxxxxxx>
>> wrote:
>>>
>>> Hi Gael,
>>>
>>> I think I follow what you did, but just to summarize...
>>>
>>> > for each horizontal panel B_k of B of size 256x2048 do
>>> > copy/pack B_k to B'
>>> > let A_k be the corresponding 2048x256 vertical panel of A
>>> > for each block A_ik of A_k of size 512x256 do
>>> > copy/pack A_ji to A'
>>> > compute C_i += A' * B' using an optimized kernel
>>> > endfor
>>> > endfor
>>>
>>> Do you mean copy A_ik to A' here instead of A_ji?
>>
>> arf yes, sorry.
>>
>>>
>>> > #cores: 1 2 4
>>> > Eigen: 19 35 60
>>>
>>> Those are some very impressive numbers, I wouldn't be kicking myself too
>>> hard.
>>>
>>> > An important result is that unlike what I thought, *it is not always
>>> > better
>>> > to parallelize at the highest level*. For instance, if a user has to do
>>> > 4
>>> > large matrix products, it is better to do them sequentially and let the
>>> > underlying products manage the parallelization ! So this is a strong
>>> > argument to enable good vectorization in Eigen as it cannot optimaly be
>>> > done
>>> > at the user level.
>>>
>>> I agree strongly with this statement. Of course I think this might
>>> mean I need to roll my sleeves up and start helping :)
>>
>> hehe!
>>
>>>
>>> > To this end, I replaced the copies/packings to B' and A' by calls to
>>> > memset
>>> > on B' and A' respectiveley => then I got a perfect x4 meaning the
>>> > optimized
>>> > kernel is already perfect and that the performance penalty is entirely
>>> > due
>>> > to the (non sequential) reads from B_j and A_ij. I tried to disable
>>> > only one
>>> > of the two and I confirm that both are equally responsible for this
>>> > slowdown. (With a single thread they represent about 5% of the
>>> > computation.)
>>>
>>> Again, your notation for the sub-block of A is slightly confusing, I
>>> wish I had taken more notes on your approach, I'll review Goto's
>>> paper.
>>
>> typo again. If you prefer I meant B_k and A_ik where j and k are sub-block
>> indices, not coeff indices but I know you got that right ;)
>>
>>>
>>> If I understand correctly, your scaling performance is suffering due
>>> to the increased time taken to copy into the 'optimal panels', not due
>>> to computation. If we can improve the efficiency of preparing the
>>> panels, we'll have solved the scalability problem. Of course, this is
>>> easier said than done.
>>>
>>> > C1 = A * B1
>>> > C2 = A * B2
>>> >
>>> > is not very good because A is entirely shared between the two threads
>>> > and so
>>> > the packing of the A_ij is done twice => *it is pretty clear the the
>>> > threads
>>> > have to talk* :(
>>>
>>> Actually, this happens in the 2x2 case as well:
>>>
>>> > C11 = A1*B1
>>> > C12 = A1*B2
>>> > C21 = A2*B1
>>> > C22 = A2*B2
>>>
>>> Each thread needs one of A1 or A2, and one of B1 or B2. This implies
>>> that each of the sub-matrices from A and B gets packed twice. I agree
>>> that we have to be careful about avoiding redundant copies when we
>>> can.
>>
>> arf! I feel so stupid... somehow I got confused by all the experiments I
>> made. Anyway, so good news because at least now we know what is the problem
>> :)
>>
>>>
>>> What CPU architectures are you performing your tests on? I have
>>> access to two 8-core Intel architectures here (X5550 and E5420), and
>>> the 4-core IBM BlueGene architecture may have a chance at working with
>>> Eigen once we get XLC upgraded. Also, are you still using OpenMP or
>>> do you want to consider other options (pthreads, Boost.threads)? Is
>>> your code pushed to the branch?
>>
>> Quad core2 (penryn). I have to check but I'm pretty sure that around me
>> there are some 8 cores machines. Regarding the underlying threading API, I'd
>> say that as long as OpenMP fulfills our needs we should stick to it, but I
>> fear it won't be flexible enough to allow the threads to talk well and
>> quick. Well, we'll see.
>>
>>>
>>> My first thoughts are on getting set up with benchmarks on both
>>> machines so that I can verify performance, as I suspect things will
>>> vary between the X5550 and the E5420. I'm most concerned with new
>>> architectures such as Nehalem (i.e., the X5550), so I will probably be
>>> tuning and thinking in that direction.
>>>
>>> I also plan to stick with GNU compilers for now, as KAUST's setup for
>>> the Intel compilers is a bit sticky at the moment.
>>
>> Here gcc 4.4 and icc equally perform for this particular case.
>>
>>>
>>> Regards,
>>> Aron
>>>
>>> On Thu, Feb 25, 2010 at 2:00 PM, Gael Guennebaud
>>> <
gael.guennebaud@xxxxxxxxx> wrote:
>>> >
>>> > Ok, after some experiments it seems that the naive splitting strategy
>>> > I've
>>> > implemented does not scale very well:
>>> >
>>> > product sizes (rows,cols,depth): 2048x2048x2048
>>> >
>>> > GFLOPS:
>>> >
>>> > #cores: 1 2 4
>>> > Eigen: 19 35 60
>>> > GOTO: 20 39 75
>>> >
>>> > So first let me start by briefly review of the way the product is
>>> > implemented. Let's consider C = A * B, and to simplify I'll instantiate
>>> > the
>>> > product for matrices of size 2048x2048:
>>> >
>>> > for each horizontal panel B_k of B of size 256x2048 do
>>> > copy/pack B_k to B'
>>> > let A_k be the corresponding 2048x256 vertical panel of A
>>> > for each block A_ik of A_k of size 512x256 do
>>> > copy/pack A_ji to A'
>>> > compute C_i += A' * B' using an optimized kernel
>>> > endfor
>>> > endfor
>>> >
>>> > For a quad cores, the actual parallelization strategy is:
>>> > - split B into B1 and B2 each of size 2048x1024
>>> > - split A into A1 and A2 each of size 1024x2048
>>> > - split C into C11 .. C22 each of size 1024x1024
>>> > - do the 4 non-dependent matrix products in parallel:
>>> > C11 = A1*B1
>>> > C12 = A1*B2
>>> > C21 = A2*B1
>>> > C22 = A2*B2
>>> >
>>> > Now if I use this strategy with GOTO (i.e., I instantiate 4 different
>>> > products in parallel using OpenMP and I make sure that GOTO does not
>>> > try to
>>> > parallelize), then I get really bad performance: 35Gflops (while Eigen
>>> > does
>>> > the same at 60Gflops). On the other hand, if I do these 4 products
>>> > sequentially, and let GOTO parallelize (4threads), then GOTO yields
>>> > 70Gflops
>>> > !
>>> >
>>> > An important result is that unlike what I thought, *it is not always
>>> > better
>>> > to parallelize at the highest level*. For instance, if a user has to do
>>> > 4
>>> > large matrix products, it is better to do them sequentially and let the
>>> > underlying products manage the parallelization ! So this is a strong
>>> > argument to enable good vectorization in Eigen as it cannot optimaly be
>>> > done
>>> > at the user level.
>>> >
>>> > Then I tried to understand why this "naive" parallelization strategy
>>> > does
>>> > not scale very well. Indeed, in the case of 4 threads, all the data are
>>> > fully independent and it is equalivalent to do 4 smaller products in
>>> > parallel. So what ?
>>> >
>>> > To this end, I replaced the copies/packings to B' and A' by calls to
>>> > memset
>>> > on B' and A' respectiveley => then I got a perfect x4 meaning the
>>> > optimized
>>> > kernel is already perfect and that the performance penalty is entirely
>>> > due
>>> > to the (non sequential) reads from B_j and A_ij. I tried to disable
>>> > only one
>>> > of the two and I confirm that both are equally responsible for this
>>> > slowdown. (With a single thread they represent about 5% of the
>>> > computation.)
>>> >
>>> > If I replace my copy/pack function calls to the GOTO's ones (hand
>>> > written
>>> > assembly with more aggressive unrolling and prefetching), then I get a
>>> > slight speedup but not that much. So we really have to work at a higher
>>> > level, i.e., how to better distribute the works across the different
>>> > threads
>>> > ? But now, I'm clueless...
>>> >
>>> > One final remark though: With 4 threads, the naive splitting strategy
>>> > sounded really not too bad because all submatrices involved are 100%
>>> > independent. On the other hand, with only 2 threads, it is pretty clear
>>> > that
>>> > simply splitting the product into two, e.g.:
>>> >
>>> > C1 = A * B1
>>> > C2 = A * B2
>>> >
>>> > is not very good because A is entirely shared between the two threads
>>> > and so
>>> > the packing of the A_ij is done twice => *it is pretty clear the the
>>> > threads
>>> > have to talk* :(
>>> >
>>> >
>>> > that's all for now.
>>> >
>>> >
>>> > gael
>>> >
>>> >
>>> > On Tue, Feb 23, 2010 at 12:34 PM, Aron Ahmadia <
aja2111@xxxxxxxxxxxx>
>>> > wrote:
>>> >>
>>> >> Oh very cool! I am going to start playing with this very soon. I
>>> >> want to look into getting some automatic numbers up from each build so
>>> >> that it's easy to see how the code is performing across several
>>> >> different HPC architectures (BlueGene/P, Nehalem, Clovertown). The
>>> >> bench code should be what I'm looking for.
>>> >>
>>> >> I'm curious about the difference between OpenMP and pthreads. I know
>>> >> there is a bit of overthread in OpenMP but I didn't realize it would
>>> >> be so meaningful. This will of course vary between architectures and
>>> >> compilers, but I will query the IBM Researchers (Gunnels is more of an
>>> >> expert with both pthreads and OpenMP and worked in van de Geijn's
>>> >> group so may have more direct thoughts).
>>> >>
>>> >> Warm Regards,
>>> >> Aron
>>> >>
>>> >> On Mon, Feb 22, 2010 at 11:43 AM, Hauke Heibel
>>> >> <
hauke.heibel@xxxxxxxxxxxxxx> wrote:
>>> >> > Works like a charm - see attachement.
>>> >> >
>>> >> > - Hauke
>>> >> >
>>> >> > On Mon, Feb 22, 2010 at 11:28 AM, Gael Guennebaud
>>> >> > <
gael.guennebaud@xxxxxxxxx> wrote:
>>> >> >>
>>> >> >> Hi,
>>> >> >>
>>> >> >> I have just created a fork there:
>>> >> >>
>>> >> >>
http://bitbucket.org/ggael/eigen-smp
>>> >> >>
>>> >> >> to play with SMP support, and more precisely, with OpenMP.
>>> >> >>
>>> >> >> Currently only the general matrix-matrix product is parallelized.
>>> >> >> I've
>>> >> >> implemented a general 1D parallelizer to factor the parallelization
>>> >> >> code. It
>>> >> >> is defined there:
>>> >> >>
>>> >> >> Eigen/src/Core/products/Parallelizer.h
>>> >> >>
>>> >> >> and used at the end of this file:
>>> >> >> Eigen/src/Core/products/GeneralMatrixMatrix.h
>>> >> >>
>>> >> >> In the bench/ folder there are two bench_gemm*.cpp files to try it
>>> >> >> and
>>> >> >> compare to BLAS.
>>> >> >>
>>> >> >> On my core2 duo, I've observed a speed up around 1.9 for relatively
>>> >> >> small
>>> >> >> matrices.
>>> >> >>
>>> >> >> At work I have an "Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz"
>>> >> >> but
>>> >> >> it
>>> >> >> is currently too busy to do really meaningfull experiments.
>>> >> >> Nevertheless, it
>>> >> >> seems that gotoblas, which is directly using pthread, reports more
>>> >> >> consistent speedups. So perhaps OpenMP is trying to do some too
>>> >> >> smart
>>> >> >> scheduling and it might be useful to directly deal with pthread?
>>> >> >>
>>> >> >> For the lazy but interested reader, the interesting piece of code
>>> >> >> is
>>> >> >> there:
>>> >> >>
>>> >> >> int threads = omp_get_num_procs();
>>> >> >> int blockSize = size / threads;
>>> >> >> #pragma omp parallel for schedule(static,1)
>>> >> >> for(int i=0; i<threads; ++i)
>>> >> >> {
>>> >> >> int blockStart = i*blockSize;
>>> >> >> int actualBlockSize = std::min(blockSize, size - blockStart);
>>> >> >>
>>> >> >> func(blockStart, actualBlockSize);
>>> >> >> }
>>> >> >>
>>> >> >> feelfree to play with it and have fun!
>>> >> >>
>>> >> >> Gael.
>>> >> >>
>>> >> >> PS: to Aron, yesterday the main pb was that our benchtimer reported
>>> >> >> the
>>> >> >> total execution time, not the "real" one... the other was that my
>>> >> >> system was
>>> >> >> a bit buzy because of daemons, and stuff like that...
>>> >> >>
>>> >> >
>>> >>
>>> >>
>>> >
>>> >
>>>
>>>
>>
>
>