Re: [eigen] a branch for SMP (openmp) experimentations |

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

*To*: eigen@xxxxxxxxxxxxxxxxxxx*Subject*: Re: [eigen] a branch for SMP (openmp) experimentations*From*: Aron Ahmadia <aja2111@xxxxxxxxxxxx>*Date*: Thu, 25 Feb 2010 14:47:38 +0300*Dkim-signature*: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to :content-type:content-transfer-encoding; bh=SEbrjIdnsQXTdhBcwk0CbOonqT3ORjHxJrhCGqPuygU=; b=LB35fK+ZEiLQK+sOy1KlLTDs0FVaelpW9TJORJDB3y0aLOyXzwBH+27EoLGtQ/kALO Cx3UsVDKfg8HCWsK1SBN90TyMPDivKG94Yk7eqHh8QTV8pjUFkBP02yoyA5pjqLn9BLT cAnxMSkOvr20925fa6yjHU9/iLQRNjzz+89o4=*Domainkey-signature*: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; b=khEBWKNl9JVzKftEYD6JvbTwfKKq3thoI0iy9CECEeqtyvkyaKO1GlVv2VfkPJRaXU Md21T4gmf5EbJ9jA0XjT4moXpMTnj/Wt3BMOQSycaPYeNWiJkGT89PBZMtRDx0APxmIx QtMYNL4LxgCQkMkDQxXu+0jCpFCzpEF3qzWYg=

Hi Gael, I think I follow what you did, but just to summarize... > for each horizontal panel B_k of B of size 256x2048 do > copy/pack B_k to B' > let A_k be the corresponding 2048x256 vertical panel of A > for each block A_ik of A_k of size 512x256 do > copy/pack A_ji to A' > compute C_i += A' * B' using an optimized kernel > endfor > endfor Do you mean copy A_ik to A' here instead of A_ji? > #cores: 1 2 4 > Eigen: 19 35 60 Those are some very impressive numbers, I wouldn't be kicking myself too hard. > An important result is that unlike what I thought, *it is not always better > to parallelize at the highest level*. For instance, if a user has to do 4 > large matrix products, it is better to do them sequentially and let the > underlying products manage the parallelization ! So this is a strong > argument to enable good vectorization in Eigen as it cannot optimaly be done > at the user level. I agree strongly with this statement. Of course I think this might mean I need to roll my sleeves up and start helping :) > To this end, I replaced the copies/packings to B' and A' by calls to memset > on B' and A' respectiveley => then I got a perfect x4 meaning the optimized > kernel is already perfect and that the performance penalty is entirely due > to the (non sequential) reads from B_j and A_ij. I tried to disable only one > of the two and I confirm that both are equally responsible for this > slowdown. (With a single thread they represent about 5% of the computation.) Again, your notation for the sub-block of A is slightly confusing, I wish I had taken more notes on your approach, I'll review Goto's paper. If I understand correctly, your scaling performance is suffering due to the increased time taken to copy into the 'optimal panels', not due to computation. If we can improve the efficiency of preparing the panels, we'll have solved the scalability problem. Of course, this is easier said than done. > C1 = A * B1 > C2 = A * B2 > > is not very good because A is entirely shared between the two threads and so > the packing of the A_ij is done twice => *it is pretty clear the the threads > have to talk* :( Actually, this happens in the 2x2 case as well: > C11 = A1*B1 > C12 = A1*B2 > C21 = A2*B1 > C22 = A2*B2 Each thread needs one of A1 or A2, and one of B1 or B2. This implies that each of the sub-matrices from A and B gets packed twice. I agree that we have to be careful about avoiding redundant copies when we can. What CPU architectures are you performing your tests on? I have access to two 8-core Intel architectures here (X5550 and E5420), and the 4-core IBM BlueGene architecture may have a chance at working with Eigen once we get XLC upgraded. Also, are you still using OpenMP or do you want to consider other options (pthreads, Boost.threads)? Is your code pushed to the branch? My first thoughts are on getting set up with benchmarks on both machines so that I can verify performance, as I suspect things will vary between the X5550 and the E5420. I'm most concerned with new architectures such as Nehalem (i.e., the X5550), so I will probably be tuning and thinking in that direction. I also plan to stick with GNU compilers for now, as KAUST's setup for the Intel compilers is a bit sticky at the moment. Regards, Aron On Thu, Feb 25, 2010 at 2:00 PM, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote: > > Ok, after some experiments it seems that the naive splitting strategy I've > implemented does not scale very well: > > product sizes (rows,cols,depth): 2048x2048x2048 > > GFLOPS: > > #cores: 1 2 4 > Eigen: 19 35 60 > GOTO: 20 39 75 > > So first let me start by briefly review of the way the product is > implemented. Let's consider C = A * B, and to simplify I'll instantiate the > product for matrices of size 2048x2048: > > for each horizontal panel B_k of B of size 256x2048 do > copy/pack B_k to B' > let A_k be the corresponding 2048x256 vertical panel of A > for each block A_ik of A_k of size 512x256 do > copy/pack A_ji to A' > compute C_i += A' * B' using an optimized kernel > endfor > endfor > > For a quad cores, the actual parallelization strategy is: > - split B into B1 and B2 each of size 2048x1024 > - split A into A1 and A2 each of size 1024x2048 > - split C into C11 .. C22 each of size 1024x1024 > - do the 4 non-dependent matrix products in parallel: > C11 = A1*B1 > C12 = A1*B2 > C21 = A2*B1 > C22 = A2*B2 > > Now if I use this strategy with GOTO (i.e., I instantiate 4 different > products in parallel using OpenMP and I make sure that GOTO does not try to > parallelize), then I get really bad performance: 35Gflops (while Eigen does > the same at 60Gflops). On the other hand, if I do these 4 products > sequentially, and let GOTO parallelize (4threads), then GOTO yields 70Gflops > ! > > An important result is that unlike what I thought, *it is not always better > to parallelize at the highest level*. For instance, if a user has to do 4 > large matrix products, it is better to do them sequentially and let the > underlying products manage the parallelization ! So this is a strong > argument to enable good vectorization in Eigen as it cannot optimaly be done > at the user level. > > Then I tried to understand why this "naive" parallelization strategy does > not scale very well. Indeed, in the case of 4 threads, all the data are > fully independent and it is equalivalent to do 4 smaller products in > parallel. So what ? > > To this end, I replaced the copies/packings to B' and A' by calls to memset > on B' and A' respectiveley => then I got a perfect x4 meaning the optimized > kernel is already perfect and that the performance penalty is entirely due > to the (non sequential) reads from B_j and A_ij. I tried to disable only one > of the two and I confirm that both are equally responsible for this > slowdown. (With a single thread they represent about 5% of the computation.) > > If I replace my copy/pack function calls to the GOTO's ones (hand written > assembly with more aggressive unrolling and prefetching), then I get a > slight speedup but not that much. So we really have to work at a higher > level, i.e., how to better distribute the works across the different threads > ? But now, I'm clueless... > > One final remark though: With 4 threads, the naive splitting strategy > sounded really not too bad because all submatrices involved are 100% > independent. On the other hand, with only 2 threads, it is pretty clear that > simply splitting the product into two, e.g.: > > C1 = A * B1 > C2 = A * B2 > > is not very good because A is entirely shared between the two threads and so > the packing of the A_ij is done twice => *it is pretty clear the the threads > have to talk* :( > > > that's all for now. > > > gael > > > On Tue, Feb 23, 2010 at 12:34 PM, Aron Ahmadia <aja2111@xxxxxxxxxxxx> wrote: >> >> Oh very cool! I am going to start playing with this very soon. I >> want to look into getting some automatic numbers up from each build so >> that it's easy to see how the code is performing across several >> different HPC architectures (BlueGene/P, Nehalem, Clovertown). The >> bench code should be what I'm looking for. >> >> I'm curious about the difference between OpenMP and pthreads. I know >> there is a bit of overthread in OpenMP but I didn't realize it would >> be so meaningful. This will of course vary between architectures and >> compilers, but I will query the IBM Researchers (Gunnels is more of an >> expert with both pthreads and OpenMP and worked in van de Geijn's >> group so may have more direct thoughts). >> >> Warm Regards, >> Aron >> >> On Mon, Feb 22, 2010 at 11:43 AM, Hauke Heibel >> <hauke.heibel@xxxxxxxxxxxxxx> wrote: >> > Works like a charm - see attachement. >> > >> > - Hauke >> > >> > On Mon, Feb 22, 2010 at 11:28 AM, Gael Guennebaud >> > <gael.guennebaud@xxxxxxxxx> wrote: >> >> >> >> Hi, >> >> >> >> I have just created a fork there: >> >> >> >> http://bitbucket.org/ggael/eigen-smp >> >> >> >> to play with SMP support, and more precisely, with OpenMP. >> >> >> >> Currently only the general matrix-matrix product is parallelized. I've >> >> implemented a general 1D parallelizer to factor the parallelization >> >> code. It >> >> is defined there: >> >> >> >> Eigen/src/Core/products/Parallelizer.h >> >> >> >> and used at the end of this file: >> >> Eigen/src/Core/products/GeneralMatrixMatrix.h >> >> >> >> In the bench/ folder there are two bench_gemm*.cpp files to try it and >> >> compare to BLAS. >> >> >> >> On my core2 duo, I've observed a speed up around 1.9 for relatively >> >> small >> >> matrices. >> >> >> >> At work I have an "Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz" but >> >> it >> >> is currently too busy to do really meaningfull experiments. >> >> Nevertheless, it >> >> seems that gotoblas, which is directly using pthread, reports more >> >> consistent speedups. So perhaps OpenMP is trying to do some too smart >> >> scheduling and it might be useful to directly deal with pthread? >> >> >> >> For the lazy but interested reader, the interesting piece of code is >> >> there: >> >> >> >> int threads = omp_get_num_procs(); >> >> int blockSize = size / threads; >> >> #pragma omp parallel for schedule(static,1) >> >> for(int i=0; i<threads; ++i) >> >> { >> >> int blockStart = i*blockSize; >> >> int actualBlockSize = std::min(blockSize, size - blockStart); >> >> >> >> func(blockStart, actualBlockSize); >> >> } >> >> >> >> feelfree to play with it and have fun! >> >> >> >> Gael. >> >> >> >> PS: to Aron, yesterday the main pb was that our benchtimer reported the >> >> total execution time, not the "real" one... the other was that my >> >> system was >> >> a bit buzy because of daemons, and stuff like that... >> >> >> > >> >> > >

**Follow-Ups**:**Re: [eigen] a branch for SMP (openmp) experimentations***From:*Gael Guennebaud

**References**:**[eigen] a branch for SMP (openmp) experimentations***From:*Gael Guennebaud

**Re: [eigen] a branch for SMP (openmp) experimentations***From:*Hauke Heibel

**Re: [eigen] a branch for SMP (openmp) experimentations***From:*Aron Ahmadia

**Re: [eigen] a branch for SMP (openmp) experimentations***From:*Gael Guennebaud

**Messages sorted by:**[ date | thread ]- Prev by Date:
**Re: [eigen] a branch for SMP (openmp) experimentations** - Next by Date:
**Re: [eigen] a branch for SMP (openmp) experimentations** - Previous by thread:
**Re: [eigen] a branch for SMP (openmp) experimentations** - Next by thread:
**Re: [eigen] a branch for SMP (openmp) experimentations**

Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |