<
gael.guennebaud@xxxxxxxxx> wrote:
>
> Ok, after some experiments it seems that the naive splitting strategy I've
> implemented does not scale very well:
>
> product sizes (rows,cols,depth): 2048x2048x2048
>
> GFLOPS:
>
> #cores: 1 2 4
> Eigen: 19 35 60
> GOTO: 20 39 75
>
> So first let me start by briefly review of the way the product is
> implemented. Let's consider C = A * B, and to simplify I'll instantiate the
> product for matrices of size 2048x2048:
>
> for each horizontal panel B_k of B of size 256x2048 do
> copy/pack B_k to B'
> let A_k be the corresponding 2048x256 vertical panel of A
> for each block A_ik of A_k of size 512x256 do
> copy/pack A_ji to A'
> compute C_i += A' * B' using an optimized kernel
> endfor
> endfor
>
> For a quad cores, the actual parallelization strategy is:
> - split B into B1 and B2 each of size 2048x1024
> - split A into A1 and A2 each of size 1024x2048
> - split C into C11 .. C22 each of size 1024x1024
> - do the 4 non-dependent matrix products in parallel:
> C11 = A1*B1
> C12 = A1*B2
> C21 = A2*B1
> C22 = A2*B2
>
> Now if I use this strategy with GOTO (i.e., I instantiate 4 different
> products in parallel using OpenMP and I make sure that GOTO does not try to
> parallelize), then I get really bad performance: 35Gflops (while Eigen does
> the same at 60Gflops). On the other hand, if I do these 4 products
> sequentially, and let GOTO parallelize (4threads), then GOTO yields 70Gflops
> !
>
> An important result is that unlike what I thought, *it is not always better
> to parallelize at the highest level*. For instance, if a user has to do 4
> large matrix products, it is better to do them sequentially and let the
> underlying products manage the parallelization ! So this is a strong
> argument to enable good vectorization in Eigen as it cannot optimaly be done
> at the user level.
>
> Then I tried to understand why this "naive" parallelization strategy does
> not scale very well. Indeed, in the case of 4 threads, all the data are
> fully independent and it is equalivalent to do 4 smaller products in
> parallel. So what ?
>
> To this end, I replaced the copies/packings to B' and A' by calls to memset
> on B' and A' respectiveley => then I got a perfect x4 meaning the optimized
> kernel is already perfect and that the performance penalty is entirely due
> to the (non sequential) reads from B_j and A_ij. I tried to disable only one
> of the two and I confirm that both are equally responsible for this
> slowdown. (With a single thread they represent about 5% of the computation.)
>
> If I replace my copy/pack function calls to the GOTO's ones (hand written
> assembly with more aggressive unrolling and prefetching), then I get a
> slight speedup but not that much. So we really have to work at a higher
> level, i.e., how to better distribute the works across the different threads
> ? But now, I'm clueless...
>
> One final remark though: With 4 threads, the naive splitting strategy
> sounded really not too bad because all submatrices involved are 100%
> independent. On the other hand, with only 2 threads, it is pretty clear that
> simply splitting the product into two, e.g.:
>
> C1 = A * B1
> C2 = A * B2
>
> is not very good because A is entirely shared between the two threads and so
> the packing of the A_ij is done twice => *it is pretty clear the the threads
> have to talk* :(
>
>
> that's all for now.
>
>
> gael
>
>
> On Tue, Feb 23, 2010 at 12:34 PM, Aron Ahmadia <
aja2111@xxxxxxxxxxxx> wrote:
>>
>> Oh very cool! I am going to start playing with this very soon. I
>> want to look into getting some automatic numbers up from each build so
>> that it's easy to see how the code is performing across several
>> different HPC architectures (BlueGene/P, Nehalem, Clovertown). The
>> bench code should be what I'm looking for.
>>
>> I'm curious about the difference between OpenMP and pthreads. I know
>> there is a bit of overthread in OpenMP but I didn't realize it would
>> be so meaningful. This will of course vary between architectures and
>> compilers, but I will query the IBM Researchers (Gunnels is more of an
>> expert with both pthreads and OpenMP and worked in van de Geijn's
>> group so may have more direct thoughts).
>>
>> Warm Regards,
>> Aron
>>
>> On Mon, Feb 22, 2010 at 11:43 AM, Hauke Heibel
>> <
hauke.heibel@xxxxxxxxxxxxxx> wrote:
>> > Works like a charm - see attachement.
>> >
>> > - Hauke
>> >
>> > On Mon, Feb 22, 2010 at 11:28 AM, Gael Guennebaud
>> > <
gael.guennebaud@xxxxxxxxx> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I have just created a fork there:
>> >>
>> >>
http://bitbucket.org/ggael/eigen-smp
>> >>
>> >> to play with SMP support, and more precisely, with OpenMP..
>> >>
>> >> Currently only the general matrix-matrix product is parallelized. I've
>> >> implemented a general 1D parallelizer to factor the parallelization
>> >> code. It
>> >> is defined there:
>> >>
>> >> Eigen/src/Core/products/Parallelizer.h
>> >>
>> >> and used at the end of this file:
>> >> Eigen/src/Core/products/GeneralMatrixMatrix.h
>> >>
>> >> In the bench/ folder there are two bench_gemm*.cpp files to try it and
>> >> compare to BLAS.
>> >>
>> >> On my core2 duo, I've observed a speed up around 1.9 for relatively
>> >> small
>> >> matrices.
>> >>
>> >> At work I have an "Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz" but
>> >> it
>> >> is currently too busy to do really meaningfull experiments.
>> >> Nevertheless, it
>> >> seems that gotoblas, which is directly using pthread, reports more
>> >> consistent speedups. So perhaps OpenMP is trying to do some too smart
>> >> scheduling and it might be useful to directly deal with pthread?
>> >>
>> >> For the lazy but interested reader, the interesting piece of code is
>> >> there:
>> >>
>> >> int threads = omp_get_num_procs();
>> >> int blockSize = size / threads;
>> >> #pragma omp parallel for schedule(static,1)
>> >> for(int i=0; i<threads; ++i)
>> >> {
>> >> int blockStart = i*blockSize;
>> >> int actualBlockSize = std::min(blockSize, size - blockStart);
>> >>
>> >> func(blockStart, actualBlockSize);
>> >> }
>> >>
>> >> feelfree to play with it and have fun!
>> >>
>> >> Gael.
>> >>
>> >> PS: to Aron, yesterday the main pb was that our benchtimer reported the
>> >> total execution time, not the "real" one... the other was that my
>> >> system was
>> >> a bit buzy because of daemons, and stuff like that...
>> >>
>> >
>>
>>
>
>