Re: [eigen] a branch for SMP (openmp) experimentations

On Thu, Feb 25, 2010 at 12:47 PM, Aron Ahmadia <aja2111@xxxxxxxxxxxx> wrote:

Hi Gael,

I think I follow what you did, but just to summarize...

> for each horizontal panel B_k of B of size 256x2048 do
> copy/pack B_k to B'
> let A_k be the corresponding 2048x256 vertical panel of A
> for each block A_ik of A_k of size 512x256 do
> copy/pack A_ji to A'
> compute C_i += A' * B' using an optimized kernel
> endfor
> endfor

Do you mean copy A_ik to A' here instead of A_ji?

arf yes, sorry.

> #cores: 1 2 4
> Eigen: 19 35 60

Those are some very impressive numbers, I wouldn't be kicking myself too hard.

> An important result is that unlike what I thought, *it is not always better
> to parallelize at the highest level*. For instance, if a user has to do 4
> large matrix products, it is better to do them sequentially and let the
> underlying products manage the parallelization ! So this is a strong
> argument to enable good vectorization in Eigen as it cannot optimaly be done
> at the user level.

I agree strongly with this statement. Of course I think this might
mean I need to roll my sleeves up and start helping :)

hehe!

> To this end, I replaced the copies/packings to B' and A' by calls to memset
> on B' and A' respectiveley => then I got a perfect x4 meaning the optimized
> kernel is already perfect and that the performance penalty is entirely due
> to the (non sequential) reads from B_j and A_ij. I tried to disable only one
> of the two and I confirm that both are equally responsible for this
> slowdown. (With a single thread they represent about 5% of the computation.)

Again, your notation for the sub-block of A is slightly confusing, I
wish I had taken more notes on your approach, I'll review Goto's
paper.

typo again. If you prefer I meant B_k and A_ik where j and k are sub-block indices, not coeff indices but I know you got that right ;)

If I understand correctly, your scaling performance is suffering due
to the increased time taken to copy into the 'optimal panels', not due
to computation. If we can improve the efficiency of preparing the
panels, we'll have solved the scalability problem. Of course, this is
easier said than done.

> C1 = A * B1
> C2 = A * B2
>
> is not very good because A is entirely shared between the two threads and so
> the packing of the A_ij is done twice => *it is pretty clear the the threads
> have to talk* :(

Actually, this happens in the 2x2 case as well:

> C11 = A1*B1
> C12 = A1*B2
> C21 = A2*B1
> C22 = A2*B2

Each thread needs one of A1 or A2, and one of B1 or B2. This implies
that each of the sub-matrices from A and B gets packed twice. I agree
that we have to be careful about avoiding redundant copies when we
can.

arf! I feel so stupid... somehow I got confused by all the experiments I made. Anyway, so good news because at least now we know what is the problem :)

What CPU architectures are you performing your tests on? I have
access to two 8-core Intel architectures here (X5550 and E5420), and
the 4-core IBM BlueGene architecture may have a chance at working with
Eigen once we get XLC upgraded. Also, are you still using OpenMP or
do you want to consider other options (pthreads, Boost.threads)? Is
your code pushed to the branch?

Quad core2 (penryn). I have to check but I'm pretty sure that around me there are some 8 cores machines. Regarding the underlying threading API, I'd say that as long as OpenMP fulfills our needs we should stick to it, but I fear it won't be flexible enough to allow the threads to talk well and quick. Well, we'll see.

My first thoughts are on getting set up with benchmarks on both
machines so that I can verify performance, as I suspect things will
vary between the X5550 and the E5420. I'm most concerned with new
architectures such as Nehalem (i.e., the X5550), so I will probably be
tuning and thinking in that direction.

I also plan to stick with GNU compilers for now, as KAUST's setup for
the Intel compilers is a bit sticky at the moment.

Here gcc 4.4 and icc equally perform for this particular case.

Regards,
Aron

On Thu, Feb 25, 2010 at 2:00 PM, Gael Guennebaud

<gael.guennebaud@xxxxxxxxx> wrote:
>
> Ok, after some experiments it seems that the naive splitting strategy I've
> implemented does not scale very well:
>
> product sizes (rows,cols,depth): 2048x2048x2048
>
> GFLOPS:
>
> #cores:   1 2     4
> Eigen:    19   35   60
> GOTO:    20   39   75
>
> So first let me start by briefly review of the way the product is
> implemented. Let's consider C = A * B, and to simplify I'll instantiate the
> product for matrices of size 2048x2048:
>
> for each horizontal panel B_k of B of size 256x2048 do
> copy/pack B_k to B'
> let A_k be the corresponding 2048x256 vertical panel of A
> for each block A_ik of A_k of size 512x256 do
>     copy/pack A_ji to A'
>     compute C_i += A' * B' using an optimized kernel
> endfor
> endfor
>
> For a quad cores, the actual parallelization strategy is:
> - split B into B1 and B2 each of size 2048x1024
> - split A into A1 and A2 each of size 1024x2048
> - split C into C11 .. C22 each of size 1024x1024
> - do the 4 non-dependent matrix products in parallel:
> C11 = A1*B1
> C12 = A1*B2
> C21 = A2*B1
> C22 = A2*B2
>
> Now if I use this strategy with GOTO (i.e., I instantiate 4 different
> products in parallel using OpenMP and I make sure that GOTO does not try to
> parallelize), then I get really bad performance: 35Gflops (while Eigen does
> the same at 60Gflops). On the other hand, if I do these 4 products
> sequentially, and let GOTO parallelize (4threads), then GOTO yields 70Gflops
> !
>
> An important result is that unlike what I thought, *it is not always better
> to parallelize at the highest level*. For instance, if a user has to do 4
> large matrix products, it is better to do them sequentially and let the
> underlying products manage the parallelization ! So this is a strong
> argument to enable good vectorization in Eigen as it cannot optimaly be done
> at the user level.
>
> Then I tried to understand why this "naive" parallelization strategy does
> not scale very well. Indeed, in the case of 4 threads, all the data are
> fully independent and it is equalivalent to do 4 smaller products in
> parallel. So what ?
>
> To this end, I replaced the copies/packings to B' and A' by calls to memset
> on B' and A' respectiveley => then I got a perfect x4 meaning the optimized
> kernel is already perfect and that the performance penalty is entirely due
> to the (non sequential) reads from B_j and A_ij. I tried to disable only one
> of the two and I confirm that both are equally responsible for this
> slowdown. (With a single thread they represent about 5% of the computation.)
>
> If I replace my copy/pack function calls to the GOTO's ones (hand written
> assembly with more aggressive unrolling and prefetching), then I get a
> slight speedup but not that much. So we really have to work at a higher
> level, i.e., how to better distribute the works across the different threads
> ? But now, I'm clueless...
>
> One final remark though: With 4 threads, the naive splitting strategy
> sounded really not too bad because all submatrices involved are 100%
> independent. On the other hand, with only 2 threads, it is pretty clear that
> simply splitting the product into two, e.g.:
>
> C1 = A * B1
> C2 = A * B2
>
> is not very good because A is entirely shared between the two threads and so
> the packing of the A_ij is done twice => *it is pretty clear the the threads
> have to talk* :(
>
>
> that's all for now.
>
>
> gael
>
>
> On Tue, Feb 23, 2010 at 12:34 PM, Aron Ahmadia <aja2111@xxxxxxxxxxxx> wrote:
>>
>> Oh very cool! I am going to start playing with this very soon. I
>> want to look into getting some automatic numbers up from each build so
>> that it's easy to see how the code is performing across several
>> different HPC architectures (BlueGene/P, Nehalem, Clovertown). The
>> bench code should be what I'm looking for.
>>
>> I'm curious about the difference between OpenMP and pthreads. I know
>> there is a bit of overthread in OpenMP but I didn't realize it would
>> be so meaningful. This will of course vary between architectures and
>> compilers, but I will query the IBM Researchers (Gunnels is more of an
>> expert with both pthreads and OpenMP and worked in van de Geijn's
>> group so may have more direct thoughts).
>>
>> Warm Regards,
>> Aron
>>
>> On Mon, Feb 22, 2010 at 11:43 AM, Hauke Heibel
>> <hauke.heibel@xxxxxxxxxxxxxx> wrote:
>> > Works like a charm - see attachement.
>> >
>> > - Hauke
>> >
>> > On Mon, Feb 22, 2010 at 11:28 AM, Gael Guennebaud
>> > <gael.guennebaud@xxxxxxxxx> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I have just created a fork there:
>> >>
>> >> http://bitbucket.org/ggael/eigen-smp
>> >>
>> >> to play with SMP support, and more precisely, with OpenMP..
>> >>
>> >> Currently only the general matrix-matrix product is parallelized. I've
>> >> implemented a general 1D parallelizer to factor the parallelization
>> >> code. It
>> >> is defined there:
>> >>
>> >> Eigen/src/Core/products/Parallelizer.h
>> >>
>> >> and used at the end of this file:
>> >> Eigen/src/Core/products/GeneralMatrixMatrix.h
>> >>
>> >> In the bench/ folder there are two bench_gemm*.cpp files to try it and
>> >> compare to BLAS.
>> >>
>> >> On my core2 duo, I've observed a speed up around 1.9 for relatively
>> >> small
>> >> matrices.
>> >>
>> >> At work I have an "Intel(R) Core(TM)2 Quad CPU    Q9400 @ 2.66GHz" but
>> >> it
>> >> is currently too busy to do really meaningfull experiments.
>> >> Nevertheless, it
>> >> seems that gotoblas, which is directly using pthread, reports more
>> >> consistent speedups. So perhaps OpenMP is trying to do some too smart
>> >> scheduling and it might be useful to directly deal with pthread?
>> >>
>> >> For the lazy but interested reader, the interesting piece of code is
>> >> there:
>> >>
>> >> int threads = omp_get_num_procs();
>> >> int blockSize = size / threads;
>> >> #pragma omp parallel for schedule(static,1)
>> >> for(int i=0; i<threads; ++i)
>> >> {
>> >>     int blockStart = i*blockSize;
>> >>     int actualBlockSize = std::min(blockSize, size - blockStart);
>> >>
>> >>     func(blockStart, actualBlockSize);
>> >> }
>> >>
>> >> feelfree to play with it and have fun!
>> >>
>> >> Gael.
>> >>
>> >> PS: to Aron, yesterday the main pb was that our benchtimer reported the
>> >> total execution time, not the "real" one... the other was that my
>> >> system was
>> >> a bit buzy because of daemons, and stuff like that...
>> >>
>> >
>>
>>
>
>