Re: [eigen] Automatic cache and block size determination

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


2010/6/22 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
> Well, let me give you some numbers.
>
> The time to query the L1 and L2 cache sizes at runtime is 0.5ms. This
> is done only once per execution of your software (in case you perform
> matrix products on dynamic size matrices).

It is also the case if one performs matrix products on large enough
fixed-size matrices, where large enough depends on the architecture
but is 8x8 on ARM NEON. Imagine an embedded application that just
wants to do 8x8 fixed-size matrix products --- it shouldn't pay the
price of querying cache sizes.

Embedded people are those people who try to not #include<iostream>
because of the cost of initializing the std::cout object...

A few options:
 - add an option as described by Bernard
 - or make sure to only use blocked product on actual Dynamic sizes
 - or introduce a second threshold under which CPU cache sizes aren't
queried and that level of blocking isn't performed. After all your
matrix product code has 2 levels of blocking, and for 8x8 matrices
that obviously fit many times in the CPU cache, only the first level
of blocking is performed.

Benoit

>
> The overhead to test these queries have already been done is only one
> "if(something==0)". This is completely negligible compared to all the
> other computations which have to be carried out before doing the
> actual matrix product. They include the computation of the block sizes
> (which depend of the sizes of the matrices):
>
> std::ptrdiff_t l1, l2;
> ei_manage_caching_sizes(GetAction, &l1, &l2); // cost = 1 "if" (cheap)
> k = std::min<std::ptrdiff_t>(k, l1/kdiv); // kdiv is a power of 2 => 1
> bit shift (cheap)
> std::ptrdiff_t _m = l2/(4 * sizeof(LhsScalar) * k); // this integer
> division cannot be avoid even if L1 and L2 are known at compile time
> if(_m<m) m = _m & mr_mask;
>
> then we have several allocations of the blocks:
>
> Scalar* blockA = ei_aligned_stack_new(Scalar, kc*mc);
> std::size_t sizeB = kc*Blocking::PacketSize*Blocking::nr + kc*cols;
> Scalar* allocatedBlockB = ei_aligned_stack_new(Scalar, sizeB);
> Scalar* blockB = allocatedBlockB + kc*Blocking::PacketSize*Blocking::nr;
>
> then the data are copied into these blocks,
>
> etc.
>
> So really, I think this little "if" is totally negligible.
>
> What could be really useful, however, is a way to instantiate a
> "matrix product object" with some information on the maximal and/or
> typical matrix sizes we are considering such that all the above
> initialization cost can be avoided when doing many matrix products on
> matrices having the same sizes. For instance, this could be useful for
> blocked decompositions.
>
> gael
>
>
> On Tue, Jun 22, 2010 at 9:39 AM,  <bernard.hugueney@xxxxxxxxxx> wrote:
>>
>> Hi,
>>
>>
>>
>> On Tue, 22 Jun 2010 00:46:22 +0200, Thomas Capricelli
>>
>>
>>
>>> I guess there is a very small runtime cost of 'checking if the cache
>>
>> sizes
>>
>>> have already been computed' for every product, right ?
>>
>>> And also, computation involving cache sizes were previously done at
>>
>>> compile time and not anymore.. ? Nothing to worry about ?
>>
>>
>>
>> If possible, it would be best to have a #define L2_CACHE_SIZE that would
>>
>> default to
>>
>> a runtime query at static initialization time but could be set when
>>
>> compiling.
>>
>>
>>
>> When set at compile time, a typed value (as in [0]) would enable meta
>>
>> programming unrolling,
>>
>> when set at runtime doing it at static initialization time would avoid
>>
>> polluting other code
>>
>>  with a check run only once.
>>
>>
>>
>> My .2€
>>
>>
>>
>> Best regards,
>>
>>
>>
>> Bernard
>>
>>
>>
>> [0] http://www.boost.org/doc/libs/1_43_0/libs/mpl/doc/refmanual/int.html
>>
>>
>>
>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/