Re: [eigen] Automatic cache and block size determination

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


2010/6/22 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
> 2010/6/22 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>> Well, let me give you some numbers.
>>
>> The time to query the L1 and L2 cache sizes at runtime is 0.5ms. This
>> is done only once per execution of your software (in case you perform
>> matrix products on dynamic size matrices).
>
> It is also the case if one performs matrix products on large enough
> fixed-size matrices, where large enough depends on the architecture
> but is 8x8 on ARM NEON.

By the way, this makes me thing that the TODO item "make sure that
products of fixed-size matrices doesn't cause malloc" is still to be
done, and is really important! Since the low-level cache-friendly
functions take as few template params as possible (only the Scalar
type), they need to take an additional workspace pointer argument, so
they don't need to allocate workspace themselves...

> Imagine an embedded application that just
> wants to do 8x8 fixed-size matrix products --- it shouldn't pay the
> price of querying cache sizes.
>
> Embedded people are those people who try to not #include<iostream>
> because of the cost of initializing the std::cout object...
>
> A few options:
>  - add an option as described by Bernard
>  - or make sure to only use blocked product on actual Dynamic sizes
>  - or introduce a second threshold under which CPU cache sizes aren't
> queried and that level of blocking isn't performed. After all your
> matrix product code has 2 levels of blocking, and for 8x8 matrices
> that obviously fit many times in the CPU cache, only the first level
> of blocking is performed.
>
> Benoit
>
>>
>> The overhead to test these queries have already been done is only one
>> "if(something==0)". This is completely negligible compared to all the
>> other computations which have to be carried out before doing the
>> actual matrix product. They include the computation of the block sizes
>> (which depend of the sizes of the matrices):
>>
>> std::ptrdiff_t l1, l2;
>> ei_manage_caching_sizes(GetAction, &l1, &l2); // cost = 1 "if" (cheap)
>> k = std::min<std::ptrdiff_t>(k, l1/kdiv); // kdiv is a power of 2 => 1
>> bit shift (cheap)
>> std::ptrdiff_t _m = l2/(4 * sizeof(LhsScalar) * k); // this integer
>> division cannot be avoid even if L1 and L2 are known at compile time
>> if(_m<m) m = _m & mr_mask;
>>
>> then we have several allocations of the blocks:
>>
>> Scalar* blockA = ei_aligned_stack_new(Scalar, kc*mc);
>> std::size_t sizeB = kc*Blocking::PacketSize*Blocking::nr + kc*cols;
>> Scalar* allocatedBlockB = ei_aligned_stack_new(Scalar, sizeB);
>> Scalar* blockB = allocatedBlockB + kc*Blocking::PacketSize*Blocking::nr;
>>
>> then the data are copied into these blocks,
>>
>> etc.
>>
>> So really, I think this little "if" is totally negligible.
>>
>> What could be really useful, however, is a way to instantiate a
>> "matrix product object" with some information on the maximal and/or
>> typical matrix sizes we are considering such that all the above
>> initialization cost can be avoided when doing many matrix products on
>> matrices having the same sizes. For instance, this could be useful for
>> blocked decompositions.
>>
>> gael
>>
>>
>> On Tue, Jun 22, 2010 at 9:39 AM,  <bernard.hugueney@xxxxxxxxxx> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> On Tue, 22 Jun 2010 00:46:22 +0200, Thomas Capricelli
>>>
>>>
>>>
>>>> I guess there is a very small runtime cost of 'checking if the cache
>>>
>>> sizes
>>>
>>>> have already been computed' for every product, right ?
>>>
>>>> And also, computation involving cache sizes were previously done at
>>>
>>>> compile time and not anymore.. ? Nothing to worry about ?
>>>
>>>
>>>
>>> If possible, it would be best to have a #define L2_CACHE_SIZE that would
>>>
>>> default to
>>>
>>> a runtime query at static initialization time but could be set when
>>>
>>> compiling.
>>>
>>>
>>>
>>> When set at compile time, a typed value (as in [0]) would enable meta
>>>
>>> programming unrolling,
>>>
>>> when set at runtime doing it at static initialization time would avoid
>>>
>>> polluting other code
>>>
>>>  with a check run only once.
>>>
>>>
>>>
>>> My .2€
>>>
>>>
>>>
>>> Best regards,
>>>
>>>
>>>
>>> Bernard
>>>
>>>
>>>
>>> [0] http://www.boost.org/doc/libs/1_43_0/libs/mpl/doc/refmanual/int.html
>>>
>>>
>>>
>>
>>
>>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/