Re: [eigen] Automatic cache and block size determination

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


On Tue, Jun 22, 2010 at 2:12 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> 2010/6/22 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>> 2010/6/22 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>>> Well, let me give you some numbers.
>>>
>>> The time to query the L1 and L2 cache sizes at runtime is 0.5ms. This
>>> is done only once per execution of your software (in case you perform
>>> matrix products on dynamic size matrices).
>>
>> It is also the case if one performs matrix products on large enough
>> fixed-size matrices, where large enough depends on the architecture
>> but is 8x8 on ARM NEON.
>
> By the way, this makes me thing that the TODO item "make sure that
> products of fixed-size matrices doesn't cause malloc" is still to be
> done, and is really important! Since the low-level cache-friendly
> functions take as few template params as possible (only the Scalar
> type), they need to take an additional workspace pointer argument, so
> they don't need to allocate workspace themselves...

this actually more complicated than we initially thought, mainly
because parallelization, blocking strategies, and storage order of the
result matrix are all interdependent. For instance the block on the
rhs is shared between the threads while we need one block on the lhs
per thread. If the result matrix is row major, then the rhs becomes
the lhs.... Those are just a few examples to show the complexity of
the problem.

So, is it ok to assume that when the three dimensions of the products
are all bounded at compile-time then the matrices are too small for
blocking and/or for parallelization ? On the other hand, if at least
one dimension is not bound, then we go as we currently do.

I know this is not ideal, but as I said, an ideal solution might be
very very complicated to implement. I also think there is no general
"ideal" solution anyway.

Basically, the idea is to add a third implementation for such
relatively large (from 8^2 to ~32^2) fixed sizes matrices in-between
the coefficient based product, and the fully blocked/parallelized one.
Of course it would still reuse the packing and product kernels but
skip all the runtime logic of the blocking and parallelization and
employs purely static allocation.


cheers,

gael

>
>> Imagine an embedded application that just
>> wants to do 8x8 fixed-size matrix products --- it shouldn't pay the
>> price of querying cache sizes.
>>
>> Embedded people are those people who try to not #include<iostream>
>> because of the cost of initializing the std::cout object...
>>
>> A few options:
>>  - add an option as described by Bernard
>>  - or make sure to only use blocked product on actual Dynamic sizes
>>  - or introduce a second threshold under which CPU cache sizes aren't
>> queried and that level of blocking isn't performed. After all your
>> matrix product code has 2 levels of blocking, and for 8x8 matrices
>> that obviously fit many times in the CPU cache, only the first level
>> of blocking is performed.
>>
>> Benoit
>>
>>>
>>> The overhead to test these queries have already been done is only one
>>> "if(something==0)". This is completely negligible compared to all the
>>> other computations which have to be carried out before doing the
>>> actual matrix product. They include the computation of the block sizes
>>> (which depend of the sizes of the matrices):
>>>
>>> std::ptrdiff_t l1, l2;
>>> ei_manage_caching_sizes(GetAction, &l1, &l2); // cost = 1 "if" (cheap)
>>> k = std::min<std::ptrdiff_t>(k, l1/kdiv); // kdiv is a power of 2 => 1
>>> bit shift (cheap)
>>> std::ptrdiff_t _m = l2/(4 * sizeof(LhsScalar) * k); // this integer
>>> division cannot be avoid even if L1 and L2 are known at compile time
>>> if(_m<m) m = _m & mr_mask;
>>>
>>> then we have several allocations of the blocks:
>>>
>>> Scalar* blockA = ei_aligned_stack_new(Scalar, kc*mc);
>>> std::size_t sizeB = kc*Blocking::PacketSize*Blocking::nr + kc*cols;
>>> Scalar* allocatedBlockB = ei_aligned_stack_new(Scalar, sizeB);
>>> Scalar* blockB = allocatedBlockB + kc*Blocking::PacketSize*Blocking::nr;
>>>
>>> then the data are copied into these blocks,
>>>
>>> etc.
>>>
>>> So really, I think this little "if" is totally negligible.
>>>
>>> What could be really useful, however, is a way to instantiate a
>>> "matrix product object" with some information on the maximal and/or
>>> typical matrix sizes we are considering such that all the above
>>> initialization cost can be avoided when doing many matrix products on
>>> matrices having the same sizes. For instance, this could be useful for
>>> blocked decompositions.
>>>
>>> gael
>>>
>>>
>>> On Tue, Jun 22, 2010 at 9:39 AM,  <bernard.hugueney@xxxxxxxxxx> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> On Tue, 22 Jun 2010 00:46:22 +0200, Thomas Capricelli
>>>>
>>>>
>>>>
>>>>> I guess there is a very small runtime cost of 'checking if the cache
>>>>
>>>> sizes
>>>>
>>>>> have already been computed' for every product, right ?
>>>>
>>>>> And also, computation involving cache sizes were previously done at
>>>>
>>>>> compile time and not anymore.. ? Nothing to worry about ?
>>>>
>>>>
>>>>
>>>> If possible, it would be best to have a #define L2_CACHE_SIZE that would
>>>>
>>>> default to
>>>>
>>>> a runtime query at static initialization time but could be set when
>>>>
>>>> compiling.
>>>>
>>>>
>>>>
>>>> When set at compile time, a typed value (as in [0]) would enable meta
>>>>
>>>> programming unrolling,
>>>>
>>>> when set at runtime doing it at static initialization time would avoid
>>>>
>>>> polluting other code
>>>>
>>>>  with a check run only once.
>>>>
>>>>
>>>>
>>>> My .2€
>>>>
>>>>
>>>>
>>>> Best regards,
>>>>
>>>>
>>>>
>>>> Bernard
>>>>
>>>>
>>>>
>>>> [0] http://www.boost.org/doc/libs/1_43_0/libs/mpl/doc/refmanual/int.html
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/