Re: [eigen] Automatic cache and block size determination |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] Automatic cache and block size determination
- From: Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
- Date: Tue, 22 Jun 2010 17:51:38 -0400
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=0C67tIJFPav7Og1eRwkAtNR/SLzDVx8ieufXm75VAsU=; b=HKFmn7wsM/KVurmK9qrCVihTBIZ4N1ib/j2Kklle1TpBfzt7eCUagqhuy+txyi7TIb jS+gg7g6dItsP7JWsz4k//hF5A3en324P8exLJNYr2iI2Bi+2+BO8PVSWIFZHKhZ9QnM EbK6KrqenLyMyAyRCPPcGN3pRQnYCMwxVA6sg=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=ASkpq8elcKti2jBJA124NJn3IIArx5bud26Ne1rgIoZ7W9FND3aFzVc9XCz6Gj1lE1 F6aFoEAwZMoDJ7bj5TEuTNBuO1PmI0H0+j1cQeZE6/mxsTpt75395GKqKYjoe3CVhKiL i/0wrrC3TeLACDFTXLFoTuMQtZ+3vZv09clJ4=
2010/6/22 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
> On Tue, Jun 22, 2010 at 2:12 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>> 2010/6/22 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>> 2010/6/22 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>>>> Well, let me give you some numbers.
>>>>
>>>> The time to query the L1 and L2 cache sizes at runtime is 0.5ms. This
>>>> is done only once per execution of your software (in case you perform
>>>> matrix products on dynamic size matrices).
>>>
>>> It is also the case if one performs matrix products on large enough
>>> fixed-size matrices, where large enough depends on the architecture
>>> but is 8x8 on ARM NEON.
>>
>> By the way, this makes me thing that the TODO item "make sure that
>> products of fixed-size matrices doesn't cause malloc" is still to be
>> done, and is really important! Since the low-level cache-friendly
>> functions take as few template params as possible (only the Scalar
>> type), they need to take an additional workspace pointer argument, so
>> they don't need to allocate workspace themselves...
>
> this actually more complicated than we initially thought, mainly
> because parallelization, blocking strategies, and storage order of the
> result matrix are all interdependent. For instance the block on the
> rhs is shared between the threads while we need one block on the lhs
> per thread. If the result matrix is row major, then the rhs becomes
> the lhs.... Those are just a few examples to show the complexity of
> the problem.
>
> So, is it ok to assume that when the three dimensions of the products
> are all bounded at compile-time then the matrices are too small for
> blocking and/or for parallelization ? On the other hand, if at least
> one dimension is not bound, then we go as we currently do.
Yes, this is OK. In the worst case we lose some performance, and that
case (large fixed sizes) is rare and discouraged anyway.
On the other hand, the current situation seems much worse to me : we
don't currently honor our promise that fixed sizes never cause
mallocs!
>
> I know this is not ideal, but as I said, an ideal solution might be
> very very complicated to implement. I also think there is no general
> "ideal" solution anyway.
Right. What you propose above is very close to ideal, IMO. What's the
point of 100x100 fixed size, anyway.
> Basically, the idea is to add a third implementation for such
> relatively large (from 8^2 to ~32^2) fixed sizes matrices in-between
> the coefficient based product, and the fully blocked/parallelized one.
Is it worth it? Why not just say that fixed sizes are optimized for
small? (smaller than 16x16 at the very most)? If people want large
matrices without mallocs, they have other options, like specifying
Max-sizes or doing Map() on pre-allocated buffer.
Benoit
> Of course it would still reuse the packing and product kernels but
> skip all the runtime logic of the blocking and parallelization and
> employs purely static allocation.
>
>
> cheers,
>
> gael
>
>>
>>> Imagine an embedded application that just
>>> wants to do 8x8 fixed-size matrix products --- it shouldn't pay the
>>> price of querying cache sizes.
>>>
>>> Embedded people are those people who try to not #include<iostream>
>>> because of the cost of initializing the std::cout object...
>>>
>>> A few options:
>>> - add an option as described by Bernard
>>> - or make sure to only use blocked product on actual Dynamic sizes
>>> - or introduce a second threshold under which CPU cache sizes aren't
>>> queried and that level of blocking isn't performed. After all your
>>> matrix product code has 2 levels of blocking, and for 8x8 matrices
>>> that obviously fit many times in the CPU cache, only the first level
>>> of blocking is performed.
>>>
>>> Benoit
>>>
>>>>
>>>> The overhead to test these queries have already been done is only one
>>>> "if(something==0)". This is completely negligible compared to all the
>>>> other computations which have to be carried out before doing the
>>>> actual matrix product. They include the computation of the block sizes
>>>> (which depend of the sizes of the matrices):
>>>>
>>>> std::ptrdiff_t l1, l2;
>>>> ei_manage_caching_sizes(GetAction, &l1, &l2); // cost = 1 "if" (cheap)
>>>> k = std::min<std::ptrdiff_t>(k, l1/kdiv); // kdiv is a power of 2 => 1
>>>> bit shift (cheap)
>>>> std::ptrdiff_t _m = l2/(4 * sizeof(LhsScalar) * k); // this integer
>>>> division cannot be avoid even if L1 and L2 are known at compile time
>>>> if(_m<m) m = _m & mr_mask;
>>>>
>>>> then we have several allocations of the blocks:
>>>>
>>>> Scalar* blockA = ei_aligned_stack_new(Scalar, kc*mc);
>>>> std::size_t sizeB = kc*Blocking::PacketSize*Blocking::nr + kc*cols;
>>>> Scalar* allocatedBlockB = ei_aligned_stack_new(Scalar, sizeB);
>>>> Scalar* blockB = allocatedBlockB + kc*Blocking::PacketSize*Blocking::nr;
>>>>
>>>> then the data are copied into these blocks,
>>>>
>>>> etc.
>>>>
>>>> So really, I think this little "if" is totally negligible.
>>>>
>>>> What could be really useful, however, is a way to instantiate a
>>>> "matrix product object" with some information on the maximal and/or
>>>> typical matrix sizes we are considering such that all the above
>>>> initialization cost can be avoided when doing many matrix products on
>>>> matrices having the same sizes. For instance, this could be useful for
>>>> blocked decompositions.
>>>>
>>>> gael
>>>>
>>>>
>>>> On Tue, Jun 22, 2010 at 9:39 AM, <bernard.hugueney@xxxxxxxxxx> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> On Tue, 22 Jun 2010 00:46:22 +0200, Thomas Capricelli
>>>>>
>>>>>
>>>>>
>>>>>> I guess there is a very small runtime cost of 'checking if the cache
>>>>>
>>>>> sizes
>>>>>
>>>>>> have already been computed' for every product, right ?
>>>>>
>>>>>> And also, computation involving cache sizes were previously done at
>>>>>
>>>>>> compile time and not anymore.. ? Nothing to worry about ?
>>>>>
>>>>>
>>>>>
>>>>> If possible, it would be best to have a #define L2_CACHE_SIZE that would
>>>>>
>>>>> default to
>>>>>
>>>>> a runtime query at static initialization time but could be set when
>>>>>
>>>>> compiling.
>>>>>
>>>>>
>>>>>
>>>>> When set at compile time, a typed value (as in [0]) would enable meta
>>>>>
>>>>> programming unrolling,
>>>>>
>>>>> when set at runtime doing it at static initialization time would avoid
>>>>>
>>>>> polluting other code
>>>>>
>>>>> with a check run only once.
>>>>>
>>>>>
>>>>>
>>>>> My .2€
>>>>>
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>>
>>>>>
>>>>> Bernard
>>>>>
>>>>>
>>>>>
>>>>> [0] http://www.boost.org/doc/libs/1_43_0/libs/mpl/doc/refmanual/int.html
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>
>
>