Re: [eigen] Automatic cache and block size determination

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: Re: [eigen] Automatic cache and block size determination
From: Gael Guennebaud <gael.guennebaud@xxxxxxxxx>
Date: Tue, 22 Jun 2010 16:25:17 +0200
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type :content-transfer-encoding; bh=RrlYsubf58uuiG1UnOptMPBfT6d1e5BErtXm0Q++0bg=; b=tdgYp2/wbP0M0u2OLi4I32NlgKcumYsJvksX8VIj1FoPMiLUbcnTvn0B2SjVg5EnZp 0nvoGXWjSaUNMmNbYJ3obl0Cv+8erlg/lifw/pkHseluodTiFm0GrFlziMFmSvuRObMz rp+/dDVr2GQH1TAFstK/Ucnjvo9hALNWj7BlQ=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=rjO/qJvkooEk21WddDO535/f6cLFs0dw6FYbTjpnOYzwVqBxKY10TT9cddNgHvVpgC 4WLC6mVydZ9YbQaNKBSDtbltkPLcn7mnO1vGsmvwt8J1HbmzfUdhmYDCha+IgXApBwl8 amacHHX32uGoYRNDiLFpOAyzNSR4MuBiSENv0=

On Tue, Jun 22, 2010 at 2:01 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> 2010/6/22 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>> Well, let me give you some numbers.
>>
>> The time to query the L1 and L2 cache sizes at runtime is 0.5ms. This
>> is done only once per execution of your software (in case you perform
>> matrix products on dynamic size matrices).
>
> It is also the case if one performs matrix products on large enough
> fixed-size matrices, where large enough depends on the architecture
> but is 8x8 on ARM NEON. Imagine an embedded application that just
> wants to do 8x8 fixed-size matrix products --- it shouldn't pay the
> price of querying cache sizes.

fair enough, I'll try to address this issue asap.

gael

> Embedded people are those people who try to not #include<iostream>
> because of the cost of initializing the std::cout object...
>
> A few options:
>  - add an option as described by Bernard
>  - or make sure to only use blocked product on actual Dynamic sizes
>  - or introduce a second threshold under which CPU cache sizes aren't
> queried and that level of blocking isn't performed. After all your
> matrix product code has 2 levels of blocking, and for 8x8 matrices
> that obviously fit many times in the CPU cache, only the first level
> of blocking is performed.
>
> Benoit
>
>>
>> The overhead to test these queries have already been done is only one
>> "if(something==0)". This is completely negligible compared to all the
>> other computations which have to be carried out before doing the
>> actual matrix product. They include the computation of the block sizes
>> (which depend of the sizes of the matrices):
>>
>> std::ptrdiff_t l1, l2;
>> ei_manage_caching_sizes(GetAction, &l1, &l2); // cost = 1 "if" (cheap)
>> k = std::min<std::ptrdiff_t>(k, l1/kdiv); // kdiv is a power of 2 => 1
>> bit shift (cheap)
>> std::ptrdiff_t _m = l2/(4 * sizeof(LhsScalar) * k); // this integer
>> division cannot be avoid even if L1 and L2 are known at compile time
>> if(_m<m) m = _m & mr_mask;
>>
>> then we have several allocations of the blocks:
>>
>> Scalar* blockA = ei_aligned_stack_new(Scalar, kc*mc);
>> std::size_t sizeB = kc*Blocking::PacketSize*Blocking::nr + kc*cols;
>> Scalar* allocatedBlockB = ei_aligned_stack_new(Scalar, sizeB);
>> Scalar* blockB = allocatedBlockB + kc*Blocking::PacketSize*Blocking::nr;
>>
>> then the data are copied into these blocks,
>>
>> etc.
>>
>> So really, I think this little "if" is totally negligible.
>>
>> What could be really useful, however, is a way to instantiate a
>> "matrix product object" with some information on the maximal and/or
>> typical matrix sizes we are considering such that all the above
>> initialization cost can be avoided when doing many matrix products on
>> matrices having the same sizes. For instance, this could be useful for
>> blocked decompositions.
>>
>> gael
>>
>>
>> On Tue, Jun 22, 2010 at 9:39 AM,  <bernard.hugueney@xxxxxxxxxx> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> On Tue, 22 Jun 2010 00:46:22 +0200, Thomas Capricelli
>>>
>>>
>>>
>>>> I guess there is a very small runtime cost of 'checking if the cache
>>>
>>> sizes
>>>
>>>> have already been computed' for every product, right ?
>>>
>>>> And also, computation involving cache sizes were previously done at
>>>
>>>> compile time and not anymore.. ? Nothing to worry about ?
>>>
>>>
>>>
>>> If possible, it would be best to have a #define L2_CACHE_SIZE that would
>>>
>>> default to
>>>
>>> a runtime query at static initialization time but could be set when
>>>
>>> compiling.
>>>
>>>
>>>
>>> When set at compile time, a typed value (as in [0]) would enable meta
>>>
>>> programming unrolling,
>>>
>>> when set at runtime doing it at static initialization time would avoid
>>>
>>> polluting other code
>>>
>>>  with a check run only once.
>>>
>>>
>>>
>>> My .2€
>>>
>>>
>>>
>>> Best regards,
>>>
>>>
>>>
>>> Bernard
>>>
>>>
>>>
>>> [0] http://www.boost.org/doc/libs/1_43_0/libs/mpl/doc/refmanual/int.html
>>>
>>>
>>>
>>
>>
>>
>
>
>

References:
- [eigen] Automatic cache and block size determination
  - From: Gael Guennebaud
- Re: [eigen] Automatic cache and block size determination
  - From: Thomas Capricelli
- Re: [eigen] Automatic cache and block size determination
  - From: bernard.hugueney
- Re: [eigen] Automatic cache and block size determination
  - From: Gael Guennebaud
- Re: [eigen] Automatic cache and block size determination
  - From: Benoit Jacob

Messages sorted by: [ date | thread ]
Prev by Date: Re: [eigen] Enabling openMP
Next by Date: Re: [eigen] 2.0.14 x-( grrrr >-(
Previous by thread: Re: [eigen] Automatic cache and block size determination
Next by thread: [eigen] Is post-multiplication by diagonal matrix optimized?

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/