Re: [eigen] Automatic cache and block size determination |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] Automatic cache and block size determination
- From: Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
- Date: Tue, 22 Jun 2010 08:01:30 -0400
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=ov5BT2CZYzbCkXg5ou86VB344q2mK3luftnVjitbAwo=; b=YlIOta+xS0vWgmiLBLMLtLvto/848BZavP6Z/LyoI04hsX5AeF3bIJFgqXXudBkKYD J/4QGZyRhm3+9DF0oKUCQMmEJgx+oj0Q3WGtIngU/F//BF/1foQBHM5ON6mYdB6kjVBR s7cs+JSnHbVL8gXU5gdMJ6ZBLOj5ur1oQXTzY=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=AQwhyIoZvixQCjb/yPW0hvUX4D6W6VjmzTbpaKvxlXKmNOMQCETnbqaXE8jJTYF0rt yPYkm6XlFy7kNJwAP1p68QMk+PJAjSnqya1yBMD9qHsgWfTJKltfcK6FvcF1yzBDHzRH mJMDUT7eifKfnEGJRNVkukqO1isuub3saLjh4=
2010/6/22 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
> Well, let me give you some numbers.
>
> The time to query the L1 and L2 cache sizes at runtime is 0.5ms. This
> is done only once per execution of your software (in case you perform
> matrix products on dynamic size matrices).
It is also the case if one performs matrix products on large enough
fixed-size matrices, where large enough depends on the architecture
but is 8x8 on ARM NEON. Imagine an embedded application that just
wants to do 8x8 fixed-size matrix products --- it shouldn't pay the
price of querying cache sizes.
Embedded people are those people who try to not #include<iostream>
because of the cost of initializing the std::cout object...
A few options:
- add an option as described by Bernard
- or make sure to only use blocked product on actual Dynamic sizes
- or introduce a second threshold under which CPU cache sizes aren't
queried and that level of blocking isn't performed. After all your
matrix product code has 2 levels of blocking, and for 8x8 matrices
that obviously fit many times in the CPU cache, only the first level
of blocking is performed.
Benoit
>
> The overhead to test these queries have already been done is only one
> "if(something==0)". This is completely negligible compared to all the
> other computations which have to be carried out before doing the
> actual matrix product. They include the computation of the block sizes
> (which depend of the sizes of the matrices):
>
> std::ptrdiff_t l1, l2;
> ei_manage_caching_sizes(GetAction, &l1, &l2); // cost = 1 "if" (cheap)
> k = std::min<std::ptrdiff_t>(k, l1/kdiv); // kdiv is a power of 2 => 1
> bit shift (cheap)
> std::ptrdiff_t _m = l2/(4 * sizeof(LhsScalar) * k); // this integer
> division cannot be avoid even if L1 and L2 are known at compile time
> if(_m<m) m = _m & mr_mask;
>
> then we have several allocations of the blocks:
>
> Scalar* blockA = ei_aligned_stack_new(Scalar, kc*mc);
> std::size_t sizeB = kc*Blocking::PacketSize*Blocking::nr + kc*cols;
> Scalar* allocatedBlockB = ei_aligned_stack_new(Scalar, sizeB);
> Scalar* blockB = allocatedBlockB + kc*Blocking::PacketSize*Blocking::nr;
>
> then the data are copied into these blocks,
>
> etc.
>
> So really, I think this little "if" is totally negligible.
>
> What could be really useful, however, is a way to instantiate a
> "matrix product object" with some information on the maximal and/or
> typical matrix sizes we are considering such that all the above
> initialization cost can be avoided when doing many matrix products on
> matrices having the same sizes. For instance, this could be useful for
> blocked decompositions.
>
> gael
>
>
> On Tue, Jun 22, 2010 at 9:39 AM, <bernard.hugueney@xxxxxxxxxx> wrote:
>>
>> Hi,
>>
>>
>>
>> On Tue, 22 Jun 2010 00:46:22 +0200, Thomas Capricelli
>>
>>
>>
>>> I guess there is a very small runtime cost of 'checking if the cache
>>
>> sizes
>>
>>> have already been computed' for every product, right ?
>>
>>> And also, computation involving cache sizes were previously done at
>>
>>> compile time and not anymore.. ? Nothing to worry about ?
>>
>>
>>
>> If possible, it would be best to have a #define L2_CACHE_SIZE that would
>>
>> default to
>>
>> a runtime query at static initialization time but could be set when
>>
>> compiling.
>>
>>
>>
>> When set at compile time, a typed value (as in [0]) would enable meta
>>
>> programming unrolling,
>>
>> when set at runtime doing it at static initialization time would avoid
>>
>> polluting other code
>>
>> with a check run only once.
>>
>>
>>
>> My .2€
>>
>>
>>
>> Best regards,
>>
>>
>>
>> Bernard
>>
>>
>>
>> [0] http://www.boost.org/doc/libs/1_43_0/libs/mpl/doc/refmanual/int.html
>>
>>
>>
>
>
>