Re: [eigen] Indexes: why signed instead of unsigned? |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] Indexes: why signed instead of unsigned?
- From: leon zadorin <leonleon77@xxxxxxxxx>
- Date: Mon, 17 May 2010 02:07:42 +1000
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=Hrxu7Zjwc2BDvm8v8KeOaITbfysh76xh2383PkMS/Oc=; b=a3M8barq1kG/YtA/gTc09ymB3t8iGA3nBA47mhy0tuVoPC3QNZn7J9cnwKAe9dmir9 8fpTc+CFdMRDZJBvWvrUZnFTOzrAU8cgD/RJiqGOFKnoe5fD/4m3kVsU/kzdoUqfuMtB HK/k07r64RZFYRk7ffKIgdmFYSRcL9R2d1znY=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=Jjqgmf886zfycYjBAdGVCxP/9zgqj7YdKy8twXxCZ2jtGX8EStwIFv4vGsHjfRkteR PDkMDsB7ahJST0JBnHo38WnRSTS6rMSk1NTXGFtQrWswn8/EJbglYHptekjIdNzA2HxO f+kHqjW7/JF+HYG97Py2ftIT1S8Zjw/iwa1Ng=
On 5/17/10, leon zadorin <leonleon77@xxxxxxxxx> wrote:
> On 5/17/10, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>> 2010/5/16 leon zadorin <leonleon77@xxxxxxxxx>:
> [...]
>>> but also for the purposes of speed
>>> such as better utilization of CPU cache-lines (esp. for the
>>> frequently-used portions of the code).
> [...]
>>> sizeof(uint_fast32_t) is still 4 (i.e. != sizeof(uint_fast64_t))
>>>
>>> thereby implying to me that if I am running a 64 bit platform and I
>>> *don't* need > 4billion(s) elements in my matricies et al, then using
>>> (u)int_fast32_t may be faster/more-efficient (as per vendors
>>> evaluation of 'fast') than the implied 64-bit variant...
>>
>> On x86-64, 64bit integers are really just as fast as 32bit integers. I
>> tried benchmarking that to be sure, see attached file a.cpp. Results:
>>
>> ##### 10:33:12 ~/cuisine$ g++ -O2 a.cpp -o a -I ../eigen-clone/ -lrt
>> && ./a 2>/dev/null
>> op: add
>> index type of size: 4
>> time: 0.431198
>> index type of size: 8
>> time: 0.430843
> [...]
>
> Well ok, but what was the a.cpp testing and making sure of?
>
> From this excerpt:
>
> [...]
> index_type j = 1234, k = 5678;
> for(index_type i = 0; i < 500000000; ++i)
> {
> j += i;
> j = op::run(j, i);
> ++k;
> k = op::run(j, k);
> }
> [...]
>
> we can make 2 observations at least:
>
> 1) the CPU-cache utilization is not being tested at all -- i.e. there
> are only a very very few variables (thusly memory locations) involved
> (about 2 or 3 ?) ... so the speed test is mostly to do with register
> rinsing/trashing. However when it comes to accessing more variables
> (in combination with other data occupying the cache) then the price of
> more blowouts of cache-lines may become more noticeable (albeit
> cacheline size is not 1 byte, still the point stands).
>
> 2) Not so important, but worth mentioning I guess: compile-time
> loop-length and constant-folding may impact on the compiler's final
> code w.r.t. how much of the 32 vs 64 bit code is actually rolled into
> constants... and how much of the loop is partially unrolled based on
> which compilation options may be used by a given project etc...
>
> Having said this -- I'm sorry for not re-testing your code as much as
> I'd like (no time you see -- it's very late-night/early-morning and my
> head is not working :-), but a somewhat ugly and an extremely-rushed
> hack indicates slightly different results (esp for multiplication):
>
> template<typename op> void bench_op_and_type()
> {
> typedef typename op::index_type index_type;
> std::cout << " index type of size: " << sizeof(index_type) << std::endl;
> index_type j = printf(""), k = printf("") - 2;
> int64_t const limit(3000000000 + printf(""));
> ::timespec start;
> ::clock_gettime(CLOCK_PROF, &start);
minor clarification: CLOCK_PROF (but was also tried with equivalent of
realtime with the same observations) is due to FreeBSD (7.2) not
supporting your timer's clock mode (in BenchTimer) for
clock_gettime... in so far is I could spend time finding out...
> for(index_type i = 0; i < limit; ++i)
> {
> j += i;
> j = op::run(j, i);
> ++k;
> k = op::run(j, k);
> }
> std::cerr << " ignore this: " << k << std::endl;
> ::timespec end;
> ::clock_gettime(CLOCK_PROF, &end);
> double const rv(end.tv_sec - start.tv_sec + (end.tv_nsec -
> start.tv_nsec) * static_cast<double>(1e-9));
> std::cout << " time: " << rv << std::endl;
> }
>
> c++ -O3 a.cpp
> ran as su, nice -20 ./a.out
>
> on FreeBSD 7.2-RELEASE sys/GENERIC amd64
> on Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz (albeit cpu
> throttling enabled, but I don't have the time right now to turn it
> off, besides some projects might need to run on such a system
> anyway...)
> on gcc built as single-threaded mem-model, v 4.4.2
> and also on gcc built as standard posix-threads mem-model, v 4.2.1
> (done on different gccs as mem-models affect kinds of optimizations
> across things like opaque functions... and I don't have time to find
> out which calls are deemed opaque right now)...
> also tested with -O2 and -O3 (yielding similar observations)
>
> shows for int vs ptrdiff_t:
> [...]
> op: mul
> index type of size: 4
> time: 6.45414
> index type of size: 8
> time: 12.3436
> [...]
> running again and again still shows 2:1 mul slowdown ratio or
> thereabouts...
>
> being silly and using uint32_t instead of int
>
> op: mul
> index type of size: 4
> ignore this: 1291724011
> time: 5.1793
> index type of size: 8
> ignore this: -5677676258390039317
> time: 12.8464
>
> (Disclaimer -- my code could suck at these late night hours :-)
>
> I am not saying that the test-case is 100% correct (in fact
> wraparounds for 32 bit w.r.t. 3000000000 etc, although reducing it to
> within limits of (and casting to) index_type showed the same ratios),
> but it does point-out the "specificity" of any benchmark w.r.t. a
> given deployment context (i.e. a project etc.) and so, I think, at the
> very least it illuminates the need for customization as per user's
> needs for data/metadata types...
>
> ... not to mention that we still have not even looked at the effects
> on cacheline utilisation...
>
> But anyway -- is I've mentioned, if you recon that Eigen does not use
> much of integral referencing (reading/writing) and arithmetic, then
> it's cool -- I'll leave you alone :-)
>
> And sorry if the above code is not quite correct -- I need to get some
> sleep :-) and time... I am very seldom in front of the computer when I
> can indulge at playing with these kinds of scenarios :-) :-) :-)
>
> Kind regards
> Leon.
>