Re: [eigen] Indexes: why signed instead of unsigned? |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] Indexes: why signed instead of unsigned?
- From: Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
- Date: Sun, 16 May 2010 12:44:37 -0400
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=nUTD5R9CmJR7rZhNGbVkmhvlqWTPqJLL5I16pMKQbu0=; b=VvjQMsTn07kOcUQs6TyEKMX6J+kxMuiP8cOMEfp02udIdsXUJu9C6vZKAEFzg36962 A70z54XKxvr15zJDkWnKnYy0WvPrMuL2QJ0DyEP6JlenCS92QyXLt6IFwGb5vS2zDhNe FgTVARN8hHSQ0dbkfbmWckZ8j+lHiSFOystHU=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=IEgY7s/C4mpi0dkG1F22ljac1ZSTQgLkI/rAKl2N5YdLx8ESa/ozY056qrd+R4WQ1O syXq8RZpaK0mjshMdyXMDJ3f0WooNW9cHiTB9zknLwVbKtPDk63u8XWuyd4t87Klof/d XNjt8oHVe7oZjVGoytOOZ7wTsV/e/84e9NClE=
ok so I dumped the assembly for my loop in the mul case.
For type int:
..L76:
#APP
# 28 "a.cpp" 1
#begin
# 0 "" 2
#NO_APP
leal (%rdx,%rax), %eax
addl $1, %ebx
imull %edx, %eax
imull %eax, %ebx
#APP
# 33 "a.cpp" 1
#end
# 0 "" 2
#NO_APP
addl $1, %edx
cmpl $500000000, %edx
jne .L76
For type std::ptrdiff_t:
..L91:
#APP
# 28 "a.cpp" 1
#begin
# 0 "" 2
#NO_APP
leaq (%rdx,%rax), %rax
addq $1, %rbx
imulq %rdx, %rax
imulq %rax, %rbx
#APP
# 33 "a.cpp" 1
#end
# 0 "" 2
#NO_APP
addq $1, %rdx
cmpq $500000000, %rdx
jne .L91
I don't see anything here biasing the results, and the two versions
are similar, but i'm no assembly language expert, perhaps you see
something i don't. The two versions run in the same amount of time
here, so I'm tempted to believe that imulq is as fast as imull.
Could you dump your assembly to see if some optimization happened on your side?
Benoit
2010/5/16 leon zadorin <leonleon77@xxxxxxxxx>:
> On 5/17/10, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>> 2010/5/16 leon zadorin <leonleon77@xxxxxxxxx>:
> [...]
>>> but also for the purposes of speed
>>> such as better utilization of CPU cache-lines (esp. for the
>>> frequently-used portions of the code).
> [...]
>>> sizeof(uint_fast32_t) is still 4 (i.e. != sizeof(uint_fast64_t))
>>>
>>> thereby implying to me that if I am running a 64 bit platform and I
>>> *don't* need > 4billion(s) elements in my matricies et al, then using
>>> (u)int_fast32_t may be faster/more-efficient (as per vendors
>>> evaluation of 'fast') than the implied 64-bit variant...
>>
>> On x86-64, 64bit integers are really just as fast as 32bit integers. I
>> tried benchmarking that to be sure, see attached file a.cpp. Results:
>>
>> ##### 10:33:12 ~/cuisine$ g++ -O2 a.cpp -o a -I ../eigen-clone/ -lrt
>> && ./a 2>/dev/null
>> op: add
>> index type of size: 4
>> time: 0.431198
>> index type of size: 8
>> time: 0.430843
> [...]
>
> Well ok, but what was the a.cpp testing and making sure of?
>
> From this excerpt:
>
> [...]
> index_type j = 1234, k = 5678;
> for(index_type i = 0; i < 500000000; ++i)
> {
> j += i;
> j = op::run(j, i);
> ++k;
> k = op::run(j, k);
> }
> [...]
>
> we can make 2 observations at least:
>
> 1) the CPU-cache utilization is not being tested at all -- i.e. there
> are only a very very few variables (thusly memory locations) involved
> (about 2 or 3 ?) ... so the speed test is mostly to do with register
> rinsing/trashing. However when it comes to accessing more variables
> (in combination with other data occupying the cache) then the price of
> more blowouts of cache-lines may become more noticeable (albeit
> cacheline size is not 1 byte, still the point stands).
>
> 2) Not so important, but worth mentioning I guess: compile-time
> loop-length and constant-folding may impact on the compiler's final
> code w.r.t. how much of the 32 vs 64 bit code is actually rolled into
> constants... and how much of the loop is partially unrolled based on
> which compilation options may be used by a given project etc...
>
> Having said this -- I'm sorry for not re-testing your code as much as
> I'd like (no time you see -- it's very late-night/early-morning and my
> head is not working :-), but a somewhat ugly and an extremely-rushed
> hack indicates slightly different results (esp for multiplication):
>
> template<typename op> void bench_op_and_type()
> {
> typedef typename op::index_type index_type;
> std::cout << " index type of size: " << sizeof(index_type) << std::endl;
> index_type j = printf(""), k = printf("") - 2;
> int64_t const limit(3000000000 + printf(""));
> ::timespec start;
> ::clock_gettime(CLOCK_PROF, &start);
> for(index_type i = 0; i < limit; ++i)
> {
> j += i;
> j = op::run(j, i);
> ++k;
> k = op::run(j, k);
> }
> std::cerr << " ignore this: " << k << std::endl;
> ::timespec end;
> ::clock_gettime(CLOCK_PROF, &end);
> double const rv(end.tv_sec - start.tv_sec + (end.tv_nsec -
> start.tv_nsec) * static_cast<double>(1e-9));
> std::cout << " time: " << rv << std::endl;
> }
>
> c++ -O3 a.cpp
> ran as su, nice -20 ./a.out
>
> on FreeBSD 7.2-RELEASE sys/GENERIC amd64
> on Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz (albeit cpu
> throttling enabled, but I don't have the time right now to turn it
> off, besides some projects might need to run on such a system
> anyway...)
> on gcc built as single-threaded mem-model, v 4.4.2
> and also on gcc built as standard posix-threads mem-model, v 4.2.1
> (done on different gccs as mem-models affect kinds of optimizations
> across things like opaque functions... and I don't have time to find
> out which calls are deemed opaque right now)...
> also tested with -O2 and -O3 (yielding similar observations)
>
> shows for int vs ptrdiff_t:
> [...]
> op: mul
> index type of size: 4
> time: 6.45414
> index type of size: 8
> time: 12.3436
> [...]
> running again and again still shows 2:1 mul slowdown ratio or thereabouts....
>
> being silly and using uint32_t instead of int
>
> op: mul
> index type of size: 4
> ignore this: 1291724011
> time: 5.1793
> index type of size: 8
> ignore this: -5677676258390039317
> time: 12.8464
>
> (Disclaimer -- my code could suck at these late night hours :-)
>
> I am not saying that the test-case is 100% correct (in fact
> wraparounds for 32 bit w.r.t. 3000000000 etc, although reducing it to
> within limits of (and casting to) index_type showed the same ratios),
> but it does point-out the "specificity" of any benchmark w.r.t. a
> given deployment context (i.e. a project etc.) and so, I think, at the
> very least it illuminates the need for customization as per user's
> needs for data/metadata types...
>
> ... not to mention that we still have not even looked at the effects
> on cacheline utilisation...
>
> But anyway -- is I've mentioned, if you recon that Eigen does not use
> much of integral referencing (reading/writing) and arithmetic, then
> it's cool -- I'll leave you alone :-)
>
> And sorry if the above code is not quite correct -- I need to get some
> sleep :-) and time... I am very seldom in front of the computer when I
> can indulge at playing with these kinds of scenarios :-) :-) :-)
>
> Kind regards
> Leon.
>
>
>