Re: [eigen] Indexes: why signed instead of unsigned? |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] Indexes: why signed instead of unsigned?
- From: leon zadorin <leonleon77@xxxxxxxxx>
- Date: Mon, 17 May 2010 12:32:47 +1000
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=fKY2ggQEQ/K4dTwwg3kNUrU6C8GM0UkOQChhmHIVEqs=; b=PSTtn/thVQJCZm5Mv0NlSN278FxEtPF8xuEY1s/nHpzB1u8zJVA6/NyywywUogRcpl 0GyxUB8OohdPdUNDxG+wId8mPcSsJ3jvhjlDMOxunUXgxiDwt+iC9mlGILuiXfcALqZQ qNCrcsTkJwU3j5tw0WU2uFxvzm7GRtqnQl8GU=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=Qcqn1rXIKu6jQtR+2Er7zgt7WQJOigQqGhSWdgqDobY64819PNWD7gOVDspEBwvly4 bjn5N2RMCAe6iuoEqd3UaKkQzSwPxJtPUaX0RQwBvpc28zHPf2W/Utdq8QCW1GOJjQG8 8qXHcnnyRd397jdWHJC+i+9Dm7USkK/SRN+50=
On 5/17/10, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> ok so I dumped the assembly for my loop in the mul case.
>
>
> For type int:
>
> .L76:
> #APP
> # 28 "a.cpp" 1
> #begin
> # 0 "" 2
> #NO_APP
> leal (%rdx,%rax), %eax
> addl $1, %ebx
> imull %edx, %eax
> imull %eax, %ebx
> #APP
> # 33 "a.cpp" 1
> #end
> # 0 "" 2
> #NO_APP
> addl $1, %edx
> cmpl $500000000, %edx
> jne .L76
>
>
> For type std::ptrdiff_t:
>
> .L91:
> #APP
> # 28 "a.cpp" 1
> #begin
> # 0 "" 2
> #NO_APP
> leaq (%rdx,%rax), %rax
> addq $1, %rbx
> imulq %rdx, %rax
> imulq %rax, %rbx
> #APP
> # 33 "a.cpp" 1
> #end
> # 0 "" 2
> #NO_APP
> addq $1, %rdx
> cmpq $500000000, %rdx
> jne .L91
>
>
> I don't see anything here biasing the results, and the two versions
> are similar, but i'm no assembly language expert, perhaps you see
> something i don't. The two versions run in the same amount of time
> here, so I'm tempted to believe that imulq is as fast as imull.
>
> Could you dump your assembly to see if some optimization happened on your
> side?
Sure thing. In fact -- I should learn to get some sleep before
posting, I think my int64_t part of code was insane, or at the very
least inappropriate for the test anyway -- I shall readjust the code
and dump the assembly as soon as I get a chance. Sorry for the noise
in the meantime.
Kind regards
Leon.
> Benoit
>
>
>
>
> 2010/5/16 leon zadorin <leonleon77@xxxxxxxxx>:
>> On 5/17/10, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>>> 2010/5/16 leon zadorin <leonleon77@xxxxxxxxx>:
>> [...]
>>>> but also for the purposes of speed
>>>> such as better utilization of CPU cache-lines (esp. for the
>>>> frequently-used portions of the code).
>> [...]
>>>> sizeof(uint_fast32_t) is still 4 (i.e. != sizeof(uint_fast64_t))
>>>>
>>>> thereby implying to me that if I am running a 64 bit platform and I
>>>> *don't* need > 4billion(s) elements in my matricies et al, then using
>>>> (u)int_fast32_t may be faster/more-efficient (as per vendors
>>>> evaluation of 'fast') than the implied 64-bit variant...
>>>
>>> On x86-64, 64bit integers are really just as fast as 32bit integers. I
>>> tried benchmarking that to be sure, see attached file a.cpp. Results:
>>>
>>> ##### 10:33:12 ~/cuisine$ g++ -O2 a.cpp -o a -I ../eigen-clone/ -lrt
>>> && ./a 2>/dev/null
>>> op: add
>>> index type of size: 4
>>> time: 0.431198
>>> index type of size: 8
>>> time: 0.430843
>> [...]
>>
>> Well ok, but what was the a.cpp testing and making sure of?
>>
>> From this excerpt:
>>
>> [...]
>> index_type j = 1234, k = 5678;
>> for(index_type i = 0; i < 500000000; ++i)
>> {
>> j += i;
>> j = op::run(j, i);
>> ++k;
>> k = op::run(j, k);
>> }
>> [...]
>>
>> we can make 2 observations at least:
>>
>> 1) the CPU-cache utilization is not being tested at all -- i.e. there
>> are only a very very few variables (thusly memory locations) involved
>> (about 2 or 3 ?) ... so the speed test is mostly to do with register
>> rinsing/trashing. However when it comes to accessing more variables
>> (in combination with other data occupying the cache) then the price of
>> more blowouts of cache-lines may become more noticeable (albeit
>> cacheline size is not 1 byte, still the point stands).
>>
>> 2) Not so important, but worth mentioning I guess: compile-time
>> loop-length and constant-folding may impact on the compiler's final
>> code w.r.t. how much of the 32 vs 64 bit code is actually rolled into
>> constants... and how much of the loop is partially unrolled based on
>> which compilation options may be used by a given project etc...
>>
>> Having said this -- I'm sorry for not re-testing your code as much as
>> I'd like (no time you see -- it's very late-night/early-morning and my
>> head is not working :-), but a somewhat ugly and an extremely-rushed
>> hack indicates slightly different results (esp for multiplication):
>>
>> template<typename op> void bench_op_and_type()
>> {
>> typedef typename op::index_type index_type;
>> std::cout << " index type of size: " << sizeof(index_type) << std::endl;
>> index_type j = printf(""), k = printf("") - 2;
>> int64_t const limit(3000000000 + printf(""));
>> ::timespec start;
>> ::clock_gettime(CLOCK_PROF, &start);
>> for(index_type i = 0; i < limit; ++i)
>> {
>> j += i;
>> j = op::run(j, i);
>> ++k;
>> k = op::run(j, k);
>> }
>> std::cerr << " ignore this: " << k << std::endl;
>> ::timespec end;
>> ::clock_gettime(CLOCK_PROF, &end);
>> double const rv(end.tv_sec - start.tv_sec + (end.tv_nsec -
>> start.tv_nsec) * static_cast<double>(1e-9));
>> std::cout << " time: " << rv << std::endl;
>> }
>>
>> c++ -O3 a.cpp
>> ran as su, nice -20 ./a.out
>>
>> on FreeBSD 7.2-RELEASE sys/GENERIC amd64
>> on Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz (albeit cpu
>> throttling enabled, but I don't have the time right now to turn it
>> off, besides some projects might need to run on such a system
>> anyway...)
>> on gcc built as single-threaded mem-model, v 4.4.2
>> and also on gcc built as standard posix-threads mem-model, v 4.2.1
>> (done on different gccs as mem-models affect kinds of optimizations
>> across things like opaque functions... and I don't have time to find
>> out which calls are deemed opaque right now)...
>> also tested with -O2 and -O3 (yielding similar observations)
>>
>> shows for int vs ptrdiff_t:
>> [...]
>> op: mul
>> index type of size: 4
>> time: 6.45414
>> index type of size: 8
>> time: 12.3436
>> [...]
>> running again and again still shows 2:1 mul slowdown ratio or
>> thereabouts...
>>
>> being silly and using uint32_t instead of int
>>
>> op: mul
>> index type of size: 4
>> ignore this: 1291724011
>> time: 5.1793
>> index type of size: 8
>> ignore this: -5677676258390039317
>> time: 12.8464
>>
>> (Disclaimer -- my code could suck at these late night hours :-)
>>
>> I am not saying that the test-case is 100% correct (in fact
>> wraparounds for 32 bit w.r.t. 3000000000 etc, although reducing it to
>> within limits of (and casting to) index_type showed the same ratios),
>> but it does point-out the "specificity" of any benchmark w.r.t. a
>> given deployment context (i.e. a project etc.) and so, I think, at the
>> very least it illuminates the need for customization as per user's
>> needs for data/metadata types...
>>
>> ... not to mention that we still have not even looked at the effects
>> on cacheline utilisation...
>>
>> But anyway -- is I've mentioned, if you recon that Eigen does not use
>> much of integral referencing (reading/writing) and arithmetic, then
>> it's cool -- I'll leave you alone :-)
>>
>> And sorry if the above code is not quite correct -- I need to get some
>> sleep :-) and time... I am very seldom in front of the computer when I
>> can indulge at playing with these kinds of scenarios :-) :-) :-)
>>
>> Kind regards
>> Leon.
>>
>>
>>
>
>
>