Re: [eigen] Indexes: why signed instead of unsigned?

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


ok so I dumped the assembly for my loop in the mul case.


For type int:

..L76:
#APP
# 28 "a.cpp" 1
	#begin
# 0 "" 2
#NO_APP
	leal	(%rdx,%rax), %eax
	addl	$1, %ebx
	imull	%edx, %eax
	imull	%eax, %ebx
#APP
# 33 "a.cpp" 1
	#end
# 0 "" 2
#NO_APP
	addl	$1, %edx
	cmpl	$500000000, %edx
	jne	.L76


For type std::ptrdiff_t:

..L91:
#APP
# 28 "a.cpp" 1
	#begin
# 0 "" 2
#NO_APP
	leaq	(%rdx,%rax), %rax
	addq	$1, %rbx
	imulq	%rdx, %rax
	imulq	%rax, %rbx
#APP
# 33 "a.cpp" 1
	#end
# 0 "" 2
#NO_APP
	addq	$1, %rdx
	cmpq	$500000000, %rdx
	jne	.L91


I don't see anything here biasing the results, and the two versions
are similar, but i'm no assembly language expert, perhaps you see
something i don't. The two versions run in the same amount of time
here, so I'm tempted to believe that imulq is as fast as imull.

Could you dump your assembly to see if some optimization happened on your side?

Benoit




2010/5/16 leon zadorin <leonleon77@xxxxxxxxx>:
> On 5/17/10, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>> 2010/5/16 leon zadorin <leonleon77@xxxxxxxxx>:
> [...]
>>> but also for the purposes of speed
>>> such as better utilization of CPU cache-lines (esp. for the
>>> frequently-used portions of the code).
> [...]
>>> sizeof(uint_fast32_t) is still 4 (i.e. != sizeof(uint_fast64_t))
>>>
>>> thereby implying to me that if I am running a 64 bit platform and I
>>> *don't* need > 4billion(s) elements in my matricies et al, then using
>>> (u)int_fast32_t may be faster/more-efficient (as per vendors
>>> evaluation of 'fast') than the implied 64-bit variant...
>>
>> On x86-64, 64bit integers are really just as fast as 32bit integers. I
>> tried benchmarking that to be sure, see attached file a.cpp. Results:
>>
>> ##### 10:33:12 ~/cuisine$ g++ -O2 a.cpp -o a -I ../eigen-clone/ -lrt
>> && ./a 2>/dev/null
>> op: add
>>   index type of size: 4
>>     time: 0.431198
>>   index type of size: 8
>>     time: 0.430843
> [...]
>
> Well ok, but what was the a.cpp testing and making sure of?
>
> From this excerpt:
>
> [...]
>  index_type j = 1234, k = 5678;
>  for(index_type i = 0; i < 500000000; ++i)
>  {
>    j += i;
>    j = op::run(j, i);
>    ++k;
>    k = op::run(j, k);
>  }
> [...]
>
> we can make 2 observations at least:
>
> 1) the CPU-cache utilization is not being tested at all -- i.e. there
> are only a very very few variables (thusly memory locations) involved
> (about 2 or 3 ?) ... so the speed test is mostly to do with register
> rinsing/trashing. However when it comes to accessing more variables
> (in combination with other data occupying the cache) then the price of
> more blowouts of cache-lines may become more noticeable (albeit
> cacheline size is not 1 byte, still the point stands).
>
> 2) Not so important, but worth mentioning I guess: compile-time
> loop-length and constant-folding may impact on the compiler's final
> code w.r.t. how much of the 32 vs 64 bit code is actually rolled into
> constants... and how much of the loop is partially unrolled based on
> which compilation options may be used by a given project etc...
>
> Having said this -- I'm sorry for not re-testing your code as much as
> I'd like (no time you see -- it's very late-night/early-morning and my
> head is not working :-), but a somewhat ugly and an extremely-rushed
> hack indicates slightly different results (esp for multiplication):
>
> template<typename op> void bench_op_and_type()
> {
>  typedef typename op::index_type index_type;
>  std::cout << "  index type of size: " << sizeof(index_type) << std::endl;
>  index_type j = printf(""), k = printf("") - 2;
>  int64_t const limit(3000000000 + printf(""));
>  ::timespec start;
>  ::clock_gettime(CLOCK_PROF, &start);
>  for(index_type i = 0; i < limit; ++i)
>  {
>    j += i;
>    j = op::run(j, i);
>    ++k;
>    k = op::run(j, k);
>  }
>  std::cerr << "    ignore this: " << k << std::endl;
>  ::timespec end;
>  ::clock_gettime(CLOCK_PROF, &end);
>  double const rv(end.tv_sec - start.tv_sec + (end.tv_nsec -
> start.tv_nsec) * static_cast<double>(1e-9));
>  std::cout << "    time: " << rv << std::endl;
> }
>
> c++ -O3 a.cpp
> ran as su, nice -20 ./a.out
>
> on FreeBSD 7.2-RELEASE sys/GENERIC  amd64
> on Intel(R) Core(TM)2 Duo CPU     E6550  @ 2.33GHz (albeit cpu
> throttling enabled, but I don't have the time right now to turn it
> off, besides some projects might need to run on such a system
> anyway...)
> on gcc built as single-threaded mem-model, v 4.4.2
> and also on gcc built as standard posix-threads mem-model, v 4.2.1
> (done on different gccs as mem-models affect kinds of optimizations
> across things like opaque functions... and I don't have time to find
> out which calls are deemed opaque right now)...
> also tested with -O2 and -O3 (yielding similar observations)
>
> shows for int vs ptrdiff_t:
> [...]
> op: mul
>  index type of size: 4
>    time: 6.45414
>  index type of size: 8
>    time: 12.3436
> [...]
> running again and again still shows 2:1 mul slowdown ratio or thereabouts....
>
> being silly and using uint32_t instead of int
>
> op: mul
>  index type of size: 4
>    ignore this: 1291724011
>    time: 5.1793
>  index type of size: 8
>    ignore this: -5677676258390039317
>    time: 12.8464
>
> (Disclaimer -- my code could suck at these late night hours :-)
>
> I am not saying that the test-case is 100% correct (in fact
> wraparounds for 32 bit w.r.t. 3000000000 etc, although reducing it to
> within limits of (and casting to) index_type showed the same ratios),
> but it does point-out the "specificity" of any benchmark w.r.t. a
> given deployment context (i.e. a project etc.) and so, I think, at the
> very least it illuminates the need for customization as per user's
> needs for data/metadata types...
>
> ... not to mention that we still have not even looked at the effects
> on cacheline utilisation...
>
> But anyway -- is I've mentioned, if you recon that Eigen does not use
> much of integral referencing (reading/writing) and arithmetic, then
> it's cool -- I'll leave you alone :-)
>
> And sorry if the above code is not quite correct -- I need to get some
> sleep :-) and time... I am very seldom in front of the computer when I
> can indulge at playing with these kinds of scenarios  :-) :-) :-)
>
> Kind regards
> Leon.
>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/