Re: [eigen] Indexes: why signed instead of unsigned?

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


On 5/17/10, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> ok so I dumped the assembly for my loop in the mul case.
>
>
> For type int:
>
> .L76:
> #APP
> # 28 "a.cpp" 1
> 	#begin
> # 0 "" 2
> #NO_APP
> 	leal	(%rdx,%rax), %eax
> 	addl	$1, %ebx
> 	imull	%edx, %eax
> 	imull	%eax, %ebx
> #APP
> # 33 "a.cpp" 1
> 	#end
> # 0 "" 2
> #NO_APP
> 	addl	$1, %edx
> 	cmpl	$500000000, %edx
> 	jne	.L76
>
>
> For type std::ptrdiff_t:
>
> .L91:
> #APP
> # 28 "a.cpp" 1
> 	#begin
> # 0 "" 2
> #NO_APP
> 	leaq	(%rdx,%rax), %rax
> 	addq	$1, %rbx
> 	imulq	%rdx, %rax
> 	imulq	%rax, %rbx
> #APP
> # 33 "a.cpp" 1
> 	#end
> # 0 "" 2
> #NO_APP
> 	addq	$1, %rdx
> 	cmpq	$500000000, %rdx
> 	jne	.L91
>
>
> I don't see anything here biasing the results, and the two versions
> are similar, but i'm no assembly language expert, perhaps you see
> something i don't. The two versions run in the same amount of time
> here, so I'm tempted to believe that imulq is as fast as imull.
>
> Could you dump your assembly to see if some optimization happened on your
> side?

Sure thing. In fact -- I should learn to get some sleep before
posting, I think my int64_t part of code was insane, or at the very
least inappropriate for the test anyway -- I shall readjust the code
and dump the assembly as soon as I get a chance. Sorry for the noise
in the meantime.

Kind regards
Leon.

> Benoit
>
>
>
>
> 2010/5/16 leon zadorin <leonleon77@xxxxxxxxx>:
>> On 5/17/10, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>>> 2010/5/16 leon zadorin <leonleon77@xxxxxxxxx>:
>> [...]
>>>> but also for the purposes of speed
>>>> such as better utilization of CPU cache-lines (esp. for the
>>>> frequently-used portions of the code).
>> [...]
>>>> sizeof(uint_fast32_t) is still 4 (i.e. != sizeof(uint_fast64_t))
>>>>
>>>> thereby implying to me that if I am running a 64 bit platform and I
>>>> *don't* need > 4billion(s) elements in my matricies et al, then using
>>>> (u)int_fast32_t may be faster/more-efficient (as per vendors
>>>> evaluation of 'fast') than the implied 64-bit variant...
>>>
>>> On x86-64, 64bit integers are really just as fast as 32bit integers. I
>>> tried benchmarking that to be sure, see attached file a.cpp. Results:
>>>
>>> ##### 10:33:12 ~/cuisine$ g++ -O2 a.cpp -o a -I ../eigen-clone/ -lrt
>>> && ./a 2>/dev/null
>>> op: add
>>>   index type of size: 4
>>>     time: 0.431198
>>>   index type of size: 8
>>>     time: 0.430843
>> [...]
>>
>> Well ok, but what was the a.cpp testing and making sure of?
>>
>> From this excerpt:
>>
>> [...]
>>  index_type j = 1234, k = 5678;
>>  for(index_type i = 0; i < 500000000; ++i)
>>  {
>>    j += i;
>>    j = op::run(j, i);
>>    ++k;
>>    k = op::run(j, k);
>>  }
>> [...]
>>
>> we can make 2 observations at least:
>>
>> 1) the CPU-cache utilization is not being tested at all -- i.e. there
>> are only a very very few variables (thusly memory locations) involved
>> (about 2 or 3 ?) ... so the speed test is mostly to do with register
>> rinsing/trashing. However when it comes to accessing more variables
>> (in combination with other data occupying the cache) then the price of
>> more blowouts of cache-lines may become more noticeable (albeit
>> cacheline size is not 1 byte, still the point stands).
>>
>> 2) Not so important, but worth mentioning I guess: compile-time
>> loop-length and constant-folding may impact on the compiler's final
>> code w.r.t. how much of the 32 vs 64 bit code is actually rolled into
>> constants... and how much of the loop is partially unrolled based on
>> which compilation options may be used by a given project etc...
>>
>> Having said this -- I'm sorry for not re-testing your code as much as
>> I'd like (no time you see -- it's very late-night/early-morning and my
>> head is not working :-), but a somewhat ugly and an extremely-rushed
>> hack indicates slightly different results (esp for multiplication):
>>
>> template<typename op> void bench_op_and_type()
>> {
>>  typedef typename op::index_type index_type;
>>  std::cout << "  index type of size: " << sizeof(index_type) << std::endl;
>>  index_type j = printf(""), k = printf("") - 2;
>>  int64_t const limit(3000000000 + printf(""));
>>  ::timespec start;
>>  ::clock_gettime(CLOCK_PROF, &start);
>>  for(index_type i = 0; i < limit; ++i)
>>  {
>>    j += i;
>>    j = op::run(j, i);
>>    ++k;
>>    k = op::run(j, k);
>>  }
>>  std::cerr << "    ignore this: " << k << std::endl;
>>  ::timespec end;
>>  ::clock_gettime(CLOCK_PROF, &end);
>>  double const rv(end.tv_sec - start.tv_sec + (end.tv_nsec -
>> start.tv_nsec) * static_cast<double>(1e-9));
>>  std::cout << "    time: " << rv << std::endl;
>> }
>>
>> c++ -O3 a.cpp
>> ran as su, nice -20 ./a.out
>>
>> on FreeBSD 7.2-RELEASE sys/GENERIC  amd64
>> on Intel(R) Core(TM)2 Duo CPU     E6550  @ 2.33GHz (albeit cpu
>> throttling enabled, but I don't have the time right now to turn it
>> off, besides some projects might need to run on such a system
>> anyway...)
>> on gcc built as single-threaded mem-model, v 4.4.2
>> and also on gcc built as standard posix-threads mem-model, v 4.2.1
>> (done on different gccs as mem-models affect kinds of optimizations
>> across things like opaque functions... and I don't have time to find
>> out which calls are deemed opaque right now)...
>> also tested with -O2 and -O3 (yielding similar observations)
>>
>> shows for int vs ptrdiff_t:
>> [...]
>> op: mul
>>  index type of size: 4
>>    time: 6.45414
>>  index type of size: 8
>>    time: 12.3436
>> [...]
>> running again and again still shows 2:1 mul slowdown ratio or
>> thereabouts...
>>
>> being silly and using uint32_t instead of int
>>
>> op: mul
>>  index type of size: 4
>>    ignore this: 1291724011
>>    time: 5.1793
>>  index type of size: 8
>>    ignore this: -5677676258390039317
>>    time: 12.8464
>>
>> (Disclaimer -- my code could suck at these late night hours :-)
>>
>> I am not saying that the test-case is 100% correct (in fact
>> wraparounds for 32 bit w.r.t. 3000000000 etc, although reducing it to
>> within limits of (and casting to) index_type showed the same ratios),
>> but it does point-out the "specificity" of any benchmark w.r.t. a
>> given deployment context (i.e. a project etc.) and so, I think, at the
>> very least it illuminates the need for customization as per user's
>> needs for data/metadata types...
>>
>> ... not to mention that we still have not even looked at the effects
>> on cacheline utilisation...
>>
>> But anyway -- is I've mentioned, if you recon that Eigen does not use
>> much of integral referencing (reading/writing) and arithmetic, then
>> it's cool -- I'll leave you alone :-)
>>
>> And sorry if the above code is not quite correct -- I need to get some
>> sleep :-) and time... I am very seldom in front of the computer when I
>> can indulge at playing with these kinds of scenarios  :-) :-) :-)
>>
>> Kind regards
>> Leon.
>>
>>
>>
>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/