Re: [eigen] Indexes: why signed instead of unsigned?

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


On 5/17/10, leon zadorin <leonleon77@xxxxxxxxx> wrote:
> On 5/17/10, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>> 2010/5/16 leon zadorin <leonleon77@xxxxxxxxx>:
> [...]
>>> but also for the purposes of speed
>>> such as better utilization of CPU cache-lines (esp. for the
>>> frequently-used portions of the code).
> [...]
>>> sizeof(uint_fast32_t) is still 4 (i.e. != sizeof(uint_fast64_t))
>>>
>>> thereby implying to me that if I am running a 64 bit platform and I
>>> *don't* need > 4billion(s) elements in my matricies et al, then using
>>> (u)int_fast32_t may be faster/more-efficient (as per vendors
>>> evaluation of 'fast') than the implied 64-bit variant...
>>
>> On x86-64, 64bit integers are really just as fast as 32bit integers. I
>> tried benchmarking that to be sure, see attached file a.cpp. Results:
>>
>> ##### 10:33:12 ~/cuisine$ g++ -O2 a.cpp -o a -I ../eigen-clone/ -lrt
>> && ./a 2>/dev/null
>> op: add
>>   index type of size: 4
>>     time: 0.431198
>>   index type of size: 8
>>     time: 0.430843
> [...]
>
> Well ok, but what was the a.cpp testing and making sure of?
>
> From this excerpt:
>
> [...]
>   index_type j = 1234, k = 5678;
>   for(index_type i = 0; i < 500000000; ++i)
>   {
>     j += i;
>     j = op::run(j, i);
>     ++k;
>     k = op::run(j, k);
>   }
> [...]
>
> we can make 2 observations at least:
>
> 1) the CPU-cache utilization is not being tested at all -- i.e. there
> are only a very very few variables (thusly memory locations) involved
> (about 2 or 3 ?) ... so the speed test is mostly to do with register
> rinsing/trashing. However when it comes to accessing more variables
> (in combination with other data occupying the cache) then the price of
> more blowouts of cache-lines may become more noticeable (albeit
> cacheline size is not 1 byte, still the point stands).
>
> 2) Not so important, but worth mentioning I guess: compile-time
> loop-length and constant-folding may impact on the compiler's final
> code w.r.t. how much of the 32 vs 64 bit code is actually rolled into
> constants... and how much of the loop is partially unrolled based on
> which compilation options may be used by a given project etc...
>
> Having said this -- I'm sorry for not re-testing your code as much as
> I'd like (no time you see -- it's very late-night/early-morning and my
> head is not working :-), but a somewhat ugly and an extremely-rushed
> hack indicates slightly different results (esp for multiplication):
>
> template<typename op> void bench_op_and_type()
> {
>   typedef typename op::index_type index_type;
>   std::cout << "  index type of size: " << sizeof(index_type) << std::endl;
>   index_type j = printf(""), k = printf("") - 2;
>   int64_t const limit(3000000000 + printf(""));
>   ::timespec start;
>   ::clock_gettime(CLOCK_PROF, &start);

minor clarification: CLOCK_PROF (but was also tried with equivalent of
realtime with the same observations) is due to FreeBSD (7.2) not
supporting your timer's clock mode (in BenchTimer) for
clock_gettime... in so far is I could spend time finding out...

>   for(index_type i = 0; i < limit; ++i)
>   {
>     j += i;
>     j = op::run(j, i);
>     ++k;
>     k = op::run(j, k);
>   }
>   std::cerr << "    ignore this: " << k << std::endl;
>   ::timespec end;
>   ::clock_gettime(CLOCK_PROF, &end);
>   double const rv(end.tv_sec - start.tv_sec + (end.tv_nsec -
> start.tv_nsec) * static_cast<double>(1e-9));
>   std::cout << "    time: " << rv << std::endl;
> }
>
> c++ -O3 a.cpp
> ran as su, nice -20 ./a.out
>
> on FreeBSD 7.2-RELEASE sys/GENERIC  amd64
> on Intel(R) Core(TM)2 Duo CPU     E6550  @ 2.33GHz (albeit cpu
> throttling enabled, but I don't have the time right now to turn it
> off, besides some projects might need to run on such a system
> anyway...)
> on gcc built as single-threaded mem-model, v 4.4.2
> and also on gcc built as standard posix-threads mem-model, v 4.2.1
> (done on different gccs as mem-models affect kinds of optimizations
> across things like opaque functions... and I don't have time to find
> out which calls are deemed opaque right now)...
> also tested with -O2 and -O3 (yielding similar observations)
>
> shows for int vs ptrdiff_t:
> [...]
> op: mul
>   index type of size: 4
>     time: 6.45414
>   index type of size: 8
>     time: 12.3436
> [...]
> running again and again still shows 2:1 mul slowdown ratio or
> thereabouts...
>
> being silly and using uint32_t instead of int
>
> op: mul
>   index type of size: 4
>     ignore this: 1291724011
>     time: 5.1793
>   index type of size: 8
>     ignore this: -5677676258390039317
>     time: 12.8464
>
> (Disclaimer -- my code could suck at these late night hours :-)
>
> I am not saying that the test-case is 100% correct (in fact
> wraparounds for 32 bit w.r.t. 3000000000 etc, although reducing it to
> within limits of (and casting to) index_type showed the same ratios),
> but it does point-out the "specificity" of any benchmark w.r.t. a
> given deployment context (i.e. a project etc.) and so, I think, at the
> very least it illuminates the need for customization as per user's
> needs for data/metadata types...
>
> ... not to mention that we still have not even looked at the effects
> on cacheline utilisation...
>
> But anyway -- is I've mentioned, if you recon that Eigen does not use
> much of integral referencing (reading/writing) and arithmetic, then
> it's cool -- I'll leave you alone :-)
>
> And sorry if the above code is not quite correct -- I need to get some
> sleep :-) and time... I am very seldom in front of the computer when I
> can indulge at playing with these kinds of scenarios  :-) :-) :-)
>
> Kind regards
> Leon.
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/