[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
[...]
index_type j = 1234, k = 5678;
for(index_type i = 0; i < 500000000; ++i)
{
j += i;
j = op::run(j, i);
++k;
k = op::run(j, k);
}
[...]
we can make 2 observations at least:
1) the CPU-cache utilization is not being tested at all -- i.e. there
are only a very very few variables (thusly memory locations) involved
(about 2 or 3 ?) ... so the speed test is mostly to do with register
rinsing/trashing. However when it comes to accessing more variables
(in combination with other data occupying the cache) then the price of
more blowouts of cache-lines may become more noticeable (albeit
cacheline size is not 1 byte, still the point stands).
2) Not so important, but worth mentioning I guess: compile-time
loop-length and constant-folding may impact on the compiler's final
code w.r.t. how much of the 32 vs 64 bit code is actually rolled into
constants... and how much of the loop is partially unrolled based on
which compilation options may be used by a given project etc...
Having said this -- I'm sorry for not re-testing your code as much as
I'd like (no time you see -- it's very late-night/early-morning and my
head is not working :-), but a somewhat ugly and an extremely-rushed
hack indicates slightly different results (esp for multiplication):
template<typename op> void bench_op_and_type()
{
typedef typename op::index_type index_type;
std::cout << " index type of size: " << sizeof(index_type) << std::endl;
index_type j = printf(""), k = printf("") - 2;
int64_t const limit(3000000000 + printf(""));
::timespec start;
::clock_gettime(CLOCK_PROF, &start);
for(index_type i = 0; i < limit; ++i)
{
j += i;
j = op::run(j, i);
++k;
k = op::run(j, k);
}
std::cerr << " ignore this: " << k << std::endl;
::timespec end;
::clock_gettime(CLOCK_PROF, &end);
double const rv(end.tv_sec - start.tv_sec + (end.tv_nsec -
start.tv_nsec) * static_cast<double>(1e-9));
std::cout << " time: " << rv << std::endl;
}
c++ -O3 a.cpp
ran as su, nice -20 ./a.out
on FreeBSD 7.2-RELEASE sys/GENERIC amd64
on Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz (albeit cpu
throttling enabled, but I don't have the time right now to turn it
off, besides some projects might need to run on such a system
anyway...)
on gcc built as single-threaded mem-model, v 4.4.2
and also on gcc built as standard posix-threads mem-model, v 4.2.1
(done on different gccs as mem-models affect kinds of optimizations
across things like opaque functions... and I don't have time to find
out which calls are deemed opaque right now)...
also tested with -O2 and -O3 (yielding similar observations)
shows for int vs ptrdiff_t:
[...]
op: mul
index type of size: 4
time: 6.45414
index type of size: 8
time: 12.3436
[...]
running again and again still shows 2:1 mul slowdown ratio or thereabouts...
being silly and using uint32_t instead of int
op: mul
index type of size: 4
ignore this: 1291724011
time: 5.1793
index type of size: 8
ignore this: -5677676258390039317
time: 12.8464
(Disclaimer -- my code could suck at these late night hours :-)
I am not saying that the test-case is 100% correct (in fact
wraparounds for 32 bit w.r.t. 3000000000 etc, although reducing it to
within limits of (and casting to) index_type showed the same ratios),
but it does point-out the "specificity" of any benchmark w.r.t. a
given deployment context (i.e. a project etc.) and so, I think, at the
very least it illuminates the need for customization as per user's
needs for data/metadata types...
.... not to mention that we still have not even looked at the effects
on cacheline utilisation...
But anyway -- is I've mentioned, if you recon that Eigen does not use
much of integral referencing (reading/writing) and arithmetic, then
it's cool -- I'll leave you alone :-)
And sorry if the above code is not quite correct -- I need to get some
sleep :-) and time... I am very seldom in front of the computer when I
can indulge at playing with these kinds of scenarios :-) :-) :-)
Kind regards
Leon.