[eigen] Benchmark for blocking sizes

Hi List,

Thanks a lot for your feedback and results on the first iteration of this benchmark.

A new version is checked in the repo: bench/benchmark-blocking-sizes.cpp

https://bitbucket.org/eigen/eigen/src/tip/bench/benchmark-blocking-sizes.cpp

Changes:

1. Uses Eigen's BenchTimer.h so should run everywhere.

2. Tries to empty caches by default. You can override/tweak this behavior by playing with the --min-working-set-size command-line parameter.

3. Measures sizes up to 2048, up from 1024. Gael rightly pointed out to me that 1024 was not quite large enough to show all of the impact of M/N blocking sizes.

4. Displays progress info on stderr -- so it's fun to watch and you'll want to run it on lots of machines for me!

5. Doesn't try to do any analysis anymore. Just dumps a raw easy-to-parse table of GFlops for each combo of product size and blocking params.

Downside: it now takes longer to run --- typically 3 hours on a PC, 9 hours on an Android device.

Example compilation command line (note -mfma for haswell):

$ c++ -O3 -DNDEBUG -mavx -mfma ../eigen/bench/benchmark-blocking-sizes.cpp -o b -I ../eigen --std=c++11

You'll want to redirect stdout to a file, while stderr is all you want to watch in your terminal, so there's no need for 'tee'.

$ ./b foo > log-benchmark-blocking-sizes

Caveats:

- Still only single-threaded. At some point we'll want a multithreaded benchmark.

Note to helpful people who already sent me data with the previous version:

I have a program that converts the existing logs to the new format, so your data isn't lost. Thanks again for it! The new benchmark generates better data though, so if you have a chance to run it, that's great, otherwise I can make use of the old data, which I've already converted to the new format.

Useful things to test:

- try with --min-working-set-size=1 to disable the emptying of caches. That runs faster; what I don't know is if that leads to a significantly different choice of blocking strategy.

What's next:

The new data format doesn't contain analysis, just raw timings. Once we have data from several different machines, we can start to generate tables of blocking parameters that are "least bad" across multiple machines. Of course, users who want absolutely optimal perf on their machine will have to generate a table for it, but Eigen should default to parameters that minimize the efficiency across all machines. So the next step is to generate such a table of parameters by aggregating data from multiple logs.

Cheers!

Benoit