> Did someone have a look at what blaze [1] does? They seem to be pretty advanced regarding parallelism -- they also parallelize "simple" things like vector addition, which if we trust their benchmarks [2] seems to be beneficial starting at something like 50000 doubles You should not trust benchmarks, especially where they have been done by people who wrote the software :-) Here is my attempt at this game, multiplying 10 000 x 10 000 "double" matrices on a Dual Xeon 2660-v4 (Haswell), 2x14 cores, Broadwell, 2400 Mhz memory, with the latest release of every single library (as of today) : - MKL: 740 GFlops - OpenBLAS: 540 GFlops - Eigen: 440 GFlops - Blaze: 440 GFlops This machine is capable of: 2 x 14 (cores) x 2 (2 FMA ports) x 2 (FMA) x 4 (AVX2) x 2 (GHz) = 896 GFlops. Hyperthreading is turned off and the CPU frequency is blocked at 2 GHz. On a MacBook Pro (2014), with a 4 core Haswell, I get: - MKL: 170 GFlops - Eigen: 130 GFlops For Blaze, for things such as vector addition, they also use streaming stores to speed up the process. François

