Re: [eigen] Parallel matrix multiplication causes heap allocation

[ Thread Index | Date Index | More Archives ]

On Mon, Dec 19, 2016 at 3:24 PM, François Fayard <fayard@xxxxxxxxxxxxx> wrote:

> Did someone have a look at what blaze [1] does? They seem to be pretty advanced regarding parallelism -- they also parallelize "simple" things like vector addition, which if we trust their benchmarks [2] seems to be beneficial starting at something like 50000 doubles

You should not trust benchmarks, especially where they have been done by people who wrote the software :-)

Here is my attempt at this game, multiplying 10 000 x 10 000 "double" matrices on a Dual Xeon 2660-v4 (Haswell), 2x14 cores, Broadwell, 2400 Mhz memory, with the latest release of every single library (as of today) :
- MKL: 740 GFlops
- OpenBLAS: 540 GFlops
- Eigen: 440 GFlops
- Blaze: 440 GFlops
This machine is capable of: 2 x 14 (cores) x 2 (2 FMA ports) x 2 (FMA) x 4 (AVX2) x 2 (GHz) = 896 GFlops. Hyperthreading is turned off and the CPU frequency is blocked at 2 GHz.
On a MacBook Pro (2014), with a 4 core Haswell, I get:
- MKL: 170 GFlops
- Eigen: 130 GFlops

For Blaze, for things such as vector addition, they also use streaming stores to speed up the process.

speaking about Blaze, when benchmarking this library it is important to specify which BLAS backend is enabled (for large enough matrices you are mostly benchmarking the underlying BLAS), and whether padding is enabled. Indeed, by default blaze matrices are padded such that each row (or column) are aligned on 32 bytes boundary. This make some significant difference for some small to medium matrix sizes, but this also waste memory. In my opinion, it's kind of cheating to compare "blaze+padding" versus "otherlib+no-papping", and this is what they benchmark report.



Mail converted by MHonArc 2.6.19+