Re: [eigen] Parallel matrix multiplication causes heap allocation

[ Thread Index | Date Index | More Archives ]

On Mon, Dec 19, 2016 at 6:24 AM, François Fayard <fayard@xxxxxxxxxxxxx> wrote:

> Did someone have a look at what blaze [1] does? They seem to be pretty advanced regarding parallelism -- they also parallelize "simple" things like vector addition, which if we trust their benchmarks [2] seems to be beneficial starting at something like 50000 doubles

You should not trust benchmarks, especially where they have been done by people who wrote the software :-)

More than just that, OpenMP runtimes are nontrivial beasts to control and any multithreaded performance data that does not include a complete list of compiler and runtime versions, affinity information, complete processor details, and OS+distro version should be viewed with skepticism.

For example, most OpenMP runtimes do not set affinity by default, and I've seen this reduce performance by ~2x in DGEMM, and once affinity is enabled, breadth- vs depth-first placement makes a large difference in some cases.

For Blaze, for things such as vector addition, they also use streaming stores to speed up the process.

Streaming stores are definitely a worthwhile optimization in the appropriate circumstances.  I haven't had time to experiment much but for STREAM on Knights Landing, they make a big difference (Intel and probably other compilers auto-generator them, or emits an optimized memcpy call that uses them, at least in simple loops).

I am particularly interested in where they pay off for storing the C matrix in GEMM with k << m,n...

Full disclosure: I work for Intel but my comments here are not official statements of any kind.



Mail converted by MHonArc 2.6.19+