Re: [eigen] Parallel matrix multiplication causes heap allocation |

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

*To*: eigen@xxxxxxxxxxxxxxxxxxx*Subject*: Re: [eigen] Parallel matrix multiplication causes heap allocation*From*: Jeff Hammond <jeff.science@xxxxxxxxx>*Date*: Mon, 19 Dec 2016 09:31:34 -0800*Dkim-signature*: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=GUILUJqsUVmtzEMZhbVt6lC2km6tme369Q0Q8jhe73Y=; b=DpGOf6D1vOloybrxXhTf9paWETsxPr5y19QBsCZ2TmMjq3ZshTv2IeV5A7UrK/jQGd 9ccQKIKUfHfgwMqma5DDqdidJQZwm6+6GaAigEojGxGx9PP8oaaCDGcuo6f8u363aXsZ 28QlH5nHGRdMfO74lSrwqUyOlTCeN7PiLlDpTmEe6BLmy4FBp3jHrqw47K6+DqoB6vKA 4mhvg9nYOvgkQJ6zlZWRswTHRJSbHvEBpv36em92yow3y3IwqyLKJ/b0MhTrvuwucLjW HG947dfVar5oJ8dxtjmDQeyWYeKicfKIzrvLcnDUAFOcV14PghNApH8Sxdbbl23jzS4c uLTA==

On Mon, Dec 19, 2016 at 6:24 AM, François Fayard <fayard@xxxxxxxxxxxxx> wrote:

> Did someone have a look at what blaze [1] does? They seem to be pretty advanced regarding parallelism -- they also parallelize "simple" things like vector addition, which if we trust their benchmarks [2] seems to be beneficial starting at something like 50000 doubles

You should not trust benchmarks, especially where they have been done by people who wrote the software :-)

More than just that, OpenMP runtimes are nontrivial beasts to control and any multithreaded performance data that does not include a complete list of compiler and runtime versions, affinity information, complete processor details, and OS+distro version should be viewed with skepticism.

For example, most OpenMP runtimes do not set affinity by default, and I've seen this reduce performance by ~2x in DGEMM, and once affinity is enabled, breadth- vs depth-first placement makes a large difference in some cases.

For Blaze, for things such as vector addition, they also use streaming stores to speed up the process.

Streaming stores are definitely a worthwhile optimization in the appropriate circumstances. I haven't had time to experiment much but for STREAM on Knights Landing, they make a big difference (Intel and probably other compilers auto-generator them, or emits an optimized memcpy call that uses them, at least in simple loops).

I am particularly interested in where they pay off for storing the C matrix in GEMM with k << m,n...

Full disclosure: I work for Intel but my comments here are not official statements of any kind.

Jeff

**Follow-Ups**:**Re: [eigen] Parallel matrix multiplication causes heap allocation***From:*François Fayard

**References**:**[eigen] Parallel matrix multiplication causes heap allocation***From:*Rene Ahlsdorf

**Re: [eigen] Parallel matrix multiplication causes heap allocation***From:*François Fayard

**Re: [eigen] Parallel matrix multiplication causes heap allocation***From:*Gael Guennebaud

**Re: [eigen] Parallel matrix multiplication causes heap allocation***From:*Jeff Hammond

**Re: [eigen] Parallel matrix multiplication causes heap allocation***From:*Gael Guennebaud

**Re: [eigen] Parallel matrix multiplication causes heap allocation***From:*Jeff Hammond

**Re: [eigen] Parallel matrix multiplication causes heap allocation***From:*Gael Guennebaud

**Re: [eigen] Parallel matrix multiplication causes heap allocation***From:*Christoph Hertzberg

**Re: [eigen] Parallel matrix multiplication causes heap allocation***From:*François Fayard

**Messages sorted by:**[ date | thread ]- Prev by Date:
**Re: [eigen] Parallel matrix multiplication causes heap allocation** - Next by Date:
**Re: [eigen] Parallel matrix multiplication causes heap allocation** - Previous by thread:
**Re: [eigen] Parallel matrix multiplication causes heap allocation** - Next by thread:
**Re: [eigen] Parallel matrix multiplication causes heap allocation**

Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |