Re: [eigen] Parallel matrix multiplication causes heap allocation

Re: [eigen] Parallel matrix multiplication causes heap allocation

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: Re: [eigen] Parallel matrix multiplication causes heap allocation
From: Jeff Hammond <jeff.science@xxxxxxxxx>
Date: Mon, 19 Dec 2016 09:31:34 -0800
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=GUILUJqsUVmtzEMZhbVt6lC2km6tme369Q0Q8jhe73Y=; b=DpGOf6D1vOloybrxXhTf9paWETsxPr5y19QBsCZ2TmMjq3ZshTv2IeV5A7UrK/jQGd 9ccQKIKUfHfgwMqma5DDqdidJQZwm6+6GaAigEojGxGx9PP8oaaCDGcuo6f8u363aXsZ 28QlH5nHGRdMfO74lSrwqUyOlTCeN7PiLlDpTmEe6BLmy4FBp3jHrqw47K6+DqoB6vKA 4mhvg9nYOvgkQJ6zlZWRswTHRJSbHvEBpv36em92yow3y3IwqyLKJ/b0MhTrvuwucLjW HG947dfVar5oJ8dxtjmDQeyWYeKicfKIzrvLcnDUAFOcV14PghNApH8Sxdbbl23jzS4c uLTA==

On Mon, Dec 19, 2016 at 6:24 AM, François Fayard <fayard@xxxxxxxxxxxxx> wrote:

> Did someone have a look at what blaze [1] does? They seem to be pretty advanced regarding parallelism -- they also parallelize "simple" things like vector addition, which if we trust their benchmarks [2] seems to be beneficial starting at something like 50000 doubles

You should not trust benchmarks, especially where they have been done by people who wrote the software :-)

More than just that, OpenMP runtimes are nontrivial beasts to control and any multithreaded performance data that does not include a complete list of compiler and runtime versions, affinity information, complete processor details, and OS+distro version should be viewed with skepticism.

For example, most OpenMP runtimes do not set affinity by default, and I've seen this reduce performance by ~2x in DGEMM, and once affinity is enabled, breadth- vs depth-first placement makes a large difference in some cases.

For Blaze, for things such as vector addition, they also use streaming stores to speed up the process.

Streaming stores are definitely a worthwhile optimization in the appropriate circumstances. I haven't had time to experiment much but for STREAM on Knights Landing, they make a big difference (Intel and probably other compilers auto-generator them, or emits an optimized memcpy call that uses them, at least in simple loops).

I am particularly interested in where they pay off for storing the C matrix in GEMM with k << m,n...

Full disclosure: I work for Intel but my comments here are not official statements of any kind.

Jeff

--

Jeff Hammond
jeff.science@xxxxxxxxx
http://jeffhammond.github.io/

Follow-Ups:
- Re: [eigen] Parallel matrix multiplication causes heap allocation
  - From: François Fayard

References:
- [eigen] Parallel matrix multiplication causes heap allocation
  - From: Rene Ahlsdorf
- Re: [eigen] Parallel matrix multiplication causes heap allocation
  - From: François Fayard
- Re: [eigen] Parallel matrix multiplication causes heap allocation
  - From: Gael Guennebaud
- Re: [eigen] Parallel matrix multiplication causes heap allocation
  - From: Jeff Hammond
- Re: [eigen] Parallel matrix multiplication causes heap allocation
  - From: Gael Guennebaud
- Re: [eigen] Parallel matrix multiplication causes heap allocation
  - From: Jeff Hammond
- Re: [eigen] Parallel matrix multiplication causes heap allocation
  - From: Gael Guennebaud
- Re: [eigen] Parallel matrix multiplication causes heap allocation
  - From: Christoph Hertzberg
- Re: [eigen] Parallel matrix multiplication causes heap allocation
  - From: François Fayard

Messages sorted by: [ date | thread ]
Prev by Date: Re: [eigen] Parallel matrix multiplication causes heap allocation
Next by Date: Re: [eigen] Parallel matrix multiplication causes heap allocation
Previous by thread: Re: [eigen] Parallel matrix multiplication causes heap allocation
Next by thread: Re: [eigen] Parallel matrix multiplication causes heap allocation

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/