Re: [eigen] RFC: making a deterministic and reproducable product codepat

On Tue, Sep 6, 2016 at 2:08 AM, Jason Newton <nevion@xxxxxxxxx> wrote:

Hi Peter,

For FMA/SSE and what not, you must ensure that part is the same on both implementation, but what matters is the reduction ordering is kept the same after that part is fixed. It is one more thing like rounding that will need attentiveness - gcc will let you escape the 80-bit register for instance by relying on SSE for everything. I was not able to understand how more than one FPU could wriggle itself inside the reduction ordering - at least on on a GPU, I know this won't happen and there are many FPUs there.

For what it's worth, there is quite a bit online about this topic, e.g.

# Intel-specific

* https://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler

* https://software.intel.com/sites/default/files/article/326703/fp-control-2012-08.pdf slides on the previous topic.

* https://software.intel.com/en-us/articles/differences-in-floating-point-arithmetic-between-intel-xeon-processors-and-the-intel-xeon

The first link has proven to be of great practical relevant in the context of https://github.com/elemental/Elemental/issues/179, as one example.

# Blogs/similar on this topic

* https://blogs.msdn.microsoft.com/shawnhar/2009/03/25/is-floating-point-math-deterministic/

* http://christian-seiler.de/projekte/fpmath/

* http://stackoverflow.com/questions/20963419/cross-platform-floating-point-consistancy

* http://yosefk.com/blog/consistency-how-to-defeat-the-purpose-of-ieee-floating-point.html

Itanium not sure of at all but those are going by the wayside, right?

The most recent product was released in 2012Q4 (http://ark.intel.com/products/family/451/Intel-Itanium-Processor). I cannot comment on what Intel might do in the future.

I agree it is possible, at least without managing compiler settings carefully, that there will be some situations where we cannot attain this property, but most of the time it's achievable and high value.

Best,

Jeff (who works for Intel, should it matter)

It also might be good to have a user pluggable matrix product calculator - this would let you fiddle with the reduction ordering, to say deterministic reduction trees / different blocking/tiling configurations.

-Jason

On Tue, Sep 6, 2016 at 3:54 AM, Peter <list@xxxxxxxxxxxxxxxxx> wrote:
Dear Jason,

Am 02.09.2016 um 09:47 schrieb Jason Newton:

The advantage of doing this is when porting code from one context to another (be it GPUs, or different languages - like python/numpy) we can get a 100% bit-exact match as long as both domains follow the same algorithms (and deal
with rounding the same way, another topic) which provides a fairly strong guarantee that the ported code/code in another domain is correct (provided a large enough input space is used for coverage)

I doubt that this is possible, even if the code is single threaded.
Just changing between machines with FMA and without will significantly change the result.
x86 with the 80 bit register gives you different results compared to other. If you have more
than one FPU you can't be sure on the ordering within the scalar products, and reordering
can/will change the result. And if you happen to use an Itanium machine, you never know, what the compiler produces.
It may work in many cases, but at least it doesn't for my main application.

Best regards,
Peter

Jeff Hammond
jeff.science@xxxxxxxxx
http://jeffhammond.github.io/