Dear Jason,

Am 02.09.2016 um 09:47 schrieb Jason Newton:

The advantage of doing this is when porting code from one context to another (be it GPUs, or different languages - like python/numpy) we can get a 100% bit-exact match as long as both domains follow the same algorithms (and deal
with rounding the same way, another topic) which provides a fairly strong guarantee that the ported code/code in another domain is correct (provided a large enough input space is used for coverage)

I doubt that this is possible, even if the code is single threaded.
Just changing between machines with FMA and without will significantly change the result.
x86 with the 80 bit register gives you different results compared to other. If you have more
than one FPU you can't be sure on the ordering within the scalar products, and reordering
can/will change the result. And if you happen to use an Itanium machine, you never know, what the compiler produces.
It may work in many cases, but at least it doesn't for my main application.

Best regards,

