For FMA/SSE and what not, you must ensure that part is the same on both implementation, but what matters is the reduction ordering is kept the same after that part is fixed. It is one more thing like rounding that will need attentiveness - gcc will let you escape the 80-bit register for instance by relying on SSE for everything. I was not able to understand how more than one FPU could wriggle itself inside the reduction ordering - at least on on a GPU, I know this won't happen and there are many FPUs there.
Itanium not sure of at all but those are going by the wayside, right?
I agree it is possible, at least without managing compiler settings carefully, that there will be some situations where we cannot attain this property, but most of the time it's achievable and high value.
It also might be good to have a user pluggable matrix product calculator - this would let you fiddle with the reduction ordering, to say deterministic reduction trees / different blocking/tiling configurations.