- actually generate prefetch instructions on ARM64. On a Pixel XL Android device, running on 1 big core (Kryo @ 2.15 GHz), 1024x1024 matrix multiplication speed (rowmajor * colmajor -> colmajor, which is what we tend to use in NN applications) is improved by ~ 10% by this change.
- on ARM32, the asm statement was needlessly clobbering "cc" (condition code). There is nothing in the ARM assembler reference that suggests that this instruction touches condition codes.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0068b/Chdjffbi.html