> W.r.t porting to AVX: Be aware that there might be some pitfalls with
> AVX-performance:

Interesting tidbit from that link "If the programmer inadvertently
mixes AVX and non-AVX vector instructions in the same code then there
is a penalty of 70 clock cycles for each transition between the two

Thank you for the pointer to the blog,

