On Mon, Apr 21, 2014 at 6:10 PM, Christoph Hertzberg <chtz@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:

* If you have a limited range of block-sizes (known at compile-time),

you may consider implementing it using a virtual class with

specializations for each occurring block size. You'll need some kind

of non-virtual intermediate class, which will introduce additional

overhead, so overall it might not be worth the effort.

A much lesser painful trick is to use A.lazyProduct(B) instead of A*B and keep vectorization ON. Of course this will be a little bit less efficient than fixed-size specialization, but that should be enough to see a nice speedup with 2s change.

gael.

