'mul' being magnitude of times slower for 64 bits ints
same here. I believe Benoit did not notice any speed difference because his CPU is more recent that ours. Anyway this benchmark does not represent typical use cases of sizes & indexes. In theory the compiler should not have to issue integer multiplication inside inner loops but simply increments some registers. I verified that for simple expressions. Unfortunately for a more complex one involving more complex addressing like:
it seems that GCC fails at generating good code and issue 3 imull in the most inner vectorized loop. The non vectorized loops (to deal with boundaries or when vectorization is disabled) are fine though... ahhhh compiler weirdnesses....
Anyway, the change Benoit is doing is the necessary step to offer more flexibility, then we can bench real world expressions, and finally make a decision for the default index/size type of Matrix. According to the result of the benchs, I see two options:
1) always use ptrdiff_t 2) default is int + one Matrix's Option to use int64