On Wed, Sep 8, 2010 at 1:18 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote: > 2010/9/8 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>: >> On Tue, Sep 7, 2010 at 12:57 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote: >>> Here, "most efficiently" depends on what you're doing. If you want to >>> apply this transformation to a vector, it's going to be faster if you >>> have a matrix representation of your transform, as the Transform class >>> does. This is one of the most performance-critical use cases... >> >> some numbers to transform N 3D vectors stored into a 3xN column major >> matrix and transformed using a 3x3 matrix, a quaternion using the >> quaternion x a single vector product, and a quaternion converted on >> the fly to a 3x3 matrix. The times are in second for 100000 runs (in >> the last case the quaternion is converted 100000 times to a matrix). >> >> N 1 2 3 4 5 6 >> 7 8 >> matrix 3x3 0.0007521 0.0008807 0.001357 0.002339 0.002869 0.003583 >> 0.004301 0.02684 >> quaternion 0.001332 0.002183 0.003098 0.004002 0.004913 0.005945 >> 0.007081 0.007997 >> quat-mat 0.001165 0.00152 0.001822 0.002925 0.003396 0.003964 >> 0.004615 0.02727 >> >> as expected the matrix product is significantly faster, but what is >> surprising is that even for transforming a single vector (N=1), it is >> faster to convert the quaternion to a matrix and then perform the >> matrix product rather than directly using the optimized >> quaternion-vector product since the costs are respectively: >> >> 3x3 matrix : 9 mul + 6 add = 15 ops >> quaternion : 15 mul + 15 add = 30 ops >> quat-mat : 18 mul + 21 add = 39 ops >> >> These numbers directly come from the assembly where we can see gcc >> optimized the "2 * v" by "v+v". >> >> also Daniel you might be interested to know that this benchmark is in >> bench/quaternion.cpp (in trunk). > > Thanks a lot for these numbers! > > Do you think that quaternion*vector3D has room to be improved by > copying the vector3d into a vector4d and applying the vectorizable > quaternion*vector4D product? I am worried about the 4th component: if > it would be required to divide by it, that could kill the benefit. It is even worse. I've simply tried to copy the input into a vector4f and used the vectorized cross3 function. The result is 5 pmul and 5 padd only: N 1 matrix 3x3 0.0007607 quaternion 0.002226 quat-mat 0.001178 The problem is not the copy which are well optimized away by gcc, but the extra 5 shuffling. Maybe some shuffling can be removed by directly vectorizing the quaternion * vector product (we currently vectorize quat*quat only). gael > Benoit > > >> >> >> gael >> >> >> > > >

