packetwise complex format for vectorization, was Re: [eigen] FFT for Eigen

[ Thread Index | Date Index | More Archives ]

Just random thoughts on maintaining ABI:

Is the ei_matrix_storage::data() function the only way to access the raw data buffer?
If so, then a check point could be placed there that would reformat packetwise storage. e.g. r0,r1,r2,r3,i0,i1,i2,i3 back to interleaved r0,i0...r3,i3
Any complex functions that could benefit from a packetwise storage scheme would jump through hoops and use a dedicated API that prevented the automatic conversion to interleaved format.

On a related side note, if you wanted to access the scrambled elements of a float buffer by index; you can move around the low order 3 bits of the index:
base + 0,1,2,3,4,5,6,7  => base + 0,2,4,6,1,3,5,7

something like: ( i & ~7 )   |  ((i>>2)&1 )  |  ((i&3)<<1)

but it might be faster to use a lookup table for index scrambling.
i.e.:  i + lut[i&7]
where lut ={ 0,1,2,3,-3,-2,-1,0 }

-- Mark

Gael Guennebaud wrote:
On Tue, May 19, 2009 at 7:07 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
 i'd be (pleasantly) surprised if this could be done without
very big (not worth it) changes in Eigen....
definitely !

Anyway, I just wanted to add that such a storage scheme would also be
optimal to vectorize code dealing with many Vector3 or other non easy
to vectorize data types....



2009/5/19 Rohit Garg <rpg.314@xxxxxxxxx>:
This is probably not a good idea. I believe that they should be stored
in the interleaved format. I'll be happy to pitch in with SSE2/3
intrinsics code for complex multiplication, division if neccessary. I
think we should go the standard way as many other libraries and
std::complex use it.

So far, on this discussion, the only reason for not using the
interleaved format that I have seen is that it is tricky to multiply
using that. Is there any other reason?

IMHO, we shouldn't lose compatibility with ~90% of other complex
libraries/formats just to simplify multiplication.

On Tue, May 19, 2009 at 5:49 AM, Benoit Jacob <jacob.benoit.1@xxxxxxxxxx> wrote:
I can believe that this is probably a very efficient storage scheme.
We could offer this as an option if really it's not too hard to
implement (i didn't start thinking about this).

The default should remain the current for many reasons, but as an
option why not.


2009/5/19 Márton Danóczy <marton78@xxxxxxxxx>:
I concur: I don't think that it would be very useful to have complex
matrices with the real and imaginary parts stored separately. Most
operations -- and the more costly ones -- would run slower in such a
scheme. The basic issue here is memory locality.
What about storing them packet by packet? That is, in case of floats,
four real parts followed by four imaginary parts. That would not be
too hard to implement and vectorization of component-wise operations
would be trivial. And I think even FFTW can handle that using the guru
interface, by setting up a split fft plan with a stride of




Rohit Garg

Senior Undergraduate
Department of Physics
Indian Institute of Technology




Mail converted by MHonArc 2.6.19+