Re: [eigen] vectorization of complex

[ Thread Index | Date Index | More Archives ]


I've just merged the vectorization of complex<float> and
complex<double> into the devel branch. They also include some
optimizations for SSE3. All tests pass with gcc 4.4 in 32 and 64 bits.
For the rest, well, let's see ;)

The support for mixing types will continue on this fork.

On Wed, Jul 7, 2010 at 12:14 AM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> Awesome!
> Have you found back my old email?

ah thanks I forgot about this one, so let's see:

> 1) Everywhere in Eigen when we have PacketSize==1 conditions, examine
> if we really mean that or if that is a way of asking if the scalar
> type has vectorization. Hint: in 95% of cases it is the latter. An
> exception might be in ei_first_aligned, need to check.

actually, there were only 2 or 3 occurrences...

> 3) Introduce new SIMD functions ei_pconj (conjugate a packet) and
> ei_pmulconj (compute x*conj(y), useful in dot products etc.).
> For real numbers, ei_pconj(x) returns x and ei_pmulconj is just like
> ei_pmul. Implement them for complex<float> and complex<double>. At
> first, do it only for SSE using instructions like SHUFPS, then we'll
> see if stuff can factor out with AltiVec...

This is achieved via a more general ei_conj_helper<T0,T1,bool
Conj0,bool Conj1> object allowing all conjugation configurations as
well as mixing real and complexes (in the future).

Some details:

SSE3 proposes a nice instruction called addsub which is perfect to
compute the product of complexes. Using this instruction yields a
significant speedup. So far so good. The problem is that this
instruction is useless for the conjugated multiplications. If you want
to use it without explicitly conjugating the arguments, you need one
more shuffling killing the performance. So with SSE3 it is actually
faster to simply do ei_pmul(ei_pconj(a),b).... For matrix-matrix
products, it is even faster to let the conjugations happen during the
packing of the blocks, such that we always do basic multiplications

Still about matrix products, higher performance could be achieved by
explicitly writing the code to perform multiple multiplications.
Indeed, each factor is used twice, and some intermediate results
(shuffling) could be reused.... Maybe the compiler does that for us, I
did not check, but I doubt that!

> 5) The puzzle: What to do about ei_pabs() ? In the same vein it would
> be nice to introduce a ei_pabs2()... but we need to solve the
> question: what should they return, a half of a packet of reals???

Yes that's still an open issue !!!! And there also exist some cases
where we would like to pick two consecutive floats (a1, a2), put them
in a 4 component packet (a1 a1 a2 a2) to then multiply it to a packet
of two complex<float> .... This occurs for coeff wise products, some
configurations of the diagonal product, etc.


> Not that you need it :-) It just has a couple of small ideas that
> might still be relevant.
> Benoit
> 2010/7/6 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>> Hi all,
>> everything is in the title, and this is happening there:
>> complex<float> are already vectorized: speedup factor 4.6x compared to
>> beta1 for a large matrix product :)
>> road-map:
>> 1 - complex<double>
>> 2 - mixed real-complex products
>> 3 - merge
>> 4 - optimized implementation for SSE3 and SSE4 (can be done in
>> parallel to the rest)
>> Please let me handle items 1 and 2 because they might require some non
>> trivial changes deep inside Eigen, but if some want to have fun
>> playing with SSE intrinsics your are very welcome to help with item 4.
>> Everything is in Eigen/src/Core/arch/SSE/Complex.h.
>> cheers,
>> gael.

Mail converted by MHonArc 2.6.19+