Re: [eigen] A complex FFT for Eigen

On Tue, Dec 2, 2008 at 5:25 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:

It seems to me that the optimal solution is to patch <complex> to
vectorize std::complex<double>.
What do you think?
I could live of the idea of saying "yeah i've submitted a patch or two
to GCC" in a party.

Another option (basically, if they don't accept our patches) is to
have our own <complex> having the same name. If the user hasn't
already #included<complex> (as we can tell from the preprocessor
symbol guarding it) then we include our own instead, and we define all
possible preprocessor symbols to make sure that subsequent inclusion
of <complex> by the user is blocked.

Cheers,
Benoit

2008/12/2 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>
> On Tue, Dec 2, 2008 at 4:51 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
> wrote:
>>
>> Yes I agree, I too prefer the 1st solution: own specialization of
>> complex<double>. Actually I hadn't thought of that possibility.
>>
>> Unfortunately, at least here in gcc 4.3, <complex> does specialize
>> complex for float and double. Not just as an external, but the actual
>> specialization is in the header. Which prevents us from doing our own.
>>
>> Moreover, comments refer to what seem to be sections in the C++ spec,
>> so this is probably standard.
>
> I see, that's a pity.
>
>> So at this point I want to say: OK so the c++ standard expects the std
>> lib implementors to fully take care of optimizing complex for float
>> and double, fine! So the next thing to do is to check the assembly
>> with some test program: enable SSE2 and check that the compiler really
>> produces vectorized code. An addition of complex<double> should be
>> vectorized for example. Looking at the <complex> header, there doesn't
>> seem to be any explicit vectorization, but perhaps they know that the
>> compiler does a good job here, or perhaps i don't understand what
>> these __real__, __imag__ pseudo keywords mean.
>
> I've just checked, and g++ does not vectorize at all complex<double> :(
>
>
> gael.
>
>
>>
>> Cheers,
>> Benoit
>>
>>
>> > yes of course we first need to get the vectorization of complex done the
>> > right way, and I still don't know what's the best option. We can either
>> > provide our own specialization of std::complex<double> implementation or
>> > use
>> > the ei_p* functions:
>> >
>> > Own specialization of std::complex<double>
>> > * pro:
>> > - enable its optimization everywhere
>> > * cons:
>> > - we and the user have to make sure a std::complex<double> is always
>> > correctly aligned (easy to do unless the user starts doing fancy casts)
>> > - on the long term, the next gen CPUs will have 256bit registers, so
>> > a
>> > packet of std::complex<double> won't be made of a single coeff
>> > anymore...
>> >
>> > Using the ei_p* with packet of 1 coeff
>> > * pro:
>> > - no need to specialize std::complex<double>
>> > * cons:
>> > - need to enable the vectorization path (to use the ei_p*) even for
>> > packets of one element, so we need another mechanism to specify whether
>> > a
>> > scalar type is vectorizable or not
>> > - usually, the vectorized path does not really make sens for packet
>> > size
>> > of 1 coeff, (on the other hand, the products are implemented on this
>> > model,
>> > so why not)
>> >
>> > all in all, currently I don't see any argument which would strongly kill
>> > one
>> > option in favor of the other, and so I would vote for the first one (own
>> > specialization of std::complex<double>) simply because it enables more
>> > optimizations and it is probably easier to do.
>> >
>> > gael.
>> >
>> >
>> >>
>> >> Cheers,
>> >> Benoit
>> >>
>> >> 2008/12/2 Gael Guennebaud <gael.guennebaud@xxxxxxxxx>:
>> >> >
>> >> > all of this looks great,
>> >> >
>> >> > On Mon, Dec 1, 2008 at 11:31 PM, Benoit Jacob
>> >> > <jacob.benoit.1@xxxxxxxxx>
>> >> > wrote:
>> >> >>
>> >> >> 2008/12/1 Tim Molteno <tim@xxxxxxxxxxxxxxxxxxx>:
>> >> >>
>> >> >> for (unsigned i=0; i<N2; i++)
>> >> >> {
>> >> >> std::complex<T> temp(data[i+N2]*w);
>> >> >> data[i+N2] = data[i]-temp;
>> >> >> data[i] += temp;
>> >> >> w += w*wp;
>> >> >> }
>> >> >>
>> >> >> And I don't see any way to use Eigen expressions here, or to
>> >> >> vectorize
>> >> >> that loop as such (again, no contiguous access).
>> >> >
>> >> > what about the following (let's assume the size of a packet is 2):
>> >> >
>> >> > const int PacketSize = 2;
>> >> > Scalar _w[PacketSize];
>> >> > _w[0] = 0;
>> >> > for (int k=1; k<PacketSize; ++k)
>> >> > _w[k] = _w[k-1] + _w[k-1] * wp;
>> >> > Packet w = ei_pload(_w);
>> >> > Packet wp2 = ei_pset1(wp);
>> >> > for (int i=0; i<N2; i+=PacketSize)
>> >> > {
>> >> > Packet tmp = ei_pmul(ei_pload(data+i+N2), w);
>> >> > Packet di = ei_pload(data+i);
>> >> > ei_store(data+i+N2, ei_psub(di, tmp));
>> >> > ei_store(data+i, ei_padd(di, tmp));
>> >> > w = ei_madd(w, wp2, w);
>> >> > }
>> >> >
>> >> > I think this should work whatever PacketSize is (including the case
>> >> > when
>> >> > the
>> >> > vectorization is disabled, i.e., PacketSize = 1)
>> >> >
>> >> > gael.
>> >> >
>> >>
>> >> ---
>> >>
>> >
>> >
>>
>> ---
>>
>
>

---