Re: [eigen] No vectorization in presence of .cast<T>() calls

[ Thread Index | Date Index | More Archives ]

2010/12/18 Christoph Hertzberg <chtz@xxxxxxxxxxxxxxxxxxxxxxxx>:
> On 18.12.2010 06:23, Benoit Jacob wrote:
>> Tough case.
>> Casting from unsigned char to float is expanding 1 byte to 4 bytes,
>> which means going from 16 to 4 scalars per 16-byte packet. This change
>> in the number of scalars per packet is what's troublesome for our
>> vectorization system.
>> In general, that's quite hard, but it seems that we can easily
>> overcome this in you particular case. Since you're only casting from a
>> small type to a bigger type, the expression returned by cast() could
>> be vectorizable by implementing packet() by reading LESS THAN a packet
>> from the original uchar expression, and expanding it to float.
>> something like this (pseudo code):
>> packet4f cast<float>(Index i)
>> {
>>    return packet4f(float(src.coeff(i)), float(src.coeff(i+1)),
>> float(src.coeff(i+2)), float(src.coeff(i+3)));
>> }
> Hm, this would use a slow unvectorized FPU-Cast, I guess ...
>> This is only going to be beneficial if this is used in a complex
>> enough expression to pay for the cost of this packet() method. We must
>> make sure not to introduce a performance regression on a simple
>> dst=src.cast<float>() example.
> Just thinking loud here ...
> First of all, I'd say that casting from a bigger to smaller type should
> be possible with the current vectorization system. Something like this:
> packet4f some_double_expression::cast<float>(Index i){
>        return swizzle(
>                _mm_cvtpd_ps(src.packet(i)),
>                _mm_cvtpd_ps(src.packet(i+2)),
>                /* some index set here */);
> }

yes, it's possible too. indeed this is more generally efficient as it
uses only packet() in the source expression.

> For the other way around I admit that it's very tricky. Especially, if
> the smaller type also comes from a complex expression. Ideally you would
> need to read a single package of the smaller type and return 2, 4 or 8
> consecutive packages of the bigger type -- I have no idea how this could
> be done easily and efficient ...

That's right, so the pseudocode I proposed would only possibly be a
good idea if CoeffReadCost==1.

> Another point: Did someone already think about how to support AVX in the
> future? -- Last time I checked there were some hard-coded 16 in the code ....

Sure, as soon as AVX hardware is actually available, that's a useful
thing to do :-)

Yes there still are hardcoded 16's but that's not a big problem.
Should be easy to get rid of, as they mostly don't affect the API.

The only thing that worries me a little bit is when this impacts the
API itself. There are 2 such places:
 - the Aligned option to Map
 - the AlignedBit flag on expression (that one is largely internal though).

What we can do in Eigen 3.1 is:
 - replace Aligned by a template<int N=16> struct Aligned. So Aligned
would mean 16-byte aligned, and you could do Aligned<32> if you want
to explicitly specify 32 byte alignment. This preserves the API, but
changes the ABI of class Map, I think that's OK: we're only making ABI
stability guarantees on plain objects, not expressions, so especially
in Eigen 3.0 this doesn't have to include class Map.
 - similarly, kill AlignedBit and replace it by a Aligned<N> typedef.
This is a quite radical change, so I would propose to declare that
Flags is completely internal in 3.0. To limit application trouble, we
can keep the AlignedBit around in 3.1, mark it as deprecated, and make
sure its value is always 0 instead of being undefined (which would
give undefined app crashes).


> Christoph
> --
> ----------------------------------------------
> Dipl.-Inf. Christoph Hertzberg
> Cartesium 0.051
> Universität Bremen
> Enrique-Schmidt-Straße 5
> 28359 Bremen
> Tel: (+49) 421-218-64252
> ----------------------------------------------

Mail converted by MHonArc 2.6.19+