Re: [eigen] New(?) way to make using SIMD easier

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


2009/11/24 Mark Borgerding <mark@xxxxxxxxxxxxxx>:
> Just an idea:
> What if the user could write code like:
>
>   VectorOperator( std::plus<SomeDataType>() , dstPtr , srcPtr1, srcPtr2 ,
> num );
>
> which would use a SIMD-optimized call if one exists, and use a generic
> algorithm otherwise.

There seem to be 2 aspects in your proposal:

 1) providing a uniform interface using functors

  ---> we already have that: we have little functors encapsulating all
sorts of operations and they provide a packetOp() method that does
just that. See Functors.h. However they don't take care of
loading/storing to memory, see next point (they take pre-loaded
packets/registers).

2) providing functions that do load+operation+store instead of
requiring one to call ei_pload and ei_pstore.

  ---> but i don't think that's a good idea because that means that
complex operations are compiled, a lot of redundant load/store happen.
For example consider what would happen if we did implement operator+
using such a function. Then compiling
   u+v+w
would result in basically:

1. load u from memory
2. load v from memory
3. add them, store to memory (temporary)
4. load that temporary again from memory
5. load w from memory
6. add them
7. store

here, steps 3-4 are redundant. Using expression templates allows us to
avoid compiling them. This is why we do not use combined functions for
load+op+store.

Just fyi here is basically the code that eigen emits for u+v+w,
assuming for example Vector4f (so you also see how these ei_p*
functions are used):

Packet4f pu = ei_pload(u.data());
Packet4f pv = ei_pload(v.data());
Packet4f t = ei_padd(pu,pv);
Packet4f pw = ei_pload(w.data());
Packet4f result = ei_padd(t,pw);

Here this Packet4f type is a typedef for a built-in type (e.g. __m128
on SSE) that the compiler recognizes as a SIMD packet and knows how to
store as a SIMD register (e.g. xmm0).

See Core/arch/SSE/PacketMath.h

> It might even be used to detect special conditions.  e.g. If CUDA processing
> is enabled and the source pointers are device memory  It performs all
> calculations on the device.  Then brings the result back to the host only if
> the destination resides in host memory.

We wouldn't want to do any runtime branching in such a small and
frequently-called function ; if one wants to detect these things at
runtime, one needs to do it at a wider level, otherwise too much time
would be wasted in if's.

If these is something that you wanted to do and didn't see how to do
using Eigen's current infrastructure, can we help?

Cheers,
Benoit



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/