Re: [eigen] AVX/LRB and SSE with SoA support

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Hi,

I've attached a modified version of your code for Eigen3 as well as
the corresponding ASM generated with gcc 4.5.


On Fri, Jul 23, 2010 at 6:45 AM, keller <christoph-keller@xxxxxx> wrote:
> Hello,
>
> I implemented a function where a sphere cuts a Ray Bundle (Most problems can
> be demonstrated there). It may not be optimally implemented, but i think it
> works as a simple example.
>
> I did not yet put the data into blocks, as a few other problems occurred.
>
> I used eigen2 for this.
> -using VS2008 and the Intel compiler it compiles and runs correctly.
> -g++ prints out errors with cwise()

Yes this is because when you call a template function on a template
type you have to use the template keyword, e.g.:

mat.template block<rows,cols>(i,j)


> The resulting code does not use SSE-Instructions that often. I do not know
> if this is because i used eigen2 in the wrong way. The compiler(VS2008) has
> SSE-Instructions enabled.
>
> Also the _mm_or_ps is avaiable through ei_por in Eigen3, but not via a
> normal interface function.

This is something that would make sense to have upstream in the Array
API. The question is shall we use operator|  (i.e., A | B) or a very
explicit A.bitwiseOr(B) function. Since operator| is not defined for
non integral type I would go for the explicit bitwiseOr.

> _mm_movemask_ps is not avaiable at all.

This one is more difficult to expose in a useful and portable way.
Perhaps something like:

ei_pallpositive(v) / ei_pallnegative(v)

for the low level functions, and perhaps in second time we could have
shortcuts for:

(A > 0).all() and (A < 0).all()

like

A.allPositive() / A.allNegative()

which would be built on top of the respective ei_p* functions, but I'm
not 100% sure and this is really specific use cases. I doubt that's
really critical even in your case.

gael

> Is it a good idea at all to use Eigen, if one does not want to be more than
> about 20% slower than with a naive Intrinsics-implementation?

For this kind of low level stuff you can attack Eigen at two level:
using Array<> API, or directly using the ei_p* functions. The former
has the advantage to allow you to use "blocks" of multiple packets
with automatic unrolling, while the later offers more flexibility for
fine tuning...



> Greetings,
> Christoph
>
>
> On 07/03/2010 09:11 AM, Gael Guennebaud wrote:
>>
>> Hi,
>>
>> first things first, supporting a new vector engine in eigen is
>> relatively easy. However, you won't be able to right code
>> "templatized" on the vector engine, and even with a custom library I
>> doubt that's doable. The choice of the vector engine has to be
>> compiler options. So, the way to support multiple vector engines at
>> runtime (I guess this is what you want?), is to have a trivial
>> function foo(...) selecting the right implementation function e.g.,
>> foo_sse(...), foo_sse4(...), etc. The foo_* functions are implemented
>> only once, by putting it in its own file which will be compiled
>> multiple times with different compiler options (e.g., -msse, -msse4,
>> etc.). The actual of the function (foo_sse) will be built using the
>> preprocessor, e.g.:
>>
>> EIGEN_CAT(foo_,VECTOR_ENGINE_SUFFIX)(...) { .... }
>>
>> Now regarding the special interleaved packing of the data, there is
>> currently no such thing in eigen, however, I think you can easily add
>> that on top of, e.g., a
>> Array<float,Dynamic,ei_packet_traits<float>::size,RowMajor>. It will
>> be initialized with dimension *
>>
>> ((size/ei_packet_traits<float>::size)+((size%ei_packet_traits<float>::size)==0
>> ? 0 : 1) rows where dimension is three in your example, and size is
>> the number of element. You can easily get the i-th element as follow:
>>
>>
>> underlying_array.block<dimension,1>((i/ei_packet_traits<float>::size)*dimension,
>> i%ei_packet_traits<float>::size) = Vector3f(x,y,z);
>>
>> The idea would be to add a class warping this underlying_array to make
>> it convenient to use. The main questions are what is the set of
>> features which have to be supported? How do we want to use it? through
>> a manual loop over the (e.g., 3x4) blocks? through functors? through
>> high level expression template code? etc.
>>
>> To finish, this is definitely something I planed to do the future, so
>> I'd be glad to discuss with you its design and help you to get it
>> right regarding Eigen.
>>
>> cheers,
>>
>> gael
>>
>> On Fri, Jul 2, 2010 at 11:56 PM, keller<christoph-keller@xxxxxx>  wrote:
>>
>>>
>>> Hello,
>>>
>>> I have to implement a raytracer that uses the Vector units of modern x86
>>> cpus.
>>>
>>> Intel will introduce AVX soon. I have to support AVX and SSE and i do not
>>> want all of my code replicated. I currently use the Intels intrinsics. If
>>> Intel adds the Larrabee extensions into normal x86 CPUs one would even
>>> have
>>> to write three versions.
>>>
>>> As i use Eigen in parts of the software i wonder if one could support
>>> this
>>> with Eigen. The typical situation is like this.
>>> -One has a ray bundle with 16 rays that are arranged in SoA format:
>>> x1,..,x4
>>> y1,...,y4
>>> z1,...,z4
>>> .....
>>> x13,..,x16
>>> y13,...,y16
>>> z13,...,z16
>>> This format has to be changed when using AVX/LRB of course.
>>> -One source of the rays and an object like a sphere
>>>
>>> Ideally i have a function like testCut(Bundles, Sphere) that uses a
>>> template
>>> parameter to set the Vector extension to use. I do not want to partially
>>> specialize this function but use some structures and functions in testCut
>>> that depend on the template parameter.
>>>
>>> The question for me is: Develop a small library of my own, or use Eigen
>>> (where i can contribute this functionality). I know i can do this with a
>>> custom library, but i dont know if it is easy to add this functionality
>>> to
>>> Eigen.
>>>
>>> I think a lot of people will have the same problem soon.
>>>
>>> Greetings,
>>> Christoph
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
// EigenTest.cpp : Definiert den Einstiegspunkt ___ die Konsolenanwendung.
//

#include <Eigen/Core>
#include <Eigen/Geometry>
#include <iostream>
#include <xmmintrin.h>
#include <emmintrin.h>

using namespace Eigen;

#ifndef EIGEN_VECTORIZE
  #error
#endif

typedef Vector4f Vec;

template <int h, int w> 
EIGEN_STRONG_INLINE Array<float , h, 1>
SOAm (Array<float, h, w>& l, Array<float, h, w>& r) {
  EIGEN_ASM_COMMENT("begin SOAm");
	Array<float , h, 1>  ret = Array<float , h, 1>::Zero();
	for(int i=0; i < w; i++) {
	  ret += l.col(i) * r.col(i);
	}
  EIGEN_ASM_COMMENT("end SOAm");
	return ret;
}

/* cut sphere (r, pos) with rays (from, step)
 * only test if cut occurs */
template <int len, int vLen> 
EIGEN_DONT_INLINE void cut(float& r, 
		 Vec& pos, 
		 Array<float, len, 3>& from, 
		 Array<float, len, 3>& step,
		 Array<float, len, 1>& mask) {
	typedef Array<float , vLen, 1> cT;
	typedef Array<float , vLen, 3> mT;
    
    mT SOAp;
	cT SOArr = cT::Constant(r*r);
	/* obviously for sphere:
	 * (from+t*step)^2 = r*r
	 * from*step*2*t+from^2+t*t*step^2 = r*r
	 * (t*t*step^2) + (2*from*step*t) + (from^2-r*r);
	 * a*t* t       +  b*t            +  c
	 */

	for(int i=0; i < 3; i++) //this may be set into the constructor
		SOAp.col(i).fill(pos[i]);

	for(int i=0; i < len; i += vLen) {
    // Use Map to tell Eigen the alignment is preserved
		mT frS = Map<mT,Aligned,OuterStride<len> >(&from.coeff(i,0));
    mT stS = Map<mT,Aligned,OuterStride<len> >(&step.coeff(i,0));

		mT diff = frS - stS;		

    EIGEN_ASM_COMMENT("three calls to SOAm");
		//This does not get inlined!
		cT aInv = SOAm<vLen,3>(stS, stS).inverse();
		cT b = 2*SOAm<vLen,3>(stS, frS);
		cT c = SOAm<vLen,3>(frS, frS) - SOArr;

		cT p = b * aInv;
		cT q = b * aInv;

		cT sqrVal = (p*p)*0.25f - q;
		if(_mm_movemask_ps(_mm_load_ps(sqrVal.data())) == 15) continue;

    EIGEN_ASM_COMMENT("call to sqrt");
		sqrVal = sqrVal.sqrt(); /* does this use _mm_sqrt_ps()? */
		cT bigSol = -p*0.5f + sqrVal;
	
    EIGEN_ASM_COMMENT("manual SSE begins here");
		/*may be better to do this in the add unit*/
		__m128 doesCut = _mm_or_ps( /*case some sqrVal < 0 not considered here*/
				_mm_cmpgt_ps(_mm_load_ps(bigSol.data()),_mm_setzero_ps()),
				_mm_load_ps(mask.data()+i));
		_mm_store_ps(mask.data()+i, doesCut);
	}
	EIGEN_ASM_COMMENT("end");
}
int main(int argc, char** argv)
{
	using namespace std;
#define LEN 16

	float r = 4;
	Vec pos = Vec::Constant(1);
	pos[2] = 2;

	/* I am not sure if sth. like
	 * Array<float, 3, 4>[4] from, step;
	 * would be better for the memory access in cut*/

	Array<float, LEN, 3> from = Array<float, LEN, 3>::Zero();
	Array<float, LEN, 3> step = Array<float, LEN, 3>::Ones();


	/*use float, as the floating point unit is used*/
	Array<float, LEN, 1> hitmask = Array<float, LEN, 1>::Zero();

	/*all ray should cut the sphere*/
	cut<LEN, 4>(r, pos, from, step, hitmask);

	return 0;
}

Attachment: rayb.asm
Description: Binary data



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/