Re: [eigen] matrix::linspace

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


I tried using sequential packet access. Unfortunately, it did not
help. Maybe I am still missing something. The required patch (indexed
packet access) is attached to this mail as well as my test code.

- Hauke

On Sat, Jan 23, 2010 at 4:03 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> 2010/1/23 Hauke Heibel <hauke.heibel@xxxxxxxxxxxxxx>:
>> Hi there,
>>
>> I am working on some functionality which would allow something like this:
>>
>>  VectorXd a = VectorXd::linspace(low,high,num_steps)
>>
>> which should be corresponding to:
>>
>>  const double step_size = (high-low)/(num_steps-1);
>>  VectorXd a(num_steps);
>>  for (int i=0; i<num_steps; ++i)
>>   a(i) = low+i*step_size;
>>
>> In Eigen, this is best implemented based on NullaryExpressions, i.e.
>> Matrices, where each element can be computed based on some functor.
>
> I agree!
>
>> In
>> theory, this should not only be nice from a usability point of view
>> but I also hoped of gaining from vectorization. It tuns out, that this
>> is not that easy - it maybe due to MSVC but maybe also due to my lack
>> of SSE skills. For the most simple example
>>
>>  for (int i=0; i<num_steps; ++i) a(i) = i;
>>
>> I managed to implement the functor's method ei_linspace_op::packetOp
>> such that we do actually gain from vectorization as (note that I added
>> integer parameters to it):
>>
>>  // for doubles only
>>  ei_linspace_op::packetOp(int i, int = 0)
>>  {
>>    return ei_padd(ei_pset1<Scalar>(i),ei_pset<Scalar>(0,1));
>>  }
>>
>> For me the first surprise was that this performs way better than just
>> returning ei_pset<Scalar>(i,i+1).
>
> Notice that reading from memory is costly, and ei_pset1 is costly
> because it needs a SHUFPD instruction, so naively I would think that
> the best that can happen is that you make once and for all a packet
> (0,1) and a packet (1,1) and at each step you only add the latter into
> the former. Perhaps your code is fast because it allowed the compiler
> to do that optimization?
>
>> Next, I looked into the case:
>>
>>  for (int i=0; i<num_steps; ++i) a(i) = low+i
>>
>> and here, I totally failed. Returning
>>
>>  return ei_pset<Scalar>(m_low+i,m_low+i+1);
>>
>> resulted in no statistically meaningful gain (~0.06%) speedup. The
>> same holds for
>>
>>  return ei_padd(ei_pset1<Scalar>(i),ei_padd(ei_pset1(m_low),ei_pset<Scalar>(0,1)));
>>
>> which is even sometimes slower and permuting the operators did not help either.
>>
>> So, I am sort of puzzled about what I could improve. Any ideas? Any
>> SSE operation, I am not aware of for creating (a, a+1, a+2, ...)
>
> Same remark as above. I would initially store the packet (low,low+1)
> in a SSE register, store (1,1) into another SSE register, and only do
> a ei_padd at each iteration. Naively I would think that that is much
> faster?
>
>> Finally, note that ei_pset is not yet in PacketMath. Over here, I
>> added forward delarations into GenericPacketMath and the actual SSE
>> implementation in the corresponding PacketMath header (no Altivec
>> yet). They GenericPacketMath declarations (as the name says) have no
>> implementations since I don't see any meaningful way of providing them
>> nor do I think they are useful.
>
> Indeed I think they're only used in the matrix product code, which
> doesn't use your new ei_pset function.
>
>> The second change in the lib was
>> mentioned above - I added indices to the packetOp functions.
>
> Oh right, you need to fit your stuff with our packetOp infrastructure.
> Hm, not easy indeed, as I was assuming sequential, not random access.
>
>>
>> If everything regarding the vectorization fails, we could still
>> provide the functionality (with a specialization for the trivial case
>> where we gain from vectorizing) because at least, I think it is nifty.
>
> We might fail to vectorize random access efficiently. But 95% of use
> cases will be sequential access. This could then be a use case for a
> ReturnByValue function. When nesting this into a larger expression,
> this would have the drawback of introducing a temporary, and the
> advantage of enabling full vectorization instead of preventing
> vectorization of the larger expression.
>
> Benoit
>
>
>

Attachment: main.cpp
Description: Binary data

Attachment: cwisenullary.patch
Description: Binary data



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/