Re: [eigen] matrix::linspace

[ Thread Index | Date Index | More Archives ]

Hi Hauke,

You've been trapped in a subtlety of gael's BenchTimer: when you use a
single BenchTimer object more than 1 time, it returns the best time.

I'd propose to rename it to something more explicit, and letting the
BenchTimer name refer to something much simpler.

That is why, when I ran your program, I got twice the same time.

Now I edited main.cpp as attached, and I get:

vec: 893.094 ms
lin: 6745.94 ms


2010/1/24 Hauke Heibel <hauke.heibel@xxxxxxxxxxxxxx>:
> I tried using sequential packet access. Unfortunately, it did not
> help. Maybe I am still missing something. The required patch (indexed
> packet access) is attached to this mail as well as my test code.
> - Hauke
> On Sat, Jan 23, 2010 at 4:03 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>> 2010/1/23 Hauke Heibel <hauke.heibel@xxxxxxxxxxxxxx>:
>>> Hi there,
>>> I am working on some functionality which would allow something like this:
>>>  VectorXd a = VectorXd::linspace(low,high,num_steps)
>>> which should be corresponding to:
>>>  const double step_size = (high-low)/(num_steps-1);
>>>  VectorXd a(num_steps);
>>>  for (int i=0; i<num_steps; ++i)
>>>   a(i) = low+i*step_size;
>>> In Eigen, this is best implemented based on NullaryExpressions, i.e.
>>> Matrices, where each element can be computed based on some functor.
>> I agree!
>>> In
>>> theory, this should not only be nice from a usability point of view
>>> but I also hoped of gaining from vectorization. It tuns out, that this
>>> is not that easy - it maybe due to MSVC but maybe also due to my lack
>>> of SSE skills. For the most simple example
>>>  for (int i=0; i<num_steps; ++i) a(i) = i;
>>> I managed to implement the functor's method ei_linspace_op::packetOp
>>> such that we do actually gain from vectorization as (note that I added
>>> integer parameters to it):
>>>  // for doubles only
>>>  ei_linspace_op::packetOp(int i, int = 0)
>>>  {
>>>    return ei_padd(ei_pset1<Scalar>(i),ei_pset<Scalar>(0,1));
>>>  }
>>> For me the first surprise was that this performs way better than just
>>> returning ei_pset<Scalar>(i,i+1).
>> Notice that reading from memory is costly, and ei_pset1 is costly
>> because it needs a SHUFPD instruction, so naively I would think that
>> the best that can happen is that you make once and for all a packet
>> (0,1) and a packet (1,1) and at each step you only add the latter into
>> the former. Perhaps your code is fast because it allowed the compiler
>> to do that optimization?
>>> Next, I looked into the case:
>>>  for (int i=0; i<num_steps; ++i) a(i) = low+i
>>> and here, I totally failed. Returning
>>>  return ei_pset<Scalar>(m_low+i,m_low+i+1);
>>> resulted in no statistically meaningful gain (~0.06%) speedup. The
>>> same holds for
>>>  return ei_padd(ei_pset1<Scalar>(i),ei_padd(ei_pset1(m_low),ei_pset<Scalar>(0,1)));
>>> which is even sometimes slower and permuting the operators did not help either.
>>> So, I am sort of puzzled about what I could improve. Any ideas? Any
>>> SSE operation, I am not aware of for creating (a, a+1, a+2, ...)
>> Same remark as above. I would initially store the packet (low,low+1)
>> in a SSE register, store (1,1) into another SSE register, and only do
>> a ei_padd at each iteration. Naively I would think that that is much
>> faster?
>>> Finally, note that ei_pset is not yet in PacketMath. Over here, I
>>> added forward delarations into GenericPacketMath and the actual SSE
>>> implementation in the corresponding PacketMath header (no Altivec
>>> yet). They GenericPacketMath declarations (as the name says) have no
>>> implementations since I don't see any meaningful way of providing them
>>> nor do I think they are useful.
>> Indeed I think they're only used in the matrix product code, which
>> doesn't use your new ei_pset function.
>>> The second change in the lib was
>>> mentioned above - I added indices to the packetOp functions.
>> Oh right, you need to fit your stuff with our packetOp infrastructure.
>> Hm, not easy indeed, as I was assuming sequential, not random access.
>>> If everything regarding the vectorization fails, we could still
>>> provide the functionality (with a specialization for the trivial case
>>> where we gain from vectorizing) because at least, I think it is nifty.
>> We might fail to vectorize random access efficiently. But 95% of use
>> cases will be sequential access. This could then be a use case for a
>> ReturnByValue function. When nesting this into a larger expression,
>> this would have the drawback of introducing a temporary, and the
>> advantage of enabling full vectorization instead of preventing
>> vectorization of the larger expression.
>> Benoit
#include <Eigen/Dense>
#include <bench/BenchTimer.h>

using namespace Eigen;

template <typename Scalar, bool RandomAccess> struct ei_linspace_op_impl;

// double, sequential
template <> 
struct ei_linspace_op_impl<double,false>
  typedef ei_packet_traits<double>::type PacketScalar;
  ei_linspace_op_impl(double low, double step) : m_base(ei_pset(low-2*step,low-step)), m_step(ei_pset1(2*step)) {}
  EIGEN_STRONG_INLINE const PacketScalar packetOp() const { return m_base = ei_padd(m_base,m_step); }
  mutable PacketScalar m_base;
  PacketScalar m_step;

// ----- Linspace functor ----------------------------------------------------------------

// Forward declaration
template <typename Scalar, bool RandomAccess = true> struct ei_linspace_op;
namespace Eigen{
// Functor traits
template <typename Scalar, bool RandomAccess> struct ei_functor_traits< ei_linspace_op<Scalar,RandomAccess> >
{ enum { Cost = 1, PacketAccess = ei_packet_traits<Scalar>::size>1, IsRepeatable = true }; };
// Functor interface
template <typename Scalar, bool RandomAccess> struct ei_linspace_op 
  typedef typename ei_packet_traits<Scalar>::type PacketScalar;
  ei_linspace_op(Scalar low, Scalar high, int num_steps) : impl(low,(high-low)/(num_steps-1)), m_low(low), m_step((high-low)/(num_steps-1)) {}
  EIGEN_STRONG_INLINE const Scalar operator() (int i, int = 0) const { return m_low+i*m_step; }
  EIGEN_STRONG_INLINE const PacketScalar packetOp(int, int = 0) const { return impl.packetOp(); }

  ei_linspace_op_impl<Scalar,RandomAccess> impl;
  Scalar m_low;
  Scalar m_step;

int main()
    BenchTimer t1,t2; double e1,e2;

    double low = 5.0;
    double high = 7.0;
    const int size = 200000;
    double step = (high-low)/(size-1);
      ArrayXd test;      
      for (int i=0; i<5000; ++i)
        test = ArrayXd::NullaryExpr(size, ei_linspace_op<double,false>(low,high,size));
    e1 = t1.value();
      ArrayXd test(size);
      for (int i=0; i<5000; ++i)
        for (int i=0; i<size; ++i)
          test(i) = low+i*step;
    e2 = t2.value();
    std::cout << "vec: " << e1*1000.0 << " ms" << std::endl;
    std::cout << "lin: " << e2*1000.0 << " ms" << std::endl;

Mail converted by MHonArc 2.6.19+