Re: [eigen] matrix::linspace |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] matrix::linspace
- From: Benoit Jacob <jacob.benoit.1@xxxxxxxxx>
- Date: Sun, 24 Jan 2010 12:17:03 -0500
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=NvIL6RalsEqLA5mvgrJibO0rfuMqfYLT8P1uUJUkdM8=; b=IY076hRtgxFN1A2YNsKJn+4kUUs8n0H4YKWS+SK/8RqbclFmq0x1z0kGqMPbuEZMX0 ADqIPQWNp72FGwxKLcifzas33wBjEP2xt2aryKik8i+RUw/QbZe43avsVw3pjrD/GPpe VAKNpG1sZIekCUbrjvBppujupqvUlDdAo+pyw=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=LyiGEo+qUcNLn3ooXEnW1/NeCW19F6i6hi81liJlw52QU7A07z4ygng/aS4nT1hhuS EQFCb/2quSxxl8SA1H+bD/o9gqK0LduEwRyLE07dn6HFFDNJWvFloNF5u7GAN2U2CgLy zNugR/eG1OSMUMVqRYsL4ymXqQfD+pbetbNnM=
Hi Hauke,
You've been trapped in a subtlety of gael's BenchTimer: when you use a
single BenchTimer object more than 1 time, it returns the best time.
I'd propose to rename it to something more explicit, and letting the
BenchTimer name refer to something much simpler.
That is why, when I ran your program, I got twice the same time.
Now I edited main.cpp as attached, and I get:
vec: 893.094 ms
lin: 6745.94 ms
Cheers,
Benoit
2010/1/24 Hauke Heibel <hauke.heibel@xxxxxxxxxxxxxx>:
> I tried using sequential packet access. Unfortunately, it did not
> help. Maybe I am still missing something. The required patch (indexed
> packet access) is attached to this mail as well as my test code.
>
> - Hauke
>
> On Sat, Jan 23, 2010 at 4:03 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>> 2010/1/23 Hauke Heibel <hauke.heibel@xxxxxxxxxxxxxx>:
>>> Hi there,
>>>
>>> I am working on some functionality which would allow something like this:
>>>
>>> VectorXd a = VectorXd::linspace(low,high,num_steps)
>>>
>>> which should be corresponding to:
>>>
>>> const double step_size = (high-low)/(num_steps-1);
>>> VectorXd a(num_steps);
>>> for (int i=0; i<num_steps; ++i)
>>> a(i) = low+i*step_size;
>>>
>>> In Eigen, this is best implemented based on NullaryExpressions, i.e.
>>> Matrices, where each element can be computed based on some functor.
>>
>> I agree!
>>
>>> In
>>> theory, this should not only be nice from a usability point of view
>>> but I also hoped of gaining from vectorization. It tuns out, that this
>>> is not that easy - it maybe due to MSVC but maybe also due to my lack
>>> of SSE skills. For the most simple example
>>>
>>> for (int i=0; i<num_steps; ++i) a(i) = i;
>>>
>>> I managed to implement the functor's method ei_linspace_op::packetOp
>>> such that we do actually gain from vectorization as (note that I added
>>> integer parameters to it):
>>>
>>> // for doubles only
>>> ei_linspace_op::packetOp(int i, int = 0)
>>> {
>>> return ei_padd(ei_pset1<Scalar>(i),ei_pset<Scalar>(0,1));
>>> }
>>>
>>> For me the first surprise was that this performs way better than just
>>> returning ei_pset<Scalar>(i,i+1).
>>
>> Notice that reading from memory is costly, and ei_pset1 is costly
>> because it needs a SHUFPD instruction, so naively I would think that
>> the best that can happen is that you make once and for all a packet
>> (0,1) and a packet (1,1) and at each step you only add the latter into
>> the former. Perhaps your code is fast because it allowed the compiler
>> to do that optimization?
>>
>>> Next, I looked into the case:
>>>
>>> for (int i=0; i<num_steps; ++i) a(i) = low+i
>>>
>>> and here, I totally failed. Returning
>>>
>>> return ei_pset<Scalar>(m_low+i,m_low+i+1);
>>>
>>> resulted in no statistically meaningful gain (~0.06%) speedup. The
>>> same holds for
>>>
>>> return ei_padd(ei_pset1<Scalar>(i),ei_padd(ei_pset1(m_low),ei_pset<Scalar>(0,1)));
>>>
>>> which is even sometimes slower and permuting the operators did not help either.
>>>
>>> So, I am sort of puzzled about what I could improve. Any ideas? Any
>>> SSE operation, I am not aware of for creating (a, a+1, a+2, ...)
>>
>> Same remark as above. I would initially store the packet (low,low+1)
>> in a SSE register, store (1,1) into another SSE register, and only do
>> a ei_padd at each iteration. Naively I would think that that is much
>> faster?
>>
>>> Finally, note that ei_pset is not yet in PacketMath. Over here, I
>>> added forward delarations into GenericPacketMath and the actual SSE
>>> implementation in the corresponding PacketMath header (no Altivec
>>> yet). They GenericPacketMath declarations (as the name says) have no
>>> implementations since I don't see any meaningful way of providing them
>>> nor do I think they are useful.
>>
>> Indeed I think they're only used in the matrix product code, which
>> doesn't use your new ei_pset function.
>>
>>> The second change in the lib was
>>> mentioned above - I added indices to the packetOp functions.
>>
>> Oh right, you need to fit your stuff with our packetOp infrastructure.
>> Hm, not easy indeed, as I was assuming sequential, not random access.
>>
>>>
>>> If everything regarding the vectorization fails, we could still
>>> provide the functionality (with a specialization for the trivial case
>>> where we gain from vectorizing) because at least, I think it is nifty.
>>
>> We might fail to vectorize random access efficiently. But 95% of use
>> cases will be sequential access. This could then be a use case for a
>> ReturnByValue function. When nesting this into a larger expression,
>> this would have the drawback of introducing a temporary, and the
>> advantage of enabling full vectorization instead of preventing
>> vectorization of the larger expression.
>>
>> Benoit
>>
>>
>>
>
#include <Eigen/Dense>
#include <bench/BenchTimer.h>
using namespace Eigen;
template <typename Scalar, bool RandomAccess> struct ei_linspace_op_impl;
// double, sequential
template <>
struct ei_linspace_op_impl<double,false>
{
typedef ei_packet_traits<double>::type PacketScalar;
ei_linspace_op_impl(double low, double step) : m_base(ei_pset(low-2*step,low-step)), m_step(ei_pset1(2*step)) {}
EIGEN_STRONG_INLINE const PacketScalar packetOp() const { return m_base = ei_padd(m_base,m_step); }
mutable PacketScalar m_base;
PacketScalar m_step;
};
// ----- Linspace functor ----------------------------------------------------------------
// Forward declaration
template <typename Scalar, bool RandomAccess = true> struct ei_linspace_op;
namespace Eigen{
// Functor traits
template <typename Scalar, bool RandomAccess> struct ei_functor_traits< ei_linspace_op<Scalar,RandomAccess> >
{ enum { Cost = 1, PacketAccess = ei_packet_traits<Scalar>::size>1, IsRepeatable = true }; };
}
// Functor interface
template <typename Scalar, bool RandomAccess> struct ei_linspace_op
{
typedef typename ei_packet_traits<Scalar>::type PacketScalar;
ei_linspace_op(Scalar low, Scalar high, int num_steps) : impl(low,(high-low)/(num_steps-1)), m_low(low), m_step((high-low)/(num_steps-1)) {}
EIGEN_STRONG_INLINE const Scalar operator() (int i, int = 0) const { return m_low+i*m_step; }
EIGEN_STRONG_INLINE const PacketScalar packetOp(int, int = 0) const { return impl.packetOp(); }
ei_linspace_op_impl<Scalar,RandomAccess> impl;
Scalar m_low;
Scalar m_step;
};
int main()
{
{
BenchTimer t1,t2; double e1,e2;
double low = 5.0;
double high = 7.0;
const int size = 200000;
double step = (high-low)/(size-1);
t1.start();
{
ArrayXd test;
for (int i=0; i<5000; ++i)
test = ArrayXd::NullaryExpr(size, ei_linspace_op<double,false>(low,high,size));
}
t1.stop();
e1 = t1.value();
t2.start();
{
ArrayXd test(size);
for (int i=0; i<5000; ++i)
for (int i=0; i<size; ++i)
test(i) = low+i*step;
}
t2.stop();
e2 = t2.value();
std::cout << "vec: " << e1*1000.0 << " ms" << std::endl;
std::cout << "lin: " << e2*1000.0 << " ms" << std::endl;
}
}