Re: [eigen] matrix::linspace

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


I put the function calls in EIGEN_DONT_INLINE functions and ran
everything again but still see no improvement... But I've found
something. In

ei_assign_impl<Derived1, Derived2, LinearVectorizedTraversal,
NoUnrolling>::run()

there are three parts - assigning unaligned start, assigning aligned
middle and finally assigning unaligned end. When I comment out the
unaligned sections, I get

vec: 791.646 ms
lin: 1796.26 ms

which is nice and reasonable. So, MSVC has an optimization issue over
here. In my particular case, the unaligned cases are anyways empty and
skipping them does not hurt.

- Hauke

On Sun, Jan 24, 2010 at 7:30 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
> On the other hand, my results with GCC 4.4 should not be over
> interpreted: all is in the main function and we know that GCC doesn't
> optimize correctly what's in main(). On the other hand, stuff may have
> been optimized away. So the only way to make sure is to put code in
> non-inlined functions.
>
> Benoit
>
> 2010/1/24 Hauke Heibel <hauke.heibel@xxxxxxxxxxxxxx>:
>> I am still getting results as bad as this here:
>>
>> vec: 2141.66 ms
>> lin: 1950.84 ms
>>
>> MSVC seems to fail to optimize properly. I will try to look at the assembly.
>>
>> - Hauke
>>
>> On Sun, Jan 24, 2010 at 6:17 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>>> Hi Hauke,
>>>
>>> You've been trapped in a subtlety of gael's BenchTimer: when you use a
>>> single BenchTimer object more than 1 time, it returns the best time.
>>>
>>> I'd propose to rename it to something more explicit, and letting the
>>> BenchTimer name refer to something much simpler.
>>>
>>> That is why, when I ran your program, I got twice the same time.
>>>
>>> Now I edited main.cpp as attached, and I get:
>>>
>>> vec: 893.094 ms
>>> lin: 6745.94 ms
>>>
>>> Cheers,
>>> Benoit
>>>
>>> 2010/1/24 Hauke Heibel <hauke.heibel@xxxxxxxxxxxxxx>:
>>>> I tried using sequential packet access. Unfortunately, it did not
>>>> help. Maybe I am still missing something. The required patch (indexed
>>>> packet access) is attached to this mail as well as my test code.
>>>>
>>>> - Hauke
>>>>
>>>> On Sat, Jan 23, 2010 at 4:03 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>>>>> 2010/1/23 Hauke Heibel <hauke.heibel@xxxxxxxxxxxxxx>:
>>>>>> Hi there,
>>>>>>
>>>>>> I am working on some functionality which would allow something like this:
>>>>>>
>>>>>>  VectorXd a = VectorXd::linspace(low,high,num_steps)
>>>>>>
>>>>>> which should be corresponding to:
>>>>>>
>>>>>>  const double step_size = (high-low)/(num_steps-1);
>>>>>>  VectorXd a(num_steps);
>>>>>>  for (int i=0; i<num_steps; ++i)
>>>>>>   a(i) = low+i*step_size;
>>>>>>
>>>>>> In Eigen, this is best implemented based on NullaryExpressions, i.e.
>>>>>> Matrices, where each element can be computed based on some functor.
>>>>>
>>>>> I agree!
>>>>>
>>>>>> In
>>>>>> theory, this should not only be nice from a usability point of view
>>>>>> but I also hoped of gaining from vectorization. It tuns out, that this
>>>>>> is not that easy - it maybe due to MSVC but maybe also due to my lack
>>>>>> of SSE skills. For the most simple example
>>>>>>
>>>>>>  for (int i=0; i<num_steps; ++i) a(i) = i;
>>>>>>
>>>>>> I managed to implement the functor's method ei_linspace_op::packetOp
>>>>>> such that we do actually gain from vectorization as (note that I added
>>>>>> integer parameters to it):
>>>>>>
>>>>>>  // for doubles only
>>>>>>  ei_linspace_op::packetOp(int i, int = 0)
>>>>>>  {
>>>>>>    return ei_padd(ei_pset1<Scalar>(i),ei_pset<Scalar>(0,1));
>>>>>>  }
>>>>>>
>>>>>> For me the first surprise was that this performs way better than just
>>>>>> returning ei_pset<Scalar>(i,i+1).
>>>>>
>>>>> Notice that reading from memory is costly, and ei_pset1 is costly
>>>>> because it needs a SHUFPD instruction, so naively I would think that
>>>>> the best that can happen is that you make once and for all a packet
>>>>> (0,1) and a packet (1,1) and at each step you only add the latter into
>>>>> the former. Perhaps your code is fast because it allowed the compiler
>>>>> to do that optimization?
>>>>>
>>>>>> Next, I looked into the case:
>>>>>>
>>>>>>  for (int i=0; i<num_steps; ++i) a(i) = low+i
>>>>>>
>>>>>> and here, I totally failed. Returning
>>>>>>
>>>>>>  return ei_pset<Scalar>(m_low+i,m_low+i+1);
>>>>>>
>>>>>> resulted in no statistically meaningful gain (~0.06%) speedup. The
>>>>>> same holds for
>>>>>>
>>>>>>  return ei_padd(ei_pset1<Scalar>(i),ei_padd(ei_pset1(m_low),ei_pset<Scalar>(0,1)));
>>>>>>
>>>>>> which is even sometimes slower and permuting the operators did not help either.
>>>>>>
>>>>>> So, I am sort of puzzled about what I could improve. Any ideas? Any
>>>>>> SSE operation, I am not aware of for creating (a, a+1, a+2, ...)
>>>>>
>>>>> Same remark as above. I would initially store the packet (low,low+1)
>>>>> in a SSE register, store (1,1) into another SSE register, and only do
>>>>> a ei_padd at each iteration. Naively I would think that that is much
>>>>> faster?
>>>>>
>>>>>> Finally, note that ei_pset is not yet in PacketMath. Over here, I
>>>>>> added forward delarations into GenericPacketMath and the actual SSE
>>>>>> implementation in the corresponding PacketMath header (no Altivec
>>>>>> yet). They GenericPacketMath declarations (as the name says) have no
>>>>>> implementations since I don't see any meaningful way of providing them
>>>>>> nor do I think they are useful.
>>>>>
>>>>> Indeed I think they're only used in the matrix product code, which
>>>>> doesn't use your new ei_pset function.
>>>>>
>>>>>> The second change in the lib was
>>>>>> mentioned above - I added indices to the packetOp functions.
>>>>>
>>>>> Oh right, you need to fit your stuff with our packetOp infrastructure..
>>>>> Hm, not easy indeed, as I was assuming sequential, not random access.
>>>>>
>>>>>>
>>>>>> If everything regarding the vectorization fails, we could still
>>>>>> provide the functionality (with a specialization for the trivial case
>>>>>> where we gain from vectorizing) because at least, I think it is nifty.
>>>>>
>>>>> We might fail to vectorize random access efficiently. But 95% of use
>>>>> cases will be sequential access. This could then be a use case for a
>>>>> ReturnByValue function. When nesting this into a larger expression,
>>>>> this would have the drawback of introducing a temporary, and the
>>>>> advantage of enabling full vectorization instead of preventing
>>>>> vectorization of the larger expression.
>>>>>
>>>>> Benoit
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>>
>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/