Re: [eigen] matrix::linspace

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


On the other hand, my results with GCC 4.4 should not be over
interpreted: all is in the main function and we know that GCC doesn't
optimize correctly what's in main(). On the other hand, stuff may have
been optimized away. So the only way to make sure is to put code in
non-inlined functions.

Benoit

2010/1/24 Hauke Heibel <hauke.heibel@xxxxxxxxxxxxxx>:
> I am still getting results as bad as this here:
>
> vec: 2141.66 ms
> lin: 1950.84 ms
>
> MSVC seems to fail to optimize properly. I will try to look at the assembly.
>
> - Hauke
>
> On Sun, Jan 24, 2010 at 6:17 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>> Hi Hauke,
>>
>> You've been trapped in a subtlety of gael's BenchTimer: when you use a
>> single BenchTimer object more than 1 time, it returns the best time.
>>
>> I'd propose to rename it to something more explicit, and letting the
>> BenchTimer name refer to something much simpler.
>>
>> That is why, when I ran your program, I got twice the same time.
>>
>> Now I edited main.cpp as attached, and I get:
>>
>> vec: 893.094 ms
>> lin: 6745.94 ms
>>
>> Cheers,
>> Benoit
>>
>> 2010/1/24 Hauke Heibel <hauke.heibel@xxxxxxxxxxxxxx>:
>>> I tried using sequential packet access. Unfortunately, it did not
>>> help. Maybe I am still missing something. The required patch (indexed
>>> packet access) is attached to this mail as well as my test code.
>>>
>>> - Hauke
>>>
>>> On Sat, Jan 23, 2010 at 4:03 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx> wrote:
>>>> 2010/1/23 Hauke Heibel <hauke.heibel@xxxxxxxxxxxxxx>:
>>>>> Hi there,
>>>>>
>>>>> I am working on some functionality which would allow something like this:
>>>>>
>>>>>  VectorXd a = VectorXd::linspace(low,high,num_steps)
>>>>>
>>>>> which should be corresponding to:
>>>>>
>>>>>  const double step_size = (high-low)/(num_steps-1);
>>>>>  VectorXd a(num_steps);
>>>>>  for (int i=0; i<num_steps; ++i)
>>>>>   a(i) = low+i*step_size;
>>>>>
>>>>> In Eigen, this is best implemented based on NullaryExpressions, i.e.
>>>>> Matrices, where each element can be computed based on some functor.
>>>>
>>>> I agree!
>>>>
>>>>> In
>>>>> theory, this should not only be nice from a usability point of view
>>>>> but I also hoped of gaining from vectorization. It tuns out, that this
>>>>> is not that easy - it maybe due to MSVC but maybe also due to my lack
>>>>> of SSE skills. For the most simple example
>>>>>
>>>>>  for (int i=0; i<num_steps; ++i) a(i) = i;
>>>>>
>>>>> I managed to implement the functor's method ei_linspace_op::packetOp
>>>>> such that we do actually gain from vectorization as (note that I added
>>>>> integer parameters to it):
>>>>>
>>>>>  // for doubles only
>>>>>  ei_linspace_op::packetOp(int i, int = 0)
>>>>>  {
>>>>>    return ei_padd(ei_pset1<Scalar>(i),ei_pset<Scalar>(0,1));
>>>>>  }
>>>>>
>>>>> For me the first surprise was that this performs way better than just
>>>>> returning ei_pset<Scalar>(i,i+1).
>>>>
>>>> Notice that reading from memory is costly, and ei_pset1 is costly
>>>> because it needs a SHUFPD instruction, so naively I would think that
>>>> the best that can happen is that you make once and for all a packet
>>>> (0,1) and a packet (1,1) and at each step you only add the latter into
>>>> the former. Perhaps your code is fast because it allowed the compiler
>>>> to do that optimization?
>>>>
>>>>> Next, I looked into the case:
>>>>>
>>>>>  for (int i=0; i<num_steps; ++i) a(i) = low+i
>>>>>
>>>>> and here, I totally failed. Returning
>>>>>
>>>>>  return ei_pset<Scalar>(m_low+i,m_low+i+1);
>>>>>
>>>>> resulted in no statistically meaningful gain (~0.06%) speedup. The
>>>>> same holds for
>>>>>
>>>>>  return ei_padd(ei_pset1<Scalar>(i),ei_padd(ei_pset1(m_low),ei_pset<Scalar>(0,1)));
>>>>>
>>>>> which is even sometimes slower and permuting the operators did not help either.
>>>>>
>>>>> So, I am sort of puzzled about what I could improve. Any ideas? Any
>>>>> SSE operation, I am not aware of for creating (a, a+1, a+2, ...)
>>>>
>>>> Same remark as above. I would initially store the packet (low,low+1)
>>>> in a SSE register, store (1,1) into another SSE register, and only do
>>>> a ei_padd at each iteration. Naively I would think that that is much
>>>> faster?
>>>>
>>>>> Finally, note that ei_pset is not yet in PacketMath. Over here, I
>>>>> added forward delarations into GenericPacketMath and the actual SSE
>>>>> implementation in the corresponding PacketMath header (no Altivec
>>>>> yet). They GenericPacketMath declarations (as the name says) have no
>>>>> implementations since I don't see any meaningful way of providing them
>>>>> nor do I think they are useful.
>>>>
>>>> Indeed I think they're only used in the matrix product code, which
>>>> doesn't use your new ei_pset function.
>>>>
>>>>> The second change in the lib was
>>>>> mentioned above - I added indices to the packetOp functions.
>>>>
>>>> Oh right, you need to fit your stuff with our packetOp infrastructure.
>>>> Hm, not easy indeed, as I was assuming sequential, not random access.
>>>>
>>>>>
>>>>> If everything regarding the vectorization fails, we could still
>>>>> provide the functionality (with a specialization for the trivial case
>>>>> where we gain from vectorizing) because at least, I think it is nifty..
>>>>
>>>> We might fail to vectorize random access efficiently. But 95% of use
>>>> cases will be sequential access. This could then be a use case for a
>>>> ReturnByValue function. When nesting this into a larger expression,
>>>> this would have the drawback of introducing a temporary, and the
>>>> advantage of enabling full vectorization instead of preventing
>>>> vectorization of the larger expression.
>>>>
>>>> Benoit
>>>>
>>>>
>>>>
>>>
>>
>
>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/