Re: [eigen] Signed or unsigned indexing

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: Re: [eigen] Signed or unsigned indexing
From: Martin Beeger <martin.beeger@xxxxxxxxx>
Date: Fri, 20 Jan 2017 19:51:15 +0100

Some more thoughts on this topic:

* Jon Kalb (signed integer fraction) has pointed out that repeatedlyobnoxious bugs with unsigned integer underflow and bounds wraparoundwhen subtracting sizes went unnoticed very very long even in librarieslike STL & boost.

* Andrej Alexandrescu pointed out that whether using indices or pointersin algorithmic loops is faster seems to change every five years or so.That means, that using indices instead of pointers is a valid strategyand may increase your performance.

* That the specified wraparound behaviour when using loop indicesinhibits a class of optimizations and therefore especially for fixed butunknown loop lengths there is a inherent performance hit of usingunsigned. See CppCon 2016 Michael Spencer - My Little Optimizer:undefined behaviour is magic. https://www.youtube.com/watch?v=g7entxbQOCc

* When pressed to answer why or how it came that std::size_t becameunsigned Stoustrup replied: Someone had a case for needing to reservemore than half the address space in a vector and the number of elementsto be strictly positive. (sadly I could not find the reference here)Which is funny, because you cannot actually use more than half, as Markpointed out.


Am 20.01.2017 um 18:48 schrieb Mark Borgerding:

On the other side, I understand people who use std::size_t forindexing because it is consistent with the standard library.

* Eric Nieblers STL2 project has plans on using signed. See discussionhere:https://github.com/ericniebler/stl2/issues/182. So if Eigen nowchooses unsigned for compatibility and the project STL2 ever flies, wemight look like fools ;)

For me the argument that loop optimization suffers from using unsignedand this is inherent in the language (and cannot be fully solved bycompiler vendors) is the strongest argument, as performance matters forEigen's loop and indexes might likely end up being used in their exitconditions.


Regards, Martin


Am 20.01.2017 um 18:48 schrieb Mark Borgerding:

Well made points. These greatly expand on points I made on this listin May 2010.
From the linked video...
"I think one of the sad things about the standard library is that theindices are unsigned whereas array indices are unsigned.You are sort of doomed to have confusion and problems with that. [...]It is a major bug source. Which is why I'm saying, 'stay as simple asyou can. Use [signed] integers until you really, really need somethingelse.' "
-- Bjarne Stroustrop
This opinion was echoed by several of the esteemed panel and opposedby none: Bjarne Stroustrup, Andrei Alexandrescu, Herb Sutter, ScottMeyers, Chandler Carruth, Sean Parent, Michael Wong, and Stephan T.Lavavej.
" They [the unsigned ordinals in the STL] are wrong. We're sorry. Wewere young."
I would add...
It is human nature to forget that these "integer" data type we'rediscussing are only approximations of integers over a finite range.The approximation is most horribly broken at the point ofover/underflow. For signed types, that point is often comfortably farfrom where our calculations happen.
For unsigned types, that boundary point at the busiest spot, zero.


-- Mark






On 01/20/2017 10:54 AM, Francois Fayard wrote:
Let me give you my 2 cents on this signed/unsigned problem:
- The core C++ language is using signed integers for indexing. If pis a pointer, p[k] is defined for k being a std::ptrdiff_t which isthe Index type used by Eigen- The C++ standard library is using unsigned integers for indexing:std::size_t
So, if you want to choose: (core C++) > (STL) > (Eigen). You have togo and use signed integers. Most people in the standard committeesthink that using unsigned integers for indexing was a mistake,including Bjarne Stroustrup, Herb Sutter and Chandler Carruth. Youcan see their argument in this video, at 42:38 and 1:02:50 (https://www.youtube.com/watch?v=Puio5dly9N8 ).
Using unsigned integers is just an error-prone. For instance, if youhave an array v and you are looking from the first element k whenyou go from b downto a such that v[k] = 0, with unsigned integers,you just need to write:
for (int k = b; k >= a, —k) {
   if (v[k] == 0) {
     return k;
   }
}
If you do that with unsigned integers, good luck to write somethingsimple that does not get crazy when a = 0 (It was a real bug I foundin an image program that was searching for something from right toleft inside a rectangle of interest).
In terms of performance, the compiler can also do more things withsigned integers as overflow for signed integers is undefinedbehavior. This is not the case for unsigned integers (they warp). Asa consequence, a compiler can assume that p[k] and p[k + 1] are nextto each other in memory if k is signed. It can’t do that if k isunsigned (think of the mess it can lead for people working onauto-vectorization). With signed integer 10 * k / 2 = 5 * k. This isnot the case with unsigned integers. I believe that "unsignedintegers" should be renamed "modulo integers” because unsignedintegers are just Z/(2^n Z). And I have taught enough Mathematics toknow that most people are (or should be) scared of Z/(2^n Z) :-)
The only advantage of unsigned integers on a performance point ofview is division: it is faster with unsigned integers. But I neverhad this bottleneck in real programs.
An argument used by “unsigned” people is that you get an extra bitfor array indices. This is just nonsense. First, it only happens on32-bit systems with char arrays of more than 2 million elements. Haveyou ever considered *char* arrays that long? Now, the must funny partis that if you look at the implementation of std::vector<T>, itstores 3 pointers, including the begin and the end of the array. Toget its size, the vector returns static_cast<std::size_t>(end -begin). And end - begin is just at std::ptrdiff_t, so if the vectorhas more than 2 million elements, you just get undefined behavior. Sostd::vector can’t even use that bit !!! By the way, this is whystd::vector<T>::max_size() is 2 million with libc++ (clang). Andnobody seems to complain.
On the other side, I understand people who use std::size_t forindexing because it is consistent with the standard library.
François Fayard
Founder & Consultant - Inside Loop
Applied Mathematics & High Performance Computing
On Jan 20, 2017, at 3:33 PM, Benoit Jacob <jacob.benoit.1@xxxxxxxxx>wrote:
It's complicated. The use of signed indexing has a long historyspecifically in numerical linear algebra, see FORTRAN/BLAS/LAPACK.Also some large C++ shops such as Google have entirely turned awayfrom unsigned indexing:
https://google.github.io/styleguide/cppguide.html#Integer_Types says:
You should not use the unsigned integer types such as uint32_t,unless there is a valid reason such as representing a bit patternrather than a number, or you need defined overflow modulo 2^N. Inparticular, do not use unsigned types to say a number will never benegative. Instead, use assertions for this.
I only mention this to say, there are valid points on both sides ofthis argument. Neither choice will make much more than half of usershappy. (I actually don't like the Google C++ style that much, and Iwouldn't mind unsigned personally).
The only thing that would be really bad would be to try to make bothsignednesses fully supported. That would mean keeping only theintersection of the sets of advantages of either signedness, havingto deal with the union of the sets of constraints.
Benoit
2017-01-20 8:25 GMT-05:00 Henrik Mannerström<henrik.mannerstrom@xxxxxxxxx>:
This issue seems to warrant a separate message thread.
I'd like to offer my two cents: As much as I like Eigen I thinkthere is a strict ordering "c++ language" > "stl" > "Eigen". Morethan once have I developed something and only at some later pointbrought in Eigen. Code written in std:size_t fashion has then neededrefactoring. So, if there would be a vote, I'd vote for size_tindexing. I think smooth interoperability with stl is valuable.
Best,
Henrik
On Thu, Jan 19, 2017 at 10:58 PM, Márton Danóczy<marton78@xxxxxxxxx> wrote:
Hi all,
while I would not want to argue with Gael nor with the numerous C++experts advocating signed integers, today's reality is different.Some libraries, the most prominent being the standard library, useunsigned integers as indices. Therefore, mixing signed and unsignedtypes is sadly unavoidable when using Eigen, which is a majorannoyance (at least for me, working with -pedantic -Werror).
In my opinion, using Eigen with-DEIGEN_DEFAULT_DENSE_INDEX_TYPE=size_t should just work, regardlessof the developers' personal preference for signed integers.
Best,
Marton
On 19 January 2017 at 13:00, Gael Guennebaud<gael.guennebaud@xxxxxxxxx> wrote:
On Thu, Jan 19, 2017 at 11:31 AM, Andrew Fitzgibbon<awf@xxxxxxxxxxxxx> wrote:
I wonder if a rethink of reshape could allow a move to

unsigned index types, assuming I understand correctly
that Dynamic would be of another type.  It’s always been a

bit clunky getting “size_t-correctness” right for mixed
Eigen/STL code, and compilers complain increasingly
nowadays.   Perhaps now might be a time to give it a try?
See also:http://eigen.tuxfamily.org/index.php?title=FAQ#Why_Eigen.27s_API_is_using_signed_integers_for_sizes.2C_indices.2C_etc..3F
Maybe one day we'll get a "fixed" std-v2 that would be morecompatible with libraries that made the right choice of using signedtypes.
For sparse matrices, I agree that we might try to allow for unsignedtypes as the StorageIndex type. This should be doable while keepingsigned 64 bits integers for the API (rows, cols, nonZeros, etc.)
We might also think about solutions to ease the mix of Eigen/STLcode...
gael
I see the “downcounting” argument athttps://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2009/03/msg00099.html,
but that appears fairly strongly to be a special case where
one would anyway want to benchmark, check sizes etc.


Finally, I think we are in a world where sparse arrays with
entries in the 2-4billion range are reasonably common,
and one could conceivably be pleased to get the extra bit
back…


Thanks again for a great library!


A.


Dr Andrew Fitzgibbon FREng FBCS FIAPR

Partner Scientist
Microsoft HoloLens, Cambridge, UK

http://aka.ms/awf


From: Gael Guennebaud [mailto:gael.guennebaud@xxxxxxxxx]
Sent: 13 January 2017 12:26
To: eigen <eigen@xxxxxxxxxxxxxxxxxxx>
Subject: Re: [eigen] Let's get reshape in shape for 3.4





On Fri, Jan 13, 2017 at 6:14 AM, Jason Newton <nevion@xxxxxxxxx> wrote:

Also, regarding them RowMajor/ColMajor  int/type issue - perhaps stuff
them in a new namespace or class - storage ?  Too bad StorageOrder is
already used in so many places.   Honestly I'm all for you making them
types and things working uniformly from there.  I have used them
myself as integers with the flags bitset, but only for enable_if logic
which would be rendered obsolete if you had a collection of C++11
inspired type traits (instead they get repeated on the web a few
places).  Sorry if I'm not being very detailed, it's been a while
since I've needed these, but my point is that it was basically a flaw
to use them as int's in the first place, in user code - and so I
encourage you to change things so it all works fluidly in the new api
without fear of upsetting users.  Although perhaps that is a daunting
task...
I think you are mixing Eigen::RowMajor with Eigen::RowMajorBit. Iagree that the bit flags could be managed differently usingindividual type traits, but regarding Eigen::RowMajor, it iscurrently used as a template parameter to Matrix, Array,SparseMatrix, etc.:
Matrix<...., RowMajor|DontAlign>
which is pretty convenient to write compared to having to subclasssome default_matrix_traits class to customize the options. WithRowMajor|DontAlign, RowMajor could still be instance of anintegral_constant-like type with operator | overloaded.... ActuallyI've started to think about such an approach for Eigen::Dynamic, sothat one can write:
M = N*2+1
and get M==Dynamic if N==Dynamic. Currently we always have to write:M = N==Dynamic ? Dynamic : 2*N+1 which is error prone because it'seasy to forget about checking for Dynamic, especially when combiningmultiple compile-time identifiers.
gael



-Jason
On Thu, Jan 12, 2017 at 11:56 PM, Jason Newton <nevion@xxxxxxxxx>wrote:
Hi Gael,

Glad to see all the new api's you're moving in for the new year.

I actually prefer C if C is a superset of B - that is the way it works
in Numpy - oder is overridable in several places, but mainly things
follow the matrix types you are working with (which would be
expressions here).

I haven't thought about the details but is there any reason
A.reshaped(4, n/2) work via constexprs or something on the 4?  I
imagine even if it did you're trying to cover for C++98 though, but I
think fix<4> is a fair bit ugly.

As for the placeholder for a solvable dimension - the matlab
convension is the empty matrix and I welcome that notion (warped as a
type) - how about any of, with no priorities:
Null, Nil, Empty, Filled, DontCare, Placeholder,  CalcSize (this and
the next are more explicit), or AutoSized


-Jason

On Thu, Jan 12, 2017 at 10:35 AM, Gael Guennebaud
<gael.guennebaud@xxxxxxxxx> wrote:
Hi everyone,
just after generic indexing/slicing, another long standing missingfeature
is reshape. So let's make it for 3.4.
This is not the first time we discuss it. There is a old bugreport entry[1]. and a old pull-request with various discussions [2]. TheTensor module
also support reshape [3].
However, the feature is still not there because we never convergedabout howto properly handle the ambiguity between col-major / row-majororders, also
called Fortran versus C style orders (e.g., in numpy doc [4]).

We have several options:
A) Interpret the indices in column major only, regardless of thestorage
order.
   - used in MatLab and Armadillo
   - pros: simple strategy
- cons: not very friendly for row-major inputs (needs totranspose twice)
B) Follows the storage order of the given expression
   - used by the Tensor module
   - pros: easiest implementation
   - cons:
* results depends on storage order (need to be careful ingeneric code)* not all expressions have a natural storage order (e.g.,a+a^T, a*b)* needs a hard copy if, e.g., the user want to stack columnsof a
row-major input
C) Give the user an option to decide which order to use between:ColMajor,
RowMajor, Auto
   - used by numpy [4] with default to RowMajor (aka C-like order)
   - pros: give full control to the user
   - cons: the API is a bit more complicated
At this stage, option C) seems to be the only reasonable one.However, weyet have to specify how to pass this option at compile-time, whatAuto
means, and what is the default strategy.
Regarding 'Auto', it is similar to option (B) above. However, as Ialreadymentioned, some expressions do not has any natural storage order.We couldaddress this issue by limiting the use of 'Auto' to expressionsfor which
the storage order is "strongly" defined, where "strong" could mean:
- Any expressions with the DirectAccessBit flags (it means weare dealingwith a Matrix, Map, sub-matrix, Ref, etc. but not with a genericexpression)- Any expression with the LinearAccessBit flag: it means theexpression can
be efficiently processed as a 1D vector.

Any other situation would raise a static_assert.
But what if I really don't care and just want to, e.g., get alinear viewwith no constraints of the stacking order? Then we could add afourth option
meaning 'IDontCare', perhaps 'AnyOrder' ?
For the default behavior, I would propose 'ColMajor' which isperhaps themost common and predictable choice given that the default storageis column
major too.


Then, for the API, nothing fancy (I use c++11 for brevity):
template<typename RowsType=Index,typename ColType=Index,typenameOrder=Xxxx>
DenseBase::reshaped(RowsType rows,ColType cols,Order = Order());

with one variant to output a 1D array/vector:

template<typename Order= Xxxx >
DenseBase.reshaped(Order = Order());

Note that I used "reshaped" with a "d" on purpose.
The storage order of the resulting expression would match theoptional
order.
Then for the name of the options we cannot use"RowMajor"/"ColMajor" becausethey already are defined as "static const int" and we need objectswithdifferent types here. Moreover, col-major/row-major does notextend well tomulti-dimension tensors. I also don't really like the reference toFortran/C
as in numpy. "Forward"/"Backward" are confusing too. Any ideas?
The rows/cols parameters could also be a mix of compile-time &runtime
values, like:

A.reshaped(fix<4>,n/2);
And maybe we could even allow a placeholder to automaticallycompute one ofthe dimension to match the given matrix size. We cannot reuse"Auto" here
because that would be too confusing:

A.reshaped(5,Auto);
Again, any ideas for a good placeholder name? (numpy uses -1 butwe need a
compile-time identifier)


cheers,

gael

[1] http://eigen.tuxfamily.org/bz/show_bug.cgi?id=437
[2] https://bitbucket.org/eigen/eigen/pull-requests/41
[3]
https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/README.md?fileviewer=file-view-default#markdown-header-operation-reshapeconst-dimensions-new_dims
[4]
https://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.reshape.html

Follow-Ups:
- Re: [eigen] Signed or unsigned indexing
  - From: Benoit Jacob
- Re: [eigen] Signed or unsigned indexing
  - From: François Fayard

References:
- [eigen] Signed or unsigned indexing
  - From: Henrik Mannerström
- Re: [eigen] Signed or unsigned indexing
  - From: Benoit Jacob
- Re: [eigen] Signed or unsigned indexing
  - From: Francois Fayard
- Re: [eigen] Signed or unsigned indexing
  - From: Mark Borgerding

Messages sorted by: [ date | thread ]
Prev by Date: Re: [eigen] Signed or unsigned indexing
Next by Date: Re: [eigen] Signed or unsigned indexing
Previous by thread: Re: [eigen] Signed or unsigned indexing
Next by thread: Re: [eigen] Signed or unsigned indexing

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/