Re: [eigen] On a flexible API for submatrices, slicing, indexing, maskin

On Fri, Dec 30, 2016 at 10:15 AM, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote:

On Fri, Dec 30, 2016 at 2:54 AM, Yuanchen Zhu <yuanchen.zhu@xxxxxxxxx> wrote:

First, using c<> to encode a compile time parameter is a rather novel convention, foreign to most programmers, engineers, matlab users, or basically anyone.

Yes, I'm also not 100% sure about it, but at least it has the great advantage of not disturbing the parameter orders. The implementation is also compatible with integral_constant defined with literals (using namespace boot::hana::literals): 3_c

Interesting. User literal changes the equation. To me range(1, n, 3_c) does look cleaner than range(1, n, c<3>). And most importantly the name space of user literals is significantly less populated than that of types. I would even argue it should be the default and preferred notation.

The downside is of course it's still a new convention.

There are more downsides:
- c++14 only
- does not work with template parameters as in: template<class T> void foo() { span(i,c<T::Bar>); }

True

- much more costly on the compiler (need to parse a char* using compile-time meta programming)

I believe integer user literals are fed a long long instead of char* by the compiler.

- I think it's as unusual as c<3> as literals are mainly tailored to convey units.

So it's just one possibility offered to user for free on Eigen's side, but that cannot be the default recommended approach!

Then, to be honest, I don't think we can define such a Eigen::c, as importing Eigen namespace would be too polluting. So we need to find another default name anyway. Some (not great) possibilities:

- IndexC<N>
- CIndex<N>
- CstIndex<N>
- CTIndex<N>

Right, I really don't like c<> from the start due to the namespace pollution and the unusual convention. What I meant was if we really had to adopt c<>, then I find it more sensible to make the user literal version the default publicly exposed one, and hide c<> within a namespace or under a longer name for power users.

For the "start" argument, currently I don't really plan to allow for compile-time value, though the case start=0 is sometimes useful to figure-out alignment or for sparse matrices. This case is currently part of the Block API using specialized members (middleRows, leftCols, etc.)

I think the user will expect a compile-time start/stop to be valid usage, especially since span allows TWO of its parameters to be static. It is going to be a real hassle to remember what can be static and which cannot.

OK, for more consistency we can easily accept compile-time value for each parameter.

Also, sometimes people may just want to provide a pair of compile-time (first, last), and get in return a compile time sized matrix, especially in unit tests. In that case, my order of preference is:

M(range<2, 10, 2>) // pretty succinct

but does not work if not all parameters are compile-time values.

M(range(2_c, 10_c, 2_c)) // acceptable, especially if editor color code 2_c's as literals.
M(range(c<2>, c<10>, c<2>)) // Ugh! So much noise for so simple an _expression_

but so far, that's still the prettiest option that satisfies all constraints.

Still, when you read old code, or debug a hard to track memory bug, do you ever get that nagging suspicion that you have the order of last and step wrong somewhere? I know I do.

yes, I've already acknowledged that I'm never 100% sure about the argument order of LinSpaced and Constant, but I think that's mainly because there are different kind of optional arguments. For range/span, I think it's ok as there is only one optional argument.

A more subtle but more important question though, is whether the "stop" argument in range(start,stop) should be inclusive (=last, as in MatLab, symmetric with the start argument) or exclusive (=end, as in C++ STL). I'm tempted to prefer the inclusive variant but since the exclusive one is more standard in C++, I'm puzzled...

Yeah. I think we should use (first, last) convention, since those are used by the technical computing community in general. Even boost::range's irange use the (first, last) convention. Also, when negative step comes into question, do you want to write range(n-1, -1, step<-1>) as opposed to the simpler range(n-1, 0, step<-1>), which is also symmetrical to range(0, n-1, step<1>).

Actually, I just though about another case that is not covered by the current proposal, that is templated code where the length (and/or step) could be either a compile-time or run-time value. For instance, with the block API we have:

A.block<Rows,Cols>(i,j,rows=Rows,cols=Cols)

this way one may write:

A.block<Dynamic,ColsAtCompileTime>(i,j,rows,actual_cols)

with ColsAtCompileTime being either Eigen::Dynamic (=-1) or a positive compile-time integer.

If we want to support this flexibility for both lengths and steps (and I think we do), then we first need to introduce a Eigen::DynamicStep=0 value because -1 is valid step whereas 0 is not, and then, following the current proposal, we could imagine something like:

span(start, c<LEN>(actual_len), c<STEP>(actual_step))

and let the span function figure-out whether the compile-time value is truly a compile-time value, or whether it should get the run-time value passed as argument.

Yeah I agree you probably need something like c<STEP>(actual_step) for generic code. Although I'd argue you should still allow c<STEP> and c(step) to work.

yes, of course!

Here is another related idea: Instead of using a generic c<N> to provide step or length, make both of them named parameters. This way we even remove artificial conceptual separation of range and span (which is yet another convention to memorize). Otherwise, when you see span(3, 10), you will always have a nagging suspicion: is this 3..10 or 3..12?

Here're some examples:

V(span(i, j)) // first..last
V(span(i, j, step(s))) // dynamic step
V(span(i, j, step<S>)) // static step
V(span(i, j, step<S>(s))) // support both dynamic and static step

V(span(i, len(n))) // dynamic span length
V(span(i, len<N>)) // static span length
V(span(i, len<N>(n)) // support both dynamic and static span length
V(span(i, len<N>, step(s))) // static size, dynamic step
V(span(i, len(n), step<S>(s))) // dynamic size, static/dynamic step.
...

We now have a simple consistent rule to remember: non-index parameters, len and step, are always named, and can be dynamic, or static, or both. This syntax adds a little verbosity, but avoids lots of confusion and greatly improves readability, imo.

At some points I've considered such an approach and did not pushed it too much because of verbosity, but after all, as we probably need a less intrusive name for c<N> anyway, this option seems worth to be thoroughly considered. Some remarks:

- you kept "span" (one less character?), but I think "range" is more intuitive (minor)

There are several advantages:

- Yes one fewer character! Shorter without adding confusion is always better.

- Consistency with Armadillo (minor advantage).

- The word "range" is overloaded much more than span. There's boost::range, the WIP range v3, and the "range-based" for loops. These big shots can have different conventions, some times [first, last) (range-based for), some times [begin, end) (boost's irange). For span, the only conflict is gsl::span right now (which should really be named array_view anyway, akin to std::string_view. It also takes in (base_ptr, length), so it's categorically different from our use).

- Thorough Eigen we use the word "size", so we should probably use Size<N>(n). Note the upper case "S" to convey that Size does not return the size of an object (as std::size), but rather constructs an object representing a size. Same for "step" -> "Step". This also reduces name collisions.

Yes I understand Eigen uses "size". I have some slight reservations regarding using size here. As a member function name, it's fine. But as a global function, it reminds me too much of std::size(), and even sizeof(). When I see Size(n), my first mental response is always that it refers to the size of n.

There is also the lesser issue where matlab's definition of size is [height, width], but the Eigen already adopted the C++ convention so this issue is moot. The word "len" conveys the idea that it is 1D, e.g., string::length.

Alternatives are "count", "extent", "length" (spelled in full). I used "len" since it is the shortest.

I actually liked the word "count" a lot, since it answers a small question at the back of my head: when step is >1, does the length parameter refer to the number of elements in the range, or the number of indices the range straddles across, i.e. (last - first + 1). "Count" unequivocally indicates the former, whereas "size", "extent", and "length" can be interpreted both ways. Another example of API with similar distinction is GL/Vulkan, where "size" is usually size in bytes, "extent" refers to (last - first + 1), and "count" is the number of elements.

So I'll use "count" from now on.

Regarding capitalization, I kind of feel that "step" and "count" should be in lower case. Mixing extra capital letters in a short range _expression_ adds syntax noises. An _expression_ like M(all, span(1, count(10)) looks cleaner than M(all, span(1, Count(10))) to me. Also "step" is likely implemented as an overloaded free-standing function, similar to span. Free-standing functions in Eigen starts with lower case by convention, right? If you type "range" and "all" in lower case, why is "count" capitalized? Of course, you could argue that it is a special naming convention for "named argument", but I hate to memorize new conventions...

Actually, this leads to another of my axe to grind. :) I so much prefer the STL/Boost naming convention where basically everything is in snake_case, except (future) concepts and template argument are in PascalCase. Maybe add an suffix "_" for non-public member variables, and macros are of course still ALL_UPPERCASE, but that's all the rules. With a library that uses lots of generic programming, so many templated types are used as compile-time functions or variables. Distinguishing them as types using PascalCase just feels a bit wrong. Also, long descriptor identifier names (like MaxRowsAtCompileTime) look so much better and cleaner in snake case (max_rows_at_compile_time), especially when you come back to read old code months or years later. Anyway I can probably write an essay on this, so I'll stop the rambling (till Eigen 4.0 or std::eigen or boost::eigen, maybe? ;)

Another random idea: it is possible, using operator= overloading, to have the following syntax for named parameter:

M(all, span(1, count = 10))

See e.g., BOOST_PARAMETER_NAME. I kind of like this more than M(all, span(1, count(10))), though in this case, we cannot use the same identifier for a static count, since count is a variable and count<C> won't compile. Maybe span(1, count_ = 10) and span(1, count<10>)?

- I think we could allow unnamed steps without much ambiguity: range(i, j, s) and range(i, Size(n), s), in particular the second version is unambiguous.

I totally agree.

We may or may not want support static indices. If we do, i and j can be supplied as compile time constants using the user suffix syntax, e.g., 3_c, as you described before. Alternatively, we can have the rule that indices can be supplied as template parameters to span. Consider:

V(span<3, 6>) vs V(span(3_c, 6_c))
V(span<3, 6>(step(s)) vs V(span(3_c, 6_c, step(s)))
V(span<3, 6>(step<2>)) vs V(span(3_c, 6_c, step<2>))
V(span<4>(len(n)) vs V(span(4_c, len(n))
V(span<6>(len<8>, step(s))) vs V(span(6_c, len<8>, step(s));

All except the first row look too noisy to me, so maybe we should just not support static indices....

If we go with the span/len/step (or rather range/Size/Step) approach and want compile-time indices, then the left column seems preferable to me. I see different use cases:

1- for sparse matrices, knowing that start=0 or start=last (with step=-1) is useful. For this we could imagine something else:

range("",j)
range(Eigen::last,j,Step<-1>)

We need a Eigen::last or Eigen::end "keyword" anyway to index relatively from the end. Using "" as a default start=0 might sounds odd, but it has the advantage of not having to introduce one more "keyword" Eigen::zero or Eigen::first.

I feel strongly that using "" to mean "begin" or "all" is a step towards obscurity. It's like those operator abusing math libraries where "&" means dot product and "^" means cross product. It feels clever when writing, but hampers compression when you come back to read it months later. Hell, if we had to commit that sin, I feel using the underscore "_" would be a better replacement for "all", i.e., M(_, i) = N(_, i).

I currently don't have a better proposal if we want to support static index and dynamic count/step... Need to think about it.

2 - Knowing the start at compile-time might also help to compute alignment, especially in the common case of start==0, which can be handled as in the previous point. Of course, in theory for that purpose we are rather interested in knowing whether start is a multiple of some power of two numbers... but that's too far fetched!

3 - Knowing both start and stop are known at compile-time to figure-out the size at compile-time, for this we only need:

range<START,STOP>([optional step]);
range<START,STOP>(start,stop[,optional step]);

that would be aliases for range(start, Size<STOP-START(+1?)>(stop-start(+1?)) [,optional step]);

gael