ok I see, then it is not really what people call skyline storage (
http://www.netlib.org/linalg/html_templates/node96.html), but it is more like an extension of the tridiagonal storage.
However, I don't see how your storage can yield to a more efficient implementation of the LU decomposition than the standard skyline storage. For instance here is a basic algorithm without pivoting for the standard skyline storage:
for(int k = 0; k+1 < rows; ++k)
int rrows = rows-k-1;
int rsize = size-k-1;
// line 1
lu.col(k).end(rrows) /= lu.coeff(k,k);
// line 2
lu.corner(BottomRight,rrows,rsize).noalias() -= lu.col(k).end(rrows) * lu.row(k).end(rsize);
where lu is the working matrix, lu.col(k) is assumed to return a "range vector" as I described in my previous email.
Here line 1 would be trivially optimized (i.e., vectorized) since lu.col(k).end(rrows) is just a small dense vector.
Line 2 is an outer product which again is trivially/automatically vectorized sicne it is impelmented as a sequence of: "col_range_vector_i -= scalar * col_range_vector".
Here the locality is pretty good because the vector "lu.col(k).end(rrows)" which is reused multiple times is sequentially stored in memory.
But perhaps there exists a special algorithm which perfectly fit your storage ? Is it possible to see your code somewhere ? Finally, if it is really more efficient then that would make sense to have it in Eigen.