Hi,
trying to directly write a cache-friendly implementation is probably a bit too difficult. I'd recommend to first have a working implementation based on vector-vector operations and then see how to leverage more efficient matrix-vector or even matrix-matrix operations.
Moreover, it would be better to write a high-level blocking strategy as in the PartialPivLU and LLT solvers and let the existing triangular solver and matrix products deal with the nasty details. Such an approach should lead to a much simpler code, with less redundancy, and the result will be more future proof as the internal matrix product kernels are subject to change from one version to the other.
cheers,
gael