trying to directly write a cache-friendly implementation is probably a bit too difficult. I'd recommend to first have a working implementation based on vector-vector operations and then see how to leverage more efficient matrix-vector or even matrix-matrix operations.
Moreover, it would be better to write a high-level blocking strategy as in the PartialPivLU and LLT solvers and let the existing triangular solver and matrix products deal with the nasty details. Such an approach should lead to a much simpler code, with less redundancy, and the result will be more future proof as the internal matrix product kernels are subject to change from one version to the other.