Christian,
What I have:
* Support for dynamically sized and statically sized tensors as well as tensor map
* Support for many tensor expressions (unary cwise, binary cwise, select, contractions, and convolutions)
* Support for cxx11 with a fallback on cxx03 for compilers such as nvcc that don't support cxx11.
* Support for automatic parallelization across multiple cpu cores and/or offloading to gpu through devices. The _expression_ A = B + C; can be offloaded on gpu by calling A.device(my_gpu) = B + C; or parallelized over multiple cpu cores A.device(my_thread_pool) = B + C;
I have reused the existing Eigen codebase as much as possible, which gives me for free:
* Vectorization leveraging SSE/AVX
* Primitives for tensor operations (addition, exponentiation, ...)
I didn't reuse the Eigen _expression_ mechanism since it assumes that arrays/matrices are 2D object, but I tried to follow the pattern as much as possible.