Re: [eigen] CUDA vectorization status

On Wed, Oct 12, 2016 at 11:23 AM, Robert Lukierski <r.lukierski12@xxxxxxxxxxxxxx> wrote:

Hi,

I've asked this first on the eigen-core-team list first, but this is
probably more appropriate place. Backstory first.

At the beginning of 2013 I've noticed eigen-nvcc efforts and I quickly
forked that repo, as I thought it is a great idea. I was doing my PhD
(computer vision, slam, robotics) and I always hated that people
handcrafted small matrix classes to use inside CUDA device code, hate
reinventing the wheel.

So I quickly forked eigen-nvcc repo and started adding EIGEN_DEVICE_FUNC
in thousands of places. Then I started benchmarking against the built-in
types (e.g. float4) and some DIY CUDA compatible matrix class by one of
the guys in our group. That's how I found lack of vectorization and
improvised packet math using make_float4 to hint the compiler to do the
right thing, it worked awesomely, almost identical performance to
float4, but it wasn't well received :/

https://bitbucket.org/lukier/eigen-nvcc/commits/4cf24b0cd4fd71ec93dffa76b016166d2dc16a24

Anyway, I kept my fork and used separate Eigen for the host C++ and separate
for CUDA files (CUDA_PROPAGATE_HOST_FLAGS OFF in CMake and
cuda_include_directories). It worked great for years, I've used
libraries based on Eigen (like https://github.com/lukier/Sophus for Lie
groups) in CUDA, also I've EIGEN_DEVICE_FUNC-ed jet.h header from
ceres-solver (automatic differentiation) so I could template SE3 group
class (inside based on Quaternion and Vector3) with ceres::Jet<float,N>
and I had all my Jacobians automatically calculated, from my templated
camera models (https://github.com/lukier/camera_models), through Lie
group transformation and in the end, chain ruled with numerical
derivatives from the image/depth map, all done by the compiler. Haven't
calculated a single Jacobian in my entire PhD thanks to that. Cool stuff.

To the point. I couldn't be bothered maintaining my fork and it was
based on 3.2 so with C++11 and newer compilers (also newer better CUDA
compilers) it started generating more and more warnings. Now I've
finished my PhD and I'm a research engineer and I want others to use
Eigen-CUDA and friends instead of reinventing the wheel. And Eigen 3.3
supports CUDA in the mainline which made me very happy.

But it seems the support is not complete yet. Geometry module is missing
(EIGEN_DEVICE_FUNC, I'm working on the pull request now).

However, the most important question is what is the state of
vectorization on CUDA? It seems somebody implemented the same trick I
did in 2013:

https://bitbucket.org/eigen/eigen/src/1a24287c6c133b46f8929cf5a4550e270ab66025/Eigen/src/Core/arch/CUDA/PacketMath.h?fileviewer=file-view-default

but looks like it is not enabled. In Core file EIGEN_DONT_VECTORIZE is
defined for __CUDACC__ that overrides subsequent stuff like:
#if defined __CUDACC__
#define EIGEN_VECTORIZE_CUDA
....
and therefore there is no alignment set. Also PacketMath header
inclusion depends on EIGEN_USE_GPU and I don't think this is set anywhere.

Let me know if I can help, I could run my performance tests from 2013
again and Sophus unit tests that test (via Geomtetry) fixed-size Eigen
quite well. We have a wide range of nVidia GPUs in the lab, so I can
test on pretty much everything from GTX480 to GTX1080/TitanX.

With best regards,
Robert Lukierski.

--
Robert Lukierski
Dyson Robotics Lab
Imperial College London
r.lukierski12@xxxxxxxxxxxxxx

Benoit