Re: [eigen] CUDA vectorization status

[ Thread Index | Date Index | More Archives ]

I use the CUDA packet primitives in the tensor module. Loading and writing floats 4 by packets of 4 whenever possible helps speed things up. Processing 16-bit floats 2 by 2 also doubles the throughput on GPUs that fully support them (eg tegra X1 and Pascal P100).

I haven't tried to vectorize the matrix code and don't foresee doing so any time so, so you're welcome to give it a try and see how much this helps.

On Wed, Oct 12, 2016 at 11:23 AM, Robert Lukierski <r.lukierski12@xxxxxxxxxxxxxx> wrote:

I've asked this first on the eigen-core-team list first, but this is
probably more appropriate place. Backstory first.

At the beginning of 2013 I've noticed eigen-nvcc efforts and I quickly
forked that repo, as I thought it is a great idea. I was doing my PhD
(computer vision, slam, robotics) and I always hated that people
handcrafted small matrix classes to use inside CUDA device code, hate
reinventing the wheel.

So I quickly forked eigen-nvcc repo and started adding EIGEN_DEVICE_FUNC
in thousands of places. Then I started benchmarking against the built-in
types (e.g. float4) and some DIY CUDA compatible matrix class by one of
the guys in our group. That's how I found lack of vectorization and
improvised packet math using make_float4 to hint the compiler to do the
right thing, it worked awesomely, almost identical performance to
float4, but it wasn't well received :/

Anyway, I kept my fork and used separate Eigen for the host C++ and separate
cuda_include_directories). It worked great for years, I've used
libraries based on Eigen (like for Lie
groups) in CUDA, also I've EIGEN_DEVICE_FUNC-ed jet.h header from
ceres-solver (automatic differentiation) so I could template SE3 group
class (inside based on Quaternion and Vector3) with ceres::Jet<float,N>
and I had all my Jacobians automatically calculated, from my templated
camera models (, through Lie
group transformation and in the end, chain ruled with numerical
derivatives from the image/depth map, all done by the compiler. Haven't
calculated a single Jacobian in my entire PhD thanks to that. Cool stuff.

To the point. I couldn't be bothered maintaining my fork and it was
based on 3.2 so with C++11 and newer compilers (also newer better CUDA
compilers) it started generating more and more warnings. Now I've
finished my PhD and I'm a research engineer and I want others to use
Eigen-CUDA and friends instead of reinventing the wheel. And Eigen 3.3
supports CUDA in the mainline which made me very happy.

But it seems the support is not complete yet. Geometry module is missing
(EIGEN_DEVICE_FUNC, I'm working on the pull request now).

However, the most important question is what is the state of
vectorization on CUDA? It seems somebody implemented the same trick I
did in 2013:

but looks like it is not enabled. In Core file EIGEN_DONT_VECTORIZE is
defined for __CUDACC__ that overrides subsequent stuff like:
#if defined __CUDACC__
and therefore there is no alignment set. Also PacketMath header
inclusion depends on EIGEN_USE_GPU and I don't think this is set anywhere.

Let me know if I can help, I could run my performance tests from 2013
again and Sophus unit tests that test (via Geomtetry) fixed-size Eigen
quite well. We have a wide range of nVidia GPUs in the lab, so I can
test on pretty much everything from GTX480 to GTX1080/TitanX.

With best regards,
Robert Lukierski.

Robert Lukierski
Dyson Robotics Lab
Imperial College London


Mail converted by MHonArc 2.6.19+