Re: [eigen] CUDA vectorization status

[ Thread Index | Date Index | More Archives ]

On 10/13/2016 06:17 PM, Benoit Steiner wrote:
> I use the CUDA packet primitives in the tensor module. Loading and
> writing floats 4 by packets of 4 whenever possible helps speed things
> up. Processing 16-bit floats 2 by 2 also doubles the throughput on GPUs
> that fully support them (eg tegra X1 and Pascal P100).
> I haven't tried to vectorize the matrix code and don't foresee doing so
> any time so, so you're welcome to give it a try and see how much this helps.

Thanks for the info. I've started reading some test code and I've
noticed EIGEN_USE_GPU there, so mystery solved.

I'll experiment with matrix vectorization using current CUDA/PacketMath.
I think the biggest problem may be the struct alignment/padding
differences between SSE/AVX and CUDA, as often the dense structs are
passed verbatim by pure memcopy-like mechanisms (kernel arguments,
cudaMemcpyToSymbol, CUDA/Host allocated buffers with cudaMallocs etc).

But that's a bigger task. Firstly, I want to CUDA-enable one more
module: AutoDiffScalar. I've been using a modified ceres::Jet class from
ceres-solver, but I don't want to bother Sameer Agarwal, their library
has nothing to do with CUDA, so I think that AutoDiffScalar is a better
candidate to add EIGEN_DEVICE_FUNCs. I'll write host unit tests first,
possibly in a similar manner to Jet's:

as currently AutoDiffScalar is not well tested. Then do the same tests
on the GPU similarly to how Tensors are tested (run_and_compare_to_cuda

I'll do a pull request when I finish AutoDiffScalar and its tests.


Robert Lukierski
Dyson Robotics Lab
Imperial College London

Mail converted by MHonArc 2.6.19+