Re: [eigen] CUDA vectorization status

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]

To: eigen@xxxxxxxxxxxxxxxxxxx
Subject: Re: [eigen] CUDA vectorization status
From: Robert Lukierski <r.lukierski12@xxxxxxxxxxxxxx>
Date: Thu, 13 Oct 2016 18:35:24 +0100
Dkim-signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=imperial.ac.uk; s=main01; h=Content-Transfer-Encoding:Content-Type: In-Reply-To:MIME-Version:Date:Message-ID:From:References:To:Subject; bh=TIZXaHo5FY6Nn2gXU1uWU4NREZFcNi3oTed25GLj8Pc=; b=wn0yPaMaWI88s1gMyK9hctqHFr 2IybvXvalpVtiQ/0kcO0/AZPYjIOoT3zU4Po5ljNC3M86l7qwMZ670ILAxrFAL6rdVIe5IHUgdXas HViLXk3Ex6dhDHlG9D7rg2gstwEwqXc7i9qEsL/CKzySMN6eLIjyYCUpP+iYCvINQ/U0=;

On 10/13/2016 06:17 PM, Benoit Steiner wrote:
> I use the CUDA packet primitives in the tensor module. Loading and
> writing floats 4 by packets of 4 whenever possible helps speed things
> up. Processing 16-bit floats 2 by 2 also doubles the throughput on GPUs
> that fully support them (eg tegra X1 and Pascal P100).
> 
> I haven't tried to vectorize the matrix code and don't foresee doing so
> any time so, so you're welcome to give it a try and see how much this helps.

Thanks for the info. I've started reading some test code and I've
noticed EIGEN_USE_GPU there, so mystery solved.

I'll experiment with matrix vectorization using current CUDA/PacketMath.
I think the biggest problem may be the struct alignment/padding
differences between SSE/AVX and CUDA, as often the dense structs are
passed verbatim by pure memcopy-like mechanisms (kernel arguments,
cudaMemcpyToSymbol, CUDA/Host allocated buffers with cudaMallocs etc).

But that's a bigger task. Firstly, I want to CUDA-enable one more
module: AutoDiffScalar. I've been using a modified ceres::Jet class from
ceres-solver, but I don't want to bother Sameer Agarwal, their library
has nothing to do with CUDA, so I think that AutoDiffScalar is a better
candidate to add EIGEN_DEVICE_FUNCs. I'll write host unit tests first,
possibly in a similar manner to Jet's:

https://github.com/ceres-solver/ceres-solver/blob/master/internal/ceres/jet_test.cc

as currently AutoDiffScalar is not well tested. Then do the same tests
on the GPU similarly to how Tensors are tested (run_and_compare_to_cuda
etc).

I'll do a pull request when I finish AutoDiffScalar and its tests.

Thanks!
Robert.

-- 
Robert Lukierski
Dyson Robotics Lab
Imperial College London
r.lukierski12@xxxxxxxxxxxxxx

References:
- [eigen] CUDA vectorization status
  - From: Robert Lukierski
- Re: [eigen] CUDA vectorization status
  - From: Benoit Steiner

Messages sorted by: [ date | thread ]
Prev by Date: Re: [eigen] CUDA vectorization status
Next by Date: [eigen] Thank you for creating Eigen!
Previous by thread: Re: [eigen] CUDA vectorization status
Next by thread: [eigen] Thank you for creating Eigen!

Mail converted by MHonArc 2.6.19+

http://listengine.tuxfamily.org/