|Re: [eigen] CUDA vectorization status|
[ Thread Index |
| More lists.tuxfamily.org/eigen Archives
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] CUDA vectorization status
- From: Robert Lukierski <r.lukierski12@xxxxxxxxxxxxxx>
- Date: Thu, 13 Oct 2016 18:35:24 +0100
- Dkim-signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=imperial.ac.uk; s=main01; h=Content-Transfer-Encoding:Content-Type: In-Reply-To:MIME-Version:Date:Message-ID:From:References:To:Subject; bh=TIZXaHo5FY6Nn2gXU1uWU4NREZFcNi3oTed25GLj8Pc=; b=wn0yPaMaWI88s1gMyK9hctqHFr 2IybvXvalpVtiQ/0kcO0/AZPYjIOoT3zU4Po5ljNC3M86l7qwMZ670ILAxrFAL6rdVIe5IHUgdXas HViLXk3Ex6dhDHlG9D7rg2gstwEwqXc7i9qEsL/CKzySMN6eLIjyYCUpP+iYCvINQ/U0=;
On 10/13/2016 06:17 PM, Benoit Steiner wrote:
> I use the CUDA packet primitives in the tensor module. Loading and
> writing floats 4 by packets of 4 whenever possible helps speed things
> up. Processing 16-bit floats 2 by 2 also doubles the throughput on GPUs
> that fully support them (eg tegra X1 and Pascal P100).
> I haven't tried to vectorize the matrix code and don't foresee doing so
> any time so, so you're welcome to give it a try and see how much this helps.
Thanks for the info. I've started reading some test code and I've
noticed EIGEN_USE_GPU there, so mystery solved.
I'll experiment with matrix vectorization using current CUDA/PacketMath.
I think the biggest problem may be the struct alignment/padding
differences between SSE/AVX and CUDA, as often the dense structs are
passed verbatim by pure memcopy-like mechanisms (kernel arguments,
cudaMemcpyToSymbol, CUDA/Host allocated buffers with cudaMallocs etc).
But that's a bigger task. Firstly, I want to CUDA-enable one more
module: AutoDiffScalar. I've been using a modified ceres::Jet class from
ceres-solver, but I don't want to bother Sameer Agarwal, their library
has nothing to do with CUDA, so I think that AutoDiffScalar is a better
candidate to add EIGEN_DEVICE_FUNCs. I'll write host unit tests first,
possibly in a similar manner to Jet's:
as currently AutoDiffScalar is not well tested. Then do the same tests
on the GPU similarly to how Tensors are tested (run_and_compare_to_cuda
I'll do a pull request when I finish AutoDiffScalar and its tests.
Dyson Robotics Lab
Imperial College London