|Re: [eigen] Using Eigen in CUDA kernels|
[ Thread Index |
| More lists.tuxfamily.org/eigen Archives
- To: eigen <eigen@xxxxxxxxxxxxxxxxxxx>
- Subject: Re: [eigen] Using Eigen in CUDA kernels
- From: Gael Guennebaud <gael.guennebaud@xxxxxxxxx>
- Date: Tue, 12 Feb 2013 09:33:03 +0100
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=S4yqJGi1ZnKzdCvpoIjsKNhSpJ2L63KYTkBmBkqr7gc=; b=X5dgPmUc7yKHF/tum1tdYsVWFgyRB+37la/vYcoOPjotkq9etKWRD+bN0tTOAExU7G RJDRueNj04zSfOWD9WLVVcWhFc2+n9gw8MWwggjHlIXzcOpB+G0fynV0KkyvXA+aD6s/ VGKwKG2zmX0/BQ40wShNwt1Y1xQ2DTb4Pp8rAyUTJED35sSC99ZcHSb3NYsJW96nRWFV 37etyJU9d2rsQrEbhlefwNkmXBqh+Tvhpq/5PmzLa2P5orYuLKPwGHgoErwef9H+WnY6 ZJCUkaTFTl48H9of6t63IxXrg/adlsu/V62AUI0vnPat/mbqaox+4r8VvP5U2ntw+11M tKJQ==
On Mon, Feb 11, 2013 at 2:10 PM, Benjamin Schindler
> Do you plan on merging this to master at some point?
sure, but let's make sure we have something quite complete with proper
unit tests before asking for a merge.
> "assembly looks reasonable" - meaning, somewhat optimal, could be
> better? Do you have some specifics?
the problem is that the actual register allocation seems to take place
after the .ptx files, so I can only tell that on the exemples I
tested, only meaningful instructions have been generated but I cannot
tell the register use is good or not.
> On 02/07/2013 07:21 PM, Gael Guennebaud wrote:
>> Hi list,
>> good news, NVCC (the CUDA compiler from NVidia) version 5 is now able
>> to parse and compile Eigen :) That means using Eigen in our cuda
>> kernels is "just" a matter of qualifying all our functions callable
>> from CUDA by __device__ when compiled with nvcc. As a proof of concept
>> I've created a fork there:
>> This port is far to be complete, but I've been able to do stupid stuff like:
>> Matrix3f M(data1+tid*9);
>> Vector3f v = M.inverse() * Map<Vector3f>(data+tid*3);
>> output[tid] = v.cwiseAbs().maxCoeff();
>> where data1, data, and output are device buffers. The generated
>> assembly looked reasonable.
>> So no more crappy home-made fixed-size matrix and vector classes!
>> This also means that evaluating very large *coefficient-wise
>> expressions* on CUDA should not be very difficult. We just need a
>> special DeviceMatrix storing its data in device memory, and specialize
>> its assignment operator to call a CUDA kernel templated with the
>> destination and source expression...
>> have fun,