[eigen] Broadcasting Slow on CPU?

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Hi Eigen Folks,

Sorry about the two mails in quick succession.

I'm playing around with broadcasting in the tensor library, and it seems that while broadcasting is working quite well on the GPU, on the CPU it's significantly slower than doing a for loop over the elements and processing them one-by-one. To give a concrete example, here is code for calculating "logsumexp" over multiple columns of a 2 dimensional tensor:

---------------

int NUM_ROWS, NUM_COLS;
Eigen::TensorMap<Eigen::Tensor<float,2>> x;
Eigen::TensorMap<Eigen::Tensor<float,1>> z, m;
Eigen::array<int, 1> red_axis({0});
Eigen::array<int, 2> bcast({NUM_ROWS, 1});

// Fast on GPU, slow on CPU
m.device(device) = x.maximum(red_axis);
z.device(my_device) = (x - m.broadcast(bcast)).exp().sum(red_axis);
z.device(my_device) = z.log() + m;

// Fast on CPU, slow on GPU
m.device(device) = x.maximum(red_axis);
vector<float> mvals = as_vector(m);
for(size_t b = 0; b < NUM_COLS; b++) {
  z.chip<0>(b).device(my_device) = (x.chip<1>(b) - mvals[b]).exp().sum();
  z.chip<0>(b).device(my_device) = z.chip<0>(b).log() + mvals[b];
}

---------------

Is this known behavior? And if so, is there any way around this? Ideally I'd like to just use the first one because it's simpler, but the speed differences are significant enough that I've been forced to use alternative code paths based on whether I'm using CPU or GPU.

Graham


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/