Hi,
it seems that what you're looking for is a mean to merge multiple evaluation loops of the same size into a single one (the fact that they run on the GPU is not really important here). Actually, this needs already shows up for stuff like:
a = vec.minCoeff();
b = vec.maxCoeff();
that currently requires two loops. I remember that we already talked about that with Benoit S., and I don't think there is a general solution implemented in the Tensor module yet.
Technically, I don't think that's very difficult though. The main difficulty is perhaps on the API side. We could imagine something like:
auto E1 = (R1.deferred() = expr1);
auto E2 = (R2.deferred() = expr2);
...
merged_eval(E1, E2, ...);
that would essentially generate:
(parallel/GPU/whatever) for loop {
R1[i] = expr1.coeffl(i);
R2[i] = expr2.coeffl(i);
...
}
In Eigen/Core, "R.deferred().operator=(expr)" would return an Eigen::internal::Assignment _expression_ (without calling run) that would be merged by the merged_eval function.
gael