[eigen] Implementation of a TensorMultiMap

Hi,

I would like to implement a "TensorMultiMap" that takes as input an array of pointers from multiple Tensors so that one can perform operations on the combined array of Tensors instead of having to allocate new memory and perform a series of `concatenate` operations prior.

The problem I am running into is that I have two arrays (of dynamic length, but equal in size) of Tensors (e.g., std::vector<Tensor> tensor1 and std::vector<Tensor> tensor2) where I would like to efficiently multiply each Tensor by one another and sum the result in a single output Tensor. I can accomplish this easily enough using a for loop, but because I am not able to use auto, an evaluation must be made at each iteration of the loop. Using Cuda, this results in a new launch of a kernal, which drastically impacts performance.

I have experimented with using a recursive function, but unfortunately, this does not work with Cuda 11 (the code will compile, but the stream will never sync).

Is a "TensorMultiMap" possible? If so, how best could it be implemented?

Best,

Douglas McCloskey, PhD

Group Leader, AutoFlow
Laison, Information Services/Computational Biology

DTU Biosustain

Technical University of Denmark

Novo Nordisk Foundation Center for Biosustainability

Kemitorvet

Building 220, Room 218

2800 Kgs.Lyngby

domccl@xxxxxxxxxxxxxxxxx