Re: [eigen] Intermediate Packet Storage

[ Thread Index | Date Index | More Archives ]

On 12/18/19 1:08 PM, Christoph Hertzberg wrote:
On 18/12/2019 12.20, Joel Holdsworth wrote:
const Packet<uint32_t> E = a * b; // Some expensive calculation
const Packet<uint32_t> x = E + c;
const Packet<uint32_t> y = E + c;

This results in the value of E being stored onto the stack, and then reloaded twice to calculate x and y.

But of course, I just want E to stay in register.

I'm pretty sure that as long as `E` fits into a (set of) register(s) no reasonable compiler will store this on the stack, unless it actually runs out of register space, see e.g.:

I don't know if you think ARM GCC 5.4 counts as reasonable, but you can see the problem occuring here:

As I mentioned, my project requires GCC 5. I would be interested to know if newer versions of ARM GCC have the same issue - but there seems to be some issue with Eigen on newer versions, because godbolt is giving me errors.

Interestingly, x86 GCC 5.4 seems to do the right thing.

Even if the small intermediate was stored on the stack, I assume the overhead should be negligible.

It's all just cycles that I'd like to eliminate.

My algorithm has enough cross-linking in the overall evaluation graph, that the loads and stores account for ~30% of all my instructions when you include the extra instructions needed to calculate the stack-pointer addresses.

The problem is different, if you would want to apply your expressions at once on a set of large arrays. Something like the following will very likely require `E` getting stored or evaluated twice (unless the compiler is really smart detecting duplicated code or load after store).

     ArrayXi a,b,c,d; // input from somewhere

     ArrayXi E = a*b; // some expensive operations
     ArrayXi x = E+c, y=E+d;

For solving that problem you may be interested in:

That certainly seems like a related concept. But is there much prospect of this getting implemented any time soon?

Here are some examples of things I would like to do in a single evaluation pass:

auto alpha = ...
blend = alpha * x + (1.0f - alpha) * y;

limit = (x > 0.f).select(log(x), 0);

auto condition = X(...) && Y(...) && Z(...);
result1 =, b);
result1 =, d);

At the moment, there is a choice between having an ArrayX intermediate, which will incur a lot of RAM bandwidth and cache eviction, or calculating the intermediate value twice.

Thanks for your advice.


Mail converted by MHonArc 2.6.19+