|Re: [eigen] Intermediate Packet Storage|
[ Thread Index |
| More lists.tuxfamily.org/eigen Archives
On 12/18/19 1:08 PM, Christoph Hertzberg wrote:
On 18/12/2019 12.20, Joel Holdsworth wrote:
const Packet<uint32_t> E = a * b; // Some expensive calculation
const Packet<uint32_t> x = E + c;
const Packet<uint32_t> y = E + c;
This results in the value of E being stored onto the stack, and then
reloaded twice to calculate x and y.
But of course, I just want E to stay in register.
I'm pretty sure that as long as `E` fits into a (set of) register(s) no
reasonable compiler will store this on the stack, unless it actually
runs out of register space, see e.g.:
I don't know if you think ARM GCC 5.4 counts as reasonable, but you can
see the problem occuring here: https://godbolt.org/z/fpvqat
As I mentioned, my project requires GCC 5. I would be interested to know
if newer versions of ARM GCC have the same issue - but there seems to be
some issue with Eigen on newer versions, because godbolt is giving me
Interestingly, x86 GCC 5.4 seems to do the right thing.
Even if the small intermediate was stored on the stack, I assume the
overhead should be negligible.
It's all just cycles that I'd like to eliminate.
My algorithm has enough cross-linking in the overall evaluation graph,
that the loads and stores account for ~30% of all my instructions when
you include the extra instructions needed to calculate the stack-pointer
The problem is different, if you would want to apply your expressions at
once on a set of large arrays. Something like the following will very
likely require `E` getting stored or evaluated twice (unless the
compiler is really smart detecting duplicated code or load after store).
ArrayXi a,b,c,d; // input from somewhere
ArrayXi E = a*b; // some expensive operations
ArrayXi x = E+c, y=E+d;
For solving that problem you may be interested in:
That certainly seems like a related concept. But is there much prospect
of this getting implemented any time soon?
Here are some examples of things I would like to do in a single
auto alpha = ...
blend = alpha * x + (1.0f - alpha) * y;
limit = (x > 0.f).select(log(x), 0);
auto condition = X(...) && Y(...) && Z(...);
result1 = condition.select(a, b);
result1 = condition.select(c, d);
At the moment, there is a choice between having an ArrayX intermediate,
which will incur a lot of RAM bandwidth and cache eviction, or
calculating the intermediate value twice.
Thanks for your advice.