Re: [eigen] Intermediate Packet Storage

[ Thread Index | Date Index | More Archives ]

On 18/12/2019 14.45, Joel Holdsworth wrote:
On 12/18/19 1:08 PM, Christoph Hertzberg wrote:
On 18/12/2019 12.20, Joel Holdsworth wrote:
const Packet<uint32_t> E = a * b; // Some expensive calculation
const Packet<uint32_t> x = E + c;
const Packet<uint32_t> y = E + c;

This results in the value of E being stored onto the stack, and then reloaded twice to calculate x and y.

But of course, I just want E to stay in register.

I'm pretty sure that as long as `E` fits into a (set of) register(s) no reasonable compiler will store this on the stack, unless it actually runs out of register space, see e.g.:

I don't know if you think ARM GCC 5.4 counts as reasonable, but you can see the problem occuring here:

Hm, interesting/unfortunate ...

GCC seems to store and immediately reload what it stored (even after removing the ASM_COMMENT line, or after replacing the calculations by much simpler calculations

I'm no ARM expert, but I assume {d{2k}-d{2k+1}} is an alias for `q{k}`

As I mentioned, my project requires GCC 5. I would be interested to know if newer versions of ARM GCC have the same issue - but there seems to be some issue with Eigen on newer versions, because godbolt is giving me errors.

Yes, I have no idea about what causes this -- maybe some ARM expert can chip in.

Interestingly, x86 GCC 5.4 seems to do the right thing.

Even if the small intermediate was stored on the stack, I assume the overhead should be negligible.

It's all just cycles that I'd like to eliminate.

My algorithm has enough cross-linking in the overall evaluation graph, that the loads and stores account for ~30% of all my instructions when you include the extra instructions needed to calculate the stack-pointer addresses.

Ok, fair enough.

The problem is different, if you would want to apply your expressions at once on a set of large arrays. Something like the following will very likely require `E` getting stored or evaluated twice (unless the compiler is really smart detecting duplicated code or load after store).

     ArrayXi a,b,c,d; // input from somewhere

     ArrayXi E = a*b; // some expensive operations
     ArrayXi x = E+c, y=E+d;

For solving that problem you may be interested in:

That certainly seems like a related concept. But is there much prospect of this getting implemented any time soon?

No promises when (or if) this will be finished. But it seems this would not directly fix your immediate issue anyway.

Here are some examples of things I would like to do in a single evaluation pass:

auto alpha = ...
blend = alpha * x + (1.0f - alpha) * y;

limit = (x > 0.f).select(log(x), 0);

auto condition = X(...) && Y(...) && Z(...);
result1 =, b);
result1 =, d);

At the moment, there is a choice between having an ArrayX intermediate, which will incur a lot of RAM bandwidth and cache eviction, or calculating the intermediate value twice.

Are you able to implement the above (or something similar) with pure intrinsics? (In a way which gcc5 properly inlines, without storing intermediates) If that is not possible, I'd see no way at all to do this with that compiler. If it is possible, I'd see some hope in implementing the previously mentioned Meta-Packets.


Thanks for your advice.


 Dr.-Ing. Christoph Hertzberg

 Besuchsadresse der Nebengeschäftsstelle:
 Robotics Innovation Center
 Robert-Hooke-Straße 5
 28359 Bremen, Germany

 Postadresse der Hauptgeschäftsstelle Standort Bremen:
 Robotics Innovation Center
 Robert-Hooke-Straße 1
 28359 Bremen, Germany

 Tel.:     +49 421 178 45-4021
 Zentrale: +49 421 178 45-0
 E-Mail:   christoph.hertzberg@xxxxxxx

 Weitere Informationen:
  Deutsches Forschungszentrum für Künstliche Intelligenz GmbH
  Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

  Prof. Dr. Antonio Krüger (Vorsitzender)
  Dr. Walter Olthoff

  Vorsitzender des Aufsichtsrats:
  Dr. Gabriël Clemens
  Amtsgericht Kaiserslautern, HRB 2313

Mail converted by MHonArc 2.6.19+