Re: [eigen] Performance difference icc <-> gcc, EIGEN_STRONG

Hi Michael,

Can you provide some details on the compiler flags you used? If you can give me the commands required to reproduce your results, I will submit a bug report to ICC (I work for Intel).

I suspect that ICC is using different (more conservative) inlining heuristic than GCC. Failure to inline probably explains the 13x. However, that doesn't mean that GCC is right and ICC is wrong, as there is no perfect inlining heuristic (see e.g. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194 for a contrary position on GCC heuristics). Nonetheless, it may be useful to have the ICC inline heuristic aware of Eigen use cases, since Eigen is rather popular.

Fortunately, it seems that there is a good solution already, which is to use EIGEN_STRONG_INLINE, which obviously causes ICC to inline much more aggressively.

Best,

Jeff

PS In the unlikely event that EIGEN_STRONG_INLINE isn't sufficient, you may find the following ICC options useful.

$ icpc -help inline

Inlining

--------

-inline-level=<n>

control inline expansion:

n=0 disable inlining

n=1 inline functions declared with __inline, and perform C++

inlining

n=2 inline any function, at the compiler's discretion

-f[no-]inline

inline functions declared with __inline, and perform C++ inlining

-f[no-]inline-functions

inline any function at the compiler's discretion

-finline-limit=<n>

set maximum number of statements a function can have and still be

considered for inlining

-fgnu89-inline

use C89 semantics for "inline" functions when in C99 mode

-inline-min-size=<n>

set size limit for inlining small routines

-no-inline-min-size

no size limit for inlining small routines

-inline-max-size=<n>

set size limit for inlining large routines

-no-inline-max-size

no size limit for inlining large routines

-inline-max-total-size=<n>

maximum increase in size for inline function expansion

-no-inline-max-total-size

no size limit for inline function expansion

-inline-max-per-routine=<n>

maximum number of inline instances in any function

-no-inline-max-per-routine

no maximum number of inline instances in any function

-inline-max-per-compile=<n>

maximum number of inline instances in the current compilation

-no-inline-max-per-compile

no maximum number of inline instances in the current compilation

-inline-factor=<n>

set inlining upper limits by n percentage

-no-inline-factor

do not set set inlining upper limits

-inline-forceinline

treat inline routines as forceinline

-inline-calloc

directs the compiler to inline calloc() calls as malloc()/memset()

-inline-min_caller-growth=<n>

set lower limit on caller growth due to inlining a single routine

-no-inline-min-caller-growth

no lower limit on caller growth due to inlining a single routine

On Wed, Mar 13, 2019 at 11:34 AM Michael Riesch <michael.riesch@xxxxxx> wrote:

Hello all,

Thank you very much for your work on Eigen. We found it very useful for
our simulation software mbsolve [1] (BTW maybe you would like to add it
to the projects list that uses the Eigen library).

The code I am working on at the moment consists mostly of dense
matrix-matrix and matrix-vector multiplications. I compiled the code
with both Intel compiler 19 and gcc 6.3.0 and found that there is a
strange performance difference. Unless I define

#EIGEN_STRONG_INLINE inline

the binary compiled by icc is ~13x slower. The gcc binary performance
remains the same, as inline seems to be the standard setting of this
macro for gcc.

Why can this behavior occur? Or, alternatively, which possible
anti-pattern could be the cause of this performance difference?

Any hints are welcome. If you need more information, please let me know.

Thanks in advance and best regards,
Michael

[1] https://github.com/mriesch-tum/mbsolve

Jeff Hammond
jeff.science@xxxxxxxxx
http://jeffhammond.github.io/