Re: [eigen] Slow matrix-matrix multiply

Hi Guys,

Here are some more numbers on my macbook(i7 2.3Ghz) , its not completely quiet but I think these numbers all show a steady trend so they can be trusted.

First some explanation. old refers to commit 222ca20 in the ceres

tree. New refers to HEAD. We made a number of changes between two versions. Some of them are just the way we are using eigen, managing memory etc, and the other have to do with optionally using new BLAS routines instead of eigen. The suffix, eigen/blas refers to whether eigen or our custom blas routines are being used for the small block gemm and gemv operations in schur eliminator.

I tested both Clang 4.2 and GCC 4.2.1 with problems from the UW BAL dataset. I am only reporting the time spent in the Ceres SPARSE_SCHUR linear solver.

The first thing to note is that for both compilers there is significant improvement in performance from old-eigen to new-eigen. This is fairly substantial and true for both compilers. But Clang seems to be generally a bit worse than GCC.

There does not seem to be much difference between Eigen/Custom BLAS with GCC, except for two problems, but for Clang (Despite the improved inlining flags), performance improves pretty consistently.

Problem 1. 16 cameras 22106 points

old-eigen new-eigen new-blas

gcc 2.1 1.0 1.0

clang 2.1 1.0 1.0

Problem 2. 49 cameras 7776 points

old-eigen new-eigen new-blas

gcc 5.0 2.6 2.6

clang 5.0 2.6 2.5

Problem 3. 245 cameras 198739 points

old-eigen new-eigen new-blas

gcc 47 31 31.5

clang 50.5 32.3 30.2

Problem 4. 257 cameras 65132 points

old-eigen new-eigen new-blas

gcc 15 8.2 8.3

clang 15 8.3 7.6

Problem 5. 356 cameras 226730 points

old-eigen new-eigen new-blas

gcc 54 36 36

clang 56 37 34

Problem 6. 744 cameras 543562 points

old-eigen new-eigen new-blas

gcc 199 155 151

clang 210 156 145

Problem 7. 1031 cameras 110968 points

old-eigen new-eigen new-blas

gcc 57 42 43

clang 57 42 40

The thing which is common to all the problems above is that the matrices in question are all statically sized. Another problem we are interested in involves semi-statically sized matrices, and there the performance improvements are much more dramatic.

8. 8 Cameras 1 Shared calibration 2190 points

old-eigen new-eigen new-blas

gcc 0.17 0..15 0.04

clang 0.16 0.13 0.03

In summary, for fixed sized matrices Clang and Eigen have some talking to do. For dynamic/semi-static matrices it seems (based on one example) for both GCC and Clang custom routines beat Eigen.

Sameer

On Wed, Apr 3, 2013 at 9:10 AM, Sameer Agarwal <sameeragarwal@xxxxxxxxxx> wrote:

Gael,

I need to run some experiments on my laptop and desktop, and I will post what I think are statistically meaningful numbers(with multiple problems), but its going to take me a bit of time.

Sameer

On Wed, Apr 3, 2013 at 1:19 AM, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote:

I also have difficulties to observe significant differences with
Apple's default compiler:

-- default - clang - macbookpro --

Time (in seconds):
Preprocessor 0.043

Residual Evaluations 0.076
Jacobian Evaluations 0.866
Linear Solver 0.740
Minimizer 1.816

Postprocessor 0.002
Total 1..906

-- CERES_NO_CUSTOM_BLAS - clang-inline-threshold - macbookpro --

Time (in seconds):
Preprocessor 0.043

Residual Evaluations 0.070
Jacobian Evaluations 0.859
Linear Solver 0.779
Minimizer 1.837

Postprocessor 0.002
Total 1..926

-- CERES_NO_CUSTOM_BLAS - clang - macbookpro --

Time (in seconds):

Preprocessor 0.043

Residual Evaluations 0.075

Jacobian Evaluations 0.863
Linear Solver 0.896
Minimizer 1.970

Postprocessor 0.002
Total 2..060

On Wed, Apr 3, 2013 at 9:42 AM, Gael Guennebaud

<gael.guennebaud@xxxxxxxxx> wrote:
> still cannot reproduce with gcc:
>
> -- default - gcc47 - Core2 Q9400 @2.66GHz --
>
> Time (in seconds):
> Preprocessor 0.093
>
> Residual Evaluations 0.117
> Jacobian Evaluations 1.067
> Linear Solver 0.809
> Minimizer 2.237
>
> Postprocessor 0.005
> Total 2.371
>
> -- CERES_NO_CUSTOM_BLAS - gcc47 - Core2 Q9400 @2.66GHz --
>
> Time (in seconds):
> Preprocessor 0.089
>
> Residual Evaluations 0.108
> Jacobian Evaluations 1.054
> Linear Solver 0.803
> Minimizer 2.206
>
> Postprocessor 0.005
> Total 2.335
>
>
> -- default - gcc47 - Xeon X5570 @2.93GHz --
>
> Time (in seconds):
> Preprocessor 0.067
>
> Residual Evaluations 0.085
> Jacobian Evaluations 0.720
> Linear Solver 0.600
> Minimizer 1.557
>
> Postprocessor 0.001
> Total 1.645
>
> -- CERES_NO_CUSTOM_BLAS - gcc47 - Xeon X5570 @2.93GHz --
>
> Time (in seconds):
> Preprocessor 0.067
>
> Residual Evaluations 0.085
> Jacobian Evaluations 0.734
> Linear Solver 0.599
> Minimizer 1.570
>
> Postprocessor 0.001
> Total 1.658
>
> gael
>
> On Wed, Apr 3, 2013 at 5:58 AM, Sameer Agarwal <sameeragarwal@xxxxxxxxxx> wrote:
>> In case there is still interest, the change has been merged into the master
>> branch.
>> Sameer
>>
>>
>>
>>
>> On Tue, Apr 2, 2013 at 12:25 PM, Sameer Agarwal <sameeragarwal@xxxxxxxxxx>
>> wrote:
>>>
>>> On Keir's suggestion, I have updated this CL to optionally compile Eigen
>>> based routines in and out.
>>>
>>> passing -DCUSTOM_BLAS=ON/OFF to cmake switches between custom loops and
>>> eigen inside blas.h
>>>
>>> Sameer
>>>
>>>
>>>
>>> On Tue, Apr 2, 2013 at 11:42 AM, Sameer Agarwal <sameeragarwal@xxxxxxxxxx>
>>> wrote:
>>>>
>>>> Here is the gerrit CL that is used for generating these numbers
>>>>
>>>> https://ceres-solver-review.googlesource.com/#/c/2870/
>>>>
>>>> Sameer
>>>>
>>>>
>>>>
>>>> On Tue, Apr 2, 2013 at 11:34 AM, Sameer Agarwal
>>>> <sameeragarwal@xxxxxxxxxx> wrote:
>>>>>
>>>>> Gael and Christoph,
>>>>>
>>>>> Thank you for looking into this.
>>>>>
>>>>> Yes adding -mllvm -inline-threshold=600 makes the timing of Eigen
>>>>> comparable to CUSTOM_GEMM.
>>>>>
>>>>> However, I went ahead and replaced all use of small block operations in
>>>>> the eliminator with simple gemm and gemv implementations. And the time has
>>>>> dropped even further. Which would not be the case if inlining were the only
>>>>> thing at work here.
>>>>>
>>>>> With the increased inlining 1.02s
>>>>> With custom blas 0.634s
>>>>>
>>>>> I get roughy similar numbers with g++4.2 on macos. I also tested this on
>>>>> linux with g++ 4.6.3, where the linear solver time goes from 0.8 to .5
>>>>> seconds.
>>>>>
>>>>> Sameer
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Apr 2, 2013 at 5:23 AM, Gael Guennebaud
>>>>> <gael.guennebaud@xxxxxxxxx> wrote:
>>>>>>
>>>>>> On Tue, Apr 2, 2013 at 1:58 PM, Gael Guennebaud
>>>>>> <gael.guennebaud@xxxxxxxxx> wrote:
>>>>>> > After adding a few always_inline attributes
>>>>>>
>>>>>> An alternative is to add the following compiler option:
>>>>>>
>>>>>> -mllvm -inline-threshold=600
>>>>>>
>>>>>> gael
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>