Re: [eigen] Slow matrix-matrix multiply

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Hi Guys,

Here are some more numbers on my macbook(i7 2.3Ghz) , its not completely quiet but I think these numbers all show a steady trend so they can be trusted.

First some explanation. old refers to commit 222ca20 in the ceres
tree. New refers to HEAD.  We made a number of changes between two versions.  Some of them are just the way we are using eigen, managing memory etc, and the other have to do with optionally using new BLAS routines instead of eigen.  The suffix, eigen/blas refers to whether eigen or our custom blas routines are being used for the small block gemm and gemv operations in schur eliminator.

I tested both Clang 4.2 and GCC 4.2.1 with problems from the UW BAL dataset. I am only reporting the time spent in the Ceres SPARSE_SCHUR linear solver. 

The first thing to note is that for both compilers there is significant improvement in performance from old-eigen to new-eigen. This is fairly substantial and true for both compilers. But Clang seems to be generally a bit worse than GCC.

There does not seem to be much difference between Eigen/Custom BLAS with GCC, except for two problems, but for Clang (Despite the improved inlining flags), performance improves pretty consistently.

Problem 1. 16 cameras 22106 points
          old-eigen  new-eigen new-blas
  gcc         2.1        1.0     1.0
clang         2.1        1.0     1.0

Problem 2. 49 cameras 7776 points
          old-eigen  new-eigen new-blas
  gcc         5.0        2.6     2.6
clang         5.0        2.6     2.5

Problem 3. 245 cameras 198739 points
          old-eigen  new-eigen new-blas
  gcc         47         31      31.5
clang         50.5       32.3    30.2

Problem 4. 257 cameras 65132 points
          old-eigen  new-eigen new-blas
  gcc         15      8.2      8.3
clang         15      8.3      7.6

Problem 5. 356 cameras 226730 points
          old-eigen  new-eigen new-blas
  gcc         54         36      36
clang         56         37      34

Problem 6. 744 cameras 543562 points
          old-eigen  new-eigen new-blas
  gcc         199       155      151
clang         210       156      145

Problem 7. 1031 cameras 110968 points
          old-eigen  new-eigen new-blas
  gcc         57       42        43
clang         57       42        40

The thing which is common to all the problems above is that the matrices in question are all statically sized. Another problem we are interested in involves semi-statically sized matrices, and there the performance improvements are much more dramatic.

8. 8 Cameras 1 Shared calibration 2190 points
          old-eigen new-eigen new-blas
  gcc        0.17     0..15      0.04
clang        0.16     0.13      0.03

In summary, for fixed sized matrices Clang and Eigen have some talking to do. For dynamic/semi-static matrices it seems (based on one example) for both GCC and Clang custom routines beat Eigen.

Sameer



On Wed, Apr 3, 2013 at 9:10 AM, Sameer Agarwal <sameeragarwal@xxxxxxxxxx> wrote:
Gael,

I need to run some experiments on my laptop and desktop, and I will post what I think are statistically meaningful numbers(with multiple problems), but its going to take me a bit of time.

Sameer



On Wed, Apr 3, 2013 at 1:19 AM, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote:
I also have difficulties to observe significant differences with
Apple's default compiler:

-- default - clang - macbookpro --

Time (in seconds):
Preprocessor                            0.043

  Residual Evaluations                  0.076
  Jacobian Evaluations                  0.866
  Linear Solver                         0.740
Minimizer                               1.816

Postprocessor                           0.002
Total                                   1..906


-- CERES_NO_CUSTOM_BLAS - clang-inline-threshold - macbookpro --

Time (in seconds):
Preprocessor                            0.043

  Residual Evaluations                  0.070
  Jacobian Evaluations                  0.859
  Linear Solver                         0.779
Minimizer                               1.837

Postprocessor                           0.002
Total                                   1..926


-- CERES_NO_CUSTOM_BLAS - clang - macbookpro --

Time (in seconds):
Preprocessor                            0.043

  Residual Evaluations                  0.075
  Jacobian Evaluations                  0.863
  Linear Solver                         0.896
Minimizer                               1.970

Postprocessor                           0.002
Total                                   2..060

On Wed, Apr 3, 2013 at 9:42 AM, Gael Guennebaud
<gael.guennebaud@xxxxxxxxx> wrote:
> still cannot reproduce with gcc:
>
> -- default - gcc47 - Core2 Q9400 @2.66GHz --
>
> Time (in seconds):
> Preprocessor                            0.093
>
>   Residual Evaluations                  0.117
>   Jacobian Evaluations                  1.067
>   Linear Solver                         0.809
> Minimizer                               2.237
>
> Postprocessor                           0.005
> Total                                   2.371
>
> -- CERES_NO_CUSTOM_BLAS - gcc47 - Core2 Q9400 @2.66GHz --
>
> Time (in seconds):
> Preprocessor                            0.089
>
>   Residual Evaluations                  0.108
>   Jacobian Evaluations                  1.054
>   Linear Solver                         0.803
> Minimizer                               2.206
>
> Postprocessor                           0.005
> Total                                   2.335
>
>
> -- default - gcc47 - Xeon X5570 @2.93GHz --
>
> Time (in seconds):
> Preprocessor                            0.067
>
>   Residual Evaluations                  0.085
>   Jacobian Evaluations                  0.720
>   Linear Solver                         0.600
> Minimizer                               1.557
>
> Postprocessor                           0.001
> Total                                   1.645
>
> -- CERES_NO_CUSTOM_BLAS - gcc47 - Xeon X5570 @2.93GHz --
>
> Time (in seconds):
> Preprocessor                            0.067
>
>   Residual Evaluations                  0.085
>   Jacobian Evaluations                  0.734
>   Linear Solver                         0.599
> Minimizer                               1.570
>
> Postprocessor                           0.001
> Total                                   1.658
>
> gael
>
> On Wed, Apr 3, 2013 at 5:58 AM, Sameer Agarwal <sameeragarwal@xxxxxxxxxx> wrote:
>> In case there is still interest, the change has been merged into the master
>> branch.
>> Sameer
>>
>>
>>
>>
>> On Tue, Apr 2, 2013 at 12:25 PM, Sameer Agarwal <sameeragarwal@xxxxxxxxxx>
>> wrote:
>>>
>>> On Keir's suggestion, I have updated this CL to optionally compile Eigen
>>> based routines in and out.
>>>
>>> passing -DCUSTOM_BLAS=ON/OFF to cmake switches between custom loops and
>>> eigen inside blas.h
>>>
>>> Sameer
>>>
>>>
>>>
>>> On Tue, Apr 2, 2013 at 11:42 AM, Sameer Agarwal <sameeragarwal@xxxxxxxxxx>
>>> wrote:
>>>>
>>>> Here is the gerrit CL that is used for generating these numbers
>>>>
>>>> https://ceres-solver-review.googlesource.com/#/c/2870/
>>>>
>>>> Sameer
>>>>
>>>>
>>>>
>>>> On Tue, Apr 2, 2013 at 11:34 AM, Sameer Agarwal
>>>> <sameeragarwal@xxxxxxxxxx> wrote:
>>>>>
>>>>> Gael and Christoph,
>>>>>
>>>>> Thank you for looking into this.
>>>>>
>>>>> Yes adding -mllvm -inline-threshold=600 makes the timing of Eigen
>>>>> comparable to CUSTOM_GEMM.
>>>>>
>>>>> However, I went ahead and replaced all use of small block operations in
>>>>> the eliminator with simple gemm and gemv implementations. And the time has
>>>>> dropped even further.  Which would not be the case if inlining were the only
>>>>> thing at work here.
>>>>>
>>>>> With the increased inlining 1.02s
>>>>> With custom blas            0.634s
>>>>>
>>>>> I get roughy similar numbers with g++4.2 on macos. I also tested this on
>>>>> linux with g++ 4.6.3, where the linear solver time goes from 0.8 to .5
>>>>> seconds.
>>>>>
>>>>> Sameer
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Apr 2, 2013 at 5:23 AM, Gael Guennebaud
>>>>> <gael.guennebaud@xxxxxxxxx> wrote:
>>>>>>
>>>>>> On Tue, Apr 2, 2013 at 1:58 PM, Gael Guennebaud
>>>>>> <gael.guennebaud@xxxxxxxxx> wrote:
>>>>>> > After adding a few always_inline attributes
>>>>>>
>>>>>> An alternative is to add the following compiler option:
>>>>>>
>>>>>> -mllvm -inline-threshold=600
>>>>>>
>>>>>> gael
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>






Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/