Re: [eigen] BDCSVD update

Hi, some quick updates about D&C SVD.

In the evaluator branch [1], I cleaned a bit the code and implemented the application of HouseholderSequence block-wise. Now, the performance for a 1000^2 square matrix is as follow:

BDCSVD: 0.447019

BDCSVD U+V: 1.27355

dgesdd_: 0.48619 (Lapack)

dgesdd_ U+V: 1.20698 (Lapack)

When computing the SVD with both the U and V factors, the computation cost is roughly distributed as follows:

(1) - 25% for the bidiagonalization

(2) - 33% to update the m_naiveU and m_naiveV factors in the conquer step as matrix-matrix products [2].

(3) - 35% to apply the U and V factors of the bidiagonalization as HouseholderSequence to get the final U and V factors.

The rest represents only a small fraction. So, as you can see we reached a point where further optimizations are not straightforward as all these three parts are already implemented in a cache friendly manner (if we forgot about parallelism).

Nonetheless, I observed that in part (2), the m_naiveU and m_naiveV matrices are rather sparse: before the last update, about 50% of the coefficients are still zeros. Moreover, these zeros seems to be well structured, as shown by the final m_naiveU matrix for a 20x20 random matrix:

100100100101011101110
100100100101011101110
100100100101011101110
100100100101011101110
100100100101011101110
100100100101011101110
100100100101011101110
100100100101011101110
100100100101011101110
100100100101011101110
100100100101011101110
011011011010100010000
011011011010100010000
011011011010100010000
011011011010100010000
011011011010100010000
011011011010100010000
011011011010100010000
011011011010100010000
011011011010100010000
000000000000000000001

(here '1' represents any non zero value)

So in theory, we should be able to get a 2x speedup for part (2) which is quite significant.

cheers,

gael

[1] https://bitbucket.org/ggael/eigen-evaluators

[2] https://bitbucket.org/ggael/eigen-evaluators/src/e4c964526d5d269f9d80d0a8f2532c391a7bdced/unsupported/Eigen/src/BDCSVD/BDCSVD.h?at=default#cl-418

On Tue, Aug 27, 2013 at 5:49 PM, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote:

Congrats, with your new solver I now get:

(versus MKL)
BDCSVD: 0.61278
BDCSVD U+V: 5.05408
dgesdd_: 0.486337
dgesdd_ U+V: 1.20376

(versus Lapack)
BDCSVD: 1.94448
BDCSVD U+V: 11.7599
dgesdd_: 1.82727
dgesdd_ U+V: 3.36355

FYI, here, the bidiagonalization step takes about 1.7s.

So as you guessed, there is major room for improvement to compute U and V.

gael.

On Tue, Aug 27, 2013 at 5:44 PM, Gael Guennebaud <gael.guennebaud@xxxxxxxxx> wrote:

On Tue, Aug 27, 2013 at 4:54 PM, Jitse Niesen <jitse@xxxxxxxxxxxxxxxxx> wrote:

Gael, thanks for your work on bidiagonalization. Hopefully we are getting close to lapack now.

yes. The bidiagonalization step is now as fast as Lapack or intel MKL (with a single thread) even though it is still dominated by matrix-vector products.

Before your new solving method, I got these numbers (for a 1000 x 1000 random matrix):

on a xeon X5570 versus Intel MKL:
BDCSVD: 1.03396
BDCSVD U+V: 5.39393
dgesdd_(MKL): 0.53134
dgesdd_(MKL) U+V: 1.25072

on a core2 versus Lapack:
BDCSVD: 2.35741
BDCSVD U+V: 12.1212
dgesdd_: 1.81886
dgesdd_ U+V: 3.46479

gael