Hi, some quick updates about D&C SVD.
In the evaluator branch [1], I cleaned a bit the code and implemented the application of HouseholderSequence block-wise. Now, the performance for a 1000^2 square matrix is as follow:
BDCSVD: 0.447019
BDCSVD U+V: 1.27355
dgesdd_: 0.48619 (Lapack)
dgesdd_ U+V: 1.20698 (Lapack)
When computing the SVD with both the U and V factors, the computation cost is roughly distributed as follows:
(1) - 25% for the bidiagonalization
(2) - 33% to update the m_naiveU and m_naiveV factors in the conquer step as matrix-matrix products [2].
(3) - 35% to apply the U and V factors of the bidiagonalization as HouseholderSequence to get the final U and V factors.
The rest represents only a small fraction. So, as you can see we reached a point where further optimizations are not straightforward as all these three parts are already implemented in a cache friendly manner (if we forgot about parallelism).
Nonetheless, I observed that in part (2), the m_naiveU and m_naiveV matrices are rather sparse: before the last update, about 50% of the coefficients are still zeros. Moreover, these zeros seems to be well structured, as shown by the final m_naiveU matrix for a 20x20 random matrix:
100100100101011101110
100100100101011101110
100100100101011101110
100100100101011101110
100100100101011101110
100100100101011101110
100100100101011101110
100100100101011101110
100100100101011101110
100100100101011101110
100100100101011101110
011011011010100010000
011011011010100010000
011011011010100010000
011011011010100010000
011011011010100010000
011011011010100010000
011011011010100010000
011011011010100010000
011011011010100010000
000000000000000000001
(here '1' represents any non zero value)
So in theory, we should be able to get a 2x speedup for part (2) which is quite significant.
cheers,