[eigen] OpenMP implementation of Matrix*Vector operation |
[ Thread Index |
Date Index
| More lists.tuxfamily.org/eigen Archives
]
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: [eigen] OpenMP implementation of Matrix*Vector operation
- From: gr x <xgrchn@xxxxxxxxx>
- Date: Tue, 15 May 2012 14:52:49 +0000
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=cn+eHe03TiMsCabgTbuqvVGoefaqJ4k2lgNI0bL3CKY=; b=g3nzkRJSBe6LHJq8GJCHolNq1eQTnkuv5dTh73pZP1fYd2lU2ReLgDgLKuKk5CfI28 OjtIBygDUDdkgLjqAk00Qj/6tZ5bVIyEII+PPTV4EGTHzfEVODeotndmAUkRL1bERzNv BFDmduHElAyaX+BROMyFSNdEfvETGzNG8Mn1dq7ffzioGfM1cUPPRXX40EAdsC3XluP5 A6nYVC5wi8iY/QGAXNu16j0K89pVw++7e3gNzcWbcFMwaGDGawgy73W76hTlFyu6EEJl 78Viq9PefWoC/rVYlJWdTAi4sV8Bt5wHt2DQiZF9D2dOJMPxI7+IQovdQKMzx67Bg3Y9 SZaQ==
Hi everyone:
I'm trying to use eigen for solving some linear algebra equations
iteratively, so the Matrix*Vector operation is quite common . As I know ,
in eigen, the openMP parallelization is only implemented for matrix*matrix
multiplication( tell me if i'm wrong).
However, in my case, the matrix is often with a moderate large scale
( typically several thousand or even hundreds of thousand for sparse).
So it's quite
necessary to take the advantage of multicore cpu.
I've heard it's on the schedule, so how is it going now? Any
benchmark result
with respect to matrix scale?
Thanks~
ps: Actually I've implemented a simple version by partitioning
the matrix to
several "blocks", but it turns out to work well only for scale
100~2000, and it's
much slower for larger scale (ie, no better than the serial code).
Here is my code snippet:
(built by g++ with flag -O2 -fopenmp -march=native)
int N = 1000; //Scale
Matrix<double,Dynamic,Dynamic,rowMajor> m =
Matrix<double,Dynamic,Dynamic>::Random(N,N);
VectorXd v = VectorXd::Random(N);
VectorXd s(N);
#pragma omp parallel for
for( int i = 0; i < omp_get_num_threads(); i += 1)
{
int nthreads = omp_get_num_threads();
int rank = omp_get_thread_num();
int chunk = (N + nthreads -1)/nthreads;
int i0 = rank * chunk;
int i1 = (rank+1)*chunk<N ? (rank+1)*chunk : N;
int in = i1 - i0;
s.segment(i0,in) = m.block(i0,0,in,N)*v;
}
Is there a better way to do this?
thanks very much!