|Re: [eigen] Advanced vectorization|
[ Thread Index |
| More lists.tuxfamily.org/eigen Archives
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] Advanced vectorization
- From: Márton Danóczy <marton78@xxxxxxxxx>
- Date: Mon, 6 Jun 2011 20:44:38 +0200
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type:content-transfer-encoding; bh=sN/t47evYQ+FFNKhJHTCXabH5EkDNsz06N/TeBZs4gU=; b=Oa2jaib5Kqc1o/L+Qh6BsztcjHYx9i35gpoCHuXLx+W0nW0HZIh+oIIaNo9bRbZIC0 MeerJ+Ft4k8J0Q7KTnxTg65l/BWnpuzKGEdw73dLjp09IfqF/kO6UhVWOzFTEWA3vMfD G/On7HbjElom3lhe9WjLzQq4KGot4w4LAhr+Q=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=YIrB6j5M2Q1UywRIOeSlozCG0yBGynzVy2snfg0iIfFVxx5pzZTk7NY1FiN6x8sInZ 29qYvle2Onta+15Hj1cr4qtM7ydiXagTz/EqgGY6+nMMehsAII5Rc0P35LiNH/n+vapP Auf9PqKO404iEJJi7QndNzwwPmkbGMk+Egn8Y=
>> If A is of size n*n, then just reading it is n^2 memory accesses,
>> while you trick would be saving only n memory accesses (since e is a
>> vector of size n), so I'd say it's going to be negligible unless n is
>> very small. Also, by computing e coefficient-wise, you lose the
>> benefit of Eigen's cache-friendly matrix-vector product
>> implementation, which is important for large n.
> I guess Márton's suggestion would first and foremost reduce the read-cost of
> If A is stored row-major, each row could be multiplied by x, and the
> resulting scalar times that row transposed can be added to g. Assuming that
> g and at least one row of A fit into the cache this could reduce RAM access
> by a factor of 2.
I did exactly that and benchmarked it:
Scalar objfunc(const Matrix<Scalar, Dynamic, 1>& x, Matrix<Scalar,
Dynamic, 1>& g)
for (ptrdiff_t i=0; i<AT.cols(); ++i)
typename Matrix<Scalar, Dynamic, Dynamic>::ColXpr ai = AT.col(i);
Scalar e = ai.dot(x) - b(i);
g += e*ai;
f += e*e;
return Scalar(0.5) * f;
Benchmarking reveals that regardless of the size of A (1024x1024,
1024x4096, 4096x1024, 4096x4096), the original code is faster.
Conclusion: don't try to outsmart Gael (at least if your matrices
don't fit into the cache)...