[eigen] Proposal to remove aligned alloc paths

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


Hi List,

This is to propose that we drop two paths from aligned_malloc: the "malloc is already aligned" path, and the mm_malloc path.

Thus, we would be left with just these paths:
 1) The preferred system-specific aligned allocation (posix_memalign on POSIX, _aligned_malloc on Windows).
 2) handmade_aligned_malloc as the only fallback.

See attached patches.

Motivation for these changes, aside from removing code:

The "malloc is already aligned" path relies on the assumption that malloc's nonstandard properties are only a function of the platform, which we tried to detect using preprocessor tokens. But malloc is not a system feature, it's a C library feature and many applications override it. Thus, we can't know if malloc is already aligned based on platform checks.

Part of the motivation for the "malloc is already aligned" path was that on my linux system at the time, malloc was faster than posix_memalign. If you think that this is a reason to keep this path, please run the attached benchmark and share results. On my current machine running Mac OSX, I get:

malloc+free for random sizes between 0 and 10...
Real:5.15e-08 s, CPU: 5.15e-08 s
posix_memalign+free with 16-byte alignment for random sizes between 0 and 10...
Real:5.34e-08 s, CPU: 5.34e-08 s
posix_memalign+free with 32-byte alignment for random sizes between 0 and 10...
Real:2.34e-07 s, CPU: 2.34e-07 s
malloc+free for random sizes between 0 and 1000...
Real:1.01e-07 s, CPU: 1.01e-07 s
posix_memalign+free with 16-byte alignment for random sizes between 0 and 1000...
Real:1.06e-07 s, CPU: 1.06e-07 s
posix_memalign+free with 32-byte alignment for random sizes between 0 and 1000...
Real:2.33e-07 s, CPU: 2.33e-07 s
malloc+free for random sizes between 0 and 100000...
Real:1.07e-07 s, CPU: 1.07e-07 s
posix_memalign+free with 16-byte alignment for random sizes between 0 and 100000...
Real:1.13e-07 s, CPU: 1.13e-07 s
posix_memalign+free with 32-byte alignment for random sizes between 0 and 100000...
Real:1.14e-07 s, CPU: 1.14e-07 s
malloc+free for random sizes between 0 and 10000000...
Real:5.9e-07 s, CPU: 5.9e-07 s
posix_memalign+free with 16-byte alignment for random sizes between 0 and 10000000...
Real:6.08e-07 s, CPU: 6.08e-07 s
posix_memalign+free with 32-byte alignment for random sizes between 0 and 10000000...
Real:5.78e-07 s, CPU: 5.78e-07 s

So for me, posix_memalign with 16-byte alignment is just 5% slower than malloc. Would that be enough to bother?

I added 32-byte alignment for completeness but of course malloc is not an option for that.

Finally, there also was the point that it's better to use malloc rather than obscure other functions because it works better with instrumentation that people might use, that hook into or override standard allocation functions. That's a fair point, but such instrumentation, in order to be correct, has to instrument the platform's primary aligned-alloc function (posix_memalign or aligned_malloc on Windows) anyway. Out of the functions that we used, only mm_malloc stood out as relatively obscure. Since we shouldn't ever need it, and the handmade fallback isn't too terrible anyway, here is a second patch that removes it.

Thoughts?

Benoit

Attachment: no-mm-malloc
Description: Binary data

Attachment: malloc-not-already-aligned
Description: Binary data

#include <bench/BenchTimer.h>
#include <iostream>
#include <vector>

using namespace std;
using namespace Eigen;

const size_t iters = 1000000;

void test(size_t maxsize, size_t alignment) {
	if (alignment) {
		cout << "posix_memalign+free with " << alignment << "-byte alignment ";
	} else {
		cout << "malloc+free ";
	}
	cout << "for random sizes between 0 and " << maxsize << "..." << endl;
	vector<size_t> sizes;
	for (size_t i = 0; i < iters; ++i) {
		// only test sizes that are multiples of 4 bytes to be realistic wrt Eigen
		sizes.push_back((rand() % maxsize) & ~3);
	}
	BenchTimer timer;
	double startrealtime = timer.getRealTime(),
	       startcputime = timer.getCpuTime();
	for (size_t i = 0; i < iters; ++i) {
		void *p;
		if (alignment) {
			posix_memalign(&p, alignment, sizes[i]);
		} else {
			p = malloc(sizes[i]);
		}
		free(p);
	}
	double endrealtime = timer.getRealTime(),
	       endcputime = timer.getCpuTime();
	cout << "Real:" <<
	  (endrealtime - startrealtime) / iters << " s, CPU: " <<
	  (endcputime - startcputime) / iters << " s" << endl;
}

int main()
{
	cout.precision(3);
	test(10, 0);
    test(10, 16);
	test(10, 32);
	test(1000, 0);
    test(1000, 16);
	test(1000, 32);
	test(100000, 0);
    test(100000, 16);
	test(100000, 32);
	test(10000000, 0);
    test(10000000, 16);
	test(10000000, 32);
}


Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/