[eigen] Status of building Eigen with Emscripten (running Eigen code in

Hi List,

TL;DR: Emscripten currently allows to run scalar Eigen code at ~40% of native speed in multiple browsers. SIMD support makes this much better in supporting browsers, but that doesn't include any current stable shipping browser; In current stable browsers, SIMD makes things much *worse*.

I just took another look at running Eigen MatrixXf multiplications in the browser, here is what I found.

Emscripten is now very easy to get started with. Compiling the attached testcase is as easy as:

em++ ~/vrac/eigen-benchmark.cc -I $HOME/eigen -O3 --std=c++11 -Wextra -s TOTAL_MEMORY=30000000 -o eigen-benchmark.html

That is, aside from specifying the memory size or growth policy, there is nothing particular to do. You can then simply point your browser to the resulting eigen-benchmark.html.

I was interested in performance, and in the status of SIMD.

By default, Emscripten emulates a 32-bit arch with no SIMD. For 1024x1024 MatrixXf multiplication, I get:

Native with -m32 -mno-sse: 6.0 GFlop/s
Emscripten'd code in Firefox: 2.6 GFlop/s

Emscripten'd code in Chrome: 2.2 GFlop/s

So we're at roughly 40% of native performance with plain scalar code.

Next, I was interested in SIMD status. Emscripten is gaining the ability to target SIMD.js, simply by passing -msse2 as usual. Unfortunately, this seems to be only supported in Firefox Nightly at the moment, with other browsers at the "intent to implement" stage according to Mozilla documentation. Emscripten generates a polyfill so that SIMD code still "works" everywhere, but that fallback is very, very slow.

Results with SSE2:
Native with -m32 -msse2: 20 GFlop/s
Native with -m64 -msse2: 25 GFlop/s
Emscripten'd code in Firefox Nightly: 11.8 GFlop/s
Emscripten'd code in stable Firefox: 0.0015 GFlop/s
Emscripten'd code in stable Chrome: did not complete benchmark

So the good news is that when SIMD.js is supported (in Firefox Nightly), it runs at 60% of native speed (since we should compare to -m32). The bad news is that enabling SIMD makes things unbearably slow when the fallback is used.

Emscripten bug to track for making the SIMD fallback better: issue 3783

Cheers,

Benoit

#include <iostream> #include <unistd.h> #include <sys/time.h> #include <Eigen/Core> using Eigen::MatrixXf; double current_time_in_seconds() { timeval t; gettimeofday(&t, nullptr); return t.tv_sec + 1e-6 * t.tv_usec; } // Make this function non-inlinable to prevent the compiler from optimizing // away repeated calls. EIGEN_DONT_INLINE void do_one_benchmark_iter(const MatrixXf& a, const MatrixXf& b, MatrixXf* c) { *c = a * b; } double benchmark_sgemm_gflops(int size) { // Minimum duration for this benchmark to run. If the workload finishes // sooner, we retry with double the number of iterations. static const double min_benchmark_time_in_seconds = 0.1; MatrixXf a = MatrixXf::Random(size, size); MatrixXf b = MatrixXf::Random(size, size); MatrixXf c = MatrixXf::Zero(size, size); uint64_t iters_at_a_time = 1; while (true) { double t_start = current_time_in_seconds(); for (uint64_t i = 0; i < iters_at_a_time; i++) { do_one_benchmark_iter(a, b, &c); } double t_end = current_time_in_seconds(); double elapsed = t_end - t_start; if (elapsed > min_benchmark_time_in_seconds) { // Standard GEMM algorithm on NxN matrices requires 2N^3 operations double gflop_per_iter = 2.0 * size * size * size * 1e-9; return gflop_per_iter * iters_at_a_time / elapsed; } iters_at_a_time *= 2; } } int main() { for (int size = 16; size <= 1024; size *= 2) { std::cout << "Muliplying " << size << "x" << size << " matrices at " << benchmark_sgemm_gflops(size) << " GFlop/s" << std::endl; } }