Lesson 5: WASM Performance — When it beats JavaScript -

“WASM is faster than JavaScript.” I’ve heard this so many times, and it drives me nuts — not because it’s wrong, but because it’s incomplete. WASM can be faster than JavaScript. It can also be slower. The difference depends on what you’re doing, how you’re crossing the JS↔WASM boundary, and whether you’ve hit the specific scenarios where WASM’s architecture actually gives you an advantage.

I ran benchmarks for months to figure out where the real boundaries are. Let me show you the data.

Why WASM Can Be Faster

Before the benchmarks, let’s understand the architectural differences:

JavaScript execution:

Parse source → AST
Compile to bytecode (baseline JIT)
Execute bytecode, profile hot paths
Recompile hot paths with optimizing JIT (TurboFan in V8)
Execute optimized machine code
Deoptimize if type assumptions break
Repeat from step 3

WASM execution:

Decode binary format (much faster than parsing text)
Validate the module
Compile to machine code (can be single-pass)
Execute machine code

WASM skips the warmup phase. There’s no JIT profiling, no speculative optimization, no deoptimization. You get consistent performance from the first call. For JavaScript, the optimizing JIT needs hundreds or thousands of calls before it kicks in, and a single unexpected type can cause deoptimization.

The Benchmarking Setup

I’m using criterion on the Rust side and a custom harness on the JavaScript side. All benchmarks run in Chrome, on the same machine, measuring wall-clock time after warmup.

# Cargo.toml
[package]
name = "wasm-bench"
version = "0.1.0"
edition = "2021"

[lib]
crate-type = ["cdylib", "rlib"]

[dependencies]
wasm-bindgen = "0.2"
js-sys = "0.3"

Benchmark 1: Tight Numeric Loops

The classic case — and where WASM wins convincingly.

use wasm_bindgen::prelude::*;

#[wasm_bindgen]
pub fn sum_squares(n: u32) -> f64 {
    let mut sum: f64 = 0.0;
    for i in 0..n {
        sum += (i as f64) * (i as f64);
    }
    sum
}

#[wasm_bindgen]
pub fn mandelbrot_iterations(cx: f64, cy: f64, max_iter: u32) -> u32 {
    let mut zx: f64 = 0.0;
    let mut zy: f64 = 0.0;
    let mut i: u32 = 0;

    while i < max_iter && zx * zx + zy * zy < 4.0 {
        let tmp = zx * zx - zy * zy + cx;
        zy = 2.0 * zx * zy + cy;
        zx = tmp;
        i += 1;
    }
    i
}

#[wasm_bindgen]
pub fn render_mandelbrot(width: u32, height: u32, max_iter: u32) -> Vec<u8> {
    let mut pixels = vec![0u8; (width * height * 4) as usize];

    for py in 0..height {
        for px in 0..width {
            let cx = (px as f64 / width as f64) * 3.5 - 2.5;
            let cy = (py as f64 / height as f64) * 2.0 - 1.0;
            let iter = mandelbrot_iterations(cx, cy, max_iter);

            let idx = ((py * width + px) * 4) as usize;
            if iter == max_iter {
                pixels[idx] = 0;
                pixels[idx + 1] = 0;
                pixels[idx + 2] = 0;
                pixels[idx + 3] = 255;
            } else {
                let t = iter as f64 / max_iter as f64;
                pixels[idx] = (9.0 * (1.0 - t) * t * t * t * 255.0) as u8;
                pixels[idx + 1] = (15.0 * (1.0 - t) * (1.0 - t) * t * t * 255.0) as u8;
                pixels[idx + 2] = (8.5 * (1.0 - t) * (1.0 - t) * (1.0 - t) * t * 255.0) as u8;
                pixels[idx + 3] = 255;
            }
        }
    }

    pixels
}

The JavaScript equivalent:

function renderMandelbrot(width, height, maxIter) {
    const pixels = new Uint8Array(width * height * 4);

    for (let py = 0; py < height; py++) {
        for (let px = 0; px < width; px++) {
            const cx = (px / width) * 3.5 - 2.5;
            const cy = (py / height) * 2.0 - 1.0;
            let zx = 0, zy = 0, i = 0;

            while (i < maxIter && zx * zx + zy * zy < 4.0) {
                const tmp = zx * zx - zy * zy + cx;
                zy = 2.0 * zx * zy + cy;
                zx = tmp;
                i++;
            }

            const idx = (py * width + px) * 4;
            if (i === maxIter) {
                pixels[idx] = 0;
                pixels[idx + 1] = 0;
                pixels[idx + 2] = 0;
                pixels[idx + 3] = 255;
            } else {
                const t = i / maxIter;
                pixels[idx] = (9.0 * (1 - t) * t * t * t * 255) | 0;
                pixels[idx + 1] = (15.0 * (1 - t) * (1 - t) * t * t * 255) | 0;
                pixels[idx + 2] = (8.5 * (1 - t) * (1 - t) * (1 - t) * t * 255) | 0;
                pixels[idx + 3] = 255;
            }
        }
    }
    return pixels;
}

Results (800x600, 1000 iterations):

	Time
JavaScript	~85ms
WASM (Rust)	~32ms
WASM speedup	2.7x

WASM wins because this is pure numeric computation with predictable types. The V8 JIT can optimize the JavaScript pretty well (those | 0 hints help), but WASM still wins because there’s zero overhead from type checking and no risk of deoptimization.

Benchmark 2: String Processing

Here’s where things get more nuanced:

#[wasm_bindgen]
pub fn count_words(text: &str) -> usize {
    text.split_whitespace().count()
}

#[wasm_bindgen]
pub fn to_title_case(text: &str) -> String {
    text.split_whitespace()
        .map(|word| {
            let mut chars = word.chars();
            match chars.next() {
                None => String::new(),
                Some(c) => {
                    let upper: String = c.to_uppercase().collect();
                    upper + &chars.as_str().to_lowercase()
                }
            }
        })
        .collect::<Vec<_>>()
        .join(" ")
}

#[wasm_bindgen]
pub fn find_all_urls(text: &str) -> Vec<JsValue> {
    let mut urls = Vec::new();
    let mut start = 0;

    while let Some(pos) = text[start..].find("https://") {
        let abs_pos = start + pos;
        let end = text[abs_pos..]
            .find(|c: char| c.is_whitespace() || c == ')' || c == ']' || c == '"')
            .map(|e| abs_pos + e)
            .unwrap_or(text.len());

        urls.push(JsValue::from_str(&text[abs_pos..end]));
        start = end;
    }

    urls
}

Results (1MB of text, 100 iterations):

Operation	JavaScript	WASM	Ratio
Word count	8ms	5ms	1.6x faster
Title case	15ms	22ms	0.7x (slower!)
URL extraction	12ms	9ms	1.3x faster

Wait — title case is slower in WASM? Yes. Because every string returned from WASM has to be copied across the boundary with a UTF-8 → UTF-16 conversion. When you’re creating lots of small strings and returning them, the boundary cost dominates the computation cost.

The lesson: WASM wins at string processing when you do the work in bulk and return a single result. It loses when you’re creating and returning many individual strings.

Benchmark 3: The Boundary Cost

Let’s measure the boundary overhead directly:

#[wasm_bindgen]
pub fn noop() {}

#[wasm_bindgen]
pub fn identity_i32(x: i32) -> i32 { x }

#[wasm_bindgen]
pub fn identity_string(x: &str) -> String { x.to_string() }

#[wasm_bindgen]
pub fn identity_bytes(x: &[u8]) -> Vec<u8> { x.to_vec() }

// Benchmark: call each function 1,000,000 times
const t0 = performance.now();
for (let i = 0; i < 1_000_000; i++) noop();
console.log("noop:", performance.now() - t0);

Results (1M calls):

Function	Time	Cost per call
`noop()`	12ms	~12ns
`identity_i32(42)`	15ms	~15ns
`identity_string("hello")`	280ms	~280ns
`identity_bytes(1KB)`	850ms	~850ns

The numeric boundary is essentially free. String and byte array crossings are 20-60x more expensive due to copying and encoding conversion. This is the number one performance pitfall in Rust WASM code.

Optimization Strategies

Strategy 1: Minimize Boundary Crossings

Bad:

#[wasm_bindgen]
pub fn process_item(item: &str) -> String {
    // Process one item
    item.to_uppercase()
}
// Called 10,000 times from JS — 10,000 boundary crossings

Good:

#[wasm_bindgen]
pub fn process_batch(items: &str, delimiter: &str) -> String {
    // Process all items in one call
    items
        .split(delimiter)
        .map(|item| item.to_uppercase())
        .collect::<Vec<_>>()
        .join(delimiter)
}
// Called once — 1 boundary crossing

Strategy 2: Use Shared Memory for Large Data

Instead of copying data across the boundary, let JavaScript read directly from WASM memory:

use wasm_bindgen::prelude::*;
use std::cell::RefCell;

thread_local! {
    static BUFFER: RefCell<Vec<u8>> = RefCell::new(Vec::new());
}

#[wasm_bindgen]
pub fn process_image(width: u32, height: u32) -> *const u8 {
    BUFFER.with(|buf| {
        let mut buffer = buf.borrow_mut();
        buffer.resize((width * height * 4) as usize, 0);

        // Do expensive image processing...
        for chunk in buffer.chunks_exact_mut(4) {
            chunk[0] = 255; // R
            chunk[1] = 128; // G
            chunk[2] = 0;   // B
            chunk[3] = 255; // A
        }

        buffer.as_ptr()
    })
}

#[wasm_bindgen]
pub fn buffer_len() -> usize {
    BUFFER.with(|buf| buf.borrow().len())
}

const ptr = process_image(800, 600);
const len = buffer_len();

// Create a view into WASM memory — zero copy!
const pixels = new Uint8Array(wasm.memory.buffer, ptr, len);

// Use directly with Canvas
const imageData = new ImageData(
    new Uint8ClampedArray(pixels.buffer, pixels.byteOffset, pixels.byteLength),
    800, 600
);
ctx.putImageData(imageData, 0, 0);

This is how high-performance WASM applications work. The pixel data never leaves WASM memory — JavaScript just creates a view into it.

Strategy 3: SIMD

Rust can emit WASM SIMD instructions, which process 4 floats or 16 bytes simultaneously:

use std::arch::wasm32::*;
use wasm_bindgen::prelude::*;

#[wasm_bindgen]
pub fn dot_product_simd(a: &[f32], b: &[f32]) -> f32 {
    assert_eq!(a.len(), b.len());
    let chunks = a.len() / 4;
    let mut sum = f32x4_splat(0.0);

    for i in 0..chunks {
        let offset = i * 4;
        let va = f32x4(a[offset], a[offset + 1], a[offset + 2], a[offset + 3]);
        let vb = f32x4(b[offset], b[offset + 1], b[offset + 2], b[offset + 3]);
        sum = f32x4_add(sum, f32x4_mul(va, vb));
    }

    let result = f32x4_extract_lane::<0>(sum)
        + f32x4_extract_lane::<1>(sum)
        + f32x4_extract_lane::<2>(sum)
        + f32x4_extract_lane::<3>(sum);

    // Handle remaining elements
    let remainder_start = chunks * 4;
    let mut scalar_sum = result;
    for i in remainder_start..a.len() {
        scalar_sum += a[i] * b[i];
    }

    scalar_sum
}

#[wasm_bindgen]
pub fn dot_product_scalar(a: &[f32], b: &[f32]) -> f32 {
    a.iter().zip(b.iter()).map(|(x, y)| x * y).sum()
}

Enable SIMD in your build:

RUSTFLAGS="-C target-feature=+simd128" wasm-pack build --target web

Results (dot product, 1M elements):

	Time
JavaScript	3.2ms
WASM scalar	2.1ms
WASM SIMD	0.8ms

SIMD gives you another 2-3x on top of the baseline WASM advantage. Browser support for WASM SIMD is excellent — Chrome, Firefox, Safari, and Edge all support it.

Strategy 4: Profile Before You Optimize

Use the browser’s built-in profiler. WASM functions show up in Chrome DevTools Performance tab with their actual names (if you build with debug info or use a name section):

# Keep function names in the binary for profiling
[profile.release]
debug = 1  # Include some debug info

You can also use console.time / console.timeEnd from Rust:

use wasm_bindgen::prelude::*;

#[wasm_bindgen]
extern "C" {
    #[wasm_bindgen(js_namespace = console)]
    fn time(label: &str);

    #[wasm_bindgen(js_namespace = console)]
    fn timeEnd(label: &str);
}

#[wasm_bindgen]
pub fn expensive_operation(data: &[f64]) -> Vec<f64> {
    time("rust_processing");

    let result: Vec<f64> = data
        .windows(3)
        .map(|w| (w[0] + w[1] + w[2]) / 3.0)
        .collect();

    timeEnd("rust_processing");
    result
}

The Real-World Performance Picture

Here’s my honest assessment after a year of shipping Rust WASM in production:

WASM crushes JavaScript at:

Image and video processing (2-5x faster)
Physics simulations (3-10x faster, especially with SIMD)
Cryptography and hashing (2-4x faster)
Compression/decompression (2-3x faster)
Parsing binary formats (2-4x faster)
Anything with complex data structures (trees, graphs) — no GC pauses

WASM is roughly equal to JavaScript at:

JSON parsing (V8’s JSON.parse is heavily optimized native code)
Simple array operations (V8 optimizes these well)
Regex matching (V8’s regex engine is native C++)
Basic string operations (when boundary costs are factored in)

WASM is slower than JavaScript at:

DOM-heavy operations (every DOM call crosses the boundary)
Creating many small strings
Tasks with lots of JS interop
Anything where the computation time is small relative to boundary cost

Memory Management

One area where Rust WASM has a less obvious advantage: predictable memory usage. JavaScript’s garbage collector can cause unpredictable pauses — usually small, but they add up in frame-sensitive applications. Rust doesn’t have a GC, so memory cleanup happens deterministically when values go out of scope.

#[wasm_bindgen]
pub fn process_large_dataset(data: &[f64]) -> Vec<f64> {
    // Temporary allocations are freed immediately when they go out of scope
    let intermediate: Vec<f64> = data.iter().map(|x| x.sin()).collect();
    // `intermediate` is freed here

    let result: Vec<f64> = data
        .iter()
        .zip(intermediate.iter())
        .map(|(a, b)| a + b)
        .collect();

    result
    // Only `result` survives — clean, predictable memory
}

Wait, that code won’t compile — intermediate is moved and freed before the zip. Let me fix that:

#[wasm_bindgen]
pub fn process_large_dataset(data: &[f64]) -> Vec<f64> {
    let sines: Vec<f64> = data.iter().map(|x| x.sin()).collect();

    let result: Vec<f64> = data
        .iter()
        .zip(sines.iter())
        .map(|(a, b)| a + b)
        .collect();

    // `sines` is freed here when it goes out of scope — deterministic, no GC pause
    result
}

For real-time applications — games, audio processing, visualizations — this predictability matters more than raw speed. A GC pause of even 5ms can cause a visible frame drop.

Practical Takeaway

Don’t reach for WASM because “it’s faster.” Reach for it when you’ve profiled your JavaScript, identified a CPU-bound bottleneck, and the computation is large enough that the boundary crossing cost is negligible compared to the work being done.

The ideal WASM workload looks like this: receive a large chunk of data, do expensive computation, return a result. One boundary crossing in, one out, lots of work in between. That’s where the 2-5x speedups live.

Next up: multi-threaded WASM. Because single-threaded performance improvements are nice, but parallelism is how you get order-of-magnitude gains.

Atharva Pandey/Lesson 5: WASM Performance — When it beats JavaScript