Lesson 10: Rayon — Data parallelism made easy -

I had a batch processing job — reading 50,000 JSON files, parsing them, running validation, writing results. Single-threaded, it took 12 minutes. I added Rayon, changed .iter() to .par_iter(), and it dropped to 90 seconds. Five characters added to my code. That’s it.

Rayon is probably the most impressive crate in the Rust ecosystem for the effort-to-impact ratio. If you have CPU-bound work that operates on collections, Rayon makes parallelism trivial.

The Problem: Manual Thread Management

Parallelizing work by hand is tedious and error-prone:

use std::thread;

fn process(item: &str) -> String {
    // expensive computation
    std::thread::sleep(std::time::Duration::from_millis(10));
    item.to_uppercase()
}

fn main() {
    let items: Vec<String> = (0..1000).map(|i| format!("item-{}", i)).collect();
    let chunk_size = items.len() / num_cpus::get();

    let mut handles = vec![];
    for chunk in items.chunks(chunk_size) {
        let owned: Vec<String> = chunk.to_vec();
        handles.push(thread::spawn(move || {
            owned.iter().map(|item| process(item)).collect::<Vec<_>>()
        }));
    }

    let results: Vec<String> = handles
        .into_iter()
        .flat_map(|h| h.join().unwrap())
        .collect();

    println!("Processed {} items", results.len());
}

That’s a lot of boilerplate. You’re managing chunks, spawning threads, collecting results, dealing with ownership. And what if the chunks are uneven? What if some items take longer than others?

The Solution: par_iter

use rayon::prelude::*;

fn process(item: &str) -> String {
    std::thread::sleep(std::time::Duration::from_millis(10));
    item.to_uppercase()
}

fn main() {
    let items: Vec<String> = (0..1000).map(|i| format!("item-{}", i)).collect();

    let results: Vec<String> = items
        .par_iter()
        .map(|item| process(item))
        .collect();

    println!("Processed {} items", results.len());
}

Add rayon to your Cargo.toml, change iter() to par_iter(), done. Rayon handles thread pool management, work stealing, load balancing — all of it.

[dependencies]
rayon = "1.10"

How Rayon Works

Rayon uses a work-stealing thread pool. Here’s the basic idea:

Rayon maintains a global thread pool (by default, one thread per CPU core)
When you call par_iter(), the work is split recursively into chunks
Each thread grabs a chunk and processes it
If a thread finishes early, it steals work from other threads’ queues
Results are combined and returned

Work stealing is what makes Rayon handle uneven workloads well. If 90% of the work is in 10% of the items, threads that finish the easy items steal from the busy thread’s queue. No idle cores.

Parallel Iterator Methods

Most iterator methods have parallel equivalents:

use rayon::prelude::*;

fn main() {
    let numbers: Vec<i64> = (0..1_000_000).collect();

    // par_iter — parallel immutable iteration
    let sum: i64 = numbers.par_iter().sum();
    println!("Sum: {}", sum);

    // map + collect
    let squares: Vec<i64> = numbers.par_iter().map(|&n| n * n).collect();

    // filter + map + collect
    let even_squares: Vec<i64> = numbers
        .par_iter()
        .filter(|&&n| n % 2 == 0)
        .map(|&n| n * n)
        .collect();
    println!("Even squares: {} items", even_squares.len());

    // any / all
    let has_negative = numbers.par_iter().any(|&n| n < 0);
    let all_positive = numbers.par_iter().all(|&n| n >= 0);
    println!("has_negative: {}, all_positive: {}", has_negative, all_positive);

    // find_any — returns Some(&T), but not necessarily the first match
    let found = numbers.par_iter().find_any(|&&n| n == 500_000);
    println!("Found: {:?}", found);

    // for_each — parallel side effects
    numbers.par_iter().for_each(|&n| {
        if n % 100_000 == 0 {
            println!("Processing: {}", n);
        }
    });

    // reduce
    let product: i64 = (1i64..=20)
        .into_par_iter()
        .reduce(|| 1, |a, b| a * b);
    println!("20! = {}", product);
}

Important: find_any vs find_first. find_any returns any matching element (whichever thread finds one first). find_first returns the first matching element in iteration order, which is slower because it can’t short-circuit as aggressively.

par_iter vs into_par_iter vs par_iter_mut

use rayon::prelude::*;

fn main() {
    let data = vec![1, 2, 3, 4, 5];

    // par_iter: borrows — &T
    let sum: i32 = data.par_iter().sum();

    // par_iter_mut: mutable borrows — &mut T
    let mut data = vec![1, 2, 3, 4, 5];
    data.par_iter_mut().for_each(|x| *x *= 2);
    println!("{:?}", data); // [2, 4, 6, 8, 10]

    // into_par_iter: takes ownership — T
    let data = vec![String::from("a"), String::from("b"), String::from("c")];
    let upper: Vec<String> = data.into_par_iter().map(|s| s.to_uppercase()).collect();
    // data is consumed, can't use it anymore
    println!("{:?}", upper);

    // Ranges work too
    let vals: Vec<i32> = (0..100).into_par_iter().map(|n| n * n).collect();
    println!("{} items", vals.len());
}

Parallel Sorting

use rayon::prelude::*;

fn main() {
    let mut data: Vec<i32> = (0..10_000_000).rev().collect();

    // Parallel sort — significantly faster for large collections
    data.par_sort();

    // Parallel unstable sort — even faster, doesn't preserve order of equal elements
    data.par_sort_unstable();

    // With custom comparator
    data.par_sort_unstable_by(|a, b| b.cmp(a)); // descending

    println!("First 5: {:?}", &data[..5]);
}

Parallel sorting shines with millions of elements. For a few thousand, the overhead of parallelization isn’t worth it.

Custom Thread Pool

By default, Rayon uses a global thread pool. You can customize it or create your own:

use rayon::prelude::*;

fn main() {
    // Configure the global pool
    rayon::ThreadPoolBuilder::new()
        .num_threads(4)
        .thread_name(|idx| format!("rayon-worker-{}", idx))
        .build_global()
        .unwrap();

    let result: Vec<i32> = (0..100).into_par_iter().map(|n| n * 2).collect();
    println!("{:?}", &result[..10]);
}

Or create a separate pool for isolation:

use rayon::prelude::*;

fn main() {
    let pool = rayon::ThreadPoolBuilder::new()
        .num_threads(2)
        .build()
        .unwrap();

    let result = pool.install(|| {
        (0..1000).into_par_iter().map(|n| n * n).sum::<i64>()
    });

    println!("Result: {}", result);
}

pool.install() runs the closure on the custom pool instead of the global one. Useful when you need different parallelism levels for different workloads.

When Rayon Hurts

Rayon isn’t free. The overhead of task splitting, scheduling, and work stealing matters when:

use rayon::prelude::*;
use std::time::Instant;

fn main() {
    let data: Vec<i32> = (0..100).collect();

    // Sequential — probably faster for tiny workloads
    let start = Instant::now();
    let _: Vec<i32> = data.iter().map(|&n| n + 1).collect();
    println!("Sequential: {:?}", start.elapsed());

    // Parallel — overhead exceeds benefit for trivial operations on small data
    let start = Instant::now();
    let _: Vec<i32> = data.par_iter().map(|&n| n + 1).collect();
    println!("Parallel: {:?}", start.elapsed());
}

Rules of thumb:

Fewer than ~1000 elements with cheap operations — stick with sequential
Cheap operations on large data — parallel helps but gains are modest
Expensive operations on any size data — parallel almost always wins
IO-bound work — Rayon is for CPU work. Use async for IO.

Real Example: Image Processing

use rayon::prelude::*;

#[derive(Clone)]
struct Pixel {
    r: u8,
    g: u8,
    b: u8,
}

impl Pixel {
    fn grayscale(&self) -> Pixel {
        let gray = (0.299 * self.r as f64 + 0.587 * self.g as f64 + 0.114 * self.b as f64) as u8;
        Pixel { r: gray, g: gray, b: gray }
    }
}

fn main() {
    // Simulate a 4K image: 3840 x 2160 pixels
    let width = 3840;
    let height = 2160;
    let image: Vec<Pixel> = (0..width * height)
        .map(|i| Pixel {
            r: (i % 256) as u8,
            g: ((i / 256) % 256) as u8,
            b: ((i / 65536) % 256) as u8,
        })
        .collect();

    let start = std::time::Instant::now();
    let _gray: Vec<Pixel> = image.iter().map(|p| p.grayscale()).collect();
    println!("Sequential: {:?}", start.elapsed());

    let start = std::time::Instant::now();
    let _gray: Vec<Pixel> = image.par_iter().map(|p| p.grayscale()).collect();
    println!("Parallel: {:?}", start.elapsed());
}

For 8 million pixels, the parallel version typically runs 4-6x faster on 8 cores.

Parallel Chunks

For row-based processing where you need the index:

use rayon::prelude::*;

fn main() {
    let width = 1920;
    let height = 1080;
    let mut pixels = vec![0u8; width * height * 3]; // RGB

    // Process each row in parallel
    pixels
        .par_chunks_mut(width * 3)
        .enumerate()
        .for_each(|(row, chunk)| {
            for x in 0..width {
                let idx = x * 3;
                chunk[idx] = (x % 256) as u8;         // R
                chunk[idx + 1] = (row % 256) as u8;    // G
                chunk[idx + 2] = 128;                   // B
            }
        });

    println!("Filled {} rows", height);
}

par_chunks_mut splits a slice into mutable chunks and processes them in parallel. Each chunk is exclusively owned by one thread — no races possible.

Combining Rayon with Other Patterns

Rayon pairs well with channels for streaming results:

use rayon::prelude::*;
use std::sync::mpsc;

fn main() {
    let (tx, rx) = mpsc::channel();

    // Spawn a thread to consume results as they're produced
    let consumer = std::thread::spawn(move || {
        let mut count = 0;
        for result in rx {
            count += 1;
            if count % 1000 == 0 {
                println!("Consumed {} results, latest: {}", count, result);
            }
        }
        count
    });

    // Parallel processing, streaming results via channel
    let items: Vec<i32> = (0..10_000).collect();
    items.par_iter().for_each_with(tx, |tx, &item| {
        let result = item * item; // expensive computation
        tx.send(result).unwrap();
    });

    let total = consumer.join().unwrap();
    println!("Total processed: {}", total);
}

for_each_with clones the sender for each thread in the pool — exactly what you need for multi-producer patterns.

Next — Crossbeam, with scoped threads and lock-free data structures that go beyond what std provides.

Atharva Pandey/Lesson 10: Rayon — Data parallelism made easy