Logo
Logo

Atharva Pandey/Lesson 19: Thread-Local Storage — Per-thread state

Created Wed, 11 Dec 2024 09:55:00 +0000 Modified Wed, 11 Dec 2024 09:55:00 +0000

I was optimizing a JSON serializer that allocated a buffer for every call. Under profiling, those allocations were 30% of the cost. The fix? A thread-local buffer that gets reused across calls on the same thread. No synchronization needed. No contention. Each thread has its own buffer. Throughput doubled.

Thread-local storage is the ultimate escape hatch from synchronization overhead. If each thread has its own copy, there’s nothing to synchronize.

The Problem: Unnecessary Sharing

Not all state needs to be shared. Per-request allocators, cached computations, RNG state — these are inherently per-thread. Putting them behind a mutex adds overhead for zero benefit.

use std::sync::{Arc, Mutex};

// BAD: sharing a buffer pool across threads for no reason
struct SharedBufferPool {
    buffers: Mutex<Vec<Vec<u8>>>,
}

impl SharedBufferPool {
    fn get_buffer(&self) -> Vec<u8> {
        self.buffers.lock().unwrap().pop().unwrap_or_else(|| Vec::with_capacity(4096))
    }

    fn return_buffer(&self, buf: Vec<u8>) {
        self.buffers.lock().unwrap().push(buf);
    }
}

Every buffer checkout locks a mutex. With 32 threads serializing data at high rates, this becomes a bottleneck fast.

thread_local! Macro

Rust’s thread_local! macro creates a value that’s unique to each thread:

use std::cell::RefCell;

thread_local! {
    static BUFFER: RefCell<Vec<u8>> = RefCell::new(Vec::with_capacity(4096));
}

fn serialize_something(data: &[u8]) -> Vec<u8> {
    BUFFER.with(|buf| {
        let mut buf = buf.borrow_mut();
        buf.clear();
        buf.extend_from_slice(data);
        // Process buf...
        buf.clone() // return a copy of the result
    })
}

fn main() {
    use std::thread;

    thread::scope(|s| {
        for _ in 0..4 {
            s.spawn(|| {
                for i in 0..1000 {
                    let data = format!("item-{}", i);
                    let _result = serialize_something(data.as_bytes());
                }
            });
        }
    });

    println!("Done — each thread reused its own buffer");
}

BUFFER looks like a global, but each thread gets its own Vec<u8>. No synchronization, no contention. The with() method gives you access to the current thread’s value.

How It Works

Thread-local storage uses OS facilities (pthreads TLS on Unix, TLS slots on Windows) to give each thread its own copy of the variable. When a thread is created, its thread-locals are initialized lazily — the first time they’re accessed.

use std::cell::Cell;

thread_local! {
    static COUNTER: Cell<u32> = Cell::new(0);
}

fn increment() -> u32 {
    COUNTER.with(|c| {
        let val = c.get() + 1;
        c.set(val);
        val
    })
}

fn main() {
    use std::thread;

    // Main thread
    println!("Main: {}", increment()); // 1
    println!("Main: {}", increment()); // 2

    // Spawned thread — gets its own counter starting at 0
    thread::spawn(|| {
        println!("Thread: {}", increment()); // 1 (not 3!)
        println!("Thread: {}", increment()); // 2
    })
    .join()
    .unwrap();

    // Main thread's counter is independent
    println!("Main: {}", increment()); // 3
}

Each thread starts with a fresh counter. There’s no sharing, no interference.

Practical Patterns

Per-Thread RNG

use std::cell::RefCell;

// Simple (bad) RNG for illustration — use `rand` crate in real code
struct SimpleRng {
    state: u64,
}

impl SimpleRng {
    fn new(seed: u64) -> Self {
        SimpleRng { state: seed }
    }

    fn next(&mut self) -> u64 {
        self.state = self.state.wrapping_mul(6364136223846793005).wrapping_add(1442695040888963407);
        self.state
    }
}

thread_local! {
    static RNG: RefCell<SimpleRng> = RefCell::new(SimpleRng::new(
        // Seed from thread ID for uniqueness
        std::thread::current().id().as_u64().get()
    ));
}

fn random_u64() -> u64 {
    RNG.with(|rng| rng.borrow_mut().next())
}

fn main() {
    use std::thread;

    thread::scope(|s| {
        for _ in 0..4 {
            s.spawn(|| {
                let vals: Vec<u64> = (0..5).map(|_| random_u64()).collect();
                println!("{:?}: {:?}", thread::current().id(), vals);
            });
        }
    });
}

Each thread gets its own RNG with a unique seed. No lock contention, no coordination. This is how rand::thread_rng() works under the hood.

Per-Thread Metrics

use std::cell::Cell;
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;

// Thread-local accumulators, periodically flushed to global counters
thread_local! {
    static LOCAL_REQUESTS: Cell<u64> = Cell::new(0);
    static LOCAL_BYTES: Cell<u64> = Cell::new(0);
}

struct Metrics {
    total_requests: AtomicU64,
    total_bytes: AtomicU64,
}

impl Metrics {
    fn new() -> Self {
        Metrics {
            total_requests: AtomicU64::new(0),
            total_bytes: AtomicU64::new(0),
        }
    }

    fn record_request(&self, bytes: u64) {
        LOCAL_REQUESTS.with(|c| c.set(c.get() + 1));
        LOCAL_BYTES.with(|c| c.set(c.get() + bytes));

        // Flush every 1000 requests to avoid frequent atomics
        LOCAL_REQUESTS.with(|c| {
            if c.get() >= 1000 {
                self.flush();
            }
        });
    }

    fn flush(&self) {
        let reqs = LOCAL_REQUESTS.with(|c| {
            let v = c.get();
            c.set(0);
            v
        });
        let bytes = LOCAL_BYTES.with(|c| {
            let v = c.get();
            c.set(0);
            v
        });
        self.total_requests.fetch_add(reqs, Ordering::Relaxed);
        self.total_bytes.fetch_add(bytes, Ordering::Relaxed);
    }

    fn snapshot(&self) -> (u64, u64) {
        (
            self.total_requests.load(Ordering::Relaxed),
            self.total_bytes.load(Ordering::Relaxed),
        )
    }
}

fn main() {
    use std::thread;

    let metrics = Arc::new(Metrics::new());
    let mut handles = vec![];

    for _ in 0..8 {
        let m = Arc::clone(&metrics);
        handles.push(thread::spawn(move || {
            for _ in 0..10_000 {
                m.record_request(256);
            }
            m.flush(); // flush remaining
        }));
    }

    for h in handles {
        h.join().unwrap();
    }

    let (reqs, bytes) = metrics.snapshot();
    println!("Total requests: {}, bytes: {}", reqs, bytes);
}

This pattern — accumulate locally, flush globally — dramatically reduces contention. Instead of 80,000 atomic operations (one per request per thread), you get 80 (one per flush). That’s a 1000x reduction in cross-thread synchronization.

Thread-Local Allocator/Arena

use std::cell::RefCell;

struct Arena {
    storage: Vec<u8>,
    offset: usize,
}

impl Arena {
    fn new(capacity: usize) -> Self {
        Arena {
            storage: vec![0u8; capacity],
            offset: 0,
        }
    }

    fn alloc(&mut self, size: usize) -> &mut [u8] {
        let start = self.offset;
        self.offset += size;
        if self.offset > self.storage.len() {
            panic!("Arena out of memory");
        }
        &mut self.storage[start..start + size]
    }

    fn reset(&mut self) {
        self.offset = 0;
    }
}

thread_local! {
    static ARENA: RefCell<Arena> = RefCell::new(Arena::new(1024 * 1024)); // 1MB per thread
}

fn with_arena<F, R>(f: F) -> R
where
    F: FnOnce(&mut Arena) -> R,
{
    ARENA.with(|arena| {
        let mut arena = arena.borrow_mut();
        arena.reset();
        f(&mut arena)
    })
}

fn main() {
    use std::thread;

    thread::scope(|s| {
        for id in 0..4 {
            s.spawn(move || {
                with_arena(|arena| {
                    let buf = arena.alloc(256);
                    buf[0] = id as u8;
                    println!("Thread {} allocated 256 bytes from arena", id);
                });
            });
        }
    });
}

Limitations

Thread-local storage has some gotchas:

1. Destructors run in unpredictable order:

thread_local! {
    static A: RefCell<String> = RefCell::new(String::from("hello"));
    static B: RefCell<String> = RefCell::new(String::from("world"));
}

// When the thread exits, A and B are dropped — but the order isn't guaranteed.
// If B's destructor tries to access A, it might find A already dropped.

2. Accessing thread-locals during thread shutdown can panic:

with() panics if the thread-local has already been destroyed. This happens during thread cleanup if one thread-local’s destructor tries to access another.

3. Thread-locals don’t work well with thread pools:

If you’re using a thread pool (Rayon, etc.), thread-locals persist across tasks. A value set by task A will still be there when task B runs on the same thread. This can cause subtle bugs if you expect fresh state per task.

// With thread pools: always reset state at the start of each task
fn process_task(task_id: usize) {
    LOCAL_COUNTER.with(|c| c.set(0)); // reset!
    // ... do work ...
}

4. No way to iterate over all thread-locals:

You can’t “collect all thread-local counters” from the main thread. You need an explicit flush mechanism (like the metrics pattern above) or a registry where threads register their values.

with_borrow and with_borrow_mut (Rust 1.73+)

Newer Rust versions added convenience methods:

use std::cell::RefCell;

thread_local! {
    static DATA: RefCell<Vec<i32>> = RefCell::new(Vec::new());
}

fn main() {
    // Before Rust 1.73
    DATA.with(|d| d.borrow_mut().push(42));
    let val = DATA.with(|d| d.borrow()[0]);

    // Rust 1.73+
    DATA.with_borrow_mut(|d| d.push(43));
    let val = DATA.with_borrow(|d| d[0]);
    println!("{}", val);
}

Less nesting, same semantics.

When to Use Thread-Local Storage

  • Caches — Per-thread caches that avoid synchronization
  • Buffers — Reusable scratch space (serialization, formatting)
  • RNG — Random number generators
  • Metrics — Local accumulators flushed to global counters
  • Context — Request context in server applications (but be careful with async)

The pattern is always the same: “This state is per-thread, and sharing it would just add overhead.” When you find yourself wrapping something in Arc<Mutex<T>> that never actually needs to be shared, thread-local might be the answer.


Next — the actor model, where we turn message passing into a full architecture.