Here’s a number that should change how you think about data structures: reading from L1 cache takes about 1 nanosecond. Reading from main memory takes about 100 nanoseconds. That’s a 100x penalty for a cache miss. On a modern CPU running at 4 GHz, a single cache miss stalls the processor for roughly 400 cycles. Four hundred cycles where your CPU is sitting there, doing nothing, waiting for data to arrive from RAM.
I spent a month optimizing a particle simulation once. The algorithm was perfect — O(n log n) neighbor finding, spatial hashing, the works. But it was crawling. Profiling showed 35% of cycles wasted on cache misses. The data layout was the bottleneck, not the algorithm. Restructuring the data — same algorithm, same logic — gave me a 4x speedup.
This lesson is about how to think about data layout.
How CPU Caches Work (The Short Version)
Your CPU has a hierarchy of caches:
L1 cache: ~32 KB, ~1 ns latency, ~4 cycles
L2 cache: ~256 KB, ~5 ns latency, ~12 cycles
L3 cache: ~8 MB, ~20 ns latency, ~40 cycles
Main RAM: ~GB, ~100 ns latency, ~200+ cycles
When you access memory, the CPU doesn’t fetch a single byte. It fetches a cache line — typically 64 bytes. This means that after you access one byte, the next 63 bytes are already in cache. If your data is laid out so that you access things sequentially, you get those cache fills for free. If your data is scattered across memory, every access is a potential cache miss.
The CPU also has a prefetcher that detects sequential access patterns and speculatively loads the next few cache lines before you ask for them. This is why iterating a Vec is incredibly fast — the prefetcher keeps the pipeline fed.
Array of Structs vs Struct of Arrays
This is the most impactful layout decision you’ll make. Let me show you with a game-engine example.
Array of Structs (AoS) — The Default
// Array of Structs — how most people write it
struct Particle {
position: [f32; 3], // 12 bytes
velocity: [f32; 3], // 12 bytes
color: [f32; 4], // 16 bytes
mass: f32, // 4 bytes
lifetime: f32, // 4 bytes
is_active: bool, // 1 byte + 3 padding
_padding: [u8; 3],
}
// Total: 52 bytes per particle (rounded to 56 with alignment)
let particles: Vec<Particle> = Vec::with_capacity(100_000);
Memory layout:
[pos vel color mass life active | pos vel color mass life active | ...]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
particle 0 (56 bytes) particle 1 (56 bytes)
Now imagine you want to update just the positions based on velocities. You iterate all particles and touch the position and velocity fields. But each cache line (64 bytes) barely fits one particle. You’re loading 56 bytes per particle but only using 24 bytes (position + velocity). That’s 57% wasted bandwidth.
Struct of Arrays (SoA) — Cache Optimized
// Struct of Arrays — data-oriented design
struct Particles {
positions: Vec<[f32; 3]>, // all positions contiguous
velocities: Vec<[f32; 3]>, // all velocities contiguous
colors: Vec<[f32; 4]>, // all colors contiguous
masses: Vec<f32>,
lifetimes: Vec<f32>,
active: Vec<bool>,
}
impl Particles {
fn new(capacity: usize) -> Self {
Self {
positions: Vec::with_capacity(capacity),
velocities: Vec::with_capacity(capacity),
colors: Vec::with_capacity(capacity),
masses: Vec::with_capacity(capacity),
lifetimes: Vec::with_capacity(capacity),
active: Vec::with_capacity(capacity),
}
}
fn update_positions(&mut self, dt: f32) {
// Now we're iterating two contiguous arrays
// Each cache line holds ~5 positions — perfect utilization
for (pos, vel) in self.positions.iter_mut()
.zip(self.velocities.iter())
{
pos[0] += vel[0] * dt;
pos[1] += vel[1] * dt;
pos[2] += vel[2] * dt;
}
}
}
Memory layout:
positions: [pos0 pos1 pos2 pos3 pos4 | pos5 pos6 pos7 pos8 pos9 | ...]
velocities: [vel0 vel1 vel2 vel3 vel4 | vel5 vel6 vel7 vel8 vel9 | ...]
When updating positions, every byte we load is data we actually use. The prefetcher has a trivially predictable sequential pattern. And LLVM can auto-vectorize this with SIMD instructions because the data is contiguous and uniformly typed.
Benchmark: AoS vs SoA
use criterion::{black_box, criterion_group, criterion_main, Criterion};
// AoS version
struct ParticleAoS {
position: [f32; 3],
velocity: [f32; 3],
color: [f32; 4],
mass: f32,
lifetime: f32,
}
fn update_aos(particles: &mut [ParticleAoS], dt: f32) {
for p in particles.iter_mut() {
p.position[0] += p.velocity[0] * dt;
p.position[1] += p.velocity[1] * dt;
p.position[2] += p.velocity[2] * dt;
}
}
// SoA version
struct ParticlesSoA {
positions: Vec<[f32; 3]>,
velocities: Vec<[f32; 3]>,
}
fn update_soa(particles: &mut ParticlesSoA, dt: f32) {
for (pos, vel) in particles.positions.iter_mut()
.zip(particles.velocities.iter())
{
pos[0] += vel[0] * dt;
pos[1] += vel[1] * dt;
pos[2] += vel[2] * dt;
}
}
fn bench(c: &mut Criterion) {
let n = 100_000;
let mut aos: Vec<ParticleAoS> = (0..n)
.map(|i| ParticleAoS {
position: [i as f32, 0.0, 0.0],
velocity: [1.0, 0.5, 0.25],
color: [1.0, 1.0, 1.0, 1.0],
mass: 1.0,
lifetime: 10.0,
})
.collect();
let mut soa = ParticlesSoA {
positions: (0..n).map(|i| [i as f32, 0.0, 0.0]).collect(),
velocities: vec![[1.0, 0.5, 0.25]; n],
};
c.bench_function("update_aos_100k", |b| {
b.iter(|| update_aos(black_box(&mut aos), 0.016))
});
c.bench_function("update_soa_100k", |b| {
b.iter(|| update_soa(black_box(&mut soa), 0.016))
});
}
criterion_group!(benches, bench);
criterion_main!(benches);
// Typical results:
// update_aos_100k: ~310 µs
// update_soa_100k: ~85 µs
3.6x faster. Same algorithm, same computation, different data layout. The SoA version is faster because:
- Better cache utilization (100% of loaded data is used)
- Prefetcher works perfectly (sequential access)
- LLVM can vectorize the SoA version with SIMD
Hot/Cold Splitting
Sometimes full SoA is overkill. A simpler approach: split your struct into hot (frequently accessed) and cold (rarely accessed) parts:
// Instead of one big struct...
struct Entity {
// Hot — accessed every frame
position: [f32; 3],
velocity: [f32; 3],
health: f32,
// Cold — accessed rarely
name: String,
description: String,
creation_time: u64,
metadata: HashMap<String, String>,
}
// Split into hot and cold
struct EntityHot {
position: [f32; 3],
velocity: [f32; 3],
health: f32,
}
struct EntityCold {
name: String,
description: String,
creation_time: u64,
metadata: HashMap<String, String>,
}
struct World {
hot: Vec<EntityHot>, // iterated every frame
cold: Vec<EntityCold>, // accessed on demand
}
Now the hot data is tightly packed. Each cache line holds multiple EntityHot structs. The cold data doesn’t pollute the cache during the update loop.
Pointer Chasing — The Cache Killer
Every time you follow a pointer, you potentially pay a cache miss. Linked lists are the classic example:
// Linked list: every node is a separate allocation
// Following ->next is a potential cache miss
// DO NOT use LinkedList in performance-sensitive code
// Vec: contiguous memory, sequential access
// No pointer chasing, prefetcher-friendly
let data: Vec<Item> = items.collect();
But it’s not just linked lists. Any Box, String, or Vec inside your struct is a pointer to heap memory:
struct Record {
id: u64,
name: String, // pointer → heap
tags: Vec<String>, // pointer → heap → more pointers → more heap
}
Iterating a Vec<Record> means: read Record from Vec (sequential, fast), then dereference name (cache miss), then dereference each tag (more cache misses). The Vec iteration is fast, but the data it points to is scattered.
Mitigation strategies:
- Use
CompactString(Lesson 6) to keep short strings inline - Use
SmallVecto keep short arrays inline - Use flat indices instead of pointers when possible
// Instead of Box<Node> for tree nodes, use a Vec + indices
struct Arena {
nodes: Vec<Node>,
}
struct Node {
value: u64,
left: Option<u32>, // index into Arena.nodes
right: Option<u32>, // index into Arena.nodes
}
// All nodes are contiguous in memory
// "Following a pointer" is just indexing into the Vec
Struct Size and Alignment
Rust lays out struct fields with padding for alignment. This can waste space:
// Poor layout — lots of padding
struct Wasteful {
a: u8, // 1 byte + 7 padding
b: u64, // 8 bytes
c: u8, // 1 byte + 3 padding
d: u32, // 4 bytes
}
// Size: 24 bytes (only 14 bytes of data)
// Rust's default repr reorders fields for optimal packing
// But #[repr(C)] preserves your order — avoid it unless needed
By default, Rust is free to reorder struct fields for better packing. But if you’ve used #[repr(C)] (for FFI compatibility), you’ve forced the compiler to use your field order, which might be suboptimal.
You can check struct sizes and alignment:
println!("Size: {}", std::mem::size_of::<MyStruct>());
println!("Align: {}", std::mem::align_of::<MyStruct>());
For cache performance, try to keep frequently-accessed structs within 64 bytes (one cache line) or within 128 bytes (two cache lines).
False Sharing
In concurrent code, false sharing happens when two threads modify different variables that happen to be on the same cache line. The CPU’s cache coherency protocol forces both cores to invalidate and reload the line, even though they’re not actually sharing data.
use std::sync::atomic::{AtomicU64, Ordering};
// BAD: both counters on the same cache line
struct Counters {
counter_a: AtomicU64, // thread 1 writes this
counter_b: AtomicU64, // thread 2 writes this
// These are 8 bytes apart — same cache line!
}
// GOOD: pad to separate cache lines
#[repr(C)]
struct PaddedCounters {
counter_a: AtomicU64,
_pad_a: [u8; 56], // fill rest of 64-byte cache line
counter_b: AtomicU64,
_pad_b: [u8; 56],
}
Or use crossbeam_utils::CachePadded:
use crossbeam_utils::CachePadded;
struct Counters {
counter_a: CachePadded<AtomicU64>,
counter_b: CachePadded<AtomicU64>,
}
False sharing can cause 10-50x slowdowns in concurrent code. If you’re seeing unexpectedly bad scaling with more threads, check for this.
Measuring Cache Performance
Use perf stat to measure cache behavior:
perf stat -e cache-misses,cache-references,L1-dcache-load-misses,LLC-load-misses \
./target/release/my_program
# Sample output:
# 1,234,567 cache-misses # 3.45% of all cache refs
# 35,789,012 cache-references
# 456,789 L1-dcache-load-misses
# 23,456 LLC-load-misses # Last-level cache misses = main memory trips
A cache miss rate above 5% usually means your data layout needs work. Below 1% is excellent.
Practical Guidelines
Use Vec for almost everything. Contiguous memory is king. Even when you need a “tree” or “graph,” consider storing nodes in a Vec and using indices.
Profile before restructuring. SoA is more complex to work with than AoS. Only switch if profiling shows cache misses are your bottleneck.
Keep hot data small. The less data you touch per iteration, the more elements fit in cache. Remove unused fields from hot structs.
Avoid unnecessary indirection. Every
Box,String,Vecfield is a pointer chase. Use inline alternatives when the data is small.Sort your data. If you’re doing lookups, sorted data has better cache behavior than random-order data because the prefetcher can predict access patterns.
Batch your operations. Process all items for operation A, then all items for operation B. Don’t interleave — it thrashes the cache.
The Takeaway
Cache-friendly data layout can give you 2-10x speedups with zero algorithmic changes. The key insight: CPUs are fast, memory is slow, and caches bridge the gap — but only if your data is laid out to exploit them.
Think about what data you access together and keep it together. Think about what data you access sequentially and make it contiguous. The hardware will reward you.