Lesson 1: Arrays and Memory Layout — Cache lines decide your performance -

I spent two years writing Go services before I genuinely understood why iterating over a two-dimensional slice in the wrong order could tank my throughput by 5x. It wasn’t a bug. It wasn’t a bad algorithm. It was cache lines.

Arrays are the first data structure everyone learns and the last one most engineers actually understand. This is my attempt to fix that — not with theory, but with the reasoning that makes you a better systems engineer.

How It Actually Works

An array is a contiguous block of memory. That’s it. When you declare [8]int64 in Go, you’re reserving 64 bytes in a straight line — no pointers, no indirection, just raw memory addresses laid out sequentially.

package main

import (
    "fmt"
    "unsafe"
)

func main() {
    arr := [4]int64{10, 20, 30, 40}

    for i := 0; i < 4; i++ {
        ptr := uintptr(unsafe.Pointer(&arr[0])) + uintptr(i)*8
        fmt.Printf("arr[%d] = %d, address = 0x%x\n", i, arr[i], ptr)
    }
}
// arr[0] = 10, address = 0xc0000b4000
// arr[1] = 20, address = 0xc0000b4008
// arr[2] = 30, address = 0xc0000b4010
// arr[3] = 40, address = 0xc0000b4018

Each element is exactly 8 bytes apart because int64 is 8 bytes. This predictability is the whole point.

Now here’s the thing your CS professor glossed over: your CPU doesn’t read one value at a time from RAM. It reads in cache lines — chunks of 64 bytes on virtually every modern processor. When you touch arr[0], the CPU fetches all of arr[0] through arr[7] (8 × 8 bytes = 64 bytes) into L1 cache in a single operation. The next 7 accesses are essentially free — they’re already sitting in cache.

This is called spatial locality, and it’s why arrays are fast even when O(n) algorithms seem “worse” than O(log n) alternatives.

When to Use It

Use arrays (or slices backed by arrays) when:

You know the size upfront or can bound it reasonably
You’re iterating sequentially — reading all elements, computing sums, filtering
You’re doing numeric work: matrix operations, time-series data, sensor readings
You’re implementing other data structures (the backing store for queues, hash tables, heaps)

Avoid arrays when:

You need frequent insertions or deletions in the middle (O(n) shifts)
You’re building something inherently pointer-linked (trees, graphs)

Production Example

Here’s the cache line effect in practice. Consider two ways to sum a 2D matrix:

package main

import (
    "fmt"
    "time"
)

const N = 4096

var matrix [N][N]int64

func sumRowMajor() int64 {
    var total int64
    for i := 0; i < N; i++ {
        for j := 0; j < N; j++ {
            total += matrix[i][j] // row by row — sequential access
        }
    }
    return total
}

func sumColumnMajor() int64 {
    var total int64
    for j := 0; j < N; j++ {
        for i := 0; i < N; i++ {
            total += matrix[i][j] // column by column — stride-N access
        }
    }
    return total
}

func main() {
    start := time.Now()
    sumRowMajor()
    fmt.Println("Row-major:", time.Since(start))

    start = time.Now()
    sumColumnMajor()
    fmt.Println("Column-major:", time.Since(start))
}

On my machine, row-major runs in roughly 20ms. Column-major takes 120ms on the same data. Same number of additions, same algorithmic complexity, 6x difference in wall time. The column-major version jumps 4096 elements on each access, blowing the cache every single time.

This matters in production. If you’re building a metrics aggregation service that sums across time-series data, layout determines throughput — not the algorithm.

The Tradeoffs

Fixed size is a real constraint. Go’s built-in slices handle growth by allocating a new backing array and copying. When a slice doubles from capacity 512 to 1024, every element is copied. If you’re appending in a hot path, pre-allocate:

// Bad: triggers multiple reallocations
result := []int64{}
for _, v := range source {
    result = append(result, process(v))
}

// Good: single allocation
result := make([]int64, 0, len(source))
for _, v := range source {
    result = append(result, process(v))
}

Middle insertions are expensive. Inserting at index i requires shifting everything from i to len-1 one position right — O(n) work. In practice, if you’re doing a lot of middle insertions, you either need a different data structure or you should rethink your data model.

False sharing in concurrent code. If two goroutines write to different elements in the same cache line, they thrash each other’s caches even though they’re technically writing different memory. This is called false sharing and it’s a source of subtle performance degradation in concurrent services.

// Dangerous: counter[0] and counter[1] likely share a cache line
var counters [2]int64

// Better: pad to separate cache lines
type PaddedCounter struct {
    value int64
    _     [56]byte // pad to 64 bytes
}
var counters [2]PaddedCounter

Key Takeaway

Arrays aren’t interesting because they’re simple. They’re interesting because they’re the only data structure where the hardware does you a favor — prefetching, cache lines, and spatial locality all conspire to make sequential access blazingly fast. Every other data structure you’ll learn in this series trades away some of that locality for flexibility. Understanding what you’re trading away starts here.

When a senior engineer tells you “just use a slice,” they’re usually right. But now you know why they’re right, and you’ll know the specific situations where it stops being true.

Next: Lesson 2: Linked Lists — Almost never the right choice

Atharva Pandey/Lesson 1: Arrays and Memory Layout — Cache lines decide your performance

How It Actually Works

When to Use It

Production Example

The Tradeoffs

Key Takeaway