Logo
Logo

Atharva Pandey/Lesson 5: Profiling in Production — pprof is not just for development

Created Sun, 01 Dec 2024 00:00:00 +0000 Modified Sun, 01 Dec 2024 00:00:00 +0000

I used to think profiling was something you did when you had a performance problem: reproduce it locally, run pprof, stare at the flame graph, fix the hot path. A fire-fighting tool, not an always-on system.

Then we had a memory leak in production that we couldn’t reproduce locally. The heap grew by about 50 MB per hour under real traffic patterns, causing an OOM every eight hours and a rolling restart across the fleet. Our staging environment used synthetic load that didn’t trigger the leak. We had no profile data from when the leak was building — only from after the crash, when the heap had already been cleared.

We deployed continuous profiling. Three days later a similar leak happened in a different service. This time we had heap profiles from the entire growth period, timestamped and correlated to deployment markers. The fix took 20 minutes: a goroutine was holding a reference to a growing slice in a closure that was never garbage collected because the goroutine itself never exited.

The Problem

The common mental model: pprof is a development tool you use with benchmarks. You run go test -bench . -cpuprofile cpu.out, look at it once, and move on. This model misses the cases where performance problems only manifest under real production traffic — specific user behavior, specific data shapes, cardinality that your synthetic load doesn’t cover.

The other mistake is leaving pprof disabled in production out of concern for overhead:

// people add this and then comment it out "for performance"
// import _ "net/http/pprof"

The pprof HTTP handler adds approximately 5% CPU overhead when actively sampling, and essentially zero overhead when idle. The default handlers are idle unless you hit the /debug/pprof/ endpoint. Leaving them enabled costs nothing in normal operation and gives you everything when you need it.

The third problem is running a one-off profile at the wrong moment. If your service leaks memory over 8 hours, a 30-second heap profile taken 2 minutes after startup tells you nothing useful. You need continuous, time-series profiling data.

The Idiomatic Way

Step 1: Enable the pprof endpoints — and lock them down

package main

import (
    "net/http"
    _ "net/http/pprof"  // registers handlers on http.DefaultServeMux
    "os"
)

func main() {
    // serve pprof on a separate port, not your public API port
    // bind to localhost or an internal network interface only
    go func() {
        addr := os.Getenv("PPROF_ADDR")
        if addr == "" {
            addr = "localhost:6060"
        }
        if err := http.ListenAndServe(addr, nil); err != nil {
            slog.Error("pprof server failed", "err", err)
        }
    }()

    // your main API server on a different port
    http.ListenAndServe(":8080", apiRouter())
}

Never expose pprof on your public port. Bind it to localhost or an internal network range, then access it via kubectl port-forward or an internal load balancer. Exposing pprof to the internet allows any caller to CPU-starve your service by requesting a 30-second CPU profile.

Step 2: Take targeted profiles

The pprof HTTP endpoints accept query parameters that control duration and depth:

# CPU profile for 30 seconds
curl -o cpu.prof http://localhost:6060/debug/pprof/profile?seconds=30

# heap profile (current allocations)
curl -o heap.prof http://localhost:6060/debug/pprof/heap

# goroutine profile — shows all running goroutines and their stacks
curl -o goroutine.prof http://localhost:6060/debug/pprof/goroutine

# mutex contention — shows where goroutines block on mutexes
curl -o mutex.prof http://localhost:6060/debug/pprof/mutex

# analyze locally
go tool pprof -http=:8081 cpu.prof

The goroutine profile is underrated. When you suspect a goroutine leak, download it and look at the count over time:

# download goroutine profile every 60 seconds for 10 minutes
for i in $(seq 1 10); do
    curl -o "goroutine-$i.prof" http://localhost:6060/debug/pprof/goroutine?debug=1
    sleep 60
done

# check goroutine counts
grep "^goroutine" goroutine-*.prof

If the goroutine count grows monotonically, you have a leak.

Step 3: Continuous profiling with Pyroscope

For always-on profiling that gives you time-series data, integrate Pyroscope (or Grafana Alloy with the profiling feature):

import "github.com/grafana/pyroscope-go"

func initProfiling(serviceName string) {
    pyroscope.Start(pyroscope.Config{
        ApplicationName: serviceName,
        ServerAddress:   "http://pyroscope:4040",
        Logger:          pyroscope.StandardLogger,
        ProfileTypes: []pyroscope.ProfileType{
            pyroscope.ProfileCPU,
            pyroscope.ProfileAllocObjects,
            pyroscope.ProfileAllocSpace,
            pyroscope.ProfileInuseObjects,
            pyroscope.ProfileInuseSpace,
            pyroscope.ProfileGoroutines,
        },
        // tag with deployment metadata for correlation
        Tags: map[string]string{
            "version": os.Getenv("APP_VERSION"),
            "region":  os.Getenv("AWS_REGION"),
        },
    })
}

Pyroscope samples the CPU profile at 100Hz and streams it continuously. You can query the flame graph for any time window, compare before and after a deploy, and correlate a memory growth event with the specific function that started allocating.

In The Wild

The workflow I use in practice:

When alerted that heap usage is growing, I first look at the Pyroscope dashboard to see which function started allocating more after the last deployment. If it’s a new deployment, I compare the flame graph from the hour before and the hour after.

For diagnosing goroutine leaks, I wrote a small monitoring goroutine that emits the goroutine count as a metric every 30 seconds:

func monitorRuntimeMetrics(ctx context.Context) {
    ticker := time.NewTicker(30 * time.Second)
    defer ticker.Stop()

    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            var ms runtime.MemStats
            runtime.ReadMemStats(&ms)

            goroutineCount.Set(float64(runtime.NumGoroutine()))
            heapAllocBytes.Set(float64(ms.HeapAlloc))
            heapInuseBytes.Set(float64(ms.HeapInuse))
            gcPauseTotalNs.Set(float64(ms.PauseTotalNs))
            nextGCBytes.Set(float64(ms.NextGC))

            // alert threshold: if goroutines exceed 10,000, something is leaking
            if runtime.NumGoroutine() > 10000 {
                slog.Warn("goroutine count high — possible leak",
                    "count", runtime.NumGoroutine(),
                )
            }
        }
    }
}

When the goroutine metric spikes in Prometheus, I take a goroutine profile at that moment and look for goroutines blocked on the same operation — that’s always where the leak is.

For CPU spikes during traffic bursts, the trace tool (distinct from distributed tracing) is the right instrument:

import "runtime/trace"

// capture a 5-second execution trace
f, _ := os.Create("trace.out")
defer f.Close()
trace.Start(f)
time.Sleep(5 * time.Second)
trace.Stop()

// analyze
// go tool trace trace.out

The execution trace shows goroutine scheduling, syscall blocks, network waits, and GC stop-the-world events on a timeline with microsecond resolution. It answers the question pprof cannot: “why did my goroutine wait 200ms before getting scheduled?”

The Gotchas

runtime.ReadMemStats stops the world. Calling it frequently (every second or more) introduces GC-like pauses. 30-second intervals are safe. Anything more frequent than 5 seconds starts to show up in latency distributions.

Mutex and block profiling must be explicitly enabled.

// in main(), before starting your server
runtime.SetMutexProfileFraction(5)  // sample 1/5 mutex contentions
runtime.SetBlockProfileRate(1)      // sample all blocking events (expensive — use sparingly)

SetBlockProfileRate(1) is very expensive in production — it profiles every goroutine block. Use it briefly during an incident, not continuously. SetMutexProfileFraction at 5–10 is safe to leave on.

Flame graphs show hot paths, not slow paths. A CPU flame graph shows where CPU time is being spent. A function that’s slow because it waits on I/O won’t show up on a CPU profile because it’s not consuming CPU while waiting. For I/O latency, use distributed traces. For CPU saturation, use pprof. They answer different questions.

Profile samples are not deterministic across runs. Don’t compare absolute sample counts between profiles — compare the proportional distribution of where time is spent. A function that’s 30% of samples in one profile and 5% in another has gotten faster; a function that holds steady at 10% while total throughput doubled is actually consuming twice as much absolute work.

Key Takeaway

pprof endpoints have negligible idle cost — run them in production. Use them on demand for immediate diagnosis: CPU profile for hot paths, heap profile for memory growth, goroutine profile for leak detection. Add continuous profiling with Pyroscope for time-series flame graphs that let you compare before and after a deployment. Emit goroutine count and heap size as Prometheus metrics with thresholds to get alerted before a leak becomes an OOM. The execution trace tool (runtime/trace) is the right instrument when you need to understand scheduling latency, not just CPU consumption. Between these tools, almost any Go performance problem in production becomes diagnosable without a reproduction environment.


Previous: Correlation IDs | Course Index | Next: Debugging Latency and Leaks →