We deployed a new checkout flow on a Friday afternoon. It had passed code review, passed testing, passed staging load tests. By Friday evening, the error rate was climbing. The new flow had a race condition that only manifested under specific mobile browser timing that we hadn’t tested. Without a feature flag, the fix would have required an emergency deployment — 25 minutes of build time, deployment, and validation. With a feature flag, the rollback was turning off a switch. Thirty seconds.
Feature flags are a safety mechanism, not just a release tool. I’ve changed my opinion on them: they’re now mandatory for any user-facing change in production.
How It Works
A feature flag is a conditional in your code that evaluates at runtime:
if flags.IsEnabled(ctx, "new_checkout_flow") {
return newCheckoutHandler(w, r)
}
return legacyCheckoutHandler(w, r)
The flag value comes from a flag service — not from a config file that requires a deployment to change. The flag service evaluates the flag at runtime, usually with:
- Boolean: On or off for everyone
- Percentage rollout: On for X% of users (consistent per user)
- User targeting: On for specific user IDs, emails, or attributes
- Rule-based: On for users matching certain criteria (country, plan tier, beta users)
Flag Lifecycle
Flags have types that determine their lifecycle:
Release flag → Short-lived. Used to gate a new feature during rollout.
Remove within 1-2 sprints of reaching 100%.
Ops (kill switch) → Long-lived. Used to disable a feature in emergencies.
Keep indefinitely. The value is the ability to turn it off.
Experiment flag → A/B test. Run for the duration of the experiment.
Analyze, decide, remove.
Permission flag → Permanent. Controls access by user tier or org.
These never go away.
The most common mistake: treating all flags as release flags and never removing them. Code with 30 unremoved flags is a maintenance nightmare — every code path branches, tests must cover all combinations, and no one remembers what each flag does.
Consistent Assignment
For a percentage rollout, the user should get the same experience on every request. You don’t want a user who got the new checkout flow on page load to get the old one when they submit the form. Consistent hashing by user ID achieves this:
func (f *Flags) rolloutPercentage(flagName, userID string) bool {
h := fnv.New32a()
h.Write([]byte(flagName + ":" + userID))
// Hash to 0-99 range
bucket := h.Sum32() % 100
return int(bucket) < f.config[flagName].Rollout
}
Same flag name + user ID always produces the same bucket, so the same decision.
OpenFeature Standard
OpenFeature is an emerging standard for feature flag APIs, similar to how OpenTelemetry standardized observability. It lets you swap flag providers (LaunchDarkly, Unleash, Flagsmith, custom) without changing application code.
import (
"github.com/open-feature/go-sdk/openfeature"
ldprovider "github.com/launchdarkly/openfeature-go-provider"
)
// Initialize once
provider := ldprovider.NewProvider(ldClient)
openfeature.SetProvider(provider)
// Use the standard API anywhere
client := openfeature.NewClient("my-app")
enabled, _ := client.BooleanValue(ctx, "new_checkout_flow", false,
openfeature.NewEvaluationContext("user-123", map[string]interface{}{
"email": "user@example.com",
"plan": "pro",
}))
If you later switch from LaunchDarkly to Unleash, you change one line (the provider initialization) and nothing else.
Why It Matters
Feature flags decouple deployment from release. This changes the release cycle fundamentally:
- Code can be merged and deployed while incomplete (behind an off flag)
- Releases can be rolled out to 1% → 10% → 50% → 100%, with validation at each step
- Rollbacks are instant (turn off the flag) instead of slow (deploy old code)
- A/B tests are first-class, not hacks
For teams practicing continuous deployment — deploying multiple times per day — feature flags are the mechanism that makes this safe for users. You deploy often; you release on your terms.
Production Example
A complete feature flag setup in Go with a local fallback cache:
package flags
import (
"context"
"encoding/json"
"hash/fnv"
"log/slog"
"sync"
"time"
"github.com/redis/go-redis/v9"
)
type FlagConfig struct {
Enabled bool `json:"enabled"`
Rollout int `json:"rollout"` // 0-100, percentage
AllowList []string `json:"allow_list"` // specific user IDs
DenyList []string `json:"deny_list"`
Description string `json:"description"`
Owner string `json:"owner"`
ExpiresAt string `json:"expires_at"` // for release flags
}
type Service struct {
redis *redis.Client
cache sync.Map
cacheTTL time.Duration
}
func (s *Service) IsEnabledForUser(ctx context.Context, flagName, userID string) bool {
config, err := s.getConfig(ctx, flagName)
if err != nil {
slog.WarnContext(ctx, "flag lookup failed, defaulting to false",
"flag", flagName, "error", err)
return false // fail closed — safe default
}
// Explicit deny list check first
for _, id := range config.DenyList {
if id == userID {
return false
}
}
// Explicit allow list
for _, id := range config.AllowList {
if id == userID {
return true
}
}
// Global on/off
if !config.Enabled {
return false
}
// Percentage rollout
if config.Rollout >= 100 {
return true
}
if config.Rollout <= 0 {
return false
}
h := fnv.New32a()
h.Write([]byte(flagName + ":" + userID))
return int(h.Sum32()%100) < config.Rollout
}
func (s *Service) getConfig(ctx context.Context, flagName string) (FlagConfig, error) {
// Check local cache
if cached, ok := s.cache.Load(flagName); ok {
entry := cached.(cacheEntry)
if time.Now().Before(entry.expiresAt) {
return entry.config, nil
}
}
// Fetch from Redis
data, err := s.redis.Get(ctx, "flag:"+flagName).Bytes()
if err == redis.Nil {
return FlagConfig{}, nil // Flag doesn't exist → disabled
}
if err != nil {
return FlagConfig{}, err
}
var config FlagConfig
if err := json.Unmarshal(data, &config); err != nil {
return FlagConfig{}, err
}
s.cache.Store(flagName, cacheEntry{
config: config,
expiresAt: time.Now().Add(s.cacheTTL),
})
return config, nil
}
Progressive rollout in practice:
// Week 1: Internal testing only
// redis-cli SET flag:new_checkout_flow '{"enabled":true,"rollout":0,"allow_list":["user-atharva","user-priya"]}'
// Week 2: 5% of users
// redis-cli SET flag:new_checkout_flow '{"enabled":true,"rollout":5}'
// Week 3: 25%
// redis-cli SET flag:new_checkout_flow '{"enabled":true,"rollout":25}'
// Week 4: 100%
// redis-cli SET flag:new_checkout_flow '{"enabled":true,"rollout":100}'
// After stable for 1 sprint: remove flag from code, remove from Redis
Ops kill switch for an external dependency:
func (h *PaymentHandler) Authorize(w http.ResponseWriter, r *http.Request) {
// Kill switch: if Stripe is having issues, fallback to backup processor
useStripe := h.flags.IsEnabled(r.Context(), "payment_processor_stripe")
if useStripe {
result, err := h.stripe.Authorize(r.Context(), req)
if err != nil {
// Can turn off this flag immediately from ops console
// without any deployment
writeError(w, err)
return
}
writeJSON(w, result)
return
}
// Fallback to backup processor
result, err := h.backup.Authorize(r.Context(), req)
writeJSON(w, result)
}
Adding flag observability — knowing which flag variant users see:
// Log flag evaluations for analysis
slog.InfoContext(ctx, "flag evaluated",
"flag", "new_checkout_flow",
"user_id", userID,
"result", enabled,
"reason", reason, // "allow_list", "percentage", "disabled"
)
// Track in metrics
flagEvaluations.WithLabelValues(flagName, strconv.FormatBool(enabled)).Inc()
The Tradeoffs
Flag evaluation latency: Every flagged code path makes a flag service call (or cache read). With a local cache, this is sub-millisecond. Without caching, a Redis round trip adds ~1ms per flag check. For high-frequency code paths (every HTTP request), cache aggressively. For low-frequency code (batch jobs), a direct lookup is fine.
Technical debt from flag accumulation: Every active flag doubles the number of code paths to test. 10 flags = up to 1,024 combinations. In practice, not all flags interact, but the complexity is real. Enforce a flag lifecycle policy: every release flag has an expiry date, and clearing old flags is part of sprint planning.
Testing with flags: Your CI should test with flag combinations, not just the default state. At minimum: test the flag-off path (legacy behavior) and the flag-on path (new behavior). For flags that interact, test the critical combinations.
Flag state in tests: Tests that depend on flag state need flag state to be deterministic. Inject a test flag provider that returns predetermined values:
// In tests:
testFlags := flags.NewStatic(map[string]bool{
"new_checkout_flow": true,
"legacy_payment_path": false,
})
handler := NewCheckoutHandler(testFlags, ...)
Multivariate flags: Boolean flags are simple. Multivariate flags (string or numeric values — “which algorithm version?”, “what timeout value?”) are more powerful but harder to reason about. Start with boolean flags. Reach for multivariate only when you have a clear need.
Key Takeaway
Feature flags decouple deployment from release, making continuous deployment safe and rollbacks instant. The core pattern is simple: a runtime boolean per feature, evaluated against user context, with consistent assignment for percentage rollouts. Maintain a flag lifecycle policy — release flags expire and must be removed. Use kill switches for any integration with external services or high-risk code paths. The investment in a flag system pays back on the first Friday deployment you don’t have to roll back with a 25-minute build cycle.
Previous: Lesson 5: Load Testing Next: Lesson 7: On-Call Engineering — Reducing toil, improving reliability