Logo
Logo

Atharva Pandey/Lesson 7: Service Mesh — Sidecar proxies and mTLS without code

Created Sat, 03 Aug 2024 00:00:00 +0000 Modified Sat, 03 Aug 2024 00:00:00 +0000

A team I worked with had seventeen microservices. Each service had its own implementation of retry logic, circuit breaking, timeout handling, and mutual TLS. Some used libraries, some rolled their own. When we needed to update the TLS certificate rotation policy, it touched eleven different code repositories, four different languages, and took two months. Then we introduced Linkerd and moved all of that to the infrastructure layer. The services still did their jobs. The networking became someone else’s problem — specifically, the platform team’s.

How It Works

A service mesh is an infrastructure layer for service-to-service communication. It handles observability, traffic management, and security at the network level — without requiring application code changes.

The Sidecar Pattern

The core mechanism is a sidecar proxy: a separate process that runs alongside each service instance in the same pod (in Kubernetes). All network traffic to and from the service is transparently redirected through this sidecar.

Pod A                                    Pod B
+----------------------------------+     +----------------------------------+
|  [App]  <---> [Sidecar Proxy]   |     |  [Sidecar Proxy]  <---> [App]  |
+----------------------------------+     +----------------------------------+
              |                                       ^
              +--- encrypted, observed, retried ------+
                   (mTLS, metrics, traces, retries)

The redirection happens via iptables rules injected at pod startup by the mesh’s CNI plugin or init container. The application connects to localhost:8080 as normal, but iptables intercepts the packet and routes it through the sidecar’s port (e.g., 15001 for Envoy/Istio).

The application developer doesn’t change any code. The service “thinks” it’s making a direct connection.

Envoy: The Proxy Behind Istio

Istio uses Envoy as its sidecar proxy. Envoy is a C++ L7 proxy with a comprehensive feature set:

  • Load balancing: Round-robin, least-request, ring hash, random
  • Circuit breaking: Outlier detection, panic threshold
  • Retry policies: Retry on 5xx, with exponential backoff
  • Timeouts: Per-route, per-cluster
  • Observability: Prometheus metrics, distributed traces (Zipkin, Jaeger), access logs
  • TLS termination: Including mTLS between sidecars

Envoy is configured via xDS APIs — a set of discovery services that Istiod (the Istio control plane) uses to push configuration to each proxy:

  • CDS (Cluster Discovery Service): Upstream service definitions
  • EDS (Endpoint Discovery Service): Service instance IPs and ports
  • LDS (Listener Discovery Service): Port listeners
  • RDS (Route Discovery Service): Request routing rules

The control plane talks to each Envoy sidecar and keeps their configuration synchronized. This is how you can update a retry policy or traffic split across an entire fleet in seconds, without redeploying any application.

mTLS Without Code

Mutual TLS between services is a zero-trust security model: every service proves its identity to every other service on every connection. Without a service mesh, this requires each service to manage client certificates.

With Istio:

  1. Istiod acts as a certificate authority (CA).
  2. On pod startup, each sidecar requests a certificate from Istiod via the SPIFFE/SVID standard (Secure Production Identity Framework for Everyone).
  3. The certificate encodes the workload identity: spiffe://cluster.local/ns/production/sa/payment-service.
  4. All sidecar-to-sidecar communication is automatically mTLS — each side presents its certificate and verifies the other.
  5. Certificate rotation happens every 24 hours automatically.

The application service never sees this. It sends plaintext HTTP to localhost:8080. The sidecar encrypts it with mTLS before it leaves the pod.

Traffic Management

A VirtualService in Istio lets you control how traffic is routed:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: payment-service
spec:
  hosts:
  - payment-service
  http:
  - route:
    - destination:
        host: payment-service
        subset: v2
      weight: 10  # 10% of traffic to v2
    - destination:
        host: payment-service
        subset: v1
      weight: 90  # 90% to v1
    timeout: 5s
    retries:
      attempts: 3
      perTryTimeout: 2s
      retryOn: 5xx,reset,connect-failure

This canary deployment is entirely in Kubernetes manifests. No code change in the payment service. No feature flag system. The mesh routes traffic at the proxy level.

Linkerd: The Simpler Alternative

Linkerd is lighter than Istio. It uses a Rust-based micro-proxy called linkerd2-proxy instead of Envoy. It automatically provides:

  • mTLS
  • Prometheus metrics (golden signals: success rate, request rate, latency)
  • Retries and timeouts via ServiceProfile resources
  • Traffic splitting via TrafficSplit

What it doesn’t provide (compared to Istio): fine-grained traffic management, WebAssembly extensions, Envoy’s full feature set. For most production Kubernetes clusters, Linkerd is often the right choice — simpler to operate, lower resource overhead.

Why It Matters

The service mesh pattern solves the “distributed systems cross-cutting concerns” problem at the infrastructure layer. Security (mTLS), observability (traces, metrics, logs), and resilience (retries, circuit breaking, timeouts) are all things every service needs and none of them are business logic. Moving them out of application code means:

  • Language-agnostic: Go, Java, Python, Ruby — all get the same capabilities.
  • Consistent behavior: One retry policy configuration, not N library configurations.
  • Operational leverage: Change behavior across the fleet by updating one manifest.

The cost is operational complexity: the mesh itself needs to be maintained, upgraded, and monitored.

Production Example

Enabling Linkerd on a service:

# Install Linkerd
linkerd install --crds | kubectl apply -f -
linkerd install | kubectl apply -f -
linkerd check

# Inject Linkerd sidecar into a deployment
kubectl get deploy payment-service -o yaml \
  | linkerd inject - \
  | kubectl apply -f -

# Or annotate the namespace for automatic injection
kubectl annotate namespace production linkerd.io/inject=enabled

Defining retry and timeout policy in Linkerd:

apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
  name: payment-service.production.svc.cluster.local
  namespace: production
spec:
  routes:
  - name: POST /v1/payments
    condition:
      method: POST
      pathRegex: /v1/payments
    responseClasses:
    - condition:
        status:
          min: 500
          max: 599
      isFailure: true
    timeout: 5s
    # Note: don't retry POST — it's not idempotent. Only use retries for GET.
  - name: GET /v1/payments/{id}
    condition:
      method: GET
      pathRegex: /v1/payments/[^/]*
    isRetryable: true
    timeout: 2s

For Istio, observing the mesh:

# View mesh traffic in real time
istioctl dashboard kiali

# Check sidecar proxy configuration
istioctl proxy-config cluster payment-service-7d9fb6b9c8-xkj2p

# Verify mTLS is enforced
istioctl authn tls-check payment-service.production.svc.cluster.local

# Watch Envoy stats for a pod
kubectl exec -n production payment-service-7d9fb6b9c8-xkj2p \
  -c istio-proxy -- curl localhost:15000/stats | grep upstream_rq_retry

Resource requirements matter: each Envoy sidecar uses roughly 50-100MB RAM and 0.5 CPU cores at moderate traffic. For a cluster with 200 pods, that’s 10-20GB of RAM for sidecars alone. Linkerd’s proxy is much leaner (~10MB per proxy).

The Tradeoffs

Service mesh vs library approach: A mesh gives you language-agnostic cross-cutting concerns. Libraries like resilience4j (Java) or go-kit give you more flexibility and zero infrastructure dependency. The mesh approach wins for polyglot environments. For a single-language shop, a shared library might be simpler.

Latency overhead: Every request now passes through two sidecar proxies (egress + ingress). Envoy adds ~0.5-1ms per hop under normal conditions. For most services this is fine. For ultra-low-latency paths (high-frequency trading, real-time gaming), it’s a real consideration.

Debugging complexity: When a request fails, is it the application, the sidecar proxy, the service mesh control plane, or the underlying network? The mesh adds layers to your mental model. Good observability is essential.

Ambient mesh (Istio 1.18+): Istio now supports “ambient mode” — no sidecars. Traffic is intercepted at the node level by a shared proxy (ztunnel) with L4 enforcement, and L7 features are provided by a separate “waypoint” proxy only where needed. This dramatically reduces the per-pod overhead. Worth watching as it matures.

Egress traffic: Meshes handle east-west (service-to-service) traffic well. North-south (ingress/egress with external services) is often handled separately by an API gateway. Don’t conflate the two.

Key Takeaway

A service mesh moves networking concerns — mTLS, retries, timeouts, observability, traffic shaping — from application code into infrastructure. Every service gets these capabilities without writing a line of networking code. The sidecar proxy intercepts traffic transparently via iptables, the control plane distributes configuration, and certificates rotate automatically. The cost is operational complexity and resource overhead. For polyglot microservice deployments at scale, the mesh is often worth it. For a small number of services in a single language, a well-designed library might be simpler. Understand the tradeoffs before committing.


Previous: Lesson 6: gRPC and Protobuf


🎓 Course Complete! You’ve finished “Networking for Backend Engineers.” You now understand the full stack of protocols your services run on — from TCP’s three-way handshake up through HTTP/2 multiplexing, TLS handshakes, DNS caching, WebSocket framing, gRPC’s binary protocol, and service mesh infrastructure.