Lesson Outline: Episode 7 - Observability: Metrics, Logging, Tracing

Metadata

Course: Kubernetes Production Mastery
Episode: 7 of 10
Duration: 15 minutes
Type: Core Concept
Prerequisites:
- Episode 1 (Production Mindset - monitoring checklist item)
- Episode 4 (Troubleshooting - systematic debugging)
- Episode 5 (StatefulSets - Prometheus storage)
- Episode 6 (Networking - monitoring targets)
Template: Core Concept (Episodes 3-N-2)

Learning Objectives

By end of lesson, learners will:

Deploy Prometheus with persistent storage and configure service discovery for automatic target detection
Design actionable alerts using the four golden signals (latency, traffic, errors, saturation) to prevent alert fatigue
Determine when to use metrics vs logs vs traces for specific debugging scenarios

Success criteria:

Can explain why default Prometheus install fails in production and what's needed
Can write PromQL queries for the four golden signals
Can choose between Loki and ELK based on requirements and cost
Can design a distributed tracing strategy with appropriate sampling rates

Spaced Repetition Plan

Introduced (repeat later):

Four golden signals → Episode 8 (cost optimization metrics), Episode 10 (multi-cluster observability)
Prometheus architecture → Episode 8 (monitoring cost metrics), Episode 10 (federation)
PromQL basics → Episode 8 (resource utilization queries), Episode 10 (cross-cluster queries)
Log aggregation patterns → Episode 10 (centralized logging at scale)

Reinforced (from earlier):

Production readiness checklist from Episode 1 (item #5: monitoring) → Applied in Prometheus setup section
StatefulSets from Episode 5 → Applied to Prometheus persistent storage (needs PVC, stable identity)
Services/Ingress from Episode 6 → Monitoring targets (ServiceMonitors, endpoint scraping)
Troubleshooting workflow from Episode 4 → Enhanced with observability tools

Lesson Flow

1. Recall & Connection (1.5 min)

Callback to Episode 6: "Last episode: networking - Services, Ingress, CNI. You can now route traffic. But here's the question: is it working? What's the latency? Are requests failing? Without observability, you're flying blind."

The Production Blindness Problem:

Your Ingress is routing 10,000 requests/minute - but how many are errors?
Pod-to-pod latency across your service mesh - what's normal vs concerning?
Database connections maxed out - when did it start? Which service?

Today's Preview: "Three pillars of observability. Metrics for trends. Logs for details. Traces for flows. Plus the four golden signals that prevent 90% of production surprises."

2. Learning Objectives (30 sec)

By the end, you'll:

Deploy production-ready Prometheus with persistent storage and service discovery
Design actionable alerts using golden signals to avoid alert fatigue
Choose metrics vs logs vs traces for specific debugging scenarios

3. The Observability Mental Model (3 min)

Analogy: Airplane cockpit instruments

Not this: Single engine temperature gauge (you're monitoring, but inadequately)
This: Comprehensive dashboard - altitude, speed, fuel, heading, engine temp, hydraulics
Why? Because production failures are multi-dimensional. CPU might be fine, but memory leaking. Network healthy, but disk full.

Three Pillars Explained:

Metrics (vital signs): Numbers over time. Request rate, error percentage, CPU usage, memory. Answers "What's happening now? What's the trend?"
Logs (security camera footage): Event records. Error messages, stack traces, audit trails. Answers "What exactly happened at 3:42 AM?"
Traces (GPS package tracking): Request flow across services. One user request touches 5 microservices - trace shows the path and timing. Answers "Where did this slow request spend its time?"

When to Use Each (decision framework):

Metrics: "Is there a problem?" (detection)
Logs: "What exactly went wrong?" (diagnosis)
Traces: "Which service in the chain caused it?" (distributed diagnosis)

Visual: Imagine debugging a 502 error.

Metrics tell you error rate spiked at 3:42 AM (detection)
Logs show "connection timeout to backend-api" (which service)
Traces show backend-api spent 29 seconds waiting for database (root cause)

Why This Matters: Without all three, you're debugging with missing puzzle pieces. Episode 4's troubleshooting workflow? Observability gives you the data to actually execute it.

4. Prometheus: Production-Ready Metrics (7 min)

4.1 Why Default Install Fails (1 min)

The Mistake: helm install prometheus → works in dev, fails in production

What's Missing:

No persistent storage - pod restart = lose all metrics
No retention policy - fills disk in 2 weeks
No security - metrics exposed to cluster
No service discovery - manual target configuration

Callback to Episode 5: Remember StatefulSets? Prometheus needs persistent storage just like databases. PVC for data, stable identity for federation.

4.2 Production Prometheus Architecture (2 min)

Setup Pattern:

Prometheus Server: StatefulSet with PVC (500GB-1TB), retention 15-30 days
Service Discovery: ServiceMonitors (CRDs that tell Prometheus what to scrape)
Alertmanager: Handles alert routing, grouping, silencing
Grafana: Visualization (queries Prometheus via PromQL)

Deployment Choice:

Prometheus Operator (recommended): CRDs for declarative config, automatic target discovery
Helm chart: Simpler, less features, manual ServiceMonitor creation

Example: E-commerce platform

Prometheus scrapes:
- kube-state-metrics (pod/deployment/node states)
- node-exporter (host metrics: CPU, memory, disk)
- Ingress controller (request rates, latencies)
- Application pods (custom metrics: cart abandonment rate)

4.3 The Four Golden Signals (2 min)

Framework (from Google SRE book):

Latency: How long requests take (p50, p95, p99)
Traffic: Requests per second, bandwidth
Errors: Request failure rate (4xx, 5xx)
Saturation: How full your system is (CPU, memory, disk, connections)

Why These Four: Cover 90% of production issues. Everything else is noise.

PromQL Examples:

# Latency (p95)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Traffic
rate(http_requests_total[5m])

# Errors
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# Saturation
container_memory_usage_bytes / container_spec_memory_limit_bytes

Callback to Episode 6: These queries monitor your Ingress endpoints, Services, pod-to-pod latency. The networking you built last episode.

4.4 Label Design & Cardinality Trap (1.5 min)

The Problem: High cardinality kills Prometheus

What is Cardinality: Unique combinations of label values

Low: {service="api", environment="prod"} → 10 services × 3 envs = 30 combinations
High: {user_id="12345", request_id="abc-def-ghi"} → millions of combinations

The Trap: Adding user_id as label = Prometheus runs out of memory. Why? Every unique combo = separate time series.

Best Practice:

✅ Labels: service, environment, instance, job (low cardinality, high value)
❌ Labels: user_id, request_id, email, session_id (high cardinality, kills performance)

Rule: If a label has >100 unique values, reconsider. If >10,000, definitely wrong.

4.5 Retention & Storage Costs (0.5 min)

Production Pattern:

Short-term: 15-30 days in Prometheus (recent, high resolution)
Long-term: 1+ years in Thanos/Cortex (downsampled, cheap storage)

Cost Awareness: 1TB retention × 3 replicas = 3TB storage. Plan accordingly.

5. Logging: Aggregation & Cost Management (4 min)

5.1 Why Log Aggregation Matters (1 min)

The Problem Without It:

50 services × 5 pods each = 250 log sources
Pod restarts = logs gone
Debugging: SSH to 20 pods, grep, pray

The Solution: Centralized log aggregation

All logs → single query interface
Persistent storage (logs survive pod restarts)
Correlation (filter by request ID across services)

5.2 Loki vs ELK Decision (1.5 min)

Loki (Grafana's log system):

✅ Lightweight, low cost (indexes labels, not content)
✅ Integrates with Prometheus/Grafana
✅ Good for Kubernetes (tail logs like kubectl)
❌ Limited full-text search (no Elasticsearch query power)
Best for: Teams already using Prometheus/Grafana, cost-conscious

ELK Stack (Elasticsearch, Logstash, Kibana):

✅ Powerful full-text search, complex queries
✅ Advanced analytics, dashboards
❌ Resource-heavy (Elasticsearch needs 4GB+ memory)
❌ Higher cost (indexes everything)
Best for: Large teams, compliance requirements, advanced log analysis

My Decision Framework:

Start with Loki (simpler, cheaper)
Migrate to ELK if you need: advanced search, compliance, dedicated log analytics team

5.3 Structured Logging (1 min)

Unstructured (bad):

2025-01-12 14:32:10 ERROR User login failed for john@example.com from IP 203.0.113.5

Hard to query: "Show me all login failures from this IP range"

Structured (good):

{"timestamp":"2025-01-12T14:32:10Z","level":"ERROR","event":"login_failed","user":"john@example.com","ip":"203.0.113.5"}

Easy to query: {event="login_failed", ip=~"203.0.113.*"}

Best Practice: JSON logging from application, parsed by Fluent Bit, stored in Loki/ELK.

5.4 Log Retention & Cost (0.5 min)

The Cost Explosion:

Logging everything at DEBUG level = 10TB/month
7-day retention = manageable
90-day retention = massive cost

Production Pattern:

INFO level in production (ERROR/WARN always)
7-day retention for general logs
30-90 days for audit logs (compliance)
Archive to S3/GCS for long-term (cheap, searchable if needed)

6. Alerting: Actionable, Not Noisy (3 min)

6.1 The Alert Fatigue Problem (1 min)

Bad Alerting (what NOT to do):

Alert on every error (100 alerts/day, ignored)
Alert on symptoms not causes ("disk usage 70%" - so what?)
No grouping (same issue = 50 separate alerts)

Good Alerting:

Alert on customer impact (error rate >1%, latency p95 >500ms)
Alert on imminent failure (disk 90%, will fill in 2 hours)
Group related alerts (all pods down = one alert, not 20)

6.2 Golden Signals Alerts (1 min)

Latency Alert:

alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
annotations:
  summary: "API latency p95 > 500ms for 5 minutes"

Error Rate Alert:

alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
for: 5m
annotations:
  summary: "Error rate > 1% for 5 minutes"

Saturation Alert:

alert: HighMemoryUsage
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
for: 10m
annotations:
  summary: "Pod using > 90% memory for 10 minutes"

Pattern: Alert on rate of change + duration threshold (avoids transient spikes)

6.3 Alertmanager Routing (1 min)

Smart Routing:

Critical alerts (error rate >5%) → PagerDuty → wake up on-call
Warning alerts (disk 80%) → Slack → check during business hours
Info alerts (deployment completed) → Log only

Grouping:

20 pods crash → one alert "Deployment X unhealthy", not 20 separate

Silencing:

Planned maintenance? Silence alerts for 2 hours
Known issue? Silence until fix deployed

Callback to Episode 4: Troubleshooting workflow enhanced. Alerts tell you WHAT failed. Metrics/logs tell you WHY.

7. Distributed Tracing: When You Need It (1.5 min)

7.1 The Microservice Debugging Problem (0.5 min)

Scenario: User reports "checkout is slow"

Request touches: frontend → API gateway → cart service → inventory service → payment service → order service
Which service is slow? Metrics show all services "healthy"
This is where tracing saves you

7.2 Tracing Architecture (0.5 min)

Components:

Instrumentation: Application code generates spans (timing for each service call)
Collector: Aggregates spans from all services
Storage: Jaeger, Tempo (stores trace data)
UI: Visualize request flow

Example Trace:

Checkout request (2.3s total):
├─ API Gateway (50ms)
├─ Cart Service (100ms)
├─ Inventory Service (80ms)
├─ Payment Service (1900ms) ← THE PROBLEM
└─ Order Service (170ms)

Decision: Payment service slow. Check payment service logs/metrics for root cause.

7.3 When to Implement Tracing (0.5 min)

You Need Tracing If:

10+ microservices
Complex request flows (can't easily follow the path)
Latency issues you can't diagnose with metrics alone

You DON'T Need Tracing If:

Monolith or under 5 services
Simple architectures (can follow request path easily)
Metrics + logs sufficient

Sampling Strategy:

Don't trace every request (cost/performance)
Sample 1% of successful requests (trend visibility)
Sample 100% of errors (debug issues)

Callback to Episode 6: Service mesh (Istio) includes tracing automatically. If you implemented service mesh, tracing is already enabled.

8. Common Pitfalls (1.5 min)

Mistake 1: No Prometheus retention policy

What happens: Metrics fill disk, Prometheus crashes, lose all data
Fix: Set retention (15-30 days) and storage size in StatefulSet PVC
Command: --storage.tsdb.retention.time=30d --storage.tsdb.retention.size=450GB

Mistake 2: High cardinality labels

What happens: Prometheus OOM, slow queries, can't scrape targets
Example: Adding user_id as label → millions of time series
Fix: Use labels for dimensions (service, env), not identifiers (user, request)

Mistake 3: Logging everything at DEBUG level

What happens: Log storage costs explode (10TB/month), slow queries
Fix: INFO level in production, ERROR/WARN always, DEBUG only during troubleshooting

Mistake 4: No alerting or too many alerts

What happens: Miss critical issues OR ignore all alerts (fatigue)
Fix: Alert on customer impact (golden signals), group related alerts, route by severity

Mistake 5: Tracing at 100% sampling rate

What happens: Performance overhead, storage costs, no benefit vs 1% sampling for trends
Fix: 1% sampling for successful requests, 100% for errors

9. Active Recall (1.5 min)

"Pause and answer these questions:"

Question 1: What are the four golden signals, and why do they matter? [PAUSE 5 sec]

Question 2: You're debugging a slow checkout flow in a microservice architecture. Which observability pillar would you use first, and why? [PAUSE 5 sec]

Question 3: Your Prometheus is running out of memory. What are the two most likely causes? [PAUSE 5 sec]

Answers:

Four golden signals: Latency (how long), Traffic (how much), Errors (failure rate), Saturation (how full). They cover 90% of production issues - if these look healthy, your system is probably fine.
Slow checkout debugging: Start with traces (see which service in the chain is slow), then metrics (check that service's latency/errors), then logs (find specific error messages). Traces give you the "where", metrics give you "how bad", logs give you "why".
Prometheus OOM: (1) High cardinality labels - too many unique time series, or (2) No retention policy - storing too much data. Check label design and set retention limits.

10. Recap & Integration (1.5 min)

Takeaways:

Three pillars: Metrics (trends), Logs (details), Traces (flows) - you need all three
Prometheus production setup: StatefulSet with PVC, retention policy, ServiceMonitors for discovery, avoid high cardinality
Four golden signals: Latency, Traffic, Errors, Saturation - alert on these, not everything
Logging choice: Loki for cost/simplicity, ELK for power/compliance
Tracing: Only for complex microservices (10+), sample smartly (1% success, 100% errors)

Connection to Previous Episodes:

Episode 4's troubleshooting workflow now has data sources (metrics/logs/traces)
Episode 5's StatefulSets apply to Prometheus storage (PVC, stable identity)
Episode 6's networking (Services, Ingress) are your monitoring targets

Bigger Picture: You can now build (Episodes 2-3), secure (Episode 3), persist (Episode 5), route traffic (Episode 6), AND see what's happening in production (Episode 7). Next: optimize the cost.

Callback Preview: Episode 1's production readiness checklist item #5 was monitoring. Today you learned how to implement it properly.

11. Next Episode Preview (30 sec)

"Next episode: Cost Optimization at Scale. You've built production-ready infrastructure. But what's it costing? The five sources of Kubernetes cost waste. Right-sizing strategies. FinOps tools like Kubecost. And the senior engineer problem - over-engineering costs money. We'll use Episode 7's Prometheus to monitor resource utilization and Episode 2's resource concepts to right-size workloads. See you then."

Supporting Materials

Code Examples:

Prometheus StatefulSet manifest - shows PVC, retention, resources
ServiceMonitor CRD - automatic target discovery
PromQL queries - four golden signals implementation
Alert rules - latency/errors/saturation with proper thresholds
Loki stack deployment - Fluent Bit → Loki → Grafana

Tech References:

prometheus.io/docs - official architecture docs
grafana.com/docs/loki - Loki setup guide
sre.google/sre-book - original four golden signals chapter
prometheus.io/docs/prometheus/latest/querying/basics - PromQL tutorial

Analogies:

Cockpit instruments - comprehensive dashboard vs single gauge
Vital signs - metrics like pulse/blood pressure/temperature
Security camera - logs as event footage
GPS package tracking - traces showing request journey
Alert fatigue - like car alarm that goes off constantly (ignored)

Quality Checklist

Structure:

Clear beginning/middle/end (recall → teach → practice → recap)
Logical flow (mental model → Prometheus → logging → alerting → tracing)
Time allocations realistic (15 min total: 1.5+7+4+3+1.5+1.5+1.5 = ~20 min allocated, will tighten in script)
All sections specific with concrete examples

Pedagogy:

Objectives specific/measurable (deploy Prometheus with storage, design alerts using golden signals, choose observability pillar)
Spaced repetition integrated (callbacks to Episodes 1,4,5,6; forward to Episodes 8,10)
Active recall included (3 retrieval questions with pauses)
Signposting clear (section transitions, "first/second/third")
5 analogies (cockpit, vital signs, camera, GPS, car alarm)
Pitfalls addressed (5 common mistakes with fixes)

Content:

Addresses pain points (production blindness, alert fatigue, high cardinality, cost explosion)
Production-relevant examples (e-commerce platform, 502 debugging, checkout tracing)
Decision frameworks (metrics vs logs vs traces, Loki vs ELK, when to trace, sampling strategy)
Troubleshooting included (debugging with observability tools)
Appropriate depth for senior engineers (PromQL, cardinality, retention policies)

Engagement:

Strong hook (flying blind, can't answer basic questions about production)
Practice/pause moments (active recall with 5-sec pauses)
Variety in techniques (analogies, examples, think-alouds, decision frameworks)
Preview builds anticipation (cost optimization using observability data)

Metadata​

Learning Objectives​

Spaced Repetition Plan​

Lesson Flow​

1. Recall & Connection (1.5 min)​

2. Learning Objectives (30 sec)​

3. The Observability Mental Model (3 min)​

4. Prometheus: Production-Ready Metrics (7 min)​

4.1 Why Default Install Fails (1 min)​

4.2 Production Prometheus Architecture (2 min)​

4.3 The Four Golden Signals (2 min)​

4.4 Label Design & Cardinality Trap (1.5 min)​

4.5 Retention & Storage Costs (0.5 min)​

5. Logging: Aggregation & Cost Management (4 min)​

5.1 Why Log Aggregation Matters (1 min)​

5.2 Loki vs ELK Decision (1.5 min)​

5.3 Structured Logging (1 min)​

5.4 Log Retention & Cost (0.5 min)​

6. Alerting: Actionable, Not Noisy (3 min)​

6.1 The Alert Fatigue Problem (1 min)​

6.2 Golden Signals Alerts (1 min)​

6.3 Alertmanager Routing (1 min)​

7. Distributed Tracing: When You Need It (1.5 min)​

7.1 The Microservice Debugging Problem (0.5 min)​

7.2 Tracing Architecture (0.5 min)​

7.3 When to Implement Tracing (0.5 min)​

8. Common Pitfalls (1.5 min)​

9. Active Recall (1.5 min)​

10. Recap & Integration (1.5 min)​

11. Next Episode Preview (30 sec)​

Supporting Materials​

Quality Checklist​