Skip to main content

Lesson 12: Disaster Recovery: Failover Procedures & Chaos Engineering

Multi-Region Platform Engineering: AWS, Kubernetes, and Aurora at Scale

Episode 12 of 16 | Duration: 17 minutes

Target Audience: Senior platform engineers, SREs, DevOps engineers (5+ years experience)


🎥 Watch This Lesson


What You'll Learn

  • 6-phase DR runbook: Detection, validation, approval gates, execution, verification, rollback procedures
  • Chaos engineering schedule: Quarterly testing with escalating severity levels, gameday exercises
  • Failure scenarios: Region-wide outage, database promotion failures, DNS propagation delays, split-brain
  • Human factors: Decision-making under pressure, runbook clarity, escalation paths
  • Automation vs manual steps: Automatic detection with manual approval, balance speed with safety

← Previous: Service Mesh & Federation | Back to Course | Next: Compliance-Driven Architecture →