The Platform Engineering Playbook Podcast
Welcome to The Platform Engineering Playbook Podcast — where everything you hear is built, reviewed, and improved by AI, by me, and by you, the listener.
This show keeps me — and hopefully you — up to speed on the latest in platform engineering, SRE, DevOps, and production engineering. It's a living experiment in how AI can help us track, explain, and debate the fast-moving world of infrastructure.
Every episode is open source. If you've got something to add, correct, or challenge, head to GitHub — open a pull request, join the conversation, and make the Playbook smarter.
Target Audience: Senior platform engineers, SREs, DevOps engineers with 5+ years experience seeking strategic insights on technology choices, market dynamics, and skill optimization.
🎥 Latest Episode: Kubernetes Production Mastery - Lesson 02
Courses
Structured, multi-episode educational series designed for deep learning and skill mastery. Each course uses single-presenter lecture format optimized for retention with learning science principles (spaced repetition, active recall, progressive complexity).
Available Courses:
📖 Kubernetes Production Mastery
Transform from a Kubernetes user into a production Kubernetes engineer. Learn how to run Kubernetes at scale with confidence through real-world failure patterns, systematic debugging, and battle-tested best practices.
- Episodes: 10 lessons (2 published, 8 coming soon)
- Duration: ~3 hours total
- Level: Intermediate to Advanced
- Prerequisites: Basic Kubernetes knowledge (pods, deployments, services)
What You'll Learn:
- Production mindset: Think in failure modes, not just success cases
- Resource management: Prevent OOMKilled and cascading failures
- RBAC, secrets, and security for multi-tenant clusters
- Systematic debugging workflow for production incidents
- Stateful workloads, networking, and observability
- Cluster operations, multi-tenancy, and advanced patterns
Published Lessons:
- 📖 Lesson 01: Production Mindset (17 min) - Learn the 5 failure patterns that break systems at scale and the 6-item production readiness checklist
- 📖 Lesson 02: Resource Management (19 min) - Master requests vs limits, QoS classes, and the 5-step debugging workflow for OOMKilled pods
🎥 Watch on YouTube:
📖 Multi-Region Platform Engineering: AWS, Kubernetes, and Aurora at Scale
Master the LEGO architecture approach to multi-region systems. Learn the real 2.5-7.5x cost multiplier, compose building blocks (Aurora Global Database, EKS, Transit Gateway, DynamoDB Global Tables), and build production-grade architectures that match your actual needs—not aspirational ones.
- Episodes: 16 lessons (ALL PUBLISHED ✅)
- Duration: ~4 hours total
- Level: Advanced
- Prerequisites: 5+ years production AWS/Kubernetes experience, distributed systems knowledge
What You'll Learn:
- Cost reality: True multi-region costs and when single-region wins
- Architecture patterns: Hot-hot, hot-warm, hot-cold—what actually works
- AWS building blocks: Aurora, EKS, Transit Gateway, DynamoDB as composable pieces
- Data strategies: Consistency models, replication, conflict resolution
- Compliance: SEC SCI, MiFID II, crypto regulations (NY BitLicense, EU MiCA)
- Decision frameworks: Calculate your actual requirements vs marketing hype
Complete course available with video lectures and detailed transcripts. All scripts validated, enhanced, and ready for learning!
Latest Episodes
Latest content includes course lessons and standalone podcast episodes. Course lessons are organized in the Courses section above.
Multi-Region Platform Engineering Course - Complete Series:
- 📖 #028: Lesson 16 - 90-Day Implementation Roadmap (18 min) - 4-phase rollout with go-no-go gates, abort criteria, and risk mitigation strategies
- 📖 #027: Lesson 15 - Anti-Patterns: What Breaks Multi-Region (15 min) - Six anti-patterns, real cost impact, and recovery strategies
- 📖 #026: Lesson 14 - Security Architecture (18 min) - Encryption at-rest/in-transit/in-use, key management trade-offs, zero-trust networking
- 📖 #025: Lesson 13 - Compliance-Driven Architecture (16 min) - SEC SCI, MiFID II, BitLicense, MiCA regulatory requirements
- 📖 #024: Lesson 12 - Disaster Recovery & Chaos Engineering (17 min) - 6-phase runbook, failover procedures, chaos testing strategies
- 📖 #023: Lesson 11 - Advanced Kubernetes Patterns (15 min) - Service mesh subsecond failover vs DNS's 2+ minutes, operational complexity trade-offs
- 📖 #022: Lesson 10 - Data Consistency Models (13 min) - CAP theorem, Aurora vs DynamoDB consistency, split-brain prevention
- 📖 #021: Lesson 09 - Cost Management (16 min) - Seven optimization strategies, locality-aware routing saves 90%, $112K→$36K real example
- 📖 #020: Lesson 08 - DNS & Traffic Management (14 min) - Route53 health checks, Global Accelerator, failover detection
- 📖 #019: Lesson 07 - Observability at Scale (16 min) - Centralized logging, distributed tracing, cross-region metrics
- 📖 #018: Lesson 06 - DynamoDB Global Tables (15 min) - Active-active replication, conflict resolution, cost comparison with Aurora
- 📖 #017: Lesson 05 - Network Architecture (17 min) - Transit Gateway, VPC peering, PrivateLink, Global Accelerator
- 📖 #016: Lesson 04 - Kubernetes Multi-Cluster (18 min) - EKS as regional boundary, independent clusters, cross-cluster discovery
- 📖 #015: Lesson 03 - Aurora Global Database (14 min) - Active-passive replication, promotion procedures, 45-85ms lag
- 📖 #014: Lesson 02 - Production Patterns (16 min) - Hot-hot, hot-warm, hot-cold, cold-standby patterns and RTO/RPO trade-offs
- 📖 #013: Lesson 01 - Multi-Region Mental Model (15 min) - Cost, Complexity, Capability triangle and when single-region actually wins
Latest Standalone Episodes:
- 🎙️ #012: Platform Engineering ROI Calculator (15 min) - Prove platform value to executives: ROI formula, DORA→business translation, and stakeholder templates that saved teams from disbandment
- 🎙️ #011: Why 70% of Platform Engineering Teams Fail (12 min) - The critical PM gap, metrics blindness, and the 5 predictive metrics that separate success from $3.75M failures
- 📖 #010: Kubernetes Production Mastery - Lesson 02 (19 min) - Master requests vs limits, QoS classes, and the 5-step debugging workflow for OOMKilled pods
- 📖 #009: Kubernetes Production Mastery - Lesson 01 (17 min) - Learn the 5 failure patterns that break systems at scale and the 6-item production readiness checklist
- 🎙️ #008: GCP State of the Union 2025 (17 min) - When depth beats breadth: GCP's 32% growth vs AWS's 17%
- 🎙️ #007: AWS Outage October 2025 (16 min) - The $75M/hour lesson: DNS race condition in DynamoDB
- 🎙️ #006: AWS State of the Union 2025 (29 min) - Navigate 200+ AWS services with strategic clarity and career frameworks
- 🎙️ #005: Platform Tools Tier List 2025 (13 min) - Which skills command $24K+ higher salaries?
- 🎙️ #004: PaaS Showdown 2025 (14 min) - Flightcontrol vs Vercel vs Railway vs Render vs Fly.io
- 🎙️ #003: Platform Economics (18 min) - Hidden costs and ROI of platform engineering
- 🎙️ #002: Cloud Providers (20 min) - AWS vs Azure vs GCP deep dive
- 🎙️ #001: AI Platform Engineering (15 min) - Shadow AI and governance
Podcast Episodes
Episode 12: Platform Engineering ROI Calculator
🎙️ Platform Engineering ROI Calculator: Prove Value to Executives
45% of platform teams measure nothing and get disbanded when they can't prove ROI. Learn the exact ROI calculation framework that saved three platform teams from disbandment—with real numbers from startups (233% ROI) to enterprises (380% ROI). Discover how to translate DORA metrics into dollars executives understand, from deployment frequency to revenue impact, MTTR to SLA penalties. Includes CFO, CTO, and VP Eng stakeholder templates that speak their language.
Duration: 15 minutes
📝 Read the full blog post with detailed spreadsheets, stakeholder templates, and real-world case studies.
Episode 11: Why 70% of Platform Engineering Teams Fail
🎙️ Why 70% of Platform Engineering Teams Fail (And the 5 Metrics That Predict Success)
60-70% of platform engineering teams fail to deliver impact, with 45% disbanded within 18 months. We investigate why technically excellent teams with senior engineers and big budgets consistently fail—and uncover the shocking truth: it's not about technology. Learn the 5 predictive metrics that separate successful platforms from expensive failures, including the critical PM gap that explains Spotify's 99% adoption vs the industry's 10% average.
Duration: 12 minutes
📝 Read the full blog post with the 90-day playbook and comprehensive decision framework.
Episode 10: Kubernetes Production Mastery - Lesson 02
📖 Kubernetes Production Mastery - Lesson 02: Resource Management
Exit code 137 (OOMKilled) is the #1 production failure in Kubernetes—67% of teams have experienced it. Master the critical difference between resource requests and limits, learn the 5-step debugging workflow for OOMKilled errors, and discover how to right-size containers using P95/P99 metrics and Quality of Service principles. Real incidents: $94K lost in 47 minutes from missing limits, $2,400/month wasted from over-provisioning.
Duration: 19 minutes
Part of: Kubernetes Production Mastery Course - Episode 2 of 10
Episode 9: Kubernetes Production Mastery - Lesson 01
📖 Kubernetes Production Mastery - Lesson 01: Production Mindset
Learn the mental shift from development to production Kubernetes. Understand the 5 critical failure patterns that break systems at scale (OOMKilled, RBAC, health checks, storage, networking) and the 6-item production readiness checklist you must apply before any deployment. Stop following tutorials—start thinking like a production engineer who anticipates failure and designs for reliability.
Duration: 17 minutes
Part of: Kubernetes Production Mastery Course - Episode 1 of 10
Episode 8: GCP State of the Union 2025
🎙️ GCP State of the Union 2025: When Depth Beats Breadth
GCP grows at 32% while AWS manages 17%—nearly 2x faster despite having half the services. We break down why Google's depth-over-breadth strategy is winning AI/ML and data workloads in 2025. Learn about 3x network performance advantages, automatic sustained use discounts (no Reserved Instance forecasting!), and when GCP's specialist positioning beats AWS's generalist approach.
Duration: 17 minutes
Episode 7: AWS Outage October 2025
🎙️ The $75 Million Per Hour Lesson: Inside the 2025 AWS US-EAST-1 Outage
October 19, 2025. 11:48 PM. Ring doorbells stop. Robinhood freezes trading. Roblox goes dark. 6.5 million outage reports globally. A DNS race condition in DynamoDB cascaded into 70+ AWS services down, affecting 1000+ companies from gaming to government. We dissect the technical failure, the $75M/hour cost, and what it reveals about single-region control plane dependencies. Essential listening for multi-cloud strategy and resilience planning.
Duration: 16 minutes
📝 Read the full blog post with timeline, cost breakdown, and decision frameworks.
Episode 6: AWS State of the Union 2025
🎙️ AWS State of the Union 2025: Navigate 200+ Services with Strategic Clarity
You're an experienced platform engineer. AWS has over 200 services. Where do you start? We cut through the complexity: which 20 services matter, how AWS specialization ($127K) stacks against specialized tools on AWS ($135-139K), and practical guidance for engineers returning to AWS or migrating from Azure/GCP. Includes career tier frameworks, cost optimization strategies, and service selection playbooks.
Duration: 28 minutes
Episode 5: Platform Tools Tier List 2025
🎙️ The Platform Engineering Tools Tier List 2025: Which Skills Command $24K+ Higher Salaries
Which skills command $24K+ higher salaries? We analyze 220+ tools from the Dice 2025 report, break down the commoditization trap (Git -3%, Docker -2%, Kubernetes -1%), reveal S-tier specializations earning $130K-152K, and provide practical 18-month career roadmaps from B-tier to S-tier compensation.
Duration: 13 minutes
Episode 4: PaaS Showdown 2025
🎙️ PaaS Showdown 2025: Flightcontrol vs Vercel vs Railway vs Render vs Fly.io
A deep dive into the 2025 PaaS landscape. We break down pricing models, compare real-world costs for the same workload, and give you a decision framework for choosing the right platform for your team size and technical expertise.
Duration: 12-15 minutes
Episode 3: Platform Economics
🎙️ Platform Economics - The Hidden Costs of Infrastructure Decisions
We explore the economic realities of platform engineering — from cloud costs to engineering time, from build vs buy decisions to the opportunity cost of DIY infrastructure. Learn how to make financially sound technical decisions.
Duration: 8-10 minutes
Episode 2: Cloud Providers Deep Dive
🎙️ Cloud Providers - The Real Story Behind AWS, Azure, and GCP
A comprehensive comparison of the big three cloud providers — AWS, Azure, and GCP. We discuss their strengths, weaknesses, pricing models, and how to choose the right one for your organization's needs.
Duration: 18-20 minutes
Episode 1: AI Platform Engineering
🎙️ AI Platform Engineering - The Real Story Behind Shadow AI and Developer Productivity
We dive into the AI platform engineering crisis that 85% of organizations are facing right now. Shadow AI, governance that actually works, AIOps that delivers real ROI, and how to build platforms that support AI workloads without losing your mind.
Duration: 12-15 minutes
Subscribe & Listen
The Platform Engineering Playbook Podcast is available on all major podcast platforms. Episodes are also available directly on this site.
Contribute
Every topic, transcript, and summary you hear lives out in the open. If you've got thoughts, fixes, or new ideas, open a PR on GitHub.
And if you enjoyed the show, give the project a ⭐ star on GitHub — it helps others find and contribute to the Platform Engineering Playbook.