Skip to main content

The Platform Engineering Playbook Podcast

Welcome to The Platform Engineering Playbook Podcast — where everything you hear is built, reviewed, and improved by AI, by me, and by you, the listener.

This show keeps me — and hopefully you — up to speed on the latest in platform engineering, SRE, DevOps, and production engineering. It's a living experiment in how AI can help us track, explain, and debate the fast-moving world of infrastructure.

Every episode is open source. If you've got something to add, correct, or challenge, head to GitHub — open a pull request, join the conversation, and make the Playbook smarter.

Target Audience: Senior platform engineers, SREs, DevOps engineers with 5+ years experience seeking strategic insights on technology choices, market dynamics, and skill optimization.

🎥 Latest Episode: Kubernetes Production Mastery - Lesson 02


Courses

Structured, multi-episode educational series designed for deep learning and skill mastery. Each course uses single-presenter lecture format optimized for retention with learning science principles (spaced repetition, active recall, progressive complexity).

Available Courses:

📖 Kubernetes Production Mastery

Transform from a Kubernetes user into a production Kubernetes engineer. Learn how to run Kubernetes at scale with confidence through real-world failure patterns, systematic debugging, and battle-tested best practices.

  • Episodes: 10 lessons (2 published, 8 coming soon)
  • Duration: ~3 hours total
  • Level: Intermediate to Advanced
  • Prerequisites: Basic Kubernetes knowledge (pods, deployments, services)

What You'll Learn:

  • Production mindset: Think in failure modes, not just success cases
  • Resource management: Prevent OOMKilled and cascading failures
  • RBAC, secrets, and security for multi-tenant clusters
  • Systematic debugging workflow for production incidents
  • Stateful workloads, networking, and observability
  • Cluster operations, multi-tenancy, and advanced patterns

Published Lessons:

🎥 Watch on YouTube:

📖 Multi-Region Platform Engineering: AWS, Kubernetes, and Aurora at Scale

Master the LEGO architecture approach to multi-region systems. Learn the real 2.5-7.5x cost multiplier, compose building blocks (Aurora Global Database, EKS, Transit Gateway, DynamoDB Global Tables), and build production-grade architectures that match your actual needs—not aspirational ones.

  • Episodes: 16 lessons (ALL PUBLISHED ✅)
  • Duration: ~4 hours total
  • Level: Advanced
  • Prerequisites: 5+ years production AWS/Kubernetes experience, distributed systems knowledge

What You'll Learn:

  • Cost reality: True multi-region costs and when single-region wins
  • Architecture patterns: Hot-hot, hot-warm, hot-cold—what actually works
  • AWS building blocks: Aurora, EKS, Transit Gateway, DynamoDB as composable pieces
  • Data strategies: Consistency models, replication, conflict resolution
  • Compliance: SEC SCI, MiFID II, crypto regulations (NY BitLicense, EU MiCA)
  • Decision frameworks: Calculate your actual requirements vs marketing hype

Complete course available with video lectures and detailed transcripts. All scripts validated, enhanced, and ready for learning!


Latest Episodes

Latest content includes course lessons and standalone podcast episodes. Course lessons are organized in the Courses section above.

Multi-Region Platform Engineering Course - Complete Series:

Latest Standalone Episodes:


Podcast Episodes

Episode 12: Platform Engineering ROI Calculator

🎙️ Platform Engineering ROI Calculator: Prove Value to Executives

45% of platform teams measure nothing and get disbanded when they can't prove ROI. Learn the exact ROI calculation framework that saved three platform teams from disbandment—with real numbers from startups (233% ROI) to enterprises (380% ROI). Discover how to translate DORA metrics into dollars executives understand, from deployment frequency to revenue impact, MTTR to SLA penalties. Includes CFO, CTO, and VP Eng stakeholder templates that speak their language.

Duration: 15 minutes

📝 Read the full blog post with detailed spreadsheets, stakeholder templates, and real-world case studies.


Episode 11: Why 70% of Platform Engineering Teams Fail

🎙️ Why 70% of Platform Engineering Teams Fail (And the 5 Metrics That Predict Success)

60-70% of platform engineering teams fail to deliver impact, with 45% disbanded within 18 months. We investigate why technically excellent teams with senior engineers and big budgets consistently fail—and uncover the shocking truth: it's not about technology. Learn the 5 predictive metrics that separate successful platforms from expensive failures, including the critical PM gap that explains Spotify's 99% adoption vs the industry's 10% average.

Duration: 12 minutes

📝 Read the full blog post with the 90-day playbook and comprehensive decision framework.


Episode 10: Kubernetes Production Mastery - Lesson 02

📖 Kubernetes Production Mastery - Lesson 02: Resource Management

Exit code 137 (OOMKilled) is the #1 production failure in Kubernetes—67% of teams have experienced it. Master the critical difference between resource requests and limits, learn the 5-step debugging workflow for OOMKilled errors, and discover how to right-size containers using P95/P99 metrics and Quality of Service principles. Real incidents: $94K lost in 47 minutes from missing limits, $2,400/month wasted from over-provisioning.

Duration: 19 minutes

Part of: Kubernetes Production Mastery Course - Episode 2 of 10


Episode 9: Kubernetes Production Mastery - Lesson 01

📖 Kubernetes Production Mastery - Lesson 01: Production Mindset

Learn the mental shift from development to production Kubernetes. Understand the 5 critical failure patterns that break systems at scale (OOMKilled, RBAC, health checks, storage, networking) and the 6-item production readiness checklist you must apply before any deployment. Stop following tutorials—start thinking like a production engineer who anticipates failure and designs for reliability.

Duration: 17 minutes

Part of: Kubernetes Production Mastery Course - Episode 1 of 10


Episode 8: GCP State of the Union 2025

🎙️ GCP State of the Union 2025: When Depth Beats Breadth

GCP grows at 32% while AWS manages 17%—nearly 2x faster despite having half the services. We break down why Google's depth-over-breadth strategy is winning AI/ML and data workloads in 2025. Learn about 3x network performance advantages, automatic sustained use discounts (no Reserved Instance forecasting!), and when GCP's specialist positioning beats AWS's generalist approach.

Duration: 17 minutes


Episode 7: AWS Outage October 2025

🎙️ The $75 Million Per Hour Lesson: Inside the 2025 AWS US-EAST-1 Outage

October 19, 2025. 11:48 PM. Ring doorbells stop. Robinhood freezes trading. Roblox goes dark. 6.5 million outage reports globally. A DNS race condition in DynamoDB cascaded into 70+ AWS services down, affecting 1000+ companies from gaming to government. We dissect the technical failure, the $75M/hour cost, and what it reveals about single-region control plane dependencies. Essential listening for multi-cloud strategy and resilience planning.

Duration: 16 minutes

📝 Read the full blog post with timeline, cost breakdown, and decision frameworks.


Episode 6: AWS State of the Union 2025

🎙️ AWS State of the Union 2025: Navigate 200+ Services with Strategic Clarity

You're an experienced platform engineer. AWS has over 200 services. Where do you start? We cut through the complexity: which 20 services matter, how AWS specialization ($127K) stacks against specialized tools on AWS ($135-139K), and practical guidance for engineers returning to AWS or migrating from Azure/GCP. Includes career tier frameworks, cost optimization strategies, and service selection playbooks.

Duration: 28 minutes


Episode 5: Platform Tools Tier List 2025

🎙️ The Platform Engineering Tools Tier List 2025: Which Skills Command $24K+ Higher Salaries

Which skills command $24K+ higher salaries? We analyze 220+ tools from the Dice 2025 report, break down the commoditization trap (Git -3%, Docker -2%, Kubernetes -1%), reveal S-tier specializations earning $130K-152K, and provide practical 18-month career roadmaps from B-tier to S-tier compensation.

Duration: 13 minutes


Episode 4: PaaS Showdown 2025

🎙️ PaaS Showdown 2025: Flightcontrol vs Vercel vs Railway vs Render vs Fly.io

A deep dive into the 2025 PaaS landscape. We break down pricing models, compare real-world costs for the same workload, and give you a decision framework for choosing the right platform for your team size and technical expertise.

Duration: 12-15 minutes


Episode 3: Platform Economics

🎙️ Platform Economics - The Hidden Costs of Infrastructure Decisions

We explore the economic realities of platform engineering — from cloud costs to engineering time, from build vs buy decisions to the opportunity cost of DIY infrastructure. Learn how to make financially sound technical decisions.

Duration: 8-10 minutes


Episode 2: Cloud Providers Deep Dive

🎙️ Cloud Providers - The Real Story Behind AWS, Azure, and GCP

A comprehensive comparison of the big three cloud providers — AWS, Azure, and GCP. We discuss their strengths, weaknesses, pricing models, and how to choose the right one for your organization's needs.

Duration: 18-20 minutes


Episode 1: AI Platform Engineering

🎙️ AI Platform Engineering - The Real Story Behind Shadow AI and Developer Productivity

We dive into the AI platform engineering crisis that 85% of organizations are facing right now. Shadow AI, governance that actually works, AIOps that delivers real ROI, and how to build platforms that support AI workloads without losing your mind.

Duration: 12-15 minutes


Subscribe & Listen

The Platform Engineering Playbook Podcast is available on all major podcast platforms. Episodes are also available directly on this site.

Contribute

Every topic, transcript, and summary you hear lives out in the open. If you've got thoughts, fixes, or new ideas, open a PR on GitHub.

And if you enjoyed the show, give the project a ⭐ star on GitHub — it helps others find and contribute to the Platform Engineering Playbook.