The Platform Engineering Playbook Podcast

Welcome to The Platform Engineering Playbook Podcast — where everything you hear is built, reviewed, and improved by AI, by me, and by you, the listener.

This show keeps me — and hopefully you — up to speed on the latest in platform engineering, SRE, DevOps, and production engineering. It's a living experiment in how AI can help us track, explain, and debate the fast-moving world of infrastructure.

Every episode is open source. If you've got something to add, correct, or challenge, head to GitHub — open a pull request, join the conversation, and make the Playbook smarter.

Target Audience: Senior platform engineers, SREs, DevOps engineers with 5+ years experience seeking strategic insights on technology choices, market dynamics, and skill optimization.

🎥 Latest Episode: #087 - Kubernetes Upcoming Features Deep Dive

Daily Episode with News • 41 minutes • January 10, 2026 • Jordan and Alex

The Scheduler Finally Learns Math

Kubernetes 1.35 introduces Extended Toleration Operators (Gt, Lt) for threshold-based scheduling—finally enabling numeric comparisons like "schedule on nodes with at least 4 GPUs" or "tolerate up to 5% spot failure rate."

Two Alpha Features: Extended Toleration Operators (KEP-5471) for threshold-based scheduling with taints, and Mutable PersistentVolume Node Affinity (KEP-5381) for dynamic storage topology adaptation.

Key Finding: Current taints/tolerations are binary (Equal/Exists). The new Gt/Lt operators enable numeric threshold scheduling that NodeAffinity lacks—including eviction via NoExecute effect.

Platform Engineering Implications: GPU clusters, spot instance management, cost optimization, performance tiering—all benefit from threshold-based scheduling without complex NodeAffinity workarounds.

Action Items: Enable feature gates in test clusters, audit current taints for numeric threshold candidates, check CSI driver support for mutableNodeAffinity.

📝 Full episode page →

Previous Episode: #086 - Cloudspecs: The End of Moore's Law for Cloud Computing

📝 Full episode page →

All Episodes

Pure chronological list of all podcast episodes and published course lessons. Episodes in reverse order (newest first).

🎙️ #086: Cloudspecs - The End of Moore's Law for Cloud Computing (20 min) - TUM research reveals AWS i3 instances from 2016 still deliver best NVMe performance per dollar by ~2x. CPU cores increased 10x but cost-adjusted only 2-3x. Memory has "effectively flatlined." Network is the only bright spot (10x improvement). Interactive tool at cloudspecs.fyi using DuckDB-WASM. Authors: Till Steinert, Maximilian Kuschewski, Viktor Leis from CIDR 2026. News: AI coding tool DevOps challenges, Kubernetes Dashboard archived, Windows Secure Boot certs expiring June 2026, AWS Lambda .NET 10, Amazon MQ mTLS, MCP criticism, NVIDIA Rubin announcement.
🎙️ #085: Iran IPv6 Blackout - When Governments Weaponize Protocol Transitions (22 min) - Deep technical analysis of Iran's January 2026 IPv6 blackout. How governments weaponize protocol transitions to target mobile users (98.5% IPv6 drop, 12%→1.8% traffic share). BGP mechanics of selective blocking. Why mobile carriers depend on IPv6 (NAT exhaustion). Engineered degradation vs total blackout—maintaining economic function while disrupting protests. Starlink as a resilience layer. Platform engineering implications: protocol-specific monitoring, Happy Eyeballs testing, dual-stack resilience. Data residency vs availability tradeoffs. News: Kubernetes 1.35 CSI SA tokens, HashiCorp non-human identity, CoreDNS 1.14.0, OpenTelemetry Slack analysis, Route 53 Global Resolver, kernel bug hide times.
🎙️ #084: Venezuela BGP Anomaly - Deep Technical Analysis (28 min) - Special deep dive into the January 2026 Venezuela BGP route leak incident. Was it a cyberattack? The technical evidence says no—10x AS-path prepending proves misconfiguration (prepending REPELS traffic, opposite of MITM). Type 1 Hairpin Route Leak (RFC 7908), valley-free routing violations, 11 similar events from AS8048 since December 2025. RPKI at 54% global adoption validates origin but not path. RFC 9234 OTC and ASPA defenses explained. Why platform engineers should add BGP monitoring to observability stack. Check your providers at isbgpsafeyet.com.
🎙️ #083: HolmesGPT - AI-Powered Root Cause Analysis for Kubernetes (24 min) - Deep dive into HolmesGPT, the CNCF Sandbox AI agent for cloud-native troubleshooting. Agentic architecture that creates investigation task lists, queries Prometheus/Grafana/Kubernetes/ArgoCD, and synthesizes findings. 40+ built-in toolsets, privacy-first design (bring your own LLM keys), read-only access, respects RBAC. End-to-end automation with AlertManager webhooks for automatic investigation. Installation via pip, Homebrew, or Helm for production Kubernetes. News: AirFrance-KLM automation platform, AWS ECS tmpfs mounts on Fargate, Qwen 30B on Raspberry Pi, AWS European Sovereign Cloud.
🎙️ #082: Docker Kanvas - Infrastructure as Design (27 min) - Docker launched Kanvas, a visual tool that turns architecture diagrams into deployable infrastructure. Built on Meshery (CNCF's 6th highest-velocity project), it converts Docker Compose to Kubernetes manifests. Designer Mode (GA) for drag-and-drop design with 300+ K8s operators, AWS/Azure/GCP services. Operator Mode (Beta) for live cluster management. Import existing Docker Compose, Helm charts, Kustomize configs. Export to GitOps-compatible formats for ArgoCD/Flux. Decision framework: when to use Kanvas vs Helm vs Kustomize. Practical adoption strategies for platform teams.
🎙️ #081: Remote MCP Architecture - Running AI Tool Servers on Kubernetes (24 min) - The MCP server registry hit 10,000+ integrations, but most teams are running these servers on laptops. Google, Red Hat, and AWS are converging on remote MCP servers deployed on Kubernetes. Three deployment patterns: local stdio (dev only), remote HTTP/SSE (team scale), managed remote (Google's GKE, BigQuery, GCE endpoints). Native vs wrapper architecture: Red Hat's Go-based server uses client-go directly, no kubectl subprocess parsing. Defense-in-depth security: dedicated ServiceAccounts, TokenRequest API (2-hour tokens), RBAC, --read-only mode, audit logging. Platform team ownership: sidecar per namespace, central gateway, or hybrid approach. Implementation roadmap: Q1 experiment (read-only dev), Q2 adopt (HTTP/SSE staging), Q3 scale (production RBAC).
🎙️ #080: AWS DevOps Agent - Promises vs Reality (26 min) - AWS launched DevOps Agent at re:Invent 2025 as an "autonomous on-call engineer." Agent Spaces provide isolated containers with automatic topology building (42 resources discovered in demo). Integrates with CloudWatch, Datadog, Dynatrace, New Relic, Splunk, GitHub Actions, GitLab CI/CD, ServiceNow, PagerDuty, MCP for custom tools. Critical limitation: cannot execute fixes—generates detailed mitigation plans but humans must approve every action. MTTR improvement 45→18 minutes when properly configured with 2-4 weeks training. Preview limits: 10 Agent Spaces, 20 incident hours/month, US-East-1 only, English-only, no SOC 2/ISO 27001. 5-question evaluation framework and ideal vs wait-and-see scenarios by cloud footprint. News: KubeCon Europe 2026 schedule (March 23-26, Amsterdam, 224 sessions), Platform Engineering 2026 predictions ("agentic infrastructure becomes standard").
🎙️ #079: AWS Graviton5 - ARM Takes Over the Data Center (11 min) - AWS doubled the core count on Graviton5 to 192 cores in a single socket with 180MB L3 cache (5.3x larger than Graviton4). Customer benchmarks: Atlassian 30% higher Jira performance, Honeycomb 20-25% lower latency, SAP 35-60% OLTP improvement. Single-socket design eliminates NUMA overhead with 33% lower inter-core latency. Nitro Isolation Engine with formal verification—mathematical proofs for security, not just testing. 98% of top 1,000 EC2 customers already on Graviton. Migration framework: audit dependencies, start with stateless services, set up dual-arch CI/CD. News: State of Platform Engineering 2026 report shows platform engineering "shifting down" to mid-market companies.
🎙️ #078: Can OpenTelemetry Save Observability in 2026? (18 min) - OpenTelemetry has 95% adoption predicted for 2026, but 43% of organizations haven't seen cost savings. Netflix processes 1M+ trace spans per episode using Flink stream processing—treating observability as data engineering. OTel collector enables "cost-control chokepoint" for sampling, filtering, and routing decisions. 40% targeting autonomous remediation by end 2026 (agent-first observability). SLOs becoming business conversations—error budgets as budget conversations for engineering-business alignment. Platform engineers becoming translators between telemetry and business impact. News: GitHub Actions 39% pricing reduction, Jaeger v2.14.0 legacy removal.
🎙️ #077: When Serverless Fails - Unkey's 6x Performance Migration to Containers (16 min) - Unkey rebuilt their entire API key management service from Cloudflare Workers to AWS Fargate—achieving 6x performance improvement. Root cause: 30ms p99 cache latency from serverless statelessness when they needed sub-10ms. "Zero network requests are always faster than one network request." Decision framework: Where in request path? What's p99 target? How hot is data? Complexity budget? Can you self-host? Unexpected bonus: container architecture made self-hosting trivial. News: Kubernetes 1.35 z-pages now support structured JSON responses with HTTP content negotiation for compliance automation.
🎙️ #067: Anthropic Blocks Third-Party CLI Tools - The AI Platform Control Paradox (18 min) - On January 9, 2026, Anthropic blocked third-party Claude Code wrappers overnight, breaking thousands of developer workflows. DHH called it "customer hostile." Technical deep dive: header spoofing, rate limit bypass, the $200 vs $1,000+ arbitrage that made this inevitable. Framework for evaluating AI platform dependencies: build abstraction layers, question unlimited claims, evaluate lock-in risk, check ToS, calculate true costs. "Unlimited" AI subscriptions are economically impossible without rate limiting—the friction IS the product.
🎙️ #066: CNPE Certification Study Guide - The Complete Deep Dive (18 min) - CNPE (Certified Cloud Native Platform Engineer) launched November 11, 2025 at KubeCon Atlanta—the first hands-on platform engineering exam in five years. Deep dive into exam format (17 tasks, 2 hours, 64% pass), all five domains (GitOps 25%, Platform APIs 25%, Observability 20%, Architecture 15%, Security 15%), BACK stack (Backstage, Argo CD, Crossplane, Kyverno), and complete study guide. Beta testers report 29% scores—harder than CKS. Golden Kubestronaut requires CNPE after March 2026. Career impact: platform engineers $160K-$220K, senior $250K+. News: Decathlon Spark→Polars migration, State of Platform Engineering 2026 survey, Business SLOs.
🎙️ #065: Kubernetes 1.35 Timbernetes Deep Dive (20 min) - Kubernetes 1.35 "Timbernetes" dropped December 17, 2025. 60 enhancements, 3 breaking changes: cgroup v1 REMOVED (kubelet won't start), containerd 1.x EOL, IPVS deprecated. In-Place Pod Resize GA (KEP-1287)—6 years from proposal to stable—resize CPU/memory without pod restart. Pod Certificates Beta (KEP-4317) for native mTLS. Gang Scheduling Alpha (KEP-4671) for AI workloads. DRA locked enabled. Node Declared Features for version-skew scheduling. Practical upgrade checklist: verify cgroup v2, containerd 2.0+, test nftables mode.
🎙️ #064: Terraform Stacks + Native Monorepo Support (17 min) 🎬 - HashiCorp released native monorepo support and Terraform Stacks GA (September 2025). Component-based architecture with .tfstack.hcl files replaces copy-paste configurations. Deployments provide isolated state files per environment/region. Orchestration rules enable automated approvals with context-aware conditions. Linked stacks handle cross-stack dependencies declaratively. Workspace-to-stacks migration tool in beta—start with greenfield or non-critical workspaces. Advanced orchestration rules require HCP Terraform Plus Edition. News: Pulumi IaC supports Terraform/HCL directly (GA Q1 2026), vLLM v0.13.0 (NVIDIA Blackwell Ultra, DeepSeek optimizations), EC2 AZ ID API support, GPT-5.2-Codex (56.4% SWE-Bench Pro).
🎙️ #063: Docker Hardened Images - Free Security for Every Developer (11 min) - Docker released 1,000+ hardened container images under Apache 2.0 license—95% CVE reduction validated by SRLabs. Distroless runtime, complete SBOM, SLSA Level 3 provenance, 7-day patch SLA. Includes hardened MCP server images for AI agent infrastructure. Migration guide: multi-stage builds, test thoroughly for distroless constraints. Enterprise tier adds FIPS, STIG, 5-year ELS. News: First Linux Kernel Rust CVE (CVE-2025-68260), GitHub Actions pricing changes (39% reduction, self-hosted billing postponed indefinitely).
🎙️ #062: Kubernetes 1.35 "Timbernetes" - The End of the Pod Restart Era (16 min) - In-Place Pod Vertical Scaling goes GA in Kubernetes 1.35—change CPU/memory without restarting pods. Breaking changes: cgroup v1 REMOVED (not deprecated), containerd 1.x EOL, IPVS mode deprecated. Pod Certificates for Workload Identity (beta) enables native mTLS without cert-manager. PreferSameNode traffic distribution (GA), Gang Scheduling for AI (alpha), DRA feature gate locked. 60 enhancements: 17 Stable, 19 Beta, 22 Alpha. News: Docker Hardened Images free (1,000+ images, 95% CVE reduction), GitHub Actions pricing changes, First Linux Kernel Rust CVE, KubeVirt security audit complete.
🎙️ #061: 40,000x Fewer Deployment Failures: How Netflix Adopted Temporal (17 min) - Netflix reduced deployment failures from 4% to 0.0001% (40,000x improvement) using Temporal. Deep dive into durable execution: write code as if failures don't exist. Comparison: Temporal vs AWS Step Functions vs Apache Airflow vs Cadence. Netflix's Spinnaker/Clouddriver implementation with 2-hour fix-forward window. Lessons learned: avoid unnecessary child workflows, use single argument objects, separate business from workflow failures. When Temporal is (and isn't) right for your organization. News: Temporal $2.5B valuation (183K developers), K8s v1.35 security features, Shai-Hulud npm attack postmortem.
🎙️ #060: Helm Is Too Simple. Crossplane Is Too Complex. Is kro Just Right? (22 min) 🎬 - 48% of Kubernetes users struggle with tool choice (up from 29% in 2023). The Goldilocks problem of Kubernetes composition: Helm (too simple?), Crossplane (too complex?), kro (just right?). Decision framework included. kro vs krew confusion cleared (completely different tools!). Viktor Farcic's criticism addressed honestly: "no compelling improvement" - fair, but fills a gap. AWS/Google/Microsoft co-developed kro. News: Shai-Hulud npm attack (500+ packages), ingress-nginx retirement (March 2026), Netflix Maestro 100x rewrite.
🎙️ #059: Platform Engineering 2025 Year in Review (25 min) - 2025 was the year platform engineering grew up—and got a reality check. This comprehensive year-in-review covers the 10 defining stories: AI-Native Kubernetes (DRA GA, AI Conformance v1.0), Platform Consensus (3 principles but 70% still fail), Infrastructure Concentration Risk (AWS $75M/hour outage, Cloudflare's 6 outages), IngressNightmare (CVE-2025-1974, 43% vulnerable), Agentic AI (80% unintended actions), Open Source Sustainability (60% maintainers unpaid), GPU Waste (13% utilization), Service Mesh Evolution (Istio Ambient GA), IaC Consolidation (IBM+HashiCorp, CDKTF deprecated), Gateway API Standard. Key takeaways: AI infrastructure standardized, platform engineering has a definition, concentration risk is real, fund open source ($2K/dev/year), GPU waste is the new cloud waste. Action items: Migrate to Gateway API before March 2026, implement DRA, audit agent policies. Plus news on Meta's BPF Jailer, "too big to fail" challenged, sustainable platform design. Blog post →
🎙️ #058: Okta's GitOps Journey - Scaling ArgoCD from 12 to 1,000 Clusters (15 min) - In five years, Okta scaled Auth0's private cloud from 12 to 1,000+ Kubernetes clusters using ArgoCD. At KubeCon 2025, engineers Jérémy Albuixech and Kahou Lei shared their hard-won lessons in "One Dozen To One Thousand Clusters." The journey: 83x cluster growth over 5 years. The challenges: controller performance degradation (10-minute sync times), centralized bottlenecks, application explosion, global latency, observability gaps. The solutions: controller sharding (horizontal scaling), ArgoCD Agent hub-spoke model, Application Sets templating, progressive rollouts. Six key lessons: GitOps doesn't solve organizational problems, start small and scale incrementally (12→50→200→1,000), load testing is non-negotiable, observability unlocks confidence, ArgoCD isn't the only tool (Helm/Kustomize/External Secrets/OPA), plan for Day 2 operations. Practical guidance by scale: 10-50 clusters (single instance), 100-500 clusters (warning zone - plan sharding), 500+ clusters (Okta territory - dedicated team required). News: Helm v4.0.4 and v3.19.4, Zero Trust in CI/CD Pipelines, 1B row database migration, Azure HorizonDB, Platform Engineering State 2026.
🎙️ #057: Platform Engineering Team Structures That Work (18 min) - DORA 2025 shows 90% of organizations have platform initiatives, but most just renamed their ops team. Optimal team size is 6-12 people (Spotify squads). Dedicated platform leader at 100+ engineers shields from competing priorities. Team Topologies interaction patterns: Collaboration→X-as-a-Service evolution. Success metrics: self-service rate >90%, developer happiness, DORA metrics for consuming teams. Anti-patterns: rebranding without role change, underinvestment after launch, skill concentration trap, Field of Dreams building. 8% individual and 10% team productivity boost when done right. News: Sim (Apache 2.0 n8n alternative), Docker Hub credential leak (10K+ images), Meta BPF-LSM replacing SELinux, Litestream VFS, GitHub login failures, GPT-5.2.
🎙️ #056: CDKTF Deprecated - The End of HashiCorp's Programmatic IaC Experiment (14 min) - HashiCorp (IBM) archived CDK for Terraform on December 10, 2025, ending a five-year experiment in programmatic infrastructure-as-code. CDKTF had 243K weekly NPM downloads vs Pulumi's 1.1M (4-5x gap). Four failure factors: Pulumi's head start, JSII complexity, HCL "good enough", IBM acquisition timing. Migration paths: HCL (cdktf synth --hcl), Pulumi, OpenTofu, AWS CDK. Key lesson: adoption metrics are leading indicators of tool risk. News: Envoy CVE-2025-0913 (CVSS 8.6), Google MCP servers, OpenTofu 1.11, pgAdmin 4 v9.11, Lima v2.0, Amazon ECS custom stop signals.
📖 #055: AudioDocs - stern v1.32.0 - AI-narrated documentation for stern, the multi-pod log tailing tool. Tail logs from multiple pods simultaneously with color-coded output.
📖 #054: AudioDocs - CoreDNS v1.13.1 - AI-narrated documentation for CoreDNS, the flexible DNS server and CNCF graduated project that serves as Kubernetes cluster DNS.
📖 #053: AudioDocs - kubectx v0.9.5 - AI-narrated documentation for kubectx, the utility to manage and switch between kubectl contexts and namespaces.
🎙️ #052: AWS re:Invent 2025 - Data & AI Wrap-Up (Series Finale) (24 min) - Part 4 of 4 in our AWS re:Invent 2025 series (finale). S3 Tables GA with Intelligent-Tiering (80% cost savings) and automatic cross-region replication for Iceberg tables. Aurora DSQL uses GPS atomic clocks for global consistency, 4x faster than other distributed SQL, built 100% in Rust. S3 Vectors supports 2B vectors per index (40x preview increase), 90% cheaper than Pinecone/Weaviate/Qdrant. Clean Rooms ML generates privacy-enhanced synthetic datasets for collaborative ML. Database Savings Plans: up to 35% savings, flexible across engines/regions. Comprehensive series wrap-up connecting 50+ announcements: agents, chips, Kubernetes at scale, data services. Theme: AWS wants to make infrastructure boring. News: Envoy CVE-2025-0913, Rust in Linux kernel permanent, Let's Encrypt 10 years.
🎙️ #051: AWS re:Invent 2025 - EKS & Cloud Operations (18 min) - Part 3 of 4 in our AWS re:Invent 2025 series. EKS Ultra Scale supports 100,000 nodes per cluster (vs 15K GKE, 5K AKS)—enabling 1.6 million Trainium accelerators or 800K GPUs. AWS replaced etcd Raft with internal "journal" system and in-memory storage for 500 pods/second at 100K scale. Anthropic using for Claude training (35%→90%+ latency KPIs). EKS Capabilities brings managed Argo CD, ACK (200+ CRDs for 50+ services), KRO. EKS MCP Server enables natural language Kubernetes ("show me all pods not running"). Provisioned Control Plane with XL/2XL/4XL tiers ($1.65-$6.90/hr). CloudWatch gen AI observability for LangChain/CrewAI. DevOps Agent as autonomous on-call engineer (Kindle: 80% time savings). News: cert-manager CVE patches, Canonical K8s 15-year LTS, OpenTofu 1.11 ephemeral resources.
🎙️ #050: AWS re:Invent 2025 - Infrastructure & Developer Experience (14 min) - Part 2 of 4 in our AWS re:Invent 2025 series. Graviton5 delivers 192 cores (3x previous) with 40% better price-performance vs x86. Trainium 3 offers 4.4x AI training performance at 50% lower cost with NeuronLink eliminating 50% network overhead. Lambda Durable Functions enable year-long workflows with context.step and context.wait primitives. Werner Vogels introduces the "Renaissance Developer" framework—five qualities for thriving in the AI era. News: BellSoft hardened Java images (95% fewer CVEs), GitHub Actions package manager security gaps (54% have weaknesses), Proxmox DCM 1.0 (VMware escape hatch).
🎙️ #049: AWS re:Invent 2025 - The Agentic AI Revolution (17 min) - Part 1 of 4 in our AWS re:Invent 2025 series. AWS announces autonomous AI agents that can work for days without human intervention. DevOps Agent (86% root cause identification), Security Agent (context-aware from design to deployment), Kiro (250,000+ developers). All frontier agents stop at approval stage—humans review and decide. Werner Vogels introduces "verification debt" concept. 40% of agentic AI projects predicted to fail by 2027 (Gartner). Nova Act achieves 90% browser automation reliability. News: Model Context Protocol wins AI integration standard, Oxide publishes LLM code policy. Full blog post →
🎙️ #048: Developer Experience Metrics Beyond DORA (13 min) - DORA metrics revolutionized DevOps measurement, but they're not the complete picture. This episode explains DORA from the ground up—the four key metrics (Deployment Frequency, Lead Time, Change Failure Rate, MTTR), benchmarks (elite vs low performers), and why throughput correlates with stability. Then we explore what DORA misses: developer satisfaction, cognitive load, flow state. Covers SPACE framework (2021), DevEx (2023), and DX Core 4. Practical guidance on which framework to use and mistakes to avoid. News: Iterate.ai AgentOne for AI code security, AWS Lambda Durable Functions, Capital One OpenTelemetry optimization.
🎙️ #047: Cloudflare's Trust Crisis - December 2025 Outage and the Human Cost (12 min) - Three weeks after their November outage, Cloudflare went down AGAIN on December 5, 2025—28% of HTTP traffic impacted for 25 minutes. This is their SIXTH major outage of 2025. Beyond the technical postmortem (Lua killswitch bug), we examine the pattern of repeated failures, community reactions ("below 99.9% uptime"), and the human cost to on-call engineers. 67% IT burnout rate. Multi-CDN strategies, external monitoring, on-call wellness programs. Full blog post →
🎙️ #046: Cloud Cost Quick Wins for Year-End (12 min) - Global cloud spend hits $720B in 2025, with 20-30% wasted on unused resources. Six quick wins you can implement this week: scheduling non-prod (70% savings), right-sizing (25-40% per instance), Reserved Instances (up to 72% off), Spot instances (60-90%), storage tiering, and zombie hunting ($500-2K/month per account). Monday checklist included. News: Envoy v1.36.3 CVEs, Loki Operator 0.9.0, AWS Graviton5 M9g preview.
🎙️ #045: Platform Engineering vs DevOps vs SRE - The Identity Crisis (17 min) - Platform Engineer roles pay 20% more than DevOps roles, but job descriptions are 90% identical. Is this title inflation? We trace the origin stories: DevOps (2009) was a movement, not a job title. SRE (2003/2016) added Google's 50% engineering time rule. Platform Engineering (2018-2020) brought product thinking. Decision framework: DevOps culture first, then SRE for reliability pain, then Platform Engineering for cognitive load. The 20% premium pays for product thinking, not the title. Includes December 3, 2025 news: PgBouncer CVE-2025-12819, MinIO Docker CVE controversy, GitHub CI/CD OTel guide.
🎙️ #044: Platform Engineering Certification Tier List 2025 (30 min) - Are certifications worth it? We rank 25+ certifications using a data-driven 60/40 framework (60% skill-building, 40% market signal). CKA ($445, 66% pass rate, 45K+ job postings) remains gold standard. Platform engineers earn $172K vs DevOps $152K (13% premium). AWS SA Associate overrated (500K+ holders). CNPE early adopters get 12-18 month advantage. Optimal stack: CKA + one cloud Professional + one specialty cert (~$1,200, 7-9 months). Includes AWS Re:Invent 2025 news segment.
🎙️ #043: Kubernetes AI Conformance - The End of AI Infrastructure Chaos (17 min) - CNCF launched the Certified Kubernetes AI Conformance Program at KubeCon Atlanta (November 11, 2025)—the first vendor-neutral standard for AI workloads on Kubernetes. Five core certification requirements: Dynamic Resource Allocation (DRA), intelligent autoscaling, rich accelerator metrics, AI operator support (Kubeflow, Ray), and gang scheduling via Kueue/Volcano. 11+ vendors certified including AWS EKS, Google GKE, Microsoft Azure, Red Hat OpenShift, CoreWeave. DRA improves GPU utilization from 45-60% to 70-85%, reducing monthly GPU costs by 30-40%. Decision framework for when certification is critical vs less critical. ISO 42001 comparison (governance vs technical). v2.0 roadmap includes topology-aware scheduling and cost attribution.
🎙️ #042: Helm 4 Deep Dive - The Complete Guide to the Biggest Update in 6 Years (24 min) - Helm 4.0 dropped at KubeCon Atlanta 2025—the first major version in 6 years. Server-Side Apply replaces three-way merge, ending GitOps ownership conflicts. SSA delivers 40-60% faster deployments. WASM plugins via Extism bring sandboxed security but require post-renderer migration. 12-month runway with Helm 3 support until November 2026. Breaking changes: CLI flag renames (--dry-run=server, --force-replace), annotation changes. Complete migration guide with SSA testing, WASM plugin porting, and staged rollout strategy.
🎙️ #041: CNPE Certification Guide - The First Platform Engineering Credential (15 min) - Complete guide to CNCF's new Certified Cloud Native Platform Engineer exam. Five domains: GitOps/CD (25%), Platform APIs (25%), Observability (20%), Architecture (15%), Security (15%). Beta testers report 29% scores—no Killer.sh simulator until Q1 2026. Platform engineers earn $219K average (20% more than DevOps). Three certification paths: Traditional (CKA→CKS→CNPA→CNPE), Fast-track (CNPA→CNPE), Full Coverage (Kubestronaut→CNPE). CNPE required for Golden Kubestronaut after March 1, 2026. Tools to know: ArgoCD, Flux, Backstage, Crossplane, OpenTelemetry, Kyverno.
🎙️ #040: 10 Platform Engineering Anti-Patterns That Kill Developer Productivity (13 min) - DORA 2024 found organizations with platform teams saw throughput decrease by 8% and stability decrease by 14%. Why are so many platform investments backfiring? 10 anti-patterns: Ticket Ops (bottleneck factory), Ivory Tower Platform (disconnected from reality), Platform as Bucket (scope creep), Mandatory Adoption (forced usage), Golden Cage (over-standardization), Over-Engineered Monolith (complexity enemy), Front-End First (35% still use spreadsheets), Biggest Bang Trap (starting hard), Day 1 Obsession (under 1% of lifecycle), Build It And They Will Come (no marketing). What successful teams do: Spotify's Backstage users 2.3x more GitHub active, Zalando's first step was cultural, teams with stable priorities face 40% less burnout, adoption strategies yield 30% higher ROI. Audit checklist: devs waiting >1 day? platform team pair-programmed with devs? scope grown 3x? one-size-fits-all templates? beautiful portal but Slack for help?
🎙️ #039: Black Friday War Stories: Lessons from E-Commerce's Worst Days (12 min) - Black Friday special diving into the graveyard of e-commerce outages. Hall of Fame crashes: J.Crew ($775K lost in 5 hours, 323,000 shoppers), Walmart ($9M before Black Friday started), Best Buy 2014 (78% mobile traffic surprise), Cloudflare 2024 (99.3% of Shopify stores frozen). Famous non-Black-Friday disasters: AWS S3 2017 ($150M typo, 4+ hours, 100,000+ sites), GitLab 2017 (5 backup systems none working, 300GB deleted). k8s.af Kubernetes failure stories. Platform engineer's playbook: load test at 5-10x (not 2x), multi-CDN/multi-cloud, monthly restore tests, chaos practice, mobile-first design, dangerous command safeguards.
🎙️ #038: Giving Thanks to Your Dependencies: A Platform Engineer's Gratitude Guide (10 min) - Thanksgiving special on thanking open source maintainers. 60% of maintainers unpaid. 60% have left or considered leaving. Gratitude tools: npx thanks, npm fund, cargo-thanks, thanks-stars. Happiness Packets for anonymous thank-you notes. Beyond stars: why specific use case emails matter more. Company-level: Open Source Pledge ($2K/dev/year), GitHub Sponsors, Maintainer Month. Your 5-minute Thanksgiving challenge: run npx thanks, pick one dependency, send a thank-you email, donate $5-10, star the repos.
🎙️ #037: KubeCon Atlanta 2025 Part 3: Community at 10 Years - The Sustainability Question (14 min) - CNCF celebrates 10 years with 300,000 contributors—but the sustainability crisis is real. 60% of maintainers unpaid. 60% have left or considered leaving. XZ Utils backdoor showed what happens when isolated maintainers burn out. Han Kang tribute reminds us of the human cost. Technical sessions revealed: CiliumCon (TikTok IPv6 migration, 60K node clusters), in-toto graduation for supply chain attestation, Gateway API convergence, OpenTelemetry eBPF maturity. Open Source Pledge ($2,000/developer/year minimum) and Kubernetes governance improvements (6→4 subteams) offer hope. Framework: audit dependencies for maintainer health, join Open Source Pledge, invest in the people who write the code.
🎙️ #036: KubeCon Atlanta 2025 Part 2: Platform Engineering Consensus and Community Reality Check (17 min) - After years of definitional chaos, platform engineering reached consensus at KubeCon 2025: three principles (API-first self-service, business relevance, managed service approach), real-world adoption at Intuit/Bloomberg/ByteDance scale, and honest burnout conversations. The "puppy for Christmas" anti-pattern explains 70% platform team failure. Intuit migrated Mailchimp's 11M users invisibly. Bloomberg ran K8s for 9 years. ByteDance's AI Brix is 80% external contributors. EU CRA clarified: individuals NOT liable, Dec 2027 deadline manageable. Cat Cosgrove (K8s Steering Committee) reveals "ready to abandon ship" from work overload. CNCF's 200+ projects raises sustainability questions. Kubernetes reduced dependencies 416→247 through discipline. Framework for platform teams: 3-5 year timeline, managed service commitment, business metrics, SBOM generation, community health investment.
🎙️ #035: KubeCon Atlanta 2025 Part 1: AI Goes Native and the 30K Core Lesson (19 min) - Google donates a GPU driver live on stage. OpenAI saves $2.16M/month with one line of code. Kubernetes rollback finally works after 10 years. What changed at KubeCon Atlanta 2025 that proves Kubernetes isn't adapting to AI—it's being rebuilt for it. Dynamic Resource Allocation reaches GA in Kubernetes 1.34, preventing 10-40% GPU performance loss from NUMA misalignment ($200K/day waste at 100-node scale). Workload API arrives in alpha for gang-scheduling multi-pod AI training. OpenAI freed 30,000 CPU cores by disabling inotify in Fluent Bit after profiling revealed 35% CPU time on fstat64. Skip-version upgrades now supported with 99.99% success rate. Monday action plan: test DRA in development, profile your highest-CPU service with perf or eBPF, check for NUMA misalignment in GPU workloads.
🎙️ #034: The $4,350/Month GPU Waste Problem (28 min) - Your H100 costs $5,000/month but runs at 13% utilization—wasting $4,350 monthly per GPU. Analysis of 4,000+ Kubernetes clusters reveals why Kubernetes treats GPUs as atomic resources, and the five-layer optimization framework (MIG, time-slicing, VPA, Spot, regional arbitrage) that recovers 75-93% of lost capacity in 90 days. Real case study: 20 H100s → 7 H100s ($100K → $35K/month, 65% reduction). Multi-Instance GPU enables 84% savings for multi-tenant SaaS workloads. AWS EKS Split Cost Allocation launched Sept 2025 for pod-level GPU tracking. Complete 90-day implementation playbook with $780K annual savings target for 20-GPU clusters.
🎙️ #033: Service Mesh Showdown: Why User-Space Beat eBPF (20 min) - Kernel-level eBPF should beat user-space proxies—but Istio Ambient delivers 8% mTLS overhead while Cilium shows 99%. Academic benchmarks reveal why architecture boundaries matter more than execution location. 50,000-pod stability testing shows Cilium's distributed control plane crashed the API server under churn while Istio's centralized architecture handled it. Decision framework for choosing based on cluster size, traffic patterns (L4 vs L7), and cost analysis ($186K/year savings for 2,000-pod clusters).
🎙️ #032: The Terraform vs OpenTofu Debate (17 min) - HashiCorp's license change and IBM's $6.4B acquisition created the "you must migrate" narrative—but 70% of teams using Terraform in-house aren't legally affected. Fidelity's 50,000 state file migration case study, three-factor decision framework (Cloud lock-in, compliance, vendor tolerance), and why migration is 90% organizational change management. OpenTofu 1.7+ delivers state encryption after 5+ years of Terraform community requests.
🎙️ #031: Agentic DevOps - GitHub Agent HQ (18 min) - GitHub Universe 2025 announced Agent HQ—mission control for orchestrating AI agents from OpenAI, Anthropic, Google, and more. Azure SRE Agent saved Microsoft 20,000+ engineering hours. But 80% of companies report agents executing unintended actions, and only 44% have agent-specific security policies. Tiered adoption framework for deploying agents without creating catastrophic risk.
🎙️ #030: Cloudflare Outage November 2025 (13 min) - A routine database permissions change triggered Cloudflare's worst outage since 2019—taking down ChatGPT, X, Shopify, Discord, and 20% of the internet for 6 hours. Technical chain reaction from ClickHouse metadata exposure to FL2 Rust proxy panic when ~60 features became >200 and exceeded hardcoded limit. Third major cloud outage in 30 days raises infrastructure concentration risk questions.
🎙️ #029: Ingress NGINX Retirement (13 min) - The de facto standard Kubernetes ingress controller is being retired in March 2026 with no security patches after. Only 1-2 maintainers for years, InGate replacement failed, and platform teams have four months to migrate. Four-phase migration framework to Gateway API with controller comparison and immediate actions.
🎙️ #028: OpenTelemetry eBPF Instrumentation (14 min) - Complete observability without code changes sounds too good to be true—but kernel-level eBPF delivers under 2% CPU overhead. How Grafana's May 2025 Beyla donation to OpenTelemetry makes this mainstream, the TLS encryption catch nobody talks about, and decision framework for eBPF vs SDK instrumentation.
🎙️ #027: The Open Source Observability Showdown (20 min) - When "free" Prometheus costs $6-12K/month in engineer time. Shopify's dedicated observability team, VictoriaMetrics' 40-60% storage wins, Loki's 10x cheaper storage with 3-5x slower queries, and the three-tier operational maturity framework for Prometheus/Grafana/Loki/Tempo build vs buy decisions. $2B+ Datadog revenue explained.
🎙️ #026: The Kubernetes Complexity Backlash (13 min) - 92% market share meets 88% cost increases and 25% shrinking deployments. The 3-5x cost underestimation problem, 200-node rule, and when Docker Swarm/Nomad/ECS/PaaS beat Kubernetes. 37signals saved $10M+ leaving AWS, teams finally did the math
🎙️ #025: SRE Reliability Principles - The 26% Problem (15 min) - Only 26% of organizations use SLOs despite 49% saying they're more relevant. Error budgets remain timeless, Platform Engineering and SRE are complementary, and AI/ML needs adapted reliability principles. Practical playbook for starting from zero or fixing ignored SLOs
🎙️ #024: Internal Developer Portal Showdown 2025 (15 min) - Backstage costs $150K per 20 developers in hidden engineering time. Commercial platforms are 8-16x cheaper for most teams. Real pricing, timelines, and decision framework by team size
🎙️ #023: DNS for Platform Engineering (23 min) - A forty-year-old protocol keeps taking down billion-dollar infrastructure. October 2025 AWS outage: 15 hours from a DNS race condition. CoreDNS, ndots:5 trap, and the five-layer defensive playbook
🎙️ #022: eBPF in Kubernetes (25 min) - Your Kubernetes cluster is a black box—Prometheus shows symptoms, not causes. eBPF turns the Linux kernel into a programmable platform for observability, networking, and security
🎙️ #021: Time Series Language Models (20 min) - AI that reads your infrastructure metrics like language, explains anomalies in plain English, and predicts failures without training on your data. This technology exists now, but companies won't deploy it to production yet. Why?
🎙️ #020: Kubernetes IaC & GitOps - The Workflow Paradox (20 min) - 77% GitOps adoption yet deployments still take days. Why workflow design beats tool selection—ArgoCD vs Flux is a false choice, successful teams run both
🎙️ #019: The FinOps AI Paradox (12 min) - Companies invest $500K in AI FinOps tools, identify $3M in savings, but implement only 6%. Why sophisticated AI fails to reduce cloud waste and what the successful 6% actually do differently
📖 #016: Kubernetes Production Mastery - Lesson 03 (43 min) - Implement namespace-scoped RBAC roles, secure secrets management with Sealed Secrets/External Secrets, and remediate the 5 most common RBAC misconfigurations
🎙️ #015: The Cloud Repatriation Debate (13 min) - AWS charges 10-100x more than it should? Real companies saving millions by leaving the cloud, hidden costs exposed, decision frameworks for when cloud makes sense
🎙️ #014: Kubernetes in 2025: The Maturity Paradox (15 min) - 92% market share meets "do we need this?" backlash. Service mesh revolution, AI/ML integration, when to skip K8s for simpler alternatives
🎙️ #013: Backstage in Production: The 10% Adoption Problem (16 min) - The real $1M+ cost, why adoption stalls at 10%, and honest comparison with Port, Cortex, and custom portals
🎙️ #012: Platform Engineering ROI Calculator (15 min) - Prove platform value to executives: ROI formula, DORA→business translation, and stakeholder templates that saved teams from disbandment
🎙️ #011: Why 70% of Platform Engineering Teams Fail (12 min) - The critical PM gap, metrics blindness, and the 5 predictive metrics that separate success from $3.75M failures
📖 #010: Kubernetes Production Mastery - Lesson 02 (19 min) - Master requests vs limits, QoS classes, and the 5-step debugging workflow for OOMKilled pods
📖 #009: Kubernetes Production Mastery - Lesson 01 (17 min) - Learn the 5 failure patterns that break systems at scale and the 6-item production readiness checklist
🎙️ #008: GCP State of the Union 2025 (17 min) - When depth beats breadth: GCP's 32% growth vs AWS's 17%. 3x network performance advantages and automatic sustained use discounts
🎙️ #007: AWS Outage October 2025 (16 min) - The $75M/hour lesson: DNS race condition in DynamoDB cascaded into 70+ AWS services down, affecting 1000+ companies
🎙️ #006: AWS State of the Union 2025 (29 min) - Navigate 200+ AWS services with strategic clarity. Which 20 services matter, career tier frameworks, cost optimization strategies
🎙️ #005: Platform Tools Tier List 2025 (13 min) - Which skills command $24K+ higher salaries? Analyze 220+ tools, commoditization trap, S-tier specializations earning $130K-152K
🎙️ #004: PaaS Showdown 2025 (14 min) - Flightcontrol vs Vercel vs Railway vs Render vs Fly.io. Deep dive into 2025 PaaS landscape with pricing models and decision frameworks
🎙️ #003: Platform Economics (18 min) - Hidden costs and ROI of platform engineering. From cloud costs to engineering time, build vs buy decisions and opportunity costs
🎙️ #002: Cloud Providers (20 min) - AWS vs Azure vs GCP deep dive. Comprehensive comparison of strengths, weaknesses, pricing models, and decision frameworks
🎙️ #001: AI Platform Engineering (15 min) - Shadow AI and governance. The AI platform engineering crisis 85% of organizations face right now and how to build platforms that support AI workloads

The Platform Engineering Playbook Podcast is available on all major podcast platforms. Episodes are also available directly on this site.

Contribute

Every topic, transcript, and summary you hear lives out in the open. If you've got thoughts, fixes, or new ideas, open a PR on GitHub.

And if you enjoyed the show, give the project a ⭐ star on GitHub — it helps others find and contribute to the Platform Engineering Playbook.

🎥 Latest Episode: #087 - Kubernetes Upcoming Features Deep Dive​

Previous Episode: #086 - Cloudspecs: The End of Moore's Law for Cloud Computing​

All Episodes​

Subscribe & Listen​

Contribute​

🎥 Latest Episode: #087 - Kubernetes Upcoming Features Deep Dive

Previous Episode: #086 - Cloudspecs: The End of Moore's Law for Cloud Computing

All Episodes

Subscribe & Listen

Contribute