Skip to main content

Lesson 4: Troubleshooting Crashes - CrashLoopBackOff & Beyond

Kubernetes Production Mastery Course

Course: Kubernetes Production Mastery Episode: 4 of 10 Duration: 15 minutes Target Audience: Senior platform engineers, SREs, DevOps engineers with 5+ years experience

Learning Objectives

By the end of this lesson, you'll be able to:

  • Execute systematic troubleshooting workflow for pod failures (describe → logs → events)
  • Diagnose CrashLoopBackOff, ImagePullBackOff, and Pending states
  • Configure effective health checks (liveness and readiness probes) that prevent false failures

Prerequisites


Video Lesson

Watch on YouTube: Kubernetes Troubleshooting - CrashLoopBackOff & Beyond


Topics Covered

The Systematic Troubleshooting Workflow

  • kubectl describe → logs → events workflow
  • Building team runbooks for common failures

CrashLoopBackOff Deep Dive

  • Application crashes vs infrastructure issues
  • Exit codes (137 = OOMKilled, 1 = app error)
  • Understanding backoff delay patterns

ImagePullBackOff

  • Registry authentication issues
  • Image not found and tag problems
  • Common registry misconfigurations

Pending Pods

  • Scheduling failures and resource constraints
  • Node selectors and affinity rules
  • Diagnosing why pods won't schedule

Health Checks That Actually Work

  • Liveness probes: Restart unhealthy containers
  • Readiness probes: Remove from load balancer when not ready
  • Startup probes: Handle slow-starting applications
  • Common mistakes: aggressive timeouts, wrong endpoints

⬅️ Previous: Lesson 3: Security Foundations | Next: Lesson 5 (Coming Soon) ➡️

📚 Back to Course Overview