Skip to main content

Kubernetes Production Mastery - Lesson 01: Production Mindset

The Platform Engineering Playbook Podcast


Duration: 17 minutes Presenter: Autonoe Target Audience: Senior platform engineers, SREs, DevOps engineers with 5+ years experience


Learning Objectives

By the end of this lesson, you'll be able to:

  • Understand the mental shift required from development to production Kubernetes
  • Identify the 5 common failure patterns that break systems at scale
  • Apply the 6-item production readiness checklist before deploying

Prerequisites

  • Basic Kubernetes knowledge (pods, deployments, services)
  • Experience deploying applications to Kubernetes
  • Familiarity with YAML manifests

🎥 Watch the Lesson


Transcript

Welcome to Kubernetes Production Mastery, a 10-part series designed for senior engineers who already have a firm grasp of the fundamentals. Throughout this course, we'll review the essential Kubernetes concepts and explore advanced topics that matter in your day-to-day production operations.

Here's what we'll cover. Episode 1: Production Mindset. Episode 2: Resource Management. Episode 3: Security and Secrets. Episode 4: Debugging Workflows. Episode 5: Stateful Workloads. Episode 6: Networking Deep Dive. Episode 7: Observability. Episode 8: Cost Optimization. Episode 9: GitOps. Episode 10: Multi-Cluster.

Let's begin with Episode 1: Production Mindset.

This lesson will prepare you to think like a production engineer, not a tutorial follower. You'll recognize the five critical failure patterns before they hit your clusters. And you'll use a production readiness framework to evaluate any workload.

Let's activate what you already know with a quick refresher. If you're nodding along, good. We're not here to re-teach basics. We're here to show you what breaks when this scales to production.

Kubernetes is a control loop system. You declare desired state in YAML, Kubernetes reconciles actual state to match it. That's it. That's the entire mental model. Declarative, not imperative. Reconciliation, not execution.

Think of it like a thermostat. You set the desired temperature to 70 degrees. The thermostat continuously checks the actual temperature and adjusts heating or cooling to match. Kubernetes does this for your containers.

Now, the core objects. Lightning round.

Pods. Smallest deployable unit. One or more containers sharing network and storage. Every container runs in a pod. Pods are ephemeral - they die, new ones replace them.

Deployments. They manage ReplicaSets. Handle rolling updates. Maintain desired pod count. If you want 5 replicas, the Deployment ensures 5 pods are always running.

Services. Stable network endpoint for pods. Three types you care about: ClusterIP for internal only, that's the default. NodePort exposes on every node. LoadBalancer provisions a cloud load balancer. Services route traffic to healthy pods.

ConfigMaps and Secrets. Configuration data and sensitive data. Mounted as volumes or environment variables.

Namespaces. Logical cluster subdivision. Isolation boundary.

The kubectl basics. Four commands handle 80% of troubleshooting. Kubectl get pods shows what's running. Kubectl describe pod shows why it's failing. Kubectl logs shows what it said. Kubectl apply deploys it.

But here's the biggest misconception. If I understand these objects, I understand Kubernetes. No. You understand development Kubernetes. Production Kubernetes is a different beast entirely. Let me show you why.

Development Kubernetes is like driving a car in an empty parking lot. Production Kubernetes is like driving on a Los Angeles freeway at rush hour in the rain. The vehicle is the same, but the stakes, complexity, and failure modes are entirely different.

In a dev cluster, you're running a single cluster, maybe minikube on your laptop. Resources feel infinite. You're the only user. When something breaks, you restart it. Security? It's just local. You have one namespace, maybe two. Monitoring? Kubectl describe is fine.

In production? Organizations commonly run dozens of clusters. Resources are constrained and cost real money. You have thousands of pods, dozens of teams. When something breaks, customers notice and revenue stops. Security breaches make headlines. You've got namespace sprawl with complex RBAC. Monitoring is mission-critical. Alerts wake you at 2 AM.

Here's the core principle. In development, you optimize for speed of iteration. In production, you optimize for reliability, security, and cost. These goals are often in tension. Production engineering is the art of balancing these trade-offs.

When I review a cluster configuration, here's what I'm thinking. First: What happens WHEN this fails? Not if, when. Second: Can this survive a node dying? A zone outage? Third: Will this cost spiral when traffic spikes? Fourth: If compromised, what's the blast radius? Fifth: Can I debug this at 2 AM with limited context?

That's the production mindset. Always asking what breaks, and how do I handle it.

Now let's talk about the five things that consistently break at scale. I'm giving you names and quick previews. The next 6 episodes will dive deep into each one.

Pattern 1: The OOMKilled Surprise.

Your pods work fine in dev. In production under load: exit code 137, OOMKilled, CrashLoopBackOff, service degrades. Why do tutorials miss this? Dev traffic is toy traffic. You never hit memory limits. The reality? No resource limits means nodes can be overcommitted. Memory pressure leads to OOM kills. You need both requests AND limits, properly sized. Episode 2 fixes this completely.

Pattern 2: The RBAC Blindspot.

Everything has cluster-admin access. An intern's script deletes production namespaces. Or worse, an attacker gets service account tokens and pivots across your entire infrastructure. Why do tutorials miss this? RBAC seems unnecessary when you're the only user. The reality? RBAC misconfigurations are among the most common Kubernetes security issues. Least privilege isn't optional. Episode 3 covers RBAC and secrets management.

Pattern 3: Health Checks That Lie.

Your app is deadlocked, not serving requests, but the pod shows Running. Load balancer keeps sending traffic. Users get timeouts. Why do tutorials miss this? Simple apps just start and work. The reality? Liveness probes prevent zombies. Readiness probes prevent traffic to unready pods. Get these wrong, service degrades silently. Episode 4 teaches troubleshooting and health checks.

Pattern 4: Storage and State.

Your database pod gets rescheduled. All data is gone. Or your PVC is stuck Pending and pods won't start. Why do tutorials miss this? Stateless apps don't need storage. The reality? Every real production system needs persistence. StatefulSets, PVs, PVCs, Storage Classes - you need to understand all of it. Episode 5 covers stateful workloads.

Pattern 5: The Networking Mystery.

Pods can't reach services. Or they can, but shouldn't for security. Network policies? What are those? Service mesh adding 200 milliseconds of latency? Why do tutorials miss this? Default networking is flat and permissive. The reality? Multi-tenancy requires network policies. Ingress controllers need TLS. Service mesh is powerful but complex. Episodes 6 and 7 cover networking and observability.

Notice a pattern? Tutorials optimize for hello world. Production optimizes for survives Friday afternoon deploys. That's what we're learning here.

Here's your production readiness checklist. Memorize these six items. They're your guardrails.

One. Resource Limits and Requests. Every container must have both. Requests are what you need for scheduling. Limits are the maximum allowed to prevent noisy neighbors. Missing these? OOMKilled is in your future.

Two. Health Checks Configured. Liveness and readiness probes. Liveness restarts if unhealthy. Readiness removes from the load balancer if not ready. Missing these? Silent failures.

Three. RBAC Least Privilege. No cluster-admin for workloads. Service accounts per application. Namespace-scoped roles, not cluster-wide. Missing this? Security incident waiting.

Four. Multi-Replica Deployments. Never single replicas in production. Minimum 2 to 3 replicas for availability. Pod disruption budgets. Missing this? Downtime during deploys.

Five. Observability Enabled. Metrics, logs, alerts. Prometheus or equivalent. Log aggregation. Actionable alerts. Missing this? Debugging nightmare.

Six. Security Baseline. Don't run as root. Scan images. Non-root containers. Image vulnerability scanning. Network policies. Missing this? Compliance failures.

For every workload, ask: Does this pass all 6 checks? If no, it's not production-ready. Period.

In other words, this checklist is your definition of done. Not deployed, done. Not running, production-ready.

Two common pitfalls to avoid.

Mistake 1: It works in dev equals production ready. Why this fails? Dev doesn't replicate scale, load, or failure modes. The fix? Load test, chaos engineering, assume everything breaks. Reality check: If you haven't tested failures, they'll test you.

Mistake 2: Copy-paste from tutorials. Why this fails? Tutorials optimize for demo, not production. The fix? Understand every line of your manifests. Reality check: That GitHub example probably lacks resource limits.

Before we wrap up, pause and test yourself. Don't look back. Answer from memory.

What are the five production failure patterns we covered? Name three items from the production readiness checklist. What's the key difference between development and production mindset?

Give yourself a few seconds.

Here are the answers. The five patterns: OOMKilled, RBAC misconfig, health checks, storage and state, networking. Three checklist items - any three work: Resource limits and requests, health checks, RBAC, multi-replica, observability, security baseline. The key difference? Dev optimizes for speed. Production optimizes for reliability, security, and cost.

Let's recap what we covered.

One. Dev does not equal production. Different stakes, scale, and failure modes require different thinking.

Two. Five failure patterns. OOMKilled, RBAC, health checks, storage, networking. Know them.

Three. Six-item checklist. Resources, health checks, RBAC, replicas, observability, security. Memorize it.

Four. Production mindset. Always ask: What happens WHEN this fails?

This is your foundation. Every episode builds on this.

This 10-episode course is your production survival guide. Each episode tackles one of these patterns in depth with debugging workflows and decision frameworks. We're building you into a production Kubernetes engineer.

Here's why this matters. Senior engineers aren't judged by whether they can deploy a pod. They're judged by whether their systems stay up under pressure, scale efficiently, and fail gracefully. That's what we're building toward.

Next time: Episode 2, Resource Management: Preventing OOMKilled.

You'll learn the difference between requests and limits, and why both matter. How to debug OOMKilled errors from symptoms to root cause. Load testing and right-sizing strategies for production workloads.

We're taking the number one production failure pattern and giving you a complete debugging and prevention playbook. You'll never see OOMKilled the same way again.

I'll see you in Episode 2.