Lesson 9: Troubleshooting
Kubernetes Production Mastery Course
Course: Kubernetes Production Mastery Episode: 9 of 10 Duration: ~18 minutes Target Audience: Senior platform engineers, SREs, DevOps engineers with 5+ years experience
Learning Objectives
By the end of this lesson, you'll be able to:
- Implement GitOps principles with ArgoCD or Flux based on team size and requirements
- Choose between Helm and Kustomize for configuration management based on templating needs
- Configure canary deployments with Argo Rollouts for progressive validation
Prerequisites
- Lesson 1: Production Mindset
- Lesson 2: Resource Management
- Lesson 3: Security Foundations
- Lesson 4: Health Checks & Probes
- Lesson 5: Stateful Workloads
- Lesson 6: Networking & Services
- Lesson 7: Observability
- Lesson 8: CI/CD Integration
Transcript
Welcome to Episode 9 of Kubernetes Production Mastery. Last episode, we optimized costs. Right-sizing workloads, cluster autoscaling, resource quotas. You're paying for what you need. But here's what happens next: your team needs to deploy an update. To twenty clusters. Dev, staging, three production regions, disaster recovery, per-team clusters. How do you deploy consistently?
Here's the nightmare scenario. You kubectl apply to cluster one. Then cluster two. Then cluster three. Midway through cluster seven, you realize there's a typo in the YAML. Some clusters have the old version. Some have the new version. Some have the broken version. Configuration drift begins. Clusters that should be identical aren't. No audit trail. Who deployed what? When? Why? No rollback strategy. How do you undo this mess across twenty clusters?
Episode One called manual kubectl in production an anti-pattern. Today you'll see why. GitOps makes Git your source of truth. ArgoCD or Flux automates deployment. Helm or Kustomize manages configuration. And deployment strategies go beyond rolling updates to canary releases and blue-green deployments.
By the end, you'll implement GitOps principles with ArgoCD or Flux. You'll choose between Helm and Kustomize based on your templating needs. And you'll configure canary deployments that validate progressively instead of all-or-nothing rollouts.
Why Manual kubectl Breaks Down
Let me show you why manual kubectl breaks down. At one cluster, kubectl apply works fine. Quick, simple, direct. At five clusters, it's still manageable with shell scripts. Loop through clusters, apply manifests. Tedious but doable. At twenty-plus clusters? Completely breaks down.
Time becomes your enemy. Twenty clusters times five minutes per deployment equals one hundred minutes. That's nearly two hours to deploy a single change. Errors multiply. Manual process means typos, wrong files, skipped clusters. Drift accelerates. Clusters slowly diverge. Different versions. Different configurations. Nobody can explain why dev cluster three has different settings than dev cluster four. Audit trail doesn't exist. No record of who changed what when. Rollback during an incident? Manual rollback to twenty clusters is too slow. You're trying to stop the bleeding while the system hemorrhages.
GitOps Solution
GitOps solves all of this. Commit your change to Git. Takes thirty seconds. GitOps tool deploys to all twenty clusters automatically. Takes two to five minutes. Git history shows you who, what, when, why. Complete audit trail. Rollback is git revert. GitOps redeploys the previous state to all clusters automatically. Drift prevention happens continuously. GitOps reconciles cluster state to Git every few minutes. Someone manually edits a deployment? GitOps reverts it. Git is always right.
Think of Git as your building blueprints. The cluster is the actual building. If someone makes unauthorized changes to the building, you rebuild it from the blueprints. That's GitOps. Continuous reconciliation ensures the building matches the blueprint.
GitOps Benefits
The benefits go beyond just deployment speed. Version control means all changes are tracked. Branching for features. Pull requests for review. Merge commits create an approval trail. Disaster recovery becomes trivial. Lose an entire cluster? Recreate it from Git in minutes, not days. Security improves. Developers don't need direct kubectl access to production. Git pull request approval is required. The GitOps tool has credentials, not humans. Consistency across environments. Dev, staging, production all derive from the same Git structure. Differences are explicit through environment-specific configurations.
Remember Episode One's production readiness checklist? Repeatable deployments. GitOps achieves this. No manual steps. No "works on my machine" problems. Git defines production.
ArgoCD vs Flux
Now the question becomes: which GitOps tool? ArgoCD or Flux? Both are CNCF graduated. Both are production-proven. Both are actively maintained. The choice depends on your team culture and requirements.
ArgoCD
ArgoCD is the full-featured option. Web UI shows you all applications, their sync status, health checks, deployment history. Multi-tenancy through projects that separate teams, namespaces, Git repositories with granular role-based access control. Sync waves let you control deployment order. Deploy the database before the application. Hooks enable pre-sync and post-sync scripts. Run database migrations. Execute smoke tests. App-of-apps pattern means ArgoCD can deploy ArgoCD applications. Bootstrap entire environments from a single manifest. Health assessments go beyond Kubernetes readiness. Custom health checks for your specific applications.
The strengths are clear. UI reduces the learning curve. Developers can see deployment status without touching the command line. Enterprise features come out of the box. Single sign-on, audit logs, comprehensive role-based access control. Larger ecosystem with plugins, integrations, extensive community support. Multi-cluster support from day one.
The weaknesses? Heavier weight. More components mean more resource usage. Higher operational complexity. More to configure, more to troubleshoot. Opinionated structure that may not fit all workflows.
ArgoCD is best for large teams. Fifty-plus engineers needing multi-tenancy. Organizations requiring a UI for visibility and debugging. Multi-cluster deployments from day one. Teams wanting comprehensive role-based access control and audit trails.
Flux
Flux takes a different approach. Lightweight with minimal footprint and less resource usage. GitOps-native design means Git is your interface. No web dashboard by default. Built-in Kustomize integration. Native Helm chart deployment through the Helm controller. Notification system sends webhook alerts to Slack, PagerDuty, wherever you need them. OCI support lets you store manifests in OCI registries, not just Git.
Flux's strengths are simplicity. Fewer moving parts. Lower resource usage, good for smaller clusters and edge deployments. Command-line first, which is automation-friendly and fits the GitOps philosophy. Flexible toolkit approach. Use the components you need.
The weaknesses? No built-in UI. Third-party UIs exist like Weave GitOps, but they're separate. Less opinionated means more setup required. Steeper learning curve for teams wanting a graphical interface.
One note about Flux: Weaveworks, the company that created Flux, shut down in twenty twenty-four. But Flux is CNCF-backed and actively maintained by the community. It's not going anywhere.
Flux is best for small to medium teams comfortable with command-line and Git-centric workflows. Automation-first organizations where CI/CD pipelines and infrastructure as code are the norm. Resource-constrained environments. Teams wanting a lightweight, composable toolkit.
Decision Framework: ArgoCD vs Flux
Here's my decision framework. Choose ArgoCD if you have a large team needing visibility and self-service. If a UI is required for developers, operations, or management. If multi-tenancy with strict role-based access control is needed. If enterprise features like single sign-on and audit logs are required from day one.
Choose Flux if you have a small to medium team comfortable with CLI and Git. If you have an automation-first culture built around CI/CD pipelines and infrastructure as code. If lightweight footprint is desired for edge or resource-constrained deployments. If you prefer a toolkit approach that's composable and flexible.
Both are good. This isn't a right-or-wrong choice. Choose based on team culture and requirements. And you can switch later. Migration from ArgoCD to Flux or vice versa is possible. Git manifests are portable. Don't overthink the initial choice.
Helm vs Kustomize
Now let's talk about configuration management. Helm versus Kustomize. Helm is Kubernetes' package manager with a templating engine. Charts are packages containing templates. Values files customize deployments. Helm renders templates using your values, generates manifests, applies them to the cluster.
Helm's strengths are templating. Variables, conditionals, loops. Don't repeat yourself. Package manager capabilities. Install, upgrade, rollback applications as units. Charts ecosystem with thousands of pre-built charts from Bitnami and official sources. Versioning. Charts have versions that track application releases. Dependency management. Charts can depend on other charts.
Use Helm when you have complex applications with many configuration options. When you need to install third-party software. PostgreSQL charts, Redis charts, monitoring stacks. When you want versioned releases with rollback capability. When you have multiple environments with significant configuration differences.
Example: deploy PostgreSQL with different sizes per environment. Helm chart with values-dev.yaml specifying small resources. Values-prod.yaml specifying high availability and large resources. Same chart, different values.
Kustomize
Kustomize is template-free configuration management using overlays. Base manifests contain common configuration. Overlays contain environment-specific patches. Kustomize merges base plus overlay to generate final manifests.
Kustomize's strengths are no templating. Raw YAML that's easier to read and understand. Overlays let you patch base config for different environments. Dev, staging, production. Native to kubectl. kubectl apply -k works without additional tools. Composable. You can overlay on overlay. Base, then team overlay, then environment overlay. Simpler mental model. "Patch these fields" versus "template logic."
Use Kustomize when you have simpler applications. Typical microservices, standard deployments. When you prefer declarative patches over templating. When environment differences are minor. Replica counts, resource limits, image tags. When you want no dependencies. Kustomize is built into kubectl.
Example: base deployment specifies three replicas. Dev overlay patches to one replica. Production overlay patches to five replicas with more resources.
Decision Framework: Helm vs Kustomize
Here's the decision. Use Helm when you have complex apps with many configuration knobs. Databases, monitoring stacks. When you're installing third-party charts. Don't reinvent the wheel. When you need versioned releases and rollback. When your team is comfortable with templating logic.
Use Kustomize when you have simpler apps. Typical microservices. When you prefer raw YAML over templates. When environment differences are minor. When you want no additional tools beyond kubectl.
You can use both. Helm chart as your base. Kustomize overlay for environment-specific patches. This isn't either-or.
Many teams start with Kustomize because it's simpler. They add Helm when complexity demands it. Databases, complex applications.
Deployment Strategies
Now deployment strategies. Rolling updates are the default. Gradually replace old pods with new pods. MaxSurge controls extra pods during rollout. MaxUnavailable controls maximum pods down during rollout. Pros: built-in, simple, zero downtime. Cons: old and new versions coexist during rollout. No traffic control. All-or-nothing. You can't shift just ten percent of traffic. Use rolling updates for most deployments. Simple and reliable.
Blue-Green Deployments
Blue-green deployments run two identical environments. Blue is current production. Green is the new version. You deploy to green. Test it. Switch traffic from blue to green instantly. Update Service selector or Ingress routing. Keep blue running for quick rollback if needed.
Pros: instant cutover. Easy rollback by switching back to blue. Test new version in production-like environment before it receives traffic. Cons: double resources during transition. You're running blue plus green simultaneously. Instant switch means instant impact if there are issues.
Implementation uses two Deployments. app-blue and app-green. Service selector switches between them. Or Ingress uses weighted routing.
Use blue-green for high-risk changes. When you need instant rollback capability. When you can afford double resources during the transition. Database migrations are a classic use case. Deploy green, test thoroughly, then switch.
Canary Deployments
Canary deployments deploy the new version to a small subset. The canary. Shift ten percent of traffic to the canary. Monitor metrics. Latency, errors, saturation. If healthy, increase to twenty-five percent. Then fifty percent. Then one hundred percent. If errors spike, rollback immediately.
Pros: progressive validation. Limited blast radius. Only ten percent of users affected if something breaks. Metrics-driven decision making. Cons: more complex setup. Requires traffic splitting through Ingress, service mesh, or Argo Rollouts.
Implementation options include Ingress weighted routing, but this gives limited control. Service mesh like Istio or Linkerd for fine-grained traffic splitting. Or Argo Rollouts, which I recommend. It's a progressive delivery controller designed for canary deployments. Automated traffic shifting from zero percent to ten to twenty-five to fifty to one hundred. Metrics analysis that queries Prometheus and aborts if errors spike. Pause points for manual approval before increasing traffic. Automated rollback on failure.
Use canary deployments for critical services where you need validation before full rollout. When you have metrics infrastructure. Episode Seven's Prometheus provides the data you need.
Deployment Strategy Decision Framework
Let me give you a decision framework. Use rolling updates as your default for most applications. Simple, reliable, built-in. Use blue-green for high-risk changes needing instant rollback. When you can afford double resources. Database migrations, major architecture changes. Use canary for critical services. When you need progressive validation. When you have metrics infrastructure. Use Argo Rollouts to automate the process.
CI/CD with GitOps
CI/CD integration changes with GitOps. Traditional CI/CD has the CI pipeline build the image, push to registry, then kubectl apply manifest to cluster. The problem? CI has cluster credentials. CI must know about all clusters. Deployment is tied to CI success.
GitOps CI/CD works differently. CI pipeline builds image, pushes to registry, then updates Git manifest with the new image tag and commits to Git. ArgoCD or Flux detects the Git change and deploys to clusters. Benefits? CI doesn't need cluster access. Single workflow works for all clusters. Deployment is decoupled from CI.
Image promotion pattern uses Git branches. Dev Git branch, ArgoCD deploys to dev cluster. Merge to staging branch, ArgoCD deploys to staging. Tag a release, update production branch, ArgoCD deploys to production. Git structure enforces your promotion flow.
Remember Episode Seven's Prometheus metrics? Canary deployments need those metrics for validation. Prometheus provides latency and error data for automated rollout decisions.
Common Mistakes
Common mistakes. Mistake one: storing secrets in Git. What happens? Credentials exposed, security breach. The fix: use sealed secrets that encrypt before Git. Use external secret management like Vault or cloud providers. Use secret generators from ArgoCD or Flux.
Mistake two: no drift prevention. What happens? Someone kubectl edits a deployment. Cluster diverges from Git. The fix: enable auto-sync in ArgoCD or Flux. GitOps tool reverts manual changes automatically.
Mistake three: deploying to production without testing the GitOps workflow. What happens? GitOps misconfiguration. Production deployment fails. No rollback plan. The fix: test GitOps in dev and staging first. Verify sync works. Test rollback with git revert.
Mistake four: overly complex Helm templates. What happens? Unmaintainable templates. Debugging nightmares. Team avoids making changes. The fix: keep templates simple. Prefer Kustomize overlays for simple differences. Use Helm only when complexity is justified.
Mistake five: no environment parity. What happens? Dev works, staging works, production fails because configurations differ. The fix: same base manifests across environments. Environment differences only through Kustomize overlays or Helm values.
Active Recall Quiz
Let's pause for active recall. First question: why is manual kubectl to twenty clusters an anti-pattern? How does GitOps solve this? Second question: ArgoCD or Flux—which would you choose for a fifty-person engineering team that wants deployment visibility and multi-tenancy? Why? Third question: when would you use a canary deployment instead of rolling update? What infrastructure is required?
Answers:
Manual kubectl doesn't scale. Twenty clusters times five minutes equals one hundred minutes. Prone to errors like typos and skipped clusters. No audit trail. No rollback strategy. Configuration drift happens. GitOps solution: commit to Git takes thirty seconds. ArgoCD or Flux deploys to all clusters automatically in two to five minutes. Git history provides audit trail. Rollback is git revert. Continuous reconciliation prevents drift.
For the fifty-person team, choose ArgoCD. Large team benefits from UI for visibility. Multi-tenancy with role-based access control needed for team isolation. ArgoCD provides web UI, projects for multi-tenancy, extensive RBAC out of the box. Flux would work but requires third-party UI and more setup for multi-tenancy.
Canary deployment when you have critical services where progressive validation is needed. Can't risk all-or-nothing rollout. Required infrastructure: metrics from Episode Seven's Prometheus to validate canary health. Traffic splitting through Argo Rollouts or service mesh like Istio. Time to monitor progressive rollout. Not for urgent hotfixes.
Key Takeaways
Let's recap. GitOps principles: Git as source of truth, declarative configuration, automated sync, self-healing. Manual kubectl doesn't scale to twenty-plus clusters. ArgoCD versus Flux: ArgoCD for large teams wanting UI and multi-tenancy. Flux for automation-first teams wanting lightweight. Both are CNCF graduated and production-proven. Helm versus Kustomize: Helm for complex apps, templating, third-party charts. Kustomize for simpler apps, overlays, no templating. You can use both. Deployment strategies: rolling is default. Blue-green for instant switch and easy rollback. Canary for progressive validation using Argo Rollouts. CI/CD pattern: build image, update Git, GitOps deploys. CI doesn't need cluster credentials.
This connects to previous episodes. Episode One: GitOps achieves repeatable deployments and production readiness. Episode Seven: canary deployments use Prometheus metrics for validation. Episode Eight: GitOps prevents configuration drift that causes cost waste.
You can now build infrastructure, observe it, optimize costs, and automate deployments. Final episode: managing all this at multi-cluster scale. Twenty-plus clusters across dev, staging, production, disaster recovery. How do you manage them without losing your mind? Hub-and-spoke GitOps architecture. Policy enforcement with OPA and Kyverno. Disaster recovery with Velero. And synthesizing all ten episodes into a cohesive production operations framework. We're bringing everything together. See you then.
Navigation
⬅️ Previous: Lesson 8: CI/CD Integration | Next: Lesson 10: Production Checklists ➡️