Kubernetes Mastery for Platform Engineers

An in-depth guide to Kubernetes architecture, operations, and platform engineering on K8s.

📚 Essential Resources

📖 Must-Read Books & Guides

Kubernetes in Action - Marko Lukša (2nd Edition)
Production Kubernetes - Josh Rosso & Rich Lander
Kubernetes Patterns - Bilgin Ibryam & Roland Huß
Kubernetes Up & Running - Kelsey Hightower, Brendan Burns, Joe Beda
The Kubernetes Book - Nigel Poulton

🎥 Video Resources

Kubernetes Course - Full Beginners Tutorial - TechWorld with Nana
CNCF Kubernetes Course - Complete playlist
KubeCon Talks - Latest K8s innovations
Kubernetes Deconstructed - Carson Anderson
Life of a Packet - Michael Rubin (Google)

🎓 Courses & Certifications

CKA (Certified Kubernetes Administrator) - CNCF Certification
CKAD (Certified Kubernetes Application Developer) - CNCF Certification
CKS (Certified Kubernetes Security Specialist) - CNCF Certification
Kubernetes the Hard Way - Kelsey Hightower
KodeKloud Kubernetes Courses - Hands-on labs

📰 Blogs & Articles

Kubernetes Blog - Official Kubernetes blog
Learnk8s Blog - In-depth tutorials
ITNEXT Kubernetes - Community articles
The New Stack - K8s ecosystem coverage
Container Journal - Container & K8s news

🔧 Essential Tools & Platforms

K9s - Terminal UI for Kubernetes
Lens - Kubernetes IDE
kubectl Cheat Sheet - Official reference
Kustomize - Kubernetes native configuration
Helm Hub - Find and share Helm charts

💬 Communities & Forums

Kubernetes Slack - Official Slack (get invite at slack.k8s.io)
r/kubernetes - Reddit community
Stack Overflow - Kubernetes - Q&A
CNCF Community - Cloud Native community
Kubernetes Forum - Official forum

🎮 Interactive Learning

Killercoda - Free K8s scenarios
Play with Kubernetes - Browser-based K8s
Katacoda Kubernetes - Interactive tutorials
Kubernetes Playground - Practice environment

📖 Documentation & References

Kubernetes Documentation - Official docs
Kubernetes API Reference - API documentation
kubectl Reference - Command reference
Kubernetes Examples - Official examples
Awesome Kubernetes - Curated resources

Kubernetes Architecture Deep Dive

Control Plane Components

API Server (kube-apiserver)

Central management point
RESTful API interface
Authentication and authorization
Admission controllers

# API server key features
- RBAC (Role-Based Access Control)
- Admission webhooks
- API aggregation
- OpenAPI schema validation

etcd

Distributed key-value store
Cluster state storage
Consistency via Raft consensus
Watch functionality for changes

# etcd operations
etcdctl get / --prefix --keys-only
etcdctl snapshot save backup.db
etcdctl member list

Scheduler (kube-scheduler)

Pod placement decisions
Resource requirements evaluation
Affinity/anti-affinity rules
Custom schedulers

# Scheduling example
apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    disktype: ssd
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - topologyKey: kubernetes.io/hostname

Controller Manager

Runs controller loops
Maintains desired state
Built-in controllers:
- ReplicaSet
- Deployment
- StatefulSet
- DaemonSet
- Job/CronJob

Resources:

Data Plane Components

kubelet

Node agent
Pod lifecycle management
Container runtime interface (CRI)
Health checking

# kubelet debugging
journalctl -u kubelet -f
kubectl get --raw /api/v1/nodes/<node>/proxy/stats/summary

kube-proxy

Network proxy
Service abstraction implementation
iptables/IPVS modes
Connection tracking

# kube-proxy modes
kubectl get configmap kube-proxy -n kube-system -o yaml
# Check iptables rules
iptables-save | grep KUBE-SERVICES

Container Runtime

Docker (deprecated)
containerd
CRI-O
Runtime classes

# RuntimeClass example
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc

Advanced Networking

Network Models

Cluster Networking Requirements:

All pods can communicate without NAT
All nodes can communicate with pods without NAT
Pod sees its own IP address

CNI Plugins Comparison:

Plugin	Mode	Performance	Features
Calico	L3	High	Network policies, BGP
Flannel	Overlay	Medium	Simple, VXLAN
Cilium	eBPF	Very High	L7 policies, observability
Weave	Overlay	Medium	Encryption, multicast

Service Types and Ingress

Service Types:

# ClusterIP - Internal only
apiVersion: v1
kind: Service
spec:
  type: ClusterIP
  selector:
    app: myapp
  ports:
  - port: 80

# LoadBalancer - Cloud provider LB
spec:
  type: LoadBalancer
  loadBalancerIP: 1.2.3.4

# NodePort - External access via node ports
spec:
  type: NodePort
  ports:
  - port: 80
    nodePort: 30080

Ingress Controllers:

# Ingress with path-based routing
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
  - host: example.com
    http:
      paths:
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 80

Resources:

Network Policies

# Deny all ingress traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-ingress
spec:
  podSelector: {}
  policyTypes:
  - Ingress

# Allow specific traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend
spec:
  podSelector:
    matchLabels:
      app: backend
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - port: 8080

Storage Architecture

Storage Classes and Dynamic Provisioning

# StorageClass definition
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
volumeBindingMode: WaitForFirstConsumer

Persistent Volumes and Claims

# PVC with specific requirements
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: database-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: fast-ssd
  selector:
    matchLabels:
      tier: production

CSI (Container Storage Interface)

CSI Driver Implementation:

Identity Service
Controller Service
Node Service

# CSI Driver deployment
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
  name: csi.example.com
spec:
  attachRequired: true
  podInfoOnMount: true
  volumeLifecycleModes:
  - Persistent
  - Ephemeral

Resources:

Security Best Practices

RBAC (Role-Based Access Control)

# ClusterRole for platform team
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: platform-engineer
rules:
- apiGroups: ["*"]
  resources: ["nodes", "persistentvolumes"]
  verbs: ["*"]
- apiGroups: ["apps"]
  resources: ["deployments", "daemonsets", "statefulsets"]
  verbs: ["get", "list", "watch"]

# RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: platform-engineer-binding
  namespace: production
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: platform-engineer
subjects:
- kind: User
  name: john@example.com

Pod Security Standards

# Pod Security Policy (deprecated, use Pod Security Standards)
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restricted
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
    - ALL
  volumes:
    - 'configMap'
    - 'emptyDir'
    - 'projected'
    - 'secret'
    - 'downwardAPI'
    - 'persistentVolumeClaim'
  runAsUser:
    rule: 'MustRunAsNonRoot'
  seLinux:
    rule: 'RunAsAny'
  fsGroup:
    rule: 'RunAsAny'

Secrets Management

# External Secrets Operator
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: vault-backend
spec:
  provider:
    vault:
      server: "https://vault.example.com"
      path: "secret"
      version: "v2"
      auth:
        kubernetes:
          mountPath: "kubernetes"
          role: "demo"

Resources:

Observability and Monitoring

Metrics Architecture

# Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

Logging Architecture

Logging Patterns:

Node-level logging
Sidecar container pattern
DaemonSet collectors

# Fluentd DaemonSet config
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
data:
  fluent.conf: |
    <source>
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag kubernetes.*
      <parse>
        @type json
        time_format %Y-%m-%dT%H:%M:%S.%N%z
      </parse>
    </source>

Distributed Tracing

# OpenTelemetry Collector
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
data:
  otel-collector-config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:
      batch:
    exporters:
      jaeger:
        endpoint: jaeger-collector:14250
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [jaeger]

Advanced Deployment Patterns

GitOps with ArgoCD

# ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/company/k8s-manifests
    targetRevision: HEAD
    path: production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Progressive Delivery with Flagger

# Canary deployment
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  progressDeadlineSeconds: 60
  service:
    port: 80
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m

Platform Engineering on Kubernetes

Multi-Tenancy Patterns

Namespace Isolation:

# ResourceQuota per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "100"
    requests.memory: 200Gi
    persistentvolumeclaims: "10"

Network Isolation:

# Default deny all NetworkPolicy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Custom Resource Definitions (CRDs)

# Platform service CRD
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: platformservices.platform.io
spec:
  group: platform.io
  versions:
  - name: v1
    served: true
    storage: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              tier:
                type: string
                enum: ["production", "staging", "development"]
              autoscaling:
                type: boolean
              monitoring:
                type: boolean

Operator Development

// Operator reconciliation loop
func (r *PlatformServiceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    var platformService v1.PlatformService
    if err := r.Get(ctx, req.NamespacedName, &platformService); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }
    
    // Ensure deployment exists
    deployment := r.deploymentForPlatformService(&platformService)
    if err := r.Create(ctx, deployment); err != nil {
        return ctrl.Result{}, err
    }
    
    // Update status
    platformService.Status.Ready = true
    if err := r.Status().Update(ctx, &platformService); err != nil {
        return ctrl.Result{}, err
    }
    
    return ctrl.Result{RequeueAfter: time.Minute}, nil
}

Troubleshooting Guide

Common Issues and Solutions

1. Pod Stuck in Pending:

kubectl describe pod <pod-name>
kubectl get events --sort-by='.lastTimestamp'
kubectl get nodes -o wide
kubectl describe node <node-name>

2. CrashLoopBackOff:

kubectl logs <pod-name> --previous
kubectl describe pod <pod-name>
kubectl exec -it <pod-name> -- /bin/sh

3. Service Discovery Issues:

kubectl get endpoints
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- bash
nslookup kubernetes.default

4. Performance Issues:

kubectl top nodes
kubectl top pods --all-namespaces
kubectl get hpa
kubectl describe hpa <hpa-name>

Debugging Tools

# kubectl plugins
kubectl krew install debug
kubectl debug node/<node-name> -it --image=ubuntu

# Network debugging
kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot
kubectl exec -it tmp-shell -- tcpdump -i any -w trace.pcap

# Resource analysis
kubectl get pods --all-namespaces -o json | jq '.items[] | {namespace: .metadata.namespace, name: .metadata.name, cpu: .spec.containers[].resources.requests.cpu, memory: .spec.containers[].resources.requests.memory}'

Production Best Practices

High Availability

Control Plane HA:
- Multiple API servers behind LB
- etcd cluster with odd number of nodes
- Leader election for controllers
Data Plane HA:
- Multiple nodes across AZs
- Pod disruption budgets
- Node affinity rules

# PodDisruptionBudget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: myapp

Resource Management

# Resource limits and requests
resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    cpu: "500m"

# Vertical Pod Autoscaler
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: app-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: app
  updatePolicy:
    updateMode: "Auto"

Certification and Learning Path

Certifications

Learning Resources

📚 Kubernetes in Action
📚 Production Kubernetes
🎥 Kubernetes Course - TechWorld with Nana
🎮 Kubernetes the Hard Way
📖 Kubernetes Documentation
🎯 KillerCoda Interactive Scenarios

Community Resources

Remember: Kubernetes is rapidly evolving. Stay updated with the latest releases and best practices through official documentation and community resources.

📚 Essential Resources​

📖 Must-Read Books & Guides​

🎥 Video Resources​

🎓 Courses & Certifications​

📰 Blogs & Articles​

🔧 Essential Tools & Platforms​

💬 Communities & Forums​

🎮 Interactive Learning​

📖 Documentation & References​

Kubernetes Architecture Deep Dive​

Control Plane Components​

Data Plane Components​

Advanced Networking​

Network Models​

Service Types and Ingress​

Network Policies​

Storage Architecture​

Storage Classes and Dynamic Provisioning​

Persistent Volumes and Claims​

CSI (Container Storage Interface)​

Security Best Practices​

RBAC (Role-Based Access Control)​

Pod Security Standards​

Secrets Management​

Observability and Monitoring​

Metrics Architecture​

Logging Architecture​

Distributed Tracing​

Advanced Deployment Patterns​

GitOps with ArgoCD​

Progressive Delivery with Flagger​

Platform Engineering on Kubernetes​

Multi-Tenancy Patterns​

Custom Resource Definitions (CRDs)​

Operator Development​

Troubleshooting Guide​

Common Issues and Solutions​

Debugging Tools​

Production Best Practices​

High Availability​

Resource Management​

Certification and Learning Path​

Certifications​

Learning Resources​

Community Resources​

📚 Essential Resources

📖 Must-Read Books & Guides

🎥 Video Resources

🎓 Courses & Certifications

📰 Blogs & Articles

🔧 Essential Tools & Platforms

💬 Communities & Forums

🎮 Interactive Learning

📖 Documentation & References

Kubernetes Architecture Deep Dive

Control Plane Components

Data Plane Components

Advanced Networking

Network Models

Service Types and Ingress

Network Policies

Storage Architecture

Storage Classes and Dynamic Provisioning

Persistent Volumes and Claims

CSI (Container Storage Interface)

Security Best Practices

RBAC (Role-Based Access Control)

Pod Security Standards

Secrets Management

Observability and Monitoring

Metrics Architecture

Logging Architecture

Distributed Tracing

Advanced Deployment Patterns

GitOps with ArgoCD

Progressive Delivery with Flagger

Platform Engineering on Kubernetes

Multi-Tenancy Patterns

Custom Resource Definitions (CRDs)

Operator Development

Troubleshooting Guide

Common Issues and Solutions

Debugging Tools

Production Best Practices

High Availability

Resource Management

Certification and Learning Path

Certifications

Learning Resources

Community Resources