Kubernetes Mastery for Platform Engineers
An in-depth guide to Kubernetes architecture, operations, and platform engineering on K8s.
๐ Essential Resourcesโ
๐ Must-Read Books & Guidesโ
- Kubernetes in Action - Marko Lukลกa (2nd Edition)
- Production Kubernetes - Josh Rosso & Rich Lander
- Kubernetes Patterns - Bilgin Ibryam & Roland Huร
- Kubernetes Up & Running - Kelsey Hightower, Brendan Burns, Joe Beda
- The Kubernetes Book - Nigel Poulton
๐ฅ Video Resourcesโ
- Kubernetes Course - Full Beginners Tutorial - TechWorld with Nana
- CNCF Kubernetes Course - Complete playlist
- KubeCon Talks - Latest K8s innovations
- Kubernetes Deconstructed - Carson Anderson
- Life of a Packet - Michael Rubin (Google)
๐ Courses & Certificationsโ
- CKA (Certified Kubernetes Administrator) - CNCF Certification
- CKAD (Certified Kubernetes Application Developer) - CNCF Certification
- CKS (Certified Kubernetes Security Specialist) - CNCF Certification
- Kubernetes the Hard Way - Kelsey Hightower
- KodeKloud Kubernetes Courses - Hands-on labs
๐ฐ Blogs & Articlesโ
- Kubernetes Blog - Official Kubernetes blog
- Learnk8s Blog - In-depth tutorials
- ITNEXT Kubernetes - Community articles
- The New Stack - K8s ecosystem coverage
- Container Journal - Container & K8s news
๐ง Essential Tools & Platformsโ
- K9s - Terminal UI for Kubernetes
- Lens - Kubernetes IDE
- kubectl Cheat Sheet - Official reference
- Kustomize - Kubernetes native configuration
- Helm Hub - Find and share Helm charts
๐ฌ Communities & Forumsโ
- Kubernetes Slack - Official Slack (get invite at slack.k8s.io)
- r/kubernetes - Reddit community
- Stack Overflow - Kubernetes - Q&A
- CNCF Community - Cloud Native community
- Kubernetes Forum - Official forum
๐ฎ Interactive Learningโ
- Killercoda - Free K8s scenarios
- Play with Kubernetes - Browser-based K8s
- Katacoda Kubernetes - Interactive tutorials
- Kubernetes Playground - Practice environment
๐ Documentation & Referencesโ
- Kubernetes Documentation - Official docs
- Kubernetes API Reference - API documentation
- kubectl Reference - Command reference
- Kubernetes Examples - Official examples
- Awesome Kubernetes - Curated resources
Kubernetes Architecture Deep Diveโ
Control Plane Componentsโ
API Server (kube-apiserver)
- Central management point
- RESTful API interface
- Authentication and authorization
- Admission controllers
# API server key features
- RBAC (Role-Based Access Control)
- Admission webhooks
- API aggregation
- OpenAPI schema validation
etcd
- Distributed key-value store
- Cluster state storage
- Consistency via Raft consensus
- Watch functionality for changes
# etcd operations
etcdctl get / --prefix --keys-only
etcdctl snapshot save backup.db
etcdctl member list
Scheduler (kube-scheduler)
- Pod placement decisions
- Resource requirements evaluation
- Affinity/anti-affinity rules
- Custom schedulers
# Scheduling example
apiVersion: v1
kind: Pod
spec:
nodeSelector:
disktype: ssd
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
Controller Manager
- Runs controller loops
- Maintains desired state
- Built-in controllers:
- ReplicaSet
- Deployment
- StatefulSet
- DaemonSet
- Job/CronJob
Resources:
Data Plane Componentsโ
kubelet
- Node agent
- Pod lifecycle management
- Container runtime interface (CRI)
- Health checking
# kubelet debugging
journalctl -u kubelet -f
kubectl get --raw /api/v1/nodes/<node>/proxy/stats/summary
kube-proxy
- Network proxy
- Service abstraction implementation
- iptables/IPVS modes
- Connection tracking
# kube-proxy modes
kubectl get configmap kube-proxy -n kube-system -o yaml
# Check iptables rules
iptables-save | grep KUBE-SERVICES
Container Runtime
- Docker (deprecated)
- containerd
- CRI-O
- Runtime classes
# RuntimeClass example
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: gvisor
handler: runsc
Advanced Networkingโ
Network Modelsโ
Cluster Networking Requirements:
- All pods can communicate without NAT
- All nodes can communicate with pods without NAT
- Pod sees its own IP address
CNI Plugins Comparison:
Plugin | Mode | Performance | Features |
---|---|---|---|
Calico | L3 | High | Network policies, BGP |
Flannel | Overlay | Medium | Simple, VXLAN |
Cilium | eBPF | Very High | L7 policies, observability |
Weave | Overlay | Medium | Encryption, multicast |
Service Types and Ingressโ
Service Types:
# ClusterIP - Internal only
apiVersion: v1
kind: Service
spec:
type: ClusterIP
selector:
app: myapp
ports:
- port: 80
# LoadBalancer - Cloud provider LB
spec:
type: LoadBalancer
loadBalancerIP: 1.2.3.4
# NodePort - External access via node ports
spec:
type: NodePort
ports:
- port: 80
nodePort: 30080
Ingress Controllers:
# Ingress with path-based routing
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
rules:
- host: example.com
http:
paths:
- path: /api
pathType: Prefix
backend:
service:
name: api-service
port:
number: 80
Resources:
Network Policiesโ
# Deny all ingress traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all-ingress
spec:
podSelector: {}
policyTypes:
- Ingress
# Allow specific traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend
spec:
podSelector:
matchLabels:
app: backend
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- port: 8080
Storage Architectureโ
Storage Classes and Dynamic Provisioningโ
# StorageClass definition
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp3
iops: "3000"
throughput: "125"
volumeBindingMode: WaitForFirstConsumer
Persistent Volumes and Claimsโ
# PVC with specific requirements
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: database-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: fast-ssd
selector:
matchLabels:
tier: production
CSI (Container Storage Interface)โ
CSI Driver Implementation:
- Identity Service
- Controller Service
- Node Service
# CSI Driver deployment
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
name: csi.example.com
spec:
attachRequired: true
podInfoOnMount: true
volumeLifecycleModes:
- Persistent
- Ephemeral
Resources:
- ๐ Kubernetes Storage
- ๐ฅ Storage Deep Dive
- ๐ CSI Specification
Security Best Practicesโ
RBAC (Role-Based Access Control)โ
# ClusterRole for platform team
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: platform-engineer
rules:
- apiGroups: ["*"]
resources: ["nodes", "persistentvolumes"]
verbs: ["*"]
- apiGroups: ["apps"]
resources: ["deployments", "daemonsets", "statefulsets"]
verbs: ["get", "list", "watch"]
# RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: platform-engineer-binding
namespace: production
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: platform-engineer
subjects:
- kind: User
name: john@example.com
Pod Security Standardsโ
# Pod Security Policy (deprecated, use Pod Security Standards)
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: restricted
spec:
privileged: false
allowPrivilegeEscalation: false
requiredDropCapabilities:
- ALL
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
- 'persistentVolumeClaim'
runAsUser:
rule: 'MustRunAsNonRoot'
seLinux:
rule: 'RunAsAny'
fsGroup:
rule: 'RunAsAny'
Secrets Managementโ
# External Secrets Operator
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: vault-backend
spec:
provider:
vault:
server: "https://vault.example.com"
path: "secret"
version: "v2"
auth:
kubernetes:
mountPath: "kubernetes"
role: "demo"
Resources:
- ๐ Kubernetes Security
- ๐ฅ Kubernetes Security Best Practices
- ๐ Kubernetes Security - Operating Kubernetes Clusters and Applications Safely
Observability and Monitoringโ
Metrics Architectureโ
# Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-metrics
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 30s
path: /metrics
Logging Architectureโ
Logging Patterns:
- Node-level logging
- Sidecar container pattern
- DaemonSet collectors
# Fluentd DaemonSet config
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
data:
fluent.conf: |
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
<parse>
@type json
time_format %Y-%m-%dT%H:%M:%S.%N%z
</parse>
</source>
Distributed Tracingโ
# OpenTelemetry Collector
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
data:
otel-collector-config.yaml: |
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
exporters:
jaeger:
endpoint: jaeger-collector:14250
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger]
Advanced Deployment Patternsโ
GitOps with ArgoCDโ
# ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: production-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/company/k8s-manifests
targetRevision: HEAD
path: production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
Progressive Delivery with Flaggerโ
# Canary deployment
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
progressDeadlineSeconds: 60
service:
port: 80
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
Platform Engineering on Kubernetesโ
Multi-Tenancy Patternsโ
Namespace Isolation:
# ResourceQuota per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-a
spec:
hard:
requests.cpu: "100"
requests.memory: 200Gi
persistentvolumeclaims: "10"
Network Isolation:
# Default deny all NetworkPolicy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Custom Resource Definitions (CRDs)โ
# Platform service CRD
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: platformservices.platform.io
spec:
group: platform.io
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
tier:
type: string
enum: ["production", "staging", "development"]
autoscaling:
type: boolean
monitoring:
type: boolean
Operator Developmentโ
// Operator reconciliation loop
func (r *PlatformServiceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
var platformService v1.PlatformService
if err := r.Get(ctx, req.NamespacedName, &platformService); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// Ensure deployment exists
deployment := r.deploymentForPlatformService(&platformService)
if err := r.Create(ctx, deployment); err != nil {
return ctrl.Result{}, err
}
// Update status
platformService.Status.Ready = true
if err := r.Status().Update(ctx, &platformService); err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{RequeueAfter: time.Minute}, nil
}
Troubleshooting Guideโ
Common Issues and Solutionsโ
1. Pod Stuck in Pending:
kubectl describe pod <pod-name>
kubectl get events --sort-by='.lastTimestamp'
kubectl get nodes -o wide
kubectl describe node <node-name>
2. CrashLoopBackOff:
kubectl logs <pod-name> --previous
kubectl describe pod <pod-name>
kubectl exec -it <pod-name> -- /bin/sh
3. Service Discovery Issues:
kubectl get endpoints
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- bash
nslookup kubernetes.default
4. Performance Issues:
kubectl top nodes
kubectl top pods --all-namespaces
kubectl get hpa
kubectl describe hpa <hpa-name>
Debugging Toolsโ
# kubectl plugins
kubectl krew install debug
kubectl debug node/<node-name> -it --image=ubuntu
# Network debugging
kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot
kubectl exec -it tmp-shell -- tcpdump -i any -w trace.pcap
# Resource analysis
kubectl get pods --all-namespaces -o json | jq '.items[] | {namespace: .metadata.namespace, name: .metadata.name, cpu: .spec.containers[].resources.requests.cpu, memory: .spec.containers[].resources.requests.memory}'
Production Best Practicesโ
High Availabilityโ
-
Control Plane HA:
- Multiple API servers behind LB
- etcd cluster with odd number of nodes
- Leader election for controllers
-
Data Plane HA:
- Multiple nodes across AZs
- Pod disruption budgets
- Node affinity rules
# PodDisruptionBudget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: myapp
Resource Managementโ
# Resource limits and requests
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
# Vertical Pod Autoscaler
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: app-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: app
updatePolicy:
updateMode: "Auto"
Certification and Learning Pathโ
Certificationsโ
- ๐ CKA (Certified Kubernetes Administrator)
- ๐ CKAD (Certified Kubernetes Application Developer)
- ๐ CKS (Certified Kubernetes Security Specialist)
Learning Resourcesโ
- ๐ Kubernetes in Action
- ๐ Production Kubernetes
- ๐ฅ Kubernetes Course - TechWorld with Nana
- ๐ฎ Kubernetes the Hard Way
- ๐ Kubernetes Documentation
- ๐ฏ KillerCoda Interactive Scenarios
Community Resourcesโ
- ๐ฌ Kubernetes Slack
- ๐ CNCF Blog
- ๐ฅ KubeCon Talks
Remember: Kubernetes is rapidly evolving. Stay updated with the latest releases and best practices through official documentation and community resources.