Troubleshooting & Debugging Guide for Platform Engineers
Production issues don't wait for convenient times. This guide covers systematic approaches to troubleshooting, essential tools, and real-world scenarios you'll face as a platform engineer.
The Troubleshooting Mindset
Systematic Approach
- Observe: What are the symptoms?
- Orient: What changed recently?
- Hypothesize: What could cause this?
- Test: Verify or eliminate hypotheses
- Act: Implement fixes
- Monitor: Ensure the fix works
Golden Rules
- Don't panic: Stay calm and methodical
- Preserve evidence: Don't destroy logs or state
- Document everything: Future you will thank you
- Know when to escalate: Some issues need more eyes
- Learn from incidents: Every outage is a learning opportunity
Essential Troubleshooting Tools
System Level Tools
Process and Resource Monitoring:
# Real-time system overview
htop # Interactive process viewer
atop # Advanced system monitor
glances # Cross-platform monitoring
# Quick system health check
uptime # Load average
free -h # Memory usage
df -h # Disk usage
iostat -x 1 # I/O statistics
vmstat 1 # Virtual memory stats
Network Troubleshooting:
# Connection testing
ping -c 4 google.com
traceroute google.com
mtr google.com # Combines ping and traceroute
# DNS debugging
dig example.com
nslookup example.com
host example.com
# Port and connection analysis
netstat -tulpn # All listening ports
ss -tulpn # Modern netstat replacement
lsof -i :80 # What's using port 80
nc -zv host 80 # Test port connectivity
# Packet analysis
tcpdump -i any -w capture.pcap
tcpdump -i eth0 host 10.0.0.1
tcpdump -i any port 443 -A # Show ASCII
Advanced Debugging Tools:
# System call tracing
strace -p <pid>
strace -e trace=network command
strace -c command # Summary statistics
# Library call tracing
ltrace command
# Kernel tracing
perf record -g command
perf report
bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %s\n", comm, str(args->filename)); }'
Resources:
Container and Kubernetes Debugging
Docker Troubleshooting:
# Container inspection
docker inspect <container>
docker logs --tail 50 -f <container>
docker exec -it <container> /bin/bash
# Resource usage
docker stats
docker system df
docker system events
# Network debugging
docker network ls
docker network inspect bridge
docker port <container>
Kubernetes Debugging:
# Pod troubleshooting
kubectl describe pod <pod>
kubectl logs <pod> --previous
kubectl logs <pod> -c <container>
kubectl exec -it <pod> -- /bin/bash
# Events and resources
kubectl get events --sort-by='.lastTimestamp'
kubectl top nodes
kubectl top pods --all-namespaces
# Advanced debugging
kubectl debug node/<node> -it --image=ubuntu
kubectl run debug --image=nicolaka/netshoot -it --rm
# Cluster diagnostics
kubectl cluster-info dump --output-directory=/tmp/cluster-dump
kubectl get pods --all-namespaces -o wide
kubectl get svc --all-namespaces
Resources:
Application Performance Monitoring
APM Tools:
# JVM applications
jstack <pid> # Thread dump
jmap -heap <pid> # Heap summary
jstat -gcutil <pid> 1 # GC statistics
jconsole # GUI monitoring
# Python applications
py-spy record -o profile.svg --pid <pid>
python -m cProfile script.py
python -m trace --trace script.py
# Go applications
go tool pprof http://localhost:6060/debug/pprof/heap
go tool pprof http://localhost:6060/debug/pprof/profile
go tool trace trace.out
Common Troubleshooting Scenarios
Scenario 1: High CPU Usage
Symptoms:
- System slowness
- High load average
- Unresponsive applications
Investigation Steps:
# 1. Identify the culprit
top -H # Show threads
ps aux --sort=-cpu | head -10
# 2. Analyze the process
strace -p <pid> -c # System call summary
perf top -p <pid> # CPU profiling
# 3. Check for CPU throttling
cat /sys/fs/cgroup/cpu/cpu.stat | grep throttled
# 4. Thread analysis
ps -eLf | grep <pid> # All threads
pstack <pid> # Stack trace
Common Causes:
- Infinite loops
- Inefficient algorithms
- Garbage collection
- CPU limits/throttling
Scenario 2: Memory Leak
Symptoms:
- Increasing memory usage
- OOM kills
- Swap usage increasing
Investigation Steps:
# 1. Memory overview
free -h
cat /proc/meminfo
vmstat 1
# 2. Find memory hogs
ps aux --sort=-rss | head -10
smem -rs rss -p # Sorted by RSS
# 3. Process memory analysis
pmap -x <pid>
cat /proc/<pid>/status | grep Vm
cat /proc/<pid>/smaps
# 4. Heap analysis (Java example)
jmap -histo:live <pid>
jmap -dump:live,format=b,file=heap.bin <pid>
Memory Leak Detection:
# Python memory profiling
from memory_profiler import profile
@profile
def memory_intensive_function():
# Your code here
pass
# Run with: python -m memory_profiler script.py
Scenario 3: Disk I/O Issues
Symptoms:
- Slow application response
- High I/O wait
- Disk errors in logs
Investigation Steps:
# 1. I/O statistics
iostat -x 1
iotop -o
dstat -d --disk-util
# 2. File system usage
df -h
df -i # Inode usage
du -sh /* | sort -hr
# 3. Find I/O intensive processes
pidstat -d 1
iotop -b -o
# 4. Trace I/O operations
blktrace -d /dev/sda -o trace
blkparse trace
Scenario 4: Network Connectivity Issues
Symptoms:
- Connection timeouts
- Intermittent failures
- Slow response times
Investigation Steps:
# 1. Basic connectivity
ping -c 10 target.com
mtr --report target.com
# 2. DNS resolution
dig target.com
systemd-resolve --status
# 3. Connection analysis
ss -tan | grep ESTABLISHED
netstat -s # Protocol statistics
# 4. Packet loss detection
ping -f -c 1000 target.com
iperf3 -c target.com
# 5. Firewall and routing
iptables -L -n -v
ip route show
traceroute -T -p 443 target.com
Scenario 5: Database Performance Issues
Symptoms:
- Slow queries
- Connection pool exhaustion
- Lock contention
PostgreSQL Troubleshooting:
-- Active queries
SELECT pid, age(clock_timestamp(), query_start), usename, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start DESC;
-- Lock analysis
SELECT blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
-- Table bloat
SELECT schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
MySQL Troubleshooting:
-- Active processes
SHOW FULL PROCESSLIST;
-- Lock information
SELECT * FROM information_schema.innodb_locks;
SELECT * FROM information_schema.innodb_lock_waits;
-- Query analysis
EXPLAIN SELECT * FROM table WHERE condition;
SHOW STATUS LIKE 'Handler_%';
Production Debugging Strategies
Safe Debugging in Production
1. Read-Only First:
- Start with non-intrusive commands
- Avoid modifying state
- Use read replicas when possible
2. Circuit Breakers:
# Implement safety mechanisms
import signal
import sys
def timeout_handler(signum, frame):
print("Debug operation timed out")
sys.exit(1)
# Set 5-minute timeout for debug operations
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(300)
3. Canary Debugging:
- Test fixes on one instance first
- Monitor impact before full rollout
- Have rollback plan ready
Distributed Tracing
OpenTelemetry Setup:
from opentelemetry import trace
from opentelemetry.exporter.jaeger import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Configure tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=6831,
)
span_processor = BatchSpanProcessor(jaeger_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
# Use in code
with tracer.start_as_current_span("process_request"):
# Your code here
pass
Resources:
Log Analysis and Correlation
Centralized Logging
Log Aggregation Pipeline:
# Filebeat configuration
filebeat.inputs:
- type: container
paths:
- '/var/lib/docker/containers/*/*.log'
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/lib/docker/containers/"
output.elasticsearch:
hosts: ['${ELASTICSEARCH_HOST:elasticsearch}:${ELASTICSEARCH_PORT:9200}']
index: "filebeat-%{[agent.version]}-%{+yyyy.MM.dd}"
Log Correlation Queries:
# Elasticsearch query for error spike
curl -X GET "localhost:9200/logs-*/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{"match": {"level": "ERROR"}},
{"range": {"@timestamp": {"gte": "now-1h"}}}
]
}
},
"aggs": {
"errors_over_time": {
"date_histogram": {
"field": "@timestamp",
"interval": "5m"
}
}
}
}'
Pattern Recognition
Common Log Patterns:
# Log anomaly detection
import re
from collections import Counter
def analyze_logs(log_file):
error_patterns = Counter()
patterns = [
(r'OutOfMemoryError', 'OOM'),
(r'Connection refused', 'Connection Error'),
(r'Timeout|timed out', 'Timeout'),
(r'NullPointerException', 'NPE'),
(r'database is locked', 'DB Lock')
]
with open(log_file, 'r') as f:
for line in f:
for pattern, category in patterns:
if re.search(pattern, line, re.I):
error_patterns[category] += 1
return error_patterns.most_common(10)
Interview Scenarios
Common Troubleshooting Questions
-
"How would you debug a memory leak in production?"
- Start with monitoring metrics
- Use heap dumps carefully
- Analyze with minimal impact
- Have rollback strategy
-
"A service is experiencing intermittent timeouts. How do you investigate?"
- Check network path
- Analyze connection pools
- Review timeout settings
- Look for patterns
-
"The database is slow. What's your approach?"
- Check slow query logs
- Analyze execution plans
- Review indexes
- Monitor connections
-
"How do you handle cascading failures?"
- Circuit breakers
- Bulkheads
- Timeout tuning
- Graceful degradation
Hands-On Scenarios
Scenario Setup for Practice:
# Create a problematic container
docker run -d --name buggy-app \
--memory="50m" \
--cpus="0.5" \
your/buggy-app:latest
# Introduce network latency
tc qdisc add dev eth0 root netem delay 100ms
# Simulate disk pressure
stress-ng --io 4 --timeout 60s
# Generate load
ab -n 10000 -c 100 http://localhost:8080/
Building a Troubleshooting Toolkit
Essential Scripts
System Health Check:
#!/bin/bash
# health_check.sh - Quick system health assessment
echo "=== System Health Check ==="
echo "Date: $(date)"
echo
echo "--- Load Average ---"
uptime
echo -e "\n--- Memory Usage ---"
free -h
echo -e "\n--- Disk Usage ---"
df -h | grep -vE '^Filesystem|tmpfs|cdrom'
echo -e "\n--- Top CPU Processes ---"
ps aux --sort=-cpu | head -5
echo -e "\n--- Network Connections ---"
ss -tan | grep ESTAB | wc -l
echo "Established connections: $(ss -tan | grep ESTAB | wc -l)"
echo -e "\n--- Recent Errors ---"
journalctl -p err -n 10 --no-pager
Documentation Templates
Incident Report Template:
# Incident Report: [Title]
**Date**: [YYYY-MM-DD]
**Duration**: [Start time - End time]
**Severity**: [P1/P2/P3]
**Services Affected**: [List services]
## Summary
[Brief description of the incident]
## Timeline
- HH:MM - [Event description]
- HH:MM - [Event description]
## Root Cause
[Detailed explanation of what caused the incident]
## Resolution
[Steps taken to resolve the issue]
## Impact
- [Customer impact]
- [Business impact]
## Lessons Learned
1. [What went well]
2. [What could be improved]
## Action Items
- [ ] [Action item with owner]
- [ ] [Action item with owner]
Resources for Continuous Learning
Books
- 📚 Site Reliability Engineering - Google
- 📚 Debugging: The 9 Indispensable Rules - David Agans
- 📚 Effective Debugging - Diomidis Spinellis
Online Resources
- 📖 Brendan Gregg's Performance Site
- 🎥 SREcon Talks
- 📖 Production Readiness Checklist
- 🎮 Troubleshooting Scenarios
Tools to Master
- Monitoring: Prometheus, Grafana, Datadog
- Tracing: Jaeger, Zipkin, AWS X-Ray
- Logging: ELK Stack, Fluentd, Splunk
- Profiling: pprof, Java Flight Recorder, perf
Remember: The best troubleshooters combine systematic thinking, deep technical knowledge, and excellent communication skills. Every incident is an opportunity to improve your systems and processes.