System Design for Platform Engineers
System design interviews for platform engineering roles focus on building reliable, scalable, and maintainable infrastructure. Unlike product-focused system design, you'll need to demonstrate deep understanding of distributed systems, infrastructure components, and operational excellence.
Platform Engineering System Design Focus Areas
What Makes Platform Engineering Design Different
- Infrastructure First: Design systems that other teams build upon
- Operational Excellence: Consider monitoring, debugging, and maintenance from day one
- Cost Optimization: Balance performance with resource efficiency
- Multi-tenancy: Design for multiple teams and applications
- Security and Compliance: Build security into the platform layer
System Design Interview Framework
1. Requirements Gathering (5-10 minutes)
Functional Requirements:
- What services need to be supported?
- What are the SLAs/SLOs?
- What scale are we designing for?
- What are the integration points?
Non-Functional Requirements:
- Availability targets (99.9%, 99.99%?)
- Latency requirements
- Throughput expectations
- Security and compliance needs
- Cost constraints
Example Questions to Ask:
- "What's our target availability?"
- "What's the expected request rate?"
- "What regions do we need to support?"
- "What's our budget constraint?"
- "What compliance requirements exist?"
2. Capacity Estimation (5 minutes)
Calculate:
- Requests per second (RPS)
- Storage requirements
- Bandwidth needs
- Server/container count
- Cost estimates
Example Calculation:
Daily Active Users: 10M
Requests per user: 100/day
Total requests: 1B/day = 11,574 RPS
Peak traffic: 3x average = 34,722 RPS
With 20% headroom: 41,666 RPS needed
3. High-Level Design (10-15 minutes)
Start with major components:
- Load balancers
- API gateways
- Service mesh
- Data stores
- Message queues
- Caching layers
- Monitoring stack
4. Detailed Design (15-20 minutes)
Deep dive into:
- Data flow
- API design
- Database schema
- Caching strategy
- Security measures
- Monitoring and alerting
5. Scale and Optimize (10 minutes)
Discuss:
- Bottlenecks
- Scaling strategies
- Performance optimization
- Cost optimization
- Disaster recovery
Common Platform Engineering System Design Questions
1. Design a CI/CD Platform
Requirements:
- Support 1000+ developers
- Multiple programming languages
- 10,000 builds/day
- Artifact storage
- Security scanning
Key Components:
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Git Repo │────▶│ Webhook/API │────▶│ Build Queue │
└─────────────┘ └──────────────┘ └──────────────┘
│
▼
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Builders │◀────│ Scheduler │ │ Orchestrator │
└─────────────┘ └──────────────┘ └──────────────┘
│
▼
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Artifacts │ │Test Results │ │ Deployment │
└─────────────┘ └──────────────┘ └──────────────┘
Design Considerations:
- Build isolation (containers/VMs)
- Queue management (Kafka/RabbitMQ)
- Artifact storage (S3/Artifactory)
- Secret management
- Build caching
- Monitoring and metrics
Resources:
2. Design a Container Orchestration Platform
Requirements:
- Manage 10,000+ containers
- Multi-region deployment
- Auto-scaling
- Service discovery
- Zero-downtime deployments
Key Components:
┌─────────────────┐
│ Control Plane │
├─────────────────┤
│ • API Server │
│ • Scheduler │
│ • Controller │
│ • etcd │
└────────┬────────┘
│
┌────┴────┐
▼ ▼
┌────────┐ ┌────────┐
│ Node 1 │ │ Node 2 │
├────────┤ ├────────┤
│ Kubelet│ │ Kubelet│
│ Proxy │ │ Proxy │
│Runtime │ │Runtime │
└────────┘ └────────┘
Design Considerations:
- Control plane high availability
- Network policies and CNI
- Storage orchestration
- Resource allocation
- Security policies
- Multi-tenancy
Resources:
3. Design a Monitoring and Observability Platform
Requirements:
- 1M metrics/second
- 100TB logs/day
- Distributed tracing
- 99.9% availability
- 30-day retention
Key Components:
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Agents │────▶│ Collectors │────▶│ Storage │
└─────────────┘ └──────────────┘ └──────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Querying │◀─────│ Aggregation │
└──────────────┘ └──────────────┘
│
▼
┌──────────────┐
│Visualization │
└──────────────┘
Design Considerations:
- Time-series database selection
- Data retention policies
- Sampling strategies
- Query performance
- Alert management
- Data compression
Resources:
4. Design a Service Mesh
Requirements:
- Handle 100K RPS
- mTLS between services
- Traffic management
- Circuit breaking
- Observability
Key Components:
┌────────────────┐
│ Control Plane │
├────────────────┤
│ • Config Mgmt │
│ • Cert Mgmt │
│ • Policy Engine│
└───────┬────────┘
│
┌───────▼────────┐
│ Data Plane │
├────────────────┤
│ Sidecar Proxies│
│ (Envoy) │
└────────────────┘
Resources:
5. Design a Multi-Region Database Platform
Requirements:
- Global distribution
- Strong consistency options
- 99.99% availability
- Automatic failover
- Compliance with data residency
Key Components:
┌─────────────────────────────────────┐
│ Global Coordinator │
└─────────────┬───────────────────────┘
│
┌─────────┴─────────┬─────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Region 1 │ │Region 2 │ │Region 3 │
├─────────┤ ├─────────┤ ├─────────┤
│ Primary │◀────▶│ Replica │◀─▶│ Replica │
└─────────┘ └─────────┘ └─────────┘
Resources:
Platform-Specific Design Patterns
1. Reliability Patterns
Circuit Breaker
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=60):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.failure_count = 0
self.last_failure_time = None
self.state = 'CLOSED'
Bulkhead Pattern
- Isolate resources to prevent cascade failures
- Use connection pools with limits
- Implement thread pool isolation
Resources:
2. Scalability Patterns
Horizontal Scaling
- Stateless services
- Shared-nothing architecture
- Database sharding
Caching Strategies
- Cache-aside
- Write-through
- Write-behind
- Refresh-ahead
Resources:
3. Security Patterns
Zero Trust Architecture
- Verify explicitly
- Least privilege access
- Assume breach
Secret Management
- Centralized secret store
- Dynamic secret generation
- Secret rotation
Resources:
System Design Resources
Books
- 📚 Designing Data-Intensive Applications - Martin Kleppmann
- 📚 Site Reliability Engineering - Google
- 📚 The System Design Interview - Alex Xu
- 📚 Building Microservices - Sam Newman
Online Courses
- 🎓 System Design Interview - An Insider's Guide
- 🎓 Designing Distributed Systems
- 🎥 System Design Playlist - Gaurav Sen
Practice Resources
- 🎯 System Design Primer
- 🎯 High Scalability - Real World Architectures
- 📖 AWS Architecture Center
- 📖 Google Cloud Architecture Framework
Mock Interviews
Common Pitfalls to Avoid
- Over-engineering: Don't design for 1B users when asked for 1M
- Ignoring constraints: Always consider cost, team size, timeline
- Missing monitoring: Every design needs observability
- Forgetting security: Security should be built-in, not bolted-on
- Neglecting operations: Consider how the system will be maintained
Tips for Success
- Think out loud: Verbalize your thought process
- Start simple: Begin with a basic design and iterate
- Draw diagrams: Visual representations help communicate ideas
- Consider trade-offs: Every decision has pros and cons
- Ask questions: Clarify requirements and constraints
- Know your numbers: Memorize common latency and capacity figures
Latency Numbers Every Platform Engineer Should Know
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 1K bytes over 1 Gbps network 10,000 ns
Read 4K randomly from SSD 150,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Read 1 MB sequentially from SSD 1,000,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns
Remember: System design for platform engineering is about building the foundation that enables other teams to succeed. Focus on reliability, scalability, and operational excellence in every design decision.