The Cloud Repatriation Debate: When AWS Costs 10-100x More Than It Should [2025 Platform Engineering Guide]
Quick Answer (TL;DR)
Problem: AWS and major cloud providers charge 7-18x markups on compute, with hidden egress fees and unpredictable billing driving total cost 10-100x higher than bare metal alternatives for many workloads.
Movement: 86% of CIOs planned cloud repatriation in 2025 (up from 43% in 2020), but only 8-9% pursue full exit—most do selective workload optimization.
Real Savings:
- 37signals: $2M/year saved, $10M projected over 5 years
- Dropbox: $74.6M saved over 2 years moving 90% of data
- Typical enterprises: 15-30% infrastructure cost reduction
Key Trade-offs:
- Bare Metal Wins: Predictable workloads at scale (>50 servers, >12 months stable), high bandwidth needs, limited managed service usage
- Cloud Wins: True burst scaling requirements, pre-PMF startups, global multi-region presence, heavy managed service reliance, compliance constraints
Timeline: Break-even at 50-100 sustained servers or 12-24 months of stable usage patterns.
Decision Framework: Evaluate actual elasticity needs, egress costs, managed service dependency, team expertise, and total cost of ownership—not just sticker price.
🎙️ Listen to the podcast episode: The Cloud Repatriation Debate - Jordan and Alex discuss real companies saving millions by leaving the cloud, expose hidden costs like egress fees and NAT gateways, and debate when cloud makes sense versus when it's "highway robbery."
🎥 Watch on YouTube
The Repatriation Wave Reaches Critical Mass
David Heinemeier Hansson (DHH) from 37signals stood before a $3.2 million annual AWS bill in 2022 and asked the question that would spark a movement: "Why are we paying this much?"
By 2024, 37signals had completely exited AWS. Their new bill: $1.3 million annually—saving nearly $2 million per year, with projections exceeding $10 million in savings over five years.
They're not alone. 86% of CIOs planned to repatriate at least some cloud workloads in 2025—the highest on record in Barclays' CIO Survey, nearly double the 43% who said the same in late 2020.
But here's what the headlines don't tell you: this isn't a wholesale rejection of cloud computing. Only 8-9% of companies plan full workload repatriation. The real story is nuanced, complex, and far more interesting than "cloud bad, bare metal good."
Let's examine the economics, break down the 10-100x markup claim with real data, and build decision frameworks platform engineering teams can actually use.
Key Statistics (2024-2025 Data)
| Statistic | Value | Source | Context |
|---|---|---|---|
| CIOs planning repatriation | 86% in 2025 (vs 43% in 2020) | Barclays CIO Survey 2024 | Highest on record, but most selective not wholesale |
| Full workload repatriation | Only 8-9% of companies | IDC 2024 | Most do selective optimization, not complete exit |
| Actual workloads repatriated | 21% of workloads and data | IDC Server and Storage Survey | Net cloud growth continues despite exits |
| AWS compute markup | 7-18x vs bare metal | Hetzner vs AWS pricing comparison | 80-core server: €190/mo vs $2,500-$3,500/mo |
| 37signals savings | $2M annually, $10M+ over 5 years | 37signals public disclosure | $700K hardware investment, existing team |
| Dropbox savings | $74.6M over 2 years | Dropbox S-1 filing 2018 | Moved 90% of data, custom infrastructure |
| AWS egress pricing | $0.09/GB after 100GB free | AWS pricing 2025 | Increased free tier from 1GB but still major cost |
| Hetzner included traffic | 20TB free with entry plans | Hetzner pricing | No egress charges vs AWS $900 for 10TB transfer |
| Cloud waste | 15-30% of cloud spend | FinOps Foundation 2024 | Idle resources, oversized instances, unused services |
| Break-even point | 50-100 sustained servers | Industry analysis | Or 12-24 months stable workload patterns |
| Infrastructure team cost | $2,500-$6,000/month overhead | 10-20 hrs/week at $60-75/hr | Hidden cost of bare metal management |
| Typical ROI threshold | 40-60% cost reduction at scale | Multiple case studies | Above 200 servers with predictable usage |
The 10-100x Markup Reality
The Bold Claim
Víctor Martínez's viral article made a provocative assertion: cloud providers charge 10-100x what infrastructure actually costs. Is this hyperbole or documented reality?
Let's break down the numbers with specific examples.
Real Cost Comparison: AWS vs Hetzner (Bare Metal)
Equivalent 80-Core Configuration:
| Provider | Specs | Monthly Cost | Markup vs Bare Metal |
|---|---|---|---|
| Hetzner Bare Metal | 80 cores, dedicated hardware | €190 (~$207) | Baseline (1x) |
| AWS EC2 On-Demand | Comparable C5/C6 instances | $2,500-$3,500 | 13-18x |
| AWS Reserved (3-year) | Same instances, 3-year commit | ~$1,300/mo + $46K upfront | 7x (plus lock-in) |
Reality Check: The 7-18x markup claim is verifiable and conservative for compute-intensive workloads.
💡 Key Takeaway
AWS compute costs 7-18x more than bare metal for equivalent resources. Reserved instances reduce this to 7x but require $46K upfront and 3-year commitment. The markup is real, not marketing hype.
VPS Middle Ground: AWS vs DigitalOcean vs Hetzner
For teams not ready for bare metal, VPS providers offer middle ground:
| Provider | 8-core, 32GB RAM | Monthly Cost | Bandwidth Included | Egress Overage |
|---|---|---|---|---|
| Hetzner VPS | 8 vCPU, 32GB RAM, 160GB SSD | €50 (~$55) | 20TB | Free beyond 20TB |
| DigitalOcean | 8 vCPU, 32GB RAM, 200GB SSD | ~$120 | Shared pool | $0.01/GB |
| AWS EC2 | m5.2xlarge (8 vCPU, 32GB) | ~$280/mo | 100GB | $0.09/GB after 100GB |
Markup Analysis:
- AWS costs 5x more than Hetzner for equivalent VPS
- DigitalOcean sits at 2.2x Hetzner pricing
- Bandwidth costs can double or triple total bills for data-heavy workloads
The Egress Fee Multiplier
Compute markup is only half the story. Egress fees can turn a manageable cloud bill into a financial nightmare.
Scenario: 10TB Monthly Data Transfer
| Provider | Egress Cost Formula | Monthly Egress Cost | Annual Impact |
|---|---|---|---|
| AWS | 100GB free + 9,900GB × $0.09 | $891/month | $10,692/year |
| DigitalOcean | Shared pool + $0.01/GB overage | ~$100/month | $1,200/year |
| Hetzner | 20TB included free | $0 | $0 |
For a workload transferring 10TB/month, AWS charges $10,692 annually just for bandwidth that Hetzner includes free.
💡 Key Takeaway
Egress fees are the "hidden tax" that pushes cloud costs from 7-18x markup to 10-100x total cost of ownership for data-intensive workloads. A single 10TB/month transfer costs $10,692/year on AWS versus $0 on Hetzner.
Storage: The S3 Lock-In Tax
Scenario: 18 Petabytes of Storage (37signals' use case)
| Provider | Storage Model | Annual Cost | Egress Penalty |
|---|---|---|---|
| AWS S3 | $0.023/GB storage + egress | ~$1.5M/year | +$0.09/GB out |
| Pure Storage (on-prem) | Upfront hardware + maintenance | Hardware CAPEX | No egress fees |
37signals' S3 bill alone: $1.5 million annually. AWS waived $250K in egress fees for their migration—a telling admission that exit costs are punitive.
The 100x Cases: Where It Actually Happens
The 100x markup isn't typical, but it does occur in specific scenarios:
100x Markup Scenarios:
- Managed Service Abuse: AWS Elasticsearch vs self-hosted on bare metal (markup 40-80x)
- Function-as-a-Service Overuse: Lambda costs vs containerized equivalents on owned hardware (50-100x at scale)
- Serverless Databases: Aurora/DynamoDB vs PostgreSQL on bare metal for predictable workloads (30-80x)
- Data Transfer Heavy: Video streaming/CDN workloads with high egress (10-100x when egress dominates)
Where Markup is Lower (2-5x):
- Managed Kubernetes (EKS) vs self-hosted k8s on VPS
- Basic compute instances vs VPS (especially with reserved instances)
- Blob storage vs cold storage arrays (when egress is minimal)
The Counterargument: What You're Actually Paying For
Cloud defenders argue the markup pays for:
- Elasticity: Scale from 10 to 10,000 servers in minutes
- Managed Services: RDS, Lambda, Kinesis—services that require expertise to build
- Global Footprint: 30+ regions, edge locations, compliance certifications
- Operational Burden: No hardware failures, no datacenter contracts, no capacity planning
- Innovation Velocity: New services quarterly, no infrastructure blocking product development
These are legitimate value propositions—for workloads that actually need them.
The problem? Most workloads don't need them, yet pay the 10-100x markup anyway.
The Case Studies: Who Left and What They Saved
37signals: The Poster Child of Cloud Exit
Timeline:
- 2022: DHH discovers $3.2M annual AWS bill, begins exit planning
- 2023: Migrates compute workloads, saves $1M in first year
- 2024: First "clean year" post-migration, saves $2M annually
- 2025: Completes S3 exit (18 petabytes), deletes AWS account entirely
Investment vs Savings:
- Hardware purchase: $700K (Dell systems)
- Annual savings: $2M/year
- Payback period: 4.2 months
- 5-year projection: $10M+ saved
Team Impact:
- Zero team expansion required
- Same DevOps team manages infrastructure
- "No hidden dragons" of operational burden
Key Quote (DHH):
"We've been pleasantly surprised that savings have been even better than originally estimated. The team managing everything is still the same—there were no hidden workloads that required us to balloon the team."
What 37signals Got Right:
- Workload Fit: Predictable SaaS traffic patterns, not spiky consumer apps
- Minimal Managed Services: Limited AWS service lock-in beyond compute/storage
- Existing Expertise: Team already capable of infrastructure management
- Executive Commitment: DHH personally championed the migration
- Data-Driven: Measured costs rigorously before and after
💡 Key Takeaway
37signals' success hinged on predictable workloads, minimal managed service dependency, existing infrastructure expertise, and executive commitment. They paid back $700K hardware investment in 4.2 months and project $10M+ savings over 5 years.
Dropbox: The $75 Million Migration
Timeline:
- 2015: Begins "Infrastructure Optimization" project
- 2016: Moves 90% of user data off AWS to custom "Magic Pocket" infrastructure
- 2017-2018: Completes migration, goes public
Financial Impact:
- 2016 savings: $39.5M decrease in infrastructure costs
- 2017 savings: $35.1M decrease in infrastructure costs
- Total 2-year savings: $74.6M
- Hardware investment: $53M+
- Gross margin improvement: 33% to 67% (2015-2017)
Infrastructure Details:
- Three colocated datacenters: California, Virginia, Texas
- Custom-built "Magic Pocket" storage system
- Custom hardware and software design
- Retained 10% of workloads on AWS for flexibility
Critical Context: Dropbox is an outlier, not a template. Their entire business model is storage—they were competing with AWS's core service. Most companies don't have:
- Deep storage engineering expertise
- Scale to justify $53M+ hardware investment
- Business model centered on infrastructure efficiency
What Made Dropbox Special:
- Competing directly with AWS S3 (negative margin paying AWS)
- Sufficient scale (~500M users) to justify custom datacenter investment
- World-class infrastructure engineering team
- Storage workload perfectly suited to owned hardware
💡 Key Takeaway
Dropbox saved $74.6M over 2 years but invested $53M+ in custom infrastructure with world-class engineering teams. They're an outlier—most companies lack the scale, expertise, or business model alignment to replicate this success.
GEICO: The Repatriation Regret Story
The Cautionary Tale:
- Spent a decade migrating 600+ applications to public cloud
- Cloud costs increased 2.5x after migration
- Now repatriating workloads to private cloud (OpenStack/Kubernetes)
- Investing in on-premises infrastructure to optimize costs
What Went Wrong:
- Lift-and-Shift Mentality: Moved apps without redesign for cloud efficiency
- Lack of Cost Visibility: Didn't monitor per-app cloud costs during migration
- Over-reliance on Managed Services: Locked into expensive AWS-specific services
- No FinOps Practice: No cost optimization culture during migration phase
The Lesson: A bad cloud migration is worse than staying on-premises. GEICO's story isn't "cloud failed"—it's "unoptimized cloud migration failed."
The 86% Who Aren't Making Headlines
While 37signals and Dropbox dominate headlines, 86% of CIOs planned repatriation in 2025, but most aren't full exits:
Typical Repatriation Patterns:
- Selective Workload Optimization: Move predictable baseline workloads to bare metal, keep burst capacity in cloud
- Cold Storage Exit: Archive data to cheaper on-prem/colocation storage, keep hot data in cloud
- Dev/Test Environment Repatriation: Move non-production environments to Hetzner/DigitalOcean
- Database Repatriation: Self-host PostgreSQL/MySQL on bare metal, keep stateless compute in cloud
- Hybrid Cloud Architecture: Strategic placement of workloads based on economics, not ideology
Why Partial Repatriation Dominates:
- Lower risk than full migration
- Preserves cloud benefits for workloads that need them
- Incremental cost savings without operational upheaval
- Easier to justify to risk-averse executives
The Decision Framework: When Cloud Makes Sense vs When It Doesn't
The Core Question
The cloud vs bare metal debate isn't "which is better?"—it's "which workloads belong where?"
Platform engineering teams need frameworks to answer this question without ideology.
Decision Matrix: Cloud vs Bare Metal
| Factor | Stay in Cloud | Consider Bare Metal/VPS | Weight |
|---|---|---|---|
| Workload Elasticity | Traffic spikes 10x+ within hours | Predictable baseline with <2x variation | ⭐⭐⭐ Critical |
| Scale | <50 sustained servers | >100 sustained servers, stable 12+ months | ⭐⭐⭐ Critical |
| Managed Service Dependency | Heavy use of RDS, Lambda, managed k8s | Primarily compute/storage, self-managed services | ⭐⭐⭐ Critical |
| Geographic Distribution | Multi-region presence required | Single region or 2-3 strategic locations | ⭐⭐ High |
| Team Expertise | No infrastructure specialists on staff | 3+ engineers with datacenter/bare metal experience | ⭐⭐ High |
| Egress Requirements | Low data transfer (<1TB/month) | High bandwidth (>10TB/month outbound) | ⭐⭐ High |
| Compliance Needs | Require SOC2/HIPAA/FedRAMP certifications | Standard security, no specialized compliance | ⭐ Medium |
| Growth Stage | Pre-PMF, unpredictable growth trajectory | Post-PMF, predictable growth patterns | ⭐ Medium |
| Capital Availability | Cannot commit $100K+ upfront for hardware | Can invest 6-12 months OpEx upfront for CAPEX | ⭐ Medium |
The "Stay in Cloud" Profile
You should probably stay in AWS/GCP/Azure if:
-
Startup, Pre-Product Market Fit
- Unpredictable scaling needs
- Team focused on product, not infrastructure
- Runway concerns make OpEx flexibility critical
- May pivot or fail—don't want hardware commitments
-
True Burst Scaling Requirements
- Black Friday-like traffic spikes (10-100x)
- Event-driven workloads (launches, campaigns)
- Geographic traffic shifting (follow-the-sun)
- Need to scale from 50 to 5,000 instances in minutes
-
Heavy Managed Service Users
- Core business logic runs on Lambda, Step Functions
- Rely on managed ML services (SageMaker, Vertex AI)
- Use proprietary services (DynamoDB, Aurora Serverless)
- Migration would require app rewrites (6-12 month distraction)
-
Global, Compliance-Heavy Operations
- Need presence in 10+ geographic regions
- Require specific compliance certifications (FedRAMP, HIPAA)
- Customers demand cloud-native architectures
- Multi-cloud strategy for risk management
-
Small Engineering Teams (<100 total engineers)
- No dedicated infrastructure specialists
- Platform team smaller than 3 FTE
- Limited ops expertise in-house
- Can't spare 10-20 hours/week for infrastructure management
Example Profile: E-commerce startup (Series A)
- 30 engineers, 20 in product
- Traffic varies 3-50x (normal vs sales events)
- Heavy AWS managed services (RDS, Lambda, S3)
- Global customer base (US, EU, APAC)
- Verdict: Stay in cloud (markup justified by elasticity needs)
The "Consider Repatriation" Profile
You should evaluate bare metal/VPS alternatives if:
-
Predictable SaaS or Internal Tooling
- Traffic patterns stable within 2x variation
- 50-100+ sustained baseline servers
- 12+ months of workload history showing consistency
- Capacity planning is feasible
-
High Bandwidth Requirements
- Egress costs >$5K/month
- Video streaming, file transfer, CDN origin
- Data analytics with frequent large exports
- Backup/DR with multi-TB daily transfers
-
Minimal Managed Service Lock-in
- Primarily compute and block storage users
- Self-host databases (PostgreSQL, MySQL, Redis)
- Kubernetes workloads (portable across infrastructure)
- Use open-source tools (Prometheus, Grafana, etc.)
-
Mature Engineering Organization
- 100+ total engineers
- Dedicated infrastructure/platform team (3+ FTE)
- Existing ops expertise (Linux, networking, storage)
- Culture of infrastructure ownership
-
Cost Optimization Mandate
- Cloud bill >$500K annually with 70%+ fixed workloads
- Executive pressure to reduce infrastructure spend
- Comfortable with 6-12 month migration timeline
- Can commit $100K-$1M upfront for hardware/migration
Example Profile: Mature B2B SaaS (Series C+)
- 200 engineers, 12-person platform team
- 150 sustained EC2 instances, spikes to 180 in peak hours
- Self-hosted PostgreSQL, Kafka, Redis on EC2
- $1.2M annual AWS bill, mostly compute and egress
- Verdict: Strong repatriation candidate (save 40-60% = $480-720K/year)
The Hybrid Approach: Best of Both Worlds
Most sophisticated platform teams don't choose cloud XOR bare metal—they architect hybrid systems optimizing cost and capability.
Hybrid Architecture Pattern:
-
Baseline Workload → Bare Metal/Colocation
- Predictable compute: self-hosted k8s cluster on Hetzner bare metal
- Databases: PostgreSQL on dedicated servers (high I/O performance)
- Object storage: MinIO or Ceph on-prem for bulk storage
- Cost: 40-60% lower than cloud equivalent
-
Burst Capacity → Cloud
- Auto-scaling application tier in AWS (scale 1.5-10x on demand)
- Spot instances for batch processing
- Lambda for event-driven workloads
- Cost: Pay cloud premium only for actual burst usage
-
Managed Services → Cloud (Selectively)
- Keep high-value managed services (e.g., Route53, CloudFront CDN)
- Use cloud for capabilities you can't replicate (e.g., SageMaker for ML)
- Avoid managed services you can self-host efficiently (RDS → self-hosted PostgreSQL)
-
DR/Backup → Cheap Cloud Storage
- Glacier/Deep Archive for cold backups
- Cross-region replication for critical data
- Geographic diversity without operating multiple datacenters
Real-World Hybrid Example:
Company: Mid-size SaaS, 500 employees, 200 engineers
- Before (all AWS): $2.4M/year
- After (hybrid):
- Bare metal (Hetzner): 100 servers baseline = $600K/year
- AWS burst capacity: 20-50 instances on-demand = $400K/year
- AWS managed services: Route53, CloudFront, S3 (hot data) = $300K/year
- Total: $1.3M/year
- Savings: $1.1M annually (46% reduction)
- Team overhead: +10 hours/week infrastructure management
💡 Key Takeaway
The optimal architecture isn't cloud OR bare metal—it's strategically hybrid. Place predictable baseline workloads on owned infrastructure (40-60% savings), reserve cloud for burst capacity and high-value managed services. Most mature platforms save 30-50% with hybrid approaches.
The Hidden Costs Nobody Talks About
Cloud's Hidden Costs
The sticker shock of AWS bills is obvious. Less obvious: the hidden costs that inflate true cloud spending.
1. Egress Fees (The Bandwidth Tax)
- Often 20-40% of total bill for data-heavy workloads
- Difficult to predict or forecast accurately
- Creates lock-in (expensive to move data out)
- Impact: $0.09/GB × 10TB/month = $10,692/year
2. Cross-AZ and Cross-Region Traffic
- $0.01/GB for data between availability zones
- $0.02/GB for cross-region transfers
- Chatty microservices architecture can generate massive internal traffic
- Impact: High-frequency trading firm spent $40K/month on internal AWS traffic
3. NAT Gateway and Network Costs
- NAT Gateway: $0.045/hour + $0.045/GB processed = ~$40/month + data charges
- Transit Gateway: $0.05/hour + $0.02/GB = ~$36/month + data charges per attachment
- Load balancers: ALB/NLB at ~$20-30/month + LCU charges
- Impact: Network infrastructure can cost $5-15K/month before any compute
4. Managed Service Lock-in Premium
- RDS costs 2-3x self-hosted PostgreSQL on EC2
- Aurora costs 4-5x self-hosted PostgreSQL on bare metal
- Elasticsearch Service costs 3-4x self-hosted on EC2
- Impact: $500/month self-hosted DB becomes $2,000/month RDS, $2,500/month Aurora
5. Idle Resource Waste
- FinOps Foundation: 15-30% of cloud spend is waste
- Orphaned EBS volumes, unused load balancers, forgotten test environments
- Dev/staging environments running 24/7 (used 40 hours/week)
- Impact: $2M cloud bill → $300-600K pure waste
6. Reserved Instance and Savings Plan Complexity
- Require accurate 1-3 year capacity forecasting
- Wrong guess = wasted prepayment or continued on-demand premium
- Management overhead: which instances to reserve, when to renew
- Impact: 10-20 hours/quarter reserved instance optimization
7. Multi-Account and Organizational Complexity
- Dozens or hundreds of AWS accounts for isolation
- Centralized billing, IAM complexity, cross-account access
- Security and compliance overhead
- Impact: 1-2 FTE dedicated to cloud account management at enterprise scale
Bare Metal's Hidden Costs
Bare metal advocates downplay the real operational burden of self-managed infrastructure.
1. Upfront Capital Expenditure
- Hardware purchase: $10-30K per server (enterprise-grade)
- Datacenter setup or colocation contracts
- Network equipment, switches, firewalls
- Impact: $700K (37signals), $53M+ (Dropbox) upfront investment
2. Infrastructure Team Overhead
- 10-20 hours/week minimum for infrastructure management
- At $60-75/hour senior DevOps rates: $2,500-$6,000/month
- Scales with infrastructure complexity
- Impact: $30-72K annually in hidden labor cost
3. Capacity Planning Risk
- Over-provision → wasted hardware investment
- Under-provision → emergency hardware procurement (weeks of lead time)
- Hardware refresh cycles (3-5 years)
- Impact: 20-30% over-provisioning typical to avoid capacity emergencies
4. Hardware Failure and Redundancy
- Servers fail: plan for 2-5% annual failure rate
- Need N+1 or N+2 redundancy for HA
- Spare parts inventory and RMA processes
- Impact: 15-25% additional hardware for redundancy
5. Datacenter and Power Costs
- Colocation: $1,000-$5,000/month per rack
- Power: $0.10-$0.30/kWh (80-150W per server × 24/7)
- Cooling: adds 30-50% to power costs
- Impact: 200 servers = $4-8K/month power + $10-30K/month colocation
6. Network Bandwidth and Transit Costs
- Datacenter bandwidth not always "free"
- Transit providers charge for high bandwidth (95th percentile billing)
- DDoS protection and network security
- Impact: 10Gbps commit = $2-5K/month bandwidth costs at colocation
7. Compliance and Security Overhead
- Physical security: datacenter access controls
- Compliance audits: SOC2, ISO27001 for self-managed infrastructure
- Security patching: OS, firmware, hardware vulnerabilities
- Impact: SOC2 audit: $50-150K annually, ongoing compliance overhead
8. Opportunity Cost and Team Distraction
- Engineering time spent on infrastructure ≠ time spent on product
- 6-12 month migration timeline with team focus shift
- Delayed feature development during migration
- Impact: Hard to quantify but potentially millions in delayed revenue
💡 Key Takeaway
Both cloud and bare metal have hidden costs. Cloud hides costs in egress fees, cross-AZ traffic, managed service premiums, and waste (15-30% of spend). Bare metal hides costs in upfront CAPEX, infrastructure team overhead ($30-72K/year), capacity planning risk, and opportunity cost of team distraction. Calculate total cost of ownership, not just sticker price.
The FinOps Response: Optimizing Cloud Before Exiting
Before committing to cloud repatriation, platform engineering teams should exhaust cloud cost optimization strategies. Many organizations discover 30-50% savings are possible without leaving AWS.
The FinOps Maturity Model
Crawl Phase (0-6 months):
- Establish cost visibility: tag all resources, enable Cost Explorer
- Identify waste: unused resources, orphaned volumes, idle instances
- Quick wins: rightsize obvious over-provisioned instances
- Typical savings: 15-20% of cloud spend
Walk Phase (6-18 months):
- Automate waste cleanup: scheduled shutdown of dev/staging environments
- Reserved Instance / Savings Plan strategy for predictable workloads
- Implement FinOps policies: budgets, alerts, approval workflows
- Typical savings: 25-35% of cloud spend
Run Phase (18+ months):
- FinOps as Code: policy-driven cost optimization in CI/CD
- Unit economics: cost per customer, per transaction, per feature
- Culture shift: engineers own cost as quality metric
- Typical savings: 35-50% of cloud spend
Top 10 Cloud Cost Optimization Strategies
1. Eliminate Waste (Quickest ROI)
- Action: Identify and terminate idle resources (unused instances, orphaned volumes, forgotten load balancers)
- Typical Savings: 10-15% of total cloud spend
- Tools: AWS Trusted Advisor, CloudHealth, Spot.io
- Example: $2M cloud bill → $200-300K saved by deleting waste
2. Rightsize Over-Provisioned Instances
- Action: Match instance types to actual CPU/memory utilization
- Typical Savings: 20-30% on compute spend
- Tools: AWS Compute Optimizer, CloudWatch metrics analysis
- Example: m5.4xlarge (16 vCPU) at 30% utilization → m5.2xlarge (8 vCPU) saves 50%
3. Reserved Instances and Savings Plans
- Action: Commit 1-3 years for predictable baseline workloads
- Typical Savings: 30-70% vs on-demand pricing
- Risk: Wrong forecast = wasted prepayment or continued on-demand premium
- Best Practice: Reserve 60-70% of baseline, keep 30-40% flexible on-demand
4. Spot Instances for Fault-Tolerant Workloads
- Action: Use spot instances (up to 90% off) for batch jobs, CI/CD, dev/test
- Typical Savings: 60-90% on applicable workloads
- Limitation: Can be terminated with 2-minute notice
- Use Cases: Data processing, ML training, rendering, test environments
5. Auto-Scaling and Scheduled Shutdown
- Action: Scale down non-production environments outside business hours
- Typical Savings: 50-70% on dev/staging costs (running 40hrs/week vs 168hrs/week)
- Tools: AWS Instance Scheduler, custom Lambda functions
- Example: 50 dev/test instances × $100/mo = $5K → $1.75K (65% savings)
6. Storage Lifecycle Policies
- Action: Auto-transition infrequent-access data to cheaper storage tiers
- Typical Savings: 40-90% on storage costs
- Strategy: S3 Standard → S3-IA (30 days) → Glacier (90 days) → Deep Archive (365 days)
- Example: 100TB S3 Standard ($2,300/mo) → 80TB Glacier ($320/mo) = $1,980/mo saved
7. Egress Cost Reduction
- Action: Use CloudFront CDN (free egress to CloudFront), cache aggressively, compress data
- Typical Savings: 30-50% on data transfer costs
- Strategy: CloudFront ↔ S3 egress free, CloudFront → internet cheaper than S3 → internet
- Example: 10TB/month S3 egress ($900) → CloudFront ($600) + caching (3TB actual) = $180
8. Commitment to Architecture Optimization
- Action: Refactor chatty microservices, reduce cross-AZ traffic, optimize database queries
- Typical Savings: 20-40% on network and data transfer
- Investment: Requires engineering time, not just configuration
- Example: Collocate services in single AZ (trade redundancy for cost where acceptable)
9. Managed Service Alternatives
- Action: Replace expensive managed services with self-hosted equivalents
- Typical Savings: 50-70% on database and service costs
- Trade-off: Operational burden increases
- Example: RDS PostgreSQL ($500/mo) → PostgreSQL on EC2 ($150/mo instance + management overhead)
10. FinOps Culture and Accountability
- Action: Make cost visibility real-time for engineers, assign budgets per team/product
- Typical Savings: 10-20% through behavior change
- Tools: CloudZero, Vantage, Kubecost (for Kubernetes)
- Strategy: Unit economics (cost per customer), cost as quality metric, showback/chargeback
💡 Key Takeaway
Before repatriating to bare metal, exhaust cloud optimization strategies. Most organizations achieve 30-50% cloud cost reduction through waste elimination, rightsizing, reserved instances, auto-scaling, and storage lifecycle policies—without the operational complexity of leaving cloud.
When Optimization Isn't Enough
You've optimized cloud and still need repatriation if:
-
Post-optimization costs still 3-5x bare metal equivalent
- After removing waste, rightsizing, and commitments, still paying massive premium
- Example: $1.2M optimized AWS bill vs $400K bare metal equivalent
-
Egress costs dominate and can't be reduced
- High bandwidth workloads (video, large file transfers) with unavoidable egress
- Example: $300K/year egress fees for data analytics platform (Hetzner = $0)
-
Managed services provide minimal value
- Using AWS primarily for compute and block storage
- Self-hosting PostgreSQL, Redis, Kafka already (no managed service value)
- Kubernetes portable across infrastructure
-
Predictable workload eliminates elasticity value
- 12+ months data shows <2x traffic variation
- Capacity planning is feasible and accurate
- Don't need cloud's burst scaling capabilities
-
Team has infrastructure expertise and capacity
- Dedicated platform team (3+ FTE) with datacenter experience
- Bandwidth to manage 10-20 hours/week infrastructure overhead
- Culture of infrastructure ownership
Decision Point: If you've optimized cloud spend by 30-50% and still meet criteria above, repatriation economics likely favor bare metal. If optimization closed the cost gap significantly or you rely on managed services, stay in cloud.
Platform Engineering Team Recommendations
For Startups (<100 Engineers, Pre-PMF)
Recommendation: Stay in cloud
Rationale:
- Unpredictable scaling needs
- Team should focus on product, not infrastructure
- Runway concerns make OpEx flexibility critical
- Cloud markup is "insurance premium" for flexibility
Cost Optimization Focus:
- Aggressive waste cleanup (unused resources)
- Schedule dev/staging environment shutdown (nights/weekends)
- Use spot instances for CI/CD, batch jobs
- Don't over-engineer: default to smallest instances that work
When to Revisit:
- Reach 50-100 sustained servers with predictable patterns
- 12+ months of stable workload data
- Cloud bill exceeds $500K annually
- Post-PMF with clear growth trajectory
For Growth Companies (100-500 Engineers, Series B-C)
Recommendation: Evaluate hybrid architecture
Rationale:
- Sufficient scale to justify infrastructure investment
- Likely have predictable baseline workload
- Platform team exists or can be built
- Cost savings materially impact burn rate
Evaluation Checklist:
- Cloud bill >$500K annually with 70%+ predictable workload
- Minimal managed service dependency (or can self-host equivalents)
- 3+ engineers with infrastructure expertise
- Executive support for 6-12 month migration
- Can commit $100K-$500K upfront for hardware/migration
Recommended Approach:
- Phase 1 (Months 1-3): Optimize cloud spend (target 30% reduction)
- Phase 2 (Months 4-6): Migrate dev/test environments to Hetzner/DigitalOcean (low risk)
- Phase 3 (Months 7-12): Migrate baseline production workloads to bare metal/colocation
- Phase 4 (Ongoing): Hybrid architecture—bare metal baseline, cloud burst capacity
Expected Savings: 30-40% total infrastructure cost
For Enterprises (500+ Engineers, Series D+)
Recommendation: Strategic hybrid with selective repatriation
Rationale:
- Massive scale justifies infrastructure investment
- Likely already have infrastructure specialists
- Cost optimization is board-level priority
- Risk tolerance for multi-datacenter operations
Strategic Framework:
Workload Classification:
-
Class 1: Repatriate to Bare Metal
- Predictable baseline compute (web servers, API servers)
- Self-hosted databases (PostgreSQL, MySQL, Redis, Kafka)
- Batch processing and analytics
- Target: 40-60% of compute workload
-
Class 2: Keep in Cloud
- Burst capacity for traffic spikes
- Global multi-region presence
- Managed services providing high value (e.g., SageMaker, Kinesis)
- Target: 20-30% of compute workload
-
Class 3: Move to Cheap Cloud (DigitalOcean, Hetzner)
- Dev, staging, QA environments
- CI/CD infrastructure
- Internal tooling
- Target: 20-30% of compute workload
Expected Savings: 40-60% total infrastructure cost
Investment Required:
- Hardware: $1-5M upfront (depends on scale)
- Migration team: 4-8 engineers for 12-18 months
- Platform team expansion: +2-4 FTE for ongoing management
Break-Even Timeline: 12-24 months
The "Never Repatriate" Scenarios
Don't repatriate if:
-
Heavy AWS Managed Service Lock-In
- Core business logic in Lambda, Step Functions, proprietary services
- Migration requires 6-12+ month app rewrites
- Alternative: Optimize managed service usage, negotiate enterprise discounts
-
True Burst Scaling Requirements
- E-commerce with Black Friday-like spikes (10-100x traffic)
- News/media sites with viral traffic unpredictability
- Alternative: Hybrid with bare metal baseline + cloud burst capacity
-
Global Multi-Region Compliance
- Must operate in 10+ geographic regions
- Compliance certifications require cloud infrastructure
- Alternative: Negotiate volume discounts, optimize within cloud
-
Small Team Without Infrastructure Expertise
- No dedicated platform team (<3 FTE)
- No infrastructure specialists on staff
- Alternative: Aggressive cloud cost optimization, consider managed Kubernetes
-
Fast-Growing, Unpredictable Scaling
- 2-5x YoY growth with unpredictable patterns
- Risk of under-provisioning bare metal capacity
- Alternative: Stay in cloud until growth stabilizes, then revisit
The 2025-2027 Outlook: Regulatory and Market Forces
Regulatory Pressure on Egress Fees
EU Data Act (Effective September 2025):
- Targets "unfair contractual terms" in cloud contracts
- Aims to reduce switching barriers between cloud providers
- Bans profit-generating egress fees by January 12, 2027
- Impact: AWS, Azure, GCP must eliminate or reduce egress charges in EU
AWS Response (2025):
- Increased free egress tier from 1GB to 100GB/month
- Waives egress fees for time-bound migrations (60-day credits with approval)
- Interpretation: Regulatory pressure working, but changes incremental so far
What This Means for Platform Teams:
- Egress fee reduction likely continues through 2027
- EU-based operations may see significant egress savings
- Cloud lock-in concerns diminishing (easier to migrate data out)
- Strategy: Monitor regulatory developments, plan migrations for post-egress-fee era
Market Dynamics: Cloud Growth Despite Repatriation
The Paradox: 86% of CIOs plan repatriation, yet cloud spending grows 21.5% annually.
Explanation:
- Selective Repatriation: 21% of workloads repatriated, but 30%+ new workloads to cloud
- Net Cloud Growth: Migration to cloud outpaces repatriation for most enterprises
- Hybrid Strategies: Companies optimize workload placement, not binary cloud exit
Gartner Forecast:
- 2024 cloud spending: $595.7 billion
- 2025 cloud spending: $723.4 billion (21.5% growth)
- Cloud remains dominant despite repatriation trend
What This Tells Us:
- Repatriation is workload optimization, not wholesale cloud rejection
- Cloud will remain dominant for appropriate use cases
- Sophisticated teams optimize placement, don't pick ideological sides
The Rise of "Bare Metal Cloud" and Hybrid Solutions
New Market Entrants:
- Hetzner: Traditional bare metal provider, now offering cloud flexibility
- Latitude.sh: Bare Metal as a Service (BMaaS) with Terraform/API provisioning
- OpenMetal: On-demand private clouds with cloud-like provisioning
- Vultr, Linode (Akamai): VPS providers offering bare metal options
What's Changing:
- Bare metal now has cloud-like provisioning (Terraform, APIs, automation)
- "Physical servers as easily as VMs" (platform engineering integration)
- Hybrid architectures become operationally feasible
Platform Engineering Impact:
- IaC (Terraform, Pulumi) works across cloud and bare metal
- Kubernetes portability enables seamless hybrid deployments
- CI/CD pipelines provision bare metal as easily as AWS EC2
The Future: Workload placement becomes continuous optimization problem, not one-time migration decision.
Conclusion: Optimize for Economics, Not Ideology
The cloud repatriation debate is polarizing, but the data is clear:
The Truths:
- ✅ AWS charges 7-18x markup on compute vs bare metal (verifiable)
- ✅ Egress fees add 10-100x total cost for data-intensive workloads (real at scale)
- ✅ 86% of CIOs plan some repatriation (highest on record, up from 43% in 2020)
- ✅ Real companies save millions (37signals: $2M/year, Dropbox: $75M over 2 years)
- ✅ Most repatriation is selective, not wholesale (21% of workloads, not 100%)
- ✅ Cloud spending still grows 21.5% annually despite repatriation trend
- ✅ Bare metal has hidden costs: CAPEX, team overhead, capacity planning risk
- ✅ Cloud has hidden costs: egress, cross-AZ traffic, managed service premiums, 15-30% waste
The Framework:
Stay in Cloud If:
- Startup pre-PMF with unpredictable growth
- True burst scaling requirements (10x+ spikes)
- Heavy managed service dependency
- Small team (<100 engineers, <3 infrastructure FTE)
- Global multi-region compliance requirements
Consider Repatriation If:
- Predictable workload (>50 servers sustained 12+ months)
- High egress costs (>$5K/month)
- Minimal managed service lock-in
- Mature engineering org (100+ engineers, 3+ infrastructure FTE)
- Post-optimization cloud bill still 3-5x bare metal equivalent
Optimal Strategy for Most: Hybrid architecture
- Bare metal for predictable baseline workloads (40-60% cost savings)
- Cloud for burst capacity and high-value managed services
- Cheap VPS (Hetzner, DigitalOcean) for dev/test environments
- Result: 30-50% total infrastructure cost reduction without sacrificing flexibility
📚 Learning Resources
📖 Essential Cost Comparison Data
- AWS vs DigitalOcean vs Hetzner: 2025 Cost Comparison - Comprehensive pricing breakdown with performance benchmarks
- AWS Egress Costs in 2025: How to Reduce Them - Deep dive on egress pricing and optimization strategies
- Cloud vs. Bare Metal: A Comprehensive Cost Analysis - ROI framework for infrastructure decisions
📝 Real-World Case Studies
- 37signals Cloud Repatriation Journey - $2M annual savings, complete AWS exit
- Dropbox Infrastructure Optimization Analysis - $75M saved over 2 years
- Cloud Repatriation Statistics 2025 - 86% of CIOs planning repatriation, industry trends
🎥 Platform Engineering Perspectives
- Why Platform Engineering Needs Bare Metal in 2025 - BMaaS and hybrid architecture strategies
- The Cost of Cloud: A Trillion Dollar Paradox - Andreessen Horowitz analysis on cloud economics
📚 FinOps and Cost Optimization
- FinOps Best Practices for 2025 - 8 strategies for cloud cost management
- Unlocking Cloud Cost Optimization: A Guide to Cloud FinOps - Google Cloud's FinOps framework
- Top 15 Cloud Cost Optimization Strategies - Comprehensive tactics and tools
- FinOps as Code: Managing Cloud Costs - McKinsey on automated cost optimization
🛠️ Cost Calculation and Decision Tools
- AWS Pricing Calculator - Official AWS cost estimation
- Hetzner Price Calculator - Compare VPS and bare metal pricing
- Cloud Migration Cost Calculator - Economic framework for migration decisions
🌐 Community and Discussion
- Hacker News: Cloud Repatriation Discussions - 37signals migration discussion with practitioner insights
- Cloud Repatriation: The New Stack Analysis - Industry perspectives on the trend
- FinOps Foundation Community - Cloud financial management best practices and community
📡 Stay Updated
Cloud Provider Pricing: AWS Pricing • GCP Pricing • Azure Pricing
Alternative Providers: Hetzner • DigitalOcean • Vultr • Linode
Industry Analysis: FinOps Foundation Blog • The Register Cloud Coverage • The New Stack Infrastructure
Cost Optimization Tools: nOps • CloudZero • Vantage • Ternary