Skip to main content

The Cloud Repatriation Debate: When AWS Costs 10-100x More Than It Should [2025 Platform Engineering Guide]

· 30 min read

Quick Answer (TL;DR)

Problem: AWS and major cloud providers charge 7-18x markups on compute, with hidden egress fees and unpredictable billing driving total cost 10-100x higher than bare metal alternatives for many workloads.

Movement: 86% of CIOs planned cloud repatriation in 2025 (up from 43% in 2020), but only 8-9% pursue full exit—most do selective workload optimization.

Real Savings:

  • 37signals: $2M/year saved, $10M projected over 5 years
  • Dropbox: $74.6M saved over 2 years moving 90% of data
  • Typical enterprises: 15-30% infrastructure cost reduction

Key Trade-offs:

  • Bare Metal Wins: Predictable workloads at scale (>50 servers, >12 months stable), high bandwidth needs, limited managed service usage
  • Cloud Wins: True burst scaling requirements, pre-PMF startups, global multi-region presence, heavy managed service reliance, compliance constraints

Timeline: Break-even at 50-100 sustained servers or 12-24 months of stable usage patterns.

Decision Framework: Evaluate actual elasticity needs, egress costs, managed service dependency, team expertise, and total cost of ownership—not just sticker price.

Full analysis below ↓

🎙️ Listen to the podcast episode: The Cloud Repatriation Debate - Jordan and Alex discuss real companies saving millions by leaving the cloud, expose hidden costs like egress fees and NAT gateways, and debate when cloud makes sense versus when it's "highway robbery."

🎥 Watch on YouTube


The Repatriation Wave Reaches Critical Mass

David Heinemeier Hansson (DHH) from 37signals stood before a $3.2 million annual AWS bill in 2022 and asked the question that would spark a movement: "Why are we paying this much?"

By 2024, 37signals had completely exited AWS. Their new bill: $1.3 million annually—saving nearly $2 million per year, with projections exceeding $10 million in savings over five years.

They're not alone. 86% of CIOs planned to repatriate at least some cloud workloads in 2025—the highest on record in Barclays' CIO Survey, nearly double the 43% who said the same in late 2020.

But here's what the headlines don't tell you: this isn't a wholesale rejection of cloud computing. Only 8-9% of companies plan full workload repatriation. The real story is nuanced, complex, and far more interesting than "cloud bad, bare metal good."

Let's examine the economics, break down the 10-100x markup claim with real data, and build decision frameworks platform engineering teams can actually use.


Key Statistics (2024-2025 Data)

StatisticValueSourceContext
CIOs planning repatriation86% in 2025 (vs 43% in 2020)Barclays CIO Survey 2024Highest on record, but most selective not wholesale
Full workload repatriationOnly 8-9% of companiesIDC 2024Most do selective optimization, not complete exit
Actual workloads repatriated21% of workloads and dataIDC Server and Storage SurveyNet cloud growth continues despite exits
AWS compute markup7-18x vs bare metalHetzner vs AWS pricing comparison80-core server: €190/mo vs $2,500-$3,500/mo
37signals savings$2M annually, $10M+ over 5 years37signals public disclosure$700K hardware investment, existing team
Dropbox savings$74.6M over 2 yearsDropbox S-1 filing 2018Moved 90% of data, custom infrastructure
AWS egress pricing$0.09/GB after 100GB freeAWS pricing 2025Increased free tier from 1GB but still major cost
Hetzner included traffic20TB free with entry plansHetzner pricingNo egress charges vs AWS $900 for 10TB transfer
Cloud waste15-30% of cloud spendFinOps Foundation 2024Idle resources, oversized instances, unused services
Break-even point50-100 sustained serversIndustry analysisOr 12-24 months stable workload patterns
Infrastructure team cost$2,500-$6,000/month overhead10-20 hrs/week at $60-75/hrHidden cost of bare metal management
Typical ROI threshold40-60% cost reduction at scaleMultiple case studiesAbove 200 servers with predictable usage

The 10-100x Markup Reality

The Bold Claim

Víctor Martínez's viral article made a provocative assertion: cloud providers charge 10-100x what infrastructure actually costs. Is this hyperbole or documented reality?

Let's break down the numbers with specific examples.

Real Cost Comparison: AWS vs Hetzner (Bare Metal)

Equivalent 80-Core Configuration:

ProviderSpecsMonthly CostMarkup vs Bare Metal
Hetzner Bare Metal80 cores, dedicated hardware€190 (~$207)Baseline (1x)
AWS EC2 On-DemandComparable C5/C6 instances$2,500-$3,50013-18x
AWS Reserved (3-year)Same instances, 3-year commit~$1,300/mo + $46K upfront7x (plus lock-in)

Reality Check: The 7-18x markup claim is verifiable and conservative for compute-intensive workloads.

💡 Key Takeaway

AWS compute costs 7-18x more than bare metal for equivalent resources. Reserved instances reduce this to 7x but require $46K upfront and 3-year commitment. The markup is real, not marketing hype.

VPS Middle Ground: AWS vs DigitalOcean vs Hetzner

For teams not ready for bare metal, VPS providers offer middle ground:

Provider8-core, 32GB RAMMonthly CostBandwidth IncludedEgress Overage
Hetzner VPS8 vCPU, 32GB RAM, 160GB SSD€50 (~$55)20TBFree beyond 20TB
DigitalOcean8 vCPU, 32GB RAM, 200GB SSD~$120Shared pool$0.01/GB
AWS EC2m5.2xlarge (8 vCPU, 32GB)~$280/mo100GB$0.09/GB after 100GB

Markup Analysis:

  • AWS costs 5x more than Hetzner for equivalent VPS
  • DigitalOcean sits at 2.2x Hetzner pricing
  • Bandwidth costs can double or triple total bills for data-heavy workloads

The Egress Fee Multiplier

Compute markup is only half the story. Egress fees can turn a manageable cloud bill into a financial nightmare.

Scenario: 10TB Monthly Data Transfer

ProviderEgress Cost FormulaMonthly Egress CostAnnual Impact
AWS100GB free + 9,900GB × $0.09$891/month$10,692/year
DigitalOceanShared pool + $0.01/GB overage~$100/month$1,200/year
Hetzner20TB included free$0$0

For a workload transferring 10TB/month, AWS charges $10,692 annually just for bandwidth that Hetzner includes free.

💡 Key Takeaway

Egress fees are the "hidden tax" that pushes cloud costs from 7-18x markup to 10-100x total cost of ownership for data-intensive workloads. A single 10TB/month transfer costs $10,692/year on AWS versus $0 on Hetzner.

Storage: The S3 Lock-In Tax

Scenario: 18 Petabytes of Storage (37signals' use case)

ProviderStorage ModelAnnual CostEgress Penalty
AWS S3$0.023/GB storage + egress~$1.5M/year+$0.09/GB out
Pure Storage (on-prem)Upfront hardware + maintenanceHardware CAPEXNo egress fees

37signals' S3 bill alone: $1.5 million annually. AWS waived $250K in egress fees for their migration—a telling admission that exit costs are punitive.

The 100x Cases: Where It Actually Happens

The 100x markup isn't typical, but it does occur in specific scenarios:

100x Markup Scenarios:

  1. Managed Service Abuse: AWS Elasticsearch vs self-hosted on bare metal (markup 40-80x)
  2. Function-as-a-Service Overuse: Lambda costs vs containerized equivalents on owned hardware (50-100x at scale)
  3. Serverless Databases: Aurora/DynamoDB vs PostgreSQL on bare metal for predictable workloads (30-80x)
  4. Data Transfer Heavy: Video streaming/CDN workloads with high egress (10-100x when egress dominates)

Where Markup is Lower (2-5x):

  • Managed Kubernetes (EKS) vs self-hosted k8s on VPS
  • Basic compute instances vs VPS (especially with reserved instances)
  • Blob storage vs cold storage arrays (when egress is minimal)

The Counterargument: What You're Actually Paying For

Cloud defenders argue the markup pays for:

  1. Elasticity: Scale from 10 to 10,000 servers in minutes
  2. Managed Services: RDS, Lambda, Kinesis—services that require expertise to build
  3. Global Footprint: 30+ regions, edge locations, compliance certifications
  4. Operational Burden: No hardware failures, no datacenter contracts, no capacity planning
  5. Innovation Velocity: New services quarterly, no infrastructure blocking product development

These are legitimate value propositions—for workloads that actually need them.

The problem? Most workloads don't need them, yet pay the 10-100x markup anyway.


The Case Studies: Who Left and What They Saved

37signals: The Poster Child of Cloud Exit

Timeline:

  • 2022: DHH discovers $3.2M annual AWS bill, begins exit planning
  • 2023: Migrates compute workloads, saves $1M in first year
  • 2024: First "clean year" post-migration, saves $2M annually
  • 2025: Completes S3 exit (18 petabytes), deletes AWS account entirely

Investment vs Savings:

  • Hardware purchase: $700K (Dell systems)
  • Annual savings: $2M/year
  • Payback period: 4.2 months
  • 5-year projection: $10M+ saved

Team Impact:

  • Zero team expansion required
  • Same DevOps team manages infrastructure
  • "No hidden dragons" of operational burden

Key Quote (DHH):

"We've been pleasantly surprised that savings have been even better than originally estimated. The team managing everything is still the same—there were no hidden workloads that required us to balloon the team."

What 37signals Got Right:

  1. Workload Fit: Predictable SaaS traffic patterns, not spiky consumer apps
  2. Minimal Managed Services: Limited AWS service lock-in beyond compute/storage
  3. Existing Expertise: Team already capable of infrastructure management
  4. Executive Commitment: DHH personally championed the migration
  5. Data-Driven: Measured costs rigorously before and after

💡 Key Takeaway

37signals' success hinged on predictable workloads, minimal managed service dependency, existing infrastructure expertise, and executive commitment. They paid back $700K hardware investment in 4.2 months and project $10M+ savings over 5 years.

Dropbox: The $75 Million Migration

Timeline:

  • 2015: Begins "Infrastructure Optimization" project
  • 2016: Moves 90% of user data off AWS to custom "Magic Pocket" infrastructure
  • 2017-2018: Completes migration, goes public

Financial Impact:

  • 2016 savings: $39.5M decrease in infrastructure costs
  • 2017 savings: $35.1M decrease in infrastructure costs
  • Total 2-year savings: $74.6M
  • Hardware investment: $53M+
  • Gross margin improvement: 33% to 67% (2015-2017)

Infrastructure Details:

  • Three colocated datacenters: California, Virginia, Texas
  • Custom-built "Magic Pocket" storage system
  • Custom hardware and software design
  • Retained 10% of workloads on AWS for flexibility

Critical Context: Dropbox is an outlier, not a template. Their entire business model is storage—they were competing with AWS's core service. Most companies don't have:

  • Deep storage engineering expertise
  • Scale to justify $53M+ hardware investment
  • Business model centered on infrastructure efficiency

What Made Dropbox Special:

  • Competing directly with AWS S3 (negative margin paying AWS)
  • Sufficient scale (~500M users) to justify custom datacenter investment
  • World-class infrastructure engineering team
  • Storage workload perfectly suited to owned hardware

💡 Key Takeaway

Dropbox saved $74.6M over 2 years but invested $53M+ in custom infrastructure with world-class engineering teams. They're an outlier—most companies lack the scale, expertise, or business model alignment to replicate this success.

GEICO: The Repatriation Regret Story

The Cautionary Tale:

  • Spent a decade migrating 600+ applications to public cloud
  • Cloud costs increased 2.5x after migration
  • Now repatriating workloads to private cloud (OpenStack/Kubernetes)
  • Investing in on-premises infrastructure to optimize costs

What Went Wrong:

  1. Lift-and-Shift Mentality: Moved apps without redesign for cloud efficiency
  2. Lack of Cost Visibility: Didn't monitor per-app cloud costs during migration
  3. Over-reliance on Managed Services: Locked into expensive AWS-specific services
  4. No FinOps Practice: No cost optimization culture during migration phase

The Lesson: A bad cloud migration is worse than staying on-premises. GEICO's story isn't "cloud failed"—it's "unoptimized cloud migration failed."

The 86% Who Aren't Making Headlines

While 37signals and Dropbox dominate headlines, 86% of CIOs planned repatriation in 2025, but most aren't full exits:

Typical Repatriation Patterns:

  1. Selective Workload Optimization: Move predictable baseline workloads to bare metal, keep burst capacity in cloud
  2. Cold Storage Exit: Archive data to cheaper on-prem/colocation storage, keep hot data in cloud
  3. Dev/Test Environment Repatriation: Move non-production environments to Hetzner/DigitalOcean
  4. Database Repatriation: Self-host PostgreSQL/MySQL on bare metal, keep stateless compute in cloud
  5. Hybrid Cloud Architecture: Strategic placement of workloads based on economics, not ideology

Why Partial Repatriation Dominates:

  • Lower risk than full migration
  • Preserves cloud benefits for workloads that need them
  • Incremental cost savings without operational upheaval
  • Easier to justify to risk-averse executives

The Decision Framework: When Cloud Makes Sense vs When It Doesn't

The Core Question

The cloud vs bare metal debate isn't "which is better?"—it's "which workloads belong where?"

Platform engineering teams need frameworks to answer this question without ideology.

Decision Matrix: Cloud vs Bare Metal

FactorStay in CloudConsider Bare Metal/VPSWeight
Workload ElasticityTraffic spikes 10x+ within hoursPredictable baseline with <2x variation⭐⭐⭐ Critical
Scale<50 sustained servers>100 sustained servers, stable 12+ months⭐⭐⭐ Critical
Managed Service DependencyHeavy use of RDS, Lambda, managed k8sPrimarily compute/storage, self-managed services⭐⭐⭐ Critical
Geographic DistributionMulti-region presence requiredSingle region or 2-3 strategic locations⭐⭐ High
Team ExpertiseNo infrastructure specialists on staff3+ engineers with datacenter/bare metal experience⭐⭐ High
Egress RequirementsLow data transfer (<1TB/month)High bandwidth (>10TB/month outbound)⭐⭐ High
Compliance NeedsRequire SOC2/HIPAA/FedRAMP certificationsStandard security, no specialized compliance⭐ Medium
Growth StagePre-PMF, unpredictable growth trajectoryPost-PMF, predictable growth patterns⭐ Medium
Capital AvailabilityCannot commit $100K+ upfront for hardwareCan invest 6-12 months OpEx upfront for CAPEX⭐ Medium

The "Stay in Cloud" Profile

You should probably stay in AWS/GCP/Azure if:

  1. Startup, Pre-Product Market Fit

    • Unpredictable scaling needs
    • Team focused on product, not infrastructure
    • Runway concerns make OpEx flexibility critical
    • May pivot or fail—don't want hardware commitments
  2. True Burst Scaling Requirements

    • Black Friday-like traffic spikes (10-100x)
    • Event-driven workloads (launches, campaigns)
    • Geographic traffic shifting (follow-the-sun)
    • Need to scale from 50 to 5,000 instances in minutes
  3. Heavy Managed Service Users

    • Core business logic runs on Lambda, Step Functions
    • Rely on managed ML services (SageMaker, Vertex AI)
    • Use proprietary services (DynamoDB, Aurora Serverless)
    • Migration would require app rewrites (6-12 month distraction)
  4. Global, Compliance-Heavy Operations

    • Need presence in 10+ geographic regions
    • Require specific compliance certifications (FedRAMP, HIPAA)
    • Customers demand cloud-native architectures
    • Multi-cloud strategy for risk management
  5. Small Engineering Teams (<100 total engineers)

    • No dedicated infrastructure specialists
    • Platform team smaller than 3 FTE
    • Limited ops expertise in-house
    • Can't spare 10-20 hours/week for infrastructure management

Example Profile: E-commerce startup (Series A)

  • 30 engineers, 20 in product
  • Traffic varies 3-50x (normal vs sales events)
  • Heavy AWS managed services (RDS, Lambda, S3)
  • Global customer base (US, EU, APAC)
  • Verdict: Stay in cloud (markup justified by elasticity needs)

The "Consider Repatriation" Profile

You should evaluate bare metal/VPS alternatives if:

  1. Predictable SaaS or Internal Tooling

    • Traffic patterns stable within 2x variation
    • 50-100+ sustained baseline servers
    • 12+ months of workload history showing consistency
    • Capacity planning is feasible
  2. High Bandwidth Requirements

    • Egress costs >$5K/month
    • Video streaming, file transfer, CDN origin
    • Data analytics with frequent large exports
    • Backup/DR with multi-TB daily transfers
  3. Minimal Managed Service Lock-in

    • Primarily compute and block storage users
    • Self-host databases (PostgreSQL, MySQL, Redis)
    • Kubernetes workloads (portable across infrastructure)
    • Use open-source tools (Prometheus, Grafana, etc.)
  4. Mature Engineering Organization

    • 100+ total engineers
    • Dedicated infrastructure/platform team (3+ FTE)
    • Existing ops expertise (Linux, networking, storage)
    • Culture of infrastructure ownership
  5. Cost Optimization Mandate

    • Cloud bill >$500K annually with 70%+ fixed workloads
    • Executive pressure to reduce infrastructure spend
    • Comfortable with 6-12 month migration timeline
    • Can commit $100K-$1M upfront for hardware/migration

Example Profile: Mature B2B SaaS (Series C+)

  • 200 engineers, 12-person platform team
  • 150 sustained EC2 instances, spikes to 180 in peak hours
  • Self-hosted PostgreSQL, Kafka, Redis on EC2
  • $1.2M annual AWS bill, mostly compute and egress
  • Verdict: Strong repatriation candidate (save 40-60% = $480-720K/year)

The Hybrid Approach: Best of Both Worlds

Most sophisticated platform teams don't choose cloud XOR bare metal—they architect hybrid systems optimizing cost and capability.

Hybrid Architecture Pattern:

  1. Baseline Workload → Bare Metal/Colocation

    • Predictable compute: self-hosted k8s cluster on Hetzner bare metal
    • Databases: PostgreSQL on dedicated servers (high I/O performance)
    • Object storage: MinIO or Ceph on-prem for bulk storage
    • Cost: 40-60% lower than cloud equivalent
  2. Burst Capacity → Cloud

    • Auto-scaling application tier in AWS (scale 1.5-10x on demand)
    • Spot instances for batch processing
    • Lambda for event-driven workloads
    • Cost: Pay cloud premium only for actual burst usage
  3. Managed Services → Cloud (Selectively)

    • Keep high-value managed services (e.g., Route53, CloudFront CDN)
    • Use cloud for capabilities you can't replicate (e.g., SageMaker for ML)
    • Avoid managed services you can self-host efficiently (RDS → self-hosted PostgreSQL)
  4. DR/Backup → Cheap Cloud Storage

    • Glacier/Deep Archive for cold backups
    • Cross-region replication for critical data
    • Geographic diversity without operating multiple datacenters

Real-World Hybrid Example:

Company: Mid-size SaaS, 500 employees, 200 engineers

  • Before (all AWS): $2.4M/year
  • After (hybrid):
    • Bare metal (Hetzner): 100 servers baseline = $600K/year
    • AWS burst capacity: 20-50 instances on-demand = $400K/year
    • AWS managed services: Route53, CloudFront, S3 (hot data) = $300K/year
    • Total: $1.3M/year
  • Savings: $1.1M annually (46% reduction)
  • Team overhead: +10 hours/week infrastructure management

💡 Key Takeaway

The optimal architecture isn't cloud OR bare metal—it's strategically hybrid. Place predictable baseline workloads on owned infrastructure (40-60% savings), reserve cloud for burst capacity and high-value managed services. Most mature platforms save 30-50% with hybrid approaches.


The Hidden Costs Nobody Talks About

Cloud's Hidden Costs

The sticker shock of AWS bills is obvious. Less obvious: the hidden costs that inflate true cloud spending.

1. Egress Fees (The Bandwidth Tax)

  • Often 20-40% of total bill for data-heavy workloads
  • Difficult to predict or forecast accurately
  • Creates lock-in (expensive to move data out)
  • Impact: $0.09/GB × 10TB/month = $10,692/year

2. Cross-AZ and Cross-Region Traffic

  • $0.01/GB for data between availability zones
  • $0.02/GB for cross-region transfers
  • Chatty microservices architecture can generate massive internal traffic
  • Impact: High-frequency trading firm spent $40K/month on internal AWS traffic

3. NAT Gateway and Network Costs

  • NAT Gateway: $0.045/hour + $0.045/GB processed = ~$40/month + data charges
  • Transit Gateway: $0.05/hour + $0.02/GB = ~$36/month + data charges per attachment
  • Load balancers: ALB/NLB at ~$20-30/month + LCU charges
  • Impact: Network infrastructure can cost $5-15K/month before any compute

4. Managed Service Lock-in Premium

  • RDS costs 2-3x self-hosted PostgreSQL on EC2
  • Aurora costs 4-5x self-hosted PostgreSQL on bare metal
  • Elasticsearch Service costs 3-4x self-hosted on EC2
  • Impact: $500/month self-hosted DB becomes $2,000/month RDS, $2,500/month Aurora

5. Idle Resource Waste

  • FinOps Foundation: 15-30% of cloud spend is waste
  • Orphaned EBS volumes, unused load balancers, forgotten test environments
  • Dev/staging environments running 24/7 (used 40 hours/week)
  • Impact: $2M cloud bill → $300-600K pure waste

6. Reserved Instance and Savings Plan Complexity

  • Require accurate 1-3 year capacity forecasting
  • Wrong guess = wasted prepayment or continued on-demand premium
  • Management overhead: which instances to reserve, when to renew
  • Impact: 10-20 hours/quarter reserved instance optimization

7. Multi-Account and Organizational Complexity

  • Dozens or hundreds of AWS accounts for isolation
  • Centralized billing, IAM complexity, cross-account access
  • Security and compliance overhead
  • Impact: 1-2 FTE dedicated to cloud account management at enterprise scale

Bare Metal's Hidden Costs

Bare metal advocates downplay the real operational burden of self-managed infrastructure.

1. Upfront Capital Expenditure

  • Hardware purchase: $10-30K per server (enterprise-grade)
  • Datacenter setup or colocation contracts
  • Network equipment, switches, firewalls
  • Impact: $700K (37signals), $53M+ (Dropbox) upfront investment

2. Infrastructure Team Overhead

  • 10-20 hours/week minimum for infrastructure management
  • At $60-75/hour senior DevOps rates: $2,500-$6,000/month
  • Scales with infrastructure complexity
  • Impact: $30-72K annually in hidden labor cost

3. Capacity Planning Risk

  • Over-provision → wasted hardware investment
  • Under-provision → emergency hardware procurement (weeks of lead time)
  • Hardware refresh cycles (3-5 years)
  • Impact: 20-30% over-provisioning typical to avoid capacity emergencies

4. Hardware Failure and Redundancy

  • Servers fail: plan for 2-5% annual failure rate
  • Need N+1 or N+2 redundancy for HA
  • Spare parts inventory and RMA processes
  • Impact: 15-25% additional hardware for redundancy

5. Datacenter and Power Costs

  • Colocation: $1,000-$5,000/month per rack
  • Power: $0.10-$0.30/kWh (80-150W per server × 24/7)
  • Cooling: adds 30-50% to power costs
  • Impact: 200 servers = $4-8K/month power + $10-30K/month colocation

6. Network Bandwidth and Transit Costs

  • Datacenter bandwidth not always "free"
  • Transit providers charge for high bandwidth (95th percentile billing)
  • DDoS protection and network security
  • Impact: 10Gbps commit = $2-5K/month bandwidth costs at colocation

7. Compliance and Security Overhead

  • Physical security: datacenter access controls
  • Compliance audits: SOC2, ISO27001 for self-managed infrastructure
  • Security patching: OS, firmware, hardware vulnerabilities
  • Impact: SOC2 audit: $50-150K annually, ongoing compliance overhead

8. Opportunity Cost and Team Distraction

  • Engineering time spent on infrastructure ≠ time spent on product
  • 6-12 month migration timeline with team focus shift
  • Delayed feature development during migration
  • Impact: Hard to quantify but potentially millions in delayed revenue

💡 Key Takeaway

Both cloud and bare metal have hidden costs. Cloud hides costs in egress fees, cross-AZ traffic, managed service premiums, and waste (15-30% of spend). Bare metal hides costs in upfront CAPEX, infrastructure team overhead ($30-72K/year), capacity planning risk, and opportunity cost of team distraction. Calculate total cost of ownership, not just sticker price.


The FinOps Response: Optimizing Cloud Before Exiting

Before committing to cloud repatriation, platform engineering teams should exhaust cloud cost optimization strategies. Many organizations discover 30-50% savings are possible without leaving AWS.

The FinOps Maturity Model

Crawl Phase (0-6 months):

  • Establish cost visibility: tag all resources, enable Cost Explorer
  • Identify waste: unused resources, orphaned volumes, idle instances
  • Quick wins: rightsize obvious over-provisioned instances
  • Typical savings: 15-20% of cloud spend

Walk Phase (6-18 months):

  • Automate waste cleanup: scheduled shutdown of dev/staging environments
  • Reserved Instance / Savings Plan strategy for predictable workloads
  • Implement FinOps policies: budgets, alerts, approval workflows
  • Typical savings: 25-35% of cloud spend

Run Phase (18+ months):

  • FinOps as Code: policy-driven cost optimization in CI/CD
  • Unit economics: cost per customer, per transaction, per feature
  • Culture shift: engineers own cost as quality metric
  • Typical savings: 35-50% of cloud spend

Top 10 Cloud Cost Optimization Strategies

1. Eliminate Waste (Quickest ROI)

  • Action: Identify and terminate idle resources (unused instances, orphaned volumes, forgotten load balancers)
  • Typical Savings: 10-15% of total cloud spend
  • Tools: AWS Trusted Advisor, CloudHealth, Spot.io
  • Example: $2M cloud bill → $200-300K saved by deleting waste

2. Rightsize Over-Provisioned Instances

  • Action: Match instance types to actual CPU/memory utilization
  • Typical Savings: 20-30% on compute spend
  • Tools: AWS Compute Optimizer, CloudWatch metrics analysis
  • Example: m5.4xlarge (16 vCPU) at 30% utilization → m5.2xlarge (8 vCPU) saves 50%

3. Reserved Instances and Savings Plans

  • Action: Commit 1-3 years for predictable baseline workloads
  • Typical Savings: 30-70% vs on-demand pricing
  • Risk: Wrong forecast = wasted prepayment or continued on-demand premium
  • Best Practice: Reserve 60-70% of baseline, keep 30-40% flexible on-demand

4. Spot Instances for Fault-Tolerant Workloads

  • Action: Use spot instances (up to 90% off) for batch jobs, CI/CD, dev/test
  • Typical Savings: 60-90% on applicable workloads
  • Limitation: Can be terminated with 2-minute notice
  • Use Cases: Data processing, ML training, rendering, test environments

5. Auto-Scaling and Scheduled Shutdown

  • Action: Scale down non-production environments outside business hours
  • Typical Savings: 50-70% on dev/staging costs (running 40hrs/week vs 168hrs/week)
  • Tools: AWS Instance Scheduler, custom Lambda functions
  • Example: 50 dev/test instances × $100/mo = $5K → $1.75K (65% savings)

6. Storage Lifecycle Policies

  • Action: Auto-transition infrequent-access data to cheaper storage tiers
  • Typical Savings: 40-90% on storage costs
  • Strategy: S3 Standard → S3-IA (30 days) → Glacier (90 days) → Deep Archive (365 days)
  • Example: 100TB S3 Standard ($2,300/mo) → 80TB Glacier ($320/mo) = $1,980/mo saved

7. Egress Cost Reduction

  • Action: Use CloudFront CDN (free egress to CloudFront), cache aggressively, compress data
  • Typical Savings: 30-50% on data transfer costs
  • Strategy: CloudFront ↔ S3 egress free, CloudFront → internet cheaper than S3 → internet
  • Example: 10TB/month S3 egress ($900) → CloudFront ($600) + caching (3TB actual) = $180

8. Commitment to Architecture Optimization

  • Action: Refactor chatty microservices, reduce cross-AZ traffic, optimize database queries
  • Typical Savings: 20-40% on network and data transfer
  • Investment: Requires engineering time, not just configuration
  • Example: Collocate services in single AZ (trade redundancy for cost where acceptable)

9. Managed Service Alternatives

  • Action: Replace expensive managed services with self-hosted equivalents
  • Typical Savings: 50-70% on database and service costs
  • Trade-off: Operational burden increases
  • Example: RDS PostgreSQL ($500/mo) → PostgreSQL on EC2 ($150/mo instance + management overhead)

10. FinOps Culture and Accountability

  • Action: Make cost visibility real-time for engineers, assign budgets per team/product
  • Typical Savings: 10-20% through behavior change
  • Tools: CloudZero, Vantage, Kubecost (for Kubernetes)
  • Strategy: Unit economics (cost per customer), cost as quality metric, showback/chargeback

💡 Key Takeaway

Before repatriating to bare metal, exhaust cloud optimization strategies. Most organizations achieve 30-50% cloud cost reduction through waste elimination, rightsizing, reserved instances, auto-scaling, and storage lifecycle policies—without the operational complexity of leaving cloud.

When Optimization Isn't Enough

You've optimized cloud and still need repatriation if:

  1. Post-optimization costs still 3-5x bare metal equivalent

    • After removing waste, rightsizing, and commitments, still paying massive premium
    • Example: $1.2M optimized AWS bill vs $400K bare metal equivalent
  2. Egress costs dominate and can't be reduced

    • High bandwidth workloads (video, large file transfers) with unavoidable egress
    • Example: $300K/year egress fees for data analytics platform (Hetzner = $0)
  3. Managed services provide minimal value

    • Using AWS primarily for compute and block storage
    • Self-hosting PostgreSQL, Redis, Kafka already (no managed service value)
    • Kubernetes portable across infrastructure
  4. Predictable workload eliminates elasticity value

    • 12+ months data shows <2x traffic variation
    • Capacity planning is feasible and accurate
    • Don't need cloud's burst scaling capabilities
  5. Team has infrastructure expertise and capacity

    • Dedicated platform team (3+ FTE) with datacenter experience
    • Bandwidth to manage 10-20 hours/week infrastructure overhead
    • Culture of infrastructure ownership

Decision Point: If you've optimized cloud spend by 30-50% and still meet criteria above, repatriation economics likely favor bare metal. If optimization closed the cost gap significantly or you rely on managed services, stay in cloud.


Platform Engineering Team Recommendations

For Startups (<100 Engineers, Pre-PMF)

Recommendation: Stay in cloud

Rationale:

  • Unpredictable scaling needs
  • Team should focus on product, not infrastructure
  • Runway concerns make OpEx flexibility critical
  • Cloud markup is "insurance premium" for flexibility

Cost Optimization Focus:

  1. Aggressive waste cleanup (unused resources)
  2. Schedule dev/staging environment shutdown (nights/weekends)
  3. Use spot instances for CI/CD, batch jobs
  4. Don't over-engineer: default to smallest instances that work

When to Revisit:

  • Reach 50-100 sustained servers with predictable patterns
  • 12+ months of stable workload data
  • Cloud bill exceeds $500K annually
  • Post-PMF with clear growth trajectory

For Growth Companies (100-500 Engineers, Series B-C)

Recommendation: Evaluate hybrid architecture

Rationale:

  • Sufficient scale to justify infrastructure investment
  • Likely have predictable baseline workload
  • Platform team exists or can be built
  • Cost savings materially impact burn rate

Evaluation Checklist:

  • Cloud bill >$500K annually with 70%+ predictable workload
  • Minimal managed service dependency (or can self-host equivalents)
  • 3+ engineers with infrastructure expertise
  • Executive support for 6-12 month migration
  • Can commit $100K-$500K upfront for hardware/migration

Recommended Approach:

  1. Phase 1 (Months 1-3): Optimize cloud spend (target 30% reduction)
  2. Phase 2 (Months 4-6): Migrate dev/test environments to Hetzner/DigitalOcean (low risk)
  3. Phase 3 (Months 7-12): Migrate baseline production workloads to bare metal/colocation
  4. Phase 4 (Ongoing): Hybrid architecture—bare metal baseline, cloud burst capacity

Expected Savings: 30-40% total infrastructure cost

For Enterprises (500+ Engineers, Series D+)

Recommendation: Strategic hybrid with selective repatriation

Rationale:

  • Massive scale justifies infrastructure investment
  • Likely already have infrastructure specialists
  • Cost optimization is board-level priority
  • Risk tolerance for multi-datacenter operations

Strategic Framework:

Workload Classification:

  1. Class 1: Repatriate to Bare Metal

    • Predictable baseline compute (web servers, API servers)
    • Self-hosted databases (PostgreSQL, MySQL, Redis, Kafka)
    • Batch processing and analytics
    • Target: 40-60% of compute workload
  2. Class 2: Keep in Cloud

    • Burst capacity for traffic spikes
    • Global multi-region presence
    • Managed services providing high value (e.g., SageMaker, Kinesis)
    • Target: 20-30% of compute workload
  3. Class 3: Move to Cheap Cloud (DigitalOcean, Hetzner)

    • Dev, staging, QA environments
    • CI/CD infrastructure
    • Internal tooling
    • Target: 20-30% of compute workload

Expected Savings: 40-60% total infrastructure cost

Investment Required:

  • Hardware: $1-5M upfront (depends on scale)
  • Migration team: 4-8 engineers for 12-18 months
  • Platform team expansion: +2-4 FTE for ongoing management

Break-Even Timeline: 12-24 months

The "Never Repatriate" Scenarios

Don't repatriate if:

  1. Heavy AWS Managed Service Lock-In

    • Core business logic in Lambda, Step Functions, proprietary services
    • Migration requires 6-12+ month app rewrites
    • Alternative: Optimize managed service usage, negotiate enterprise discounts
  2. True Burst Scaling Requirements

    • E-commerce with Black Friday-like spikes (10-100x traffic)
    • News/media sites with viral traffic unpredictability
    • Alternative: Hybrid with bare metal baseline + cloud burst capacity
  3. Global Multi-Region Compliance

    • Must operate in 10+ geographic regions
    • Compliance certifications require cloud infrastructure
    • Alternative: Negotiate volume discounts, optimize within cloud
  4. Small Team Without Infrastructure Expertise

    • No dedicated platform team (<3 FTE)
    • No infrastructure specialists on staff
    • Alternative: Aggressive cloud cost optimization, consider managed Kubernetes
  5. Fast-Growing, Unpredictable Scaling

    • 2-5x YoY growth with unpredictable patterns
    • Risk of under-provisioning bare metal capacity
    • Alternative: Stay in cloud until growth stabilizes, then revisit

The 2025-2027 Outlook: Regulatory and Market Forces

Regulatory Pressure on Egress Fees

EU Data Act (Effective September 2025):

  • Targets "unfair contractual terms" in cloud contracts
  • Aims to reduce switching barriers between cloud providers
  • Bans profit-generating egress fees by January 12, 2027
  • Impact: AWS, Azure, GCP must eliminate or reduce egress charges in EU

AWS Response (2025):

  • Increased free egress tier from 1GB to 100GB/month
  • Waives egress fees for time-bound migrations (60-day credits with approval)
  • Interpretation: Regulatory pressure working, but changes incremental so far

What This Means for Platform Teams:

  • Egress fee reduction likely continues through 2027
  • EU-based operations may see significant egress savings
  • Cloud lock-in concerns diminishing (easier to migrate data out)
  • Strategy: Monitor regulatory developments, plan migrations for post-egress-fee era

Market Dynamics: Cloud Growth Despite Repatriation

The Paradox: 86% of CIOs plan repatriation, yet cloud spending grows 21.5% annually.

Explanation:

  1. Selective Repatriation: 21% of workloads repatriated, but 30%+ new workloads to cloud
  2. Net Cloud Growth: Migration to cloud outpaces repatriation for most enterprises
  3. Hybrid Strategies: Companies optimize workload placement, not binary cloud exit

Gartner Forecast:

  • 2024 cloud spending: $595.7 billion
  • 2025 cloud spending: $723.4 billion (21.5% growth)
  • Cloud remains dominant despite repatriation trend

What This Tells Us:

  • Repatriation is workload optimization, not wholesale cloud rejection
  • Cloud will remain dominant for appropriate use cases
  • Sophisticated teams optimize placement, don't pick ideological sides

The Rise of "Bare Metal Cloud" and Hybrid Solutions

New Market Entrants:

  1. Hetzner: Traditional bare metal provider, now offering cloud flexibility
  2. Latitude.sh: Bare Metal as a Service (BMaaS) with Terraform/API provisioning
  3. OpenMetal: On-demand private clouds with cloud-like provisioning
  4. Vultr, Linode (Akamai): VPS providers offering bare metal options

What's Changing:

  • Bare metal now has cloud-like provisioning (Terraform, APIs, automation)
  • "Physical servers as easily as VMs" (platform engineering integration)
  • Hybrid architectures become operationally feasible

Platform Engineering Impact:

  • IaC (Terraform, Pulumi) works across cloud and bare metal
  • Kubernetes portability enables seamless hybrid deployments
  • CI/CD pipelines provision bare metal as easily as AWS EC2

The Future: Workload placement becomes continuous optimization problem, not one-time migration decision.


Conclusion: Optimize for Economics, Not Ideology

The cloud repatriation debate is polarizing, but the data is clear:

The Truths:

  1. ✅ AWS charges 7-18x markup on compute vs bare metal (verifiable)
  2. ✅ Egress fees add 10-100x total cost for data-intensive workloads (real at scale)
  3. 86% of CIOs plan some repatriation (highest on record, up from 43% in 2020)
  4. ✅ Real companies save millions (37signals: $2M/year, Dropbox: $75M over 2 years)
  5. ✅ Most repatriation is selective, not wholesale (21% of workloads, not 100%)
  6. ✅ Cloud spending still grows 21.5% annually despite repatriation trend
  7. ✅ Bare metal has hidden costs: CAPEX, team overhead, capacity planning risk
  8. ✅ Cloud has hidden costs: egress, cross-AZ traffic, managed service premiums, 15-30% waste

The Framework:

Stay in Cloud If:

  • Startup pre-PMF with unpredictable growth
  • True burst scaling requirements (10x+ spikes)
  • Heavy managed service dependency
  • Small team (<100 engineers, <3 infrastructure FTE)
  • Global multi-region compliance requirements

Consider Repatriation If:

  • Predictable workload (>50 servers sustained 12+ months)
  • High egress costs (>$5K/month)
  • Minimal managed service lock-in
  • Mature engineering org (100+ engineers, 3+ infrastructure FTE)
  • Post-optimization cloud bill still 3-5x bare metal equivalent

Optimal Strategy for Most: Hybrid architecture

  • Bare metal for predictable baseline workloads (40-60% cost savings)
  • Cloud for burst capacity and high-value managed services
  • Cheap VPS (Hetzner, DigitalOcean) for dev/test environments
  • Result: 30-50% total infrastructure cost reduction without sacrificing flexibility

📚 Learning Resources

📖 Essential Cost Comparison Data

📝 Real-World Case Studies

🎥 Platform Engineering Perspectives

📚 FinOps and Cost Optimization

🛠️ Cost Calculation and Decision Tools

🌐 Community and Discussion


📡 Stay Updated

Cloud Provider Pricing: AWS PricingGCP PricingAzure Pricing

Alternative Providers: HetznerDigitalOceanVultrLinode

Industry Analysis: FinOps Foundation BlogThe Register Cloud CoverageThe New Stack Infrastructure

Cost Optimization Tools: nOpsCloudZeroVantageTernary