Cloud Platforms Deep Dive for Platform Engineers
A comprehensive guide to AWS, GCP, and Azure from a platform engineering perspective, focusing on infrastructure automation, reliability, and scale.
📚 Essential Resources
📖 Must-Read Books & Guides
- AWS Well-Architected Framework - AWS best practices
- Google Cloud Architecture Framework - GCP best practices
- Azure Architecture Center - Azure patterns & practices
- Cloud Native Infrastructure - Justin Garrison & Kris Nova
- Architecting the Cloud - Michael Kavis
🎥 Video Resources
- AWS re:Invent Videos - Latest AWS innovations
- Google Cloud Next - GCP conference talks
- Microsoft Build - Azure updates
- A Cloud Guru YouTube - Multi-cloud tutorials
- freeCodeCamp Cloud Computing - Full course
🎓 Courses & Certifications
AWS Certifications
- AWS Solutions Architect - Associate & Professional
- AWS DevOps Engineer - Professional
- AWS Advanced Networking - Specialty
- AWS Security - Specialty
GCP Certifications
- Google Cloud Architect - Professional
- Google Cloud DevOps Engineer - Professional
- Google Cloud Network Engineer - Professional
- Google Cloud Security Engineer - Professional
Azure Certifications
- Azure Solutions Architect - Expert
- Azure DevOps Engineer - Expert
- Azure Security Engineer - Associate
- Azure Network Engineer - Associate
📰 Blogs & Articles
- AWS News Blog - Official AWS updates
- Google Cloud Blog - GCP announcements
- Azure Updates - Azure service updates
- High Scalability - Architecture case studies
- The Morning Paper - CS papers explained
🔧 Essential Tools & Platforms
- Terraform - Multi-cloud IaC
- Pulumi - Modern infrastructure as code
- CloudFormation - AWS native IaC
- Cloud Deployment Manager - GCP IaC
- Azure Resource Manager - Azure IaC
💬 Communities & Forums
- r/aws - AWS community
- r/googlecloud - GCP community
- r/azure - Azure community
- Cloud Native Computing Foundation - Multi-cloud community
- AWS Developer Forums - Official AWS forums
🎮 Interactive Learning
- AWS Free Tier - Hands-on AWS practice
- Google Cloud Free Tier - GCP free resources
- Azure Free Account - Azure free tier
- A Cloud Guru Playground - Multi-cloud sandboxes
- Qwiklabs - Hands-on cloud labs
📖 Documentation & References
- AWS Documentation - Comprehensive AWS docs
- Google Cloud Docs - GCP documentation
- Azure Documentation - Azure docs
- Cloud Service Comparison - Compare services across clouds
- Cloud Pricing Comparison - Cost calculator
🏆 Architecture Resources
- AWS Architecture Center - Reference architectures
- Google Cloud Solutions - GCP architectures
- Azure Architecture Diagrams - Azure patterns
- This is My Architecture - Real-world architectures
- Cloud Design Patterns - Common patterns
Cloud Platform Comparison
Key Services Mapping
Service Type | AWS | GCP | Azure |
---|---|---|---|
Compute | EC2 | Compute Engine | Virtual Machines |
Containers | ECS/EKS | GKE | AKS |
Serverless | Lambda | Cloud Functions | Functions |
Object Storage | S3 | Cloud Storage | Blob Storage |
Block Storage | EBS | Persistent Disk | Managed Disks |
Database (SQL) | RDS | Cloud SQL | SQL Database |
NoSQL | DynamoDB | Firestore/Bigtable | Cosmos DB |
Message Queue | SQS | Pub/Sub | Service Bus |
CDN | CloudFront | Cloud CDN | CDN |
Load Balancer | ELB/ALB/NLB | Cloud Load Balancing | Load Balancer |
VPC | VPC | VPC | Virtual Network |
IAM | IAM | Cloud IAM | Azure AD/RBAC |
Monitoring | CloudWatch | Cloud Monitoring | Monitor |
Logs | CloudWatch Logs | Cloud Logging | Log Analytics |
AWS Deep Dive
Compute and Networking
EC2 Instance Optimization:
# Instance metadata service v2 (IMDSv2)
aws ec2 modify-instance-metadata-options \
--instance-id i-1234567890abcdef0 \
--http-tokens required \
--http-endpoint enabled
# Spot instance management
aws ec2 request-spot-fleet \
--spot-fleet-request-config file://config.json
VPC Design Patterns:
# Multi-AZ VPC with public/private subnets
Resources:
VPC:
Type: AWS::EC2::VPC
Properties:
CidrBlock: 10.0.0.0/16
EnableDnsHostnames: true
EnableDnsSupport: true
PublicSubnet1:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
CidrBlock: 10.0.1.0/24
AvailabilityZone: !Select [0, !GetAZs '']
MapPublicIpOnLaunch: true
PrivateSubnet1:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
CidrBlock: 10.0.11.0/24
AvailabilityZone: !Select [0, !GetAZs '']
Advanced Networking:
# Transit Gateway for multi-VPC
aws ec2 create-transit-gateway \
--description "Central hub for VPC connectivity" \
--options=AmazonSideAsn=64512,DefaultRouteTableAssociation=enable
# VPC Peering
aws ec2 create-vpc-peering-connection \
--vpc-id vpc-1a2b3c4d \
--peer-vpc-id vpc-5e6f7g8h \
--peer-region us-west-2
Resources:
- 📖 AWS Well-Architected Framework
- 📚 AWS Certified Solutions Architect Study Guide
- 🎥 AWS re:Invent Videos
- 📖 AWS Architecture Center
Container Services
EKS Platform Engineering:
# EKS cluster with managed node groups
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: production-cluster
region: us-west-2
version: "1.27"
vpc:
subnets:
private:
us-west-2a: { id: subnet-1 }
us-west-2b: { id: subnet-2 }
us-west-2c: { id: subnet-3 }
managedNodeGroups:
- name: general-workloads
instanceType: m5.large
minSize: 3
maxSize: 10
desiredCapacity: 5
volumeSize: 100
volumeType: gp3
privateNetworking: true
iam:
withAddonPolicies:
imageBuilder: true
autoScaler: true
ebs: true
efs: true
cloudWatch: true
ECS Service Mesh:
{
"family": "service-task",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "256",
"memory": "512",
"proxyConfiguration": {
"type": "APPMESH",
"containerName": "envoy",
"properties": [
{"name": "ProxyIngressPort", "value": "15000"},
{"name": "ProxyEgressPort", "value": "15001"},
{"name": "AppPorts", "value": "8080"},
{"name": "EgressIgnoredIPs", "value": "169.254.170.2,169.254.169.254"}
]
}
}
Serverless Platform
Lambda Best Practices:
# Lambda with connection pooling
import boto3
from botocore.config import Config
# Initialize outside handler for connection reuse
config = Config(
region_name='us-west-2',
retries={'max_attempts': 2}
)
dynamodb = boto3.resource('dynamodb', config=config)
table = dynamodb.Table('my-table')
def lambda_handler(event, context):
# Reuse connection
response = table.get_item(
Key={'id': event['id']}
)
return response['Item']
Step Functions for Orchestration:
{
"Comment": "ETL Pipeline State Machine",
"StartAt": "ExtractData",
"States": {
"ExtractData": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT:function:extract-data",
"Next": "TransformData",
"Retry": [{
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2
}]
},
"TransformData": {
"Type": "Parallel",
"Branches": [
{
"StartAt": "TransformBatch1",
"States": {
"TransformBatch1": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT:function:transform",
"End": true
}
}
}
],
"Next": "LoadData"
}
}
}
Storage and Databases
S3 Lifecycle Management:
{
"Rules": [{
"Id": "ArchiveOldLogs",
"Status": "Enabled",
"Prefix": "logs/",
"Transitions": [{
"Days": 30,
"StorageClass": "STANDARD_IA"
}, {
"Days": 90,
"StorageClass": "GLACIER"
}],
"Expiration": {
"Days": 365
}
}]
}
DynamoDB Design Patterns:
# Single table design
table_design = {
"TableName": "platform-table",
"KeySchema": [
{"AttributeName": "PK", "KeyType": "HASH"},
{"AttributeName": "SK", "KeyType": "RANGE"}
],
"GlobalSecondaryIndexes": [{
"IndexName": "GSI1",
"Keys": [
{"AttributeName": "GSI1PK", "KeyType": "HASH"},
{"AttributeName": "GSI1SK", "KeyType": "RANGE"}
],
"Projection": {"ProjectionType": "ALL"}
}],
"BillingMode": "PAY_PER_REQUEST"
}
Security and Compliance
IAM Best Practices:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::ACCOUNT:role/platform-engineer"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "unique-external-id"
},
"IpAddress": {
"aws:SourceIp": ["10.0.0.0/8"]
}
}
}]
}
Resources:
GCP Deep Dive
Compute and Networking
Compute Engine Automation:
# Create instance template with startup script
gcloud compute instance-templates create platform-template \
--machine-type=n2-standard-4 \
--network-interface=network=default,network-tier=PREMIUM \
--metadata=startup-script='#!/bin/bash
apt-get update
apt-get install -y monitoring-agent
' \
--maintenance-policy=MIGRATE \
--provisioning-model=STANDARD \
--scopes=https://www.googleapis.com/auth/cloud-platform
# Managed instance group with autoscaling
gcloud compute instance-groups managed create platform-mig \
--template=platform-template \
--size=3 \
--zones=us-central1-a,us-central1-b,us-central1-c
VPC Design:
# Shared VPC configuration
resources:
- name: host-vpc
type: compute.v1.network
properties:
autoCreateSubnetworks: false
- name: prod-subnet
type: compute.v1.subnetwork
properties:
network: $(ref.host-vpc.selfLink)
ipCidrRange: 10.0.1.0/24
region: us-central1
privateIpGoogleAccess: true
secondaryIpRanges:
- rangeName: pods
ipCidrRange: 10.1.0.0/16
- rangeName: services
ipCidrRange: 10.2.0.0/16
Global Load Balancing:
# Cloud CDN with Cloud Armor
gcloud compute backend-services create web-backend \
--protocol=HTTP \
--port-name=http \
--health-checks=http-health-check \
--global \
--enable-cdn \
--cache-mode=CACHE_ALL_STATIC
gcloud compute security-policies create block-bad-actors \
--description "Block known malicious IPs"
gcloud compute security-policies rules create 1000 \
--security-policy block-bad-actors \
--expression "origin.region_code == 'XX'" \
--action "deny-403"
Resources:
- 📖 Google Cloud Architecture Framework
- 📚 Google Cloud Certified Professional Cloud Architect
- 🎥 Google Cloud Next Videos
- 📖 GCP Best Practices
GKE Platform Engineering
GKE Autopilot:
# Optimized GKE cluster
apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerCluster
metadata:
name: platform-cluster
spec:
location: us-central1
autopilot:
enabled: true
releaseChannel:
channel: REGULAR
workloadIdentityConfig:
workloadPool: PROJECT_ID.svc.id.goog
addonsConfig:
httpLoadBalancing:
disabled: false
horizontalPodAutoscaling:
disabled: false
Workload Identity:
# Configure workload identity
kubectl create serviceaccount gcs-access \
--namespace production
gcloud iam service-accounts create gcs-access-sa \
--display-name "GCS Access Service Account"
gcloud iam service-accounts add-iam-policy-binding \
gcs-access-sa@PROJECT_ID.iam.gserviceaccount.com \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:PROJECT_ID.svc.id.goog[production/gcs-access]"
kubectl annotate serviceaccount gcs-access \
--namespace production \
iam.gke.io/gcp-service-account=gcs-access-sa@PROJECT_ID.iam.gserviceaccount.com
Data Platform
BigQuery for Analytics:
-- Partitioned table with clustering
CREATE TABLE platform.events
PARTITION BY DATE(timestamp)
CLUSTER BY user_id, event_type
AS
SELECT * FROM platform.raw_events;
-- Materialized view for real-time dashboards
CREATE MATERIALIZED VIEW platform.hourly_metrics
PARTITION BY DATE(hour)
CLUSTER BY service_name
AS
SELECT
TIMESTAMP_TRUNC(timestamp, HOUR) as hour,
service_name,
COUNT(*) as request_count,
APPROX_QUANTILES(latency_ms, 100)[OFFSET(95)] as p95_latency
FROM platform.events
WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY hour, service_name;
Dataflow Streaming:
// Streaming pipeline with windowing
Pipeline pipeline = Pipeline.create(options);
pipeline
.apply("ReadFromPubSub",
PubsubIO.readMessages().fromSubscription(subscription))
.apply("ParseEvents",
ParDo.of(new ParseEventFn()))
.apply("WindowEvents",
Window.<Event>into(
FixedWindows.of(Duration.standardMinutes(5)))
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(30))))
.withAllowedLateness(Duration.standardMinutes(2))
.discardingFiredPanes())
.apply("AggregateMetrics",
Combine.perKey(new MetricsAggregator()))
.apply("WriteToBigQuery",
BigQueryIO.writeTableRows()
.to(table)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND));
Azure Deep Dive
Compute and Networking
Virtual Machine Scale Sets:
{
"type": "Microsoft.Compute/virtualMachineScaleSets",
"apiVersion": "2025-01-01",
"name": "platform-vmss",
"location": "[resourceGroup().location]",
"sku": {
"name": "Standard_D4s_v3",
"capacity": 3
},
"properties": {
"overprovision": true,
"upgradePolicy": {
"mode": "Rolling",
"rollingUpgradePolicy": {
"maxBatchInstancePercent": 20,
"maxUnhealthyInstancePercent": 20,
"maxUnhealthyUpgradedInstancePercent": 5,
"pauseTimeBetweenBatches": "PT10S"
}
},
"virtualMachineProfile": {
"storageProfile": {
"imageReference": {
"publisher": "Canonical",
"offer": "UbuntuServer",
"sku": "18.04-LTS",
"version": "latest"
}
},
"osProfile": {
"computerNamePrefix": "platform",
"adminUsername": "azureuser",
"customData": "[base64(variables('cloudInit'))]"
},
"networkProfile": {
"networkInterfaceConfigurations": [{
"name": "nic",
"properties": {
"primary": true,
"ipConfigurations": [{
"name": "ipconfig",
"properties": {
"subnet": {
"id": "[resourceId('Microsoft.Network/virtualNetworks/subnets', 'vnet', 'subnet')]"
},
"loadBalancerBackendAddressPools": [{
"id": "[resourceId('Microsoft.Network/loadBalancers/backendAddressPools', 'lb', 'backend')]"
}]
}
}]
}
}]
}
}
}
}
Virtual Network Architecture:
// Hub-spoke network topology
resource hubVnet 'Microsoft.Network/virtualNetworks@2025-01-01' = {
name: 'hub-vnet'
location: resourceGroup().location
properties: {
addressSpace: {
addressPrefixes: [
'10.0.0.0/16'
]
}
subnets: [
{
name: 'GatewaySubnet'
properties: {
addressPrefix: '10.0.0.0/27'
}
}
{
name: 'AzureFirewallSubnet'
properties: {
addressPrefix: '10.0.1.0/26'
}
}
]
}
}
resource spokeVnet 'Microsoft.Network/virtualNetworks@2025-01-01' = {
name: 'spoke-vnet'
location: resourceGroup().location
properties: {
addressSpace: {
addressPrefixes: [
'10.1.0.0/16'
]
}
}
}
resource vnetPeering 'Microsoft.Network/virtualNetworks/virtualNetworkPeerings@2025-01-01' = {
parent: hubVnet
name: 'hub-to-spoke'
properties: {
remoteVirtualNetwork: {
id: spokeVnet.id
}
allowVirtualNetworkAccess: true
allowForwardedTraffic: true
allowGatewayTransit: true
}
}
Resources:
- 📖 Azure Architecture Center
- 📚 Azure Solutions Architect Expert Guide
- 🎥 Azure Friday
- 📖 Azure Well-Architected Framework
AKS Platform Engineering
AKS with Azure Arc:
# Create AKS cluster with advanced networking
az aks create \
--resource-group platform-rg \
--name platform-aks \
--node-count 3 \
--enable-managed-identity \
--network-plugin azure \
--network-policy calico \
--enable-addons monitoring,azure-policy \
--generate-ssh-keys \
--attach-acr platform-acr \
--enable-cluster-autoscaler \
--min-count 3 \
--max-count 10
# Enable GitOps
az k8s-configuration flux create \
--name platform-gitops \
--cluster-name platform-aks \
--resource-group platform-rg \
--namespace flux-system \
--cluster-type managedClusters \
--scope cluster \
--url https://github.com/company/k8s-config \
--branch main \
--kustomization name=infra path=./infrastructure prune=true \
--kustomization name=apps path=./apps/production prune=true dependsOn=infra
Azure DevOps and Automation
Azure DevOps Pipelines:
# Multi-stage pipeline with approval gates
trigger:
branches:
include:
- main
stages:
- stage: Build
jobs:
- job: BuildContainer
pool:
vmImage: 'ubuntu-latest'
steps:
- task: Docker@2
inputs:
containerRegistry: 'platform-acr'
repository: 'platform/app'
command: 'buildAndPush'
Dockerfile: '**/Dockerfile'
tags: |
$(Build.BuildId)
latest
- stage: DeployDev
dependsOn: Build
jobs:
- deployment: DeployToDev
pool:
vmImage: 'ubuntu-latest'
environment: 'development'
strategy:
runOnce:
deploy:
steps:
- task: KubernetesManifest@0
inputs:
action: 'deploy'
kubernetesServiceConnection: 'dev-aks'
manifests: |
manifests/deployment.yaml
manifests/service.yaml
containers: |
platform-acr.azurecr.io/platform/app:$(Build.BuildId)
- stage: DeployProd
dependsOn: DeployDev
jobs:
- deployment: DeployToProd
pool:
vmImage: 'ubuntu-latest'
environment: 'production'
strategy:
runOnce:
preDeploy:
steps:
- task: ManualValidation@0
timeoutInMinutes: 1440
inputs:
notifyUsers: 'platform-team@company.com'
instructions: 'Validate dev deployment before promoting to production'
deploy:
steps:
- task: KubernetesManifest@0
inputs:
action: 'deploy'
kubernetesServiceConnection: 'prod-aks'
manifests: |
manifests/deployment.yaml
manifests/service.yaml
containers: |
platform-acr.azurecr.io/platform/app:$(Build.BuildId)
Multi-Cloud Strategies
Cloud-Agnostic Architecture
Terraform Multi-Cloud Module:
# Main module supporting multiple providers
variable "cloud_provider" {
type = string
validation {
condition = contains(["aws", "gcp", "azure"], var.cloud_provider)
error_message = "Cloud provider must be aws, gcp, or azure."
}
}
module "compute" {
source = "./modules/${var.cloud_provider}/compute"
instance_count = var.instance_count
instance_type = local.instance_types[var.cloud_provider]
network_id = module.network.network_id
}
locals {
instance_types = {
aws = "m5.large"
gcp = "n2-standard-4"
azure = "Standard_D4s_v3"
}
}
Cost Optimization Across Clouds
Cost Monitoring and Optimization:
# Multi-cloud cost analyzer
import boto3
from google.cloud import billing_v1
from azure.mgmt.consumption import ConsumptionManagementClient
class MultiCloudCostAnalyzer:
def get_aws_costs(self, start_date, end_date):
ce_client = boto3.client('ce')
response = ce_client.get_cost_and_usage(
TimePeriod={
'Start': start_date,
'End': end_date
},
Granularity='DAILY',
Metrics=['UnblendedCost'],
GroupBy=[
{'Type': 'DIMENSION', 'Key': 'SERVICE'}
]
)
return response['ResultsByTime']
def get_gcp_costs(self, billing_account):
client = billing_v1.CloudBillingClient()
# Implementation for GCP billing API
pass
def get_azure_costs(self, subscription_id):
consumption_client = ConsumptionManagementClient(
credential, subscription_id
)
# Implementation for Azure consumption API
pass
Platform Engineering Interview Scenarios
Common Cloud Platform Questions
-
Design a multi-region deployment
- Active-active vs active-passive
- Data replication strategies
- Failover mechanisms
-
Implement disaster recovery
- RTO and RPO requirements
- Backup strategies
- Cross-region replication
-
Optimize cloud costs
- Reserved instances vs spot/preemptible
- Auto-scaling strategies
- Storage tiering
-
Secure cloud infrastructure
- Network segmentation
- Identity and access management
- Encryption at rest and in transit
-
Build CI/CD pipelines
- Multi-cloud deployment
- Blue-green deployments
- Canary releases
Hands-On Labs
AWS:
GCP:
Azure:
Certifications and Learning Paths
AWS Certifications
- 🎓 AWS Solutions Architect Associate
- 🎓 AWS DevOps Engineer Professional
- 🎓 AWS Advanced Networking Specialty
GCP Certifications
- 🎓 Google Cloud Professional Cloud Architect
- 🎓 Google Cloud Professional DevOps Engineer
- 🎓 Google Cloud Professional Cloud Network Engineer
Azure Certifications
- 🎓 Azure Solutions Architect Expert
- 🎓 Azure DevOps Engineer Expert
- 🎓 Azure Network Engineer Associate
Key Takeaways
- Know the equivalents: Understand how services map across clouds
- Focus on patterns: Architectural patterns apply across all platforms
- Automation first: Use IaC and APIs for everything
- Cost awareness: Understand pricing models and optimization strategies
- Security by design: Implement security at every layer
- Stay current: Cloud services evolve rapidly - keep learning
Remember: Platform engineering in the cloud is about building reliable, scalable, and cost-effective infrastructure that empowers development teams to deliver value quickly and safely.