Skip to main content

3 posts tagged with "ai"

View All Tags

AWS re:Invent 2025: The Complete Platform Engineering Guide

· 38 min read
VibeSRE
Platform Engineering Contributor

🎙️ Listen to our 4-part podcast series on AWS re:Invent 2025:

TL;DR​

AWS re:Invent 2025 delivered the most significant platform engineering announcements in years. Agentic AI became the defining theme: AWS DevOps Agent achieves 86% root cause identification, Kiro has 250,000+ developers, and Gartner predicts 40% of agentic AI projects will fail by 2027 due to data foundation gaps. Infrastructure hit new scale: EKS Ultra Scale supports 100K nodes (vs 15K GKE, 5K AKS), Graviton5 delivers 192 cores with 25% better performance, Trainium3 cuts AI training costs by 50%. Developer experience evolved: Lambda Durable Functions enable year-long workflows, EKS Capabilities bring managed Argo CD/ACK, and the EKS MCP Server enables natural language cluster management. Werner Vogels coined "verification debt" in his final keynote, warning that AI generates code faster than humans can understand it. For platform teams, this isn't about AI replacing engineers—it's about evolving skills from writing runbooks to evaluating AI-generated mitigation plans.


Key Statistics​

MetricValueSource
Agentic AI & Automation
Kiro autonomous agent users globally250,000+AWS
AWS DevOps Agent root cause identification86%AWS
Nova Act browser automation reliability90%+AWS
Bedrock AgentCore evaluation frameworks13AWS
Agentic AI projects predicted to fail by 202740%+Gartner
Day-to-day decisions by agentic AI by 202815%Gartner
Kindle team time savings with DevOps Agent80%AWS
Infrastructure & Compute
EKS Ultra Scale max nodes per cluster100,000AWS
GKE max nodes (standard cluster)15,000AWS
AKS max nodes5,000AWS
Max Trainium accelerators per EKS cluster1.6 millionAWS
Anthropic Claude latency KPI improvement with EKS Ultra Scale35% → 90%+AWS
EKS scheduler throughput at 100K scale500 pods/secAWS
Graviton5 cores per chip192AWS
Graviton5 performance improvement vs Graviton425%AWS
Top 1000 AWS customers using Graviton98%AWS
Trainium3 performance vs Trainium24.4xAWS
Trainium3 cost reduction for AI training50%AWS
Trainium3 energy efficiency improvement4xAWS
Trainium3 PFLOPs per UltraServer (FP8)362AWS
Developer Experience
Lambda Durable Functions max workflow duration1 yearAWS
Database Savings Plans max savings (serverless)35%AWS
Database Savings Plans savings (provisioned)20%AWS
AWS Controllers for Kubernetes (ACK) CRDs200+AWS
ACK supported AWS services50+AWS
EKS Provisioned Control Plane (4XL) max nodes40,000AWS
EKS Provisioned Control Plane (4XL) max pods640,000AWS
Data Services
S3 Tables query performance improvementUp to 3xAWS
S3 Tables TPS improvementUp to 10xAWS
S3 Tables Intelligent-Tiering cost savingsUp to 80%AWS
S3 Tables created since launch400,000+AWS
Aurora DSQL performance vs competitors4x fasterAWS
Aurora DSQL availability (multi-region)99.999%AWS

Executive Summary: What Matters Most​

AWS re:Invent 2025 was dominated by three strategic themes:

  1. Agentic AI everywhere: From frontier agents (DevOps Agent, Security Agent, Kiro) to platform capabilities (Bedrock AgentCore) to browser automation (Nova Act), AWS is betting that autonomous AI will fundamentally change how software is built and operated.

  2. Scale as a competitive moat: EKS Ultra Scale's 100K-node support creates a 6-20x advantage over GKE and AKS. Combined with custom silicon (Graviton5, Trainium3), AWS is positioning itself as the only cloud that can handle next-generation AI training workloads.

  3. Developer experience simplification: Lambda Durable Functions eliminate Step Functions complexity, EKS Capabilities remove operational toil, natural language interfaces (EKS MCP Server) lower the barrier to Kubernetes operations.

For platform engineering teams, the message is clear: AI will handle operational toil (triage, analysis, routine fixes), humans will handle judgment calls (architecture, approval, verification). The teams that master this hybrid model will deliver 5-10x productivity gains. The teams that resist will struggle with mounting operational debt.


Part 1: The Agentic AI Revolution​

The Shift from Assistants to Agents​

AWS CEO Matt Garman set the tone in his keynote: "AI assistants are starting to give way to AI agents that can perform tasks and automate on your behalf."

The distinction matters:

AI Assistants are reactive. They wait for you to ask a question, then provide an answer. You drive the interaction.

AI Agents are autonomous. They observe systems, identify problems, analyze root causes, and either fix issues or propose fixes. They work for hours or days without constant human intervention. They navigate complex, multi-step workflows across multiple systems.

AWS announced three "frontier agents"—so named because they represent the cutting edge of what autonomous AI can do today.

đź’ˇ Key Takeaway: The agent paradigm fundamentally changes how platform teams interact with AI. Instead of asking questions, you delegate tasks. Instead of getting answers, you review proposed actions. The skill shifts from prompt engineering to evaluation and approval.

AWS DevOps Agent: 86% Root Cause Identification​

The AWS DevOps Agent acts as an autonomous on-call engineer, working 24/7 without sleep or context-switching.

How it works:

  • Integrates with CloudWatch (metrics/logs), GitHub (deployment history), ServiceNow (incident management)
  • Correlates signals across sources that would take humans 30 minutes to gather
  • Identifies root causes in 86% of incidents based on AWS internal testing
  • Generates detailed mitigation plans with expected outcomes and risks
  • Humans approve before execution—the agent stops at the approval stage

Real-world impact: The Kindle team reported 80% time savings using CloudWatch Investigations, the underlying technology powering DevOps Agent.

Availability: Public preview in US East (N. Virginia), free during preview.

The critical insight: DevOps Agent handles triage and analysis—the tasks that consume the first 20-40 minutes of any incident. You make the decision with full context instead of spending that time gathering information. The role evolves from first responder to decision-maker.

đź’ˇ Key Takeaway: Start mapping how DevOps Agent fits with your existing incident management tools (PagerDuty, OpsGenie). Define approval processes now while it's in preview. Who can approve AI-generated fixes? What's the review bar? How do you handle disagreement with an agent's recommendation?

AWS Security Agent: Context-Aware Application Security​

The AWS Security Agent goes beyond pattern matching to understand your application architecture.

Key capabilities:

  • AI-powered design reviews: Catches security issues in architecture decisions before code is written
  • Contextual code analysis: Understands data flow across your entire application, not just individual files
  • Intelligent penetration testing: Creates customized attack plans informed by security requirements, design documents, and source code

What makes it different: Traditional static analysis tools flag patterns ("this code uses eval"). Security Agent understands intent and context ("this admin endpoint uses eval for configuration, but it's protected by IAM and only accessible from VPC endpoints").

Availability: Public preview in US East (N. Virginia), free during preview. All data remains private—never used to train models.

💡 Key Takeaway: Security Agent shifts security left in a practical way. Instead of handing developers a list of CVEs to fix after code review, the agent participates earlier in the process—understanding context rather than just matching patterns.

Kiro: 250,000+ Developers Building with Autonomous Agents​

Kiro is the autonomous developer agent that navigates across multiple repositories to fix bugs and submit pull requests. Over 250,000 developers are already using it globally.

Key differentiators:

  • Persistent context: Unlike chat-based assistants, Kiro maintains context across sessions for hours or days
  • Team learning: Understands your coding standards, test patterns, deployment workflows
  • Multi-repository navigation: Works across your entire codebase, not just single files
  • Pull request workflow: Submits proposed changes for human review before merge

Amazon made Kiro the official development tool across the company, using it internally at scale.

Startup incentive: Free Kiro Pro+ credits available through AWS startup program.

💡 Key Takeaway: Kiro represents the "developer agent" category—autonomous systems that can take development tasks and execute them across your codebase. The human review step remains critical, treating AI-generated code the same way you'd treat code from any new team member.

Amazon Bedrock AgentCore: Building Production-Ready Agents​

Amazon Bedrock AgentCore is the platform for building custom AI agents. At re:Invent 2025, AWS announced major enhancements:

Policy in AgentCore (Preview): Set explicit boundaries using natural language. "This agent can read from this database but not write." "This agent can access production logs but not customer PII." Deterministic controls that operate outside agent code.

AgentCore Evaluations: 13 pre-built evaluation systems for monitoring agent quality—correctness, safety, tool selection accuracy. Continuous assessment for AI agent quality in production.

AgentCore Memory: Agents develop a log of information on users over time and use that information to inform future decisions. Episodic functionality allows agents to learn from past experiences.

Framework agnostic: Supports CrewAI, LangGraph, LlamaIndex, Google ADK, OpenAI Agents SDK, Strands Agents.

Adoption: In just five months since preview, AgentCore has seen 2 million+ downloads. Organizations include PGA TOUR (1,000% content writing speed improvement, 95% cost reduction), Cohere Health, Cox Automotive, Heroku, MongoDB, Thomson Reuters, Workday, and Swisscom.

💡 Key Takeaway: If you're building custom agents, AgentCore provides the production infrastructure—policy controls, memory, evaluations—that enterprises require. The framework-agnostic approach means you're not locked into AWS-specific patterns.

Amazon Nova Act: 90% Browser Automation Reliability​

Amazon Nova Act is a service for building browser automation agents, powered by a custom Nova 2 Lite model optimized for UI interactions.

The 90% reliability claim: Nova Act achieves over 90% task reliability on early customer workflows, trained through reinforcement learning on hundreds of simulated web environments.

Use cases:

  • Form filling and data extraction
  • Shopping and booking flows
  • QA testing of web applications
  • CRM and ERP automation

Real-world results:

  • Hertz: Accelerated software delivery by 5x, eliminated QA bottleneck using Nova Act for end-to-end testing
  • Sola Systems: Automated hundreds of thousands of workflows per month
  • 1Password: Reduced manual steps for users accessing logins

What makes it work: Nova Act diverges from standard training methods by utilizing reinforcement learning within synthetic "web gyms"—simulated environments that allow agents to train against real-world UI scenarios.

đź’ˇ Key Takeaway: Browser automation has traditionally been fragile (Selenium tests breaking on minor UI changes). Nova Act's 90% reliability suggests a step-change in what's possible. Consider it for QA automation, internal tool workflows, and data extraction tasks.

The 40% Failure Warning: Why Agentic AI Projects Fail​

Gartner predicts that over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear business value, or inadequate risk controls.

Primary causes:

  1. Inadequate data foundations: Agents need high-quality, timely, contextualized data. When agents act on outdated or incomplete data, results range from inefficiencies to outright failures.

  2. Data silos: Agents need to access information across systems, but most enterprises have data locked in disconnected silos without API access.

  3. Trust in data quality: If the data an agent uses is stale, incomplete, or inaccurate, the agent's outputs will be too.

  4. Cross-organizational governance: Who's responsible when an agent accesses data from multiple teams? What are the audit requirements?

  5. Data consumption patterns: Agents consume data differently than humans—they need APIs, not dashboards.

  6. "Agent washing": Many vendors rebrand existing RPA tools, chatbots, and AI assistants without substantial agentic capabilities. Gartner estimates only about 130 of thousands of agentic AI vendors are real.

The opportunity: Despite the high failure rate, Gartner predicts 15% of day-to-day work decisions will be made autonomously through agentic AI by 2028 (up from virtually none in 2024), and 33% of enterprise software applications will embed agentic AI by 2028 (vs less than 1% today).

đź’ˇ Key Takeaway: Platform teams thinking about agentic AI should start with a data readiness assessment. Are the systems these agents need to access actually accessible via API? Is the data fresh and accurate? Do you have governance frameworks in place? Without solid data foundations, even the most sophisticated agents will fail.

Werner Vogels' Verification Debt Concept​

In his final re:Invent keynote after 14 years, Werner Vogels introduced a concept every platform engineer should internalize: verification debt.

The problem: AI generates code faster than humans can comprehend it. This creates a dangerous gap between what gets written and what gets understood. Every time you accept AI-generated code without fully understanding it, you're taking on verification debt. That debt accumulates until something breaks in production.

The solution: Code reviews become "the control point to restore balance."

Vogels was emphatic: "We all hate code reviews. It's like being a twelve-year-old and standing in front of the class. But the review is where we bring human judgment back into the loop."

His answer to "Will AI take my job?": "Will AI take my job? Maybe. Will AI make me obsolete? Absolutely not—if you evolve."

The Renaissance Developer framework (5 qualities):

  1. Be curious: AI lowers the barrier to learning—explore any technology in hours, not months
  2. Think in systems: Architecture matters more than ever—AI writes code, you design systems
  3. Communicate precisely: AI amplifies unclear thinking—vague prompts produce vague code
  4. Own your work: "Vibe coding is fine, but only if you pay close attention to what is being built"
  5. Become a polymath: Cross-disciplinary skills differentiate—breadth plus depth equals competitive advantage

đź’ˇ Key Takeaway: Organizations like Oxide Computer Company are already building verification debt into policy. Their internal LLM policy states: "Wherever LLM-generated code is used, it becomes the responsibility of the engineer." Engineers must self-review all LLM code before peer review. The closer code is to production, the greater care required.


Part 2: Infrastructure at Unprecedented Scale​

EKS Ultra Scale: 100,000 Nodes per Cluster​

Amazon EKS Ultra Scale now supports up to 100,000 worker nodes per cluster—a 6-20x advantage over competitors:

  • EKS: 100,000 nodes
  • GKE (standard): 15,000 nodes
  • AKS: 5,000 nodes

What this enables: Up to 1.6 million AWS Trainium accelerators or 800,000 NVIDIA GPUs in a single cluster. This is the scale required for training trillion-parameter models, where training jobs fundamentally can't be distributed across multiple clusters easily.

The technical breakthrough: The bottleneck at scale has always been etcd, Kubernetes' core data store. Etcd uses Raft consensus for replication, which works great at normal scale but becomes limiting at 100K nodes.

AWS's solution:

  1. Replaced etcd's Raft backend with "journal": An internal AWS component built over a decade that provides ultra-fast, ordered data replication with multi-AZ durability
  2. Moved etcd to in-memory storage (tmpfs): Order-of-magnitude performance wins—higher read/write throughput, predictable latencies, faster maintenance
  3. Doubled max database size to 20GB: More headroom for cluster state
  4. Partitioned key-space: Split hot resource types into separate etcd clusters, achieving 5x write throughput

Performance results:

  • 500 pods/second scheduling throughput at 100K scale
  • Cluster contains 10+ million Kubernetes objects (100K nodes, 900K pods)
  • Aggregate etcd database size: 32GB across partitions
  • API latencies remain within Kubernetes SLO targets

Real-world adoption: Anthropic uses EKS Ultra Scale to train Claude. Their end-user latency KPIs improved from an average of 35% to consistently above 90%. The percentage of write API calls completing within 15ms increased from 35% to 90%.

💡 Key Takeaway: EKS Ultra Scale isn't just about bragging rights—it's about enabling AI workloads that simply can't run on other clouds. If your organization is training large models or running massive batch inference workloads, EKS is now the only Kubernetes platform that can handle it at scale.

Graviton5: 192 Cores, 25% Better Performance​

AWS Graviton5 is AWS's most powerful and efficient CPU:

Specifications:

  • 192 cores per chip (up from 96 in Graviton4)
  • 25% better compute performance vs Graviton4
  • 33% lower inter-core latency
  • 5x larger L3 cache
  • Built on Arm Neoverse V3 architecture using TSMC's 3nm process

Adoption: 98% of AWS's top 1,000 customers are already using Graviton. For the third year in a row, more than half of new CPU capacity added to AWS is powered by Graviton.

Real-world results:

  • SAP: 35-60% performance improvement for S/4HANA workloads
  • Atlassian: 30% higher performance with significant cost reduction
  • Honeycomb: 36% better throughput for observability workloads

New instance types: M9g (general purpose), C9g (compute-optimized), R9g (memory-optimized) launching in 2026.

Price-performance advantage: Graviton5 delivers 40% better price-performance vs x86 equivalents, according to AWS benchmarks.

💡 Key Takeaway: Most container workloads compile seamlessly for ARM64. If you're not running Graviton, you're leaving 25-40% price-performance on the table. The migration patterns are well-established now—this is no longer experimental.

Trainium3: 4.4x Performance, 50% Cost Reduction​

AWS Trainium3 UltraServers are AWS's answer to GPU supply constraints and high AI training costs:

Performance metrics:

  • 4.4x more compute performance vs Trainium2
  • 50% cost reduction for AI training
  • 362 FP8 petaflops per UltraServer
  • 144 Trainium3 chips per UltraServer
  • 4x better energy efficiency

Technical innovation: Built on TSMC's 3nm process, Trainium3 is AWS's first 3nm AI chip. EC2 UltraClusters 3.0 can connect thousands of UltraServers, scaling up to 1 million chips total.

Real-world adoption:

  • Anthropic: Using Trainium for Claude training, scaling to over 1 million Trainium2 chips by end of 2025, achieving 60% tensor engine utilization on Trainium2 and over 90% on Trainium3
  • Decart: Achieved 4x faster inference for real-time generative video at half the cost of GPUs
  • Metagenomi: Using for genomics research AI models
  • Ricoh: Using for document processing AI

Future roadmap: AWS announced Trainium4 on the roadmap, which will be NVIDIA NVLink compatible, signaling long-term commitment to custom AI silicon.

đź’ˇ Key Takeaway: Trainium3 changes AI economics for organizations willing to optimize for AWS's custom silicon. If you're evaluating AI infrastructure and can adapt your training pipelines, Trainium is now a serious alternative to NVIDIA at half the cost.

Lambda Durable Functions: Year-Long Workflows​

AWS Lambda Durable Functions fundamentally changed what serverless can do.

The old constraint: Lambda timeout is 15 minutes. Complex workflows required Step Functions.

The new capability: Build stateful workflows directly in Lambda that run from seconds to 1 full year.

Two new primitives:

  1. context.step(): Creates durable checkpoints. Your function executes some code, checkpoints the result, and if anything fails, it resumes from that checkpoint.

  2. context.wait(): Suspends execution and resumes when an event arrives. You can wait for human approval, external API callbacks, timer expirations—all natively in Lambda.

How it works: Lambda keeps a running log of all durable operations (steps, waits) as your function executes. When your function needs to pause or encounters an interruption, Lambda saves this checkpoint log and stops execution. When it's time to resume, Lambda invokes your function again from the beginning and replays the checkpoint log, substituting stored values for completed operations.

Example use case: A data pipeline that fetches data, waits up to 7 days for human approval, then processes the data after approval. In the old world: Step Functions state machine, callback patterns, state store management. Now: 3 lines of code with context.step() and context.wait().

Additional operations: create_callback() (await external events or human approvals), wait_for_condition() (pause until specific condition met), parallel() and map() for advanced concurrency.

Timeout settings:

  • Lambda function timeout (max 15 minutes): Limits each individual invocation
  • Durable execution timeout (max 1 year): Limits total time from start to completion

Availability: Generally available in US East (Ohio) with support for Python 3.13/3.14 and Node.js 22/24 runtimes.

đź’ˇ Key Takeaway: If you're using Step Functions for straightforward state management, Lambda Durable might be simpler. It's not replacing Step Functions for complex orchestration, but it eliminates a lot of boilerplate for common patterns like human approval workflows, long-running data pipelines, and event-driven orchestration.

Database Savings Plans: Up to 35% Savings​

AWS Database Savings Plans offer a flexible pricing model:

Savings breakdown:

  • Serverless deployments: Up to 35% savings
  • Provisioned instances: Up to 20% savings
  • DynamoDB/Keyspaces on-demand: Up to 18% savings
  • DynamoDB/Keyspaces provisioned: Up to 12% savings

Coverage: Aurora, RDS, DynamoDB, ElastiCache, DocumentDB, Neptune, Keyspaces, Timestream, and AWS Database Migration Service across all regions (except China).

Flexibility: Commitment automatically applies regardless of engine, instance family, size, deployment option, or Region. You can change between Aurora db.r7g and db.r8g instances, shift workloads from EU (Ireland) to US (Ohio), modernize from RDS for Oracle to Aurora PostgreSQL, or from RDS to DynamoDB—and still benefit from discounted pricing.

Commitment: One-year term with no upfront payment required (at launch).

Limitations: Excludes SimpleDB, Timestream LiveAnalytics, Neptune Analytics, Redis, MemoryDB, Memcached, China regions, and AWS Outposts. Only covers instance and serverless usage—storage, backup, IO not included.

đź’ˇ Key Takeaway: This is an easy cost optimization lever. If your database spend is stable and predictable, commit today. Stack it with Reserved Instances where applicable. The ROI calculation is straightforward: stable spend equals immediate savings.


Part 3: Kubernetes Evolution and Cloud Operations​

EKS Capabilities: Managed Argo CD, ACK, and KRO​

Amazon EKS Capabilities eliminates operational toil for platform teams:

The problem: Platform teams have been running Argo CD for GitOps and ACK for managing AWS resources from Kubernetes. But maintaining these systems is real work—patching, upgrading, ensuring compatibility, handling scaling.

AWS's solution: EKS Capabilities makes all of that AWS's problem. These capabilities run in AWS service-owned accounts that are fully abstracted from you. AWS handles infrastructure scaling, patching, updates, and compatibility analysis.

Three capabilities:

  1. Managed Argo CD: Fully managed Argo CD instance that can deploy applications across multiple clusters. Git becomes your source of truth, Argo automatically remediates drift. The CNCF 2024 survey showed 45% of Kubernetes users are running Argo CD in production or planning to.

  2. AWS Controllers for Kubernetes (ACK): Manage AWS resources using Kubernetes CRDs. Provides over 200 CRDs for more than 50 AWS services. Create S3 buckets, RDS databases, IAM roles—all from YAML. No need to install or maintain controllers yourself.

  3. Kube Resource Orchestrator (KRO): Platform teams create reusable resource bundles that hide complexity. Developers consume these abstractions without needing to understand the underlying details. This is how you build your internal developer platform on Kubernetes.

Multi-cluster architecture: Run all three capabilities in a centrally managed cluster. Argo CD on that management cluster deploys applications to workload clusters across different regions or accounts. ACK provisions AWS resources for all clusters. KRO creates portable platform abstractions that work everywhere.

Pricing: Per-capability, per-hour billing with no upfront commitments. Additional charges for specific Kubernetes resources managed by the capabilities.

đź’ˇ Key Takeaway: GitOps becomes turnkey with EKS Capabilities. The maintenance burden of running Argo CD and ACK disappears. That's real operational toil that goes away, freeing platform teams to focus on higher-value work like building abstractions and improving developer experience.

EKS MCP Server: Natural Language Kubernetes Management​

The EKS MCP Server lets you manage Kubernetes clusters using natural language instead of kubectl.

What is MCP?: Model Context Protocol is an open-source standard that gives AI models secure access to external tools and data sources. Think of it as a standardized interface that enriches AI applications with real-time, contextual knowledge.

What the EKS MCP Server does:

  • Say "show me all pods not in running state" → it just works
  • Say "create a new EKS cluster named demo-cluster with VPC and Auto Mode" → it does it
  • Get logs, check deployments, create clusters—all through conversation
  • No kubectl, no kubeconfig required

Enterprise features:

  • Hosted in AWS cloud: No local installation or maintenance
  • Automatic updates and patching
  • AWS IAM integration for security
  • CloudTrail integration for audit logging
  • Knowledge base built from AWS operational experience managing millions of Kubernetes clusters

AI tool integrations: Works with Kiro (AWS's IDE and CLI), Cursor, Cline, Amazon Q Developer, or custom agents you build.

Availability: Preview release.

💡 Key Takeaway: The MCP Server changes who can operate Kubernetes clusters. AWS is betting that conversational AI turns multi-step manual tasks into simple requests. The barrier to Kubernetes operations just dropped significantly—which has implications for team structure, skill requirements, and developer self-service.

EKS Provisioned Control Plane: Guaranteed Performance​

Amazon EKS Provisioned Control Plane provides guaranteed SLAs for production workloads:

The problem: Standard EKS control planes have variable performance. Under burst loads, you can get unpredictable behavior.

The solution: Pre-allocate control plane capacity with well-defined performance characteristics.

T-shirt sizing:

TierAPI Request ConcurrencyPod Scheduling RateCluster Database SizeStress Test ResultsPricing
XL1,700 concurrent requests100 pods/sec5GB10,000 nodes, 160K pods$1.65/hr
2XL3,400 concurrent requests200 pods/sec10GB20,000 nodes, 320K pods$3.30/hr
4XL6,800 concurrent requests400 pods/sec20GB40,000 nodes, 640K pods$6.90/hr

When to use: Enterprises needing guaranteed SLAs for production workloads, especially those with burst traffic patterns or large-scale deployments.

Flexibility: You can switch tiers as workloads change, or revert to standard control plane during quieter periods.

đź’ˇ Key Takeaway: For mission-critical workloads where control plane performance SLAs matter, Provisioned Control Plane provides predictable capacity. The 4XL tier's ability to handle 40,000 nodes and 640,000 pods (8x improvement over standard) makes it suitable for large enterprises consolidating multiple clusters.

CloudWatch Generative AI Observability​

CloudWatch Gen AI Observability provides comprehensive monitoring for AI applications and agents:

What it does: Built-in insights into latency, token usage, and errors across your AI stack—no custom instrumentation required.

Framework support:

  • Amazon Bedrock AgentCore (native integration)
  • LangChain, LangGraph, CrewAI (open-source agentic frameworks)

Why it matters: Agent observability has been a gap. You deploy an agent, and when something goes wrong, you're debugging in the dark. Now you have proper tracing and metrics out of the box.

Additional CloudWatch updates:

  1. MCP Servers for CloudWatch: Bridge AI assistants to observability data—standardized access to metrics, logs, alarms, traces, and service health data

  2. Unified Data Store: Automates collection from AWS and third-party sources (CrowdStrike, Microsoft 365, SentinelOne). Everything stored in S3 Tables with OCSF and Apache Iceberg support. First copy of centralized logs incurs no additional ingestion charges.

  3. Application Signals GitHub Action: Provides observability insights during pull requests and CI/CD pipelines. Developers can identify performance regressions without leaving their development environment.

  4. Database Insights: Cross-account and cross-region monitoring for RDS, Aurora, and DynamoDB from a single monitoring account.

💡 Key Takeaway: As more teams deploy AI agents, observability becomes critical. CloudWatch's native support for agentic frameworks (LangChain, CrewAI) and end-to-end tracing means you can monitor agent performance, identify bottlenecks, and debug failures—just like you do for traditional applications.


Part 4: Data Services for AI Workloads​

S3 Tables with Apache Iceberg: 3x Faster Queries​

Amazon S3 Tables is AWS's first cloud object store with built-in Apache Iceberg support:

Performance improvements:

  • Up to 3x faster query performance
  • Up to 10x higher transactions per second (TPS)
  • Automated table maintenance for analytics workloads

Adoption: Over 400,000 tables created since launch.

Key updates at re:Invent 2025:

  1. Intelligent-Tiering support: Automatically optimizes table data across three access tiers (Frequent Access, Infrequent Access, Archive Instant Access) based on access patterns—delivering up to 80% storage cost savings without performance impact or operational overhead. S3 Intelligent-Tiering has saved customers over $6 billion to date.

  2. Automatic replication across AWS Regions and accounts: Simplifies disaster recovery and multi-region analytics.

Use cases:

  • Data lakes requiring ACID transactions
  • Analytics workloads with high query concurrency
  • Change data capture (CDC) from Aurora Postgres/MySQL for near real-time analytics
  • Multi-engine access (Athena, Redshift, EMR, Spark)

đź’ˇ Key Takeaway: S3 Tables simplifies data lake management with native Apache Iceberg support and ACID transactions. If you're building data lakes or analytics platforms, the combination of 10x TPS improvement and 80% cost savings via Intelligent-Tiering is compelling.

Aurora DSQL: Distributed SQL with 99.999% Availability​

Amazon Aurora DSQL is a new serverless, distributed SQL database:

Key features:

  • Effectively unlimited horizontal scaling: Independent scaling of reads, writes, compute, and storage
  • PostgreSQL-compatible: Supports common PostgreSQL drivers, tools, and core relational features (ACID transactions, SQL queries, secondary indexes, joins)
  • 99.999% multi-region availability: Strong consistency across regions
  • 4x faster than competitors: According to AWS benchmarks

Technical innovation: DSQL decouples transaction processing from storage, so every statement doesn't need to check at commit time. This architectural separation enables the performance and scalability improvements.

Deployment: Create new clusters with a single API call, begin using a PostgreSQL-compatible database within minutes.

Coming soon: Native integrations on Vercel Marketplace and v0—developers can connect to Aurora PostgreSQL, Aurora DSQL, or DynamoDB in seconds.

💡 Key Takeaway: Aurora DSQL addresses the distributed SQL challenge for SaaS applications that need strong consistency across regions. The ability to maintain ACID guarantees while scaling horizontally has traditionally required complex coordination—DSQL makes it turnkey.


What This Means for Your Team: Decision Frameworks​

Framework 1: Should You Adopt AWS DevOps Agent?​

Evaluate if you answer YES to 3+:

  • Your team handles 10+ incidents per week
  • Mean time to identify (MTTI) is >20 minutes
  • You have multiple observability tools (CloudWatch, GitHub, ServiceNow)
  • On-call engineers spend >30% time on triage
  • You're willing to invest in defining approval processes

If YES: Start with preview in non-production environment. Map integration points with existing incident management tools. Define approval workflows. Train team on evaluating AI-generated mitigation plans.

If NO: Wait for GA and customer case studies showing production results.

Framework 2: Should You Migrate to EKS Ultra Scale?​

Evaluate if you answer YES to 2+:

  • You're training AI models requiring 10,000+ GPUs
  • You need >15,000 nodes in a single cluster (GKE limit)
  • Your workloads can't be easily distributed across multiple clusters
  • You're hitting etcd performance limits in existing clusters
  • You're willing to run on Trainium or large-scale GPU instances

If YES: EKS Ultra Scale is the only Kubernetes platform that can handle your scale. Start planning migration.

If NO: Standard EKS is sufficient. Monitor your node count growth—plan migration when you cross 10K nodes.

Framework 3: Should You Adopt EKS Capabilities?​

Evaluate if you answer YES to 3+:

  • You're running Argo CD or planning GitOps adoption
  • You manage AWS resources from Kubernetes (or want to)
  • Your team spends >8 hours/month on Argo CD/ACK maintenance
  • You operate multi-cluster environments
  • You want to build internal developer platform abstractions

If YES: EKS Capabilities eliminates operational toil. The per-capability hourly pricing is likely cheaper than the engineering time spent on maintenance.

If NO: Continue self-hosting if you need deep customization or have existing automation that works well.

Framework 4: Should You Use Lambda Durable Functions?​

Evaluate if you answer YES to 2+:

  • You have workflows requiring human approval steps
  • You need workflows that run longer than 15 minutes but less than 1 year
  • Your Step Functions state machines are mostly linear (not complex branching)
  • You want to reduce state management boilerplate
  • You're willing to use Python 3.13+/Node.js 22+

If YES: Lambda Durable simplifies common state management patterns. Start migrating straightforward Step Functions workflows.

If NO: Keep using Step Functions for complex orchestration with parallel branches, error handling, and integration with 200+ AWS services.

Framework 5: Should You Invest in Trainium3?​

Evaluate if you answer YES to 3+:

  • You're training or fine-tuning large language models
  • AI training costs are >$100K/month
  • You can adapt training pipelines to AWS custom silicon
  • You're willing to invest in optimization for 50% cost reduction
  • You're planning multi-year AI infrastructure commitments

If YES: Trainium3's 4.4x performance and 50% cost reduction justify the optimization investment. Follow Anthropic's playbook—they achieved 60% utilization on Trainium2 and 90%+ on Trainium3.

If NO: Stick with NVIDIA GPUs if you need maximum ecosystem compatibility and existing training pipelines work well.


Comparison: AWS vs GCP vs Azure for Platform Engineering​

CapabilityAWS (re:Invent 2025)GCPAzure
Kubernetes ScaleEKS: 100,000 nodesGKE: 15,000 nodes (standard)AKS: 5,000 nodes
Custom AI ChipsTrainium3 (4.4x, 50% cost reduction)TPU v5p/v6eAzure Maia 100 (preview)
Custom CPUsGraviton5 (192 cores, 25% faster)Axion (Arm, preview)Cobalt 100 (Arm, preview)
Serverless WorkflowsLambda Durable (1 year max)Cloud Run/Workflows (no native durable)Durable Functions (unlimited)
Managed GitOpsEKS Capabilities (Argo CD managed)Config Sync, AnthosFlux (self-managed)
AI AgentsDevOps Agent (86% accuracy), Security Agent, Kiro (250K users)Gemini Code Assist, Duet AIGitHub Copilot integration
Database Savings35% (serverless), 20% (provisioned)Committed Use Discounts (CUDs)Reserved Capacity (35%)
Data LakesS3 Tables (Iceberg, 3x faster, 10x TPS)BigLake (Iceberg support)OneLake (Fabric, Delta Lake)

Where AWS leads:

  • Kubernetes scale (6-20x advantage)
  • Custom silicon maturity (98% of top 1000 customers on Graviton)
  • Agentic AI breadth (3 frontier agents + AgentCore platform)
  • Managed GitOps (EKS Capabilities vs self-managed alternatives)

Where competitors lead:

  • Azure: Durable Functions unlimited duration (vs Lambda's 1 year)
  • GCP: BigQuery performance for analytics, Cloud Run simplicity
  • Azure: GitHub integration (Microsoft ownership), native AD/Entra ID

đź’ˇ Key Takeaway: AWS is positioning itself as the platform for AI-scale workloads. If your organization is training large models, running massive batch inference, or building agentic AI applications, AWS has the most comprehensive stack. For traditional web/mobile workloads, the differences are less pronounced.


Action Plan for Platform Engineering Teams​

Immediate Actions (Next 30 Days)​

  1. Data readiness assessment: Before investing in agentic AI, audit your data foundations. Are systems accessible via API? Is data fresh and accurate? Do you have governance frameworks?

  2. Test DevOps Agent in preview: Integrate with one non-production environment. Map how it fits with PagerDuty/OpsGenie. Define approval processes.

  3. Evaluate Database Savings Plans: If database spend is stable, commit today for immediate 20-35% savings.

  4. Audit Graviton readiness: Identify which workloads can migrate to ARM64. Most containers work seamlessly—you're leaving 25-40% price-performance on the table.

  5. Review Lambda workflows: Identify Step Functions state machines that are mostly linear. Migrate to Lambda Durable for reduced boilerplate.

Medium-term (Next 90 Days)​

  1. Define verification debt protocols: Establish code review processes for AI-generated code. Who can approve? What's the review bar? Document expectations.

  2. Experiment with EKS Capabilities: If you're running Argo CD or ACK, test managed versions. Calculate time savings from eliminating maintenance toil.

  3. Build agent evaluation framework: If you're developing custom agents, implement AgentCore Evaluations. Define quality metrics (correctness, safety, tool selection accuracy).

  4. Map EKS scale requirements: Project node count growth over next 24 months. If you'll exceed 15K nodes, plan EKS Ultra Scale migration.

  5. Pilot natural language ops: Test EKS MCP Server with subset of team. Evaluate impact on developer self-service and support ticket volume.

Long-term (Next 12 Months)​

  1. Skill evolution plan: Shift team skills from writing runbooks to evaluating AI mitigation plans. This is a different skillset—invest in training.

  2. Platform abstraction strategy: Use KRO (Kube Resource Orchestrator) to build internal developer platform abstractions. Hide infrastructure complexity.

  3. AI infrastructure evaluation: If you're training large models, run cost comparison between Trainium3 and NVIDIA GPUs. Anthropic's 50% cost reduction at 90% utilization is the benchmark.

  4. Renaissance Developer framework: Adopt Werner Vogels' 5 qualities. Invest in system thinking, precise communication, polymath skills.

  5. Agent-first architecture: Design new systems assuming AI agents will interact with them. Provide APIs, not dashboards. Implement policy controls, audit logging, explicit boundaries.


The 2026 Outlook: Three Predictions​

Prediction 1: Human-in-the-Loop Becomes Industry Standard​

AWS's frontier agents all stop at the approval stage. This pattern will become the industry standard for mission-critical systems. Organizations that automate too aggressively (removing human approval) will suffer high-profile failures that set the industry back.

Why it matters: Platform teams should invest in approval workflows, not full automation. The skill evolution is from first responder to decision-maker with AI-generated context.

Prediction 2: Data Foundations Separate Winners from Losers​

Gartner's 40% failure prediction will prove accurate. The primary differentiator won't be which AI models you use—it'll be whether your data is accessible, accurate, and governed. Organizations with strong data foundations will see 5-10x productivity gains. Organizations with data silos will struggle.

Why it matters: Data readiness assessment should be your first step before any agentic AI investment. Without solid foundations, even the most sophisticated agents will fail.

Prediction 3: Kubernetes Scale Becomes a Competitive Moat​

EKS's 100K-node support creates a 6-20x advantage over GKE and AKS. As AI training workloads require increasingly large single-cluster deployments, organizations will consolidate on AWS. Google and Microsoft will respond, but AWS has a 12-24 month head start.

Why it matters: If your organization is building AI-first products requiring large-scale training, AWS is the only cloud that can handle it today. Make architectural decisions accordingly.


Conclusion: The AI-Native Platform Era​

AWS re:Invent 2025 marked the transition from cloud-native to AI-native platform engineering.

The key shifts:

  1. From reactive to autonomous: AI agents (DevOps Agent, Security Agent, Kiro) handle operational toil, humans handle judgment calls
  2. From limited scale to unlimited scale: EKS Ultra Scale's 100K nodes enables workloads that simply can't run elsewhere
  3. From generic hardware to purpose-built silicon: Graviton5 and Trainium3 deliver 25-50% cost advantages through vertical integration
  4. From complex orchestration to simple primitives: Lambda Durable Functions eliminate Step Functions boilerplate for common patterns
  5. From manual operations to natural language: EKS MCP Server enables conversational cluster management

Werner Vogels' verification debt warning should be internalized by every platform engineer. AI speed creates new risks. Code reviews are more important than ever. Organizations that embrace the Renaissance Developer framework—curious, systems-thinking, precise communication, ownership, polymath—will thrive. Organizations that resist will accumulate technical debt faster than they can pay it down.

The teams that master the hybrid model—AI handles triage and analysis, humans handle architecture and approval—will deliver 5-10x productivity gains. The teams that resist will struggle with mounting operational burden as systems grow more complex.

The autonomous DevOps future isn't coming. It's already here. The question isn't whether to engage with it. It's how to shape it for your team.


Sources​

AWS Official Announcements​

Industry Analysis​

AWS re:Invent 2025: The Agentic AI Revolution for Platform Engineering Teams

· 15 min read
VibeSRE
Platform Engineering Contributor

🎙️ Listen to the podcast episode: Episode #049: AWS re:Invent 2025 - The Agentic AI Revolution - A deep dive into AWS's frontier agents and what they mean for platform engineering teams.

TL;DR​

AWS re:Invent 2025 marked a fundamental shift from AI assistants to autonomous AI agents. Three "frontier agents" were announced: DevOps Agent for incident response, Security Agent for application security, and Kiro for autonomous development. Werner Vogels coined "verification debt" to warn about AI generating code faster than humans can understand it. Gartner predicts 40% of agentic AI projects will fail by 2027 due to inadequate data foundations. Platform teams should focus on integration readiness, trust protocols, and skill evolution—not wholesale replacement.


Key Statistics​

MetricValueSource
Kiro developers globally250,000+AWS
AWS DevOps Agent root cause identification86%AWS
Nova Act browser automation reliability90%AWS
Agentic AI projects predicted to fail by 202740%+Gartner
Bedrock AgentCore downloads2 million+AWS
AgentCore Evaluations frameworks13AWS
PGA TOUR content speed improvement with AgentCore1,000%AWS
Day-to-day decisions by agentic AI by 202815%Gartner

The Shift from Assistants to Agents​

AWS CEO Matt Garman set the tone in his re:Invent 2025 keynote: "AI assistants are starting to give way to AI agents that can perform tasks and automate on your behalf."

This isn't just marketing. The distinction matters:

AI Assistants are reactive. They wait for you to ask a question, then provide an answer. You drive the interaction.

AI Agents are autonomous. They observe systems, identify problems, analyze root causes, and either fix issues or propose fixes. They work for hours or days without constant human intervention. They navigate complex, multi-step workflows across multiple systems.

AWS announced three "frontier agents" at re:Invent 2025—so named because they represent the cutting edge of what autonomous AI can do today. These aren't simple chatbots. They're designed to handle enterprise-scale complexity.

đź’ˇ Key Takeaway: The agent paradigm fundamentally changes how platform teams interact with AI. Instead of asking questions, you delegate tasks. Instead of getting answers, you review proposed actions. The skill shifts from prompt engineering to evaluation and approval.


AWS DevOps Agent: Your Autonomous On-Call Engineer​

The AWS DevOps Agent is designed to accelerate incident response and improve system reliability. Think of it as an autonomous on-call engineer that works 24/7—no sleep, no coffee breaks, no context-switching.

How It Works​

The DevOps Agent integrates with your existing observability stack:

  • CloudWatch for metrics and logs
  • GitHub for code and deployment history
  • ServiceNow for incident management
  • Other tools via API integrations

When an incident occurs, the agent pulls data from all sources simultaneously. It correlates signals that a human might take 30 minutes to gather: error rates spiking in CloudWatch, a recent deployment in GitHub, similar incidents in ServiceNow history.

According to AWS, internal use of the DevOps Agent identified root causes in 86% of incidents.

The Critical Limitation​

The DevOps Agent stops short of making fixes automatically. Once it identifies the root cause, it generates a detailed mitigation plan:

  • The specific change to make
  • Expected outcomes
  • Associated risks

An engineer reviews that plan and approves it before anything gets executed.

AWS documentation states explicitly: "To keep frontier agents from breaking critical systems, humans remain the gatekeepers."

Availability​

AWS DevOps Agent is available in public preview in US East (N. Virginia) at no additional cost during preview.

💡 Key Takeaway: The DevOps Agent handles triage and analysis—the tasks that consume the first chunk of any incident. You make the decision with full context instead of spending 30 minutes gathering that context yourself. The role evolves, but it doesn't disappear.


AWS Security Agent: Context-Aware Application Security​

The AWS Security Agent secures applications from design through deployment. What makes it different from traditional security tools is that it's context-aware—it actually understands your application architecture.

Beyond Pattern Matching​

Traditional static analysis tools look for patterns: "This code uses eval, that's potentially dangerous." "This SQL query isn't parameterized, that's a risk."

The Security Agent goes deeper. It understands what your application is trying to accomplish:

  • AI-powered design reviews: Catches security issues in architecture decisions before code is written
  • Contextual code analysis: Understands data flow across your entire application
  • Intelligent penetration testing: Creates customized attack plans informed by security requirements, design documents, and source code

AWS says customers report receiving penetration testing results "within hours compared to what would have taken weeks of scheduling and back-and-forth communication between teams."

How Organizations Use It​

Security teams define organizational security requirements once: approved encryption libraries, authentication frameworks, logging standards. The Security Agent then automatically validates these requirements throughout development, providing specific guidance when violations are detected.

Availability​

AWS Security Agent is available in public preview in US East (N. Virginia), free during preview. All data remains private—queries and data are never used to train models.

💡 Key Takeaway: The Security Agent shifts security left in a practical way. Instead of handing developers a list of CVEs to fix, the agent participates earlier in the process—understanding context rather than just matching patterns.


Kiro: The Autonomous Developer Agent​

Kiro is the autonomous developer agent that navigates across multiple code repositories to fix bugs and submit pull requests. Over 250,000 developers are already using it globally.

What Makes Kiro Different​

Amazon has made Kiro the official development tool across the company. It learns from your team's specific processes and practices:

  • Understands your coding standards
  • Learns your test patterns
  • Adapts to your deployment workflows

When it submits work, it comes as a proposed pull request. A human reviews the code before it gets merged.

Amazon describes it as "another member of your team"—but a team member whose work you always review before it ships.

Persistent Context​

Unlike chat-based AI assistants, Kiro maintains persistent context across sessions. It doesn't run out of memory and forget what it was supposed to do. It can be handed tasks and work on its own for hours or days with minimal human intervention.

For teams, the Kiro autonomous agent is a shared resource that builds a collective understanding of your codebase, products, and standards. It connects to repos, pipelines, and tools like Jira, GitHub, and Slack to maintain context as work progresses.

Startup Incentive​

Amazon is offering free Kiro Pro+ credits to qualified startups through the AWS startup program.

💡 Key Takeaway: Kiro represents the "developer agent" category—autonomous systems that can take development tasks and execute them across your codebase. The human review step remains critical, treating AI-generated code the same way you'd treat code from any new team member.


Amazon Bedrock AgentCore: Building Your Own Agents​

Amazon Bedrock AgentCore is the platform for building production-ready AI agents. At re:Invent 2025, AWS announced major enhancements addressing the biggest challenges enterprises face.

Key New Capabilities​

Policy in AgentCore (Preview): Set explicit boundaries for what agents can and cannot do using natural language. "This agent can read from this database but not write." "This agent can access production logs but not customer PII." These are deterministic controls that operate outside agent code.

AgentCore Evaluations: 13 pre-built evaluation systems for monitoring agent quality—correctness, safety, tool selection accuracy. Continuous assessment for AI agent quality in production.

AgentCore Memory: Agents can now develop a log of information on users over time and use that information to inform future decisions. The new "episodic functionality" allows agents to learn from past experiences.

Framework Agnostic​

AgentCore supports any framework (CrewAI, LangGraph, LlamaIndex, Google ADK, OpenAI Agents SDK, Strands Agents) or model while handling critical agentic AI infrastructure needs.

Adoption Numbers​

In just five months since preview:

  • 2 million+ downloads
  • Organizations including PGA TOUR (1,000% content writing speed improvement, 95% cost reduction), Cohere Health, Cox Automotive, Heroku, MongoDB, Thomson Reuters, Workday, and Swisscom

đź’ˇ Key Takeaway: AgentCore addresses enterprise concerns about agent governance. Policy controls let you set guardrails, Evaluations let you monitor quality, and Memory lets agents learn without retraining. This infrastructure layer is crucial for production deployments.


Nova Act: 90% Reliable Browser Automation​

Amazon Nova Act is now generally available—a new service for building browser automation agents with enterprise reliability.

Why This Matters​

Browser automation has traditionally been fragile. Selenium scripts break when UIs change. RPA tools require constant maintenance. Nova Act achieves 90% reliability on enterprise workflows—a significant improvement.

Technical Approach​

Nova Act uses reinforcement learning with agents running inside custom synthetic environments ("web gyms") that simulate real-world UIs. This vertical integration across model, orchestrator, tools, and SDK—all trained together—unlocks higher completion rates at scale.

Powered by a custom Amazon Nova 2 Lite model optimized specifically for browser interactions.

Use Cases​

  • Form filling
  • Search and extract
  • Shopping and booking flows
  • QA testing (Amazon Leo reduced weeks of engineering effort to minutes)

Pricing Innovation​

Nova Act uses an hourly pricing model—you pay for time the agent is active, not tokens or API calls. This makes costs more predictable for automation workflows.

Launch Partner​

1Password is a launch partner, bringing credential security management directly into agentic AI automation.

đź’ˇ Key Takeaway: Nova Act targets workflows that still require humans to click through web interfaces. The 90% reliability benchmark and hourly pricing model make it practical for production use cases like QA testing and data entry.


Werner Vogels' Warning: Verification Debt​

Werner Vogels delivered his final re:Invent keynote after 14 years and introduced a concept every platform engineer should understand: verification debt.

What Is Verification Debt?​

AI generates code faster than humans can comprehend it. This creates a dangerous gap between what gets written and what gets understood.

Every time you accept AI-generated code without fully understanding it, you're taking on verification debt. That debt accumulates. And at some point, something breaks in production that nobody on the team actually understands.

"Vibe Coding" Is Gambling​

Vogels was direct: "Vibe coding is fine, but only if you pay close attention to what is being built. We can't just pull a lever on your IDE and hope that something good comes out. That's not software engineering. That's gambling."

The Solution: Code Reviews as Control Points​

Vogels called code reviews "the control point to restore balance."

"We all hate code reviews. It's like being a twelve-year-old standing in front of the class. But the review is where we bring human judgment back into the loop."

This aligns with thoughtful policies from organizations like Oxide Computer Company, whose public LLM policy states: "Wherever LLM-generated code is used, it becomes the responsibility of the engineer." Engineers must conduct personal review of all LLM-generated code before it even goes to peer review.

The Renaissance Developer Framework​

Vogels' parting framework for the AI era emphasizes:

  • Being curious
  • Thinking in systems
  • Communicating precisely
  • Owning your work
  • Becoming a polymath

His core message: "The work is yours, not the tools."

đź’ˇ Key Takeaway: Verification debt is technical debt's dangerous cousin. As AI generates more code, the gap between generation and understanding widens. Code reviews become more important, not less. Organizations serious about AI are also the ones emphasizing engineer responsibility and ownership.


The 40% Failure Prediction​

Gartner's sobering prediction: Over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear business value, or inadequate risk controls.

Why Projects Are Failing​

"Most agentic AI projects right now are early stage experiments or proof of concepts that are mostly driven by hype and are often misapplied," said Anushree Verma, Senior Director Analyst at Gartner. "This can blind organizations to the real cost and complexity of deploying AI agents at scale."

The Four Data Barriers​

AWS addressed this at re:Invent 2025, identifying four specific barriers:

  1. Data silos: Agents need to access information across systems, but most enterprises have data locked in disconnected silos
  2. Trust in data: If data is stale, incomplete, or inaccurate, agent outputs will be too
  3. Cross-organizational governance: Who's responsible when an agent accesses data from multiple teams? What are the audit requirements?
  4. Data consumption patterns: Agents consume data differently than humans. They need APIs, not dashboards.

The "Agent Washing" Problem​

Gartner identified widespread "agent washing"—vendors rebranding existing AI assistants, chatbots, or RPA tools as "agentic AI" without delivering true agentic capabilities. Of thousands of vendors claiming agentic solutions, Gartner estimates only about 130 actually offer genuine agentic features.

The Positive Outlook​

Despite the high failure rate, Gartner sees long-term potential:

  • 15% of day-to-day work decisions will be made autonomously by agentic AI by 2028 (up from 0% in 2024)
  • 33% of enterprise software applications will include agentic AI by 2028 (up from less than 1% in 2024)

đź’ˇ Key Takeaway: Data readiness might be your biggest blocker. Before investing heavily in agents, assess your data foundations. Are systems accessible via API? Is data fresh and accurate? Do you have governance frameworks in place?


What Platform Teams Should Prepare For​

Based on the re:Invent 2025 announcements, here's what platform engineering teams should focus on:

1. Integration Readiness​

Map how the DevOps Agent fits with your existing incident management tools. Understand the handoff between PagerDuty/OpsGenie and AWS's agent. Start thinking about this now while the agent is in preview.

2. Trust Protocols​

Establish clear processes for approving AI-generated fixes:

  • Who can approve? Senior engineers only, or anyone on-call?
  • What's the review bar for different severity levels?
  • How do you handle disagreement with an agent's recommendation?

3. Skill Evolution​

Your job shifts from writing runbooks to evaluating AI mitigation plans. That's a different skill. It requires understanding both the systems and the AI's reasoning. Start building that capability now.

4. Embrace the Hybrid Model​

AI handles triage and analysis. Humans handle judgment calls and approvals. This isn't about replacement—it's about augmentation.

The agent does the initial analysis. It pulls the data. It proposes a plan. You make the decision with full context instead of spending 30 minutes gathering that context yourself.

5. Address Data Foundations First​

Given the 40% failure prediction, prioritize data readiness before agent deployment:

  • Audit API availability for systems agents need to access
  • Assess data freshness and accuracy
  • Establish cross-team governance for agent data access
  • Document data consumption patterns for automation

💡 Key Takeaway: The autonomous DevOps future is being built right now. The question isn't whether to engage with it—it's how to shape it for your team. Start with preview access, build the muscle memory, and train your team on evaluating AI-generated plans.



Key Takeaways Summary​

  1. Frontier agents are available now: DevOps Agent and Security Agent in public preview, Kiro GA with 250,000+ developers
  2. Humans remain gatekeepers: All agents stop at approval stage—you review, you decide
  3. Integration is everything: Success depends on fitting agents into existing workflows, not replacing them
  4. Verification debt is real: AI speed creates new risks; code reviews more important than ever
  5. Data readiness may be your biggest blocker: 40% of projects fail due to data issues—assess foundations first
  6. Start experimenting now: Preview access is the time to learn before these go GA

This analysis is part of our AWS re:Invent 2025 coverage series. Stay tuned for Episode #050: AWS Infrastructure Revolution covering Graviton 5, Trainium 3, and Lambda Durable Functions.

CNCF Kubernetes AI Conformance Program: The Complete Guide for Platform Teams

· 11 min read
VibeSRE
Platform Engineering Contributor

The "Wild West" of AI infrastructure just ended. At KubeCon Atlanta on November 11, 2025, CNCF launched the Certified Kubernetes AI Conformance Program—establishing the first industry standard for running AI workloads on Kubernetes. With 82% of organizations building custom AI solutions and 58% using Kubernetes for those workloads, the fragmentation risk was real. Now there's a baseline.

TL;DR​

  • What: CNCF certification program establishing minimum capabilities for running AI/ML workloads on Kubernetes
  • When: v1.0 launched November 11, 2025 at KubeCon Atlanta; v2.0 roadmap started for 2026
  • Who: 11+ vendors certified including AWS, Google, Microsoft, Red Hat, Oracle, CoreWeave
  • Core Requirements: Dynamic Resource Allocation (DRA), GPU autoscaling, accelerator metrics, AI operator support, gang scheduling
  • Impact: Reduces vendor lock-in, guarantees interoperability, enables multi-cloud AI strategies
  • Action: Check if your platform is certified before selecting AI infrastructure

🎙️ Listen to the podcast episode: Episode #043: Kubernetes AI Conformance - The End of AI Infrastructure Chaos - Jordan and Alex break down the new CNCF certification and what it means for platform teams.

Key Statistics​

MetricValueSource
Organizations building custom AI82%Linux Foundation Sovereign AI Research, Nov 2025
Enterprises using K8s for AI58%Linux Foundation Sovereign AI Research, Nov 2025
Open source critical to AI strategy90%Linux Foundation Sovereign AI Research, Nov 2025
Initial certified vendors11+CNCF Announcement, Nov 2025
AI/ML workload growth on K8s (next 12mo)90% expect increaseSpectro Cloud State of K8s 2025
GPU utilization improvement (DRA vs device plugins)45-60% → 70-85%The New Stack DRA Guide
Existing certified K8s distributions100+CNCF Conformance Program

The Problem: AI Infrastructure Fragmentation​

Before this program, every cloud provider and Kubernetes distribution implemented AI capabilities differently. GPU scheduling worked one way on GKE, another way on EKS, and a third way on OpenShift. Training a model on one platform and deploying for inference on another meant rewriting infrastructure code.

The consequences for platform teams were significant:

  1. Vendor Lock-in: Once you optimized for one platform's GPU scheduling, migration became expensive
  2. Unpredictable Behavior: AI frameworks like Kubeflow and Ray behaved differently across environments
  3. Resource Waste: Without standardized DRA, GPU utilization hovered at 45-60%
  4. Skill Fragmentation: Teams needed platform-specific expertise rather than portable Kubernetes skills
Key Takeaway

The Kubernetes AI Conformance Program does for AI workloads what the original Kubernetes Conformance Program did for container orchestration—it guarantees that certified platforms behave identically for core capabilities.

What the Program Certifies​

The certification validates five core capabilities that every AI-capable Kubernetes platform must implement consistently.

1. Dynamic Resource Allocation (DRA)​

DRA is the foundation of the conformance program. Traditional Kubernetes device plugins offer limited resource requests—you ask for "2 GPUs" and get whatever's available. DRA enables complex requirements:

# Traditional device plugin (limited)
resources:
limits:
nvidia.com/gpu: 2

# DRA-enabled (rich requirements)
resourceClaims:
- name: gpu-claim
spec:
deviceClassName: nvidia-gpu
requests:
- count: 2
constraints:
- interconnect: nvlink
- memory: {min: "40Gi"}
- locality: same-node

According to The New Stack, DRA reaching GA in Kubernetes 1.34 improves GPU utilization from 45-60% with device plugins to 70-85%, reduces job queue times from 15-45 minutes to 3-10 minutes, and cuts monthly GPU costs by 30-40%.

2. Intelligent Autoscaling​

Certified platforms must implement two-level autoscaling for AI workloads:

  • Cluster Autoscaling: Automatically adjusts node pools with accelerators based on pending pods
  • Horizontal Pod Autoscaling: Scales workloads based on custom metrics like GPU utilization

This matters because AI workloads have bursty resource requirements. Training jobs need massive GPU clusters for hours, then nothing. Inference services need to scale from zero to thousands of replicas based on traffic.

3. Rich Accelerator Metrics​

Platforms must expose detailed performance metrics for GPUs, TPUs, and other accelerators. Generic "utilization percentage" isn't sufficient—conformant platforms provide:

  • Memory usage and bandwidth
  • Compute utilization by workload
  • Temperature and power consumption
  • NVLink/interconnect statistics for multi-GPU jobs

Without standardized metrics, autoscaling decisions and capacity planning become guesswork.

4. AI Operator Support​

Complex AI frameworks like Kubeflow and Ray run as Kubernetes Operators using Custom Resource Definitions (CRDs). The conformance program ensures these operators function correctly by validating:

  • CRD installation and lifecycle management
  • Operator webhook functionality
  • Resource quota enforcement for operator-managed resources

If the core platform isn't robust, AI operators fail in unpredictable ways.

5. Gang Scheduling​

Distributed AI training jobs require all worker pods to start simultaneously. If 7 of 8 GPUs are available but the 8th isn't, traditional Kubernetes scheduling starts 7 pods that sit idle waiting for the 8th. Gang scheduling (via Kueue or Volcano) ensures jobs only start when all resources are available.

Key Takeaway

Gang scheduling prevents resource deadlocks in distributed training. Without it, partially-scheduled jobs waste expensive GPU time waiting for stragglers.

Certified Vendors (November 2025)​

The v1.0 release certifies these platforms:

VendorProductNotes
AWSAmazon EKSFull DRA support, integrated with EC2 GPU instances
Google CloudGKEFirst mover, detailed implementation blog
MicrosoftAzure Kubernetes ServiceIntegrated with Azure ML
Red HatOpenShiftEnterprise focus, RHEL AI integration
OracleOCI Kubernetes EngineOCI GPU shapes supported
Broadcom/VMwarevSphere Kubernetes ServiceOn-premises AI workloads
CoreWeaveCoreWeave KubernetesGPU cloud specialist
AkamaiAkamai Inference CloudEdge AI inference
Giant SwarmGiant Swarm PlatformManaged K8s provider
KubermaticKKPMulti-cluster management
Sidero LabsTalos LinuxSecure, immutable K8s

Notable Absence: NVIDIA​

NVIDIA isn't on the certified list, but that's expected. Chris Aniszczyk (CNCF CTO) clarified to TechTarget: "They're not on the list, but they don't really have a product that would qualify. They don't have a Kubernetes-as-a-Service product similar to those being certified."

NVIDIA participates in the working group and their ComputeDomains feature integrates with conformant platforms, but the certification targets platform providers, not hardware vendors.

How This Differs from ISO 42001​

A common question: "How does this relate to ISO 42001 AI management certification?"

AspectKubernetes AI ConformanceISO 42001
FocusTechnical capabilitiesManagement & governance
ValidatesAPIs, configurations, workload behaviorPolicies, processes, documentation
TargetPlatform infrastructureOrganizational AI practices
ScopeKubernetes-specificTechnology-agnostic

ISO 42001 certifies that your organization manages AI responsibly. Kubernetes AI Conformance certifies that your infrastructure runs AI workloads correctly. You likely need both for enterprise AI deployments.

Key Takeaway

ISO 42001 answers "Do we manage AI responsibly?" Kubernetes AI Conformance answers "Does our infrastructure run AI correctly?" These are complementary, not competing standards.

Practical Implications for Platform Teams​

Vendor Selection​

The certification changes how you evaluate AI infrastructure. Instead of detailed POCs testing GPU scheduling behavior across vendors, you can trust that conformant platforms handle core capabilities identically. Selection criteria shift to:

  • Price: GPU instance costs vary significantly across providers
  • Ecosystem: Integration with your existing tools (MLflow, Weights & Biases, etc.)
  • Support: SLAs and enterprise support options
  • Geography: Data residency requirements

Multi-Cloud AI Strategy​

The program enables genuine multi-cloud AI deployments:

  • Training: Use the cheapest GPU cloud (often CoreWeave or Lambda Labs)
  • Inference: Deploy to whichever cloud serves your users fastest
  • Burst: Overflow to alternative providers during peak demand

This was previously difficult because workload manifests needed platform-specific modifications. With conformance, the same Kubernetes resources work everywhere.

Migration Planning​

If your current platform isn't certified, the conformance gap identifies specific capabilities to evaluate:

  1. Does your platform support DRA or only legacy device plugins?
  2. Can you request GPUs with specific interconnect requirements?
  3. Are gang scheduling solutions (Kueue, Volcano) supported?
  4. Do AI operators (Kubeflow, Ray) function correctly?

Non-conformant platforms may still work for simple use cases, but expect friction as workloads become more sophisticated.

Decision Framework: When Conformance Matters​

Certification is critical when:

  • Running distributed training jobs across multiple GPUs/nodes
  • Deploying AI workloads across multiple clouds or regions
  • Using complex AI frameworks (Kubeflow, Ray, KServe)
  • GPU cost optimization is a priority
  • Portability between platforms is required

Certification is less critical when:

  • Running single-GPU inference workloads
  • Locked into a single cloud provider for other reasons
  • Using managed AI services (SageMaker, Vertex AI) rather than raw Kubernetes
  • Workloads don't require GPU/TPU acceleration

What's Coming in v2.0​

CNCF announced that v2.0 roadmap development has started, with an expected 2026 release. Based on working group discussions, likely additions include:

  • Topology-aware scheduling: Requirements for NUMA node, PCIe root, and network fabric alignment
  • Multi-node NVLink: Standardized support for NVIDIA's ComputeDomains
  • Model serving standards: Common interfaces for inference workloads
  • Cost attribution: Standardized GPU cost tracking and chargeback

The v1.0 program intentionally started with fundamentals. As Chris Aniszczyk noted: "It starts with a simple focus on the kind of things you really need to make AI workloads work well on Kubernetes."

Key Takeaway

Don't wait for v2.0 to adopt conformant platforms. The v1.0 capabilities address the most common AI infrastructure pain points. Additional features will extend the standard, not replace it.

Getting Your Platform Certified​

If you provide a Kubernetes platform with AI capabilities, certification is straightforward:

  1. Review requirements: Check the GitHub repository for current test criteria
  2. Run conformance tests: Automated test suite validates capability implementation
  3. Submit results: Pull request to the CNCF repository with test output
  4. Review process: CNCF bot verifies results, human review for edge cases

The process mirrors the existing Kubernetes Conformance Program that has certified 100+ distributions since 2017.

Actions for Platform Teams​

Immediate (This Week)​

  1. Check if your current platform is AI conformant
  2. Inventory AI workloads by capability requirements (DRA, gang scheduling, etc.)
  3. Identify gaps between current platform and conformance requirements

Short-Term (This Quarter)​

  1. If non-conformant: Evaluate migration to certified platform
  2. If conformant: Validate that conformance capabilities are enabled
  3. Update internal platform documentation with conformance status

Long-Term (2025-2026)​

  1. Build vendor selection criteria around conformance certification
  2. Develop multi-cloud AI strategy leveraging platform portability
  3. Track v2.0 requirements for topology-aware scheduling

Learn More​

Official Resources​

Technical Deep Dives​

Industry Analysis​


The Kubernetes AI Conformance Program represents the maturation of AI infrastructure. For the first time, platform teams have a vendor-neutral standard to evaluate AI capabilities. As Chris Aniszczyk put it: "Teams need consistent infrastructure they can rely on." Now they have it.