CNCF Kubernetes AI Conformance Program: The Complete Guide for Platform Teams
The "Wild West" of AI infrastructure just ended. At KubeCon Atlanta on November 11, 2025, CNCF launched the Certified Kubernetes AI Conformance Program—establishing the first industry standard for running AI workloads on Kubernetes. With 82% of organizations building custom AI solutions and 58% using Kubernetes for those workloads, the fragmentation risk was real. Now there's a baseline.
TL;DR
- What: CNCF certification program establishing minimum capabilities for running AI/ML workloads on Kubernetes
- When: v1.0 launched November 11, 2025 at KubeCon Atlanta; v2.0 roadmap started for 2026
- Who: 11+ vendors certified including AWS, Google, Microsoft, Red Hat, Oracle, CoreWeave
- Core Requirements: Dynamic Resource Allocation (DRA), GPU autoscaling, accelerator metrics, AI operator support, gang scheduling
- Impact: Reduces vendor lock-in, guarantees interoperability, enables multi-cloud AI strategies
- Action: Check if your platform is certified before selecting AI infrastructure
🎙️ Listen to the podcast episode: Episode #043: Kubernetes AI Conformance - The End of AI Infrastructure Chaos - Jordan and Alex break down the new CNCF certification and what it means for platform teams.
Key Statistics
| Metric | Value | Source |
|---|---|---|
| Organizations building custom AI | 82% | Linux Foundation Sovereign AI Research, Nov 2025 |
| Enterprises using K8s for AI | 58% | Linux Foundation Sovereign AI Research, Nov 2025 |
| Open source critical to AI strategy | 90% | Linux Foundation Sovereign AI Research, Nov 2025 |
| Initial certified vendors | 11+ | CNCF Announcement, Nov 2025 |
| AI/ML workload growth on K8s (next 12mo) | 90% expect increase | Spectro Cloud State of K8s 2025 |
| GPU utilization improvement (DRA vs device plugins) | 45-60% → 70-85% | The New Stack DRA Guide |
| Existing certified K8s distributions | 100+ | CNCF Conformance Program |
The Problem: AI Infrastructure Fragmentation
Before this program, every cloud provider and Kubernetes distribution implemented AI capabilities differently. GPU scheduling worked one way on GKE, another way on EKS, and a third way on OpenShift. Training a model on one platform and deploying for inference on another meant rewriting infrastructure code.
The consequences for platform teams were significant:
- Vendor Lock-in: Once you optimized for one platform's GPU scheduling, migration became expensive
- Unpredictable Behavior: AI frameworks like Kubeflow and Ray behaved differently across environments
- Resource Waste: Without standardized DRA, GPU utilization hovered at 45-60%
- Skill Fragmentation: Teams needed platform-specific expertise rather than portable Kubernetes skills
The Kubernetes AI Conformance Program does for AI workloads what the original Kubernetes Conformance Program did for container orchestration—it guarantees that certified platforms behave identically for core capabilities.
What the Program Certifies
The certification validates five core capabilities that every AI-capable Kubernetes platform must implement consistently.
1. Dynamic Resource Allocation (DRA)
DRA is the foundation of the conformance program. Traditional Kubernetes device plugins offer limited resource requests—you ask for "2 GPUs" and get whatever's available. DRA enables complex requirements:
# Traditional device plugin (limited)
resources:
limits:
nvidia.com/gpu: 2
# DRA-enabled (rich requirements)
resourceClaims:
- name: gpu-claim
spec:
deviceClassName: nvidia-gpu
requests:
- count: 2
constraints:
- interconnect: nvlink
- memory: {min: "40Gi"}
- locality: same-node
According to The New Stack, DRA reaching GA in Kubernetes 1.34 improves GPU utilization from 45-60% with device plugins to 70-85%, reduces job queue times from 15-45 minutes to 3-10 minutes, and cuts monthly GPU costs by 30-40%.
2. Intelligent Autoscaling
Certified platforms must implement two-level autoscaling for AI workloads:
- Cluster Autoscaling: Automatically adjusts node pools with accelerators based on pending pods
- Horizontal Pod Autoscaling: Scales workloads based on custom metrics like GPU utilization
This matters because AI workloads have bursty resource requirements. Training jobs need massive GPU clusters for hours, then nothing. Inference services need to scale from zero to thousands of replicas based on traffic.
3. Rich Accelerator Metrics
Platforms must expose detailed performance metrics for GPUs, TPUs, and other accelerators. Generic "utilization percentage" isn't sufficient—conformant platforms provide:
- Memory usage and bandwidth
- Compute utilization by workload
- Temperature and power consumption
- NVLink/interconnect statistics for multi-GPU jobs
Without standardized metrics, autoscaling decisions and capacity planning become guesswork.
4. AI Operator Support
Complex AI frameworks like Kubeflow and Ray run as Kubernetes Operators using Custom Resource Definitions (CRDs). The conformance program ensures these operators function correctly by validating:
- CRD installation and lifecycle management
- Operator webhook functionality
- Resource quota enforcement for operator-managed resources
If the core platform isn't robust, AI operators fail in unpredictable ways.
5. Gang Scheduling
Distributed AI training jobs require all worker pods to start simultaneously. If 7 of 8 GPUs are available but the 8th isn't, traditional Kubernetes scheduling starts 7 pods that sit idle waiting for the 8th. Gang scheduling (via Kueue or Volcano) ensures jobs only start when all resources are available.
Gang scheduling prevents resource deadlocks in distributed training. Without it, partially-scheduled jobs waste expensive GPU time waiting for stragglers.
Certified Vendors (November 2025)
The v1.0 release certifies these platforms:
| Vendor | Product | Notes |
|---|---|---|
| AWS | Amazon EKS | Full DRA support, integrated with EC2 GPU instances |
| Google Cloud | GKE | First mover, detailed implementation blog |
| Microsoft | Azure Kubernetes Service | Integrated with Azure ML |
| Red Hat | OpenShift | Enterprise focus, RHEL AI integration |
| Oracle | OCI Kubernetes Engine | OCI GPU shapes supported |
| Broadcom/VMware | vSphere Kubernetes Service | On-premises AI workloads |
| CoreWeave | CoreWeave Kubernetes | GPU cloud specialist |
| Akamai | Akamai Inference Cloud | Edge AI inference |
| Giant Swarm | Giant Swarm Platform | Managed K8s provider |
| Kubermatic | KKP | Multi-cluster management |
| Sidero Labs | Talos Linux | Secure, immutable K8s |
Notable Absence: NVIDIA
NVIDIA isn't on the certified list, but that's expected. Chris Aniszczyk (CNCF CTO) clarified to TechTarget: "They're not on the list, but they don't really have a product that would qualify. They don't have a Kubernetes-as-a-Service product similar to those being certified."
NVIDIA participates in the working group and their ComputeDomains feature integrates with conformant platforms, but the certification targets platform providers, not hardware vendors.
How This Differs from ISO 42001
A common question: "How does this relate to ISO 42001 AI management certification?"
| Aspect | Kubernetes AI Conformance | ISO 42001 |
|---|---|---|
| Focus | Technical capabilities | Management & governance |
| Validates | APIs, configurations, workload behavior | Policies, processes, documentation |
| Target | Platform infrastructure | Organizational AI practices |
| Scope | Kubernetes-specific | Technology-agnostic |
ISO 42001 certifies that your organization manages AI responsibly. Kubernetes AI Conformance certifies that your infrastructure runs AI workloads correctly. You likely need both for enterprise AI deployments.
ISO 42001 answers "Do we manage AI responsibly?" Kubernetes AI Conformance answers "Does our infrastructure run AI correctly?" These are complementary, not competing standards.
Practical Implications for Platform Teams
Vendor Selection
The certification changes how you evaluate AI infrastructure. Instead of detailed POCs testing GPU scheduling behavior across vendors, you can trust that conformant platforms handle core capabilities identically. Selection criteria shift to:
- Price: GPU instance costs vary significantly across providers
- Ecosystem: Integration with your existing tools (MLflow, Weights & Biases, etc.)
- Support: SLAs and enterprise support options
- Geography: Data residency requirements
Multi-Cloud AI Strategy
The program enables genuine multi-cloud AI deployments:
- Training: Use the cheapest GPU cloud (often CoreWeave or Lambda Labs)
- Inference: Deploy to whichever cloud serves your users fastest
- Burst: Overflow to alternative providers during peak demand
This was previously difficult because workload manifests needed platform-specific modifications. With conformance, the same Kubernetes resources work everywhere.
Migration Planning
If your current platform isn't certified, the conformance gap identifies specific capabilities to evaluate:
- Does your platform support DRA or only legacy device plugins?
- Can you request GPUs with specific interconnect requirements?
- Are gang scheduling solutions (Kueue, Volcano) supported?
- Do AI operators (Kubeflow, Ray) function correctly?
Non-conformant platforms may still work for simple use cases, but expect friction as workloads become more sophisticated.
Decision Framework: When Conformance Matters
Certification is critical when:
- Running distributed training jobs across multiple GPUs/nodes
- Deploying AI workloads across multiple clouds or regions
- Using complex AI frameworks (Kubeflow, Ray, KServe)
- GPU cost optimization is a priority
- Portability between platforms is required
Certification is less critical when:
- Running single-GPU inference workloads
- Locked into a single cloud provider for other reasons
- Using managed AI services (SageMaker, Vertex AI) rather than raw Kubernetes
- Workloads don't require GPU/TPU acceleration
What's Coming in v2.0
CNCF announced that v2.0 roadmap development has started, with an expected 2026 release. Based on working group discussions, likely additions include:
- Topology-aware scheduling: Requirements for NUMA node, PCIe root, and network fabric alignment
- Multi-node NVLink: Standardized support for NVIDIA's ComputeDomains
- Model serving standards: Common interfaces for inference workloads
- Cost attribution: Standardized GPU cost tracking and chargeback
The v1.0 program intentionally started with fundamentals. As Chris Aniszczyk noted: "It starts with a simple focus on the kind of things you really need to make AI workloads work well on Kubernetes."
Don't wait for v2.0 to adopt conformant platforms. The v1.0 capabilities address the most common AI infrastructure pain points. Additional features will extend the standard, not replace it.
Getting Your Platform Certified
If you provide a Kubernetes platform with AI capabilities, certification is straightforward:
- Review requirements: Check the GitHub repository for current test criteria
- Run conformance tests: Automated test suite validates capability implementation
- Submit results: Pull request to the CNCF repository with test output
- Review process: CNCF bot verifies results, human review for edge cases
The process mirrors the existing Kubernetes Conformance Program that has certified 100+ distributions since 2017.
Actions for Platform Teams
Immediate (This Week)
- Check if your current platform is AI conformant
- Inventory AI workloads by capability requirements (DRA, gang scheduling, etc.)
- Identify gaps between current platform and conformance requirements
Short-Term (This Quarter)
- If non-conformant: Evaluate migration to certified platform
- If conformant: Validate that conformance capabilities are enabled
- Update internal platform documentation with conformance status
Long-Term (2025-2026)
- Build vendor selection criteria around conformance certification
- Develop multi-cloud AI strategy leveraging platform portability
- Track v2.0 requirements for topology-aware scheduling
Related Content
- Episode #035: KubeCon 2025 Part 1 - AI Goes Native - AI announcements at KubeCon
- The $4,350/Month GPU Waste Problem - GPU cost optimization strategies
- Kubernetes Overview 2025 - Kubernetes ecosystem overview
- KubeCon Atlanta 2025 Recap - Full conference coverage where AI Conformance was announced
- Platform Engineering Anti-Patterns - Common mistakes when building platform capabilities
Learn More
Official Resources
- CNCF Kubernetes AI Conformance Program - Official announcement
- GitHub: cncf/k8s-ai-conformance - Test suite and certified products
- Working Group Charter - How the program was developed
Technical Deep Dives
- GKE AI Conformance Implementation - Google's technical details
- Dynamic Resource Allocation Guide - DRA explained
- Topology-Aware Scheduling - v2.0 preview
Industry Analysis
- TechTarget Analysis - Analyst perspectives on the program
- Cloud Native Now Coverage - Industry reaction
The Kubernetes AI Conformance Program represents the maturation of AI infrastructure. For the first time, platform teams have a vendor-neutral standard to evaluate AI capabilities. As Chris Aniszczyk put it: "Teams need consistent infrastructure they can rely on." Now they have it.
