Skip to main content

CNCF Kubernetes AI Conformance Program: The Complete Guide for Platform Teams

· 11 min read
VibeSRE
Platform Engineering Contributor

The "Wild West" of AI infrastructure just ended. At KubeCon Atlanta on November 11, 2025, CNCF launched the Certified Kubernetes AI Conformance Program—establishing the first industry standard for running AI workloads on Kubernetes. With 82% of organizations building custom AI solutions and 58% using Kubernetes for those workloads, the fragmentation risk was real. Now there's a baseline.

TL;DR

  • What: CNCF certification program establishing minimum capabilities for running AI/ML workloads on Kubernetes
  • When: v1.0 launched November 11, 2025 at KubeCon Atlanta; v2.0 roadmap started for 2026
  • Who: 11+ vendors certified including AWS, Google, Microsoft, Red Hat, Oracle, CoreWeave
  • Core Requirements: Dynamic Resource Allocation (DRA), GPU autoscaling, accelerator metrics, AI operator support, gang scheduling
  • Impact: Reduces vendor lock-in, guarantees interoperability, enables multi-cloud AI strategies
  • Action: Check if your platform is certified before selecting AI infrastructure

🎙️ Listen to the podcast episode: Episode #043: Kubernetes AI Conformance - The End of AI Infrastructure Chaos - Jordan and Alex break down the new CNCF certification and what it means for platform teams.

Key Statistics

MetricValueSource
Organizations building custom AI82%Linux Foundation Sovereign AI Research, Nov 2025
Enterprises using K8s for AI58%Linux Foundation Sovereign AI Research, Nov 2025
Open source critical to AI strategy90%Linux Foundation Sovereign AI Research, Nov 2025
Initial certified vendors11+CNCF Announcement, Nov 2025
AI/ML workload growth on K8s (next 12mo)90% expect increaseSpectro Cloud State of K8s 2025
GPU utilization improvement (DRA vs device plugins)45-60% → 70-85%The New Stack DRA Guide
Existing certified K8s distributions100+CNCF Conformance Program

The Problem: AI Infrastructure Fragmentation

Before this program, every cloud provider and Kubernetes distribution implemented AI capabilities differently. GPU scheduling worked one way on GKE, another way on EKS, and a third way on OpenShift. Training a model on one platform and deploying for inference on another meant rewriting infrastructure code.

The consequences for platform teams were significant:

  1. Vendor Lock-in: Once you optimized for one platform's GPU scheduling, migration became expensive
  2. Unpredictable Behavior: AI frameworks like Kubeflow and Ray behaved differently across environments
  3. Resource Waste: Without standardized DRA, GPU utilization hovered at 45-60%
  4. Skill Fragmentation: Teams needed platform-specific expertise rather than portable Kubernetes skills
Key Takeaway

The Kubernetes AI Conformance Program does for AI workloads what the original Kubernetes Conformance Program did for container orchestration—it guarantees that certified platforms behave identically for core capabilities.

What the Program Certifies

The certification validates five core capabilities that every AI-capable Kubernetes platform must implement consistently.

1. Dynamic Resource Allocation (DRA)

DRA is the foundation of the conformance program. Traditional Kubernetes device plugins offer limited resource requests—you ask for "2 GPUs" and get whatever's available. DRA enables complex requirements:

# Traditional device plugin (limited)
resources:
limits:
nvidia.com/gpu: 2

# DRA-enabled (rich requirements)
resourceClaims:
- name: gpu-claim
spec:
deviceClassName: nvidia-gpu
requests:
- count: 2
constraints:
- interconnect: nvlink
- memory: {min: "40Gi"}
- locality: same-node

According to The New Stack, DRA reaching GA in Kubernetes 1.34 improves GPU utilization from 45-60% with device plugins to 70-85%, reduces job queue times from 15-45 minutes to 3-10 minutes, and cuts monthly GPU costs by 30-40%.

2. Intelligent Autoscaling

Certified platforms must implement two-level autoscaling for AI workloads:

  • Cluster Autoscaling: Automatically adjusts node pools with accelerators based on pending pods
  • Horizontal Pod Autoscaling: Scales workloads based on custom metrics like GPU utilization

This matters because AI workloads have bursty resource requirements. Training jobs need massive GPU clusters for hours, then nothing. Inference services need to scale from zero to thousands of replicas based on traffic.

3. Rich Accelerator Metrics

Platforms must expose detailed performance metrics for GPUs, TPUs, and other accelerators. Generic "utilization percentage" isn't sufficient—conformant platforms provide:

  • Memory usage and bandwidth
  • Compute utilization by workload
  • Temperature and power consumption
  • NVLink/interconnect statistics for multi-GPU jobs

Without standardized metrics, autoscaling decisions and capacity planning become guesswork.

4. AI Operator Support

Complex AI frameworks like Kubeflow and Ray run as Kubernetes Operators using Custom Resource Definitions (CRDs). The conformance program ensures these operators function correctly by validating:

  • CRD installation and lifecycle management
  • Operator webhook functionality
  • Resource quota enforcement for operator-managed resources

If the core platform isn't robust, AI operators fail in unpredictable ways.

5. Gang Scheduling

Distributed AI training jobs require all worker pods to start simultaneously. If 7 of 8 GPUs are available but the 8th isn't, traditional Kubernetes scheduling starts 7 pods that sit idle waiting for the 8th. Gang scheduling (via Kueue or Volcano) ensures jobs only start when all resources are available.

Key Takeaway

Gang scheduling prevents resource deadlocks in distributed training. Without it, partially-scheduled jobs waste expensive GPU time waiting for stragglers.

Certified Vendors (November 2025)

The v1.0 release certifies these platforms:

VendorProductNotes
AWSAmazon EKSFull DRA support, integrated with EC2 GPU instances
Google CloudGKEFirst mover, detailed implementation blog
MicrosoftAzure Kubernetes ServiceIntegrated with Azure ML
Red HatOpenShiftEnterprise focus, RHEL AI integration
OracleOCI Kubernetes EngineOCI GPU shapes supported
Broadcom/VMwarevSphere Kubernetes ServiceOn-premises AI workloads
CoreWeaveCoreWeave KubernetesGPU cloud specialist
AkamaiAkamai Inference CloudEdge AI inference
Giant SwarmGiant Swarm PlatformManaged K8s provider
KubermaticKKPMulti-cluster management
Sidero LabsTalos LinuxSecure, immutable K8s

Notable Absence: NVIDIA

NVIDIA isn't on the certified list, but that's expected. Chris Aniszczyk (CNCF CTO) clarified to TechTarget: "They're not on the list, but they don't really have a product that would qualify. They don't have a Kubernetes-as-a-Service product similar to those being certified."

NVIDIA participates in the working group and their ComputeDomains feature integrates with conformant platforms, but the certification targets platform providers, not hardware vendors.

How This Differs from ISO 42001

A common question: "How does this relate to ISO 42001 AI management certification?"

AspectKubernetes AI ConformanceISO 42001
FocusTechnical capabilitiesManagement & governance
ValidatesAPIs, configurations, workload behaviorPolicies, processes, documentation
TargetPlatform infrastructureOrganizational AI practices
ScopeKubernetes-specificTechnology-agnostic

ISO 42001 certifies that your organization manages AI responsibly. Kubernetes AI Conformance certifies that your infrastructure runs AI workloads correctly. You likely need both for enterprise AI deployments.

Key Takeaway

ISO 42001 answers "Do we manage AI responsibly?" Kubernetes AI Conformance answers "Does our infrastructure run AI correctly?" These are complementary, not competing standards.

Practical Implications for Platform Teams

Vendor Selection

The certification changes how you evaluate AI infrastructure. Instead of detailed POCs testing GPU scheduling behavior across vendors, you can trust that conformant platforms handle core capabilities identically. Selection criteria shift to:

  • Price: GPU instance costs vary significantly across providers
  • Ecosystem: Integration with your existing tools (MLflow, Weights & Biases, etc.)
  • Support: SLAs and enterprise support options
  • Geography: Data residency requirements

Multi-Cloud AI Strategy

The program enables genuine multi-cloud AI deployments:

  • Training: Use the cheapest GPU cloud (often CoreWeave or Lambda Labs)
  • Inference: Deploy to whichever cloud serves your users fastest
  • Burst: Overflow to alternative providers during peak demand

This was previously difficult because workload manifests needed platform-specific modifications. With conformance, the same Kubernetes resources work everywhere.

Migration Planning

If your current platform isn't certified, the conformance gap identifies specific capabilities to evaluate:

  1. Does your platform support DRA or only legacy device plugins?
  2. Can you request GPUs with specific interconnect requirements?
  3. Are gang scheduling solutions (Kueue, Volcano) supported?
  4. Do AI operators (Kubeflow, Ray) function correctly?

Non-conformant platforms may still work for simple use cases, but expect friction as workloads become more sophisticated.

Decision Framework: When Conformance Matters

Certification is critical when:

  • Running distributed training jobs across multiple GPUs/nodes
  • Deploying AI workloads across multiple clouds or regions
  • Using complex AI frameworks (Kubeflow, Ray, KServe)
  • GPU cost optimization is a priority
  • Portability between platforms is required

Certification is less critical when:

  • Running single-GPU inference workloads
  • Locked into a single cloud provider for other reasons
  • Using managed AI services (SageMaker, Vertex AI) rather than raw Kubernetes
  • Workloads don't require GPU/TPU acceleration

What's Coming in v2.0

CNCF announced that v2.0 roadmap development has started, with an expected 2026 release. Based on working group discussions, likely additions include:

  • Topology-aware scheduling: Requirements for NUMA node, PCIe root, and network fabric alignment
  • Multi-node NVLink: Standardized support for NVIDIA's ComputeDomains
  • Model serving standards: Common interfaces for inference workloads
  • Cost attribution: Standardized GPU cost tracking and chargeback

The v1.0 program intentionally started with fundamentals. As Chris Aniszczyk noted: "It starts with a simple focus on the kind of things you really need to make AI workloads work well on Kubernetes."

Key Takeaway

Don't wait for v2.0 to adopt conformant platforms. The v1.0 capabilities address the most common AI infrastructure pain points. Additional features will extend the standard, not replace it.

Getting Your Platform Certified

If you provide a Kubernetes platform with AI capabilities, certification is straightforward:

  1. Review requirements: Check the GitHub repository for current test criteria
  2. Run conformance tests: Automated test suite validates capability implementation
  3. Submit results: Pull request to the CNCF repository with test output
  4. Review process: CNCF bot verifies results, human review for edge cases

The process mirrors the existing Kubernetes Conformance Program that has certified 100+ distributions since 2017.

Actions for Platform Teams

Immediate (This Week)

  1. Check if your current platform is AI conformant
  2. Inventory AI workloads by capability requirements (DRA, gang scheduling, etc.)
  3. Identify gaps between current platform and conformance requirements

Short-Term (This Quarter)

  1. If non-conformant: Evaluate migration to certified platform
  2. If conformant: Validate that conformance capabilities are enabled
  3. Update internal platform documentation with conformance status

Long-Term (2025-2026)

  1. Build vendor selection criteria around conformance certification
  2. Develop multi-cloud AI strategy leveraging platform portability
  3. Track v2.0 requirements for topology-aware scheduling

Learn More

Official Resources

Technical Deep Dives

Industry Analysis


The Kubernetes AI Conformance Program represents the maturation of AI infrastructure. For the first time, platform teams have a vendor-neutral standard to evaluate AI capabilities. As Chris Aniszczyk put it: "Teams need consistent infrastructure they can rely on." Now they have it.