Navos runs on unmodified upstream Kubernetes and Cluster API. Learn why that architectural choice matters for portability, multi-cloud, and no-lock-in cluster management.
Every infrastructure decision is a bet. When we built Navos, we made one deliberately: we bet on Cluster API. Not a fork. Not a proprietary lifecycle engine. Not custom automation stitched together with shell scripts and drift. Cluster API, the same upstream CNCF project that underpins cluster lifecycle management at some of the largest infrastructure organizations on the planet.
This post explains that decision. Not as a marketing exercise, but as an honest account of the architectural choices that shaped Navos, and what those choices mean for the engineers who run clusters on top of it.
The Problem We Were Solving
Managing Kubernetes clusters is tractable at small scale. One team, one cluster, one environment. The scaffolding barely matters.
It stops being tractable the moment you have multiple clusters, mixed infrastructure, and the expectation that everything stays up during upgrades, scaling events, and node replacements. That's where lifecycle management either holds together or quietly falls apart.
The Kubernetes ecosystem has accumulated a long tail of ways to bootstrap clusters. Most of them solve the Day 1 problem well enough. The pain surfaces on Day 2 and beyond: upgrades that require manual sequencing, scaling that breaks predictability, audit trails that amount to "someone ran a script." At that scale, teams need a lifecycle model that treats clusters as first-class Kubernetes objects, not artifacts assembled outside the API boundary.
That's the problem Cluster API was built to solve.
What Cluster API Actually Is
Cluster API, CAPI, is a Kubernetes SIG project under SIG Cluster Lifecycle. It brings declarative, Kubernetes-style APIs to cluster creation, configuration, and management. The core idea is direct: if Kubernetes manages application workloads as objects, it should also manage Kubernetes clusters as objects.
In practice, infrastructure components traditionally managed outside of Kubernetes, virtual machines, control plane nodes, worker pools, become Kubernetes resources governed by the same reconciliation loop that manages any other workload. A KubeadmControlPlane resource defines your control plane machines. MachineDeployment resources define your worker pools. Cluster topology, the entire lifecycle of a cluster, is declared, version-controlled, and continuously reconciled.
The management cluster watches desired state. When actual state drifts, a node fails; an upgrade is triggered, a machine pool scales, the controller acts. Not a webhook. Not a Helm hook. The actual Kubernetes control loop, running continuously.
This is not a proprietary abstraction sitting on top of Kubernetes. Cluster API is the upstream project that organizations including Google, Microsoft, Apple, IBM, and Red Hat all actively contribute to. It carries broad infrastructure provider support: AWS, Azure, GCP, OpenStack, bare metal, and more. Building on Cluster API means building on the same primitives the largest cloud operators use to manage clusters at scale.
Why We Chose It for Navos
Lifecycle management shouldn't be a custom codebase.
The alternative to Cluster API is writing your own cluster lifecycle logic or inheriting someone else's. That logic includes: provisioning, upgrade sequencing, node rolling, version skew enforcement, drain coordination, machine health checks, and auto-healing. Write that yourself, and you own it indefinitely. Every Kubernetes release, every API deprecation, every CNI compatibility concern – yours to track, yours to patch.
Cluster API addresses the underlying problem as a shared project maintained by engineers across dozens of organizations. When a new Kubernetes version ships, the CAPI maintainers address compatibility. When upgrade sequencing logic changes, the community validates it across provider implementations. VEXXHOST doesn't need to reinvent that. We need to run it reliably and support the teams depending on it.
The operator model is the right model.
Cluster API uses the same operator pattern that governs every other Kubernetes workload. That's architecturally significant. Cluster lifecycle runs inside the same control plane logic that platform teams already understand. No separate tooling layer to learn. No separate failure mode to isolate. The system reconciles continuously, not only when a manual operation is triggered.
In Navos, a cluster upgrade begins by changing a single field: spec.topology.version on the Cluster resource. Cluster API takes it from there – upgrading control plane nodes first, strictly ordered per the Kubernetes version skew policy, then worker nodes one at a time, with workload disruption minimized at every step. The process is reproducible because it's declared. It runs the same way every time because the same controller drives it every time.
Upstream means no fork tax.
Navos runs unmodified upstream Kubernetes and unmodified upstream Cluster API. No forks. No proprietary layers bolted on top. No custom CRDs that exist only inside our platform.
That choice has a direct operational consequence: everything deployed on Navos is portable. Your manifests, your Helm charts, your kubectl workflows — none of them depend on Navos-specific abstractions.
Deploy on VEXXHOST infrastructure, on your own private cloud via Atmosphere, or on bare metal. The cluster lifecycle API looks the same. Your workloads move with you.
What It Means for Multi-Cloud and Portability
Cluster API maintains a deliberate separation between its core lifecycle logic and the infrastructure provider layer. The core, Machine, MachineSet, MachineDeployment, KubeadmControlPlane, is infrastructure-agnostic. Provider implementations handle the infrastructure-specific work: creating VMs, configuring load balancers, and managing networks. The contract between them is well-defined and standardized across the ecosystem.
In practice: if you're running Navos clusters on VEXXHOST today and decide tomorrow to run some of them on your own hardware or in a different region, the cluster topology doesn't change. The provider reference changes. The lifecycle model doesn't. Your CI/CD pipelines stay intact. Your GitOps configuration stays valid.
This matters differently depending on what you're building. For organizations with data sovereignty requirements, regulated industries, public sector workloads, compliance-heavy environments, the ability to relocate clusters without rewriting lifecycle tooling isn't a nice-to-have. It's a prerequisite for treating infrastructure location as an ongoing decision rather than a permanent commitment.
For platform teams managing clusters across environments, development, staging, production, on-premise, hosted, a consistent lifecycle API means one operational model across all of them. No context-switching depending on where the cluster physically runs.
What It Means Operationally
Upgrades are declared, not scripted. As covered in the zero-downtime upgrade deep-dive, Navos cluster upgrades follow a strict, controller-driven sequence. Control plane nodes upgrade first, in rolling fashion, one at a time, with the API server load balancer maintaining availability throughout. Worker nodes follow via MachineDeployment rollouts, cordon, drain, reschedule, replace, with Pod Disruption Budgets respected at every step. No scripts, no manual sequencing, no intervention required.
Auto-healing is native to the model. Machine health checks are a Cluster API primitive. If a node becomes unreachable or fails defined health criteria, the controller replaces it declaratively, without paging an engineer. The cluster reconciles toward desired state automatically.
Cluster state is auditable. Every change to cluster topology goes through the Kubernetes API. kubectl works. GitOps works. RBAC works. If you need to know what changed, when, and what triggered it — the audit trail lives in the API server, not a separate system with its own access model.
The tooling ecosystem is shared. Because Navos runs upstream Cluster API, every tool built to work with CAPI works with Navos. Fleet management tooling, GitOps integrations, provider-specific automation, none of it requires porting or adaptation.
What This Means for You
If you're evaluating Navos or any managed Kubernetes platform, the question worth asking isn't only "what features does it have?" It's "what does it cost me to leave?"
If the platform owns your lifecycle tooling, leaving means rewriting it. If the platform forks Kubernetes, staying means accepting drift from upstream. If lifecycle logic lives in proprietary automation, debugging a failed upgrade requires the platform team, not your own engineers.
Navos is built on Cluster API specifically so that none of those costs apply. Your clusters are defined as standard Cluster API objects. Your workloads run on upstream Kubernetes. Your manifests are yours. Leave any time, nothing you've built is stranded.
That's not an accidental property of the architecture. It's the point of the architecture. Explore solutions!