Orchestrating Multi-Cloud Failover with Cluster API and OpenStack for Zero Downtime
How Cluster API and OpenStack enable automated cross-region failover with zero downtime. A technical walkthrough using CAPO, Ceph, and Atmosphere.
Insights, updates, and stories from our team
How Cluster API and OpenStack enable automated cross-region failover with zero downtime. A technical walkthrough using CAPO, Ceph, and Atmosphere.
Hiring takes 6+ months. Your roadmap can't wait. Learn why leading infrastructure teams treat managed services as a permanent layer, not a stopgap.
One open DevOps role triggers overload, burnout, and attrition. See how the cascade runs and how to stop it before the second domino falls.
How Cluster API and OpenStack enable automated cross-region failover with zero downtime. A technical walkthrough using CAPO, Ceph, and Atmosphere.
High availability is no longer optional. AI inference, customer-facing applications, and mission-critical services are expected to remain available even when an entire region fails. The business impact of getting this wrong is significant: according to Uptime Institute's 2025 Annual Outage Analysis, 54% of organizations said their most recent major outage cost more than $100,000, while one in five reported losses exceeding $1 million.
Most organizations rely on their cloud provider's native disaster recovery capabilities. The problem is that those capabilities are designed to work within a single cloud ecosystem. Your failover strategy depends on proprietary services, provider-specific APIs, and infrastructure that isn't easily portable.
Cluster API (CAPI) offers a different approach. It extends Kubernetes' declarative model to cluster lifecycle management, providing a consistent API for provisioning, upgrading, and managing Kubernetes clusters across infrastructure providers. CAPO (Cluster API Provider OpenStack) brings those capabilities to OpenStack, enabling Kubernetes clusters to span independent OpenStack regions while being managed from a single control plane.
VEXXHOST took this further by building the magnum-cluster-api driver, which uses CAPO under the hood but wraps it behind Magnum's familiar OpenStack API. Users create and manage Kubernetes clusters through standard OpenStack commands while CAPI and CAPO handle the lifecycle operations underneath. This makes Cluster API accessible to teams already working within OpenStack without requiring them to interact with Kubernetes APIs directly.
Instead of relying on manual disaster recovery procedures, the desired state of your infrastructure is defined in Kubernetes manifests. If a cluster becomes unavailable, the management cluster reconciles that desired state, automating cluster lifecycle operations across regions using the same declarative APIs.
In this post, we explore how Cluster API and CAPO enable resilient Kubernetes deployments on OpenStack, how cross-region architectures can be designed using upstream technologies, and why an open, portable control plane built on Atmosphere by VEXXHOST provides a practical alternative to provider-specific disaster recovery solutions.
Cluster API (CAPI) brings declarative, Kubernetes-style APIs to cluster lifecycle management. Instead of provisioning clusters through scripts or templates, you define the desired state in manifests, and CAPI controllers continuously reconcile the infrastructure to match.
CAPO (Cluster API Provider OpenStack) is the infrastructure provider that translates Cluster API resources into OpenStack services such as Nova, Neutron, and Cinder. A dedicated management cluster runs the Cluster API controllers and is responsible for provisioning and operating workload clusters, which are the Kubernetes environments where applications run.
VEXXHOST built the magnum-cluster-api driver to modernize Kubernetes provisioning in OpenStack. Rather than relying on Magnum's legacy implementation, the driver turns Magnum into an API layer while delegating cluster lifecycle management to the actively maintained Cluster API project. It supports modern operating systems such as Flatcar and Ubuntu and eliminates the need for Heat templates.
For OpenStack users, this provides a familiar experience. Clusters can still be managed through the Magnum API and OpenStack CLI, while Cluster API and CAPO handle provisioning and lifecycle management behind the scenes. The result is a simpler architecture that can provision Kubernetes clusters in less than five minutes.
Within Atmosphere, VEXXHOST's deployment platform for OpenStack, a Kubernetes management cluster already exists to deploy and operate the cloud itself. Adding Cluster API controllers therefore requires minimal additional infrastructure. The Magnum driver translates Magnum API requests into native Cluster API resources, allowing Kubernetes clusters to be managed using upstream APIs instead of bespoke orchestration logic.
This architecture also lays the foundation for multi-region operations. A single management cluster can provision and manage workload clusters across multiple independent OpenStack regions. Because Atmosphere deploys the same upstream OpenStack and Kubernetes stack everywhere, Cluster API interacts with each region in a consistent way, making cross-region lifecycle management and recovery significantly simpler.
For more details on the implementation, see the original announcement: Cluster API Driver for OpenStack Magnum.
The architecture has three components: a management cluster, workload clusters across independent regions, and declarative manifests that define the desired state.
The management cluster runs the Cluster API controllers in a neutral location, isolated from the regions it manages. It doesn't run application workloads. Its only responsibility is monitoring cluster state and reconciling when reality drifts from the desired state.
Workload clusters run across independent OpenStack regions. Each region is fully autonomous with its own Nova, Neutron, Cinder, and Ceph services. There are no stretched control planes or shared databases. If one region goes offline, the others continue operating independently.
Declarative manifests tie everything together. Node counts, machine types, networking, and Kubernetes versions are all defined in YAML. Cluster API continuously compares the desired state with the actual state and automatically reconciles any differences.
On Atmosphere, this architecture maps naturally. VEXXHOST operates regions in Montreal, Santa Clara, and Amsterdam, each running the same upstream OpenStack and Kubernetes stack. The management cluster interacts with every region in exactly the same way. Add on-premises or colocation deployments, and the list of failover targets expands while still using the same Cluster API interface.
The result is a multi-region architecture built on open technologies rather than proprietary disaster recovery tooling or vendor-specific multi-region services.
For a detailed breakdown of multi-region deployment patterns, including active-active, active-passive, and federated architectures, see Running Multi-Region Kubernetes on OpenStack.
Failover for stateless workloads is relatively straightforward. New pods can be started in the surviving region and traffic redirected. Stateful AI workloads are far more complex.
Training checkpoints, model artifacts, feature stores, inference session data, and vector databases all contain state that must survive a regional outage. If the workload is restored but the data is stale, incomplete, or unavailable, recovery is only partial.
Ceph provides the storage foundation. Within Atmosphere, it delivers block and object storage with support for replication across regions. RBD mirroring keeps persistent volumes synchronized so that when a workload cluster is brought up in another region, it can mount storage containing the latest replicated data. Because replication is typically asynchronous, there is a small recovery point objective (RPO) during an unexpected outage. The acceptable balance between consistency and availability should be determined by the requirements of each workload.
At the application layer, additional synchronization may still be necessary. Model registries must be available in every target region. Feature stores require consistent snapshots. Inference services that maintain session state need a shared or replicated backend. These decisions depend on the application, but the infrastructure must provide the building blocks. Ceph supplies the storage layer, Neutron provides the networking foundation, and Keystone delivers a consistent identity service across environments.
The key principle is simple: recovering compute without recovering data is not true disaster recovery. Storage, networking, and identity all need to be designed for multi-region operation from the beginning.
A regional outage doesn't have to become a manual recovery exercise. With Cluster API, the management cluster continuously compares the desired state of the infrastructure with what is actually running. If an entire OpenStack region becomes unavailable, CAPO provisions replacement infrastructure in a healthy region, Kubernetes schedules workloads onto the new nodes, and replicated Ceph volumes provide access to the latest available data.
From an operator's perspective, recovery follows the same declarative workflow used for day-to-day cluster management. Rather than executing a disaster recovery runbook or rebuilding infrastructure by hand, the management cluster reconciles the environment until it matches the desired state defined in Kubernetes manifests.
Recovery time depends on several factors, including failure detection, machine provisioning, image availability, and storage replication. These operational parameters can be tuned to meet the recovery objectives of different workloads, balancing recovery speed with consistency requirements.
What makes this approach particularly effective on Atmosphere is consistency. Every region runs the same upstream OpenStack, Kubernetes, and Ceph stack, allowing Cluster API to manage infrastructure predictably regardless of location. Whether the target is Montreal, Amsterdam, or an on-premises deployment, the same APIs, controllers, and provisioning workflows apply.
Instead of relying on proprietary disaster recovery tooling or cloud-specific failover services, organizations can build a portable, multi-region architecture using upstream open source technologies. The result is infrastructure that behaves consistently across environments while giving operators the flexibility to recover workloads wherever capacity is available.
Hyperscalers offer multi-region failover. But it only works within their ecosystem. AWS to AWS. Azure to Azure. GCP to GCP. Your resilience strategy is as portable as the provider allows.
CAPI on OpenStack changes the boundary. Failover targets aren't limited to regions within a single provider. They can span independent OpenStack deployments, different data centers, different geographies, or entirely different infrastructure providers. The management cluster doesn't care what's underneath as long as the CAPI provider interface is consistent.
This is where Atmosphere's architecture matters. Every deployment runs upstream OpenStack and CNCF-certified Kubernetes. No forks. No vendor extensions. A CAPI management cluster targeting one Atmosphere region can target any other Atmosphere deployment with identical behavior. On-premise, colocation, hosted, different continent. The reconciliation loop works the same way everywhere.
That means failover strategies aren't constrained by provider boundaries. An organization can fail over from a hosted Atmosphere region to an on-premise deployment. From one country to another for sovereignty reasons. From colocation to cloud during a capacity event. The architecture supports all of it because the infrastructure is open and consistent.
Provider-locked failover gives you resilience within one ecosystem. CAPI on open infrastructure gives you resilience across all of them.
For more on why open infrastructure provides this level of flexibility, read Why OpenStack and Kubernetes Are Better Together for AI.
Zero downtime isn't a feature you buy from a provider. It's an architecture you build on infrastructure you control.
Cluster API provides the declarative control plane. CAPO provides the OpenStack integration. VEXXHOST's magnum-cluster-api driver makes it operational within Atmosphere. Ceph provides the stateful replication. And upstream OpenStack and Kubernetes ensure consistent behavior across every failover target.
The result: automated, cross-region failover on open infrastructure. No vendor lock-in. No proprietary disaster recovery tooling. No resilience strategy that stops at one provider's boundary.
Explore Atmosphere and build infrastructure that doesn't go down when a region does.
Choose from Atmosphere Cloud, Hosted, or On-Premise.
Simplify your cloud operations with our intuitive dashboard.
Run it yourself, tap our expert support, or opt for full remote operations.
Leverage Terraform, Ansible or APIs directly powered by OpenStack & Kubernetes