Multi-Cloud Failover with Cluster API and OpenStack

How Cluster API and OpenStack enable automated cross-region failover with zero downtime. A technical walkthrough using CAPO, Ceph, and Atmosphere.

§1 When the Region Goes Dark

High availability is no longer optional. AI inference, customer-facing applications, and mission-critical services are expected to remain available even when an entire region fails. The business impact of getting this wrong is significant: according to Uptime Institute's 2025 Annual Outage Analysis, 54% of organizations said their most recent major outage cost more than $100,000, while one in five reported losses exceeding $1 million.

Most organizations rely on their cloud provider's native disaster recovery capabilities. The problem is that those capabilities are designed to work within a single cloud ecosystem. Your failover strategy depends on proprietary services, provider-specific APIs, and infrastructure that isn't easily portable.

Cluster API (CAPI) offers a different approach. It extends Kubernetes' declarative model to cluster lifecycle management, providing a consistent API for provisioning, upgrading, and managing Kubernetes clusters across infrastructure providers. CAPO (Cluster API Provider OpenStack) brings those capabilities to OpenStack, enabling Kubernetes clusters to span independent OpenStack regions while being managed from a single control plane.

VEXXHOST took this further by building the magnum-cluster-api driver, which uses CAPO under the hood but wraps it behind Magnum's familiar OpenStack API. Users create and manage Kubernetes clusters through standard OpenStack commands while CAPI and CAPO handle the lifecycle operations underneath. This makes Cluster API accessible to teams already working within OpenStack without requiring them to interact with Kubernetes APIs directly.

Instead of relying on manual disaster recovery procedures, the desired state of your infrastructure is defined in Kubernetes manifests. If a cluster becomes unavailable, the management cluster reconciles that desired state, automating cluster lifecycle operations across regions using the same declarative APIs.

In this post, we explore how Cluster API and CAPO enable resilient Kubernetes deployments on OpenStack, how cross-region architectures can be designed using upstream technologies, and why an open, portable control plane built on Atmosphere by VEXXHOST provides a practical alternative to provider-specific disaster recovery solutions.

What production is teaching us.

How to Evaluate an OpenStack Provider: A Buyer's Checklist

The AI Agent Boom Is Outrunning Infrastructure

OpenStack Myths Debunked: What You Need to Know

Orchestrating Multi-Cloud Failover with Cluster API and OpenStack for Zero Downtime

§1 When the Region Goes Dark

§2 Cluster API, CAPO, and VEXXHOST's Magnum Driver

§3 The Architecture: Multi-Region Failover on OpenStack

§4 Synchronizing Stateful Workloads Across Regions

§5 Regional Failover in Practice

§6 Why Open Infrastructure Makes This Possible

Conclusion

Virtual machines, Kubernetes & Bare Metal Infrastructure