Are You Evaluating OpenStack for Production?

For platform engineers running a serious evaluation of Atmosphere and OpenStack.

The reputation OpenStack carries is partly earned and partly dated. Bare OpenStack deployments, often maintained without tooling discipline, running releases that fell years behind, have caused real pain. That history is legitimate.

But evaluating OpenStack in 2026 against that history is like evaluating Linux against its early 1990s installation experience.

The more useful question for a PoC is narrower - can your team operate this platform at production quality, with your existing toolchain, without specialising in OpenStack internals?

That question has a concrete answer. In this post, we map the technical concerns worth pressure-testing during an evaluation.

Complexity and Day 2 Operations

The core operational problem with bare OpenStack was the absence of a repeatable, opinionated assembly. When Nova, Neutron, Cinder, Keystone, and Glance are wired together differently in every deployment, there's no institutional knowledge that transfers, no runbooks that generalise, and no upgrade path that doesn't require the engineer who originally built the environment.

Atmosphere solves this at the architectural level rather than the tooling level. The full stack— compute, storage, networking, identity, monitoring—is deployed via versioned Ansible playbooks and Helm charts.

More significantly, OpenStack services in Atmosphere run as Kubernetes workloads. This changes the operational model substantially. Upgrades are rolling deployments with health-gate checkpoints. Observability comes from the same Prometheus and Grafana stack the Kubernetes ecosystem already standardised on. There's no separate monitoring system to operate for the infrastructure layer.

PoC validation: At environment standup, Grafana should surface live metrics for compute node health, Ceph cluster status, and network throughput with no additional configuration. If answering "what is my current storage utilisation?" requires SSH access to a node, that's a gap worth flagging before going further.

Upgrades

This warrants separate treatment because it's where OpenStack deployments have historically accumulated the most technical debt. Major version upgrades across a large service surface with complex inter-service dependencies, database schema migrations, and API version negotiation are high-risk operations when done against running services. The operational response in many organisations was simply to defer upgrades, which compounded the problem.

Atmosphere's approach is structurally different. Because OpenStack services are Kubernetes pods, a version upgrade is a standard rolling deployment. New versions are deployed alongside the current ones; traffic shifts; unhealthy pods trigger rollback before the old versions are terminated. The upgrade path is the same mechanism as any other Kubernetes workload update. This means it's testable, observable, and reversible in a way that legacy OpenStack upgrades were not.

VEXXHOST tracks the upstream OpenStack release cadence (two releases per year) and maintains Atmosphere against it. The intent is that staying current is a routine operation rather than a project.

PoC validation: Run an actual upgrade during the evaluation period, not a patch. Instrument what happens when a component fails mid-upgrade. The rollback behaviour should be automatic and the resulting state should be consistent — if it requires manual intervention to recover, that's the real operational cost of upgrades at 2am.

Toolchain Integration

The OpenStack Terraform provider and Ansible OpenStack collection cover major OpenStack services and work against any OpenStack-compatible deployment without modification. Code written against Atmosphere works against every other OpenStack cloud. There's no provider-specific abstraction — openstack_compute_instance_v2, openstack_networking_network_v2, and openstack_networking_secgroup_v2 are the same resources regardless of who deployed the cluster.

This portability is the material difference from hyperscaler IaC. Terraform modules written against AWS or GCP use provider-specific resources that have no equivalent elsewhere. OpenStack Terraform is effectively vendor-neutral by design.

GitOps workflows compose naturally with this. Because the infrastructure is fully API-driven and Terraform-native, Flux or ArgoCD managing infrastructure state against an Atmosphere environment works the same way it would against any other Terraform-backed platform.

PoC validation: Translate an existing production Terraform module — not a toy example, something representative of your real infrastructure — to the OpenStack provider. The friction in that translation is signal about the API learning curve and the completeness of provider coverage for your use cases.

Kubernetes Integration

Atmosphere provides CNCF-conformant Kubernetes via Cluster API. The distinction worth understanding is that these are real clusters with full control plane access — not a managed service where the provider controls the API server. You get kubeconfig, you have cluster-admin access, and you operate the cluster yourself with VEXXHOST as a support layer.

The operational properties:

Block storage attaches via the standard CSI interface using Ceph-backed Cinder volumes. No custom storage classes, no driver maintenance.
Cluster Autoscaler integration is supported. Node pool scaling is driven by workload demand.
Rolling Kubernetes version upgrades are handled without manual node replacement.
Clusters are provisioned in isolated OpenStack tenant networks.

One architectural detail worth understanding before the PoC: Atmosphere runs OpenStack control-plane services as Kubernetes workloads (enabling the upgrade model described above), and then provisions Kubernetes clusters on top of OpenStack for user workloads. The two layers serve different purposes and shouldn't be conflated operationally.

PoC validation: Provision a cluster, then deliberately terminate a worker node while a deployment is running. The time to workload recovery is your real RTO for node failures.

Identity Federation

Atmosphere uses Keycloak as an identity broker in front of OpenStack Keystone in Hosted and On-Premise editions. Keycloak supports LDAP, SAML 2.0, and OpenID Connect natively. Your existing directory — Active Directory, Okta, Azure AD, or any standards-compliant IdP — connects without custom middleware.

From an operational standpoint, this means users authenticate via your existing SSO and their role assignments in the IdP map directly to OpenStack project roles. Application credentials for service accounts are scoped to specific projects and don't require embedding user passwords in IaC or CI/CD pipelines.

VEXXHOST publishes a Vault/OpenBao plugin that generates short-lived OpenStack application credentials on demand, eliminating static API keys for automated workloads entirely.

PoC validation: Complete the Keycloak federation to your IdP in the first week. Verify that a user with a specific directory group receives the correct OpenStack project role. Verify that a service account can authenticate using a scoped application credential without any user credential exposure. Identity integration that requires more than a few hours of configuration at this stage is worth documenting as a risk.

Performance

The relevant hardware capabilities in Atmosphere: SR-IOV, DPDK, and ASAP2 for hardware-accelerated networking (line-rate throughput up to 100Gbps for instances that require it), NVMe-backed Ceph RBD pools for block storage, and dedicated GPU instances via PCI passthrough using enterprise NVIDIA accelerators. CPU and GPU resources in instances are not oversold.

For private cloud deployments, compute is single-tenant by design. The noisy neighbour problem is an architectural characteristic of oversubscribed multi-tenant environments. A dedicated private cloud doesn't have it.

Synthetic benchmarks are a poor proxy for production behaviour. The evaluation worth running is your actual workload, or the closest representative subset, against both the current platform and Atmosphere during the overlap period. That produces data with organisational credibility.

PoC validation: Run your real workload. If you're migrating from VMware, run the same job on both environments simultaneously to compare wall-clock time and resource utilisation. That comparison is the evidence that drives internal alignment.

Operating Model

Atmosphere is available in three deployment models, with the platform built on the same open-source stack across all of them:

Atmosphere Cloud

Multi-tenant cloud hosted out of VEXXHOST's global datacenters, billed per minute. The right option for teams that want public cloud economics with OpenStack API compatibility.

Atmosphere Hosted

Single-tenant dedicated cloud hosted out of VEXXHOST's global datacenters, billed per month. The right option for teams that need private cloud properties without the operational burden of running the infrastructure themselves.

Atmosphere On-Premise

Open source cloud platform hosted in your datacenter, with support or remote ops. VEXXHOST offers either full remote operations or a support-only engagement where your team executes and VEXXHOST provides the backstop.

For teams evaluating private cloud specifically, the Hosted and On-Premise editions are the most relevant comparisons. Because the underlying software stack is shared across all three models, IaC and operational knowledge transfer between them. VEXXHOST's support comprises of engineers who operate and contribute to the projects. So your tickets aren’t going to a tiered helpdesk.

Clients have told us that they see that distinction particularly during incidents and post-upgrade debugging situations where the difference between a team that can read and modify source code and a team that files upstream tickets is measured in hours of downtime.

Structuring the PoC

A PoC that only validates the happy path validates nothing operationally useful. These are the scenarios worth running deliberately:

Compute node failure. Terminate a node while workloads are running. Measure time to recovery. This is your real HA baseline, not the SLA document.

Storage node failure. Remove a Ceph OSD while workloads are performing I/O. Measure performance degradation during rebalance and time to full recovery. Ceph's rebalance behaviour under load is what your storage resilience story actually looks like.

Version upgrade. Run a full version upgrade in the PoC environment. Instrument what happens when a component fails mid-process. Automatic rollback to a consistent state is the expected behaviour — manual intervention to recover from a failed upgrade is a significant operational risk.

IaC portability. Bring production Terraform modules, not examples. The translation effort is real signal, and running infrastructure you actually understand produces more useful evaluation data.

On the Lock-In Question

This comes up in every evaluation and is worth addressing directly, because it's structurally different for OpenStack than for any proprietary platform.

There is the OpenStack API, supported across every OpenStack-powered cloud globally. VM images are in standard formats. Networking configuration is Neutron, portable to any Neutron deployment. IaC written against the OpenStack Terraform provider works anywhere.

Starting Point

Architecture review, PoC setup, and ongoing technical questions are all part of the engagement. Talk to an engineer about your infrastructure requirements and what a scoped evaluation looks like for your environment.

The Latest From Us

Sovereign by Architecture: Building AI Infrastructure for the EU AI Act

Bringing Browser-Based MFA SSO to the OpenStack CLI

The Real Cost of Running AI on Hyperscalers vs. Open Infrastructure