Protecting the App Inside the VM with Heat

VM running does not equate to uptime. See how Heat-driven automation, Prometheus alerts, and Ceph-backed recovery turn OpenStack apps into self-healing, auto-scaling services.

Most high availability conversations in OpenStack circles focus on control planes, database quorum, and replication strategies for infrastructure components. But if you're a DevOps engineer responsible for the full application lifecycle, you know that a resilient app doesn't stop at “VM is up.”

A running VM doesn’t guarantee a healthy application. And the further up the stack you go, from Nova to the actual workload, the easier it is for those assumptions to fail.

It has to survive traffic spikes, VM crashes, misconfigured updates, and network anomalies. That’s why application-level HA needs to live inside your orchestration logic.

Heat helps with that by executing predefined orchestration logic in response to external events or alarms. It’s not a continuous control plane like Kubernetes, but rather an event-driven engine that responds predictably when triggered. Tools like Prometheus, and AlertManager work together to detect conditions, evaluate thresholds, and instruct Heat when and how to react.

Application Resilience, Written as Code

With Heat, you define infrastructure behavior declaratively: if a metric crosses a threshold, spin up more VMs. If a resource fails a health check, replace it with a new instance. These orchestration behaviors are codified in templates and executed on demand, ensuring reproducibility, transparency, and version control.

Atmosphere builds on this by providing real-world HA patterns out of the box. For example, you can define a service that scales out under load, recovers IPs and volumes after a crash, and maintains traffic distribution through Octavia. These are pre-integrated patterns covering compute, storage, networking, and monitoring.

With Atmosphere, you don’t need to stitch systems together manually. You define the logic. The platform wires the rest.

Scaling Decisions Based on Reality

Auto-scaling is where HA meets real-world usage. When load increases, new resources come online. When it drops, they’re retired. But the quality of these decisions depends entirely on your telemetry and how you respond to it.

Atmosphere integrates Prometheus for metric collection and AlertManager for alert evaluation. Prometheus gathers performance data around CPU, memory, latency, disk I/O. AlertManager detects patterns and anomalies. Additionally, Atmosphere supports advanced scaling policies that combine multiple metrics.

For example, you could configure scaling to occur only when CPU usage exceeds 75% and API latency surpasses 300ms for 5 minutes.

This ensures decisions reflect actual workload stress, and not just transient spikes or isolated indicators.

Lack of Recovery is an Issue

What separates robust systems from fragile ones is how they respond when downtime happens.

Atmosphere treats failures as recoverable events. If a VM dies, it is a trigger. A new VM spins up. Its Ceph-backed RBD volume reattaches. The floating IP is reassigned. Octavia automatically resumes traffic flow when the new instance passes health checks.

Ceph underpins this speed and reliability. RBD (RADOS Block Device) volumes provide snapshot and replication capabilities, ensuring data consistency across failures. Optional erasure coding improves storage efficiency without sacrificing durability. These features shorten recovery time and ensure zero data loss even under hardware failure or host-level interruptions.

Monitoring is the Other Half of the Equation

You can’t recover from what you can’t detect. Atmosphere provides a fully integrated observability stack. Prometheus scrapes metrics from both infrastructure and apps. AlertManager routes alerts. Grafana visualizes performance over time. These systems are built into the platform with operational defaults ready from day one.

You can define alert rules based on any metric and map them to orchestration logic.

Want to scale based on CPU and network I/O? Or rebuild instances showing high swap usage and failed health checks? You can.

Stratometrics, Atmosphere’s Usage Service, adds longer-term visibility by tracking per-project consumption over time. This data supports predictive scaling, capacity planning, and cost allocation, making HA more than an operational safety net.

A Real-World Example

Let’s take a real deployment scenario. You’re running a SaaS platform with a Kubernetes frontend and a PostgreSQL backend. The frontend needs elastic scaling. The backend needs high durability and fast failover.

Atmosphere orchestrates both in a single Heat template. The Kubernetes cluster is launched via OpenStack Magnum, using the Cluster API driver to manage scaling and node health. Prometheus tracks API latency.

Meanwhile, the database runs on a VM backed by a replicated Ceph volume. If the VM fails, a new one is launched, the RBD volume is reattached, and Octavia updates the routing pool without any human intervention required.

The orchestration is declarative, version-controlled, and portable. Whether you’re scaling, recovering, or rebalancing, it can all happen through code, not manual configuration.

Deployment Models That Match Your Needs

Atmosphere supports this HA architecture across three delivery models, Cloud, Hosted, and On-Premise, so teams can choose how much control or support they need.

In the Cloud edition, you get elastic workloads with minute-based billing, perfect for spiky demand, dev/test environments, or high-scale CI pipelines.

In the Hosted edition, VEXXHOST delivers a dedicated private cloud with full operational coverage. You write the orchestration logic; we monitor it, scale it, patch it, and recover it.

In the On-Premise edition, you get Atmosphere deployed inside your data center. For these environments, VEXXHOST offers professional services to design, deploy, and manage HA architectures. This includes remote operations, upgrades, cluster tuning, and 24/7 support for your critical services.

Hybrid Stacks, Orchestrated Seamlessly

Few infrastructures are uniform. Most teams run some workloads on VMs and others on Kubernetes. Some workloads stay in the cloud; others move on-prem for compliance, latency, or cost.

Atmosphere was designed for this. You can define hybrid stacks, compute, storage, K8s, in a single Heat template. Whether you’re deploying GPUs for ML inference or legacy apps with attached volumes, you use the same orchestration model.

That consistency extends across editions. You can move a workload from Cloud to On-Premise without rewriting orchestration logic. The networking, scaling, and recovery behaviors all stay intact, because the abstraction lives in the Heat template, not the environment.

Rethinking What HA Actually Means

If your HA strategy stops at “the VM is still running,” you're not protecting your application. If orchestration doesn’t respond to real-time telemetry, it can only be scheduled automation. And if recovery depends on someone waking up and fixing it manually, it can’t be called HA.

Heat gives you the orchestration tools. Atmosphere makes them practical, bundling monitoring, scaling, and recovery into a single declarative stack you can deploy anywhere.

Thinking About Application-Level HA? Try a Proof-of-Concept

If you're exploring how to improve HA for your VM-based apps without migrating everything to containers, a Heat-powered deployment is a great place to start.

If you're already working with Heat, Prometheus, or OpenStack-based infrastructure, Atmosphere gives you a way to bring those pieces together without reinventing the control logic or stitching systems by hand.

Let us help you scope out a proof of concept. We’ll walk through your use case, identify HA patterns that work, and show how Atmosphere can make them real without starting from scratch.