Is Your Org Chart a Lagging Indicator?
Hiring takes 6+ months. Your roadmap can't wait. Learn why leading infrastructure teams treat managed services as a permanent layer, not a stopgap.
Insights, updates, and stories from our team
Hiring takes 6+ months. Your roadmap can't wait. Learn why leading infrastructure teams treat managed services as a permanent layer, not a stopgap.
One open DevOps role triggers overload, burnout, and attrition. See how the cascade runs and how to stop it before the second domino falls.
An open DevOps role costs roughly $1,000 a day in lost output. That's before recruiting fees, before ramp-up. Here's what the full timeline actually looks like, and what to do while the search runs.
OpenStack security is a set of decisions distributed across Keystone, Neutron, Nova, and Barbican. What to change, why it matters, and what breaks if you skip it.
Security in OpenStack is a set of decisions distributed across Keystone, Neutron, Nova, and Barbican, where a reasonable default in development becomes an unacceptable exposure in production. This guide covers the specific changes that matter, why they matter, and what happens if you skip them.
The Admin Token is a shared secret used to bootstrap Keystone. It carries no user context and no scope. It grants unrestricted access to your entire Keystone deployment to anyone who has it.
In production, it must not exist. Remove AdminTokenAuthMiddleware from your paste application pipelines in keystone-paste.ini. Every hour your deployment runs with the admin token active is an hour that token can be used from anywhere on your network without attribution to any user or project.
Fernet key rotation has a failure mode that is easy to miss and silently breaks authentication. In a multi-node Keystone deployment, if you rotate keys without first distributing them to all nodes, a token created with the new primary key on one node will fail validation on every other node that hasn't received the update yet.
The upstream Keystone docs state this directly: "If the rotation and distribution are not lock-step, a single keystone node in the deployment will create tokens with a primary key that no other node has as a staged key. This will cause tokens generated from one keystone node to fail validation on other keystone nodes."
The staged key (key 0) exists specifically to handle this window. It can decrypt tokens but is never used to create them. This means a node that has the staged key — but has not yet received the new primary — can still validate tokens created on another node, as long as distribution happens before the next rotation.
The correct sequence is:
The formula for setting max_active_keys is:
max_active_keys = (token_expiration_hours / rotation_frequency_hours) + 2
The two additional keys account for the staged key and a buffer. For example: 24-hour token validity with 6-hour rotation requires (24 / 6) + 2 = 6 active keys. Setting this too low means Keystone prunes secondary keys that are still needed to validate unexpired tokens, silently breaking authentication. Treat Fernet keys with the same care as SSL private keys. Any node joining the cluster must have the same key repository before it starts issuing or validating tokens.
If your deployment uses service token authentication (where services may need to validate expired tokens), adjust the formula to account for allow_expired_window: max_active_keys = ((token_expiration + allow_expired_window) / rotation_frequency) + 2.
Since the Newton release, Keystone encrypts all credentials stored in the SQL backend using Fernet. This requires a separate key repository configured explicitly in keystone.conf:
[credential]provider = fernetkey_repository = /etc/keystone/credential-keys/
This is a separate key repository from the token keys, with its own rotation lifecycle. If this section is absent from your configuration, confirm that keystone-manage credential_setup has been run and that keystone-manage credential_migrate has been completed after upgrades from older deployments. Do not infer plaintext storage merely from the absence of an explicit [credential] section in keystone.conf.
All PCI-DSS compliance controls in Keystone live in the [security_compliance] section of keystone.conf. The configurable parameters:
[security_compliance]change_password_upon_first_use = truedisable_user_account_days_inactive = 90lockout_duration = 1800lockout_failure_attempts = 6minimum_password_age = 1password_expires_days = 90password_regex = (?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[!@#$%^&*]).{8,}password_regex_description = Must be 8+ chars with uppercase, lowercase, digit, and special characterunique_last_password_count = 5
The PCI-DSS mappings:
Two caveats the upstream docs are explicit about. First, these controls apply only to Keystone's SQL identity backend. If you use LDAP, federated identity, or any non-SQL driver, PCI-DSS compliance for authentication is entirely the responsibility of that external system — OpenStack cannot enforce it. Second, in most HA deployments, TLS is terminated at the public endpoint, and traffic between the load balancer and backend Keystone nodes on the private network may be unencrypted. If your private network is considered at risk, the load balancer must be configured for TLS on the internal network. OpenStack does not manage this; your deployment tooling does.
Account lockout is correct policy for users. For service accounts it can take down services. If a service user gets locked out of Keystone, the corresponding service stops working.
Exclude service accounts via the CLI:
openstack user set --ignore-lockout-failure-attempts <service-user-id>
Or via the REST API:
curl -X PATCH \ -H "X-Auth-Token: $TOKEN" \ -H "Content-Type: application/json" \ -d '{"user": {"options": {"ignore_lockout_failure_attempts": true}}}' \ https://keystone.example.com/v3/users/<user_id>
Apply this to all service users before enabling lockout globally.
Changes to policy.json do not require a service restart. They take effect the moment the file is saved. Test policy changes thoroughly in staging first.
Historically, OpenStack used a Linux bridge between each instance and the OVS integration bridge br-int because OVS could not interact directly with iptables. Security group rules lived in iptables on that intermediate bridge. Every VM had its own bridge with its own iptables chain.
OVN replaces this entirely. It implements security group rules as OpenFlow flows evaluated in kernel space, eliminating the Linux bridge and iptables dependency. Beyond simplifying the architecture, OVN introduces Port Groups: instead of creating separate ACL flows for every port in a security group, OVN groups ports with identical security group membership and applies one set of rules to the group. A security group shared by 100 instances creates one ACL set instead of 100. The performance and scalability improvement is significant and increases as VM count grows.
Neutron security groups are stateful. Inbound TCP on port 443 automatically allows the corresponding response traffic without a separate egress rule. The underlying mechanism is Linux connection tracking (conntrack), and it has limits.
The default conntrack table can be exhausted on high-throughput compute nodes handling large numbers of concurrent connections, causing new connections to fail. Check your current state:
sudo conntrack -C # current entry countsudo sysctl net.netfilter.nf_conntrack_max # current limitsudo sysctl -w net.netfilter.nf_conntrack_max=262144 # increase if needed
Set this in /etc/sysctl.d/ to persist across reboots and monitor table utilization as part of your standard compute node metrics.
OVN deployments can use stateless security groups, which bypass connection tracking entirely. Support for the allow-stateless ACL action was added in OVN 21.06. On deployments running OVN older than 21.06, stateless security groups are not supported. The allow_stateless_action_supported configuration option that previously controlled this was removed in the 2025.1 (Epoxy) release — the 2023.1 (Antelope) release notes deprecated it for removal. On any supported release, stateless security groups work if your OVN version meets the minimum.
The tradeoff is explicit: stateless groups do not automatically allow return traffic. A rule allowing outbound TCP requires a corresponding rule explicitly allowing inbound replies. Stateless mode is also the only viable option when offloading OpenFlow actions to hardware.
For DPDK-based deployments, stateless NAT for floating IPs is available via [ovn] stateless_nat_enabled in ml2_conf.ini. It is disabled by default. Enabling it avoids conntrack OVN actions for floating IP traffic. The option lives in ml2_conf.ini because it is a configuration option for the ML2/OVN mechanism driver, not for Neutron's core service.
The default Neutron security group allows all egress traffic. Most operators tighten ingress rules carefully and leave egress entirely open. Egress filtering is what prevents a compromised instance from initiating unauthorized outbound connections. Default-deny egress with explicit rules for traffic you expect is the correct model, not the exception.
Neutron's new secure RBAC defaults, including a service role for port policies, are not enabled automatically for Neutron 2023.1 and older deployments. To enable them:
[oslo_policy]enforce_new_defaults = true
For newer OpenStack releases, check the release-specific Neutron policy defaults rather than assuming these options are opt-in (the behavior may differ from what is described here).
One important constraint: setting enforce_scope = true will cause 403 Forbidden responses to any API calls made with a system-scoped token, because all Neutron APIs are currently project-scoped. Do not enable scope enforcement in Neutron until the project has completed the necessary scoping work in your deployment.
The live_migration_tunnelled option has two significant limitations the upstream Nova docs acknowledge directly: it cannot handle block migration (live migration with non-shared storage), and it has substantial performance overhead due to increased data copying on both source and destination hosts.
QEMU-native TLS solves both problems. QEMU-native TLS encrypts all migration streams — guest RAM, device state, and disk data over NBD for block migration — with significantly lower overhead. It requires libvirt 4.4.0 and QEMU 2.11.
On every compute node, add to /etc/libvirt/qemu.conf:
default_tls_x509_cert_dir = "/etc/pki/qemu"default_tls_x509_verify = 1
Setting both default_tls_x509_cert_dir and default_tls_x509_verify means there is no need to specify any of the other individual _tls config options. Then in nova.conf:
[libvirt]live_migration_with_native_tls = truelive_migration_scheme = tls
Both lines are required. Omitting it produces a silent failure with no indication migrations are unencrypted.
Ensure TCP ports 16514 and 49152–49215 are open between compute nodes.
Note on VNC consoles: VNC settings allow clients from any IP address to connect to instance consoles. When hardening compute hosts, restrict VNC access to trusted networks or protect it with firewalls independently.
The MDS vulnerabilities (RIDL, Fallout, ZombieLoad, disclosed May 2019) affect Intel x86_64 CPUs and have a specific OpenStack mitigation path.
With cpu_mode=host-model (the default when virt_type=kvm or virt_type=qemu), the md-clear CPU flag passes through to guests automatically. The same applies to cpu_mode=host-passthrough.
With cpu_mode=custom, you must explicitly add md-clear along with flags for prior vulnerabilities. The Nova docs example:
[libvirt]cpu_mode = customcpu_models = IvyBridgecpu_model_extra_flags = spec-ctrl,ssbd,md-clear
After updating all vulnerable compute nodes, running guests must be fully powered down and cold-booted (an explicit stop followed by a start) to activate the new CPU model. A live migration is not sufficient. Validate the mitigation is active on the host:
cat /sys/devices/system/cpu/vulnerabilities/mds
The output "SMT vulnerable" in the response means Hyper-Threading may still expose you depending on workload. For multi-tenant deployments running untrusted workloads, review whether disabling Hyper-Threading is warranted.
The default Barbican configuration uses the simple crypto plugin, which encrypts all secrets with a single symmetric key stored in plaintext in barbican.conf. A single compromised key exposes every secret for every tenant in the database, and key rotation requires re-encrypting all stored secrets. The simple crypto plugin is appropriate for development only.
The production configuration for HSM-backed deployments uses a three-tier key hierarchy: a Master KEK (MKEK) stored in and never extracted from the HSM, per-project wrapped KEKs stored encrypted in the Barbican database, and per-secret encrypted blobs also stored in the database.
The MKEK never leaves the HSM. All wrapping and unwrapping operations for project KEKs happen within the HSM's memory. Different tenants have different per-project KEKs, so a compromised pKEK for one tenant does not expose others. When the MKEK is rotated, only the project KEKs need to be re-wrapped — not every stored secret.
The MKEK model also solves an HSM capacity problem. The naive approach creates one KEK per project directly on the HSM. HSMs have limited storage and in a multi-tenant deployment this will eventually fail to create KEKs for new projects. The MKEK model keeps a minimum number of keys on the HSM while still maintaining per-project encryption.
Rotation uses barbican-manage. Perform steps in order:
barbican-manage hsm gen_mkek \ --library-path /path/to/pkcs11.so \ --passphrase <hsm-pin> \ --slot-id 1 \ --label <unique-mkek-label> \ --length 32
barbican-manage hsm gen_hmac \ --library-path /path/to/pkcs11.so \ --passphrase <hsm-pin> \ --slot-id 1 \ --label <unique-hmac-label> \ --length 32
barbican-manage hsm rewrap_pkek
The upstream docs note that both the new MKEK and HMAC key must already be generated, their labels set in barbican.conf, and Barbican restarted before running rewrap_pkek. The --dry-run flag is available to preview the operation without committing changes.
Creating an encrypted volume type, as shown in current upstream Cinder documentation:
openstack volume type create \ --encryption-provider luks \ --encryption-cipher aes-xts-plain64 \ --encryption-key-size 256 \ --encryption-control-location front-end LUKS
The Cinder docs are explicit about access control: non-admin users need the creator role to store secrets in Barbican and to create encrypted volumes. Grant it:
openstack role add --project PROJECT --user USER creator
If migrating from the legacy ConfKeyManager (fixed key stored in configuration files), do not remove the fixed_key value from nova.conf and cinder.conf until you have verified no volumes still depend on it — volumes encrypted with the fixed key will become inaccessible if it is removed prematurely.
Atmosphere upgraded the nginx ingress controller from 1.10.1 to 1.12.1 to address CVE-2025-1097, CVE-2025-1098, CVE-2025-1974, CVE-2025-24513, and CVE-2025-24514. These CVEs were patched in both v1.11.5 and v1.12.1. If you are running any version below 1.11.5, upgrading to a patched version is an immediate priority but this should be treated as short-term risk reduction only.
The Kubernetes ingress-nginx controller reached end-of-life on March 24, 2026. The repository is now read-only. There will be no new features, no bug fixes, and no further CVE patches. The Ingress API itself (networking.k8s.io/v1) is not deprecated only this controller implementation. EOL software in the L7 data path is an automatic finding in SOC 2, PCI-DSS, ISO 27001, and HIPAA audits. Migration to a supported ingress controller is the required long-term remediation.
All container images in Atmosphere now use external repositories with independent versioning per component. Images for OVN, Open vSwitch, libvirt, and all OpenStack services have dedicated repositories with independent release cycles. This removes image build infrastructure from Atmosphere itself and allows precise tracking of which version of each component is running.
Prometheus monitoring, Grafana dashboards, log aggregation, and vulnerability scanning are included in every Atmosphere deployment. Security observability is built in, not a separate configuration step.
For teams auditing an existing deployment, the priority order based on blast radius:
VEXXHOST has been contributing to the OpenStack community since 2011 and operates production clouds across the OpenStack, Kubernetes, and Ceph ecosystems. We build Atmosphere, an open-source private cloud platform. We are a Gold Member of the OpenInfra Foundation. If you are working through a security review of your OpenStack deployment, we are happy to discuss what we have seen and what has worked.
Choose from Atmosphere Cloud, Hosted, or On-Premise.
Simplify your cloud operations with our intuitive dashboard.
Run it yourself, tap our expert support, or opt for full remote operations.
Leverage Terraform, Ansible or APIs directly powered by OpenStack & Kubernetes