Building a Proxmox HA cluster: architecture and pitfalls

One of the most common misconceptions among teams new to Proxmox: high availability is not an option you enable in the interface. It's a system that relies on several independent mechanisms, each with its own conditions and its own failure points.

You can enable "HA" and have a cluster that doesn't failover during a real incident. Here's what you need to understand before finding that out in production.

Corosync and quorum: the underestimated foundation

Corosync is the cluster communication component. It maintains state consistency between nodes and is the basis for failover decisions.

Quorum defines how many nodes must be communicating for the cluster to be "healthy" and authorized to make decisions. By default, a 3-node cluster tolerates the loss of one node (2 out of 3 form the quorum). A 2-node cluster has a fundamental problem: if one node is lost, the second can't tell whether it's the sole survivor or whether there's been a network partition.

For small environments that don't want a full third node, Proxmox offers a QDevice (lightweight quorum server on a VM or Raspberry Pi). It's a valid solution, but it must be documented and integrated into maintenance procedures — if the QDevice is unavailable, cluster operations are blocked.

Fencing: misunderstood, often misconfigured

Fencing is the mechanism that allows a surviving node to "kill" a failed node before migrating its VMs. Without fencing, the cluster risks starting VMs on a new node while they're still running on the failed node — causing data corruption.

In enterprise production, fencing must be hardware-based: IPMI, iDRAC, iLO, or another out-of-band management interface. These interfaces allow sending a physical power-off command to the failed node.

Watchdog fencing (kernel-based) is an alternative for environments without IPMI, but it has its own constraints and should not be retained as the primary solution in critical production.

What to validate before declaring HA functional:

Test the fence command manually from each node to every other node
Physically cut power to a node and verify that fencing executes
Verify that HA VMs restart correctly on surviving nodes with the proper delay

Shared storage: the most critical component

Proxmox HA only works if VMs are on storage accessible from all cluster nodes. A local disk cannot host an HA VM — if the node hosting that disk goes down, the VM cannot migrate.

Common options:

Ceph: distributed, native in Proxmox, no SPOF if properly configured
NFS/iSCSI shared: depends on an external server (potential SPOF)
Proxmox replication: snapshot-based replication between nodes, but not true shared storage — HA failover is not instantaneous

Ceph on Proxmox HA: what's not said often enough

Ceph is excellent when properly sized. It can be painful when it isn't.

Watch points in enterprise deployment:

Minimum replica count: by default, Ceph replicates data across 3 nodes (size=3). Below 3 OSDs in "up" state, the pool enters "degraded" or "down" mode. Too many people deploy a 3-node cluster with 1 OSD per node and discover that losing one node during maintenance of the second renders storage unavailable.

The network is critical: Ceph uses two distinct networks — public (for clients) and cluster (for internal replication). Mixing them or using the same switch as VM traffic can generate performance and stability problems that are difficult to diagnose.

Rebuild after failure: when an OSD fails, Ceph initiates a replication procedure to restore the tolerance level. For large volumes, this process can take several hours and degrade performance. In production, this must be anticipated.

Stretched clusters: where it gets genuinely complex

A stretched Proxmox cluster — two sites with nodes on each side — is technically feasible but architecturally complex.

Problems to solve:

Inter-site network latency: Corosync is latency-sensitive. Beyond a few milliseconds between nodes, cluster stability can degrade.
Split-brain: if the inter-site link is cut, each site sees the other as "dead." Without a tiebreaker on a third site, both partitions may try to become "master" simultaneously.
Stretched Ceph: possible but with specific configurations (stretch mode, failure zones, rack awareness). This isn't a standard configuration and requires deep mastery.

For most multisite DR needs, Proxmox replication (snapshot-based) combined with manual failover procedures is more robust than a stretched cluster managed by automatic HA.

What to test before declaring HA production-ready

Minimum test list for a production HA cluster:

Clean node failure (normal reboot) → verify automatic migration of HA VMs
Hard node failure (power cut) → verify that fencing executes and VMs restart
Cluster network loss → verify behavior under partition
Ceph OSD failure → verify that storage remains operational and rebuild starts correctly
Restore from PBS → verify the real RTO of a complete restoration
Failover and failback → verify that VMs return to their preferred node after recovery

These tests shouldn't only happen at startup. They must be repeated after each major update and integrated into a quarterly validation schedule.

The difference between an HA cluster and one that just looks like it

High availability in enterprise production is not binary. It's a continuum between "the cluster eventually restarts" and "the RTO is measured, documented, and met."

The difference lies in the details: is the fencing network on the same switch as the production network? Have HA VMs been individually tested? Is the manual failover decision process documented and known to the night team?

Those details are what make the difference at 2 AM.