DRP overhaul on ageing infrastructure

The problem with untested DRPs

It wasn't that the DRP was bad. It was that it only existed on paper.

The VMware infrastructure was 4 years old. The DRP had been documented at initial deployment, formally validated by an external service provider, and never rehearsed since. Over 4 years, the environment had evolved: new VMs, new application traffic flows, network changes. The DRP documented a state that no longer existed.

:::Critical point A documented DRP is not a functional DRP. It is a declarative DRP. The difference shows up at the first production incident. :::

What the audit revealed

The audit lasted 5 days. Five days to produce a short but precise list of problems:

Two undocumented single points of failure on the production network path
The failover process assumed the availability of an administrator with physical access to the secondary site — not guaranteed outside business hours
The decision deadline for activating the DRP was undefined: who decides, on what criterion, within what timeframe?
Three critical VMs did not have application-consistent snapshots (pre-snapshot flush not configured)

:::Field observation The RTO declared in the SLAs was 20 minutes. The RTO measured during the simulated test was 64 minutes. The gap came primarily from points 2 and 3. :::

HA architecture chosen

Redesign to multi-node Proxmox VE with the following principles:

Automatic failover without human intervention for critical VMs (Proxmox HA)
Synchronous replication to a secondary node on the secondary site (dedicated network)
Proxmox Backup Server on the secondary site with automatic integrity verification
Failover procedure executable from any workstation with VPN access (physical access prerequisite removed)
Tested runbooks: every procedure is validated by the internal team before being signed off

:::Decision retained Synchronous replication only for the 8 critical SLA-bound VMs. The rest on asynchronous replication. This was not a budget decision — it was a design decision: not all workloads carry the same risk profile. :::

Real tests

Three rounds of testing under production conditions:

Node failure simulation — forced power-cut of the primary node, measurement of automatic recovery time
Primary site loss simulation — full failover to the secondary site, full RTO measurement including decision time
Restore from backup — restore of a critical VM from PBS, measurement of real RPO

Tests were executed by the internal team after training, not by VSHIFT. This was deliberate: if the team cannot execute the procedures on their own, the DRP is not operational.

:::Production reality The first test in series 2 detected a network configuration issue on the secondary site that was not visible under normal operation. Better to find it in a test. :::

Result

Effective RTO below 6 minutes on critical workloads, measured under real conditions. RPO below 2 minutes via synchronous replication. Automatic failover tested and validated each quarter by the internal team. Single points of failure eliminated. Up-to-date DRP documentation auditable by clients within service contracts.