DRP overhaul on ageing infrastructure
The problem with untested DRPs
It wasn't that the DRP was bad. It was that it only existed on paper.
The VMware infrastructure was 4 years old. The DRP had been documented at initial deployment, formally validated by an external service provider, and never rehearsed since. Over 4 years, the environment had evolved: new VMs, new application traffic flows, network changes. The DRP documented a state that no longer existed.
:::Critical point A documented DRP is not a functional DRP. It is a declarative DRP. The difference shows up at the first production incident. :::
What the audit revealed
The audit lasted 5 days. Five days to produce a short but precise list of problems:
- Two undocumented single points of failure on the production network path
- The failover process assumed the availability of an administrator with physical access to the secondary site — not guaranteed outside business hours
- The decision deadline for activating the DRP was undefined: who decides, on what criterion, within what timeframe?
- Three critical VMs did not have application-consistent snapshots (pre-snapshot flush not configured)
:::Field observation The RTO declared in the SLAs was 20 minutes. The RTO measured during the simulated test was 64 minutes. The gap came primarily from points 2 and 3. :::
HA architecture chosen
Redesign to multi-node Proxmox VE with the following principles:
- Automatic failover without human intervention for critical VMs (Proxmox HA)
- Synchronous replication to a secondary node on the secondary site (dedicated network)
- Proxmox Backup Server on the secondary site with automatic integrity verification
- Failover procedure executable from any workstation with VPN access (physical access prerequisite removed)
- Tested runbooks: every procedure is validated by the internal team before being signed off
:::Decision retained Synchronous replication only for the 8 critical SLA-bound VMs. The rest on asynchronous replication. This was not a budget decision — it was a design decision: not all workloads carry the same risk profile. :::
Real tests
Three rounds of testing under production conditions:
- Node failure simulation — forced power-cut of the primary node, measurement of automatic recovery time
- Primary site loss simulation — full failover to the secondary site, full RTO measurement including decision time
- Restore from backup — restore of a critical VM from PBS, measurement of real RPO
Tests were executed by the internal team after training, not by VSHIFT. This was deliberate: if the team cannot execute the procedures on their own, the DRP is not operational.
:::Production reality The first test in series 2 detected a network configuration issue on the secondary site that was not visible under normal operation. Better to find it in a test. :::
Result
Effective RTO below 6 minutes on critical workloads, measured under real conditions. RPO below 2 minutes via synchronous replication. Automatic failover tested and validated each quarter by the internal team. Single points of failure eliminated. Up-to-date DRP documentation auditable by clients within service contracts.