Skip to content
VSHIFTVSHIFTSolutions
Case StudiesHigh Availability
High AvailabilityB2B Services

DRP overhaul on ageing infrastructure

VMware infrastructure with a documented but never-executed DRP. Real RTO measured at 3× the contractual target. Full redesign to Proxmox multi-node HA with quarterly DRP testing.

< 6 min
Effective RTO
< 2 min
RPO
Automatic
Failover
Met
SLA

DRP overhaul on ageing infrastructure

The problem with untested DRPs

It wasn't that the DRP was bad. It was that it only existed on paper.

The VMware infrastructure was 4 years old. The DRP had been documented at initial deployment, formally validated by an external service provider, and never rehearsed since. Over 4 years, the environment had evolved: new VMs, new application traffic flows, network changes. The DRP documented a state that no longer existed.

:::Critical point A documented DRP is not a functional DRP. It is a declarative DRP. The difference shows up at the first production incident. :::

What the audit revealed

The audit lasted 5 days. Five days to produce a short but precise list of problems:

  • Two undocumented single points of failure on the production network path
  • The failover process assumed the availability of an administrator with physical access to the secondary site — not guaranteed outside business hours
  • The decision deadline for activating the DRP was undefined: who decides, on what criterion, within what timeframe?
  • Three critical VMs did not have application-consistent snapshots (pre-snapshot flush not configured)

:::Field observation The RTO declared in the SLAs was 20 minutes. The RTO measured during the simulated test was 64 minutes. The gap came primarily from points 2 and 3. :::

HA architecture chosen

Redesign to multi-node Proxmox VE with the following principles:

  • Automatic failover without human intervention for critical VMs (Proxmox HA)
  • Synchronous replication to a secondary node on the secondary site (dedicated network)
  • Proxmox Backup Server on the secondary site with automatic integrity verification
  • Failover procedure executable from any workstation with VPN access (physical access prerequisite removed)
  • Tested runbooks: every procedure is validated by the internal team before being signed off

:::Decision retained Synchronous replication only for the 8 critical SLA-bound VMs. The rest on asynchronous replication. This was not a budget decision — it was a design decision: not all workloads carry the same risk profile. :::

Real tests

Three rounds of testing under production conditions:

  1. Node failure simulation — forced power-cut of the primary node, measurement of automatic recovery time
  2. Primary site loss simulation — full failover to the secondary site, full RTO measurement including decision time
  3. Restore from backup — restore of a critical VM from PBS, measurement of real RPO

Tests were executed by the internal team after training, not by VSHIFT. This was deliberate: if the team cannot execute the procedures on their own, the DRP is not operational.

:::Production reality The first test in series 2 detected a network configuration issue on the secondary site that was not visible under normal operation. Better to find it in a test. :::

Result

Effective RTO below 6 minutes on critical workloads, measured under real conditions. RPO below 2 minutes via synchronous replication. Automatic failover tested and validated each quarter by the internal team. Single points of failure eliminated. Up-to-date DRP documentation auditable by clients within service contracts.