Site Recovery
Site Recovery is available exclusively with an Enterprise license. The required feature flag is ceph_replication. Learn more about licensing.
Site Recovery provides disaster recovery (DR) capabilities for your Proxmox environments. It manages data replication between nodes or clusters, orchestrates recovery plans, and supports failover, failback, and emergency DR operations -- giving you confidence that critical workloads can be restored quickly when disaster strikes.
Overview
Site Recovery is built around two core concepts:
- Replication Jobs -- Continuous or scheduled data replication from a source node/cluster to a target, ensuring an up-to-date copy of your VMs is always available.
- Recovery Plans -- Predefined sequences of actions that describe how to restore a set of VMs on a target cluster in case of failure.
Together, these allow you to protect workloads, test your DR strategy regularly, and execute real failovers with minimal downtime.
Interface Tabs
The Site Recovery page is organized into four tabs: Dashboard, Protection, Recovery Plans, and Emergency.
Dashboard
The Dashboard tab provides a high-level view of your replication health:
- Overall replication status -- healthy, degraded, or critical
- Active replication job count and their current states
- Error count -- jobs in an error state are flagged immediately
- Recovery plan status overview
Use this tab as a daily check-in to verify that your DR posture is healthy.
Protection
The Protection tab manages replication jobs. Each job defines what data is replicated, from where, and to where.
Creating a Replication Job
Click Create Job to open the creation dialog. You can configure:
- Source connection -- the Proxmox cluster containing the VMs to protect
- Target connection -- the destination cluster for replicated data
- VMs to replicate -- select individual VMs from the source cluster
- Schedule -- how often replication runs
Managing Replication Jobs
For each job, the following actions are available:
| Action | Description |
|---|---|
| Sync | Trigger an immediate replication sync |
| Pause | Temporarily suspend replication |
| Resume | Resume a paused replication job |
| Delete | Remove the replication job entirely |
Selecting a job displays its execution logs in a detail panel, showing the history of sync operations with timestamps and results.
Run a manual sync after making significant changes to a protected VM to ensure the latest state is replicated before relying on it for recovery.
Recovery Plans
Recovery Plans define the procedure for restoring services on a target cluster. The tab lists all existing plans and lets you create new ones.
Creating a Recovery Plan
Click Create Plan to define:
- Plan name and description
- Source and target clusters
- Associated replication jobs -- which replication jobs feed into this plan
- VM startup order and dependencies
Recovery Plan Operations
Each recovery plan supports three operations:
| Operation | Description |
|---|---|
| Test Failover | Executes the recovery plan in an isolated, network-isolated environment. Production workloads are not affected. Use this to validate your DR strategy regularly. |
| Failover | Activates the recovery plan for real. VMs are started on the target cluster using the most recent replicated data. Use this during an actual disaster. |
| Failback | After the primary site is restored, failback reverses the direction -- migrating workloads back from the DR site to the original production cluster. |
When any operation is executed, ProxCenter tracks its progress in real time, polling the execution status every 3 seconds and displaying step-by-step updates.
Failover is a disruptive operation. Ensure the source site is truly unavailable before initiating a production failover, as running the same VMs on both sites simultaneously can cause data corruption.
Test Cleanup
After running a test failover, use the Cleanup action to tear down the test environment and release resources on the target cluster. This ensures that test artifacts do not consume storage or interfere with future tests.
Execution History
Select a recovery plan to view its execution history -- a chronological list of all test, failover, and failback operations with their outcomes, timestamps, and any errors encountered.
Emergency
The Emergency tab is designed for critical situations where you need to act fast without going through the full recovery plan workflow.
Emergency DR Mode allows you to:
- Start individual VMs on a target cluster directly from their most recent replication snapshot
- Execute immediate failover of an entire recovery plan
- Execute failback to restore services to the original site
This tab aggregates all replication jobs and recovery plans with quick-action buttons, giving operators a single view to manage a crisis.
Emergency operations bypass the normal validation steps. Use them only when time is critical and you understand the implications of starting replicated VMs without a full plan execution.
Workflow Example
A typical Site Recovery workflow looks like this:
- Set up replication: Create replication jobs for your critical VMs, pointing to a secondary Proxmox cluster
- Create a recovery plan: Group the replication jobs into a recovery plan with the correct startup order
- Test regularly: Run test failovers monthly to validate that recovery works as expected, then clean up
- Respond to incidents: If the primary site fails, execute a failover from the Emergency tab or Recovery Plans tab
- Restore normal operations: Once the primary site is back, perform a failback to return workloads to production
Permissions
| Permission | Description |
|---|---|
vm.config | Required to access Site Recovery and manage replication jobs and recovery plans |
Users without the vm.config permission will not see the Site Recovery entry in the navigation sidebar.