DR Criteria in the Multicloud World

May 28
4 min read

Part I: The Criteria That Matter

Disaster Recovery products, deployments, and strategies have evolved extensively in the last 15 years or so. With Cloud technology, robust virtualization options, vastly improved wide area networking, and new products and methods, it can be a daunting task to plan and implement a modern Disaster Recovery solution. Legacy backup and DR products with limited support for Hyperscaler as well as emerging Cloud technologies can complicate things even more.

It's tempting and common to focus on a few highly quantitative metrics such as RPO and RTO. And as important as they are, there are other equally important elements to evaluate for a successful DR plan and implementation. This is the first in a series of blogs that attempts to itemize evaluation criteria and explore considerations for each.

Evaluation criteria should include:

DR site location and network infrastructure
Personnel plan
Application categorization
Wave planning
Handling production IT changes
Boot order and post-boot delay
Recovery Point Objective (RPO)
Recovery Time Objective (RTO) and Data Consistency
DR drill planning
Converged or separate backup and DR
Hybrid and Multi-Cloud
Product selection

Subsequent blogs will address each of these criteria in more detail, but to complete this blog I'd like to discuss how RackWare concluded its approach was the optimal architecture to optimally address all of the above.

When RackWare was founded with the vision of building the most sophisticated mobility and DR solution, we evaluated multiple different types of solutions. When engineers build products, they very often design the product around their skillset as opposed to what the real end user needs are. If you have storage expertise, you will build a storage-based solution. If you have hypervisor expertise, you will build a hypervisor virtual appliance. As the adage goes, if your tool is a hammer, every problem looks like a nail.

So, we actually built 3 functioning prototypes and tested and evaluated all three against a wide range of criteria. And along the way there was some friendly debate among us as to which would prove the superior.

The first prototype was a VMDK solution. At the time these were fairly new and not much was known about them. In a VMDK solution, a VMDK file, designed to operate in one specific Hypervisor context, is transformed to run in a different hypervisor. In many instances this works well, especially for smaller, simpler Virtual Machines, especially in isolation. This prototype was productized enough such that it was even used a few times for production migration projects. These were successful, but it became clear that a VMDK solution was not optimal when considering the broader scope of criteria.

The second prototype was a storage solution whereby a kernel filter driver accesses and replicates storage at the sector level. Storage solutions can work well, especially when you have identical hardware and storage on both sides and support from storage appliances. However, we quickly discovered that when removing the same hardware/same storage constraints, this approach was very troublesome, particularly considering the coming Cloud age where hardware platforms and storage would be different. We did a couple of special projects with it but quickly deprecated this approach.

The idea behind the third prototype was to replicate and delta sync through the Operating System whereby, among other technology, filesystem logical volumes were replicated and subsequently delta sync'ed. The theory was that applications see the world through the eyes of the Operating System. The OS presents storage and networking to applications as well as any other hardware or platform services. So, replicating through the OS provides the widest scope and most consistent experience for both end users and applications. When coming from a VMDK or storage background, it's a little counterintuitive — why would you do this? And not surprisingly, this was the most difficult solution to build. But it was almost magical how it solved all the inadequacies of the other two approaches, especially for Cloud, Hybrid Cloud and Cross Cloud use cases.

Some of the advantages include:

Highly flexible DR policies with granularity of frequency and selected data
Far superior data consistency ensuring applications achieve production operational state with no or minimum intervention
Permits DR test without bringing up a second site
Converged backup features such as retention policies and single file restore
Flexible and better storage options on the target side
Supports in-memory applications (e.g., SAP HANA) as part of standard operations with no special or additional configuration or product extensions
Initial replication is more efficient as it only replicates used data as opposed to the entire disk
Not sensitive to network outages; never requires a complete re-replication of data no matter how long the outage
Selective sync at the drive, directory and file level
Not sensitive to disk defragmentation
Single file restore and restore anywhere
BIOS → UEFI and UEFI → BIOS conversion

The superiority of our Image Replication was rapidly proved in the Enterprise market. RackWare garnered the reputation as being the solution of choice when other products failed or were hard to use. We secured 4 patents along the way and hundreds of happy customers with over 1 million servers successfully replicated.

By Use Case

By Environment

Part I: The Criteria That Matter

Comments