Evaluating DR Criteria - Part 3

2 days ago
4 min read

Part III: Not All Servers Are Equal

In Part 2 we addressed DR site location and personnel plan and how these often under considered criteria can make or break the efficacy of your DR strategy. Now you've studied and understand your geographic risk and now understand the realistic and practical implications of your personnel plan.

The next issues to address are application categorization and how that feeds into an intelligent plan for Recovery Point Objective (RPO), and Recovery Time Objective (RTO). Beyond simply the acceptable data loss and acceptable downtime, these are important because they dramatically affect costs, ongoing maintenance, and DR drills.

Application Categorization

Enterprises can have hundreds and even thousands of applications spread across a huge number of servers. They all execute different functions that have different values to the business. Even if your environment is comprised of dozens of servers and not thousands, you still have important differences and needs among your servers for a recovery plan.

Some servers are required infrastructure, necessary for anything to function at all. Other servers are mission critical. It's tempting to categorize infrastructure servers with mission critical, but in reality they often require different RPOs and RTOs. For example, infrastructure servers typically have the tightest Recovery Time as they need to be up and functional before anything else but often don't necessitate an overly aggressive RPO. Even within infrastructure servers, there can be different categorizations for both RPO and RTO.

There should be a well thought out definition of what mission critical means to the business. A typical definition of mission critical is that the business cannot function at all without those systems being operational. And as with infrastructure servers they may have the same RTO but a different RPO. For high transaction applications that require a zero RPO you should strongly consider an application specific mechanism. Not only will that mechanism achieve the lowest data loss it will also ensure the fastest reliable recovery. In reality, very few servers require an application specific mechanism but when they do it's highly advised.

Beyond infrastructure and mission critical there can be many other categories. To be clear, there are tradeoffs for having fewer versus more categories. More categories create more complexity but will meet your needs very specifically. Fewer categories are less complex but may compromise requirements either being unnecessarily aggressive or sub-optimal for certain servers. These decisions should be well thought out and documented. Importantly, DR testing (a future topic) will help make those determinations and allow for fine tuning to balance cost versus function. It's wise to select a product that allows you to easily make policy changes at runtime, without disruption to production operations.

Implications for RPO

So, why should there be different RPOs and RTOs among the servers? A more aggressive RPO requires more resources and hence more cost. Unnecessary delta syncs impact the Origin environment, using more storage bandwidth, computing power and local network bandwidth. This likely isn't an issue during off-peak periods of time, but during peak periods, your most important consideration, resources usually run close to maximum capacity. Will an overly aggressive RPO for all your servers impact production and result in a requirement for additional resources? How much more resources and cost will be required in the DR site, which isn't even running production.

Additionally, networks have a tipping point of reliability and optimal bandwidth. If your network starts dropping packets to handle congestion issues retry synchronization algorithms and retries can greatly impact both Origin and Target systems and environments. The difference between 90% reliability and 60% reliability can be as little as a 5% increase in traffic. Frustratingly, after reporting the problem, the network team measures the network bandwidth and reliability during off-peak hours and finds no issues. You don't want to sync over huge logfiles or massive code updates on your development servers during peak hours. While that's an extreme example to illustrate the point, the same likely applies to web and application server updates - they can be done during off peak periods.

Implications for RTO

There are two major points to consider when evaluating Recovery Time.

Recovery Time implications don't usually impact the Origin environment, but definitely affect costs in the Target. A more aggressive RTO requires more pre-provisioned resources in the Target environment. Servers, and other infrastructure, with an aggressive RTO should not require any provisioning at failover time, those resources should be present and available.

Second, to optimize having your mission critical servers functioning in the DR site they need to have that priority. Disaster Recovery and High Availability are not the same thing. DR can be highly automated but there will always be a human element when recovering from a total outage at an Origin site. As servers recover they compete for resources and attention from personnel. If all your servers have the same recovery time policy then none of your servers have priority. In fact, it's highly likely that your simpler, smaller, lower priority servers will end up recovering first.

Final Thoughts

From a product selection standpoint, it's highly advantageous having a platform that can easily configure and handle a myriad of different policies for RPO (both frequency and schedule), schedule blackout times, and even execute selective delta syncs (e.g. - include or exclude specific files, directories, or drives).

Likewise, having a platform that can easily accommodate different recovery mechanisms with different priorities and cost structures will greatly facilitate getting your business up and running smoothly and reliably.

It's also vital to conduct regular DR drills to evaluate your categories relative to recovery mechanisms so you can make important adjustments. Of course, there's more to DR drills which will be in Part 4.

By Use Case

By Environment

Part III: Not All Servers Are Equal

Application Categorization

Implications for RPO

Implications for RTO

Final Thoughts

Comments