Besides cost, the key business continuity drivers for a recovery solution are the Recovery Point Objective and the
Recovery Time Objective.
Recovery Point Objective
The Recovery Point Objective (RPO) refers to the recovery point in time. Another way to think of this is that the
RPO specifies the maximum allowable time delay between a data commit on the production side and the replication
of this data to the recovery site.
It is probably easiest to think of RPO in terms of the amount of allowable data loss. The RPO is frequently
expressed in terms of its relation to the time at which replication stops, as in “less than 5 minutes of data loss.”
Recovery Time Objective
The second major business driver is the Recovery Time Objective (RTO). This is the amount of time it will take us
to recover from a disaster. Depending on the context, this may refer only to the technical steps required to bring up
services on the recovery system. Usually, however, it refers to the amount of time that the service will be
unavailable, including time to discover that an outage has occurred, the time required to decide to fail over, the time
to get staff in place to perform the recovery, and then the amount of time to bring up services at the recovery site.
The costs associated with different RPO and RTO values will be determined by the type of application and its
business purpose. Some applications may be able to tolerate unplanned outages of up to days without incurring
substantial costs. Other applications may cause significant business-side problems with even minor amounts of
unscheduled downtime.
Different applications and environments have different tolerances for RPO and RTO. Some applications might be
able to tolerate a potential data loss of days or even weeks; some may not be able to tolerate any data loss at all.
Some applications can remain unavailable long enough for us to purchase a new system and restore from tape; some
cannot.
Recovery Strategies
There are several different strategies for recovering an application. Choosing a strategy will almost always involve
an investment in hardware, software, and implementation time. If a strategy is chosen that does not support the
business RPO and RTO requirements, an expensive re-tooling may be necessary.
Many types of replication solutions can be implemented at a server, disk storage, or storage network level. Each has
unique advantages and disadvantages. Server replication tends to be cheapest, but also involves using server cycles
to manage the replication. Storage network replication is extremely flexible, but can be more difficult to configure.
Disk storage replication tends to be rock solid, but is usually limited in terms of supported hardware for the
replication target.
Regardless where we choose to implement our data replication solution, we will still face a lot of the same issues.
One issue that needs to be addressed is re-silvering of a replication solution that has been partitioned for some
amount of time. Ideally, only the changed sections of the disks will need to be re-replicated. Some less sophisticated
solutions require a re-silvering of the entire storage area, which can take a long time and soak up a lot of bandwidth.
Re-silvering is an issue that needs to be investigaged during the product evaluation.
Continuity Planning
Continuity planning should be done during the initial architecture and design phases for each service. If the service
is not designed to accommodate a natural recovery, it will be expensive and difficult to retrofit a recovery
mechanism.
The type of recovery that is appropriate for each service will depend on the importance of the service and what the
tolerance for downtime is for that service.
There are five generally-recognized approaches to recovery architecture:
- Server Replacement: Some services are run on standard server images with very little local customization.
Such servers may most easily be recovered by replacing them with standard hardware and standard server
images.
- Backup and Restore: Where there is a fair amount of tolerance for downtime on a service, it may be
acceptable to rely on hardware replacement combined with restores from backups.
- Shared Nothing Failover: Some services are largely data-independent and do not require frequent data
replication. In such cases, it might make sense to have an appropriately configured replacement at a recovery
site. (One example may be an application server that pulls its data from a database. Aside from copying
configuration changes, replication of the main server may not be necessary.)
- Replication and Failover: Several different replication technologies exist, each with different strengths and
weaknesses. Array-based, SAN-based, file system-based or file-based technologies allow replication of data on
a targeted basis. Synchronous replication techniques prevent data loss at the cost of performance and
geographic dispersion. Asynchronous replication techniques permit relatively small amounts of data loss in
order to preserve performance or allow replication across large distances. Failover techniques range from nearly
instantaneous automated solutions to administrator-invoked scripts to involved manual checklists.
- Live Active-Active Stretch Clusters: Some services can be provided by active servers in multiple locations,
where failover happens by client configurations. Some examples include DNS services (failover by resolv.conf
lists), SMTP gateway servers (failover by MX record), web servers (failover by DNS load balancing), and some
market data services (failover by client configuration). Such services should almost never be down. (Stretch
clusters are clusters where the members are located at geographically dispersed locations.)
Which of these recovery approaches is appropriate to a given situation will depend on the cost of downtime on the
service, as well as the particular characteristics of the service's architecture.
Causes of Recovery Failure
Janco released a study outlining the most frequent causes of a recovery failure:
- Failure of the backup or replication solution. If the a copy of the data is not available, we will not be able to recover.
- Unidentified failure modes. The recovery plan does not cover a type of failure.
- Failure to train staff in recovery procedure. If people don't know how to carry out the plan, the work is wasted.
- Lack of a communication plan. How do you communicate when your usual infrastructure is not available?
- Insufficient backup power. Do you have enough capacity? How long will it run?
- Failure to prioritize. What needs to be restored first? If you don't lay that out in advance, you will waste valuable time on recovering less critical services.
- Unavailable disaster documentation. If your documentation is only available on the systems that have failed, you are stuck. Keep physical copies available in recovery locations.
- Inadequate testing. Tests reveal weaknesses in the plan and also train staff to deal with a recovery situation in a timely way.
- Unavailable passwords or access. If the recovery team does not have the permissions necessary to carry out the recovery, it will fail.
- Plan is out of date. If the plan is not updated to reflect changes in the environment, the recovery will not succeed.
Recovery Business Practices
Janco also suggested several key business practices to improve the likelihood that you will survive a recovery:
- Eliminate single points of failure.
- Regularly update staff contact information, including assigned responsibilities.
- Stay abreast of current events, such as weather and other emergency situations.
- Plan for the worst case.
- Document your plans and keep updated copies available in well-known, available locations.
- Script what you can, and test your scripts.
- Define priorities and thresholds.
- Perform regular tests and make sure you can meet your RTO and RPO requirements.