About Me

My photo
Experienced Information Technology leader, author, system administrator, and systems architect.

Wednesday, June 12, 2013

Recovery Strategies

Besides cost, the key business continuity drivers for a recovery solution are the Recovery Point Objective and the Recovery Time Objective.

Recovery Point Objective

The Recovery Point Objective (RPO) refers to the recovery point in time. Another way to think of this is that the RPO specifies the maximum allowable time delay between a data commit on the production side and the replication of this data to the recovery site.

It is probably easiest to think of RPO in terms of the amount of allowable data loss. The RPO is frequently expressed in terms of its relation to the time at which replication stops, as in “less than 5 minutes of data loss.”

Recovery Time Objective

The second major business driver is the Recovery Time Objective (RTO). This is the amount of time it will take us to recover from a disaster. Depending on the context, this may refer only to the technical steps required to bring up services on the recovery system. Usually, however, it refers to the amount of time that the service will be unavailable, including time to discover that an outage has occurred, the time required to decide to fail over, the time to get staff in place to perform the recovery, and then the amount of time to bring up services at the recovery site.

The costs associated with different RPO and RTO values will be determined by the type of application and its business purpose. Some applications may be able to tolerate unplanned outages of up to days without incurring substantial costs. Other applications may cause significant business-side problems with even minor amounts of unscheduled downtime.

Different applications and environments have different tolerances for RPO and RTO. Some applications might be able to tolerate a potential data loss of days or even weeks; some may not be able to tolerate any data loss at all. Some applications can remain unavailable long enough for us to purchase a new system and restore from tape; some cannot.

Recovery Strategies

There are several different strategies for recovering an application. Choosing a strategy will almost always involve an investment in hardware, software, and implementation time. If a strategy is chosen that does not support the business RPO and RTO requirements, an expensive re-tooling may be necessary.

Many types of replication solutions can be implemented at a server, disk storage, or storage network level. Each has unique advantages and disadvantages. Server replication tends to be cheapest, but also involves using server cycles to manage the replication. Storage network replication is extremely flexible, but can be more difficult to configure. Disk storage replication tends to be rock solid, but is usually limited in terms of supported hardware for the replication target.

Regardless where we choose to implement our data replication solution, we will still face a lot of the same issues. One issue that needs to be addressed is re-silvering of a replication solution that has been partitioned for some amount of time. Ideally, only the changed sections of the disks will need to be re-replicated. Some less sophisticated solutions require a re-silvering of the entire storage area, which can take a long time and soak up a lot of bandwidth. Re-silvering is an issue that needs to be investigaged during the product evaluation.

Continuity Planning

Continuity planning should be done during the initial architecture and design phases for each service. If the service is not designed to accommodate a natural recovery, it will be expensive and difficult to retrofit a recovery mechanism.

The type of recovery that is appropriate for each service will depend on the importance of the service and what the tolerance for downtime is for that service.

There are five generally-recognized approaches to recovery architecture:

  • Server Replacement: Some services are run on standard server images with very little local customization. Such servers may most easily be recovered by replacing them with standard hardware and standard server images.
  • Backup and Restore: Where there is a fair amount of tolerance for downtime on a service, it may be acceptable to rely on hardware replacement combined with restores from backups.
  • Shared Nothing Failover: Some services are largely data-independent and do not require frequent data replication. In such cases, it might make sense to have an appropriately configured replacement at a recovery site. (One example may be an application server that pulls its data from a database. Aside from copying configuration changes, replication of the main server may not be necessary.)
  • Replication and Failover: Several different replication technologies exist, each with different strengths and weaknesses. Array-based, SAN-based, file system-based or file-based technologies allow replication of data on a targeted basis. Synchronous replication techniques prevent data loss at the cost of performance and geographic dispersion. Asynchronous replication techniques permit relatively small amounts of data loss in order to preserve performance or allow replication across large distances. Failover techniques range from nearly instantaneous automated solutions to administrator-invoked scripts to involved manual checklists.
  • Live Active-Active Stretch Clusters: Some services can be provided by active servers in multiple locations, where failover happens by client configurations. Some examples include DNS services (failover by resolv.conf lists), SMTP gateway servers (failover by MX record), web servers (failover by DNS load balancing), and some market data services (failover by client configuration). Such services should almost never be down. (Stretch clusters are clusters where the members are located at geographically dispersed locations.)
Which of these recovery approaches is appropriate to a given situation will depend on the cost of downtime on the service, as well as the particular characteristics of the service's architecture.

Causes of Recovery Failure

Janco released a study outlining the most frequent causes of a recovery failure:
  • Failure of the backup or replication solution. If the a copy of the data is not available, we will not be able to recover.
  • Unidentified failure modes. The recovery plan does not cover a type of failure.
  • Failure to train staff in recovery procedure. If people don't know how to carry out the plan, the work is wasted.
  • Lack of a communication plan. How do you communicate when your usual infrastructure is not available?
  • Insufficient backup power. Do you have enough capacity? How long will it run?
  • Failure to prioritize. What needs to be restored first? If you don't lay that out in advance, you will waste valuable time on recovering less critical services.
  • Unavailable disaster documentation. If your documentation is only available on the systems that have failed, you are stuck. Keep physical copies available in recovery locations.
  • Inadequate testing. Tests reveal weaknesses in the plan and also train staff to deal with a recovery situation in a timely way.
  • Unavailable passwords or access. If the recovery team does not have the permissions necessary to carry out the recovery, it will fail.
  • Plan is out of date. If the plan is not updated to reflect changes in the environment, the recovery will not succeed.

Recovery Business Practices

Janco also suggested several key business practices to improve the likelihood that you will survive a recovery:
  • Eliminate single points of failure.
  • Regularly update staff contact information, including assigned responsibilities.
  • Stay abreast of current events, such as weather and other emergency situations.
  • Plan for the worst case.
  • Document your plans and keep updated copies available in well-known, available locations.
  • Script what you can, and test your scripts.
  • Define priorities and thresholds.
  • Perform regular tests and make sure you can meet your RTO and RPO requirements.

Tuesday, June 11, 2013

Insourcing Picks Up Steam

I recently read an interesting report on insourcing by Pace Harmon. We have previously discussed some of the elements that should go into a decision whether or not to outsource or to offshore. Some major companies such as GM are re-insourcing operations that had previously been outsourced offshore.

Outsourcing typically works best with commodity IT activities. If complex activities are outsourced over the long run, an organization runs the risk of losing the insight and expertise needed to leverage new opportunities as the technology landscape evolves.

Pace Harmon report on several facts that are driving the insourcing trend:

  • Wage inflation in India and other prime offshoring locations have led to an erosion in the wage differential between onshore and offshore talent
  • Management costs associated with maintaining an offshore or outsourced relationship. Tracking and resolving quality issues can be especially expensive.
  • Lack of provider agility and flexibility. When you purchase an offering from another company, you are limited to either purchasing their standard offering or paying a premium for premium service.

Organizations that are considering insourcing need to keep several factors in mind:

  • Make sure that you have accounted for all of the costs of the re-insourcing operation. This will include direct costs, such staffing costs and termination penalties, as well as indirect costs such as those associated with reduced stability during the migration
  • Does your outsourcing contract specify that the vendor is required to provide you assistance with the insourcing, including training and process documentation? If not, find out what it will cost to get that assistance from your vendor.
  • Is your organization up to handling the level of complexity that your environment demands?
  • Will you be able to attract and retain the right staff? You may be able to re-badge some of your vendor's staff, but would you be able to retain them?
  • Are your organization's processes mature enough to be able to manage your team's technical responsibilities properly? If your organization does not have the maturity to collect requirements and track progress properly, you may not be ready for this transition.