Tuesday, July 22, 2008

The Seven Tiers of DR


Seven Tiers concept of DR was originally arised by http://www.share.org/ to help DR specialists define various method of recoverying mission critical IT systems. From RTO(Recover Time Object) and RPO(Recover Point Object) perspective, this definition offers a service level comparing between risk and cost.


What is disaster? Everyone in China during the first half on 2008 can know what exactly it means

Tier 0 means no recovery at all when data lost occur. There is no backup plan, nor any backup hardware remotely, such as tapes or disks, etc. The only way to protect data lost in Tier o is praying to God :-)

In Tier 1 you will have an offsite backup plan, possibly transport the tapes through what we so called PATM(pickup as truck access method). When data lost occurs, you can retrive the backup tape via car or anyother transportation tools. Very age-old way for recovery :) What happen if there is an car accident or horrible jam during recovery? The RTO and RPO is not under your controll.

But it is still an effect complementary for data replication due to bandwidth limit or data de-duplication not available. Think about transferring 2 TB data with only 80KB/s speed...Why would you take up a drive and deliver 2 LTO4 tape in your hand directly?

When we go into Tier 2, we will feel quite lucky that there is a hot site to recover our facility besides PTAM tape delivery. So, no traffic jam concern now: ) The RTO is under controll but the recovery point is limit to daily data lost. We can only rollback to yesterday and lost everything we done within today.

Time to solve the car accident now. In Tier 3, we will have an electronic link between production site and hot backup site, such as ATM, or any other IP network links. Mission critical or private data can be transffered through TCP/IP while not a car maybe driven by drunk drivers. It is reported last year that CITI did lost some tape containing millions of bank account info. The lost is much more than the car :-)

When you are running a SAP ECC6 ERP system, should there an earthquake occur, how could you recover business if you only have backup data onhand in the recovery site? Is your backup specialist also an SAP basis expert? The recovery time will depends on how soon you can setup the same infrastructure in the hot site. So, that's why we introduce Tier4, with an active secondary site running. Often, we will utilize some host based replication software to achive asychronious remote volumn copy. The data lost should be limited to hours.

How about using some disk array based snapshot/snapclone tech locally to provide an clone for remote data synchronous replication, for example, TimeFinder and SRDF A ? In Tire5, we will integrate the data replication tech and application tightly. For example, the Oracle DataGuard and SRDF. Asynchronous redo log transfer between production and backup sites will help us restart business in minutes.

But there are still some industry that require zero data lost and zero downtime. This tough service level demand us to make both application and data replication between two sites synchronously. A lot of customers apply the Three Datacenter methology to reduce performance impact on production site when there are more than 300KM or longer distance. Which means, we can utilize FC over DWDM within the same city(60-80 KM usually) for synchronous replication between production site and backup site 1.


Then we will make an asynchronous replication between backup site 1 and backup site 2(500KM away) through IP networks, no physical distance limit actually. This design is widely used by EMC and HDS when implementing Business Continuity solution.

1 comment:

RTO and RPO said...

Nice blog... This blog nicely explain what is RTO and RPO. I found this blog post very helpful. Thanks for sharing