Article

On the road to recovery: restoring data after disasters

Authors:
Kimberly Keeton

Hewlett-Packard Labs, Palo Alto, CA

Hewlett-Packard Labs, Palo Alto, CA
View Profile

,
Dirk Beyer

Hewlett-Packard Labs, Palo Alto, CA

Hewlett-Packard Labs, Palo Alto, CA
View Profile

,
Ernesto Brau

Hewlett-Packard Labs, Palo Alto, CA

Hewlett-Packard Labs, Palo Alto, CA
View Profile

,
Arif Merchant

Hewlett-Packard Labs, Palo Alto, CA

Hewlett-Packard Labs, Palo Alto, CA
View Profile

,
Cipriano Santos

Hewlett-Packard Labs, Palo Alto, CA

Hewlett-Packard Labs, Palo Alto, CA
View Profile

,
Alex Zhang

Hewlett-Packard Labs, Palo Alto, CA

Hewlett-Packard Labs, Palo Alto, CA
View Profile

EuroSys '06: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006April 2006Pages 235–248https://doi.org/10.1145/1217935.1217958

Published:18 April 2006Publication History

EuroSys '06: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006

Pages 235–248

ABSTRACT

Restoring data operations after a disaster is a daunting task: how should recovery be performed to minimize data loss and application downtime? Administrators are under considerable pressure to recover quickly, so they lack time to make good scheduling decisions. They schedule recovery based on rules of thumb, or on pre-determined orders that might not be best for the failure occurrence. With multiple workloads and recovery techniques, the number of possibilities is large, so the decision process is not trivial.This paper makes several contributions to the area of data recovery scheduling. First, we formalize the description of potential recovery processes by defining recovery graphs. Recovery graphs explicitly capture alternative approaches for recovering workloads, including their recovery tasks, operational states, timing information and precedence relationships. Second, we formulate the data recovery scheduling problem as an optimization problem, where the goal is to find the schedule that minimizes the financial penalties due to downtime, data loss and vulnerability to subsequent failures. Third, we present several methods for finding optimal or near-optimal solutions, including priority-based, randomized and genetic algorithm-guided ad hoc heuristics. We quantitatively evaluate these methods using realistic storage system designs and workloads, and compare the quality of the algorithms' solutions to optimal solutions provided by a math programming formulation and to the solutions from a simple heuristic that emulates the choices made by human administrators. We find that our heuristics' solutions improve on the administrator heuristic's solutions, often approaching or achieving optimality.

References

E. Anderson, D. Beyer, K. Chaudhuri, T. Kelly, N. Salazar, C. Santos, R. Swaminathan, R. Tarjan, J. Wiener, and Y. Zhou. Value-maximizing deadline scheduling and its application to animation rendering. In Proc. ACM Symp. on Parallelism in Algorithms and Architectures (SPAA), July 2005. Google ScholarDigital Library
A. Azagury, M. Factor, and J. Satran. Point-in-Time copy: yesterday, today and tomorrow. In Proc. 10th NASA Conf. on Mass Storage Systems and Technologies/19th IEEE Symp. on Mass Storage Systems, pages 259--270, April 2002.Google Scholar
K. R. Baker. Introduction to sequencing and scheduling. John Wiley, 1974.Google Scholar
E. Balas. Project scheduling with resource constraints. In E. Beale, editor, Applications of Mathematical Programming Techniques, pages 187--200. American Elsevier, 1970.Google Scholar
R. Bhagwan, K. Tati, Y. Cheng, S. Savage, and G. Voelker. Total Recall: system support for automated availability management. In Proc. ACM/USENIX Symp. on Networked Systems Design and Implementation (NSDI), March 2004. Google ScholarDigital Library
P. Brucker, A. Drexl, R. Mohring, K. Neumann, and E. Pesch. Resource constrained project scheduling: notation, classification, models, and methods. European Journal of Operations Research, 112:3--41, 1999.Google ScholarCross Ref
A. Chervenak, V. Vellanki, and Z. Kurmas. Protecting file systems: a survey of backup techniques. In Proc. 6th NASA Conf. on Mass Storage Systems and Technologies/15th IEEE Symp. on Mass Storage Systems, March 1998.Google Scholar
D. Cougias, E. Heiberger, and K. Koop. The backup book: disaster recovery from desktop to data center. Schaser-Vartan Books, Lecanto, FL, 2003.Google Scholar
P. de Jong. Going with the flow. ACM Queue, pages 25--32, March 2006. Google ScholarDigital Library
C. Ekelin. An optimization framework for scheduling of embedded real-time systems. PhD thesis, Chalmers University of Technology, 2004.Google Scholar
S. Hartmann. A self-adapting genetic algorithm for project scheduling under resource constraints. Naval Research Logistics, 49:433--448, 1001.Google ScholarCross Ref
Hewlett-Packard Company. HP StorageWorks Enterprise Virtual Array, December 2003. h18006. www 1.hp.com/products/storageworks/enterprise/.Google Scholar
Hewlett Packard Company. HP StorageWorks Extended Tape Library Architecture, December 2003. h 18006. www.1.hp.com/products/storageworks/tlarchitecture/.Google Scholar
Hewlett-Packard Development Co. HP OpenView Storage Data Protector administrator's guide, October 2004. Mfg. Part Number B6960--90106, Release A.05.50.Google Scholar
E. S. H. Hou, N. Ansari, and H. Ren. A genetic algorithm for multiprocessor scheduling. IEEE Trans. Parallel and Distributed Systems, 5(2):113--120, 1994. Google ScholarDigital Library
ILOG, Inc., Mountain View, CA. CPLEX 8.0 User's Manual, July 2002. Available from http://www.ilog.com.Google Scholar
M. Ji, A. Veitch, and J. Wilkes. Seneca: remote mirroring done write. In Proc. USENIX Annual Technical Conf., pages 253--268, June 2003.Google Scholar
K. Keeton, D. Beyer, J. Chase, C. Santos, and J. Wilkes. Lessons and challenges in automating data dependability. In Proc. 11th ACM-SIGOPS European Workshop, September 2004. Google ScholarDigital Library
K. Keeton and A. Merchant. A framework for evaluating storage system dependability. In Proc. Intl. Conf. on Dependable Systems and Networks (DSN), pages 877--886, 2004. Google ScholarDigital Library
K. Keeton and A. Merchant. Challenges in managing dependable data systems. ACM SIGMETRICS Performance Evaluation Review, March 2006. Google ScholarDigital Library
K. Keeton, C. Santos, D. Beyer, J. Chase, and J. Wilkes. Designing for disasters. In Proc. USENIX Conf. on File and Storage Technologies (FAST), pages 59--72, March 2004. Google ScholarDigital Library
R. Kolisch and S. Hartmann. Heuristic algorithms for the resource-constrainted project scheduling problem: classification and computational analysis. In J. Weglarz, editor, Project scheduling: recent models, algorithms and applications, pages 147--178. Kluwer Academic Publishers, 1999.Google Scholar
Eagle Rock Alliance Ltd. Online survey results: 2001 cost of downtime. http://contingencyplanningresearch.com/2001_ Survey.pdf, August 2001.Google Scholar
E. Marcus and H. Stern. Blueprints for high availability. Wiley Publishing, Indianapolis, IN, 2003.Google Scholar
P. Massiglia and E. Marcus, editors. The resilient enterprise: recovering information services from disaster. Veritas Software Corp., Mountain View, CA, USA, 2002.Google Scholar
Z. Michalewicz. Genetic algorithms + data structures = evolution programs. Srpinger-Verlag, third edition, 1999. Google ScholarDigital Library
D. Patterson, G. Gibson, and R. Katz. A case for redundant arrays of inexpensive disks (RAID). In Proc. SIGMOD, pages 109--16, 1--3 June 1988. Google ScholarDigital Library
M. Pinedo. Planning and scheduling in manufacturing and services. Springer Series in Operations Research. Springer-Verlag, 2005.Google Scholar
Y. Saito, S. Frolund, A. Veitch, A. Merchant, and S. Spence. FAB: building distributed enterprise disk arrays from commodity components. In Proc. ACM Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 48--58, October 2004. Google ScholarDigital Library
R. R. Schulman. Disaster recovery issues and solutions. Hitachi Data Systems White paper, September 2004.Google Scholar
W. van der Aalst and K. van Hee. Workflow management: models, methods and systems. MIT Press, Cambridge, MA, USA, 2002. Google ScholarDigital Library
M. Wall. A genetic algorithm for resource-constrained scheduling. PhD thesis, Massachusetts Institute of Technology, June 1996. Google ScholarDigital Library
C. Warrick et al. IBM TotalStorage business continuity solutions guide. IBM Redbooks. IBM International Technical Support Organization, August 2005.Google Scholar
J. Wylie, M. Bigrigg, J. Strunk, G. Ganger, H. Kiliççöte, and P. Khosla. Survivable information storage systems. Computer, 33(8):61--68, August 2000. Google ScholarDigital Library
J. Xu. Multiprocessor scheduling of processes with release times, deadlines, precedence, and exclusion relations. IEEE Trans. Softw. Eng., 19(2):139--154, 1993. Google ScholarDigital Library
W. Zhao, K. Ramamritham, and J. A. Stankovic. Scheduling tasks with resource requirements in hard real-time systems. IEEE Trans. on Software Engineering, 13(5):564--577, 1987. Google ScholarDigital Library
W.-D. Zhu et al. IBM Content Manager backup/recovery and high availability: strategies, options and procedures. IBM Redbook, March 2004. Google ScholarDigital Library

Index Terms

On the road to recovery: restoring data after disasters

Recommendations

On the road to recovery: restoring data after disasters
Proceedings of the 2006 EuroSys conference

Restoring data operations after a disaster is a daunting task: how should recovery be performed to minimize data loss and application downtime? Administrators are under considerable pressure to recover quickly, so they lack time to make good scheduling ...
Read More
Planning for optimal multi-site data distribution for disaster recovery
GECON'11: Proceedings of the 8th international conference on Economics of Grids, Clouds, Systems, and Services

In this paper, we present DDP-DR: a Data Distribution Planner for Disaster Recovery. DDP-DR provides an optimal way of backing-up critical business data into data centres (DCs) across several Geographic locations. DDP-DR provides a plan for replication ...
Read More
A Single-Objective Recovery Phase Model

The Federal Emergency Management Agency FEMA has identified the four phases of disaster related planning as mitigation, preparation, response, and recovery. The recovery phase is characterized by activity to return life to normal or improved levels. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EuroSys '06: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
April 2006
420 pages
ISBN:1595933220
DOI:10.1145/1217935
Conference Chair:
Yolande Berbers
K. U. Leuven, Belgium
,
Program Chair:
Willy Zwaenepoel
EPFL
ACM SIGOPS Operating Systems Review Volume 40, Issue 4
Proceedings of the 2006 EuroSys conference
October 2006
383 pages
ISSN:0163-5980
DOI:10.1145/1218063
Issue’s Table of Contents
Copyright © 2006 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 April 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
backup/restore
data storage
disaster recovery
genetic algorithms
management
math programming
optimization
scheduling
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate241of1,308submissions,18%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 35
  Total Citations
  View Citations
- 967
  Total Downloads
- Downloads (Last 12 months)21
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

On the road to recovery: restoring data after disasters

EuroSys '06: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the road to recovery: restoring data after disasters

Planning for optimal multi-site data distribution for disaster recovery

A Single-Objective Recovery Phase Model