ABSTRACT
Evolutionary Algorithms (EAs), and particularly Genetic Programming (GP), are techniques frequently employed to solve difficult real-life problems, which can require up to days or months of computation. One approach to reduce the time to solution is to use parallel computing on distributed platforms. Distributed platforms are prone to failures, and when these platforms are large and/or low-cost, failures are expected events rather than catastrophic exceptions. Therefore, fault tolerance and recovery techniques often become necessary. It turns out that Parallel GP (PGP) applications have an inherent ability to tolerate failures. This ability is quantified via simulation experiments performed using failure traces from real-world distributed platforms, namely, desktop grids (DGs), for two well-known GP problems. A simple technique is then proposed by which PGP applications can better tolerate the different, and often high, failures rates seen in different platforms.
- D. Anderson. Boinc: a system for public-resource computing and storage. In Grid Computing, 2004. Proceedings. Fifth IEEE/ACM International Workshop on, pages 4--10, 2004. Google ScholarDigital Library
- D. Andre and J. R. Koza. Parallel genetic programming: a scalable implementation using the transputer network architecture. pages 317--337, 1996. Google ScholarDigital Library
- S. B. and G. G. A. A Large-Scale Study of Failures in High-Performance Computing Systems. In Proceedings of the International Conference on Dependable Systems, pages 249--258, 2006. Google ScholarDigital Library
- W. Banzhaf and W. B. Langdon. Some considerations on the reason for bloat. Genetic Programming and Evolvable Machines, 3(1):81--91, Mar. 2002. Google ScholarDigital Library
- A. Baratloo, P. Dasgupta, and Z. Kedem. Calypso: a novel software system for fault-tolerant parallel processing on distributed platforms. hpdc, 00:122, 1995. Google ScholarDigital Library
- C. C., H. T., L. P., P. L., R. A., R. E., and C. F. Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI. In Proceedings of the ACM/IEEE SC Conference, Nov. 2006. Google ScholarDigital Library
- S. Cahon, N. Melab, and E. Talbi. ParadisEO: A Framework for the Reusable Design of Parallel and Distributed Metaheuristics. Journal of Heuristics, 10(3):357--380, 2004. Google ScholarDigital Library
- K. D., F. G., C. F., C. A. A., and C. H. Resource Availability in Enterprise Desktop Grids. Journal of Future Generation Computer Systems, 23(7):888--903, 2007. Google ScholarDigital Library
- F. F. de Vega. A fault tolerant optimization algorithm based on evolutionary computation. In Proceedings of the International Conference on Dependability of Computer Systems, 2006. Google ScholarDigital Library
- M. L. Douglas Thain. The Grid 2, chapter 19, pages 285--318. Morgan Kaufmann, 2004.Google Scholar
- E. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (CSUR), 34(3):375--408, 2002. Google ScholarDigital Library
- L. V. F. Fernández, M. Tomassini. Saving computational effort in genetic programming by means of plagues. Evolutionary Computation, 2003. CEC'03. The 2003 Congress on, 2003.Google ScholarCross Ref
- F. Fernandez, G. Spezzano, M. Tomassini, and L. Vanneschi. Parallel genetic programming. In E. Alba, editor, Parallel Metaheuristics, Parallel and Distributed Computing, chapter 6, pages 127--153. Wiley-Interscience, Hoboken, New Jersey, USA, 2005.Google Scholar
- F. Fernández and D. Lombraña. Algoritmos evolutivos tolerantes a fallos en entornos de computación distribuida. In XVII Jornadas de Paralelismo, volume 1, pages 401--406, Albacete, Spain, September 2006.Google Scholar
- G. Folino, C. Pizzuti, and G. Spezzano. CAGE: A tool for parallel genetic programming applications. In J. F. M. et. al., editor, Genetic Programming, Proceedings of EuroGP'2001, volume 2038 of LNCS, pages 64--73, Lake Como, Italy, 18-20 Apr. 2001. Springer-Verlag. Google ScholarDigital Library
- F. G., G. E., B. G., A. T., C. Z., P.-G. J., L. K., and D. J. Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems. In Proceedings of International Supercomputer Conference, June 2004.Google Scholar
- C. Gagné, M. Parizeau, and M. Dubreuil. Distributed beagle: An environment for parallel and distributed evolutionary computations. In Proc. of the 17th Annual International Symposium on High Performance Computing Systems and Applications (HPCS) 2003, pages 201--208, May 11-14 2003.Google Scholar
- F. C. Gartner. Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Computing Surveys, 31(1):1--26, 1999. Google ScholarDigital Library
- S. Ghosh. Distributed systems: an algorithmic approach. Chapman & Hall/CRC, 2006.Google Scholar
- I. Hidalgo, F. Fernández, J. Lanchares, and D. Lombraña. Is the island model fault tolerant? In Genetic and Evolutionary Computation Conference, volume 2, page 1519, London, England, July 2007. Google ScholarDigital Library
- D. Kondo, G. Fedak, F. Cappello, A. Chien, and H. Casanova. Characterizing resource availability in enterprise desktop grids. volume 23, pages 888--903. Elsevier, 2007. Google ScholarDigital Library
- J. R. Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA, 1992. Google ScholarDigital Library
- D. Lombraña and F. Fernández. Analyzing fault tolerance on parallel genetic programming by means of dynamic-size populations. In Congress on Evolutionary Computation, volume 1, pages 4392--4398, Singapore, September 2007.Google Scholar
- D. Lombraña, F. Fernández, L. Trujillo, G. Olague, and B. Segal. Customizable execution environments with virtual desktop grid computing. Parallel and Distributed Computing and Systems, PDCS, 2007. Google ScholarDigital Library
- S. Luke and L. Panait. A comparison of bloat control methods for genetic programming. Evolutionary Computation, 14(3):309--344, Fall 2006. Google ScholarDigital Library
- J. Pruyne and M. Livny. Managing checkpoints for parallel programs. In Workshop on Job Scheduling Strategies for Parallel Processing (IPPS'96), Honolulu, HI, April 1996. Google ScholarDigital Library
- G. R. and S. A. Software-Based Replication for Fault Tolerance. IEEE Computer, 30(4):68--74, 1997. Google ScholarDigital Library
- Sullivan, Werthimer, Bowyer, Cobb, Gedye, and Anderson. A New Major SETI Project based on project SERENDIP data and 100,000 Personal Computers. In Astronomical and Biochemical Origins and the Search for Life in the Universe, 1997.Google Scholar
- A. T. Tai and K. S. Tso. A performability-oriented software rejuvenation framework for distributed applications. In DSN'05: Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN'05), pages 570--579, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
- M. Tomassini. Spatially Structured Evolutionary Algorithms. Springer, 2005. Google ScholarDigital Library
- Top 500 Supercomputer Sites. http://www.top500.org/, 2009.Google Scholar
- L. Trujillo and G. Olague. Automated Design of Image Operators that Detect Interest Points. volume 16, pages 483--507. MIT Press, 2008. Google ScholarDigital Library
Index Terms
- Characterizing fault tolerance in genetic programming
Recommendations
Characterizing fault tolerance in genetic programming
Evolutionary algorithms, including genetic programming (GP), are frequently employed to solve difficult real-life problems, which can require up to days or months of computation. An approach for reducing the time-to-solution is to use parallel computing ...
Characterizing fault-tolerance of genetic algorithms in desktop grid systems
EvoCOP'10: Proceedings of the 10th European conference on Evolutionary Computation in Combinatorial OptimizationThis paper presents a study of the fault-tolerant nature of Genetic Algorithms (GAs) on a real-world Desktop Grid System, without implementing any kind of fault-tolerance mechanism. The aim is to extend to parallel GAs previous works tackling fault-...
Low-Overhead Fault-Tolerance Technique for a Dynamically Reconfigurable Softcore Processor
In this paper, we propose a new approach to implement a reliable softcore processor on SRAM-based FPGAs, which can mitigate radiation-induced temporary faults (single-event upsets (SEUs)) at moderate cost. A new Enhanced Lockstep scheme built using a ...
Comments