Skip to main content
Log in

Towards a verifiable real-time, autonomic, fault mitigation framework for large scale real-time systems

  • ORIGINAL PAPER
  • Published:
Innovations in Systems and Software Engineering Aims and scope Submit manuscript

Abstract

Designing autonomic fault responses is difficult, particularly in large-scale systems, as there is no single ‘perfect’ fault mitigation response to a given failure. The design of appropriate mitigation actions depend upon the goals and state of the application and environment. Strict time deadlines in real-time systems further exacerbate this problem. Any autonomic behavior in such systems must not only be functionally correct but should also conform to properties of liveness, safety and bounded time responsiveness. This paper details a real-time fault-tolerant framework, which uses a reflex and healing architecture to provide fault mitigation capabilities for large-scale real-time systems. At the heart of this architecture is a real-time reflex engine, which has a state-based failure management logic that can respond to both event- and time-based triggers. We also present a semantic domain for verifying properties of systems, which use this framework of real-time reflex engines. Lastly, a case study, which examines the details of such an approach, is presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Aghasaryan A, Fabre E, Benveniste A, Boubour R, Jard C (1998) Fault detection and diagnosis in distributed systems: an approach by partially stochastic petri nets. Discrete Event Dyn Syst 8(2):203–231

    Article  MATH  MathSciNet  Google Scholar 

  2. Ahuja S, Bapty T, Cheung H, Haney M, Kalbarczyk Z, Khanna A, Kowalkowski J, Messie D, Mosse D, Neema S, Nordstrom S, Oh J, Sheldon P, Shetty S, Wang L, Yao D (2005) RTES demo system2004. SIGBED Rev 2(3):1–6

    Article  Google Scholar 

  3. Alur R, Dill DL (1994) A theory of timed automata. Theor Comput Sci 126(2):183–235

    Article  MATH  MathSciNet  Google Scholar 

  4. Behrmann G, David A, Larsen KG (2004) A tutorial on uppaal. In: SFM, pp 200–236

  5. Bengtsson J, Larsen K, Larsson F, Pettersson P, Yi W (1996) Uppaal a tool suite for automatic verification of real-time systems. In: Proceedings of the DIMACS/SYCON workshop on Hybrid systems III: verification and control. Springer-Verlag, New York, Inc., Secaucus, pp 232–243

  6. Buttazzo GC (2005) Hard real-time computing systems: predictable scheduling algorithms and applications. Kluwer, Norwell

    MATH  Google Scholar 

  7. Cassandras CG, Lafortune S (1999) Introduction to discrete event systems. Kluwer, Norwell

    MATH  Google Scholar 

  8. Clarke EM, Grumberg O, Peled DA (2000) Model checking. MIT Press, Cambridge

    Google Scholar 

  9. de Kleer J, Williams BC (1987) Diagnosing multiple faults. Artif Intell 32(1):97–130

    Article  MATH  Google Scholar 

  10. Dubey A, Nordstrom S, Keskinpala T, Neema S, Bapty T (2006) Verifying autonomic fault mitigation strategies in large scale real-time systems. In: EASE, pp 129–140

  11. Frank PM (1990) Fault diagnosis in dynamic systems using analytical and knowledge-based redundancy a survey and some new results. Automatica 26(3):459–474

    Article  MATH  Google Scholar 

  12. Garlan D, Cheng SW, Schmerl B (2003) Increasing system dependability through architecture-based self-repair. In: de Lemos R, Gacek C, Romanorsky A (eds) Architecting dependable systems. Springer, Berlin, pp 61–89

    Google Scholar 

  13. Gertler J (1998) Fault detection and diagnosis in engineering systems. Marcel Dekker, New York

    Google Scholar 

  14. Gutleber J et al (2001) Clustered data acquisition for the CMS experiment. In: International conference on computing in high energy and nuclear physics

  15. Haney M, Ahuja S, Bapty T, Cheung H, Kalbarczyk Z, Khanna A, Kowalkowski J, Messie D, Mosse D, Neema S, Nordstrom S, Oh J, Sheldon P, Shetty S, Volper D, Wang L, Yao D (2005) The RTES project – btev, and beyond. In: Real time conference, 2005, 14th IEEE-NPSS, pp 143–146

  16. Henzinger T, Nicollin X, Sifakis J, Yovine S (1994) Symbolic model checking for real time systems. Inf Comput 111(2):193–244

    Article  MATH  MathSciNet  Google Scholar 

  17. Huth M, Ryan M (2000) Logic in computer science: modelling and reasoning about systems, 2nd edn. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  18. Krcál P, Yi W (2004) Decidable and undecidable problems in schedulability analysis using timed automata. In: TACAS, pp 236–250

  19. Kwan S (2002) The btev pixel detector and trigger system. In: FERMILAB-Conf-02/313

  20. Lamperti G, Zanella M (2002) Diagnosis of discrete event systems from uncertain temporal observations. Artif Intell 137(1–2):91–163

    Article  MATH  MathSciNet  Google Scholar 

  21. Lerner U, Parr R, Koller D, Biswas G (2000) Bayesian fault detection and diagnosis in dynamic systems. In: Proceedings of the 17th National conference on artificial intelligence and 12th conference on innovative applications of artificial intelligence. The MIT Press, Cambridge, pp 531–537

  22. Lunze J (2000) Diagnosis of quantized systems based on a timed discrete-event model. IEEE Trans Syst Man Cybern A 30(3):322–335

    Article  MathSciNet  Google Scholar 

  23. Madl G, Abdelwahed S, Karsai G (2004) Automatic verification of component-based real-time corba applications. In: RTSS, pp 231–240

  24. Nordstrom S, Bapty T, Neema S, Dubey A, Keskinpala T (2006a) A guided explorative approach for autonomic healing of model based systems. In: Second IEEE conference on space mission challenges for information technology (SMC-IT)

  25. Nordstrom S, Dubey A, Keskinpala T, Bapty T, Neema S (2006b) Ghost: guided healing and optimization search technique for healing large-scale embedded systems. In: 3rd IEEE international workshop on engineering of autonomic and autonomous systems (EASE’06)

  26. Nordstrom S, Shetty S, Neema SK, Bapty TA (2006) Modeling reflex-healing autonomy for large scale embedded systems. IEEE Trans Syst Man Cybern C 36(3):292–303

    Article  Google Scholar 

  27. Parashar M, Hariri S (2004) Autonomic computing: an overview. In: UPP, pp 257–269

  28. Patton RJ, Frank PM, Clarke RN (eds) (1989) Fault diagnosis in dynamic systems: theory and application. Prentice-Hall, Inc., Upper Saddle River

  29. Rafea AA, Desouki AE, El-Moniem S (1990) Combined model expert system for electronics fault diagnosis. In: IEA/AIE ’90: Proceedings of the 3rd international conference on Industrial and engineering applications of artificial intelligence and expert systems. ACM Press, New York, pp 23–31.DOI http://doi.acm.org/10.1145/98784.98793

  30. Ramadge P, Wonham W (1987) Supervisory control of a class of discrete event processes. Siam J Control Optim 25(1):206–230

    Article  MATH  MathSciNet  Google Scholar 

  31. Resnick M (1999) Decentralized modeling and decentralized thinking. In: Feurzeig W, Roberts N (eds) Modeling and simulation in science and mathematics education. Springer, New York, pp 114–137

    Google Scholar 

  32. Rothenberg J (1989) The nature of modeling. In: Widman LE, Loparo KA, Nielsen NR (eds) Artificial intelligence, simulation and modeling. Wiley, New York, pp 75–92

    Google Scholar 

  33. Sampath M, Sengupta R, Lafortune S, Sinnamohideen K, Teneketzis D (1996) Failure diagnosis using discrete-event models. IEEE Trans Control Syst Technol 4(2):105–124

    Article  Google Scholar 

  34. Shetty S, Nordstrom S, Ahuja S, Yao D, Bapty T, Neema S (2005) Systems integration of large scale autonomic systems using multiple domain specific modeling languages. In: ECBS, pp 481–489

  35. Sterritt R (2005) Autonomic computing. Innovations in Syst Softw Eng 1(1):79–88

    Article  Google Scholar 

  36. Sterritt R, Hinchey MG (2005) Autonomic computing – panacea or poppycock? In: ECBS, pp 535–539

  37. Truszkowski WF, Hinchey MG, Rash JL, Rouff CA (2006) Autonomous and autonomic systems: a paradigm for future space exploration missions. IEEE Trans Syst Man Cybern C 36(3):279–291

    Article  Google Scholar 

  38. Vries RD (1990) An automated methodology for generating a fault tree. IEEE Trans Reliab 39:76–86

    Article  Google Scholar 

  39. Yao D, Neema S, Nordstrom S, Shetty S, Ahuja S, Bapty T (2005) Specification and implementation of autonomic large-scale system behaviors using domain specific modeling language tools. In: Proceedings of international conference on software engineering and practice

  40. Yovine S (1997) Kronos: a verification tool for real-time systems. Int J Softw Tools Technol Transfer 126:110–122

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abhishek Dubey.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dubey, A., Nordstrom, S., Keskinpala, T. et al. Towards a verifiable real-time, autonomic, fault mitigation framework for large scale real-time systems. Innovations Syst Softw Eng 3, 33–52 (2007). https://doi.org/10.1007/s11334-006-0015-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11334-006-0015-7

Keywords

Navigation