Skip to main content

Testing, Checking, and Hardware Syndrome

  • Chapter
  • First Online:
Software Design for Resilient Computer Systems

Abstract

In previous chapters we introduced the processes of checking and testing, the first of the three main processes of generalized algorithm of fault tolerance—GAFT. In this chapter we further discuss the process of checking hardware, at first software-based hardware checking and at second hardware-based checking. For the software-based hardware checking, we show what a software-based test should include, when they are the preferred choice over hardware-based checking schemes and especially how such tests can be scheduled in the system without interfering with ongoing real-time tasks. Further to support handling of hardware -based checking we introduce a new system condition descriptor—a so-called syndrome—and illustrate how it can be used as a mechanism to signal to the operating system hardware condition, including manifestation of detected error. We then show the steps the run-time system performs to eliminate the fault and in case of permanent errors how the software can reconfigure the hardware to exclude the faulty element. We also explain in which cases software has to adapt to the new hardware topology. We start by explaining how software-based checks can be used to detect hardware faults. Run-time systems use online or off-line scheduling mechanisms for task management of programs—own—system software ones and user application ones. Since in Kirby et al. (Softw Pract Exp 15(1):87–103, 1985, [2]), Serlin (Comput C 7(8):19–30, 1984, [3]), Blazewicz et al. (Handbook on scheduling, from theory to applications, 2007, [4]), Ingo (Linux kernel archive, 2002, [8]) it is expected that run-time system provides a special session of task scheduling (off-line or online during execution) for the purposes of diagnostic of hardware conditions—recall Apple and Microsoft system starting delays. Later for some systems that operate in domain of real-time monitoring scheduling of tasks, critical in time of execution especially criticality of hardware availability and efficiency of process scheduling, become crucial. In turn testing itself become “hot” in terms of required time and coverage of hardware. Thus in this chapter we initially analyze simple sequences of testing of hardware elements of computer systems. Further, we introduce a concept of transparent for user application procedure of hardware testing. This enables to prove the integrity of computer system hardware, and guarantee it within reasonable time, without delays of service of execution of user tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bogdanov J, Schagaev I (1990) Sliding slotting diagnosis in multiprocessors. In: IMECO congress proceedings, pp 141–150

    Google Scholar 

  2. Kirby W et al (1985) The NMFECC Cray time-sharing system. Softw Pract Exp 15(1):87–103

    Google Scholar 

  3. Serlin O (1984) Fault-tolerant systems in commercial applications. Comput C 7(8):19–30

    Google Scholar 

  4. Blazewicz J et al (2007) Handbook on scheduling, from theory to applications. Springer, Berlin

    Google Scholar 

  5. Garey M, Johnson D (1979) Computers and in-tractability: a guide to the theory of NP-completeness. W.H. Freeman and Company, New York

    Google Scholar 

  6. Knuth D (1998) The art of computer programming 3. Sorting and searching, vol III. Addison- Wesley Longman, Amsterdam

    Google Scholar 

  7. Johannes M (2002) The active object system-design and multiprocessor implementation. ETH Zurich, Zurich

    Google Scholar 

  8. Ingo M (2002) Linux kernel archive. World Wide Web electronic publication

    Google Scholar 

  9. Liu CL, Layland J (1973) Scheduling algorithms for multiprogramming in a hard-real-time environment. J ACM 20(1):46–61

    Article  MathSciNet  MATH  Google Scholar 

  10. Castano V, Schagaev I (2014) Resilient computer system design. Springer, Berlin. ISBN 978-3-319-15069-7

    Google Scholar 

  11. Blaeser L, Monkman S, Schagaev I (2014) Evolving systems worldcomp2014. In Proceedings of the international conference on foundations of computer science FCS’14, 2014 CSREA Press. ISBN: 1-60132-270-4

    Google Scholar 

  12. Monkman S, Schagaev I (2013) Redundancy + Reconfigurability = Recoverability, Electronics, 2, 2013, pp 212–233. doi:10.3390/electronics2030212, ISSN 2079-9292

    Google Scholar 

  13. Buhanova G, Schagaev I (2001) Comparative study of fault tolerant ram structures. In: Proceeding IEEE dependendable system networks conference, Guetebog. https://www.academia.edu/7140850/. Accessed 10 July 2001

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Igor Schagaev .

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Schagaev, I., Kaegi-Trachsel, T. (2016). Testing, Checking, and Hardware Syndrome. In: Software Design for Resilient Computer Systems. Springer, Cham. https://doi.org/10.1007/978-3-319-29465-0_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-29465-0_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-29463-6

  • Online ISBN: 978-3-319-29465-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics