skip to main content
research-article

Optimal Checkpointing Strategy for Real-time Systems with Both Logical and Timing Correctness

Published:24 July 2023Publication History
Skip Abstract Section

Abstract

Real-time systems are susceptible to adversarial factors such as faults and attacks, leading to severe consequences. This paper presents an optimal checkpoint scheme to bolster fault resilience in real-time systems, addressing both logical consistency and timing correctness. First, we partition message-passing processes into a directed acyclic graph (DAG) based on their dependencies, ensuring checkpoint logical consistency. Then, we identify the DAG’s critical path, representing the longest sequential path, and analyze the optimal checkpoint strategy along this path to minimize overall execution time, including checkpointing overhead. Upon fault detection, the system rolls back to the nearest valid checkpoints for recovery. Our algorithm derives the optimal checkpoint count and intervals, and we evaluate its performance through extensive simulations and a case study. Results show a 99.97% and 67.86% reduction in execution time compared to checkpoint-free systems in simulations and the case study, respectively. Moreover, our proposed strategy outperforms prior work and baseline methods, increasing deadline achievement rates by 31.41% and 2.92% for small-scale tasks and 78.53% and 4.15% for large-scale tasks.

REFERENCES

  1. [1] Ahmed Chuadhry Mujeeb and Zhou Jianying. 2020. Challenges and opportunities in cyberphysical systems security: A physics-based perspective. IEEE Security & Privacy 18, 6 (2020), 1422. Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Alguliyev Rasim, Imamverdiyev Yadigar, and Sukhostat Lyudmila. 2018. Cyber-physical systems and their security issues. Computers in Industry 100 (2018), 212223. Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Alvisi Lorenzo, Elnozahy Elmootazbellah, Rao Sriram, Husain Syed Amir, and Mel Asanka De. 1999. An analysis of communication induced checkpointing. In Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No. 99CB36352). IEEE, 242249.Google ScholarGoogle Scholar
  4. [4] Ansari Mohsen, Safari Sepideh, Khdr Heba, Gohari-Nazari Pourya, Henkel Jörg, Ejlali Alireza, and Hessabi Shaahin. 2022. Power-aware checkpointing for multicore embedded systems. IEEE Transactions on Parallel and Distributed Systems 33, 12 (2022), 44104424.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Arghandeh Reza, Meier Alexandra von, Mehrmanesh Laura, and Mili Lamine. 2016. On the definition of cyber-physical resilience in power systems. Renewable and Sustainable Energy Reviews 58 (2016), 10601069. Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Baldoni R., Helary J., Mostefaoui A., and Raynal M.. 1997. A communication-induced checkpointing protocol that ensures rollback-dependency trackability. In Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing. 6877. Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Cao Guohong and Singhal Mukesh. 1998. On coordinated checkpointing in distributed systems. IEEE Transactions on Parallel and Distributed Systems 9, 12 (1998), 12131225.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Chakraborty Anirban, Alam Manaar, Dey Vishal, Chattopadhyay Anupam, and Mukhopadhyay Debdeep. 2018. Adversarial attacks and defences: A survey. arXiv preprint arXiv:1810.00069 (2018).Google ScholarGoogle Scholar
  9. [9] Colabianchi Silvia, Costantino Francesco, Gravio Giulio Di, Nonino Fabio, and Patriarca Riccardo. 2021. Discussing resilience in the context of cyber physical systems. Computers & Industrial Engineering 160 (2021), 107534. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] CRIU. 2022. Checkpoint/Restore In Userspace (CRIU). https://criu.org/Main_Page.Google ScholarGoogle Scholar
  11. [11] Di Sheng, Robert Yves, Vivien Frédéric, Kondo Derrick, Wang Cho-Li, and Cappello Franck. 2013. Optimization of cloud task processing with checkpoint-restart mechanism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC’13). Association for Computing Machinery, New York, NY, USA, Article 64, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Duo Wenli, Zhou MengChu, and Abusorrah Abdullah. 2022. A survey of cyber attacks on cyber physical systems: Recent advances and challenges. IEEE/CAA Journal of Automatica Sinica 9, 5 (2022), 784800. Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Ejlali Alireza, Al-Hashimi Bashir M., and Eles Petru. 2009. A standby-sparing technique with low energy-overhead for fault-tolerant hard real-time systems. In Proceedings of the 7th IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Elnozahy Elmootazbellah Nabil, Alvisi Lorenzo, Wang Yi-Min, and Johnson David B.. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (CSUR) (2002).Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Elnozahy Elmootazbellah N. and Plank James S.. 2004. Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Transactions on Dependable and Secure Computing 1, 2 (2004), 97108.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Flammini Francesco. 2019. Resilience of cyber-physical systems. Springer (2019).Google ScholarGoogle Scholar
  17. [17] Gelenbe Erol and Derochette D.. 1978. Performance of rollback recovery systems under intermittent failures. Commun. ACM 21, 6 (1978), 493499.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Guermouche A., Ropars T., Brunet E., Snir M., and Cappello F.. 2011. Uncoordinated checkpointing without domino effect for send-deterministic MPI applications. In 2011 IEEE International Parallel Distributed Processing Symposium. 9891000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Guo Yifeng, Zhu Dakai, and Aydin Hakan. 2013. Generalized standby-sparing techniques for energy-efficient fault tolerance in multiprocessor real-time systems. In 2013 IEEE 19th International Conference on Embedded and Real-Time Computing Systems and Applications. IEEE, 6271.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Haque Mohammad A., Aydin Hakan, and Zhu Dakai. 2016. On reliability management of energy-aware real-time systems through task replication. IEEE Transactions on Parallel and Distributed Systems 28, 3 (2016), 813825.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] He Haibo and Yan Jun. 2016. Cyber-physical attacks and defences in the smart grid: A survey. IET Cyber-Physical Systems: Theory & Applications (2016).Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Ho Justin C. Y., Wang Cho-Li, and Lau Francis C. M.. 2008. Scalable group-based checkpoint/restart for large-scale message-passing systems. In 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 112.Google ScholarGoogle Scholar
  23. [23] Levy Bran Selic Shiping Chen Ifeanyi P. Egwutuoha, David. 2013. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing (2013).Google ScholarGoogle Scholar
  24. [24] Izosimov Viacheslav, Pop Paul, Eles Petru, and Peng Zebo. 2005. Design optimization of time-and cost-constrained fault-tolerant distributed embedded systems. In Design, Automation and Test in Europe. IEEE.Google ScholarGoogle Scholar
  25. [25] Jafary Bentolhoda, Fiondella Lance, and Chang Ping-Chen. 2020. Optimal equidistant checkpointing of fault tolerant systems subject to correlated failure. Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability 234, 4 (2020), 636648.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Kaio N., Dohi T., and Trivedi K. S.. 2002. Availability models with age-dependent checkpointing. In Reliable Distributed Systems, IEEE Symposium on. IEEE Computer Society, Los Alamitos, CA, USA, 130. Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Knuth Donald E.. 2014. In Art of Computer Programming, volume 2: Seminumerical Algorithms. Addison-Wesley Professional.Google ScholarGoogle Scholar
  28. [28] Kong Fanxin, Xu Meng, Weimer James, Sokolsky Oleg, and Lee Insup. 2018. Cyber-physical system checkpointing and recovery. In 2018 ACM/IEEE 9th International Conference on Cyber-Physical Systems (ICCPS). 2231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Kwak Seong Woo, Choi Byung Jae, and Kim Byung Kook. 2001. An optimal checkpointing-strategy for real-time control systems under transient faults. IEEE Transactions on Reliability 50, 3 (2001), 293301.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Melhem Rami, Mosse Daniel, and Elnozahy Elmootazbellah. 2004. The interplay of power management and fault recovery in real-time systems. IEEE Trans. Comput. 53, 2 (2004), 217231.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Pflanz Matthias and Vierhaus Heinrich Theodor. 1998. Generating reliable embedded processors. IEEE Micro (1998).Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Pinello Claudio, Carloni Luca P., and Sangiovanni-Vincentelli Alberto L.. 2004. Fault-tolerant deployment of embedded software for cost-sensitive real-time feedback-control applications. In Design, Automation and Test in Europe. IEEE.Google ScholarGoogle Scholar
  33. [33] Pradhan Dhiraj K.. 1996. Fault-tolerant Computer System Design. Prentice-Hall, Inc.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Punnekkat Sasikumar, Burns Alan, and Davis Robert. 2001. Analysis of checkpointing for real-time systems. Real-Time Systems 20, 1 (2001), 83102.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Sahoo Siva Satyendra, Veeravalli Bharadwaj, and Kumar Akash. 2020. Markov chain-based modeling and analysis of checkpointing with rollback recovery for efficient DSE in soft real-time systems. In 2020 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT). IEEE, 16.Google ScholarGoogle Scholar
  36. [36] Salehi Mohammad, Tavana Mohammad Khavari, Rehman Semeen, Shafique Muhammad, Ejlali Alireza, and Henkel Jörg. 2016. Two-state checkpointing for energy-efficient fault tolerance in hard real-time systems. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 24, 7 (2016), 24262437.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Shin Kang G., Lin Tein-Hsiang, and Lee Yann-Hang. 1987. Optimal checkpointing of real-time tasks. IEEE Transactions on Computers 100, 11 (1987), 13281341.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Wang Yi-Min, Chung Pi-Yu, Lin In-Jen, and Fuchs W. Kent. 1995. Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Transactions on Parallel and Distributed Systems 6, 5 (1995), 546554.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Wolf Marilyn and Serpanos Dimitrios. 2017. Safety and security in cyber-physical systems and internet-of-things systems. Proc. IEEE (2017).Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Yaacoub Jean-Paul A., Salman Ola, Noura Hassan N., Kaaniche Nesrine, Chehab Ali, and Malli Mohamad. 2020. Cyber-physical systems security: Limitations, issues and future trends. Microprocessors and Microsystems 77 (2020), 103201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Luo Yi and D. Manivannan. 2011. Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E families. Performance Evaluation 68, 5 (2011), 429–445.Google ScholarGoogle Scholar
  42. [42] Zeng Luyuan, Huang Pengcheng, and Thiele Lothar. 2016. Towards the design of fault-tolerant mixed-criticality systems on multicores. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems. 110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Zhang Lin, Chen Xin, Kong Fanxin, and Cardenas Alvaro A.. 2020. Real-time attack-recovery for cyber-physical systems using linear approximations. In 2020 IEEE Real-Time Systems Symposium (RTSS). 205217. Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Zhang Lin, Lu Pengyuan, Kong Fanxin, Chen Xin, Sokolsky Oleg, and Lee Insup. 2021. Real-time attack-recovery for cyber-physical systems using linear-quadratic regulator. ACM Trans. Embed. Comput. Syst. 20, 5s, Article 79 (Sep.2021), 24 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Zhang Lin, Sridhar Kaustubh, Liu Mengyu, Lu Pengyuan, Chen Xin, Kong Fanxin, Sokolsky Oleg, and Lee Insup. 2023. Real-time data-predictive attack-recovery for complex cyber-physical systems. In 2023 IEEE 29th Real-Time and Embedded Technology and Applications Symposium (RTAS).Google ScholarGoogle Scholar
  46. [46] Zhang Lin, Wang Zifan, Liu Mengyu, and Kong Fanxin. 2022. Adaptive window-based sensor attack detection for cyber-physical systems. In Proceedings of the 59th ACM/IEEE Design Automation Conference (San Francisco, California) (DAC’22). Association for Computing Machinery, New York, NY, USA, 919924. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Zhang Ying and Chakrabarty Krishnendu. 2004. Task feasibility analysis and dynamic voltage scaling in fault-tolerant real-time embedded systems. In Proceedings Design, Automation and Test in Europe Conference and Exhibition, Vol. 2. IEEE, 11701175.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Zhu Dakai. 2006. Reliability-aware dynamic energy management in dependable embedded real-time systems. In 12th IEEE Real- Time and Embedded Technology and Applications Symposium.Google ScholarGoogle Scholar

Index Terms

  1. Optimal Checkpointing Strategy for Real-time Systems with Both Logical and Timing Correctness

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Embedded Computing Systems
          ACM Transactions on Embedded Computing Systems  Volume 22, Issue 4
          July 2023
          551 pages
          ISSN:1539-9087
          EISSN:1558-3465
          DOI:10.1145/3610418
          • Editor:
          • Tulika Mitra
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 24 July 2023
          • Online AM: 1 June 2023
          • Accepted: 12 May 2023
          • Revised: 24 April 2023
          • Received: 16 November 2022
          Published in tecs Volume 22, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)209
          • Downloads (Last 6 weeks)27

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text