Abstract
Real-time systems are susceptible to adversarial factors such as faults and attacks, leading to severe consequences. This paper presents an optimal checkpoint scheme to bolster fault resilience in real-time systems, addressing both logical consistency and timing correctness. First, we partition message-passing processes into a directed acyclic graph (DAG) based on their dependencies, ensuring checkpoint logical consistency. Then, we identify the DAG’s critical path, representing the longest sequential path, and analyze the optimal checkpoint strategy along this path to minimize overall execution time, including checkpointing overhead. Upon fault detection, the system rolls back to the nearest valid checkpoints for recovery. Our algorithm derives the optimal checkpoint count and intervals, and we evaluate its performance through extensive simulations and a case study. Results show a 99.97% and 67.86% reduction in execution time compared to checkpoint-free systems in simulations and the case study, respectively. Moreover, our proposed strategy outperforms prior work and baseline methods, increasing deadline achievement rates by 31.41% and 2.92% for small-scale tasks and 78.53% and 4.15% for large-scale tasks.
- [1] . 2020. Challenges and opportunities in cyberphysical systems security: A physics-based perspective. IEEE Security & Privacy 18, 6 (2020), 14–22. Google ScholarCross Ref
- [2] . 2018. Cyber-physical systems and their security issues. Computers in Industry 100 (2018), 212–223. Google ScholarCross Ref
- [3] . 1999. An analysis of communication induced checkpointing. In Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No. 99CB36352). IEEE, 242–249.Google Scholar
- [4] . 2022. Power-aware checkpointing for multicore embedded systems. IEEE Transactions on Parallel and Distributed Systems 33, 12 (2022), 4410–4424.Google ScholarDigital Library
- [5] . 2016. On the definition of cyber-physical resilience in power systems. Renewable and Sustainable Energy Reviews 58 (2016), 1060–1069. Google ScholarCross Ref
- [6] . 1997. A communication-induced checkpointing protocol that ensures rollback-dependency trackability. In Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing. 68–77. Google ScholarCross Ref
- [7] . 1998. On coordinated checkpointing in distributed systems. IEEE Transactions on Parallel and Distributed Systems 9, 12 (1998), 1213–1225.Google ScholarDigital Library
- [8] . 2018. Adversarial attacks and defences: A survey. arXiv preprint arXiv:1810.00069 (2018).Google Scholar
- [9] . 2021. Discussing resilience in the context of cyber physical systems. Computers & Industrial Engineering 160 (2021), 107534. Google ScholarDigital Library
- [10] . 2022. Checkpoint/Restore In Userspace (CRIU). https://criu.org/Main_Page.Google Scholar
- [11] . 2013. Optimization of cloud task processing with checkpoint-restart mechanism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (
SC’13 ). Association for Computing Machinery, New York, NY, USA, Article64 , 12 pages. Google ScholarDigital Library - [12] . 2022. A survey of cyber attacks on cyber physical systems: Recent advances and challenges. IEEE/CAA Journal of Automatica Sinica 9, 5 (2022), 784–800. Google ScholarCross Ref
- [13] . 2009. A standby-sparing technique with low energy-overhead for fault-tolerant hard real-time systems. In Proceedings of the 7th IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis.Google ScholarDigital Library
- [14] . 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (CSUR) (2002).Google ScholarDigital Library
- [15] . 2004. Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Transactions on Dependable and Secure Computing 1, 2 (2004), 97–108.Google ScholarDigital Library
- [16] . 2019. Resilience of cyber-physical systems. Springer (2019).Google Scholar
- [17] . 1978. Performance of rollback recovery systems under intermittent failures. Commun. ACM 21, 6 (1978), 493–499.Google ScholarDigital Library
- [18] . 2011. Uncoordinated checkpointing without domino effect for send-deterministic MPI applications. In 2011 IEEE International Parallel Distributed Processing Symposium. 989–1000. Google ScholarDigital Library
- [19] . 2013. Generalized standby-sparing techniques for energy-efficient fault tolerance in multiprocessor real-time systems. In 2013 IEEE 19th International Conference on Embedded and Real-Time Computing Systems and Applications. IEEE, 62–71.Google ScholarCross Ref
- [20] . 2016. On reliability management of energy-aware real-time systems through task replication. IEEE Transactions on Parallel and Distributed Systems 28, 3 (2016), 813–825.Google ScholarDigital Library
- [21] . 2016. Cyber-physical attacks and defences in the smart grid: A survey. IET Cyber-Physical Systems: Theory & Applications (2016).Google ScholarCross Ref
- [22] . 2008. Scalable group-based checkpoint/restart for large-scale message-passing systems. In 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 1–12.Google Scholar
- [23] . 2013. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing (2013).Google Scholar
- [24] . 2005. Design optimization of time-and cost-constrained fault-tolerant distributed embedded systems. In Design, Automation and Test in Europe. IEEE.Google Scholar
- [25] . 2020. Optimal equidistant checkpointing of fault tolerant systems subject to correlated failure. Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability 234, 4 (2020), 636–648.Google ScholarCross Ref
- [26] . 2002. Availability models with age-dependent checkpointing. In Reliable Distributed Systems, IEEE Symposium on. IEEE Computer Society, Los Alamitos, CA, USA, 130. Google ScholarCross Ref
- [27] . 2014. In Art of Computer Programming, volume 2: Seminumerical Algorithms. Addison-Wesley Professional.Google Scholar
- [28] . 2018. Cyber-physical system checkpointing and recovery. In 2018 ACM/IEEE 9th International Conference on Cyber-Physical Systems (ICCPS). 22–31. Google ScholarDigital Library
- [29] . 2001. An optimal checkpointing-strategy for real-time control systems under transient faults. IEEE Transactions on Reliability 50, 3 (2001), 293–301.Google ScholarCross Ref
- [30] . 2004. The interplay of power management and fault recovery in real-time systems. IEEE Trans. Comput. 53, 2 (2004), 217–231.Google ScholarDigital Library
- [31] . 1998. Generating reliable embedded processors. IEEE Micro (1998).Google ScholarDigital Library
- [32] . 2004. Fault-tolerant deployment of embedded software for cost-sensitive real-time feedback-control applications. In Design, Automation and Test in Europe. IEEE.Google Scholar
- [33] . 1996. Fault-tolerant Computer System Design. Prentice-Hall, Inc.Google ScholarDigital Library
- [34] . 2001. Analysis of checkpointing for real-time systems. Real-Time Systems 20, 1 (2001), 83–102.Google ScholarDigital Library
- [35] . 2020. Markov chain-based modeling and analysis of checkpointing with rollback recovery for efficient DSE in soft real-time systems. In 2020 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT). IEEE, 1–6.Google Scholar
- [36] . 2016. Two-state checkpointing for energy-efficient fault tolerance in hard real-time systems. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 24, 7 (2016), 2426–2437.Google ScholarCross Ref
- [37] . 1987. Optimal checkpointing of real-time tasks. IEEE Transactions on Computers 100, 11 (1987), 1328–1341.Google ScholarDigital Library
- [38] . 1995. Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Transactions on Parallel and Distributed Systems 6, 5 (1995), 546–554.Google ScholarDigital Library
- [39] . 2017. Safety and security in cyber-physical systems and internet-of-things systems. Proc. IEEE (2017).Google ScholarCross Ref
- [40] . 2020. Cyber-physical systems security: Limitations, issues and future trends. Microprocessors and Microsystems 77 (2020), 103201. Google ScholarDigital Library
- [41] and D. Manivannan. 2011. Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E families. Performance Evaluation 68, 5 (2011), 429–445.Google Scholar
- [42] . 2016. Towards the design of fault-tolerant mixed-criticality systems on multicores. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems. 1–10.Google ScholarDigital Library
- [43] . 2020. Real-time attack-recovery for cyber-physical systems using linear approximations. In 2020 IEEE Real-Time Systems Symposium (RTSS). 205–217. Google ScholarCross Ref
- [44] . 2021. Real-time attack-recovery for cyber-physical systems using linear-quadratic regulator. ACM Trans. Embed. Comput. Syst. 20, 5s, Article
79 (Sep. 2021), 24 pages. Google ScholarDigital Library - [45] . 2023. Real-time data-predictive attack-recovery for complex cyber-physical systems. In 2023 IEEE 29th Real-Time and Embedded Technology and Applications Symposium (RTAS).Google Scholar
- [46] . 2022. Adaptive window-based sensor attack detection for cyber-physical systems. In Proceedings of the 59th ACM/IEEE Design Automation Conference (San Francisco, California) (
DAC’22 ). Association for Computing Machinery, New York, NY, USA, 919–924. Google ScholarDigital Library - [47] . 2004. Task feasibility analysis and dynamic voltage scaling in fault-tolerant real-time embedded systems. In Proceedings Design, Automation and Test in Europe Conference and Exhibition, Vol. 2. IEEE, 1170–1175.Google ScholarCross Ref
- [48] . 2006. Reliability-aware dynamic energy management in dependable embedded real-time systems. In 12th IEEE Real- Time and Embedded Technology and Applications Symposium.Google Scholar
Index Terms
- Optimal Checkpointing Strategy for Real-time Systems with Both Logical and Timing Correctness
Recommendations
On Real-Time Quasi-Durable Checkpointing
ICECCS '96: Proceedings of the 2nd IEEE International Conference on Engineering of Complex Computer SystemsCheckpointing is a commonly used technique for fault tolerant computing. However, most of the existing approaches focus on improving checkpointing reliability and performance. This study investigates real-time checkpointing techniques in the context of ...
Analysis of checkpointing for schedulability of real-time systems
RTCSA '97: Proceedings of the 4th International Workshop on Real-Time Computing Systems and ApplicationsCheckpointing is a relatively cost effective method for achieving fault tolerance in real-time systems. Since checkpointing schemes depend on time redundancy, they could affect the correctness of the system by causing deadlines to be missed. This paper ...
The Interplay of Power Management and Fault Recovery in Real-Time Systems
Abstract--This paper describes how to exploit the scheduling slack in a real-time system to reduce energy consumption and achieve fault tolerance at the same time. During failure-free operation, a task takes checkpoints to enable recovery from failure. ...
Comments