Synthesizing and Optimizing FDIR Recovery Strategies From Fault Trees

Redundancy concepts are an integral part of the design of space systems. Deciding when to activate which redundancy and which component should be replaced can be a difficult task. In this paper, we refine a methodology where recovery strategies are synthesized from a model of non-deterministic dynamic fault trees. The synthesis is performed by transforming non-deterministic dynamic fault trees into Markov Automata. From the optimized scheduler, an optimal recovery strategy can then be derived and represented by a model we call Recovery Automaton. We discuss techniques on how this Recovery Automaton can be further optimized to contain fewer states and transitions and show the effectiveness of our approach on two case studies.


Introduction
Reliability engineering is an important discipline in the design of any safety critical system, in particular in the domain of aerospace systems and spacecraft. No matter how well designed a system is, it still has to deal with the presence of faults to some extent. Faults in this context can be events such as equipment failure, wrong sensor readings, external interferences and many more. To raise trust in handling system failures, reliability engineering tries to embed Failure Detection, Isolation and Recovery (FDIR) concepts. These concepts are derived using various tools and methodologies such as Fault Tree Analysis (FTA) [9].
FTA is a methodology commonly used in the industry for performing stateof-the-art failure analysis [13]. The resulting Fault Trees (FT) describe how faults propagate through components and subsystems of a system and eventually lead to a top-level system failure. Graphical representations of these trees are intuitive and easy to understand. On the one hand, FTs can be used to analyze the system qualitatively in terms of fault combinations that lead to system failure. On the other hand, they also enable quantitative analysis of important computable measures such as reliability. Dynamic Fault Trees (DFT) are an extension introducing temporal dependencies and new features to analyze redundancy concepts known as spare management. However, there are challenges arising from non-deterministic behavior of DFTs such as spare races. An example for such race behavior can be seen in a system of two operative memories together with a pool of two spare memories. If both operative memories fail at the same time it is unclear which backup memory takes over the role of which operational one.
To overcome this shortcoming, a new methodology was presented in [11]. It introduces a model of Non-deterministic Dynamic Fault Trees (NdDFT) as an extension to DFTs. In contrast to the latter, the new NdDFT does not impose a fixed, rigid order on the spares to be used. As next step, the methodology foresees transforming this NdDFT model into a Markov Automaton (MA) which is suitable for the computation of the aforementioned non-deterministic decisions on spare activations. By optimizing the scheduling of the MA model in terms of reliability of the system, a recovery strategy for the NdDFT can be synthesized. This recovery strategy defines which spare has to be used in which failure state of the system and can therefore guarantee an optimal reliability at all times.
The goal of the present paper is to refine the methodology presented in [11] by further developing an automata model that formalizes the decision process underlying a recovery strategy, a so-called Recovery Automaton (RA). We give its formal definition and show how it can be minimized in order to obtain an efficient implementation of recovery strategies for FDIR. This paper is structured as follows. Section 2 of this paper summarizes the related work relevant to the topic of FTs, MA and synthesis of recovery strategies. Further background on the theory of FTs including their (non-deterministic) dynamic variants is given in Section 3. Section 4 describes the process of synthesizing recovery strategies from a given NdDFT as well as a model to represent such strategies, which is further optimized in Section 5. Section 6 then evaluates the technique on a use case example. Finally, the paper concludes in Section 7 and provides some outlook to future work.

Related Work
The goal of FDIR lies in keeping a system in a stable and operational state, even in the presence of faults. While some of the following steps may be omitted in some cases, performing FDIR generally means applying the following procedural approach [15]: -Monitor the system to detect the occurrence of faults.
-Identify the fault and localize it within the system. -Isolate the fault and prevent further propagation into other parts of the system. -Perform recovery actions to reconfigure the system and return it into a stable state.
In order to derive how faults relate to each other and eventually lead to a system wide failure, failure analysis techniques such as FTA can be employed. One of the very basic types of FTs are Static Fault Trees (SFT). They employ Boolean algebra to combine various different failure events by AND and OR operations, often graphically represented as gates, until they sum up to the overall system failure. The failure events are usually related to faulty components of the system. Applying this methodology, statements such as "The system fails if component A and component B fail" can be modeled and refined to arbitrary levels of precision. The probability of the top-level failure after time t (reliability) can be computed from a given DFT for example by transforming a DFT into a Continuous-Time Markov Chain (CTMC) [5].
Markov Automata [6] are an extension to CTMCs. They are state-based transition systems with two types of transitions: They can contain continuous-time transitions (also called Markovian transitions) that are labeled with rates, that is, non-negative real values as well as immediate, non-deterministic transitions labeled by actions. In the latter case, transitions have to be chosen by a so-called scheduler. The computation of optimal schedulers for Markov Automata with respect to various quantitative objectives, such as state reachability, is discussed in [7].
Computing strategies for recovery purposes from a given fault model has been researched in other contexts. In [1], a similar approach is taken for repairable fault trees. The authors consider non-deterministic repair policies where the repair order is not fixed. Optimal repair policies are then computed by converting the repairable fault tree to a Markov decision process, a time-discrete version of Markov Automata. However, the authors do not consider DFT models. In [4], Dynamic Decision Networks (DDN) are employed and their inference capabilities are exploited to create autonomous on-board FDIR systems for spacecraft that can select reactive and preventive recovery actions during run-time. In [12], the authors propose creating the DDN from an extension of the DFT model. Timed Failure Propagation Graphs are used in [2] to synthesize FDIR components, namely monitors for the purpose of fault detection and recovery plans for every specified combination of fault and mode. Here, the recovery components are created using a planning based approach on predefined actions.

Fault Trees
FTs are graphs consisting of two types of nodes respectively representing events and gates. The root node, or top level event (TLE), usually represents the event of a system failure whereas the leaves of the tree model the event of individual components failing. The leaves are also called basic events (BE). They correspond to a Boolean variable where false represents the initial state of no failure. The variable is considered true in case of a failure event. We consider here only the case of permanent failure, i.e. once a BE has failed, it remains in a failed state for all future points in time. The branches of the trees are represented by the gates performing operations on the events. FTs are directed acyclic graphs starting from the BEs pointing over the gates towards the system failure event.

Static Fault Trees
Fig . 1 shows the gates and events used in the SFT notation. SFTs use Boolean operations represented by AND and OR gates. There also exist other gates such as the k-VOTE gate, which propagates if at least k inputs have failed. Observe that a 1-VOTE gate corresponds to an OR gate and a k-VOTE gate with k inputs to an AND gate. Implementation wise, all gates can therefore be considered as k-VOTE gates for some appropriate k. Some other extensions also introduce a NOT gate. However, this allows the construction of fault trees where the TLE can change from having failed to working again as new failures occur. These fault trees are known as non-coherent fault trees and have been dismissed as being a sign for modeling errors [14].

Dynamic Fault Trees
Many extensions have been proposed to the formalism of FTs [13] to increase its expressiveness and enhance its features. A particular extension is the notion of Dynamic Fault Trees (DFT). It introduces temporal understanding and new features to analyze redundancy concepts known as spare management. In DFTs, a node can be either failed, active (operational) or dormant (operational). A node that is an unactivated spare is dormant. All other nodes are activated. Together with this state, failure rates for failing actively and failing dormantly can be defined for every BE.  The SPARE gate is connected to a primary event and a set of spare events. It propagates a failure if the primary input failed and all spares are either claimed or failed themselves. The spare events can be shared with another SPARE gate, therefore a spare can be claimed by either the one or the other SPARE gate. But there may be no shared elements between the primary input and any spare. The order in which such a spare is chosen is deterministic and defined at design time by the reliability engineer.
The FDEP (functional dependency) has a trigger event on the left hand side and any number of dependent events functionally dependent on the triggering event. When the trigger event occurs, the dependent events are set to fail as well. The output of an FDEP gate only indicates to which tree it belongs and has no further semantical meaning.
In the following, we give an example to illustrate the DFT notation. Fig. 3 shows a system consisting of two memory components which are covered by two spare memories for failures. The two spares are shared among the two SPARE gates. According to DFT semantics, Memory3 will be used before Memory4 in case of a failure of Memory1 or Memory2. In addition, the system has two hot redundant, always active power sources, Power1 and Power2. Both primaries Memory1 and Memory2 are powered by Power1 and the redundancies Memory3 and Memory4 are powered by the second power source Power2. Using FDEPs, the failure of a power source is propagated to the respective memory components. In the figure, FDEP dependent events are marked by an arrow and dashed lines indicate the parent of an FDEP.

Non-Deterministic Dynamic Fault Trees
As described before, DFTs require spares to be activated in a fixed and rigid order. This order cannot be adapted depending on faults that have previously occurred. Additionally, in cases of spare races it is not semantically clear which SPARE gate claims the actual redundancy. To relax on this semantical restriction of the DFT model, [11] introduces an inherently non-deterministic DFT model (NdDFT, following the naming in [1]). The syntax and notation of the NdDFT is completely adopted from the DFT. Semantically, the NdDFT drops the requirement that spares are always activated from left to right. Morover, the new non-deterministic semantics allows for a SPARE gate to leave the spares available for more important SPARE gates by not claiming. Whenever BEs occur in an NdDFT, the new semantics allow to perform valid recovery actions of the following form: Definition 1 (Recovery Action). A recovery action r in an NdDFT T is an action of the form We

Synthesizing Recovery Strategies
Here we describe the essential steps; details can be found in [11]. First, the NdDFT model is transformed into a Markov Automaton (MA) that represents all possible (non-deterministic) decisions on spare activations. By optimizing the scheduling of the MA model in terms of reliability of the system, a recovery strategy for the NdDFT can be synthesized. This strategy is represented by a Recovery Automaton (RA) that defines which spare has to be used in which failure state of the system and can therefore guarantee an optimal reliability at all times. The latter can be computed by a quantitative analysis of the Markov Chain that is obtained from the RA, enriched by the failure rates of basic events as determined by the original NdDFT. Fig. 4 visualizes the procedure.

Recovery Strategies and Automata
In the NdDFT, the actual recovery action r that is applied is defined by a given recovery strategy. In the following, transitions of Recovery Automata are labeled by recovery action sequences. Given the observed basic events, a recovery strategy is then a mapping that returns the recovery action sequence that should be taken accordingly. The NdDFT considers recovery strategies that only have recovery actions as given in Def. 1. They are defined as follows: Definition 2 (Recovery Strategy). A recovery strategy for an NdDFT T is a mapping Recovery : , rs n with rs n ∈ RS (T ).
As each basic event can occur at most once, the recovery strategy only needs to be defined for pairwise disjoint sets of basic events, i.e., B i ∩ B j = ∅ for i = j. A finite automaton that represents a recovery strategy will be called Recovery Automaton.

Definition 3 (Recovery Automaton).
A Recovery Automaton (RA) R T = (Q, δ, q 0 ) of an NdDFT T is an automaton where -Q is a finite set of states, q 0 ∈ Q is an initial state, and δ : Q × BES (T ) → Q × RS (T ) is a deterministic transition function that maps the current state and an observed set of faults to the successor state and a recovery action sequence.
The recovery strategy induced by a Recovery Automaton R is denoted by Recovery R . An example of a Recovery Automaton for a simple Fault Tree consisting of a SPARE gate with a cold redundant spare is given in Fig. 5.

Non-Deterministic Dynamic Fault Trees to Markov Automata
Transforming an NdDFT to a Markov Automaton can be done by adapting traditional algorithms for transforming DFTs to CTMCs. As base algorithm, we use the one given in [5]. The adapted algorithm operates by memorizing two sets of data in every of its states: First, the history of occurred basic event sets (B 1 , B 2 , . . . , B n ). Second, a mapping from spare gates to the currently claimed spare. The initial, empty history of the algorithm is denoted by (). Starting with this initial state, all active basic events, i.e. those that are not associated to an unactivated spare, are used to compute Markovian successors for each of them while extending the history accordingly.
The respective basic event set is obtained by taking the active basic event and computing all basic events that transitively fail due to FDEPs. The transitions are labeled with the respective failure rate of the basic event causing the transition. All transitions that would lead to a state that implies that the top-level event (system failure) has occurred, are connected to a special FAIL state instead. For each target state of a Markovian transition, the algorithm generates successors using non-deterministic transitions. Each non-deterministic transition is labeled by a valid recovery action.

Synthesizing Recovery Automata from Markov Automata
Using existing techniques for optimizing the scheduling of a Markov Automaton, the optimal non-deterministic transitions for maximizing the system reliability can be computed. The Recovery Automaton model is then used to represent the underlying decision process of the scheduler.
Extracting a Recovery Automaton from a scheduler for a Markov Automaton is achieved by replacing sequences of transitions for states s 0 , s 1 , . . . , s n of the form (s 0 , B : λ, s 1 ), (s 1 , r 1 , s 2 ), . . . , (s n−1 , r n , s n ), where B is a basic event set, λ a failure rate and r 1 , . . . , r n recovery actions, by the transition δ(s 0 , B) = (s n , r 1 . . . r n ) where empty recovery actions are ignored. This applies to all transitions where s 1 , . . . , s n are the successors computed by the optimized schedule of the Markov Automaton. All other non-deterministic transitions are then discarded. Finally, the algorithm discards all unreachable states.

Further Optimization of Recovery Automata
Complex systems usually exhibit a large number of faults that may occur. This means that NdDFTs describing such systems may be very large and correspondingly synthesized Recovery Automata may contain redundant states. In this section, we refine the given synthesis procedure by discussing some techniques for reducing the state space and the transition count of a synthesized Recovery Automaton. This leads to the task of finding an automaton with the same "behavior" that contains a smaller number of states. To capture this notion of having the same behavior, we introduce the concept of recovery equivalence between Recovery Automata as follows: Definition 4 (RA Recovery Equivalence). Let R 1 = (Q 1 , δ 1 , q 01 ) and R 2 = (Q 2 , δ 2 , q 02 ) be two RAs. We define a binary relation ≈ R such that it holds true for any two RA that R 1 ≈ R R 2 iff for any sequence of sets of basic events B 1 , . . . , B n with B i ∩ B j = ∅ for any i = j it holds that: Recovery R2 (B 1 , . . . , B n ) Given a Recovery Automaton as an input, the task of minimization involves obtaining an equivalent recovery automaton with as few states as possible. The standard problem of automata minimization is well-known and has been studied extensively. In this work, we apply the usual definition of trace equivalence and lift it to states of Recovery Automata: Definition 5 (Trace Equivalence). Let R T = (Q, δ, q 0 ) be an RA. A trace equivalence ≈ ⊆ Q × Q is a maximal, binary relation such that it holds for any states q 1 , q 2 ∈ Q that q 1 ≈ q 2 iff for any B ∈ BES (T ) it holds that: δ(q 1 , B) = (q 1 , rs 1 ) and δ(q 2 , B) = (q 2 , rs 2 ) with q 1 ≈ q 2 and rs 1 = rs 2 Equivalent states in automata can be computed using the Partition Refinement algorithm [8] and then a minimized automaton can be obtained by merging all equivalent states. In the setting of Recovery Automata, we can go even further and merge pairs of states that are not trace equivalent as long as the behavior of the automaton does not change. A simple example for a case where merging non-equivalent states yields a Recovery Automaton that induces an equivalent recovery strategy, can be seen in Fig. 6.
In the following we present the main contribution of this work: Rules that allow to merge states that are not trace-equivalent, yet yield implementations of equivalent recovery strategies. We identified two cases where merging nonequivalent states does not change the induced recovery strategy. In both cases, the key to minimization that we exploit, is the fact that the inputs of the automaton are produced by an FT. Hence, basic events can only occur at most once. This leads to the effect that certain traces in the RA are not valid inputs for the correspondingly induced recovery strategy. Therefore it gives us additional freedom for merging states that do would not be allowed to be merged in a standard automaton model.

Merging Orthogonal States
In the first rule, the idea is to identify states that may have transitions with disagreeing outputs, but where we can guarantee for certain that those transitions can never be taken, as their necessary inputs can no longer be produced. As mentioned before, the key to this idea lies in the exploitation of the property that basic events can only occur at most once in an FT. This gives us the following observation: If a basic event occurs on every path leading to a state in an RA, then it is guaranteed that in the future no transition listing this basic event in its guards can be taken. Note that Recovery Automata are deterministic automata, meaning that unlike non-deterministic automata they always have a transition defined for every possible input. Fig. 7 abstractly illustrates the application of this merging rule.
For the purpose of formalizing the intuitively given notion, we now introduce the concept of orthogonal states. To capture the basic event sets that can no longer be produced by an FT upon having reached a state in the RA, we define the set of guaranteed inputs of a state q as a function GI : Q → Q with: Definition 6 (Orthogonal States). Let R T = (Q, δ, q 0 ) be an RA. Let further p, q ∈ Q be two non-initial distinct states and B ∈ BES (T ). Then p, q are orthogonal with respect to B iff To illustrate the definition of orthogonality, we consider as an example the Recovery Automaton depicted in Fig. 8. The RA we consider there reacts to two distinct basic event sets B 1 and B 2 and performs a corresponding recovery action r 1 or r 2 accordingly. An NdDFT that would produce such an RA would be for example a system consisting of two parallel spare gates running independently from each other, e.g. spare gates with no shared spare. For the guaranteed inputs we have: Thus, by Def. 6 it holds that q 1 and q 2 are orthogonal with respect to basic event sets B 1 and B 2 . Observe that q 1 has an outgoing loop transition labeled with B 2 : that cannot occur. Similarly, q 2 has an outgoing loop transition labeled by B 1 : that cannot occur. In the merged RA, these transitions are eliminated and all the other incoming and outgoing transitions are redirected to start and end at the merged state respectively.
We are now ready to incorporate the orthogonality concept into an equivalence definition. We extend the basic trace equivalence definition as follows: Definition 7 (RA State Recovery Equivalence). Let R T = (Q, δ, q 0 ) be an RA. A state-based recovery equivalence ≈ R ⊆ Q × Q is a maximal relation such that it holds for any states q 1 , q 2 ∈ Q that q 1 ≈ R q 2 iff for any B ∈ BES (T ) it holds that either: δ(q 1 , B) = (q 1 , rs 1 ) and δ(q 2 , B) = (q 2 , rs 2 ) with q 1 ≈ q 2 and rs 1 = rs 2 or q 1 , q 2 are orthogonal with respect to B.
We now prove the correctness of our approach. The following theorem states that merging two recovery equivalent states yields a recovery equivalent RA. Theorem 1. Let R 1 = (Q 1 , δ 1 , q 01 ) be an RA with a pair of states q 1 and q 2 such that q 1 ≈ R q 2 . Let further R 2 = (Q 2 , δ 2 , q 02 ) be an RA that contains equal states and transitions as R 1 , apart from merging q 1 and q 2 into a single state q 12 , redirecting the incoming transitions of q 1 and q 2 to q 12 and copying the outgoing transitions from q 1 with guard B / ∈ GI (q 1 ) and q 2 with guard B / ∈ GI (q 2 ). Then Proof. Let β := B 1 , . . . , B n ∈ BES (T ) * be a sequence of basic event sets produced by an NdDFT. Then B i ∩ B j = ∅ for any i = j. We distinguish two cases: -Assume R 1 never vists q 1 or q 2 . By definiton of R 2 we then have that also R 2 does not visit q 12 . And by definition of R 2 again we thus immediately have that Recovery R1 (β) = Recovery R2 (β). -Assume R 1 visits q 1 (the case of visiting q 2 is analog) upon reading B i for some i < n. Now consider B i+1 . Let q 1 , q 12 and rs 1 , rs 12 be such that: δ 1 (q 1 , B i+1 ) = (q 1 , rs 1 ) and δ 2 (q 12 , B i+1 ) = (q 12 , rs 12 ).
By Def. 7 this means that we have either: • rs 1 = rs 12 and q 1 ≈ q 12 . By correctness of merging trace equivalent states we hence obtain Recovery R1 (β) = Recovery R2 (β). • q 1 , q 2 are orthogonal with respect to B i+1 . Then by Def. 6 it holds that: If B i+1 ∈ GI (q 1 ) then there exists by construction of GI an index j < i + 1 such that B i+1 ∩ B j = ∅. Contradiction to the definition of β. Therefore we obtain conclude B i+1 ∈ GI (q 2 ). By construction of R 2 this implies that the transition of q 2 is not copied and the transition of q 1 is chosen instead. Thus, rs 1 = rs 12 and q 1 = q 12 . Hence we can conclude Recovery R1 (β) = Recovery R2 (β).

Merging the FAIL State to Predecessors
The idea of the second case is to identify FAIL states that do not contribute to new recovery actions sequences when a set of faults occurs. If a state only leads to a FAIL state, the transition can be turned into a self-loop. And should the FAIL state no longer be reachable, it can be eliminated. This rule is abstractly illustrated in Fig. 9. We further introduce the concept of a FAIL state.
Definition 8 (FAIL State). Let R T = (Q, δ, q 0 ) be an RA and q ∈ Q a state. Then q is a FAIL state iff for any B ∈ BES (T ), all transitions from q are of the form δ(q, B) = (q, ).
The formalized merging rule can then be captured by the following theorem: Theorem 2. Let R 1 = (Q 1 , δ 1 , q 01 ) be an RA with a pair of states q 1 and q 2 such that q 2 is a FAIL state and all transitions of q 1 are -loops except for one transition being of the form δ 1 (q 1 , B) = (q 2 , rs), such that rs = . Let further R 2 = (Q 2 , δ 2 , q 02 ) be an RA with equal states and transitions as R 1 , except for turning outgoing transitions of q 1 into loop transitions. Then R 1 ≈ R R 2 .
Proof. Let β := B 1 , . . . , B n ∈ BES(T ) * be a sequence of basic event sets with B i ∩ B j = ∅. We distinguish two cases: -Assume R 1 never visits q 1 . Then by definition of R 2 , it also never visits q 1 .
As both automata are defined to be equal otherwise, we then immediately have that Recovery R1 (β) = Recovery R2 (β). -Assume R 1 visits q 1 upon reading B i for some i < n. Then by definition, R 2 also visits q 1 upon reading B i . Now consider B i+1 . By the construction of R 2 it holds that δ 1 (q 1 , B i+1 ) = (q 2 , rs) and δ 2 (q 1 , B i+1 ) = (q 1 , rs). for some recovery action sequence rs. Since q 2 is a FAIL state we obtain from Def. 8 that δ 1 (q 2 , B j ) = (q 2 , ) for any j > i + 1. Moreover, since also B j ∩ B i+1 = ∅ for any j > i + 1 we also have by definition of q 1 and R 2 that δ 2 (q 1 , B j ) = (q 1 , ). In total, we can therefore conclude that: In all cases Recovery R1 (β) = Recovery R2 (β). Hence, R 1 ≈ R R 2 by Def. 4.

Case Studies
In order to evaluate the presented techniques, we apply the synthesis methodology including the newly described merging rules to further optimize the created RA models to two use cases.

Multiprocessor Computing System
Target System. We consider the literature example of a Multiprocessor Computing System (MCS) based on the model given in [3]. The MCS consists of two main components: The Bus and the Computing Module (CM). The CM is hot redundant and consists of two further CMs CM 1 and CM 2 . Each of these CMs requires a disk, a processor and a memory unit. Each CM has a warm redundant backup disk. Furthermore, a shared redundant memory unit MS is available to the entire CM in case that their own memory unit fails. Finally, both processors are powered by a common power source PS. The common power source itself is again hot redundant and consists of the two power units PS 1 and PS 2 . Fig. 10 shows a NdDFT that describes the MCS. Experimental Results. The described synthesis algorithm was performed to obtain a Recovery Automaton from the described NdDFT. The RA was then optimized by merging trace-equivalent and recovery equivalent states and by eliminating redundant transitions. Table 1 shows the results after minimizing the synthesized RA. Observe that initially the RA contained a large number of states and transitions. After performing the Partition Refinement algorithm based on the trace-equivalence definition, the number of states and transitions was significantly reduced. After performing the Partition Refinement and merging non-trace equivalent states according to the described merging rules, it was observed that the number of states was further reduced by 95.86% and the number of transitions was further reduced by 95.05%. Thus, merging non-trace equivalent states additionally reduced the number of states obtained by merging trace-equivalent states by 38.81% and the number of transitions was reduced by 32.38%. This indicates the effectiveness of the proposed approach to consider cases when non-trace equivalent states can be merged to obtain an equivalent Recovery Automaton having the same behavior.

Memory System witn N Redundancies
Target System. To assess the the state space reduction for Recovery Automata in terms of increasing DFT complexity, we consider a family of DFTs based on the previous memory system use case given in Fig. 3. The model family is depicted in Fig. 11a. As before, the system consists of two main memory units Memory1 and Memory2. However, instead of a fixed size of redundant memory systems, they now share a variable pool of cold redundancies of size N .
Experimental Results. Fig. 11b shows how the state space sizes increase with varying number of redundancies N for both the raw Markov Automaton of the NdDFT and the finally resulting minimized RA. Note that the y-axis is scaled logarithmically. It can be seen that the RA state space grows significantly slower, but still at an exponential pace. However, it can also be seen that the state space reduction remains consistent over the course of the increasing number of redundancies. In this paper, we investigated the problem of optimizing Recovery Automata that represent recovery strategies synthesized from NdDFTs. New algorithms to minimize an RA by additionally eliminating non trace-equivalent states and redundant transitions were provided. In particular, we extended the notion of recovery equivalence between states by introducing the notion of orthogonal states and a rule for merging them. In addition, we introduced the concept of fail states and a rule for merging them with predecessor states. A formal proof showing that an equivalent RA is produced for each case was given. A case study using the described approach was provided and the evaluated results showed that it allows to obtain a more efficient implementation of recovery strategies for FDIR than solely eliminating trace equivalent states.
In the future, we would like to extend the Recovery Automata model to deal with input of Fault Trees with transient and repairable faults and consider how the merging rules can be transferred.