Fault Tolerant Suffix Trees

Classical algorithms and data structures assume that the underlying memory is reliable, and the data remain safe during or after processing. However, the assumption is perilous as several studies have shown that large and inexpensive memories are vulnerable to bit flips. Thus, the correctness of output of a classical algorithm can be threatened by a few memory faults. Fault tolerant data structures and resilient algorithms are developed to tolerate a limited number of faults and provide a correct output based on the uncorrupted part of the data. Suffix tree is one of the important data structures that has widespread applications including substring search, super string problem and data compression. The fault tolerant version of the suffix tree presented in the literature uses complex techniques of encodable and decodable error-correcting codes, blocked data structures and fault-resistant tries. In this work, we use the natural approach of data replication to develop a fault tolerant suffix tree based on the faulty memory random access machine model. The proposed data structure stores copies of the indices to sustain memory faults injected by an adversary. We develop a resilient version of the Ukkonen’s algorithm for constructing the fault tolerant suffix tree and derive an upper bound on the number of corrupt suffixes.


Introduction
Computer memory plays an important role in all Turing-based computational platforms. Modern systems are using memories at multiple levels which consequently increases the need of its correctness [1]. However, memories, at any level, are fragile and can be error-prone, so some sort of healing mechanism is necessary. A fault is an underlying cause of an error, such as a stuck-at bit or high energy particle strike. Faults can be active (causing errors) or dormant (not causing errors) [1]. An error is an incorrect portion of state resulting from an active fault, such as an incorrect value in memory. Faults are predicted to become more frequent in future systems that contain many times more DRAM and SRAM than found in current systems.
Memory faults can compromise the correctness of results. During the execution of an algorithm or during the lifetime of a data structure, it is assumed that the data stored, and data structures built in memory locations are correct and the processing is taking place exactly according to the directions of the algorithm. When memory faults occur, data items stored on memory locations get altered and algorithm is compelled to take wrong steps. Consequently, the results are quite different from the expected results. This can lead to financial loss or even result in more disastrous situations such as in case of avionics applications [2]. The errors can be contained by using the error detection and correction circuitry. But this is an expensive solution in terms of price and performance as additional computational overhead is also added. Another solution can be the use of high performance and reliable memory like memory registers. However, it is too expensive to be used for large applications. Hence these faults need to be taken care of at application level instead of hardware level. Therefore, some recovery or tolerance mechanism is required to guard against these memory faults and ensure the reliability of processing. For this purpose, resilient algorithms or fault-tolerant data structures are designed. A resilient algorithm is the one which can compute a correct output based on uncorrupted values [3]. During the last fifty years several resilient algorithms and fault-tolerant data structures have been presented by the algorithmic community [2].
Suffix tree is an important data structure used for performing substring queries on a given string. Several classical algorithms are available for creating suffix tree for a given string. These algorithms fail when a few memory faults occur. Therefore, in this work, we design a fault tolerant suffix tree based on the natural technique of data replication. We use the Faulty-Memory random access machine model introduced by Finocchi and Italiano [3] to construct a Fault-Tolerant Suffix Tree (FTST). We use ffiffiffi r p as duplication function, where r is an upper bound on memory faults. Instead of duplication actual characters of the string, only the starting index of each edge is duplicated ffiffiffi r p þ 1 times. The proposed FTST can tolerate ½ ffiffiffi r p þ 1 ð Þ= 2 À 1 faults on each edge of the suffix tree.
Rest of the paper is organized as follows. Section 2 provides a brief overview of the state of the art. Formal problem statement is presented in Section 3. Section 4 presents the proposed Fault Tolerant Suffix Tree (FTST) model, and analysis of the model is performed in the same section. Section 4 concludes the work and provides directions for future research.

Literature Review
A lot of work has been done in the field of fault-tolerant data structures and resilient algorithms. Largescale applications process massive data which requires large amount of low-cost memory. Hence faulttolerance is an important consideration in large systems, safety critical systems and financial institutions' systems etc. Computing with unreliable information is a new and interesting area of research. Investigations and explorations have been carried out for coping with the problem of computing with unreliable information in a variety of different settings. For example, liar model [4], fault-tolerant persistent memory programming [5], fault tolerant main memory file systems [6], parallel models of computation with faulty memories [7], and others [8][9][10][11]. In the following, we summarize key works in the area of fault tolerant algorithms and systems.
Zhang [5] presented Pangolin, which is a fault tolerant persistent object library for NVMM. The proposed library uses a variety of techniques such as parity, checksums and micro-buffering to protect and preserve the objects from errors introduced due to media failure or due to software bugs. Xu et al. [6] proposed a novel non-volatile main memory (NVMM) based file system called NOVA-Fortis. The proposed file system is capable of performing well even in the presence of faulty storage (errors introduced due to corruption) or software bugs. Wang et al. [11] considered the problem of massive health data generation, transmission and collection using various devices, and proposed a interest matching mechanism for efficient, fault tolerant data collection. Jia [12] criticised the stability of traditional systems in the era of big data processing. The author proposed a fault tolerant model using spark application framework for big data clustering in distributed environment.
Finocchi [3] introduced a faulty-memory random access machine model, in which the storage locations may suffer from storage faults. In this model an adversary can alter up to σ storage locations throughout the execution of an algorithm. This model is used to produce correct output based on uncorrupted values [3,8]. Aumann [9] proposed fault-tolerant Stack, Linked List and General Tree. In their work they have used reconstruction technique. When the faults are detected, the data structure is reconstructed. Finocchi et al. [10] presented resilient search trees to obtain optimal time and space bounds while tolerating up to O ð p lognÞ memory faults, where n is the size of search tree.
Weidendorfer et al. [13] proposed LAIK (Lightweight Application-Integration data distribution for parallel worKers) to support fault tolerant features in parallel programming. LAIK has access to data and has the ability to perform various operations such as freeing up the node before failure and assisting in replication for roll back schemes. The authors presented an example by integrating LAIK with other application.
Wang et al. [14] considered the high computational cost and I/O costs incurred during the archiving phase of large graph computation systems and proposed to dynamically adjust checkpoint interval. The resultant procedure not only achieves the required key property of fault tolerance but significantly reduces the high I/O costs as well.
Traditionally, wireless sensor network nodes have low energy storage capacity, resulting in the failure of routing and communication protocols. This can result in power outage in some sensor networks resulting in loss of communication among nodes. Lu [15] addressed the problem by proposing a new algorithm using the structured directional de Brujin graph to improve the fault tolerance of communication protocols in wireless sensor networks. The intuition behind the proposed algorithm is to deploy nodes with high energy to act as super nodes responsible for formation of redundant routing tables.

Problem Statement
We have a string S containing n number of characters including $ symbol as its last character. String S is constituted over a set of alphabets of size c. A pattern P is any other string of length m. P is to be checked for substring of S. The suffix tree data structure can be used for substring queries. The suffix tree built for S contains n number of suffixes or leaf nodes and c number of root-originating edges. The total number of edges of suffix tree are represented by e: We are using faulty-memory random access machine model for constructing the Fault Tolerant Suffix Tree. In this model resilient variables are used to achieve resilience capability. A resilient variable duplicates values for safety purpose. This duplication is based on a function of number of corruptions, such as 2d or d 2 , where d is an upper bound on the number of corruptions that can be introduced by an adversary. The duplication function df, if not selected cautiously, can decrease the performance of resilient data structures or algorithms in terms of space and running time. Therefore, we have carefully selected df for our FTST model as ffiffi ffi d p .
Fact 1: The minimum number of edges e in a suffix tree is given by e ! n.

Fact 2:
The maximum number of edges e in a suffix tree is given by e 2Ãn À c: corruptions on a single edge of a suffix tree will leave the value of that edge irreparable.
Proof: To sustain adversarial faults on an edge, the edge value is duplicated so that if some values are altered, we may still have some correct values remaining. But if half or more of the values are altered, then majority of the values are corrupted and hence the edge value cannot be repaired. # : p as otherwise an adversary can corrupt all edges beyond repair.
Proof; From Lemma 1, we know that to corrupt the value of an edge, an adversary needs to corrupt df þ 1 2 $ % copies. Therefore, to corrupt e number of edges, the number of corruptions required is Solving we get;

Proposed Fault Tolerant Suffix Tree Algorithms
In this section, we present the proposed fault tolerant suffix tree construction and traversing algorithms. The construction algorithm (Algorithm 1) is used to construct a fault tolerant suffix tree for string S of length n. Algorithm 2 and Algorithm 3 are used to find a pattern P in string S.

Construction of Fault Tolerant Suffix Tree
Algorithm 1 describes the steps required for the construction of a fault tolerant suffix tree. Algorithm 1 receives string S as input where S contains $ symbol as its terminating character. n represents the size of string S and c the size of its alphabets. These two values n and c are used for calculating the possible number of edges e of the FTST by using Fact 1. By using Theorem 2, maximum number of corruptions, max are calculated beyond which all edges can be corrupted by adversary. Actual number of corruptions d are then randomly selected between 1 and max. The number of actual corruptions d defines the value of our duplication function df . We are aware that data replication can be very costly in terms of running time and space, therefore, we have carefully selected the df as ffiffi ffi d p . r represents the duplicate copies plus the original value which will be used to decide the size of array which stores start index of every node of FTST.
FTST i is the Fault Tolerant Suffix Tree of S which contains characters of S from 1 to i th character. FTST 1 is the Fault Tolerant Suffix Tree which contains only the first character of S. Then by using all the five heuristics of Ukkonen's algorithm, the whole FTST is constructed. Each child node stores the start and end index which links it to its parent node. This startÀend pair shows the length of the edge which connects this child to its parent. Each start index is a resilient variable which stores the value r number of times to provide the resilience capability. The majority of start index values decide its correctness or otherwise.

Searching for Pattern P in Fault Tolerant Suffix Tree
Algorithm 2 and Algorithm 3 are proposed to search for pattern P in the fault tolerant suffix tree constructed by Algorithm 1. The algorithms and the required explanation is provided in the text below.
Algorithm 2 (Traverse Node) receives three parameters, a node N of FTST , a pattern P and an integer value x which is used as index of pattern P. P is another string of length m and is globally accessible in all algorithms. P is to be searched within S. Traverse Node algorithm traverses its edge and stores the returned value in result. A 1 shows success while À1 shows failure of search. Variable x is incremented by the length of edge. The character of P at x is used as index for identifying among children nodes of N . N recursively calls its valid child node and proceeds.
Traverse Node algorithm calls Traverse Edge (Algorithm 3) to explore the edge which links N to its parent node. During this exploration, corresponding characters of the pattern P are compared with characters stored on this edge. If 1 is received from Traverse Edge algorithm, then it shows that pattern P is successfully found in string S. If 0 is received from Traverse Edge algorithm, then it means that comparison is successful on this edge but is incomplete and the remaining part needs to be explored further. To explore the deeper edge, linking this child N to its sub-child, Traverse Node algorithm recursively calls itself. Any value other than 0 or 1 indicates that pattern P is not a substring of S. Algorithm 3 (Traverse Edge) receives four parameters, a pattern P to be searched, an integer x which represents the index of pattern P, an array of integers start½ representing the starting index of node and integer variable end representing end index of node. Integer variable k is used as index of string S and x of P. If corresponding characters are not equal, -1 is returned indicating failure of searching. Every node of FTST stores a pair of indices start and end, which stores the start and end index of the edge linking this node to its parent node, respectively. In order to achieve resilience capability, the FTST stores the start index value r times in an array, start½ , of size r. t stores the start index value. t receives its value by calling Check Index algorithm which finds a value in majority.
The proposed FTST construction and sub-string query algorithms can be used in a variety of applications where fault tolerance is a key design requirement. The proposed technique can also be used to design other fault tolerant data structures such as AVL trees, Red-Black trees and Splay trees etc.
Funding Statement: The authors received no specific funding for this study.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.