Smart Routing: Towards Proactive Fault-Handling in Software-Defined Networks

Software-defined networking offers numerous benefits against the legacy networking systems through simplifying the process of network management and reducing the cost of network configuration. Currently, the management of failures in the data plane is limited to two mechanisms: proactive and reactive. Such failure recovery techniques are activated after occurrences of failures. Therefore, packet loss is highly likely to occur as a result of service disruption and unavailability. This issue is not only related to the slow speed of recovery mechanisms, but also the delay caused by the failure detection process. In this paper, we define a new approach to the management of fault tolerance in software-defined networks where the goal is to eliminate the convergence process altogether, rather than speed up failure detection and recovery. We propose a new framework, called Smart Routing, which works based on the forewarning signs on failures in order to compute alternative paths and isolate the risky links from the routing tables of the data plane devices. We validate our framework through a set of experiments that demonstrate how the underlying model runs.


I. INTRODUCTION
T HE concern about the Internet ossification, which is a consequence of the growing number of variety networks (e.g. Internet of Things, wireless sensor, Cloud, etc.) that serve a huge number of clients (currently estimated about 9 billion) around the globe, has led to rethink about the existing rigid network infrastructure whether it can be replaced by a programmable one [1]. In this context, Software-Defined Networking (SDN) has emerged as a promising solution to tackle the inflexibility of the legacy networking systems. Unlike traditional IP networks, SDN architectures consist of two layers: A control plane and a data plane. The control plane, or sometimes called the controller, represents the network brain and maintain a global view on the network. While, the data plane comprises network forwarding elements, i.e. switches and routers, that constitute the network topology. All the data plane elements are dictated by the network controller and therefore the entire nodes have to disclose their status periodically toward the controller, hence the global view comes. So far, OpenFlow [2] is the most widely used protocol that enables the controller to govern the SDN data plane through carrying the forwarding rules as well as to facilitate the exchanging of signals between the two planes. Nowadays, communication networks play a vital role in human being's life activities as it represents the backbone for most of the current modern technologies. Since networking equipment are failure prone, some aspects like availability measurements, fault management and reliability become very important. This paper is mainly focused on the availability attribute in terms of fault tolerance and forecasting of failure in SDNs. Despite SDN benefits, new challenges such as recovery from failure still require investigation in order to maximise their utility [4], [5]. This paper presents a complementary approach that minimises the percentage of service unavailability through utilising an online failure prediction mechanism. This allows the network controller to perform the necessary reconfiguration prior the failure incidents. Although a number of works on SDN fault management have been proposed, none of them has exploited the feature of SDN global view in the context of failure prediction purposes.
The rest of the paper is organised as follows. Section II provides an overview of literature related to various SDN fault management techniques. We define the problem statement in Section III and the novelty of our work. We then present our model and framework in Section IV. Section VII and VIII present the experimental procedure, observed result and comparison. Finally, a summary of this paper is provided in Section IX with some future research directions.
II. RELATED WORK Link failure issues often occur as part of everyday routine network operations. Due to their negative impact on network Quality of Service (QoS), a considerable amount of research has been conducted to analyse, characterise, evaluate and recover from the frequent issues of network link failures. Such failures can either be unintentional (i.e. unplanned) due to various causes like human error, natural disasters, overload, software bugs or cable cuts, or intentional (i.e. planned) caused by the process of maintenance [6]. Failure recovery is a necessary requirement for networking systems to ensure the reliability and service availability. Generally, failure recovery mechanisms of carrier-grade networks are categorized into two types: protection and restoration. In protection, which is also know as proactive, alternative solutions are pre-planned and reserved in advance (i.e. before a failure occurs). By contrast, in restoration, which is also called reactive, possible solutions are not pre-planned and will be calculated dynamically when failures occur. Both approaches have pros and cons.
For example, the authors in [8] implemented an OpenFlow monitoring function for achieving a fast data plane recovery.
In [9], another protection method was proposed through using the OpenFlow-based Segment Protection (OSP) scheme. The main disadvantage of these approaches is that they consume the data plane storing capability since the more flow entries (i.e rules) that need to be stored, the more storage space that needs to be used. Current OpenFlow appliances in the market are able to accommodate up to 8000 flow entries only, due to known limitations of the Ternary Content-Addressable Memory (TCAM), hence making this kind of solutions costly [7], [10]. The installation of many attributes in the OpenFlow forwarding elements could lead to the deterioration of the process of match-and-action for the data plane nodes. Moreover, there is no guarantee that the preserved backups are failurefree; the backup path might fail before the primary one.
Following the restoration approach, the authors in [11] and [12] presented OpenFlow restoration methods to recover from single link failures. Experiments were conducted on small scale network topologies that did not exceed 14 nodes. In [13], the authors demonstrated, through extensive experiments, that OpenFlow restoration is not easily attainable within a time of 50ms, especially for large-scale networks, unless using protection techniques. In the same context, some works have utilised the concept of multiple disjoint paths to be employed as a backup. For example, CORONET [14] is presented as a fault-tolerance system for SDNs, in which multiple link failures can be resolved. The ADaptive Multi-Path Computation Framework (ADMPCF) [15] and HiQoS [16] for large scale OpenFlow networks were produced as traffic engineering tools that are capable of holding two or more disjoint paths to be utilised when some network events (e.g. link failure) occur. Most of the existing works do not take into account the processing time of flow entries, i.e. insert, delete and modify of rules. Although the performance of OpenFlow devices is associated with their vendors, in [17] the authors stated that each single flow entry insertion ranges from 0.5ms to 10ms. However, 11ms is the minimum duration required to modify a single rule, since each modification process includes both deletion (of old rules) and insertion (of new ones) [18].
Unlike existing works, the authors in [19] considered the problem of minimising the time of flow entries required when diverting from an affected primary path to a backup one. Although, the presented algorithms do not guarantee the shortest path from end-to-end, nonetheless, they open a new direction that is worth exploring. Within the same context, the authors in [20] produced new algorithms for minimising the required time to update rules through reducing the solution search space from the source to the destination in the affected path. Similarly, in [21], an approach to divide the network topology into non-overlapping cliques has been introduced to tackle the issue of failures in a localised manner, rather than taking a global view of the network. Both [20] and [21] took into account the time required to compute the alternative route in order to speed up the update operation. The main issue with the last three works is that they do not guarantee a shortest path from source to destination.
In summary, the previous studies demonstrated different methods to tackle the problem of data plane recovery from link failure incidents. A more recent survey [22] outlines in detail more contributions to the area of fault management in SDNs. One can conclude that protection approaches are not ideal due to the TCAM space exhaustion problem, whereas the latency issue is the major drawback of the existing restoration approaches. As a result, we believe that more research is needed in terms of achieving efficient SDN resilience, which is the main aim of this work.

III. PROBLEM STATEMENT AND CONTRIBUTIONS
Current SDN fault tolerance mechanisms inescapably lead to a certain amount of packet loss as well as to a certain probability of service unavailability. This is due to the delay of the convergence scheme T C . We define T C as the time taken by the OpenFlow controller to amend a path in response to failure scenario. Typically, the convergence time in SDNs can be defined in terms of three factors: • Failure detection time (T D ): This is the required time to detect a failure incident. Compared with the conventional networking systems, the centralised management and global view of an SDN eases this task by continuously monitoring network status and obtaining notifications upon failure. However, the speed of receiving a notification is sometimes associated with the nature of network design and mode of communication (i.e. in-band or out-of-band) [23], [24]. According to [25], link failure detection time ranges from tens to hundreds of milliseconds, depending on the type of commercial OpenFlow switch being used.
• New route computation time (T S P ): This is the spent time when network controller runs a nominated shortest path routing algorithm (e.g. Dijkstra [26]) to compute the backup path (usually for the reactive fault tolerance strategies). The T S P computation time could reach 10s of milliseconds [20] according to how big the network is.
• Flow entries update time (T U pdate ): This is the required time to update the relevant switches (i.e. nodes who are involved in the affected path). Again, this factor depends on how many forwarding rules need to be updated after the failure scenario, where the amount of time for a single rule may exceed 10ms.
Accordingly, the resulting convergence time can be calculated through the following equation: Currently, the classical SDN fault management methods aim to tackle the failure after it occurrence, therefore, the recovery mechanism is activated after the moment of failure and hence all the previous work proposals embroiled in a certain amount of delay according to (1). The only way to completely overcome the three factors of (1) altogether is by handling the failure before it occurs. Therefore, failure prediction is required to provide awareness about the potential future incidents as well as allowing the controller to perform the reconfiguration action in purpose of overriding failures before causing damage on some paths. Although there are a number of studies that have put efforts in the area of failure prediction, none of these (except [27]) has exploited the information that can be gained from any prediction method to  [27] is the only realistic study that discussed the advantages of failure prediction through producing a riskaware routing method for the legacy IP networks. Our work is different from theirs in that we build a framework of proactive failure management for SDNs. Our work combines the concept of the online failure prediction with risk analysis towards maximising the network service availability. With this context in mind, we can summarise the main contributions of this paper as follows: • A new network model that allows for the forecasting of link failures by predicting their characteristics in an online fashion. This model also combines the predictive capability with the decision making process using risk analysis.
• We provide an implementation of the new model in terms of a couple of fault tolerance algorithms. We use simulation techniques to test the efficiency of these algorithms. Our simulation results prove that the proposed model and algorithms improve the service availability of SDNs.

IV. THE PROPOSED MODEL
Anticipating failures before they occur is a promising approach for further enhancement of SDN failure management techniques, i.e. the proactive and reactive, in which the controller responds to failures when they take place. The SDN proposed model for anticipating link failure events is presented in this section. We start by outlining some notations that will be used throughout this paper, as shown in Table I. The network topology is modelled as an undirected graph G = (V, E); where V represents the finite set of vertices (i.e. routers) in G that ranges over by {v i , v j , . . . , v z } where {i, j, . . . , z} ⊂ {1, . . . , n} for n ∈ N , and E represents the finite set of bidirectional edges (i.e. links) in G that denoted as {e i j } where each e i j ∈ E is an edge that enables v i and v j to connect each other. Now, we define the following test operational function (OP) over a link, which reflects the link state whether it's working or not: OP(e i j ) = 1 the link is operational 0 otherwise Therefore, F can be defined as follows: Based on G, we define a path P as a sequence of vertices representing routers in the network. Each path starts from a source router, src, and ends with a destination router, dst: We define the set Flow to represent all demand traffic flows that need to be serviced. Each f low ∈ Flow is an instance of P, which associates with a particular traffic that are defined by unique src and dst pair. We consider f low set to be the set of all the possible paths between src and dst that can be derived from G, which is defined as follows: and the definition of first and last is given as functions on any general sequence (a 1 , . . . , a n ): first((a 1 , . . . , a n )) = a 1 , last((a 1 , . . . , a n )) = a n We also consider P set as a set that contains all the admissible paths that can be constructed from G, so this means that P ∈ P set and therefore, Flow ⊂ P set . When a link failure is reported in G, then, we identify the affected routes as follow: In the same context, but this time we consider the case of when there is a link failure prediction message m i ∈ M such that M set denoted by where t is the time when the system receives m i . In this context, we define the following: to characterise the received link, which we useē i j to imply that e i j ∈ PF L is a shorthand, with state of potential to fail and hence it does not belong to F. Now, we can define the potential to fail route set as follows: wheref low is a f low that has at least oneē i j , in other words, f low ∩ PF L = ∅.

A. SDN Predictive Model
All the previous efforts that dealt with data plane failures have succeeded in mitigating the impact of failures (e.g. reduce the downtime) rather than attempting to obviate their effect, such as the service unavailability. Network incidents that cause routing instability, i.e. flaps, and lead to significant degrading of network service availability vary [28], [29], however, we are merely concerned with the type of data link failure. By relying on monitoring techniques, some failures can be predicted through failure tracking, syndrome monitoring, and error reporting [30]. Consequently, a set of conditions can be defined as a base to trigger a failure warning when at least one of the predefined conditions is satisfied, as follows: if condition then warning trigger Online failure prediction strategies vary such as machine learning techniques (e.g. using the κ-nearest neighbor algorithm [31]) and statistical analysis methods (e.g. time series [30], Kalman and Wiener filter [32]). Such techniques can be used to predict the incoming events through relying on the past and current state information of a system. However, in this paper, we do not intend to propose a failure prediction solution as extensive studies have been conducted in this field with remarkable achievements. Instead, employing the online failure prediction as a technique to enrich the current SDN fault management is one of the main aims of this work. A generic overview of the time relations of online failure prediction is presented in Figure 1. • ∆t d : represents the past (historical) data upon which the predictor is forecasting the upcoming failure events. • ∆t l : represents the lead time upon which a failure alarm is generated. It can also be defined as the minimum duration between the prediction and failure. • ∆t w : represents the warning time in which an action may be required to find a new solution based on the predicted event. Therefore, ∆t l must be greater than ∆t w so that the information from prediction will be serviceable. In SDN, the ∆t w should be at least adequate to the time required to set up the longest shortest path in given G. • ∆t p : represents the time for which the prediction will be assumed to be a valid case. This should be defined carefully by the network operator so as to identify the true and false alarms after a certain time window.
The quality of the failure prediction is usually evaluated by two parameters: FP and F N; whereas, Recall and Precision are the two well-known metrics that are used to measure the overall performance.
Recall is defined as the ratio of the accurately captured failures to the total number of the certainly occurred failures. However, Precision is defined as the ratio of the correctly classified failures to the total number of the positive predictions. Correspondingly, SDN controller actions will now associate with predicted and unpredicted situations as listed in Table II. On one hand, every false failure alarm will lead to an unnecessary reconfiguration for a particular set of routes in Flow and this will cause unwitting network instability. On the other hand, a controller needs to deal with the undetected failures in a similar way to the classical methods. Consequently, the more precise behaviour of prediction, the higher the percentage of network stability and service availability will be gained. The relevance between the network model and the predictive model is summarised in Figure 2.

B. Failure Event Model
We have implemented an approach of generating failure events as it is very difficult to find a public network dataset that includes some useful details like failures, hence, we adopted an alternative approach by developing our failure model. This work intends to enhance the SDN fault tolerance and resilience through maximising the network service availability. Two basic metrics have been exploited in this model: Mean Time Between Failure (MT BF) and Mean Time To Recover (MTT R); which are essential for calculating the availability and reliability of each network repairable component [3], [33]. MT BF is defined as the average time in which a particular component functions before failing, where it comes through: number o f f ailures ; while, MTT R is the average time required to repair a failed component. Each component (i.e. link) is characterized by its own values of both MT BF and MTT R, which are commonly independent from other components in the network. As a consequence of lacking real data, some metrics (such as cable length and CC) can be alternatively used for measuring the two availability metrics. According to [33], MT BF can be calculated as follows: For instance, when CC is equal to 100 km, it means that per 100 km there will be on average one cut per year. Besides this, the MTT R of a link is influenced by its length [34], which expresses the fact that the longer link has a higher MTT R value. On this basis, we have designed the following formula for calculating the MTT R value for each link in the network.
Where γ is defined as a parameter indicating the time required to fix the cable, which is measured by hour/kilometer format. Due to the fact that links are physically distributed in different locations and environments, therefore, γ differs from one link to another. In other words, even if some links have the same length, their γ could be different as it relies on the physical location and the ambient conditions. We will discuss the use of these two values in Section VI.

V. RISK ANALYSIS
According to [35], risk can be defined in terms of the following three questions: What scenario could occur? what is the likelihood that scenario would occur? and what is the consequence if the scenario does occur? We next consider these questions towards formulating failure risk in SDNs.
What scenario could occur? We define the scenario as any undesirable event, such as failure, that breaks the service down and therefore requires a solution (e.g. path change). According to [36], there are three main types of failure scenarios, namely controller failure (including hardware and software), communication components failure (i.e. node and link) and application failure (e.g. bugs in application code), that could affect the SDN networking system. We define the set of all scenarios as S ranged over by variables s 1 , s 2 , . . . , s n ∈ S.
What is the likelihood a scenario would occur? The likelihood that a failure scenario disrupts the network services is conditional on the occurrence of the scenario. We address this question by the aid of online failure prediction that in our case works based on a scenario's failure probability, p ∈ [0, 1].
What is the consequence if the scenario does occur? We address this question by computing the percentage of loss or consequence, c, that might potentially happen when a failure scenario is predicted at an early stage. Each failure scenario might lead to some disconnections and service disruption. Therefore, the severity of adverse effects of each failure scenario varies. For instance, c 1 that was caused by s 1 might be different from c 2 that was caused by s 2 , which would reflect the outage costs that would result from disrupting some of the network connections.
Over a period of time, these questions would make a list of outcomes in the form of a triplet s i , p i , c i . Utilising such information, risk can then be formulated as a set of triples: Failure scenarios may have many causes and different origins. However, in this paper we focus only on one type, i.e. link failure scenarios that hit the data plane. Therefore, because we are considering the only link failure scenarios, s (e i j ) , we shall refine the definition of risk in (5). Accordingly, we redefine risk of damage to be the combination of the probability of link failure and its consequence.
To deduce the risk value, the two factors of (6), i.e. p and c, can be assessed independently. On one hand, the probability, p, depends on the efficacy of the online failure predictor at determining the likelihood of the incoming failure scenarios, which is, in this study, defined by a selective failure probability threshold value, T Ω . On the other hand c can be measured based upon the percentage of affected routes that would result from the anticipated scenario. By utilising some global network topological characteristics, such as Edge Betweenness Centrality (EBC), the consequence score can be identified. The edge betweenness centrality of a link e i j is the total number of shortest paths between pairs of nodes that traverse the edge e i j [37], which can be formulated as follows: Where Γ vi,v j denotes the number of shortest paths between nodes vi and v j, while, Γ vi,v j e i j denotes the number of shortest paths between nodes vi and v j and go through e i j ∈ E. For instance, Figure 3 demonstrates an example topology with an EBC value for each link in the network, which has been calculated based on Ulrik Brandes algorithm [38]. The network controller knows the demand traffic matrix between all pairs in the network, i.e. Flow. Therefore, equation (7) in our case is congruent with the following: Where Γ f low denotes the total number of paths in Flow set, while, Γ f low e i j denotes the number of paths in Flow set and pass through e i j ∈ M. With the above context in mind, the higher the EBC value of e i j , which is a normalised value between 0 and 1, the more critical the link is and therefore, the higher the score indicating the consequences. This is because the outcome of failure for a link with high EBC will definitely lead to a huge number of path failures and therefore a higher percentage of negative impacts on the availability of network services. Our goal in this analysis is to gauge the percentage of possible loss and provide such information to the concerned decision-making mechanism, i.e. the routing mechanism in our case. For more details about the existing risk analysis methods that fit SDNs, we refer the interested readers to [39].

VI. FRAMEWORK DESIGN
From a high level point of view, Figure 4 illustrates the main components of our proposed framework where the Smart Routing and Prediction modules are the primary contribution of our work. We discuss next in more detail the components we used to develop this framework. (a) SDN Controller Our framework currently supports the POX controller [40], which is an open source SDN controller written in python and it is more suitable for fast prototyping than other available controllers such as [41]. The standard OpenFlow protocol is used for establishing the communication between the data and control planes, whereas the set of POX APIs can be used for developing various network control applications. (b) Smart Routing Firstly, this module is responsible for maintaining and parsing the underlying network topology. Topology parameters such as the number of nodes and links, way of connection and port status can be detected via the Link Layer Discovery Protocol (LLDP) [42], which is one of the vital features of the current OpenFlow specification. The openflow.discovery 1 , which is an already developed component that can be used to send crafted LLDP messages out of OpenFlow nodes so that the topological view over the data plane layer can be constructed. This module will then convert the discovered network topology into a graph G representation for efficient management purposes. To do so, we utilised the Networkx tool [43], which is a pure python package with a set of powerful functions for manipulating network graphs. When the network starts working and after shaping the data plane topology, the shortest path for each f low ∈ Flow is configured by the appointed SP x algorithm, which thereafter is stored in the Operational Routes table that is specified to contain 1 https://github.com/att/pox/blob/master/pox/openflow/discovery.py all the desired working (healthy) paths. In order to perceive how the link failure incident could affect the configured paths from the perspective of service availability and convergence time, we provide a simple example in Table III in which the service deterioration of the f low x due to link failure incident is highlighted. In order to maintain the Operational Routes table, two algorithms have been implemented each with its own view in respect to keep the Flow maintained.
Algorithm 1 depicts the default shortest path routing strategy that is performed by the network controller. We specify Dijkstra's algorithm [26], with complexity O(|V |+|E | log |V |), as the shortest path finder approach for Algorithm 1, which we denote by SP D instead of SP x . So, the SP D is a Dijkstra function that can be applied on any f low set to return only one unique shortest path. When the OpenFlow controller reports a link failure event, every path suffering from that failure will be detected and then two operations will be issued by the controller. First, a Remove, denoted by OF R e mov e , command is sent to all the routers that belong to each failed path in Flow as a step to remove the incorrectly working entries, then an alternative route will be computed for every affected f low. The new flow entries of the alternative path are then forwarded to the relevant routers of each f low through the Install, denoted by OF I ns t all , command. Each modified f low, i.e. assigned to alternative, will be stored in a special set that is called the Labeled Flow (LF), where: LF ⊂ Flow and with length of n. This is to indicate that each f low ∈ LF is in a sub-optimal state. The recovery from link failure procedure is demonstrated in line (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13). However, the algorithm also includes the reversion procedure that is activated after a failure recurs (line [15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32] and it is no less important than the recovery process [44]. This procedure is required to take into account the percentage of routing flaps that is necessary for the experimental analysis. In fact, we developed this algorithm for comparison purposes only against Algorithm 2. Therefore, it does not reflect a contribution of this paper. Algorithm 2 is one of the main contributions of this work that exploited the prediction information towards enhancing the service availability and the fault tolerance of SDNs. This algorithm depends on Bhandari's algorithm for finding K edgedisjoint paths [45], which has been utilised as a complementary to build the smart routing strategy. We denoted Bhandari's algorithm as SP B in place of SP x . Thereon, we consider SP B as a function specified to compute two link-disjoint paths with the least total cost for any given pair of nodes (i.e. src and dst) or f low set . For the purpose of distinguishing between the two returned paths of SP B , we denote the first path as f low b 1 and the second disjoint one as f low b 2 . The time complexity of SP B is different from the SP D , which is a polynomial that is equivalent to O((K + 1).|E |+|V | log |V |).
The pseudo code of Smart Routing (SR) is demonstrated in Algorithm 2, in which the f low b 1 is initially selected to represent the primary path for each f low in the network. The network controller will then start listening to the prediction module, which will be discussed in the next section, for the potential of future incidents. When a new message (m) is received, the controller will firstly identify the potential failed list, which contains the information about link which  is expected to fail in the near future as described in (line [2][3][4]. Secondly, the route (or routes) which might be affected according to the predicted failure message will be computed as a preparatory step to replace them (lines 5-7). After identifying the routes that may possibly fail, the E BC for the predicted link will be calculated as a step towards measuring the risk (lines 8-10). If the risk value is below the threshold, then the prediction information will be ignored and no action will be taken. Otherwise, the flow entries of the newly computed disjoint path from the second step will be installed through using the Install command. This is done by adjusting the disjoint path rules with lower priority than the primary path to avoid conflict of matching and action processes.
Following this step, the forwarding rules of the risky primary paths will need to be deleted in order to use TCAM resources efficiently. This needs to be done in a similar procedure to the installation but with the Remove command as demonstrated in (lines [11][12][13][14]. After swapping the primary, f low b 1 , with the disjoint, f low b 2 , this action will be considered as the correct decision for a certain period of time (i.e. ∆t p ) as indicated in line 15. To examine the substantiality of the changing routes decision, the link that was anticipated to get down within ∆t l will be compared against the failure set F. On one hand, if the link exists then, the prediction will be marked as T P. In addition, each f low ∈ PF R will be labeled as sub-optimal and store in LF (lines [16][17][18]. On the other hand, if the link does not exist then, the prediction will be considered as FP. In such a case, it is necessary to reset the primary path to its initial state (i.e. optimal) as deliberated in (lines [19][20][21][22][23][24][25]. However, in case when there is a failure that is not captured by the prediction module then, it is considered as F N and such failures are tackled by calling Algorithm 1 as outlined in (line [28][29][30]. Finally, Algorithm 1 will also be invoked when a failed link is repaired (lines 32-34).

(c) Prediction Module
In this work, this module is placed on top of the parsed network topology state that gained from the network controller as a result of lacking historical data. We consider each link in the network as an independent object of link class. The link class contains a set of attributes, which currently includes eight attributes as shown in Figure 5. The link attributes are used to control the up and down events. In the current implementation, we used the priority queue, Q, as a pool to hold all the non-faulty links. On one hand, equations (3) and (4) are essential for computing the two static attributes (MT BF and MTT R) of each link. For (3), we rely on the topologies information in Section VII-C and by assuming that CC equals the minimum cable length in a network. While, for (4) we used the uniform distribution to generate γ for each link independently. On the other hand, the six remaining attributes are described as follows: • ID: a numerical unique value (i.e. 1, . . . , n) assigned to the link to represent the link identification number.
• F Count: registers the number of times the link has failed.
• Length: represents the link's length in km, which is derived from the topology specification. • Next F : refers to the next time to failure of link, which controls the enqueue and dequeue operations of the link. In other words, this attribute determines the link's life span in the Q, where the link will be dequeued when Next F=0.
• Probability F : registers the current failure probability, p, of the link. For instance, the Probability F of the link ( j) is defined as: where n is the Q length.
• Status : reflects the current state of the link as either operational or faulty.
On this basis, we have placed our online predictor scheme, as defined by Algorithm 3, on top of the priority queue in order to send encapsulated messages about the links which satisfy the following two conditions (as described in lines 2-9): First, the probability of failure is greater than or equal to the threshold T Ω and second, the leading time (i.e. ∆t l ) is less than or equal to the next time to failure.

VII. EXPERIMENTAL SETUP AND DESIGN
Since smart routing is aimed to enhance the SDN fault tolerance in the context of network service availability, we have implemented some metrics for fair comparison between the traditional SDN and the proposed system. We also show in this section the adopted network topologies that have been utilised in our experiments.

A. Availability Measurements
Considering the convergence time that is required to shift from a failed or non-operational path to an alternative or Wait: N ext F (Q p t r ) = 0 14 end backup one, which conforms with Equation (1). This convergence process definitely damages the availability of some paths, as shown in Table III. For the purpose of identifying the serviceable, which are denoted by "Yes", and the unserviceable, which are denoted by "No", f lows with respect to some failure events, we formulated this problem as follows: where, "Yes" and "No" can be obtained by intersecting each f low ∈ Flow against the Q. The f low is subjected to "Yes" when all its forming edges reside in the Q, otherwise, the f low will be considered as unserviceable and subjected to "No". By knowing the number of serviceable and unserviceable f lows, the service unavailability and thus the service availability can be measured. The service unavailability of SDN (U SD N ) over a given interval time with a certain number of failure events, which are denoted by ev, can be arrived at as follows: Whereas, for smart routing it is important to further consider the impact of Recall values as well. Hence, the service unavailability of SR (U SR ) can be arrived at through the following equation: Consequently, the availability A x , with x = SDN or SR, can be arrived at through the following:

B. Routing Instability Measurements
In traditional networks, routing protocols (e.g. IGP [46]) perform two routing changes as a reaction to every single failure, one time when a failure occurs and another when a failure is repaired. In fact, both changes are essential for the QoS where the first change is for the purpose of service availability, while, the goal of the second one is to return back from the backup (i.e. sub-optimal) to the primary (i.e. optimal) path again. In contrast, SDN architecture brings centralisation and programmability to the scene, therefore, traditional distributed protocols are independent of the SDN architecture. Maintaining the optimal path (e.g. minimum hops in our case) of each f low will require a continuously adaptive strategy that will be responsible for replacing each sub-optimal f low with the optimal one after it becomes serviceable. To do so, we assume that each alternative f low is additionally stored in LF as mentioned in Section VI. For SDN, the routing flaps (denoted by RF) can be measured by the means of link up (denoted by u f ) and down (denoted by d f ) as follows: On one hand, and according to (12), after each link down event; a new route for each f low ∈ F R is required, which then leads to a first routing change for each f low. On the other hand, and after each link up announcement, the controller will need to check the state of each labeled f low in LF to determine if it's still the optimal choice. If so, then no change will be made, otherwise, rerouting is required and therefore it will result in another routing change. However, for the smart routing mechanism, it is necessary to consider the three prediction parameters also (i.e. F N, T P and FP) as follows: According to (13), the F N f is equivalent to d f in (12) as it reflects the actual failure events that have not been captured by the prediction module, while the remaining are as follows: • Each true prediction will lead to a first reroute flap that gives the advantage of avoiding an upcoming failure event. While, the second flap will be similar to the scenario of RF S D N through inserting the f low into the LF and the next flap builds upon the link restoration u f . • Each false prediction leads into two useless flaps, one when the prediction triggers an alarm, in such a case each potential f low will be added to the temporary labeled Flow set (T LF), as a transient step before it recognises the prediction was false. The second flap is performed when ∆t p expires.
We provide an overview of the process of measuring the number of routing flaps in the flow chart of Figure 6, which also shows how the LF is adjusted in the scenario of the two algorithms, i.e. Algorithm 1 and 2.
Since all actions are associated with the link state, in this work, we utilise the OpenFlow protocol to reflect the data plane links changing state by relying on the Link-State Advertisement (LSA), in addition to the proposed prediction module that will also produce additional observed information about the potential failures. Both LSA and prediction information will be delivered to the controller through the Updater in order to apply the appropriate action as illustrated in the flow chart.

C. Simulated network topologies
In order to evaluate the proposed method, we have modelled three core network topologies as illustrated in Table IV Fig. 6. Flow chart of routing flaps both janos-us and germany50 represent a real network topology instance that was defined in [47], while waxman synthetic topology is created by the Internet topology generator Brite [48] through using the well-known Waxman model [49]. Waxman's model is a geographical approach that connects distributed routers in a plane on the basis of the distance among them, given by the following probabilistic formula: where 0 < α and β ≤ 1. d represents the distance between v i and v j , while L represents the maximum distance between any two given nodes. The number of links among the generated nodes is associated with the value of α in a directly proportional manner, while the edge distance increases when the value of β is incremented. We used Brite to generate a large-scale network topology in comparison to the others (e.g. when the number of edges or nodes ≥ 100). The characteristics of all the modelled topologies are detailed in Table IV.

D. Experimental Design and Implementation
In order to validate our approach, the proposed framework is built-up on top of POX controller 2 . We evaluated our framework prototype by using the container-based emulator, Mininet [50]. Mininet is a widely used emulation system, as evidenced in a recent survey [10], for evaluating and prototyping SDN protocols and applications. It can also be used to create realistic virtual networks, running real kernel, switch and application code, on a single machine (VM, cloud or native). Our experiments were designed based on the topologies that we illustrated in the preceding section. Since one of our experimental topologies was designed via Brite, we utilised the Fast Network Simulation Setup (FNSS) [51]. FNSS is a python-based toolchain simulator that can be used to facilitate the process of network experiments. It provides a wide range of functions and adapters that allow network researchers to parse graphs from different topology generators, such as Brite, in order to be compatible with and/or to interface with other simulator/emulator tools, such as Mininet.
Based on the failure event model (Section IV-B), the general reliability theory [52] has been utilised to generate failure events using the exponential distribution (mean = MT BF) for the next time to failure of each link, and lognormal distribution E(µ, σ) with: and, σ = log(1 + ((0.6 × MTT R) 2 /MTT R 2 for time to recover. Regarding failure anticipation, false and true positive have been generated during the simulated time using the uniform distribution following the specified threshold value. Figure 7 summarises the simulated link queuing system that is correlated to the two metrics of reliability, i.e. MTBF and MTTR.  In order to dispatch the prediction information that is necessarily important to the smart routing module, the distributed messages framework (ZeroMQ [53]) was exploited to carry the alarm messages, M, from the prediction module to the network controller interface. In some network f low conditions it will activate the smart routing module to begin a possible reconfiguration. In the emulation environment, we employed two servers; one acts as the OpenFlow controller and the other to simulate the network topologies. For each server, we used Ubuntu version 14.04 LTS running on an Intel Core-i5 processor equipped with 8 GB RAM.

VIII. KEY ADVANTAGES OF SMART ROUTING
In this section, we present comparison and evaluation of the proposed method versus the default SDN technique. To do so, the study has been conducted on the three topologies that were summarised in Table IV. To simulate the three topologies, we ran the emulator for 144 hours, i.e. each experimental topology was simulated in the system for 48 hours. Figure 8 shows the obtained results from the three topologies based on parameter settings of T Ω = 0.25, T ω = 0.1, ∆t l = 120s and ∆t p = 30s.
As discussed earlier, the T Ω and T ω values can be selected by the network operator or by using additional algorithms (i.e. machine learning) to identify the near optimal values. Since the main goal of smart routing is to enhance the network service availability, we plot for each network that which gives the default SDN and SR mechanisms for the service availability percentage (Y-axis) and the rate of routing flaps (X-axis). Furthermore, for SR, the performance of the online failure predictor represented by the values of Recall and Precision are considered and reported respectively to each topology. In fact, Recall value has a crucial impact on the service availability in the SR scheme, however, Precision value has an impact on the unnecessary routing changes. It can be clearly observed that SR outperformed the default SDN in providing network service availability for all test cases. In spite of the low Recall values (i.e. 0.2-0.3), there is still a gain in service availability. Similarly, janos-us gained the highest improvement percentage in the service availability and this is because its Recall value is greater than that of the other topologies.
On the other hand, the rate of the routing flaps generated by SR is always higher than the SDN. This disadvantage comes as a trade-off for improving the network service availability. Given that the routing instability by means of unnecessary flaps is correlated with the value Precision, we have measured the only useless flaps that were generated during the simulation time and for each topology as shown in Figure 9. Figure 9(a) shows the only unnecessary routing changes that have been reported based on the FP rate of each topology, where each single FP is associated with two useless flaps, that is, one for the reconfiguration and the other for the reversion. However, Figure 9(b) shows the percentage of useless routing flaps for each topology in comparison with the total number of flaps. In the worst case scenario the routing flaps did not exceed 25%. Although janos-us topology has the highest Precision value, it yielded a relatively high percentage of useless flaps and this is because the number of links in the topology is low, hence, it is highly likely that each single link is associated with a large number of routes in contrast to the other two topologies. It is also clearly evident that the online failure prediction plays a significant role in both service availability (by T P) and routing flaps (by FP). Based upon the experiments and simulations, we have some observations, as follows: • Some alternative routes are considered as optimal after 4 receiving an updater message, even though the received update is not involved in its conforming path. The reason for this is that the current system defines the optimal path based on the number of hops. Therefore, each alternative path that has the same number of hops as the optimal one will be considered to be an optimal path. It might not be the case if the obtained mechanism, i.e. using a specified cost function with different parameters such as bandwidth, congestion, energy, etc., is not relying on the number of hops.
• In some cases the algorithm is barely able to find twodisjoint paths and therefore, sometimes if a path has faced two successive predictions on its links then, no change will be made. Hence, we used (≈) instead of (=) in the output of Algorithm 2, to imply that an entirely empty PF R cannot be always guaranteed.
• It is also possible that each f low ∈ LF may face one or more risky links, thus in such a case the entangled f low state will be the same (i.e. sub-optimal).
• In some cases and when the Next F < 2 min, the controller ignores the prediction if it is generated as in such a case the ∆t l is not satisfied and so the controller will not have enough time for the preparation process.
IX. CONCLUSION AND FUTURE WORK This paper has demonstrated how to use online failure prediction to enhance SDN service availability. We presented a new model for SDNs that tackles the problem of data plane link failures. Our work differs from the existing contributions by allowing SDN controllers to have a time window to reconfigure the network before the anticipated failure occurs and avoid the interruption in the availability of network services. The proposed model was implemented using a couple of new algorithms that extract the risky links from paths.
Hence, when such risky links fail, no path will be affected. Our experiments were performed over a number of network topologies conducted with the link failure event model. The experimental findings demonstrate the effectiveness of the proposed method in enhancing the SDN service availability. A major drawback of this approach is the routing flaps rate that results from the failure prediction process, which may lead to network instability, especially when it reaches high rates. For this purpose, we measured the percentage of the unnecessary routing changes and in the worst scenario, it was 25%, which we consider requires improving in future research.
For other future work, we will position the study in the setting of machine learning algorithms in order to achieve more flexibility in the decision making process, allowing this to be gauged against optimal threshold values. We are also planning to extend this work to consider disaster situations, which involve multiple link failures.