Survivable Resource Orchestration for Optically Interconnected Data Center Networks

We propose resource orchestration schemes in overlay networks enabled by optical network virtualization. Based on the information from underlying optical networks, our proposed schemes provision the fewest data centers to guarantee K-connect survivability, thus maintaining resource availability for cloud applications under any failure. Introduction As more applications and workloads are moving to the Cloud, geographically distributed data centers (DCs) are being deployed across optical networks. Cloud applications rely on distributed DCs for improved user experience. However, cloud providers may not own optical network infrastructure and count on network providers to optically interconnect distributed DCs. One example is the combination of IBM SmartCloud and AT&T virtual private networking for global cloud services. Usually, network providers are unwilling to expose their full network topology information to cloud providers. Hence, it is critical to investigate an overlay framework that enables cloud providers to control cloud network connections and optimize resource orchestration without having detailed network information. Many cloud applications in distributed DCs are arranged in an aggregation communication pattern, whereby an aggregation DC collects data processed at distributed DCs and outputs final results to users. Cloud applications can make the physically dispersed virtual machines (VMs) to operate logically as one DC by collecting results from dispersed VMs at an aggregation DC. Applications, such as cloud search and data backup, can allocate VMs close to data stored in distributed DCs and provide results at an aggregation DC for access by users. Complicated communication patterns can be constituted by scheduling a sequence of data aggregations. Due to the reliance on distributed DCs and aggregation DCs, survivability becomes an important issue for cloud applications. K-connect survivability is defined as at least K number of DCs (out of M original working DCs) remain connected with an aggregation DC (DCa) for any failure. When a failure occurs, additional VMs can be allocated at the surviving K DCs in order to maintain resource availability by taking advantage of the mobility of VMs. We consider a single shared risk group (SRG) failure that can result in multiple failures at DC sites and in networks (due to fiber cuts, power outages, nature disasters, etc.). Assume that DCa cannot fail; otherwise, a new cloud request should be initiated. In this paper, we present an overlay framework that interconnects distributed data centers by virtualized optical networks. We propose survivable resource orchestration schemes based on the network information provided by virtualized optical networks, such as SRG and delay. Our proposed schemes provision the fewest working DCs to guarantee K-connect survivability. Previous research provide reliable anycast or manycast by disjoint routing on physical network topologies. Our resource orchestration schemes differ from existing works in that they provision the fewest working DCs based on information on overlay networks, where physical network topology may be unavailable and routing for connections may not be possible. Framework for Resource Orchestration An overlay framework based on optical network virtualization is presented in Fig. 1. The overlay network comprises point-to-point connections between DCs, where connection bandwidth is adjustable by optical network virtualization technologies. The underlying optical network can adopt the optical transport network technology or flexible optical data planes (e.g., flexible transceivers) to adjust connection bandwidth. Cloud providers have a centralized controller that manages DCs interconnected by the overlay network. The controller obtains connection information, such as delay and SRG, and requests connection bandwidth through network application programming interfaces (APIs) provided by software defined networks (SDN). Such a framework has the advantage of avoiding network providers expose physical network topologies, while allowing cloud providers to easily set up cloud services, to perform resource orchestration, and to flexibly adjust connection bandwidth, without considering intermediate network devices along connection paths. Mo.3.E.3.pdf


Introduction
As more applications and workloads are moving to the Cloud, geographically distributed data centers (DCs) are being deployed across optical networks. Cloud applications rely on distributed DCs for improved user experience 1 . However, cloud providers may not own optical network infrastructure and count on network providers to optically interconnect distributed DCs. One example is the combination of IBM SmartCloud and AT&T virtual private networking for global cloud services. Usually, network providers are unwilling to expose their full network topology information to cloud providers. Hence, it is critical to investigate an overlay framework that enables cloud providers to control cloud network connections and optimize resource orchestration without having detailed network information.
Many cloud applications in distributed DCs are arranged in an aggregation communication pattern 2 , whereby an aggregation DC collects data processed at distributed DCs and outputs final results to users. Cloud applications can make the physically dispersed virtual machines (VMs) to operate logically as one DC by collecting results from dispersed VMs at an aggregation DC. Applications, such as cloud search and data backup, can allocate VMs close to data stored in distributed DCs and provide results at an aggregation DC for access by users. Complicated communication patterns can be constituted by scheduling a sequence of data aggregations 2,3 .
Due to the reliance on distributed DCs and aggregation DCs, survivability becomes an important issue for cloud applications. K-connect survivability is defined as at least K number of DCs (out of M original working DCs) remain connected with an aggregation DC (DC a ) for any failure. When a failure occurs, additional VMs can be allocated at the surviving K DCs in order to maintain resource availability by taking advantage of the mobility of VMs. We consider a single shared risk group (SRG) failure that can result in multiple failures at DC sites and in networks (due to fiber cuts, power outages, nature disasters, etc.). Assume that DC a cannot fail; otherwise, a new cloud request should be initiated.
In this paper, we present an overlay framework that interconnects distributed data centers by virtualized optical networks. We propose survivable resource orchestration schemes based on the network information provided by virtualized optical networks, such as SRG and delay. Our proposed schemes provision the fewest working DCs to guarantee K-connect survivability. Previous research 4 provide reliable anycast or manycast by disjoint routing on physical network topologies. Our resource orchestration schemes differ from existing works in that they provision the fewest working DCs based on information on overlay networks, where physical network topology may be unavailable and routing for connections may not be possible.

Framework for Resource Orchestration
An overlay framework based on optical network virtualization is presented in Fig. 1. The overlay network comprises point-to-point connections between DCs, where connection bandwidth is adjustable by optical network virtualization technologies 5 . The underlying optical network can adopt the optical transport network technology or flexible optical data planes (e.g., flexible transceivers) to adjust connection bandwidth.
Cloud providers have a centralized controller that manages DCs interconnected by the overlay network. The controller obtains connection information, such as delay and SRG, and requests connection bandwidth through network application programming interfaces (APIs) provided by software defined networks (SDN) 5 . Such a framework has the advantage of avoiding network providers expose physical network topologies, while allowing cloud providers to easily set up cloud services, to perform resource orchestration, and to flexibly adjust connection bandwidth, without considering intermediate network devices along connection paths.

Mo.3.E.3.pdf
The controller receives cloud requests and performs resource orchestration. A basic aggregation request is shown in Fig. 2(a) 2 . Complicated requests can be generated using a combination of basic requests 2,3 . Each request must satisfy K-connect survivability. When a failure occurs, a request with K-connect survivability can allocate additional VMs at the surviving K DCs (out of M original working DCs) in order to maintain certain computing capacity. Assuming that each DC is allocated the same number of VMs for a request and each request needs to maintain V virtual machines for any failure, the total VMs required for a request with K-connect survivability is VM/K. Finding the least M working DCs that satisfy K-connect survivability results in the fewest VMs required for a request.
Guaranteeing K-connect survivability can save network cost by jointly considering failures at network connections and DCs. In Fig. 2(b), where s i indicates SRG i, network connections are blindly protected by providing disjoint paths (dotted lines). 2-connect survivability can be guaranteed by protecting against failures at DCs separately from network connections. In Fig. 2(c), with SRG information on overlay networks, 2connect survivability is also guaranteed by finding connections and DCs that can be jointly protected, thereby allowing for significant savings in network resource (a savings of three protection connections compared to Fig. 2(b)).
The subset with minimum delay can be chosen when multiple subsets of DCs that satisfy Kconnect survivability exist. A delay of a request is the total delay of connections between the subset of DCs and the aggregation DC (DC a ). DC a can be allocated to a DC that is relatively near to users or relatively near to a particular subset of DCs, depending on applications.

K-Connect Survivability Problem Description
Given: An overlay network with N DCs sites and a set of L SRGs S = {s 1 ,s 2 ,…,s l ,…,s L }. In the overlay network, each connection E ij between DC i and DC j has network information including delay, d ij , and a vector of associated SRGs, Similarly, each DC i is associated with a set of SRGs, A ii . Also, we are given a request that requires K DCs to remain connected to an aggregation DC a for any failure. Joint protection is considered. Find: at least M number of working DCs such that 1. min∑(d aj ), where 1 ≤ j ≤ M, which minimizes the total delay of a request; and 2. K number of DCs remain connected to DC a for any failure, which guarantees K-connect survivability We can prove that the KCS problem is NPcomplete by WSC (Weighted Set Cover) KCS.

Proposed Solutions
Two novel heuristic algorithms are proposed for solving the KCS problem in optically interconnected distributed DC networks. In both algorithms, a matrix is constructed for each aggregation DC a . For each s l , the matrix records 1 if a pair (p ij ) consisting of a connection E ij and a DC j is associated with s l . Table 1 shows a matrix constructed for DC 1 in Fig. 3. A parallel matrix (#p l ) records the number of currently chosen pairs that are associated with s l . For example, in Table  1 RiskBased: In the DelayBased scheme, it is possible that pairs chosen earlier are associated with many SRGs, resulting in more working DCs required. Hence, RiskBased sorts p aj pairs in an increasing order of the total frequency of SRGs that are associated with p aj . The frequency of a SRG is defined as the number of p aj pairs that are associated with the SRG. Other steps are similar to DelayBased.

Simulation Results
We simulate the 75-node CORONET 6 . Overlay networks are generated whose DCs are located at randomly chosen nodes in CORONET. The shortest paths are used for connections between DCs. Requests are generated by assigning aggregation DCs to each DC at generated overlay networks until 10 5 requests are successfully allocated. Assume that an arbitrary amount of bandwidth and VMs can be requested from the underlying optical infrastructure. Fig. 4 compares the least M working DCs required and the average delay of requests as K increases. In this comparison, the total number of SRGs in a network is sixty. Each physical link or DC is associated with R randomly chosen SRGs (R = 2 or 3). The total number of DCs (N) in an overlay network is 10. Fig. 4(a) shows that RiskBased requires up to 12% fewer working DCs than DelayBased. The least number of working DCs increases to satisfy the increasing K-connect constraint. When K = 6 (or 5) and R = 2 (or 3), the least working DCs is close to N total DCs. Fig.  4(b) shows that, as K increases, the average request delay increases due to the requirement of more working DCs and the difference in delay reduces. When K ≤ 5 (or 4) and R = 2 (or 3), RiskBased results in longer delay than DelayBased, even with fewer working DCs, since a connection with lower SRG frequency may have longer delay. When K = 6 (or 5) and R = 2 (or 3), the number of required working DCs is close to N, the choices of working DCs are limited. Thus, both schemes perform closely. Fig. 5 compares the least M and the average delay of requests as N increases. Here, K = 4. RiskBased requires fewer working DCs than DelayBased for different N. When R = 3 and N ≤ 12, higher N needs more working DCs since higher N results in more successful requests, each of which requires more working DCs. Both schemes have similar delay due to limited choices of working DCs. When N > 12, the least number of working DCs reduces as N increases since there are more than enough working DCs to satisfy the K-connect constraint and solutions with fewer working DCs can be found. RiskBased results in longer delay since a connection with lower SRG frequency may have longer delay. It is noted that both schemes have the same request blocking due to iteratively incrementing M until resource allocation for a request is successful.

Conclusions
Resource orchestration schemes are proposed for provisioning the fewest data centers to guarantee K-connect survivability on virtualized optical overlay networks. RiskBased requires fewer working DCs, but longer request delay, compared to DelayBased. DelayBased is suitable for delay-sensitive cloud applications. Future work will investigate potential savings in both network and VM resources by applying K-connect survivability.