Programmable OPS/OCS hybrid data centre network

On the basis of profound understanding of data center (DC) tra ﬃ c demands and optical switching technologies, we present a hybrid optical network design for future data center network (DCN). Such design integrates optical circuit switching (OCS) and optical packet switching (OPS) schemes via hybrid Top-of-the-Rack (ToR) switches which provide ﬂ exible function-switchover between di ﬀ erent tra ﬃ c patterns in the DCN. Simulations of network behaviors under such DC tra ﬃ c loads indicate that the proposed OPS/OCS DCN can e ﬀ ectively improve the network performance. Moreover, via a preliminary analysis of OCS and OPS network con ﬁ gurations, the construction of hybrid DCN is also proved as the most cost-and energy-e ﬃ cient way for DCN upgrading while o ﬀ ering a promised quality of service. An experimental demonstration shows a complete implementation of data center virtualization in the proposed hybrid data center network.


Introduction
Recent trends show network applications move from private clouds to public cloud data centers.As shown in Cisco global cloud index 2016 [1], annual global cloud IP traffic will reach 14.1 ZB by the end of 2020, up from 3.9 ZB in 2015.Around 68 percent of the cloud workloads will be in public cloud data centers by 2020.Data, logic and application are migrating to the Cloud.The increasing needs for data center and cloud resources will further drive the development of large-scale public cloud data centers, i.e., hyperscale data centers.The hyperscale data center, usually operated by large Internet-driven companies, such as Google, Facebook, could reduce capital expenditure (CapEx) with a sophisticated operation and maintenance team.The large volume purchase of the key facilities, including switches, transceiver modules, gives the operators of the hyperscale data centers a big negotiation power in the market to further push down the cost of the CapEx.It's expected more and more hyperscale data centers will be built.
The traditional multi-tier DCN architectures encounter great challenges to support the ever-expanding large-scale co-located DCs [2].Firstly, the over-subscription ratio may exceed 20:1 in core switches and 4:1 in aggregated switches [3], which will be the bottleneck for the dominant traffic inside DCNs, i.e., the east-west traffic.According to the Cisco report, the overall east-west traffic represents around 86% of the total data center by 2020 [1].Secondly, the latency in multi-tier DCNs will increase dramatically in a large-scale DCN, as the queueing time and processing time of traffic in each hop will get even longer when DCNs scale up.The huge latency will make the DCN unable to provide latency-sensitive services, especially for the emerging 5G applications.In addition, the latency also affects the current users' satisfaction.The third challenge of the multi-tier DCN architecture comes from the increased power consumption and total cost of the high-radix electrical switches.Thus, the multi-tier DCN architecture becomes less than ideal when it comes to supporting today's low-latency, virtualized applications.In current deployment, a swift and dramatic shift to "leaf-andspine" architecture is happening [4].The leaf-and-spine architecture promises a better support for east-west traffic.However, the switch fabric requires a large number of fiber connections and high radix electrical switches.Currently, it is very challenging to build a single electrical switching chip with a high radix and high per port bandwidth, due to the limitations on bandwidth at the edge of chips and power constraints.Thus many low radix switching chips are connected in a Clos topology to build a high radix chassis switch [5], which require large power consumptions with low port densities.The ITRS (International Technology Roadmap for Semiconductors) predicts only modest growth in per-pin bandwidth and pin count over the next decade [6].The leaf-and-spine DCN architecture treats all the leaf switches equally and couldn't handle traffic flow locality, such as hot Top-of-Rack (ToR) switches or servers, in an efficient way.It's very challenging to build hyperscale DCNs with leaf-and-spine architecture.
Currently, optical signaling is mainly used for point-to-point interconnections in DCs.400 Gigabit Ethernet (400G, 400GbE) and 200 Gigabit Ethernet (200G, 200GbE) standards has been officially released by the IEEE P802.3bsTask Force in December 2017 [7].The optical fiber-based transmission technologies could provide huge bandwidths for future hyperscale DCs.
Regarding switching technologies in DCNs, optical switching technologies, such as optical circuit switching (OCS), optical packet switching (OPS), and optical time division multiplexing (TDM), could potentially provide low-cost and power-efficient optical switching for intra-DCN communications.Comparing with electrical switching technologies, optical switching is less flexible because of the unavailability of optical random-access memories (RAM).However, optical switching technologies show some advantages which are very attractive to DCN applications.Firstly, optical switching technologies could provide highradix network switches with silicon and nanophotonic technologies.The high-radix optical switch could reduce hop and switch counts and help to reduce latency and power consumption in DCNs [5].Secondly, optical switching technologies are transparent to optical signal formats, which provide a better compatibility with different transmission standards.Thirdly, the advances in fiber-optic technology, such as wavelength division multiplexing (WDM) and space division multiplexing (SDM), could increase fiber-link capacities with multiple parallel links.For future hyperscale data center, optical switching and transmission technologies will play a significant role to support the ever-growing traffic demands.
In order to take full advantages of optical switching technologies in data center networks, new data center architectures are required for future hyperscale DCNs.As we mentioned before, the unavailability of optical RAM makes an all-optical DCN impossible.The traditional electrical switching will be a good complementary to optical switching technologies for DCN interconnections.The ToR switch need to be redesigned to support optical switching technologies in DCNs.In the past years, there have been lots of explorations to use optical switching solutions in DCNs [8][9][10][11][12][13][14].However, these proposed optical DCN solutions focused on the leveraging of a single optical technology, e.g.OCS or OPS.The lack of flexibility of these proposed network architectures lead to relatively poor performance for dynamic traffics in DCNs [15][16][17][18].The tremendous variation and diversity in the communication matrix over space and time in DCNs require a more flexible and dynamic DCN architecture.In addition, another challenge for future hyperscale DCNs comes from data center virtualization.In future, Cloud providers require the capability to dynamically allocate cloud resources to multiple virtual DCNs.DCN embedding is one of the mandatory features for future hyperscale DCNs.However, DCN virtualization becomes even more challenging in optical switching involved DCNs [19].
In this paper, we summarize our recent research in programmable OCS/OPS hybrid DCN architecture.The proposed hybrid data center architecture could offer several features: 1) dynamic deployment of DCN network functions for unicast and broadcast traffic; 2) OPS/OCS switch-over enabled by FPGA-based NICs/ToRs support variable link bandwidth with a fine granularity; 3) topologies adaption in packetswitching based sub-network.The function programmable feature is offered based on architecture-on-demand (AoD) concept [20].The network topologies and network functions can be reconfigured according to network traffic estimation or prediction.The key enabling technologies include FPGA-based OCS/OPS reconfigurable network interface card (NIC) [21], synchronized TDM switching [22], parallel interconnections based on space division multiplexing (SDM) or WDM [23].A complete software stack is developed to enable virtual DCs in the test platform [24].
This paper is organized as follows.In Section II, we review the applications of optical switching technologies in recently proposed optical DCN solutions and outline the benefits of hybrid DCN architecture.The proposed architecture of the programmable optical circuit/packet switching hybrid DCN is presented in detail in Section III.In Section IV, simulation works about the proposed programmable DCN architecture are presented.In Section V, a recent experimental demonstration of data center virtualization is reported.Conclusions of our work are given as Section IV.

Review of optical switching technologies in DCN
Optical switching technologies promise a better solution for future hyperscale DCNs.The available optical switching technologies could be categorized to circuit-or packet-switching.Here we will review the benefits and limitations of OCS and OPS schemes by summarizing their applications in recently proposed Optical DCN solutions.
For most OCS-based DCN solutions, Micro-Electro-Mechanical Systems (MEMS) or beam-steering based lager-port-count fiber switches (LPFS) are utilized as central switches which connect all the ToRs to generate a flattened network infrastructure [8][9][10].The scalability of OCS-based DCN architecture is promised by the high radix LPFS which is up to thousands of ports [25].Optical links can be set up between ToR pairs directly through the fiber switches and variable capacity can be assigned to each link by leveraging wavelength division multiplexing (WDM) technologies (e.g.WDM transceivers, spectrum selective switches (SSS), etc.).High capacity up to terabits/s for each link is feasible with low-speed electronics on the ToR by assembling multiple optical channels.High-capacity smooth data flow between different racks can be accommodated with low latency once the optical circuit link is set up.The degree of connectivity provided by a ToR can be enhanced by dense fiber connections or advanced space division multiplexing (SDM) fiber technology [26].Thus, the long-lived bulk data transfers with bounded degrees (e.g.data migrations, backups, interprocessor communications) can be accommodated by OCS network.
However, to accommodate short-lived traffic patterns with high communication radix, OCS-based DCN needs to either reconfigure the DCN topology frequently [8,9] or send traffic through multi-hop indirect connections to remote servers [10].The former solution suffers from the long reconfiguration time of fiber switches, which is in the order of a few milliseconds.The latter solution requires Optical-Electrical-Optical (O-E-O) conversions on each hop which introduces channel congestions at intermediate switches as well as significant latency on the multi-hop path.
On the other hand, OPS-based DCN solutions provide packet-level switching which fits better to dynamic traffic patterns.Two major approaches have been reported for OPS-based DCN realization: Arrayed-Waveguide Grating Routers (AWGR) based scheme [11,12,14] and Semiconductor Optical Amplifier (SOA) based scheme [13,27].The former approach allocates different connections in DCNs with different wavelengths by connecting all ToRs to AWGR switches.Tunable Wavelength Converters (TWCs) or Fast Tunable Lasers (FTLs) are deployed at each ToR for addressing a specific destination port of AWGR by selecting the appropriate wavelength.The latter approach relies on SOAbased fast switches which can reconfigure the DCN in nanoseconds [28].This scheme provides higher flexibility, as each connection is not limited by the channel grids of AWGR.The capacity it can support is thereby adaptive by aggregating different wavelengths to a single connection, which is so called "waveband switching" in the OPS scheme.
The challenge of OPS scheme is the system complexity and scalability.Due to the lack of optical RAM, blocked packets in congestion are generally stored in electronic buffers [11,14] or optical delay fiber array [12] for retransmission.Such buffering solutions together with the utilization of TWCs, FTLs or SOAs require a large number of wire connections from the controller to buffer and switch components, which increases the complexity along with the growing of DC size.For higher scalability and resiliency, multi-stage topologies have to be exploited for OPS-based DCN architecture [29].The end-to-end latency for each packet is thus mainly caused by the congestion and buffering at each stage.In addition, a 5%-20% overhead is required for OPS transmission, including inter-slot guard time, time for synchronization, time for clock recovery in burst mode receivers [30], etc.
The comparison between OCS-and OPS-based DCN is summarized in Table 1.OCS-based DCN can provide direct connections with large capacity and the system can be scaled up without compromising performance or adding complexity.By contrast, OPS-based DCN works better with fast-changing and unpredictable traffics but requires expensive and power-hungry components as well as complicated control management.Therefore, hybrid OPS/OCS DCN architecture becomes an attractive solution as OCS and OPS can complement each other: OPS system offers flexible connectivity for dynamic traffics and OCS system efficiently handles long-lived bulk data transfer.

Proposed programmable optical/electrical data center networking
The proposed programmable optical/electrical data center network architecture is shown in Fig. 1 with several key technologies, including FPGA-based programmable switch and interface card (SIC), OPS/OCS hybrid ToR switch, programmable OCS and OPS network configuration.The main design consideration is to divide the hyperscale DCN to several clusters.Each cluster consists of tens/hundreds of racks.All the clusters are connected together through a LPFS-based inter-cluster switch.Multiple SMFs or MCFs can be used to connect all clusters to the inter-cluster switch.The inter-cluster switch configures the connection matrix between all clusters and provides adaptable link capacity between different clusters.Thus, a single hop OCS is used to serve the long-lived, large capacity data flows for inter-cluster communications.
In each cluster, a centralized LPFS, as the cluster switch, interconnects ToRs via fiber bundles or SDM links.Inside clusters, OCS and OPS are used for different traffic patterns.The OCS, implemented with a LPFS, requires a relatively large setup time, however, no extra latency for the communications.Thus, elephant flows will transmit through OCS connections.The OPS network is achieved with OPS/OCS hybrid ToRs and OPS switches that connected to the cluster LPFS.The OPS switch will be regarded as sub-functions to be deployed in the DCN.The topology of OPS can be configured through the OCS network.The OPS network will carry most of the mice flow, due to its fast setup time.In addition, OPS could offer more connections in the same time, as a complementary to the OCS network.Other optical functional elements, such as PLZT-based TDM/OPS switches, EDFAs, couplers and combiners are also connected to the cluster switch for network function programmability.According to the traffic requests, variable network functions could be deployed by configuring the cluster switch to enable network function programmability.
The key enabling technologies are introduced as follows:

FPGA-based programmable switch and interface card
The programmable switch and interface card (SIC), which is designed to replace the traditional network interface card (NIC), can be plugged into the server directly and enable intense intra-rack blade-toblade communication [31].Compared to traditional NIC, the SIC provides switching function to the server, which enables server-centric data architecture (e.g., BCube [32]) and also simplify the implementation of the ToR switch.With more concerning of data security [33], the SIC design attracts more interests from industries, as the SIC makes data encryption more easier in DCNs.The SIC also supports flexible OCS/OPS function switchover for each optical channel.Servers in the same rack send/receive Ethernet frames through the SIC on each server to/from the intra-rack access on the FPGA board.[11], IRIS [12], Bidirectional SOA [13], LIONS [14], OSMOSIS [27] Fig. 1.Architecture of programmable optical/electrical DCN.
The SIC design and implementation architecture are shown in Fig. 2. The SIC is capable of copying the data between the memories of the blades and SIC, processing and sending out the data in particular port in TDM/OPS or WDM/OCS, based on the instruction of control plane.The SIC also acts as an OCS/WDM switch, an OPS/TDM switch, or a Layer 2 switch according to the commands of the control plane.With the switch functions, the SIC can work as a hop to supply maximum flexibility and programmability in the DCN architecture.
The SIC supports both intra-rack blade-to-blade communication and blade to optical ToR switch communication with the view to achieve high performance intra-rack evolving to inter-rack communication.The TDM-based SIC support link virtualization, which enable network virtualization in data center networks [24].With this programmable SIC, an all-optical programmable disaggregated data center network was proposed and demonstrated successfully [21].With a designed scheduling algorithms for disaggregated computing, the data center architecture could satisfy the high-capacity and low latency requirements [34].

OPS/OCS hybrid ToR switch
Another application of the developed OPS/OCS programmable SIC is the ToR switch [35].The novel programmable hybrid ToR switch enables flexible OCS/OPS function switchover for each optical channel.As shown in Fig. 3, the FPGA platform performs traffic processing and traffic aggregation.For inter-rack traffic loads, the FPGA platform differentiates them to either OCS or OPS traffic with application-aware classification following the commands from the control plane.Then the FPGA platform loads/unloads variable traffic onto different wavelength channels for DCN interconnections.Extra traffic monitoring can be introduced to classify the traffic in real time [36].Optical transceivers on the TORs can be 10Gbps SFP+, 40Gbps QSFP+ or even transceivers enabling advanced modulation formats, depending on the hardware supported on the FPGA platform.An m × n SSS are utilized as the interfaces between transceivers and the hybrid DCN.Circulators are adopted to connect transmitter (Tx) and receiver (Rx) to the SSS and enable bidirectional communication through it.Different optical channels can be aggregated by the SSS so that multi-granularity capacity for each link from the ToR is enabled.A proportion of such links are connected to OPS system according to the link requirement from each ToR to a specific OPS network configuration in the hybrid DCN.The rest links are connected to the OCS system.The maximum bandwidth a ToR can support is given by the total capacity provided on the FPGA platform, which is evolving rapidly; the node degree (link number) of each ToR is decided by the radix of SSS.
Thus, traffic switching is enabled at the ToR level: hybrid OPS/OCS switchover functions can be implemented hitless; adaptive capacity can be assigned to different links which enables flexible capacity assignment for different services in a DC and the isolation between them.

OCS network configurations
The OCS network is constructed based on programmable optical networks with the AoD concept [20].As shown in Fig. 4, both the intercluster and intra-cluster switches are implemented with a LPFS (e.g., Polatis beam-steering fiber switches) based on the AoD programmable switch.Regarding inter-cluster communication, the AoD-based intercluster switch provide connections between different clusters based on OCS.The link capacity can be dynamically programmabled by offering variable numbers of connection links.
For intra-cluster communication, the AoD-based cluster switches provide OCS for intra-cluster communications.In addition, the cluster   switches also connect various optical modules (e.g.AWGs, splitters, EDFAs and etc.) to achieve network function programmability.Depending on the number of OCS-enabled links from each ToR, several AoD nodes are utilized in parallel to construct the OCS network and each of them connects one OCS link from each ToR.To make full use of the ports on the LPFS, each OCS link is set to work bidirectional, which also saves the utilization of circulators between the ToRs and the backplane.
With such configuration, arbitrary network topologies and functionalities can be delivered on demand by setting appropriate crossconnections between hybrid ToRs and optical modules in the optical backplane: required OCS links can be constructed directly between relevant ToR pairs; optical channels aggregated in the same OCS link can be separated or reassembled through AWGs; OCS broadcasting can be accomplished by utilizing splitters.And all the connections can be dynamically reconfigured according to the traffic pattern variation.

OPS network configurations
As mentioned before, OPS switching technology is only used for intra-cluster communications.All the OPS modules are connected to the cluster switch in an AoD approach.The interconnection between the OPS modules can be configured to form different network topologies.Given the challenge of practical applications with high-radix OPS modules, multi-stage topologies are exploited for OPS-based intracluster communications.For the convenience of study, we assume that there are 75 racks (ToRs) in a cluster and the size of OPS module is no bigger than 16 × 16.Fig. 5 illustrates different OPS network configurations with different topologies: single-rooted tree, multi-rooted tree and butterfly.The architecture of each configuration is summarized in Table 2.Here we assume only unidirectional operation for OPS modules.In other words, each OPS link works in a simplex way.Thus, circulators are required to interface the links between hybrid ToRs and the OPS-based network.
Fig. 5(a) shows a non-blocking OPS network architecture by cascading OPS nodes following the classic single-rooted tree topology: five 16 × 16 OPS modules work as branch nodes (P 1 -P 5 ); a 5 × 5 OPS root node R 1 connects all branch nodes.Each ToR provides only one OPS link with variable capacities.Each branch node provides connectivity among 15 ToRs (from both Tx and Rx sides) and their accesses to other branches of the tree.The capacity between different branches is oversubscribed for high connectivity demand with an oversubscription rate of 15:1, which leads to the capacity bottleneck for delivering the majority of "east-west traffic" in the DC clusters.
By introducing more redundancy together with intelligent multipath routing strategies, DCN architectures with full-bisection provision can be constructed.Fig. 5(b) gives an example leveraging multi-rooted tree topology: each ToR provides one OPS link; fifteen 10 × 10 OPS modules work as branch nodes (P 1 -P 15 ) and each of them connects 5 ToRs (from both Tx and Rx sides); five 15 × 15 OPS modules (R 1 -R 5 ) connect all branch nodes thus the oversubscription is avoided by the overprovision of root nodes.Similar topologies such as Fat-tree [37], D-Cell [38], or BCube [32] can be constructed as well, depending on the provision of OPS modules.Multi-path routing protocols, such as equalcost multi-path (ECMP) algorithm [39], are required in these cases to efficiently allocate workloads among different root nodes in order to optimize network performance.
Moreover, a butterfly topology with 25 15 × 15 OPS nodes (P 1 -P 25 ) is shown in Fig. 5(c).Each OPS node connects the Tx of 15 ToRs with the Rx from another 15 ToRs.With 5 OPS links provided on each ToR, the Tx on each ToR is connected to 5 different OPS nodes and so does the Rx on each ToR.Therefore, all the ToRs are fully connected and a unique "one hop" OPS path is set for each ToR pair.In such configuration, collocated traffic loads from the same ToR are split and loaded on different links, and sent to different OPS nodes according to their destinations.In all the cases described above, OCS network described in the last subsection is constructed in parallel as a supplementary to the OPS network.By setting OCS links between ToR pairs where augmented capacity or tight latency is required, the network performance can be further improved.

Transmission media for DCN
In order to provide low latency interconnections in large-scale DCNs, flat DCN architectures are preferred with a reduced number of hops.Compared to the traditionally hierarchy DCN, the flat-structured DCN requires more connections, as each ToR needs to connect more  ToR directly.The connectivity will be one of the big challenges for future large-scale DCN with a flat architecture.Thanks to recent advances in fiber technologies, space division multiplexing (SDM) is now possible, allowing a large number of signals to be multiplexed not only in wavelength, but also in space, and be transmitted along a single optical fiber at the same time.Several SDM technological alternatives have been reported in the literature, and include multimode fibers (MMFs), multicore fibers (MCFs) [40], Multi-element fibers (MEFs) [41] and even their combinations.Using SDM, a spatial multiplicity as high as 36 has been demonstrated in fibers with dimensions not too dissimilar to those of a typical SMF [42].The use of SDM technologies in DCN can help simplify the connectivity between ToRs and the centralized LPFS.In combing with wavelength division multiplexing, a dramatic increase of connections can be achieved to provide more connectivity in DCNs.In [23], we demonstrated the use of SDM in a DCN for the first time.
Another way to reduce communication latency is to use hollow-core bandgap fiber, which could reduce propagation delay by 30%.By combing with a flat DCN architecture, ultra-low latency communications could be offered for chip-level access in a disaggregated data center [43].

Simulation scenarios and parameters
We use the simulated DC traffic patterns [44] as traffic demands in the hybrid DCN and examine the network behaviors under such traffic loads.Two typical DC traffic patterns are simulated.In Case I, we assume that the inter-processor-like traffic dominates the whole DC, whereas in Case II, the hot-spots-like traffic is the major traffic pattern.Fig. 6(a) illustrates the modeled DC traffic pattern in Case I via the heat map of inter-rack traffic matrix of log 10 (Bytes) in a simulated 1 s interval.We can see that the communication degree of each ToR is bounded and hot ToRs exchange much of their data with only a few other ToRs (see dark and red dots in the heat map).By contrast, Fig. 6(b) illustrates the simulated 1 s traffic pattern in Case II, where hot ToRs communicate with most ToRs in the DC following a "fan-in/fanout" pattern while the "cold traffic" pattern is popular among cold ToRs.
Firstly, we assume each ToR can provide 12 OPS/OCS programmable channels with 10Gbps capacity per channel.The maximum bandwidth a ToR can offer is 120Gbps and the capacity of each link from this ToR can vary from 10 Gb/s to 120 Gb/s.An OPS emulator is developed and programmed with Matlab, as illustrated in Fig. 7.A 2-µs slot size is selected for synchronous OPS operation and a 15% overhead is assumed for each packet.Each flow transmitted by OPS network in the simulation is firstly divided to a queue of optical packets (or frames) with an equal size, instead of Ethernet packets with uncertain sizes.To take the most advantage of such flexible capacity, we assume that each OPS node is enabled for optical waveband switching.OPS modules are transparent to optical wavelength (e.g., OPS based on semiconductor optical amplifier (SOA)).Thus, multiple wavelengths can be utilized for a single optical packet.Thus, the optical packet size (in Bytes) is decided by the efficient OPS link bandwidth: In case of congestions, random priority is given to each packet for being switched to the output port successfully.Blocked packets are immediately buffered at the corresponding input of OPS module and waiting for retransmission.The capacity of each electrical buffer is set as 200 KB.The latency of each received byte caused by buffering and the amounts of bytes dropped due to buffer overflow are counted in the simulation.
We assume that all the OPS emulators in our model are switching simultaneously.Each optical packet transferred through a multi-stage OPS link is switched in different time slots for each hop.Moreover, regarding the multi-rooted tree network, a simple multi-path routing protocol is used to distribute traffic loads among different root nodes efficiently: the relevant packet is always switched to the root node with the least buffer occupation.
Apart from the OPS emulator, traffic transmission through OCS links is also simulated.Variable capacity channels can be assigned to the OCS network while the overall channel number from each ToR is fixed as 12.Given the fact that the required capacity between any ToR pair is never beyond 10 Gb/s in our traffic model, we always assign a single 10 Gb/s channel to each OCS link so that the degree of connectivity in the OCS network, i.e. the number of OCS links from each ToR, is maximized.Once an OCS link is set between two ToRs, all the data exchanged between these two ToRs are going through the OCS link and counted without any delay or loss in our simulation.Some extra parameters are listed in Table 3.
It is worthy noting that all the parameters used in our simulation are chosen due to our limited computation capability.Future DC with optical inter-connects should equip with higher channel capacity and shorter slot size to support even heavier traffic workloads.

Function-topology management strategy
We simulate network behaviors for each 1 s time slot within a continuous 30 s period.The processes of simulation are schemed as Fig. 8, where the function-topology management is composed of two steps: topology optimization for OPS network, and traffic load separating between OCS and OPS networks.
Topology optimization for the OPS network aims to derive the objective matrix TM OPT from the original traffic matrix TM ORI by column switching and row switching.In the single-/multi-rooted tree or butterfly topologies, the objective matrix should balance traffic distributions among OPS branch nodes.Regarding the butterfly topology, the objective matrix need to distribute traffic more in OPS branch nodes.Hence, we present a greedy heuristic algorithm for constructing the objective matrix, as outlined in Algorithm 1.
Firstly, all the source ToRs and destination ToRs are sorted from the hottest to the coldest.The matrix TM ORI is thereby transformed into TM sort with rescheduled source list S sort and destination list D sort , which satisfies, where N is the rack number in the DC, and which are the traffic loads transferred/received by each ToR.
Taking the inter-rack traffic in the 25th second for instance, Fig. 9(a) illustrates the heat map of TM sort with rescheduled ToR sequences which are marked with the original ToR labels.We can see that such matrix is the very objective matrix for the butterfly topology optimization, where most traffic loads are distributed among branch nodes.
Algorithm 1. Heuristic algorithm for optimizing traffic load distribution in OPS networks: TopologyOptimization(TM ORI , topology), where TM ORI is the original traffic matrix, and topology is the topological name of OPS network.
1  46 return TM OPT For other topologies, matrix with more balanced traffic distribution needs to be constructed.In our heuristic algorithm, sub-lists S g (1)-S g (k) and D g (1)-D g (k) are set to represent different groups of source ToRs and destination ToRs connected to each branch node of OPS network, where k is the number of branch nodes.The overall traffic loads on each sublist are calculated as: (5) Then, we distribute source ToRs and destination ToRs one by one from S sort and D sort onto different sub-lists while minimizing the difference of traffic loads among them: during the ToR assignment, the sub-list with the least overall traffic loads always tends to have the hottest ToR from unassigned ToRs unless such sub-list is full, i.e. if i satisfies T g (i) = min{T g }, and |S g (i)| < N/k, then the hottest unassigned source ToR is added into S g (i).And so does the assignment for destination ToRs.Fig. 9(b) illustrates the traffic distribution in the simulated 25th second after such matrix operations for the single-rooted tree topology optimization, where hot ToRs are separated for different branch nodes to reduce traffic congestions on them.
Next, depending on the capacity assigned to OCS network, we simulate OCS links that set between certain ToR pairs.OCS network is constructed by establishing OCS links between relevant ToR pairs directly.To maximum the throughput of OCS network, those ToR pairs with OCS interconnections are selected by the Edmonds algorithm [45], which takes the inter-rack traffic matrix as a weighted graph G(V, E, W) and selects source-to-destination ToR pairs out from it.The heuristic algorithm for the OCS network construction is outlined in Algorithm 2, where OCS traffic loads are picked out and separated from OPS traffic loads.To make each OCS link bi-directional, the Edmonds algorithm is applied to a symmetric matrix TM sym which is derived from the interrack traffic matrix TM Rack and its transpose, so that the constructed OCS network is symmetric as well.As assumed, the OPS/OCS hybrid ToR switch could configure the 12 channels on each ToR either for OCS network or OPS network.The configuration would affect the network performance in DCN.Fig. 10.illustrates the varying of data drop rate for each configuration with different channel numbers assigned to OCS network.For both traffic patterns in Case I and II, simulation results indicate that hybrid OCS/ OPS networks can perform better than either of the homogenous networks: the traffic drop rate could be reduced by at least an order of magnitude when introducing appropriate OCS links in parallel with the OPS networks.The detailed simulation results are shown in Fig. 12.The reason behind is that the distribution of DC traffic is highly skewed.Thus, even a small amount of OCS links can effectively split the bulk traffic loads and reduce the competing with the other traffic at the hot ToRs in the OPS network.Fig. 13.shows the share of traffic loads taken by OCS and OPS networks with different OCS channel numbers in Case I and Case II.As expected, the more skewed that the traffic pattern is, the more effectively that the OCS network can perform.
Besides, the bandwidth of OPS network is decreased with growing OCS channels since they share the same capacity provided by each ToR.
Thus, increasing the OCS channels leads to the increment of OPS buffer capacity in the simulation: the size of each electrical buffer in the OPS network is fixed as 200 KB but the size of optical packet is reduced with fewer OPS bandwidth (see Eq. ( 1)), which means that more optical packets are able to be stored in each buffer.Fig. 11 illustrates the varying of optical packet size and the number of packets able to be buffered with different OCS channel numbers.Meanwhile, the performance of OPS network will start to degrade if the bandwidth drops quicker than the traffic loads it partakes.Fig. 14 illustrates the normalized OPS traffic loads for the hottest ToR and the top 10 hot ToRs in the OPS network.Thus, the overall network behavior can only be benefited when the increment of network utilization brought by OCS links exceeds the loss of it due to the reduction of OPS capacity.A tradeoff has to be made for allocating capacity between OCS and OPS systems in order to optimize the network performance, which is depending not only on the network topology but also on the distribution of traffic pattern, as summarized in Table 4.

Cost and power consumption
The comparison of network performance illustrated in Fig. 12 indicates two alternative ways to improve DCN architecture: a) upgrading the OPS network topology by introducing more redundant OPS nodes; b) constructing the hybrid DCN by synthesizing the OCS network in parallel with the OPS network.A preliminary analysis on the cost and power consumption of such improvements is provided in this section.
The cost estimation of OCS network is based on the price of commercially available 320-port 3D-MEMS in [46], where 0.17/port can be derived as the cost of each OCS port in arbitrary unit.Similarly, the power consumption of OCS network is assumed as 0.47 W/port according to power requirement of the 3D-MEMS quoted in [46].
The estimation of OPS network is tricky since there has not been any mature technology developed for such scheme so far.Thus, we build our estimation model based on the switching fabric of each OPS node.Given that there are a number of other key elements in each OPS node such as buffers, label processors and controllers, the switch fabric are assumed to represent at most 50% of the total cost and power consumption of the whole OPS node, which is a rather conservative estimation compared to the case in present-day packet switching scheme [47].For example, we use the price of the PLZT switch in [46] to calculate the cost of switching fabric which is per port in arbitrary unit, thus the cost of OPS node is 2/port in arbitrary unit (we assume that the fast switch adopted in OPS nodes should be competitive with PLZT switch on price).The power consumption of OPS switching fabric is estimated based on the Benes switch with 2 × 2 SOA gate arrays and 0.4 W is utilized as the power required for each SOA gate working in "ON-state" [48].Thus, the power consumption of such switching fabric rises exponentially with port number n:    Table 5 summarizes the cost and power consumption model for each OCS and OPS configurations.The comparisons of the overall cost and power consumption between different hybrid DCN configurations are illustrated in Fig. 15.Together with the performance comparison shown in Fig. 12, we can see that the construction of hybrid DCN is more efficient than the overprovision of OPS nodes in terms of cost and power consumption while offering the similar or even better improvement on network performance.

Scalability
To scale up the proposed OPS/OCS hybrid DCN, the main challenges are the limited port numbers of the LPFS and that of the OPS switch modules.In the proposed OPS/OCS hybrid DCN, the LPFS is used to implement OCS networks and also manage the OPS switch modules based on architecture-on-demand.Currently, high radix optical switch can offer over 384 × 384 switching [49].The recent developing silicon photonics provides a potential candidate for low loss high radix switches [50].Furthermore, multiple LPFSs could be cascaded together to provide even higher radix optical switches [46].Several methods could combine several LPFSs to a large LPFS.However, the port number of the LPFS still not enough for future hyperscale DCNs.Thus, in the proposed hybrid DCN, the hyperscale DCN is firstly divided to the clusters.Each cluster can be implemented with the proposed OPS/OCS hybrid DCN solution.Cluster based DCN architecture will relax the requirements for the LPFS.
Regarding to the OPS switching, multiple technologies have been adopted with a limited number of ports, such as SOAs and PLZT [51].The OPS switching modules are still in an early stage.The efficient OPS require precise time synchronization.Thus, the OPS are only used for intra-cluster communications with a limited scale.

Experimental demonstration
We demonstrated virtual data centers (VDC) provision in an OCS/ OPS hybrid data center.Due to the limited scale, no cluster switches are used.Fig. 16 shows the experimental setup of the hybrid data center.The hybrid data center deployed our developed time-shared optical networking (TSON) [52], FPGA-based OPS/OCS SIC and optical circuit elements [24].Compared to OPS technology, TSON technology provides similar but a simpler solution for optical slot switching and no extra header are required.In the experiments, TSON is used to offer similar connectivity as the OPS.On top of the data plane, a software stack that consists of the Orchestrator and the OpenFlow agents is developed for the hybrid data center.The software stack enables the provisioning of VDC instances over the optical data layer of the hybrid OPS/OCS DCN.Here we treat the TSON as a simplified OPS network.The TSON is used only for intra-cluster communications.
The data plane consists of two kinds of fiber switches.The beamsteering 2 × 2 4-core MCF switch provides optical switching over 4core MCFs.This MCF device offers a 300% increase in fibre capacity over single-mode fibre (SMF) and we envision usage for inter-DC traffic.Another LPFS is used for the OCS system and to manage the network  The OpenStack (DevStack) implementation dynamically provides TSON and OCS resources via an extended and optically-enabled SDN controller.A new algorithm module is developed to determine the several logical instances, such as IP network, subnetworks and ports, to enable traffic exchanges along the VDC instance.To map the VMs and create the logical resources, the algorithm interacts with the core orchestrator services via the OpenStack Heat service.In addition to the physical route and the necessary timeslots, the algorithm also determines the particular VLAN to be employed when encapsulating the traffic of each virtual link.On each OSK compute node, an OpenVswitch (OVS) is programmed to control flows between the VM instances.
The performance of the TSON network was measured in terms of throughput and latency against allocated timeslots.These results demonstrate a sustainable maximum data rate of up to 8.6Gbps, as shown in Fig. 17. Circuit and TSON switching are combined to offer flexible and granular bandwidth provisioning.As can be observed, higher throughput and lower latency can be achieved with interleaved (or distributed) slots allocations.This is because interleaving reduces the maximum delay between data transmissions.Therefore, it is recommended to avoid contiguous allocation for best performance.Similarly, in Fig. 18, the maximum and mean latency measurements converge as timeslots increases because the largest gap between transmission slots reduces.The interleaved minimum is greater because unlike contiguous, there is always a no-transmit slot between transmissions.
The switching latency of the OXS was measured at both the circuit and application level.The switching time at the circuit level was measured electronically around 25 ns.Using a ping-flood method, we tested the effective reconfiguration time from an application perspective.The mean value was measured over five reconfigurations.The end-   to-end buffering and serialisation for the TSON scheme is around 38.7 µs.Same measurement is conducted for the NIC in Ethernet mode.The measured time is around 8.3 µs.An overhead of 30.1 µs is required when using the extra buffering, logic and negotiation (key characters) involved in the TSON implementation.A similar experiment measured the mean reconfiguration time of the MCF switch over several iterations as 121 µs.

Conclusion
In this paper, we present the design of the programmable OCS/OPS DCN architecture, a hybrid optical network solution for the future hyperscale DCs.Such design combines the advantages of OCS and OPS schemes via the adoption of FPGA-based hybrid TOR switches.Thus, traffic loads in the DC with various patterns can be effectively accommodated by different network topologies on demand.
We simulate the network behaviors for different hybrid DCN configurations under different traffic demands and evaluate the benefits brought by the flexibility of the hybrid scheme.The results indicate that the network performance can be significantly improved by configuring the hybrid OPS/OCS network topologies according to the skewed nature of DC traffic.Besides, a preliminary comparison on the cost and power consumption for different network topologies is presented, which shows that the hybrid DCN architecture is more cost-and energyefficient than the homogenous network under the same quality of service provision.Finally, data center virtualization is demonstrated successfully based on the proposed hybrid data center architecture.

Fig. 3 .
Fig. 3. Schematic of the programmable hybrid ToR switch: a high-speed FPGA platform provides the processing of both intra-and inter-rack traffic; inter-rack traffic is sorted into OCS/OPS traffic and sent to the DCN via m × n SSS interfaces.

Fig. 4 .
Fig. 4. Schematic of the OCS network configuration: an AoD-based optical programmable system where hybrid ToRs and a variety of optical function modules are connected via several large port-count optical backplanes.

Fig. 6 .
Fig. 6.Inter-rack traffic distributions among 75 ToRs in a simulated 1 s interval are illustrated by heat maps of traffic matrices of log 10 (Bytes) for (a) Case I; (b) Case II.

Fig. 8 .
Fig. 8. Schematic of network behaviour simulation with the function-topology management for each 1 s time slot.

Algorithm 2 :
Heuristic algorithm for constructing OCS network and separating traffic loads between OCS and OPS network: OCS_construction(TM Rack , n), where TM Rack is the original inter-rack traffic matrix, and n is the channel number enabled for OCS on each ToR. 1 Begin 2 OCS link ← ∅; 3 TM sym = TM Rack + TM Rack T ; 4 Transfer matrix TM sym into graph G(V, E, W); 5 while n ≠ 0 do 6 n ← n-1; 7 Apply Edmonds algorithm to graph G; 8 return mates 9 OCS link ← OCS link ∪ m ates; 10 Remove mates from graph G; 11 end 12 Transfer OCS link to OCS traffic matrix TM OCS ; 13 TM OPS ← TM Rack − TM OCS ; 14 end 15 return TM OCS 16 return TM OPS4.3.Network performance for OPS/OCS hybrid DCN The OPS/OCS hybrid DCN can be configured to implement different network topologies.A matlab-based DCN simulation platform is implement based on the previous assumption.Evaluations of Network performance have been done for different network topologies in terms of traffic drop rate and average latency.The statistic traffic drop rate and average delay for each Byte, instead of that for each packet, are used to describe the network behaviors.

Fig. 9 .Fig. 10 .
Fig. 9. Matrix transformation for topology optimization in the 25th second: (a) traffic matrix TM sort with rescheduled source and destination sequences from the hottest to the coldest; (b) traffic matrix TM OPT with balanced traffic loads distribution among branches in the single-rooted tree.

Fig. 11 .
Fig. 11.The size of optical packet is decreasing with the growing of OCS capacity, which leads to the increment of OPS buffer capacity.

Fig. 12 .
Fig. 12. Simulated traffic drop rates with different capacity allocations between OCS and OPS networks under the traffic demands in: (a) Case I; (b) Case II.

Fig. 13 .
Fig. 13.Traffic loads shared by OCS and OPS network with different capacity allocations between them in the network simulation for Case I and Case II.

Fig. 14 .
Fig. 14.The traffic loads for the hottest ToR and the average traffic loads for the top 10 hot ToRs in the OPS network, normalized by the OPS bandwidth of each ToR, are varying with the increment of OCS channel assignment.

Fig. 15 .
Fig. 15.Comparisons with overall cost and power consumption for different configurations of hybrid DCN.ST: single-rooted tree; MT: multi-rooted tree; BF: butterfly.

Fig. 16 .
Fig. 16.Architecture and control flow for virtual data center provisioning.

Fig. 17 .
Fig. 17.Contiguous and interleaved allocated time slots vs. throughput for the TSON data plane.

Table 1
Comparison of different optical network technologies for DCN solutions.

Table 2
Summary of OPS network configurations.

Table 3
Summary of extra simulation parameters for simulation.

Table 4
Summary of the network performance for various configurations.The OPS network with butterfly topology requires 5 OPS links from each ToR, thus the maximum number of OCS links that each ToR can support is 7. *

Table 5
Model of cost and power consumption for hybrid DCN.