Traffic Generation for Benchmarking Data Centre Networks

Benchmarking is commonly used in research fields, such as computer architecture design and machine learning, as a powerful paradigm for rigorously assessing, comparing, and developing novel technologies. However, the data centre networking community lacks a standard open-access benchmark. This is curtailing the community's understanding of existing systems and hindering the ability with which novel technologies can be developed, compared, and tested. We present TrafPy; an open-access framework for generating both realistic and custom data centre network traffic traces. TrafPy is compatible with any simulation, emulation, or experimentation environment, and can be used for standardised benchmarking and for investigating the properties and limitations of network systems such as schedulers, switches, routers, and resource managers. To demonstrate the efficacy of TrafPy, we use it to conduct a thorough investigation into the sensitivity of 4 canonical scheduling algorithms (shortest remaining processing time, fair share, first fit, and random) to varying traffic trace characteristics. We show how the fundamental scheduler performance insights revealed by these tests translate to 4 realistic data centre network types; University, Private Enterprise, Commercial Cloud, and Social Media Cloud. We then draw conclusions as to which types of scheduling policies are most suited to which types of network load conditions and traffic characteristics, leading to the possibility of application-informed decision making at the design stage and new dynamically adaptable scheduling policies. TrafPy is open-sourced via GitHub and all data associated with this manuscript via RDR.


Introduction
A benchmark is a series of experiments performed within some standard framework to measure the performance of an object. Researching data centre network (DCN) systems and objects such as networks, resource managers, and topologies involves understanding which types of mechanisms, principles or architectures are generalisable, scalable and performant when deployed in real world environments. Benchmarking is a powerful paradigm for investigating such questions, and has proved to be a strong driving force behind innovation in a variety of fields [3]. A famous example of a successful benchmark is the ImageNet project [4], which has facilitated a range of significant discoveries in the field of deep learning over the last decade.
In order to benchmark a DCN system, a traffic trace with which to load the network is required. This presents several challenges: (i) Data related to DCNs are often considered privacy-sensitive and proprietary to the owner, therefore few DCN traffic traces are openly available; (ii) when a real DCN trace is made available, it is often specific to a particular DCN and possibly not representative of current and future systems, too limited for cutting-edge data-hungry applications such as reinforcement learning, and not sufficient for stress-testing different loads in networks with arbitrary capacities to understand system limitations and vulnerabilities to future workloads; (iii) even if an attempt is made to make a real DCN available for live testing, deploying experimental systems in such large-scale production environments is often too expensive and time consuming; and (iv) reducing or approximating DCN traffic down to small-scale experiments is often unfruitful since many DCN application traffic patterns only emerge at large scales.
For these reasons, most DCN researchers revert to simulating DCN traffic in order to conduct their experiments. However, synthetic DCN traffic generation is often plagued by numerous inadequacies. A common simplification approach is to assume uniform or 'named' (Gaussian, Pareto, log-normal, etc.) distributions from which to sample DCN traffic characteristics. However, such distributions often ignore fluctuations caused by the short bursty nature of real DCN traffic, rendering the simulation unrealistically simple. Sometimes researchers will try to implement their own unique distributions to better describe real DCN traffic, however this brings difficulties with trying to reproduce and benchmark against literature reports since there is no standard framework for doing so. Another common approach is to only focus on the temporal (arrival time) dependence of DCN traffic characteristics and assume uniform spatial (server-to-server) dependencies. However, this fails to capture the spatial variations in server-to-server communication which are needed to accurately mimic real traffic. Works by Alizadeh et al. [5,6] and Bai et al. [7] introduced important DCN systems, but the traffic generators released with their papers fall short of addressing the issues of fidelity, reproducibility, and compatibility with generic network architectures (see Section 2).
These difficulties with simulating DCN traffic have meant that no traffic generation framework, and subsequently no universal DCN system benchmark, has emerged as the networking research field's tool-ofchoice. The lack of a rigorous benchmarking framework has been a major issue in DCN literature since individual researchers have often used their own tests without adhering to the aforementioned requirements. This has limited reproducibility, stifled network object prototype benchmarking, and hindered training data supply for novel machine learning systems. Without benchmarking, it is difficult to systematically and consistently test and validate new heuristics for specific tasks such as flow scheduling. Furthermore, without sufficient training data, state-of-the-art machine learning models are less able to replace existing heuristics.
To address the lack of openly available traffic data sets, the aforementioned problems with simulation, and the absence of a system benchmark, a common DCN traffic generation framework is needed. We introduce TrafPy: An open-source Python API for realistic and custom DCN traffic generation for any network under arbitrary loads, which can in turn be used for investigating a variety of network objects such as networks, schedulers, buffer managers, switch/route architectures, and topologies. TrafPy is open-access via GitHub [1] and all data associated with this manuscript via RDR [2]. TrafPy contributes two key novel ideas to traffic generation, which we detail in this paper: 1. Reproducibility guarantee A novel method for providing a distribution reproducibility guarantee when generating traffic based on the Jensen-Shannon distance metric (see Section 3.3). 2. Traffic generation algorithm: A novel method for efficiently creating reproducible flow-level traffic with granular control over both spatial and temporal characteristics (see Section 3.5).
In addition to the above, TrafPy also contains the following features which, when combined with these novel aspects, make TrafPy a useful tool for benchmark workload generation: ⋅ Interactivity: A distribution shaping tool for rapid creation of complex distributions which accurately mimic realistic workloads given only high-level characteristic descriptions (see Appendix C). ⋅ Compatibility: Compatibility with any simulation, emulation, or experimentation environment by exporting traffic into universally compatible file formats; and ⋅ Accessibility: Open-source code and documentation with a low barrier to entry.

Related work
While there is limited literature on DCN traffic generation, data sets, and benchmarking for the reasons outlined in Section 1, there have been notable works striving towards their creation.
Real workloads There are a collection of publicly available DCN workload traces and job computation graph data sets . However, almost all of these stem from Hadoop clusters and are limited to data mining applications [14], therefore their use is primarily suited to application-specific testing and evaluation rather than as a generic tool for generating arbitrary loads and testing and designing DCN systems as TrafPy is proposed for. Additionally, many of them lack flow-level data, which is needed to accurately benchmark network systems.
Real workload characteristics There is a limited body of work, primarily from private corporations, aiming to characterise real DCN workloads without open-accessing the underlying proprietary raw data. Benson et al. [30] built on work done by Kandula et al. [31] and Benson et al. [32] by characterising DCN traffic into one of three categories; university, private enterprise, and commercial cloud DCNs. They identified that each of these categories serviced different applications and therefore had different traffic patterns. University DCNs serviced applications such as database backups, distributed file system hosting (e.g. email servers, web services for faculty portals, etc.), and multicast video streams. Private enterprise hosted the same applications as university DCNs but additionally serviced a significant number of custom applications and development test beds. Commercial cloud DCNs focused more on internet-facing applications (e.g. search indexing, webmail, video, etc.), and intensive data mining and MapReduce-style jobs. They also went further than prior works by quantifying the number of hot spots and characterising the flow-level properties of DCN traffic.
The above cloud DCN studies came almost exclusively from Microsoft, who primarily service MapReduce-style applications. Roy et al. [33] broke this homogeneous view of cloud traffic by reporting the traffic characteristics of Facebook's DCNs, thereby introducing a fourth DCN category; social media cloud DCNs. Social media cloud applications include generating responses to web requests (email, messenger, etc.), MySQL database storage and cache querying, and newsfeed assembly. This results in network traffic being more uniform and, in contrast to Microsoft's commercial cloud DCNs, having a much lower proportion (12.9%) of traffic being intra-rack.
Note that the above examples did not open-access the full data sets, but rather provided quantitative characterisations of their nature for other researchers to inform their own traffic generators.
Traffic generators In their seminal pFabric work, Alizadeh et al. [6] provided open-access traffic generation code which loosely replicated web search and data mining DCN workloads by following a Poisson flow inter-arrival time distribution whose arrival rate was adjusted to meet a required target load and with a mix of small and large characteristically heavy-tailed flow sizes. Additionally, the same authors [5] released a simple generator which used a server application to create many-to-one flow requests from 9 servers, again following a load-adjustable Poisson arrival time distribution with 80% of flows having a size of 1 kB (a single packet) and 20% being 10 MB. As the authors noted themselves, these workloads were not intended to be realistic, but rather were designed to demonstrate clear impact comparisons between different DCN design schemes and the small latency-sensitive and large bandwidth-sensitive flows. TrafPy, on the other hand, can facilitate the shaping of complex inter-arrival and flow size distributions with one-to-one, many-to-one, and one-to-many non-uniform server-server distributions with ease. Furthermore, TrafPy enables the generation of traffic with the same characteristics as Alizadeh et al. [5,6], but for any network topology with an arbitrary number of servers and link capacities, allowing for the straightforward comparison of novel DCN fabrics with pre-established benchmark workloads.
Similarly, Bai et al. [7] conducted an extensive experiment into the trade-off between throughput, latency, and weighted fair sharing in scenarios where each switch port had multiple queues. Alongside their study they released an open-access traffic generator which could take a configuration file as input and generate both uniform and non-uniform server-server flow requests from pre-defined discrete probability distributions. However, to produce traffic, users had to manually enter numbers into a configuration file, which made the code difficult to use. Furthermore, Bai et al.'s generator had no mechanism for ensuring distribution reproducibility when sampling from a pre-defined probability distribution; a feat achieved by TrafPy via the Jensen-Shannon distance method (see Section 3.3).
The key objective of TrafPy is to augment DCN research projects such as those cited above [5][6][7]. Unlike our work, the primary focus of such projects was not on the traffic generator itself, but rather on using traffic generation as a means of benchmarking innovative ideas. We posit that the fidelity, generality, reproducibility, and compatibility of TrafPy, achieved by generating custom server-level flow traffic, would make such works easier to conduct and to compare against as baselines in future projects.

Design objectives
Designing successful network object benchmarks requires a flexible, modular, and reproducible traffic generation framework. The framework should enable fair comparisons between different systems whilst maintaining a rigorous experimental setting. In light of the issues highlighted in Section 1, the following criteria are required of such a framework:

TrafPy overview
An overview of the TrafPy API user experience is given in Fig. 1 and further elaborated on throughout this manuscript, with Table A.1 summarising the notation used and some API examples given in Appendix C. The core component of TrafPy is the Generator, which can be used for generating custom, literature, or standard benchmark network traffic traces. These traces can be saved in standard formats (e.g. JSON, CSV, pickle, etc.) and imported into any script or network simulator. Researchers can therefore design their systems and experiments independently of TrafPy and use their own programming languages, making TrafPy compatible with already-developed research projects and future network objects. This also means that TrafPy can be used with any simulation, emulation, or experimentation test bed. The Generator has an optional interactive visual tool for shaping and reproducing distributions, therefore little to no programming experience is required to use it to generate and save traffic data in standard formats. As the nature of DCN traffic changes, new traffic distributions can be generated with TrafPy and state-of-the-art benchmarks established.

Flow traffic
The flow-centric paradigm considers a single demand as a flow, which is a task demanding some information be sent from a source node to a destination node in the network. Flow characteristics include size (how much information to send), arrival time (the time the flow arrives ready to be transported through the network, as derived from the network-level inter-arrival time which is the time between a flow's time of arrival and its predecessor's), and source-destination node pair (which machine the flow is queued at and where it is requesting to be sent). Together, these characteristics form a network-level source-destination node pair distribution ('how much' (as measured by either probability or load) each machine tends to be requested by arriving flows). When a new flow arrives at a source and requests to be sent to a destination, it can be stored in a buffer until completed (all information fully transferred) or, if the buffer is full, dropped. Once dropped or completed, the flow is not re-used.

TrafPy distributions
At the heart of TrafPy are two key notions; that no raw data should be required to produce network traffic, and that every aspect of the API should be parameterised for reproducibility. To achieve the first, rather than using clustering and autoregressive models to fit distributions to data [34,35], TrafPy provides an interactive tool for visually shaping distributions. This way, researchers need only have either a written (e.g. 'the data followed a Pareto distribution with 90% of the values less than 1') or visual description of a traffic trace's characteristics in order to produce it. To achieve the second, all distributions are parameterised by a handful of parameters (termed D ′ ; see Appendix B for an example of the parameters used in this paper), and a third party need only see D ′ in order to reproduce the original distribution. As such, TrafPy traces are discrete distributions in the form of hash tables, which can be sampled at run-time to generate flows. These tables map each possible value taken by all flow characteristics to fractional values which represent either the 'probability of occurring' for size and time distributions, or the 'fraction of the overall traffic load requested' for node distributions. This enables traffic traces to be generated from common TrafPy benchmarks for Fig. 1. TrafPy API user experience for using custom or benchmark TrafPy parameters D ′ to make flow traffic trace D with maximum Jensen-Shannon distance threshold ̅̅̅̅̅̅̅̅ JSD √ and minimum flow arrival duration t t,min for m loads {ρ 1 , …, ρ m }. The generated trace D can then be used to benchmark a DCN system test object (e.g. a scheduler) in a test bed (a simulation, emulation, or experimentation environment) to measure the key performance indicators P KPI . The user need only use TrafPy to generate the traffic; all other tasks can be done externally to TrafPy in any programming language. custom network systems in a reproducible manner without needing to reformat the original data in order to make it compatible with new systems and topologies, as would be needed if the benchmarks were hard-coded request data sets instead of distributions.

Accuracy and reproducibility of distributions
All TrafPy distributions are summarised by a set of parameters D ′ . Once D ′ has been established (by e.g. the community as a benchmark or a researcher as a custom stress-test or future workload trace), TrafPy must be able to reliably and accurately reproduce (via sampling) the 'original' distribution parameterised by D ′ each time a new set of traffic data is generated. Therefore, a guarantee that the sampled distribution will be close to the original is required to ensure reproducibility. TrafPy utilises the Jensen-Shannon Divergence (JSD) [36,37] to quantify how distinguishable discrete probability distributions are from one another. Given a set of n probability distributions {P 1 ,…,P n }, a corresponding set of weights {π 1 , …, π n } to quantify the contribution of each distribution's entropy to the overall similarity metric, and the entropy H(P i ) of a discrete distribution with m random variables , the JSD between the distributions can be calculated as in Equation (1). In the context of TrafPy, the P i distributions are the hash tables of variable value-fraction pairs and the weights are simply set to 1.
The square root of the Jensen-Shannon Divergence gives the Jensen-Shannon distance [37], which is a metric between 0 and 1 used to describe the similarity between distributions (0 being exactly the same, 1 being completely different). The TrafPy API enables users to specify their own maximum ̅̅̅̅̅̅̅̅ JSD √ threshold, ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ JSD threshold √ , when sampling data from a set of original distributions to create their own data set(s). A lower distance requires that the sampled distributions be more similar to the original distributions. TrafPy will automatically sample more demands until, by the law of large numbers, the user-specified ̅̅̅̅̅̅̅̅ JSD √ threshold is met. Fig. 2 shows how, for an example benchmark's flow size and interarrival time distribution, the ̅̅̅̅̅̅̅̅ JSD √ between the original and the sampled distributions changes with the number of samples (number of demands). As shown, most characteristic parameters (mean, minimum, maximum, and standard deviation) of the sampled distributions converge at ̅̅̅̅̅̅̅̅ JSD √ ≈ 0.1; a threshold reached after 137,435 demands for the flow size distribution and 27,194 for the inter-arrival times. The greater the number of possible random variable values and complexity in the original distribution, the more demands which will be needed to lower the ̅̅̅̅̅̅̅̅ JSD √ . The distribution which requires the most demands to meet the ̅̅̅̅̅̅̅̅ JSD √ threshold will determine the minimum number of demands needed for the generated flow data set to accurately reproduce the original set from which it is sampled.

Node distributions
'Node distributions' are a mapping of how much each machine (network node) pair tends to be requested by arriving flows, as measured by the pair's load (flow information arriving per unit time), to form a source-destination pair matrix. These distributions can be defined explicitly on a per-node basis. However, explicit mappings would result in D ′ being defined for a specific topology (since each topology might have a different number of machines and/or a different machine labeling convention). Therefore, TrafPy node distributions can also be implicitly defined by high-level parameters. These parameters are the fraction of the nodes and/or node pairs which account for some proportion of the overall traffic load and, optionally, the fraction of the traffic which is intra-vs. inter-cluster (where 'clusters' are usually considered as 'racks' in the context of DCNs). In this way, node distributions can be defined independently of the network topology, enabling greater generality and the use of custom topologies with traffic traces and benchmarks parameterised by D ′ , even if D ′ was originally defined for a different topology. Furthermore, this allows individual or groups of network nodes to be set as 'hot', 'cold', or any combination of hot and cold as desired by the user. Note that this formalism also enables both incast (many-to-one) and out-cast (one-to-many) traffic patterns, since any node(s) can have multiple out-cast and in-cast flow demands generated at a given point in time when sampling from the node distribution.

Algorithm 1 (continued )
Step 2: 'Pack the flows' → fully initialise n f flows {b s , b a , b p } Initialise b p and b n from P(B n ) with n 2 n − nn elements Sort pairs in descending d p order and randomly self-shuffle equal d p pairs First pass: Attempt dp ≈ 0∀p ∈ [1, …, n 2 n − nn] Given the distributions of flow sizes, inter-arrival times, and node pairs P(B s ), P(B t ), and P(B n ) of a benchmark B, TrafPy can generate traffic at a (optionally) specified target load fraction (fraction of overall network capacity being requested for a given time period) ρ target ∈ [0, 1] with maximum Jensen-Shannon distance threshold ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ JSD threshold √ for an arbitrary topology T with n n server nodes, n c channels (light paths) per communication link, and C c capacity per server node link channel (divided equally between the source and destination ports such that each machine may simultaneously transmit and receive data), forming tuple 〈n n , n c , C c 〉 with total network capacity per direction (maximum information units transported per unit time) C t = nn⋅Cc⋅nc 2 . Since load rate is defined as information arriving per unit time, in order to generate traffic at arbitrary loads, either the amount of information (flow sizes) or the rate of arrival (flow inter-arrival times) must be adjusted in order to change the load rate. Since DCNs tend to handle particular types of applications and jobs which result in particular flow sizes, we posit that a reasonable assumption is that changing loads are the result of changing rates of demand arrivals rather than changing flow sizes (which remain fixed for a given application type). Therefore, if a target load is specified, TrafPy automatically adjusts the scale of the inter-arrival time distribution values in P(B t ) by a constant factor to meet the target load whilst keeping the same general shape of the P(B t ) distribution that was initially input to the generator. The following 3-step traffic generation process (summarised in Algorithm 1) is used to achieve the above: Step 1 (generate n f flows with size and arrival time characteristics {b s , b a }): First, n b s flow sizes and n b t inter-arrival times are independently sampled from P(B s ) and P(B t ) to form vectors b s and b t respectively, where n b s and n b t are incrementally increased by a constant factor until ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ JSD(P(B s ), P(b s )) by the law of large numbers. Whichever distribution needed fewer samples to meet ̅̅̅̅̅̅̅̅ JSD √ ≤ ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ JSD threshold √ is then continually sampled such that there are n f flow sizes and inter-arrival times, where n f = max({n b s , n b t }). Then, b t (whose order is arbitrary from the previous random sampling process) can be converted to an equivalent arrival time vector b a by initialising a zero array of length n f and setting b a Step 2 ('pack the flows' → generate n f flows with size, arrival time, and source-destination node pair characteristics {b s , b a , b p }): Next, to meet the source-destination node pair load fractions specified by P(B n ), the flows are packed into node pairs with a simple packing algorithm. First, a vector of n 2 n − n n node pairs b p (which do not include self-similar pairs) and their corresponding load pair fractions b n are extracted from P(B n ). Next, these 'target' load pair fractions b n are converted into a hash table mapping each pair p of the [1, …, n 2 n − n n ] pairs to their current 'distance' from their respective target total information request magnitudes d = ϱ ⋅b n ⋅ t t . In other words, we take the load fractions (fraction of overall information requested) of each node pair b n and multiply them by the total simulation load rate (information units arriving per unit time) ϱ and the total simulation time t t to create a vector d which, when first initialised, represents the total amount of information which is requested by each source-destination pair across the whole simulation. The task of the packer is therefore to assign source-destination pairs to each flow such that d p ≈ 0∀p ∈ [1, …, n 2 n − n n ]. For each sequential ith flow ∀i ∈ [1, …, n f ], after sorting the pairs in descending d p order (with any pairs with equal d p randomly shuffled amongst one-another), the packer will try to 'pack the flow' (given its size b s i ) into a sourcedestination pair in two passes. For the first pass the packer loops through each sorted pth pair ∀p ∈ [1, …, n 2 n − n n ] and checks that assigning the flow to this pair would not result in d p < 0. If this condition is met, the packer sets b p i = p and d p := d p − b s i before moving to the next flow. However, if the condition is violated for all pairs, the packer moves to the second pass, where it again loops through each sorted pair p but now, rather than ensuring d p ≥ 0, only ensures that assigning the pair would not exceed the maximum server link's source/destination port capacity Cc 2 before setting b p i = p and d p := d p − b s i . In other words, the first pass attempts to achieve d p ≈ 0∀p ∈ [1, …, n 2 n − n n ] to try to match P(B n ) but, failing that, the second pass ensures that no server link load exceeds 1.0 of the link capacity. Consequently, as ρ target approaches 1.0, so too will the resultant packed node distribution's server links, thereby converging on a uniform distribution no matter what the original skewness was of P(B n ) as shown in Fig. 3 and further elaborated on in Appendix E. Once this packing process is complete, a set {b s , b a , b p } of n f flows each with size b s , arrival time b a , and source-destination node pair b p , an overall load ρ target on network T, and a flow size, inter-arrival time, and node distribution of approximately P(B s ), P(B t ), and P(B n ) will have been fully initialised.
Step 3 (ensure b a n f − b a 0 ≥ t t,min ): The final stage of the flow generation process is then to ensure that the flow arrival duration t t is greater than or equal to some minimum duration t t,min (a parameter often required for test bed measurement reliability) specified by either the user. This is done by simply doubling the set {b s , b a , b p } of flows β = ⌈ tt tt,min ⌉ times to make an updated set of n f :=β ⋅ n f flows with t t ≥ t t,min and the same distribution and load statistics as before.

Traffic generation guidelines
Given a user-or benchmark-specified set of distribution parameters D ′ , TrafPy generates traffic trace D. As such, whenever using TrafPy to generate D, D ′ should always be reported to help others reproduce the same trace (as done in Table B.2 of Appendix B for this manuscript). For the same reason, all traffic traces D generated from D ′ should have a maximum ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ JSD threshold √ of 0.1 as outlined in Section 3.3. Enough demands should be generated so as to have a last demand arrival time t t larger than the time needed to complete the largest demands in the userdefined network T under the test conditions used; not doing so would result in all large flows being dropped regardless of what decisions were made. This would unfairly punish systems optimised for large demands, since such systems would allocate network resources to requests which ultimately could never be completed during the experiment. TrafPy conveniently generates and saves traffic data sets in a range of formats including JSON, CSV, and pickle. Therefore if desired, users may generate traffic in TrafPy and then use their own custom test bed and analysis scripts written in any programming language thereafter by simply importing the TrafPy-generated traffic. For result reliability, each trace D should be generated R times from D ′ and used to test the network object, where R should be sufficiently large enough so as to have a satisfactory confidence interval (which might vary from project-toproject but should be reported regardless).

Optical networks
The key purpose of TrafPy is for it to be used as a tool to explore novel areas of DCN research. One such area of particular importance is that of optical DCNs, which strive to replace electronically interconnected networks with optical systems in order to improve performance whilst reducing power consumption.

Limitations of current electronic packet switched networks
The servers of traditional multi-tier data centre and high performance computing (HPC) systems are interconnected by electronic packet switched (EPS) networks. Such 'electronic DCNs' have poor scalability, bandwidth, latency, and power consumption. Data centres now consume 2% of the World's electricity; more than the entire aviation industry and estimated to increase to 15% by 2030, with the network sometimes accounting for >50% of total power consumption [38]. Furthermore, the sensitivity of electronic switches to workloads limits their computational and application performance. Compounding this, the slowing of Moore's Law coinciding with new data-hungry demands means that electronic switches are unable to keep up with emerging applications (internet-of-things, artificial intelligence, genome processing, etc.) which follow data-heavy trends [39,40]. Although the compute power of DCN server nodes, as measured by flops per second, has increased by a factor of 65 over the last 18 years, the bandwidth of the DCN network facilitating communication between these nodes has only increased by a factor of 4.8, resulting in an 8-factor decrease in bytes communicated per flop. This has created a performance bottleneck not in the server nodes themselves, but rather in the network connecting them. As a result, management systems such as machine placers, schedulers and topology controllers are being forced to minimise data movement and constrain applications to operate locally, which would otherwise benefit from utilising more distributed architectures. Further degrading system and application performance, these systems also suffer from high median and 99th percentile network latencies on the order of 100 μs and 100 ms respectively.

Optical circuit switched networks
DCNs with optical interconnects have the potential to offer orders-ofmagnitude improvements in performance and energy efficiency and thereby address the limitations of EPS networks [48,[64][65][66]. Optical circuit switched (OCS) networks offer a promising avenue with which to realise optical DCNs, and have been used in many DCN solutions as they offer stable non-blocking circuit configurations with high-capacity and scalability [41]. In contrast to optical packet switching, they are simpler to implement and they eliminate the need for in-switch buffering or queuing and addressing. They establish single-hop connections with a wide range of circuit establishment time, lasting from orders of magnitude less than a second to hours or days. Leveraging stable circuit establishments, they can employ wavelength division multiplexing (WDM) and modulation formats to reach higher capacity. OCS switches are readily available [41] and are being used as part of many existing networks. They are mainly employed as part of a hybrid network, as in Ref. [42], in order to cater to specific types of traffic. However, they cannot be used on their own as they suffer from two key limitations: the long reconfiguration time (time taken to switch) and the long circuit computation time (time taken to compute the schedule), as shown by Fig. 4. Fig. 4 shows the circuit computation and the reconfiguration time of the key state-of-the-art OCS technologies. In summary, slow beam steering and light guiding technologies (millisecond OCS) were assisted with slow software-based circuit computation to provide reconfiguration, also in milliseconds (HELIOS, Firefly and OSA) [42][43][44]. More recent work has shown microsecond speed WSS-based OCS reliant on FPGA-based control (REACToR, Mordia) [45,46]. Rotor switches and fast SOA-based switches with schedule-less control were also explored for fast OCS in RotorNet [47] and Sirius [48] respectively. Although schedule-less architectures simplified the control plane, they result in performance-inefficient networks as network resources are allocated uniformly even in dynamic and skewed traffic environments.
However, with transceivers growing at a staggering rate, already reaching 100 Gbps [49] (trending towards 400G and 800G) and switch bandwidth increasing beyond 6.4 Tbps [50], the increased data-rate makes OCS 5-6 orders of magnitude too slow. This ever increasing gap between OCS switching/control speed and transceiver data rate makes OCS unsuitable as standalone solutions. Hence, PULSE (indicated by a star in Fig. 4) [51] proposed a two-fold solution: The first is the use of SOA-aided widely tunable-switching methods to minimize the reconfiguration time to sub-nanoseconds [52]. The second is a custom-made ASIC controller or scheduler that reduces reconfiguration computation time to nanoseconds. PULSE matches OCS switching times to packet-level granularity, making them suitable and adaptable to modern high capacity, bandwidth and speed switching data centre networks.
However, the performance of PULSE is heavily reliant on the performance of the scheduling heuristic employed. TrafPy can therefore be used as a tool with which to evaluate the performance of different design choices and resource management systems in novel OCS networks, such as PULSE (an OCS DCN system which was developed with the help of TrafPy [53]), and thereby help to realise future all-optical DCNs.

Experiment
Here we conduct a brief experiment into the sensitivity of 4 schedulers to different traffic traces. Specifically, we look at shortest remaining processing time (SRPT) [6,54,55], fair share [54], first fit (FF) [56], and random DCN flow scheduling.

Network
All experiments assume an optical TDM-based circuit switched network architecture with a 64-server folded clos (spine-leaf) topology made up of 2 core switches, 4 top-of-the-rack (ToR) switches, and 64 servers (16 servers per rack) with bidirectional links, as shown in Fig. 5. The server-to-rack and ToR-to-core links each have 1 channel with 10 Gbps and 80 Gbps capacity respectively, leading to a 1:1 subscription ratio and a total network capacity of 640 Gbps (320 Gbps bisection bandwidth). Flows are mapped to TDM circuits, and we assume ideal server-level time multiplexing of the flows' packets such that the bandwidth of each channel can be fully utilised. The core switch performs link/fiber switching. There are various ways to perform packet/ TDM aggregation of flows at the server and to realise such networks, but neither are the focus of this paper.

Traffic traces
We use TrafPy to generate 2 categories of traffic with which to investigate our schedulers; DCN traces based on real-world application data, and custom skewed node and rack data for testing system performance under extreme conditions. We use a maximum ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ JSD threshold √ of 0.1, setting t t,min = 3.20 × 10 5 μs (≈10 times larger than the time taken to complete the largest ≈20 × 10 6 B flow amongst our benchmarks), and generating traffic of loads 0.1-0.9 for each data set. We generate each set R = 5 times to run 5 repeats of our experiments and therefore ensure reliability. All TrafPy parameters D ′ used to generate the traffic are reported in Table B.2 of Appendix B for reproducibility.

'Realistic' DCN traces
Four types of DCN and their network flow demand distributions are explored; University [30], Private Enterprise [57], Commercial Cloud [31], and Social Media Cloud [33]. Each DCN type services different applications and therefore has a different traffic pattern. Using TrafPy, flow distributions for each of these categories were generated to established a set of open-source traffic traces for the DCN benchmark. The tuned TrafPy parameters D ′ of each flow characteristic have been summarised in Table B.2. The resultant distributions are shown in Fig. 6, and the subsequent quantitative summary of each distribution's characteristics is given in Table B.3 of Appendix B.  Gbps server-to-ToR links, and 80 Gbps ToR-to-core links (1:1 subscription ratio, 640 Gbps total network capacity).

'Extreme' skewed node and rack sensitivity traces
We generated two additional traces; the skewed nodes sensitivity benchmark and the rack sensitivity benchmark. These were not based on realistic data, but rather designed to test and better understand our systems under extreme conditions. Both use the same flow size and interarrival time distributions as the commercial cloud data set in Fig. 6, however the node distribution is adjusted. Specifically, the skewed nodes benchmark is made up of 5 sets with uniform, 5%, 10%, 20%, and 40% of the server nodes being 'skewed' by accounting for 55% of the total overall traffic load, named skewed_nodes_sensitivity_uniform, 0.05, 0.1, 0.2, and 0.4 respectively (see Appendix E for further justification and analysis of these values). Similarly, the rack distribution benchmark is made up of 5 sets with uniform, 20%, 40%, 60%, and 80% of the traffic being intra-rack (and the rest inter-rack) named rack_sensitivity_uniform, 0.2, 0.4, 0.6, and 0.8 respectively. Therefore, these distributions allow for investigations into DCN system sensitivity to i) the number of skewed nodes and ii) the ratio of intra-vs. inter-rack traffic. They have been plotted in Fig. 7.

Simulation details
We use a time-driven simulator where scheduling decisions are made at fixed intervals. The time between decisions is the 'slot size'; smaller slot sizes result in greater scheduling decision and measurement metric granularity, but at the cost of longer simulation times and the need for scheduler and switch hardware optimisation [52,53,[58][59][60]63]. We use a slot size of 1 ms. We assume perfect packet time-multiplexing whereby the scheduler is allowed to schedule as many flow packets for the next time slot as the channel bandwidth of its rate-limiting link in its chosen path will allow. We run 9 simulations (loads 0.1-0.9) for each   7. TrafPy node distribution plots for the skewed nodes sensitivity benchmark with (a) uniform, (b) 5%, (c) 10%, (d) 20%, and (e) 40% of nodes accounting for 55% of the overall traffic load, and for the rack sensitivity benchmark with (f) uniform, (g) 20%, (h) 40%, (i) 60%, and (j) 80% traffic being intra-rack and the rest inter-rack. benchmark data set, terminating the simulation when the last demand arrives at t = t t (which is ≥ t t,min = 3.20 × 10 5 μs). We set the warm-up time as being 10% of the simulation time t t before which no collected data contribute to the final performance metrics. Similarly, since the simulation is terminated at t t , we exclude any cool-down period from measurement. For each experiment, we then record: (1) mean flow completion time (FCT); (2) 99th percentile (p99) FCT; (3) maximum (max) FCT; (4) absolute throughput (total number of information units transported per unit time); (5) relative throughput (fraction of arrived information successfully transported); and (6) fraction of arrived flows accepted. We report each of these metrics' mean across the R = 5 runs and their corresponding 95% confidence intervals.

Results
To begin the investigation into the sensitivity of different schedulers, we first input TrafPy-generated traffic with heavily skewed nodes and racks (see Section 5.2.2) into our simulator to understand how the four schedulers considered behave at the extremes. We then test the same schedulers under traces for different DCN types to see how the results from the 'extreme' condition investigation translate into more realistic scenarios. For brevity, we provide the full results in Appendix G and a summary in this section.
Extreme Rack Conditions As shown in Table F.17, as the rack distribution becomes heavily skewed to intra-rack, the completion time metrics of FS become increasingly superior to SRPT. This suggests that real DCNs which have heavy intra-rack traffic (e.g. social media cloud DCNs) would benefit from deploying pure FS scheduling policies, at least at higher loads, whereas DCNs with heavy inter-rack traffic (e.g. university DCNs) would benefit from deploying FS at medium loads and SRPT at low and high loads.
In terms of throughput and demands accepted, FF is competitive with SRPT and FS at low intra-rack traffic levels, but as the DCN becomes more heavily intra-rack (e.g. social media cloud DCNs), SRPT and FS are preferable, with FS achieving the best performances at higher loads. Again, a preferable strategy would likely be to utilise SRPT strategies at low loads before switching to FS at loads about 0.3-0.5 (depending on the level of intra-rack traffic).
Extreme Node Conditions As shown in Table F.18, at the two extremes of heavily skewed and uniform traffic, scheduler completion time performances are similar in that SRPT outperforms FS at low and high loads, but FS performs well at medium loads. However, in between these two extremes (around 40% of nodes requesting 55% of overall traffic), there is a point where FS becomes the dominant scheduler in terms of completion time.
In terms of throughput and demands accepted, under heavily skewed conditions (5% nodes requesting 55% of traffic), FF and/or Rand beat SRPT and FS across all 0.1-0.9 loads in terms of throughput and fraction of information accepted. This suggests that FF and SRPT are strained under high skews with respect to these two metrics. However, as observed with the uniform distribution, this comes at the cost of the fraction of arrived flows accepted, where SRPT and FS outperform FF and Rand across all loads. As the proportion of nodes requesting 55% of traffic is increased to 10%, 20%, and 40%, relative scheduler performances converge to those seen with the uniform distribution, with FS and SRPT being mostly dominant except at high 0.8 and 0.9 loads, where FF often has the better throughput and fraction of information accepted.
Realistic Conditions Table F. 19 summarises the results for the four schedulers on each of the four 'realistic' DCN benchmarks considered. As shown, the SRPT scheduler tends to achieve the best completion time metrics when loads are low ( ≤ 0.7) and where traffic is primarily interrack (the University and Private Enterprise DCNs). This is to be expected, since a policy which prioritises completion of the smallest flows as soon as possible will keep its completion time averages low. However, as traffic reaches higher loads ( > 0.7), the fair share policy achieves the best completion time metrics. This indicates that networks would benefit from scheduling policies which can dynamically adapt to changing traffic loads. Moreover, for networks with characteristically intra-rack traffic (the Commercial Cloud and Social Media Cloud DCNs), the fair share policy attains the best completion time and throughput metrics. These results therefore validate the predictions made by the rack distribution sensitivity analysis study; namely that completion time metrics in real DCN traces with heavily intra-rack (e.g. Commercial Cloud and Social Media Cloud) traffic benefit from FS scheduling strategies. On the other hand, at least for low loads, low intra-rack DCN traces (e.g. University and Private Enterprise) benefit from SRPT scheduling strategies.
These results suggest that not only should scheduling policies be adapted to changing traffic loads, but also to changing characteristics such as the level of inter-vs. intra-rack communication. Note that, as expected, the fair share policy provides the best worst-case completion time (max FCT), the greatest network utilisation (throughput), and the strongest service guarantee (number of flow requests satisfied) across most loads and DCN types.

Conclusion & further work
In conclusion, we have introduced TrafPy; an API for generating custom and realistic DCN traffic and a standardised protocol for benchmarking DCN systems which is compatible with any simulation, emulation, or experimentation test bed. These systems can be any combination of networked devices or methods such as schedulers, switches, routers, admission control policies, management protocols, topologies, buffering methods, and so on. TrafPy has been developed with a focus on having a high level of fidelity, generality, scalability, reproducibility, repeatability, replicability, compatibility, and comparability in the context of DCN research, which in turn will aid in accelerating innovation.
We have demonstrated the efficacy of TrafPy by briefly investigating the sensitivity of four canonical schedulers to varying traffic loads and characteristics. The scheduler performances were heavily dependent on the level of intra-rack traffic and overall network load. We found that SRPT was generally the dominant scheduler for low intra-rack traffic (particularly at low loads), but that FS became superior across all loads at high intra-rack levels. These insights were then found to translate into realistic DCN traces, with low intra-rack users such as University and Private Enterprise DCNs benefiting from SRPT policies at low and medium loads and high intra-rack traces such as Commercial Cloud and Social Media Cloud being more suited to FS strategies. This shows that there is no 'one size fits all' strategy for scheduling different types of DCNs, and that there would be great value in the development of trafficinformed and dynamic DCN systems. With its standardised traffic generation and benchmark protocol, TrafPy is an ideal tool for developing such systems via the benchmark paradigm described throughout this manuscript.
The space of potential research areas from this work is vast. We hope presently unforeseeable avenues will be pursued with the support of TrafPy's standardised traffic generation and rigorous benchmarking framework. For our own work, based on the preliminary results of scheduler sensitivity to varying load conditions and traffic trace characteristics, we expect to develop new scheduling heuristics and learning algorithms which can dynamically adapt to network traffic states and outperform literature baselines in open-source TrafPy benchmarks. The 2.5 TB of open-access simulation data from this manuscript open some interesting offline reinforcement learning opportunities. We also anticipate adding more sensitivity-testing and realistic DCN traffic traces to the suite of TrafPy benchmarks. Furthermore, there are some exciting features which could enhance TrafPy. For example, although TrafPy can generate traces without any raw data given whatever characteristic distributions the user provides, it would be useful to be able to input real data (e.g. Ref. [7]) and have TrafPy automatically characterise the traffic in order to generate realistic data. Additionally, we plan to include a computation graph view of DCN network traffic in the TrafPy API, unifying the flow-centric paradigm from the networking community with the job-centric perspective from computer science. This could lead to exciting novel research, such as network-and job-aware DCN scheduling.

Author statement
We declare that all authors made notable contributions to this manuscript, and that this work is our own original work. This work was completed with the support of the following funders: EPSRC Distributed Quantum Computing and Applications EP/W032643/1; the Innovate UK Project on Quantum Data Centres and the Future 10004793; Opto-Cloud EP/T026081/1; TRANSNET EP/R035342/1; the Engineering and Physical Sciences Research Council EP/R041792/1 and EP/L015455/1; and the Alan Turing Institute.

Declaration of competing interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Christopher Parsonson reports financial support was provided by the Engineering and Physical Sciences Research Council, OptoCloud, TRANSNET, and the Innovate UK Project on Quantum Data Centres for the Future. Christopher Parsonson reports a relationship with The Alan Turing Institute that includes: funding grants.    Example of interactively and visually shaping a weibull distribution's parameters to achieve a target distribution for some random variable in Jupyter Notebook (output in Fig. C.8):

Appendix A. Table of Notation
This same distribution can then be reproduced by using the same parameters: Appendix C. 1

.2. Interactively & Visually Shaping a Custom 'Multimodal' Distribution in a Jupyter Notebook
To generate a multimodal distribution, first shape each mode individually (output in Fig. C.9): Then combine the distributions, filling the distribution with a tuneable amount of 'background noise' (output in Fig. C.10): This same distribution can be reproduced using the same parameters: N.B. An equivalent function can be used for generating custom skew distributions with a single mode which also do not fall under one of the canonical 'named' distributions.

Appendix E. Traffic Skew Convergence
A constraint of any traffic matrix is that the load on each end point (the fraction of the end point's capacity being requested) cannot exceed 1.0. Consequently, certain traffic skews become infeasible at higher loads (for example, it is impossible for an n > 1 network to have 1 node requesting 100% of the traffic if the overall network is under a 1.0 load). As shown in Fig. 3, this results in all traffic matrices tending towards uniform (i.e. having no skew) as the overall network load tends to 1.0.
A question traffic trace generators may ask is: for a given load, what combination of i) number of skewed nodes, ii) corresponding fraction of the arriving network traffic the skewed nodes request, and iii) overall network load results in the traffic matrix being skewed or not skewed? To answer this question, we make the following assumptions: ⋅ All network end points have equal bandwidth capacities. ⋅ All end points are either 'skewed' or 'not skewed' by the same amount.
⋅ 'Skew' is defined by a skew factor, which is the fractional difference between the load rate per skewed node and the load rate per non-skewed node (the highest being the numerator, and the lowest being the denominator). ⋅ For a given combination of skewed nodes and the load rate they request of some overall network load, any excess load (exceeding 1.0) on a given end point is distributed equally amongst all other end points whose loads are < 1.0.
With the above assumptions, we can calculate the skew factor for each combination of skewed nodes, corresponding traffic requested, and overall network load. Doing this for 0-100% of the network nodes being skewed and requesting 0-100% of the overall network load under network loads 0.1-0.9, we can construct a look-up table of skew factors for each of these combinations before generating any actual traffic. Fig. E.12 shows a high resolution (0.1%) heat map of these combinations, with any skew factors ≥ 2.0 set to the same colour for visual clarity. Fig. E.13 shows the corresponding plots with lower resolution (5%) but with the skew factors labelled. As expected, above 0.6 network loads, certain combinations of number of skewed nodes and traffic requested become restricted as to how much skew there can be in the matrix, with many combinations tending towards uniform (skew factor 1.0) at 0.9 loads.  Using the skew factor data from Figs. E.12 and E.13, we can be confident at 5%, 10%, 20%, and 40% of the network nodes requesting 55% of the overall network traffic that the skew factor will be > 1.0 across loads 0.1-0.9. Fig. E.14 shows the skew factor as a function of load for these combinations. Therefore, these were the combinations chosen for the skewed nodes sensitivity benchmark defined in Section 5 of this manuscript.

Appendix F.3. Performance Metric Tables
The below performance tables summarise the schedulers' mean performances (averaged across 5 runs, 95% confidence intervals reported) for each P KPI , each load, and each benchmark.

Appendix G. A Note on the Flow-vs. Job-Centric Traffic Paradigms
Common DCN jobs include search queries, generating social media feeds, and performing machine learning tasks such as inference and backpropagation. These jobs are directed acyclic graphs composed of operations (nodes) and dependencies (edges) [61]. The dependencies are either control dependencies (where the child operation can only begin once the parent operation has been completed) or data dependencies (where ≥ 1 tensors are output from the parent operation as required input for the child operation). In the context of DCNs, when a job arrives, each operation in the job is placed onto some machine to execute it. These operations might all be placed onto one machine or, as is often the case, distributed across different machines in the network [62]. The DCN is then used to pass the tensors around between machines executing the operations. Job data dependencies whose parent and child operations are placed onto different machines have their tensors become DCN flows.
There are therefore two paradigms when considering traffic demand generation in DCNs; the flow-centric paradigm, which is agnostic to the overall computation graph being executed in the DCN when servicing an application, and the job-centric paradigm, which does consider the computation graph when generating network flows. For this manuscript, we considered the flow-centric paradigm, where a single demand is a flow; a task demanding some information be sent from a source node to a destination node in the network. Flow characteristics include size (how much information to send), arrival time (the time the flow arrives ready to be transported through the network, as derived from the network-level inter-arrival time which is the time between a flow's time of arrival and its predecessor's), and source-destination node pair (which machine the flow is queued at and where it is requesting to be sent). Together, these characteristics form a network-level source-destination node pair distribution ('how much' (as measured by either probability or load) each machine tends to be requested by arriving flows).
In real DCNs, traffic flows can be correlated with one another since they may be part of the same job and therefore share similar characteristics. An interesting area of future work will be to develop TrafPy to support the job-centric paradigm and have this type of inter-flow correlation. However, this is beyond the scope of this manuscript.