: An interconnection networks research flow-level simulation framework

This paper presents INRFlow, a mature, frugal, ﬂow-level simulation framework for modelling large-scale networks and computing systems. INRFlow is designed to carry out performance-related studies of interconnection networks for both high performance computing systems and datacentres. It features a completely modular design in which adding new topologies, rout-ings or traﬃc models requires minimum eﬀort. Moreover, INRFlow includes two diﬀerent simulation engines: a static engine that is able to scale to tens of millions of nodes and a dynamic one that captures temporal and causal relationships to provide more realistic simulations. We will describe the main aspects of the simulator, including system models, traﬃc models and the large variety of topologies and routings implemented so far. We conclude the paper with a case study that analyses the scalability of several typical topologies. INRFlow has been used to conduct a variety of studies including evaluation of novel topologies and routings (both in the context of graph theory and optimization), analysis of storage and bandwidth alloca-∗

• It is a mature, flexible and efficient tool for simulating large scale systems • It models network, storage, scheduler and applications • It has been used extensively for our research in the past • INRFlow is open source and programmed in C *Highlights (for review)

Introduction
For most activities our society has come to depend on information and computer systems as a way to improve productivity and so, competitiveness.This has both driven forward the development of a plethora of IT technologies and motivated the construction of increasingly larger computing facilities.For instance, in the context of business-centric computing systems, companies require increasingly higher computing power to support operations such as mining data from service records, offering on-line services and supporting increasingly large amounts of data.If we look at the world's largest companies it is speculated that they have hundreds of thousands of servers to sustain their infrastructure.As examples of well-known companies, Google may have around one million servers scattered among 13 datacentres worldwide whereas Amazon may have roughly half a million servers in 7 datacentres around the world.
An analogous trend occurs in the scientific community, where we can see systems of similar sizes and an always increasing greed for more computing power.In this context, the typical application is computer simulation of various natures (molecular dynamics, finite elements or weather modelling, to cite a few) which are carried out using increasingly finer-grain models which are expected to be more and more accurate but also to require of higher and higher computing power.Recently, the introduction of new technologies for data analytics has opened a new form of exploitation of scientific computing sites by allowing to analyse data collected from empirical experimentation.The highest exponent of data analytics in science is the Large Hadron Collider at CERN, which generates data at a stunning rate of 50 Petabytes per year.In order to be able to analyse all the generated data a Grid-like system with over 150 computing centres all over the world is used (See Worldwide LHC Computing Grid website 1 ).This data generation rate will be dwarfed by the Square Kilometer Array project which is expected to generate a mind-blowing amount of data exceeding the Exabyte per day once it is built by 2020 (See Square Kilometre Array website2 ).At any rate, the advent of data analytics within the scientific community has motivated the convergence of datacentre and HPC architectures.Consequently, simulation tools that can cope with both models are becoming increasingly important.As we will explain in this paper, INRFlow offers capabilities for both data-centric and computation-centric systems and covers the gap between simulators specially designed for only one of these models.
The interconnection network (IN, in short), a specific-purpose network that allows compute nodes to interchange messages with high throughput and low latency, is a key element of these large-scale computing platforms because its performance has a definite impact on the overall execution time of parallel applications, especially for those that are fine-grained and communication intensive.Indeed, they have been widely acknowledged (e.g., [1,2,3,4]) to be one of the limiting factors when it comes to scaling up computing systems, essentially because the communication and synchronisation penalties suffered by applications increase with the size of the system.
Current trends show the number of nodes used in data centre networks or supercomputers can be hundreds of thousands [5,6,7,8] and these numbers are expected to increase over the millions in the next decade [3].This is the reason why we should not decide lightly about the network that interconnects compute nodes in extreme-scale computing sites.The evaluation of an IN is a complex task that requires, among other concerns, deep knowledge about how parallel applications make use of the network.Our interest revolves exactly around this topic: the modelling, simulation and evaluation of large-scale computing systems with special emphasis on the IN.This The main features of INRFlow are its flexibility, its low resource consumption and the modularity of its design.As will be seen throughout the paper, INRFlow can be used to simulate a plethora of topologies and traffic models each with different degrees of fidelity.We believe this flexibility may tip the scale in favour of INRFlow when it comes to select a simulation tool.Furthermore, the requirements to build and use INRFlow, in terms of memory and CPU speed are frugal, and therefore it may be the environment of choice for quick deployment and fast obtaining of results.It provides the capability to simulate, on a desktop computer, systems composed by mil-lions of nodes in reasonable time.The most extreme configurations modelled used the static engine with over 1M servers and the dynamic engine with over 64K servers, see Section 4. The main limiting factor is normally the amount of RAM, as simulations complete quite fast (typically, hours).The amount of required memory varies depending on the characteristics of the simulation (number of endpoints, switches, complexity of traffic, etc).
INRFlow is coded in C and can be built with any compliant compiler in both POSIX and Microsoft Windows environments.Most simulation parameters are given at execution-time, so that only a few decisions have to be taken at compilation time, which, in turn, greatly simplifies compilation.
It currently runs single-threaded as runtime is acceptable for our needs (a few hours for tens of thousands of nodes in dynamic mode), but extending it to perform parallel execution should be relatively simple.The source code of INRFlow (released under GPL) together with the required information for its operation (user manual) can be found at Gitlab3 .

INRFlow has been the backbone of a large part of our recent research:
In [9] we developed a novel routing for recursively-defined server-centric networks DCell and FiConn.This was later extended for the HCN/BCN networks [10].In both cases, significantly improved practical routing algorithms were obtained.In [11] we provided a minimal-path routing for DPillar.In [12] we established the stellar dual-port server-centric design methodology.
Any graph can be chosen as the base and judicious choices result in networks with beneficial properties.Using generalized hypercubes as the base graphs, we constructed GQ * and compared it with the state-of-the-art FiConn and DPillar.In [13], we proposed a multi-objective optimization framework to automatise the selection of topologies for a large-scale, exascalable computing system, ExaNeSt [14].In [15], we analyse the effect of data-storage policies on the interferences between storage and applications traffic and, in turn, its effect on system performance.
To highlight the capabilities of the simulator, we conclude the paper with an example case study where we analyse the scalability of a number of state-of-the-art topologies for HPC systems and the effects of multipath routing on some of them.

Related Work
The networking community has developed a large variety of network simulation tools with different approaches and objectives.Let us review a small selection, pinpointing the main differences with INRFlow.Note that, as explained before, the main characteristics of INRFlow are its flexibility and its low resources requirement, and therefore it outperforms in these two aspects to most of the tools revised here.
INSEE [16] is a cycle-driven flexible, lightweight functional simulator which is also being developed and maintained by our group.INSEE models router functionality in detail and provides a more accurate alternative for simulating mid-to large-scale networks than INRFlow.INSEE is able to simulate a wide variety of router models and topologies and shares part of the code base with INRFlow.HPC-NetSim [17] is a simulator developed to model the Tianhe-2 supercomputer, which occupies the second place in the Nov'17 Top500 list.They provide an accurate cycle-driven simulation and show the precision of their framework by comparing with the real system.However, their evaluation is restricted to a 32-endpoint system, so its scalability is difficult to assess.TOPAZ [18], developed at the University of Cantabria, is a cycle-accurate simulator for supercomputer INs with detailed models of the components which allows obtaining very accurate performance measurements.It has the ability to interface with GEMS5 to perform full-system simulation.TOPAZ is implemented in C++ and offers parallel execution to speed up execution.
BigNetSim [19], developed at the University of Illinois at Urbana Champaign, is a trace-driven parallel discrete event simulator.It simulates, with reasonable detail, an integrated model for computation (processors) and communication (network).The simulator allows different levels of detail to evaluate the IN: from simple latency models to detailed models of the network including k-ary n-cubes and k-ary n-trees.One of the main advantages of this system is its extreme modularity, with easy mechanisms to model new topologies and routing algorithms.BigNetSim has a parallel implementation that allows carrying out large simulations of current and future systems, and to study the behaviour of applications developed for those systems.In contrast with INRFlow, in which system configuration is given as parameters at execution-time, BigNetSim is configured at compilation time, in such a way that any change in the models require to recompile the target modules.
MARS [20] is a simulator of parallel systems developed at IBM and based on the OMNeT++ simulation framework.Its design is oriented to the evaluation of parallel systems and parallel applications, and to that purpose it includes detailed models of both the communication side and the compute nodes.MARS allows us to use several multistage topologies, and a variety of switching and routing functions.In addition, it supports multicore configurations in which each processing core has its own MPI stack.
The main strength of MARS is its conformity to MPI semantics and their ability to run in parallel.However, its scalability seems to be limited to a few thousand endpoints.
MINSimulate [21], developed at the Technical University of Berlin, is a simulator designed to evaluate multistage INs.It implements Clos and Delta networks and supports both wormhole and store and forward switching.
Note that currently INRFlow does not support this kind of networks but their inclusion would require insignificant efforts as it would simply require implementing connection and routing functions.
The NS-2 simulator [22], from the University of Southern California, is designed to research on wired and wireless TCP-based communication networks.Although high performance computing systems used to rely on high performance interconnects such as InfiniBand or proprietary interconnects for parallel computing workloads, Ethernet is getting a significant share of the Top500 list, as new versions of 10G Ethernet or 100G Ethernet are leveraged as low-cost INs.However, TCP-based networks are not a good alternative for HPC, so most of them are relegated to the lower positions of that list.
The COTSon Infrastructure for system-level simulation by HP Labs [23] provides a full-system simulation environment based on AMD's SimNow4 .
The tool was open sourced in January 2010 and is able to simulate clusters of many-core processing systems using a functional simulator of a network switch.However, this way of modelling the network is extremely simplistic and incapable of modelling the complexity of traffic interaction within a full-fledged network.
Dimemas [24], developed and maintained at the Barcelona Supercomputing Center, was designed with the evaluation of applications behaviour in mind.It can reconstruct the execution of a parallel application in any supported architecture using a trace of that application.Dimemas models computing elements with accuracy but models the INs in a rather simplistic way: a collection of buses.The workloads used with Dimemas are modelled in detail, with lots of significant states available for each application thread.
A drawback of this workload's complexity is that obtaining traces with sufficient level of detail requires an instrumented kernel.Dimemas is designed to search for bottlenecks and/or unbalances that may harm the performance of parallel applications, however the basic network model cannot identify bottlenecks occurring at the network-level.

INRFlow design
INRFlow is a flexible, lightweight flow-level simulator focused on modelling large interconnects.To extend scalability it models the network at a link level without a detailed model of router architecture.In addition to the interconnect, INRFlow models several subsystems of a supercomputer or datacenter such as the scheduling of applications, the allocation of resources and the mapping of tasks and data sources to computing nodes.It is also able to simulate both the novel storage network of ExaNeSt in which the storage devices are attached to computing elements and more classical approaches that rely on a Storage Area Network (SAN).Extending INRFlow is very easy because all the subsystems are independent.

Simulation Engines
INRFlow works at flow-level.This high level of abstraction was decided in order to be able to scale to the system sizes we are targeting.Note, however, that raising abstraction also reduces the accuracy of the results because the particularities of the components and fine-grain interactions between them are not covered by the model.For us, a workload is a set of pairs of source and destination nodes.INRFlow constructs the network topology and workload at runtime, routes the flows specified by the workload, using the specified routing algorithm, and, finally, reports statistics.
INRFlow implements two different simulation engines: static and dynamic.In static mode, flows are routed simultaneously and a link's capacity is assumed to be shared among all the flows routed through it.Static mode can handle very large networks and serves to report on raw performance metrics where the causal relationships between flows are not important, such as the mean hop-length of a routing algorithm or preliminary estimations of the throughput.
While the static engine works similarly to most simulators used by the datacentre networks community [25,26,27,28], we argue that static analysis does not always accurately reflect network performance due to its lack of temporal and causal modelling.For this reason, INRFlow features a dynamic engine that is able to deal with temporal and causal aspects of the execution.In dynamic mode, the links of the network have capacities and each flow is specified with a weight reflecting the data that must be routed.In addition, the workloads prescribe computing phases and causal relationships among flows, so that some flows must finish before others begin.Dynamic mode provides a more realistic, flow-level simulation of general real-world workloads, as well as a good estimation of the completion times of a collection of application-inspired workloads.
INRFlow has also the ability of incorporating failures into the description of the system.This is useful to ascertain the fault tolerance of a system.
When failures are present, INRFlow randomly disconnects a number of network links that is given as the percentage of the total number of link or as an absolute number of failures.This can be used to measure the connectivity of the network after the failures occur counting the number of flows that the network is able to deliver to their destination, see, e.g., [12].This is particularly useful to evaluate routing algorithms and the effectiveness of multipath policies.To employ this mode, fault-tolerant routing functions would be needed, but INRFlow will keep operating by simply dropping flows that can not be routed, if such a facility is not implemented.
INRFlow dynamic engine also supports using flow priorities at the link level.Currently, we consider two priority levels, App (high) and I/O (low) and have implemented the following policies (depicted in Figure 1) and analysed in detail in [15].
• No priorities (NP): This is the baseline policy in which all the flows have the same priority.As we can see in Figure 1a, link bandwidth is shared fairly among the flows.
• Bandwidth Apportion (BA): This policy establishes the proportion of the link bandwidth that will be used for each type of traffic.
The number of flow types can vary as shown in Figure 1b where we have assigned 50% of bandwidth to inter-process traffic and 50% for storage traffic.In this case, half of the bandwidth will be shared by • Full Priority (FP): App traffic has full priority over I/O traffic.Thus I/O traffic will only use the network resources that are not employed by applications.An example of this policy is depicted in Figure 1c where all the bandwidth is used to transmit application flows.When these finish, storage flows are transmitted.

Topologies
INRFlow implements a large variety of network topologies, see Fig. 2 for a few examples, both from the datacentre and the HPC communities.In particular it is able to simulate server-centric networks in which nodes and switches have the role of routing elements, and switch-centric networks in which the nodes do not perform any kind of routing.Examples of some of the datacenter topologies implemented in INRFlow are: • DPillar: It is a server-centric data center network somewhat inspired by the classic butterfly topology [29].DPillar provides several nice properties such as scalability, network performance, and cost efficiency, which make it suitable for building large scale future data centers.
• Bcube: Another server-centric network architecture, specifically designed for shipping-container based modular datacenters [28].BCube has a nice property: graceful performance degradation as the server and/or switch failure rate increases.
• Gdcficonn: This is the generalisation of DCell [30] and Ficonn [31], a family of recursive topologies which eliminates the requirement of any switches other than the lowest-level commodity ones.It is highly scalable to encompass hundreds of thousands of servers, while at the same time keeping low diameter.
• HCN/BCN: is a recursively-defined family of networks, where the BCN construct is built using (copies of) HCNs by including an additional layer of interconnecting links [32].
• Jellyfish: This is a random network designed to provide high connectivity in datacenters.One of its main characteristics is that it is incrementally expandable as opposed to most common topologies [33].
INRFlow also includes more HPC-focused networks such as the following: • The torus is a well-known topology that has been historically used to interconnect massively parallel processors.Nodes in a torus are arranged in a d -dimensional grid with wrap-around links.
• Dragonfly is a large-radix, low-diameter, recursive family of topologies that uses a group of high-radix routers (a network group) as a virtual router to increase the effective radix of the network.Then interconnects large numbers of these network groups in an all-to-all fashion using a given connection rule [35].Currently, we have 5 different connection rules implemented.
• General Graph-based topology: Sometimes we do not have the mathematical description of a topology (connection rules).For this reason we have implemented a loader that is able to load the definition of the topology from a file.This file contains the number of nodes and switches and how they are interconnected.This was developed for our topological optimization framework [13].
The In particular, we can implement arithmetic routings that calculate the path when they are requested or pre-computed routings in which all the path between pairs of nodes are calculated before the simulation starts.
However, implementing a new routing algorithm is not required for every topology since we have implemented generic routing algorithms that can work with arbitrary topologies: • Breadth-first search (BFS) routing: This is single-path routing policy that looks for one of the possible shortest paths between each pair of nodes.
• Equal Cost Multiple Paths (ECMP) routing: This is a multi-path routing algorithm that balances loads among all shortest paths between each pair of nodes.
• K -Shortest Path (KSP) routing: This is a multi-path routing algorithm that looks for the K (arbitrary) shortest paths between each pair of nodes.
• AllPath-d (AP) routing: This is a multi-path routing algorithm that balances load over paths which are equal of shorter than the shortest path plus an arbitrary parameter, d.It aims to increase path diversity.
These are very useful when we want to evaluate new topologies or routing functions, since they can be used as baseline to compare with.However, they involve a relatively high cost in terms of computing time and memory consumption, so for consolidated topologies it is advisable to provide specific routing algorithms.

Workload Generator
In INRFlow, nodes are modelled in a rather simplistic way: a traffic generator/consumer.However they can use a large variety of types of traffic generators.From purely synthetic traffic patterns to traces extracted from real applications as well as realistic traffic generators developed from analysing real traces from applications.

Synthetic Traffic Patterns
INRFlow provides a broad range of synthetic traffic patterns that can be used to measure the performance of the communication infrastructure and only consider spatial distribution of traffic.The following are some of the traffic types that can be generated by INRFlow: • Random: When a packet is generated at a node (the source), the destination is randomly selected following a given probability distribution.
The built-in modes are uniform, in which all the nodes have the same probability of being selected as destination, and the non-uniform hot spot and hot region, where a given node or group of nodes, respectively, have higher probability of being selected as destination, increasing the risk of generating congestion in some regions of the network.Finally, with local traffic, the probability of selecting destination nodes decreases with the distance (so that most packets are sent to nearby nodes).
• Permutations: Given a source node, the destination node is always the same, and is computed as a permutation of the source node identifier (generally a bit permutations).INRFlow supports classical permutations such as Perfect Shuffle, Bit Reversal, Bit Transpose and Bit Complement.
• Bisection: The network is split uniformly at random into two halves and every server in each half sends a flow to every server in the other half.
• All-to-one: A unique root server is chosen, uniformly at random, and every server sends a flow to the root.
• All-to-all: every server sends a flow to every other server.
• Many-all-to-all: For a given size s, the network is partitioned uniformly at random into g = N /s groups of servers, each of size at most s.Each server sends a flow to all other servers in its group.

Real Applications Traces
Synthetic traffic sources provide very useful insights into a network's potential.However, obtained performance metrics can be unrealistic as applications use more sophisticated communication patterns than synthetic models.For this reason INRFlow can also use traces from applications to perform trace-driven simulation.To reproduce the causal relationships between events in the trace files, INRFlow requires a special data structure to store past and future events, shown in Fig. 3.Each node of the simulated applications has an event queue, which is fed from the trace file.A packet is sent through the network when an S (send) event is in the queue's head.
If an R (receive) event is in the head, it is necessary to access the pending notifications queue to check if the expected event has happened already; otherwise, processing of events is blocked until the network notifies the awaited reception.The pending notifications queue at each node, thus, stores reception events that arrive before the application requests them, and it is a crucial element to keep event causality.The complete process of trace-driven simulation is akin to the one we presented in [36] and works as follows: 1. Enqueue in each node's event queue all the events it has to execute.
2. Initialize the pending notifications list as an empty list.Nodes sequentially execute the events in their event queue.
3. If the first event is a send, remove the event and inject the corresponding message into the network.
4. If it is a reception, check if a corresponding message (matching origin, destination, tag and size) is in the pending notification list.If it is there, remove both entries.Otherwise, keep in this state until the required message is received by the node and is accordingly found in such list.
5. If it is a computation event, put the node on hold for the required period of time, using a selected CPU-scale factor.
6.When the network delivers a message, put it in the pending notifications list.
An example of this procedure is depicted in Fig. complying with the causal order between a reception and the subsequent sends it may trigger, see [36].

Pseudo-Applications Traffic
INRFlow is also able to generate many application-inspired workloads.
These workloads cover some representative network traffic scenarios that can be found in existing datacentres and HPC systems and use the same data structures as the trace-based simulations.A non-exhaustive list of these workloads includes: • Scientific applications: Inside this group we include models that mimic scientific code traditionally used by the HPC community.In particular, 2D and 3D stencil and sweep codes, i.e., applications that communicate following 2D and 3D grid patterns (similar to what we found in many scientific codes).In addition, we also incorporated a nbody application, code used to solve the n-body problem that involves the prediction of individual motions of a group of objects interacting with each other.
• Datacentre applications: This group of applications includes models that mimic the traffic that appears in datacentres, including the popular Mapreduce in which after a phase of scatter data, the tasks of the application communicate using an alltoall traffic pattern and finishing with a gather phase.We also emulate unstructured applications such as graph analytics using causal random traffic and a model that we call dcntraffic in which all the nodes of the network communicate with each other using an 80% of short flows and a 20% of long flows as reported in [37].This latest model emulates, not only the traffic of the applications, but also the management traffic present in datacentres.
• Benchmarking patterns: This group contains causality-enhanced versions of the traffic patterns traditionally used in the evaluation of network topologies (similar to the ones in Section 3.3.1).To enforce causality, flows are generated into phases.Each phase has a fixed number of flows and requires all the flows from the previous phases to be delivered before beginning.The smaller the phase size, the more tightly-coupled the application, i.e., the higher the causality.

Markov-chain-based Application Model
Given the wide variety of applications that we need to consider (HPC from several scientific domains, big data analytics for scientific, engineering • Init: This state represents the moment in which an application gets scheduled into the system and the required resources (i.e.processing nodes) are assigned to it, including all source data preparation (caching).
• End: When this state is reached the application will finalise.Transitioning to this state will free all the computing resources of the application and will also trigger updating the data origins with the results of the application.
• Comp: A computation-intensive phase without any data moving.It is the first one after Init and the last one before End to model the creation/destruction of application's data structures.
• Comm: A communication-intensive phase in which computing nodes will communicate and/or synchronise with each other.This phase can model different patterns, in particular any of the ones discussed above.
• Read: During this phase, the processing nodes will read data from storage according to the storage policies implemented in INRFLow.
• Write: During this phase the application writes data to storage, covering for storage of execution results, updating of data, snapshots of the application status or check-pointing of the application.
The transition between phases is performed using a probabilities transition matrix.The value of each element of the matrix, M [a][b], indicates the probability of a transition from phases a to phase b.Therefore, the sum of each row and column has to be 1.These probabilities are fully configurable and allow to emulate several types of applications such as I/O-intensive, computation-intensive, communication-intensive or mixes of them.Additionally, there are many other parameters that can be configured, such as the application size, the communication pattern during the Comm phase, the lambda parameter for the duration (exponential) of the Comp phase, or the storage servers and transfer sizes for storage traffic.

Scheduling Model
In large-scale multi-tenant systems, applications need to follow several steps upon submission before they are actually executed.The piece of software in charge of that is called the scheduler and performs the following stages: Job Selection, Resource Allocation and Task Mapping (see Fig. 5).Because most large-scale computing systems are reliant on this kind of tools, we have implemented a model of the scheduler in INRflow to be used in our research [15].
First, the Job Selection stage in which the next application to be executed is selected.At the moment the following policies are implemented: • The simplest and most common policy is the First-Come First-Serve policy (FCFS) [38], which imposes a strict order in the execution of jobs.These are arranged by their arrival time and order violations are not allowed.The main drawback of this policy is that it severely reduces system utilization.When the job at the head of the queue cannot be put to run because the required resources are not available, all the jobs in the queue must wait due to the sequentially ordered execution of jobs.As a result, many processors could remain idle, even when other waiting jobs could be eligible to use them.
• Aggressive Backfilling (BF) [38] tries to overcome this drawback of FCFS by allowing the head job to be overtaken and allowing to schedule other application(s) which fit in the available resources.BF is a variant of FCFS, based on the idea of advancing jobs through the queue.If the job at the head cannot be launched due to resource constraints, a reservation time for it is calculated using the estimated termination time of currently running jobs.Using this policy, the system utilization is improved because more jobs can be put to run without delaying the expected starting time of other jobs.Note that there exists an alternative called conservative Backfilling in which all the jobs at the queue receive a reservation but it is too strict and is barely used in practice, so it is not yet implemented in INRFlow.
• Shortest Job First (SJF) [38] selects the jobs in order, with the shortest (in terms of estimated execution time) being executed first.The idea behind this policy is to avoid short jobs having to wait for much longer jobs in order to reduce the average waiting time of the jobs at the queue.The jobs are ordered using the expected value of the runtimes of each job.
Note that the use of both BF and SJF policies requires the expected runtime of the jobs being scheduled.As this depends on many variables such as the problem to solve, the type of hardware assigned or the status of the network it is impossible to provide an accurate value.For this reason these times are, either predicted using simple models based on the history of the execution of similar jobs [39] or, more recently developed models based on machine learning techniques [40].However, in real production systems these times are provided by the users when submitting the jobs [41].This is the approach that we use in INRFlow, so the estimated runtime needs to be provided for each application or if it is not, then it will not be able to overtake previous jobs.
Once the application is selected, the Resource allocation stage selects the physical resources (servers) to execute it.This stage tends to be guided by applications requirements such as memory, storage capacity, processor architecture, OS, etc.However, many authors argue that exploiting application locality by placing applications in a set of nodes which maintain some form of contiguity provides a more efficient utilization by reducing network latency and interference between jobs [42,43,44].
Finally, the Task mapping stage assigns each task of the application to the allocated servers.This stage can have a high impact on the performance of the applications [45] and to be effective it should be done considering specifics of the communication patterns used, amount of data exchanged, etc.For this reason, there is a large body of research and many approaches on how to improve this stage of the scheduling process [46,47,48,49,50].
INRFlow implements two simple strategies which are valid for any network: (1) consecutive which assigns the task to the set of reserved nodes in sequential order and (2) random which assigns the tasks to the nodes randomly.
In INRFlow we have implemented each of the stages as independent modules.This way it is very easy to implement new policies for any of the three stages.In the following pictures we have depicted a scheduling example for one application, from its arrival to the queue to their execution.
The quality of the scheduling process is typically measured using a set of specific metrics which are implemented in INRFlow: • Waiting time: It is the time that a job spend in the queue, that is, the time since it is submitted to the system until it is selected to be executed.
• Runtime: It is the time required by a job to be executed.
• Total time: This time is the combination of the waiting time and the runtime.It represents the time since a job is submitted until it finishes.
• System Utilization: This metric represents the proportion of computing resources that is used during a time period.
• Throughput: This is the number of jobs that finish per unit time.
• Makespan: This is the total time required to process the whole sequence of jobs since the first is submitted to the queue until all of them finish.

Storage Subsystem Model
INRFlow also incorporates a full model of the storage subsystem (see Fig. 6 for a detailed diagram and [15] for a concrete use case).In the  • Data mapped locally in memory is accessed immediately.
• Read and writes into the local storage device is limited by the PCI-e controller or the device (configurable independently).
• Access to data mapped in remote nodes is limited by the IN.
• Data in the centralized storage requires using the SAN network.
Once an application has been selected to be run, the allocator will select a set of computing nodes to place the tasks of the application.In that moment, the application will request access to the required data.We have currently three possible policies to perform storage assignment in the localdevices.
• Local: All the local storage devices are available to load the data for the application.This is the ideal scenario where all the storage traffic remains local within the computing elements.As a consequence there is no interference with other traffic.
• Internal: In this case only some of the local storage devices are available.This situation could happen if other applications have requested • Overall Throughput per Port: is calculated from the two previous taking into consideration the number of switch ports (links), so to ascertain how efficiently resources are used.
Afterwards, we use INRFlow's dynamic engine to analyse instances of the topologies of around 64K-node handling some realistic workloads.In this set of experiments the figure of interest is the execution time of the applications and they focus on assessing how the above raw gains are translated into applications speed-up.We consider a broad range of application models as explained in Section 3.3.3.

Scalability Results
Fig. 7a and 7b show, respectively, the Non-Restricted and Restricted Throughput of the different topologies as the number of endpoints is increased.First, we can see that the 6D torus provides the highest throughput for relatively small networks.However, it does not scale as nicely as the other topologies, indeed, it eventually gets outperformed by the fattree as the systems scale up and the best dragonfly (1:2) would also outperform it if we extended our experimental space a little further.It can be noted as well that all the topologies within a family of topologies have similar trends (gradient in the plots) with the tori showing the lowest slope.The trees and the dragonflies have similarly good trends, with the dragonflies having a slightly higher one.Jellyfish requires special consideration as it behaves differently in terms of restricted and non-restricted throughputs.In the unrestricted case, it follows a trend similar to the trees whereas in the restricted case, it scales poorly like the torus.This is because of the random nature of the topology generates quite a lot of bottlenecks when static routing is used.This is a known limitation of the topology, but can be alleviated by means of multipath routing, we will explore this issue later.With regards to the Overall Throughput per Port, shown in Fig. 7c, we can see how all of the topologies, except for the tori, have a very nice, nearly flat, scalability in terms of throughput per port.While there is little difference among the topologies, it is worth highlighting that the Jellyfish is the one with the best relation between throughput and links.This motivated our study in less structured topologies, such as the optimisation framework proposed in [13].

Applications Execution Time
Let us analyse now the results for the different groups of applications using the dynamic engine of INRFlow.For simplicity we assume all links run at 10Gbps, but mixed link bandwidths are supported.Fig. 8 summarizes the results with the realistic workloads.Given the wide range of execution times, the results are normalized, so to show how many times slower execution could be if an inadequate network is chosen.The first group of workloads is formed by these traditionally used in HPC environments.The Stencil workloads are executed much faster in the tori because the topology matches perfectly the communication pattern, hence there is no contention at the network level.Among the rest of the topologies the best results are achieved by the fattree and the thintree with 1:2 oversubscription ratio.As expected, the worst results were obtained using Jellyfish due to the use of an unstructured topology to execute a completely structured workload.In the case of n-body, the high causality of its communication pattern minimises the differences between topologies.
The second group of workloads mimics the behaviour of applications in datacentres.If we look at the Unstructured workload the best performance is achieved using the Torus 6D.The rest of the topologies perform similarly except the thintree with a 1:4 ratio in which the reduced bandwidth in the upper tiers severely affects the performance of the application.Special attention should be paid to the good performance achieved by the dragonflies and the jellyfish topologies due to the fact that both have a random compo-  nent: valiant routing and the random construction respectively.Regarding MapReduce, the best performance is achieved using the 3D-and 6D-tori with the remaining topologies performing similarly.Finally, dcntraffic works best with the torus 6D and the fattree and worst, again, with the thintree with a 1:4 ratio.
The third group evaluates the topologies using more traditional patterns.
In this case the patterns running in the torus 6D and the fattree require the lowest execution time.Let us remark that the worst topology is again, in all cases except Shift, the thintree with an oversubscription ratio of 1:4.
Overall it is worth noticing the effect that the topology may have on the execution time of applications is considerable, with up to one order of magnitude slower execution if run in an inadequate topology.In the results, we observe the great potential of torus-like and tree-like.Although high oversubscription in thintrees affects negatively the performance of the applications, these kind of networks are still strong candidates in larger networks when locality can be achieved.This will be studied thoroughly in future (a) Fattree vs Jellyfish.works.As occurred with the static experiments above, the low performance achieved by Jellyfish requires extra consideration.Again, the culprit is that the single path routing cannot take advantage of the high connectivity of the network.In order to assess how much of the low performance can be attributed to that reason, we next evaluate both type of topologies, tree-like and jellyfish, using multi path routing.
Fig. 9 shows how the availability of multipath routing affects the performance of the fattree, the Jellyfish and the thintree.We can see that in the case of the fattree, the multipath schemes not only do not help greatly to improve the performance but, in fact they can actually harm it significantly for adversarial traffic patterns.This is because the large interconnection resource provided by the topology can be shared evenly with a static, singlepath routing, whereas the use of multipath can actually generate some areas of contention that would not appear otherwise.With Jellyfish, on the other hand, we can see that applying multipath routing algorithms can be considerably beneficial (up to 2 − 3× faster) with KSP being generally better than ECMP.
Note that, although the fattree cannot really benefit from multipath routing in many cases, oversubscribed trees are able to benefit from it up to a certain level.There we can see that the slightly oversubscribed thintree 2:1, can achieve speed-ups in the range of 2 − 4× for many of the workloads considered here.This is because with relatively small oversubscription ratios, it is more likely to generate contention in the topology, but there is still a large variety of paths that can be exploited by the multipath algorithm to distribute the traffic more evenly across the higher levels of the topology.
On the other hand, the more aggressively oversubscribed topology, thintree 4:1, extracts little benefit from the multipath scheme because the low availability of paths means that there are not many occasions in which the traffic can be spread more evenly across the great bottleneck that is the last level of the interconnect.

Conclusions
This In top of that description we complete the paper with a case study in which we investigate the scalability of typical interconnection networks with up to one million nodes.There we see that high-dimensionality torus can offer the best raw performance, as well as exploit it appropriately to obtain the fastest execution times of applications.We also show some examples of topologies where multipath routing can be necessary in order to speed up the execution of applications.
Finally, we want to remark that INRFlow is an open source platform and we would like to invite all researchers in the area of interconnection networks and related ones, to try it and use it for their own purposes as well as to contribute to its design and development.
paper presents INRFlow, an Interconnection Networks Research Flow-level simulation framework we have been developing since 2014 to support our experimental work.

Figure 1 :
Figure 1: Examples of traffic prioritization policies.Time flows left-to-right and bandwidth is represented vertically.

Figure 2 :
Figure 2: Examples of some of the topologies supported by INRFlow.
definition of a topology in INRFlow normally involves implementing routing algorithms that can be used with it.There are many algorithms already implemented and it is very easy to add new ones.INRFlow supports both single-path, where given a source and a destination node the same path is used all the time, and multi-path, where many parallel paths can be used.When INRFlow needs to perform the routing of a flow it will use all the paths provided by the routing algorithm.The design of INRFlow is very flexible allowing the implementation of different kinds of routing policies.

3 .Figure 3 :
Figure 3: Diagram of the data structures needed to support trace-driven simulation.

Figure 4 :
Figure 4: Representation of the Markov chain used to generate synthetic applications and parameters used to generate traffic in our experiments.
m u n ic a t io n p a t t e r n o f J o b List of Jobs in the queue and communication pattern of Job 1.
Job selection.Job 1 is selected among all the jobs in the queue.
Task mapping.Each task of the job is allocated to a specific server.

Figure 5 :
Figure 5: Examples of the different stages of the scheduling process.

Figure 6 :
Figure 6: Visual representation of the ExaNeSt storage architecture.The local NVMs are attached to the computing nodes sharing the main IN (solid).An Ethernet network is provided for central data storage (dashed).

Figure 8 :
Figure 8: Execution time of the realistic workloads.
paper has described exhaustively the design of the INRFlow simulation framework for large-scale networks and computing systems.INRFlow is a mature, flexible and frugal tool that has shown its capabilities in a wide range of previous research work within the areas of interconnection networks for datacentres and HPC computing systems.It models many aspects of such systems, including the scheduling process, the storage subsystem, the interconnection network and the application traffic.Our description includes the large number of topologies and routings implemented already, the wide variety of traffic generators and the different subsystems included in our models.