Energy efficient HPC network topologies with on/off links

Energy e ffi ciency is a must in today HPC systems. To achieve this goal, a holistic design based on the use of power-aware components should be performed. One of the key components of an HPC system is the high-speed interconnect. In this paper, we compare and evaluate several design options for the interconnection network of an HPC system, including torus, fat-trees and dragonflies. State of the art low power modes are also used in the interconnection networks. The paper does not only consider energy e ffi ciency at the interconnection network level but also at the system as a whole. The analysis is performed by using a simple yet realistic power model of the system. The model has been adjusted using actual power consumption values measured on a real system. Using this model, realistic multi-job trace-based workloads have been used, obtaining the execution time and energy consumed. The results are presented to ease choosing a system, depending on which parameter, performance or energy consumption, receives the most importance.


Introduction
High Performance Computing (HPC) systems are based on the aggregation of multiple computing nodes (processors and memory banks) to provide the best possible computing support for computational problems not addressable by accessible commercial computers. The interconnection network is one of the basic building blocks of an HPC system, providing communication among potentially many thousands or even millions of computing nodes.
Interconnection networks are a significant performance limiting factor of HPC systems. For this reason, they are an active area of research where new topologies, routing algorithms or switch architectures are proposed. The goal is offering highcommunication bandwidth, low latency and good scalability to make it possible to interconnect all nodes.
The United Nations in 2015 approved the 17 Sustainable Development Goals (SDG) to transform our world. Energy is the main contributor to climate change and accounts for around 60% of all global greenhouse gas emissions [1]. Thus, the efficient use of energy is one of the key aspects to consider to help combat climate change and global warming. HPC systems can-not be left out and must be energy efficient in order to contribute to energy savings that facilitate a more sustainable planet. While significant contributions have been made to increase interconnection network performance, much lower attention has been paid to its energy consumption. HPC systems are significant energy consumers, e.g. the current most powerful supercomputer, Fugaku, reaches nearly 30 MW [2,3]. This is high enough to supply energy to more than 33,000 homes, according to U.S. Energy Information Administration standard. The mostenergy efficient system, NVIDIA DGX SuperPOD, achieves 26.195 GFlops/W [2,4]. A projection of that metric to an exascale supercomputer with equivalent efficiency predicts power consumption will peak 473 MW, soon reaching gigawatt figures. In order to mitigate this trend, energy-proportional computing nodes are being deployed. The strategy is modulating energy consumption at server level according to processors' utilization. Under these conditions, interconnection networks contribution to overall system energy consumption significantly increases, ranging from 12% to 50% depending on servers utilization (higher budget for lower server utilization) [5,6].
As a consequence, a trade-off between performance and cost, which is greatly conditioned by power consumption, must be considered in the design of an HPC system and its interconnection network. From this perspective, as network performance is greatly conditioned by its topology, an appropriate topology selection will contribute to the overall system performance. Nevertheless, that decision also impacts on the ability of the network to show an energy-proportional behavior, con-tributing to an energy-efficiency design.
Historically, several topologies for interconnection networks have been proposed. Nevertheless, almost all high-performance interconnection networks in practical use over the last two decades use topologies derived from three families: tori (k-ary n-cubes) [7], fat-tree (k-ary n-tree) [8] and, recently, dragonfly [9]. For any given network size, any topology can be chosen using similar link technology and applying different power-saving strategies (pursuing energy-proportionality).
Selecting the topology that offers the best trade-off between performance and energy consumption is not straightforward. The energy consumed by a computing system depends on power consumption over time. Energy-efficiency strategies may reduce power during low utilization periods but increase overall execution time. Eventually this might provide a non-acceptable performance degradation and/or an undesired increase of the energy consumption of the system.
In this paper, we present a comparative performance-wise and energy-efficiency-wise analysis of different network topologies. We characterize network behavior and its impact on the overall HPC system when applying dynamic power saving techniques. Performance of the different configurations is determined by using real application execution traces. The main contribution of this paper is a thorough comparative evaluation of three family of topologies: tori, fat-trees and dragonflies. In all cases, several topology parameters are considered, leading to a representative set of evaluated topologies. In addition, the Low Power Idle proposed on the IEEE Energy Efficient Ethernet standard is used as dynamic power saving mechanism. Being a standard, the obtained results can be more easily generalized to commercial interconnection networks. Evaluation results are shown by using combined performance-energy plots which allow an easy comparison and selection of the proper configuration according to either the criterion of performancefirst or energy-first.
The rest of the document is organized as follows. Section 2 describes the power consumption model. Section 3 presents the system and evaluation model. After that, we' analyze and discuss network performance and energy evaluation results in Section 4. Other works studying energy consumption in high-speed interconnection networks are reviewed in Section 5. Finally, in Section 6 we outline the conclusions and future work.

Power model
As mentioned in Section 1, the main goal of this work is to show the impact of the energy consumption of an HPC platform for different configurations of the interconnection network topology.
In previous works, we defined a power consumption model to compare networks with different number of switches/ports [10,11]. This model selects one of the compared networks as a reference network and normalizes all the power results according the switch/port configuration of the reference network. This approach allows us to compare different network configurations when we do not have power measurements of the components of the HPC system. However, in this work we have made power measurements of some current HPC nodes and HPC switches in order to characterize the power model parameters (Section 2.4). As we now have these power values in Watts, the model has been slightly modified to work directly with absolute power measurements. For this reason, and for the paper to be self-contained, we include a set of definitions and a brief explanation about the modified power model.

Initial hypotheses
Our model considers separately the power consumption of the computing nodes and the power consumption of the network switches. Moreover, the link power consumption and the switch logic power consumption are also studied separately, due to the relevance of the links in the performance and energy efficiency of the network. According to the state of the art, our model considers the following general hypotheses: • The switch power consumption increases linearly with the number of ports [12], which has been experimentally verified in [13].
• A power-saving mechanism like Low Power Idle (LPI), proposed on the IEEE Energy-Efficient Ethernet (EEE) standard (IEEE 802.3az) [14] is assumed, and therefore, two switch port states are considered: wake-up (or turned on) and sleep (or turned off ). Figure 1a illustrates how a link with LPI operates under our hypotheses: -Since the transceiver is working independently of the port is transmitting data or not, the port power consumption is 100% when it is turned on.
-When the transmission of a packet ends and no more packets are available for transmission, the link is switched to the low power mode. After a certain amount of time (T s ), the link is in LPI mode.
-When the port is in sleep mode, the transceiver is frozen and it consumes a small part of its maximum power consumption [15]. As can be seen in Figure  1, this fraction of power consumption is specified in our model by the ω p S port parameter. -When packets need to be transmitted, the link is waked up to the active mode. This change of state also requires a certain amount of time (T w ) to finish.
-During the transitions from one state to another, the port power consumption is 100%.
• As shown in other studies [16,11], using only LPI in the context of HPC applications considerably increases the execution time of these applications, and in some cases, the increase can be more than 100%. Not only this degrades the obtained performance, but also the total system consumption can be even higher than without using the LPI technique. However, LPI becomes an attractive technique for saving energy when it is combined with Power-Down Threshold (PDT) . Thus, in this work the PDT technique [16] is also implemented.  Each link has a timer configured with the Power-Down Threshold value. After finishing a transmission, the timer starts. If one message is received, the timer is canceled. Since the link is active, it can immediately start the transmission of the message. Once the new transmission finishes, the timer is restarted again. Only when the timer finishes without receiving new messages, the link starts the transition to sleep mode.
• In order to simplify the model, we consider that the power consumption of the switch control logic is fixed regardless of the switch utilization. Moreover, this behavior is observed in the experiments performed in [13].
• The nodes always consume a fraction of its maximum power consumption, in spite of there are no running processes in the node (idle node). Then, a fraction of the node power consumption is fixed and the remaining power depends on the node utilization. This hypothesis has been also experimentally proven in [13]. Table 1 introduces the parameters used to quantify both, the main components of the system and their contribution to power consumption and total energy, as well as the variables calculated by the power model. Moreover, Figure 2 shows graphically the main power parameters of the model, in order to help the reader to understand these parameters. The figure shows a node connected by a port to the switch, showing the power consumption when the hardware is active ( Figure 2a) and when it is inactive ( Figure 2b).

Definitions
Then, in Figure 2a we can see the power consumption when all the hardware is active. The node power consumption is Ω nodes W, while the switch power consumption per port is Ω port W. Moreover, we can distinguish two power components within Ω port : the power of the port, and the power of the switch logic. The contribution of the ports to the total switch power consumption is indicated by ω ports parameter, therefore, the port power consumption is Ω port × ω ports W, while the switch logic consumption is Ω port × (1 − ω ports ) W.
Finally, Figure 2b shows the power consumption when the node is idle and the port is sleeping. Both components only consume a fraction of its maximum power consumption. While ω S port indicates the fraction of power consumption of a sleeping port, ω S port indicates the fraction consumed by an idle node. Therefore, the power consumption of the port is Ω port × ω ports × ω S port W and the power consumption of the node is Ω nodes × ω S nodes W. The power consumption of the switch logic power consumption remains the same as in the Figure 2a.

Power consumption model
Let W s ports be the fraction of the power consumption that all the ports consume in a switch s with respect to the maximum power consumption of those ports. When a port p is in sleep mode, it only consumes ω p S port of its maximum power   consumption. Then, a port p always consumes ω p S port , plus (1 − ω p S port ) · U p port when it is turned on. Since the ports in a switch have the same characteristics, ω p S port is the same for all ports (ω S port will be used hereafter), and the relative ports power consumption is: Once we have the ports power consumption, we can calculate the switch power consumption. According to the initial hypotheses, the switch logic always consumes (1−w s ports ) of the total switch power consumption. Therefore, the fraction of the maximum power consumed by a switch is: (1 − ω s ports ) + ω s ports · W s ports and therefore, the absolute switch power consumption, measured in Watts, is: By considering all switches in the network, we obtain the network power consumption. Without loss of generality, and reasoning at network level in the same way as switch level, we consider that the contribution of switch ports to the total switch power consumption is the same for all switches, i.e. ω s ports is the same for all the switches, and we use ω ports hereafter. Then: In order to obtain the total energy of the HPC platform, we need to consider the power consumption of the computing part, mainly due to the compute nodes. Taking into account the definitions above and the initial hypotheses, a node always consumes w S nodes of the total node power consumption. Therefore, the power consumption of a node n is: Ω nodes · w S nodes + (1 − w S nodes ) · U n cpu and then, the power consumption of the cluster nodes is: Then, considering the nodes and the network power consumption, we can calculate the HPC platform power consumption: and the energy consumed by an application:

Characterization of power model parameters
To use the power model defined in this section, values must be given to the parameters that characterize the model, e.g. the fraction of power dissipated by a port in sleep mode, the switch port power consumption or the compute node power consumption.
There are numerous documents that provide data on the power consumption of the main components of an HPC system or of the system as a whole. However, these data, for the same type of components, offer significant differences, and thus it is not easy to choose one in particular. Therefore, we have decided to collect power consumption data in a real and modern HPC cluster, using a meter for this purpose. Except for w S port , all other power model parameters have been selected according to the results obtained. In [13] we have included the characteristics of the measuring instrument, the tests carried out and the results obtained.
The power model parameters are summarized in Table 2. According to the EEE standard, the power consumption of an idle link is estimated to be 10% of the link power consumption [14,15], and therefore, w S port is set to 0.1. The experiments in [13] show that a modern 36-port Mellanox switch consumes approximately 33 W without connected ports, and 180 W with all its ports connected. Therefore, w ports is set to (180 − 33)/180 = 0.816 and Ω port is set to 180/36 = 5 W. Regarding the compute nodes, the power consumption of a fullyloaded node is 300 W, while in a node with an unique thread running is only 150 W. Then, Ω nodes is set to 300 W and ω S nodes is 150/300 = 0.5.

System model
After presenting the power model, we describe the system model used in the evaluation. Section 3.1 outlines the network

Network model
The network has been modeled using the simulation tool Hiperion (HIgh PERformance InterconnectiOn Network) [17].
Hiperion is an open-source simulation tool available for researchers and companies. The simulator main goal is to perform comparative studies and it has a large range of configurable parameters, e.g. topology, routing, queue sizes, output scheduling algorithms, etc. Hiperion is capable of running simulations using synthetic traffic (e.g. random uniform, bit-reversal, bitcomplement, etc.) and MPI traffic using the VEF trace framework [18,19]. The Hiperion configuration used to perform all the experiments in the performance study is detailed below.
The Hiperion modeled architecture is realistic and representative of current state of the art HPC platforms. The chosen design parameters are based on several commercial networks [20,21,22,23,24].
Network switches are IQ (Input Queued) [25], with virtual cut-through [26], credit-based flow control and the three-stage allocation algorithm implemented in the IBM Blue Gene L [27].
Flit size is 16 bytes and packets are 8 flit long. The switch clock frequency is 625 MHz (i.e. clock cycle equal to 1.6 ns). Since the switch crossbar can deliver one flit per cycle, each switch port offers a peak bandwidth of 10 Gbytes/s. We assume that all switch components have a fixed delay. The latency per hop is approximately 50 ns, slightly varying as a function of the number of ports. The unique switch component with a variable latency is the switch allocator. The allocator comprises several stages of round-robin arbiter, which latency increases logarithmically with the number of arbiter entries [28]. Therefore, the allocator latency increases logarithmically with the number of ports.
Each physical channel is multiplexed into 4 virtual channels (VCs). Each input port has an input buffer of 1024 flits, or 16 Kbytes, statically split between the VCs. The use of the VCs depends on the modeled topology: • In the torus topologies, the VCs are employed to avoid deadlocks and to enable the use of adaptive routing. In this case, the switch implements a fully-adaptive routing algorithm [29]. Two of the four VCs are fully-adaptive VCs, while the remaining VCs are used as escape paths, implementing the DOR algorithm to avoid deadlocks [30].
• In the fat-tree topologies, since VCs are not required for avoiding deadlocks or providing adaptiveness, they are used to implement DBBM [31] and reduce the Head-Of-Line blocking, mapping the packets in the VCs using the function Destination % Number o f VCs. • In the dragonfly topology, UGAL-L routing algorithm [9] is used. The number of VCs has been modified, since this algorithm requires 3 VCs in local channels and 2VCs in global channels for avoiding deadlock.
Although our switch is not referred to EEE or any specific technology, we have implemented Low Power Idle mechanism [14] and Power-Down Threshold (PDT) [16] for saving energy on the links. We have used the time values specified in IEEE Energy-Efficient Ethernet standard [15] to configure the delays for turning on (4.16 µs) and turning off (2.88 µs) a link.
In a previous work, we evaluated the impact of PDT value in torus and fat-tree topologies. The results showed that the greatest power savings with negligible performance penalties were obtained when the adaptive routing prioritizes the wakeup links and PDT is set to 10 µs [11]. Therefore, this value has been chosen for the experiments. However, the dragonfly topology does not have an adaptive routing. The routing algorithm selects a path without having information about the power state of the links in the chosen path. This may cause a poor performance setting PDT to 10 µs. For this reason, the dragonfly topologies are also tested setting PDT to 100 µs.

Case studies
To perform the evaluation, the first idea was to compare different systems with a fixed number of computing nodes (64 and 256). These system sizes are possible in fat-tree and torus topologies, but not in the dragonfly topology. For this reason, we have chosen dragonfly configurations that: i) their number of computing nodes is closer to 64 and 256; and ii) they fulfill the conditions a ≥ 2h and 2p ≥ 2h to balance the network load [9]. Moreover, in the second case study (256 nodes), we have also added extra configurations for fat-tree and torus topologies which have a closer system size to dragonfly configurations. Table 3 and Table 4 show the topologies evaluated in both case studies and other useful information, such as the number of nodes, switches, ports per switch and total network ports. Note that the torus topologies marked as Trunk in both tables use trunk links in each torus direction. Each trunk link comprises four independent links transmitting independent packets.

Network load
The load on the system plays an important role when evaluating the energy consumed that will be obtained from the execution time and the power consumed. We have decided to use an open access trace-driven traffic model, called VEF trace framework [18,19]. The parallel applications inject MPI traffic that will be captured in a trace file that we will use to generate the traffic in the network simulator. The VEF traces model both point-to-point communications and MPI collective communication primitives, making use of the collective communication algorithms developed in Open MPI [32].
For modeling the network traffic, we have run the following applications in the GALGO supercomputer [33], trying to make the scenarios as real as possible: • HPCC Linpack [34] is used to solve a dense system of linear equations. This application follows a specific pattern in which a given task always communicates with the same tasks, following a ping-pong traffic pattern.
• Namd is a parallel application for simulating large biomolecular systems [35]. The application maps logically the tasks in a 3D grid and the tasks communicate mainly with their neighbor tasks in the grid. For this reason its traffic pattern shows a great spatial locality. Our traces correspond to the apoa1 benchmark.
• Gromacs [36] is a scientific application to perform molecular dynamics. Similarly to the previous one, this application shows a great spatial locality. We generated the trace using the input "d.poly-ch2" available in the Gromacs benchmark 1 .
• Graph500 benchmark using the replicated-csr implementation, a scale factor of 20 and an edge factor of 16 [37]. All the communications are generated by MPI collective communications that generate a great exchange of data among tasks. This application generates the highest network load of the all applications tested.
• HPCC MPI Random Access [34] (or MPIRA). Most of the communications are performed by MPI point-to-point primitives. The messages are uniformly distributed among all the tasks, being the traffic pattern very close to a uniform traffic pattern.

Trace scheduler evaluation
Since the number of tasks per trace is limited by the size of the GALGO supercomputer, we have developed an oblivious trace scheduler to solve this limitation and to be able to evaluate the most realistic network traffic.
Given a set of traces, the scheduler operates as follows: i) the scheduler checks the number of available cores, ii) the scheduler verifies the number of tasks in each trace and marks the traces as eligible if their number of tasks is lower than the number of available cores, iii) the scheduler randomly chooses an eligible trace and allocates it to the first free nodes. Until all the traces are mapped or there are no free nodes to map another trace, the previous process must be repeated. When a trace is finished and its resources are released, the scheduler will run again.
We consider all nodes have 8 cores in the follow-up scheduler evaluation, e.g. the 64-node networks have 512 cores, the 256-node networks have 2048 cores, etc.
We have evaluated three different sets of traces, combining traces from the five applications shown in Section 3.3 with three different number of tasks: 128, 256 and 512 tasks.
All trace sets have the same number of applications (15 traces: Three task sizes per application). Trace sets have been configured to provide increasing communication requirements. This is achieved by removing Namd and Gromacs in sets Mid and High, respectively, while including additional instances of Graph500 that generates a higher network load.
The three trace sets are the following: • Set Low: Namd, Gromacs, Linpack, MPI Random Access and Graph500. This trace set generates approximately 715 Gbytes and injects 645 Gbytes into the network 2 .
The amount of traffic generated by each trace and its contribution to the total traffic generated by each set can be found in Table 5.
Finally, it should be noted that this scheduler is an oblivious scheduler, i.e, the scheduler does not apply any strategy to optimize the use of the nodes or the network. The scheduler randomly selects an application to be run taking only into account 2 Note that the traffic between two tasks in the same multicore processors is not injected into the network and is internally managed by the VEF trace framework. For this reason, the amount of traffic injected into the network is slightly lower than the amount of traffic generated by the traces. the number of available cores. Note that designing an scheduler that optimizes the available resources is far away of the scope of this work. Therefore, for each set of traces and each topology, we have carried out 30 different executions varying the random seed, and the average value and the 95% confidence interval is shown in the evaluation figures.

Evaluation
This section presents the results of the performance and energy consumption evaluation. Section 4.1 shows the results for 64-node networks, while Section 4.2 shows the results for 256node networks. Figure 3 shows Runtime, E net and E cluster results for each topology, connecting at least 64 computing nodes. Every plot depicts three groups of contiguous bars, each one corresponding to the same topology (namely, torus, fat-tree and Dragonfly). Systems based on different network topologies have been tested with a power saving mechanism based on Low Power Idle and Power Down Threshold, indicated with -P or -PH suffixes as described in Section 3.2.

64-node network evaluation
Results on energy consumption at network level (E net plots) indicate that the power saving mechanisms provide reductions for every configuration. All network topologies are highly sensitive to power saving solutions based on dynamically switching links on and off. Even though the network load may eventually require longer time to be delivered, energy saving is always achieved. The increase in time is compensated by power reduction.
Results on energy dissipation at system level (E cluster plots) indicate that energy savings at network level provide a positive impact on energy consumption by the entire system. The exception is Dragonfly topology when the default PDT is used, as indicated in Section 3.2. The poor performance achieved when links are placed into low-power mode and short PDT is used (10 µs, in our experiments, labeled with -P suffix) is due to the lack of adaptivity of the Dragonfly routing algorithm. Since the routing algorithm can not choose a path based on the power status of the network links, the number of asleep ports in the chosen path increases, degrading the system performance and increasing the system energy consumption due to the higher execution times. That limitation is solved when using longer PDT for systems based on Dragonfly (100 µs, in our experiments, labeled with -PH suffix). By using a less aggressive value for PDT, the performance degradation is lower and therefore, the system energy consumption decreases, although the links are turned off less times than using a more aggressive PDT value.
Runtime results (Runtime) indicate, as expected, that energy saving solutions based on link powering-down increase the execution time. In our evaluation, all application mixes increase their runtime when network links are moved to a low-power state during idle periods. The time required to change link status, which prevents data from being sent, increases runtime. Nevertheless, the impact on energy consumption is always favorable. Although applications require longer time to complete, the overall system energy decreases.
To ease choosing the best configuration, Figure 4 shows the runtime-energy relationship for the three analyzed scenarios for a 64-node system. This plot shows the trade-off among both figures of merit. The best configuration is the one that achieves the lowest execution time and consumes the lowest energy. In the Figure, it will be the downmost leftmost point of the plot. If there are several configurations that achieve a similar execution time (i.e. they are on the same value of the Y axis), the one with the lowest consumed energy (i.e., the lowest X value) will be the best choice. If there are several configurations that consumes a similar amount of energy (i.e. they are on the same value of the X axis), the one with the lowest execution time (i.e., the lowest Y value) will be the best choice.
Let us apply this analysis to the results shown in Figure 4. For all the problem sizes analyzed (Set Low, Set Mid and Set High), the best configuration from the network energy point of view is the 80-node power-saving Dragonfly. It can be seen clearly that it has the lowest network energy consumption and an acceptable execution time. However, if we consider the system energy, the winner is the 3D-torus with power saving. It provides one of the lowest execution times, being the shortest execution time for sets Mid and High, and the best whole system energy figures.
The dots of the low-arity fat-tree (k = 4) are in the upper right area of the plots, and therefore to use this topology should be discarded. The fat-tree configuration with higher arity (k = 8) provides system energy results close to the torus despite longer runtime, due to its lower network energy consumption.
In general, the Dragonfly provides the worst energy consumption. As can be seen in Figure 4, the Dragonfly points are the leftmost points of the plot, specially when the PDT value is 10µs. This happens despite the runtime is acceptable, being the fastest system for Set Low, and having shorter runtime than the the fat-tree in the remaining sets. The reason for those poor energy metrics is the excess in computing nodes required by the Dragonfly (72 or 80) versus torus and fat-tree. Despite the Dragonfly using more computing nodes, overall runtime is not reduced enough and energy consumption increases. Figure 5 shows Runtime, E net and E cluster results for each topology, connecting at least 256 computing nodes. As in Figure 3, every plot depicts three groups of contiguous bars, each one corresponding to the same topology (namely, torus, fat-tree and Dragonfly). Systems based on different network topologies have been tested with a power saving mechanism based on Low Power Idle and Power Down Threshold, indicated with -P or -PH suffixes as described in Section 3.2. For Dragonfly configuration only results for 100 µs PDT are shown (-PH suffix), since results for 10 µs PDT are significantly worse.

256-node network evaluation
As for the experiments with 64 computing nodes, all network topologies benefit from using power saving mechanism since E net is significantly reduced in all cases. The measured contribution of the interconnection network to the overall system energy consumption, across all tested configurations, ranges from 7 to 22 %, when no power-saving mechanism is being utilized. This result matches estimations previously reported by Abts et al. [38].
The impact of power saving mechanisms on runtime is low and valuable energy reductions are achieved at system level (E cluster ) for all our case studies. Indeed, for several cases, runtime marginally decreases when applying power saving mechanisms. For instance, 4D and 3D tori or 16-ary fat-tree with set Low. In these cases, for low loads, powering down several network links sometimes reduces runtime. The change in topology generated by disconnecting some links limits the adaptability of the routing algorithm, concentrating the traffic in fewer links and thus reducing the conflicts between packets, which reduces packet latency and in turn results in a reduction in execution time.
At system level, again the torus topology with the higher number of dimensions provides the minimum energy consumption, for all the configurations tested. Only for set Low, fat-tree with k = 16 and Dragonfly with 272 nodes provide similar results. For the other sets, where network load is increased, the torus significantly outperforms fat-tree and Dragonfly in energy requirements.
To ease choosing the best configuration, Figure 6 shows the runtime-energy relationship for the three analyzed scenarios for a 256-node system. For the execution time-network energy relationship and the Set Mid and Set High problem sizes, the 342-node Dragonfly is the best choice. Although there are other faster networks, the increase of the energy consumption makes these options unworthy. This network also obtains good results for Set Low. The fat-tree with k = 18 also achieves a better trade-off between runtime and network energy.
Conversely, for the more interesting execution time-system energy relationship, the 256-node 4D torus with power saving is the absolute winner for Set Mid and Set High. Although the 342-node Dragonfly and the 320-node 4D torus are faster systems (around 5 ∼ 6% faster), the energy consumption reduction of the 256-node 4D torus is much more significant (around 10 ∼ 15%). However, 342-node Dragonfly and the 320-node 4D torus with power saving also offer a good trade-off between energy system consumption and execution time. Interestingly, for low loads (Set Low), there is much more variability in the results. The best choice will depend on which parameter receives the most importance. To improve the per- formance, the best choices are the 18-ary fat-try and 342-node Dragonfly, both with power saving, meanwhile the most interesting option to improve the energy consumption are 272-node Dragonfly and the 256-node 4D torus, again with power saving.

Related work
Various research works investigate energy consumption in high-speed interconnection networks, aiming at improving energy efficiency of HPC systems and data centers.
Kim et al. [39] evaluate dynamic voltage scaling and on/off based techniques on network link energy consumption. For a mesh topology, they show that link shutting down performs significantly better. Soteriou et al. [40] propose a dynamic power management (DPM) mechanism for mesh topologies. Their proposal relies on using a deadlock-free fully-adaptive routing algorithm that allows traffic redirection when links are turned off. These results show that link shutting down, supported by efficient network switch design, has potential to provide significant power saving with moderate impact on performance. Alonso et al. [41,42,43] target torus interconnect topologies based on aggregated links. Their works propose dynamically turning on and off network links as a function of traffic, while maintaining at least one active link per aggregated link. For very low loads on a single active link, link bandwidth is re-  duced to obtain additional power saving. An advantage to previous works is that network topology is not modified nor the routing algorithm. Similarly, additional work [44] extends the approach of shutting down redundant links to fat-tree systems. This proposal demonstrates that significant network power con-sumption reduction with limited impact on performance can be obtained.
Gunaratne et al. [45] estimate potential energy savings achievable by using adaptive link rate of Ethernet links (ALR). Their results, obtained by simulation with synthetic traffic patterns, show that an Ethernet link can operate at a low data rate most of the time yielding to significant energy savings with small impact on packet delay. ALR continues consuming power during idle periods, although at a reduced rate due to the slower link speed. Low Power Idle (LPI), proposed on the IEEE Energy-Efficient Ethernet (EEE) standard (IEEE 802.3az) [14], saves power by switching off transceiver components when a link is idle. LPI improves ALR because it requires less delay (microseconds versus milliseconds) and offers the maximum power savings: up to 90% less power consumption than fully active mode [46]. Although some studies have combined ALR with LPI, other works show how the combination of ALR and LPI introduces a significant performance degradation in terms of the QoS application receives (higher latency and jitter) which is not acceptable for end-users [47].
D. Abts et al. [38] show that high-performance interconnects consume a significant fraction of total system energy, in particular when servers operate at a fraction of their maximum utilization. Since servers typically run at low utilization levels and are increasingly becoming energy-proportional, this work highlights the need for more energy-proportional networks. Based on an analytical consumption model, they propose reducing link data rates during periods of inactivity to save energy, but do not provide solutions. Saravanan et al. [48] propose a mechanism to reduce link energy consumption for Energy Efficient Ethernet. Their proposals rely on the implementation of vari-able length stall-timer, used by links to enter low-power states, that is dynamically set according to link statistics collected by NICs and switches and a performance overhead bound. Simulation results using traces for a hierarchical topology based on fat-tree show significant energy savings at link level with limited performance degradation.
Zahn et al. [49] perform a comparative evaluation of different strategies for obtaining energy savings on torus, fat-tree and Dragonfly networks. They compare mechanisms that set link power states as a function of link utilization. However, this proposal only presents energy metrics at link level and does not discuss overall system performance. In a previous work [10], we presented a comparative performance and energy evaluation of different torus network configurations when link on/off power reduction mechanisms are used. Our simulations using real traces, evaluate total system energy consumption (including energy by the network fabric and the compute nodes) and application execution time. We show that using a power-saving mechanism always pay-off and that aggregated-link torus topologies provide the best trade-off between performance and energy consumption. In this paper, we extend this previous work by considering not only torus but also fat-trees and dragonflies thus allowing a comparison of the most widely used topologies. Indeed, the combined performance-energy plots presented in this paper allow an easy comparison of results and the selection of the best configuration according to either the criterion of performance-first or energy-first. On the other hand, this paper considers as power-saving mechanism the Low Power Idle proposed on the IEEE Energy Efficient Ethernet standard, thus allowing the utilization of the obtained results on commercial interconnection networks.

Conclusion
We present an energy/performance study of HPC systems based on energy-efficient interconnects for realistic, multi-job trace-based workloads. To the best of our knowledge, this is the first research work where actual power consumption metrics measured on a real system are presented and used to configure the simulator. We conduct experiments on three widely used network topologies, namely torus, fat-tree and Dragonfly, applying low-power modes based on the Energy Efficient Ethernet standard.
Results show that all network topologies provide significant energy savings at network level when low-power modes are applied. At system level, overall energy consumption always decreases when the interconnection network implements Low Power Idle and Power Down Threshold, despite the increase in runtime. This demonstrates that implementing power saving strategies in interconnection networks is a good idea.
For the evaluated case studies, the torus topology with the highest number of dimensions provides lowest energy consumption at system level, and generally one of the shortest runtime, achieving in some cases the best performance (e.g. sets High and Low in the 64-node systems). Although torus topology consumes significantly higher energy at network level, when low-power mechanisms are applied the energy gets drastically reduced contributing to excellent overall system energy consumption. Fat-trees show good efficiency for low loads, but for medium and high loads provide the worst results both in terms of performance and energy. Dragonfly offers results close to torus. Overall, we find that torus topology provides the best energy-performance trade-off in all cases, followed by Dragonfly.