1 Introduction

High-performance computing (HPC) is a type of computing that uses high-end computing components to cooperatively address large-scale tasks that cannot be solved easily by ordinary computers. The computing components are connected by HPC networks to achieve better efficiency.

An HPC network differs from other networks in that it often seeks to synchronize communication and computation so that the communication does not interrupt the computation too much to increase efficiency. An HPC network also tends to use homogeneous computing hardware, such as the same model of switches (with an equal number of ports), CPUs, and accelerators across the entire implementation. Homogeneous products in a system ensure lower prices for each component due to mass production and more straightforward restoration by prompt replacement when some parts go wrong.

Hwang et al. have shown the potential of Equality network compared with a few popular HPC network topologies [1,2,3,4] such as 2-tier fat-tree, 3-tier fat-tree, 3D torus, and 5D torus. In this work, we further analyze the performance of Equality networks in different scales to compare with Slim Fly, Dragonfly, and two popular network topologies, Fat-tree and Tori. We also extend the focus on applying the Equality networks to enable machines capable of reaching multi-exaflops based on current hardware craftsmanship.

The main contributions of the current work that are different from previous works include the following:

  • The development and implementation of the systematic routing tables for Equality networks,

  • The modified routing algorithm bottleneck-UGAL to refrain from over-subscribed paths,

  • The introduction and explanation of a new measure called bisection ratio in addition to bisection bandwidth,

  • The analysis of the resulting network properties (diameter, average distance, latency, and throughput) of various scales of the Equality networks and the comparison to other existing publications,

  • The strategies of finding a suitable configuration for a future HPC system utilizing Equality network topology, and

  • The largest cycle-accurate simulation ever calculated by BookSim (a 1 M-endpoint system).

2 Network architecture

Different network topologies often are designed to fit specific workloads when designed beforehand. To justify the quality of a network and whether it is suitable for the target application workloads, one can inspect the performance measures of a network and additionally perform simulations on the network. The standard network measures used in this paper include the network diameter d and average distance a. The standard communication measures are the message latency and the network’s overall throughput under different traffic patterns and injection intensities.

A well-balanced topology should have a reasonable network diameter and also accompanied by tailored routing algorithms to reduce the latency and increase the throughput. Nevertheless, for any application, a given network has an effective diameter, \(d_{\text {eff}}\), if all communication patterns of the specific application use no more than \(d_{\text {eff}}\) hops in the network regardless of the actual network diameter. The same idea goes for the average hop count.

Under low injection rates, when a packet never contends with other packets for resources, the latency is called ‘zero-load latency,’ \(l_0\), which is the sum of the latency bits when no queue or blocking is involved. Under high injection rates, when the network is saturated under the specific load, traffic pattern, and routing algorithm, if the latency goes high, the throughput approaches the saturation throughput, \(t_\text {sat}\).

Predominantly, the latency can be reduced if the network diameter d is reduced; however, the network latency and throughput still depend on the traffic patterns of real applications. Several traffic patterns [5] are devised based on communication patterns triggered by real-world applications. For instance, the transpose traffic mode is encountered in corner-turn [6] and matrix transpose applications. Choosing a network that outperforms others on most traffic patterns contributes to a better system.

2.1 Switching delays

The port-to-port latency of the contemporary high-end switches ranges from tens of nanoseconds to a few \(\mu\)s. The switching delay depends on many parameters, including the cable length, cable material, optical-electrical conversion, buffer size, switching logic cycle frequency, switching memory access time, routing table size, routing calculation complexity, hop count, etc. The hardware performance can be calculated by summing up all the instruction cycles contributing from each logic element, multiplied by the time in nanoseconds per cycle. The hardware issues, such as memory access time and logic frequency, are out of the scope of the current paper.

If the latency per hop is fixed when the switching hardware is selected, the only way to reduce the overall latency of the system is to fine-tune the topology. In reality, the messages can be waiting in channel buffers in packet-based routing architectures or for the channel to be freed in wormhole routing architectures. In the events of network congestion, the hopping distance may have a lesser impact on the network latency; therefore, the design of the topology and routing algorithm should not only focus on the diameter of the network.

2.2 Network selection

There are many topology inherited shortcomings in the existing topologies. The torus networks bear the routing difficulty of long hopping distances and over-subscribed paths. Although fat-tree networks perform well on all permutation traffic patterns, the cost for large fat-tree networks is considerably higher, and the number of layers determines the zero-load latency (\(l_0\)) of a fat-tree network.

Most topologies have the total router number being the product of integers, meaning the number of routers cannot be changed easily if the budget is modified. Dragonfly networks suffer from low global bandwidth, which becomes the bottleneck of global traffic. The irregularity of Slim Fly network links predestined the routing intricacy when the network diameter is more than two. Slim Fly MMS requires very high radix routers according to its mathematical expression when it scales up. According to its expression, with 80-port routers, the largest configuration it can offer is \(k'=53\), \(\delta =-1\) MMS, which can hold about 2450 routers and 63,700 endpoints. Once a model of router (in this case: 80-port) is selected for Slim Fly MMS, it almost means the number of endpoints is fixed. Other closest solutions in this range are:

  • \(\delta =0\), holding 2048 routers and 49,152 endpoints.

  • \(\delta =1\), holding 2402 routers and 52,272 endpoints.

and one can see these solutions differ significantly from the first solution.

Daryin and Korzh [7] have been looking for low-diameter topologies that have structures with the optimal performance for the Russian supercomputer manufactured by T-Platforms. They chose to go with the hybrid of Slim Fly and Flattened Butterfly for comparison with Dragonfly, tori, hypercube, and Flattened Butterfly. In the end, they chose Flattened Butterfly for their system #22 in November 2014 Top500 list [8] with Rmax 1.8 petaflops. This has shown that from the perspective of a system designer, the window to finding an appropriate system size can be very narrow. The comparison of our results with this petascale system is shown in Fig. 5 (Sect. 5).

Hwang et al. have compared the performance of Equality network against 2-tier fat-tree [1], 3-tier fat-tree [2], and 3D torus [3]. They also show that Equality can be used to design many-core computer chips [4].

3 Methods

In the following sections, we recapitulate the construction of an Equality network (Sect. 3.1). Section 3.2 addresses how we describe Equality networks and some related prior arts. Further, we detail the optimization procedures (Sect. 3.3) and the routing table construction procedures (Sect. 3.4). Section 3.5 briefs three routing algorithms utilized in this work, and Sect. 3.6 recites the routine for cycle-accurate simulation.

The concentration p depends on what application will run on the network, while the system is being designed. However, finding a proper p under saturated traffic is essential. The balance of network radix K and concentration p, together with the system’s networking cost and bisection bandwidth, is discussed in Sect. 4.

To get the estimated performance of our designs, we utilize the open-source BookSim [9] for cycle-accurate simulation of our networks. The package we use is downloaded from its GitHub site (https://github.com/booksim) and later modified locally to include our home-brewed routing procedures and algorithms. We implemented our topology, routing algorithms, and a few extra traffic models into BookSim. The results in various system scales are discussed in Sect. 5.

Symbols and notations used in the current paper are listed in Table 1.

Table 1 Symbols and notations used in the paper

3.1 Connection rules

Since our previous works are short conference papers, we would like to have this chance to address more about the initial conception of Equality network topology.

Upon constructing a network, one of the intuitive approaches is to link all the nodes into a ring. A ring topology has a network radix of 2. At the dawn of the current study, we looked at Hamiltonian cycles and sought to find a way to reduce the network’s diameter. Since we only focus on any Hamiltonian networks that have equal radix on all router nodes, upon adding one link on a router, the same link has to be applied on all routers. To keep the routing identical for all routers, we tried many strategies for adding connections on all nodes.

To make the interconnects, every member of the routers makes links through the following rules. An Equality network has N routers, where N is an even number. The routers are sequentially numbered from zero to \(N-1\), i.e., \(r_0, r_1, \ldots , r_{N-1}\). The routers are connected to form a ring and later to other ring members, just like chordal ring topologies.

A set of positive integers, \({ C }\), starting from 2 to \(N-3\), excluding any even numbers greater than N/2, are used as the candidates \({ C }\) for making the physical links.Footnote 1 From \({ C }\), a subset of integers, \({ S }\), composed of \(S_\text {A}\) (a collection of odd positive integers) and \(S_\text {B}\) (a collection of even positive integers) is selected for the interconnections. The connections are made for every router \(r_i\) with \(r_{(i + S_j) \text {mod} N}\) if i is even, or with \(r_{(i - S_j) \text {mod} N}\) if i is odd, for every number \(S_j\).

Fig. 1
figure 1

A simple Equality network named N14K6[\(-1\),1,3,9](4). The bold lines mark the links initiate from \(r_0\). The solid lines mark odd links, and the dashed lines mark even links

3.2 Syntax

The general notation of Equality involves an ‘n’ denoting the total number of routers, followed by a number, and a ‘k’ followed by another number indicating the network radix. The notation of ‘p’ can be omitted if the number of attached endpoints is not yet specified. Both uppercase and lowercase are allowed in the notation of ‘n,’ ‘k,’ or ‘p’ as long as they are in a sequence to describe the constraints of the target network. The notation is for the designers to have a rough idea of what configuration the network has followed rather than the full specification.

A pair of square brackets and a pair of parenthesis enclosing comma-separated numbers are for detailed specification of an Equality network, where \(S_\text {A}\) is listed in square brackets, and \(S_\text {B}\) is listed in the parentheses. For instance, an Equality network named N14K6[\(-1\),1,3,9](4) is presented in Fig. 1. As described above, the ‘N’ and ‘K’ mark the number of router nodes and network radix constraints, respectively. Hence, the number 14 means there are 14 routers in the network, and the number 6 indicates that the network radix of the routers is six. This specification is more detailed in the hops but does not say how many endpoints are attached.

Remark 1

For the general configuration of a network, one can also express in the short-hand notation of n14k6p3 to represent all Equality networks with 14 switches, network radix 6, and 3 endpoints per switch.

An Equality network has an equal number of inter-router connections in all switches. The number of inter-router connections, K, can be evaluated by the following equation:

$$\begin{aligned} K = \text {len} ( S_\text {A} ) + {\left\{ \begin{array}{ll} \text {len}\,( S_\text {B} ) \cdot 2 - 1 \text { if } N/2 \in S_\text {B} \\ \text {len}\,( S_\text {B} ) \cdot 2 \text { if } N/2 \notin S_\text {B} \\ \end{array}\right. } \end{aligned}$$
(1)

A direct explanation of Eq. 1 is that each odd number adds one inter-router connection for each router. Each even number adds two inter-router connections for each router, except the diameter link of the ring, which is the same as the odd numbers, adds one inter-router connection for each router. The diameter link of the ring can be either an odd or an even link.

The breakthrough of Equality in chordal ring topologies [10,11,12,13,14] is in the mixing of multiple even and odd links in a ring of even nodes while keeping the alternative nature of odd links. In addition, systematic routing rules are provided for derived networks.

For instance, in 2016, Faraha et al. discussed [13] about degree six 3-modified chordal ring networks, where the total number of nodes N must be divisible by 3, and every three nodes are grouped into a class. Zabłudowski et al. discussed [15] about modified network double ring structures. The only publication we found to show a likelihood of the Equality networks is [12] (modified chordal ring CHR5_a(20; 3,7), CHR5_c(16; 3,5,7) and CHR5_d(16; 3,5,8)); however, it is not based on the same construction rules and applies only to radix-5 networks.

3.3 Network optimization

Equality topology offers a plethora of networks to be assessed and made to practice. One needs to set a goal to find the candidates.

3.3.1 Assignment of \(\mathbf {S_A}\) and \(\mathbf {S_B}\)

Upon the decision of the network radix, one follows Eq. 1 to confine the lengths of \(S_\text {A}\) and \(S_\text {B}\) to achieve a fixed K. For instance, he or she would like to construct a network of 1840 routers with network radix 17. If four even numbers are selected as \(\mathbf {S_B}\), assuming hop 920 is not selected (i.e., N/2), they contribute eight connections to each router. If \(S_\text {B}\) has 920, they consume seven connections to each router. Let us say 920 is not in \(S_\text {B}\); then, additional nine numbers can be added to \(S_\text {A}\) to get an Equality network of radix 17.

3.3.2 Optimization

We optimize Equality networks with a genetic algorithm. Ideally, an initial random seed is given for the generation of \(S_\text {A}\) and \(S_\text {B}\) from \({ C }\), with the constraint of a predefined radix K. We then select the population size and other simulation parameters, such as the maximum number of generations, and mutation rate. The goal of the optimization is to minimize the product of the average distance a and network diameter d. At the end of the optimization, a series of best results from generations of evolution are reported. If the search space being explored is large, the optimized results are not necessarily the global minimum; however, the results are usually low enough for application. If a sufficient amount of evolution is conducted, one can get results close to the global minimum.

For large systems, the empirical approach can involve only optimizing \(S_\text {A}\) while keeping \(S_\text {B}\) fixed. For instance, some of our systems are derived by handpicking \(S_\text {B}\) (and possibly a part of \(S_\text {A}\)) and optimize the remaining \(S_\text {A}\); otherwise, the phase space would be too large to explore. The designer has to adjust \(S_\text {B}\) many times in length and composition and optimize \(S_\text {A}\) to reach a better set of answers.

3.4 Routing table

Routing is an essential part of communication. In the current work, the routing procedures include a universal routing table and three routing algorithms.

Remark 2

An even node and an odd node see an entirely identical network structure only in the reverse direction in an Equality network.

Starting from \(r_0\) in an Equality network, one can define a set of paths \({ P }(r_0,r_j)\) to any other node \(r_j\) in the network. From \({ P }(r_0,r_j)\), one can also find the shortest paths \(\overrightarrow{{ P }}(r_0,r_j)\) from \(r_0\) to \(r_j\) and save the paths as the first routing table \(\widehat{{ R }}\) of the network.

From Remark 2, one can then derive the shortest paths \(\overrightarrow{{ P }}(r_i,r_j)\) between \(r_i\) and \(r_j\) by simple conversion. Since the network is symmetric, the shortest paths (in fact, any paths) between nodes depend on their respective relative difference modulus to total node number N in their IDs as described in Eq. 2.

$$\begin{aligned} { P }(r_i,r_j) \equiv {\left\{ \begin{array}{ll} { P }(r_0,r_{j-i \text { mod } N}) \iff i\%2 = 0\\ { P }(r_0,r_{i-j \text { mod } N}) \iff i\%2 \ne 0\\ \end{array}\right. } \end{aligned}$$
(2)

Apart from the shortest paths, \(\overrightarrow{{ P }}(r_0,r_j)\), we are also interested in the paths that are slightly longer, which are the sub-shortest paths \(\overrightarrow{{ P }}'(r_0,r_j)\), from \(r_0\) to \(r_j\).

For each target node \(r_j\), the sub-shortest paths \(\overrightarrow{{ P }}'(r_0,r_j)\) can be evaluated by checking all neighbors \(r_k\) not in the shortest paths \(\overrightarrow{{ P }}(r_0,r_j)\), finding all the distances \(d_{0k}\) and pick the nodes \(r_k\) for all \(d_{0k} \ge d_{0j}\) and collecting the list of nodes \(r_k\) for those where \(d_{0k}\) is minimal. The collected list \(L_{0j}\) of nodes \(r_k\) can then be incorporated with the shortest path routing table \(\widehat{{ R }}\) to form another routing table \(\widehat{{ R }'}\).

\(\widehat{{ R }'}\) is thus containing the shortest paths and alternative targets for sub-shortest paths.

The definition of \(\widehat{{ R }'}\) involves the knowledge of total router number N and the interconnection configuration without knowing the number of endpoints p for each router. Therefore, the complexity of the routing table of an Equality network is O(N) instead of O(\((Np)^2\)) in regards to an irregular network.

3.5 Routing algorithms

The routing algorithms including the adaptive minimal (abbreviated amin), non-minimal UGAL (global adaptive routing using local information, abbreviated ugal), and bottleneck-UGAL (abbreviated bgal) are provided in this work to assess the performance of the Equality networks under consideration.

3.5.1 amin

Adaptive minimal routing algorithm routes the packets through the shortest paths, where there may be one or more shortest paths.

From the universal routing table \(\widehat{{ R }'}\) the alternative intermediate router IDs mentioned above as \(L_{0j}\) are used for ugal and bgal routing algorithms.

3.5.2 ugal

ugal routing algorithm originally defined in [16] is implemented and simulated for comparison. In the current work, the packet is routed so that all the provided shortest and sub-shortest paths are considered and weighted based on the products of the paths’ queue length and hop distance.

3.5.3 bgal

The bottleneck-UGAL routing is designed based on ugal algorithm; however, only in the bottleneck of all pair-wise relationships are allowed to utilize the sub-shortest paths. It differs from ugal in the quenching of the available sub-shortest paths when the number of the shortest paths is higher than a threshold value 2 (fixed in this work, but this value is adjustable). This means if the number of shortest paths exceeds the threshold value, only the shortest paths are included in the routing paths.

3.6 Cycle-accurate simulation conditions

3.6.1 Traffic models

Ten traffic models, including uniform random, asymmetric, random permutation, neighbor, tornado, bit rotation [5], bit complement, bit reverse, bit complement, and bit shuffle, are implemented in BookSim to evaluate the Equality networks. In the ten traffic models, only bit rotation is implemented locally, which performs a \(d_i = s_{i+1} \mod b\) relation to set a target endpoint for each source.

3.6.2 Deadlock freedom

Deadlocks happen when several turning points wait for buffers of cyclic dependency. Deadlock-freedom is achieved with what was described in [17], which guarantees the deadlock-freedom by either limiting the routing to ensure cycle-freedom in the channel dependency graph [18] or utilizing virtual channels (VCs) to break such cycles into different sets of buffers [19].

The routing strategy used herein is similar to that introduced in [20, 21]. If we consider a packet sent from router \(r_i\) to \(r_j\), we use the number of virtual channels equal to the diameter for minimal routing. If \(r_i\) and \(r_j\) are directly connected (i.e., one hop), then the packet is routed using VC0. If the path between \(r_i\) and \(r_j\) consists of two hops, then VC0 is used for the first hop, and VC1 is used for the second hop, respectively. Consider a network of diameter two; for example, only one turn can be taken on the path, and therefore, the maximum number of required VCs is two [22]. That way, one flit only depends on the virtual channel one above its current virtual channel to make progress. Thus, no cyclic dependency can happen.

For adaptive routing, on the other hand, the number of virtual channels is equal to the maximum hop of paths in \(\widehat{{ R }'}\). For \(d=2\) systems, the maximum hop in \(\widehat{{ R }'}\) is 4, and for \(d\ge 3\) systems, the maximum hop in \(\widehat{{ R }'}\) is \(d+1\). BookSim gives an error when the number of VCs is insufficient. To generalize the algorithm above, we use a VCk (\(0\le k<n\)) on a hop k for an n-hop path between \(r_i\) and \(r_j\).

The general parameters in all simulations are the same as described in [17], which are:

figure a

Single flit packet is used to avoid flow control issues as described in [17] and [22], where the virtual channel buffer size is set to 64 flit entries, and the number of virtual channels is set as described above depending on the network diameter.

We then collected latency and throughput benchmarks under various traffic patterns and injection rates until the values converged.

4 Cost and balance

4.1 The balance of K, p and a

The most intuitive interpretation of K/p is the ratio of ports used for inter-router connections to the number of ports used for endpoints on each router. It is easy to see that if K/p is lower, the interconnect’s price would be lower for a fixed number of endpoints.

In general, if budget is of principle consideration, Equality networks can be restricted in the range of \(p \cdot (d+1) \ge P > p \cdot (a+1)\); alternatively, \(pd \ge K > pa\). The K/p value can be relaxed if performance is of principal concern. The design can be fine-tuned depending on what applications will be run on the final system. Equality is probably the only topology that does not need to sacrifice any port to adjust this ratio. If the number of endpoints is reduced (reduced p), the empty ports can be used for larger K values.

On fat-tree systems, the value of K/pd is always one (and a is very close to d). For instance, a 3-tier fat-tree network utilizes two times the number of links to the endpoints in the number of connections between routers, making four times inter-router ports (each inter-router link consumes two ports) to the number of ports on average on each router. Coincidently, the diameter (maximum inter-router hop count) of a 3 L fat-tree network is four. The same idea applies to 2 L and 4 L fat-trees.

The behavior of the power model acts similar to the networking cost as the number of SerDes for the inter-router links to the SerDes for the endpoint links has the same ratio as K/p.

We calculated the total networking cost of all Equality systems with a model similar to what is described in [17], where each of the cabinets is assumed to be 1 m x 1 m without aisle in the cluster.Footnote 2 Each of the endpoints and routers is assumed to be 1U in size. The overhead cable pathways are 1 m above the cabinet. The endpoints and the routers are allocated sequentially on the cabinet until the cabinet is filled to 42U standard cabinet size. All routers are situated on the top of the cabinet. Depending on the remaining cabinet space, endpoints can be allocated in the same or adjacent cabinet to the router.

Manhattan distance is calculated for each cable to include the distance from each module to the overhead cable pathway, the space on the aisle, and an additional 0.5 m horizontal distance from the side of the cabinet to the port. Therefore, from this model, a cable to the adjacent module is 104.45 cm, i.e., 2\(\times\)0.5 m + 44.45 mm. Cables longer than 8 m are fiber, otherwise copper. The results are summarized in Sect. 5.2.

4.2 Cost per node

From the result of the cost model presented by Besta et al. (Fig. 11(c) of [17]) and Kim et al. (Fig. 19 of [23]), it is evident that all topologies follow almost steady cost/endpoint ratio. While the copper cable reduces the networking cost in small systems, the effect is insignificant in larger systems. In practice, this ratio will still depend on the cable length; i.e., larger computing nodes consume larger space, leading to a higher proportion of inter-router connection cost. For any two non-torus networks with computing nodes of a given form factor (for instance, 0.5U or 2U per endpoint), regardless of the topology used, the copper cable prices of the two networks should be equal if the number of routers and servers are the same. The analysis is reported in Sect. 5.3.

4.3 Bisection ratio

By definition, the bisection bandwidth, B, is evaluated by cutting a minimal number of cables to separate the system into two parts.

Instead of discussing the bisection bandwidth, we introduce two variables named network bisection ratio (\(b_r\)) and topology bisection ratio (\(B_r\)) to take the networking cost into account. To clarify, the network bisection ratio is the bisection bandwidth divided by total network bandwidth (including all links to endpoints), i.e., \(b_r= B/\Phi\), whereas the topology bisection ratio, \(B_r\), is the bisection bandwidth divided by total inter-router bandwidth (excluding all links to endpoints). The total network bandwidth, \(\Phi\), for a direct network is the total number of cables in the system multiplied by the channel bandwidth \(\phi\), i.e., \(\Phi = N\phi (\frac{K}{2}+p)\).

For a fully connected 12-port 10 Gb/s Ethernet switch, the bisection bandwidth is 60 Gb/s, which is half the total cable number multiplied by the channel bandwidth. It is easy to see that the bisection ratio for this fully connected switch is half, i.e., \(b_r= 1/2\). For a two-tier fat-tree, this value becomes 1/4, as the number of cables in the network doubles, whereas for a three-tier fat-tree, the value of \(b_r\) is 1/6. The lower the bisection ratio, the higher the cost of networking hardware because a higher fraction of networking cost is contributed to the inter-router links.

For symmetric chordal ring networks like the Equality networks, the bisection bandwidth is equal to the minimum number of links that are cut if one splits the network into two semicircles. During the generation of new networks, the topology bisection ratio \(B_r\) is evaluated by counting the minimal number of links being cut when splitting the network into two equivalent halves.Footnote 3

The relation of \(B_r\) and network bisection ratio \(b_r\) is shown in Eq. 3.

$$\begin{aligned} b_r = \frac{B}{\Phi } = \frac{\phi B_r \frac{NK}{2}}{\frac{N\phi (K+2p)}{2}} = \frac{B_r K}{K+2p} = B_r \frac{\frac{K}{p}}{\frac{K}{p}+2} \end{aligned}$$
(3)

The data are collected and discussed in Sect. 5.4.

5 Results

We have selected a few publications as targets for comparison purposes. To make a sensible comparison, most networks utilize switches with the port number available in the market.

5.1 Targeted systems

Table 2 Latency (lat@0.9) of uniform traffic under injection rate 0.9 flit/cycle using adaptive_min routing for networks demonstrated for a set of selected Equality networks. The detailed selected sets of links, \(S_\text {A}\) and \(S_\text {B}\), are not included in this table for brevity. All throughput values of Equality networks in this table are to 0.9 flit/cycle. 3FT44 is the half-span fat-tree system simulated in [17] using 44-port switches. For the three fat-tree systems (3FT36, 3FT48, 3FT80), we assume the simulated results from BookSim also apply to full-span fat-tree systems. Equality networks with ID marked as boldface in the table are systems mentioned in the main text

Equality can be used to achieve reasonably good ratios of the Moore bounds. We include a column in Table 2 to address the ratio of the network size against the Moore bound. We obtained networks reaching ratios 39% of diameter 2 Moore bound, 9.66% of diameter 3 Moore bound, 1.64% of diameter 4 Moore bound, and 0.18% of diameter 5 Moore bound.

We simulated networks of scales from small to 1,024,000 endpoints in this work with various router counts and radices. The listed systems, named according to the radix of the routers (i.e., E361 represents the first system using 36-port routers), are selected networks for each configuration. The listed networks are included in Appendix 1.

We have chosen carefully by optimizing the configuration as discussed in Sect. 3.3 and hand-picked one network with better performance in all traffic modes for each configuration. Some of the networks in the table are designed for exascale systems. The values of the torus, 3-tier fat-tree, Dragonfly, and Slim Fly networks are there for reference.

A Slim Fly network denoted by SF14 is compared with three diameter-3 Equality networks (E441, E442, and E443) in the table to show that with a bit of relaxation in router number, Equality allows better throughput using identical hardware specification. Slim Fly in this comparison, the latency under injection rate 0.9 flit/cycle is over 50 cycles, whereas all three Equality systems are under 50 cycles.

Fig. 2
figure 2

a Various sizes of systems under consideration. The palette on the right-hand side denotes the network diameter of the system. b Zoom in the range where the router number is less than 10,000. c Latency against K/pa under injection rate 0.9 flit/cycle. Fitting function \(f(x) = ax + b\), where \(a=-20.3211\) (asymptotic standard error ±2.736(13.46%)) and \(b=53.255\) (asymptotic standard error ±3.577(6.717%)) fits against all Equality networks in Table 2. The Slim Fly results are shown as inverted triangles, whereas the fat-tree results are shown as diamonds. For all systems listed in the above sub-figures, the numbers next to the symbol represent the identification of the system in their respective router radix category. For instance, a triangle with six next to it represents the sixth system using 80-port routers; therefore, the system ID for this system is E806, as listed in Table 2

Figure 2a and b shows the comparison of scales (number of routers and endpoints) of the listed systems in the table. It can be seen that the maximum sizes 3-tier fat-trees can achieve are far smaller than Equality.

5.2 The balance of K, p and a

Figure 2(c) shows that the latency at the package injection rate 0.9 flit/cycle is negatively correlated with K/pa with amin routing algorithm. The value K/pa stands for the switch’s balance of upward and downward flow ratios. The higher the a value, the higher the frequency a message has to consult the switches. By adjusting the weight of K/p, one can counterbalance the weight of a.

5.3 Cost per node

Fig. 3
figure 3

Line \(f(x)=644.5x-333.1\) is drawn to fit cost-per-node against K/p for networks with Slim Fly, Dragonfly, FBF-3, FT-3 L and DLN. The tori follow another trend as only copper cables are used. Both trends are considered to depend on K/p. The cost-per-node values of these topologies are only a rough estimation for systems under 40K endpoints, whereas the Equality systems are plotted in circles of different sizes depending on the number of endpoints. The largest circle is the E806 systems with 1 M endpoints. While other topologies use different symbols of the point size 1 for clarity, the Equality system circle point sizes are adjusted by \(\sqrt{n/40000}\) points

Figure 3 shows that with K/p being the variable, the ‘cost per endpoint’ values for all topologies (here shows Slim Fly, Dragonfly, FBF-3, FT-3 L, and DLN) follow the same trend if fiber cables are used for longer links. On the other hand, if only copper cables are used, the cost behaves like the tori. Equality systems are shown in circles, which follow the same trend as the fitted line, only that the system sizes are much larger. A higher number of endpoints will slightly increase the average networking cost per node, but not as significant as the ratio of K/p. The cluster model can be adjusted to include hot/cold aisle and different rack sizes, but it is not in the scope of the current study.

5.4 Bisection ratio

Fig. 4
figure 4

The relation between \(b_r\) and \(B_r\) for all Equality systems listed in Table 2. The color of the circle follows the palette of K/p. Higher K/p gives a higher \(b_r\) for a given \(B_r\) value. The circle sizes of Equality systems have been adjusted by \(\sqrt{n/3000}\). The FT-2 L diamond falls on the line of \(K/p =2\) as the Slim Fly \(d=2\) systems. The FT-3 L has effective \(K/p = 4\), \(B_r = 0.25\) and \(b_r = 1/6\). The FT-4 L has effective \(K/p = 6\), \(B_r = 1/6\) and \(b_r = 0.125\). The Slim Fly point takes the value \(b_r = B/\Phi\), where \(\phi = 10\) [Gb/s], \(B=60736\) [Gb/s](approx. from Fig. 5(c) of [17]) and \(\Phi =N\phi (K/2+p)=370300\) [Gb/s] (\(K=34\), \(p=18\), and \(N=1058\)). The Dragonfly point takes the values \(B=41666\) [Gb/s](approx.) and \(\Phi =N\phi (K/2+p)=402480\) [Gb/s], where \(N=2064\), \(K=23\), \(p=8\). The 3D-torus point takes the values \(B= 12654\) (approx.), where \(N=15625\) (i.e. \(25^3\)), \(K=6\), and \(p=1\). The 5D-torus point takes the values \(B=48006\) (approx.), where \(N=16807\) (i.e., \(7^5\)), \(K=10\), and \(p=1\)

Figure 4 shows the distribution of the Equality networks compared with 3D-torus, 5D-torus, Slim Fly, Dragonfly, and fat-tree networks. The location of the network in this graph depends on the K/p and \(b_r\) values of the respective network. Since \(b_r/B_r\) is a function of K/p, the Slim Fly network sits close to the line of \(K/p=2\), where two of the Equality networks (E369 and E487) fall on that line (near FT2L). The distribution of \(\mathbf {S_A}\) and \(\mathbf {S_B}\) defines \(B_r\) in Equality networks. The fat-tree networks have good \(b_r\) in two tiers, whereas it degrades as the number of layers increases. The \(b_r\) values of tori also explain why uniform traffic is the nightmare of torus networks.

The introduction of bisection ratios \(b_r\) and \(B_r\) gives a new viewpoint to look at the bisection bandwidth of a network. The question becomes, “With the constraint of the networking budget, what percentage of the budget is contributed into the bisection bandwidth?,” instead of “How much is the bisection bandwidth of the network?.” The point is not to get a large bisection ratio, as one needs a proper balance to communicate with the nearby and remote routers. Our experiments found that \(B_r\) around 0.4 ~0.5 and \(b_r\) around 0.3 are suitable for most traffic models, where global and local traffic ratios are at a balance.

5.5 Individual scenarios

Fig. 5
figure 5

Throughput comparison chart for networks of 16,384 endpoints. Columns in the gray box contain three routing algorithms amin, ugal, bgal running on the target Equality E361 system with the configuration of n2048k28p8 using 36-port switches. The data from the right panel contain the best results of the respective networks from [7]

Fig. 6
figure 6

Comparison of throughput and latency of the Equality E481 system with the configuration of n4800k38p10 using 48-port switches in 10 traffic modes. In total, there are 48,000 endpoints in this system. The transpose traffic uses 16,384 endpoints for calculation, whereas the other bit permutation traffics uses 32,768 endpoints for calculation

Fig. 7
figure 7

Comparison of throughput and latency of the Equality E807 system with the configuration of n20000k64p16 using 80-port switches in 10 traffic modes. In total, there are 320,000 endpoints in this system. All bit permutation traffics uses 262,144 endpoints for calculation

The 16, 384-endpoint scenario is prepared for the comparison to [7]. Figure 5 shows the throughput performance of the E361 network, with the best results shown in [7]. For the Equality network, three routing algorithms: amin, ugal, and bgal are shown inside the grayed box on the left-hand side. The other seven blocks: “FlatFly, SFxFF-1, SFxFF-2, Dragonfly, Hypercube, 4D-torus, and 3D-torus” are the best values directly taken from the paper. It is apparent that almost in all traffic modes, the E361 network using three routing algorithms, with the same number of switches of the same radix, performs better than all networks presented in [7].

It shows that the ugal algorithm performs akin to FlatFly in the bit complement traffic model, where all other topologies fail. Although fat-tree performs well in permutation traffic, we do not include it in this comparison because the maximum size 3-layer fat-tree can achieve with 36-port routers has only 11,664 endpoints.

The 48, 000-endpoint scenario is prepared for the situation where the InfiniBand LID limit is of major concern. Detailed simulations on ten traffic models are demonstrated in graphs containing both latency and throughput results.

All the injection processes are simulated in various injection frequencies to reflect the spectrum of latency values under 50 cycles. We focus on latencies lower than 50 cycles as they reflect the usable range of the system that is reliable under the given injection process. Figure 6 presents all ten traffic processes.

The 1E scenario features systems for future applications with multi-exaflops peak performance depending on the computing power of each endpoint. Figure 7 shows the ten traffic models. The E807 network tested here has the configuration of n20000K64p16. The 48, 000-endpoint and 1E networks with close K/pa values are here to show the scalability of Equality networks. With a significant increase in the system size, the performance is still consistent across the two networks.

The \(10^6\)-endpoint system is a single-point (simulating at 0.9 flit/cycle injection rate) simulation to achieve a million endpoints. We only show the results of throughput and latency of amin routing algorithm under an injection rate of 0.9 flit/cycle for the E806 system with the configuration of n64000k64p16 in Table 2 having 1,024,000 endpoints. The simulation time of the BookSim package running this million-endpoints system took a maximum of 272 GB memory running on single-core in bitcomp traffic. As the only large-memory resource on our site is an IBM box, the calculation took around 20 days to complete. At this point, the BookSim package fails with an integer error, presumably due to the integer type range limitation of the _cur_pid variable, but all simulations are reasonably converged before the job ends.

All of the above systems show the flexibility and performance of the Equality networks.

6 Conclusion and future work

This paper shows that the low memory footprint routing logic of Equality networks is natively born for routing algorithms using minimal adaptive and sub-shortest paths. With the topology setting and routing logic implemented in BookSim, we demonstrate the simulation results from small to large scales. We also perform the first million-node cycle-accurate calculations based on BookSim package.

Many have advocated [17, 24,25,26] the use of low-diameter networks for the realization of enormous network size with high radix routers. Table 2 shows that excellent performance persists in Equality networks with reasonably low diameters and ordinary router radix under high injection rates. Conversely, extremely low-diameter topologies usually involve networks that are not flexible or need very high-radix routers. For the 1 M-endpoint network, the current work provides a solution to build a network of 64,000 routers (80-port) (shown in Table 2) for the replacement of the network solution of 53,138 routers, each with 264 ports reported in [27]. Moreover, the resulting network performance is plausible. We have shown that Equality networks have similar or better performance compared to many other topologies, including Tori, Dragonfly, Slim Fly, Flat Fly, and Fat-Tree.

Equality does not do permutation (e.g., bit complement and bit shuffle) quite as well as fat-tree. Still, it generally trades the zero-load latency with the throughput while keeping moderate- to high-level usabilities.

Equality networks can be used with low- or high-radix routers [28] available in the market. The results of this paper also show excellent performance of Equality networks under uniform random traffic even with an adaptive minimal routing algorithm, which suggests good performance using commodity hardware for general-purpose clusters. We have also performed routing a mini-sized Equality network on 12 HPE 5945 48SFP28 8QSFP28 switches for real applications. The routing of this small cluster is accomplished with multi-protocol label switching (MPLS), where all shortest paths and some sub-shortest paths are assigned between every pair of routers. On InfiniBand, we expect the network will work well with Nue routing introduced by Domke et al. [29].

The adjustability and outstanding performance of Equality networks give the industry a new network topology option for HPC systems. The system designers will have more flexibility in picking commodity hardware. It is also an opportunity for data centers to reorganize for better efficiency.

In future, we plan to run simulations using ROSS-CODES and TraceR as described in [30] to reflect the effects of the real application traffic, especially for other subtle effects while dealing with many jobs on the cluster [31].

To summarize, the current work reiterates the construction of networks based on a novel class of network topology, allowing routing with simple logic to achieve strong scalability from small to large systems. The work shows the performance of networks by comparing the cycle-accurate BookSim benchmarks against many previous works. The results show significant benefits of utilizing this new class of network topology for future high-performance computing applications.