Disruption-Free Load Balancing for Aerial Access Network

A fundamental issue of 6G networks with aerial access networks (AAN) as a core component is that user devices will send highvolume traffic via AAN to backend servers. As such, it is critical to load balance such traffic such that it will not cause network congestion or disruption and affect users’ experience in 6G networks. Motivated by the success of software-defined networkingbased load balancing, this paper proposes a novel system called Tigris, to load balance high-volume AAN traffic in 6G networks. Different from existing load balancing solutions in traditional networks, Tigris tackles the fundamental disruption-resistant challenge in 6G networks for avoiding disruption of continuing flows and the control-path update challenge for limiting the throughput of updating load balancing instructions. Tigris achieves disruption-free and low-control-path-cost load balancing for AAN traffic by developing an online algorithm to compute disruption-resistant, per-flow load balancing policies and a novel bottom-up algorithm to compile the per-flow policies into a highly compact rule set, which remains disruption-resistant and has a low control-path cost. We use extensive evaluation to demonstrate the efficiency and efficacy of Tigris to achieve zero disruption of continuing AAN flows and an extremely low control-path update overhead, while existing load balancing techniques in traditional networks such as ECMP cause high load variance and disrupt almost 100% continuing AAN flows.


Introduction
The next-generation cellular networks (i.e., 6G networks) can interconnect devices in space, air, and ground networks with the help of aerial access networks (AAN) to provide users Internet access with a substantial higher coverage than traditional network technologies (e.g., 5G networks) and hence have drawn much attention from both academia and industry [1][2][3][4][5]. Figure 1 gives an abstract architecture of 6G networks with AAN as a key component. After receiving data traffic from end devices, AAN forwards traffic to edge servers, which process and forward traffics to backend servers.
A fundamental issue of this 6G architecture is that with higher coverage of user devices and more ultra-high-bandwidth, ultra-low-latency applications such as VR/AR and remote medical, AAN needs to deliver high-volume data traffic involving a high number of flows from user devices to the corresponding backend servers via edge and backbone net-works. As such, it is critical to load balance such traffic before they enter the backbone networks, such that it does not cause any congestion or network disruption in 6G networks.
Specifically, motivated by the recent success of flexible load balancing using software-defined networking (SDN), e.g., [6][7][8][9][10][11][12][13][14], in this paper, we investigate the feasibility and benefits of load balancing AAN traffic in 6G network using a software-defined edge (SDE), which we term as SDE-LB. Specifically, continuing the trend of moving from dedicated load balancing appliances to native network switches for load balancing (e.g., [6-8, 12, 15-19]), SDE-LB takes advantage of the flexibility of a logically centralized controller to collect load statistics from both the network and the servers to compute load balancing flow rules and install them on programmable switches (also called load balancing switches), whose TCAM flow tables allow high-speed packet processing [20][21][22] to forward packets to different backend servers, based on the matching results [12,16,18,23,24].
Despite the potential of SDE-LB, key challenges remain. In particular, we found that directly using existing load balancing solutions in SDN to SDE-LB encounters substantial challenges under dynamicity, where dynamicity can happen either on the data path or on the control path.
The key dynamicity challenge on the data path is the disruption-free challenge. Specifically, as the load of AAN from the flows contained in each flow rule changes over time, imbalance occurs and rebalancing is needed. Although one may apply previous studies [16,18,25] to conduct rebalancing, since they do not consider existing assignments of continuing flows from AAN, i.e., flows with open TCP connections, they can result in unacceptable disruptions of continuing flows. Although one may use migration to reduce the impacts of disruption, migration is considered adding substantial system complexity and hence is not preferred by many operators. Another possibility is to install exact machines to pin the assignment of continuing flows, but flow table size constraints make this approach infeasible. For example, one commodity edge server equipped with Dell Z9100 switch has only 2304 flow entries [26], while the number of data flows with different source IP addresses is much larger. Utilizing TCAM wildcard rules to aggregate flow rules may reduce the number of rules (e.g., [16,18,21,24,[27][28][29]) but would again result in a mass disruption of continuing flows.
The key dynamicity challenge on the control path is the control-path update challenge. Specifically, rebalancing AAN traffic load among servers requires an SDE-LB controller to send flow-mod instructions to update flow tables on the load balancing switches. Unfortunately, state-of-the-art SDN controllers have limited throughput in sending flow-mod instructions, putting a limit on control-path update frequencies.
In this paper, we cope with both dynamicity challenges by proposing the first disruption-resistant, low-controlpath-cost, dynamic SDE load balancer for AAN traffic in 6G networks. Tigris provides two key novel insights for addressing the dynamicity challenges: (1) it shifts a small number of incoming flows among servers to achieve load balancing without disrupting continuing flows and (2) it leverages the small number of shifted flows, the decomposition of aggregation of a large rule set into parallel aggregation of multiple smaller rule sets, and the cached intermediate rule aggregation results from previous time slots to substantially improve the efficiency of flow rule aggregation and reduce the control-path update cost. This work sheds light for future research on system and protocol design in AAN and 6G networks, such as traffic engineering, resource orchestration, and network-application integration [30].
The main contributions of this paper are as follows.
(i) We study a fundamental problem for 6G networks with AAN, the disruption-free load balancing problem for AAN traffic, identify the disruption-resistant and the control-path-cost challenges, and design a novel dynamic load balancer at the edge called Tigris to address these challenges (ii) We design DR-LB, an online algorithm, as the first phase of Tigris to compute per-flow disruptionresistant load balancing policies and prove that DR-LB achieves a competitive ratio of ð2 − 1/NÞ for load balancing, where N is the number of backend servers (iii) We develop Tree-Agg, an incremental, bottom-up rule aggregation algorithm as the second phase of Tigris to compile per-flow load balancing policies into a highly compact, disruption-resistant rule set which also has a low control-path update cost (iv) We conduct extensive evaluations to show that Tigris achieves zero disruption of continuing flows in AAN of 6G networks, a close-to-1 competitive ratio on load balancing, a less than 5% rule update ratio between time slots, and up to 53x flow rule  The remainder of this paper is organized as follows. We discuss related work on different load balancing approaches in Section 2. We present the system settings and problem definition in Section 3. In Section 4, we propose the Tigris load balancer, introduce its overall workflow and the details of its key components, and discuss its generality and overhead. We evaluate the performance of Tigris with extensive simulation in Section 5 and finally conclude our work in Section 6.

Related Work
Data traffic load balancing is a well-studied problem for which not only many algorithms have been designed and thoroughly analyzed [31][32][33] but also many systems have been developed in supporting scalable, reliable network services [6-8, 12, 15-19]. Modern networks increasingly rely on switches to develop load balancing solutions. In this section, we give a brief review of existing studies on load balancing and discuss the key limitations of these techniques for load balancing traffic from AAN.

Hash-Based Load
Balancing. Among various load balancing systems, hash-based solutions such as Equal-Cost Multipath (ECMP) [25] and Weighted-Cost Multipath (WCMP) [7] are the most widely used ones. ECMP-based systems [25] evenly partition the flow space into hash buckets and use a hardware switch to redirect incoming packets to different software load balancers based on the hash value of the packet header. WCMP-based systems [7] achieve an unevenly weighted partition of the flow space by repeating the same software load balancers as the next hop multiple times in an ECMP group. However, these hash-based solutions split the traffic based on the size of flow space rather than the actual volume of flows, resulting in load imbalance due to the unequally distributed and dynamically changing traffic contribution from different flow spaces. Because ECMP hash functions are usually proprietary [18,25], users have a limited customization capability in rebalancing to adapt to such dynamic flow statistics.

SDN Load
Balancing. Compared with hash-based load balancing, SDN load balancing is a more powerful and flexible load balancing technique. As such, it is a more suitable solution to AAN load balancing, because balancing the dynamic traffic in AAN and 6G networks requires more flexible load balancing policies. SDN supports the programming of the flow rule table on switches using wildcards and hence enables a more flexible, fine-grained load balancing service [12,16,18,23,24]. The select group table defined in Open-Flow specification [21] can be used for load sharing, but it requires the controller's guidance to respond to events not detectable by the switch, e.g., server failures. Solutions in [23,24] send the first packet of every flow back to the controller for calculating the flow rules. Benson et al. [12] inte-grate the per-flow load balancing rule generation with WCMP to minimize the traffic congestion. Sending the first packet of each flow to the controller would add extra delay for these packets. Kang et al. [18] and Wang et al. [16] study the TCAM size-constrained traffic splitting problem and design an algorithm to generate efficient flow rules to partition the flow space into weighted subspaces for different servers. In addition, TCAM wildcards are also utilized to reduce the number of flow rules used for expressing network policies (e.g., TCAMRazor [27] and CacheFlow [28]), including the load balancing policy.
One important limitation of applying existing SDN load balancing solutions to load balance AAN traffic at an SDE of 6G networks is that they do not consider the key dynamic SDE load balancing challenges, i.e., the disruption-resistant challenge and the control-path update challenge, leading to the unacceptable disruption of continuing flows and a high control-path update cost. Miao et al. [34] design a stateful data plane load balancer, but it requires the support of expensive, customized programmable hardware. On the contrary, Tigris addresses the above challenges to design the first disruption-resistant, low-control-path-cost SDE load balancer using commodity SDN switches.

Dynamic SDE Load Balancing: System Settings and Problem Definition
We consider a dynamic SDE-LB system shown in Figure 2: it provides a service using multiple servers, indexed by i = 1, 2 , ⋯, N. Clients access the service through a single public IP address, reachable via AAN. For any incoming packet, the load balancing switch at SDE finds the matched flow rule based on its source IP address, rewrites its destination IP addresses to that of the assigned server, and forwards it correspondingly to achieve load balancing. One may extend our system to multiple switches and to match on other fields for more complex load balancing scenarios for better scalability and fault tolerance, but we focus on the single load balancing switch and the source IPv4 address matching for clarity. We consider that our system operates in discrete time where time is divided into slots, indexed by t = 1, 2, ⋯, T. Different time slots can have different duration to support a combination of periodic operations and event-driven operations, where events can include server up/down and the burst of data flows. At the beginning of each slot t, the SDE controller computes a set of load balancing flow rules denoted as RðtÞ and update the flow table of load balancing switch accordingly.
Given a flow rule r, it has three basic attributes: the set of source IP addresses it matches, the forwarding action, and priority, denoted as r:match, r:act, and r:pri, respectively. Given a packet, its source IP may be matched by multiple rules. In this case, the switch will rewrite its destination IP and forward it following the action of the rule with the highest priority, as specified in OpenFlow specification [21]. We use an attribute named r:load to denote the total estimated load of the flows that followed the action of r. Our system allows the load metric to be a generic metric, considering 3 Wireless Communications and Mobile Computing factors such as data volume and CPU. The controller collects related data from servers and switches to compute the load metric.
A set of flow rules R can be aggregated into a single rule r a using wildcard. We call R the source rules of r a , denoted as r a :src. We also define some parameters to assist rule aggregation. We use r:lev to record the aggregation level of rule r, where the first 32 − r:lev matching bits of r must not contain any wildcard. As we will show in Section 4.3, r:lev can be increased even if it fails to be aggregated with another rule.
An efficient SDE-LB system must satisfy multiple constraints, which we separate into two categories: (1) traditional constraints (i.e., table size, full-coverage, and zero-ambiguity) and (2) dynamicity constraints (i.e., disruption resistance and low control-path update).
3.1. Traditional Constraints. The first traditional constraint is the table size constraint. Specifically, the set of flow rules installed on the load balancing switch in any time slot must not exceed its flow table capacity C, as expressed in Inequality Secondly, for any incoming packet, there must be at least one flow rule r that matches its source IP address and does not have punt as its action. This is the full-coverage constraint. It ensures that no packet will be punted back to the controller, eliminating the switch-to-controller delay for all incoming data traffic, and is expressed as Thirdly, for any incoming packet, if there exists more than one flow rule matching its source IP, only one rule's action will be taken. This is the zero-ambiguity constraint. We use r:dom to denote the set of source IP addresses that belong to r:match but will not follow r:act and express this constraint as where ∧ is the logical conjunction.

Dynamicity Constraints.
Other than the traditional constraints, SDE-LB under dynamicity also introduces additional constraints on both data path and control path. For the data path, any flow whose TCP connections to the servers are currently open should not be shifted to another server unless the current server fails. We call it the disruption-resistant constraint. Denoting a continuing flow as f a and its server time slot t as f a :sðtÞ, this constraint can be expressed as For the control path, the number of rules updated (i.e., deleted or inserted) in each time slot should be within the capacity of the controller, which is called the control-path update constraint. Using D to denote this constraint, we have Denoting the data flow forwarded to server i in time slot t as L i ðtÞ and the number of time slots that server i is working till time slot t as T i ðtÞ, we formulate the disruption-resistant, low-control-path-cost, dynamic load balancing problem as   (5). We prove the NP-hardness of this problem via a transformation from the classic multicore balancing problem [31]. Note that in Equation (6), we use the time-averaged load balancing as the system objective. This is the typical objective of load balancing systems [31][32][33], and it reflects the requirement of dynamic SDE-LB, i.e., long-term, online load balancing. As we will discuss in Section 4.4, we design the Tigris system to be modular so that it provides users the flexibility to define different load balancing objectives and methods while maintaining the benefits of Tigris, e.g., low-control-path-cost and disruption resistance.

Tigris: A Disruption-Resistant, Low-Control-Path-Cost, Dynamic SDE Load Balancer
We now fully specify Tigris, in which a controller computes disruption-resistant load balancing policies and generates compact flow rules with low control-path cost correspondingly. We will start with an overview of Tigris in Section 4.1. Then, we will give the design details of its key components: policy generation in Section 4.2 and rule aggregation in Section 4.3. We also discuss the generalization of Tigris in Section 4.4.

Overview of Tigris.
In practice, Tigris operates in both periodic mode and event-driven mode in response to events such as the up/down of servers and the burst of data flows.
For simplicity of presentation, we assume a periodical operation mode, where the controller executes Tigris at the beginning of every slot t. Figure 3 presents the workflow of Tigris. Specifically, each invocation of Tigris can be divided into two phases: (1) policy generation and (2) rule compilation.

Policy Generation.
During the policy generation phase, Tigris first collects load statistics and related flow information, e.g., TCP connection status, from each server i, and estimates the target load of i, i.e., the estimated load of incoming flows which should be assigned to i in time slot t for load balancing purpose (Step 1). A flow is identified by a TCP/IP 5tuple. It then uses the DR-LB algorithm to compute a set of disruption-resistant load balancing policies for all servers available in time slot t and generates a set of per-flow rules R pf ðtÞ to express the computed load balancing policies (Step 2). In particular, DR-LB keeps the forwarding action of continuing flows on available servers and adopts an online, greedy approach to shift incoming flows from overloaded servers to underloaded servers based on the target load of each server and the similarity of source IPs between continuing flows and incoming flows. This design is disruptionresistant. It achieves load balancing by shifting a small number of flows among servers and increases the chance of reusing rule aggregated results from last slot t − 1 during the rule aggregation phase, substantially increasing the aggregation efficiency and reducing the control-path update cost. And we prove that it achieves a ð2 − 1/NÞ competitive ratio for Equation (6) in certain cases.

Rule Compilation.
In the second phase, Tigris adopts a bottom-up rule aggregation algorithm Tree-Agg to iteratively aggregate the large per-flow rule set into a highly compact rule set RðtÞ along a binary IP tree (Step 3). Tree-Agg has three novel design points: (1) only traversing from IP leaves of shifted flows to the root of the IP tree, (2) decomposing the aggregation of a large rule set into parallel aggregation of multiple smaller sets, and (3) using cached intermediate rule aggregation results from slot t − 1 during aggregation. These design points substantially increase the efficiency of rule aggregation and yield a compact rule set RðtÞ with a high similarity with Rðt − 1Þ, which reduces the controlpath cost for updating the data plane. After getting RðtÞ, Tigris updates the data plane by deleting the set of obsolete rules from the switch and installing the set of new rules (Step 4).

Addressing the SDE-LB Constraints.
We show in the next subsections that the compact rule set RðtÞ expresses the same policies as the per-flow rule set R pf ðtÞ and satisfies the disruption-resistant constraint, the full-coverage constraint, and the zero-ambiguity constraint. Because it is NPhard to decide if a rule set can be aggregated into a smaller set of a given size [35], Tigris uses a software switch as a safety measure to install extra rules if the compact rule set still exceeds the table size of the hardware switch. In this way, when packets arrive, the hardware switch first tries to find a matching rule. If no matching rule is found, the packet is forwarded to the software switch (e.g., OpenvSwitch) for matching and processing. We show through extensive evaluation in Section 5 that this measure is rarely needed in practice since the aggregated rule set computed by Tigris is highly compact and outperforms the state-of-the-art rule aggregation solution, i.e., TCAM Razor. Furthermore, we also show that the compact rule set computed by Tigris has an extremely low control-path update cost.

Online, Disruption-Resistant Policy Generation.
We now give the details of the policy generation phase of Tigris. It involves two steps: statistics collection and target load adjustment and online, disruption-resistant load balancing, marked as Step 1 and Step 2 in Figure 3, respectively. Step 1 Step 2 Step 3 Step 4 Figure 3: Workflow of Tigris.

Wireless Communications and Mobile Computing
Step 1 (statistics collection and target load adjustment). At the beginning of time slot t, Tigris first collects L i ðt − 1Þ, the actual load of each server i in slot t − 1, and related info about its flows, e.g., TCP connection status. In practice, such statistics can be retrieved from the log or the monitor process of the servers. It then estimates the total incoming load for all servers in slot t as L total ðtÞ, the load of continuing flows (i.e., flows with open TCP connection) at each server i in slot t as L a i ðtÞ, and the load of incoming flows at each server i in slot t as L ia i ðtÞ when the load balancing policy stays the same in t as in t − 1, using methods adopted in [21,28]. Next, Tigris calculates a key metric L av , the "target share" of each available server at each slot till the current time slot t, in where T i ðtÞ is the number of time slots server i is available among all t slots. Note that one can extend it to the case where different servers have different capacities.
With L av , Tigris then computes the target load assigned to individual server i which is available in slot t, denoted as Specifically, according to L av , i should serve L av · T i ðtÞ in all slots until t. Since it has already served ∑ t−1 k=1 L i ðkÞ and will serve L a i ðtÞ in slot t for continuing flows, these "credits" are deducted. Note that the way Tigris computes L i ðtÞ depends on different load balancing objectives. This paper focuses on Equation (6).
Step 2 (online, disruption-resistant load balancing DR-LB). With the targeting load L i ðtÞ for every server i, we design the DR-LB algorithm, summarized in Algorithm 1, to compute a set of disruption-resistant, per-flow load balancing policies and the corresponding per-flow rule set R pf ðtÞ.

Basic Idea.
We design DR-LB as an online algorithm to achieve load balancing by shifting a small number of incoming flows among servers. This design ensures disruption resistance of continuing flows. It also increases the chance of reusing cached aggregation results from the last time slot t − 1 during the rule aggregation phase (Section 4.3), substantially increasing the aggregation efficiency and reducing the control-path cost to update the flow table on the switch.
At the beginning of the whole system, i.e., t = 0, it sets the load balancing policies to evenly divide flows from the whole source IP space to all N servers and generates a per-flow rule set R pf ð0Þ (Line 1). At each time slot t = 1, 2, ⋯, it takes the collected load statistics from Step 1 and R pf ðt − 1Þ, the perflow rule set for slot t − 1 as input (Line 3). It then iteratively shifts flows from overloaded servers to underloaded servers to achieve the target load for each server (Lines 4-18). During this process, it replaces the per-flow rule for every shifted flow with a new per-flow, eventually getting the new perflow rule set R pf ðtÞ (Lines 19-23).

Categorizing
Overloaded and Underloaded Servers. DR-LB first categorizes servers into overloaded and underloaded based on their target load (Lines 4-11). Given an available server i at time slot t, it is overloaded (underloaded) if its target load L i ðtÞ is smaller (larger) than the estimated load of incoming flows when load balancing policies stay the same in slot t as in t − 1 (Lines 4-8). For any server i that is unavailable in slot t, we set its target load as 0 and consider all the incoming flows destined to i incoming (Lines 9-10). Hence, every unavailable server is an overloaded server (Line 11).

Shifting Incoming Flows from Overloaded Servers to
Underloaded Servers. Next, DR-LB adopts a greedy approach to balance the load among servers. For every overloaded server i, it iteratively finds the incoming flow f * with the largest estimated load and shifts it to an underloaded server j which is the assigned server for the flow with the largest IP prefix match with f * :srcIP (Lines 14-18). This process stops when i is no longer overloaded or there is no underloaded server (Line 13). The rationale of this approach includes that (1) it achieves the load balancing among servers by shifting a small number of flows among servers and (2) it increases the chance of rule aggregation for the flow rules for shifted data flows. For instance, suppose f * has the source IP 10.0.0.0 and was forwarded to an overloaded server A. And there exists a flow f ′ whose source IP is 10.0.0.1 and was forwarded to an underloaded server B. DR-LB will make the load balancing decision to forward f * to B and generates a flow rule 10:0:0:0 → B to express this decision. In this way, the rules for f * and f ′ can be aggregated into 10:0:0:0x0000000 * → B during the rule aggregation process.
During flow shifting, DR-LB also generates the per-flow rule set R pf ðtÞ for slot t. It initializes R pf ðtÞ as the per-flow rule set R pf ðt − 1Þ for the last slot t − 1 (Line 3). For each shifted data flow, DR-LB generates a flow rule r with a priority of 32, a level of 0 (Lines [19][20][21][22]. r is also assigned a property r:new as 1 to indicate that r represents a newly generated load balancing policy. Then, it finds the flow rule in R pf ðtÞ who has the same match field as r and replaces it with r (Line 23). In the end, we get the per-flow rule set R pf ðtÞ representing the load balancing policies for slot t.

Performance Analysis of DR-LB.
It is easy to see that R pf ðtÞ satisfies the disruption-resistant and the full-coverage constraints. We then propose the following proposition on the performance of DR-LB.

Proposition 2.
When all servers are available, repeatedly generating load balancing policies via DR-LB at every time slot achieves a competitive ratio of ð2 − 1/NÞ on the objective function in Equation (6).

Wireless Communications and Mobile Computing
Proof. When all servers are available, any instance of our load balancing problem can be transformed into an instance of the classic load balancing problem, which is aimed at minimizing the maximal task completion time across servers. And in the transformed instance, all tasks arrive at the same time. With this transformation, we can see that the load balancing policy computed by DR-LB is in the set of all possible policies computed by greedy Graham's algorithm [31]. Hence, the competitive ratio of ð2 − 1/NÞ of DR-LB for Equation (6) is a direct result of applying the technique in proving the competitive ratio of Graham's algorithm.
Although the complexity of DR-LB depends on the number of arriving flows, which can be large due to mouse flows, in practice, we can only focus on elephant flows for computing the load balancing decisions and assign mouse flows evenly across different servers.

Incremental, Bottom-Up Rule Compilation.
We now give the details of the rule compilation phase of Tigris. It involves two steps: incremental, bottom-up rule aggregation and data plane update, marked as Step 3 and Step 4 in Figure 3, respectively.
Step 3 (incremental, bottom-up rule aggregation (Tree-Agg)). Directly installing the initial disruption-resistant perflow rule set R pf ðtÞ computed by DR-LB into the load balancer switch is infeasible because the size of R pf ðtÞ is much larger than the size of the switch flow table, tens of thousands vs. a few hundreds or thousands. Next, we develop the Tree-Agg algorithm which adopts a bottom-up approach to aggregate R pf ðtÞ into a highly compact flow rule set RðtÞ expressing the same load balancing policies as R pf ðtÞ does.

Basic Idea.
We design Tree-Agg to iteratively aggregate rules in R pf ðtÞ along the 32-level binary tree for IPv4 addresses, starting from the leaf nodes, i.e., level 0. At a first glance, this approach is impractical since the complete IPv4 address tree has 2 32 leaf nodes, an extremely large number to traverse. However, we propose three novel design points in Tree-Agg. First, R pf ðtÞ and R pf ðt − 1Þ typically only have small different flow rules, i.e., rules where r:new is set to 1 in DR-LB, due to the online feature of the DR-LB. Hence, at each time slot t, Tree-Agg only needs to traverse from the leaves representing the new rules in R pf ðtÞ to the root of the binary tree. Secondly, Tree-Agg decomposes the aggregation of a large rule set into parallel aggregation on multiple disjoint subsets. Thirdly, Tree-Agg uses the cached intermediate aggregation results from the last time slot t − 1 to aggregate with the new rules from R pf ðtÞ. These design points have two major benefits: (1) they substantially reduce the traverse and aggregation scale, hence significantly increasing the efficiency of aggregation, and (2) they reuse the cached intermediate aggregation results from t − 1 at maximum to ensure that RðtÞ would have a high similarity with Rðt − 1Þ, the 1 R pf ð0Þ: a set of per-flow rules which approximately evenly divide the whole IP space to all N servers, where each rule r has r:pri ← 32 and r:lev ← 0 2 foreacht ← 1, 2, ⋯do 3 S ul , S ol ← ∅, R pf ðtÞ ← R pf ðt − 1Þ:

Rule Aggregation Operations.
Before we present the details of the Tree-Agg algorithm, we first introduce some basic operations for rule aggregation, which are summarized in Algorithm 2. To aggregate a set of flow rules R into one rule r a , we first need to decide the matching field of r a . This is computed in the matchFieldAgg function, where a wildcard * is used at every matching bit that is not the same across every rule r ∈ R, and an exact bit 0 or 1 is used otherwise (Lines 1-7). In this way, it is guaranteed that every flow that matches at least one rule in R will also match r a . After computing the matching field of r a , we use the ruleAgg function to set other properties of r a , including action and priority. Note that during the whole rule aggregation process, we only consider aggregating rules that have the same forwarding action to avoid altering the original forwarding policies (Line 10). And the priority of the aggregated rule r a is set as the number of exact matching bits in its matching field (Line 11). The aggregated source src and r a : load are set based on the definition in Section 3, and the aggregation level of r a is also increased by 1 (Lines 12-15).
Given an aggregated rule r a and a set of rules R ′ , we can also compute the set of rules that direct a subset of the flow space of r a to a different server, i.e., r a :dom, using the find-Dom function. This process is straightforward by checking the matching field, the action, and the priority of every rule r ∈ R′ (Lines 19-22). In addition, as we will show next in the Tree-Agg algorithm, if an aggregated rule r a cannot be inserted in the flow rule set due to the violation of the noambiguity constraint, the aggregation level of every rule r in r a :src still needs to be increased by 1, by using the ruleLevUp function.

Bottom-Up Rule Aggregation along an IP Tree.
Having introduced the basic operations for aggregating rules, we now present the details of the Tree-Agg algorithm. Tree-Agg iteratively aggregates rules in R pf ðtÞ along the 32-level binary tree for IPv4 addresses, starting from the leaf nodes, i.e., l = 0. As stated earlier, Tree-Agg is an incremental aggregation algorithm. At each time slot, t, it only traverses from the leaves representing the new rules in R pf ðtÞ, i.e., rules with r: new = 1, to the root of the binary tree. During the traverse, it uses R cache ðt − 1Þ, the cached intermediate aggregation results from last slot t − 1, to aggregate with the new rules from R pf ðtÞ.
The pseudocode of Tree-Agg is shown in Algorithm 3. In the beginning, we construct a rule set R c by adding all newly generated rules in R pf ðtÞ (Lines 2-4). We use this initialization of R c to ensure that the rule aggregation process only traverses from the leaves representing the new rules to the root of the binary IP tree. And because the load balancing policies for time slot t − 1 are different from slot t, intermediate aggregated rules from R cache ðt − 1Þ that conflict with the load balancing policies of slot t need to be removed (Line 5). For instance, suppose a rule r : 10:0:0:1 → A with r:lev = 0 is in R cache ðt − 1Þ and a rule r ′ : 10:0:0:1 → B with r:lev = 0 is in R c . Because r and r ′ have the same location on the IP tree, i.e., the same leaf node, r needs to be removed from R cache ðt − 1Þ. And we define the function removeRuleðR cache ðt − 1Þ, R c , kÞ to remove all the rules with r:lev = k that have the same location as rules in R c in the IP tree.

Decomposition of Rule Set into Disjoint Subsets Using 3-
Node Subtree Representation. In each iteration of the main loop (Line 6-49), we aggregate the set of flow rules at the current level l of the IP tree into a more compact set and send it to the next level. In particular, every time, we randomly select a remaining flow rule r 0 from the current rule set R c (Line 9). And we define the getCacheðR cache ðt − 1Þ ∪ R c , r 0 , kÞ function to return all the rules whose first k matching bits are the same as r 0 but the k + 1th bit is different, from the union of R c and R cache ðt − 1Þ. Using this function, we construct two subsets R 1 and R 2 and remove their overlapping rules from R c and R cache ðt − 1Þ (Lines 10-12). We can view R 1 and R 2 as two nodes with the same parent in the IP tree. Figure 4(a) gives an example of this subtree. Denoting the aggregated rule set of R 1 and R 2 as R t and placing it in the parent node of this 3-node subtree, we can see that the aggregation process in each iteration of the main loop can be decomposed into the aggregation process of multiple 3-node subtrees. We prove the following property on the IP tree. Proof. For any r a 1 ∈ R a 1 and r b 1 ∈ R b 1 , they only share the same first ð31 − r a 1 :levÞ bits and differs on the ð32 − r a 1 :levÞ. And we see that r a 1 :dom and r a 2 :dom only contain rules with lower levels. Therefore, it is impossible to have a rule in r a 1 :dom ∩ r a 2 :dom to intersect the flow space of both r 1 and r 2 . With this proposition, the aggregation of each subtree on the same level of the IP tree can be performed in parallel without violating the no-ambiguity constraint. Hence, such a decomposition substantially reduces the problem scale and improves the efficiency of rule aggregation. Next, we describe the aggregation process for this 3-node subtree.

Aggregating a 3-Node Subtree
When R 2 Is Empty. If the set R 2 is empty, i.e., R 1 has no sibling with the same parent in the IP tree, it means that there are no flows with nonzero estimated load coming to the load balancing switch in the next time slot. In this case, we generate a new set of aggregated rules by changing the ð32 − lÞth bit of the matching field of all rules in R 1 to the wildcard (Lines 14-15). It is straightforward that the newly generated aggregated rules do not increase the size of the rule set, cause no violation of the no-ambiguity constraint, and help ensure the full-coverage constraint by covering the flow space that has zero expected loads. As a result, it is a successful aggregation and we insert the updated R 1 with an increased level into the aggregated rule set R t (Line 16). 8 Wireless Communications and Mobile Computing 4.3.6. Aggregating a 3-Node Subtree When R 2 Is Nonempty. If the set R 2 is not empty, we move on to aggregate rules with the same forwarding server in R 1 and R 2 (Lines 18-45). To do this, we randomly select r 1 from R 1 and r 2 from R 2 whose forwarding actions are the same and generate the aggregated rule r * (Line 22). Note that in Tree-Agg, we select r 1 and r 2 from different sets. This is because we can prove.

Proposition 6.
Given any two rules r ′ and r ′ ′ both from R 1 or R 2 whose forwarding actions are the same, they cannot be aggregated to reduce the number of rules without violating the nonambiguity constraint.
Proof. Without loss of generality, assume that there exist such two rules r ′ and r ′ ′ with the same forwarding action and both from R 1 . If they can be aggregated to reduce the number of flow rules, they should have been done during the aggregation of the 3-node subtree rooted at R 1 = ∅. The only reason they are not aggregated is that they cannot reduce the number of flow rules, i.e., aggregating them requires extra rules to eliminate ambiguity. So, at R 1 itself, they should not be aggregated. Now, assume that there is another rule r ∈ R 2 that shares the same forwarding action. Aggregating r′, r′′, and r together would cause the same issue because r only shares the same first ð31 − r:levÞ bits with r′ and r′′. Hence, r′ and r′′ cannot be aggregated to decrease the number of flow rules.
After generating r * , we then compute r * :dom using the findDom function defined in Algorithm 2 (Lines 23-24). Leveraging Proposition 4, we only search in the union of R 1 , R 2 , and R t instead of the whole rule set R c to find r * :dom efficiently. Having computed r * :dom, we can check if directly inserting r * will lead to any violation of no-ambiguity constraint. To this end, we first check if there exists a rule r from R 1 ∪ R 2 which violates this constraint with r * and r * :src. If so, r * cannot be inserted into the aggregated rule set since such ambiguity cannot be removed even if r is later aggregated with another rule (Lines 25-26). If such an r does not exist, we then check if any ambiguity will happen between r * and other newly aggregated rules r ⋄ in R t (Line 27) and take different actions in different cases. Case 1. If such a rule r ⋄ does not exist, r * can be directly inserted into R t as no ambiguity will be introduced by this insertion (Lines 28-31).

Case 2.
If there are more than two such r ⋄ rules, we do not insert r * into R t because at least 2 rules in R t have to be unaggregated to avoid ambiguity, which would increase the size of the rule set (Lines 33-34).

Case 3.
If there is only one such r ⋄ from R t , we compare the size of r ⋄ :dom and r * :dom and only keep the one with a smaller set of flow-space intersection rules (Lines 35-39). The rationale of this strategy is that an aggregated rule with a smaller dom set would have a smaller chance to cause ambiguity with future aggregations in R 1 and R 2 .
If the aggregated rule r * eventually cannot be inserted into the aggregated rule set, we move on to select the next 9 Wireless Communications and Mobile Computing pair of rules from R 1 and R 2 for aggregation trial. For any rules r 1 ∈ R 1 and r 2 ∈ R 2 that has failed all possible aggregations, we increase their aggregated level by 1 and insert them directly into R t (Lines 40-45). The aggregation process of a 3node subtree stops when both R 1 and R 2 are empty. We insert the resulting aggregated rule set R t for this subtree into a tem-porary set R new and move on to select the next 3-node subtree for aggregation. The aggregation process for the current level l stops when every nonempty 3-node subtree on this level has been aggregated. And we then repeat the whole aggregation process for the next level on the complete aggregated set R new (Lines 46-49) until we reach the root of the IP tree, Randomly select one rule r 0 ∈ R c 10 foreachr ∈ R 1 do 15 r:matchð32 − lÞ ← * , r:pri ← r:pri − 1 16 foreach server ido 19 else if|r ⋄ :dom | >|r * :dom |then 36 iftmp ← 0 then 41 r 1 :lev ← r 1 :lev + 1 42  i.e., l = 31. Note that before each iteration of the main loop, the removeRule function is invoked to remove obsolete, conflicting rules from R cache ðt − 1Þ (Line 48). And we also cache R new to assist the future rule aggregation for time slot t + 1 (Line 49). In the end, Tree-Agg returns R c as the compact rule set RðtÞ (Line 50).

4.3.
7. An Example. We use the example in Figure 4 to illustrate the whole flow rule aggregation process on a subtree. The original 3-node subtree is shown in Figure 4(a), and we omit the first 29 bits for simplicity. We start by aggregating the two rules forwarding to server A and get * 1 * −>A with priority 1. Because this rule does not cause any ambiguity violation with either R 1 ∪ R 2 or R t , we insert it into R t in Figure 4(b). Next, we generate another aggregate rule for server B, i.e., * * 0 − >B with priority 1. Though this rule does not cause any ambiguity violation with R 1 ∪ R 2 , it conflicts with the newly inserted rule * 1 * −>A in R t . To decide which rule to keep, we compare the size of their dom set. * 1 * −> A:dom contains both * * 0 − >B and 0 * 1 − >C from R 1 , while * * 0 − >B:dom only contains * 1 * −>A. Therefore, we keep the latter to reduce the probability of ambiguity conflict with future aggregated rules, and unaggregate * 1 * −>A back to its sources, and get the new R t shown in Figure 4(c). Last, we generate an aggregate rule for server C and found that it has no ambiguity conflict with other rules. In this way, we get the minimal aggregated rule set R t with only 5 rules in Figure 4(d). Readers may find that if we keep * 1 * −>A in R t , the minimal size of R t will increase to 6, causing unnecessary waste of limited flow rule space.

Performance
Analysis of Tree-Agg. From the previous propositions, we see that Tree-Agg does not change the action of any rule in R pf ðtÞ when compressing it into RðtÞ. Hence, RðtÞ expresses the same load balancing policies as R pf ðtÞ does, i.e., it satisfies the disruption-resistant and the full-coverage constraints. And Propositions 4 and 6 ensure that RðtÞ satisfies the zero-ambiguity constraint. One may notice that this algorithm has a polynomial complexity of the number of shift flows. However, because the number of the shifted flow is usually small, the decomposition of flow rule aggregation and the cached intermediate rule aggregation results from slot t ensure that Tree-Agg is computationally efficient and that RðtÞ has a high similarity to Rðt − 1Þ, substantially reducing the control-path update cost of Tigris.
Step 4 (data plane update). After computing the compact rule set RðtÞ, Tigris deletes the set of obsolete rules Rðt − 1Þ \ Rðt Þ ∩ Rðt − 1Þ from the switch and then installs the set of new rules RðtÞ \ Rðt − 1Þ. It is proved NP-hard to decide if a given set of flow rules can be aggregated into a smaller set of a given size [35]. Hence, Tigris uses a software switch as a safety measure to install extra rules if the compact rule set still exceeds the flow table size of the hardware switch. As we will show in Section 5, however, it is rarely needed in practice since R ðtÞ is highly compact. And we will also show that RðtÞ is highly similar to Rðt − 1Þ, i.e., a low rule update ratio, yielding an extremely low control-path update cost.

Supporting Heterogeneous Servers and Other Load
Balancing Objectives. For simplicity, we assume identical server machines in this paper, and the DR-LB algorithm makes online, disruption-resistant load balancing policies that achieve a good competitive ratio on the load balancing objective in Equation (6). However, Tigris can also be applied to scenarios where servers have different computation resources, e.g., CPU and memory. This is because Tigris adopts a modular design which separates the load balancing decision process from the rule compression process. With this design, users have the flexibility to define and implement different load balancing algorithms, while still leveraging the high rule set compressing capability of Tree-Agg.

Prefix vs. Suffix and Per-Flow Load
Balancing vs. Per-Network Load Balancing. Though in the Tree-Agg algorithm we assume an aggregation process based on the IP prefix, it is Figure 4: An example for aggregating a 3-node subtree: (a) original R 1 , R 2 , and R t ; (b) ambiguity conflict between r * and r ⋄ ; (c) r ⋄ is unaggregated to satisfy no-ambiguity constraint; (d) the optimal R t .
straightforward to apply it for suffix-based aggregation. In addition, Tigris supports not only per-flow load balancing but also per-network load balancing. In the latter case, the DR-LB algorithm can make load balancing decisions for each network, i.e., the traffic from the same network will be grouped and forwarded to the same server. With this type of load balancing decision, the Tree-Agg algorithm starts the aggregation from the network level, instead of from the leaf level of the IP tree. The computation overhead for pernetwork load balancing hence can be substantially reduced, with the trade-off of fine-grained load balancing policies.

Evaluations
We implement a prototype of Tigris and carry out extensive simulations to evaluate the efficiency of Tigris in achieving size-constrained, disruption-resistant load balancing. The evaluation is performed on a MacBook Pro with 4 2.2 GHz Intel i7 cores and 16 GB memory.

Methodology.
In our evaluations, we assume the setting of Figure 1, where a load balancer switch direct clients' requests to a set of N servers. We generate M flows for the slot. Each flow has a source IP address chosen uniform randomly in the whole 2 32 IP address space. Among the M flows, a ratio of α flows are considered continuing flows with randomly chosen forwarding servers in the previous slot and hence cannot be disrupted (i.e., cannot be redirected to another server). We assume that the load of each flow is its traffic volume and the volume is uniformly distributed  Wireless Communications and Mobile Computing (i) Load variance: the standard derivation of servers' load over the load balancing target V av (ii) Disruption ratio: the percentage of disrupted continuing flows We also compare Tigris with the state-of-the-art TCAM Razor system [27] on the following rule aggregation metrics: For these metrics, we run Tigris for T = 10 time slots and summarize the average results. In addition, we also study the following metric to measure the control-path cost of Tigris: (i) Rule update ratio: the ratio of compact flow rules in slot t that are different from slot t − 1

5.2.
Results. Due to the limited space, we only present the results with N = 30, which are more representative with a larger load balancing solution space and a more stringent requirement on rule aggregation, and omit the similar results of N = 10 and 20 servers. Figure 5 summarizes the load balancing performance of Tigris under different numbers of flows and ratios of continuing flows. From Figure 5(a), we see that Tigris achieves a stable, small load variance, i.e., between 1000 MB and 2500 MB, and the competitive ratio of the Tigris algorithm in all settings is close to 1, as shown in Figure 5(b). This is consistent with our theoretical finding in Proposition 2. Taking the case of 8000 flows as an example, we also compare the load variance of Tigris with that of ECMP in Figure 5(c). We see that the load variance of ECMP is between 2-4x higher than Tigris. These observations demonstrate the cumbersomeness of hash-based load balancing solutions in adapting to the dynamics of flow statistics, the necessity for developing a per-flow-based load balancing system, and the efficiency of Tigris in generating dynamic load balancing policies. We then plot the ratio of disrupted flows of ECMP and Tigris in Figure 5(d). We observe that Tigris 13 Wireless Communications and Mobile Computing achieves a 0 disruption ratio of continuing flows in all cases of evaluations, while ECMP has a close-to-100% disruption ratio of continuing flows in all cases. This is because continuing flows are randomly distributed across all flows with random current servers, but ECMP simply divides the flow space evenly to different servers. This huge difference in the disruption ratio between Tigris and ECMP demonstrates the efficacy of Tigris for disruption-resistant load balancing.
We then study the capability of Tigris in generating highly compact flow rule sets in Figure 6. We see from Figure 6(a) that in most cases, Tigris is able to compute an aggregated rule set of less than 400 rules. Even in the worst case with 2000 continuing flows and a total of 8000 flows, Tigris yields an aggregated rule set of less than 800 rules. Figure 6(b) shows that Tigris achieves a high rule compression ratio, i.e., between 8 and 52. These observations show that Tigris is capable of computing a highly compact flow rule set that fits into typical commodity switches without the need to truncate any rules and hence achieves efficient utilization of the limited TCAM resource on commodity switches. Taking the case of 8000 flows as an example, we also compare the compression ratio of Tigris with that of TCAM Razor in Figure 6(c). We see that when the ratio of continuing flows is small, i.e., 0.05, Tigris and TCAM Razor give almost the same compression ratio. As the ratio of continuing flows increases, Tigris outperforms TCAM Razor by yielding a higher rule compression ratio. This is because the design of Tigris shifts a small number of flows during rebalancing and leverages the cached rule aggregation results from the previous time slot for better rule aggregation performance, while TCAM Razor has to start from the per-flow rule set for a clean-slate aggregation. Furthermore, we plot the rule update ratio of Tigris over from t = 2 to t = 10 for the setting of M = 8000 flows and α = 0:25 in Figure 6(d). We observe that Tigris yields a less than 5% rule update ratio, implying an extremely low control-path update cost of Tigris.
In summary, we demonstrate the efficiency of Tigris in providing disruption-resistant, low-control-path-cost, dynamic load balancing service. Results show that Tigris achieves zero disruption of continuing flows and a close-to-1 competitive ratio on load balancing objective. It also has a higher flow rule compression ratio, i.e., up to 53x and higher than TCAM Razor, with an extremely low control-path update cost, while the state-of-the-art hash-based solution ECMP can cause 2-4x load variance and disrupt almost 100% continuing flows.

Conclusion and Future Work
In this paper, we explored the key challenges for dynamic SDE-LB of AAN traffic in 6G networks, i.e., the disruptionresistant challenge and the control-path update challenge. We address these challenges and design Tigris, the first disruption-resistant, low-control-path-cost dynamic SDE load balancer. Tigris computes disruption-resistant load balancing policies on a per-IP basis and transforms the original large flow rule set into a highly compact rule set with a small size and a low control-path cost, while remaining disruptionresistant. Evaluation results show that Tigris simultaneously achieves a close-to-optimal load balancing performance, a high rule compression ratio, a low control-path update cost, and zero disruption of continuing flows. This work sheds light on future research in traffic management in AAN and 6G networks, such as traffic engineering, resource orchestration, and application-aware networking.

Data Availability
The synthesized and real evaluation data used to support the findings of this study have not been made available because it is proprietary due to nondisclosure agreements with research partners.

Conflicts of Interest
The authors declare that they have no conflicts of interest.