SNF: synthesizing high performance NFV service chains

In this paper we introduce SNF, a framework that synthesizes (S) network function (NF) service chains by eliminating redundant I/O and repeated elements, while consolidating stateful cross layer pa ...


INTRODUCTION
network-wide service chains, driven by a controller. Slick avoids redundant operations and shares common 48 elements; however, its decentralized consolidation still realizes a chain of NFs as distributed processes. 49 Most recently, E2 (Palkar et al., 2015) showed how to schedule NFs across a cluster of machines for high 50 throughput. Also, OpenBox (Bremler-Barr et al., 2016) introduced an algorithm that merges processing 51 graphs from different NFs into a single processing graph. Contemporaneously with E2 and OpenBox, our 52 work implements the mechanisms fully specified in (Enguehard, 2016) and represents the next logical 53 step of high-performance NFV research * . 54 In the case of network-wide deployments, chains suffer from the latency imposed by interconnecting 55 different machines, processes, and switches, along with potential virtualization overheads. In the case 56 of single-server deployments, where the NFs are pinned to a specific (set of) core(s), throughput is 57 bounded by the increasing number of context switches as the length of the chain increases. Based on 58 our measurements, context switches cause a domino effect on cache utilization because of continuous 59 data invalidations and the number of CPU cycles spent forwarding packets along the chain. This leads to 60 increased end-to-end packet latency and considerable variation in latency (jitter). 61 In this paper, we describe the design and implementation of the Synthesized Network Functions (SNF), 62 our approach for dramatically increasing the performance of NFV service chains. The idea in SNF is 63 simple: create spatial correlation to execute service chains as close as possible to the speed of CPU cores 64 operating on the fastest, L1 cache of modern multi-core machines. SNF leverages the ever-continuing 65 increases in core counts of modern machines and the recent advances in user-space networking.  increases the throughput of long NF chains, and achieves low latency, and (ii) it does so while preserving 84 the functionality of the original service chains. 85 We implemented the SNF design principles into an appropriately modified version of the Click (Kohler  Once a processing core acquires a frame, it executes SNF as shown in Figure 1. First the core 116 classifies the frame (green rectangles in Figure 1) in one of the chain's TCUs and then applies the 117 required synthesized modifications (blue rounded-rectangle in Figure 1) that correspond to this TCU. Both 118 classification and modification processes are highly parallelized as different cores can simultaneously 119 drive frames that belong to different TCUs out of the chain. We detail both processes in § 3.2.

120
However, the key point of Figure 1 2012), we represent each packet as a vector in a multidimensional space. However, we follow a protocol-aware approach by dividing a packet according to the unsigned integer value of the different header fields. Thus, if p is an IPv4/TCP packet, we represent it as: From now on, we call P the space of all possible packets. For a given header field f of length l bits, we define a field filter F f as a union of disjoint intervals (0, 2 l − 1): This allows grouping packets into a data structure that we call a packet filter, defined as a logical expression of the form: where (F 1 , ..., F n ) are field filters. The space of all possible packet filters is Φ. Then: is a bijection and we can assimilate φ to (F 1 , ..., F n ).

156
Network functions typically apply read and write operations to traffic. While our packet unit representation allows us to compose complex read operations across the entire header space, we still need the means to modify traffic. For this, we define an operation as a function ω : P → Φ that associates a set of possible outputs to a packet. We add the additional constraint that for any given operation ω, there is ω 1 , ..., ω n ∈ N N such as: ∀p = (p 1 , ..., p n ) ∈ P, ω(p) = (ω 1 (p 1 ), ..., ω n (p n )) Note that we use sets of possible values (instead of fixed values) to model cases where the actual value is If we define Ω as the space of all possible operations, we can express a processing unit PU as a conditional function that maps packet filters to operations: where (ω 1 , ..., ω m ) ∈ Ω m are operations and (φ 1 , ..., φ m ) ∈ Φ m are mutually distinct packet filters. 160 An NF is simply a DAG of PUs. For instance, SNF can express a simplified router's NF as follows: where, 4 PUs take place. An IP lookup PU is followed by decrement IP TTL, IP checksum update, and 161 source and destination MAC address modification PUs.

162
The Synthesized Network Function 163 In the previous section we laid the foundation to construct NFs as graphs of PUs. Now, at the service level where multiple NFs can be chained, we define a TCU as a set of packets/flows, represented by disjoint unions of packet filters, that are processed in the same fashion (i.e., undergo the same set of synthesized operations). This definition allows us to construct the service chain's SynthesizedNF function (in short SNF) as a DAG of PUs, or equivalently, as a map of TCUs that associates operations to their packet filters:

171
Leveraging the abstractions introduced in § 3.1, we detail the steps that translate a set of NFs into an 172 equivalent SNF. The SNF architecture is comprised of three modules (shown in Figure 2). We describe 173 each module in the following sections.

175
The top left box in Figure 2 is the Service Chain Configurator; the interface that a network operator 176 uses to specify a service chain to be synthesized by SNF. Two inputs are required: a set of service 177 components (i.e., NFs), along with their topology. SNF abstracts packet processing by using graph theory.

178
That said, a chain is described as a DAG of interconnected NFs (i.e., chain-level DAG), where each NF is 179 a DAG of abstract packet processing elements (i.e., NF DAG). The NF DAG is implementation-agnostic, network operator enters these inputs in a configuration file using the following notation:  We interconnect two NFs as follows: NF 1 [inter f ace 1 ] → [inter f ace 0 ]NF 2 .

188
No loops: Since the chain-level DAG is acyclic by construction, SNF must prevent loops (e.g., two 189 interfaces of the same NF cannot be connected to each other).    PUs, according to § 3.1.2. Next, the parser considers the next entry point until all are exhausted.

213
The final output of the Service Chain Parser is a large Synthesized-DAG of PUs that models the 214 behavior of the entire input service chain.

216
After building the Synthesized-DAG, our next target is to create the SynthesizedNF introduced 217 in § 3.1.3. To do so, we need to derive the SNF's TCUs. To build a TCU we execute the following steps: 218 from each entry port of the Synthesized-DAG, we start from the identity TCU tcu 0 ∈ Φ × Ω defined 219 as: tcu 0 = (P, id P ), where id P is the identity function of P, i.e., ∀x ∈ P, id P (x) = x. Conceptually, tcu 0 220 represents an empty packet filter and no operations, which is equivalent to a transparent NF. Then, we

233
The recursive algorithm terminates in two cases: (i) when the packet filter of the current TCU is the 234 empty set, in which case the function does not return anything, (ii) when the PU U does not have any 235 successors, in which case it returns the current TCUs. In the latter case, the returned TCUs comprise the 236 final SynthesizedNF function.

237
Algorithm 1 Building the SNF TCUs Algorithm 2 Intersecting a TCU with a filter

239
A difficulty when synthesizing NF chains is managing successive stateful functions. It is crucial to 240 ensure that the states are properly located in a synthesized NF and that every packet is matched against 241 the correct state table. At the same time, SNF should hold the promise that NFV service chains must be 242 realized without redundancy, hence single-read and single-write operations must be applied per packet.

243
To highlight the challenges of maintaining the state in a chain of NFs, consider the example topology 244 shown in Figure 3. In this example, a large network operator has run out of private IPv4 addresses in the 245 10.0/8 prefix and has been forced to share the same network prefix between two distinct zones (i.e., zones 246 1 and 2), using a chain of NAPTs. This is not unlikely to happen, as an 8-byte network prefix contains less 247 than 17 million addresses and recent surveys have predicted that 50 billion addresses will be connected to 248 the Internet by 2020 (Evans, D., 2011).

249
Consolidating this chain of NFs into a single SNF instance poses a problem. That is, traffic originating 250 from zones 1 and 2 shares the same source IP address and port range, but to ensure that all the traffic is 251 translated properly, the corresponding synthesized chain must share their NAPT table. However, since 252 traffic also shares the same destination prefix (i.e., towards the same Internet gateway), a host from the 253 outside world cannot possibly distinguish the zone where the traffic is originating from.

254
Obviously, the question that SNF has to address in general, and particularly in this example is: "How 255 can we synthesize a chain of NFs, ensuring that (i) traffic mappings are unique and (ii) no redundant 256 operations will be applied?" To solve this conundrum, the SNF design respects the following properties: Property To generalize the state management problem, Figure 4 shows how SNF handles stateful configurations 263 with e.g., three egress interfaces. We apply "Property 1" by having exactly one stateful (re)write element 264 (denoted as Stateful RW) per egress interface. We apply "Property 2" by having one input port in each of 265 these (re)write elements, associated with an ingress interface. Therefore, a state table in SNF not only 266 contains flow-related information, but also keeps a linking of a flow entry with its origin interface.
Outbound We use Click to specify the NF DAGs of this example, but SNF is applicable to other frameworks.

271
The example chain consists of a NAPT, a L4 firewall (FW), and a L3 load balancer (LB) that process 272 transmission control protocol (TCP) and user datagram protocol (UDP) traffic as shown in Figure 5. The TCP traffic is NAPT'ed in the first NF and then leaves the chain, while UDP is filtered at the FW 274 (the second NF) and the UDP datagrams with destination port 1234 are load balanced across two servers 275 by the last NF. For simplicity, we discuss only the traffic going in the direction from the NAPT to the LB.

276
The rectangular operations in Figure 5 are interface-dependent, e.g., an "Encapsulate Ethernet" the "IP Fragmentation" should only be applied before the final Ethernet encapsulation.

282
The remaining operations (illustrated as rounded rectangles) of the three processing stages are 283 those that (i) make decisions based upon the contents of specific packet fields (read operations with a 284 solid round outline, e.g., "Classify IP Traffic" and "Filter IP Traffic") or (ii) modify the packet header 285 (rewrite operations with a blue dashed outline e.g., "Rewrite Flow" and "Decrement IP TTL"). We rectangles with solid outline in Figure 6) encodes all the read operations by composing paths that begin 297 from a specific interface and traverse the three traffic classes of this chain, until a packet is output or 298 dropped. Each path keeps a union of filters that represents the header space that matches the respective 299 traffic class. In this example, the filter for e.g., the allowed UDP packets is the union of the protocol and 300 destination port numbers. Such a filter is part of a classifier whose output port is linked with a set of write 301 operations (dashed vertices in Figure 6) associated with this traffic class (right-most part of the graph).

302
As shown in Figure 6, with SNF a packet passes through all the read operations once (guaranteeing 303 a single-read) and either the packet is discarded early or each header field is written once (ensuring a 304 single-write) before exiting the chain.

305
Synthesizing the counterpart of this example implies several code modifications to avoid the 306 redundancy caused by the design of each NF. To apply a per flow, per-field single-write operation we 307 ensure that the "Rewrite Flow" will smartly calculate the checksums once IP addresses, ports, and the IP 308 TTL fields are written. Therefore, in this example we saved four unnecessary operations (3 "Decrement IP 309 TTL" and 1 "Rewrite Flow") and four checksum calculations (3 IP and 1 IP/UDP). Moreover, integrating 310 all decisions (i.e., routing, filtering) in one classifier caused this operation to be slightly heavier, but saved 311 another two redundant function calls to "Destination IP LookUp" and "Filter IP Traffic" respectively.  In production service chains, where packets arrive at high rates, this overhead can play a 318 major role in limiting the throughput of the chain and the imposed latency; therefore, the advantages of 319 synthesizing more complex service chains than this simple use case are expected to be even greater.

321
As we stated earlier, SNF's basic assumption is that each input service component (i.e., NF) is 322 expressed as a graph (i.e., the NF DAG), composed of individual packet processing elements. This allows 323 SNF to parse the NF DAG and infer the internal operations of each NF, producing a synthesized equivalent.

324
Among the several candidate platforms that allow such a representation, we developed our prototype atop 325 Click because it is the most widely used NFV platform in the academia. Many earlier efforts built upon it 326 to improve its performance and scalability, hence we believe that this choice will maximize SNF's impact    The integration with FastClick required another 1500 lines of code (modifications and extensions).

338
Although FastClick improves a router's throughput and latency, it lacks features required for broader NFV 339 applications; therefore, we made the following extensions to target a service-oriented platform: these elements were not designed to be thread-safe hence they could cause race conditions when accessed 343 by multiple CPU cores at the same time. We designed thread-safe data structures for these elements while 344 also applying the necessary modifications to equip them with the FastClick optimizations.

345
Extension 2: We tailored several packet modification FastClick elements to comply with the synthesis 346 principles, as we found that their implementation was not aligned with our single-write approach. For 347 instance, we improved the IP/UDP/TCP checksum calculations by calling the respective functions only 348 once all the header field modifications are applied. Moreover, we extended IP/UDP/TCPRewriter elements 349 with additional input arguments. These arguments extend the elements' packet modification capabilities 350 (e.g., decrement IP TTL field to avoid unnecessary element calls) and guarantee that a packet entering 351 these elements undergo a single-write operation per header field.

450
Another common use case for an ISP is to deploy a service chain of a FW, a router, and a NAPT as 451 depicted in Figure 9. The FW of such a chain may contain thousands of rules in its ACL causing serious 452 performance issues for software-based NF implementations.

453
In this section we measure the performance of SNF using actual FW configurations of increasing 454 cardinality and complexity, while exploring the limits of software-based packet processing on our hardware. 455 We utilize a set of three actual ACLs (Taylor and Turner, 2007), taken from several ISPs, to deploy the  We use the above ACLs to generate traces of 64-byte frames that systematically exercise all of their 461 entries. The generated packets emulate intra-ISP, inbound and outbound Internet traffic (see Figure 9).   deployment in small subnets (e.g., using links with capacity equal or less than 10 Gbps) may not fully 469 benefit from SNF. As depicted in Figure 10b, the latency is also bounded below 100 µs. This time is  In contrast, SNF effectively synthesizes the large ACLs (i.e., 713 and 8550 rules) maintaining high 481 throughput despite their increasing complexity. In the case of 713 rules, the synthesis is so effective that 482 leads to better throughput than the 251-rule case. Regarding latency, SNF demonstrates 1.1-10x lower 483 median latency (bounded below 500 µs) and 2-3.5x lower latency variance (slightly above 1 ms in some 484 cases). The throughput gain of SNF is up to 8.5x greater than the FastClick chains.

485
Hardware-accelerated SNF 486 The results presented in the previous section show that software-based SNF cannot handle packet 487 processing at a high enough rate when the NFs are complex. We analyzed the root cause and concluded 488 that the packet classifier (that dispatches incoming packets to synthesized NFs) is the bottleneck. To 489 overcome this problem, we run additional experiments, in which we offload packet classification to a 490 hardware OpenFlow switch (since commodity NICs do not offer sufficient programmability). By doing 491 so, we showcase SNF's ability to scale to high data rates with realistic NFs. In addition, we hint at the 492 performance that is potentially achievable by offloading packet classification to a programmable interface.   Figure 11a). 515 We observe that throughput depends mostly on the frame size. The system can operate at almost 20 516 Gbps for small frames (i.e., 64 bytes), and it reaches the full line-rate for 256-byte frames. Interestingly, 517 the rule set size does not affect the throughput.

518
In the real data sets, the second bar in each pair is almost as high as the first one, which shows that the 519 software part of SNF does not limit the performance. Finally, with simple forwarding rules in the switch 520 (the first pair of bars in Figure 11a) the overall throughput is high even for small frames, which confirms 521 that packet processing at the switch is the bottleneck of the whole system. To further prove this point, we   as the source port of UDP packets) with the 3 "Decrement IP TTL" elements, since these elements do not 648 belong to the same type. This means that the final OpenBox graph will have 2 distinct packet modification 649 17/20 elements (i.e., 1 "Rewrite Flow" and 1 "Decrement IP TTL") and each element has to compute the IP and 650 UDP checksums separately. Therefore, OpenBox does not completely eliminate redundant operations.

651
In contrast, SNF effectively synthesized the operations of all these elements into a single element (see 652 Figure 6) that computes the IP and UDP checksums only once. Consequently, SNF produces both a 653 shorter processing graph and a synthesized chain with no redundancy, hence achieving lower latency. but SNF aims to make them more efficient. Concretely, an SNF TCU is not processed by a DAG of NFs, 671 but rather by a highly optimized piece of code (produced by the synthesizer) that directly applies a set of 672 operations to this specific traffic class.

673
Impact. E2 can use SNF to fit more service chains into one machine, hence postpone its elastic scaling.

674
Existing approaches can transparently use our extensions to provide services such as (i) lightweight we parse the chained NFs and build a classification graph whose leaves represent unique traffic class units.

683
In each leaf we perform a set of packet header modifications to generate an equivalent configuration that 684 implements the same functionality as the initial chain using a minimal set of elements.

685
SNF synthesizes stateful chains that appear in production ISP-level networks realizing high throughput 686 and low latency, while outperforming state-of-the-arts works.