Benchmarking Methodology for IPv4aaS Technologies: Comparison of the Scalability of the Jool Implementation of 464XLAT and MAP-T

 Abstract —A novel method is proposed for the performance and scalability measurements of the IPv4-as-a-Service (IPv4aaS) technologies. It works according to the dual Device Under Test (DUT) setup of RFC 8219 and is suitable for benchmarking any of the five IPv4aaS technologies: Combination of Stateful and Stateless Translation (464XLAT), Dual-Stack Lite (DS-Lite), Lightweight 4over6 (Lw4o6), Mapping of Address and Port with Encapsulation (MAP-E), and Mapping of Address and Port using Translation (MAP-T). The method is based on the reduction of the aggregate of Customer Edge (CE) and Provider Edge (PE) devices to a stateful network address translation from IPv4 to IPv4 (stateful NAT44) gateway. The most important advantage of the novel method is that a stateful NAT44 tester can be used instead of a technology-specific tester, which usually does not exist. The proposed method is validated by the examination of the performance and scalability of the Jool Implementation of 464XLAT and MAP-T. Scalability is defined by both (1) how performance increases with the number of active Central Processing Unit (CPU) cores; and (2) how performance decreases with the increasing number of concurrent sessions. Maximum connection establishment rate and throughput are used as performance metrics. The scalability of 464XLAT and MAP-T is measured from 1 to 16 CPU cores and from 1 to 256 million connections. The measurement details and results are fully disclosed and discussed.


INTRODUCTION
Even though the public IPv4 address pool of the Internet Assigned Numbers Authority (IANA) was depleted in 2011 [1], the transition of the Internet from Internet Protocol version 4 (IPv4) to Internet Protocol version 6 (IPv6) has not yet been completed.Several Internet Service Providers (ISPs) use Carrier-grade NAT (CGN) to mitigate the shortage of public IPv4 addresses, whereas others decided to go ahead and Gábor Lencse is with the Department of Telecommunications, Faculty of Mechanical Engineering, Informatics and Electrical Engineering, Széchenyi István University, Egyetem tér 1, Győr, H-9026, Hungary (e-mail: lencse@sze.hu).
Ádám Bazsó is with the Cybersecurity and Network Technologies Research Group, Faculty of Mechanical Engineering, Informatics and Electrical Engineering, Széchenyi István University, Egyetem tér 1, Győr, H-9026, Hungary eliminate IPv4 from their access and core networks.However, there are some IPv4-only applications and some users abide by them.Therefore, ISPs still need to provide their customers with IPv4 Internet access.To that end, five IPv4-as-a-Service (IPv4aaS) technologies have been developed [2].In terms of the technology used for access and core network traversal, they can be classified into two categories:  Combination of Stateful and Stateless Translation (464XLAT) and Mapping of Address and Port using Translation (MAP-T), which use double translation (first, from IPv4 to IPv6 and then from IPv6 to IPv4)  Dual-Stack Lite (DS-Lite), Lightweight 4over6 (Lw4o6), Mapping of Address and Port with Encapsulation (MAP-E), which encapsulate the IPv4 packets into IPv6 packets and then de-encapsulate them.The five IPv4aaS technologies have various similarities and differences (e.g., their stateful or stateless nature in the core of the ISP network) and thus they have several advantages and disadvantages as discussed in [3].The performance and scalability of the IPv4aaS technologies are important decision factors when network operators select the most suitable IPv4aaS technology for their specific needs.However, the performance analysis of the five IPv4aaS technologies is rather uncharted territory.Al-hamadani in [4] surveyed several research papers regarding the performance analysis of IPv6 transition technologies and, according to Table 2 of that paper, only one of the surveyed research papers addressed any of the five IPv4aaS technologies, [5].The authors of [5] covered all of them except Lw4o6.They measured the performance of 464XLAT, DS-Lite, MAP-T, and MAP-E by means of roundtrip-delay, jitter, throughput, and packet loss using the Asamap Vyatta software executed by Cisco UCS C200 M2 servers.Whereas the authors of this paper acknowledge the significance of their pioneering work, they contend that instead of a performance comparison using some specific hardware (which is rather obsolete at the time of writing this paper), network operators would benefit much more from the scalability comparison of the various IPv4aaS solutions.Scalability is defined by both (1) how performance increases with the number of active Central Processing Unit (CPU) cores; and (2) how performance decreases with the increasing Benchmarking Methodology for IPv4aaS Technologies: Comparison of the Scalability of the Jool Implementation of 464XLAT and MAP-T Gábor Lencse and Ádám Bazsó number of concurrent sessions.Such measurements were already performed regarding the Jool implementation of the 464XLAT and MAP-T technologies in [6].However, that measurement method had several limitations and it also did not comply with RFC 8219 [7].(Please refer to Section 3.2 for the details.)When this research began, the authors were faced with the following key technical challenges: 1. Except for 464XLAT, no RFC 8219 compliant Testers existed for benchmarking the other four IPv4aaS technologies.

Although all five IPv4aaS technologies can carry
IPv4 traffic, they cannot be benchmarked using legacy RFC 2544 [9] compliant Testers, as shown in Section 2.2.

No RFC 8219-compliant methodology has been defined for measuring the scalability of the IPv4aaS
IPv6 transition technologies as pointed out in Section 3. It should be noted that RFC 8219 reflects the state of the art in the benchmarking of IPv6 transition technologies.Any measurements that do not comply with it can give valuable insight into the performance of the examined IPv6 transition technologies, but cannot provide its full benefits.
To address these challenges, the authors set the following goals: 1. To provide an RFC 8219-compliant methodology that does not require a technology-specific tester for each technology and is suitable for the performance and scalability measurements of each of the five IPv4aaS technologies.2. To validate the methodology by actual measurements with two different IPv4aaS technologies.3. To compare the performance and scalability of the Jool implementation of the 464XLAT and MAP-T technologies in an RFC 8219-compliant way.To this end, this paper proposed a new methodology that makes it possible to benchmark any of the five IPv4aaS technologies in an RFC 8219-compliant way without the need for technology-specific testers.This was achieved by the reduction of the dual Device Under Test (DUT) setup of RFC 8219 to a single DUT setup of a stateful NAT44 gateway, which was then benchmarked according to the Internet-Draft [8].For measuring scalability, the proposed method was the one defined in the Internet-Draft.Therefore, the authors believed that the proposed measurement methodology for benchmarking IPv4aaS technologies was novel and had never been used by anyone else before.Moreover, by benchmarking the Jool implementation of the 464XLAT and MAP-T technologies, the proposed benchmarking methodology was validated.
The remainder of this paper is structured as follows.In Section 2, the methodological issues of benchmarking the five IPv4aaS technologies are discussed: first, a short high-level summary of the operation of the five IPv4aaS technologies is given and one of their common properties that must be taken into consideration is highlighted; then the relevant requirements of RFC 8219 are mentioned and the methodological gap regarding stateful technologies is pointed out; finally, a novel RFC 8219 compliant benchmarking methodology for the performance and scalability measurement of IPv4aaS technologies is outlined.In Section 3, the results of two previous papers are summarized and the shortcomings of the applied measurement methods are detailed.The rest of the current paper is a case study that demonstrates and validates the proposed methodology.In Section 4, the hardware and software measurement environment and baseline measurements (IPv4 and IPv6 packet forwarding tests) are introduced.In Section 5, the scalability measurements of the Jool implementation of 464XLAT are disclosed, including results and their discussion.In Section 6, the same is done with MAP-T using two different measurement setups.Section 7 contains a discussion of the results and authors' plans for future research.Section 8 is the conclusion of the paper.

High-level Operation of the Five IPv4aaS Technologies
In this section, a brief overview of the operation of the five IPv4aaS technologies is given to show why their RFC 8219compliant benchmarking is a problem to be solved.A common property of all of them is that they all use a Customer Edge (CE) device and Provider Edge (PE) device to facilitate the traversal of the IPv4 traffic of the user over the IPv6-only access and core network of the ISP.Moreover, they have their technology-specific names and operations, which are described as follows.

464XLAT
The customer-side translator (CLAT) of 464XLAT [10] performs a stateless network address translation from IPv4 to IPv6 (stateless NAT46) translation (also called stateless IP/ICMP translation [SIIT] [11]) to send the IPv4 traffic of the user over the IPv6-only access and core network of the ISP.When the packets arrive at the provider-side translator (PLAT), it performs a network address and protocol translation from IPv6 clients to IPv4 servers (stateful NAT64) [12] translation and the packets are forwarded to the IPv4 Internet (Fig. 1).The reply packets are translated in the reverse way.As for PLAT, the reverse translation uses the information stored in its so-called connection tracking table.There is a state in the central element (PLAT) of 464XLAT that can be a problem regarding its scalability.

DS-Lite
The Basic Bridging BroadBand (B4) element of DS-Lite [13] encapsulates the IPv4 traffic of the user into IPv6 packets.The address family transition router (AFTR) decapsulates the IPv4 packet of the user from the IPv6 packet and (with some simplification) it performs a stateful NAT44 translation.Following this, the packet is forwarded to the IPv4 Internet, as shown in Fig. 2. To be precise, this is not simply a stateful NAT44 translation, because the IPv6 address of the B4 device (called softwire-ID) is also stored in the connection tracking table of the AFTR so that it can distinguish the packets of different users that accidentally have the same five-tuple (source IPv4 address, source port number, destination IPv4 address, destination port number, and protocol number).Similar to 464XLAT, there is a state in the central element (AFTR) of DS-Lite.

Lw4o6
Lw4o6 [14] is an extension of DS-Lite.The motivation of its design was to remove the state from the central element and thus make the solution more scalable.It is achieved in a way that each subscriber is given only a specific limited source port range of a public IPv4 address.The lightweight B4 (lwB4) element of Lw4o6 first performs a stateful NAT44 translation on the IPv4 packet of the user, transforming its IPv4 source address and source port number to the specific public IPv4 address and the limited source port range assigned to the given subscriber.It then encapsulates the resulting IPv4 packet into an IPv6 packet.The lightweight AFTR (lwAFTR) decapsulates the IPv4 packet from the IPv6 packet and forwards it to the IPv4 Internet, as shown in Fig. 3. Thus, the operation of the central element is stateless.In the reverse direction, the packets are forwarded to the proper lwB4 device using Address plus Port (A+P) routing [15].

MAP-E
It can be said that MAP-E [16] is a kind of generalization of Lw4o6, although it uses a unique ruleset, called MAP rules.Not considering the mapping rules, its high-level operation is rather similar to that of Lw4o6.The CE element of MAP-E first performs a stateful NAT44 translation on the IPv4 packet of the user.This means that the IPv4 source address and source port number are translated to the specific public IPv4 address and the limited source port range assigned to the CE device.Subsequently, the CE device encapsulates the resulting IPv4 packet into an IPv6 packet.The Border Relay (BR) router decapsulates the IPv4 packet from the IPv6 packet and forwards it to the IPv4 Internet, as shown in Fig. 4. Thus, the operation of the central element is stateless.

MAP-T
The operation of MAP-T [17] is similar to that of MAP-E but it uses double translation (like 464XLAT) for access and core network traversal instead of encapsulation and decapsulation.Thus, the CE element of MAP-T first performs a stateful NAT44 translation on the IPv4 packet of the user.(The IPv4 source address and source port number are translated to the specific public IPv4 address and the limited source port range is assigned to the CE device.)Then the CE device performs a stateless NAT46 translation to transform the IPv4 packet into an IPv6 packet.The BR router performs a stateless NAT64 translation (the reverse of the stateless NAT46 translation is performed by the CE device) to transform the IPv6 packet into an IPv4 packet.It then forwards the IPv4 packet to the IPv4 Internet (Fig. 5).Thus, the operation of the central element is stateless.

Summary
The high-level operation of the five IPv4aaS technologies is summarized in Table I.The 464XLAT and DS-Lite technologies are called "stateful" as they have a state in their PE device.The other three technologies are called "stateless" because they do not have a state in the PE device; however, they do have a state, as well, but it is in the CE device.Regarding the scalability of the technologies, a state close to the end-user is generally not seen as problematic as a state in the middle of the network [3]; however, if a state exists anywhere in the system, it causes hardship for benchmarking, as discussed in Section 2.1.

Some Important Requirements of RFC 8219
A comprehensive benchmarking methodology was defined for all kinds of network interconnect devices by RFC 2544 [9]  Fig. 5. Overview of the MAP-T architecture [3].
in 1999.Moreover, it still determines how commercial network performance testers work today.Theoretically, it was an IP version independent solution, but its approach and examples reflected IPv4.As time passed, its methods were updated in different ways.Originally, RFC 2544 used a fixed test frame format including port numbers.In 2008, the RFC 4814 [18] recommended using pseudorandom port numbers.An upgrade for IPv6 specificities was given by RFC 5180 [19] in the same year.It explicitly declared that IPv6 transition technologies were outside its scope.RFC 8219 [7] defined a benchmarking methodology for the IPv6 transition technologies in 2017.
RFC 8219 has been built on its predecessors:  it has reused several measurement procedures, e.g., throughput, frame loss rate, etc. (unmodified);  it has also kept the requirement of testing with bidirectional traffic (using the same speed in both directions), although it has added an optional testing with unidirectional traffic.To be able to handle the high number of IPv6 transition technologies [20] efficiently, RFC 8219 classified them into a small number of categories regarding the method used for access and core network traversal and then defined the appropriate benchmarking methodology for each category.For this research, the relevant categories are double translation and encapsulation technologies.For these categories, the Dual DUT setup is recommended.As shown in Fig. 6, there is a single Tester and there are two DUTs.In the current case, DUT 1 and DUT 2 were the CE and PE devices, which were benchmarked together.As the usage of this test setup may hide potential asymmetries, the usage of the Single DUT test setup is also recommended (Fig. 7).Regarding the benchmarking of the five IPv4aaS technologies, the problem is that the Single DUT test setup requires a Tester that can handle the specific traffic of the given implementation.For example, for testing a MAP-E BR device, the Tester should send IPv6 packets that contain embedded IPv4 packets using IP addresses complying with the mapping rules in one direction, and it should send IPv4 traffic in the other direction.It should also be able to decode the packets to check if they arrived correctly.Not having such a specific tester, the only feasible solution could be to use the Dual DUT setup, where CE and PE devices are benchmarked together, as shown in Fig. 8. Thus, the problem of benchmarking any of the IPv4aaS technologies was simplified to the problem of benchmarking a stateful NAT44 gateway.Moreover, this brings up the problem mentioned in Section 2.1.6;however, in this case, the Tester is expected to use only IPv4 traffic (the aggregate of CE and PE devices is a stateful system and it performs stateful NAT44).Furthermore, this is the same in the case of all five IPv4aaS technologies, as they are all stateful somewhere (either in the CE or PE device).Therefore, testing with bidirectional traffic will not work unless some preliminary arrangements are made and special care is taken (as described in Section 2.3).This is the reason why the aggregate of CE and PE devices cannot be benchmarked by using legacy RFC 2544 compliant Testers, even if it can be called an IPv4 system when it is observed from outside.
Regarding the standard frame sizes, RFC 8219 follows the approach of its predecessors and also takes care that translation and encapsulation change the frame size.

Benchmarking Methodology for Stateful NAT44 Gateways
The Internet-Draft [8] proposed a methodology of how stateful NATxy (x, y are in {4, 6}) gateways might be benchmarked in compliance with RFC 8219 and RFC 4814.Here, only a very brief overview of the method is given focusing on stateful NAT44, which is relevant in the current case.The test setup for benchmarking stateful NAT44 gateways is shown in Fig. 9. numbers in the public to private direction would result in a drop of test frames that do not belong to an existing connection, that is, the vast majority of the test frames.Therefore, special care must be taken as described below.

The Basic Ideas of the Solution
To avoid the above-mentioned DoS attack, the source and destination port number ranges for the private to public direction are limited.Their sizes were used as a parameter, which is explained later in Section 5.4.
The method uses two test phases.To acquire valid fourtuples (source IP address, source port number, destination IP address, destination port number), which belong to a connection that is present in the connection tracking table, test phase 1 is used.
Test phase 1 serves two purposes:  The connection tracking table of the DUT is filled. The state table of the Responder is filled with valid four-tuples.Test phase 1 can be used without test phase 2 to measure the maximum connection establishment rate, which is a new performance metric, specific to stateful devices.
Test phase 2 must always be preceded by test phase 1.The "classic" measurement procedures of RFC 8219 (throughput, frame loss rate, latency, etc.) can be performed in test phase 2.
To ensure clear and repeatable measurements, testing under the following conditions is recommended: 1 (There is a separate measurement for the connection tear-down rate.)Condition 1 is ideal for measuring the maximum connection establishment rate and is also optimal for decreasing the duration of test phase 1 when it is followed by test phase 2. Conditions 2 and 3 are ideal for the throughput, latency, frame loss rate, etc. tests.These conditions can be easily achieved by:  using a sufficiently large and empty connection tracking table for each test;  using pseudorandom enumeration of all possible port number combinations (with the used source and destination port number ranges) in test phase 1;  using a properly high timeout value in the DUT.Please refer to the Internet-Draft [8] for all the details.The stateful extension of the siitperf measurement tool was developed in parallel with the Internet-Draft and its version used for the measurements in this paper is documented in [21].
The proposed measurement method was validated by performing its tests with three radically different stateful NAT64 implementations [22].

Method for measuring scalability
Scalability regarding how performance increases with the number of active CPU cores can be expressed by the relative scale-up defined by (1), where the numerator can express any performance characteristic measured using n number of active CPU cores.
It should be noted that this definition requires a measurement series to be performed, which is the n number active of CPU cores should be increased from 1 to the maximum available number of CPU cores.Using values for n as the powers of 2 (1, 2, 4, 8, 16, etc.) may effectively reduce the number of tests necessary.
Scalability regarding how performance decreases with the increasing number of concurrent sessions is expressed by (2), where the numerator can express any performance characteristic measured using ni connections, and the denominator is the same performance characteristic measured using n0 connections.
It should be noted that n0 is the lowest realistic number of connections, which is typically several orders of magnitudes higher than 1.The highest value of ni depends on the range of interest.In the Internet-Draft [23], it is increased until the hardware limit is reached.The policy of doubling the number of connections in each step can be a suitable approach when fine-grain analysis is needed.In [22], it was increased 10-fold to reduce the number of tests necessary.

FINDINGS AND SHORTCOMINGS OF THE PREVIOUS TESTS
Here, a summary is given about two previous efforts.

Measuring the scalability of four IPv4aaS technologies
Georgescu et al. [24] claim that no previous studies dealt with the scalability analysis of IPv6 transition technologies.Their paper covers four of the five most important IPv4aaS technologies, namely 464XLAT, DS-Lite, MAP-E, and MAP-T.

Measurement Method
In its section 3, the paper contains a survey of methods for measuring scalability using different definitions.The authors call their choice "load scalability" and they measured how the performance of the examined systems degrades with the increase of the load.They measured four performance characteristics: round-trip delay, jitter, throughput, and packet loss.The distributed Internet traffic generator (D-ITG) [25] was used for packet generation in two different setups.The first setup contained 4 servers to execute the "ITGSend" function to generate traffic, the CE function of the examined technology, the PE function of the same technology, and the "ITGRecv" function to receive the traffic and send it back to the server executing the "ITGSend" function.The second setup contained 31 servers: except for the PE function, the number of servers executing the other three functions was increased tenfold.

Limitations of the Method
Although the results can give an important insight into the "load scalability" of the examined IPv4aaS implementations, they give no information about how their performance scales up with the number of CPU cores.
However, as the ongoing development in the hardware sector favors an increasing number of processing units over an increasing speed of a single unit [26], the authors consider it important to measure how the performance of the examined IPv4aaS implementations scales up with the number of CPU cores.

Measuring the scalability of 464XLAT and MAP-T
The author's team has made a previous attempt towards the scalability comparison of the Jool implementation of 464XLAT and MAP-T technologies as presented in [6].

Measurement Method
As for the measurement tool, the dns64perf++ [27] program was used, which was originally developed for benchmarking Domain Name System (DNS) extensions for network address translation from IPv6 clients to IPv4 servers (DNS64) servers at moderate query rates [28].However, later its performance was significantly increased and it was made suitable for benchmarking authoritative DNS servers up to 3,000,000 queries per second (qps) [29].To ensure reply packets, a Knot DNS server was set up, as it could produce answers at a sufficiently high rate [29].
As for measuring scalability, the number of active CPU cores was set to 1, 2, 4, 8, and 16 in both CE and PE devices and the performance of the system was measured (by means of the number of successfully forwarded DNS queries and replies).
To determine if the CE or PE device was the bottleneck, their CPU utilization was also measured (in further tests).

Summary of Findings
It was found that the Jool implementation of MAP-T scaled up better than the Jool implementation of 464XLAT.
It was discovered from the CPU utilization results that the PLAT was the bottleneck when 464XLAT was tested, and the CE was the bottleneck when MAP-T was tested.

Limitations of the Tests and how to Overcome them
The measurements had several limitations: 1. Scalability could not be measured regarding the number of concurrent sessions.
2. The maximum connection establishment rate and throughput could not be clearly measured but rather their certain combination.3. The standard frame sizes required by RFC 8219 could not be used.4. The authors did not try influencing which device (CE or PE) was the bottleneck, that is, the performance of which device was ultimately measured.The authors of the current paper believe that these limitations deserve some discussion, especially how serious they are and how they could be eliminated.
Limitation 1 came from the fact that the destination port number of the DNS queries was always the same fixed value (53) due to the nature of the measurement tool.(The source port range could not be widened due to the nature of MAP-T.)The authors consider this limitation as the most serious one, as the scalability regarding the number of sessions is very important for the ISPs, as it depends on the number of active users and the nature of their applications.By allowing the user to specify both the source and the destination port ranges to be used, siitperf can fully eliminate this limitation.
Limitation 2 also came from the nature of the measurement tool.Likely, it is not very serious from an ISP point of view, but it is rather annoying from an analytical point of view.Of course, one can easily measure the maximum connection establishment rate and throughput separately using siitperf.
Limitation 3 also came from the nature of the measurement tool; the DNS queries and replies have their specific lengths.Whereas this one seems to have a serious shortcoming at first glance, the first author's experience shows that if the bottleneck is the CPU capacity and not the speed of the network, then packet length does not make a significant impact on the number of transferred packets per second.(This was experienced with various SIIT implementations using 128 bytes and 1280 bytes frame sizes [30] and the Jool stateful NAT64 implementation using 64 bytes and 1024 bytes frame sizes [21].)As for siitperf, it supports all Ethernet frame sizes from 64 bytes to 1518 bytes.
Limitation 4 partially came from the decision of the authors of [6] to set the same number of active CPU cores for CE and PE devices.As one PE should serve a high number of CEs in an ISP scenario, the scalability of the PE device (in this case: PLAT of 464XLAT and BR of MAP-T) is the important question.Thus, setting the number of CPU cores to the maximum value in the CE device and changing the number of the CPU cores from 1 to 16 in the PE device is a better choice for the current experiments.However, it is still possible that the CE of MAP-T becomes the bottleneck when both devices have 16 active cores, thus it can be said that this limitation partially comes from the Dual DUT setup.This issue is revisited in Section 5.3.

Hardware and Software Measurement Environment
The measurements were carried out remotely using the resources of the NICT Hokuriku StarBED Technology Center, Japan.In all, seven so-called "P" series nodes were used, which were Dell PowerEdge R430 servers with the following relevant main parameters:  two Intel Xeon E5-2683v4 2.1 GHz CPUs with 16 cores each;  twelve 32GB, 2400 MHz DDR4 RAM modules (in all 384 GB);  Intel X540 dual-port 10 Gbps NIC (for experimenting).Based on the previous benchmarking experiments of the authors ( [28], [29], and [30]), Hyper-Threading and Turbo Boost were switched off in the BIOS of the servers to avoid scattered measurement results.This time the authors went one step further and set the clock frequency of all servers to a fixed 2.1 GHz using the tlp Linux package.
The servers were interconnected by a 10 Gbps switch using VLANs.Debian 9.13 GNU/Linux operating system with its kernel version 4.9.0-14-amd64 was used on those servers that functioned as Testers because the compilation of siitperf required the Data Plane Development Kit (DPDK) version 16 contained in Debian 9.
Debian 10.11 GNU/Linux operating system with its kernel version 4.19.0-18-amd64 was used on those servers that functioned as DUTs because it was needed for Jool.

Baseline Measurements
Some "baseline" measurements were performed to check the performance of the measurement environment itself thus avoiding a bottleneck other than the examined IPv4aaS implementations.were enabled in the Linux kernel of DUT1 and DUT2.The Receive-Side Scaling (RSS) [31], also called multi-queue receiving, was set on all interfaces of the DUTs, so that the port numbers could also be taken into consideration when distributing the interrupts of packet arrivals among the active CPU cores.These used the appropriate versions of the following four commands (please refer to the two brace expansions to get them): ethtool -N enp5s0f{0,1} rx-flow-hash udp{4,6} sdfn The packet forwarding tests were executed using the following number of active CPU cores: 1, 2, 4, 8, and 16.It should be noted that 32 cores were not used because the authors found that the used 4.x Linux kernels did not distribute the interrupts evenly to the second 16 cores.
During the binary search for determining the throughput, 0.01 % packet loss was allowed, that is, the acceptance criterion was 99.99 %.Sometimes the same criterion was used in [28] and [29] too.For widespread usage of the non-zero acceptance criterion and its rationale, please refer to [32].
As required by RFC 8219, bi-directional traffic was used.It should be noted that siitperf reports the throughput results as the number of forwarded frames per second per direction.Thus, its results were multiplied by 2 to give the number of all forwarded frames per second as network performance testers usually do.
As for the IPv4 packet forwarding tests, a 64-byte frame size was used, all experiments were performed 20 times, and the median, minimum and maximum of the 20 results were calculated.The results are shown in Table II.Moreover, the results are consistent (minimum and maximum values are quite close to each other) and the throughput scaled up quite well with the number of CPU cores.The moderate increase of the median at 2 cores (from 883,553 fps to 1,540,032 fps) can be explained by the cost of multi-core operation and especially by the NUMA 1 architecture of the CPUs: the even number of CPU cores (0, 2, 4, etc.) belong to NUMA node 0, and the odd number of CPU cores (1, 3, 5, etc.) belong to NUMA node 1. (Please refer to Section 4.2.1 of [29] for a comparison of the scale-up of different NUMA architecture CPUs.) IPv6 packet forwarding tests were also performed with the only difference being that an 84-byte frame size was used.The results are shown in Table III.From 1 to 8 cores, the IPv6 results were similar to the IPv4 results (they were somewhat lower and the scale-up was also somewhat poorer), but there was a significant drop in the IPv6 results at 16 cores.Its root cause would be interesting for Linux kernel developers, but this is beyond the scope of the current paper.For this research, the point is that the values were sufficiently high in all cases and thus it was ensured that the performance of the test system did not limit the results of the examined IPv4aaS solutions.

Measurement Setup and Tests
The measurement system followed the topology and settings shown in Fig. 11.The setup of Jool was quite straightforward.As for implementing CLAT, the SIIT kernel module of Jool (jool-siit) was used.Explicit Address Mapping was applied for the source address and an IPv4embedded IPv6 Address was prepared using the NAT64 Well-Known Prefix (WKP) (i.e., 64:ff9b::/96), for the destination address (for the IPv4 to IPv6 direction).The PLAT implemented stateful NAT64 using the NAT64 WKP.All commands are shown in Fig. 11.There is only one thing that 1 It is a memory system design where the memory access time depends on the location of the memory.A CPU can access its local memory faster than non-local memory.Please refer to [33] for a full depth explanation.needs an explanation: Jool has two operation modes, called "netfilter Jool" and "iptables Jool".In short, a "netfilter Jool" instance attempts to translate everything it can, whereas the packets to be translated must be explicitly given over to an "iptables Jool" instance by iptables.Both of them were tested in a few working points and no significant performance differences were found, thus "netfilter Jool" was used because of its somewhat simpler configuration (no need for iptables rules).
As for performance metrics, maximum connection establishment rate and throughput were used.

Scalability against the Number of CPU Cores
It was examined how the performance of the system scaled up with the number of active CPU cores of the PLAT from 1 to 16, whereas the number of active CPU cores of the CLAT was always 32.However, only the first 16 were used, as mentioned in Section 3.
It was known from the preliminary tests that the performance of the system significantly depended on the number of connections in the connection tracking table of the PLAT.For the scalability test, regarding the number of active CPU cores, 4,000,000 connections were chosen based on the recommendations of Vyacheslav Gapon for a high-loaded NAT server [34].As the authors wanted to test 464XLAT and MAP-T with identical conditions and the source port number of MAP-T had to be limited due to its nature, 1-4,000 and 1-1,000 were used as the source and destination port number pools, respectively.
The parameters for this and all following measurements were the same as for the baseline measurements: 64 bytes IPv4 frame size (that is, 84 bytes for IPv6), 99.99 % acceptance criterion, and 20 executions of the tests.
The maximum connection establishment rate and throughput measurement results of the Jool implementation of 464XLAT are shown in Table IV.Both quantities show a moderate scale-up from one to four CPU cores; the maximum connection establishment rate increased from 210,344 connections per second (cps) to 423,903 cps (relative scale up: 0.504), whereas the throughput increased from 217,946 fps to 472,135 fps (relative scale up: 0.542).Thus, the scale-up is significantly lower than that of the baseline measurements.Above 4 CPU cores, the addition of further active CPU cores did not result in a significant increase in the performance of the system.This is really bad news regarding the scalability of the Jool implementation of 464XLAT.
The somewhat better scalability of the throughput than that of the maximum connection establishment rate (e.g., 0.142 vs. 0.139 relative scale-up at 16 cores) can be explained by the difference that during the throughput measurements, the connection tracking table is mainly read when a test frame is processed (except for the update of the timeout time of the given connection).However, a new connection is registered into the connection tracking table for each test frame during the maximum connection establishment rate measurement.

Checking the Bottleneck
To be able to tell whether the CLAT or PLAT was the bottleneck, the CPU utilization was measured during the tests re-executed using the found maximum in the previous paper [6].Due to some fluctuations in CPU utilization during the preliminary tests, the authors performed the measurements 100 times and calculated the average for each second.Thus, quite smooth graphs were produced.Subsequently, those graphs were used to determine that the PLAT was the bottleneck.
However, when the authors studied the CPU usage graphs of the individual experiments of the current measurement, they decided not to use averaging (this will be explained later).The CPU idle time of the PLAT as a function of time using 1 CPU core and the measured maximum frame rate of the throughput measurement (218,778 fps) is shown in Fig. 12.It is visible that the idle time varied between 0 % and 20 % during the 60s long interval of the throughput measurement.The close-tozero values indicate that the PLAT would not have been able to transfer more packets per second.However, averaging the results of several measurements would produce about 10 % idle time, which could not prove that the PLAT was the bottleneck.
The authors' question is whether CPU utilization can be used to determine if the CLAT or PLAT is the bottleneck when both of them have 16 active CPU cores.To that end, one needs to check the CPU utilization of every single CPU core because an uneven distribution of the load can cause frame loss, even if some of the cores have free capacity.The CPU idle time of the CLAT and PLAT as a function of time using 16 CPU cores and measured maximum frame rate of the throughput measurement (500,398 fps) is shown in Fig. 13 and Fig. 14, respectively.The fact that the idle time of all CPU cores of the CLAT was above 80 % during the entire test and the idle time of all CPU cores of the PLAT was around 60 % during the entire test makes it likely that the bottleneck was not the CLAT.However, this result does not explain why the It should be noted that the stateful operation does not necessarily lead to such mediocre scalability.For example, the iptables stateful NAT44 implementation scales up much better than Jool.According to [23], the median throughput of iptables scaled up from 414,900 fps at 1 CPU core to 4,557,000 fps at 16 CPU cores, which is about an 11-fold increase and 0.686 relative scale-up in contrast to the current 0.136 relative scale-up.Thus, the authors believe that a design or implementation issue caused the poor scalability of Jool.
The CPU utilization of the PLAT was also examined during the maximum connection establishment rate measurement using a single CPU core and 212,501 cps during test phase 1.The CPU idle time of the PLAT as a function of time is displayed in Fig. 15.The filling of 4 million connections into the connection tracking table lasted somewhat less than 20 seconds.The idle time exhibited a decreasing tendency, which can be explained by the fact the insertion of a new connection requires more work when there are more connections in the connection tracking table.
It is important to note that the CPU utilization measurements were done separately after finishing the maximum connection establishment rate and throughput measurements, as the execution of the used dstat command also consumed CPU power.The CPU utilization of each DUT was measured separately and the output of the Tester program produced during CPU utilization measurement was discarded.

Scalability against the Number of Connections
It was examined how the performance of the 464XLAT system depends on the number of connections in the connection tracking table of the PLAT.To perform the measurements with several different numbers of connections, the source port number range was always 1-4,000 and the destination port number range was first 1-250 and its higher limit was doubled 8 times with a final range of 1-64,000.Therefore, the number of connections during the nine consecutive measurement series was increased from 1,000,000 to 256,000,000.The filling up of the connection tracking table with the high number of connections took a significant amount of time, thus the timeout was modified from the default value (300 seconds) to a value higher than the duration of the entire experiment (test phase 1, gap between the two phases, test phase 2).The number of active CPU cores was always 16.
The results are disclosed in Table V.Both maximum connection establishment rate and throughput show continuous degradation with an increase in the number of connections.Due to the increase in the number of connections from 1 million to 256 million, the maximum connection establishment rate decreased from 576,790 cps to 272,061 cps, whereas the throughput decreased from 611,141 fps to 279,051 fps.The good news is that there was no sudden drop in the performance of the system at any working point, thus it complies with the graceful degradation principle.The bad news is that both performance characteristics decreased to less than half.6. SCALABILITY OF THE JOOL IMPLEMENTATION OF MAP-T

Original Measurement Setup and Tests
Building a MAP-T system is a much more complex task than building a 464XLAT system.It requires a preliminary design, which includes address calculations.The used design was inspired by the example from the MAP-T setup documentation of Jool [35], which can be recommended as an easy-to-follow introduction to MAP-T for those readers not familiar with MAP-T.It could not be fully followed because they used port sets with 2,048 source port numbers, but the authors wanted to use at least 4,000 source port numbers to achieve 256 million connections when using 64,000 destination port numbers.Finally, 8,192 size port sets were chosen because the authors experienced serious performance degradation when they used 4,000 source port numbers by nearly exhausting the 4,096-size port set during the preliminary tests.The 8,192 large port set size allowed 8 users to share a single IPv4 address and 3 bits were needed to identify a port set.The 192.0.2.0/24 public IPv4 address range was chosen for the measurements, thus the IPv4 suffix was 8 bits.Hence, the number of EA (Embedded Address) bits was 8+3=11.From among the possible 2 11 =2048 CEs, the authors chose the one that can be identified by the binary number of "00100011011" (this number will be later referred to as "11b" hexadecimal number).The first 8 bits (00100011) are the IPv4 suffix (35 decimal) and the last 3 bits (011) are the PSID (Port Set ID, 3 decimal).Thus, the port range is 24,576-32,767.(This means that the subscriber could use 192.0.2.35:24,576-32,767.)For an easy calculation, the authors chose 2001:db8:ce::/53 as the IPv6 prefix for the Basic Mapping Rule (BMR), hence the BMR will be as follows: IPv6 prefix: 2001:db8:ce::/53; IPv4 prefix: 192.0.2.0/24; and number of EA bits: 11.The Default Mapping Rule (DMR) is 64:ff9b::/96.
The topology of the measurement system is shown in Fig. 16.However, this is only the high-level topology of the system.The implementation of the CE device required the usage of namespaces for technical reasons specific to the Linux kernel and its Netfilter framework.In short, as described in Section 2.1.5,the stateful NAT44 translation should happen before the stateless NAT46 translation, but the Netfilter framework does stateful NAT44 POSTROUTING and thus the order of the two translations would be incorrect.The required order can be ensured by using two namespaces, as elaborated in [35].The internal topology of the CE device is shown in Fig. 17.Here, enp5s0f0 and enp5s0f1 are the physical network interfaces of the p106 server, whereas to_global and to_napt are virtual Ethernet interfaces for interconnecting the "napt" namespace and global namespace.The Network Address and Port Translation (NAPT), also called stateful NAT44, function is implemented by using iptables.
The detailed settings of the test system are disclosed in Appendix A.1 to support the reproducibility of the measurements.However, one important aspect needs to be mentioned here: the interrupts occur in the "napt" namespace because the packet arrivals were directed to the 16-31 CPU cores using a different method than ethtool.

Scalability Against the Number of CPU Cores
The same tests (using the same parameters) were performed as with 464XLAT.
The maximum connection establishment rate and throughput measurement results of the Jool implementation of MAP-T are presented in Table VI.The scale-up of the two quantities was different.Maximum connection establishment rate exhibited a relatively good scale-up from 1 to 4 cores (from 443,388 cps to 1,054,225 cps), where it reached its maximum.(The median shows a slight decrease at 8 cores [to 1,050,862 cps] but it was within measurement error.)Throughput showed a relatively good scale-up from 1 to 8 cores (from 434,220 fps to 2,081,068 fps) and there was only a marginal (less than 4 %) increase at 16 cores (2,162,170 fps).
It should be recalled that only CE is stateful from among the two devices used to implement MAP-T, and BR is stateless.Additionally, the maximum connection establishment rate was a new metric that was introduced in the methodology for benchmarking stateful NATxy gateways [8].Thus, it was visible that the CE device at 8 and 16 cores limited the maximum connection establishment rate performance of the system.Additionally, it is possible that the CE device at 16 cores also limited the throughput of the system, but currently, this has not been confirmed.The authors were more interested in the scalability of the BR device as stated in Section 3.2.3.Therefore, the measurements were repeated using two separate servers to implement the two sub-functions of the CE device.However, the original setup was not abandoned to have a basis for comparison.

Scalability against the Number of Connections
The same tests (using the same parameters) were performed as with 464XLAT.
The results are presented in Table VII.The maximum connection establishment rate results showed two anomalies:  The median values exhibited a significant increase when the number of connections was raised from 1 million to 8 million. The difference between the minimum and maximum values was extremely high at a certain number of connections (the situation was the worst at 16 million connections), thus the results were rather unreliable.By studying the measurement log files, it was found that both anomalies had a common root cause, which was that significant frame loss happened during test phase 1.This was in the order of magnitude of 0.01 %.Thus, the increase in the number of connections allowed more frames to be lost while the given step of the binary search was still considered successful.Although the authors could produce "better looking" results using a laxer acceptance criterion (e.g., 99.9 %, which allowed a 0.1 % loss), they did not do so because the maximum connection establishment rate characterizes the CE device and they focused on the scalability of the BR device.
As for the throughput, it showed practically no decrease when the number of connections was raised from 1 million to 256 million.This is an excellent scalability.

Modified Measurement Setup and Tests
To resolve the potential problem that the insufficient performance of the CE device could be the bottleneck, separate servers were used to implement its two sub-functions.Hence, the usage of namespaces and virtual Ethernet devices was also eliminated.
The topology of the modified measurement system is disclosed in Fig. 18.As for the actual implementation, nearly the same commands were used as for the original measurement setup.However, the namespaces were omitted and the physical interface names were always used.In addition to this, the appropriate ethtool command was always used to distribute the interrupts among the CPU cores of the servers.The same tests (using the same parameters) were performed as with the original setup.

Scalability against the Number of CPU Cores
The maximum connection establishment rate and throughput measurement results of the Jool implementation of MAP-T measured with the modified measurement setup are presented in Table VIII.The results are fundamentally quite similar to the results using the original setup:  The maximum connection establishment rate scaled up from 1 to 4 cores, as before.(The results were somewhat higher for all numbers of CPU cores.) The throughput scaled up quite well from 1 to 8 cores as before (from 447,262 fps to 1,996,529 fps) and there was only a marginal (less than 3 %) increase at 16 cores (2,051,404 fps). Interestingly, the throughput results of the modified setup were somewhat higher than that of the original setup at a single core (as expected), but they were somewhat lower at 2-16 cores (which was unexpected, but the highest difference at 16 cores was only about 5 %).Unfortunately, the improvement of the scale-up of the modified test system did not meet the expectations of the authors.In the search for root causes of this phenomenon, the following issues were identified:  When the RSS of the original MAP-T test system was configured, the authors could not handle the distribution of the interrupts caused by packet arrivals to the virtual Ethernet interfaces among the CPU cores using ethtool.Furthermore, the solution that was used facilitated the assignment of the interrupts to the 16-31 CPU cores.Thus, the CE of the original test system used all 32 cores of a single server, whereas the two sub-functions of the CE of the modified test system used the first 16 CPU cores of the servers.The moderate increase in the maximum connection establishment rate of the modified test system can be attributed to the fact that the usage of the namespaces and virtual Ethernet interfaces was eliminated. The most annoying difference was that the throughput of the modified test system was somewhat lower than that of the original system.This difference could be attributed to the phenomenon that the authors had already experienced during DNS server testing; using different instances of theoretically identical test systems (built up by Dell PowerEdge C6220 servers with the same hardware and software configuration), somewhat different results were produced.For example, the authoritative DNS performance of Yet Another DNS Implementation For All (YADIFA) was 180,140qps or 163,641qps, please refer to Table I of [28].To mitigate the problem, the same computers were used for the performance comparison of various DNS64 server and authoritative DNS server implementations in [28] and [29], respectively.The most important achievement of the modified test setup for benchmarking MAP-T was that the authors managed to eliminate the namespaces, which eased CPU utilization measurements and was crucial for checking the bottleneck.

Checking the Bottleneck
As with 464XLAT, the CPU utilization of the servers was measured to point out the bottleneck.The CPU idle time of the BR device as a function of time using 1 CPU core and the measured maximum frame rate of the throughput measurement (448,438 fps) is shown in Fig. 19.The graph is ideal as it shows that the single CPU core was fully utilized during the throughput measurement.
The real question that needs to be asked is what happens when the BR has 16 active cores.The CPU idle time of the CE1 (performing stateful NAT44), CE2 (performing stateless NAT64), and BR devices as a function of time using 16 CPU cores and the measured maximum frame rate of the throughput measurement (2,052,198 fps) is shown in Fig. 20, Fig. 21, and Fig. 22, respectively.The fact that the idle time of all the CPU cores of the BR was 0 % proves that the BR could surely not process more packets.
Of course, the question remains open of why the addition of 8 cores to the 8-core BR results in only a 3 % performance increase.The investigation of this question exceeds the scope of the current paper but could be very useful for the developers of Jool.

Scalability against the Number of Connections
The maximum connection establishment rate and throughput measurement results of the Jool implementation of MAP-T measured with the modified measurement setup are shown in Table IX.The results are fundamentally quite similar to the results using the original setup.The most important differences are:  The maximum connection rate did not show a drop at 256 million connections. As expected, based on the CPU scale-up results, the throughput values were 5-6 % lower than in the case of the original test system.

Achievements
By performing the detailed scalability analysis of two different IPv4aaS technologies (464XLAT and MAP-T), the operation and practical usability of the novel method the authors proposed in Section 2 were successfully demonstrated.
It should be noted that the method does not build on any technology-specific properties.It handles the aggregate of CE and PE devices as a stateful NAT44 gateway, which is true in the case of all five IPv4aaS technologies.Therefore, the method can be used with any of them.
There are two important advantages of the proposed method: 1.It is RFC 8219 compliant and thus it can support all the required measurements providing their full    benefits.2. It works according to the Dual DUT setup; therefore, it does not need a tester specific to the examined IPv4aaS technology, but a stateful NAT44 tester can be used to investigate any of them.In Section 6.2 it was found (and also confirmed in Section 6.5) that the MAP-T test system showed a significantly different scale-up regarding the maximum connection establishment rate (good scale-up from 1 to 4 cores) compared to the throughput (good scale-up from 1 to 8 cores).This difference justifies the approach of the Internet-Draft [8] to distinguish maximum connection establishment rate and throughput.
In this paper, two performance metrics were used: the maximum connection establishment rate, which is a new, stateful specific metric, and the throughput, which is one of the "classic" metrics of RFC 8219.In [22], the proposed measurement method for stateful NATxy gateways was validated by performing tests with three radically different stateful NAT64 implementations.It was demonstrated that the further "classic" tests of RFC 8219, e.g., latency, frame loss rate, packet delay variation (PDV), etc. can also be executed in test phase 2. As the aggregate of CE and PE devices is a stateful NAT44 gateway, it also applies to any of the five IPv4aaS technologies in which the above-mentioned measurements can be executed.(Please refer to section 7.2 for the limitations regarding their results.)

Limitations of the Method
The Dual DUT setup also has its limitations in that the performance of CE and PE devices of the IPv4aaS technologies are measured together.It has multiple consequences including the following: 1. Additional work is needed to find the bottleneck.To that end, a possible solution based on CPU utilization measurement was also provided.2. If the bottleneck is not the PE device and one would like to make it the bottleneck, as its scalability is usually the focal point of the research, then an extra attempt is needed to make the PE device the bottleneck.For example, a more powerful CE device or multiple CE devices may be used.3. The results characterize the aggregate of CE and PE devices, and the metrics for individual CE or PE devices are not trivial to derive.As for the third limitation, for example, when the latency is measured, the user does not know in what proportion the latency result should be divided between CE and the PE devices, as there is significant asymmetry in them.However, the latency of the aggregate of CE and the PE devices can serve as an upper bound for the latency of the individual devices.The same applies to the frame loss rate, too.
Regarding the equivalence and difference between the dual DUT setup and the single DUT setup of RFC 8219 in the case of stateless, identical devices, theoretical considerations were made and measurements were performed in the case of IPv6 routers and SIIT gateways in [36].

Potential Alternative Solutions
In section 2, the problem of benchmarking the five IPv4aaS technologies according to the Dual DUT setup of RFC 8219 was simplified into the problem of benchmarking a stateful NAT44 gateway according to the single DUT setup of RFC 8219.To solve this simpler problem, the usage of the benchmarking methodology proposed in Internet-Draft [8] was recommended.As far as the authors know, it is the only RFC 8219-compliant solution for the problem.However, there are other solutions for benchmarking a stateful NAT44 gateway.In [37], it was surveyed how other researchers measured the performance of the iptables stateful NAT44 solution.In Section 3.2.2 of [37], it is written that "researchers were creative enough to accommodate to the limitation of NAT44 that connections may be initiated only from the private side.They put the iperf or D-ITG server on the public side and thus the measurement was feasible."Of course, these solutions are applicable in the case when IPv4aaS technologies are benchmarked.Moreover, the results can be used to compare their performance.However, using these solutions will result in losing the benefits of the standard RFC 8219 compliant measurements as well as the benefits of the Internet-Draft [8], e.g., distinguishing maximum connection establishment rate and throughput.

Comparison with other Methods
The support for all the "classic" tests of RFC 8219 is the most important distinguishing factor of the proposed method, but it is not the only one.
Similar to its predecessors, RFC 8219 requires testing with bidirectional traffic, however, it also adds optional testing with unidirectional traffic.In the case of the IPv4aaS technologies, the traffic volume in the "download" direction is usually significantly higher than the traffic in the "upload" direction.As demonstrated in [22], the proposed method can be used with unidirectional traffic in any of the two directions.The alternative solutions mentioned in Sections 3.1, 3.2, and 7.3 cannot use unidirectional traffic in the download direction, because that traffic comes from a "reflector" device that sends back the received packets (or sends answers to the received DNS queries, as elaborated in Section 3.2).
Another distinguishing factor of the proposed method is that it satisfies the requirement of RFC 4814 for pseudorandom port numbers and also provides a clear way for scalability testing regarding the number of connections (often called network flows).

Other Aspects than Performance
This paper has focused on performance and scalability, however, when network operators decide on the selection of the most suitable IPv4aaS solution for their purposes, several other factors need to be considered including security, reliability, documentation, support, experience with the software or hardware solution, references of the vendor, hardware requirement (if a software implementation is used), and price (if a free software solution is not used).
As for the security analysis of the IPv4aaS solutions, there are numerous recent publications of Al-Azzawi covering the security analysis of 464XLAT [38], DS-Lite [39], and Lw4o6 [40] from both theoretical and practical aspects.
D'yab has conducted a comprehensive survey of the IPv4aasS technologies including their implementations [41].

Plans for Future Research
In this paper, siitperf was used for stateful NAT44 measurements, but it can do SIIT (stateless NAT46 or stateless NAT64) and stateful NAT64 measurements too.Thus, it can be used to benchmark the two subcomponents of 464XLAT (CLAT and PLAT) separately following the Single DUT setup of RFC 8219.As for the SIIT performance of Jool, the first author has already performed some tests with another measurement tool called nat64tester; however, it did not support RFC 4814 pseudorandom port numbers.Instead, it always sent the very same test frames, thus only two CPU cores could be used to process the interrupts of packet arrivals (one for each direction).Therefore, it could not be used for measuring the scalability of the SIIT performance of Jool with the number of active CPU cores [30].The authors are currently working on it using siitperf.The scalability of the stateful NAT64 performance of Jool was already benchmarked to validate the methodology for benchmarking stateful NAT64 gateways [22].
If the two sub-functions of the MAP-T CE are implemented by two separate devices (stateful NAT44 and stateless NAT46) then they can be benchmarked one by one following the Single DUT setup of RFC 8219 and using siitperf as the measurement tool.
At the time of performing the above measurements, the authors did not have a MAP-T BR tester, but its implementation was in progress [42].Since then, Al-hamadani has finished the implementation of maptperf, the World's first free software RFC 8219 compliant MAP-T BR tester.It is available under the GPLv3 license from GitHub [43] and it is documented in [44].The benchmarking of the scalability of the Jool implementation of MAP-T BR using maptperf is already in progress.
The research plans of the team of the first author include the extension of maptperf for benchmarking MAP-E BR, as well as the implementation of a Tester for benchmarking the lwAFTR component of Lw4o6.

CONCLUSION
An RFC 8219-compliant benchmarking method was proposed for the performance and scalability analysis of the five most important IPv4aaS technologies.The method works according to the dual DUT setup of RFC 8219 and it uses a stateful NAT44 tester and not a technology-specific tester.
The proposed method was validated and its operation was demonstrated in the example of the Jool implementation of the 464XLAT and MAP-T IPv4aaS solutions.
464XLAT showed a moderate scale-up with the number of CPU cores from one to four cores; the maximum connection establishment rate increased from 210,344 cps to 423,903 cps (relative scale-up: 0.504), whereas the throughput increased from 217,946 fps to 472,135 fps (relative scale up: 0.542).Above 4 CPU cores, the addition of further active CPU cores did not result in a significant increase in the performance of the system.Due to the increase in the number of connections from 1 million to 256 million, the maximum connection establishment rate decreased from 576,790 cps to 272,061 cps, whereas the throughput decreased from 611,141 fps to 279,051 fps.It was checked that these numbers characterize the PLAT part of the system.
As for MAP-T, the scale-up of the maximum connection establishment rate and throughput was different.Maximum connection establishment rate exhibited a relatively good scale-up from 1 to 4 cores (from 443,388 cps to 1,054,225 cps), where it reached its maximum.Throughput showed a relatively good scale-up from 1 to 8 cores (from 434,220 fps to 2,081,068 fps) and there was only a marginal (less than 4 %) increase at 16 cores (2,162,170 fps).However, the maximum connection establishment rate characterizes the CE device, and not the BR device, which is the focal point of the scalability of the system.As for the scalability of the MAP-T system against the number of connections, the throughput exhibited a constant high performance in the entire range, more than 2 million fps, which means both an excellent performance and an excellent scalability.
All in all, it was found that the Jool implementation of MAP-T scaled up much better than that of 464XLAT.

A.1. Setup of the Original MAP-T Test System
The CE and BR devices were configured using the commands shown in Fig. 23 and Fig. 24, respectively.The interrupts, caused by packet arrivals on the physical interfaces, were distributed among CPU cores 0-15 similar to before, but virtual Ethernet interfaces required a different method.The commands are shown in Fig. 25.Using this method, the interrupts of the packets arriving at the to_global and to_napt virtual Ethernet interfaces were processed by CPU Ádám Bazsó received his BSc in electrical engineering from Széchenyi István University, Győr, Hungary, 2022.He is a member of the Cybersecurity and Network Technologies Research Group of the Faculty of Mechanical Engineering, Informatics and Electrical Engineering of the Széchenyi István University.

Fig. 11 .
Fig. 11.Test setup for scalability measurements of the Jool implementation of 464XLAT.

Fig. 12 .
Fig.12.CPU idle time of the PLAT using 1 CPU core and maximum frame rate of the throughput measurement.

Fig. 13 .
Fig.13.CPU idle time of the CLAT using 16 CPU cores and maximum frame rate of the throughput measurement.

Fig. 14 .
Fig. 14.CPU idle time of the PLAT using 16 CPU cores and maximum frame rate of the throughput measurement.

Fig. 15 .
Fig.15.CPU idle time of the PLAT using 1 CPU core and maximum frame rate of the maximum connection establishment rate measurement.

Fig. 16 .
Fig. 16.Original test setup for scalability measurements of the Jool implementation of MAP-T.

Fig. 19 .
Fig.19.CPU idle time of the BR using 1 CPU core and maximum frame rate of the throughput measurement.

Fig. 20 .
Fig.20.CPU idle time of the CE1 (stateful NAT44) using 16 CPU cores and maximum frame rate of the throughput measurement.

Fig.
Fig. 21.CPU idle time of the CE2 (stateless NAT46) using 16 CPU cores and maximum frame rate of the throughput measurement.
. During test phase 1, all test frames should create a new connection.2. During test phase 2, test frames should never create a new connection.3. Connections should never be deleted (due to timeout or replacement) during test phase 1 or test phase 2.
These were IPv4 and IPv6 packet forwarding tests.The test setup shown in Fig 10 was used.The MAC addresses are presented in the figure because they had to be explicitly set for siitperf, as it was not able to reply to Address Resolution Protocol (ARP) or Neighbor Discovery Protocol (NDP) requests.IPv4 and IPv6 packet forwarding

TABLE V PERFORMANCE
OF THE JOOL IMPLEMENTATION OF 464XLAT AS A FUNCTION OF THE NUMBER OF CONNECTIONS, 16 CPU CORES Number of connections (million) 1

TABLE VI PERFORMANCE
OF THE JOOL IMPLEMENTATION OF MAP-T AS A FUNCTION OF THE ACTIVE CPU CORES (ORIGINAL SETUP), 4M CONNECTIONS
22g.22.CPU idle time of the BR using 16 CPU cores and maximum frame rate of the throughput measurement.(The line expressing the utilization of CPU core 15 hides the other lines, as all 16 are nearly identical.)