NATRA: Network ACK-Based Traffic Reduction Algorithm

Traffic monitoring involves packet capturing and processing at a very high rate of packets per second. Typically, flow records are generated from the packet traffic, such as TCP flow records that feature the number of bytes and packets in each direction, flow duration, number of different ports, and other metrics. Delivering such flow records, about network traffic flowing at tens of Gbps is rather challenging in terms of processing power. To address this problem, traffic thinning can be applied to reduce the input load, by swiftly discarding useless packets at the sniffer NIC or driver level, which effectively reduces the load on software layers that handle traffic processing. This work proposes an algorithm that drops empty ACK packets from TCP traffic, thus achieving a significant reduction in the packets per second that must be handled by each traffic module. The tests discussed below show that the algorithm achieves a 25% decrease in the packets per second rate with minimal information loss.


I. INTRODUCTION
To meet increasing bandwidth demands, Internet providers are deploying high-speed lines across the network. Core providers such as Internet2 and ESnet have upgraded their link rates up to 100 Gbps, and the same applies to regional networks [38].
Traffic monitoring probes are typically deployed in such networks to assess quality of service and reinforce network security. Such traffic probes produce enriched flow records on the fly, which can then be used to deliver Key Performance Indicators (KPIs) about networks, systems, and applications. Because TCP traffic accounts for most Internet traffic [10], [15], [16], [21], [25], a great number of quality-of-service KPIs are related to TCP connection parameters. For example, a high number of duplicated ACKs maybe indicative of packet loss. Thus, enriched TCP connection records that include information about retransmissions and other features are needed.
In this challenging scenario, capturing and analysing every packet in the network for traffic analysis is no longer The associate editor coordinating the review of this manuscript and approving it for publication was Anton Kos . feasible [34]. Capturing every single packet flowing through the network will result in a practically intractable volume of data. Furthermore, the cost of sniffers will be very high, in terms of both processing power and storage requirements [42]. Such a large data rate and information volume calls for a radical departure from the traditional approach of capturing every piece of data and analysing them later. Selective packet capture is sure to be foundational in the analysis of traffic rates of 40 Gbps and beyond [6], [17], [22].
Selective packet capture implies that some information will be lost in flow records; accuracy and performance are in a trade-off relationship. We note that the performance metric for nearly all packet sniffers is the data rate that they can handle in packets per second. At the same bits-per-second (bps) rate, smaller packet sizes increase the load in packets per second. Each incoming packet is an event that the packet sniffer must handle. As packets per second increase, the sniffer must work harder to cope with the traffic load.
In light of the above concerns, we advocate for traffic thinning techniques that decrease the sniffer load in terms of packets per second, without serious information loss. Such traffic thinning techniques can be implemented in the sniffer's Network Interface Card (NIC), immediately after the packet is captured and before it is passed to upper layers for traffic analysis. Alternatively, thinning can be implemented in the network processor hardware which allows for a degree of programmability at the hardware level [4], [7], [8], [12]. As a workaround, the latter techniques can also be implemented in the device drivers. Indeed, modifications to device driver software have proven to be useful in achieving 10-Gbps capture rates with commodity hardware [9], [17], [27], [40].
This paper proposes an algorithm that selectively removes ACK packets to reduce the sniffer data workload. By doing so, we can reduce the packets per second by a remarkable 25%. Some information might be lost, as ACKs are useful for quantifying retransmissions especially in case of duplicate ACKs. Nevertheless, retransmissions can also be estimated from the retransmitted packets themselves, which will appear in the traffic stream in the reverse direction. With this workaround, the overall information loss is minimal and is worth the resulting gains in performance.
We also note that thinning the traffic stream increases the sniffer load potentially leading to packet loss. Our proposed technique was developed to minimize macchine instructions and to avoid moving data to memory from the CPU cache.
We have named our proposed traffic thinning algorithm NATRA, for ''Network ACK-based Traffic Reduction Algorithm''. We propose several options for NATRA implementation, such as placing it between the NIC and the Storage Performance Development Kit (SPDK) libraries [35], [36]. This option can be used to store packets in high-speed hard disks such as non-volatile memory express (NVMe) disks. The results of this research can help designers of high-speed packet sniffers to develop cost-effective implementations. We also discuss a prototype software implementation that is ready for practical use.
The rest of this paper is organized as follows. In section II we give an overview of the design of high-speed traffic probes, and discuss potential botteenecks in the traffic flow. Then, in section III, we introduce the NATRA algorithm. Section IV is discusses performance evaluation and implementation. Section IV-A presents an analytical model that allows us to extrapolate results to arbitrarily large data rates. Section V presents the NATRA performance evaluation results and discussion and Section VI draws the conclusions.

II. PACKET SNIFFER SYSTEMS
This section discusses the architecture of tipical packet sniffer systems, to highlight the bottlenecks that limit the system's performance. We conclude that the packet arrival rate, in packets per second, has a great effect on overall system performance. Thus, selective packet capture can help to alleviate the packet packet sniffer's workload and allows us to increase the range of network bandwidths for which the sniffer is effective.
Packet capture takes place in the NIC, which can either be a network processor [19] or a general-purpose device [28]. The NIC usually receives traffic from a SPAN port or splitter. The driver manages the load, balancing the traffic analysis among several processing cores to generate flow records in the form of time series or aggregate statistics. Finally, the outcome of traffic analysis (and/or the raw packet trace) is passed to permanent storage, typically using a disk array (RAID), with SATA, SSD or even NVMe disks for traffic storage at 40 Gbps or more. We note that load balancing in cores can take advantage of Receive Side Scaling (RSS) techniques [37], which split the packet stream into several queues at the NIC. Then, the traffic from each queue is handled by a different processing core, achieving parallel traffic processing.
The state of the art in high-speed packet capturing shows significant advances in the area, specially in SmartNICs. Such NICs provide spare processing resources so as to alleviate the processing load at the host CPU [23], [32]. Trumpet [29] leverages DPDK technology to provide fast triggers and statistics on 10 Gbps networks. Flowatcher [41] is also based on DPDK and provides statistics on packet and flow level reaching speeds up to 10 Gbps. Flowscope [14] enables storing high-speed trafic up to 100 Gbps for subsequent processing. We note that all these approaches are complementary to the research presented in this paper, as they focus on high-speed traffic capturing, but not thinning. Indeed, NATRA could also be implemented at the SmartNIC level.
In any case, we distinguish several steps in the process for packet capture and analysis as seen in Fig. 1: • First, the NIC must copy the packet from the network into its internal memory and sometimes the NIC will apply pre-processing like RSS or packet filtering. These steps are usually performed in hardware at the line rate, and no packet loss should happen at that point.
• Second, the driver handles NIC interrupts and relays the packets from the NIC internal memory to kernel memory through DMA cycles. Clearly, as the interrupt rate increases this copying task becomes more difficult. Hence, interrupt-coalescence techniques are used to reduce the processing burden at the interface between the NIC and the driver [24]. If the interrupt rate exceeds the driver capacity, packets will be lost.
• Third, the processing cores copies the packets from kernel memory into user space and perform the analysis. The first step is extracting protocol fields like the IP addresses or VLAN tags. This step is time-consuming, especially with variable field headers, and requires a high CPU speed. To achieve greater throughput, CPU parallelism is normally exploited using receive-side scaling (RSS). Thus, many cores are required at the sniffer, which increases hardware costs. Because the CPU is limited in terms both of speed and number of cores packet loss may also happen at this stage.
• Fourth, the outcome of this analysis (flow records, packet captures, etc) must be dumped to permanent storage. Records storage is also limited by read/write speed, even though this speed has increased dramatically with the recent introduction of SSD and NVMe technologies. Ultra-fast storage is very expensive, however, and the sniffer storage needs are large. Thus, speed and storage volume are in a trade-off relationship and both are required for high-speed packet capture. Indeed, storage is the most serious bottleneck in the process and may also lead to packet loss.
At any of the above stages, traffic thinning can reduce the load at the subsequent stage and thereby reduce the cost of sniffers. The next section presents the NATRA algorithm, which is specifically designed to reduce the sniffer workload.

A. TCP CONNECTION RECORDS
Before further discussion of the algorithm design and implementation, we briefly discuss the dynamics of flow record generation in a packet sniffer. TCP packets are used by the packet sniffer to obtain flow records that required to present KPIs in dashboards. These TCP flow records include several fields which can be grouped in the following categories: • Volume: TCP flow size and duration, which are helpful for calculating the traffic per port, IP addresses, and other indicators.
• Congestion indicators: Zero window announcements from the client and server can indicate a possibly saturated host.
• Flags: SYN or RST flags help to identify connection attempts with not response, abrupt resets, and other interruptions.
• RTT: can be measured as the time difference between sending a SYN packet and receiving the first ACK packet as shown in Figure 2 .
• Loss indication: can be estimated from the number of observed retransmissions and the number of duplicated ACKs.
To obtain the above indicators, a hash table that contains one entry per TCP flow (or two entries for the two unidirectional flows that comprise a connection) is stored in memory. These entries are updated with all incoming TCP packets that belong to the flow. Most importantly, the processing cost is not equal for all the above performance indicators, which practically means that not all the indicators can be obtained online.
Allocating memory to keep record of ACKs, at 40 Gbps or more is a challenge in itself. Complicating the issue, the interarrival time between a packet and its matching ACK can reach 500 ms [5], with the delayed ACK (200 ms in a Windows operating system [39]). Consequently, the number of records in memory grows accordingly and the lookup time needed for finding a matching packet severely degrades the sniffer performance.

III. NATRA TRAFFIC THINNING ALGORITHM
The proposed traffic thinning algorithm achieves a significant reduction of the TCP packet arrival rate by removing ACK packets without payload. We focus on thinning TCP traffic, because most Internet traffic uses TCP, as can be seen in figure 3. This figure shows the percentage of TCP traffic over total traffic recorded in CAIDA data from 2002 to 2018 [10]. The alternative transport protocol alternative QUIC [20], works over UDP layer and improves the communication strategy for connection-oriented web applications, but TCP remains as the main protocol used in network trunk lines.
By thinning TCP traffic, we expect to obtain significant savings in the overall packet arrival rate. In what follows, we use the term ACK to denote packets that carry no data payload and serve only to acknowledge the arrival of packets from the other end of the connection.

A. INFORMATION LOST WHEN REMOVING ACK PACKETS
This section discusses the information that is lost when removing ACKs and assesses the importance of such information loss compared to the resulting performance gains.
Next, we discuss how KPIs will be affected by the removal of ACKs. We note that the main two affected parameters are the estimation of retransmissions (including duplicated ACKs) and RTTs. As for the former, ACKs are useful to determining whether a packet has been lost, which is typically indicated if several ACKs arrive with acknowledgment numbers smaller than the last sequence number seen in the same TCP flow. As an alternative, we can use the retransmission itself, which will eventually be triggered by the transmitter. Packet loss can also be detected from holes in the stream of sequence numbers. In this case, we should consider that such time spaces can also arise if packets arrive out of order.
We further note that in practical analysis only TCP connections with a significant number of retransmissions are relevant. For retransmitted packets, the sequence number interval (sequence number, sequence number +length) will overlap with that of the original transmitted packets.
Therefore, the corresponding TCP connections can be detected easily by counting retransmitted packets instead of duplicated ACKs [26].
RTT is a significant KPI that should not be lost. RTT can help in detecting network paths subject to considerable latency. Clearly the estimation of RTT will be affected seriously by the removal of ACKs. At this point, we propose an alternative that allows us to estimate the RTT per connection by saving only the first ACK-only packet of the connection. The implementation of this alternative design involves tracking the connection state only during the connection establishment phase, which can be expensive in terms of processing. Even so, we have obtained a remarkable speed of 31.45 Gbps per core with an i5-4570 3.20-GHz CPU, in our software implementation for tracking the connection establishment phase.
The first technique presents a trade-off between the accuracy of RTT estimation and performance gains in terms of packets per second due to ACK removal. In practice, the online estimation of all the TCP RTTs in the connection is very processing intensive and the problem is usually solved with offline analysis at speeds of many Gbps. Moreover, RTT estimation can also be performed using the ACK packets that do carry payload, which will not be removed. The proposed technique for RTT estimation in the connection handshake along with the samples produced by ACK packets with payload is sufficient for the retaining the KPI of RTT.

B. NATRA THINNING PROCESS
Our proposed thinning process drops all packets that carry the ACK flag and no data except those that participate in the handshake that stablishes the TCP connection (see Fig. 2). To this end, we only keep track of the state of the connection establishment phase during the handshake for all TCP flows. This process includes the first SYN packet from the client, the SYN+ACK packet from the server and the first ACK from the client. Then, the handshake RTT will be calculated using this information.
We will use different real traffic traces to measure the reduction provided by the proposed method. For the sake of generality, six different traces were used, as shown in Table 1. The first one was taken from a lab in our university and it has a low flow rate and a heterogeneous mix of traffic. This trace mostly includes traffic generated by the academic web applications used by students and teachers, in addition to traffic involved in standard Internet navigation. The second trace was recorded from the traffic at a large company; it includes traffic from office and business applications. In addition, we used three more traces from the CAIDA repository [11]. This last recorded by the CAIDA group from a central segment of the Internet and it is publicly avaliable; this will ensure the reproducibility of the results. Hence, the three traces give a balanced mix of academic, industrial, and public Internet traffic. The three CAIDA traces cover an hour worth of traffic through the Equinix-Chicago link.
Finally, The last trace is from a university Internet link. To be able to replay real network traffic, we need traces with a defined MAC level, non caplen, non anonymous addresses, and with playable bandwidth. Table 1 shows the reduction achieved when using NATRA. The total reduction in packet ranges between 18.7% and 34.54%, depending on the trace. the difference in reduction ratios between traces depends on the number of ACK packets without payload. We note that traces that include applications with a high level of interactivity or, generally speaking, TCP  flows with data transfer in both directions have fewer ACK packets. Figure 4 shows the reduction in traffic that we could achieve during a full hour in each trace. We selected 1 hour from each dataset to all traces in similar form, because the CAIDA traces have one hour of data. Data was aggregated in intervals of 5 minutes. Different datasets provide different reduction rates. We obtained rates close to 40% when testing the private-company traces, which is the best performance overall. The packet reduction is stable over time for any of the networks.
Additionaly, we have also assessed the impact in a large datacenter from a banking company. The total packet reduction is very similar to the reduction observed on the previous traces. We observe a potential reduction of 36.5% over the TCP traffic and 27.36% over total traffic. The percentage of TCP packets over the total is 71%, also similar to the one observed on CAIDA traces.
Overall, the experimental results indicate that NATRA shows promising performance in a veriety of traffic scenarios, reducing traffic by nearly 25%. Traffic reduction can reach nearly 40% depending on the traffic characteristics and the specific hour in which the data was recorded.
As it turns out, QUIC is gaining importance in datacenter traffic, as it will be adopted for HTTP3. Even though traffic thinning for QUIC is out of the scope of this paper we perform an exploratory analysis of packet sizes in Figure 5, which shows QUIC packet size CDF from traffic of our University access link. Interestingly, 30% of packets are shorter than 90 bytes, and 56% of packets are larger than 1378 bytes Despite QUIC data is encrypted, packets sizes could be used as a way to differentiate between data and acknowledgement packets. This finding opens new research avenues for QUIC traffic thinning, that are left for the sequel.

IV. NATRA DESIGN AND IMPLEMENTATION
In this section, we provide insights into the design and implementation of NATRA. Briefly, NATRA drops all ACK packets that carry no data payload excepting the first packet sent from server to client (which also has the SYN flag) and the second ACK sent from client to server. To identify the first ACK, a connection record must be kept in memory that includes the connection five-tuple (source and destination IP addresses, ports, and sequence number) and the state of the connection establishment handshake. Whenever an empty data payload packet arrives, the list of connection records is searched for the corresponding match and the algorithm decides whether to drop or retain the packet.
More specifically, a static table is used to store all the unfinished TCP handshakes, which are indexed by a hash function. The space in which we will store the information is called a slot and by using different indices, we will can access the information stored in different slots. The hash function is the XOR function of the five-tuple modulo the theoretical maximum number of concurrent flows. The latter is estimated VOLUME 8, 2020 from the connection interarrival time and the maximum RTT value, as will be described in section IV-A.
All the necessary input data can be obtained from the IP and TCP headers (source and destination IP and port and TCP sequence number) using a simple structure. The timestamp can be obtained from a call to the system time libraries. Ideally, the timestamp should be kept in cache memory, as it is accessed every time a new packet arrives.
Pseudocode 1 presents the algorithm that executes traffic thinning. The Check_packet function takes a packet and a timestamp as input and returns a boolean that represents the decision about dropping the current packet. When a SYN packet arrives, the corresponding connection record is updated with the new connection establishment state (lines 11-14). The hash used to update the information will contain the TCP sequence number incremented by one unit (line 3). Furthermore, when an ACK packet with empty data payload arrives, the algorithm checks wheter it corresponds to the connection establishment phase to update the RTT calculation (line 25). If the ACK-only packet belongs to the handshake, the hash will be calculated with its original sequence number so the hash will be the same as the one processed previously by using the SYN-only packet that opens the same handshake (line 19). If any of the previous connection establishment phase packets have been lost, the connection record is garbage, and must be removed after an expiration timeout. This expiration timeout should be as close as possible to the maximum RTT observed in the trace (line 5).

A. SYSTEM MODEL AND MEMORY REQUIREMENTS FOR HIGH-SPEED DEPLOYMENT
The connection records list is searched for all empty payload packets being received. As shown in the previous section, this connection record must be kept in memory for at least the duration of the connection RTT. At high speed, the search operation acts as a bottleneck and rapid-access memories are required, namely processor caches or on-board NIC memories. To increase performance, we choose to store the above hash tables in static memory, thus saving the time consumed by allocation and deallocation of dynamic memory. In this section, we focus on the memory requirements in terms of size and speed. Before we investigate these requirements, let us briefly discuss current cache technologies, to explain the hardware limitations. We will then calculate the amount of memory required and discuss whether it fits in typical cache memory sizes. Table 2 lists the I/O throughput for most typical processors (in Bytes per second, according to the Sisoftware benchmarks [33]). The table shows the limitations in size and speed of the cache memories at different cache levels. We note that cache memory is much faster than the system RAM and the associated core has exclusive access to it, instead of sharing the RAM between the different cores. Therefore, our NATRA

2) MEMORY AND APPROXIMATE CAPACITY PLANNING
Rapid access memories are tipically small, which limits the size of the list of connection records. As the network bandwidth increases, so does the number of concurrent connections and the size of the corresponding list of connection records. Thus, memory size acts as a limiting factor that restricts the applicability of NATRA in high-speed environments. In this section, we analyse the size of the list of connection records, from both empirical and analytical standpoints. The empirical analysis is intended to extrapolate the results to higher speeds.
Next, we provide a model that estimates the size of the list of connection records for the sake of planning memory capacity. If the trace includes many multiplexed TCP connections, newly arrived TCP connections (whose arrivals times are assumed to follow a non-homogeneous Poisson distribution), will open a new connection record in the list. Then, the connection record will be kept open for either the RTT of the TCP handshake or the garbage-collection timeout. Consequently, the list can be modelled as an M /D/∞ queuing system, with the arrival rate of λ connections per second and holding time of 1/µ seconds. This holding time for the connection record is equal to the RTT of the TCP handshake and we set an upper bound of 0.2 s, in accordance with previously reported data [31]. The selected upper bound is supported by the RTT obtained using Tshark ( [30]) to derive the RTT of the TCP handshake in the trace of the university Internet link (Fig. 6).
To better understand the arrival of flows, we verify the Poisson hypothesis using the CAIDA traces, which include many multiplexed TCP connections. This analysis considers that the arrival rate changes with the time of day. Figure 7 shows the flow interarrival time distribution for the CAIDA traces. The data rate is quite stable for the one-hour duration of each CAIDA trace. As shown, the exponential distribution fits the experimental data remarkably well.
According to the model, the maximum number of concurrently open TCP handshakes in connection records follows a Poisson distribution with the parameter λ µ . To calculate the maximum number of open TCP handshakes, we use the 0.99 percentile of the Poisson distribution. To this end, we employ the Normal approximation of a Poisson distribution [13]. As a result, we obtain the following formula for the number of open handshakes, denoted by K , in terms of the number of TCP connection entries in the  NATRA system: where λ is the flow arrival rate in flows per second and z 99 is the 0.99th percentile of the standard Gaussian distribution.
We note that the hash function produces a random share of the incoming connection records such that the load per slot is K M where M is the number of slots in our hash table and K follows equation 1. Thus, for any given collision probability P, the number of required slots is given simply by: This theoretical approximation closely matches the experimental results as will be demonstrated in the next section. Additionally, we note that in case of a collision in the hash values (i.e. two different connections with the same hash), the corresponding connection record will not be updated, even if table of connection records has available slots. In that case, we choose not to drop the ACK packets from the colliding flow, which slightly increases the rate of packets per second but achieves a significant gain in accuracy.

V. RESULTS AND DISCUSSION
First, we assess the validity of the above model for open handshakes, following equation 1. Then, we perform a throughout performance assessment with NATRA use cases, which also includes analyzing NATRA performance as a traffic thinning middleware for well-known traffic analysis tools.
The The results show that the the approximate capacity planning formulas for the demand of connection records (equation 1) gives an estimate of the size of the list of connection records. In the case of the 2014 trace, this estimate is not very accurate because the flow-arrival process includes peaks that depart from the Poisson distribution. In any case, the estimate will be useful for approximate planning of the system memory capacity, as in equation 2.

A. TRAFFIC REDUCTION RATES
Having verified that the number of concurrent TCP handshakes can be approximately upper-bounded using equation 1, we turn our attention to the traffic reduction rate in packets per second. We apply equation 2 with the collision probability P = 0.05 and obtain the memory size M . Then, we perform a trace-driven simulation and the results are shown in the figures 9-11c, for both memory size in number of slots and the percentage of traffic reduction. The reductions in traffic are remarkable for all the traces considered in our analysis. Furthermore, the memory sizes are compatible with current cache technologies (see section IV-A1), which ensures that multi-Gbps throughput can be attained.
The experimentally measured memory size is larger than the memory size calculated using equation 1 directly, as expected, because the collision probability is non-zero. In any case, we note that in case of collision, especially if memory is full, the corresponding incoming flow is passed to the upper layers without trimming. This relationship involves a trade-off between memory occupancy and traffic reduction.

B. EMPIRICAL EVALUATION WITH A REAL IMPLEMENTATION
To evaluate the maximum achievable throughput when using NATRA we use our previously reported high-speed driver [18] in user space to achieve a realistic deployment scenario. The driver has one thread in charge of timestamping, which moves all packets from the shared memory to user-space memory. Thus, we implemented NATRA between the timestamping and memory movement tasks.
To make the simulation as realistic as possible, we placed all packets in reverse order into the main memory. This trick bypasses the pre-caching techniques implemented in various operating systems that try to cache as much data stored in memory as possible. As such, programs usually store data sequentially in memory, to make the most of precaching.
By changing the order of the packets in memory, no pre-caching will be performed because the packets are not stored in well-aligned and sequential memory. In other words, the implementation intentionally forces cache errors when packets are read from the shared memory. The results show that the throughput per core reaches 31 Gbps reading from host memory and that the traffic reduction in packets per second is significant up to 25%, which supports the validity of using NATRA for offloading packet sniffers. We stress that only one core was used in the VOLUME 8, 2020 experiment, so the same process could be applied to different RSS queues to speed up the traffic thining system and reach faster speeds, which is the usual approach for scaling to higher speeds [17], [37].

C. RTT ESTIMATION ACCURACY AFTER NATRA
As discussed above in section III, NATRA is expected to drop a huge quantity of empty-payload ACK packets. RTT is the KPI most affected by this packet thinning process. Nevertheless, the NATRA process preserves all the packets involved in the handshake process, so the RTT obtained by these packets will not vary.
In any case, we note that RTT may vary during the connection lifetime. Thus, such RTT estimation can be updated by using ACK packets with non-empty payload, which are not removed by NATRA. In what follows, we will assess that the RTT estimation remains accurate, in spite of NATRA removing ACK packets without payload.
As a benchmark, we have measured connection RTT with Tshark [30], which is a well known and widespread tool used for obtaining network performance indicators. Tshark consumes large CPU resources, and it is not well-suited for very big traces. Thus, we used the traces recorded at the university Internet link, which has a smaller size. Figure 12 shows the Cumulative Density Function (CDF) of the RTTs calculated from the original trace and its NATRA-filtered counterpart. The CDFs are very similar, even though they are not the same because of the lack of ACKs in the latter trace.
Furthermore, we have distinguished between the connections generated to the Internet (Outbound traffic) and the traffic coming from the Internet (Inbound traffic) and have obtained the RTT CDF for each traffic direction. Figure 13 shows the RTTs obtained from inbound connections and figure 14 shows the RTTS obtained from the outbound connections, respectively. As it turns out, inbound connections between original and NATRA-filtered RTT CDFs seems to be closer than the CDFs obtained from the outbound one.  However, even in the latter case, NATRA provides a good accuracy.

D. PERFORMANCE GAINS APPLYING NATRA TO WELL-KNOWN TOOLS
To conclude the results and discussion section, we discuss our application of NATRA to the inputs of well-known monitoring tools, and assess the resulting performance gains and accuracy. First, we tested the behaviour of NATRA with FlowMiner, a program used to obtain valuable information from all the different flows included in a PCAP file.
In addition, live monitoring tools have been used to verify how NATRA performs from the CPU load standpoint. The well-known Tstat [3] and Tshark tools were used for these live tests. [1]. We endeavoured to use well-known and free-access tools to ensure the reproducibility of the experiments. Besides, ee used Dstat to obtain the system CPU statistics.

1) FlowMiner
Flowminer was used to process the one-hour-long CAIDA traces. This is the only test in this section performed without  live traffic, as Flowminer is an offline processing tool. Note that, since the traces were recorded from Internet backbone links, they include a huge number of concurrent flows, which entails a long processing time. Thus, we assessed NATRA performance in terms of processing and time reduction as a benchmark. Table 3 lists the total processing times when the original traces and NATRA-filtered ones are given as input to Flowminer. Interestingly, the NATRA-filtered traces can be processed in less than 3600 seconds, which is the exact trace duration. On the contrary, only the 2014 trace could be processed in less than the trace duration without the NATRA filter. The time Flowminer takes to process the traces without NATRA is closer to 4000 seconds. Overall, NATRA leads to a remarkable reduction in processing time, between 10% and 20% depending on the trace.
Packets were reduced by close to 20-25% in the CAIDA traces, but the improvement on performance is not as large. Nevertheless, the traces could be processed in less time than their respective durations, which allows real-time processing.
In addition, as discussed in section V-C, we evaluated to which extent the introduction of NATRA alters the flow record parameter values. Table 6 shows the percentages of flow record parameters that show alteration by using NATRA. The data obtained with and without NATRA is practically the same.
As shown, NATRA accelerates the processing time in Flowminer keeping the validity of the measured metrics.

2) TSTAT AND TSHARK
Finally, we ran experiments using the Tstat and Tshark tools with TCPreplay sending live traffic ( [2]) in order to assess if CPU performance improves. In this case, the experiment duration remains the same and equal to the traffic replay duration and the CPU load varies by using NATRA. Table 5 shows the CPU load observed, measured in CPU-time employed. Such CPU-time was measured with the Dstat log, once per second, and, then, an average over the experiment lifetime was obtained.   Better results in terms of CPU-time reduction are observed with Tshark. This tool uses a large amount of processing resources and the use of NATRA provides a significant improvement.
Apart from the RTT discussion in section V-C, we have calculated the number of errors in several attributes of the performance of TCP flows using NATRA. We selected different attributes from the ones discussed in section V-D1 for the sake of generality. Similar to the data presented in our discussion of Flowminer, table 6 lists the percentage of values altered for the selected TCP flows' attributes. We note that the percentage of values altered is negligible.
Furthermore, figure 15 compares the flow duration CDF for NATRA and non-NATRA experiments. NATRA causes no noticeable changes in the distribution, but the CPU-time is reduced by nearly 23%. Therefore, NATRA paid off in terms of accuracy and CPU time reduction.

VI. CONCLUSIONS
This work has presented NATRA, a traffic reduction algorithm that significantly decreases the rate of packets per second that must be analysed by packet sniffers on high-speed networks. In a nutshell, NATRA drops ACK packets that carry no data payload. This offloading gives the sniffer extra CPU time to perform packet processing and analysis. The results from our trace-driven evaluation show traffic reduction rates between 18.70% and 34.54% in packets per second. We assessed the performance of NATRA as a traffic thinner for well-known monitoring tools. The results show remarkable CPU-time reductions, close to 23% using Tshark, 9% using Tstat, and a processing time reduction near 15% using Flowminer, with no information loss.
Large reductions in traffic rates can be achieved, which reduce the processing load involved in subsequent analysis and visualization stages beyond packet capture. This is cornerstone to scale network traffic analysis at tens of Gbps speeds.