Improved data transfer efficiency for scale‐out heterogeneous workloads using on‐the‐fly I/O link compression

Graphics processing units (GPUs) are unarguably vital to keep up with the perpetually growing demand for compute capacity of data‐intensive applications. However, the overhead of transferring data between host and GPU memory is already a major limiting factor on the single‐node level. The situation intensifies in scale‐out scenarios, where data movement is becoming even more expensive. By augmenting the CloudCL framework with 842‐based compression facilities, this article demonstrates that transparent on‐the‐fly I/O link compression can yield performance improvements between 1.11× and 2.07× across tested scale‐out GPU workloads.


INTRODUCTION
In the age of ever-increasing data volumes, the overhead of data transfers is a major inhibitor of further performance improvements on many levels.
In heterogeneous compute architectures, the overhead of transferring data (e.g., between host and graphics processing unit (GPU) memory) can still have a major impact on the overall performance, even when the latest state-of-the-art interconnection technologies are used such as NVLink-2 on the intra-node level 1,2 and InfiniBand EDR on the inter-node level. 1 For many data-intensive applications, scaling out to multiple nodes is the most feasible strategy to satisfy their resource demands. Unfortunately, even though latest network technologies have reached speeds of 100 Gbps and more, these high-end inter-node links are mostly employed by hyperscalers internally, as many of their IaaS offerings are limited to network bandwidths ranging between 10 and 25 Gbps. 3 In the vast majority of other data centers, NIC port speeds of 10, 25, or 40 Gbps are also still the norm. 4 The increased latencies and limited bandwidths of such infrastructures aggravate the performance penalty of data movement for scale-out GPU workloads.
Preceding efforts of the research community have identified compression as a viable method for improving data transfer efficiency for certain application domains. 5,6 To work around the issue of insufficient compression throughput, preceding investigations have proposed the use of offline I/O link compression, where the payload for data transfers is available in a pre-compressed form. More recently, however, hardware-accelerated compression techniques are becoming available in an increasing number of computer architectures. 7,8 Even software-based compression facilities have gained notable levels of compression throughput. 9 Based on these observations, we hypothesize that on-the-fly I/O link compression can be used to improve data transfer efficiency and consequently overall performance of data-intensive scale-out GPU workloads, as illustrated in Figure 1.
To test this hypothesis, this article brings together two research projects that we have investigated independently of each other in the past.
In the first project, we have argued on a theoretical level that on-the-fly I/O link compression based on the 842 compression algorithm might be employed to improve data transfer efficiency in heterogeneous computer architectures. 10 Being a heavy-weight compression algorithm, 842 has been designed with main memory compression in mind and delivers decent compression ratios for various payloads. Even though preceding approaches have successfully employed on-the-fly I/O link compression using workload-specific lightweight compression techniques, 11,12 we are F I G U R E 1 Compared to uncompressed data transfers (left), on-the-fly I/O link compression may increase the effective bandwidth (right) not aware of any successful approaches that have employed heavy-weight compression techniques on commercially available off-the-shelf systems in order improve data transfer efficiency across a wider range of workloads. 13 To provide all components necessary for a practical evaluation of 842-based on-the-fly I/O link compression, we have then implemented GPU-based decompression facilities for the 842 algorithm. 14 However, a practical evaluation based on scale-out GPU workloads has still been pending thus far. In the second project, we have proposed CloudCL as a single-paradigm programming model for making scale-out GPU computing accessible to a wider audience. 15 Essentially, CloudCL has extended the Aparapi framework with a dynamic, distributed job model and relies on the dOpenCL API forwarding library for scaling-out job partials across various compute nodes. The improved ease of use of CloudCL has come at the price of limited scalability for data-intensive workloads, as the overhead of inter-node data transfers has diminished the benefits of additional scale-out compute resources.
With this article, we make the following main contributions: • We propose a pipelined architecture for integrating 842-based, transparent on-the-fly I/O link compression into scale-out GPU workflows.
• We implement the proposed architecture based on the dOpenCL API forwarding library, which serves as the foundation of the CloudCL single-paradigm framework for scale-out GPU computing.
• We present the first approach for user-space access to nx842 compression accelerators on Linux, providing applications with compression and decompression facilities for the 842 algorithm capable of processing up to 28 GB/sec in an S824L two-socket POWER8 server. 16 • We provide a practical evaluation confirming that 842-based on-the-fly I/O link compression improves data transfer efficiency and consequently overall performance of data-intensive scale-out GPU workloads, yielding performance improvements up to 2.07×.
Hereinafter, this article is structured as follows: Section 2 provides background about the employed 842 compression algorithm as well as the CloudCL framework. Subsequently, Section 3 reviews related work in the field of GPU-based compression techniques. In Section 4, central design decisions and assumptions are documented that are important to consider in the context of the aspects discussed in the ensuing sections.
Implementation details for integrating transparent on-the-fly I/O link compression in CloudCL and dOpenCL are elaborated in Section 5. For the practical evaluation of our approach, Section 6 documents all relevant details of our testing environment and our benchmark characteristics. It then presents the measured results as well as a brief discussion thereof. Finally, a conclusion is reached in Section 7.

842 compression
The 842 compression algorithm 17 has been designed with the use case of transparent main memory compression in mind. This use case requires compression throughput well in the two-digit GB/sec range to be feasible, which today is typically achieved using light-weight compression techniques. 18 However, the 842 algorithm can be attributed to the family of Lempel-Ziv derivatives 17 and thus belongs to the class of heavy-weight compression techniques. 18 The compression process deviates from the original Lempel-Ziv algorithm 19 in several aspects. However, decompression works almost identical compared to LZ'77. 17 Still, the 842 algorithm can achieve high compression throughput, as it has been designed to allow high-throughput/low-latency hardware implementations that can be placed directly on transmission channels. 20 The first implementation of the algorithm has been introduced with the nx842 on-chip compression accelerator, which is available in all IBM POWER processors introduced since the POWER7+.
As illustrated in Figure 2, the 842 algorithm 17 operates on units of 8 bytes, treating input data as sub-phrases of 8, 4, and 2 bytes length, respectively. For each phrase length, a hash table holds offsets to sub-phrases that have already appeared in the raw data stream within a certain window.

TA B L E 1 A 5-bit template code specifies a series of actions
nx842 unit can achieve a maximum throughput of 18 GB/sec. 7 With two nx842 units per socket, 21 the total compression throughput of a POWER8 processor can be as high as 36 GB/sec. Prior to this work, the only way for user-space applications to leverage the nx842 accelerator was to use the AIX operating system, where a user-space API exposed the hardware-accelerated compression facilities. On Linux, utilization of the accelerator was only feasible from kernel space, where it is accessible through the Linux Crypto API. Fortunately, this work eliminates this issue as we are going to present the first approach for user-space access to the nx842 compression accelerators in Section 5.1.
The example provided in Figure 2 demonstrates how the 32-byte string is compressed using four templates. The template 0x00 is used to encode the raw literal PITTERPA, since no matching sub-phrases have appeared in the raw data stream beforehand. The second template 0x13 encodes TTERPITT by providing offsets to two 2-byte phrases in the uncompressed data stream at the positions 2 (TT) and 4 (ER), as well as an offset to a 4-byte phrase a the position 0 (PITT). The third template 0x18 encodes ERPATTER by providing offsets to two 4-byte phrases at the positions 4 (ERPA) and 8 (TTER). Finally, the last template 0x00 encodes LISTENTO as a raw literal. kernel-partials that can be processed independently by different compute nodes. The corresponding job scheduling system can also handle the dynamic properties of cloud-based resource-provisioning by supporting dynamic addition and removal of resources at runtime.
The resulting architecture of CloudCL is illustrated in Figure 3, which stresses the central roles of dOpenCL 23 and Aparapi 22 as the underlying technologies. Based on the discussed enhancements, the framework hides the complexity of distributed programming for dynamic cluster configurations, enabling developers and domain experts to tap heterogeneous compute resources using a single-paradigm compute framework. The single paradigm approach helps to improve productivity, as developers and domain experts may focus their implementation efforts on application logic without having to consider inter-node communication and dynamic resource management.

RELATED WORK
Using compression as a means for improving utilization of main memory has a well-established history. This is well reflected by the work of Mittal et al., 24 which provides a comprehensive survey of the widely researched field of data compression mechanisms for main memory and cache systems.
Complementing traditional LZ-based compression techniques, the field of lightweight compression algorithms has surfaced in the last decade. The work of Damme et al. 18 presents an overview of contemporary lightweight approaches that are heavily used in in-memory columnar databases. In most cases, lightweight compression algorithms can achieve very high compression and decompression performance at the cost of slightly degraded compression ratios. Narrowing down the field of approaches, both heavyweight and lightweight, to GPU-based compression techniques, the existing approaches can be broken down into four major categories: 1. Compression offloading: GPU acceleration is used to speed up compression tasks on the host side.
2. GPU-sided memory compression: Similar to main memory compression, compression is used on the GPU to improve utilization of resources such as registers or device memory.

I/O link compression:
Compression is applied to improve the overall utilization of the CPU-GPU or the GPU-GPU I/O link, either within the same node or across node boundaries in a scale-out scenario. This category can be subdivided into two further categories: (compress-send-decompress). In addition to GPU-based decompression, this approach requires high performance compression mechanisms on the sender side to be feasible.

Compression offloading
In the space of GPU-based approaches, several attempts have been made to port well-established compression algorithms to graphics hardware in order to free up host resources by offloading compression operations. Popular approaches include CULZSS 25 and GLZSS, 26 with CULZSS achieving up to 3× speed-up compared to parallel, CPU-based implementations and GLZSS yielding 2× speed-up over CULZSS. G-Match 27 presents a compression strategy fully optimized for GPU architectures and generates a compressed bitstream that can be decompressed using any Snappy decompressor. It achieves 1.3× and 2.4× speed-up over GLZSS and CULZSS, respectively. CULZSS and GLZSS employ the LZSS algorithm, 28 and G-Match produces a compressed bitstream that is fully compatible with Snappy. 29 Both LZSS and Snappy can be considered heavy-weight approaches, as they are derivatives of the original Lempel-Ziv 19 algorithm.
Stein et al. have presented an approach for accelerating LZSS compression on GPUs in stream processing workflows. 30 Their stream processing architecture divides the LZSS compression process into three fundamental stages: reading input data, finding matches, and filtering and writing the results. Since the matching stage is offloaded to a GPU, creating multiple instances of this stage enables the authors to scale out across multiple GPUs, yielding 135.9× speed-up using two Titan XP GPUs compared to a parallel CPU implementation. Stein et al. have further optimized their approach by implementing latency-aware, adaptive micro-batching strategies that enable them to meet a service level objective by adapting micro-batch sizes elastically at run-time in order to react to unpredictable variations of workloads. 31 While this approach may be appealing for cloud-based deployments, adaptive batching strategies are beyond the scope of this work and are subject for future work.
It is important to note that all approaches in the offloading category are solely focused on providing an efficient implementation of the compression operation, widely disregarding GPU-based decompression. However, due to our focus on data-intensive scale-out workloads requiring large data volumes to be transferred from the master node to the compute nodes efficiently, decompression on the GPU-side is essential to evaluate the performance impact of on-the-fly I/O link compression in this article. For GPU-based decompression, processing multiple chunks in parallel is usually the only viable strategy for parallelization that does not break compatibility with existing compressor implementations (e.g., by modifying the structure of the compressed bitstream). This is also the case in this work, since we have to make sure our GPU-based decompressor is 100% compatible with the compressed bitstreams generated by the hardware-based nx842 compression accelerators used in the master node of our test environment. Our plans for future work include implementing GPU-based compression using the 842 algorithm in order to accelerate data transfers from compute nodes back to the master node. For the planned implementation of this component, we expect that the optimization strategies for the compression operation discussed in this category of related approaches may become helpful.

GPU-sided memory compression
Targeting memory-bound applications, Sathish et al. 32 have proposed using hardware-based compression to increase the efficiency of access to off-chip device memory, yielding up to 37% better performance compared to the uncompressed case. A similar approach has been published by Vijaykumar et al., 33 who are also employing memory and register compression to increase the utilization of all GPU resources, yielding up to 2.6× speed-up across a variety of memory-bound applications. Following the same goal, Lu et al. 34 have recently proposed a low-latency, hardware-based compression architecture optimized for floating point data that reduces bandwidth demand and energy consumption by 44.46% and 44.34%, respectively. Focusing entirely on the register level, Lee et al. 35 have explored register compression with the goal of reducing the energy consumption of graphics hardware. All approaches have in common that they are using custom, lightweight compression algorithms instead of general-purpose algorithms and rely on custom hardware designs. With the introduction of their latest A100 GPUs, NVIDIA has introduced hardware support for device-sided memory compression with the compute data compression feature, 36 promising up to 4× improvements in effective DRAM and L2 bandwidth.
While both research-based and commercially available approaches of this category can improve both effective bandwidth and capacity of GPU-based DRAM resources significantly, they are not feasible for memory transfers between host and GPU memory, as this would require hardware changes in the host system as well. Therefore, these approaches are not applicable for our setup since we are investigating approaches that can be implemented on commercially available off-the-shelf hardware both on the host and the GPU side.

Offline I/O link compression
Targeting efficient decompression of the heavy-weight LZ77 algorithm on GPUs, Gompresso 5 achieves a 2× speedup in decompression performance compared to a multithreaded, CPU-based implementation. Funasaka et al. have demonstrated multiple approaches for efficient decompression facilities on GPUs with the goal of increasing the data transfer efficiency either from host main memory or from non-volatile storage. The approaches are using different compression algorithms, including the heavy-weight LZW algorithm 37 as well as custom light-weight strategies such as the light loss-less data compression (LLL) 38 and adaptive lossless data compression (ALL) 6 approaches. Especially the custom algorithms LLL and ALL optimized for efficient decompression on GPUs can outperform CULZSS 25 and their GPU-based LZW implementation 37 significantly. Rozenberg et al. 39 presented a library of GPU-based decompressors for many established light-weight compression algorithms. Their evaluation indicates that light-weight compression techniques can be used to mitigate the PCI Express bottleneck for GPU-based in-memory databases.
All approaches based on well-established heavy-weight compression algorithms such as LZ77 or LZW have in common that they are modifying the layout of the compressed bitstream to enable more efficient decompression strategies on GPUs. The remaining approaches present custom compression strategies that are designed to enable efficient, GPU-based decompression. Both strategies are not applicable in the context of this work, as our goal is to remain fully compatible with the nx842 hardware compression units.

On-the-fly I/O link compression
Aiming at high compression rates with space savings below 0.5, Patel et al. 13 have explored the feasibility of on-the-fly compression based on heavy-weight techniques for data transfers between host and GPU. The authors conclude that on-the-fly compression is not feasible using their software-based compression approach implemented on the host CPU. On the side of light-weight approaches, Kaczmarski et al. 11 have successfully demonstrated that on-the-fly I/O link compression can be used to speed-up transfers between host and GPU memory for GPU-based in-memory database use cases. Tavana et al. 12 are also investigating GPU-based light-weight compression approaches. However, they are using on-the-fly I/O link compression to improve data transfers between multiple GPUs.
In this work, we refrain from using light-weight compression approaches since they are usually optimized for certain application domains and data types. With 842, we are using a traditional, heavy-weight compression algorithm instead that achieves reasonable compression ratios over a wider range of workloads. Patel et al. have claimed that in order to compensate the low compression throughput of software-based approaches available at the time of their writing, their setup only resulted in improved data transfer efficiency for compression ratios below 0.5. 13 Since compression ratios that low are not realistic for most workloads, we picked the 842 compression algorithm as it is capable of much higher compression throughput due to its design for efficient main memory compression (see Section 2.1). Even though state-of-the-art central processing units (CPUs) can achieve compression throughput with 842 high enough to saturate common 10 or 40 Gbps network links under full CPU load, this strategy prevents the system from performing any other task and comes at a high cost economically and energy-wise. With the availability of nx842 hardware compression accelerator units however, we can make sure that sufficient compression throughput is available using commercially available off-the-shelf hardware and without having to spend excessive amounts of CPU cycles on the task.

DESIGN RATIONALE
A major contribution of this article is a pipelined architecture for transparent integration of 842-based on-the-fly I/O link compression into the dOpenCL API forwarding library, which serves as the foundation of the CloudCL framework for scale-out GPU computing. This section delineates the conceived architecture by documenting central design decisions that have shaped it.

Choice of compression algorithm
For this work, we decided to use the 842 compression algorithm as it has been designed for high throughput and low latency due to its main purpose being transparent main memory compression (see Section 2.1). As elaborated in Section 3, the 842 algorithm belongs to the family of heavy-weight compression approaches that are applicable to a wide range of workloads. Therefore, it does not focus on the specific characteristics of floating point data, but rather aims at eliminating redundancy regardless of the employed data type. As it will be demonstrated in Section 6, this generic approach yields sufficient compression ratios across various data sets, including floating point data.
Another important reason for using 842 compression in our work is the availability of nx842 hardware compression units, which are part of all IBM Power CPUs introduced since the Power7+. Using hardware-accelerated compression, we can make sure that sufficient compression throughput is available using commercially available off-the-shelf hardware without having to spend excessive amounts of CPU cycles on the task. The software-based implementation that will be discussed in Section 5.2 are capable of achieving compression throughput high enough to saturate common 10 or 40 Gbps network links using current high-end CPUs. However, the software-based approach is only used as a fall-back option in situations where no nx842 hardware compression units may be available.

Focus on scale-out GPU workloads
The demand for GPU-based computing resources has been steadily increasing over the last few years. Data-science workloads such as machine learning have contributed to the growing demand for GPU-based computing resources, even though other application domains have adopted the use F I G U R E 4 Due to resource fragmentation, it is not possible to satisfy the resource demands of all users even though sufficient GPUs are available F I G U R E 5 Our cluster model assumes nx842 compression units in the master node, and GPU-based decompression on the compute nodes of GPUs as well. Many use cases even require multiple GPUs to satisfy their resource demands. In order to use multiple GPUs effectively, workloads have to be partitioned into partials that can be processed mostly independent of each other, without requiring fine-grained communication between GPUs. 40,41 Even though these workload partials are usually scaled out across multiple GPUs hosted within the same node, the same distribution pattern can be applied for scaling out workload partials to GPUs that are distributed across multiple nodes.
Using OpenCL API forwarding techniques to scale out multi-GPU applications across multiple compute nodes can be considered as a form of resource disaggregation. 42 Across the field, resource disaggregation is considered as a promising approach to improve the efficiency of data centers, [43][44][45] as resource disaggregation eliminates many resource allocation issues such as fragmentation. As illustrated in Figure 4, fragmentation is a big issue for operators of GPU compute infrastructures. 42

Assumed cluster model
To saturate fast commodity networks such as 10 Gbps Ethernet or faster, on-the-fly I/O link compression requires sufficiently high compression throughput both on the side of the master node as well as on the side of the compute nodes. As depicted in Figure 5, we assume that ideally, the master node should be able to saturate the network interfaces of all compute nodes, therefore having to deal with much larger traffic volumes than each compute node. Therefore, our assumed cluster model employs a master node equipped with nx842 compression accelerators, which are available in IBM Power CPUs. On the side of the compute nodes, arbitrary CPUs types can be used as decompression is handled by an OpenCL-based 842 decompression kernel on the GPU. Since we assume that the results computed by each compute node are usually smaller in volume compared to the input data, CPU-based software compression is sufficient to transfer results back to the master node.

Transparent integration of I/O link compression
Keeping in mind the basic CloudCL architecture presented in Section 2.  Finally, we identified that dOpenCL is the ideal target for transparently integrating on-the-fly I/O link compression into the CloudCL stack as the library is also the interface between master node and the compute nodes. This approach enables us to integrate the compression facilities without having to introduce any new components, as illustrated in Figure 6. Also, this strategy allows us to remain compatible with regular OpenCL applications that are making use of multiple GPUs. Integrating transparent compression at the level of an OpenCL implementation gives us the advantage that the OpenCL standard defines a series of rules and requirements that an OpenCL implementation must fulfil for the movement of data between the host and the set of devices associated with a buffer via its context. Even though the OpenCL standard leaves ample room for exotic implementations, most implementations (including dOpenCL) follow a set of reasonable rules for data movement, aimed at minimizing unnecessary copies, and applications rely on those rules for optimal performance. This environment turned out to be ideal to implement on-the-fly I/O link compression in dOpenCL for all common transfer methods, so the application running on the master node does not need to be adapted to use a certain, specific mechanism in order to use compressed transfers.
On the master node, dOpenCL uses the lib842 that will be introduced in Section 5 and which provides access to the hardware-based compression and decompression facilities of nx842 units, if available. Otherwise, the library falls back to an optimized, software-based implementation for both compression and decompression. On compute nodes, dOpenCL is also responsible for coordinating the workflow for compressed data transfers, using the lib842 to decompress data in GPU memory based on an OpenCL-based decompression kernel. The decompressed buffers are left in the GPU device memory, so that the actual application kernel can work on them without any additional overhead. After the execution of the application kernel has completed, buffers that should be transferred back to the master node are first copied back to the main memory of the compute node. There, the CPU-based software compressor available as part of the lib842 is used to compress data prior to being sent back to the master node. As soon as a GPU-based compression kernel for the 842 algorithm becomes available, the involvement of the host CPUs in the compute nodes can be reduced even further.

Pipelined workflow
As illustrated in Figure 7, even high-throughput compression facilities are unlikely to yield performance improvements when on-the-fly I/O link compression is implemented naïvely (B) compared to an uncompressed workflow (A). In the illustrated example, a workload is assumed that can be compressed with the ratio r = 0.5. Under the simplified assumption that all stages of the compressed workflow (compress, send, host-to-device memcopy, and decompress) are taking equally long, the naïve compressed workflow (B) even may increase transfer time compared to the uncompressed workflow (C). For the compressed workflow to yield notable performance improvements compared to the uncompressed case (A), it is necessary to introduce pipelining into the compressed workflow by overlapping the individual operations as much as possible (C). To enable pipelining, we divide the payload into smaller units, so-called micro-batches, that can be processed independently. After the first micro-batch has cleared the compression stage, it can proceed to the send stage while the second micro-batch can already enter the compression stage. The same principle applies throughout all stages of the workflow, and as all stages can be interleaved, an increased level of parallelism is achieved.
In our approach, a micro-batch is comprised of 16 chunks of 64kiB, each, resulting in a payload of 1MiB (see Figure 8)

IMPLEMENTATION
This section aims at providing insights into the implementation strategy of all major software artifacts that our approach is comprised of. As a major contribution of this article, Section 5.

Hardware-based compression on POWER CPUs
To the best of our knowledge, the approach presented in this section is the first to make the resources of the nx842 compression accelerators available to user-space applications running on Linux.
In the Linux kernel, the 842 compression scheme has already been in use for some time to implement the zram memory compression feature.
Compression facilities for the 842 algorithm are provided through the kernel-sided Linux Crypto API, where a driver for the nx842 exists as well as software-based fallback implementation. 46 However, only a small subset of the Linux Crypto API is exposed to user space applications. Even though the cryptodev-linux out-of-tree kernel module provides user-space access to a bigger portion of the Linux Crypto API, it only focuses on encryption and hasing facilities.
To fill this gap, we have created a fork of the cryptodev-linux kernel module 2 which has been extended with access to the 842 compression and decompression facilities available via the kernel-sided Linux Crypto API. The lib842 library has been extended to make use of the nx842 resources via the /dev/crypto device exposed by the cryptodev-linux kernel module. Since each ioctl request to the /dev/crypto device involves a system call, we have augmented the interface with a batching method that enables the lib842 to submit multiple chunks for compression or decompression using a single system call. To further optimize the interaction between lib842 and /dev/crypto, we have implemented session caching in order to re-use sessions initiated on /dev/crypto. With these optimizations in place, we are able to achieve high throughput for compression or decompression from user-space applications with minimal load on the CPUs.

Software-based compression on arbitrary CPUs
Prior to this work, the nx842 hardware compression accelerators have only been accessible from kernel space in Linux. Even more problematic, the only software-based implementation available to the public prior to our initial work on 842 compression 10 has been a fall-back implementation in the Linux kernel, 46 which is also only accessible from kernel space. As a first step, we ported this basic implementation to user-space by replacing all kernel dependencies with corresponding equivalents. While this rough port serves well as a baseline, it is hardly optimized and delivers meager

Fast hash tables
With efficient hash table lookups being the major potential bottleneck of the compression process, we replaced the general-purpose hash tables used in the baseline version with a very simplistic hashing mechanism. First, the 8, 4, and 2-byte subphrases are stored in a vector of 64-bit unsigned integer values. The vector-based representation with a uniform data type enables compilers to perform auto-vectorization of most subsequent operations. Using a vector-scalar-multiplication, all fields of the vector are multiplied with the largest prime number that falls within the range of a 64-bit unsigned integer. A right shift operation truncates the result of the multiplication results to its n hash most significant bits, yielding a vector of hashes. For each sub-phrase lengths, two buffers are used to form a basic hash table structure: an index array with 2 n hash unsigned short integer values and a fifo buffer of 2 8 , 2 9 , and 2 8 elements for 2, 4, and 8-byte sub-phrases, respectively. The latter exponents are fixed constants defined by the FIFO sizes employed by the hardware-based nx842 implementation. The index array uses hash-based addressing to store the offsets of corresponding values in the FIFO buffer. To retrieve the best possible performance, the hash size n hash has to be chosen carefully to yield acceptable collision rates at a memory footprint that still fits into the CPU caches. For n hash = 10, the total memory footprint amounts to 3 * 2 10 * sizeof(uint16_ t) + 2 8 * sizeof(uint16_ t) + 2 9 * sizeof(uint32_ t) + 2 8 * sizeof(uint64_ t) = 10.5KiB and should fit into the L1 data cache of most CPU architectures. We have performed several tests to make sure the presented hashing mechanism has minimal effects on compression ratio.

Efficient template lookup
In the baseline implementation, the hash tables are queried for known occurrences of a phrase in a complex hierarchy of if-else blocks in order to determine the most suitable template code for the data at hand. This mechanism has been replaced with a simple look-up mechanism, where the template key is computed as exemplified in Listing 1. The resulting template key is used to retrieve the template code (as specified in // prefer one 4-byte matches over two 2-byte matches 10 uint16_t high = max(templateKey_41, templateKey_21 + templateKey_22); 11 uint16_t low = max(templateKey_42, templateKey_23 + templateKey_24); 12 13 // prefer one 8-byte match over two 4-byte matches 14 uint16_t templateKey = max(templateKey_81, high+low); Listing 1: Prime numbers 13, 53, and 149 are used to encode matches of 2, 4 or 8 byte phrases, respectively. To encode the action slot of the match, prime numbers 3, 5, 7, and 11 are used to encode a matching phrase in the first, second, third, or forth action slot. When a known value is found in a hash table, the primes indicating phrase-length and position are multiplied. The prime numbers have been chosen so that higher template keys indicate more efficient template codes

Optimized template encoder
The unoptimized baseline version encodes the template code and the four action parameters by calling an append function on the output buffer for each data item independently. Calls to the append function require a certain degree of overhead due to bookkeeping tasks for bitstream writer. To reduce the number of append calls, we have implemented fused calls to the append function for each template code, as exemplified in Listing 2 As an additional optimization, the append function has been replaced with a buffered bitstream writer. It accumulates bitstrings until a full 64-bit data sequence can be written to the output buffer. The buffering technique significantly reduces the complexity of appending sub-byte bitstring to the output buffer. Listing 2: For all templates except for 0x00, the template key and all action parameters are packed into a single value, which reduces the number of calls to the stream_write_bits() function from five invocations to a single invocation. Template 0x00 requires two invocations

GPU-based decompression
Even though hardware-accelerated compression is not accessible for user-space applications yet, an important design goal for our GPU-based implementation 14 of 842 decompression is that it must remain fully compatible with the compressed data streams produced by the nx842 hardware compression accelerator in case it becomes accessible to user-space software. With this limitation in mind, the compression format of the nx842 unit (see Section 2) leaves no obvious venues for parallelism at the intra-chunk level of granularity. Due to the sliding window mechanism used to encode known phrases within the window as index offsets, there are no entry points that guarantee the absence of data dependencies within a chunk of compressed data. Therefore, naïve parallel decompression of chunks remains as the only viable venue for parallelization. However, we were able to achieve decent decompression performance on various GPU models using the optimization strategies explained hereinafter.

Avoiding divergent execution
Most importantly, the amount of divergent execution among threads had to be reduced to a minimum. In our implementation, we were able to reduce divergent execution by replacing a naïve case differentiation required to process each template code with a branch-free implementation. As outlined in Listing 3, the branch-free implementation strategy relies on a dictionary using the template code as a key for which it provides all parameters necessary to interpret the four actions encoded by a template code (e.g., type of action, parameter length in the compressed bitstream, and the length of the decompressed literal). Furthermore, the bitstream reader yielding an arbitrary number of bits from the compressed data stream has been reformulated to come by with very few case differentiations.  37 } while (template != OP_END); 38 } Listing 3: The array dec_templates serves as a dictionary, specifying the four actions associated with a template code. For each action, it holds the parameter size of the action (specified in bits), a tag specifying whether the action is an index action or not, and the number of raw bytes produced by the action. Based on this information, templates can be decoded without requiring a complex hierarchy of case differentiations

Optimized memory access patterns
Another important optimization step was to reduce the number of global memory access operations. This was achieved by refactoring the bitstream reader logic so that it caches data from global memory in registers, using the granularity of a native machine word. Based on this method, we were able to achieve significant speed-up, since not every read operation on the compressed input data results in a global memory access operation.

Transparent integration of compression facilities in CloudCL
With CloudCL being built on top of the dOpenCL library, we decided to integrate transparent compression mechanisms at the level of dOpenCL, as elaborated in Section 4.4. Hiding the compression facilities behind the regular OpenCL API does not only make transparent on-the-fly I/O link compression available to CloudCL, but makes compressed data transfers available to all OpenCL applications that run on dOpenCL. To achieve this goal, we have modified several function calls in dOpenCL, as illustrated in Figure 9. To transparently compress transfers from the master node to the device on a compute node, the clEnqueueWriteBuffer call, the clEnqueueMapBuffer call with map_flagscontaining the

EVALUATION
The evaluation presented in this section is subdivided into three parts. First, to make our evaluation more repeatable, Section 6.1 specifies all relevant details of the testing environment and the basic benchmark procedures. The characteristics of all datasets employed for the evaluation are documented in Section 6.2. Section 6.3 demonstrates the basic performance characteristics of our system in a series of synthetic performance tests. A detailed description of the benchmarked workloads, as well as the results yielded by them are presented in Section 6.4. Finally, Section 6.5 summarizes central findings of the evaluation.
All test and benchmark procedures used throughout this section are implemented either using plain C/C++ and OpenCL, or Java and CloudCL.
For each test or benchmark, a repository URL is provided in the footnotes in the respective section. Replicability is further facilitated by providing Dockerfiles in the repositories of CloudCL 3 and dOpenCL 4 that recreate the setups used to perform this evaluation.

Testing environment and benchmark procedure
All hardware configurations used in our tests are documented in Table 2. We employed three different classes of compute nodes to represent potential low, medium, and high-performance configurations of compute nodes. The medium and high performance compute nodes in our lab are equipped with eight GPUs, each, and are connected to the same 10 Gbps Ethernet switch as the master node. To simulate scale-out behavior, we created up to eight compute node containers using Docker with one GPU attached to each container, as depicted in Figure 10. By instantiating a varying number of containers, we emulated a varying node count.
For the low power compute nodes, we did not use the container approach but rather used up to eight individual bare-metal micro-servers. All micro-servers are attached to the same 10 Gbps Ethernet switch as the master node. Across all tests, the same master node has been used to warrant a certain degree of commensurability among the different compute node classes.
All performance measurements presented hereinafter were performed after a fresh reboot in order to ensure a clean system state. Furthermore, no other active users or background tasks were running on the involved servers and the network switch was idle. As discussed in Section 4.5, we used a chunk size of 64kiB, a micro-batch size of 1MiB (16 chunks), and a GPU-batch size of 512MiB (512 micro-batches).
In order to retrieve a sufficiently meaningful dataset, each benchmark was executed 10 times. Error bars are used in all plots to report the standard deviation for each measurement. Furthermore, each benchmark was preceded by a warm-up run in order to eliminate any confounding factors. All measurements presented in this work are reported as average values including standard deviation (n = 10).
Execution time was measured from the point where the application is started until it terminated. Therefore, all execution time measurements include the entire execution of a program, including setup, data transfers, computation phases, as well as teardown.
F I G U R E 10 To simulate a varying number of compute nodes, we partitioned our GPU servers into eight nodes with one GPU each using Docker

Dataset characteristics
The properties of the compressed payloads have a significant impact on their compression ratio r. Therefore, basic characteristics such as a brief description, size, and compression ratio of all employed datasets used throughout this work are documented in Table 3. Since performance optimization techniques for compression algorithms can often have an impact on the compression efficiency, we have included the compression ratios achieved by both the hardware-based nx842 units (see Section 5.1 as well as our optimized, CPU-based software implementation (see Section 5.2). However, the reported compression ratios indicate that the differences in compression efficiency are negligible across all datasets.
To facilitate replicability, we are using as many well-disseminated, publicly available datasets as possible. For the remaining, artificial data sets, an additional description is provided hereinafter.
The artificial datasets periodic, zeros, and random are self-explanatory and are used to quantify impact of on-the-fly I/O link compression on the effective data transfer throughput using extreme cases ranging from the best case (periodic, zeros) to the worst case (random). Distinguishing periodic and zeros makes sense because zeros triggers a special template in the 842 algorithm, whereas the periodic dataset has to be encoded using the regular templates described in Section 2.1.
The matrix dataset contains two generated matrices (N × M, M × P), with N = 13, 125 × n nodes , M = 20, 000, and P = 25. A pseudo random number generator is used to populate the matrices with random values, where the sparsity rate S controls the fraction of cells that will contain zero values.
The purpose of this setup will be explained in further detail in Section 6.4.1.

Compression throughput
We conducted a series of synthetic benchmarks to gauge the basic performance characteristics of our system on all node classes presented in Table 2. First, the processing throughput of nx842-based compression, CPU-based software compression, and GPU-based decompression available in lib842 (see Section 5) are measured, respectively. As a compression payload, we employed the enwik9 dataset (see Table 3). Using other datasets did not have a significant impact on compression throughput, except for the zeros dataset. There, the compression throughput for CPU-based compression roughly doubled across all node classes. This effect is caused by the special compression template for encoding an eight-byte sequence of zeros as well as a special template for encoding a repeated occurrence of eight-byte sequences. Each special template skips the entire hash-and-lookup operations during the compression process, which yields the observed speed-up. Since both special templates address corner cases, this effect occurs rarely and compression throughput should remain stable across many datasets.
Next, the effective transfer throughput with and without on-the-fly I/O link compression is measured between the master node and a single compute node. This test has been performed using a modified version 5 of the oclBandwidthTest sample application from the NVIDIA OpenCL SDK. 57 For the tests, we use the synthetic periodic, zeros, and random datasets (see Table 3) as compression payloads. Those artificial payloads are intended to test effective data transfer bandwidth for worst case and best case edge cases. To include more representative payloads, we have included the enwik9, the OLW, as well as the Curiosity dataset (see Table 3).
To evaluate effective transfer throughput in a scale-out scenario, we have included a modified version 6 of the previous benchmarks that performs data transfers to eight nodes, simultaneously. This test uses the same data sets and only increases the data volume in proportion to the larger number of nodes. The effective transfer throughput is aggregated across all nodes.
Our measurements for the processing throughput, the effective single-node transfer throughput, and the effective scale-out transfer throughput are presented in Figure 11(A-C), respectively. Looking at the results for the processing throughput tests, the most important insight here is that compression performance on the master node is by far high enough to saturate the 10 Gbps Ethernet infrastructure used in our testing environment. On the side of decompression performance, it can be observed that the chunk size of 64kiB restricts the decompression performance of the Intel Iris Pro and the NVIDIA Tesla K80 GPUs employed in the low and medium performance compute node configurations. However, the NVIDIA Tesla V100 GPUs used in the high performance compute nodes do not seem to be affected by this issue.
The transfer throughput tests demonstrate that using the enwik9, OLW, and Curiosity datasets, on-the-fly I/O link compression improves effective transfer throughput between 1.29× and 1.81× using medium or high performance compute nodes in the single node scenario, and between 1.40× and 1.91× using any compute node configuration in the scale-out scenario. Random data as the worst-case payload has no negative impact on the throughput, whereas the benevolent periodic and zeros datasets can yield drastic performance improvements. A closer look at the scale-out results for real-world payloads reveals that the limited bandwidth, especially on the master node network interface, remains as a major bottleneck, as the aggregated effective bandwidth of the scale-out tests only slightly exceeds the single node test bandwidth. F I G U R E 12 (A-C) The execution time measurements for the semi-sparse matrix multiplication workloads using low, medium and high power compute nodes, respectively. For each node type, the uncompressed baseline performance is compared to the performance achieved with compression enabled for the sparsity parameters S = 0.33, S = 0.50, and S = 0.67

Benchmarks
The main goal of this section is to evaluate whether on-the-fly I/O link compression can be used to mitigate the performance overhead caused by scaling out GPU-based workloads across multiple compute nodes. Usually, the best strategy to achieve this goal is to use established benchmark suites. Unfortunately, our search for multi-GPU benchmarks implemented in OpenCL remained without a result.
Therefore, we implemented four custom benchmarks either using plain C++/OpenCL or Java/CloudCL. In our custom benchmarks, independent kernel instances are used to process partials of the input data in order to avoid inter-GPU communication. For each benchmark, we are measuring its total execution time using regular, uncompressed data transfers for inter-node communication as a performance baseline. By performing the same measurements with compressed data transfers enabled, we are using the baseline performance measurements to quantify the performance improvements introduced by our on-the-fly I/O link compression approach.
For each benchmark, we provide a brief description as well as a link to the implementation before the results are presented and discussed. The benchmarks include a semi-sparse matrix multiplication workload, a database query, a text search, as well as an image downscaling workload.

Semi-sparse matrix multiplication
This test is implemented in CloudCL and assumes a matrix multiplication workload where a certain fraction of cells can be assumed to hold zero values, but where this fraction is hard to determine or where it is not large enough to justify the use of a sparse matrix representation such as compressed sparse rows (CSR). We perform a dense multiplication of matrix A (N × M) and matrix B (M × P), yielding matrix C (N × P) as a result. The dimensions in our benchmarks are defined as N = 13, 125 * n nodes , M = 20, 000, and P = 25. When node counts larger one are used, matrix A is split horizontally and distributed across compute nodes, whereas the second matrix is sent to all compute nodes. The matrix multiplication itself is implemented using a naïve implementation strategy. 7 The amount of data to be transferred to the compute nodes roughly amounts to 4 * (N * M + M * P * n nodes ) bytes, and the computation requires roughly N*M*P flops.
To regulate the sparsity of the input matrices, they are both generated with cells holding either random numbers or a zero value according to the sparsity parameter S. For a value of S = 0, all cells hold randomly generated values, whereas S = 1 results in a fully zeroed matrix. Therefore, the matrices used in this benchmark can be compressed with a ratio r of roughly r = 1 − S. We tested the impact of on-the-fly I/O link compression for the sparsity parameters S = 0.33, S = 0.50, and S = 0.67.
The adjustable sparsity parameter S which allows us to study the impact on varying compressibility on the potential performance gains of on-the-fly I/O link compression. The measurements presented in Figure 12 demonstrate that compression can improve performance for the range of tested sparsity parameters S. On the low performance compute nodes, compression only pays off for larger node counts (n ≥ 4) with performance improvements ranging between 1.11× for S = 0.33 and 1.54× for S = 0.66.
In contrast to the low performance compute nodes, the medium and high performance compute node types show slight but consistent performance improvements even for lower node counts (1 ≤ n ≤ 2). For larger node counts (n ≥ 4), compression yields performance improvements between 1.23× for S = 0.33 and 1.87× for S = 0.67. However, it should be noted that this benchmark is dominated by transfer time and only a fraction of the execution time is spent on computation. F I G U R E 13 The execution time measurements for the database query workload are reported in (A-C) for low, medium, and high power compute nodes, respectively. Each node has to process the same volume of 28 * 100, 000, 000 bytes, meaning that a constant execution time across node counts would be equivalent to perfect scale-out behavior. It can be seen that up to a node count of four, on-the-fly I/O link compression enables all compute node types to approach this ideal behavior

Database query
For this test, we implemented a database query 8 using CloudCL, that mimics the characteristics of a column-oriented in-memory database. The filter implements a query inspired by Query 1 of the TPC Benchmark H. 51 Test data for this benchmark has been generated using the same algorithm as the TPC-H DBGEN data generator, 51 with the minor modification that the Java pseudo random number generator is used instead of the custom pseudo random number generator supplied with the DBGEN application.
We want to make clear that for simplicity reasons, neither the query nor the data generator fully complies with the very complex TPC-H specification. As such, our database query benchmark must not be mistaken for a TPC-H benchmark. The implemented query is a relatively simple, join-free aggregation query that involves a simple filter statement. Only the relevant data columns are transferred to the GPUs, using a columnar layout. The data volume to be transferred and processed amounts to 28 * 100, 000, 000 bytes per node, and 28 * 100, 000, 000 * n nodes bytes in total. The GPUs perform the aggregation, to the extent possible, in parallel.
The benchmark results depicted in Figure 13 demonstrate that compression allows the filter operation to scale almost perfectly on up to four nodes, as the total execution time barely increases compared to a single node. For two to four nodes, on-the-fly I/O link compression enables almost perfect scaling behavior across all node types, as the multi-node execution times are barely higher compared to the single-node configurations.
For the n = 8 nodes, performance improvements of 1.85×, 1.9×, and 2.07× are achieved for low, medium, and high performance compute nodes, respectively. With a wider network interface available on the master node, it can be assumed that compression has the potential do enable perfect scale-out behavior for even larger node counts.

Text search
Here, we implemented 9 a simplistic text search kernel that checks for a match at each position of a large text file. Unlike the preceding benchmarks, this test has been implemented in C++ using the OpenCL C++ bindings, and therefore runs directly on top of the dOpenCL library. The benchmark is performed using the Books, Wikipedia, and OLW datasets as large text corpora. We are employing a simple, computationally expensive but yet powerful implementation strategy that can match any pattern, even non-regular ones. Using this naïve approach also allows us to investigate a task that is dominated by compute time instead of data transfer time. Depending on the number of nodes, the first 1, 000, 000, 000 * n nodes bytes of the data set are transmitted to the compute nodes, with each node having to process 1,000,000,000 bytes.
Looking at the benchmark results provided in Figure 14, the first impression might be that compression does not help too much in this use case, especially for faster compute node configurations. However, as mentioned before, this benchmark is more compute-intensive, which can be seen based on the larger performance differences between the different compute node classes. With the tests being less sensitive to data transfer volumes, it is still notable to see that compression yields 1.14×, 1.25×, and 1.43× performance improvements using n = 8 low, medium, and high performance compute nodes, respectively.
The execution times measured for the text search benchmark on low, medium, and high performance compute nodes, respectively. The black portion of each bar represents the performance baseline using uncompressed data transfers. The data supports the assumption that this benchmark is dominated by compute time, as the different performance levels of the compute nodes can be easily identified.
Even though the benefit of on-the-fly I/O link compression is not as distinct as in the other benchmarks, compression becomes more beneficial for higher node counts F I G U R E 15 (A-C) The execution times measured for the image downscaling benchmark on low, medium, and high performance compute nodes, respectively. The black portion of each bar represents the performance baseline using uncompressed data transfers. The measurements clearly illustrate that the workload is dominated by data transfers, as the execution time does not vary significantly across compute node classes and varying node counts. Here, on-the-fly I/O link compression yields significant performance improvements across most conditions

Image downscaling
Last but not least, he have implemented a simple image downscaling routine 10 using C++ and the OpenCL C++ bindings. In this benchmark, an input image is read in the TIFF format, and transferred to all available GPUs as a RGBA pixel buffer. To utilized multiple GPUs, the workload is split up by segmenting the image horizontally. As reference payloads, the Curiosity and Telescope datasets (see Table 3) are used. In contrast to the preceding benchmarks, this test does not clip the datasets proportionally to n nodes , but the entire image is processed regardless of the employed node count.
The results presented in Figure 15 illustrate that this test is largely dominated by transfer time, as the baseline execution time remains mostly stable across varying node counts. With the employed datasets being well compressionable, this benchmark makes it easy to gauge the impact of on-the-fly I/O link compression, which improves execution time by up to 1.67×, 1.71×, and 1.89× on low, medium, and high performance compute nodes, respectively considerably. Nevertheless, the test also demonstrates that scalability is ultimately limited by the fact that the network is fully saturated even when data is compressed.

Summary
In our evaluation, we have successfully tested our hypothesis that on-the-fly I/O link compression improves the overall performance of many scale-out GPU workloads using various compute node configurations by increasing the effective bandwidth between the master node and the compute nodes. Ranging between 1.11× and 2.07×, the performance improvements we observed across many workloads may not appear drastic on first sight. However, it should be noted that this speed-up was achieved without assuming any workload-specific knowledge in the compression scheme, without necessitating any modifications in the workloads themselves, or without introducing any other kind of overhead. Considering that the presented approach is capable of introducing even modest speed-up to a very wide range of GPU-based scale-out workloads, we think that performance improvements up to 2.07× appear much more attractive on second sight, especially in the context that the number of applications that require multiple GPUs to satisfy their resource demand is increasing by the day.

CONCLUSION
This article has proposed a pipelined architecture for integrating 842-based, transparent on-the-fly I/O link compression into scale-out GPU workflows. We have implemented the proposed architecture based on the dOpenCL API forwarding library, which servers as the foundation of the CloudCL single-paradigm framework for scale-out GPU computing. Based on the successful integration, we conducted a practical evaluation using a semi-sparse matrix multiplication workload, a database query, a text search application, as well as an image downscaling workload as common, data-intensive GPU-workloads. In addition to master node equipped with nx842 hardware compression accelerators, we used three different classes of compute nodes to represent configurations with low, medium and high-performance GPU hardware. To make the resources of the nx842 units accessible, we presented the first approach for user-space access to nx842 compression accelerators on Linux. Our evaluation confirmed a beneficial impact of on-the-fly I/O link compression on the performance of all tested workloads across all compute node configurations, achieving performance improvements up to 2.07×. In the case of the database query, on-the-fly I/O link compression even provided close-to-ideal scale-out performance for configurations using 2 and 4 compute nodes of any performance class.
Based on these insights, we are confident that our approach for on-the-fly I/O link compression can be used in conjunction with many application domains to mitigate the limited network bandwidths available in small-scale data-centers, university clusters, or in cloud computing infrastructures.
In its current form, the evaluation is limited by the 10 Gbps ethernet infrastructure available in our lab, as well as the number of GPU-based compute nodes that were available during our experiments. For future work, extending both network bandwidth as well as the number of compute nodes will be important aspects. Using the nx842 hardware compression units available in IBM POWER CPUs, we are able to achieve compression throughput high enough to extend the same beneficial performance effects on scale-out workloads executed across higher performance networks (e.g., 40 Gbps ethernet and possibly even faster).
This article represents the first end-to-end demonstration and evaluation of using 842-based, transparent on-the-fly I/O link compression for scale-out GPU workloads. However, our research efforts in this direction are still ongoing, as many new questions and ideas have emerged over the course of this work. For example, we are planning to investigate the effectiveness of the presented approach using real-world scale-out GPU workloads such machine learning tasks or inference tasks.

ACKNOWLEDGEMENT
We would like to thank Bulent Abali, who helped us understand all the details of the 842 compression algorithm. Also, we would like to thank Bernhard Rabe and the HPI Future SOC Lab for providing us access to the compute resources that made this work possible. Open access funding enabled and organized by Projekt DEAL.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the corresponding author upon reasonable request.