HCE: A Runtime System for Efficiently Supporting Heterogeneous Cooperative Execution

Heterogeneous systems with multiple different compute devices have come into common use recently, and the heterogeneity of the compute device is mainly reflected in three aspects: hardware architecture, instruction set architecture, and processing capability. Heterogeneous CPU-accelerator systems have attracted increasing attention especially. To make full use of multiple CPUs and accelerators to execute data-parallel applications, programmers may need to manually map computation and data to all available compute devices, which is tedious, error-prone, and difficult. Especially for some data-parallel applications, the inter-device communication could easily become the performance bottleneck of multi-device co-execution. Therefore, firstly, a runtime system is designed for supporting heterogeneous cooperative execution (HCE) of data-parallel applications, which can help programmers to automatically and efficiently map computation and data to multiple compute devices. Secondly, an incremental data transfer method is designed to avoid redundant data transfers between devices, and a three-way overlapping communication optimization method based on software pipelining is designed to effectively hide the inter-device communication overhead. Based on our previously proposed feedback-based dynamic and elastic task scheduling (FDETS) scheme and asynchronous-based dynamic and elastic task scheduling (ADETS) scheme, the modified FDETS that supports incremental data transfer and the modified ADETS that supports three-way overlapping communication optimization are proposed, which not only can effectively partition and balance the workload among multiple compute devices but also can significantly reduce data transfer overhead between devices. Thirdly, a prototype of the proposed runtime system is implemented, which provides a set of runtime APIs for task scheduling, device management, memory management, and transfer optimization. Our experimental results show that the communication overhead between devices is greatly reduced using the proposed inter-device communication optimization methods and the multi-device co-execution significantly outperforms the best single-device execution.

plications on a heterogeneous CPU-accelerator system, such as SKMD [9], CoopCL [10], EngineCL [11], FinePar [12], CoreTSAR [13], StarPU [14], and OmpSs [15]. These heterogeneous parallel programming models and runtime systems that support CPU-accelerator co-execution can help application programmers automatically map computation and data to multiple CPUs and accelerators. However, the efficient inter-device task scheduling is still a great challenge for the multi-device co-execution.
Some works [16]- [23] have recently concentrated on interdevice task scheduling strategies in heterogeneous CPUaccelerator systems, which statically split work across multiple compute devices before execution or dynamically determine the workload assignment among multiple compute devices during runtime. These task scheduling strategies provide effective workload distribution by maximizing the utilization of all available compute devices and balancing the workload between devices, but most of them do not take into account inter-device communication optimization. For some data-parallel applications, the inter-device communication could easily become the performance bottleneck of multidevice co-execution.
In recent years, many researchers have studied the interdevice communication optimization in heterogeneous CPUaccelerator systems. Gowanlock and Karsin [24] adopted CUDA streams and pinned memory to pipeline data transfers between CPU and GPU for significantly improving the performance of the heterogeneous sorting algorithm. Zheng et al. [25] developed a library named HiWayLib to support efficient inter-device data transfers for pipeline programs executing on hybrid CPU-GPU systems, which can avoid duplicated transfers of the overlapped data by employing the method of region-based lazy copy. Li et al. [26] proposed the dual buffer rotation four-stage pipelining scheme, which can achieve a good overlap of CPU computation, GPU computation, and CPU-GPU data transfer. Zhang et al. [27] developed a GPU-based parallel secure machine learning framework named ParSecureML to boost the efficiency of secure two-party computation. The finegrained double pipelining technique for overlapping PCI-E data transfer and GPU computing is adopted in ParSecureML to reduce intra-node communication overhead. Tan et al. [28] proposed a fine-grained pipelining algorithm to achieve a good overlapped execution of GPU, CPU, PCI-E bus, and IB network, which significantly optimizes the performance of Linpack benchmark running on large-scale hybrid CPU-GPU clusters. The existing researches prove that the pipelining technology can be adopted to effectively reduce the interdevice communication overhead. However, this requires elaborately designing pipeline programs and inter-device task scheduling strategies, and this will become more complicated especially when there are multiple different compute devices in heterogeneous systems.
In our previous work [23], we proposed two interdevice task scheduling strategies to enable the multi-device co-execution of data-parallel applications, including the feedback-based dynamic and elastic task scheduling (FDET-S) strategy and the asynchronous-based dynamic and elastic task scheduling (ADETS) strategy. FDETS and ADETS are preferable for data-parallel applications whose computation and data are uniformly distributed and that are non-uniformly distributed, respectively. The detailed descriptions of FDETS and ADETS are given in Section III. Although our previous work can provide efficient inter-device task scheduling for the multi-device co-execution of data-parallel applications which have a smaller inter-device communication overhead, the performance of FDETS and ADETS are not satisfactory in the multi-device co-execution of data-parallel applications which have a larger inter-device communication overhead, and it still requires a significant amount of development effort to implement the multi-device co-execution of data-parallel applications using FDETS and ADETS for programmers. On the basis of the previously proposed FDETS and ADETS, the following extensions are proposed in this paper: (i) the modified FDETS that supports incremental data transfer is proposed, which can keep a good workload balance and avoid redundant data transfers between devices; (ii) the modified ADETS that supports three-way overlapping communication optimization is proposed, which can effectively split work across devices and hide the inter-device communication overhead; (iii) a runtime system named HCE that enables heterogeneous cooperative execution of data-parallel applications is proposed, which can provide a simple and effective way for application programmers to fully exploit multiple compute devices to cooperatively execute data-parallel kernels (i.e., data-parallel for-loops) on a heterogeneous system. This paper makes the following main contributions: • A runtime system named HCE is designed for supporting multi-device co-execution of data-parallel kernels on heterogeneous systems, which can help programmers to automatically and efficiently map computation and data to multiple compute devices. • An incremental data transfer method is designed to avoid redundant data transfers between devices, and the modified FDETS that supports incremental data transfer is proposed. • A three-way overlapping communication optimization method based on software pipelining is designed to effectively hide the inter-device communication overhead, and the modified ADETS that supports three-way overlapping communication optimization is proposed. • A prototype of HCE is implemented that targets a heterogeneous system, which provides a set of runtime APIs for task scheduling, device management, memory management, and transfer optimization. The rest of this paper is organized as follows. Section II presents the overall design of HCE. Section III describes the previous inter-device task scheduling schemes. Section IV discusses the inter-device communication optimization methods. Section V presents the implementation of HCE. Section VI gives the experimental results. Section VII reviews related work. Section VIII concludes the work. Fig. 1 shows an overview of HCE. Programmers can use the hybrid OpenMP/CUDA/Intel Offload parallel programming model and the runtime APIs provided by HCE to write a program that can be cooperatively executed on multiple devices. Specifically, programmers first identify the computational kernel (i.e., the data-parallel kernel) that needs to be accelerated and determine which compute devices need to participate in multi-device co-execution of the computational kernel. Then, programmers write the device-specific computational kernel for each compute device that participates in multidevice co-execution, such as the CPU/GPU/MIC kernel. Note that the CPU/GPU/MIC kernel is the CPU/GPU/MIC version of the data-parallel code that can run on the CPU/GPU/MIC and is implemented with OpenMP/CUDA/Intel Offload.
As shown in Fig. 1, our HCE runtime system provides some easy-to-use runtime APIs related to task scheduling, device management, memory management, and transfer optimization for application programmers, allowing programmers to make full use of multiple compute devices to cooperatively execute data-parallel applications on a heterogeneous CPU-accelerator system in a simple and effective way. The runtime system is mainly responsible for partitioning and balancing the workload among multiple compute devices, optimizing the inter-device data transfers, and executing the device-specific computational kernel on each compute device to complete its assigned workload. For each computational kernel, it creates the same number of controller threads as the number of compute devices that participate in multi-device co-execution. Specifically, it creates p OpenMP threads to control p compute devices (i.e., p − 1 many-core accelerators and the multi-core CPUs), where multiple CPUs are seen as one compute device. The thread t i is in charge of running the device-specific computational kernel on the i-th accelerator, where 1 ≤ i ≤ p − 1. At first, t i transfers a part of the input data from the host to the i-th accelerator. Then, t i launches the available accelerator threads to concurrently perform the computational task assigned to the i-th accelerator. Finally, t i transfers the results back to the host. At the same time, the thread t p is in charge of running the CPU kernel on the m k-core CPUs, where m is the number of CPUs and k is the number of cores per CPU. Specifically, we enable the nested parallelism of OpenMP so that t p spawns the specified number of nested OpenMP threads, called worker threads, to concurrently perform the computational task assigned to the CPUs.
As noted above, we can use multiple CPUs and accelerators to concurrently and cooperatively execute data-parallel kernels. However, the key issue is how to effectively split workload among multiple devices and reduce the inter-device communication overhead, which will be discussed later.

III. PREVIOUS DYNAMIC SCHEDULING SCHEMES
This section briefly describes our previously proposed interdevice task scheduling schemes [23], including the FDETS scheme and the ADETS scheme.

A. THE FDETS SCHEME
FDETS firstly takes 1/n of the total workload of a computational kernel (i.e., 1/n of the total number of iterations of a data-parallel for-loop) as the initial chunk size and assigns the workload of the initial chunk to each compute device that participates in multi-device co-execution of the kernel according to the initial partition ratios, and then it constantly and dynamically adjusts the chunk size and the partition ratios during execution. Specifically, after the workload of the current chunk has been completed, FDETS dynamically decides whether the next chunk size should be doubled, unchanged or halved compared to the current one according to the performance change of multi-device co-execution, and it dynamically updates the partition ratios that can determine the assignment of the workload of the next chunk between devices by computing the relative execution speed of each compute device.
In order to better understand the FDETS scheme, an example of FDETS is illustrated in Fig. 2. For simplicity, assuming that only a CPU and a GPU are utilized. Fig. 2(a) shows the distribution of workload between the CPU and GPU for a data-parallel kernel to be executed once. As shown in Fig. 2(a), where W is the total workload of the kernel and W i is the workload of the i-th chunk. Fig. 2(b) shows the distribution of workload between the CPU and GPU for a data-parallel kernel to be executed several times. As shown in Fig. 2(b), during the 1-th execution of the kernel, W 1 = W/16, W 2 = 2W 1 , W 3 = 2W 2 , W 4 = W 3 , W 5 = W 4 /2, and W 6 = 3W/16. Assuming that FDETS finds the 6-th chunk processed at the fastest speed from the 1-th execution of the kernel, the size of the 6-th chunk is used as the sizes of the first two chunks during the VOLUME Wi The workload of the i-th chunk (a) The distribution of workload for a data-parallel kernel to be executed once.
The first execution of the kernel The second execution of the kernel . . .
The last execution of the kernel Wi The workload of the i-th chunk (b) The distribution of workload for a data-parallel kernel to be executed many times. 2-th execution of the kernel. As shown in Fig. 2(b), during the 2-th execution of the kernel, W 1 = 3W/16, W 2 = 3W/16, It also can be seen from Fig. 2 that the partition ratios used to determine the assignment of the workload of one chunk between the CPU and GPU are updated continuously. For example, after the second chunk depicted in Fig. 2(a) has finished processing, the partition ratio of the CPU is updated from 37.1% to 39.8%, while the partition ratio of the GPU is updated from 62.9% to 60.2%.

B. THE ADETS SCHEME
ADETS firstly assigns a chunk whose size is W/n to each compute device that participates in multi-device co-execution of a data-parallel for-loop, and then immediately assigns the next chunk to one compute device once it has completed its work. The size of the next chunk assigned to device D i is dynamically adjusted according to the current chunk size and the variance between the previous and current execution speeds of device D i . Fig. 3 shows an example of ADETS. As shown in Fig. 3, the first and second chunks are assigned to the CPU and GPU respectively, once the CPU or GPU has finished its work, the next unassigned chunk is assigned to it immediately. As shown in Fig. 3(a), the 1-th, 4-th, and 6-th chunks are assigned to the CPU, where W 1 = W/16, W 4 = W/16, and W 6 = 2W 4 ; the 2-th, 3-th, 5-th, and 7-th chunks are assigned to the GPU, where W 2 = W/16, W 3 = W/16, W 5 = 2W 3 , and W 7 = 2W 5 ; the last chunk is assigned to the CPU and GPU according to the partition ratios computed in the previous executions. In Fig. 3(b), we can see that the data- Wi The workload of the i-th chunk (a) The distribution of workload for a data-parallel kernel to be executed once.
The first execution of the kernel The second execution of the kernel . . .
The last execution of the kernel The distribution of workload for a data-parallel kernel to be executed many times. parallel kernel needs to be executed many times, begin from the second execution of the kernel, the sizes of the first two chunks assigned to device D i are determined by the size of the chunk processed by device D i at the fastest speed during the previous execution of the kernel. For example, for the CPU, ADETS finds the 6-th chunk processed at the fastest speed from the first execution of the kernel, thus the sizes of the first two chunks assigned to the CPU are all W/8 during the second execution of the kernel.

IV. INTER-DEVICE COMMUNICATION OPTIMIZATION
This section describes our proposed two inter-device communication optimization methods.

A. MOTIVATION
If the inter-device task scheduling decision is made without considering data transfer cost on a heterogeneous system, for some data-parallel kernels, the huge inter-device communication overhead would significantly degrade the overall performance of multi-device co-execution, so the inter-device communication can easily become the performance bottleneck of multi-device co-execution. If the data transfer overhead is higher than the performance gain actually achieved by offloading computation, the performance of multi-device co-execution will be worse than that of the best single-device multi-thread parallel execution. If the partitioning decision is made without considering data transfer cost and performance variance of partitioning, it will be suboptimal or even cause slowdown compared to the single-device execution.
As shown in Fig. 4(a), the execution times of the CPU-GPU-MIC co-execution using two different inter-device task scheduling schemes are more than the execution time of the GPU-only execution for Jacobi with three different problem  sizes. Fig. 4(b) also shows that the performance of the CPU-GPU-MIC co-execution is not as good as that of the GPU-only or MIC-only execution for FDTD2d with three different problem sizes. The huge CPU-GPU and CPU-MIC communication overheads have a great impact on the overall performance of CPU-GPU-MIC co-execution for Jacobi and FDTD2d. Therefore, the inter-device data transfers can easily become the performance bottleneck of multi-device coexecution for some data-parallel kernels, there is a need to do inter-device communication optimization, and especially our proposed task scheduling schemes should take this into account.

B. THE INCREMENTAL DATA TRANSFER METHOD
In this subsection, we first discuss the inter-device redundant transfers and then present the modified FDETS that supports incremental data transfer.

1) The Inter-Device Redundant Transfers
For the multi-device co-execution of some data-parallel kernels, the inter-device task scheduling may incur large communication costs due to frequent inter-device data transfers. As shown in Fig. 5, the matrix addition contains a computational kernel that needs to be executed repeatedly. For simplicity, we assume that only a CPU and a GPU are used to cooperatively execute the kernel. During each execution of the kernel, a part of array A needs to be uploaded from the host to the GPU and downloaded from the GPU to the host due to the change in partition ratios. Similarly, the Jacobi iteration has two computational kernels that need to be executed repeatedly. During each execution of kernel 1, A needs to be partially uploaded to the GPU, and Anew needs to be partially downloaded from the GPU. During each execution of kernel 2, Anew needs to be partially uploaded to the GPU, and A needs to be partially downloaded from the GPU. It is apparent that there are a large amount of inter-device data transfers during the repeated executions of the abovedescribed two applications, which may contain a great deal of redundant transfers. For each accelerator that participates in multi-device co-execution, if the data to be processed on the accelerator in the next execution are already present in the accelerator memory, but the data are downloaded from the accelerator at the end of the current execution and are still uploaded to the accelerator at the beginning of the next execution, such data transfers are considered to be redundant.

2) The Modified FDETS that Supports Incremental Data Transfer
To avoid redundant transfers, we design an incremental data transfer method for data-parallel applications which have one or more computational kernels that need to be executed repeatedly, such as the two applications depicted in Fig. 5. To better support the incremental data transfer, we make some modifications to FDETS. Simply put, at the begin of each execution of a computational kernel, the total workload of the kernel is split according to the suitable partition ratios, and we assign a part of the entire workload to the specified compute device. After each compute device has completed its work, we obtain the execution time of each compute device to calculate the new partition ratios.
The key issues to be solved for the incremental data VOLUME 4, 2016 Algorithm 1 Determine Which Parts of an Array Need to Be Uploaded from the Host to the Specified Accelerator Require: p, W , i, R prev.j and R curr.j (j = 1 to i) transfer are as follows: (i) how to identify which parts of an array must be uploaded from the host to the specified accelerator at the begin of each execution of a computational kernel; (ii) how to identify which parts of an array must be downloaded from the specified accelerator to the host at the end of each execution of a computational kernel. Algorithm 1 describes how to determine which parts of an array need to be uploaded from the host to the specified accelerator. Specifically, according to the total number of iterations W of the outermost for-loop of the computational kernel, the previous partition ratios R prev.1 , R prev.2 , . . . , R prev.i , and the current partition ratios R curr.1 , R curr.2 , . . . , R curr.i , we firstly get the begin index W begin prev.i and the end Subarray 1 Subarray 1 Case 1 A whole or part of the subarray that will need to be uploaded to the accelerator in the current execution A subarray that has been processed in the previous execution and is already present in the accelerator memory A whole or part of the subarray that will not need to be uploaded to the accelerator in the current execution index W end prev.i of a subarray (i.e., a section of an array) that has been processed on the i-th compute device (i.e., a specified accelerator) in the previous execution of the kernel, where 1 ≤ i ≤ p and p is the number of compute devices that participate in multi-device co-execution. Secondly, we get the begin index W begin curr.i and the end index W end curr.i of a subarray that will need to be processed on the specified accelerator in the current execution of the kernel. Thirdly, we determine which parts of the subarray need to be uploaded to the specified accelerator by comparing W begin prev.i , W end prev.i , W begin curr.i , and W end curr.i . If the whole subarray to be processed is not present in the accelerator memory, then it needs to be uploaded to the accelerator (see cases 1 and 6 in Fig. 6); if only a part of the subarray to be processed are already present in the accelerator memory, then the other part need to be uploaded to the accelerator (see cases 2, 3, and 5 in Fig. 6); if the whole subarray to be processed is present in the accelerator memory, then it does not need to be uploaded to the accelerator (see case 4 in Fig. 6). Finally, U p begin i.1 and U p end i.1 are used to store the begin index and the end index of the first part of data that need to be uploaded to the accelerator respectively, and U p begin i.2 and U p end i.2 are used to store the begin index and the end index of the second part of data that need to be uploaded to the accelerator respectively. Algorithm 2 describes how to determine which parts of an array need to be downloaded from the specified accelerator to the host. Similar to Algorithm 1, according to the total number of iterations W , the current partition ratios R curr.1 , R curr.2 , . . . , R curr.i , and the next partition ratios R next.1 , R next.2 , . . . , Algorithm 2 Determine Which Parts of an Array Need to Be Downloaded from the Specified Accelerator to the Host Require: p, W , i, R curr.j and R next.j (j = 1 to p) 1: Initialize Down begin R next.i , we firstly get the begin index W begin curr.i and the end index W end curr.i of a subarray that has been processed on the ith compute device (i.e., a specified accelerator) in the current execution of the kernel, where 1 ≤ i ≤ p. Secondly, we get the begin index W begin next.i and the end index W end next.i of a subarray that will need to be processed on the specified accelerator in the next execution of the kernel. Thirdly, we determine which parts of the subarray need to be downloaded from the specified accelerator by comparing W begin curr.i , W end curr.i , W begin next.i , and W end next.i . If the whole subarray updated on the accelerator in the current execution is not need for the accelerator in the next execution, then it needs to be downloaded from the accelerator (see cases 1 and 6 in Fig. 7); if only a part of the subarray updated on the accelerator in the current execution A whole or part of the subarray that is already present in the accelerator memory and needs to be downloaded from the accelerator in the current execution A subarray that will need to be processed on the accelerator in the next execution A whole or part of the subarray that is already present in the accelerator memory and needs not to be downloaded from the accelerator in the current execution are need for the accelerator in the next execution, then the other part need to be downloaded from the accelerator (see cases 2, 4, and 5 in Fig. 7); if the whole subarray updated on the accelerator in the current execution is need for the accelerator in the next execution, then it does not need to be downloaded from the accelerator (see case 3 in Fig. 7). Finally, Down begin i.1 and Down end i.1 are used to store the begin index and the end index of the first part of data that need to be downloaded from the accelerator respectively, and Down begin i. 2 and Down end i.2 are used to store the begin index and the end index of the second part of data that need to be downloaded from the accelerator respectively. Algorithm 3 describes the modified FDETS that supports incremental data transfer. Supposing that a computational kernel needs to be executed tolExecs times repeatedly. The incremental data transfer can be considered in each execution of the kernel, except for uploading data from the host to the accelerator in the first execution and downloading data from the accelerator to the host in the last execution. Moreover, the partition ratios that will be used in the next execution of the kernel need to be determined in advance when considering the incremental data transfer. Specifically, the initial partition ratios are adopted in the first and second executions of the kernel. Starting with the second execution of the kernel, the partition ratios that will be used in the next execution are determined by the new partition ratios computed in the previous execution.
As seen in Algorithm 3, during each execution of the computational kernel, we firstly assign the workload W curr.i VOLUME 4, 2016 Algorithm 3 The Modified FDETS that Supports Incremental Data Transfer Require: p, W , the initial partition ratios R 1 , R 2 , . . . , R p , and tolExecs 1: Initialize R prev.i = 0 and R curr.i = R next.i = R i (i=1 to p); 2: for t = 1 to tolExecs do 3: for each controller thread t i , 1≤ i ≤p, in parallel do 4: Assign the workload W curr.i = W × R curr.i to D i ; 5: if device D i is an accelerator then R next.i = 0 (i = 1 to p); 24: end if 25: end for (i.e., a part of the total workload W ) to device D i according to its current partition ratio R curr.i , where 1 ≤ i ≤ p. Secondly, we use device D i to execute the kernel to complete W curr.i . If device D i is an accelerator, we perform Algorithm 1 to identify which parts of the data need to be uploaded from the host to the accelerator before execution and copy these data to the accelerator memory, and we perform Algorithm 2 to identify which parts of the data need to be downloaded from the accelerator to the host after execution and copy these data to the host memory. Thirdly, we obtain the current execution time T curr.i of device D i to compute its current execution speed V curr.i . If device D i is an accelerator, the current execution time should include the data transfer time. Fourthly, we update the previous and current partition ratios of device D i : R prev.i = R curr.i and R curr.i = R next.i . Finally, after all p compute devices have completed the cooperative execution of the kernel, we update the next partition ratio of device D i : R next.i = V curr.i / p j=1 V curr.j . It is easy to see that the time complexity of both Algorithm 1 and Algorithm 2 is O(2p) in the worst case, where p is the number of compute devices that participate in multi-device co-execution. As shown in Algorithm 3, if one compute device is an accelerator, Algorithm 1 needs to be executed to determine which parts of an array will to be uploaded from the host to the accelerator, and Algorithm 2 also needs to be executed to determine which parts of an array will to be downloaded from the accelerator to the host. Therefore, the time complexity of Algorithm 3 is O(7λp) in the worst case, where λ is the number of times the computational kernel needs to be executed repeatedly. Thus it can be seen that the modified FDETS that supports incremental data transfer described in Algorithm 3 has a lower time complexity, and this also means that it has a lower runtime scheduling overhead.

C. COMMUNICATION OPTIMIZATION BASED ON SOFTWARE PIPELINING
Another effective way to reduce inter-device communication cost is to overlap data transfers with kernel execution, this subsection presents a three-way overlapping communication optimization method based on software pipelining.

1) The Three-Way Overlapping Communication Optimization Method based on Software Pipelining
The three-way overlapping communication optimization method relies on two things: (i) the "chunked" computation, i.e., the entire iteration space of a data-parallel for-loop is split into several chunks, and multiple devices are used to cooperatively process these chunks; (ii) the "three-way" overlap of uploading data to the accelerator, downloading data from the accelerator, and kernel execution.
In this work, we use three software pipelines that can be run in parallel to achieve the overlap of data transfers and kernel execution. Specifically, the first pipeline is responsible for asynchronously uploading the next chunk of data to the accelerator, the second pipeline is responsible for asynchronously processing the current chunk on the accelerator, and the third pipeline is responsible for asynchronously downloading the previous chunk of data from the accelerator.
To better understand the inter-device communication optimization method described above, an example of the CPU-GPU communication optimization based on software pipelining is illustrated in Fig. 8. For simplicity, assuming that the data transfers and kernel execution take roughly the same amount of time without considering the communication optimization, namely the data upload, kernel execution, and data download take up around 25%, 50%, and 25% of the total running time, respectively. Fig. 8(a) shows the distribution of workload between the CPU and GPU for a computational kernel. Fig. 8(b) shows the three-way overlap of uploading data to the GPU, downloading data from the GPU, and kernel execution. Specifically, firstly, the first chunk (i.e., W 3 ) of data need to be

W i
Execute the kernel on the GPU to process the i-th chunk Upload the i-th chunk of data asynchronously to the GPU W i Download the i-th chunk of data asynchronously from the GPU W i Upload the i-th chunk of data synchronously to the GPU W i Download the i-th chunk of data synchronously from the GPU W i (b) Three-way overlap of uploading data to the GPU, downloading data from the GPU, and kernel execution. synchronously uploaded to the GPU before processing the first chunk on the GPU. Secondly, pipeline 0 asynchronously uploads the second chunk (i.e., W 4 ) of data to the GPU while pipeline 1 begins asynchronously processing the first chunk. Thirdly, pipeline 0 asynchronously uploads the third chunk (i.e., W 5 ) of data to the GPU, pipeline 1 asynchronously processes the second chunk on the GPU, and pipeline 2 asynchronously downloads the first chunk of data from the GPU. Fourthly, except for the processing of the last chunk (i.e., W 12 ), we repeat the following operations: pipeline 0 uploads the next chunk of data and pipeline 2 downloads the previous chunk of data while pipeline 1 is processing the current chunk. Finally, pipeline 2 asynchronously downloads the second-to-last chunk (i.e., W 10 ) of data from the GPU while pipeline 1 begins asynchronously processing the last chunk, and the last chunk of data need to be synchronously downloaded from the GPU after the last chunk has finished processing.
In the example illustrated in Fig. 8, it is readily seen that the software pipelining mechanism can hide all of the data transfers between the CPU and GPU, except for uploading the first chunk of data and downloading the last chunk of data.

2) The Modified ADETS that Supports Three-Way Overlapping Communication Optimization
Considering that the next chunk of data need to be uploaded to the accelerator while the current chunk is being processed on the accelerator, how to determine the appropriate size of the next chunk before processing the current chunk is a key problem. In order to solve this problem and to support the overlap of data transfers and kernel execution, we make some modifications to ADETS. The modified ADETS that supports three-way overlapping communication optimization based on software pipelining is described in Algorithm 4. Each execution of the computational kernel consists of the following steps.
Step 1: We firstly assign a chunk whose size is W curr.i to device D i and update the assigned workload W a , where W a = W a + W curr.i and 1 ≤ i ≤ p. If this is the first execution of the kernel, W curr.i = W next.i = W/n; otherwise, we find a chunk W fs.i processed by device D i at the fastest speed from the previous execution of the kernel, and W curr.i = W next.i = W fs.i . Secondly, we pre-assign a subsequent chunk whose size is W next.i to device D i and update the assigned workload W a and unassigned workload W u , where W a = W a + W next.i and W u = W − W a . Thirdly, we execute the kernel on device D i to process the first chunk assigned to it. If device D i is an accelerator, we need to synchronously upload the first chunk of data to D i before processing the first chunk assigned to it, and then we need to asynchronously upload the second chunk of data to device D i while processing the first chunk. After device D i has completed its work, we obtain the execution time of device D i to compute its execution speed.
Step 2: We firstly pre-assign the next chunk to device D i if there is unassigned workload and update the assigned and unassigned workloads. Secondly, we use device D i to process the second chunk assigned to it. If device D i is an accelerator, we use pipeline 0 to asynchronously upload the next chunk of data to device D i , we use pipeline 1 to asynchronously process the current chunk on device D i , and we use pipeline 2 to asynchronously download the previous chunk of data from device D i . After device D i has completed its work, we compute the current execution speed V curr.i of device D i . Thirdly, we determine the size of the chunk after the next chunk that will be assigned to device D i (i.e., W next_next.i ) by comparing V prev.i and V curr.i . If |V curr.i − V prev.i | ≤ V prev.i × 10%, W next_next.i = (W curr.i + W next.i ) / 2; otherwise, W next_next.i = W curr.i + W next.i when V curr.i > V prev.i , while W next_next.i = (W curr.i + W next.i ) / 4 when V curr.i < V prev.i . Finally, we update the previous execution speed V prev.i of device D i .
Step 3: Repeat Step 2 until the unassigned workload has finished assignment and all the chunks assigned to device D i VOLUME 4, 2016

Algorithm 4 The Modified ADETS that Supports Three-Way Overlapping Communication Optimization
Require: p, W , the initial chunk size W/n, and tolExecs 1: for t = 1 to tolExecs do 2: Initialize Wprev.i = 0, Wa = 0, Wu = W , and Vprev.i = 0; 3: for each controller thread ti, 1 ≤ i ≤ p, in parallel do 4: if t == 1 then 5: Wcurr.i = Wnext.i = Wnext_next.i = W/n; 6: else 7: Find a chunk Wfs.i processed by Di at the fastest speed from the (t−1)-th execution of the kernel; 8: Wcurr.i = Wnext.i = Wnext_next.i = Wfs.i; 9: end if 10: Assign a chunk whose size is Wcurr.i to Di; 11: Update the assigned workload: Wa = Wa + Wcurr.i; 12: while Wu > 0 or Wnext.i > 0 do 13: if Wnext.i > 0 then 14: Pre-assign a chunk whose size is Wnext. have finished processing. If device D i is an accelerator, we need to synchronously download the last chunk of data from D i after the last chunk assigned to it has finished processing.
Compared with the original ADETS, the modified ADETS with communication optimization does not address the load imbalance that might happen at the end of the entire iteration space. Although this may result in some performance loss, the overall performance can still be greatly improved due to a significant reduction in the inter-device communication overhead. When a heterogeneous system has p compute devices that participate in multi-device co-execution and a computational kernel needs to be executed λ times repeatedly, the time complexity of the modified ADETS that supports three-way overlapping communication optimization described in Algorithm 4 is O(λW/p) in the worst case, where W is the total workload of the computational kernel. It is readily seen that the modified ADETS that supports three-way overlapping communication optimization can provide a low-overhead runtime scheduling.
In short, the original FDETS and ADETS shown in Section III are applicable to some data-parallel applications that have small inter-device communication overhead (such as GEMM), the modified FDETS that supports incremental data transfer described in Section IV-B is suitable for some dataparallel applications which have one or more computational kernels that need to be executed repeatedly and may contain a large amount of redundant data transfers (such as Jacobi), and the modified ADETS that supports three-way overlapping communication optimization described in Section IV-C is suitable for some data-parallel applications whose multidevice co-execution may incur huge inter-device communication overhead (such as K-means).

V. THE IMPLEMENTATION OF HCE
This section presents the runtime APIs provided by HCE and an example of using HCE.

A. THE RUNTIME APIS PROVIDED BY HCE
As shown in Fig. 1, our proposed HCE provides a set of runtime APIs for task scheduling, device management, memory management, and transfer optimization. Specifically, task scheduling aims to effectively split work across devices, such as FDETS, ADETS, the modified FDETS that supports incremental data transfer, and the modified ADETS that supports three-way overlapping communication optimization. Device management is responsible for keeping and updating the required information for each compute device that participates in multi-device co-execution, such as the begin and end positions of the iteration space assigned to the specified compute device. Memory management is responsible for the allocation and deallocation of accelerator memory and the data transfer between devices. Transfer optimization aims to effectively reduce inter-device communication overhead, such as our proposed incremental data transfer method and three-way overlapping communication optimization method based on software pipelining. A subset of the runtime APIs provided by HCE are listed in Table 1. Fig. 9 demonstrates an example of using the hybrid OpenM-P/CUDA parallel programming model and the runtime APIs provided by HCE to write a matrix addition program that can be cooperatively executed on a hybrid CPU-GPU system. The implementation of the matrix addition consists of the following seven key steps.

B. EXAMPLE OF USING HCE
Step 1: Programmers need to set the number of compute devices that participate in multi-device co-execution (see line 2 in Fig. 9).
Step 2: Programmers need to specify the begin and end positions of the outermost for-loop of the computational kernel (see line 4).
Step 5: Programmers can use some runtime APIs related to device management to specify the unique device ID, device type, device number, and the computational kernel function of each device (see lines [27][28][29][30]. Step 6: Programmers need to specify the initial partition ratios that will be used in the task scheduling (see lines [31][32]. Step 7: Programmers need to specify the task scheduling scheme and the initial chunk size and to launch the task scheduling (see line 33).
As noted above, with the aid of HCE, it is not necessary for programmers to know which chunks of the computation are going to be scheduled to which device or to specify which parts of the data are going to be copied to/from which accelerator, because these complicated and cumbersome works are automatically performed by HCE.

VI. EXPERIMENTAL EVALUATION
This section first presents the experimental setup, next evaluates the effectiveness of our proposed inter-device communication optimization methods, then evaluates the performance of HCE's multi-device co-execution, and finally presents the performance comparison among HCE, StarPU, and OmpSs.

A. EXPERIMENTAL SETUP
A serials of experiments are conducted on the following two different test platforms: (i) a hybrid CPU-GPU-MIC system consisting of two Intel Xeon E5-2640v2 CPUs, an NVIDIA Tesla K40c GPU, an Intel Xeon Phi 7110P Coprocessor, and 64GB host memory; (ii) a hybrid CPU-GPU-GPU system consisting of two Intel Xeon E5-2680v4 CPUs, two NVIDIA       [30]; GEMM and BLKS are from the NVIDIA CUDA SDK [31]. Each benchmark includes one or more dataparallel kernels, and OpenMP, CUDA, and Intel Offload are responsible for the parallelization of the outermost for-loop within each data-parallel kernel on the CPU, GPU, and MIC, respectively. In our experiments, for each benchmark with different problem sizes, we use a random number generator to produce 100 different instances, and the average execution time of 100 different instances is considered.

B. EVALUATION OF THE INTER-DEVICE COMMUNICATION OPTIMIZATION METHODS
To evaluate the effectiveness of our proposed communication optimization methods, we run Jacobi, FDTD2d, Seidel2d, Heat3d, K-means, and BLKS on the hybrid CPU-GPU-MIC system using FDETS without communication optimization (FDETS without CO) and ADETS without communication optimization (ADETS without CO) proposed in [23], FDETS with communication optimization (FDETS with CO) described in Section IV-B2, and ADETS with communication optimization (ADETS with CO) described in Section IV-C2. The incremental data transfer method is used in the CPU-GPU-MIC co-execution of Jacobi, FDTD2d, Seidel2d, and Heat3d. The three-way overlapping communication optimization method based on software pipelining is used in the CPU-GPU-MIC co-execution of K-means and BLKS. Fig. 10 shows a comparison of the performance before and after the inter-device communication optimization implemented in the CPU-GPU-MIC co-execution of six different benchmarks. The results show that the two communication optimization methods significantly improve the In this subsection, we first compare the performance of HCE's multi-device co-execution with that of the best singledevice execution on the hybrid CPU-GPU-MIC system, and then evaluate the performance of HCE's CPU-GPU-GPU coexecution.

1) Comparison with the Best Single-Device Execution
This subsection compares the performance of HCE's CPU-GPU-MIC co-execution with that of the best single-device execution. In this experiment, the best single-device execution refers to the best one among the 16-core CPU-only, GPU-only and MIC-only executions. To better evaluate the performance of HCE's multi-device co-execution, a suitable inter-device task scheduling scheme is adopted for each benchmark. Specifically, FDETS without CO is adopted for GEMM; ADETS without CO is adopted for BFS; FDETS with CO is adopted for Jacobi, FDTD2d, Seidel2d, and Heat3d; ADETS with CO is adopted for K-means and BLKS. execution is much faster than the best single-device execution in most cases, e.g., it achieves performance improvements of up to 1.14×, 1.16×, and 1.43× in Seidel2d, K-means, and BLKS, respectively. However, the modest performance improvement of 0.32× is observed in BFS. This is because BFS runs much faster on the best compute device as compared to other compute devices. The best compute device undertakes the majority of the work in the CPU-GPU-MIC co-execution of BFS. As a consequence, the other compute devices contribute less to the overall performance of CPU-GPU-MIC co-execution. In general, a proper inter-device task scheduling scheme should be adopted according to different data-parallel applications. For example, FDETS with CO or ADETS with CO can be adopted for the data-parallel applications which have a larger inter-device communication overhead.

2) Performance Evaluation on Multiple GPUs
This subsection evaluates the performance of HCE's CPU-GPU-GPU co-execution. In this experiment, a suitable interdevice task scheduling scheme is adopted for each benchmark as described in Section VI-C1. Fig. 13 shows a performance comparison of the 28-core CPU-only execution, CPU-GPU co-execution, and CPU-GPU-GPU co-execution for different benchmarks with large problem size. It is obvious that HCE's multi-device coexecution is much faster than the CPU-only execution. Specifically, compared with the 28-core CPU-only execution, CPU-GPU-GPU co-execution achieves speedups of up to 6.91×, 6.33×, 5.41×, 9.96×, 12.70×, 6.07×, 8.23×, and 5.06× in Jacobi, FDTD2d, Seidel2d, Heat3d, K-means, BFS, GEMM, and BLKS, respectively. The performance benefit mainly comes from the full utilization of two Intel Xeon 14core E5-2680v4 CPUs and two Tesla P100 GPUs each with 3584 CUDA cores of the hybrid CPU-GPU-GPU system. We believe that this is also due to the good load balancing and lower communication overhead between devices.
Benchmarks CPU-GPU co-execution CPU-GPU-GPU co-execution Seidel2d, Heat3d, K-means, BFS, GEMM, and BLKS, respectively. The results show that an additional GPU brings significant performance improvement for most benchmarks, as the GPU performs much better than the CPU in terms of these benchmarks and the increased CPU-GPU communication cost is small. Although the two 14-core E5-2680v4 CPUs perform worse than the Tesla P100 GPU for most benchmarks, HCE assigns a small amount of work to the two CPUs and achieves a good load balance. Fig. 14 shows the execution time comparison among CPU, GPU 0, and GPU 1 in the CPU-GPU-GPU coexecution of each benchmark. In Fig. 14, T cpu , T gpu0 , and T gpu1 denote the time the CPU, GPU 0, and GPU 1 take to complete its assigned workload, respectively. From the figure, we see that the difference in the execution time of each compute device is very small for most benchmarks. The good load balance between devices mainly benefits from our proposed inter-device task scheduling schemes.

D. COMPARISON WITH STARPU AND OMPSS
This subsection compares the performance of HCE with that of StarPU [14] and OmpSs [15]. StarPU is a runtime system that offers a unified view of the computational resources to allow programmers to exploit the computing power of the available CPUs and accelerators, while transparently handling low-level issues such as data transfers in a portable fashion. It provides task programming APIs for data partitioning and task scheduling across heterogeneous devices. OmpSs provides a task-based programming model where users can offload computation and data to multiple devices by adding OmpSs directives and clauses, and it is able to schedule tasks in a data flow way to the available CPUs and accelerators based on the task graph built at runtime.
In this experiment, we implement 8 benchmarks using our proposed HCE, the newest StarPU 1.3.8's rich C APIs [32], and OmpSs 19.06's directives [33] on the hybrid CPU-GPU-GPU system. The performance model-based dmda (deque model data aware) scheduler is used in StarPU, which schedules tasks where their termination time will be minimal with taking task execution performance models and data transfer time into account. The versioning scheduler is used in OmpSs, which can automatically profile each task implementation and choose the most suitable implementation each time the task must be run. Given that StarPU and OmpSs support the overlap of data transfers with computation, our proposed FDETS with communication optimization and ADETS with communication optimization are adopted in HCE.   15 gives the performance comparison among HCE's, StarPU's, and OmpSs's GPU-GPU co-execution for different benchmarks with large problem size. The results show that HCE yields better performance than StarPU and OmpSs for some data-parallel applications. For example, compared with StarPU, HCE achieves 31.07%, 32.17%, and 29.32% performance improvements for Jacobi, FDTD2d, and BFS, respectively; compared with OmpSs, HCE achieves 38.53%, 39.39%, and 33.72% performance improvements for Sei-del2d, K-means, and BLKS, respectively. Note that OmpSs achieves a slightly better performance than HCE for some data-parallel kernels (such as GEMM) that need to be executed one time and have a small inter-device communication overhead. For these 8 benchmarks, HCE achieves an average of 27.61% and 33.17% performance improvements over S-tarPU and OmpSs, respectively. The performance improvement is mainly due to HCE's inter-device task scheduling schemes can provide lower runtime scheduling overhead and higher device utilization and effectively reduce the data transfer overhead between devices. Although our proposed HCE performs better than StarPU and OmpSs for some data-parallel applications, this does not mean that HCE can replace StarPU and OmpSs, because they have their own advantages, disadvantages, and limitations.

VII. RELATED WORK
Heterogeneous CPU-accelerator systems have come into common use recently. Some directive-based parallel programming models have been developed as a powerful way to easily harness the computing power of many-core accelerators, such as hiCUDA [34], OpenMPC [35], and OpenACC [36]. They allow programmers to use directives to identify which parts of a program should be automatically offloaded to an accelerator, but they do not allow for offloading parallel codes to multiple CPUs and accelerators. Unlike these works, Shuja et al. [37] proposed a framework for single instruction multiple data instruction translation and offloading for mobile devices (SIMDOM) in heterogeneous mobile and cloud environments, which allows mobile applications to be executed on edge and cloud servers, and various modules of the SIMDOM framework for optimal execution parameters are analyzed systematically and comprehensively in [38].
Some heterogeneous parallel programming models and runtime systems [9]- [15] have recently focused on how to fully utilize multiple compute devices to execute parallel applications on a heterogeneous system. To make full use of multiple compute devices in OpenCL, SKMD [9] provides an OpenCL runtime for heterogeneous devices, which takes a kernel written for a single device and executes it across multiple devices. Similar to SKMD, CoopCL [10] also provides an OpenCL runtime that targets CPU-GPU systems, which takes applications written for a single device and automatically runs each kernel on both CPU and GPU. EngineCL [11] presents an OpenCL-based runtime system that effectively splits the workload of a single massive data-parallel kernel to multiple different compute devices so as to maximize their utilization. FinePar [12] offers a software framework that enables the fine-grained workload partitioning between the CPU and GPU on the same die for irregular applications written in OpenCL. Most of existing parallel applications are written in OpenMP, if SKMD, CoopCL, EngineCL, or FinePar is adopted to implement the multi-device coexecution of these applications, programmers need to make a big effort to rewrite these applications using OpenCL.
CoreTSAR [13] can automatically schedule dataparallelism tasks between CPU and GPU based on Accelerated OpenMP. It supports co-scheduling of parallel loop regions across an arbitrary number of CPUs and GPUs. In our previous work [23], we have discussed CoreTSAR's two dynamic scheduling strategies: quick scheduling and split scheduling. Both StarPU [14] and OmpSs [15] are most closely related to our proposed HCE. In Section VI-D, StarPU and OmpSs are introduced in detail, and the performance comparison among StarPU, OmpSs, and HCE are made. As shown in Fig. 15, the results show that HCE can achieve better performance than StarPU and OmpSs for some data-parallel applications. Moreover, HCE supports the more efficient data transfer between devices in comparison with StarPU and OmpSs.
In a nutshell, our HCE provides efficient inter-device task scheduling strategies and inter-device communication optimization methods and some easy-to-use runtime APIs, which can help programmers to automatically and efficiently map computation and data to multiple compute devices on a heterogeneous CPU-accelerator system.

VIII. CONCLUSION
In this paper, we present HCE, a runtime system that efficiently supports the heterogeneous cooperative execution of data-parallel applications on hybrid CPU-accelerator systems. HCE provides a simple and effective way for application programmers to fully exploit the available compute devices to accelerate their applications, reducing the burden on programmers and allowing them to concentrate their attention on the application itself. In order to effectively reduce the communication overhead between devices, we propose two inter-device communication optimization methods, and which have been integrated into the inter-device task scheduling schemes. A prototype of HCE is built on hybrid CPU-accelerator systems. The experimental results show that the data transfer overhead can be greatly reduced with the help of our proposed inter-device communication optimization methods and the multi-device co-execution using HCE provides much better performance than the best single-device execution. Compared with the widely used StarPU and OmpSs, HCE also achieves a better performance for some data-parallel applications.
In future work, we plan to extend HCE to support efficient execution of data-parallel kernels on heterogeneous CPUaccelerator clusters. Moreover, considering that the thread configuration would affect the performance of a compute device, we will explore how to dynamically determine the best thread configuration of each compute device according to the workload assigned to it during runtime.