skip to main content
research-article
Open Access

Optimal Model Partitioning with Low-Overhead Profiling on the PIM-based Platform for Deep Learning Inference

Published:14 February 2024Publication History

Skip Abstract Section

Abstract

Recently Processing-in-Memory (PIM) has become a promising solution to achieve energy-efficient computation in data-intensive applications by placing computation near or inside the memory. In most Deep Learning (DL) frameworks, a user manually partitions a model’s computational graph (CG) onto the computing devices by considering the devices’ capability and the data transfer. The Deep Neural Network (DNN) models become increasingly complex for improving accuracy; thus, it is exceptionally challenging to partition the execution to achieve the best performance, especially on a PIM-based platform requiring frequent offloading of large amounts of data.

This article proposes two novel algorithms for DL inference to resolve the challenge: low-overhead profiling and optimal model partitioning. First, we reconstruct CG by considering the devices’ capability to represent all the possible scheduling paths. Second, we develop a profiling algorithm to find the required minimum profiling paths to measure all the node and edge costs of the reconstructed CG. Finally, we devise the model partitioning algorithm to get the optimal minimum execution time using the dynamic programming technique with the profiled data. We evaluated our work by executing the BERT, RoBERTa, and GPT-2 models on the ARM multicores with the PIM-modeled FPGA platform with various sequence lengths. For three computing devices in the platform, i.e., CPU serial/parallel and PIM executions, we could find all the costs only in four profile runs, three for node costs and one for edge costs. Also, our model partitioning algorithm achieved the highest performance in all the experiments over the execution with manually assigned device priority and the state-of-the-art greedy approach.

Skip 1INTRODUCTION Section

1 INTRODUCTION

The heterogeneous platform, accommodating various computing devices, such as CPUs, GPUs, ASICs, FPGAs, and so on, and their rich types of operations, has been widely used for energy-efficient and high-performance computation [8, 13, 36, 39] of a broad class of applications [11, 34, 38]. To achieve the advantage by fully utilizing the heterogeneity, we should consider both each device’s computation performance and the data transfer overhead incurred when a device accesses data in other devices at the scheduling. For example, CPU and GPU generally have their own memory space on a platform, requiring an explicit data copy from one memory to another to maintain data consistency and sometimes resulting in significant overall performance degradation due to the slow PCIe interface [24].

Recently, with the emergence of data-intensive and low locality applications in Deep Learning (DL), e.g., LSTM [15] and RNN [16], Processing-in-Memory (PIM) has been adapted to the computing platform [12, 18, 19, 22, 23] for achieving resolving the memory performance bottleneck by placing processing units near or inside the memory. The recent Deep Neural Network (DNN) models process huge amounts of data and become increasingly complex for improving accuracy [2]. However, PIM, especially in-DRAM PIM, supports elementary operations like multiplication and addition due to the limited design constraints [32], i.e., unallowable design space and power budget; thus, it requires more frequently offloading the data between CPU and PIM memory spaces, i.e., cacheable and uncacheable pages [19], than the other computing devices. It makes programming more challenging to achieve the best performance on the PIM-based platform than traditional heterogeneous ones because of its more possible execution path candidates between devices.

The previous partitioning studies targeting heterogeneous platforms took two approaches: cost model-based partitioning [3, 4, 21, 26, 35] and ML-based partitioning [9, 10, 20]. The first approach measured the costs by profiling the applications and partitioned the execution using the cost model. The cost was often estimated because too many possible profile execution paths exist. The estimation approximated the computation cost through curve-fitting and the data transfer cost by dividing the data size by memory bandwidth. The cost model used the computation and data transfer costs and partitioned the application by determining whether to stay on one device or move to another. The second approach built the dataset and trained the ML model. The dataset was obtained by extracting features from the static code analysis and profiling with different input data sizes and possible device combinations. Then, the ML model was trained using the datasets to obtain the optimal partition. Finally, the ML model provided the optimal partition for a given application.

The recent DNN models’ high complexity in computation makes it more challenging to profile the programs [7, 25, 33], thus making it difficult to find the optimal model partition for the best performance. For example, ONNX Runtime [30] represents the computation as a computational graph of CG(\(V_c,E_c\)), consisting of \(V_c\) nodes representing operators and \(E_c\) edges describing tensors, and CG is a Directed Acyclic Graph (DAG). The BERT-small model [7] is composed of 222 nodes and 262 edges in CG: The complex CG involves many possible profiling paths, increasing exponentially as the number of devices increases. The frequent data offloading of the PIM data makes the problem much harder. Therefore, it would not be applicable in polynomial time to profile all possible scheduling paths for measuring all the node and edge costs and identify an optimal scheduling path from the profiled costs. In most current DL frameworks [1, 31], users manually partition the execution, so it is hard to obtain the best performance; for example, by marking which nodes to map with which devices in PyTorch [31] and TensorFlow [1] and defining the computing device’s priority and capability for each operator in ONNX Runtime.

This article proposes two novel polynomial-time algorithms for optimally running an inference DNN model on a PIM-based platform: one for profiling to recognize the minimum number of execution paths to measure all the costs and the other for partitioning to achieve the best execution performance with the profiled costs. Furthermore, we confidently present that we easily apply the methods proposed in this article to traditional heterogeneous platforms.

Our approach consists of the following three steps. First, we build a Device-mapped Computational Graph, DCG(\(V_d,E_d\)) from CG(\(V_c,E_c\)), to represent that a node shows a pair of an operator and its device capability, and each edge does the tensor’s data transfer between devices, where \(|V_d| = O(V_c\times n)\), \(|E_d| = O(E_c\times n^2)\), and n is the number of computing devices on the target platform.

Second, we profile the DCG to measure all the node and edge costs. The node cost represents the computation time of an operator on a device, and the edge cost implies the data transfer time between devices. After measuring all the costs, we substitute the partitioning problem with finding the minimum cost path from a start to an end node in DCG. Measuring all the node costs needs n profile runs in DCG. However, it is challenging to measure all the edge costs by profiling since there are \(n^{V_d}\) possible execution paths, leading to exponential application runs. Our polynomial-time profiling algorithm tracks the edges instead of execution paths since the number of edges in DCG is polynomial. Furthermore, we reduce the number of profiled edges by considering edge attributes. The data transfer between devices usually uses DMA; thus, we classify all the edges by three attributes, occurring the same data transfer cost: a source device, a destination device, and a data transfer size. Then, we develop a profiling algorithm to identify the minimum number of execution paths to include all the distinct attribute edges, resulting in the polynomial-time complexity of \(O(V_d + E_d^{2})\).

Finally, we apply a dynamic programming technique to find the optimal model partitioning from the profiled costs for achieving the best execution performance. We prove that our partitioning problem is the same as the Assembly Line Scheduling (ALS) problem [6], and therefore, the complexity of our partitioning algorithm is \(O(V_d)\).

We implemented our profiling and partitioning algorithms on the ONNX Runtime framework [30]. We used a PIM-modeled FPGA [19] and ARM Cortex-A53 as our experimental computing platform, i.e., three computing devices to be scheduled as the CPU serial, the CPU parallel on the multicores, and the PIM execution for targeting memory-intensive and low locality applications. To the best of our knowledge, our work is the first to address the optimal model partitioning in the PIM-based platform. We evaluated the performance by running three Transformer-based models, i.e., BERT [7], RoBERTa [25], and GPT-2 [33].

We needed at least three profiling runs to measure all the node costs for our computing platform, i.e., three devices. Additionally, our edge profiling algorithm identified only one execution path to find all the edge costs in all the models. We analyzed the operator-by-operator performance of the models and found that PIM outperformed the others in most operators. However, the PIM incurred the data transfer cost; thus, we should carefully assign the operators to devices to achieve the best overall performance. For the detailed performance analysis, we manually made two execution priority orders of (CPU parallel, PIM, and CPU serial) and (PIM, CPU parallel, and CPU serial) and modeled the state-of-the-art greedy partitioning algorithm [40]. Using the profiled costs and applying our optimal model partitioning algorithm, we achieved the highest performance in all the test cases: a speedup of 1.1\(\times\)\(\sim 3.0\times\) compared to the execution with manually assigned device priority orders and 1.09\(\times\)\(\sim 1.23\times\) over the greedy approach. Also, we explored all possible execution paths in the subgraphs from the experimented models and showed that our partitioning algorithm provided the best.

The remainder of the article is organized as follows: Section 2 introduces background about our experimental platform, including the DL framework and the PIM computing device. Section 3 proposes our low-overhead profiling and optimal model partitioning algorithms. Section 4 presents the performance evaluation. Section 5 describes the related work, and Section 6 concludes the article.

Skip 2BACKGROUND Section

2 BACKGROUND

This section reviews our experimental platform’s ONNX Runtime framework and the PIM computing device.

2.1 ONNX Runtime Framework

Open Neural Network Exchange (ONNX) [29] is an open format that represents a DNN model for providing interoperability between DL frameworks, such as TensorFlow [1] and PyTorch [31]. The DNN model implemented in one framework can be exported to the ONNX format and used in another. ONNX Runtime is a framework that supports and runs the ONNX format on multiple devices by adopting an execution provider interface, allowing us to conveniently integrate various devices by abstracting a computing device and its execution environment, including the libraries and drivers.

Figure 1(a) shows the execution flow of ONNX Runtime deploying multiple computing devices. ONNX Runtime transforms the ONNX format of a DNN model into Graph IR (Intermediate Representation), i.e., a computational graph (CG) and its topologically sorted CG, as shown in Figure 1(b) and (c). A node of CG represents an operator, and an edge implies a tensor representing data movement (dependence) between nodes in the form of a multi-dimensional array or a vector. ONNX Runtime performs graph optimization, partition, and execution in order. The graph optimizer applies various hardware-dependent and independent optimizations to CG. The graph partitioner maps each node to one of the computing devices in a device list by considering the user-defined device priority and capability. It also inserts a memory copy node if a device uses the output from other devices. Finally, the graph execution stage traverses the CG nodes in topological order, assigns them to the scheduled computing devices, and executes them in the order, one at a time. Therefore, we do not consider running multiple operators on multiple computing devices simultaneously.

Fig. 1.

Fig. 1. (a) Execution flow of the ONNX Runtime framework. (b) A computational graph (CG). (c) A topologically sorted CG.

A user specifies a list of available computing devices in the framework, i.e., device capability. Also, the user prioritizes the computing devices in the execution provider list for specifying the device execution preference. All the specification solely depends on the user’s knowledge of the devices.

2.2 PIM Device: Silent-PIM

We used Silent-PIM as the PIM device on our experimental platform [19], which satisfies the standard memory interface [17] and performs the PIM execution using the standard memory request.

Figure 2(a) shows the Silent-PIM’s datapath, supporting bfloat16 8-way vector addition, subtraction, multiplication, and MAC operations. There are 128-bit\(\times\)4 vecA and 128-bit\(\times\)1 vecB general registers and vACC accumulator registers. The vecA register holds the 4-cycle burst data, i.e., 64 bytes, and the vecB stores data at every cycle during the burst. Whenever vecB stores 16-byte at every cycle, Silent-PIM performs the PIM operations with 8-way vector units. Silent-PIM performs a matrix-matrix (MM) multiplication, \(A \times B = C\), by repeating a vector-matrix (VM) multiplication column-wise by the number of rows of A. The element-wise vector/matrix execution is more straightforward than the VM multiplication.

Fig. 2.

Fig. 2. (a) Silent-PIM datapath [19]. (b) Data transfer cost in time from CPU to PIM and PIM to CPU devices when varying the element sizes.

In most DNN models, the VM multiplication and element-wise operation read source operands only once, thus exploiting the low locality and resulting in higher PIM performance than CPU. However, if the source operands are available in caches, the CPU may outperform PIM. In the MM multiplication, Silent-PIM repeats the VM multiplication, so we prefer to use the CPU over PIM due to data locality. However, if source operands are available only in memory, PIM may deliver higher performance than CPU. In addition, PIM does not support the computation of complex functions, such as exp, tanh, and so on. In summary, the performance depends on the source operand location and the operator type. Therefore, we should carefully offload PIM functions to devices, i.e., partitioning workloads, to achieve the best performance on the platform.

2.3 Data Transfer Cost

Our experimental platform includes two computing devices, CPU and Silent-PIM, as shown in Figure 7, and uses DMA for data transfer between them.

We cannot accurately estimate the transfer time, i.e., the edge cost, only by considering the data (tensor) size. The DMA data transfer time is affected by the following four factors: a data type, a transfer size, a source device, and a destination device. Silent-PIM uses bfloat16, and CPU uses float32 types: therefore, we need a data type conversion before the data transfer between PIM and CPU. The DMA data transfer time is proportional to the data size [27]. Also, we configure the PIM memory as uncacheable and the CPU memory as cacheable [19]. Due to memory ordering, uncacheable memory access takes much longer than cacheable access.

Figure 2(b) shows the data transfer cost from CPU to PIM and vice versa when varying element sizes, consisting of the type conversion time and the DMA transfer time. The type conversion in the two cases took a similar time since the CPU performed the conversion. However, the DMA transfer time differed depending on the source and destination devices, and CPU-to-PIM performed faster than PIM-to-CPU. For example, the data copy from CPU to PIM of 64 K bytes was 1.83 times faster than from PIM to CPU. The CPU reads the source data from fast cacheable memory, and the PIM does from slow uncacheable memory. It should be noted that the data transfer size is the same after the type conversion, i.e., 2 bytes per element.

Skip 3LOW-OVERHEAD PROFILING AND OPTIMAL MODEL PARTITIONING Section

3 LOW-OVERHEAD PROFILING AND OPTIMAL MODEL PARTITIONING

Figure 3 shows an overall architecture of the ONNX Runtime by adding the gray-colored components to Figure 1 for realizing our low overhead profiling and optimal DL inference model partitioning on the PIM-based computing platform.

Fig. 3.

Fig. 3. The extension of the ONNX Runtime framework from Figure 1. We added the gray-colored components for our work.

After the graph optimization, we generate DCG from CG using the computing device’s capability defined in execution providers. Our low-overhead profiling algorithm finds the minimum number of execution paths to measure all the node and edge costs in DCG, thus resulting in the lowest profile overhead. Then, from the profile runs, we apply our optimal partitioning algorithm to find an execution path to guarantee the minimum execution time. The rest of this section explains each step in detail.

3.1 Reconstructing Computational Graph for Multi-Device Computing

The original graph partitioner allocates each CG node to one of the computing devices by considering the user-defined priority, i.e., device preference, without considering the device’s operator cost and data transfer cost. Therefore, the performance heavily relies on the user’s experience, which would fail to provide the desired performance. Also, allocating a node’s successor to a different device requires inserting a memory copy node between them.

In order to involve the computing nodes on a multi-device computing platform for the model partitioning, we develop DCG from CG, representing the devices’ operator capability and their data dependencies, as shown in Figure 4 from CG in Figure 1(a) with n computing devices. The solid arrow edge (\(\rightarrow\)) represents the data dependence in the same device, and the dashed arrow edge (\(\dashrightarrow\)) shows the data dependence between different devices, requiring explicit data transfer. We also add a start node and an end node for providing a single entry and exit in DCG. Also, the ONNX Runtime already encoded the device’s capability information in the node attribute. Therefore, we do not incur any additional overhead for the encoding.

Fig. 4.

Fig. 4. DCG from Figure 1(b) for n computing devices on a multi-device computation.

3.2 Profiling DCG with the Lowest Overhead

Without the loss of generality, we assign zero cost to the start and the end nodes because they are dummy operators and to the solid edges because the tensors are in the same device. We acquire the costs of the rest of the nodes and the dashed edges by profile runs.

We classify the profiling run into node cost profiling and edge cost profiling. The node cost profiling is straightforward; we can identify all the node costs with the profiling runs as many as the number of devices on our target platform, i.e., n, by assigning the highest priority to each computing device in the device list one by one. For example, in Figure 4, we need n profile executions for acquiring all the node costs, i.e., \(A_0B_iC_0D_i\) by assigning the highest priority to device 0 to \(n-1\). From the node cost profiling, we can measure some edge costs. However, the node profiling runs cannot measure all the edge costs.

Profiling all the edge costs is problematic because the exponential number of execution paths exists from the start to the end nodes, i.e., \(O(n^{|V_d|})\) paths in DCG. In this case, we cannot finish the process in the polynomial time even if we develop a polynomial-time algorithm due to the exponential input size. We may use a bookkeeping approach to reduce the working path size since many paths involve the same subpaths. However, the path size still grows exponentially since all the nodes may have n children. Therefore, instead of managing paths, we track edges since the number of edges in DCG is \(O(n^2 \times V_c)=O(n \times V_d)\), i.e., a polynomial input size. Then, we topologically visit each DCG node to find the minimum number of execution paths covering all the edges.

3.2.1 Reducing the Problem Size.

We reduce the problem size, i.e., the number of edges to be profiled, by identifying distinct attribute (DA) edges. In general, we use the DMA for data transfer between devices; thus, the data transfer cost can be determined by a 4-tuple, as discussed in Section 2.2: a data type, a data size, a source device, and a destination device. When constructing DCG, we check the tuple for a new DCG edge and register it as a DA edge if its tuple differs from the pre-identified DA edges. Therefore, we do not need to profile the edges with the same attribute repeatedly; we profile each DA edge only once.

Table 1 shows the DCG’s number of edges and DA edges in the transformer-based models for two devices (CPU, PIM). For example, the DCG edges (\(E_d\)) in BERT, RoBERTa, and GPT-2 decreased from 221, 628, and 527 to only 8, 8, and 7 DA edges, respectively. We observed that most nodes’ operand dimensions are constant, thus implying the transfer size is constant and allowing us to have a significantly small amount of DA edges from DCG edges. The number of ReduceMean and SoftMax nodes to reduce the data dimension is small.

Table 1.
BERTRoBERTaGPT-2
DCG edges (\(E_d\))221628527
DA edges887

Table 1. The Number of Edges and DA Edges in DCG from the Transformer-based Models for Two Devices

3.2.2 Edge Profiling Algorithm.

We use Figure 5 as an example for explaining the edge profiling algorithm with two computing devices, devices 0 and 1. A subgraph of the initial DCG appears in Figure 5(a), where there are five DA edges (\(e_0\)\(\sim\)\(e_4\)), and operators A and C are incapable for device 1. The node profiling identifies the edge cost of \(e_4\).

Fig. 5.

Fig. 5. An example of low-overhead edge profiling applied to DCG with a profiled edge list (L) and its profiled path (P). The topological order is \([A_0, (B_1, B_0), C_0, (D_1, D_0), (E_1, E_0)]\) , and the parentheses represent the siblings. (a) Initial at \(T_0\) (b) \(T_1\) and \(T_2\) (c) \(T_3\) (d) \(T_4\) (e) \(T_5\) (f) \(T_6\) and \(T_7\) .

For the edge profiling algorithm, we augment two data structures into each node: a profiled edge list (L) and its associated path, i.e., the profiled path (P). The edge list is the list of the distinct edges in DCG, and we construct it when building DCG. In the figure, the list represents \([e_0,e_1,e_2,e_3,e_4]\). We initialize each node’s field in the profiled edge list with 0’s and set a field associated with the incoming edges during the node visit. The profiled path is a path that contains the marked DA edges in a profiled edge list to be executed from the start node. A node with more than one incoming edge may have multiple pairs of a profiled edge list and a profiled path.

We show Algorithm 1 to identify the minimum profiling paths for recognizing all the node and edge costs in DCG. The profiling path identification algorithm performs the topological sort of DCG (Line 1) and the node cost profiling (Lines 4\(\sim\)8). We can sort DCG without any overhead since we can use CG’s available topological sort information; we need only to consider the devices as siblings in the order. The algorithm visits the DCG nodes in the topological order for the edge cost profiling (Line 10). At every node visit, the algorithm generates the node’s profiled edge list and its profiled path by considering the DA edges in the explored path with the predecessor’s information (Lines 15\(\sim\)25). When we mark all the DA edges, we stop the algorithm execution (Lines 12\(\sim\)14) and use all the collected profiled paths for profile runs (Line 20).

In the example figure, we initially color all the nodes as white, a dequeued one from the topological FIFO queue as black, and currently processed nodes as gray. In Figure 5(a) at \(T_0\), we assume that we already visited the \(A_0\) node, thus dequeuing \(A_0\) and having \([0,0,0,0,1]\) in the profile edge list (\(L(A_0)\)) and \(A_0\) in the profile path (\(P(A_0)\)). Remember that we profiled \(e_4\) from the node profiling.

Figure 5(b) shows two steps to dequeue \(B_1\) and \(B_0\): At \(T_1\), we dequeue \(B_1\) (Line 11) and generate the \([1,0,0,0,1]\) profiled edge list (\(L(B_1)\)) by adding the explored \(e_0\) DA edge to the \(A_0\)’s profiled edge list and \(A_0B_1\) of the profiled path (\(P(B_1)\)) (Lines 16\(\sim\)18). Similarly, at \(T_2\), we dequeue \(B_0\) and generate the \([0,0,0,0,1]\) profiled edge list (\(L(B_0)\)) the same as \(A_0\)’s because there is no DA edge from \(A_0\) to \(B_0\) (Line 23). We also generate \(A_0B_0\) of the profiled path (\(P(B_0)\)) (Line 16).

In Figure 5(c) at \(T_3\), we dequeue \(C_0\) to have two incoming execution paths; thus, we compare the inclusion of the generated profiled edge lists from the paths, i.e., summarize the results. One path from \(B_0\), including no new DA edge, generates \([0,0,0,0,1]\) for \(L(C_0)\), \(~B_0C_0\) for \(P(C_0)\). The other path from \(B_1\), including a new DA edge of \(e_1\), generates \([1,1,0,0,1]\) for \(L(C_0)\), \(~B_1C_0\) for \(P(C_0)\). Since \([1,1,0,0,1]\) includes \([0,0,0,0,1]\), i.e., covers more DA edges, we only maintain \([1,1,0,0,1]\) with its profiled path (Lines 19\(\sim\)21).

In Figure 5(d) at \(T_4\), we dequeue \(D_1\) to have only one incoming path from \(C_0\), including two DA edges of \(e_0\) and \(e_2\); thus, we generate \([1,1,1,0,1]\) for \(L(D_1)\), \(~C_0D_1\) for \(P(D_1)\) using \(L(C_0)\) and \(P(C_0)\). We execute the nodes in the topological order, making the edge profiling algorithm update the profiled edge list in the same order. Therefore, we consider nodes \(D_0\) and \(D_1\) after visiting \(C_0\), even if there is an edge \(e_0\) from \(B_0\) to \(D_1\).

In Figure 5(e) at \(T_5\), we dequeue \(D_0\) to have only one incoming path from \(C_0\), including one DA edge of \(e_1\) to be already registered in the predecessor, \(C_0\); thus, \(D_0\)’s profiled edge list and path are the same as \(C_0\), i.e., \([1,1,0,0,1]\) for \(L(D_0)\), \(~C_0D_0\) for \(P(D_0)\).

Figure 5(f) shows two steps to dequeue \(E_1\) and \(E_0\): At \(T_6\), we dequeue \(E_1\) and generate the \([1,1,1,0,1]\) profiled edge list (\(L(E_1)\)) by adding the explored \(e_2\) DA edge to the \(D_0\)’s profiled edge list and \(~D_0E_1\) of the profiled path (\(P(E_1)\)). Similarly, at \(T_7\), we dequeue \(E_0\) and generate the \([1,1,1,1,1]\) profiled edge list (\(L(E_0)\)) by adding the explored \(e_3\) DA edge to the \(D_1\)’s profiled edge list and \(~D_1E_0\) of the profiled path (\(P(E_1)\)). We find all the DA edges, and the algorithm stops (Lines 12\(\sim\)14).

We can determine the inclusion between the profiled edge lists at each node visit in \(O(E_d)\) since there are at most \(E_d\) different profiled edge lists. Also, we can determine whether a path includes new DA edges in \(O(1)\). Therefore, the total time complexity of our profiling algorithm is \(O(V_d + E_d^{2})\), which enables finding the minimum paths that contain all the edge costs in polynomial time.

3.3 Optimal Model Partitioning

We apply the famous Assembly Line Scheduling (ALS) problem to the optimal partitioning problem in the multi-device computing platform by showing that the two problems are the same. For simplicity, we use only two assembly lines for manufacturing, i.e., two computing devices on the platform.

Figure 6(a) shows the ALS problem in determining which nodes to be selected from assembly lines to minimize the manufacturing time. In the graph, \(a_{i,j}\) denotes a cost of the jth node on ith assembly line, \(t_{i,j}\) denotes a transfer cost from the \(j-1\)th node on ith assembly line, and \(e_i\) and \(x_i\) are an entry cost and an exit cost on the ith assembly line. It assumes that the transfer cost between nodes at the same assembly line is zero. The fastest time to get a chassis from the start to the jth node on ith assembly line, \(f_i[j]\), is expressed by Equation (1) for \(j \ge 1\) where \(f_0[0] = e_0 + a_{0,0}\) and \(f_1[0] = e_1 + a_{1,0}\). (1) \(\begin{eqnarray} f_0[j] &=& \min (f_0[j-1]+a_{0,j}, f_1[j-1] + t_{1,j} + a_{0,j}) \nonumber \nonumber\\ f_1[j] &=& \min (f_1[j-1]+a_{1,j}, f_0[j-1] + t_{0,j} + a_{1,j}) . \end{eqnarray}\)

Fig. 6.

Fig. 6. Comparison of the ALS and the DCG problems. (a) ALS graph. (b) Modified ALS graph. (c) DCG.

We modified the ALS graph from Figure 6(a) to Figure 6(b) by allowing nodes to have more than one incoming edge from a different assembly line (or more than one outgoing edge to a different assembly line). Then, Equation (1) is changed to the following: (2) \(\begin{eqnarray} f_0[j] &=& \min (f_0[j-1]+a_{0,j}, f_1[j-1] + T_{0,j} + a_{0,j}) \nonumber \nonumber\\ f_1[j] &=& \min (f_1[j-1]+a_{1,j}, f_0[j-1] + T_{1,j} + a_{1,j}) . \end{eqnarray}\)

\(T_{i,j}\) denotes a sum of a transfer cost from all the data-dependent predecessors of jth node on ith assembly line. For example, \(T_{1,3} = t_{0,3} + t_{0,3}^{^{\prime }}\). We assume that the incoming transfer cost to node j from node k (where \(k \lt j-1\)) on the same assembly line is also zero. The equation still holds the optimal solution since (1) the transfer cost of all the edges across the assembly lines is used only once for finding the optimal value and (2) \(t_{i,j}\) represents the transfer cost from the other assembly lines, and therefore, we can replace all the transfer costs from all data dependent predecessors of the jth node on ith assembly line by \(T_{i,j}\).

We can also prove the optimality of Equation (2) using the induction proof: (i) Base case: When \(j=0\), Equation (2) satisfies \(f_0[0] = e_0+a_{0,0}\) and \(f_1[0] = e_1+a_{1,0}\) since there is no path from the other assembly lines. (ii) Induction Step: Suppose that Equation (2) is valid for \(j=k-1\), i.e., \(f_0[k-1]\) and \(f_1[k-1]\) are the fastest time at the kth nodes of assembly lines 0 and 1. At the next kth node of assembly line 0, we can identify the fastest execution time by adding the kth node’s execution time (\(a_{0,k}\)) to the cost either from the node at the same assembly line (\(f_0[k-1]\)) or the nodes from the different assembly lines (\(f_1[k-1] + T_{0,k}\)). The induction proof is also valid for the other assembly line. With (i) and (ii), we can prove that the recurrence of DCG still guarantees optimality.

We can easily map the modified ALS graph to DCG by removing incapable nodes and processing the ALS in topological order. Figure 6(c) shows an example: The 1st and 3rd operators are not supported by computing device 1; therefore, we remove them from DCG. As shown in Figure 7, in our execution model, the host CPU has two memory spaces: cached system memory and uncached PIM memory. Therefore, if two consecutive nodes are in other assembly lines (CPU-PIM or PIM-CPU), an explicit data copy is required, which results in transfer costs \(T_{i,j}\). However, if both nodes are in the same assembly line, they use the same memory space, and their node execution time includes the memory access time; thus, the transfer cost does not exist. For example, we can think of the CPU’s execution time, including memory access time in the memory hierarchy.

Fig. 7.

Fig. 7. Our experimental platform on FPGA.

The time complexity of our partitioning algorithm is \(O(V_d)\), the same as that of the ALS algorithm.

Skip 4PERFORMANCE EVALUATION Section

4 PERFORMANCE EVALUATION

4.1 Experimental Environment and Methodology

We developed the PIM-based platform, including the ARM multicores and Silent-PIM [19], as shown in Figure 7; thus, we could define three computing devices such as CPU_S (CPU serial execution), CPU_P (CPU parallel execution), and PIM (PIM execution) for the experiment. We emulated Silent-PIM on an FPGA board (Xilinx Zynq UltraScale+ board/HTG-Z920) [14]. The ARM SoC of the board consists of two areas: Processing System (PS) and Programmable Logic (PL). The host CPU, Quad-core CPU (ARM Cortex-A53 MPCore/1.5 GHz), was on the PS side, while the PIM device was implemented on the PL side. Due to the different operating frequencies between the areas, we scaled the performance. The PS side memory was managed as cachable, while the PL side memory was managed as uncachable. The DMA engine was used for the CPU to offload PIM tasks.

We evaluated the performance of our profiling and partitioning algorithms on the platform, using three state-of-the-art transformer-based models, BERT [7], RoBERTa [25], and GPT-2 [33]. We can exploit CPU_P through the ONNX Runtime library and parallelizable operators. Table 2 shows the operators supported by PIM in the models.

Table 2.
OperationDescriptionModels
BERTRoBERTaGPT-2
qrqrqr
Gemm(1)\((p, q) \times (q, r) + (r)\)--3,072768
7682,304
7683,072
768768
(2)\((p, q) \times {(r, q)}^T + (r)\)51227682-
512512768768
MatMul(1)\((p, q) \times (q, r)\)5122,0487683,072-
512512768768
2,0485123,072768
Element-wise(1)\((p, r) + (p, r)\)-512-768-3,072
(2)\((p, r) + (r)\)1/512/2,0481/768/3,0721/768
(3)\((p, r) + (1)\)2,0483,0723,072
(4)\((p, r) \times (p, r)\)20483,0723,072
(5)\((p, r) \times (r)\)512768768
(6)\((p, r) \times (1)\)2,0483,0723,072
(7)\((p, r) - (p)\)512768768

Table 2. PIM Executable Operators in the Transformer-based Models

4.2 Operator-by-Operator Performance

Before showing the performance of our partitioning algorithm, we compared the operators’ performance of CPU_S, CPU_P, and PIM. We chose two operators from Table 2: element-wise addition as (p \(\times\) 512)+(p \(\times\) 512) and matrix-matrix multiplication as (p \(\times\) 512) \(\times\) (512 \(\times\) 512), where p stands for a sequence length. It should be noted that the execution time of all the element-wise operations, such as addition, multiplication, and subtraction, are the same. Also, the execution time of Gemm is a sum of the execution time in a matrix-matrix multiplication and an element-wise addition, and the multiplication dominates the overall time.

In the element-wise addition of Figure 8(a), PIM shows a speedup of 2.1\(\times\)\(\sim 3.4\times\) compared to CPU_S and 2.2\(\times\)\(\sim 2.8\times\) compared to CPU_P. Since the element-wise addition has no data reuse, all the execution have the same amount of memory requests to DRAM. PIM reads source operands and computes them simultaneously; therefore, its execution is always faster than the others. Also, the execution time of PIM is linearly proportional to p, i.e., the matrix size. The performance of CPU_P linearly increases with p due to the large task granularity of threads. For \(p=8\) and 16, the speedup is less or similar to 1.0\(\times\) due to overhead from parallel code overhead and small task granularity.

Fig. 8.

Fig. 8. The operators’ speedup of the CPU parallel and PIM execution with respect to the serial CPU execution when varying the sequence length (p) from 8 to 64. (a) Element-wise addition: (p \(\times\) 512) + (p \(\times\) 512). (b) Matrix-matrix multiplication: (p \(\times\) 512) \(\times\) (512 \(\times\) 512).

PIM also shows higher performance than the others in the matrix-matrix multiplication, i.e., a speedup of 5.6\(\times\)\(\sim 10.4\times\) compared to the CPU_S and 1.5\(\times\)\(\sim 3.1\times\) to CPU_P, as shown in Figure 8(b). As discussed in Section 2, PIM performs the matrix-matrix multiplication by repeating the VM multiplication, which requires reading the (512 \(\times\) 512) matrix by p times and results in performance degradation as p increases. On the other hand, the CPU can fully exploit the locality of the second source operand in a cache, thus acquiring the ideal speedup close to the number of cores, 4. As a result, as p increases, the performance gap between CPU and PIM gets smaller.

4.3 Optimal Model Partitioning

We compared the performance using our optimal partitioning with the two manually prioritizing device orders and three variants of the state-of-the-art partitioning greedy algorithm [40]. The algorithm allocates a computing device with the fastest execution time on a node without considering communication costs between devices. Then, it gradually increases the coverage of the nodes starting from the first node and adjusts the allocation by considering the communication cost to identify better partitioning. The algorithm does not guarantee optimality and targets the CPU-GPU environment. Also, we sincerely modeled the greedy partitioning algorithm in [40] and compared the performance with the proposed work.

We compared the performance using our optimal partitioning with the following six execution schemes:

(1)

CPU_S only: CPU serial execution only.

(2)

(CPU_P,PIM,CPU_S): Prioritizing node allocation in the order of CPU parallel, PIM, and CPU serial, the same as assigning all the thread-parallelizable operators to the CPU parallel execution and the rest to the CPU serial execution. It is because all the PIM operators can be executed by CPU parallel.

(3)

(PIM,CPU_P,CPU_S): Prioritizing node allocation in the order of PIM execution, CPU parallel, and CPU serial.

(4)

DUET(x=0%): Allocating a computing device with the fastest execution time on a node without considering communication costs between devices.

(5)

DUET(x=25%): Adjusting the node allocation by considering the communication cost on 25% of nodes from a start node.

(6)

DUET(x=50%): Adjusting the node allocation by considering the communication cost on 50% of nodes from a start node.

(7)

OPT: Our optimal execution using CPU serial, CPU parallel, and PIM execution.

Figure 9 shows the execution time and speedup of running the three transformer-based models while varying a sequence length from 8 to 64 with respect to CPU_S. Our optimal execution OPT shows a speedup of 1.1\(\times\)\(\sim 2.1\times\) and 1.1\(\times\)\(\sim 3.0\times\) compared to (CPU_P,PIM,CPU_S) and (PIM,CPU_P,CPU_S), manually assigning nodes based on priorities. The speedup of OPT is also higher than DUET(x=50%) by 1.09\(\times\)\(\sim 1.23\times\). Overall, OPT performs the best in all the cases, showing our approach’s strength and robustness.

Fig. 9.

Fig. 9. Execution time and speedup running the transformer-based models when varying the sequence length (p) from 8 to 64 with respect to CPU_S only. (a) BERT (b) RoBERTa (c) GPT-2.

The speedup of OPT decreases as the sequence length increases because the performance gap between CPU_P and PIM reduces; the execution time of PIM is linearly proportional to the sequence length, but the CPU exploits the higher cache locality. Therefore, as the sequence length p increases, the less performance gap between CPU_P and PIM results in a performance decrease in OPT. The speedup of CPU_P,PIM,CPU_S) is higher than (PIM,CPU_P,CPU_S) on BERT and RoBERTa because Gemm requires input weight transpose, which is not a friendly operation for PIM. The speedup of DUET(x=0%) is higher than (PIM,CPU_P,CPU_S) and CPU_P,PIM,CPU_S) in most cases, which means that we would achieve significant performance even by considering only the computation cost. In addition, we observed that the performance gradually improves by covering more nodes for the greedy correction but is consistently lower than OPT.

4.3.1 BERT and RoBERTa.

Figure 10 shows the partitioned result of (PIM,CPU_P,CPU_S) and OPT in BERT and RoBERTa with a sequence length of 16. The OPT partitioning assigns the Gemm and the element-wise operators to CPU_P instead of PIM due to the node and edge costs, respectively. PIM takes more time than CPU_P at the Gemm execution in Table 2 (2), requiring transposing an input. Also, the division and the erf operators unsupported by PIM incur high data transfer costs for their successors’ PIM execution. However, the MatMul performance in PIM is superior to the others; thus, the assignment is unchanged.

Fig. 10.

Fig. 10. The partitioned results of BERT and RoBERTa ( \(p=16\) ). (a) (PIM,CPU_P,CPU_S) (b) OPT.

As a result, OPT decreased both the computation time and the data transfer time by 54% and 27% in BERT and 34% and 20% in RoBERTa compared to (PIM,CPU_P,CPU_S), as shown in Figure 11.

Fig. 11.

Fig. 11. Computation cost and data transfer cost in the models (p=16). (a) BERT (b) RoBERTa (c) GPT-2.

4.3.2 GPT-2.

Figure 12 shows the partitioned result of GPT-2 for (PIM,CPU_P,CPU_S) and OPT when the sequence length is 16. There are two Gemm operators in the decoder that dominates the execution time. Two of them are different in size, where the operator on the bottom is more compute-intensive. The partitioning algorithm assigns Gemm on the top to CPU_P and the other to PIM. Also, the optimal partitioning assigns the element-wise operations to CPU_P to decrease the data transfer.

Fig. 12.

Fig. 12. The partitioned results of GPT-2 ( \(p=16\) ). (a) (PIM,CPU_P,CPU_S) (b) OPT.

As shown in Figure 11(c), their computation times are similar, but data transfer time decreases by 64%, which leads to the optimal execution time.

4.4 Verification Correctness of Optimal Model Partitioning

To verify the correctness of our proposed optimal partitioning algorithm, we explored all possible execution paths in the subgraphs from the experimented models, measured their execution time, and identified the minimum execution time. With three devices (PIM, CPU_P, and CPU_S), we explored 3\(^{127.5}\), 3\(^{361.6}\), and 3\(^{351.6}\) execution paths in BERT, RoBERTa, and GPT-2, respectively. To reduce the problem size, we chose 15 consecutive nodes of each graph as subgraphs and applied integer programming to check the correctness of our algorithm. We could not test larger subgraphs due to out-of-memory in execution.

Figure 13 shows the execution time of all possible paths, where one point represents one path’s execution time. In the figure, the star-marked point represents the minimum execution time, which is the same as the result of our partitioning algorithm.

Fig. 13.

Fig. 13. Execution time of all possible execution paths for 15-node subgraphs from (a) BERT (b) RoBERTa (c) GPT-2.

4.5 Profiling and Model Partitioning Overhead

Table 3 shows the DCG characteristics and the algorithm overhead. The CGs of BERT, RoBERTa, and GPT-2 have 222, 616, and 1934 nodes and 262, 765, and 2424 edges, respectively. With our experimental platform with three devices, the DCGs built from each model have 469, 1317, and 2593 nodes, 686, 1951, and 1723 edges, respectively. We could reduce them to only 20, 20, and 24 DA edges, significantly reducing the profiling overhead for finding all the edge costs. Moreover, as we identified the cost of the DA edges in the node cost profiling, only four DA edges were left to discover for all three models for the edge cost profiling.

Table 3.
Models
BERTRoBERTaGPT-2
DCG characteristicsCG nodes/edges (\(V_{c}/E_{c}\))222/262616/7251,934/2,424
DCG nodes/edges (\(V_{d}/E_{d}\))469/6861,317/1,9512,593/1,723
DA edges202024
PIM capable nodes108308245
Algorithm overheadGraph traversal0.04s0.04s0.07s
Required profile runs (vertex/edge)3/13/13/1
Optimal partitioning0.02s0.07s0.21s

Table 3. The Number of Nodes/edges According to the Model and its Profiling/Partitioning Overhead

The graph traversal for finding those four edges took only 0.04 s, 0.04 s, and 0.07 s in BERT, RoBERTa, and GPT-2, respectively. BERT and RoBERTa took a similar time due to their similar CG characteristics. We could need three profile runs for the node cost profiling and only one run for the edge cost profiling. The time spent on optimal partitioning was negligible and proportional to the graph complexity.

Skip 5RELATED WORKS Section

5 RELATED WORKS

Previous research on workload partitioning for heterogeneous platforms has focused on the profiling and partitioning on the CPU-GPU systems with data-parallel kernels written in OpenCL [37] or CUDA [28]. Even with the higher throughput of GPU than CPU, the low PCIe bandwidth degrades the overall performance due to the data transfer overhead. Therefore, accurate profiling and partitioning determine the overall performance. As mentioned, there has yet to be any work on PIM-based partitioning, and our work can be easily applied to partitioning on traditional heterogeneous computing.

The first approach statically partitions the model by profiling applications and building a cost model using the profiled data. Luk et al. [26] developed an analytical performance model to predict the execution time on CPU and GPU through curve-fitting the profiled data. However, it did not distinguish the data transfer time from the computation time, leading to non-optimal partitioning. Albayrak et al. [3] proposed an inter-kernel greedy mapping algorithm from profiling kernels on devices. The algorithm iterates through n kernels in topological order and assigns them to devices by comparing CPU and GPU costs, where costs are execution time on each device and the data transfer time from the source and sink devices. The algorithm would not provide optimal partitioning when resolving complex kernel data dependencies. Shen et al. [35] proposed an intra-kernel partitioning algorithm targeting a single kernel. When running each kernel with the CPU and GPU, the execution time of a kernel can be expressed as \(\max (T_G+T_D, T_C)\), where \(T_G\), \(T_D\), and \(T_C\) denote the GPU kernel computation time, the data transfer time, and the CPU kernel computation time, respectively. After partial profiling, they achieved the two metrics, a relative hardware throughput between CPU and GPU and the ratio of GPU throughput to data transfer bandwidth. Then, they substituted \(T_G\), \(T_D\), and \(T_C\) with the two metrics and achieved the optimal partitioning when \(T_G+T_D=T_C\).

Lee et al. [21] proposed partitioning multiple kernels of an application to multiple devices. It first builds a regression model based on profiled data collected by executing kernels with various input sizes for predicting the execution time of computing devices. Then, it constructs a data dependency graph of kernels. The kernels are listed in topologically sorted order of the graph and mapped with the device based on the priority, the predicted execution time, and the data transfer cost. It further decomposes the kernel into sub-kernels and maps across multiple devices. Our work does not predict the execution time but profiles with very low cost, thus providing more accurate partitioning.

The second approach uses runtime information for optimal partitioning. Belviranli et al. [4] proposed a dynamic partitioning algorithm that works in two phases. The first phase learns the computational performance of each hardware by assigning a small amount of the workload. Then, the rest of the workload is assigned according to the hardware performance in the next adaptive stage. Boyer et al. [5] decomposed a kernel into chunks, such as primary computing workloads. They executed a small number of chunks on each device at the initial execution. Then, using the execution time, they partitioned the remaining work to balance the load between devices, leading to optimal partitioning.

The third approach uses the machine learning (ML) technique to predict the optimal partitioning for the given workload. Grewe et al. [10] proposed optimal partitioning based on an ML model trained with static information from a compiler’s AST IR analysis. Similarly, Ghose et al. [9] additionally used control flow, mainly depending on the thread-id as an important deciding factor for optimal partitioning. Kofler et al. [20] proposed optimal partitioning by the ML model that was trained with static information from compile-time and dynamic information, e.g., data transfer size, from runtime.

Skip 6CONCLUSION Section

6 CONCLUSION

In this article, we proposed the optimal model partitioning method for achieving the best performance of the DL inference execution on the PIM-based platform. First, we presented the device-mapped computational graph (DCG) to represent the device capability by restructuring the existing computational graph. Then, we introduced the algorithm to find the minimum profile runs to obtain all node and edge costs in DCG in polynomial time. Also, we classified the DCG edges into the necessarily profiled distinct attribute edges, significantly reducing the problem size. Finally, our dynamic programming technique adopted from the ALS approach provided the optimal model partitioning from DCG.

We have evaluated our method on the PIM-modeled FPGA platform with the ARM multicores by running three transformer-based models, varying a sequence length p from 8 to 64. Our partitioning algorithm achieved a speedup of 1.1\(\times\)\(\sim 2.1\times\) and 1.1\(\times\)\(\sim 3.0\times\) compared to the execution with manually assigned device priority orders (CPU parallel, PIM, and CPU serial) and (PIM, CPU parallel, and CPU serial), respectively, and 1.09\(\times\)\(\sim 1.23\times\) over the state-of-the-art greedy approach. Also, we explored all possible execution paths in the subgraphs from the experimented models and showed that our partitioning algorithm provided the best. Our low-cost edge profiling algorithm found only one profile run for obtaining all the edge costs, which took 0.04 s, 0.04 s, and 0.07 s for each model with \(p=16\), which was negligible. Also, the partitioning algorithm took little time. We will apply our work to higher heterogeneity platforms, including GPUs and application-specific accelerators.

REFERENCES

  1. [1] Abadi Martín, Barham Paul, Chen Jianmin, Chen Zhifeng, Davis Andy, Dean Jeffrey, Devin Matthieu, Ghemawat Sanjay, Irving Geoffrey, Isard Michael, Kudlur Manjunath, Levenberg Josh, Monga Rajat, Moore Sherry, Murray Derek G., Steiner Benoit, Tucker Paul, Vasudevan Vijay, Warden Pete, Wicke Martin, Yu Yuan, and Zhen Xiaoqiang. 2016. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation. Savannah, GA, 265283.Google ScholarGoogle Scholar
  2. [2] AI Open. 2023. GPT-4 technical report. arXiv:2303.08774. Retrieved from https://arxiv.org/abs/2303.08774Google ScholarGoogle Scholar
  3. [3] Albayrak Omer Erdil, Akturk Ismail, and Ozturk Ozcan. 2012. Effective kernel mapping for OpenCL applications in heterogeneous platforms. In Proceedings of the 41st International Conference on Parallel Processing Workshops. Pittsburgh, PA, 8188.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Belviranli Mehmet E., Bhuyan Laxmi N., and Gupta Rajiv. 2013. A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Transactions on Architecture and Code Optimization 9, 4 (2013), 120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Boyer Michael, Skadron Kevin, Che Shuai, and Jayasena Nuwan. 2013. Load balancing in a changing world: Dealing with heterogeneity and performance variability. In Proceedings of the 10th ACM International Conference on Computing Frontiers. Ischia, Italy, 110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Cormen Thomas H., Leiserson Charles E., Rivest Ronald L., and Stein Clifford. 2001. Introduction to Algorithms (2nd. ed.). MIT Press.Google ScholarGoogle Scholar
  7. [7] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https://arxiv.org/abs/1810.04805Google ScholarGoogle Scholar
  8. [8] Gharaibeh Abdullah, Costa Lauro Beltrao, Santos-Neto Elizeu, and Ripeanu Matei. 2012. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. Minneapolis, MN, 345354.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Ghose Anirban, Dey Soumyajit, Mitra Pabitra, and Chaudhuri Mainak. 2016. Divergence aware automated partitioning of OpenCL workloads. In Proceedings of the 9th India Software Engineering Conference. Goa, India, 131135.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Grewe Dominik and O’Boyle Michael F.P.. 2011. A static task partitioning approach for heterogeneous systems using OpenCL. In Proceedings of the 20th International Conference on Compiler Construction.Saarbrücken, Germany, 286305.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Gupta Rajesh K. and Micheli Giovanni De. 1993. Hardware-software cosynthesis for digital systems. IEEE Design and Test of Computers 10, 3 (1993), 2941.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] He Mingxuan, Song Choungki, Kim Ilkon, Jeong Chunseok, Kim Seho, Park Il, Thottethodi Mithuna, and Vijaykumar T. N.. 2020. Newton: A DRAM-maker’s accelerator-in-memory (AiM) architecture for machine learning. In Proceedings of the 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture. 372385. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Hermann Everton, Raffin Bruno, Faure Francois, Gautier Thierry, and Allard Jeremie. 2010. Multi-GPU and multi-CPU parallelization for interactive physics simulations. In Proceedings of the 16th European Conference on Parallel Processing. Berlin, 235246.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Global Hitech. 2017. HTG-Z920. Retrieved August 07, 2022 from https://www.xilinx.com/products/boards-and-kits/1-qwrzuv.htmlGoogle ScholarGoogle Scholar
  15. [15] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 17351780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Jain Lakhmi and Medsker Larry. 1999. Recurrent Neural Networks: Design and Applications (1st ed.). CRC Press, Inc.Google ScholarGoogle Scholar
  17. [17] JEDEC Standard: DDR4 SDRAM JESD79-4B. 2012. Retrieved from https://www.jedec.org/standards-documents/docs/jesd79-4aGoogle ScholarGoogle Scholar
  18. [18] Ke Liu, Zhang Xuan, So Jinin, Lee Jong-Geon, Kang Shin-Haeng, Lee Sukhan, Han Songyi, Cho YeonGon, Kim Jin Hyun, Kwon Yongsuk, Kim KyungSoo, Jung Jin, Yun Ilkwon, Park Sung Joo, Park Hyunsun, Song Joonho, Cho Jeonghyeon, Sohn Kyomin, Kim Nam Sung, and Lee Hsien-Hsin S.. 2022. Near-memory processing in action: Accelerating personalized recommendation with AxDIMM. IEEE Micro 42, 1 (2022), 116127. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Kim Chang Hyun, Lee Won Jun, Paik Yoonah, Kwon Kiyong, Kim Seok Young, Park Il, and Kim Seon Wook. 2022. Silent-PIM: Realizing the processing-in-memory computing with standard memory requests. IEEE Transactions on Parallel and Distributed Systems 33, 2 (2022), 251262.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Kofler Klaus, Grasso Ivan, Cosenza Biagio, and Fahringer Thomas. 2013. An automatic input-sensitive approach for heterogeneous task partitioning. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. New York, NY, 149160.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Lee Janghaeng, Samadi Mehrzad, and Mahlke. Scott2015. Orchestrating multiple data-parallel kernels on multiple devices. In Proceedings of the 24th International Conference on Parallel Architecture and Compilation Techniques. San Francisco, CA, 256366.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Lee Sukhan, Kang Shin-Haeng, Lee Jaehoon, Kim Hyeonsu, Lee Eojin, Seo Seungwoo, Yoon Hosang, Lee Seungwon, Lim Kyounghwan, Shin Hyunsung, Kim Hyunsung, Seongil O, Lyer Anand, Wang David, Sohn Kyomin, and Kim Nam Sung. 2021. Hardware architecture and software stack for PIM based on commercial DRAM technology: Industrial product. In Proceedings of the ACM/IEEE 48th Annual International Symposium on Computer Architecture. 4356.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Lee Sukhan, Kang Shin-haeng, Lee Jaehoon, Kim Hyeonsu, Lee Eojin, Seo Seungwoo, Yoon Hosang, Lee Seungwon, Lim Kyounghwan, Shin Hyunsung, Kim Jinhyun, Seongil O, Iyer Anand, Wang David, Sohn Kyomin, and Kim Nam Sung. 2021. Hardware architecture and software stack for PIM based on commercial DRAM technology : Industrial product. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture. 4356. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Li Ang, Song Shuaiwen Leon, Chen Jieyang, Li Jiajia, Liu Xu, Tallent Nathan, and Barker Kevin J.. 2020. Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems 31, 1 (2020), 94110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, and Stoyanov Veselin. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Retrieved from https://arxiv.org/abs/1907.11692Google ScholarGoogle Scholar
  26. [26] Luk Chi-Keung, Hong Sunpyo, and Kim Hyesoon. 2009. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. New York, NY, 4555.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Microchip. 2008. Direct Memory Access (DMA) (Part III). Retrieved January 02, 2022 from http://ww1.microchip.com/downloads/en/devicedoc/70215c.pdfGoogle ScholarGoogle Scholar
  28. [28] Corporation NVIDIA. 2020. CUDA. Retrieved January 02, 2022 from https://developer.nvidia.com/cuda-toolkit accessed: 2022-01-02.Google ScholarGoogle Scholar
  29. [29] Developers ONNX. 2019. Open Neural Network Exchange (ONNX). Retrieved December 22, 2021 from https://onnx.ai/Google ScholarGoogle Scholar
  30. [30] Developers ONNX Runtime. 2021. ONNX Runtime. Retrieved December 22, 2021 from https://onnxruntime.ai/Google ScholarGoogle Scholar
  31. [31] Paszke Adam, Gross Sam, Chintala Soumith, Chanan Gregory, Yang Edward, DeVito Zachary, Lin Zeming, Desmaison Alban, Antiga Luca, and Lerer Adam. 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the 32th Neural Information Processing Systems. Vancouver, Canada, 80268037.Google ScholarGoogle Scholar
  32. [32] Peng Yarui, Ku Bon Woong, Park Younsik, Park Kwang-Il, Jang Seong-Jin, Choi Joo Sun, and Lim Sung Kyu. 2015. Design, packaging, and architectural policy co-optimization for DC power integrity in 3D DRAM. In Proceedings of the 52nd Annual Design Automation Conference. San Francisco, CA, 16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Radford Alec, Wu Jeff, Child Rewon, Luan David, Amodei Dario, and Sutskever Ilya. 2019. Language Models are Unsupervised Multitask Learners.Google ScholarGoogle Scholar
  34. [34] Rose Jonathan, Gamal Abbas El, and Sangiovanni-Vincentelli Alberto. 1993. Architecture of field-programmable gate arrays. Proc. IEEE 81, 7 (1993), 10131029.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Shen Jie, Varbanescu Ana Lucia, Zou Peng, and Sips Henk. 2016. Workload partitioning for accelerating applications on heterogeneous platforms. IEEE Transactions on Parallel and Distributed Systems 27, 9 (2016), 27662780. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Song Fengguang, Tomov Stanimire, and Dongarra Jack. 2012. Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems. In Proceedings of the 26th ACM International Conference on Supercomputing. San Servolo Island, Venice, Italy, 365376.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Stone John E., Gohara David, and Shi Guochun. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science and Engineering 12, 3 (2010), 6673. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Thompson Chirs J., Hahn Sahngyun, and Oskin Mark. 2002. Using modern graphics architectures for general-purpose computing: A framework and analysis. In Proceedings of the ACM/IEEE 35th Annual International Symposium on Computer Architecture. Istanbul, Turkey, 306317.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Wu Jing and Jaja Joseph. 2013. High performance FFT based poisson solver on a CPU-GPU heterogeneous platform. In Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing. Cambridge, MA, 115125.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Zhang Minjia, Hu Zehua, and Li Mingqin. 2021. DUET: A compiler-runtime subgraph scheduling approach for tensor programs on a coupled CPU-GPU architecture. In Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium. 151161.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Optimal Model Partitioning with Low-Overhead Profiling on the PIM-based Platform for Deep Learning Inference

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Design Automation of Electronic Systems
      ACM Transactions on Design Automation of Electronic Systems  Volume 29, Issue 2
      March 2024
      438 pages
      ISSN:1084-4309
      EISSN:1557-7309
      DOI:10.1145/3613564
      • Editor:
      • Jiang Hu
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 14 February 2024
      • Online AM: 1 November 2023
      • Accepted: 13 October 2023
      • Revised: 7 July 2023
      • Received: 5 August 2022
      Published in todaes Volume 29, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)429
      • Downloads (Last 6 weeks)160

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader