Abstract
Recently Processing-in-Memory (PIM) has become a promising solution to achieve energy-efficient computation in data-intensive applications by placing computation near or inside the memory. In most Deep Learning (DL) frameworks, a user manually partitions a model’s computational graph (CG) onto the computing devices by considering the devices’ capability and the data transfer. The Deep Neural Network (DNN) models become increasingly complex for improving accuracy; thus, it is exceptionally challenging to partition the execution to achieve the best performance, especially on a PIM-based platform requiring frequent offloading of large amounts of data.
This article proposes two novel algorithms for DL inference to resolve the challenge: low-overhead profiling and optimal model partitioning. First, we reconstruct CG by considering the devices’ capability to represent all the possible scheduling paths. Second, we develop a profiling algorithm to find the required minimum profiling paths to measure all the node and edge costs of the reconstructed CG. Finally, we devise the model partitioning algorithm to get the optimal minimum execution time using the dynamic programming technique with the profiled data. We evaluated our work by executing the BERT, RoBERTa, and GPT-2 models on the ARM multicores with the PIM-modeled FPGA platform with various sequence lengths. For three computing devices in the platform, i.e., CPU serial/parallel and PIM executions, we could find all the costs only in four profile runs, three for node costs and one for edge costs. Also, our model partitioning algorithm achieved the highest performance in all the experiments over the execution with manually assigned device priority and the state-of-the-art greedy approach.
1 INTRODUCTION
The heterogeneous platform, accommodating various computing devices, such as CPUs, GPUs, ASICs, FPGAs, and so on, and their rich types of operations, has been widely used for energy-efficient and high-performance computation [8, 13, 36, 39] of a broad class of applications [11, 34, 38]. To achieve the advantage by fully utilizing the heterogeneity, we should consider both each device’s computation performance and the data transfer overhead incurred when a device accesses data in other devices at the scheduling. For example, CPU and GPU generally have their own memory space on a platform, requiring an explicit data copy from one memory to another to maintain data consistency and sometimes resulting in significant overall performance degradation due to the slow PCIe interface [24].
Recently, with the emergence of data-intensive and low locality applications in Deep Learning (DL), e.g., LSTM [15] and RNN [16], Processing-in-Memory (PIM) has been adapted to the computing platform [12, 18, 19, 22, 23] for achieving resolving the memory performance bottleneck by placing processing units near or inside the memory. The recent Deep Neural Network (DNN) models process huge amounts of data and become increasingly complex for improving accuracy [2]. However, PIM, especially in-DRAM PIM, supports elementary operations like multiplication and addition due to the limited design constraints [32], i.e., unallowable design space and power budget; thus, it requires more frequently offloading the data between CPU and PIM memory spaces, i.e., cacheable and uncacheable pages [19], than the other computing devices. It makes programming more challenging to achieve the best performance on the PIM-based platform than traditional heterogeneous ones because of its more possible execution path candidates between devices.
The previous partitioning studies targeting heterogeneous platforms took two approaches: cost model-based partitioning [3, 4, 21, 26, 35] and ML-based partitioning [9, 10, 20]. The first approach measured the costs by profiling the applications and partitioned the execution using the cost model. The cost was often estimated because too many possible profile execution paths exist. The estimation approximated the computation cost through curve-fitting and the data transfer cost by dividing the data size by memory bandwidth. The cost model used the computation and data transfer costs and partitioned the application by determining whether to stay on one device or move to another. The second approach built the dataset and trained the ML model. The dataset was obtained by extracting features from the static code analysis and profiling with different input data sizes and possible device combinations. Then, the ML model was trained using the datasets to obtain the optimal partition. Finally, the ML model provided the optimal partition for a given application.
The recent DNN models’ high complexity in computation makes it more challenging to profile the programs [7, 25, 33], thus making it difficult to find the optimal model partition for the best performance. For example, ONNX Runtime [30] represents the computation as a computational graph of CG(\(V_c,E_c\)), consisting of \(V_c\) nodes representing operators and \(E_c\) edges describing tensors, and CG is a Directed Acyclic Graph (DAG). The BERT-small model [7] is composed of 222 nodes and 262 edges in CG: The complex CG involves many possible profiling paths, increasing exponentially as the number of devices increases. The frequent data offloading of the PIM data makes the problem much harder. Therefore, it would not be applicable in polynomial time to profile all possible scheduling paths for measuring all the node and edge costs and identify an optimal scheduling path from the profiled costs. In most current DL frameworks [1, 31], users manually partition the execution, so it is hard to obtain the best performance; for example, by marking which nodes to map with which devices in PyTorch [31] and TensorFlow [1] and defining the computing device’s priority and capability for each operator in ONNX Runtime.
This article proposes two novel polynomial-time algorithms for optimally running an inference DNN model on a PIM-based platform: one for profiling to recognize the minimum number of execution paths to measure all the costs and the other for partitioning to achieve the best execution performance with the profiled costs. Furthermore, we confidently present that we easily apply the methods proposed in this article to traditional heterogeneous platforms.
Our approach consists of the following three steps. First, we build a Device-mapped Computational Graph, DCG(\(V_d,E_d\)) from CG(\(V_c,E_c\)), to represent that a node shows a pair of an operator and its device capability, and each edge does the tensor’s data transfer between devices, where \(|V_d| = O(V_c\times n)\), \(|E_d| = O(E_c\times n^2)\), and n is the number of computing devices on the target platform.
Second, we profile the DCG to measure all the node and edge costs. The node cost represents the computation time of an operator on a device, and the edge cost implies the data transfer time between devices. After measuring all the costs, we substitute the partitioning problem with finding the minimum cost path from a start to an end node in DCG. Measuring all the node costs needs n profile runs in DCG. However, it is challenging to measure all the edge costs by profiling since there are \(n^{V_d}\) possible execution paths, leading to exponential application runs. Our polynomial-time profiling algorithm tracks the edges instead of execution paths since the number of edges in DCG is polynomial. Furthermore, we reduce the number of profiled edges by considering edge attributes. The data transfer between devices usually uses DMA; thus, we classify all the edges by three attributes, occurring the same data transfer cost: a source device, a destination device, and a data transfer size. Then, we develop a profiling algorithm to identify the minimum number of execution paths to include all the distinct attribute edges, resulting in the polynomial-time complexity of \(O(V_d + E_d^{2})\).
Finally, we apply a dynamic programming technique to find the optimal model partitioning from the profiled costs for achieving the best execution performance. We prove that our partitioning problem is the same as the Assembly Line Scheduling (ALS) problem [6], and therefore, the complexity of our partitioning algorithm is \(O(V_d)\).
We implemented our profiling and partitioning algorithms on the ONNX Runtime framework [30]. We used a PIM-modeled FPGA [19] and ARM Cortex-A53 as our experimental computing platform, i.e., three computing devices to be scheduled as the CPU serial, the CPU parallel on the multicores, and the PIM execution for targeting memory-intensive and low locality applications. To the best of our knowledge, our work is the first to address the optimal model partitioning in the PIM-based platform. We evaluated the performance by running three Transformer-based models, i.e., BERT [7], RoBERTa [25], and GPT-2 [33].
We needed at least three profiling runs to measure all the node costs for our computing platform, i.e., three devices. Additionally, our edge profiling algorithm identified only one execution path to find all the edge costs in all the models. We analyzed the operator-by-operator performance of the models and found that PIM outperformed the others in most operators. However, the PIM incurred the data transfer cost; thus, we should carefully assign the operators to devices to achieve the best overall performance. For the detailed performance analysis, we manually made two execution priority orders of (CPU parallel, PIM, and CPU serial) and (PIM, CPU parallel, and CPU serial) and modeled the state-of-the-art greedy partitioning algorithm [40]. Using the profiled costs and applying our optimal model partitioning algorithm, we achieved the highest performance in all the test cases: a speedup of 1.1\(\times\)\(\sim 3.0\times\) compared to the execution with manually assigned device priority orders and 1.09\(\times\)\(\sim 1.23\times\) over the greedy approach. Also, we explored all possible execution paths in the subgraphs from the experimented models and showed that our partitioning algorithm provided the best.
The remainder of the article is organized as follows: Section 2 introduces background about our experimental platform, including the DL framework and the PIM computing device. Section 3 proposes our low-overhead profiling and optimal model partitioning algorithms. Section 4 presents the performance evaluation. Section 5 describes the related work, and Section 6 concludes the article.
2 BACKGROUND
This section reviews our experimental platform’s ONNX Runtime framework and the PIM computing device.
2.1 ONNX Runtime Framework
Open Neural Network Exchange (ONNX) [29] is an open format that represents a DNN model for providing interoperability between DL frameworks, such as TensorFlow [1] and PyTorch [31]. The DNN model implemented in one framework can be exported to the ONNX format and used in another. ONNX Runtime is a framework that supports and runs the ONNX format on multiple devices by adopting an execution provider interface, allowing us to conveniently integrate various devices by abstracting a computing device and its execution environment, including the libraries and drivers.
Figure 1(a) shows the execution flow of ONNX Runtime deploying multiple computing devices. ONNX Runtime transforms the ONNX format of a DNN model into Graph IR (Intermediate Representation), i.e., a computational graph (CG) and its topologically sorted CG, as shown in Figure 1(b) and (c). A node of CG represents an operator, and an edge implies a tensor representing data movement (dependence) between nodes in the form of a multi-dimensional array or a vector. ONNX Runtime performs graph optimization, partition, and execution in order. The graph optimizer applies various hardware-dependent and independent optimizations to CG. The graph partitioner maps each node to one of the computing devices in a device list by considering the user-defined device priority and capability. It also inserts a memory copy node if a device uses the output from other devices. Finally, the graph execution stage traverses the CG nodes in topological order, assigns them to the scheduled computing devices, and executes them in the order, one at a time. Therefore, we do not consider running multiple operators on multiple computing devices simultaneously.
A user specifies a list of available computing devices in the framework, i.e., device capability. Also, the user prioritizes the computing devices in the execution provider list for specifying the device execution preference. All the specification solely depends on the user’s knowledge of the devices.
2.2 PIM Device: Silent-PIM
We used Silent-PIM as the PIM device on our experimental platform [19], which satisfies the standard memory interface [17] and performs the PIM execution using the standard memory request.
Figure 2(a) shows the Silent-PIM’s datapath, supporting bfloat16 8-way vector addition, subtraction, multiplication, and MAC operations. There are 128-bit\(\times\)4 vecA and 128-bit\(\times\)1 vecB general registers and vACC accumulator registers. The vecA register holds the 4-cycle burst data, i.e., 64 bytes, and the vecB stores data at every cycle during the burst. Whenever vecB stores 16-byte at every cycle, Silent-PIM performs the PIM operations with 8-way vector units. Silent-PIM performs a matrix-matrix (MM) multiplication, \(A \times B = C\), by repeating a vector-matrix (VM) multiplication column-wise by the number of rows of A. The element-wise vector/matrix execution is more straightforward than the VM multiplication.
In most DNN models, the VM multiplication and element-wise operation read source operands only once, thus exploiting the low locality and resulting in higher PIM performance than CPU. However, if the source operands are available in caches, the CPU may outperform PIM. In the MM multiplication, Silent-PIM repeats the VM multiplication, so we prefer to use the CPU over PIM due to data locality. However, if source operands are available only in memory, PIM may deliver higher performance than CPU. In addition, PIM does not support the computation of complex functions, such as exp, tanh, and so on. In summary, the performance depends on the source operand location and the operator type. Therefore, we should carefully offload PIM functions to devices, i.e., partitioning workloads, to achieve the best performance on the platform.
2.3 Data Transfer Cost
Our experimental platform includes two computing devices, CPU and Silent-PIM, as shown in Figure 7, and uses DMA for data transfer between them.
We cannot accurately estimate the transfer time, i.e., the edge cost, only by considering the data (tensor) size. The DMA data transfer time is affected by the following four factors: a data type, a transfer size, a source device, and a destination device. Silent-PIM uses bfloat16, and CPU uses float32 types: therefore, we need a data type conversion before the data transfer between PIM and CPU. The DMA data transfer time is proportional to the data size [27]. Also, we configure the PIM memory as uncacheable and the CPU memory as cacheable [19]. Due to memory ordering, uncacheable memory access takes much longer than cacheable access.
Figure 2(b) shows the data transfer cost from CPU to PIM and vice versa when varying element sizes, consisting of the type conversion time and the DMA transfer time. The type conversion in the two cases took a similar time since the CPU performed the conversion. However, the DMA transfer time differed depending on the source and destination devices, and CPU-to-PIM performed faster than PIM-to-CPU. For example, the data copy from CPU to PIM of 64 K bytes was 1.83 times faster than from PIM to CPU. The CPU reads the source data from fast cacheable memory, and the PIM does from slow uncacheable memory. It should be noted that the data transfer size is the same after the type conversion, i.e., 2 bytes per element.
3 LOW-OVERHEAD PROFILING AND OPTIMAL MODEL PARTITIONING
Figure 3 shows an overall architecture of the ONNX Runtime by adding the gray-colored components to Figure 1 for realizing our low overhead profiling and optimal DL inference model partitioning on the PIM-based computing platform.
After the graph optimization, we generate DCG from CG using the computing device’s capability defined in execution providers. Our low-overhead profiling algorithm finds the minimum number of execution paths to measure all the node and edge costs in DCG, thus resulting in the lowest profile overhead. Then, from the profile runs, we apply our optimal partitioning algorithm to find an execution path to guarantee the minimum execution time. The rest of this section explains each step in detail.
3.1 Reconstructing Computational Graph for Multi-Device Computing
The original graph partitioner allocates each CG node to one of the computing devices by considering the user-defined priority, i.e., device preference, without considering the device’s operator cost and data transfer cost. Therefore, the performance heavily relies on the user’s experience, which would fail to provide the desired performance. Also, allocating a node’s successor to a different device requires inserting a memory copy node between them.
In order to involve the computing nodes on a multi-device computing platform for the model partitioning, we develop DCG from CG, representing the devices’ operator capability and their data dependencies, as shown in Figure 4 from CG in Figure 1(a) with n computing devices. The solid arrow edge (\(\rightarrow\)) represents the data dependence in the same device, and the dashed arrow edge (\(\dashrightarrow\)) shows the data dependence between different devices, requiring explicit data transfer. We also add a start node and an end node for providing a single entry and exit in DCG. Also, the ONNX Runtime already encoded the device’s capability information in the node attribute. Therefore, we do not incur any additional overhead for the encoding.
3.2 Profiling DCG with the Lowest Overhead
Without the loss of generality, we assign zero cost to the start and the end nodes because they are dummy operators and to the solid edges because the tensors are in the same device. We acquire the costs of the rest of the nodes and the dashed edges by profile runs.
We classify the profiling run into node cost profiling and edge cost profiling. The node cost profiling is straightforward; we can identify all the node costs with the profiling runs as many as the number of devices on our target platform, i.e., n, by assigning the highest priority to each computing device in the device list one by one. For example, in Figure 4, we need n profile executions for acquiring all the node costs, i.e., \(A_0B_iC_0D_i\) by assigning the highest priority to device 0 to \(n-1\). From the node cost profiling, we can measure some edge costs. However, the node profiling runs cannot measure all the edge costs.
Profiling all the edge costs is problematic because the exponential number of execution paths exists from the start to the end nodes, i.e., \(O(n^{|V_d|})\) paths in DCG. In this case, we cannot finish the process in the polynomial time even if we develop a polynomial-time algorithm due to the exponential input size. We may use a bookkeeping approach to reduce the working path size since many paths involve the same subpaths. However, the path size still grows exponentially since all the nodes may have n children. Therefore, instead of managing paths, we track edges since the number of edges in DCG is \(O(n^2 \times V_c)=O(n \times V_d)\), i.e., a polynomial input size. Then, we topologically visit each DCG node to find the minimum number of execution paths covering all the edges.
3.2.1 Reducing the Problem Size.
We reduce the problem size, i.e., the number of edges to be profiled, by identifying distinct attribute (DA) edges. In general, we use the DMA for data transfer between devices; thus, the data transfer cost can be determined by a 4-tuple, as discussed in Section 2.2: a data type, a data size, a source device, and a destination device. When constructing DCG, we check the tuple for a new DCG edge and register it as a DA edge if its tuple differs from the pre-identified DA edges. Therefore, we do not need to profile the edges with the same attribute repeatedly; we profile each DA edge only once.
Table 1 shows the DCG’s number of edges and DA edges in the transformer-based models for two devices (CPU, PIM). For example, the DCG edges (\(E_d\)) in BERT, RoBERTa, and GPT-2 decreased from 221, 628, and 527 to only 8, 8, and 7 DA edges, respectively. We observed that most nodes’ operand dimensions are constant, thus implying the transfer size is constant and allowing us to have a significantly small amount of DA edges from DCG edges. The number of ReduceMean and SoftMax nodes to reduce the data dimension is small.
3.2.2 Edge Profiling Algorithm.
We use Figure 5 as an example for explaining the edge profiling algorithm with two computing devices, devices 0 and 1. A subgraph of the initial DCG appears in Figure 5(a), where there are five DA edges (\(e_0\)\(\sim\)\(e_4\)), and operators A and C are incapable for device 1. The node profiling identifies the edge cost of \(e_4\).
For the edge profiling algorithm, we augment two data structures into each node: a profiled edge list (L) and its associated path, i.e., the profiled path (P). The edge list is the list of the distinct edges in DCG, and we construct it when building DCG. In the figure, the list represents \([e_0,e_1,e_2,e_3,e_4]\). We initialize each node’s field in the profiled edge list with 0’s and set a field associated with the incoming edges during the node visit. The profiled path is a path that contains the marked DA edges in a profiled edge list to be executed from the start node. A node with more than one incoming edge may have multiple pairs of a profiled edge list and a profiled path.
We show Algorithm 1 to identify the minimum profiling paths for recognizing all the node and edge costs in DCG. The profiling path identification algorithm performs the topological sort of DCG (Line 1) and the node cost profiling (Lines 4\(\sim\)8). We can sort DCG without any overhead since we can use CG’s available topological sort information; we need only to consider the devices as siblings in the order. The algorithm visits the DCG nodes in the topological order for the edge cost profiling (Line 10). At every node visit, the algorithm generates the node’s profiled edge list and its profiled path by considering the DA edges in the explored path with the predecessor’s information (Lines 15\(\sim\)25). When we mark all the DA edges, we stop the algorithm execution (Lines 12\(\sim\)14) and use all the collected profiled paths for profile runs (Line 20).
In the example figure, we initially color all the nodes as white, a dequeued one from the topological FIFO queue as black, and currently processed nodes as gray. In Figure 5(a) at \(T_0\), we assume that we already visited the \(A_0\) node, thus dequeuing \(A_0\) and having \([0,0,0,0,1]\) in the profile edge list (\(L(A_0)\)) and \(A_0\) in the profile path (\(P(A_0)\)). Remember that we profiled \(e_4\) from the node profiling.
Figure 5(b) shows two steps to dequeue \(B_1\) and \(B_0\): At \(T_1\), we dequeue \(B_1\) (Line 11) and generate the \([1,0,0,0,1]\) profiled edge list (\(L(B_1)\)) by adding the explored \(e_0\) DA edge to the \(A_0\)’s profiled edge list and \(A_0B_1\) of the profiled path (\(P(B_1)\)) (Lines 16\(\sim\)18). Similarly, at \(T_2\), we dequeue \(B_0\) and generate the \([0,0,0,0,1]\) profiled edge list (\(L(B_0)\)) the same as \(A_0\)’s because there is no DA edge from \(A_0\) to \(B_0\) (Line 23). We also generate \(A_0B_0\) of the profiled path (\(P(B_0)\)) (Line 16).
In Figure 5(c) at \(T_3\), we dequeue \(C_0\) to have two incoming execution paths; thus, we compare the inclusion of the generated profiled edge lists from the paths, i.e., summarize the results. One path from \(B_0\), including no new DA edge, generates \([0,0,0,0,1]\) for \(L(C_0)\), \(~B_0C_0\) for \(P(C_0)\). The other path from \(B_1\), including a new DA edge of \(e_1\), generates \([1,1,0,0,1]\) for \(L(C_0)\), \(~B_1C_0\) for \(P(C_0)\). Since \([1,1,0,0,1]\) includes \([0,0,0,0,1]\), i.e., covers more DA edges, we only maintain \([1,1,0,0,1]\) with its profiled path (Lines 19\(\sim\)21).
In Figure 5(d) at \(T_4\), we dequeue \(D_1\) to have only one incoming path from \(C_0\), including two DA edges of \(e_0\) and \(e_2\); thus, we generate \([1,1,1,0,1]\) for \(L(D_1)\), \(~C_0D_1\) for \(P(D_1)\) using \(L(C_0)\) and \(P(C_0)\). We execute the nodes in the topological order, making the edge profiling algorithm update the profiled edge list in the same order. Therefore, we consider nodes \(D_0\) and \(D_1\) after visiting \(C_0\), even if there is an edge \(e_0\) from \(B_0\) to \(D_1\).
In Figure 5(e) at \(T_5\), we dequeue \(D_0\) to have only one incoming path from \(C_0\), including one DA edge of \(e_1\) to be already registered in the predecessor, \(C_0\); thus, \(D_0\)’s profiled edge list and path are the same as \(C_0\), i.e., \([1,1,0,0,1]\) for \(L(D_0)\), \(~C_0D_0\) for \(P(D_0)\).
Figure 5(f) shows two steps to dequeue \(E_1\) and \(E_0\): At \(T_6\), we dequeue \(E_1\) and generate the \([1,1,1,0,1]\) profiled edge list (\(L(E_1)\)) by adding the explored \(e_2\) DA edge to the \(D_0\)’s profiled edge list and \(~D_0E_1\) of the profiled path (\(P(E_1)\)). Similarly, at \(T_7\), we dequeue \(E_0\) and generate the \([1,1,1,1,1]\) profiled edge list (\(L(E_0)\)) by adding the explored \(e_3\) DA edge to the \(D_1\)’s profiled edge list and \(~D_1E_0\) of the profiled path (\(P(E_1)\)). We find all the DA edges, and the algorithm stops (Lines 12\(\sim\)14).
We can determine the inclusion between the profiled edge lists at each node visit in \(O(E_d)\) since there are at most \(E_d\) different profiled edge lists. Also, we can determine whether a path includes new DA edges in \(O(1)\). Therefore, the total time complexity of our profiling algorithm is \(O(V_d + E_d^{2})\), which enables finding the minimum paths that contain all the edge costs in polynomial time.
3.3 Optimal Model Partitioning
We apply the famous Assembly Line Scheduling (ALS) problem to the optimal partitioning problem in the multi-device computing platform by showing that the two problems are the same. For simplicity, we use only two assembly lines for manufacturing, i.e., two computing devices on the platform.
Figure 6(a) shows the ALS problem in determining which nodes to be selected from assembly lines to minimize the manufacturing time. In the graph, \(a_{i,j}\) denotes a cost of the jth node on ith assembly line, \(t_{i,j}\) denotes a transfer cost from the \(j-1\)th node on ith assembly line, and \(e_i\) and \(x_i\) are an entry cost and an exit cost on the ith assembly line. It assumes that the transfer cost between nodes at the same assembly line is zero. The fastest time to get a chassis from the start to the jth node on ith assembly line, \(f_i[j]\), is expressed by Equation (1) for \(j \ge 1\) where \(f_0[0] = e_0 + a_{0,0}\) and \(f_1[0] = e_1 + a_{1,0}\). (1) \(\begin{eqnarray} f_0[j] &=& \min (f_0[j-1]+a_{0,j}, f_1[j-1] + t_{1,j} + a_{0,j}) \nonumber \nonumber\\ f_1[j] &=& \min (f_1[j-1]+a_{1,j}, f_0[j-1] + t_{0,j} + a_{1,j}) . \end{eqnarray}\)
We modified the ALS graph from Figure 6(a) to Figure 6(b) by allowing nodes to have more than one incoming edge from a different assembly line (or more than one outgoing edge to a different assembly line). Then, Equation (1) is changed to the following: (2) \(\begin{eqnarray} f_0[j] &=& \min (f_0[j-1]+a_{0,j}, f_1[j-1] + T_{0,j} + a_{0,j}) \nonumber \nonumber\\ f_1[j] &=& \min (f_1[j-1]+a_{1,j}, f_0[j-1] + T_{1,j} + a_{1,j}) . \end{eqnarray}\)
\(T_{i,j}\) denotes a sum of a transfer cost from all the data-dependent predecessors of jth node on ith assembly line. For example, \(T_{1,3} = t_{0,3} + t_{0,3}^{^{\prime }}\). We assume that the incoming transfer cost to node j from node k (where \(k \lt j-1\)) on the same assembly line is also zero. The equation still holds the optimal solution since (1) the transfer cost of all the edges across the assembly lines is used only once for finding the optimal value and (2) \(t_{i,j}\) represents the transfer cost from the other assembly lines, and therefore, we can replace all the transfer costs from all data dependent predecessors of the jth node on ith assembly line by \(T_{i,j}\).
We can also prove the optimality of Equation (2) using the induction proof: (i) Base case: When \(j=0\), Equation (2) satisfies \(f_0[0] = e_0+a_{0,0}\) and \(f_1[0] = e_1+a_{1,0}\) since there is no path from the other assembly lines. (ii) Induction Step: Suppose that Equation (2) is valid for \(j=k-1\), i.e., \(f_0[k-1]\) and \(f_1[k-1]\) are the fastest time at the kth nodes of assembly lines 0 and 1. At the next kth node of assembly line 0, we can identify the fastest execution time by adding the kth node’s execution time (\(a_{0,k}\)) to the cost either from the node at the same assembly line (\(f_0[k-1]\)) or the nodes from the different assembly lines (\(f_1[k-1] + T_{0,k}\)). The induction proof is also valid for the other assembly line. With (i) and (ii), we can prove that the recurrence of DCG still guarantees optimality.
We can easily map the modified ALS graph to DCG by removing incapable nodes and processing the ALS in topological order. Figure 6(c) shows an example: The 1st and 3rd operators are not supported by computing device 1; therefore, we remove them from DCG. As shown in Figure 7, in our execution model, the host CPU has two memory spaces: cached system memory and uncached PIM memory. Therefore, if two consecutive nodes are in other assembly lines (CPU-PIM or PIM-CPU), an explicit data copy is required, which results in transfer costs \(T_{i,j}\). However, if both nodes are in the same assembly line, they use the same memory space, and their node execution time includes the memory access time; thus, the transfer cost does not exist. For example, we can think of the CPU’s execution time, including memory access time in the memory hierarchy.
The time complexity of our partitioning algorithm is \(O(V_d)\), the same as that of the ALS algorithm.
4 PERFORMANCE EVALUATION
4.1 Experimental Environment and Methodology
We developed the PIM-based platform, including the ARM multicores and Silent-PIM [19], as shown in Figure 7; thus, we could define three computing devices such as
We evaluated the performance of our profiling and partitioning algorithms on the platform, using three state-of-the-art transformer-based models, BERT [7], RoBERTa [25], and GPT-2 [33]. We can exploit
4.2 Operator-by-Operator Performance
Before showing the performance of our partitioning algorithm, we compared the operators’ performance of
In the element-wise addition of Figure 8(a),
4.3 Optimal Model Partitioning
We compared the performance using our optimal partitioning with the two manually prioritizing device orders and three variants of the state-of-the-art partitioning greedy algorithm [40]. The algorithm allocates a computing device with the fastest execution time on a node without considering communication costs between devices. Then, it gradually increases the coverage of the nodes starting from the first node and adjusts the allocation by considering the communication cost to identify better partitioning. The algorithm does not guarantee optimality and targets the CPU-GPU environment. Also, we sincerely modeled the greedy partitioning algorithm in [40] and compared the performance with the proposed work.
We compared the performance using our optimal partitioning with the following six execution schemes:
(1) | |||||
(2) | |||||
(3) | |||||
(4) | |||||
(5) | |||||
(6) | |||||
(7) |
Figure 9 shows the execution time and speedup of running the three transformer-based models while varying a sequence length from 8 to 64 with respect to
The speedup of
4.3.1 BERT and RoBERTa.
Figure 10 shows the partitioned result of
As a result,
4.3.2 GPT-2.
Figure 12 shows the partitioned result of GPT-2 for
As shown in Figure 11(c), their computation times are similar, but data transfer time decreases by 64%, which leads to the optimal execution time.
4.4 Verification Correctness of Optimal Model Partitioning
To verify the correctness of our proposed optimal partitioning algorithm, we explored all possible execution paths in the subgraphs from the experimented models, measured their execution time, and identified the minimum execution time. With three devices (
Figure 13 shows the execution time of all possible paths, where one point represents one path’s execution time. In the figure, the star-marked point represents the minimum execution time, which is the same as the result of our partitioning algorithm.
4.5 Profiling and Model Partitioning Overhead
Table 3 shows the DCG characteristics and the algorithm overhead. The CGs of BERT, RoBERTa, and GPT-2 have 222, 616, and 1934 nodes and 262, 765, and 2424 edges, respectively. With our experimental platform with three devices, the DCGs built from each model have 469, 1317, and 2593 nodes, 686, 1951, and 1723 edges, respectively. We could reduce them to only 20, 20, and 24 DA edges, significantly reducing the profiling overhead for finding all the edge costs. Moreover, as we identified the cost of the DA edges in the node cost profiling, only four DA edges were left to discover for all three models for the edge cost profiling.
The graph traversal for finding those four edges took only 0.04 s, 0.04 s, and 0.07 s in BERT, RoBERTa, and GPT-2, respectively. BERT and RoBERTa took a similar time due to their similar CG characteristics. We could need three profile runs for the node cost profiling and only one run for the edge cost profiling. The time spent on optimal partitioning was negligible and proportional to the graph complexity.
5 RELATED WORKS
Previous research on workload partitioning for heterogeneous platforms has focused on the profiling and partitioning on the CPU-GPU systems with data-parallel kernels written in OpenCL [37] or CUDA [28]. Even with the higher throughput of GPU than CPU, the low PCIe bandwidth degrades the overall performance due to the data transfer overhead. Therefore, accurate profiling and partitioning determine the overall performance. As mentioned, there has yet to be any work on PIM-based partitioning, and our work can be easily applied to partitioning on traditional heterogeneous computing.
The first approach statically partitions the model by profiling applications and building a cost model using the profiled data. Luk et al. [26] developed an analytical performance model to predict the execution time on CPU and GPU through curve-fitting the profiled data. However, it did not distinguish the data transfer time from the computation time, leading to non-optimal partitioning. Albayrak et al. [3] proposed an inter-kernel greedy mapping algorithm from profiling kernels on devices. The algorithm iterates through n kernels in topological order and assigns them to devices by comparing CPU and GPU costs, where costs are execution time on each device and the data transfer time from the source and sink devices. The algorithm would not provide optimal partitioning when resolving complex kernel data dependencies. Shen et al. [35] proposed an intra-kernel partitioning algorithm targeting a single kernel. When running each kernel with the CPU and GPU, the execution time of a kernel can be expressed as \(\max (T_G+T_D, T_C)\), where \(T_G\), \(T_D\), and \(T_C\) denote the GPU kernel computation time, the data transfer time, and the CPU kernel computation time, respectively. After partial profiling, they achieved the two metrics, a relative hardware throughput between CPU and GPU and the ratio of GPU throughput to data transfer bandwidth. Then, they substituted \(T_G\), \(T_D\), and \(T_C\) with the two metrics and achieved the optimal partitioning when \(T_G+T_D=T_C\).
Lee et al. [21] proposed partitioning multiple kernels of an application to multiple devices. It first builds a regression model based on profiled data collected by executing kernels with various input sizes for predicting the execution time of computing devices. Then, it constructs a data dependency graph of kernels. The kernels are listed in topologically sorted order of the graph and mapped with the device based on the priority, the predicted execution time, and the data transfer cost. It further decomposes the kernel into sub-kernels and maps across multiple devices. Our work does not predict the execution time but profiles with very low cost, thus providing more accurate partitioning.
The second approach uses runtime information for optimal partitioning. Belviranli et al. [4] proposed a dynamic partitioning algorithm that works in two phases. The first phase learns the computational performance of each hardware by assigning a small amount of the workload. Then, the rest of the workload is assigned according to the hardware performance in the next adaptive stage. Boyer et al. [5] decomposed a kernel into chunks, such as primary computing workloads. They executed a small number of chunks on each device at the initial execution. Then, using the execution time, they partitioned the remaining work to balance the load between devices, leading to optimal partitioning.
The third approach uses the machine learning (ML) technique to predict the optimal partitioning for the given workload. Grewe et al. [10] proposed optimal partitioning based on an ML model trained with static information from a compiler’s AST IR analysis. Similarly, Ghose et al. [9] additionally used control flow, mainly depending on the thread-id as an important deciding factor for optimal partitioning. Kofler et al. [20] proposed optimal partitioning by the ML model that was trained with static information from compile-time and dynamic information, e.g., data transfer size, from runtime.
6 CONCLUSION
In this article, we proposed the optimal model partitioning method for achieving the best performance of the DL inference execution on the PIM-based platform. First, we presented the device-mapped computational graph (DCG) to represent the device capability by restructuring the existing computational graph. Then, we introduced the algorithm to find the minimum profile runs to obtain all node and edge costs in DCG in polynomial time. Also, we classified the DCG edges into the necessarily profiled distinct attribute edges, significantly reducing the problem size. Finally, our dynamic programming technique adopted from the ALS approach provided the optimal model partitioning from DCG.
We have evaluated our method on the PIM-modeled FPGA platform with the ARM multicores by running three transformer-based models, varying a sequence length p from 8 to 64. Our partitioning algorithm achieved a speedup of 1.1\(\times\)\(\sim 2.1\times\) and 1.1\(\times\)\(\sim 3.0\times\) compared to the execution with manually assigned device priority orders (CPU parallel, PIM, and CPU serial) and (PIM, CPU parallel, and CPU serial), respectively, and 1.09\(\times\)\(\sim 1.23\times\) over the state-of-the-art greedy approach. Also, we explored all possible execution paths in the subgraphs from the experimented models and showed that our partitioning algorithm provided the best. Our low-cost edge profiling algorithm found only one profile run for obtaining all the edge costs, which took 0.04 s, 0.04 s, and 0.07 s for each model with \(p=16\), which was negligible. Also, the partitioning algorithm took little time. We will apply our work to higher heterogeneity platforms, including GPUs and application-specific accelerators.
- [1] . 2016. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation. Savannah, GA, 265–283.Google Scholar
- [2] . 2023. GPT-4 technical report. arXiv:2303.08774. Retrieved from https://arxiv.org/abs/2303.08774Google Scholar
- [3] . 2012. Effective kernel mapping for OpenCL applications in heterogeneous platforms. In Proceedings of the 41st International Conference on Parallel Processing Workshops. Pittsburgh, PA, 81–88.Google ScholarDigital Library
- [4] . 2013. A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Transactions on Architecture and Code Optimization 9, 4 (2013), 1–20. Google ScholarDigital Library
- [5] . 2013. Load balancing in a changing world: Dealing with heterogeneity and performance variability. In Proceedings of the 10th ACM International Conference on Computing Frontiers. Ischia, Italy, 1–10.Google ScholarDigital Library
- [6] . 2001. Introduction to Algorithms (2nd. ed.). MIT Press.Google Scholar
- [7] . 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https://arxiv.org/abs/1810.04805Google Scholar
- [8] . 2012. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. Minneapolis, MN, 345–354.Google ScholarDigital Library
- [9] . 2016. Divergence aware automated partitioning of OpenCL workloads. In Proceedings of the 9th India Software Engineering Conference. Goa, India, 131–135.Google ScholarDigital Library
- [10] . 2011. A static task partitioning approach for heterogeneous systems using OpenCL. In Proceedings of the 20th International Conference on Compiler Construction.Saarbrücken, Germany, 286–305.Google ScholarCross Ref
- [11] . 1993. Hardware-software cosynthesis for digital systems. IEEE Design and Test of Computers 10, 3 (1993), 29–41.Google ScholarDigital Library
- [12] . 2020. Newton: A DRAM-maker’s accelerator-in-memory (AiM) architecture for machine learning. In Proceedings of the 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture. 372–385.
DOI: Google ScholarCross Ref - [13] . 2010. Multi-GPU and multi-CPU parallelization for interactive physics simulations. In Proceedings of the 16th European Conference on Parallel Processing. Berlin, 235–246.Google ScholarCross Ref
- [14] . 2017. HTG-Z920. Retrieved August 07, 2022 from https://www.xilinx.com/products/boards-and-kits/1-qwrzuv.htmlGoogle Scholar
- [15] . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.Google ScholarDigital Library
- [16] . 1999. Recurrent Neural Networks: Design and Applications (1st ed.). CRC Press, Inc.Google Scholar
- [17] JEDEC Standard: DDR4 SDRAM JESD79-4B. 2012. Retrieved from https://www.jedec.org/standards-documents/docs/jesd79-4aGoogle Scholar
- [18] . 2022. Near-memory processing in action: Accelerating personalized recommendation with AxDIMM. IEEE Micro 42, 1 (2022), 116–127.
DOI: Google ScholarCross Ref - [19] . 2022. Silent-PIM: Realizing the processing-in-memory computing with standard memory requests. IEEE Transactions on Parallel and Distributed Systems 33, 2 (2022), 251–262.Google ScholarDigital Library
- [20] . 2013. An automatic input-sensitive approach for heterogeneous task partitioning. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. New York, NY, 149–160.Google ScholarDigital Library
- [21] 2015. Orchestrating multiple data-parallel kernels on multiple devices. In Proceedings of the 24th International Conference on Parallel Architecture and Compilation Techniques. San Francisco, CA, 256–366.Google ScholarDigital Library
- [22] . 2021. Hardware architecture and software stack for PIM based on commercial DRAM technology: Industrial product. In Proceedings of the ACM/IEEE 48th Annual International Symposium on Computer Architecture. 43–56.Google ScholarDigital Library
- [23] . 2021. Hardware architecture and software stack for PIM based on commercial DRAM technology : Industrial product. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture. 43–56.
DOI: Google ScholarDigital Library - [24] . 2020. Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems 31, 1 (2020), 94–110.Google ScholarDigital Library
- [25] . 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Retrieved from https://arxiv.org/abs/1907.11692Google Scholar
- [26] . 2009. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. New York, NY, 45–55.Google ScholarDigital Library
- [27] . 2008. Direct Memory Access (DMA) (Part III). Retrieved January 02, 2022 from http://ww1.microchip.com/downloads/en/devicedoc/70215c.pdfGoogle Scholar
- [28] . 2020. CUDA. Retrieved January 02, 2022 from https://developer.nvidia.com/cuda-toolkit accessed: 2022-01-02.Google Scholar
- [29] . 2019. Open Neural Network Exchange (ONNX). Retrieved December 22, 2021 from https://onnx.ai/Google Scholar
- [30] . 2021. ONNX Runtime. Retrieved December 22, 2021 from https://onnxruntime.ai/Google Scholar
- [31] . 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the 32th Neural Information Processing Systems. Vancouver, Canada, 8026–8037.Google Scholar
- [32] . 2015. Design, packaging, and architectural policy co-optimization for DC power integrity in 3D DRAM. In Proceedings of the 52nd Annual Design Automation Conference. San Francisco, CA, 1–6.Google ScholarDigital Library
- [33] . 2019. Language Models are Unsupervised Multitask Learners.Google Scholar
- [34] . 1993. Architecture of field-programmable gate arrays. Proc. IEEE 81, 7 (1993), 1013–1029.Google ScholarCross Ref
- [35] . 2016. Workload partitioning for accelerating applications on heterogeneous platforms. IEEE Transactions on Parallel and Distributed Systems 27, 9 (2016), 2766–2780.
DOI: Google ScholarDigital Library - [36] . 2012. Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems. In Proceedings of the 26th ACM International Conference on Supercomputing. San Servolo Island, Venice, Italy, 365–376.Google ScholarDigital Library
- [37] . 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science and Engineering 12, 3 (2010), 66–73.
DOI: Google ScholarDigital Library - [38] . 2002. Using modern graphics architectures for general-purpose computing: A framework and analysis. In Proceedings of the ACM/IEEE 35th Annual International Symposium on Computer Architecture. Istanbul, Turkey, 306–317.Google ScholarCross Ref
- [39] . 2013. High performance FFT based poisson solver on a CPU-GPU heterogeneous platform. In Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing. Cambridge, MA, 115–125.Google ScholarDigital Library
- [40] . 2021. DUET: A compiler-runtime subgraph scheduling approach for tensor programs on a coupled CPU-GPU architecture. In Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium. 151–161.Google ScholarCross Ref
Index Terms
- Optimal Model Partitioning with Low-Overhead Profiling on the PIM-based Platform for Deep Learning Inference
Recommendations
Low overhead program monitoring and profiling
Program instrumentation, inserted either before or during execution, is rapidly becoming a necessary component of many systems. Instrumentation is commonly used to collect information for many diverse analysis applications, such as detecting program ...
Low overhead program monitoring and profiling
PASTE '05: Proceedings of the 6th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineeringProgram instrumentation, inserted either before or during execution, is rapidly becoming a necessary component of many systems. Instrumentation is commonly used to collect information for many diverse analysis applications, such as detecting program ...
Low-overhead call path profiling of unmodified, optimized code
ICS '05: Proceedings of the 19th annual international conference on SupercomputingCall path profiling associates resource consumption with the calling context in which resources were consumed. We describe the design and implementation of a low-overhead call path profiler based on stack sampling. The profiler uses a novel sample-...
Comments