PISA-DMA: Processing-in-Memory Instruction Set Architecture Using DMA

Processing-in-memory (PIM) has attracted attention to overcome the memory bandwidth limitation, especially for computing memory-intensive DNN applications. Most PIM approaches use the CPU’s memory requests to deliver instructions and operands to the PIM engines, making a core busy and incurring unnecessary data transfer, thus, resulting in significant offloading overhead. DMA can resolve the issue by transferring a high volume of successive data without intervening CPU and polluting the memory hierarchy, thus perfectly fitting the PIM concept. However, the small computing resources of DRAM-based PIM devices allow us to transfer only small amounts of data at one DMA transaction and require a large number of descriptors, thus still incurring significant offloading overhead. This paper introduces PIM Instruction Set Architecture (ISA) using a DMA descriptor called PISA-DMA to express a PIM opcode and operand in a single descriptor. Our ISA makes PIM programming intuitive by thinking of committing one PIM instruction as completing one DMA transaction and representing a sequence of PIM instructions using the DMA descriptor list. Also, PISA-DMA minimizes the offloading overhead while guaranteeing compatibility with commercial platforms. Our PISA-DMA eliminates the opcode offloading overhead and achieves 1.25x, 1.31x, and 1.29x speedup over the baseline PIM at the sequence length of 128 with the BERT, RoBERTa, and GPT-2 models, respectively, in ONNX runtime with real machines. Also, we study how our proposed PISA affects performance in compiler optimization and show that the operator fusion of matrix-matrix multiplication and element-wise addition achieved 1.04x speedup, a similar performance gain using conventional ISAs.


I. INTRODUCTION
Most modern computers are based on a stored-program concept, i.e., the von Neumann architecture [1], where instructions and data are stored in a separate memory and handled the same. Therefore, when processing low-locality data-intensive applications, such as recently emerging DNN (Deep Neural Network) [2], e-commerce [3], [4], [5], [6] The associate editor coordinating the review of this manuscript and approving it for publication was Mario Donato Marino . graph applications [7], [8], and so on, the memory performance determines the overall system performance.
Despite the PIM's performance advantages, in-DRAM PIM has yet to be commercially available because of the following two factors. First, most designs have pursued ''accelerator the first approach'' instead of ''memory the first,'' affecting all architecture layers' design, such as cores [9], [11], [12] and memory controllers [14], [15], [17], [18], [24], thus incompatible with our current computing platforms. For example, the latest PIM studies from Samsung [23] and UPMEM [30] separated the PIM memory area from the non-PIM memory to avoid incompatibility with the JEDEC memory standard [31] for supporting all-bank execution. Our recent work and baseline for this research, the decoupled PIM [26] satisfies the standard memory interface. However, its performance is lower than the all-bank PIMs due to its perbank execution.
Second and more importantly, the PIM instruction set architecture (ISA) is unavailable, which fully supports compatibility with commercial computing platforms and perfectly fits the PIM concept and the PIM-target application characteristic. Most PIMs developed their own PIM instructions [11], [12], [22], [23], [30] and offloaded them to the PIM engine by modifying the bus interface and using the CPU load/store instructions. It increases the hardware cost and raises the system performance issues, such as making the core very busy and incurring high latency to access uncacheable PIM areas. Recent PIMs use Direct Memory Access (DMA) as the offloading mechanism [22], [25], [26], [30], [32] to resolve the performance issue by transferring opcodes and large-size operands without CPU intervention.
However, the DMA-based offloading method for in-DRAM PIM has only been used to reduce the offloading overhead. From a performance point of view, due to the PIM's design characteristics, the approach still introduces significant offloading overhead. There is little space available in commercial DRAMs where we can implement the PIM engines, i.e., only a 64-byte register for one source operand and a 16-byte latch for the other are allowed. The tiny resource limits one DMA descriptor's data transfer size (in general, 64-bytes×number of banks), thus requiring a significantly large number of DMA descriptors for fetching PIM-targeted large-size operands. One PIM execution stream, in general, alternately fetches two source operands, differently performing in the datapath, thus consisting of one opcode descriptor, two source operand descriptors, and one destination operand descriptor per OS page. Therefore, a smaller DMA data transfer size requires more operand descriptors and, consequently, more opcode descriptors, thus incurring significant operand and opcode offloading overhead, which would substantially degrade the performance. An even worse problem is that the opcode size is only 64-byte, which does not fit well with the characteristics of DMA transfers. Our experiment shows that the opcode descriptors occupy about 25% of the total.
In this paper, we propose the first Processing-in-Memory Instruction Set Architecture (PIM ISA) to guarantee full compatibility with current commercial computing platforms, which is critical to the success of the PIM products in the market, and make the following contributions.
• Contribution 1: We introduce PIM ISA using a DMA descriptor called PISA-DMA and use the DMA engine as the PIM ISA offloading engine. Most fields of the DMA descriptor are already well-defined to fit the PIM concept and the PIM-target application characteristic, i.e., not involving CPU execution and repeatedly executing the large-size data with the same operation. We specify the PIM operands in the source and destination fields of the DMA descriptor and the PIM opcode in the unused bit fields of the DMA descriptor. As a result, we can depict the PIM operand and opcode in one DMA descriptor and use the DMA engine as the PIM offload engine, thus neither adding any hardware components nor modifying computing platforms. Our approach is very cost-effective in implementation.
• Contribution 2: Our PISA-DMA makes PIM programming intuitive. We can think of committing one PIM instruction as completing one DMA transaction and representing a sequence of PIM instructions using the DMA descriptor list. Therefore, the PIM programming is the same as the DMA programming. Also, since the DMA transactions are serviced one by one, we can think of PIM ISAs as being processed in order as well. When processing a transaction, i.e., each PISA-DMA, we can execute memory requests in parallel across banks by exploiting the bank-level parallelism and in-order inside a bank. The separate opcode and operand offloading in the previous works [22], [32] asked a user to carefully program for synchronizing the PIM memory requests between the opcode and the operand execution, thus incurring high programming complexity and related execution overhead.
• Contribution 3: Our PISA-DMA minimizes the offloading overhead. We cannot reduce the operand fetching for the PIM execution, i.e., difficult to reduce the operand descriptors. However, if we express the opcode and operands in a single descriptor together, we can eliminate the opcode descriptors. The elimination allowed us to reduce the total number of descriptors by 25.8%, 26.1%, and 24.9%, thus achieving significant speedups of 1.25x, 1.31x, and 1.29x compared to the baseline PIM [26] in BERT [33], RoBERTa [34], and GPT-2 [35] models, respectively, with a sequence length of 128.
• Contribution 4: We can apply traditional compiler optimization techniques to our PIM code generation. We fused the matrix-matrix multiplication with the following element-wise addition operators in BERT and found that PISA reduced 2.9% descriptors and achieved 1.04x speedup compared with the unfused PISA execution, similar to the CPU parallel execution. Therefore, we decided that we would apply various compiler optimization techniques to our PIM code generation without concerns.
This paper consists of the followings: Section II describes decoupled PIM, the baseline architecture of this paper, and Section III proposes PISA-DMA instructions using DMA descriptors and describes the PISA-DMA execution flow. Section IV shows the performance evaluation, and Section V discusses the related works. Then, we present the conclusion in Section VI.

II. BACKGROUND: DECOUPLED PIM
We propose the PISA-DMA based on the decoupled PIM [26], so we review its datapath, interface unit, and execution flow of matrix-matrix multiplication.
A. DATAPATH Fig. 1 shows the decoupled PIM datapath with the 128-bit DRAM bank data bit-width [25], [26]. We could not afford abundant computing resources due to the space limitation in DRAM for the PIM development. Nevertheless, to compute with the 4-cycle burst standard read/write requests, each bank embeds the datapath including 8 bfloat MACs for the 8-way vector computations, one 128-bit general vector register (vecA), one 128-bit×4 general vector register (vecB[3:0]), and one 176-bit×4 accumulator register (vACC [3:0]). vecB[3:0] stores the whole burst, and vecA stores only 1-cycle burst from the whole burst. The datapath consists of 4 pipeline stages: FE stage for fetching operands from a bank, EX0/EX1 stages for the MAC computations (multiplication in the first stage and addition in the second stage), and WB stage that writes vACC to the bank.
The control unit configures the datapath before one DMA transaction fetches and stores the PIM operands by decoding the information in control registers in the decoupled PIM Interface Unit, provided by the PIM opcode offloading. The pre-configured datapath allows us to significantly lower power consumption by avoiding instruction decoding at every computation. Fig. 2 shows the decoupled PIM Interface Unit, which is inside the PIM DRAM and shared by all banks, complying with the JEDEC standard memory interface, receiving Command, Address, and Data signals from standard memory requests as a conventional DRAM device. A programmer (PIM library) offloads the opcode, i.e., stores either the PIM source or destination operand address, its size, and the datapath configuration information of the PIM engine in uncached memory-mapped Control Regs (REG A/B/C/D) before the PIM execution. After initializing all the registers, we ask the CPU to initiate the DMA transfer, and the DMA engine issues memory requests to the PIM device. After completing the transfer, the DMA engine interrupts the CPU to notify the completion of the PIM execution.

B. INTERFACE UNIT
During the PIM execution, the PIM Request Identification Unit (RIU) distinguishes the PIM memory requests by matching the PIM operand information in the operand registers (REG A/B/C) from the incoming memory requests and provides the data to the engine if matched. Each bit of the configuration register (REG D) corresponds to one control signal of the PIM datapath.
The decoupled PIM performs two execution phases of memory and computation by controlling each bank's PimS switch in Fig. 2 by the PIM memory requests. All memory requests use the DataS switch to connect to the global bus, i.e., to place data on the bus for read requests and acquire data from the bus for write requests. At the memory phase, we read the bank-private operands from a bank and turn on the corresponding bank's DataS switch to place them on the global bus. At the same time, RIU recognizes the bank-private operands and turns on the bank's PimS switch to store them in its PIM engine's registers. After performing the memory phase bank-by-bank, at the computation phase, we read the bank-shared operands from a bank and turn on the corresponding bank's DataS switch to place them on the global bus. At the same time, RIU recognizes the bank-shared operands and generates the BC match signal to notify the broadcast to all banks' engines. The signal turns on all banks' PimS switches so that all banks' engines receive the broadcast data from the global data bus and perform the computation [26]. The broadcast makes all banks perform the computation simultaneously, thus reaching the computing throughput of the all-bank PIM without any data conflict on the global bus. Fig. 3 shows the decoupled PIM's matrix-matrix multiplication using an optimal tiling size of (32 × 32) × (32 × 16) [26].  Each bank j (j = 0, 1, . . . , 15 where j is a bank number) multiplies pairs of (a 0:31,0 , b 0,j ), (a 0:31,1 , b 1,j ), · · · , (a 0:31,31 , b 31,j ) and accumulates the multiplication results one-by-one to calculate c 0:31,j . The execution performs the following phases in order:

1) Memory phase for fetching bank-private operands:
Each bank reads 64-byte (a burst size) columns of the bank-private operand MatB from its memory cell array and stores them to vecB.

3) Memory phase for storing the results into memory:
Each bank stores a 64-byte vACC in the memory cell array. We transposed all the matrices considering the DRAM's read/write granularity of 64 bytes and followed the conventional address mapping, locating 9 to 6 address bits as bank id [25]. In other words, the continuous data distributes in all 16 banks with interleaved 64 bytes.

III. PISA-DMA: PIM INSTRUCTION SET ARCHITECTURE USING DIRECT MEMORY ACCESS
In this section, we describe how to design and execute the PISA-DMA, which represents both the PIM opcode and operand in one descriptor without modifying any architecture layers.

A. PIM EXECUTION BEHAVIOR VS. DMA
We should carefully handle the PIM opcode and operands for the correct execution, requiring the following methods: 1) The control registers in the PIM Interface Unit should be uncached memory-mapped, as discussed in Section II-B.
2) The most up-to-date source operands should be in DRAM before the PIM execution. 3) Similarly, the valid destination operands should be in DRAM after the PIM execution.
To support the second and third methods, we should flush cached PIM operands into DRAM before the PIM execution and invalidate them. The cache flush and invalidation incur significant overhead, so most PIM studies declared the PIM operands as uncached attributes [22], [23], [24], [25], [26]. However, the CPU access to the uncached data is too slow because of their strictly ordered memory operations. The slowness also affects the first method in terms of performance.
Most PIM-target application performs the computation using bulk and continuous memory requests. Therefore, the memory requests generated by the CPU incur significant overhead due to their large amount and slow uncached attribute, which degrades the overall performance. A DMA engine is designed to transfer bulk and successive data without CPU intervention from one memory to another, i.e., suitable for uncached data. Therefore, when considering the PIM opcode and operand management and the PIM application characteristics, we can conclude that the DMA engine is well suited for the PIM operation. 4 represents a PISA-DMA format using a DMA descriptor [36]. Each PISA-DMA consists of an opcode and one source and destination operands (their start addresses and size) like most ISAs, i.e., x86, ARM, RISC-V, etc. The descriptor already contains the fields for specifying the operands (64-bit source and destination addresses, 32-bit transfer size), so we use 14 bits of the unused descriptor field for storing the opcode. Also, the descriptor contains the next descriptor address, allowing us to represent a sequence of PISA-DMA instructions. We can represent both the PIM opcode and operands information together in one DMA descriptor without modifying a DMA descriptor format, thus not requiring modification of a DMA engine. When the DMA engine fetches a descriptor, the PISA-DMA Interface Unit recognizes a PISA-DMA instruction and configures the PIM datapath using the PIM opcode specified in the descriptor. Then, the DMA engine fetches the operands.
In the data transfer operation, the DMA engine issues both read requests using the source address and write requests to store the read data using the destination address with the transfer size. However, PIM is different, i.e., it needs only one operand. The DMA engine issues only read requests to fetch the PIM operand from memory and provide the read data to the datapath for the computation; therefore, we do not need to specify the destination operand, which the DMA descriptor requires. For this purpose, we reserve eight pages (32KB), the maximum size of the consecutive memory accesses by the PIM math library. Similarly, we assign the source operand as the reserved space for storing the PIM operands in memory.
2) OPCODES Table 1 shows the PISA-DMA opcodes to configure the PIM datapath. We divide them into MOVE, ALU, and CLR types.
The MOVE type specifies operand fetch and store between a memory bank and registers. MOVA/MOVB configures the movement path of incoming data from a bank to the vecA/vecB register by the PIM read request, respectively. MOVC configures the movement path for storing data in the vACC register to a bank by the PIM write request. MOVD is an opcode to configure the data movement from the vecB register to the vACC register for element-wise ADD/SUB operations, requiring two source operands from vACC and vecA. BCAST is an opcode to broadcast the data from one bank to all banks.
The ALU type specifies arithmetic operations, such as MAC/ADD/SUB/MUL. CSTB is an opcode for constant value broadcasting, and k represents a counter value. k is auto-incremented by 1 for every PIM read request, and a 16-bit constant value is selected from the 512-bit vecB register. The selected 16-bit is broadcast to each input of the entire 8-way 16-bit MAC unit. When k is 31, it is initialized because all data of the 512-bit vecB register have been used. All the computation outputs are stored only in the vACC register.
The CLR type is an opcode for clearing the broadcasting counter, control registers, and general registers.

C. INTERFACE UNIT
In the decoupled PIM, we store the information of two source operands (src0 and src1) and one destination operand (dest) at one time, i.e., using one write memory command, representing the execution of ''src0 op src1 → dest.'' Then, we read src0, read src1, and store dest in order for the execution using memory requests. In PISA-DMA, we represent one operand and one opcode in one descriptor. Therefore, we reshape the execution of the decoupled PIM into ''read src0; op src1 → acc; store acc into dest,'' needing only one operand at one time.
Therefore, we can reduce two source registers (REG A/B) and one destination register (REG C) in RIU [25] to one (Rop). However, we need one more register (Rdesc) to store the address information about the PISA-DMA descriptors. As a result, we could reduce 64-bit×4 registers to 64-bit×3 registers and 3 address matching logics to 2 address logics. The configuration register of the decoupled PIM (REG D) was renamed to (Rconf) for PISA-DMA.   identified, the interface unit decodes the descriptor and stores the decoded information in the interface unit, i.e., the PIM opcode in Rconf to configure the PIM engine datapath and the operand in Rop. ④ Executing the PIM computation: Then, the DMA engine transfers memory requests and, thus, fetches the PIM operands. PISA-DMA IU recognizes the incoming memory requests as the PIM memory requests by matching with the Rop register and performs the PIM computation by the configured PIM datapath. ⑤ Completing and continuing more PISA-DMA descriptors: After completing the fetched PISA-DMA descriptor, the DMA engine fetches the subsequent descriptors in order, if available. After completing all the descriptors, the DMA engine interrupts the CPU, and we finish the execution.

E. REPRESENTING A SEQUENCE OF PISA-DMA INSTRUCTIONS
A DMA engine generally supports a descriptor list operation to reduce the overhead of handling multiple descriptors. The descriptor list can be one or multiple descriptors; each descriptor in the list means a single DMA transaction. CPU stores the descriptor list, i.e., a sequence of descriptors, before a DMA transfer. Then, when the CPU initiates a DMA transfer, a DMA engine fetches the descriptor from memory one by one and generates a DMA transaction. When completing a DMA transfer corresponding to one descriptor, a DMA engine fetches the next descriptor using the next descriptor address, as shown in Fig. 4. When completing the DMA transfer of all descriptors in the descriptor list, a DMA engine sends a completion interrupt to the CPU only once.
PISA-DMA represents both an opcode and an operand in one DMA descriptor; therefore, we can represent the sequence of PISAs as the descriptor list. One PISA-DMA instruction commit is equivalent to completing one DMA transaction; the commit of all PISA instructions stored in the descriptor list is the same as completing the code sequence. It allows a programmer to express the PIM codes intuitively and reduces the PISA-DMA execution overhead.

IV. PERFORMANCE EVALUATION
A. EXPERIMENTAL ENVIRONMENT Fig. 6 represents our experimental platform to use the PISA-DMA, extended from the decoupled PIM [26] on HTG-Z920 (Xilinx Virtex UltraScale board), and Table 2 describes the experiment configuration. Our baseline architecture targets one channel of HBM [25], [26], where all banks are connected to one shared bus inside the chip.
We measured the performance of three DNN application models, BERT, RoBERTa, and GPT-2, using the decoupled PIM (baseline), the CPU serial execution (CPU_S), and the CPU parallel execution using OpenMP (CPU_P).
We did not modify any hardware components except the PIM device, including the Xilinx DDR4 memory controller and Xilinx CDMA IP, as shown in Fig. 6. We only modified the interface unit from the decoupled PIM device [26] in Section II-B and the PIM library to represent the PISA-DMAs. We developed the PIM in the Programmable Logic (PL) area, i.e., we used the PL memory as the PIM memory. Since the operating frequencies of the PS and PL memory controllers were different, we scaled them for a fair performance comparison with the CPU execution. The decoupled PIM and PISA-DMA PIM used Xilinx CDMA (Central Direct Memory Access) IP [36] as their offloading engine, supporting coalesce of up to 255 descriptors' completion interrupt. We allocated the PISA-DMA descriptors into the PL memory.
Also, we developed the PIM MemPool to provide large contiguous physical pages by utilizing a huge page [39] mechanism at an application level without modifying the OS (+MemPool) in order to see the strength and weakness of the PISA-DMA with and without huge contiguous physical pages. The support of the huge contiguous physical pages can allow us to reduce operand descriptors.

B. PIM DESCRIPTORS: OFFLOADING OVERHEAD
The PIM execution consists of three factors, as mentioned in Section III-D: 1) the offloading to generate and store the DMA descriptors in DRAM, 2) the computation by read/write operands using DMA, and 3) the notification to CPU after completing the PIM execution. Since we used the same PIM math algorithm and PIM architecture for all the PIM's performance studies, only 1) and 3) determine their performance difference. More precisely, the number of the DMA descriptors for the PIM offloading determines the performance, and Fig. 7 shows the numbers when varying the sequence length of 32 to 128. The higher the sequence length (SL), the larger the operand size and the more descriptors. The sequence length represents p in the matrix-matrix multiplication of (p × q) × (q × r).
Our PISA completely eliminated the opcode descriptors by combining them with the operand descriptors; thus, it reduced the total descriptors by 25.8%, 26.1%, and 24.9% compared to baseline+per-page, i.e., without MemPool, in BERT, RoBERTa, and GPT-2, respectively. The MemPool library provides contiguous physical pages, thus significantly reducing the opcode descriptors to specify the operand's address range. However, the library still needs the opcode descriptors,  and PISA further removes them by 4.8%, 6.8%, and 5.7% compared to baseline+MemPool. As a result, we conclude that we do not need the MemPool function, which shows the strength of our PISA-DMA. Fig. 8 shows the speedup of total execution time with the PIM's execution time breakdown. When varying the sequence length from 32 to 128, CPU_P achieved the speedups of 2.90∼3.04x, 3.27∼3.37x, and 3.45∼3.63x in BERT, RoBERTa, and GPT-2, respectively, with respect to CPU_S. The speedup was saturated at about 3.5x due to its 4-core execution.

C. SPEEDUP AND EXECUTION TIME
The number of descriptors in Fig. 7 directly affected the PIM performance, and PISA achieved the highest in all the experiments. Memcpy in the PIM execution represents data copy between the CPU and PIM for providing data coherence. The CPU execution in PIM represents that the CPU performs operations not supported by the PIM device. Therefore, they are the same in all cases of the baseline and PISA.
PISA achieved a significant speedup; 6.22x and 5.48x, 7.63x and 6.31x, and 9.15x and 7.98x at the sequence lengths of 32 and 128 in BERT, RoBERTa, GPT-2, respectively, comparing with CPU_S. PISA consistently achieved higher speedup than the baseline PIM with per-page and MemPool approaches by remarkably reducing the offloading time: PISA+per-page achieved the speedups of 1.21∼1.25x, 1.30∼1.31x, and 1.27∼1.29x compared to the baseline+perpage in the three models, respectively. Also, we found that MemPool did not contribute any performance improvement with PISA since the per-page operand descriptor transfers sufficiently large data, at least 4KB at one time. The large size can diminish the timing gap between multiple per-page operand descriptors by providing many memory requests to a memory controller. Fig. 9 shows the breakdown of the row buffer hit/miss/conflict of the baseline decoupled PIM and our PISA-DMA. The row buffer behaviors are related to 1) the CPU's storing VOLUME 11, 2023 descriptors, 2) the DMA engine's fetching descriptors, 3) its offloading opcodes, and 4) its fetching and storing operands. We implemented the performance counter inside the Xilinx memory controller for profiling the DRAM behaviors.

D. DRAM BEHAVIOR
The row miss or conflict occurs whenever the different sequence of 1) to 4) occurs. In the case of PISA eliminating the offloading opcode, the row buffer conflict occurs when starting to read a different operand or consuming all data of the opened row buffer. However, in the decoupled PIM, if re-configuration is needed, the row buffer conflict is encountered because the DMA engine reads the opcode descriptor from different DRAM rows and writes the opcode to control registers in the interface unit to configure the PIM engine. Therefore, the baseline PIM incurs higher conflict row misses than PISA. As a result, the row buffer hit ratio of PISA-DMA was about 8.2%, 2.7%, and 2.7% higher, and the row buffer conflict ratio was about 8.4%, 2.8%, and 3.0% lower than of baseline+per-page in BERT, RoBERTa, and GPT-2, respectively. MemPool reduces the number of engine re-configurations in the decoupled PIM, so DRAM behavior was almost similar to PISA.

E. OPERATOR FUSION
Traditionally, a compiler's code optimization on CPU improves performance by executing fewer instructions, and thus, we study how our proposed PISA affects performance in a compiler's code optimization.
We applied the operator fusion to a pair of the matrix-matrix multiplication and its following element-wise addition popularly used for improving performance [40] by removing storing the multiplication results and reloading them for the addition, i.e., spills. We measured the performance variant by fusing 24 matrix multiplications with element-wise additions among 32 matrix multiplication operators in BERT with a sequence length of 128. Fig. 10(a) shows the number of descriptors in each execution without and with the operator fusion. Without PISA, in the baseline+per-page and baseline+MemPool executions, the fusion totally decreased the descriptors by 0.8% and 2.0% in each execution, reducing the operand descriptors by 3% and 3% but increasing the opcode descriptors by 5% and 25%, respectively. The fusion increases the opcode descriptors due to their interleaved execution (alternatively executing multiplication and addition), thus increasing the PIM device re-configuration. On the other hand, PISA does not require the opcode offloading, thus further reducing the PISA descriptors by 2.9% in both the per-page and MemPool executions. Fig. 10(b) shows the speedup of the fused execution with respect to the unfused one with their execution time breakdown. The CPU_P improved the performance by 3% compared with the unfused CPU_P execution. Also, the base-line+fusion improved the performance by 3% and 2% in the per-page and MemPool executions, respectively, compared with the unfused baseline execution. On the contrary, the PISA took more performance advantage from the fusion by not needing descriptors for the device re-configuration; 4% and 4% in the per-page and MemPool executions, respectively, compared with the unfused PISA execution. Therefore, PISA directly improved the performance by reducing the operand descriptors, i.e., by removing spills. The fusion did not improve the performance noticeably since the multiplication required O(N 3 ), and the fusion removed the spills O(N 2 ) with N × N matrices.  Table 3 compares the interface unit area of the decoupled PIM and PISA-DMA, consisting of control registers and address matching logics, as discussed in Section III-C. For the comparison, we used the 65nm logic process, similar to the DRAM fabrication characteristics [25].
PISA-DMA reduced the total area by about 29% compared to the decoupled PIM, from 36% and 26% reduction in control registers and address matching logics, respectively. The reduction is crucial since there is little space available for implementation in commercial DRAM. The area of control registers occupies two times more than the address-matching logic.

V. RELATED WORK
Most PIM studies [11], [12], [22], [23], [30], [32] have proposed their own PIM ISAs. However, they neither fit the PIM concept, i.e., not involving the CPU execution, nor PIM-target application characteristics, i.e., expressing large-size operands with the same operation. Also, they require hardware modification and are incompatible with current commercial computing platforms.
PEI [11] and GraphPIM [12] needed to modify the core pipeline for their own PIM ISAs, thus lacking compatibility with commercial computing platforms.
In AiM [22] based on GDDR6, a host offloads the AiM instructions to the ISR register inside the memory controller. Then, the controller decodes them into DRAM commands called AiM commands to perform all-bank execution with the DMA-offloaded operands. It requires modifying the memory controller, thus incurring incompatibility with current commercial computing platforms. Also, the separate offloading of the opcode and operands incurs the execution overhead and makes the PIM programming difficult.
UPMEM [30] separates the PIM memory area (MRAM) from the main memory to avoid the memory controller modification and includes the accelerator inside the PIM memory. The DMA engine in the UPMEM device offloads instructions and operands from MRAM to IRAM (instruction memory) and WRAM (scratchpad memory), respectively. The limited resource of IRAM and WRAM incur frequent offloading. Its design and execution follow the traditional accelerators, not PIM handling large-size operands with the same operation.
Samsung FIM [23] also embedded a core to execute PIM-HBM instructions to support the all-bank execution and separated the PIM memory from the main memory. FIM stores the PIM-HBM instructions in the CRF instruction buffer, and the CPU load/store instructions trigger DRAM commands for the execution. Samsung-FIM issues the memory commands in a user-defined order, and the memory requests to PIM can be reordered while passing through the memory hierarchy and a memory controller. RNN-T [32] based on Samsung-FIM utilizes a DMA engine to guarantee memory ordering. However, it still uses separate opcode and operand offloading, thus incurring the execution overhead and increasing the program complexity.

VI. CONCLUSION
This paper proposed PIM ISAs that represent both the PIM opcode and operand in one data structure using the DMA descriptor while providing full compatibility with commercial platforms. Committing one PIM instruction is the same as completing one PISA-DMA transaction. Also, we can represent a sequence of PIM instructions using the DMA descriptor list. It allows a programmer to express the PIM codes intuitively and reduces the PISA-DMA execution overhead.
We measured the performance of PISA with BERT, RoBERTa, and GPT-2 in ONNX runtime on real machines. The PISA's opcode descriptor elimination allowed us to achieve speedups of 1.25x, 1.31x, and 1.29x in the models, respectively, from the decoupled PIM in the per-page memory layout. Also, we showed that PISA diminished the necessity of MemPool to provide large contiguous physical pages and incurred fewer DRAM row buffer misses. Additionally, we studied the performance variants when applying the operator fusion. We found that the PISA execution took a higher fusion advantage than the baseline PIM execution.