ReAAP: A Reconfigurable and Algorithm-Oriented Array Processor With Compiler-Architecture Co-Design

Parallelism and data reuse are the most critical issues for the design of hardware acceleration in a deep learning processor. Besides, abundant on-chip memories and precise data management are intrinsic design requirements because most of deep learning algorithms are data-driven and memory-bound. In this paper, we propose a compiler-architecture co-design scheme targeting a reconfigurable and algorithm-oriented array processor, named ReAAP. Given specific deep neural networks, the proposed co-design scheme is effective to perform parallelism and data reuse optimization on compute-intensive layers for guiding reconfigurable computing in hardware. Especially, the systemic optimization is performed in our proposed domain-specific compiler to deal with the intrinsic tensions between parallelism and data locality, for the purpose of automatically mapping diverse layer-level workloads onto our proposed reconfigurable array architecture. In this architecture, abundant on-chip memories are software-controlled and its massive data access is precisely handled by compiler-generated instructions. In our experiments, the ReAAP is implemented on an embedded FPGA platform. Experimental results demonstrate that our proposed co-design scheme is effective to integrate software flexibility with hardware parallelism for accelerating diverse deep learning workloads. As a whole system, ReAAP achieves a consistently high utilization of hardware resource for accelerating all the diverse compute-intensive layers in ResNet, MobileNet, and BERT.

D OMAIN-SPECIFIC processors have been studied extensively in both academia and industry [1]. Parallel architectures such as systolic array and multithreading hardware, exhibit more favorable characteristics for meeting the explosive demands of computing power from deep learning algorithms. For example, Google TPUv3 [2] is essentially multiple two-dimensional (2D) systolic arrays consisting of 128 Â 128 processing elements (PEs) running in parallel. While employing massive PEs to explore data-level parallelism in deep neural networks (DNNs), near-neighbor interconnections between PEs in the systolic arrays are organized to maximize data reuse possibilities in DNNs. As a whole, TPUv3 offers 50Â improvement in performance per watt over general-purpose processors for training DNNs while integrating with its XLA compiler and TensorFlow framework [3]. General matrix multiplication (GEMM) is at the heart of accelerating deep learning algorithms. Matrix-matrix multiplication and matrix-vector multiplication are the basic computation patterns for accelerating most of compute-intensive layers in DNNs. Systolic array, which consists of a group of identical PEs locally interconnected to each other, is one of the most promising architectures to match the basic layer-level computation patterns in DNNs [4], [5]. Essentially, a systolic array is a topological structure including many PEs which rhythmically compute and pass data through the array architecture. There is a fundamental issue behind designing a successful systolic array architecture, which is to propose an efficient data access solution to serve the highly-parallel array paradigm with optimizing the cost of data access for both on-chip and off-chip memories. In practice, customizing an unique dataflow for each type of the compute-intensive layers is essential to achieve high parallelism and efficient data reuse [4], [6]. An eligible dataflow can provide high efficiency for passing partial sums or sharing input operands inside the systolic array [7], [8]. Typically, a DNN is composed of tens to hundreds of diverse compute-intensive layers. The combined layer-by-layer pipelines are much diverse with regard to different layer sizes. To meet the diverse conditions of layer types and layer sizes for accelerating any given deep learning algorithms, customization or reconfiguration has become an important consideration in designing its domainspecific processors [9], [10].
In DNNs, massive data-level parallelism and intensive data reuse are originated from the vast amount of data operands in or across the layer-by-layer pipelines. The data-level parallelism and data reuse can be achieved only when sufficient data are supplied in time. A fundamental challenge in mapping deep learning workloads is how to timely deliver sufficient operands to these massively PEs in parallel [11], [12]. As leading to success, the principle is simply to keep all the parallel PEs as busy as possible. In practice, PE utilization has become a critical performance counter to assess mapping efficiency between computeintensive workloads and parallel hardware architectures [6], [7], [10]. While designing a specialized architecture in hardware that can greatly improve performance and energy efficiency compared to the general-purpose processors, a dedicated compiler, called domain-specific compiler in this paper, has its opportunity to improve the co-design level of integrating hardware and software as being adaptive to diverse deep learning workloads [1], [13]. In contrasted to traditional workloads, data access patterns are more discoverable at compile time in most of deep learning workloads. Compilers and programmers can in advance optimize the use of addressable on-chip scratchpads rather than dynamically-allocated caches for efficient use of multilevel memory hierarchy. A customized dataflow with optimized data blocking strategies defined during ahead-of-time software compilation is effective to map deep learning workloads onto a specialized hardware architecture. In an endto-end full stack system, such a domain-specific compiler can play a critically important role for program transformation, computation scheduling, memory access management, code generation, and more. Taking full advantage of its software flexibility, the domain-specific compiler not only can increase the operational intensity of its oriented algorithms [10], [14], but also can help bridge the mapping gap between diverse deep learning workloads and specialized hardware architectures [2], [7].
As a matter of fact, accelerating diverse deep learning workloads with massive data access comes as a complicated problem including data layout mapping, parallel tasks placement, and how to traverse the parallel tasks through a multilevel memory hierarchy [12]. The complicated combination of basic computation patterns, hierarchical data access, and novel hardware primitives creates an enormous design space for hardware-aware code generation in a domain-specific compiler [13], [15]. While design complexity for exploring parallelism and data reuse goes from system to array and PE levels, making the difficult task of handling massive data-level parallelism more and more challenging. Rather than using heuristic search (Tensorflow +XLA [3]), template-guide search (AutoTVM [16]) or stepping search (Ansor [17]), we demonstrate in this paper that it is more effective to obtain systematic optimization on parallelism and data reuse by using sequential loop transformations which are based on precise data dependence analysis for layer-level computation patterns. During systematic optimization in our domain-specific compiler, the layer-level computation patterns in DNNs are uniformly expressed as nested-loop functions which are assembled or unfolded for parallelism and data reuse.
In this paper, we propose ReAAP, a reconfigurable and algorithm-oriented array processor for accelerating the diverse deep learning workloads. Taking full advantage of software flexibility, our domain-specific compiler is proposed to systemically abstract parallelism and data reuse opportunities from the oriented deep learning algorithms, and adaptively mapping the diverse layer-level workloads onto our proposed reconfigurable array architecture. A systematical approach based on precise data dependence analysis is proposed for realizing the domain-specific compiler to optimize layer-level computation patterns for parallelism and data reuse. The physical array size and customized dataflow in the proposed array architecture is adaptively determined according to the optimization result from the domain-specific compiler. During layer-computing process, a dynamically-configured number of parallel PEs perform dense computation while the Just-In-Time (JIT) Runtime, running on an embedded CPU, issues data preloading instructions to supply massive operands to the highly-parallel hardware array in time. The principle behind our proposed co-design methodology is to keep all the parallel PEs as busy as possible. By taking a balanced consideration on both software flexibility and hardware parallelism, the endto-end performance evaluation on ReAAP demonstrates a consistently high PE utilization for all the diverse layers in three commonly used models including ResNet, MobileNet, and BERT. Only edge PEs are connected to on-chip scratchpads and other PEs are interconnected with their neighbors. The proposed array architecture can thus be easily reconfigured in scale and can operate at a high working frequency.
The remainder of this paper is organized as follows: Section 2 reviews the background and related work; Section 3 formulates the co-design methodology; Section 4 describes the system implementation in details. The experimental results are reported in Section 5, and we conclude with the further research directions in Section 6.

Domain-Specific Hardware Acceleration
The success of deep learning has inspired a resurgence in domain-specific hardware acceleration to exploit a more efficient form of parallel computing [1]. In essence, hardware acceleration is based on a pre-defined set of multidimensional data, exploits both parallelism among vast computing units for high performance and data reuse across a multilevel memory hierarchy for high efficiency. In contrast to a general-purpose processor (CPU or GPU), the principle of domain-specific design tailors the architecture to target deep learning applications specifically and minimizes redundant hardware resources. For example, massive multithreading in GPUs [18] which helps achieve high parallelism and tolerates unpredictable cache misses, is not necessarily ideal for DNNs which have explicit and prefetchable data access patterns [2]. The addressable on-chip scratchpads for data prefetching can be designed with a much smaller size than the multithreading register files, and thus can be much more energy and resource efficient. Additionally, the performance gains demonstrated by Google TPUv1 [6] on DNNs inference and Google TPUv3 [2] on DNNs training confirm that the systolic array is more effective to capture the data reuse properties in DNNs than the multithreading hardware. While much effort has been devoted to the architecture design of hardware accelerators, there is still a lot of opportunities for research on how to make full use of the accelerator architectures during software compilation. A domain-specific compiler composed of domain-specific knowledge and device-specific dialect is indispensable for mapping deep learning workloads onto parallel hardware architectures [13], [15]. For example, the Google TPUs are in-order machines which require its XLA compiler to help manage memory operations at every execute point of a deep learning application. The Nvidia GPUs are latency-tolerant machines and thus the code generation in its CUDA compiler needs to support overlapped operations for serving the multithreading hardware.

Compiler and Architecture Co-Design
Traditional load-store architectures are suffering heavily from the expensive cost of massive data movement required for deep learning applications [2]. The established set of automatic optimization employed in traditional compilers is also insufficient to support emerging hardware features, such as the highly-parallelism pipelines and decentralized on-chip memories [19], [20]. Achieving high performance and energy efficiency for specialized hardware usually requires modifications to the underlying optimization in its compiler. That is, for an special domain such as deep learning domain, the software compiler and the hardware architecture must be co-designed to jointly optimize for performance and efficiency. In DNNs, the layer-by-layer pipelines may contain hundreds of nested loops. Parallelizing all the iteration points in these nested loops would result in an enormous design space which is far too large to search exhaustively and is very time-consuming for heuristic search [3], template-guide search [16], or stepping search [17]. Essentially, implementing all the nested loops requires exploring the tradeoff among parallelism, data locality and re-computation within multilevel tiling of the nested loops [11], [21]. Taking full advantage of software flexibility, a dedicated compiler is therefore favorable which can perform complicated nested-loops restructuring to adequately satisfy a parallel hardware accelerator. Compiler-architecture co-design shows high effectiveness in coping with the sophisticated design complexity in parallel computing. For example, Darkroom [22] limits image processing pipelines to contain only fixed-size stencil operations. All intermediate values are buffered in local line-buffer memory to eliminate unnecessary communications with off-chip memory. In its dedicated compiler, the complicated problem of minimizing off-chip memory bandwidth is formulated as an integer linear programming (ILP) problem which can be resolved to precisely schedule the line-buffer pipelines. Pol-ySA [4] leverages polyhedral intermediate representation (IR) to map compute-intensive algorithms onto systolic array. Starting from multi-dimensional polyhedron, compute-intensive algorithms such as matrix-matrix multiplication and convolution layer, can be mapped onto a systolic array through only several rounds of affine transformations during software compilation.

Optimizing for Parallelism and Data Reuse
Optimizing for parallelism and data reuse has been studied extensively [19], [23], [24]. Enhancing parallelism and data reuse opportunities in computationally intensive programs is beneficial to speeding up target applications running on multi-core processors. A DNN is a directed acyclic graph (DAG) of various types of layers. The compute-intensive layers such as the convolution layer and self-attention layer dominate the computation and memory communication, and thus have been the focus of optimization. In most cases, the computation patterns of compute-intensive layers can be uniformly expressed as nested loops. For example in Algorithm 1, the computation pattern of the convolution layer can be expressed with seven nested loops. It generates output feature maps O which have OC channels with size H Â W by processing the input feature maps I consisting of IC channels. W contains the weights as 3D filters with size of IC Â K H Â K W . Stride S h ; S w impacts the output feature map size H Â W during the process. The computations of these feature maps can be concurrently processed in the outmost loop of batches B to increase parallelism and data reuse opportunities. With the layer-computation patterns expressed by nested loops, loop optimizing techniques such as loop unrolling, loop tiling, loop reordering, and loop fusing can be sequentially conducted on the corresponding nested loops to enhance parallelism for high-performance computing [19], [25], or to optimize data access for data-centric processing [26], [27]. For example, loop unrolling allows more units running in parallel by exploiting temporal or spatial parallelism in the nested loops. Loop tiling decomposes a recursive domain into tiles of a certain size to create parallelism and data reuse opportunities. While loop unrolling and tilling increase parallelism and data locality, it meanwhile causes the problem of data dependence at loop boundaries. Essentially, how to handle and loosen the sequential constraints of data dependence is a fundamental problem. The complicated data dependence behind arbitrary nested-loop sequences makes the problem of how to perform these loop optimizing techniques as a long-standing research problem [23], [28].
The polyhedral model [29] has been the basis for major advances in automatic optimization of nested-loop functions. It is one of the mathematic foundations in our domain-specific compiler to perform optimization for parallelism and data reuse. In essence, our domain-specific compiler combines data dependence analysis and affine transformation expressiveness within a mathematical algebra representation to reason the composition correctness of sequential loop transformations. By employing parametric polyhedron as IR, our domain-specific compiler can utilize the great analysis power of polyhedral modeling on data dependence to enable equivalent space-time transformations for a newly generated space, without breaking the original data dependence of arbitrary nested-loop sequences [4], [29], [30]. After precise data dependence modeling within the polyhedral space, loop reordering and fusing are employed subsequently to enforce data dependence cross the loop boundaries. To sum up, our domain-specific compiler is developed with the insights which related work has revealed. Precise data dependence analysis and a systematical consideration of sequential loop transformations are critical steps for mapping the oriented deep learning algorithms onto proposed highly-parallel array architecture.

Mathematic Foundations
In this paper, spatial-temporal mapping [4] and the polyhedral model [29] are the mathematic foundations to map the deep learning algorithms onto our proposed array architecture. During software compilation, the polyhedral model is utilized to perform nested-loop restructuring for optimization of parallelism and data reuse. After the nested-loop restructuring, spatial-temporal mapping helps map an n-nested loop to n À m space loops and m time loops. First, GEMM is formulated as a three-dimensional polyhedron for data allocation at the 2D systolic array level. Each node in the polyhedron corresponds to a loop instance which performs one multiple-and-accumulate (MAC) operation: C½i½j ¼ A½i½k Â B½k½j þ C½i½j. To identify the space coordinates and the time stamp for each node, spatial mapping is represented by a mapped array P and temporal scheduling is represented by a scheduling vectors. Accordingly, two basic vectorsp 1 ;p 2 are included in the mapped array P which serves a given 2D systolic array architecture. For any nodeñ, its spatial mapping onto a coordinate (at PEẽ) of the 2D systolic array is realized with the two basis vectors as: Its time stampt denoting when to compute the nodeñ is formulated with the temporal-scheduling vectors as: Being combined together, spatial mapping and temporal scheduling can be formulated as a scattering function F A ðñÞ, to generate logical stamps S to precisely describe the execution order for all the PEs as: Accordingly, data access function F B ðñÞ describes memory array elements B e to be accessed by each nodeñ as: At the iteration step t when choosing a new scattering function F A ðtÞ, its data access function is updated in the newly generated space as: where F A ðtÞ À1 is the inverse matrix of F A ðtÞ in matrix format [4]. The functions of spatial-temporal mapping onto systolic array is the base to effectively allocate operands for our proposed array architecture at fine-grained level. Fig. 1a shows an example of spatial-temporal mapping onto a systolic array, with two spatial vectorsp 1 T ¼ ð1; 0; 0Þ;p 2 T ¼ ð0; 1; 0Þ and one temporal vectors T ¼ ð1; 1; 1Þ. Its data access function reveals that data from matrix A are reused between PEs vertically and data from matrix B are reused between PEs horizontally. It is shown in Fig. 1b that only two registers require to be allocated for each matrix and data will be passed through PEs by the near-neighbor interconnections in the systolic array architecture.
In order to analyze data dependence precisely, statement instances inside arbitrary nested loops are projected onto a multi-dimensional logical vector in the polyhedral space [29]. For example, a loop statement S k surrounded by three loops i; j; l; all iterating from 0 to N, will have an iteration domain D S k ¼ fS k ði; j; lÞj 0 i; j; l Ng. Each instance in the statement S is defined by its iteration vectorṽ which contains integer values for the loop indices from the outermost to innermost. A loop statement always contains multiple instances and the set of iteration vectors belonging the same statement defines a multi-dimensional polyhedron, as illustrated in Fig. 2. In the polyhedral space, Data Dependence Graph (DDG) is a directed multi-graph with each vertex representing a loop statement. The data dependence between loop statements S r and S k is summarized by a ration as: While a dependence edge e 2 E represents the data dependence from one of multiple instances in statement S k to one in statement S r , it is characterized by a dependence polyhedron P e that captures the exact dependence information corresponding to edge e. A one-dimensional affine transformation for statement S k is defined as: where f represents the direction of affine transformation, vectors represents source iterations and vectort represents target iterations. The equalities corresponding to affine transformation are a part of the dependence polyhedron P e , and can be used to reduce the dimensionally of the polyhedral space by applying statement-wise affine transformation [4], [29]. Equivalently, abstracting the optimal parallelism and data reuse opportunities is corresponding to finding a lexicographic minimal solution with transformation coefficientsc e , which is systematically presented in Section 3.3. Affine array access defines an exact mapping from the iteration space to the data space [23]. It maps the index vectorṽ within the loop bounds: to the array element location: In this study, data access in the transformed nested loops is defined by a four-tuple F ¼ hF; f; B; bi to specify the mapping from the iteration space to an array element in the data space. With each row being an one-dimensional affine hyperplane defined in Equation (7), a matrix is reconstructed to represent a multi-dimensional affine transformation for each statement S k . If such a transformation matrix has a full column rank, it completely specifics when and where a loop iteration executes. This attribute directly fits the definition of scattering function F A ðñÞ in the spatial-temporal mapping onto a systolic array. Based on the intrinsic relations between affine transformation and spatial-temporal mapping, this paper proposes a co-design methodology to address the fundamental challenges to bridge the gap between software optimization and hardware design.

Bridging the Gap Between Sequential Software Optimization and Parallel Hardware Design
Diverse deep learning workloads and enormous hardware architecture choices present significant challenges for striking an optimized balance among parallelism, data locality, and recomputation [11], [21]. In this paper, we propose to explore the enormous design space for a hardware accelerator by using software optimization techniques. The optimization techniques are conducted sequentially on the software side and meanwhile equivalent to making the architecture choices for hardware accelerator. This is the key insight behind our approach to bridge the design gap between our proposed domain-specific compiler and highlyparallel array architecture.
Optimizing code for software is drastically different from architecture design for hardware. Since there is a significant gap between sequential software optimization and parallel hardware design. we propose here to reveal the intrinsic relations between nested-loop optimizations on the software side and parallel architecture choices on the hardware side. On the one hand, hardware acceleration always takes advantage of parallelism and data reuse. By reusing data-locality data on chip as much as feasible and bringing some necessary new data each time, massively parallel units are able to perform concurrent operations at a high speed even with limited on-chip memory resource. While data-locality blocks are buffered in a fast memory (registers, scratchpad, L1 or L2 cache), data reuse can be achieved with memory hierarchy design. On the other hand, nested-loop transformations [12], [23], [24], [25] are widely used in sequential software optimization for coarse-grained parallelism and data locality. For example, partitioning the loop iteration space into tiles allows them to be concurrently executed on different processors with a limited amount of data communication. Assuming a multilevel memory hierarchy, loop tiling can serve for the architecture design of a hardware accelerator by enabling the fit of large-scale workloads into the fast and limited on-chip memory. Under the same principle, loop reordering and loop unrolling are employed to greatly impact parallelization as well as data access pattern for hardware architecture design [31], [32], [33], [34], [35]. While every expression in an unrolled loop statement is implemented with hardware circuit, the number of unrolled loops is conceptually corresponding to the degree of parallelism in hardware architecture. In a hardware accelerator, the data access patterns across multiple PEs are determined by its dataflow scheme. The dataflows such as weight-stationary (WS), row-stationary (RS), and output-stationary (OS), are widely used in DNN accelerators [5]. With the stationary characteristics, the widely-used dataflows can be uniformly expressed as a sequential combination of loop reordering and loop unrolling on the nested-loops [34], [35]. By reordering the loop execution order and merging successive operations on multiple levels of the tilled loop, data reuse opportunities are exploited to optimize data access in a multilevel memory hierarchy. Generally, the enormous design space for a hardware accelerator can be reasoned in three aspects as: data blocking, dataflow, and memory allocation. We summarize the intrinsic relations between each aspect and nested-loop transformations in Table 1. While systematical analysis is carried out in the composition choice and parameter selection for the sequential nested-loop transformations, the design complexity for a hardware accelerator is correspondingly reflected in the choices of the many different tiling and scheduling strategies in software optimization. Hence, in this paper, the design tradeoff of a hardware accelerator among parallelism, data locality, and recomputation is systemically evaluated with interleaving the nested-loop orders of allocation and execution, together with precise data dependence analysis conducted on the diverse layerlevel workloads during software compilation. Perform 1D affine transformation f t . . . ; f t S k g 6 end 7 Generate all the possible affine transformations ff 1 ; f 2 ; . . . ; f t g 8 for each loop statement S k do 9 Suppose an affine form vðpÞ ¼ũ:p þ w 10 for each dependence edge e (inter-statement or intra-statement dependence) do 11 Get cost function with one of the possible affine transformations f 1 as: d e ðs;t Þ ¼ f 1 Sr ðt Þ À f 1 S k ðsÞ; hs;t i 2 P e 12 Get all linear inequalities in affine coefficients by: u:p þ w À d e ðs;t Þ e0 þ P me k¼1 ek P k e ; ek ! 0 13 Find the lexicographic minimal solution withũ; w in the leading position: minimize 0 fu 1 ; . . . ; u k ; w; . . . c i s . . . Generate scattering function F A ðñÞ for mapping onto the given 2D systolic array 33 Generate data access function F B ðñÞ with affine array access F ¼ hF; f; B; bi 34 end 35 end

A Systematic Optimizing and Automatic
Mapping Approach During Software Compilation Optimizing individual loop one-by-one is ineffective, and instead, the dependent chains of dataflow or the entire nested-loop programs must be considered in a systematic way. Meanwhile, the inherently-linked nested-loop transformations are required to be employed cooperatively to reorder the computation patterns of nested loops for a better parallelism and data reuse. The polyhedral model [21], [29] is useful for systematically optimizing the given nested-loop algorithms with precise data dependence analysis. In the polyhedral model, a dynamic iteration of each loop statement instance is viewed as an integer point, and the composition correctness of sequential loop transformations are reasoned as instantiating an ILP solution with precise characterizations of inter-statement or intra-statement dependence. With uniform representation of nested-loop iterations in the polyhedral space, such as shown in Fig. 2, positive distance means data dependence is strongly satisfied; zero reordering, unrolling interleaving memory allocation compute-inline, compute-at distance means data dependence is weakly satisfied; and negative distance means data dependence is violated. Meanwhile, there is a basic theorem in linear algebra that an affine functionc:x þ d is non-negative everywhere in the non-empty polyhedron defined byã:x þb ! 0 if exists a linear combination asc:x þ d 0 þðã:x þbÞ, where 0 ; ! 0 [21]. So preserving original data dependence is equivalent to the condition of which dependence edge e having a non-negative value, and nested-loop transformation is equivalent to performing affine transformation under the linear constraints on affine coefficientsc; d.
In this study, we assume a cost function in affine form d e ðtÞ to be equipped for each dependence edge e between two loop statement instances. Under the true condition of the dependence edge e has non-negative distance: d e ðtÞ > ¼ 0, a set of linear inequalities are formulated as optimizing a linear objective function over the constrained space of affine coefficientsc e ; d e as following: where source iterationss are completely replaced by the affine transformation of target iterations ass ¼ h e ðtÞ, dependence polyhedron P e captures the exact dependence information corresponding to the edge e, including interstatement dependence or intra-statement dependence. The set of linear inequalities comprise the true condition on data dependence d e ðtÞ > ¼ 0 and the bounding constraint on affine coefficientsc e ; d e . It can be at once solved by finding a lexicographic minimal solution withũ and w in the leading position as: minimize 0 fu 1 ; u 2 ; . . . ; u k ; w; . . . ; c i s; . . .g Finding the above minimal solution generates one affine hyperplane to the coefficients, and it needs at least as many independent solutions as the dimensionality of its outer loops. Independent hyperplanes are iteratively found until statement-wise data dependence are satisfied for all loop statements. The cost function in affine form d e ðtÞ in Equation (10) holds much signification in the optimizing for coarse-grained parallelism and data reuse. It indicates the number of hyperplanes that the dependence edge e traverses along the hyperplane normal f. While hyperplanes execute interactively it gives us a measure of data-buffering amount for data reuse, and while using hyperplanes to generate tiles for parallelization it is a quantitative factor in the communication volume. As summarized in Algorithm 2, proposed domain-specific compiler follows a systematic optimization and automatic mapping approach which is built upon the transformed nested loops T for exploring parallelism and data reuse. Minimizing the cost function with one of the possible affine transformations indicates searching the greatest potentials in parallelism or data reuse from the nested loops. Under true condition on affine coefficientsũ ¼ 0; w ! 0, the transformed nested loops T ensures having non-negative dependence components so that it can be tiled for coarse-grained parallelism and data reuse.
In DNNs, each input feature map is reused with all the weight filters while each weight filter is used for each output feature map. The same weight filters are reused across a batch of independent samples. Parallel and pipeline processing with these reused data is a key approach to avoiding large near-MAC buffers. So in this study, a given 2D systolic array is proposed at the fine-grained level to capture these parallelism and data reuse opportunities. Only one data per filter and one partial sum per MAC require to be buffered because of the highly-pipelined organization in our proposed array architecture. Meanwhile, the hardware characteristics of the systolic array are considered as hardware constraint parameters during software compilation. Each tile contains a 2D systolic array of 4 Â 4 ¼ 16 PEs, and for example if we decide to allocate an array of 8 Â 8 tiles, for the mapping of three-dimensional GEMM, a range of the tiling maps would be intersected with the following set: where 0 b 0 < 8; 0 b 1 < 8; 0 e 0 < 4, and 0 e 1 < 4 are architecture parameters referring to the tile and PE identifies. While fine-grained parallelism and data reuse are captured by the topological feature of the 2D systolic array, a feasibility check is required for automatic mapping onto the physical systolic array. All the iteration nodes mapped to the same spatial position in a newly-generated 2D space cannot operate at the same temporal step, and one unit of delay must be associated with each dependence edge.

Combining Spatial and Temporal Computation
For a compute-intensive layer in DNNs, its computation pattern always exhibits an uniform dependence structure where every data dependence has a constant distance. This feature is appropriate for mapping the dependence structure onto a spatial architecture which leads to local interconnection [7], [9]. Compared to SIMD or SIMT, spatial architecture is particularly suitable for the algorithms whose execution sequences exhibit producer-consumer relationships or can leverage efficient data reuse within a spatial region of computing units. Through a parallel and pipeline organization of relativelysimple computing units in spatial architecture, high parallelism is gained and local interconnection is employed to exploit data reuse potentials. In our practice, the size of a mapped PE set is determined by the size of weight and input map for a given layer. It is required to propose a customized datablocking strategy to map the PE set onto the physical PE array, which means combining temporal-data-flow within the spatial architecture. In terms of temporal perspective, a customized data-blocking strategy with a dedicated dataflow are able to reconfigure the spatial architecture for mapping a given layer-computation pattern in DNNs. To some extent, finding an optimal temporal-data-flow that supports parallel computing with minimal data movement cost is crucial for reconfiguring the spatial architecture. In this study, a basic form of 2D systolic array is chosen as the ideal spatial architecture for mapping layer-computation patterns in DNNs. Interconnected PEs jointly compose the systolic array that works rhythmically -at every time step, each PE reads inputs from some neighbors, performs computation, and forwards the inputs or partial sums to other neighbors. Parallel computing is driven by a determined temporal-data-flow with reconfigurable architecture parameters. The characteristic of near-neighbor interconnection is appropriate for reusing input data or passing partial sums for each layer, and thus reduces the high pressure of data transmission. While considering layer-computation pattern as the basis to generate a systolic dataflow configuration, the optimal goal is to minimize the computation latency of the systolic array with a balanced memory access. In our practice, combining temporal and spatial computation includes three aspects: logic circuit design inside PE, topological structure design of spatial architecture, and temporal-data-flow configuration. Particularly, logic circuit inside PE is designed to perform arithmetic computation with buffering registers. To avoid impacting latency at PE level, double buffering technique is adopted by adding double buffers inside each PE to pass data through neighbor PEs. For the other two aspects, the innovation behind our design is decoupling temporal-data-flow configuration from the topological structure design of spatial architecture. Specifically, an identical topological structure is reconfigured with different temporal-data-flow configurations for implementing diverse layercomputation patterns. A tuple ðx; y; tÞ is defined to represent specific execution behavior inside the spatial architecture with PE index and time step, which assigns the computation at spatial index ðx; yÞ of the PE array and at time step t. In our domain-specific compiler, the diversity of temporal-data-flow configurations is described by affine IR with metadata. While the affine IR is used to precisely schedule data for a systolic array, the metadata contains reconfigurable information related to topological structure about the systolic array. In consequence, a given layer-computation pattern is corresponding to spatial-temporal mapping onto our proposed reconfigurable array architecture.

Proposed Reconfigurable Array Architecture
In this study, a reconfigurable array architecture is proposed for enabling parametrization and customization through high-level meta-programming and system-level integration. The reconfigurable feature is indispensable to targeting a range of architectural parameters such as dataflow choice, data-blocking strategy, physical array size, bit-width precision and more. As shown in Fig. 3, the basic element is logic circuit inside each PE which performs MAC operation and optionally bit-shifting. 16 PEs are organized with a topological structure of 4 Â 4 systolic array to form a tile. A reconfigurable number of tiles are allocated within a pipeline grid to form the tiles array. The architecture is reconfigurable with supporting widely-used dataflows, customized array size, and different data blocking strategies, which can be fixed in advance or configured at run time. During runtime, Input and Weight matrices are preloaded into the on-chip scratchpad banks from main DRAM memory. As being configured, the preloaded data of Input matrix and Weight matrix feeds directly into the tiles array, which writes the data of Partial Sums matrix either back to on-chip scratchpad banks or other functional units. As an essential part, a dedicated functional unit (ALU) is developed for implementing element-wise operators in hardware. The ALU unit works in the layer-bylayer pipeline to handle common operations such as Softmax pooling or other element-wise operations. Since in quantized DNNs, output activations are usually accumulated to a higher bit-width and must be scaled back down before being fed into the next layer. The array architecture includes scalar units which are designed to be capable of scaling high bitwidth to low bit-width. Consequently, the reconfigurable array architecture is provided with VLIW (very long instruction word) instructions which carries the determined information of operating order, precise memory address, scaling factor and etc. After configuration, a settled version of the array architecture can be generally described as to push or pull operands for all the PEs at each time step. So in practice, a dependence tuple d ¼ ðp; q; rÞ is defined for each operand to serve the configuration needs in where ðp; qÞ stands for the 2D spatial dependence and r stands for the temporal dependence. For example, if an operand C only depends on the value of the same PE at previous time step, defined as d C ¼ ð0; 0; 1Þ; or if an operand A receives data only from its left PE, defined as d A ¼ ð1; 0; 0Þ. During runtime, an operand will stay inside PE if its spatial dependence ðp; qÞ is zero, otherwise it flows through PEs with certain direction.

Configuration and Management
Space-time transformation is an essential step to map diverse layer-level workloads onto a physical systolic array [10], and one of the basic restrictions is to preserve original data dependence for the diverse workloads. In this study, the optimization result from our domain-specific compiler is the basis to perform array-mapping process. During software compilation, affine transformation is able to find the optimal choice for the fully permuted nested-loops in layercomputation patterns. After affine transformation, statement instances of space loops are mapped to spatial PEs and time loop is used to schedule the original iterations to run on these PEs. For the spatial PE ðp; qÞ at each time step r, its temporal dependence is obtained as D r :¼ fh½p; q; r ! h½p; q; r þ 1g, and its data access dependence for the two operand matrices (A½p½r, B½r½q) are obtained as D p :¼ fh½p; q; r ! h½p þ 1; q; rg and D q :¼ fh½p; q; r ! h½p; q þ 1; rg, respectively. where h T :¼ fh½p; q; r ! ½p; q; rg represents one possible space-time transformation with an identity mapping that preserves original data dependence. Under the physical constraint of systolic array architecture, space-time transformation produces a settled version for the array architecture. For instance, the dependence distance on space loops is no greater than one unit so data communication only happens between neighbor PEs. Horizontal and vertical interconnections between PEs are inserted to pass data according to the data access dependence D p and D q .
Previous works [36], [37], [38] have shown that data movement and memory management can consume up to 40% of total runtime, 60% of energy consumption for a data-centric architecture. Since data transmission between off-chip and onchip memories is expensive, memory management is very important for a deep learning processor. In this study, we propose to precisely schedule memory operations by acting with addressable on-chip memories. A structural overview of performing dense computation with addressable on-chip memories is shown in Fig. 4. Dense computation is performed in the tiles array while input data and weight data have been preloaded in the on-chip scratchpads in advance. The addressable on-chip scratchpads are software-controlled with instructions which are also prefetched in the instruction buffer. By using the architectural pattern of line-buffered pipelines, Darkroom [22] compiles an image processing program with all intermediate values stored in local line-buffered memories. How to optimally schedule the line-buffered pipelines is accordingly formulated as an ILP solution. In this study, we adopt this concept and schedule the line-buffered pipelines by flatting two-dimensional data between the pipeline stages into a continuous streaming of one-dimensional data. Given a line buffer of fixed-sized L, data access fðx þ c 1 ; y þ c 2 Þ is replaced with f 0 ðx 0 þ c 1 þ L Â c 2 Þ where x 0 ¼ x þ L Â y is the current data index in a data streaming program. Some functions in the Darkroom are integrated into our domain-specific compiler to reduce the expensive off-chip DRAM access. The addresses of off-chip and on-chip memories for buffering Input, Weight, Partial Sums matrices are embedded within the sequence of VLIW instructions. At compiler time, the domain-specific compiler is responsible for producing the instruction sequence and memory address space is reconfigurable with a dynamic size of on-chip memories.

EXPERIMENTAL RESULTS
The full ReAAP is implemented on an embedded FPGA platform of Xilinx Zynq UltraScale+ MPSoC ZCU102. Specifically, the systematic nested-loop optimization approach in our proposed domain-specific compiler is realized with the schedule primitives in TVM [13] to apply basic loop transformations incrementally. This compiler realized in software is responsible to abstract parallelism and data reuse opportunities from layer-level workloads, as to guide reconfigurable computing in hardware. As illustrated in Fig. 5, proposed software compiler with its JIT (Just-In-Time) runtime is running on the embedded ARM Cortex-A53 CPU, while proposed reconfigurable array architecture is instantiated as the layer-computing coprocessor which is implemented with programmable logic in FPGA. For hardware implementation, we develop the layer-computing coprocessor by using High-Level Synthesis (HLS) [39] which helps increase our productivity to build the proposed reconfigurable array architecture. By leveraging the FPGA programmability using HLS, we carried out the implementation of the layer-computing coprocessor in less than 3 months. On-chip BRAMs and BRAM controller are used together as addressable scratchpads for precisely buffering input, weights, and output matrices. During runtime, the embedded CPU issues instructions to reconfigure the layer-computing coprocessor. Data transmission is established directly between off-chip DDR memory and on-chip BRAMs with independent DMA Fig. 4. Performing dense computation with addressable on-chip memories. The Tiles Array performs dense computation over Input and Weight matrices which are preloaded into on-chip scratchpads. Partial Sums are finally written into a register file. The addressable on-chip scratchpads are software-controlled with instructions. A local on-chip memory holds one line buffer of streaming data and the memory-allocated size is dynamically determined by data streaming size. channels. On software side, to preserve the logical equivalence of oriented algorithms, the decoupled compute-schedule principle from Halide [11] is adopted to denote a specific mapping from tensor-level computation to low-level instructions. It is important to support emerging computation patterns because the landscape of deep learning is continuing to evolve. So beyond supporting convolution layers [40], the extended compatibility of ReAAP is also evaluated by supporting other compute-intensive layers such as separable convolution layers in MobileNet [41], and self-attention layers in BERT [42]. As a whole software-hardware system, ReAAP can be deployed to accelerate various deep learning applications just by customizing a Python script in software for each application. Some application scenarios are shown in Fig. 6. The ReAAP takes a balanced consideration on both software flexibility and hardware parallelism for implementing a full stack integrated system.

Domain-Specific Compilers Evaluation
We first evaluate the optimizing strategy in our domainspecific compiler. The evaluations are based on convolution layers in ResNet-50, separable convolution layers in Mobile-NetV3, and self-attention layers in BERT BASE . Variable-size layers are included in the evaluation. Table 2 lists detailed configurations for all diverse layers in the three commonly used models. As a fair comparison with other domain-specific compilers, we benchmark the DNNs inference on two popular types of hardware back-ends including a serverclass GPU of NVIDIA Titan X and an edge-class CPU of ARM Cortex-A53. Fig. 7 shows the evaluation results in comparison with others on inference time. TensorFlow-XLA or TensorFlow-Lite [3], AutoTVM [16], Ansor [17] are the baselines. Compared with all the best alternatives, our domain-specific compiler improves the performance of inference time on both the NVIDIA GPU and the ARM CPU. While benchmarking on the server-class GPU, Fig. 7a shows that our optimizing strategy outperforms the Tensor-Flow-XLA baseline with speedup ranging from 1:9Â to 5:7 Â . While benchmarking on the edge-class CPU, Fig. 7b shows the speedup is ranging from 1:6Â to 3:3Â compared to the TensorFlow-Lite baseline. Compared to the searchbased optimizing strategies in AutoTVM and Ansor, our optimizing strategy outperforms their solutions in all cases and does not need to run a large amount of auto-tuning trails. In essence, the parallelism and data reuse optimization in our domain-specific compiler is based on precise data dependence analysis for nested loops. Hardware characteristics are considered within the automatic mapping process for hardware-aware code generation. Evaluation   results demonstrate that precise data dependence analysis is critical to find optimal solution for parallel computing, and specific hardware characteristics information is desirable for a better low-level code generation.

End-to-End Performance Evaluation
A well-integrated hardware and software stack would simultaneously provide efficiency gains from hardware specialization and workloads compatibility from software reprogramming. To evaluate optimizing efficiency in software compiler and hardware efficiency in accelerator architecture together, we adopt active cycles [6] and a quantitative metric of PE utilization [8] as hardware performance counters. The PE utilization m i has been proposed to show the average percentage of active PEs for i-th layer as: where T denotes the total computation cycles, PE act ðtÞ is the number of active PEs at cycle t, PE total is the total number of PEs which are allocated for accelerating the i-th layer computation. Since our proposed array architecture has its reconfigurable feature, PE total is a reconfigurable number calculated with the allocated tiles number N 0 as: PE total ¼ 16 Â N 0 . Table 3 shows end-to-end performance evaluation on ReAAP. The tiles number is a reconfigurable parameter and the percentage of PE utilization is calculated with the reconfigurable tiles number and array active cycles for each diverse layer. In this experiment, a fixed-size array architecture is implemented and its bitstream file has been downloaded into FPGA during initial time. While performing the overall  network inference, the JIT runtime of proposed software compiler which is running on the embedded CPU, decides how many tiles number are required and dynamically allocates a specific part of the fixed-size array for accelerating each layer, with adaptive to each type (or size) of all the compute-intensive layers. As a comparison of its reconfigurbale feature, the reconfigurable tiles number is also fixed as a certain number (98 tiles for ResNet-50, 98 tiles for MobileNetV3-Small, 512 tiles for BERT BASE ) to compare with the evaluation result by using dynamically-allocated strategy, as shown in the last two columns of the Table 3. Three commonly used DNNs (including ResNet, MobileNet, BERT) with their diverse layers are respectively benchmarked. Each tile contains 4 Â 4 PEs and the reconfigurable tiles number is adaptively decided for each diverse layer during software compilation.
Keeping all the parallel PEs busy is always challenging but worthwhile for a high-performance hardware accelerator. The reconfigurable feature and algorithm-oriented characteristic of our proposed ReAAP is crucial for being adaptive to diverse deep learning workloads. As shown in Table 3, PE utilization maintains a consistently high value for all the diverse layers. Since the total number of multiplications and additions are fixed for a certain DNN, a higher PE utilization m i means less number of processing cycles to complete its inference. As shown in the columns of the overall network, there is still room for further improvement in PE utilization by reducing the bottleneck on memory bandwidth, which is the priorities of our future work related to supporting flexibly quantized DNNs for memory efficiency. Benchmarked with three commonly used DNNs (ResNet-50, ResNet-18, MobileNetV3-Small), Fig. 8 shows a comparison on end-to-end performance evaluation, comparing ReAAP against highly-optimized ARM CPU, Mali GPU, and VTA-Ultra96 [43]. The ARM CPU and Mali GPU are respectively served by their industrystrength compilers: TensorFlow v1.7, and ARM ComputeLibrary (ARMCL) v18.03. The VTA-Ultra96 is a fixed-pattern accelerator and served by its compiler of AutoTVM. As a fair comparison, all the evaluated DNNs in inference use 8-bit integers rather than 32-bit floating points. In particular, ReAAP outperforms (1) the TensorFlow + Cortex-A53 CPU by 8:6Â; 7:3Â; 3:3 Â ; (2) the ARMCL + Mali-T860MP4 GPU by 3:7Â; 2:9Â; 2:3 Â ; (3) the AutoTVM + VTA-Ultra96 accelerator by 2:1Â; 1:8Â; 1:4 Â ; on ResNet-50, ResNet-18, and Mobile-NetV3-Small, respectively. As a whole system, ReAAP demonstrates that compiler-architecture co-design is an effective methodology to integrate software flexibility with hardware parallelism for accelerating diverse deep learning workloads.

CONCLUSION
In this paper, we propose an effective compiler-architecture co-design scheme targeting a reconfigurable and algorithmoriented array processor: the ReAAP. The co-design scheme integrates software flexibility with hardware parallelism for accelerating diverse deep learning workloads. A systematic optimization approach with precise data dependence analysis has been proposed in our domain-specific compiler to abstract parallelism and data reuse opportunities from the oriented deep learning algorithms. A reconfigurable array architecture has been proposed and instantiated as the layercomputing coprocessor to accelerate diverse compute-intensive layers. The end-to-end performance evaluation on ReAAP demonstrates that our compiler-architecture codesign scheme can satisfy the reconfigurable array architecture with a consistently high hardware resource utilization. In this version, we use 8-bit integers rather than 32-bit floating points to improve both memory and computation efficiency. Employing quantization techniques can further reduce the bit-width of weights and activation maps to make the quantized DNNs more realistic and efficient on resourceconstrained platforms such as various IoT devices. Our future work includes extending our compiler-architecture co-design scheme for flexibly quantized DNNs.
Jianwei Zheng received the PhD degree in electronic and computer engineering from the Xiamen University of China, in 2020. From 2016 to 2018, he was studying as a Joint-Training PhD student with the University of Illinois at Urbana-Champaign, supervised by Professor Deming Chen. He is currently a postdoctoral researcher with the Hong Kong University of Science and Technology, supervised by Professor Kwang-Ting Cheng. His current research interests include deep learning algorithms, computer architecture, high-level synthesis, and hardware-software co-design.
Yu Liu received the BSc and MSc degrees in electrical engineering and nuclear technology from Tsinghua University, China, in 2012 and 2014, respectively, and the PhD degree in system engineering from the City University of Hong Kong, in 2017 and is now a senior researcher with AI Chip Center for Emerging Smart Systems (ACCESS) in Hong Kong. His main interests include the fields of hardware-friendly AI algorithm optimization and efficient AI accelerator toolchain development.
Xuejiao Liu received the master's degrees in microelectronics and solid state electronics from Peking University, in 2009. She has been working on architecture research and development of dedicated ICs for intelligent vision acceleration. She is currently a principal researcher with AI Chip Center for Emerging Smart Systems (ACCESS) in Hong Kong. Her main interests are in the fields of application specific neural network compression and hardware architecture co-design technology and AI assisted IC design EDA technology.
Deming Chen (Fellow, IEEE) received the PhD degree from the Computer Science Department, UCLA, in 2005. He is the Abel Bliss professor of the Grainger College of Engineering, University of Illinois at Urbana-Champaign (UIUC). His current research interests include reconfigurable computing, hybrid cloud, system-level design methodologies, machine learning and acceleration, and hardware security. He has published more than 250 research papers, received ten best paper awards and one ACM/SIGDA TCFPGA Hall-of-Fame Paper Award, and given more than 140 invited talks. He is an ACM distinguished speaker and the editor-in-chief of the ACM Transactions on Reconfigurable Technology and Systems (TRETS). He is the director of the AMD-Xilinx Center of Excellence and the Hybrid-Cloud Thrust Co-Lead of the IBM-Illinois Discovery Accelerator Institute, UIUC.
Kwang-Ting Cheng (Fellow, IEEE) received the PhD degree in electrical engineering and computer sciences from the University of California at Berkeley, Berkeley, in 1988. In May 2016, he became the dean of engineering in concurrence with his appointment as the chair professor jointly with the Department of Electrical and Computer Engineering (ECE) and the Department of Computer Science and Engineering (CSE). Before joining HKUST, he was a professor of electrical and computer engineering (ECE) with the University of California, Santa Barbara (UCSB), where he served in 1993. Prior to teaching with UC Santa Barbara, he spent five years with AT&T Bell Laboratories. He is a world authority in the field of electronics testing and design verification, as well as an impactful contributor across a wide range of research areas including design automation of electronic and photonic systems, computer vision, and medical image analysis.