FPGA-Based Hardware Matrix Inversion Architecture Using Hybrid Piecewise Polynomial Approximation Systolic Cells

: The hardware of the matrix inversion architecture using QR decomposition with Givens Rotations (GR) and a back substitution (BS) block is required for many signal processing algorithms. However, the hardware of the GR algorithm requires the implementation of complex operations, such as the reciprocal square root (RSR), which is typically implemented using LookUp Table (LUT) and COordinate Rotation DIgital Computer (CORDICs), among others, conveying to either high-area consumption or low throughput. This paper introduces an Field-Programmable Gate Array (FPGA)-based full matrix inversion architecture using hybrid piecewise polynomial approximation systolic cells. In the design, a hybrid segmentation technique was incorporated for the implementation of piecewise polynomial systolic cells. This hybrid approach is composed by an external and internal segmentation, where the ﬁrst is nonuniform and the second is uniform, ﬁtting the curve shape of the complex functions achieving a better signal-quantization-to noise-ratio; furthermore, it improves the time performance and area resources. Experimental results reveal a well-balanced improvement in the design achieving high throughput and, hence, less resource utilization in comparison to state-of-the-art FPGA-based architectures. In our study, the proposed design achieves 7.51 Mega-Matrices per second for performing 4 × 4 matrix operations with a latency of 12 clock cycles; meanwhile, the hardware design requires only 1474 slice registers, 1458 LUTs in an FPGA Virtex-5 XC5VLX220T, and 1474 slice registers and 1378 LUTs when a FPGA Virtex-6 XC6VLX240T is used.


Introduction
Matrix inversion is one of the most useful operations used in many signal processing algorithms (SPA), where the efficient computation and accuracy of this operation are required. Important areas such as image recovery, phased-array radar and sonar, wireless communication, control applications, and others [1] require the efficient computations of the matrix inversion in real time, specifically in Hardware (HW) implementations of wireless communication for designing multiple-input multiple-output (MIMO) systems [2,3], in multiplicative noise generators using The QRD consists of finding a matrix R and, subsequently, of pre-multiplying the matrix A by several GR matrices, which are defined as G(i, j, θ), where the entries (i, i) and (j, j) of matrix G are equal to c = cos(θ), the entry (i, j) is equal to s = sin(θ), and the entry (j, i) is equal to −s. It is important to note that the parameters (i, j) are fixed according to the corresponding rows of matrix A for being rotated. On the other hand, θ is the degrees to be rotated. Therefore, the QRD procedure consists of finding the GR matrices G 1 , G 2 , · · · , G i such as G 1 T × G 2 T , · · · , G i T A is a triangular upper matrix. Since the GR matrices are orthogonal, the multiplication of multiple Givens matrices is also orthogonal; hence, G 1 T × G 2 T , · · · , G i T A = Q T A = R.
In order to carry out the QRD hardware implementation, consider the following nomenclature: R i,j is the element allocated in row i and column j of matrix R and R(α 1 : α 2 , β 1 : β 2 ) is a sub-matrix of R that contains all the elements corresponding to the rows from α 1 to α 2 and the columns from β 1 to β 2 , respectively. Matrix G r is defined as follows: where c 2 + s 2 = 1, and thus, the QRD can be implemented following Algorithm 1 as presented in p. 252 in Reference [23].

Back-Substitution and Matrix Inverse Procedures
Once the matrix R is computed using Algorithm 1, it is necessary to calculate R −1 via the back-substitution algorithm as RR −1 = I, where each of the matrices can be represented as follows: Thus, using Equation (2), it is possible to write a system of linear equations as follows: As can be seen, the first element of Equation (3) is computed directly as follows: Also, substituting the value of r inv j,j in the second row of Equation (3) obtains r inv m−1,m . This procedure is repeated until the entire system is solved in a back substitution fashion. Hence, once the inverse matrix R −1 is obtained, it is possible to compute Having analyzed the QRD Algorithm 1, it can be observed that the RSR operation is a crucial part of the computations of this algorithm. In this sense, a hybrid PPA technique is presented in the following subsection, which permits to improve the accuracy of the proposed hardware matrix inversion architecture.

Evaluation of RSR Using Hybrid PPA Technique
The PPA technique is efficiently applied to evaluate the RSR function. Compared with uniformly sized segments, the inclusion of a hybrid segmentation technique (conformed by an external and internal segmentation, where the first is nonuniform and the second is uniform) is capable of reducing the number of segments which significantly reduces the hardware resources [24].
It is remarked that this paper employs a hybrid segmentation, which is composed via the combination of the uniform and nonuniform segmentation techniques. First, the RSR function is divided by using a nonuniform segmentation technique (providing the external segments), and after that, each of such segments is divided by using a uniform segmentation (providing the internal segments).
The advantage of the hybrid segmentation over basic segmentation methods consists in the reduction of the employed segments while the signal-quantization-to-noise ratio remains. The proposed methodology is the following: In a first level, an external segmentation is applied to divide the arithmetic function in a specific interval with nonuniformity by the power of two. Then, in a second segmentation level, each segment is uniformly divided into a number of subintervals providing a better distribution in all curvature regions.
The following design constraints are necessary to be considered: (1) range reduction strategy, i.e., for a given function f (x) for a ≤ x < b, transforms into a new function H(z) = W( f (x)) for 0 ≤ z < 1, as proposed in Equation (4) in [16]; (2) segmentation method uses the hybrid PPA technique; and (3) bit-width optimization identifies the required number of bits of each fixed-point operand in the data path in order to guarantee a desired SQNR.
As an example, Figure 1 presents a hybrid segmentation of the RSR function. It can be noted that the external segmentation is nonuniform and carried out by dividing z by the power of two, i.e., the limits of the segments are located in z = {1/2, 1/4, 1/8, . . . }, which are represented in the plot of the figure as black boxes. Meanwhile, the internal segmentation is applied by dividing the external segments into uniform segments (limits plotted as red circles).
In the design process, an SQNR analysis is applied to determine the bit word-length as well as the number of internal and external segments in order to achieve the SQNR specification.

QRD Implementation Using Systolic Arrays
Once the QRD and back-substitution operations are analyzed based on the hybrid nonuniform segmentation, the PPA cells are integrated in a systolic array design.
The systolic design is implemented via advanced loop transformations, such as space-time mapping. This methodology improves the throughput while the hardware resources is reduced.
In a first stage, a parallelism and locality analysis of the QRD and back-substitution algorithms are determined with a Dependence Graph (DG) [25]. In this study, the DG is represented as Λ = [Ξ Υ], where Ξ represents a set of nodes and Υ represents a set of arcs (see Figure 2). Each arc ζ ∈ Υ connects the corresponding pair of nodes υ 1 , υ 2 ∈ Ξ, i.e, ζ = υ 1 ζ − → υ 2 , and a linearly bounded lattice denotes an index space of the form k ∈ Z l | Ak ≥ b defines an integral convex polyhedron or, in the case of boundedness, a polytope in Z l . Here, matrix M is considered to be square and of full rank. Then, each vector k is uniquely mapped into a new index point.

Therefore, a linear transformation
, is used as space-time mapping in order to assign a processor index p ∈ Z n−1 (space) and a sequencing index t ∈ Z (time) to index vectors I ∈ I, where Π ∈ Z (1×n) and Σ ∈ Z (n−1)×n represent the scheduling and allocation, respectively, that are essential for systolic implementations. The allocation and scheduling functions must satisfy the well-known causality constrain Σ · u ≥ 0, ∀(u i , u j ), where u is a projection vector of the dependence graph, guaranteeing that no more than two index points are simultaneously assigned to a processing element [26].

Architecture and Hardware Implementation
This section explains the FPGA-based architecture of the high-throughput matrix inversion implementation. The novelty of this design consists in rearranging the traditional 2D systolic array architecture of Figure 2, incorporating hybrid PPA-systolic cells instead of iterative or LUT methods.
The boundary and internal cells (Bs and Is, respectively) are designed to obtain R, and the back-substitution cells (Ss) compute the R −1 matrix through the back-substitution algorithm. This design improves the results of the previous study reported in Reference [27], incorporating in the design a hybrid segmentation in PPA-systolic cells and guarantying an accurate fixed-point evaluation by measuring SQNR, high-throughput design, and a resource area improvement in FPGA devices. The matrix inversion architecture adapts the back substitution, Givens Generation (GG), Givens Rotation (GR), and the QRD 2D systolic array with the whole matrix inversion system.
As previously developed by Reference [25], the target polytope model of QRD is represented as follows: where matrix A is the input data and y = [t p 1 p 2 ] T is the new space-time coordinate. (5) is computed using the Fourier Motzking method as follows: where N and M are the boundaries of the loop program of the Algorithm 1. In this study, the Fourier-Motzkin method is used to solve Equation (6) [26]. This method is a linear programming algorithm that permits to solve a system of inequalities. If the system has a feasible solution, it permits iteratively to eliminate the dependence of variables until it is solved for one variable; then, using back substitution, the rest of the variables are found. It is possible to think that this method is the equivalent of the Gaussian method but for a system of inequalities. The linear inequalities of the target polytope are solved by achieving a new time-processor index domain used to generate the scheduling control signals of the processing elements.
The architecture of the GG boundary block is developed with another systolic array design based on the hybrid PPA technique. This hardware architecture is shown in Figure 3, and it is implemented considering the following parameters: a word-length precision of 16-bit fixed-point operations for signed numbers in two complement format, 30 segments conformed by 3 internal segments for each 10 external segments, and d = 2 degree polynomials.
It is important to mention that a higher number of external segments (nonuniform segmentation) leads to a better representation of the RSR when z approach to zero; however, it is conveyed in an increment in the word length. On the other hand, an increment in the number of internal segments leads to a decrement of the error approximation of the function but with the penalty of increasing the number of polynomials to be considered and, consequently, an increment in usage of LUT, Digital Signal Processing (DSP) blocks, and latency. Thus, in this paper, the number of external and internal segments were selected through a trial-and-error process for assurance of at least 80 dB of SQNR while the number of LUT and DSP blocks is not increased significantly. The GG processor block contains the hybrid PPA-systolic cell for the RSR operation, as illustrated in Figure 4. The hardware of the hybrid PPA-systolic cell is implemented with a pipelined systolic design composed of a buffer memory block, which stores the coefficients of each 30 segments of the RSR curve labeled as Read Memory Only (ROM) A0, A1, and A2; a decoder, or coefficient detector, used to identify the interval where the input value belongs; and the fixed-point arithmetic operations of the systolic degree-2 PPA-cell, which evaluates the input value (considering a range reduction of [0, 1)) with the polynomial coefficients. This block computes the RSR operation in only one clock cycle.
A comparative approximation error of the GR hybrid piecewise polynomial approximation systolic cell is presented in Figure 5. The figure shows a maximum absolute error of 1.2 × 10 −3 in the RSR curve achieving an SQNR equal to 88.37 dBs and improving the 54 dBs reported in Reference [27] for the RSR block. This experiment compares the accuracy of the RSR curve between the continuous floating-point Matlab function (labeled as floating-point approximation, black line) and the proposed fixed-point systolic array architecture (labeled as fixed-point approximation, dotted gray line). As can be seen, the curves overlap perfectly thanks to the SQNR achieved; however, the maximum error and SQNR can be improved if the number of segments in the PPA-systolic cell is increased [19].   In this regard, the output of the GG processing block is synchronized to generate the rotation angle transmitted along the GR processing elements of the general 2D QRD systolic array architecture. All values of triangular matrix R are computed, and matrix Q T is also updated with new rotated elements (Bs). The last step is the back-substitution processing nodes (Ss).
Matrix R −1 is obtained by using the back-substitution cells as illustrated in Figure 2, where the algorithm operations by column are associated to Equations (3) and (4). The architecture proposed for this important block is shown in Figure 6. The variable r inv i,m in r ii * r inv i,m is computed by dividing by r ii . Note that the 1/r ii element is obtained in the GG block of the QR architecture, as shown in Figure 3. After that, the results are passed to the next processing elements in order to implement R −1 . (15,9) Q (15,9) Q (15,9) Q (15,9) Q (15,9) X Q (15,9)  Q (15,9) Q (15,9) Q (15,9) Q (15,9) X Q (15,9) Q(30,18)

Implementation Results and Performance
The results and performance analysis of the matrix inversion architecture are presented in this section. The proposed approach is designed using Verilog-HDL and synthesized using the Integrated Software Environment (ISE TM ) WebPACK TM 14.7 of the Xilinx XST tool. It is important to note that our architecture computes not only a QR decomposition but also a complete matrix inversion. Table 1 summarizes the hardware (HW) implementation results on contemporary Virtex-5 XC5VLX220T and Virtex-6 XC6VLX240T FPGAs, analyzing the scalability of the proposed architecture in terms of hardware resources for different matrix sizes. As can be seen, the number of slice registers and slice LUTs are balanced with the DSP48E resources in order to allocate the designed architecture in the FPGA. According to References [28,29], a FPGA-based hardware matrix inversion can be implemented using only slice registers or with DSP blocks. However, the design using only slice registers requires more than three times the number of resources than the one utilizing both FPGA slice registers and DSP blocks [28]. Also, the area and speed penalty of using only slice registers to implement the design instead of DSP blocks is high. Hence, such dedicated FPGA resources are used for area-efficient and high-performance designs. In this study, an area and accuracy trade-off is explored in the FPGA design by considering the "Optimize Instantiated Primitives Synthesis" option in the Xilinx tool (balanced mode). On the other hand, the resulting SQNR of the complete matrix invertion architecture is 58 dB, configured with a data precisions of 16 bits. This SQNR performance could be improved if the number of bits in the word length of the architecture is increased. Likewise, the total execution time for performing the matrix inversion operation of a 4 × 4 matrix using the proposed architecture is equal to 284 ns, with a latency of 171.4 ns (12 clock cycles, 7 cycles for computing the QRD and 5 cycles for computing the back-substitution algorithm) and a throughput equal to 5.83 Mega-Matrices per second (MM/s). Likewise, a maximum clock frequency of 70 MHz (clock period of 14.2 ns) is achieved when the proposed architecture is synthesized using Virtex-5. For the case of a Virtex-6 FPGA, the maximum clock frequency achieved is 90 MHz (11.1 ns of clock period) with a latency of 133 ns and a throughput equal to 7.51 MM/s.
In order to make a fair comparison of the performance of our proposal with respect to state-of-the-art QRD and matrix inversion architectures, a 4 × 4 matrix dimension is considered using a 16-bit word length precision. Table 2 shows the comparative analysis results. As can be seen, the proposed matrix inversion design based on PPA systolic cells takes less hardware resources in slice registers and LUTs with respect to state-of-the-art architectures: 1474 slice registers and 1458 LUTs when a Virtex-5 FPGA is used and 1474 slice registers and 1378 LUTs when a Virtex-6 FPGA is considered. Also, in Table 2 can be observed the well-balanced performance of the latency parameter when it is compared with the state-of-the-art architectures, e.g., the proposed matrix inversion architecture based on PPA systolic cells is 5 times faster than the CORDIC-based architecture reported in Reference [8] and it is close to up to two times faster than the CORDIC-based architecture reported in Reference [9] considering a 90 MHz clock frequency. Likewise, the reached SQNR allows to implement the architecture with only 16-bit word length precision and to decrease the area resources as slice register, slice LUTs, and DSP48 elements, which improve the resource utilization reported in Reference [27].
On the other hand, it is important to note that the whole execution time of the presented approach only takes 12 clock cycles for computing the 4 × 4 matrix inversion. This characteristic makes the proposed architecture based on hybrid PPA techniques small in latency when it is compared with other designs based on CORDICs and LUTs.

Evaluation of the Matrix Inversion Design
In this subsection, the matrix inversion is evaluated using the hardware/software codesign technique. In the experiment, a matrix inversion architecture is integrated as a coprocessor with a Cortex-A9 ARM processor of 650 Mhz in the ZYBO ZYNQ XC7Z010-1CLG400C platform. Figure 7 illustrates the hardware/software codesign approach of the ARM processor and our matrix inversion architecture. Considering a 4 × 4 matrix as input of the hardware/software (HW/SW) codesign, the ARM processor transmits the matrix data to the coprocessor and receives the solution of the matrix inversion. In this analysis, the time performance was carried out as follows: where t inv is the total time for the matrix inversion operation, t AXI is the time intercommunication between the ARM and FPGA device, t QRD is the time processing of the QRD using PPA systolic cells, and t BS is the processing time of the back-substitution implementation. The resulting performance for the implementation is t AXI equal to 0.3369 ms, t QRD = 0.15 ms, and t BS = 0.17 ms. Therefore, the total time of the parallel matrix inversion is 0.6569 ms. In order to compare this result, the sequential implementation of the same operation, with only the ARM processor, was employed achieving 5.0492 ms, which represents a speed up of 7.6864.

Performance Analysis Discussion
As mentioned in Section 1, hardware implementations based on CORDIC strategies need to carry out several algorithm iterations for converging to the desired results in order to improve the result accuracy; however, these iterations impact in latency metric (cycles and time), as can be seen in Table 2, even when high frequency operations are achieved by the state-of-the-art QRD hardwares. Likewise, the CORDIC-based architectures increase the number of slice registers and slice LUTs. On the other hand, our proposed architecture decreases the number of slice registers, slice LUTs, and latency cycles in the design. It only takes a latency of 7 clock cycles for carrying out the QRD and 5 clock cycles for computing the back-substitution algorithms, which is equivalent to 100 ns and 71.4 ns for a frequency operation of 70 MHz and 77.7 ns and 55.5 ns for a frequency operation of 90 MHz, respectively. As can be observed in Table 2, our proposed matrix inversion architecture is the fastest architecture, which improves the time performances reported in the open literature. However, these outstanding advantages are paid by increasing the number of DSP48 elements in the architecture, which can be balanced by implementing them via slice registers and slice LUTs along the synthesis process or by using other design strategies in order to reduce the number of DSP48 elements [30]. Finally, the proposed matrix inversion architecture in this paper offers a well-balanced improvement in the design, achieving high throughput and less hardware resource utilization in comparison of state-of-the-art FPGA-based architectures.

Conclusions
In this paper, a new design of a hardware matrix inversion architecture was presented. The proposal is based on the QR matrix decomposition and the back-substitution algorithm, which are implemented using a systolic hardware architecture that requires the RSR operation computed by a hybrid piecewise polynomial approximation. The results show that the hardware resources and the time execution are significantly reduced in comparison with the proposals that employ CORDICs and LUTs. In particular, the experimental results for Virtex-6 technology pointed out that the proposal has the capacity to compute 7.51 MM/s when 4 × 4 matrices are processed using 12 clock cycles for the complete matrix inversion operation. Moreover, the SQNR achieved is 58 dB with 16 bit word length. Under this perspective, the authors consider that this proposal of a fast hardware architecture for the matrix inversion can be used as a coprocessor in a hardware and software codesign paradigm for many applications like receivers in MIMO space-time communications and matrix signal processing applications, where efficient and real-time computations are required.