FPGA-based Low Latency Inverse QRD Architecture for Adaptive Beamforming in Phased Array Radars

. The main objective of this paper is to facilitate the adaptive beamforming which is one of the most challenging tasks in phased array radars receivers. Recursive least square (RLS) is considered as the most well suited adaptive algorithm for the applications where beamforming is mandatory, because of its good numerical properties, high convergence rate and low misadjustment. In this paper, some RLS variants are discussed and the most numerically suitable algorithm Inverse QR decomposition (IQRD) is selected for efficient adaptive beamforming. A novel architecture for IQRD RLS is also presented, which offers low latency and low area occupation for Field Programmable Gate Array (FPGA) implementation. This approach reduces the computations by utilizing the standard pipelining methodology. Hence, efficient adder and multipliers and Look Up Table (LUT) based solution for square root and division, has highly enhanced the performance of the algorithm. The proposed IQRD RLS architecture has coded in Verilog and analyzed its performance in terms of throughput, hardware resources and efficiency.


Introduction
Radar signal processing is one of the active area of research due to its vast applications in radar systems like target tracking, navigation.Most of the radar systems are composed of multiple transmitting and receiving phased array antennas, known as phased array radars.These phased array antennas are designed to transmit and receive the spatially propagating signals and to focus on the arriving direction of desired signals as well as suppressed the interfering signals [1], [2].Beamforming is an important signal processing technique in phased array radars for directional signal transmission or reception [2]. Figure 1 shows a generalized block diagram of four-channel adaptive beamformer receiver having a uniform linear array structure (ULA) and an adaptive weight calculation (AWC) unit.In ULA, phased array radars, all the antenna elements are arranged along a line with uniform spacing s, to attenuate interferences coming from different directions along with desires signal [3].AWC unit is based on an over determined system of equations in many cases and various algorithms has been developed to solve these equations.Such a system is also known as adaptive antenna system.AWC usually based on gradient based adaptive algorithms like Least Mean Square (LMS), Normalized Least Mean Square (NLMS), Recursive Least Square (RLS), Kalman filter, assuming all the input signals to ULA system are uncorrelated with each other [3], [4].But, RLS is considered as the most commonly used algorithm as it has the fast convergence rate to find an appropriate solution and good numerical properties, but it has complex mathematical operations like matrix inversion.Hence, it limits its application in real time systems [5].In numerical methods, a well-known method known as QR decomposition (QRD) is used to decompose any matrix X into orthogonal matrix Q (QQ T = I), and upper triangular matrix R (X = QR).So, to overcome the computational complexity, various QR decomposition based RLS variants (QRD, Inverse QRD, Fast QRD) were developed, which are widely used in variety of applications, which leads to more efficient architectures, stability and accuracy [5], [6].In the paper, the comparisons of few RLS variants are performed and select IQRD RLS as the least computationally expensive algorithm for hardware application.
The primary goal of this paper is to propose and develop a high-throughput architecture of the selected RLS variant, by minimizing its latency and device area occupation for adaptive beamforming application.The speed of the beamformer is limited due to recursive operation based AWC unit, therefore, minimizing latency is a critical issue [4], [7].In this paper, a novel architecture is proposed for IQRD RLS to reduce the computational complexity by using the standard pipelining methodology and mapped into Virtex5 FPGA because it provides the powerful com- putational architecture features for floating point arithmetic.
The paper is organized as follows: first of all, a brief introduction of adaptive beamforming is presented in Sec. 2. Further, in Sec. 3, a thorough study is performed to select IQRD as the most suitable RLS variant.Section 4 describes the conventional systolic array architecture and the proposed new architecture based on low latency modules.Finally, in Sec. 5, AWC unit implementation in Virtex 5 FPGA is discussed to confirm the advantages of the new structure.

Adaptive Beamforming
The core of adaptive beamformer receiver is to detect and estimate the target signal at the input of adaptive antenna array.Such systems are usually consisted of an antenna array along with adaptive processor (AWC) to adjust its weights in real time, such that the main lobe of the radar is focused on the arriving direction of target signal and suppresses undesired signals [3], [6].
The output of beamformer in Fig. 1 is given as: where z be the vector array of estimated signal given as where w (m × 1) be weights vector, X (m × N, m > N) is the input data matrix, d (N × 1) is the desired training signal and e (N × 1) is the additive noise error vector.The primary goal of the adaptive beamformer is to minimize the mean square error, such that z(k) ≈ d(k) and find out the optimized values of unknown weight vector w.
where λ is a forgetting factor in conventional RLS [4].

RLS Variants for Adaptive Beamforming
In this section, a detail study of RLS variants is performed to select the most efficient one for the AWC unit in adaptive beamformer.But, first of all a brief introduction of QR decomposition based on Givens Rotation (GR) method is given.A GR rotation matrix c s s c used to premultiply a two-row matrix and transformed it into upper triangular matrix.Let two-row matrix be given as: Premultiplying matrix A with G we get where In [7], a numerically stable RLS variant QRD RLS, is described which avoids the complex matrix inversion operations.Moreover, it has the possibility of implementation in systolic array architecture (pipelined architecture).Hence, many researchers considered QRD RLS as the most suitable adaptive algorithm for hardware implementation [8][9][10].The QRD decomposes input data matrix X through an orthogonal triangularization approach like Givens rotation (GR), Householder transformation or Gram Schmidt orthogonalization [11]  X QR. (9) In most of the cases GR is considered as the most numerically well-conditioned method.Q be the unitary matrix (QQ T = I) generated as the sequence of Givens rotation matrices in [11], and R is the upper triangle matrix which is also the Cholesky factor of autocorrelation matrix R=X T X [11].So, equation ( 9) can be written as: Let e' = Qe and d' = Qd, so equation ( 11) can be written as From ( 12) the weight update equation can be written as The above equation is the solution of least square problem which is obtained by back substitution process.Hence, the QRD RLS algorithm is composed of two steps: first the QR decomposition of data matrix X (9) is performed and has systolic realization.In step 2, back substitution process begins for weight calculation (13), as shown in Fig. 2.
IQRD RLS is another important variant of RLS proposed by Alexander and Ghirnikar [12], to avoid the numerically tedious back substitution process for weight calculation.The core idea is to update the inverse matrix R (13) in each iteration of systolic array.So, the weights of filter are computed within systolic array.In [11], a new weight update equation is derived which is given as: where the term R -1 (k)R -T (k)x(k) is known as the Kalman gain.The systolic realization of IQRD RLS will be discussed in Sec. 3. Hence, IQRD RLS is less computationally expensive as compared to QRD RLS.
In Fast QRD (FQRD) RLS, matrix update equation is replaced by vector update equations.In [11] detail derivations of these equations are given, which further reduces the numerical complexity as compared to QRD RLS and IQRD RLS algorithms.Although, FQRD RLS is numerically less complex, but it has no weight update procedure, so it is limited to applications which try to estimate the output error vector [13].
In this section, three important RLS variants and their properties are discussed and summarized in Tab.I.It is worth mentioning that QRD and IQRD are computationally more complex than FQRD, but FQRD has no weight update procedure.So, we cannot consider it for adaptive beamforming application.On the other hand, QRD involves more computations (two step procedure) so it is also not preferable.Due to above reasons, IQRD RLS is the well-suited algorithm for hardware implementation.

Performance Analysis of Adaptive Beamforming in Terms of Radiation Pattern
The performance analysis of adaptive beamforming, using RLS adaptive algorithm is done by varying different beamformer parameters.Researchers have performed the analysis of adaptive beamformers for smart antennas based on adaptive algorithms Least Mean Squares (LMS), Sample Matrix Inversion (SMI), Recursive Least Squares (RLS) and Conjugate Gradient Method (CGM).The analysis is done by varying the inter element spacing and the number of antenna elements for each algorithm and studying the radiation patterns, amplitude response, mean square error and absolute weights for adaptive beamforming algorithms [14], [15].We assume a ULA beamformer with an operating frequency of 1 GHz.It receives an echo signal from the single target at azimuth of 30 ° and an interference signal at -45 °.Simulations show that the RLS adaptive algorithm rejects the interference at -45 ° and enhanced the target signal at 30 °.The following procedures are adopted for further analysis of RLS based adaptive beamformer:  Change the number of antenna elements.
 Change the antenna elements inter space.mance is also judged in terms of changing antenna elements inter space.The selected inter spaces are 0.5, 1.5 and 2.5 mm. Figure 4 is the resulting radiation pattern in which the performance of the adaptive beamformer is improved as the distances between the antenna elements are increased.The side lobes are also suppressed in a much better way.
Hence, the RLS based adaptive beamformer has a good performance and accuracy as the numbers of antenna elements are increased as well as the distance between each antenna elements is also increased.

IQRD Systolic Array Architecture
Systolic arrays are the well-known concept for matrix triangularization and inversion [16], [17] and the same idea is adopted for IQRD-RLS algorithm.Figure 5 shows the conventional systolic realization of the IQRD-RLS algorithm.The IQRD systolic triangular array consists of four types of processing cells; Boundary Cell (BC), Internal Cell (IC), Inverse Internal Cell (IIC) and Final Cell (FC).
A Boundary Cell determines the rotation parameters cos α and sin α to perform the rotations on input complex data rows x, from data matrix X and update the corresponding row values of R matrix which is an upper triangular matrix.It also sends the rotation parameters to the right neighboring Internal Cells [16], [18].The corresponding equations ( 6), ( 7) and ( 8) are modified as: BC is more complex than IC, IIC and FC because it performs square-root and inverse operations.All the other cells perform only multiplication and inverse operations.
Researchers have performed a lot of work to find a squareroot and division free BC unit for systolic arrays.In [19], the following data transformations are applied to seek the square-root and division free expressions for transformed data. ..., However, the number of multiplications is increased due to these transformations, but it is a good approach to improve the system latency and reduce number of computations.In [8], another single cycle look-up table (LUT) based approach is used to replace the square root and inverse operations and to optimize the QRD RLS latency.This approach gives much improvement in overall performance of QRD RLS algorithm.
An Internal Cell (IC) received the rotation parameters cos α and sin α from the Boundary Cell (BC) in each row, and performed the rotations on the input values of data matrix X and updated the x and r values.The corresponding update equations are given as follows [16], [17]: Inverse Internal Cells receive the sine and cosine value of the rotation angle from the internal cell in the same row.Some part of the inverse internal cell operation is similar to the internal cell; i.e. it rotates the input data by multiplying it with input rotation angles.Along with this operation, Inverse internal nodes of IQRD-RLS use λ (-½) in ( 12) and (13).In this paper, we consider the case when λ = 1, so that IIC is equivalent to IC units.
Final Cell (FC) performs two functions: it received the rotation parameters cos α and sin α to rotate the input data and generate the updated filter weights using the following relation [10] 2 2 ( 1) w r where γ is computed using the following relation A design methodology has been proposed to optimize the throughput and latency of the IQRD-RLS architecture.In Fig. 6, a block diagram of the proposed architecture is presented, which utilizes the standard pipelining methodologies and has 50% reduction in number of BC, IC and IIC units (Tab.2).We have further optimized the IQRS RLS architecture, by omitting the square-root and division operations in BC, in terms of single cycle single cycle look up table (LUT) as shown in Fig. 7. Generally, a look up table  (LUT) is a table that determines what the output is for any given input.In FPGA truth tables are used to implement any digital logic.In [8], the same methodology is adopted for QRD RLS algorithm.This scheme avoids the complex hardware implementation with considerable decrease in latency and improves the speed of overall system.In some previous work the conventional implementation of BC is based on CORDIC algorithm, for non-recursive applications [20], [21].But, digital beamforming is a recursive procedure, so, the conventional approach would not pay off because it leads to considerable latency.In the proposed architecture, the pipelined registers (R_reg, x_reg, Rinv_reg) are introduced which not only holds the updated values, but also feedback them to BC, IC and IIC cells.The controller CON handles the control signals as well as feedback timings.Hence equations ( 15) and ( 16) can be modified as: , The IC unit received the c, s i and s r values from neighboring BC to compute (19) and (20) and update the values of x ni , x nr and r n .These computations required only adder and multiplier units as shown in Fig. 8.The updated values x ni , x nr are stored in pipelined register x_reg and also feedback to IC module, and updated r n is stored in internal memory of IC.The c, s i and s r are also sent to the neighboring IC (Fig. 8).
One of the major differences between the LUT implemented in [8] and in this paper, is the floating-point representation format.In [8], the latency optimization of LUT is discussed with a defined N bits fixed point number format, including 1 sign bit, i integer bits (i > 1) and f frac-  tion bits (f > 1).But, the floating-point multiplier and adder are not discussed with the new defined number format.
In this paper, we have utilized the same LUT approach for IQRD RLS architecture with single precision IEEE 754 format due to its wider range over fixed point.The single-precision numbers are stored in 32 bits: 1 for the sign, 8 for the exponent, and 23 for the fraction [22].In the proposed architecture, the computation of A i and A ss are handled in single LUT, whereas A s is the 32-bit input address field to the LUT as shown in Fig. 7. Simulation results show that 97% of the A s values are greater than 1 as shown in Fig. 9a, when the all input variables in ( 23) have a typical uniform distribution within the range of [-4, 4], and their corresponding output values (A i ) are very small (Fig. 9b).Hence, probability of occurring a value less than 1 is very low thus reduces the range of input values A s for LUT.The output of the LUT A i has 32-bit bandwidth, which is used to compute the c, s i and s r values, for the IC units.

Implementation and Results
In this paper, we have optimized the BC module of systolic array in terms of complex mathematical operations.Many researchers also performed a lot of work in single precision floating point multiplier which is also a complex mathematical operation in hardware [23][24][25], but we keep our focus in divider and square-root operations and use the multiplier core of the FPGA.Xilinx Virtex 5 FPGA is the selected platform for hardware implementation.
As we have already discussed, adaptive algorithm is a basic part of the beamforming system and we select the IQRD-RLS algorithm as the most suitable algorithm.Three types of architectural modules are designed to judge the performance of algorithm in terms of latency, throughput (1sample/clock Time×clock cycles), error and efficiency (throughput/number of slices).The details are: 1. MD 0 (module 0), a conventional systolic array architecture with single precision floating point multipliers, dividers and a square root module (generated with Coregen software).where MD stands for module.A radar target detection scenario is simulated in MATLAB to generate the data points for the testing of MD 0 , MD 1 and MD 2 .First, a radar received signal is modelled having one target at azimuth angle 30°, and two interferences at azimuth 15° and 45°, respectively.Assume that both targets and radar lie at the same; i.e. elevation angle is equal to zero. Figure 10 shows the radiation pattern.The received signal is impinged on 3 element ULA beamformer based phased array radar to generate almost 10 4 data points.
In adaptive beamforming application, latency is an important performance metric which affects the overall system throughput.To judge the LUT effects on algorithm performance we compare the latencies and throughput of the architectural modules.Table 3 details the latency of each unit in system individually, and throughput of adaptive filter (3-tap) for IQRD-RLS algorithm with three dif-ferent architectural modules; i.e.MD 0 , MD 1 , MD 2 .These modules are modelled using Verilog and their behavior description is placed and routed on Virtex5 device by ISE14.1itool.It is clear from Tab. 3, the proposed architecture based module MD 2 module has offered 9 times reduction in latency as compared to MD 0 and MD 1 modules because of its optimized single cycle LUT based architecture.On the other hand, MD 2 module of IQRD-RLS has a higher throughput as compared to QRD-RLS M2 module.It is again important to notify that in [8], QRD-RLS is implemented using a new floating point format, whereas we implement the IQRD-RLS with standard IEEE-754 single precision floating point representation.Hence, it is again proved that IQRD-RLS is the most suitable algorithm for adaptive beamforming application.
Algorithm accuracy of the proposed architecture and in presence of LUT is found out by measuring the difference between the weight coefficients of the ideal method and the proposed method in MD 2 with 10 4 simulated samples.This is also known as weight error.Results show that the weight error is less than 0.00216 for 96% samples.Figure 11 illustrates the filter weights of the ideal and proposed methods for 3 tap beamformer (For clarity of the image we plot only 200 iterations).
The device utilization details or hardware summary are given in Tab. 4. It is noteworthy that MD 2 outperforms in terms of area efficiency and throughput.The conventional architecture MD 0 has the highest area efficiency, while MD 1 occupies much lesser area as compared to MD 0 .The hardware efficiency and throughput are found out to draw a reasonable comparison between the architectures.To the best of our knowledge, this is the first implementation of IQRD RLS algorithm in FPGA, so, the performance of MD 2 is also measured, after it is synthesized for different target FPGAs as shown in Tab. 5.The throughput of the IQRD proposed architecture in all FPGAs is better than the QRD [8].Hence, simulation results have clearly indicated that the IQRD RLS is the most suitable algorithm for adaptive beamforming due to its better throughput, hardware efficiency and latency.

Conclusion
A low latency architecture is presented that implements the IQRD algorithm for beamforming application using system level hardware tools.The proposed architecture improves the divider and square root operations via LUT approach.Moreover, it is mapped in Virtex5 FPGA by Xilinx and achieved the minimum errors between the real and ideal weight values.Hence, the proposed architecture is outperforming in terms of area efficiency, throughput, and latency, and the most suitable algorithm for adaptive beamforming.In future work, the multiplier and adder units can be implemented via LUT, for further improvement in latency and throughput.

Figure 3 Fig. 3 .
Figure 3 depicts the performance of beamformer with different number of antenna elements.The selected numbers of antennas are 10, 15 and 20.It is clear from the radiation pattern that when the number of antenna increases the main lobe of the beamformer is shrinked and pointed accurately towards the target angle and the side lobes are much suppressed.On the other hand, the perfor-

Fig. 4 .
Fig. 4. Radiation pattern with different inter antenna element space D.

2 .
MD 1 (module 1), a proposed systolic array architecture with single precision floating point multipliers, dividers and a square root module (generated with Coregen software).

3 .
MD 2 (module 2), a proposed systolic array architecture with LUT based BC for divider and square root operation.