Superior Implementation of Accelerated QR Decomposition for Ultrasound Imaging

In this work, a novel Minimum Variance Distortion less Response (MVDR) beamformer architecture in which the adaptive weight vector is computed based on modified Column wise Givens Rotation (CGR) is presented. As compared to the conventional MVDR beamformer, Quadrature Rotation Decomposition (QRD)-MVDR is suitable for hardware realizations. To improve the real-time performance requirements of the MVDR Beamformer, a parallel pipe-lined CGR based QRD architecture is employed in the adaptive weight computation stage of the MVDR Beamformer. A computationally efficient MVDR beamforming algorithm, which requires to compute only the R matrix in QRD, rather than matrix inverse is used to develop the architecture. The developed architecture generates the adaptive weight vector in 3.9ns, and hence a beam can be formed in 0.25msec time. The designed architecture is implemented using Verilog Register Transfer Level (RTL) coding, and the functional equivalence checking was carried with the Verasonics Vantage-64 Ultrasound Research Platform (URP). The architecture is also ported on Xilinx Kintex-7 FPGA based Emulation set up and validated in real-time, targeting medical ultrasound imaging applications. The developed architecture is compared with the existing architecture implementations. It concludes that the architecture is superior in terms of computational time and can be adapted for ultrafast adaptive beamforming applications.


I. INTRODUCTION
Delay-and-sum (DAS) is a very fundamental technique for beamforming, given a known direction of arrival (DOA) and time delays. It sums the delayed signal depending on the direction of arrival of the signal. Here the system produces destructive interference in all directions but the not in the direction of arrival/interest [1]. This beamformer combines the signal from different channels with fixed and predefined weights. It is the simplest beamformer for computational complexity, low resolution, and reduced interference suppression makes it the worst one for medical imaging. Adaptive beamformers use adaptive weights, which depend on the data-adaptive apodization weights. Minimum Variance Distortionless Response(MVDR) is such an algorithm that uses The associate editor coordinating the review of this manuscript and approving it for publication was Yizhang Jiang . adaptive weights. This algorithm is far better in resolution when compared to DAS, the expense of which is a very high computational complexity [2]. Figure 1 shows the conventional Delay and Sum Beamformer structure. The achieved resolution improvement is due to the suppression of interference from undesired direction while preserving the signal from the desired direction. The most expensive/high computational block is the one that calculates the spatial covariance matrix and its inverse for the computation of adaptive weights. MVDR Beamforming improves the resolution of the image to a large extent compared to the DAS Method [3]. The implementation issues associated with MVDR is the calculations involved in the algorithm. QR Decomposition reduces the computational complexity of the MVDR. This method does not require the computation of the spatial covariance matrix and the inverse of it, all this at the expense of necessary calculations for the QR Decomposition. A modified algorithm that further reduces the computational complexity, which is accelerated QR-MV beamformer, as this method only requires R calculation. Many methods are available for R calculations, like GR, MSGR, and Fast Toeplitz orthogonalization, etc. Column Wise Given Rotation is the best-suggested method as it requires very fewer calculations compared to the normal Given Rotation and MSGR methods. The previous articles suggest that only R calculation is required for computation of the adaptive weight vectors for MVDR Beamformer, which takes the major computational complexity for these algorithms. The proposed works use the QR-MV Beamformering techniques to implement a computationally superior method compared to native methods. The major contributions in this article are as follows: • We present CGR based QRD-MV Beamformer implementation on Verasonics Ultrasound Research Platform (URP) and on Xilinx 7 series (Virtex) FPGA.
• We present a novel modified architecture for CGR based QR Decomposition Algorithm (PCGR-QRD) and MV Beamformer realization using this QRD Algorithm on FPGA Platform.
• We compare our experimental results with the contemporary implementations in the literature and show significant improvement in throughput and resolution at par with native MVDR Beamformer. The organization of this article is as follows: Section II describes the background of MVDR Beamforming, Section III, the literature survey, Section IV describes the accelerated MV-beamformer implementation using CGR Algorithm, Section V describes the proposed architecture. Section VI describes the experiment setup Section VII deals with the results and discussions, and Section VIII presents the conclusions.

II. BACKGROUND OF MVDR BEAMFORMING
The MVDR is a technique introduced by Capon in 1969. This algorithm can resolve signals which are separated by antenna beam width's fractions [3]. To maximize the signal to noise ratio, this Beamformer takes the desired signal in the (DOA)direction of arrival. The weight vector of the Beamformer will calculate the desired signal. MDVR Algorithm can maximize the sensitivity of the sensors in only one direction. The Beamformer significantly reduces the Beamformer's output power under a single linear constraint on the array's response to the desired signal [4]. The algorithm depends on the steering vectors, which depend on the angle of incidence of the received signal from the array antenna unit. The direction of the useful signal must be known, and the output power in the direction of the desired signal must be minimized, subject to a unit gain constraint. Figure 2 shows the beamforming concept.
where λmax is the largest eigen value which is given by the equation 2 of the array correlation matrix R xx (k).
The received signal vector is denoted x(k). The weights of the array are updated according to the equation given below.
the below equation gives the error signal e(k) To achieve the minimum power of the sensor signals, which are weighted, the Minimum Variance (MV) beamformer continually updates the apodization weights under the VOLUME 8, 2020 restriction of the signal being transmitted without distortion from the point of interest. MVDR Beamformer does not require knowledge of the directions of interference for weight vector calculations. It only needs the direction of the desired signal. The MVDR / Capon beamformer can attain a more prominent resolution than the standard (Bartlett) method, but since it is a full=rank matrix inversion, this algorithm has higher complexity. Technical developments in GPU have started closing this gap and render Capon beamforming possible in real-time. The linearly constrained minimum variance (LCMV) beamformer is the most commonly used criteria in the beamformer for suppressing the interfering signals and noise. The algorithm was first proposed by Frost [4]. Griffiths and Jim [5] made a practical implementation of LCMV by suggesting a specific sidelobe canceller (GSC). Since the Generalized Side Lobe Canceller Algorithm uses an unconstrained approach instead of a constrained algorithm, weights can be modified at much higher rates [6]. Separate unitary transforms, such as Discrete Fourier Transforms (DFTs), have been used to de-correlate input data to improve the convergence rate of the original LMS GSC model [7]. Glentis attempted to reduce the computational complexity of both GSC-LMS and TD-LMS GSC by treating complicated transmissions as a few real signals and using the algorithmic strength reduction method [8], [9]. This discrepancy is enough that the GSC appears to misinterpret the required signal element by nullifying instead of retaining a distortion-free response to it. In reality, almost all beamformer applications lack the required signal during the training period [10][11], which allows the beamformer to be more reliable in the array response to mismatch errors. The beamformed output y[n] at point n in the scanline can be described as where M is the array size, w m is the complex apodization weight, x m [n] is the sampled data obtained from array element m, and δ m [n] is the delay applied to array element m to focus the beam at point n. Simplifying the equation (5) we get where x[n] denotes w[n] is denotes as here we denote the complex conjugate by the symbol * and H denotes the rearrangement of a vector or matrix by the complex conjugate.
The MVDR beamformer uses the adaptive apodization weights to minimize the beamformer's output power while maintaining the response from the direction of interest as a constant, which is not maintained in the DAS Beamformer [12].
Due to coherence, the signal from (DOA) direction of interest and the interfering signals are coherent or highly correlated because all signals which are received are scaled and delayed replicas of the transmitted pulse, which may cause signal cancellation and generate poor beamformed output. This problem will be solved using a subarray averaging technique [12], which makes the MVDR Beamformer a highly computationally complex one, which requires the calculation of R[n] and its inverse. In real-time applications, the covariance matrix is obscure and must be assessed from data. The computational cost included calculating spatial covariance matrix and its inverse limits the usage of MVDR Beamformer for real-time applications like Therapeutic Ultrasound Imaging. Whereas considering the real-time implementation, the complexity of the covariance matrix generation and its inversion become a difficult assignment. In sub-array MVDR, to decrease the size of the spatial covariance matrix, the transducer cluster is partitioned into overlapping subarrays, and the covariance matrices for each sub-array is averaged over the array; comes about producing covariance matrix and its inverse practically implementable.

III. LITERATURE SURVEY
Our conventional MVDR Beamformer has high computational cost as we need to calculate the R[n] and its inverse, which does not allow us to apply this in practical implementations despite its excellent performance. QR Decomposition does not require the R[n] and its inverse calculations, which, when applied on the beamformer, will a better Beamformer with less computational complexity [12]. The transducer array is split into 'P' (P=M-L+1) overlapping sub-arrays, covariance is generated and averaged across the array. This method is termed as spatial smoothing [12].
The subarray data vector is given by .
here L is the subarray size, which takes values l = 0, 1, 2. . . M-L. In order to make the subarray vectors into orthonormal vectors, we apply a transformation to make the weight calculation simpler [12]. The subarray data vector in the transform domain is given by The steering vector is given by The steering vector in transform domain is given by The weight vector in transform domain is given by The beamformer output [12] is given by Transformation matrix T[n] has to be calculated by defining a sub-array data matrix U[n] of size (M-L+1) x L QR Decomposition algorithm is applied to the matrix where Q[n] is MxN Orthogonal matrix and S[n] is an NxN upper triangular matrix.
The transformation matrix can be derived from the above equations as By using all the above equations we get the value of T[n] get transformed sub array data vector and transformed steering vector respectively as where S T is a lower triangular matrix. We can simply compute a z [n] The beamformer output will now look below when we apply the forward substitution technique mentioned in the paper [12]. The QR Based MV Beamformer Block Diagram is explained in Figure 3. The PCGR Algorithm is explained in the later sections.
The accelerated QR-based MVDR block diagram is given in Figure 4. We had used a Fast Fourier Transform (FFT) for converting time-domain data to the frequency domain and IFFT for back conversion. The algorithm for finding the output y[n] is described in figure 5.

A. QR DECOMPOSITION ALGORITHMS
QR factorization/QU factorization is the commonly used terminology for QR Decomposition in Linear Algebra. This is  the process by which a matrix A is decomposed into a product, which culminates to A = QR of an orthogonal matrix Q and an upper triangular matrix R. The QR algorithm is a very powerful algorithm to stably compute the eigenvalues and the corresponding eigenvectors/Schur vectors. For computing the QR Decomposition, we can use different algorithms like Householder transformations/Givens rotations [13], Gram-Schmidt process [14].
The method by which we calculate the U[n] is explained in the Figure 6.

1) SCHMIDT ORTHONORMALIZATION ALGORITHM
We have a set of input vectors (linearly independent) v1, v 2, v 3 . . . , v n. The algorithm [15] normalizes the first input vector For the second u2vector to be orthogonal to the first one, we need to delete the v2parallel to u1part, which is a simple projection We now have two orthonormal vectors; we now need to delete the parallel components of each of them and so on, for the third vector.
When applying this procedure on a system, the uk vectors are still not precisely orthogonal due to rounding errors. The lack of orthogonality is particularly bad for the ''classical Gram -Schmidt'' and we can claim that the classical Gram -Schmidt method is numerically unstable [16].
A minor change in the algorithm, which is referred to as MGS / modified Gram-Schmidt, will stabilize the Gram -Schmidt method [17]. The algorithm calculates uk according to the following equations.
The numerical complexity of the latter algorithm is floating-point operations asymptotically of O(nk 2 ), where n is the vector dimensionality.

2) GIVENS ROTATIONS
Wallace Givens was the person who introduced this method to the world. Afterward, the method was named after him. The decomposition results are stored in arrays which originally store A, which will avoid us in using additional arrays in the givens rotation scheme. Diverse uses for QR decomposition of Matrices are possible. It can be used to determine a matrix's eigenvalues to solve the so-called QR algorithm. Two rows of the matrix under transformation rotate on every step of the Givens Rotation scheme [18]. This transition parameter is chosen such that one of the entries in the existing matrix is eliminated. Initially, the first column entries are removed one by one, so the same is performed with the second column, etc., till column n − 1. The matrix which results is R.
There are two main stages in the algorithm: selecting the rotation parameter, and the second is rotation itself, which is done over two rows of the current matrix. The entries of these rows situated to the left of the pivot column are zero; thus, there is no need for modifications. The inputs in the pivot column are rotated at the same moment as the rotation parameter is selected. The second part of the move, therefore, consists of rotating two-dimensional vectors generated by the rotated row entries positioned on the right side of the pivot column [19]. So far as operations are concerned, modifying a column is equal to multiplying two complex numbers, one of these complex numbers is modulus 1. Two complex numbers will result in one subtraction, one addition, and four multiplication.
where c = cosθ, s = sinθ and as we know c 2 + s 2 = 1 The product G(i, j, θ)X represents a counter-clockwise rotation of the vector X in the (i,j) plane of θ radians. QR decompositions can be computed with a series of Givens Rotations. Each rotation zeros an element in the sub diagonal of the matrix and forms the R matrix. The concatenation of all the Given Rotations forms the orthogonal Q matrix.
We will see an example of Givens Rotation. R calculation is explained as a flow [20] we take the 3 × 3 matrix as follows Givens Rotation Algorithm will generate the upper triangular matrix R from the above matrix by eliminating a 1 , b 1 and b 2 in a three-stage of elimination.
Stage 1: The first Givens rotation matrix denoted by G1 and is defined as [21] where c = 1 The matrix G1 is multiplied with input matrix A to eliminate a 1 term. So the resulting matrix is G1 * A Stage 2: The second givens rotation matrix is denoted by G2 and is defined as in the step 2 we multiply the matrix G2 with matrix obtained in stage 1 to eliminate b 1 . So the resulting matrix is G2 * G1 * A Stage 3: Givens rotation matrix G3 is defined as here in step 3 we multiply the matrix G3 with matrix obtained in stage 2 to eliminate b 2 . Resulting matrix is the upper triangular matrix, our R

3) MODIFIED SQUARED GIVENS ROTATIONS
QR decomposition of a matrix A results in two matrices Q and R A=QR Where Q is an orthogonal matrix and R is an upper triangular matrix. The matrix A can be also be decomposed using Squared Givens Rotation (SGR) [21]as where MSGR Algorithm will be explained by taking a 3 × 3 matrix of complex values as below MSGR generates the upper triangular matrix U from the above matrix by eliminating a 1 , b 1 and b 2 in a three stage process.

Stage 1:
Rotate rows r and a to eliminate element a 1 .
Introducing u, which is defined [21] by u = r * 1 r where w a > 0 is a scalar factor.
Then back conversion from V space is defined as where W a = W a u 1 u 2 Stage 2: Rotate r and b to eliminate b 1 .r is already in U space. The row b is translated to V space by Then back conversion from V space from equation 42 is as follows where W b = W b u 1 u 1 Stage 3: Rotate a and b in order to eliminate b 2 . a must be translated to U-space and b in to Then update u a and v b . Finally, the last row must be translated to U-space to get matrix U. The MSGR method for a general case is sited in the paper [21] as below Finally the matrix R is obtained from matrix U.

4) FAST TEOPLITZ ORTHOGONALIZATION
A matrix T is Toeplitz if the elements are constant on-diagonal. Because a Toeplitz matrix is calculated by a limit of 2n-1 numbers, here are algorithms that solve Toeplitz linear equation structures using just O(n 2 ) operations, or even O(n log 2 n) while implementing quick techniques [21].
The size of Toeplitz matrix A is (m+1)x(n+1). Shift invariance property of the Toeplitz matrix can be utilized to solve our problem of QR Decomposition. The paper [21] describes how the Matrix A can be partitioned, and it suggests two methods to do it.
Matrix A's QR Decomposition is given by A=QR Where R is upper triangular matrix of order n+1. The calculation of R consists of two stages [35].
The above equation describes QR decomposition of mxn matrix, where P is an orthogonal matrix of order m and R is an upper triangular matrix of order n [35]. The actual method is shown below in figure 7 Stage 1 will compute the r −1 k and w k from r k and w 1 ,w 2 ,. . . .w k−1 .
Stage 2 will compute r k+1 , U k and V k from first determined the k th column of R and V k using equations.

IV. ACCELERATED QRD-MV BEAMFORMER IMPLEMENTATION USING CGR ALGORITHM
CGR-QRD has less number of multiplications than the GR implementations in the literature [22]. Hence, it requires fewer resources to perform the decomposition. Apart from reduced computational complexity, CGR-QRD can exploit more fine-grained and coarse-grained parallelisms compared to other native approaches. The 2D systolic array architecture for Column wise givens rotation depicts the involvement of much overlapping in computation and communications between the processing elements, which is described in figure 8.
Here in this section we describe the QRD-MV Beamformer implementation [12] using CGR Algorithm described in the paper [26]. The process is explained in below Acceleration process is explained in the literature survery now we will see how we use the Column wise givens rotation to find the upper triangular matrix and how to is more efficent that the standard methods.
Consider the non-singular matrix X of size n x n which is given as where QQ T = I and Q = Q n−1 . . . . . . Q 2 Q1, I is an Identity Matirx and Q k = G k k,k+1 . . . .G k n−2,n−1 , 1 <= k <= n − 1 where G i,j = diag(I i−2 , G i,j , I m−1 ) and

we denote GR M as the number of multiplication operators required for Givens Rotation and CGR M as that of Column Wise Givens
Rotations, we will culminate to the below findings.
we take the ratio of 46 and 47, If n goes to infinity, β will become 3 4 , ie, if we increase the size of the matrix, the number of multiplications in CGR will become 3 4 times that of Givens Rotations, which will reduce VOLUME 8, 2020 the complexity of Operations [22]. Many of the previous studies have shown successful results in reducing the computational cost, but all of them are often viewed as approximate implementations of the traditional MV beamformer.
Here we propose a computationally superior QRD based MV beamformer, which is mathematically like the traditional MV beamformer [12]. The algorithm of the CGR based MVDR beamformer is explained below.

V. THE PROPOSED ARCHITECTURE-PIPELINED COLUMN WISE GIVENS ROTATION (PCGR)
We had explained about the QRD based MV beamformer in the earlier sections. The spatial covariance matrix and its Inverse is not required for us here, but significant calculations are still necessary to perform calculations for QR Decomposition. Results from paper [12] have already described the method of finding the beamformed output by using U[n] and S[n] without having to calculate Q[n]. Acceleration method also is depicted in paper [12] which is adapted in our study also, and the computations are made superior in our study were we propose a new architecture for finding the R. The scheduling of CGR operations for FPGA implementation in the proposed system is shown in Figure 11 as new scheme named as PCGR (Pipelined Column Wise Givens Rotation).
Here in this section, we propose a new architecture for finding the Column Wise Givens Rotation in a superior method. The architecture of the Givens Generation [23], [24] is modified with memory elements to store all the intermediate values which have enabled us to pipeline the entire architecture, thereby saving the clock cycles to complete an   architecture is mentioned below in Figure 10, and the Row updation modified architecture is shown in Figure 9. The Row updation architecture [25], [26] is modified with pipelining features that enable us to save the time of computations. The new architecture is shown in Figure 11. We use a system clock of 20 MHz, which is multiplied using the FPGA Clocking resources to achieve a 200 Mhz clock, which is used to run the modified algorithm's adaptive weight calculation.

A. ARCHITECTURE AND THE WORKING
The Architecture of column-wise givens rotation is adopted from paper [21], which is modified and implemented on FPGA utilizing the parallel processing capabilities of FPGA. The proposed architecture, whose work is explained below for a 4 × 4 Matrix, can be extended to any size of the matrix. Column 1 is given as input to GG1. All the operations in GG1, as shown in the architecture below, require only column 1. The outputs obtained have to be stored in memory, as shown in figure 11. The first column of the matrix becomes zero after GG1 except for the first element, p3 (obtained from GG1 ), as shown above. Parallelly RU12 can begin the row 1 update for column 2, followed by row4 update, row3 update, and row2 update for column 2. The inputs to RU12 is column 2 and column 1, along with other outputs, which is already stored in memory. The output obtained is updated in column 2. The first element of this updated column 2, i.e., X 12 has to be stored separately since it remains unchanged in the final output matrix-the rest of the elements changes in GG2.
The input to GG2 is this updated column 2. This takes the place of column 1 as it was given as input to GG1 earlier.
Here, the first element, i.e., X 12 , has to be made zero so that the same architecture can be used. As compared to GG1, the difference is X 41 becomes X 42 , X 31 becomes X 32 , X 21 becomes X 22 and X 11 becomes X 12 which is made zero already. The outputs of GG2 operations and the updated column2, also have to be stored in memory, as shown above. The second column of the final output matrix is ready here, the first element of which was already obtained from RU12 and stored earlier. The second element will be p2 (obtained from GG2 ) and the rest zeros, as shown in figure 11.
The inputs to RU13 is column 3 and column 1, along with other outputs, from memory. Just like RU12, RU13, too, can begin the row 1 update for column 3, parallelly with GG1 operations, followed by row4 update, row3 update, and row2 update for column 3. The output obtained is updated in column 3. The first element of this updated column 3, i.e., X 13 , has to be stored separately since it remains unchanged in the final output matrix-the rest of the elements changes in the next update in RU23 and GG3.
The inputs to RU23 is updated column 2 from GG2 in which X12 was already made zero and updated column 3 from RU13, along with other outputs from the memory of GG2. The first element of this updated column 3 can be made zero, but not compulsory, since X 12 is already zero. RU23 can begin the row 2 update for column 3, parallelly with GG2 operations, followed by row4 update and row3 update for column 3. The output obtained is updated in column 3. The output of the row 2 update, i.e., X 23 has to be stored separately since it remains unchanged in the final output matrix-the rest of the elements changes in GG3. The input to GG3 is this updated column 3, with X 13 and X 23 made zero. The same architecture can be used as explained above for GG2. The outputs of GG3 operations, as well as the updated column3, also have to be stored in memory, as shown above. The third column of the final output matrix is ready here, the first and second elements of which were already obtained from RU13 and RU23 and stored earlier. The third element will be p1 (obtained from GG3) and the last element zero, as shown in figure 11.
The inputs to RU14 is column 4 and column 1, along with other outputs, from the memory of GG1. Just like RU12 and RU13, RU14, too, can begin the row 1 update for column 4, parallelly with GG1 operations, followed by row4 update and row3 update for column 4. The output obtained is updated in column 4. The first element of this updated column 4, i.e., X 14 has to be stored separately since it remains unchanged in the final output matrix. The rest of the elements changes in the next update in RU24 and RU34.
The inputs to RU24 is updated column 2 from GG2's memory, in which X12 was already made zero and updated column 4 from RU14, along with other outputs from the memory of GG2. The first element of this updated column 4, can be made zero, but not compulsory since X12 is already zero. RU24 can begin the row 2 update for column 4, parallelly with GG2 operations, followed by row4 update and row3 update for column 4. The output obtained is updated in column 4. The output of the row 2 update, i.e., X 24 has to be stored separately since it remains unchanged in the final output matrix-the rest of the elements changes in RU34. The inputs to RU34 is updated column 3 from GG3, in which X 13 and X 23 was already made zero, and updated column 4 from RU24, along with other outputs from the memory of GG3. RU34 can begin the row 3 update for column 4, parallelly with GG3 operations, followed by row4 update for column 4. The output obtained is updated in column 4. The output of the row 3 update is X 34 , and the output of row 4 update is X'44. The resultant matrix is shown below. The architecture of a 4 × 4 Matrix is shown in Figure 11. The architecture was finding the upper triangular matrix using PCGR shown in Figure 14. The algorithm of the new architecture is explained below. The working of the FSM is stated in the algorithm, which enables us to save the various repeated multiplications and divisions as we reuse the GG and RU blocks, the additional hardware required here are the memory and FIFO blocks.
The architecture is summerized in figure 12.

VI. EXPERIMENT SETUP
The complete experiment setup consists of a Custom Transceiver 128-Channel Board shown in Figure 15, High-End FPGA Prototyping Board (DBF Board) and a PC. Both boards are connected using Board to Board connector.  The ultrasound transducer probe is connected to the Transceiver board using the probe connector. The 128 channel Ultrasound Transceiver board has high voltage transmit pulsars to energize up to 128 transducers and 128 channels analog frontend (AFE) signal conditioning circuits. The board moreover contains a Xilinx Kintex-7 XC160T FPGA, which acts as the transmit-receive controller and transmit beamformer for the system. The board also has an appropriate clock and control tree structures for clock and power administration. The high voltage transmit pulsars energizes the transducer cluster through FPGA controlled (transmit beamformer), and the received echoes are analog signal conditioned and gushed to a 128 channel Digital Beamformer (DBF) board to perform receive beamforming.   The DBF board is engineered based on Xilinx-Kintex-7 410T FPGA, which has sufficient assets for versatile array signal processing algorithms for QRD-MV Beamformer execution. The Transceiver board has a high voltage transmitter section which transmits the ultrasound pulses. The pulses goes to the probe connected to the board, which produces echoes after hitting targets from the cyst phantoms. The echoes are captured and processed by the transceiver analog front end section and transferred to the Digital Beamformer Board(DBF). The DBF Board on which we had implemented the QR-MV Beamformer algorithm. The beamformed data is transferred to the PC via Gigabit Ethernet and processed by the custom made software and the ultrasound image is formed.
The data for second experiment was collected from verasonics TM ultrasound research platform [27] shown in figure 13 and using the phased array 64 element probe and the data is processed in MATLAB.
The Setup for validating the results consists of a Verasonics TM Ultrasound Research Platform, which is unique hardware that takes real data from the phantom through their ultrasound probe and process the data using Beamforming Algorithms which are in build and form the images using their image forming tools. This Research Platform has a PC  connected to it, and this unit is a flexible tool for transmitting, receiving, and processing ultrasound information. Amplification, sampling, and filtering operations are performing on receiving ultrasound data. Finally, the vantage unit stores the data and handovers to host the computer via PCI express cable [27]. Another important piece of equipment that helps   in collecting real data is the target Phantoms. There are different phantoms like Kidney Phantom, Resolution Phantom, etc.. Resolution Phantom is a rectangular box of dummy bodies containing numerous cysts. The cyst's locations are known and clearly defined. This information is also provided to the user on the phantom, and this allows the user to fine-tune/validate his/her algorithm with the resolutions obtained from the real data.
The real experiment setup is shown in Figure 15.

VII. RESULTS AND DISCUSSIONS
The proposed PCGR Architecture is implemented on Kintex 7 custom board using Verilog HDL. The RTL Simulation, Synthesis and Implementation was carried out on Custom Designed FPGA Boards and results were validated with results from Ultrasound Research Platform.

A. FPGA IMPLEMENTATION
The implementation of the accelerated QRD-MV beamformer was completed on the Prototype setup mentioned earlier in the experimental setup section. We had completed simulations and implementations to validate Algorithm's behavior and performance [30]. We had implemented the Algorithm on Custom FPGA Board, which is called as Digital Beamformer Board (DBF), which is connected to PC via Gigabit Ethernet port, and the data from sensor come from to this DBF Board through the board to board connector. The resource utilization for XC7K410T FPGA is detailed in Table 4 and Table 3. The resource utilization was on the higher side compared with standard implementations as in our work we had used pipelining to improve the latency and clock frequency the price paid for this is the extra hardware incurred to implement the pipeline. We used Xilinx LogiCORE IP to implement the Fast Fourier Transform for the architecture [36] as shown in Figure 3.

B. VALIDATION OF RESULTS
We have used the Verasonics Research Platform to validate our results. The Vantage Research Ultrasound Platform uses proprietary hardware and software technologies to provide direct access to raw ultrasound data, while preserving the ability to perform high-quality real-time imaging with custom software, at clinically useful frame rates. We had used P4-2v phased array 64-element phased array probe for our research. It has a pitch of 0.3mm, Elevation focus of 50-70mm, sensitivity of -64-95 dB. We used Precision Multi-purpose Tissue mimicking Resolution Phantom for our experiment to identify the algorithm's performance for resolution. The proposed architecture of MVDR was validated on real data captured from resolution phantom using Ultrasound Research Platform Vantage-64. As Figure 20 explains that the   CGR -MVDR Beamformer has more resolution compared to DAS Beamformer in Figure 19.
Simulations and Implementations were performed to validate the idea put forward in this article. The performances of MV and QRD-MV beamformers are similar; the parallel pipelined architecture has improved the performance of beamformers in achieving higher frames per second(fps). The beam pattern of conventional beamformer-DAS and accelerated QRD-MVDR beamformer for a single cyst at a zero degree angle are shown in Figure 17 and Figure 18. The beamwidth becomes narrow in accelerated QRD-MVDR than conventional by a factor of 0.6 mm. Images from the Research Platform for DAS beamformer and accelerated QRD-MV beamformer for multiple cysts are shown in figure 19 and figure 20, respectively. The cysts are separated in figure 20 more than figure 19.

C. COMPARISON WITH LITERATURES
Parallelism and Pipelining techniques have been utilized to make the implementation faster and less complicated by operations. MVDR and QRD-MVDR beamformers are the same and one in terms of performances. FPGA Implementation of the PCGR based QRD on Kintex 7 FPGA is carried to compare the results with that from the Research Platform. The resource utilization of PCGR with previous works is tabulated in Table 3. The previous works related to MV-Beamformer has been compared in the table mentioned above, works [37]- [39] are taken for comparison here, which has comparable results with our algorithm. The computational complexity of accelerated MVDR beamformer based on QR decomposition is shown in Table 1.

VIII. CONCLUSION
In this work, an accelerated QRDMV beamformer using the column-wise given rotation algorithm is implemented on the custom transceiver board Figure 16. We have also implemented the PCGR based QR-MV Beamformer, which reduces the computational time, as we modified the architecture of the conventional CGR Algorithm [21] for porting it on the FPGA. The performance summary of the algorithm is pictured in Table 2. For hardware optimization, the computational blocks employed in the PCGR blocks were effectively reused by adding extra memory and control logic. The architecture is implemented on Verasonics Ultrasound Research Platform, in vitro experiments, was carried out and concluded that the image quality is superior to the standard methods. It is evident from the implementation that the number of clock cycles required to form a beam depends on the adaptive weight vector generation time, and it increases as the number of channels increases. Real-time performance is investigated on FPGA Prototype Lab Model and results are comparable with MATLAB MVDR Model ported on Verasonics Vantage-64 ultrasound research platform.  Chand, 2007). He has published more than 60 peer reviewed papers in international conferences/journals. He has delivered around 50 guest lectures/invited talk and hands on workshop in the area of FPGA-based system design, full custom IC design, RTL to GDSII, and ASIC Design. His research interests include low-power VLSI design, developing high-speed architecture for cryptography, and image processing. His current research interests include neuromorphic computing and making efficient hardware for AI and ML.