A Two-Stage Reconstruction Processor for Human Detection in Compressive Sensing CMOS Radar

Complementary metal-oxide-semiconductor (CMOS) radar has recently gained much research attraction because small and low-power CMOS devices are very suitable for deploying sensing nodes in a low-power wireless sensing system. This study focuses on the signal processing of a wireless CMOS impulse radar system that can detect humans and objects in the home-care internet-of-things sensing system. The challenges of low-power CMOS radar systems are the weakness of human signals and the high computational complexity of the target detection algorithm. The compressive sensing-based detection algorithm can relax the computational costs by avoiding the utilization of matched filters and reducing the analog-to-digital converter bandwidth requirement. The orthogonal matching pursuit (OMP) is one of the popular signal reconstruction algorithms for compressive sensing radar; however, the complexity is still very high because the high resolution of human respiration leads to high-dimension signal reconstruction. Thus, this paper proposes a two-stage reconstruction algorithm for compressive sensing radar. The proposed algorithm not only has lower complexity than the OMP algorithm by 75% but also achieves better positioning performance than the OMP algorithm especially in noisy environments. This study also designed and implemented the algorithm by using Vertex-7 FPGA chip (Xilinx, San Jose, CA, USA). The proposed reconstruction processor can support the 256×13 real-time radar image display with a throughput of 28.2 frames per second.


Introduction
The complementary metal-oxide-semiconductor (CMOS) radar system has recently drawn much research attention because of the great demand of low-power actively sensing devices for the internet-of-thing (IOT) system [1][2][3]. The human-centric home-care system is one of the most potential applications of the CMOS radar system, in which an impulse-radio radar transceiver can transmit high-frequency impulses to scan the weak respiration signals and identify the human body in the noisy environments [4][5][6]. However, the high-resolution signal increases the dimension of grid space, leading to the increased computational complexity of the target detection. Thus, the target detection and weak human feature extraction are two important signal processing issues for the human-centric CMOS radar system.
Traditional human detection algorithms detected respiration frequency in either time domain or frequency domain [7][8][9]. However, most of them were developed by using simple sinusoidal respiration model. Our previous work [10] designed a CMOS impulse radar system and developed a respiration feature extraction algorithm for detecting fine human features. The proposed single-input single-output (SISO) impulse radar system can detect single human body in a tunable fixed distance (1D distance) and extract the respiration features based on a four-segment linear waveform model.
The target detection and localization is another signal processing issue for the impulse radar system because of its great computational complexity. Traditional radar systems usually achieve high-accuracy target detection regardless of the computational cost because the radar station is usually equipped with high-performance digital signal processors. For an IOT system, the computational power of the target detection algorithm becomes a critical design issue because of the low-power requirement of wireless sensing nodes. Therefore, many researchers [11,12] applied compressive sensing (CS) reconstruction algorithms to the object detection to eliminate high-power matched filters and high-bandwidth analog-to-digital converters. One the other hand, some researchers designed the shape of the transmitted impulse to satisfy the incoherent constraint of CS. Thus, the reconstruction algorithm can rebuild the target scene with better performance and lower complexity than a classical radar in the literature [13,14].
The reconstruction algorithm dominates the signal processing complexity of a CS radar system [15]. l 1 -minimization (basis pursuit) and greedy reconstruction are two typical algorithms. l 1 -minimization [12,13] has robust reconstruction performance but involves very high complexity. Greedy reconstruction algorithms, such as orthogonal matching pursuit (OMP), has worse performance than l 1 -minimization. Thus, the greedy reconstruction algorithms [16][17][18][19][20][21] are more suitable for low-power hardware implementation. For a certain CS-radar-scanned area, higher resolution leads to higher grid density and higher reconstruction complexity. This design issue is especially stringent for a human-centric wireless sensing system because the respiratory vibration is extremely fine and requires a high resolution radar system. This work proposes a low-complexity two-stage OMP reconstruction algorithm for a single-input multiple-output (SIMO) CS radar. The contributions of the proposed algorithm are twofold. Firstly, the computational complexity of the proposed algorithm is significantly lower than the OMP algorithm, especially when the resolution of the CS radar system increases. Secondly, the detection performance of the proposed algorithm is better than that of the OMP algorithm especially in noisy environments. Finally, a reconstruction processor was designed and implemented for a real-time radar image display for the single-input and multiple-output (SIMO) CS radar system. The rest of this paper is organized as follows. Section 2 introduces the signal model of the CS radar system and the signal reconstruction algorithms for target detection. Section 3 presents the proposed low-complexity reconstruction algorithm. Section 4 analyses the computational complexity and performance of the proposed algorithm. Section 5 presents the FPGA design and implementation of the proposed reconstruction processor. Finally, conclusions are given in Section 6.

Compressive Sensing
The measurement process of compressive sensing system can be expressed by where y is an M × 1 sensing vector; Φ is an M × N measurement matrix; and r is an N × 1 vector of the original signal with a higher dimension than y. Signal sparsity and incoherence of sensing matrix are requisites of efficient signal reconstruction [22,23]. r can be described by where Ψ is an N × N basis matrix of the sparse domain and x is an N × 1 coefficient vector in the sparse domain. There are only a few non-zero elements in x. Using Ψ as basis, signal r can be transformed to the sparse domain as x with only K non-zero elements. K is much smaller than N. Then, the overall compressive sensing signal model can be expressed by where A = ΦΨ is the sensing matrix of the CS system.

SISO Compressive Sensing Radar System
In a single-input single-output compressive sensing impulse radar system, the received signal can be expressed by where s(t) is transmitted impulse signal. The i-th target reflects the transmitted impulse signal with signal gain p i and propagation delay τ i . d i is the distance between the radar and the i-th target and can be calculated by d i = τ i c 2 , where c is the light velocity. The received signal can be also rewritten as follows: where z(t; d i ) is the echo signal of the i-th target and α i is the signal gain coefficient determined by the path loss and other properties of the target, such as energy reflection/absorption ratio of the target.

MIMO Compressive Sensing Radar System
A multiple-input multiple-output (MIMO) radar system has many advantages over a SISO radar system [24]. The MIMO radar can scan targets in a 2-D space by using phase array antennas [25]. On the other hand, multiple transmitted impulse signals can be properly designed to reduce the degree of coherence of the compressive sensing system so as to improve the signal reconstruction accuracy [13,14]. The received signal of the MIMO compressive sensing radar can be expressed by where α k is amplitude coefficient of the k-th target, and φ k and d k are the direction (angle) and distance of the k-th target, respectively. The echo signal z(t; φ, d) is extended from z(t; d) in Equation (5) with an additional dimension φ . Assume that a MIMO radar system has N R receive antennas and N T transmit antennas, and each receive antenna acquires N S samples. The echo signal matrix of a target with distance d and angle φ is denoted as Z(t; φ, d). Echo signal z(t; φ, d) in Equation (6) is a vectorized version of the echo signal matrix Z(t; φ, d). Echo signal matrix Z(t; φ, d) is an N R × N S matrix, and echo signal z(t; φ, d) is an N R N S × 1 vector. The echo signal matrix Z(t; φ, d) can be constructed by using uniformly-spaced linear array (ULA) [12] and can be expressed as follows: where a R (φ) is an N R × 1 receive matrix, a T (φ) is an N T × 1 transmit matrix, and S(t − τ) is an N T × N S transmitted signal matrix. The delay τ can be calculated by using τ = 2d c . In the transmitted signal matrix S(t − τ), each row of the matrix are the delayed transmitted signal s i (t − τ) of the i-th transmit antenna. Transmit matrix a T (φ) and receive matrix a R (φ) are given by Equations (9) and (10), respectively, in which d T and d R are the normalized spacings (distance divided by wavelength) of the transmit antennas and receive antennas, respectively [12]. According to Equations (8)-(10), echo signal matrix Z(t; φ, d) in Equation (7) can be established. Then, echo signal vector z(t; φ, d) in Equation (6) can be generated by vectorizing the echo signal matrix Z(t; φ, d).
The target scene of an MIMO radar system is a discrete N r × N θ range-azimuth grid. When a target is at the range-azimuth grid point (θ i , r j ) within the coverage of the MIMO radar system, that is, 1 ≤ i ≤ N θ and 1 ≤ j ≤ N r , the received signal can be expressed by where z(t; θ i , r j ) is the echo signal of the target at (θ i , r j ) and x ij is the signal gain coefficient. If there is no target at (θ i , r j ), then x ij = 0. The object detection and localization is to find a set of non-zero x ij s in Equation (11). Equation (11) can be expressed in matrix form in Figure 1. The number of targets, denoted as K, is much smaller than the number of grid points (K N θ N r ). Thus, the matrix form in Figure 1 can be regarded as a CS signal model and sensing matrix A is composed of z(t; θ i , r j ). The object detection searches a set of echo signals in the received signal, and indices of echo signals represent their locations.

Path Loss and Human Respiration Signal Model
Path loss of the travelling electromagnetic wave determines the signal gain coefficient x ij . This work utilizes a radar system [10] integrated with a CMOS impulse radar chip [1] to measure the path loss parameters of metallic objects and human bodies, as shown in Figure 2a,b. The CMOS radar system transmits 25 dBm 1 GHz sinusoidal impulse with a repetitive 10 MHz frequency, resulting in 0.94 mm range resolution. Figure 3a shows the measured results of the path loss versus logthmic radius distance. The power attenuation of human is more than 20 dB than that of a mental object due to the high energy absorption ratio of human body. The 0.94 mm resolution is fine enough to capture tiny respiratory information. This study adds four-segment linear waveform (FSLW) [10] respiration signals into the echo signals of human target to simulate the practical human respiration.
Based on the measurement results, this study constructs a SIMO impulse radar system with 4 receive antennas and 1 transmit antenna, which are configured as shown in Figure 3b. Each receive antenna receives 128 samples for each iteration. The maximum detection range is 5 m, and direction (angle) is between +45 and −45 degrees. The range resolution is N r = 256 and angle resolution is N θ = 13, that is, target scene forms 256 × 13 range-azimuth grid. The sensing matrix of the CS radar system is a 512 × 3328 matrix. a T (φ) and a R (φ) are determined by (9) and (10) with N t = 1 and N r = 4, respectively. The receive antenna spacing d R is λ 2 . There are three targets in the scene. One is a non-human object and the others are human bodies, which are modelled by the FSLW respiration model [10] with intensity A = 0.01 m, inspiration speed β 1 = 0.5, expiration speed β 2 = 0.5, and respiration holding ratio X = 0.5. The respiration rate of two persons are 0.5 Hz and 0.8 Hz, respectively. The non-human target is located at the distance 4.5 m and angle 22.5 • . One person is located at the distance 2 m and angle −7.5 • , and the other is located at distance 3.5 m and angle 30 • .

Reconstruction Algorithms for Compressive Sensing
Orthogonal matching pursuit (OMP) [26] is one of the popular greedy reconstruction algorithms for compressive sensing. For a compressive sensing framework y = Ax, y is M × 1 measurement vector, A is M × N sensing matrix, and x is N × 1 signal vector. First, the OMP algorithm searches the support set iteratively according to the matching results by The determined index with the maximum matched gain is added into the support set I k of the k-th iteration. Then, the estimated signal vector of the k-th iteration, denoted asx k , can be reconstructed Then, the residual of the k-th iteration can be calculated by r k = y − A I kx k . The residual r k is then used to match the index of the next iteration by Algorithm 1 shows the pseudo code of the OMP algorithm.

Algorithm 1 Orthogonal Matching Pursuit Algorithm
Input: Sensing matrix A, measurement y, target sparsity K Output: Support set I k , estimatex k 1: Initialization:

Orthogonal Matching Pursuit via Matrix Inversion Bypass
The computational bottleneck of the OMP algorithm is the matrix inversion in Step 5 of Algorithm 1. Thus, this study uses the Schur-Banachiewicz block-wise inversion [27,28] to reduce the complexity. The derived OMP algorithm is called the orthogonal matching pursuit via matrix inversion bypass (OMP-MIB) [29]. We denote N × N matrix G = A T A and G I,J = A T I A J , where I and J are any two support sets. Thus, the estimated signal vector can be rewritten aŝ where G I k ,I k is a K × K matrix. Schur-Banachiewicz block-wise inversion is then applied to Equation (13). Hence, matrix inversion in Equation (13) can be expressed by where and V is a scalar: Equation (14) shows that the matrix inversion of G I k ,I k can be replaced by several matrix operations. Since V is a simple reciprocal operation, matrix inversion operation is not required in Equation (14). The pseudo code of the OMP-MIB algorithm is shown in Algorithm 2.

Algorithm 2 Orthogonal Matching Pursuit via Matrix Inversion Bypass
Input: Sensing matrix A, measurement y, target sparsity K Output: Reconstructed signalx, support set I k 1: Initialization:

Two-Stage Reconstruction Algorithm
The complexity of the OMP reconstruction algorithm increases along with the increased sensing resolution and dimension of the sensing matrix. In order to reduce high reconstruction complexity for the CMOS impulse radar system, this study proposes a two-stage OMP reconstruction algorithm including block-wise OMP estimation, weight updating and decision mechanism, and fine estimation. Figure 4a shows the processing flow chart of the proposed algorithm.

Block-Wise OMP Estimation
The proposed algorithm first performs block-wise OMP, which reduces the sensing matrix size to estimate the rough target regions. In the CS radar, each column of a sensing matrix A represents the echo signal of a specific grid point. Because the neighbouring grid points have similar echo signals, A can be separated into several blocks and each block is the concatenation of neighbouring columns. Thus, the target locations can be coarsely estimated by the block-wise OMP estimation. The block-wise OMP algorithm first downsizes A into a new sensing matrix D by D j = A (j−1)×(B+1) + · · · + A j×B , where B is the block size. The j-th column of a downsized sensing matrix D is the sum of columns in the j-th block of A. The block-wise OMP estimation can be regarded as the OMP using D as the sensing matrix. The detailed block-wise OMP algorithm is shown in Algorithm 3. The block-wise OMP estimation selects several block candidates and stores their indices in a candidate set I block for the following weight updating.

Algorithm 3 Block-Wise Estimation Algorithm
Input: Sensing matrix A, measurement y, target sparsity K, block size B Output: Support set I k 1: Initialization:

Weight Updating
Each block has a weight representing the possibility of having targets within the block. The proposed algorithm calculates weights by using historical coarse estimation results. The proposed weighting mechanism utilizes the fact that the blocks near a known object have higher opportunities of having objects in, even in a noisy environment. Thus, the proposed algorithm collects coarse estimation results of the previous iteration for evaluating the block weights. The weight updating is expressed by Z (I block ) = Z (I block ) + 1, where is the candidate set selected by the coarse estimation. The size of I block is determined by the number of block candidates K b of the compressive sensing system. The algorithm adopts several sensing matrix A from the UWB radar system and skips the fine estimation in the first a few T num iterations of the training mode. This weight computation helps to increase the reliability of the weight distribution. The algorithm usually needs only one or two training iterations in low-noise conditions.

Decision Strategy for Fine Estimation
After updating weights, the algorithm selects the blocks to be performed by the following fine estimation according to the decision strategy shown in Figure 4b. The first step performs weight sorting. The blocks are sorted in the descending order of their weights. The second step is block merging. Since the size of human body is much larger than the respiratory spatial variation, the neighbouring selected blocks could represent the same target. Thus, this step merges the block candidates within a merging distance M g into the block with the largest weight. Then, the merged blocks are discarded and their weights are added to the weight of the winner block. The final step selects the remaining candidate blocks by comparing the normalized weights with a specified threshold T h . After the threshold decision, the fine estimation combines the columns of the selected blocks as sensing matrix to perform the fine object positioning. Notice that the block merging in the decision process assumes that the size of human body is much larger than the respiratory spatial variation because the radar ranging resolution is much higher than the body resolution. Thus, sparsity K in the fine estimation OMP is no larger than K b to locate the detailed positions of the human bodies or objects. Following simulation and hardware design assume K = K b without the loss of generality.
After the location of targets are estimated by the OMP estimation, the algorithm returns back to the next iteration. Notice that the block weights are not reset and remain the same at the beginning of the next iteration. The detailed two-stage reconstruction algorithm is shown in Algorithm 4. if Z sort (i) > Z sort (j) then 3.

Algorithm 4 Proposed Two-Stage Reconstruction Algorithm
Z sort (i) = Z sort (i) + Z sort (j) 4. and remove I sort (j) from I sort 5. else 6.
Z sort (j) = Z sort (j) + Z sort (i) 7. and remove I sort (i) from I sort 8.
end if 9. end for (C) Selection: 1. Threshold calculation: t = T h ×sum (Z) 2. I select = I sort (1 : K) 3. Removing I select (i) from I select for every i that satisfies Z sort (i) < t 9: Fine positioning: I out = OMP A I select , y k , K ; 10: Output I out , and let k = k + 1; 11: Goto Step 2 in order to acquire new received signal;

Complexity and Performance Analysis
This section analyses the computational complexity of the conventional OMP algorithm, OMP via MIB algorithm, and the proposed two-stage OMP algorithms with and without MIB. The complexity analysis is based on the number of multiplications of several matrix operations. The complexity of the matrix inversion for an n × n matrix is O(n 3 ). The complexity of the matrix multiplication A × B is O(mnk), where A is an m × n matrix and B is an n × k matrix. The complexity of the matrix pseudo-inverse A † is O(n 3 + 2mn 2 ).

Orthogonal Matching Pursuit Algorithm
Algorithm 1 shows the OMP algorithm. The computational complexity of the pseudo matrix inversion and multiplication in Step 5 is O(k 3 + 2mk 2 + mk). The complexity of Step 6 and Step 7 are O(mk) and O((n − k)m). Thus, total complexity of the OMP algorithm is O(k 3 + 2mk 2 + nm + mk). For the fair comparison with the OMP-MIB algorithm, the computation of matrix pseudo-inverse . Thus, total complexity of the OMP algorithm is O(k 3 + mk 2 + nm + mk).

Proposed Two-Stage OMP Reconstruction Algorithm
The major complexity of the proposed two-stage reconstruction algorithm are the OMP calculations for both coarse and fine positioning. The block-wise estimation is an OMP algorithm for a downsized m × n c sensing matrix D. Thus, the complexity of the coarse positioning is O k 3 + mk 2 + n c m + mk . For block size b, n c can be expressed as n c = n b . Thus, the complexity of the coarse positioning becomes O k 3 + mk 2 + n b m + mk . The fine positioning uses a m × n f sensing matrix to perform the OMP algorithm for the selected blocks. Thus, the complexity is O k 3 + mk 2 + n f m + mk , and n f can be expressed as n f = kb. Hence, the complexity of the fine positioning becomes O k 3 + mk 2 + kbm + mk . The total complexity of the proposed two-stage reconstruction algorithm is O k 3 + mk 2 + n b m + mk + O k 3 + mk 2 + kbm + mk . In addition, both the coarse and fine positioning can be realized by using OMP-MIB algorithm.
Thus, the complexity of coarse and fine positioning are O k 2 + n c k + mk and O k 2 + n f k + mk , respectively. By substituting n c = n b and n f = kb into O k 2 + n c k + mk and O k 2 + n f k + mk , respectively, Thus, the complexity of the coarse and fine positioning by using OMP-MIB algorithm are O k 2 + n b k + mk and O (b + 1) k 2 + mk . Figure 5a shows the complexity analysis results with the sparsity factor k = 3 and block size b = 4. The complexity of two-stage OMP algorithm is much lower than the OMP algorithm. The OMP via MIB algorithm can further reduce the complexity of the OMP and two-stage OMP algorithms, as shown in Figure 5b. In this case, the proposed two-stage algorithm still has lower complexity. Figure 6 shows the complexity of various block sizes. Figure 6b shows the complexity of the OMP-MIB and the proposed two-stage OMP-MIB algorithm since they are too close in Figure 6a. The complexity of the proposed algorithm decreases along with the increased block size because of the decreased dimension of the sensing matrix of the block-wise estimation. The complexity of the proposed two-stage OMP reconstruction with block size b = 4 reduces approximately 75% complexity of the conventional OMP algorithm. Figure 6b shows that proposed two-stage OMP-MIB algorithm with block size b = 4 reduces approximately 50% complexity comparing to original OMP via MIB algorithm. The complexity reduction percentage would further increases when the radar resolution increases, that is, the dimension of the original sensing matrix A increases. This is a great benefit to human detection because the impulse radar usually requires very high radio frequency and very high spatial resolution in order to acquire the tiny movement of the human respiration.   Figure 7 shows the positioning results of the OMP and two-stage OMP algorithms. 500 impulses are used to simulate the CS radar processing under different SNR conditions. The sensing matrix A is derived from the MIMO compressive sensing model in Sections 2.2-2.4. The "estimate" points are the detected targets, which are sent to the respiration feature extraction algorithm [10] to identify the human and object. When SNR is 24 dB (Figure 7a,d), both the OMP and two-stage OMP algorithm can perfectly locate the object and humans. When SNR is 6 dB (Figure 7b,e), the OMP algorithm generates a few incorrect estimation results of the object and human locations, but the proposed algorithm still perfectly locates the object and humans. When SNR is −6 dB (Figure 7c,f). However, the proposed algorithm can still detect one human target with some incorrect results and detect the non-human target perfectly. The proposed two-stage reconstruction algorithm has better positioning performance than the conventional OMP algorithm especially in noisy environments.

Performance Analysis
This study uses three metrics to perform detailed analysis of the reconstruction and detection results including normalized mean square error (NMSE), number of hits, and hit ratio. These metrics show some properties that are not revealed in Figure 7.
The normalized mean squared error of the reconstructed signals by CS radar is defined by where y true is the received radar signal and y estimate is the reconstructed signal by the OMP or the proposed algorithm. Number of hits is the number of correct estimates in the simulation. The hit ratio is the detection probability defined by Hit Ratio = Number of Correct Estimates Number of Total Estimates .
(18) Figure 8a shows the NMSE verus SNR performances of the OMP and two-stage OMP algorithms with different block sizes. The proposed two-stage OMP algorithm has worse NMSE performance because the proposed algorithm utilizes threshold mechanism to discard bad estimates and distort the reconstruction signal. However, the NMSE does not necessarily reflect the detection and positioning performance, as shown in Figure 7, because a lower NMSE value only means that the signal is better reconstructed by the OMP process but the reconstructed signal has an incorrect combination of indices, that is, the support set I K . The proposed algorithm with b = 6 has better NMSE than that with b = 4 because the threshold mechanism with b = 4 is more strict than that with b = 6 because the block weights is not normalized as we mentioned in Section 3. Hence, the proposed algorithm with b = 6 makes estimated results more easily to reach the threshold, leading to lower NMSE. This is similar to the reason that causes the better RMSE of the OMP algorithm. Figure 8b shows that the hit ratio of the two-stage OMP is better than that of the OMP. This implies that the matching pursuit quality of the two-stage OMP algorithm is better than that of the OMP algorithm.  Figure 9a shows the NMSE versus SNR with different thresholds. The OMP has lower NMSE than the proposed algorithm because of the threshold mechanism. A higher threshold of the proposed algorithm causes a higher NMSE value because more matched results are discarded. Figure 9b shows the hit ratio of the algorithms for different thresholds. A lower threshold leads to a larger number of hits. Threshold T h = 1% has the highest hit number in the analysis of the proposed algorithm, but the hit ratio is unstable when SNR lower than 10 dB.

Architecture Design and Implementation
This section introduces the hardware design and implementation of the proposed two-stage reconstruction algorithm. The OMP processing for coarse and fine positioning are the most complex and time consuming steps. Thus, an OMP processor was designed to accelerate the reconstruction processing. Figure 10 shows the overall architecture, which can be divided into two parts. One is Matching Result Update and Index Selection Unit, and the other is Parameter Update Unit. The gray blocks are data memories including ROMs for G and sensing matrix, a single-port SRAM for α init , and a dual-port SRAM for α. G_inv, GIK_IK, α init I k , and support set I k are register file. GIK_IK and α init I k are buffers for parallel data streams. I k is the storage of support set. Initially, α serves as register and loads the received signal firstly for calculating matching result, and then initialization steps are executed by matching result update circuit and index selection circuit. After initialization, updating GIK_IK and α init I k that store specific value from G ROM and α init RAM for computing parameter W, V, U, and W, V, U are calculated by parameter update unit sequentially. Then G −1 I k ,I k and α k are updated. Support set is also updated during the process of updating α k . Then the algorithm goes to next iteration.  Figure 10. Block diagram of the proposed architecture. Figure 11 shows the matching result update circuit. The Parallel MAC consists of N mul multiply-and-accumulate circuits (MACs) that determine the speed of matrix multiplication. The number of subtractor circuits in the Parallel Subtractor is also N mul .

Initialization and Matching Result Update Circuit
The matching result update circuit not only calculates the matching result but also performs the initialization of the OMP-MIB algorithm. In initialization phase, parallel MAC calculates the initial matching result first. Because of the extremely high dimension of sensing matrix and received signals, it is impossible to execute the parallel MAC for vector-vector multiplication of each column of sensing matrix in a clock cycle. Thus, the partial results are accumulated by summation circuit and registers. For an M × N sensing matrix, the matching result update circuit produces one element of α 0 in every M N mul clock cycles. Since the elements of initial matching result vector α 0 are computed sequentially, index selection circuit compares the maximum value of initial matching result vector one by one, therefore, reducing the complexity of index selection circuit. The resultant α 0 is then stored in memory α init . The calculations of initial matching result can be finished in N M N mul clock cycles.
In the initialization phase, index i 1 is computed by index selection circuit, and G −1 I 1 ,I 1 is computed by reciprocal circuit. After G −1 I 1 ,I 1 is calculated,x I 1 is produced by parallel MAC and stored in x init , which will be used for the calculation of new matching result α 1 by parallel MAC and parallel subtractor circuits. Figure 11a shows the processing flow of the initialization in the data path circuit.
The matching result update circuit updates α k based on α k−1 and matrices W, V, U. Dual-port SRAM is used for simultaneous reading and writing of αs. α k is obtained by subtracting α k−1 by the parallel MAC output of the matrix multiplication GĨ k ,I k × VUW T −VU . Parallel MAC calculates N mul elements of matrix multiplication operation in every k clock cycles, where k is the iteration counter and N mul is the number of MAC circuits. The N mul partial results from parallel MAC subtract the matching result vector α k−1 concurrently by parallel subtractor circuits to obtain N mul elements of α k . There are N − k elements in the vector α k for a M × N sensing matrix and N − k >> N mul . Therefore, the matching result updating is the most time consuming step in the algorithm. Figure 11b shows the processing flow of matching result updating process.  Figure 12 shows the index selection circuit. At initialization stage, the matching result update circuit calculates the elements of initial matching result vector α 0 sequentially, and then index selection circuit compares and selects the index of the elements with maximum matching result. At the iteration stage, the matching result update circuit outputs N mul matching results concurrently. Therefore, index selection circuit compares these N mul matching results by using maximum circuit. Maximum matching result of the maximum circuit is further compared with the previous maximum matching result. The larger result is preserved for the next comparison. Figure 12 shows processing flows of index selection circuit for the initialization and common matching result update processes.  Figure 13 shows the architecture of parameter update unit, which consists two circuits to update the matrices W, V, U. The vector multiplier and parallel multiplier both use K − 1 multipliers to execute vector multiplication and parallel multiplication. Figure 13a shows the parameter update circuit for W, V, and U. W is first calculated in k − 1 clock cycles in the k-th iteration. Then, V and U are computed in one clock cycle. After W, V and U are updated, the circuit shown in Figure 13b computes the matrix multiplications of VW, VUW T , and VW T W, which are used for updating new matching result α k and G −1 I k ,I k stored in memory G_inv. Figure 13b shows the several matrix multiplication circuits. First, VW is calculated in one clock cycle. Then, VU, VUW T , and VW T W are calculated concurrently. VU and VUW T require one clock cycle, and VW T W requires k − 1 clock cycles. After these matrices are updated, new matching result α k and index i k can be determined by the matching result update circuit and index selection circuit. In the block diagrams of the afore-mentioned circuits, some processing modules, such as parallel MACs, parallel substractors, parallel multipliers, and vector multiplier, can be referred to our previous works [30,31].

Implementation Results and Comparison
The proposed two-stage OMP processor was designed and implemented by using Software and Xilinx Virtex-7 FPGA (Xilinx, San Jose, CA, USA). The proposed two-stage reconstruction algorithm was realized by Matlab software (MathWorks, Natick, MA, USA) except that the pure OMP processing functions (Line 3.(B) and 9 in Algorithm 4) are accelerated by FPGA hardware. The original CS radar system has a 512 × 3328 sensing matrix derived from the MIMO compressive sensing model in Section III B, C, and D, with b = 4, k = 8, and SNR = −20 dB, that is, the coarse estimation performs OMP reconstruction for a 512 × 832 sensing matrix, and the fine estimation performs OMP reconstruction for a 512 × 32 sensing matrix. The OMP processor was implemented for both 512 × 832 and 512 × 32 sensing matrices. In the fixed-point simulation, the number of correctly selected indices in I k are used to determine the number of fractional bits with 13 integer bits of the signals in memory. Table 1 shows the hardware resources utilized in the proposed processor. The processing time for reconstruction of a 512 × 3328 CS matrix is 35 ms. The corresponding radar image frame rate is about 28.2 frames per second, which approaches the typical video frame rate. Table 2 compares the proposed OMP processor with other reconstruction processors in the literature. The major differences of the proposed processor are the large compressing sensing matrix size and targeted application. The CS radar system has an extremely large CS matrix dimension to provide very high radar image resolution for the detection of very fine human respiration signal. Thus, it is difficult and almost impossible to implement such a high-dimension signal reconstruction processor for the CS radar system. The proposed two-stage reconstruction algorithm successfully reduces the complexity by reducing the CS matrix dimension in the coarse positioning and still maintains the high-precision reconstruction by fine-positioning. Although traditional reconstruction processors have small processing latencies, they only support much lower dimensions than the proposed two-stage OMP processor. If these traditional algorithms are applied to the compressive sensing radar system with such a high dimension, it is impossible to realize a real-time processor for the CS radar system. In the application aspects, the proposed processor supports 28.2 frames per second for the radar image real-time display. This image refreshing throughput is good enough for the human vision response.

Conclusions
This study proposes a two-stage reconstruction algorithm for the compressive sensing radar. The proposed algorithm has better positioning performance and lower complexity than the traditional OMP algorithm. The OMP can be replaced by any reduced-complexity algorithm, such as OMP via MIB, but the proposed two-stage reconstruction is still effective in reducing cost and improving performance. This work applied a practically-measured path-loss model and a human respiratory signal model from a CMOS impulse radar system to configure a 4 × 1 SIMO radar system to analyse the proposed algorithm. The proposed algorithm utilizes block-wise OMP estimation, weight threshold mechanism, and fine estimation to reduce the complexity and improve the performance. The simulation results show that the proposed algorithm needs only 25% computational complexity of the conventional OMP algorithm and has a better positioning performance than the OMP algorithm especially in noisy environments.