A Digital Signal Processing Architecture for Soft- Output MIMO Lattice Reduction Aided Detection

Many wireless communication standards now include the use of multiple transmit and receive antennas as a means of achieving increased throughput or spectral efficiency, including LTE, WiMAX and WiFi (IEEE 802.11n). The task of a detector for a multi-input multi-output (MIMO) communications channel is to separate the spatially mixed and noise-corrupted data streams, and to produce reliable estimates of the transmitted bits. The brute-force maximum-likelihood (ML) detector provides optimal error-rate performance, but is computationally infeasible when either dense symbol constellations or large numbers of antennas are used. Hardware implementation of ML receivers is therefore very challenging, leading to linear detectors based on well-known approaches such as zero forcing (ZF) or minimum mean-square error (MMSE) detection, or nonlinear methods such as successive interference cancellation (SIC), which offer manageable receiver complexity at the expense of highly suboptimal error-rate performance.


Introduction
Many wireless communication standards now include the use of multiple transmit and receive antennas as a means of achieving increased throughput or spectral efficiency, including LTE, WiMAX and WiFi (IEEE 802.11n).The task of a detector for a multi-input multi-output (MIMO) communications channel is to separate the spatially mixed and noise-corrupted data streams, and to produce reliable estimates of the transmitted bits.The brute-force maximum-likelihood (ML) detector provides optimal error-rate performance, but is computationally infeasible when either dense symbol constellations or large numbers of antennas are used.Hardware implementation of ML receivers is therefore very challenging, leading to linear detectors based on well-known approaches such as zero forcing (ZF) or minimum mean-square error (MMSE) detection, or nonlinear methods such as successive interference cancellation (SIC), which offer manageable receiver complexity at the expense of highly suboptimal error-rate performance.
One powerful class of receivers which have been developed over the past decade is based on the highly developed mathematical theory of point lattices, which are periodic arrangements of discrete points.The basic idea is to consider the distortion introduced by the noise-free part of a MIMO channel as a representation of a lattice, then to perform suboptimal detection on an "improved" representation of the channel matrix based derived from a "reduced" lattice.The suitably reduced lattice facilitates the search for the lattice point closest to the received vector, shifting most of the computational complexity to a pre-processing step before linear detection.Such lattice reduction aided detection (LRAD) based approaches to MIMO receiver design have significantly closed the gap between feasible yet high-performance MIMO detection, and optimal (but impractical) ML detection.
To date, most LRAD-based MIMO detectors produce hard outputs, in which an estimate of the most likely vector of transmitted symbols is generated.For high-performance wireless communication systems, however, it is commonplace that the information transmitted over the air is coded, thereby containing not only raw data, but also the redundant information needed to perform forward error correction (FEC) at the receiver.State-of-the-art FEC codes such as turbo codes and low-density parity-check (LDPC) codes [1], which require estimates of the probability that a given transmitted bit was a 1 or a 0, therefore call for soft output detectors.The extension of hard-output LRAD detectors to the soft-output case is therefore of high practical relevance, but also recognized as a difficult problem [2, p16].In this chapter, we present what is believed to be the first digital signal processing (DSP) implementation of a soft-output lattice reduction aided MIMO detector, based on an approach to MIMO detection known as subspace LRAD (SLRAD) proposed by Windpassinger [3,4].
The chapter is organized as follows.In Section 2 we present the wireless MIMO system model, with an emphasis on how transmitted symbols are drawn from point sets consistent with the lattice theoretic approach to follow.In Section 3 we formally define lattices, and present the most celebrated algorithm for lattice reduction, known as the Lenstra-Lenstra-Lovász (LLL) algorithm.We then show how hard-output lattice-based detection can be used in conjunction with commonly used linear MIMO detectors in Section 4. In Section 5 we outline Windpassinger's subspace-based approach to LRAD in which a list of candidate symbols is produced, thereby facilitating soft-output LRAD.Finally in Section 6 we present a detailed description of our hardware implementation of a soft-output lattice reduction aided MIMO detector.

System model
We consider a MIMO wireless communication system with n T transmit and n R receive antennas.The complex baseband model for this MIMO system is where y ∈ C n R is the received vector, H ∈ C n R ×n T is the channel matrix, n ∈ C n R is the channel noise, and x ∈ C n T is the vector of transmitted symbols, as shown in Fig. 1.
We assume that the noise n [n 1 , n 2 , . . . ,n n R ] T contains independent and identically distributed (i.i.d.) elements n m ∼ CN 0, σ 2 , m = 1, . . ., n R .The channel matrix H has i.i.d.entries h m,n ∼ CN (0, 1), for m = 1, . . ., n R and n = 1, . . ., n T , where it is assumed that there are at least as many receive antennas as transmit antennas: n R ≥ n T .
An uncorrelated Rayleigh fading propagation environment is therefore assumed in this chapter, though it should be noted that lattice reduction aided detection receivers similar to those presented later in this chapter have been proposed for environments in which there is either temporal [5] or frequency-selective [6] fading.
The task of the MIMO receiver is to recover x from y, based on knowledge of both the channel realization H and the channel noise variance σ 2 .
The vector of transmitted symbols is denoted x [x 1 , x 2 , . . . ,x n T ] T .In this chapter we restrict attention to transmit symbols drawn from finite sets of points, known as constellations, drawn from a square grid, and in particular the quadrature phase-shift keying (QPSK) and, 16-quadrature amplitude modulation (16-QAM) and 64-QAM constellations depicted in Fig. 2. We do not consider non-rectangular constellations, such as 8-PSK, due to an inherent incompatibility with the lattice-theoretic framework exploited by lattice reduction aided detection, and also the limited applicability of non-rectangular constellations in emerging wireless communication standards.
The symbol transmitted from the n th antenna, denoted x n , is drawn from a constellation A n : where the scalar E sn is the average transmitted symbol power.We define the vector E s [E s1 , E s2 , . . . ,E sn T ] so that The selection of E s depends on the particular objective of transmit power scaling and indeed varies in practical implementations.In this chapter, to enable fair comparison between systems employing differing modulation formats, we constrain average unity power per information bit (E b = 1).
The constellations considered in this chapter are formed from a subset of scaled and shifted Gaussian integers In this chapter, we restrict attention to the three subsets X n ⊂ X shown in Fig. 3, where the introduction of the offset term in (4) maintains symmetry of each constellation with respect to the axes.We refer to constellations formed in this manner as Gaussian integer constellations.The constellation A n with elements α n ∈ A n employed at the n th transmit antenna is: where c n is the average energy of X n .Dividing each element of X n by √ c n ensures that A n has unity average energy and is referred to as normalized constellations.For square QAM constellations such as those in Fig. 2, It is important to note that we deliberately allow each transmit antenna to be independently mapped to a constellation set.In summary, the transmitted symbols x n are formed by scaling of the elements xn ∈ X n : The effect of a given channel realization H is to rotate and stretch (or contract) the axes of the otherwise square decision regions of the optimal, maximum-likelihood (ML) receiver.The error probability of a detector is determined by the distance of constellation points (mapped by H) from the associated decision boundaries.The essential idea of LR-aided detectors is to obtain a "more orthogonal" representation for the channel realization H, before detection using a low-complexity (sub-optimal) receiver.In the following section we make these ideas precise, drawing on the well-established mathematical literature on point lattices to formalize what is meant by the notion of a "more orthogonal" representation, and how it can be achieved and quantified.

Lattices
A complex lattice consists of all linear combinations of the set of linearly independent basis column vectors b k , 1 ≤ k ≤ M of the basis matrix B ∈ C N×M , M ≤ N. A complex lattice formed from basis matrix B is therefore the set of points where ∈ Z} is the ring of Gaussian integers [7].
The number of possible bases for a given lattice L is infinite, since any basis B = BT forms the same lattice L( B) = L(B) when the transformation matrix T is unimodular, i.e. det(T) = ±1 and T ∈ Z[i] M×M .Finding a basis in which the basis vectors are (roughly speaking) reasonably short and almost orthogonal is known as lattice basis reduction, which we now describe formally.

Lenstra-Lenstra-Lovász (LLL) algorithm
The Lenstra-Lenstra-Lovász (LLL) algorithm was originally published as a lattice reduction algorithm operating on real-valued matrices [8].Many works use the real decomposition of the complex-valued MIMO transmission model [3,9].Lattice reduction methods can operate on both real and complex integer lattices and in particular the LLL algorithm has been extended for complex lattice reduction [10].The complex LLL (CLLL) algorithm can be summarized as follows.We make the following definitions: • H i is the squared Euclidean norm of the orthogonal vectors produced by the Gram-Schmidt orthogonalization (GSO) of H • µ ij is the ratio of the length of the orthogonal projection of the i th basis onto the j th orthogonal vector and the length of the j th orthogonal vector • H i L and T i represent the values of the reduced basis and transform after the i th step of the LLL algorithm The LLL algorithm consists of three basic steps: 1. H and µ are computed using a modified GSO procedure [11] 2. Size reduction aims to make basis vectors shorter and more orthogonal by asserting the condition that |ℜ(µ k,j )| ≤ 0.5 and |ℑ(µ k,j )| ≤ 0.5 for all j < k 3. Basis vectors h k−1 and h k are swapped if a so-called swapping condition is satisfied such that size reduction can be repeated to make basis vectors shorter Size reduction and basis vector swapping iterates until the swapping condition is no longer satisfied by any pair of h k−1 and h k .The resultant basis is then said to be reduced.The swapping condition for LLL reduction, also called the Lovász condition, is: where δ satisfying 1 4 < δ < 1 is a factor selected to achieve an acceptable quality-complexity trade off [8].
After each swapping step, H k−1 , H k and some of the µ i,j values needed to be updated.Techniques can be employed to minimize the number and frequency of recalculations of H and µ elements [11].The LLL algorithm is detailed in Algorithm 1. Starting with columns 1 and 2, as |µ 2,1 | > 0.5, size reduction is performed on these columns adding the first column to the second and yielding the following partially reduced matrix and corresponding transform: 1.0000 0.0000 0.2308 1.0000 and H = 0.8125 0.0192 .
Size reduction is then performed on the columns once more; this time by subtracting three times the first column from the second we have: 1.0000 0.0000 0.0000 1.0000 and H = 0.0625 0.2500 .
The Lovász condition (7) is now satisfied, and the algorithm terminates.

Orthogonality defect
The orthogonality of a matrix H can be quantified using the orthogonality defect, defined as [4, §4.6.2]: where h k is the k th column of H, δ(H) ≥ 1 for all H and δ(H) = 1 if and only if the columns of H are orthogonal.When the number of columns and rows of H are equal, the denominator can be simplified to |det(H)|.From (8), matrices with correlated columns or larger column norms will result in higher orthogonality defects.This also causes their inverse or generalized inverse to have larger row norms, leading to noise enhancement.As will be shown in Section 4, matrices with a lower orthogonality defect therefore induce less noise enhancement in ZF-or MMSE-based detectors as the probability of error, for example as calculated in (15), can be reduced.
To illustrate the impact of lattice reduction on orthogonality defect, we generated 10 6 randomly chosen H ∈ C 4×4 , and computed the lattice reduced equivalent H L .The orthogonality defect was calculated using (8) both before and after lattice reduction.The results are presented in the form of cumulative distributions in Fig. 4, where the effect of lattice reduction on orthogonality defect is clearly apparent.Lattice basis reduction has also been shown to improve matrix conditioning [12].It is this improvement that reduces noise enhancement in linear detection methods and reduces the error rate of LRAD-based systems.
Numerous researchers have investigated and compared the application of various lattice reduction algorithms for MIMO detection.In addition to the LLL algorithm, these include Korkine-Zolotarev (KZ) [13], and Seysen's [14] lattice reduction algorithms; see [2] and the references therein for applications to MIMO detection.In this chapter we restrict attention to the LLL algorithm, since numerous simulation studies suggest that lattice-reduction-aided detection is well suited to low-complexity MIMO receivers when large constellations are used [15,16].

Hard detection using lattice reduction
Detectors which output an estimate of the most likely vector of transmitted symbols are said to be hard output detectors.Hard estimates are denoted b for bit vector estimates and x for symbol vector estimates.Detectors which generate not just a vector of bit estimates but also an estimate of the probability that a given transmitted bit was a 1 or a 0 are said to be soft output detectors.Soft output detectors provide a significant benefit when combined with channel coding schemes which make use of soft information, such as turbo codes or low-density parity-check (LDPC) codes, but typically increase receiver complexity by a significant degree.

Maximum-Likelihood detection
The maximum-likelihood (ML) detector selects from the set of possible transmitted symbol vectors x ∈ A n T the vector x ML which minimizes the Euclidean distance to the receive vector: x ML = arg min This is achieved by exhaustively examining all possible transmit vectors; see Algorithm 3. Whilst the ML detection algorithm is conceptually simple, its complexity is exponential in the size of the constellation and number of transmit antennas, and is therefore practical for real-time hardware implementation only in the simplest of settings.As the optimal detector, the performance of the the ML detector serves as a benchmark for the detection schemes of the following sections.

Zero Forcing estimation
The most straightforward linear detection scheme is zero forcing (ZF), also known as least squares estimation, which works to reverse the effect of the MIMO channel matrix on the transmitted symbols.By finding the least squares solution to (1), it is referred to as zero forcing as the interference caused by H is forced to zero by multiplication of the received vector y by W ZF , the inverse (or generalized inverse) of the channel matrix: We use the notation x to represent an unconstrained estimate of the vector of transmitted symbols.The likelihood that x actually maps to a constellation point is negligibly small and so the nearest valid constellations point must be found.ZF finds the estimate of the vector of transmitted symbols x ZF as follows: x ZF = arg min where x ZF is found by independently rounding each element of x to the nearest constellation point.The vector x ZF can then be demodulated to find b ZF , an estimate of the vector of transmitted bits, as shown in Algorithm 4.
There are numerous methods to find the least squares solution to (1), including those that directly calculate the matrix W ZF .In this chapter, we utilize the well known Moore-Penrose pseudoinverse:

Noise enhancement
Whilst ZF completely reverses the effects of the MIMO channel matrix, if the columns of H are correlated, ZF will amplify or enhance the noise.By identifying that W ZF H = I and then multiplying (1) by W ZF we can calculate the effective additive noise component of the estimated vector of transmitted symbols: It is intuitive that the noise existing in the unconstrained transmit symbol estimate xZF is W ZF n.When the rows of W ZF have a large Euclidean distance, multiplication of the received vector leads to the additive noise component in y being amplified.We can now show how a poorly conditioned or correlated channel matrix will result in significant noise enhancement in ZF by examining the probability of error: ) Existing work [17] has looked at the statistical properties of the channel matrix, and in particular the effect of this noise enhancement, leading to a tight analytical bound of the performance of ZF detectors in Rayleigh fading channels.

Minimum Mean-Square Error (MMSE) estimation
MMSE estimation acts to balance the reduction of the interference caused by H and the noise enhancement due to correlation of the columns in H. Rather than completely remove the effect of the MIMO channel, MMSE estimation works to find a coefficient which minimizes the criterion: The solution to ( 16) is the well-known MMSE estimator, also known as the Wiener filter: The shorthand notation of ( 18) was first proposed in [18] and is referred to as the extended channel matrix, which in this chapter is denoted Similarly to ZF detection, MMSE detection finds the estimate of the vector of transmitted symbols x MMSE as follows: x MMSE = arg min where x MMSE is found by independently rounding each element of x to the nearest constellation point.It is well-known that as the noise term approaches zero (at high signal-to-noise ratios), the MMSE estimator becomes equivalent to a ZF estimator.
Compared to ZF detection, MMSE results on average in less noise enhancement, as H is better conditioned.This can be seen intuitively as a result of adding a diagonal matrix relating to the noise variance as in (17) or alternatively due to the stacked structure of ( 18) resulting in a decrease in correlation.Unlike ZF, however, MMSE does not perfectly reverse or remove the interference of H, leading to interference between the otherwise independent transmit antennas.As with ZF, analytical performance bounds for MMSE detectors have been developed [17,19] for various channel models.
Utilizing the shorthand notation of the extended channel matrix of (18), ZF detection can be readily extended to perform MMSE detection, as shown in Algorithm 5. Note that due to the extra rows of H as compared to H, the computational complexity of calculating W MMSE is roughly double that of W ZF .

Detection using Lattice Reduction
Lattice basis reduction [20, §2.6.1]reduces the orthogonality defect, thereby reducing noise enhancement.This is achieved by finding a closer to orthogonal set of basis vectors.This reduced lattice basis is found by optimizing the generating matrix, which in the present application is a MIMO channel matrix realization.This closer-to-orthogonal set is found using elementary operations on basis vectors.Complex integer linear combinations of the column vectors of H are taken to form the reduced matrix H L which spans the same set of points HX n T ≡ H L X n T and so where T is a unimodular matrix with complex integer entries and det(T) = ±1, therefore T −1 also contains only complex integer entries.
As in [3], by finding an equivalent and closer to orthogonal set of the basis vectors, H L , noise enhancement is reduced when quantization is performed.Importantly, as T −1 and x both contain only integer spaced entries, so does T −1 x and so symbol detection or quantization is merely rounding to the grid X.
Once the lattice reduced channel matrix is found, we then calculate the pseudoinverse as would be done in ZF or MMSE detection.LRAD therefore operates using the following steps, which are adapted from [3] and detailed in [21]: 1. Find the reduced lattice basis 2. Use the pseudoinverse of the reduced basis to form estimates 3. Quantize estimates to X

Transform and bound points to constellation points
As shown in Algorithm 6, received vectors y are multiplied with the pseudoinverse of the reduced basis H L to find a soft estimate of the vector of transmitted symbols in the reduced domain.These symbols are then quantized to an integer grid.(Depending on the transform generated, this integer grid may be offset by a half in both real and imaginary dimensions.)These hard estimates are then transformed, using the transform matrix T generated by the LR algorithm, to find an estimate of the vector of transmitted symbols.However, as these symbols may fall outside the range of constellation points invalid constellation points are clipped back to the nearest constellation point.

Hard-output SLRAD
For hard estimation, quantization of the ZF or MMSE estimate in the transmit constellation domain is replaced by the same quantization in the lattice reduced domain.The equivalent for soft estimation calls for the calculation of the error induced by quantization in the lattice reduced domain.Unfortunately, just as it is hard to ensure quantization to valid symbols in the lattice reduced domain, it is equally hard to iterate over all possible valid symbols in the lattice reduced domain in order to estimate each bit probability.
Whilst Zhang et al. [22] present a detailed comparison of various soft output based detectors and proposes several powerful methods for generating soft output information, there are some key shortcomings, and the performance of the detectors in [22] are only evaluated using QPSK constellations.This is problematic in that a range of wireless communication standards are moving to denser constellations, such as 16-QAM and 64-QAM.This motivates the investigation of lattice reduction based detectors capable of producing candidate lists.
The subspace lattice reduction aided detection (SLRAD) approach of Windpassinger [3] forms a subspace of the channel matrix H by removing a single column from the channel matrix.This column removal allows the corresponding transmit antenna's symbol estimate to be constrained in order to calculate an estimate for what the other transmit antennae sent.For each transmit antenna a number of symbols is systematically proposed and for each proposal the set of most likely symbols transmitted on the other antennae is calculated, as shown in Algorithm 7.
The SLRAD algorithm therefore creates a list of candidate symbols, the Euclidean distance of each of these candidates from the origin being used to determine the most likely vector of transmitted symbols for a hard-output detector.
Whilst performance of SLRAD is close to that of ML (see Fig. 5), the complexity is proportional only to the sum of the size of the constellations employed on each transmit antenna.Therefore only a modest number of candidate symbols needs to be investigated, even for dense constellations.For example, a system with 4 transmit antennas each utilizing 64-QAM results in only 4 × 64 = 256 candidates.

Soft-output SLRAD
As a candidate-based detector, the hard-output SLRAD detector can be extended to generate soft output information.The probability of all the candidates where a bit is one is divided by the probability of all candidates where the bit is zero.An attractive property of subspace    Brun's algorithm is criticised in [25] as it achieves inferior performance and no analytical result has been reported to prove the level of diversity that can be achieved.This work applies a uniform scaling factor to the elements of the same matrix or vector to ensure that the magnitudes of the largest real and imaginary parts are as close as possible to, but smaller than one.This pragmatic approach offers a good compromise between true floating-point arithmetic, with its computational overhead, and a simple fixed point arithmetic with significantly reduced dynamic range.However, it appears that no active scaling is performed in the algorithm to prevent numeric overflow.Instead, it is claimed without substantiation that a bound exists which is used to calculate the required number of integer bits.
The work in [26] implements a sorted QR decomposition using Householder CORDIC units to reduce the number of LLL iterations needed.The complex LLL algorithm is used but, as with most LLL implementations, requires the use of divisions, using the Newton-Raphson algorithm, throughout the LLL iterations.
The work of [27] builds on [26] and discusses novel search based extensions to LRAD introduced in [28] which generate a candidate list and therefore soft outputs.However, the hardware implementation does not discuss this and therefore it is presumed that the hardware implementation is hard output.Due to the time-multiplexed complex multiplier pipeline, this approach is forced to rely on the use of priority inversion to prevent deadlocks due to data dependencies.Analysis is not performed on the precision required and in particular magnitude bounding is not performed which results in a large number of integer bits being required.
In [29], the authors build on their prior work [26,27] by offering several improvements.This revision implements Sorted QRD to reduce the number of LLL swapping steps.Once again, the hardware implementation is presumed to only offer hard outputs as no mention is made of the candidate generation required to form soft outputs nor the hardware required to calculate LLRs.Unlike the prior works, an upper bound of 4 integer bits is identified for the elements of the R matrix which offers a significant reduction in the precision required.
Several works [30,31] make use of systolic arrays in their implementation.This requires careful scheduling to maximize component utilization.The former work makes use of the Complex LLL algorithm whereas the later extends the LLL algorithm through the use of the Siegel condition to avoid the requirement for division operations.
The field-programmable gate array (FPGA) implementation of [32] implements the Clarkson's algorithm variant of LLL [33].However this implementation only considers slower off-the-shelf FPGA components, including the use of square root and division operations that have not been optimized.The FPGA and application-specific integrated circuit (ASIC) implementation [34] claims to achieve a "fivefold improvement in terms of throughput at the cost of only slightly more FPGA resources" over [26] and [32].This work uses CORDIC units along with a modification of the LLL algorithm by replacing the size-reduction criterion with the reverse Siegel condition.The hard output performance of this implementation is also enhanced by the use of soft interference cancellation (SIC), which requires the use of the sorted QR decomposition.

Architecture for Subspace Lattice Reduction Aided Detection
Our proposed architecture implements a soft-output lattice reduction-aided detector based on the subspace LRAD (SLRAD) approach of Windpassinger [3,4].The top-level schematic layout is shown in Fig. 8.A key feature of the detector is the separation of channel and data processing sections, shown above and below the dashed line in Fig. 8, respectively.Channel processing is computationally expensive, and includes the decomposition and lattice reduction of the MIMO channel matrix H .The separation of channel and data processing therefore enables the receiver to exploit the typically slow variation in channel gains relative to the symbol rate, whereby the output of the computationally expensive channel processing step is used is used in processing the data spanning multiple data frames.
The channel processing section in Fig. 8 is fed with elements h in of the estimated MIMO channel matrix H generated by an external MIMO channel estimator (not shown), while channel multiply and accumulator (CMAC) units perform rotations under control of the Givens control unit.Data processing involves the subspace-based detection of incoming received values, in addition to the calculation of soft outputs in the form of log-likelihood ratio (LLR) values.The data processing section is fed elements of the received vector y, scaled by automatic gain control (AGC) to ensure that analog-to-digital converters (ADCs) are not saturated, and therefore that fixed-point inputs are within a defined range.The data multiply and accumulate (DMAC) and detection (DET) blocks in Fig. 8 are described in Section 6.4.The outputs of the data processing section are LLR values for the bits corresponding to each vector of transmitted symbols x.
Unlike [26,27], this work implements the Scaled and Decoupled QR (SDQR) Decomposition [35].The use of the SDQR provides a definitive bound on the required integer precision and allows the number of fractional bits to be varied with a constant and small number of integer bits.

Givens Control Unit
The calculation of the SDQR rotation values is performed by a Givens Control unit.This unit is a single cycle processor which generates the Givens rotation G which zeros the element P j,i by rotating the j th row with the i th row of P and Φ.The Givens Control unit is capable of a throughput of one rotation variable per cycle by calculating a Givens rotation every four cycles.Two rotation variables are emitted in the third cycle (values G 1,1 and G 1,2 ) and fourth cycle (values G 2,1 and G 2,2 ) of each Givens rotation calculation.The Givens Control Unit also maintains the decoupled k values and also dynamically scales G to maintain scaling of not only k but indirectly P and the rotated y.This processor implements the reciprocal function required for division through the use of Newton-Raphson iterations.

Channel MAC (CMAC) Unit
The application of rotation operations are performed by processor units referred to as Channel Multiply and Accumulator (CMAC) units.Each CMAC unit includes sufficient register space to store a full column of the MIMO channel H as well as necessary intermediate values.All input, output and stored register values are complex numbers specified using custom extensions to the VHDL fixed point math package.Arithmetic implemented includes a complex multiplier and complex addition unit with the output of the multiplier being one of the operands of the adder, as shown in Fig. 9.
Whilst this architecture greatly simplifies the challenge of processor unit scheduling, the units are still unavoidably under-utilized.The CMAC units become unused once the their corresponding column of H is fully zeroed.As a result, the CMAC unit corresponding to the i th column of H is in use for i/n T of the SDQR execution period.
The CMAC units provide outputs which feed a multiplexer, as shown in Fig. 8. Required in order to perform back substitution, this allows the transfer of register values between CMAC units by feeding the output of a unit to the input of another.

Data MAC (DMAC) Unit
Processor units referred to as Data Multiply and Accumulator (DMAC) units, are implemented to apply Givens rotation operations on the received vector y.Each DMAC unit includes sufficient register space to store a full vector of received values y as well as necessary intermediate values.
Multiple DMAC units are implemented so that the necessary rotations required to apply a Givens rotation to a full row of H can be performed in parallel.This avoids the need to stall not only the Givens Control unit but also stalling of DMAC units that would otherwise need to occur whilst each row element of H is rotated.Multiple DMAC units are implemented to achieve the necessary data throughput rate such that a single rotation operation can be applied to multiple received vectors in parallel.This builds on the presumption that the MIMO channel is approximately constant for multiple symbol periods.Given a sufficiently static MIMO channel, any number of DMAC units can be implemented.This allows a linear scaling of data throughput by simply adding more DMAC units, a key design feature of the the proposed architecture.

H&T Register File
As well as being loaded into CMAC units, when a new H is loaded into the processor, it is cached in the H&T register file.This is done to provide a copy of H for use when calculating the Euclidean distance of candidate estimates.The H&T register file is also used to store T, the lattice basis required to translate candidate estimates from the reduced basis prior to demodulation.

Candidate Detection (DET)
Each DMAC unit feeds a symbol detection chain which performs candidate generation and finally bitwise log-likelihood accumulation.This implements the data flow detailed in Fig. 7.

Log-likelihood ratio (LLR) Accumulator
Once a list of vectors of transmit symbol candidates has been generated, the probability of each of these vectors needs to be generated.Many approaches exist that avoid the need to implement the required log operations inherit in the calculation of log-likelihood ratio (LLR) values.We implement the shifting method Log-MAP algorithm presented in [36], which utilizes the following piecewise linear approximation: The schematic for the LLR block is shown in Fig. 10.

Processor instruction set
The overall architecture is a microcode-based system with detailed low level micro-operations that combine to implement higher level complex machine instructions.Each component including the Givens Control Unit, CMACs, DMACs and Detection Chains have their own micro-operations.The benefit is the provision of a flexible architecture capable of implementing the SLRAD algorithm, but which is also able to switch to simpler LRAD or even ZF algorithms based on the prevailing channel conditions.

Control Unit Micro-operations
The bulk of the channel processing involves the execution of the four operations that generate the Givens rotation G.The first two, C1 and C2, calculate the new values for P j,i and k j ; other hand, DMAC units are able to perform the equivalent reduction operation in the two micro-operations R1 and R2 as lattice size reduction is performed in a row-wise fashion.

Comparisons with previously published work
The results in this section represent the first known digital signal processing architecture for a soft-output lattice reduction aided MIMO detector.For this reason we are unable to provide a direct comparison of our architecture with previously published work.Nevertheless, it is still possible to compare our implementation with three state-of-the-art VLSI implementations of hard-output LRAD-based MIMO detectors [32], [26], [34].
For n T = n R = 4, the combination of the CMAC micro-operations leads to the system latency outlined in Table 2.This table assumes a MIMO system represented by an extended channel matrix, requiring the zeroing of 16 elements of H.The majority of these elements require 4 cycles with the exception being the final element of each column requiring a 5 th cycle due the the extra cycle required to compute the Newton-Raphson based reciprocal.An overhead of 12 cycles exists to load data into the processor.
For the LLL algorithm, column swap operations require 5 cycles to perform the single Givens rotation.Size reduction requires at most 3 cycles per pass over the full matrix.As with prior works, a simple strategy is used to fix the number of iterations of the LLL algorithm which caps the number of swaps and size reduction passes to 3.This yields 24 cycles per subspace or 96 cycles for the four subspaces.To provide context for the results in Table 2, we compare in Table 3 the latency of the proposed architecture with the latencies of three hard-output LRAD-based MIMO detectors for a 4-input, 4-output MIMO system employing QPSK modulation.
No No No Yes Table 3. Latency comparison between the proposed architecture and three state-of-the-art implementations While the latency of the proposed architecture compares favourably with Barbero et al.'s solution [32], the significant performance penalty for generating soft outputs is apparent in comparison with the results of Gestner et al. [26] and (esp.)Bruderer et al. [34].We caution that the results in Table 3 need to be interpreted carefully, however, since it is well known that hard-output MIMO detectors such as [32], [26] and [34] do not facilitate high-performance iterative receivers involving joint detection and decoding when error-control codes such as turbo codes and LDPC codes are employed [37], [22].The proposed approach therefore trades off increased latency for improved BER performance and the ability to readily deal with dense constellations, e.g.64-QAM.

Conclusion
In this chapter we have presented the first known digital signal processing implementation of a soft-output MIMO wireless communications receiver based on lattice reduction aided detection (LRAD).Further research is needed to provide the ASIC and FPGA synthesis results needed to facilitate a comprehensive comparison with prior works providing only hard outputs.

Figure 2 .
Figure 2. The three constellations used in this chapter

Figure 4 .
Figure 4. Cumulative distributions of the orthogonality defect for non-reduced and reduced basis channel matrices then e min = e x ml = x b ml = b end if end for

Figure 6 .
Figure 6.Top Level Data Flow Diagram

Figure 7 .
Figure 7. Candidate Chain Data Flow Diagram

Figure 8 .
Figure 8. Top-level schematic layout.Channel and data processing sections are shown above and below the dashed line, respectively

Figure 9 .
Figure 9. Custom Multiply and Accumulate Schematic Layout

Table 2 .
Latency of Channel Processor