Iterative receivers with channel estimation for multi-user MIMO-OFDM: complexity and performance

A family of iterative receivers is evaluated in terms of complexity and performance for the case of an uplink multi-user (MU) multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) system. The transmission over block fading channels is considered. The analyzed class of receivers is performing channel estimation inside the iterative detection loop, which has been shown to improve estimation performance. As part of our results we illustrate the ability of this type of receiver to reduce the required amount of pilot symbols. A remaining question to ask is which combinations of estimation and detection algorithms that provide the best trade-off between performance and complexity. We address this issue by considering MU detectors and channel estimators, with varying algorithm complexity. For MU detection, two algorithms based on parallel interference cancellation (PIC) are considered and compared with the optimal symbol-wise maximum a-posteriori probability (MAP) detector. For channel estimation, an algorithm performing joint minimum-mean-square-error (MMSE) estimation is considered along with a low complexity replica making use of a Krylov subspace method. An estimator based on the space alternating generalized expectation-maximization (SAGE) algorithm is also considered. Our results show that low-complexity algorithms provide the best tradeoff, even though more receiver iterations are needed to reach a desired performance.


Introduction
In future wireless systems high data rate transmissions need to be supported, requiring larger bandwidths to be used. At the same time, spectral efficiency is becoming increasingly important. A technology that has become popular in later years, and also found its way into many wireless standards such as, e.g., LTE [1], is the use of multiple-input multiple-output (MIMO) antenna systems in combination with orthogonal frequency division multiplexing (OFDM). OFDM is used to efficiently combat inter-symbol interference (ISI), inherent in broadband transmissions, while MIMO is used for improving the channel spectral efficiency and/or suppress interference.
Introducing multiple users (MU) into such systems, a MU-MIMO-OFDM system is created. In the uplink, accurate multi-user (MU) receivers are needed to harvest the available gains. A significant number of algorithms, with varying complexity, have been proposed for this task; ranging from the simple zero-forcing detector to the high complexity maximum-likelihood (ML) detector. Please refer [2] for an overview.
The degree of channel state information (CSI) available at the receiver plays an important role in the design of the receiver structure. While it is convenient for theoretical investigations to assume that perfect CSI is available, practical receivers need to obtain CSI via, e.g., noisy pilot symbol observations. In the case of a large coherence time, the accuracy of the channel estimate can be made high since many symbols can be dedicated for pilot information without any significant effect on the spectral efficiency. In fast fading environments, or packet-based systems, the number of pilot symbols must, however, be kept small to maintain a reasonable spectral efficiency. To this end, other more sophisticated transceiver structures have been developed [3][4][5]. These receivers jointly detect the data symbols and estimate the transmission channel, which allows for a lower number of inserted pilot symbols as compared to traditional pilot based transceiver systems. While the prospect of reducing the number of pilot symbols is important, these receivers are of limited utility since they have grossly larger computational complexity than traditional pilot based receivers. This complexity amplifies dramatically if the data is coded.
The discovery of the turbo principle [6] brought radical changes to the entire communication field. It is today understood that highly complex problems, such as jointly detecting coded data and estimating the underlying transmission channel, can be efficiently handled by iteratively solving much simpler sub-problems. In particular, during the last decade there has been a growing interest in iteratively solving the joint coded data detection and channel estimation problem [7][8][9][10]. The receiver is alternating between decoding of the outer error correcting code, performing multi-user detection (MUD), and estimation the transmission channel, in an iterative manner. In [10], a theoretical framework is presented for this, elsewhere ad-hoc, choice of receiver design; strengthening the motive for this choice.
Even though iterative algorithms can reduce the complexity of the digital receiver, they may still be of prohibitive complexity in many practical scenarios; representing itself in a large chip area and high power consumption. It is therefore important to find low-complexity algorithms that are both power efficient and can deliver performance required to reach high spectral efficiencies.
In the current literature, an impressive number of lowcomplexity algorithms have been proposed for the different components of an iterative receiver, see e.g., [11]. However, few have studied the trade-off between complexity and performance for the entire receiver, including MUD, channel estimation and channel decoder. In [12], we have performed a trade-off analysis for an interleave division multiple access (IDMA) system, where a number of channel estimation algorithms are evaluated. One other exception is [13], where the complexity and performance of a set of receiver algorithms for MIMO multi-carrier code division multiple access (MC-CDMA) systems are investigated. In contrast to [13], this article evaluates a family of iterative receivers for an uplink MU MIMO-OFDM system, operating over block fading channels. Furthermore, we have tried to place a greater focus on the convergence properties of the different receiver configurations. The convergence speed is important since more iterations require a larger computational effort. Also worth mentioning is the work in [14], where a performance-complexity comparison of receivers for down-link MIMO-OFDM systems is performed. Unlike in our comparison, the investigated receivers does not contain any channel estimator.
In our evaluation, the complexity of all the building blocks of the iterative receiver is derived, and related to the system performance. Our results show that lowcomplexity algorithms are generally sufficient, but more complex schemes may be needed if convergence speed, measured in iterations, is at focus. The main contributions are summarized as follows -A tradeoff analysis between complexity and performance is performed for a MU MIMO-OFDM system incorporating iterative channel estimation and MUD. Two popular channel estimation algorithms, one based on expectation maximization (EM) [15], and one performing a joint minimum-mean-square-error (MMSE) estimation of all user channels [8,9], are evaluated. A low-complexity approximation of the latter based on a Krylov subspace projection method, as presented in [13], is also evaluated. Three popular MUDs are considered; two parallel interference cancellation (PIC) based detectors and one full maximum a-posteriori probability (MAP) detector. The latter being a natural performance benchmark. -In the tradeoff analysis, the total complexity, in terms of complex multiplications, required to reach a given bit error rate (BER) is derived for all algorithm combinations at different signal-to-noise ratios and number of users. The results show that lowcomplexity schemes are generally providing the best tradeoff.
-The convergence properties of the different receiver combinations are presented, both in terms of BER, mean square estimation error, and through the use of extrinsic information transfer (EXIT) charts [16]. The EXIT charts visualize the exchange of extrinsic information between the outer code and the rest of the receiver incorporating channel estimation and MUD.
The rest of this article is organized as follows. In Section 2, a description of the considered MU-MIMO-OFDM system is given. The algorithms for obtaining the channel estimate are presented in Section 3, and the MUD algorithms in Section 4. In Section 5 the complexity of the algorithms is discussed, and in Section 6 the performance of different algorithm combinations is investigated. A complexity versus performance analysis is performed in Section 7, before the paper is summarized in Section 8.
with N antennas. The users transmit blocks of S OFDM symbols, each containing M sub-carriers. The first S p OFDM symbols are reserved for pilot symbols, which are known to the receiver. The following S d = S -S p OFDM symbols contain coded data. The total number of information bearing signal constellation points per block, transmitted from each user, then becomes L = S d M. A forward error correcting code (FEC) with rate R is used to generate codewords, which after interleaving, are mapped onto the L signal constellation points. We restrict our investigations to the case of QPSK. An extension to other constellations is conceptually straightforward, but in general non-trivial [17]. After OFDM modulation and pilot insertion, the users transmit their signals over a frequency selective block fading channel, where the different multi-antenna links are independent and identically distributed (IID). The block fading assumption holds if the transmitted data blocks are much shorter than the channel coherence time. Thus, a system with short data blocks transmitted over a channel with moderate Doppler spread is considered. Furthermore, to allow for correct OFDM demodulation at the receiver, the users are assumed to be synchronized both in time and frequency. In frequency, the synchronization requirement is strict, but due to the use of a cyclic prefix (CP), the time requirement is somewhat relaxed to the case where the difference in arrival times is less than the duration of the CP minus the channel delay spread.
At the receiver, the signal is demodulated into the complex baseband, where an iterative receiver is implemented. The complexity-performance trade-off of this receiver is the focal point of this article. The receiver consists of three blocks; a channel estimator, a MUD, and a bank of soft-input-soft-output (SISO) channel decoders. First, an initial channel estimation is performed, based on the transmitted pilot symbols. This estimate is then used in the MUD to separate the different user streams, which are then fed to the SISO decoders after de-interleaving (Π -1 ). The output of the decoders are then used in the next iteration to update the channel estimate, and to further improve the user separation in the MUD. Multiple iterations are then performed in the same way. The different components are described in detail in later sections.

Input-output relationship of the channel
Next we turn the attention to a description of the inputoutput relationship of the channel used in this article. The notation introduced here will also be used for the description of the various algorithms. Furthermore, a low-rank description of the channel, being used by the channel estimation algorithms, is also introduced in section.
Under the assumption of block-fading channels, the discrete-time model for the received signal at the mth subcarrier, during the transmission of OFDM symbol s, can be written as ... which collects the received signals at antenna n and OFDM symbol s, across all sub-carriers. Further, this vector equals where X k [s] ∈ C M×M is a diagonal matrix which contains user k's transmitted data in OFDM symbol s along its diagonal, and w n [s] ∈ C M×1 is a vector collecting the noise at receive antenna n across subcarriers.
All channel estimation algorithms to be evaluated in this article are based on low rank approximations of the wireless channel. The assumption made is that the channel is limited in the delay domain, and can therefore be accurately represented by a relatively small number of base functions. The optimal set of base functions are presented in [18], and are known under the name discrete prolate spheroidal (DPS) sequences. Their use for low-complexity channel estimation were proposed in [19], and estimators using the same type of base functions have also been proposed in, e.g., [20].
Forming a base with I base functions, the frequency response between user k and antenna n of the block fading channel may be expressed as in a notation similar to the one used in [20], where U ∈ C M×I is a matrix collecting I orthonormal base functions in its columns and ψ k,n ∈ C I×1 is a vector containing the corresponding channel DPS coefficients. Note that ψ k,n can be interpreted as the impulse response of the channel, though not mathematically correct unless U is the Fourier base. Using this model of the channel, the received signal in (2) may be expressed as Now, by collecting the received signal for all S OFDM symbols, and all receive antennas, in a vector, and in a similar way collecting the channel coefficients, ψ k,n , for all users and antennas, the received signal may be expressed using the classical linear model [8,21]. That is, where r ∈ C SMN×1 is collecting the received signal in all time-frequency positions and at all receive antennas, Ξ ∈ C SMN×KNI is an observation matrix collecting the transmitted symbols and channel base functions, ψ ∈ C KNI×1 is collecting the channel coefficients for all users, and w ∈ C SMN×1 is collecting noise. More explicitly, the data structures are given by: r = (r T The DPS base functions are obtained from solving the eigenvalue equation [8,18,20], Cu i = λ i u i , where C ∈ C M×M is a channel correlation matrix. For later use, the eigenvalues λ i are collected in a vector, l = [λ 1 ,..., λ I ] T . For I ≥ ⌈τ max M ⌉ + 1, where ⌈·⌉ denotes the ceil operation, the energy of the eigenvalues are small and can in general be neglected [18]. This value sets a bound on the number of DPS sequences that are needed to represent the channel in an accurate way.

Channel estimation algorithms
In order to achieve satisfactory detection performance, high-accuracy channel estimates need to be made available at the receiver. A large number of appropriate algorithms has been proposed in the literature. Amongst these, two popular families of algorithms have received a great deal of attention; algorithms performing joint estimation for all users [8,22,23], and algorithms based on interference cancellation [15,24]. In this article, two algorithm from the first, and one from the second family is considered. The algorithms make use of the transmitted pilot symbols, as well as decoded data symbols. Thus, they are all using the turbo principle to iteratively improve the channel estimate as the reliability of the decoded data symbols increases. Furthermore, the algorithms have in common that they all use the same underlying low-rank channel model, the one given in Section 2.2.
The first algorithm, previously presented for MC-CDMA systems in [8,25] and later for MIMO-OFDM in [22], performs a joint MMSE estimate of the composite channel matrices H [m] based on the model in (3). The second algorithm, presented in [13] for MC-CDMA, uses a Krylov subspace method to approximate a costly matrix inverse in the joint MMSE estimator. The third algorithm, based on [15], is using the EM framework, and iteratively performs per-user channel estimation, i.e., estimates of the columns of H [m]. We slightly modify the second algorithm by using the improved space alternating generalized expectation-maximization (SAGE) [26] algorithm. The three algorithms are described below.

Joint MMSE estimator using soft decisions (joint MMSE)
The optimal channel estimation approach is to estimate all user channels jointly, making use of both pilots and soft estimates of the transmitted symbols, along with the channel correlation properties. Based on the model for the received signal given in (6), the optimal estimate of the channel coefficients ψ (in the MMSE sense) can be derived as [9] whereΞ has the same structure as Ξ, but contains both known pilot symbols and soft estimates of the transmitted data carrying symbols; either pilots or soft symbol outputs from the decoder, and 1 N is the all-ones column vector of length N. Further, the covariance matrix of the DPS sequences. Due to the sizes of the matrices involved in (7), the computational complexity can be expected to be significant. The computational burden is significantly decreased, but still large, if the sparsity and regularity of Ξ is taken into account. We will elaborate more on this in Section 5.

Krylov subspace reduced joint MMSE estimator using soft decisions (Krylov MMSE)
As mentioned above, the implementation of the joint MMSE estimator embeds a significant computational cost. Multiplication of matrices of large dimensions, along with a costly matrix inversion, adds greatly to the receiver complexity. In [13] an approach to reduce these costs was proposed. The algorithm is making use of a Krylov subspace method, more precisely the unconditional conjugate gradient method [27], to iteratively solve (7). The method iteratively finds the solution to the linear equation system x = Ab, based on an initial a r A r . The number of terms S k gives the dimensionality of the Krylov subspace, and equals the number of iterations in the algorithm.
Looking at (7), it can be rewritten Without going into any further details, the algorithm for obtaining the approximate solutionψ s , based on an initial guessψ 0 and the subspace order S k , is outlined in Table 1 as presented in [27]. In the first receiver iteration,ψ 0 is set to be the all one vector, while in the following iterations, the estimate from the previous receiver iteration is used. Note that the subspace order can either be fixed, or an error threshold could be used as a stopping criteria. The former is chosen here in order to get a fixed algorithm runtime and complexity.

SAGE based estimator (SAGE ML)
Even though the Krylov subspace method can significantly reduce the complexity of the joint MMSE estimator, the complexity is still high, since large matrix-vector multiplications are required in each Krylov iteration. A low-complexity alternative, which has shown good performance, is to use an algorithm based on EM/SAGE. In SAGE, given a received signal, the ML solution is iteratively generated based on an underlying subspace model of the data. In [15] one such algorithm was presented, producing an optimal low-rank MMSE estimate of the channel. The details of that algorithm are outlined below, where a conversion from EM to SAGE has been performed.
The algorithm is processing one receive antenna channel at the time, based on the following underlying model for the channel between user k and receive antenna n, where w[s] = Σ k w k [s] is the complete noise vector. As can be seen, r k,n [s] is the signal contribution from user k, and summing over all users gives (2). For the problem Table 1 Outline of the Krylov subspace projection method Steps at hand, the SAGE algorithm is formulated as [15,26] -Initialization: For all k and ŝ -For each iteration i: , and for all s, compute for all j, j ≠ k, In (12), the matrix Δ m = diag λ 1 λ 1 + σ 2 w , ..., λ I λ I + σ 2 w stems from the low-rank MMSE estimator, and in (13) averaging is performed to make use of the assumption that the channel is static within a block.
The value of X k [s] is only perfectly known at time instances where pilots are transmitted. On all other positions, symbol estimates must be used. The estimates are updated by the SISO decoders in every iteration, using the most recent channel estimate. Here, hard deci-sionsX k [s] = sign X k [s] of the decoded soft symbols are used for channel estimation, and soft for interference cancellation.
At the very first receiver iteration, no channel estimate is available. Therefore, the algorithm is initialized witĥ Furthermore, to improve the accuracy of the initial estimate, several internal iterations can be performed within the estimator itself. This can be seen as the algorithm being reinitialized with its own updated channel estimate, without waiting for updates on the symbol estimates. In this article, this is only performed at the initial pilot based stage, where the gain is observed to be the largest. In later stages, multiple internal iterations are not producing any significant gain, thus mainly adding to the computational complexity.

Soft-input soft-output MU detectors
With estimates of the transmission channel having been made available by the channel estimator, the next stage of the iterative receiver structure is to produce likelihood-ratios of the coded data symbols. This operation is performed by the MUD, which apart from the received signal and channel estimate, uses a-priori information of the transmitted symbols. This information is provided, from the previous iteration, by the channel decoder. The optimal SISO detector is the symbol-wise MAP detector, implemented through the BCJR algorithm [28]. Unfortunately, the complexity of the MAP detector in the MIMO case is prohibitive in most situations, except for the cases when the number of users K is small. Therefore, reduced complexity techniques have to be considered for most practical applications. Furthermore, although optimal detection is not generally feasible in practice, it remains important as a benchmark reference, and will therefore be considered in this article. The principles behind the MAP algorithm are outlined in Section 4.1.
Many reduced complexity detection algorithms have been proposed in the literature [2]. To restrict the investigations, two such algorithms have been selected and are presented in Section 4.2. Both algorithms are based on PIC. The first algorithm applies a matched filter (MF) after the cancellation, while the other applies an MMSE filter, in an attempt to further suppress the inter-user interference. While the latter approach yields better performance it is also more complex. In later sections we shall investigate whether the performance gain motivates the increased complexity.

Maximum a-posteriori probability
As stated previously, the optimal MUD is the symbolwise MAP detector. While the PIC-based algorithms, being introduced in Section 4.2, only make use of the mean valuesx k [m, s], the symbol-wise MAP detector works with the probability mass function of x[m,s], denoted P a (x[m, s]).
In the case of QPSK transmission, the data vector x [m, s] contains 2K code bits, c 1 ,..., c 2K . The MAP detector computes the marginal mass functions, represented by log-likelihood ratio (LLR) values, for these 2K bits: As was discussed above, the complexity of the symbolwise MAP detector (16) may in many cases be prohibitively large, showing the demand for low complexity schemes.

PIC based detectors
A popular low-complexity approach is to make use of interference cancellation. Though simple in their implementation, PIC based detectors have shown good performance [8,10,29]. The interference cancellation is operating separately on each subcarrier and OFDM symbol, and makes use of the most recent channel esti-mateĤ[m], and soft symbol estimatesx[m, s] from the SISO decoders. Following the notation in (1), the interference cancelled output for user k is given bỹ The first algorithm, which will be referred to as PIC-MF, applies a MF to the interference cancelled output, i.e., whereĥ k,: [m] is an estimate of the channel between user k and the base-station. In case of QPSK, the complex valued LLRs (with one symbol per complex dimension) are produced as where is the variance of the residual interference plus noise for user k.
The drawback of PIC-MF is that the noise and residual interference is not taken into account when performing user separation. To alleviate this problem, an MMSE filter can be applied instead of the MF. The resulting algorithm will be referred to as PIC-MMSE. An appropriate MMSE filter can be shown to yield [8] The output of the MMSE filter can be modeled aŝ The complex LLR output is then produced as

Complexity analysis
When it comes to practical implementations of iterative MU receivers, complexity considerations are of great importance. Since several receiver iterations are generally needed to reach a desired performance, the total computational effort can grow very large. To get an estimate of this cost, we have chosen to present and compare the complexity of the addressed algorithms in terms of the required number of complex-valued multiplications. This measure is chosen since it provides a reasonable estimate of the complexity, while being analytically tractable. Obviously, the final computational and hardware complexity depends on a large number of parameters, such as memory requirements, parallelization, hardware reuse, word lengths, etc. In the following sections, the complexities of the algorithms for both MUD and channel estimation are presented. The expressions for the complexity of the SISO decoder, not being treated in a separate section, is given as derived in [30]. The expressions for the complexity per user of the various algorithms are given in Table 2, where the required number of complex multiplications per user is shown. Also given in the table is an example of the required number of multiplications per information bit, given QPSK modulation and rate 1/2 convolutional code, for the following system settings; N = 4 receive antennas, K = 4 users, S = 20 OFDM symbols, S p = 1 OFDM pilot symbol, M = 256 subcarriers and I = 36 DPS sequences. Note that the DPS sequences, which are used for channel estimation, are assumed to be precalculated and read from memory, thus their construction does not contribute to the computational complexity.

Channel estimator complexity
Three different channel estimation algorithms were presented in Section 3, joint MMSE, Krylov MMSE and  Table 2, the difference in complexity is significant. For the discussions below, we will assume that the number of OFDM symbols in each block is smaller than the number of subcarriers, i.e., S <M.
Looking at the first algorithm, the optimal joint MMSE algorithm, the complexity is large, as previously discussed. Since all user channels are estimated jointly, using all available frequency and time samples, the dimensionality of the problem to solve becomes very large. Looking at (7), a straightforward implementation would be very costly due to the dimensionality of the involved data structures. Fortunately, considerable reductions can be achieved. Firstly, under the assumption of independent receive antenna channels, the same estimator can be used independently on each antenna. Secondly, under the block fading assumption, the matrix Ξ =X NŪN is the product of a block diagonal matrix and a block matrix with diagonal sub-matrices. Thus, the operations involving this structure can be computed efficiently. It should be noted that under the assumption of independent receive antennas, Ξ is block diagonal with identical sub-matrices. The estimator only involve one of these SM × KI submatrices. In the end, the main part of the complexity is related to two operations; the product ofΞ H Δ −1Ξ and the inverse operation of a KI × KI matrix. The computational complexity of the former is approximately M(IK) 2 , while approximately (KI) 3 for the latter. For the system settings considered in this article the two are of comparable size. Also note that the hermitian properties of the data structures can be exploited to further reduce complexity.
The second algorithm make use of a Krylov subspace method to avoid the explicit matrix inversion in (7). At the same time the explicit computation ofΞ H Δ −1Ξ can be avoided. This will be beneficial as long as S <M. Referring back to Section 3.2 and Table 1, the main part of the complexity lies in calculating Av s , which is performed once for every subspace dimension S K . From a complexity point of view, its preferable to keep S K low. On the other hand, a too small value will provide a poor approximation of the matrix inverse, and thus poor performance. The value thus needs to be chosen with care, trading complexity for performance. An upper limit on the number of dimensions may be set by timing constraints in the receiver.
The last algorithm, based on SAGE, has the lowest complexity and performs a separate channel estimate for each user channel. SAGE ML has less then half the complexity of Krylov MMSE with S K = 1. This suboptimal approach has an attractively low complexity and, as will be seen in Section 6, also delivers good performance. The complexity is linear in the number of user, i.e., the complexity per user is constant. The main part of the complexity is shared between the per symbol estimate, the interference cancellation, and the subspace filtering, i.e., the utilization of the frequency correlation.
The former two is proportional to the number of OFDM symbols S, while the latter to the subspace order I, all with the same proportionality constant. The complexity can thus be reduced by lowering the number of OFDM symbols taken into account when performing the estimation, or by reducing I. Both actions would come at the price of a performance loss.

MUD complexity
As for the different channel estimation algorithms, the complexity of the considered MUDs differ significantly, as seen from Table 2. The one with the lowest complexity is the PIC-MF, which due to its simplicity requires  relatively few arithmetic operations. The complexity is shared between the interference cancellation plus MF, and generating the LLRs. The former requiring a bit more computational effort. Despite its low complexity, as will be seen in Section 6, the performance is still competitive at low user loads. Using a soft information based MMSE filter instead of the MF, the performance will be shown to improve. This comes at a cost of an increased complexity due to the MMSE filter in (21) which needs to be calculated for each user and for each data symbol. The filter includes an inverse of a K × K matrix. At high user loads, computing the inverse will dominate the complexity. If the number of users grow very large, subspace methods as the one used in the Krylov MMSE estimator could be used to reduce the complexity.
If the optimal MAP receiver is considered, the complexity is significantly increased. The complexity, as derived in [31], grows exponentially in the number of users. For few users, the complexity is manageable, but as the number of users grows, it rapidly becomes prohibitive. It should be noted that there exist a number of reduced complexity MAP-like detectors which are based upon searching trees [32,33], which are not included in our comparison.

Simulation results
In order to investigate the receiver performance under the use of the different algorithms, computer simulations were performed. In the simulations, each user transmits S = 20 OFDM symbols, each with M = 256 subcarriers. If nothing else is stated, a single OFDM symbol is dedicated for training information, i.e., S p = 1, which is generated randomly for each user. Non-orthogonal transmission of the pilot symbols are assumed, i.e., all users transmit their pilot symbols simultaneously in time and frequency. This may incur a loss in performance, but is motivated by the flexibility it brings to the system configuration if varying number of users is to be supported. A rate 1/2 convo-lutional code with generator polynomial (7, 5) 8 is used to generate the code bits, which after random interleaving are mapped to QPSK symbols. For the receiver, we are restricting the investigation to N = 4 antennas, while different number of transmitting users are considered.
A fading multi-path IID channel is assumed, mimicking a rich scattering environment. The channel impulse response between user k and receive antenna n is given by [34] g k,n (τ ) = P−1 p=0 α p,k,n δ τ − τ p,k,n , where a p,k,n are zero-mean complex Gaussian random variables with an exponential power delay profile, θ τ p,k,n = Ce −τ p,k,n /τ rms , where C is a constant, and the delays τ p,k,n are uniformly distributed within the CP. In this article, the length of the channel, normalized to the symbol duration, is τ max = 0.1, the root mean square delay spread set to τ rms = 0.03, and the number of multi-path components P = 100. The channel delay is assumed to be no longer than the CP, and the block fading channel is generated independently for each user and receive antenna link. The number of DPS sequences used in the channel estimation process is chosen as I = 36, guided by the discussion in Section 2.2, and adding a few for improved performance at high SNR. The subspace order in Krylov MMSE estimator is set to S k = 5, if nothing else is stated.
In the following, the motive behind performing the complex operation of channel estimation in the loop of an iterative receiver is first illustrated with an example. In the example, the average BER performance at different E b /N 0 is compared for receivers using the channel estimator inside or outside of the iterative loop. It will be seen that the gains by performing the estimation inside the loop can provide significant performance gains. Here, E b is the average bit energy at the receiver. Furthermore, the impact of the array gain has been removed by scaling the noise variance by N.
We then study the evolution of the BER and MSE of the channel estimate, over the receiver iterations. This is done for different user loads. The results illustrate the difference in convergence speed of the different receiver configurations, which is important when assessing the total computational complexity needed to reach a certain level of performance. Finally, the convergence analysis is extended with the use of EXIT charts; providing additional insight on the receiver.

Illustration of the gains of using channel estimation inside the detection loop
As was seen in Section 5, performing channel estimation adds significantly to the total receiver complexity. Furthermore, having the estimation inside the loop of an iterative receiver, this costly operation needs to be performed multiple times. It would therefore, from a complexity point of view, be attractive to move the estimation outside the loop, only performing it once for each code block based on the transmitted pilot symbols.
To illustrate the motive behind using the channel estimation inside the iterative receiver, simulations are performed for a system with N = 4 receive antennas and K = 4 users. Two different receiver configurations are considered. The first is performing pilot based channel estimation only, while the other is performing channel estimation inside the iterative loop. For both receivers, the MAP MUD is used in combination with the joint MMSE channel estimator. In Figure 2, the BER performance is shown for different number of pilot symbols transmitted. For the purely iterative receiver, only one pilot OFDM symbol is used, while for the other receiver S p = 1,2 and ten pilot symbols are transmitted. For comparison, single user performance when perfect channel state information (PCSI) is available at the receiver is also shown. Also, an example with orthogonal pilots is provided, where the users consecutively transmit one pilot symbol each during the first four symbol intervals. Each pilot have been boosted, containing the equivalent energy of four regular symbols.
As seen from the figure, if only pilot based estimates are used, there is a significant performance loss, as compared to when using channel estimation in the iterative loop. For few pilot symbols, a loss in performance of 1-3 dB is observed, while if the number of pilot symbols is increased to S p = 10, the loss is small. Remember that the total number of OFDM symbols in a block is S = 20, thus transmitting ten symbols yields a 50% pilot overhead, which is unacceptable for most applications. Transmitting orthogonal boosted pilots also result in a loss of up to 1 dB. The performance achieved with orthogonal pilots is only slightly better than when transmitting S p = 4 non-orthogonal pilots, since joint channel estimation is performed. Furthermore, if iteratively updating the channel estimates, close to single user performance with PCSI is achieved. It can therefore be concluded that the use of channel estimation inside an iterative receiver can give significant performance gains, as compared to pure pilot based approaches. This means that pilot density can be kept low, without sacrificing performance, thus improving the system throughput.

Convergence performance: BER and MSE
In the previous section we illustrated how iterative channel estimation can provide a significant performance Eb/N0 (dB) gain. At the same time, the complexity can be significant, as seen in Section 5. Since the computational cost increases linearly with the number of iterations, the convergence properties of the different receiver configurations are of importance. To illustrate their properties, the BER as well as the MSE is shown, as a function of the number of iterations, in Figures 3 and 4, respectively. The results are shown for the cases of K = 4 and 7 users, at an E b /N 0 = 10dB.
Starting with the BER in Figure 3, it is clear that convergence properties differ between algorithm combinations. At the smaller user load, i.e., K = 4, the difference in convergence is relatively small, with all algorithms reaching roughly the same BER within 3-8 iterations. The fastest convergence is achieved using the MAP based MUD with joint MMSE channel estimation, while the slowest is obtained if using the PIC-MF detector with SAGE ML estimation. By using the MMSE Krylov estimator with S K = 5, a small performance loss as compared to joint MMSE is observed. Increasing this value to S K = 10, close to joint MMSE performance has been observed. Looking at a system load of K = 7 users, a similar behavior as with K = 4 is seen. Comparing the performance achieved when using the different MUDs, the best performance is given by the MAP. A gain of 1-5 iterations over the PIC-MMSE detector is observed. There is a large difference in convergence depending on which estimator is used, and additional insight on this will be given when looking at the EXIT charts in the next section. Furthermore, at this high user load, the PIC-MF can not provide sufficient detection performance for receiver convergence. It is also interesting to note that performance close to that of a single user with PCSI at the receiver is achieved for all receiver configurations, except for PIC-MF at K = 7 users. This illustrate the good performance obtained by the iterative receiver.
Looking at the average MSE, as shown in Figure 4, similar trends as for the BER are seen. The convergence speed of the joint MMSE estimator is better than that of SAGE ML, and the difference increases with the user load. Furthermore, in the first iteration, only pilot symbols are used for channel estimation, and a large MSE is obtained due to the relatively small number of available pilots. In the iterative process, as the reliability of the

Convergence performance: EXIT charts
Even though the BER and MSE convergence provide some insight on the behavior of the different algorithms, they have some limitations. One significant drawback is that the performance of the channel estimation and detection algorithms cannot be separated from that of the code. Other means are therefore of interest for the receiver evaluation.
One popular technique for visualizing the convergence behavior of iterative decoders is the EXIT charts [16]. The charts are used to visualize the exchange of extrinsic information between the SISO units making up an iterative decoder. In [35], it was shown that the MUD could be seen as SISO unit being serially concatenated with the outer channel decoder. In our case, we have three units, the MUD, the channel estimator and the decoder. Even though it is possible to visualize the exchange between all three SISO units [36,37], it is more convenient to combine the estimator and the MUD into a single SISO unit [38], referred to as MUD/CE.
In order to produce an EXIT chart, information transfer functions of the SISO units have to be produced. Each unit can be seen an LLR transformer (Λ a Λ ext ), where the transfer function measures the improvement of the LLR-transformation in terms of mutual information between the LLRs and the underlying variables x. The transfer function is given as [39] where I a = I (x; Λ a ) is the a priori input mutual information and I ext = I (x; Λ ext ) is the output extrinsic information.
When producing the transfer functions, all elements of Λ ext (becoming Λ a for the next component decoder) are assumed independent and to follow a Gaussian distribution, N xμ ext , σ 2 ext , with consistency condition μ ext = σ 2 ext /2 and where x = ±1. With this distribution of the LLRs, there is a one-to-one mapping between the mutual information I ext and the variance σ 2 ext given by where the J-function is defined in [16]. When generating the transfer functions, the J-function is used for generating input sequences with different a priori information content. More specifically, given an input symbol x, and a value for the a priori information I a , the input LLRs are given by where w ∼ N (0, 1), and σ 2 ext = J −1 (I a ). For the MUD/CE, as shown in Figure 1, the transfer function is now derived for a number of I ext [0,1]. We first generate the soft input symbolsx k = tanh Λ ext (I a )/2 and the known pilot symbols for all users. After QPSK mapping, channel estimation and MUD is performed. The LLR output generated by the MUD is then feed to a sink, where the mutual information is computed through [39] where the probability density function, p d |d , is approximated using histogram calculations. The transfer functions are then averaged over 20 channel realizations. The transfer function for the SISO decoder is obtained in a similar way. When generating the transfer function for the MUD/ CE, the initial guess for the Krylov MMSE and SAGE ML has to be provided. In the receiver this value is given by the estimate obtained in the previous iteration. Since this value is unknown, we solve it by running the channel estimator twice, first initialized with the all one channel then reinitialized with the new output. This potentially leads to an over estimated performance at high I ext . For SAGE ML this also leads to an under estimated performance at low values.
In Figure 5 the EXIT chart is shown for the different receiver combinations for the case of N = 4 receive antennas and K = 4 users at E b /N 0 = 10 dB. The transfer functions in the case of PCSI is also shown. Furthermore, the convergence path for PIC-MF with SAGE ML estimation is shown as a dashed line, and the receiver is estimated to converge in five iterations. This coincides with the observation for the BER in Figure 3. For the receivers where SAGE ML is used, a dip is seen in the transfer function at low I a . This occurs since the algorithm is not taking the quality of the soft symbols into account, thus producing estimates based on very unreliable hard estimates of the transmitted symbols. This dip could be partly removed if only pilots are considered (I a = 0) in the estimator if the reliability of the produced soft symbols are low.
Comparing the channel estimation algorithms, Krylov MMSE, used with S K = 5, delivers performance identical to Joint MMSE. For SAGE ML, the performance is much worse, but the performance at low I a is somewhat underestimated as discussed above. From Figure 5, we also see the impact of inaccurate CSI, illustrating itself by a gap between the transfer functions obtained when using the channel estimation and when having PCSI. As the reliability of the a priori information increases, this gap is decreased since the produced estimates become increasingly accurate. Looking at the MUDs, the MAP obviously has the best performance, followed by PIC-MMSE and PIC-MF. Furthermore, when the SNR is reduced (essentially leading to downward shift of the transfer functions of the MUD/CE), or when increasing the user load (essentially changing the slope of the transfer functions), the PIC-MF will be the first MUD closing the gap to the SISO decoder transfer function, and thus failing to converge.
Overall, we see that the insight given by the EXIT chart matches fairly well with what was observed for the BER. Furthermore, observing the MAP detector for K = 7 users in Figure 3, large difference in convergence performance between using the MMSE estimators or SAGE ML was observed. This could be explained by the fact that the gap in the EXIT chart is smaller for the latter estimator. From a algorithm design point of view, it is also interesting to observe that for the case presented in Figure 5 there is still room for further simplifications of the receiver structure. Additionally, the performance obtained when using an alternative channel code can be estimated by replacing the transfer function for the chosen convolutional code in Figure 5.

Complexity versus performance trade-off
From a receiver design point of view, the trade-off between performance and complexity is an important aspect. In an attempt to shed some light on this aspect, the total receiver complexity, in terms of the number of complex multiplications, needed to reach a specific target BER is investigated. The total complexity depends both on the choice of channel estimator and MUD, as well as on the number of iterations needed to reach the target. For the evaluation, a target BER of 10 -3 is chosen. The system settings are the same as described in Section 6, i.e., N = 4 receive antennas, S p = 1 and S = 20 OFDM symbols, M = 256 subcarriers and I = 36 DPS sequences. The subspace order in Krylov MMSE is set to S K = 5.
To start with, the case of K = 4 users, signaling at an E b /N 0 = 10dB, is considered. In Figure 6, the BER is As was previously seen in Figure 3, under these system settings, all receivers reach the same BER performance of~10 -4 . On the other hand, looking at the number of multiplications needed to reach this value, there is more than an order of magnitude difference between the receiver configurations. The receiver configurations using the MAP detector is found on the right, requiring the largest number of multiplications to reach convergence. To the left, we find the PIC based MUDs using SAGE ML, providing the cheapest alternative. Looking at the target BER of 10 -3 , the algorithms with the lowest total complexity is PIC-MF followed by PIC-MMSE. Reaching the target in about 70 and 100 complex multiplications per information bit, respectively. When using the MMSE Krylov estimator, we see that PIC-MF and PIC-MMSE reach the target using approximately the same number of multiplications, though PIC-MF require one more iteration.
Finally, an overview of which algorithm combinations to choose in different scenarios is given. In Figure 7, the receiver configuration with the lowest total complexity, at different user loads and E b /N 0 , is shown for a target BER of 10 -3 . The shape indicates which MUD that is used, while the color indicates the choice of channel estimation algorithm. Due to their large complexity, neither the MAP detector, nor the joint MMSE estimator are competitive in any of the evaluated scenarios -not even at high system loads. Note that eight users is at the border of what the system can handle, still, even with suboptimal algorithms, low BER can be achieved. Overall, the most favorable receiver configuration to use, from a complexity point of view, is the PIC-MF MUD combined with the SAGE ML estimator. At higher user loads though, the PIC-MMSE detector gives the best trade-off between complexity and performance. For channel estimation, using anything but SAGE ML is in general not required for the considered system.
The results shown in Figure 7 take the overall computational complexity into account and may therefore fail to show other interesting trade-offs. An example of this is seen in Figure 3, where the difference in convergence speed between the algorithms is large. Depending on the hardware architecture used, this may affect the latency of the system, and for time critical systems, the choice of algorithm combinations may therefore be another. We believe, however, that our evaluation shows that combinations of algorithms with low computational complexity, when used in an in an iterative receiver, can deliver very competitive performance for a large range of scenarios.

Conclusion
In this article, we have studied the trade-off between complexity and performance for uplink receivers in a packet based MU MIMO-OFDM system. The considered iterative receivers contained three main components; a MUD, a channel estimator and a con-volutional decoder. Three different MUD algorithms were considered, two suboptimal approaches based on PIC and one optimal based on MAP. For channel estimation, three algorithms were evaluated, one optimal joint MMSE based estimator, a low complexity Krylov subspace based version of the same, and one sub-optimal based on SAGE. The difference in complexity between the algorithms were shown to be large. When only considering performance, the high complexity algorithms naturally showed the fastest convergence. The low-complexity algorithms showed similar BER performance as the more complex ones, when converging, but at a generally slower convergence speed. More insight on the convergence was also provided through EXIT charts. When taking complexity into account, we demonstrate that the sub-optimal low-complexity algorithms are often the most attractive choice. Even though a larger number of receiver iterations were needed, the total number of complex multiplications was still lower, due to a significantly lower computational cost per-iteration. At the same time, it should be noted that the most simple receiver failed earlier than the others at high user loads, which indicates that an appropriate balance between complexity reduction and performance needs to be achieved. Furthermore, for time Endnote a In general the notation will be that sub-indices state which user and receive antenna is considered, while the time and frequency position will be given in brackets.