Algorithm and VLSI Design for 1-Bit Data Detection in Massive MIMO-OFDM

The use of low-resolution data converters in the radio-frequency (RF) chains of all-digital massive multiple-input multiple-output (MIMO) basestations promises significant reductions in power consumption, hardware costs, and interconnect bandwidth. We propose a quantization-aware data-detection algorithm which mitigates the performance loss of 1-bit quantized massive MIMO orthogonal frequency-division multiplexing (OFDM) systems. Since the system performance heavily depends on the quality of channel estimates, we also develop a nonlinear 1-bit channel estimation algorithm that builds upon the proposed data detection algorithm. We show that the proposed algorithms significantly outperform linear data detectors and channel estimators in terms of bit error rate. For the proposed nonlinear data detection algorithm, we develop a very large scale integration (VLSI) architecture and present implementation results on a Xilinx Virtex-7 field programmable gate array (FPGA). Our implementation results are, to the best of our knowledge, the first for 1-bit massive MU-MIMO-OFDM systems and demonstrate comparable hardware efficiency with respect to state-of-the-art linear data detectors designed for systems with high-resolution data converters, while achieving lower bit error rate.

of high-resolution (e.g., 10 bit to 12 bit) analog-to-digital converters (ADCs). The presence of hundreds of such highquality RF chains inevitably results in high power consumption, interconnect data rates, and hardware costs, especially when deployed for the large bandwidths offered at millimeter-wave (mmWave) frequencies [6], [7].
To mitigate these issues, one can deploy low-resolution ADCs [8]- [11], which is motivated by the observation that the power consumption of ADCs scales exponentially with the number of quantization bits [12]. Another benefit of deploying low-resolution ADCs is that the data rates on the fronthaul link, which connects the baseband unit and remote radio head, can be lowered significantly. In addition, the quality requirements on the RF circuitry (e.g. low-noise amplifiers, mixers, filters) can be relaxed, which enables further power and cost savings. All these benefits are attained at their greatest extent for the case of 1-bit ADCs, which can be implemented simply with 1-bit comparators and they eliminate the need for automatic gain control (AGC) circuits. This results in even more savings in RF chain power consumption and cost.
However, due to the strong nonlinearity introduced by 1bit ADCs, baseband processing tasks including data detection become more challenging in these systems. In this paper, we focus on data detection and channel estimation in massive MU-MIMO systems with 1-bit ADCs, operating over frequencyselective channels. However, our results can be extended to the case of multi-bit ADCs using the general framework in [13], without significantly affecting the algorithm complexity. A detailed study of multi-bit case is left for future work.

A. Related Previous Work
Linear channel estimation and data detection algorithms for 1-bit massive MU-MIMO systems, such as maximum-ratio combining (MRC) and linear minimum mean square error (L-MMSE), have been studied in [8], [14]- [16] for systems operating in frequency-flat channels. These papers demonstrate that reliable multiuser communications is possible, even for higher-order constellations [16]. Furthermore, the results in [8] show that 3-bit to 4-bit ADC resolution is sufficient to approach the achievable rates of infinite-resolution data converters.
The practically more relevant case of frequency-selective channels has been studied in [13], [17]. In [17], it has been demonstrated that linear data detection achieves acceptable performance for wideband systems with 1-bit ADCs, assuming that the channel has a sufficiently large number of taps. For massive MU-MIMO systems with orthogonal frequency division multiplexing (OFDM), it was shown in [13] that 4-bit arXiv:2009.02068v1 [eess.SP] 4 Sep 2020 to 6-bit ADCs are sufficient to achieve similar performance as systems infinite-resolution data converters. While these results show that linear channel estimators and data detectors can be used in conjunction with 4-bit to 6-bit ADCs, sophisticated nonlinear channel estimation and data detection algorithms are necessary for systems that use ADCs with 3-bit or less.
To improve the performance of 1-bit massive MU-MIMO systems, an L-MMSE channel estimator based on Bussgang's decomposition [18] has been developed in [19]. Sophisticated nonlinear channel estimation and data detection algorithms have been proposed in [10], [11], [13], [20]- [23] for systems with low-resolution ADCs. The methods in [20]- [23] perform data detection using generalized approximate message passing, which enables excellent error-rate performance for Rayleighfading channels, but at the cost of high complexity and rather poor performance in correlated or line-of-sight propagation conditions-the less complex methods in [10], [11] are only suitable for frequency-flat systems. The channel estimators and data detectors in [13] rely on a convex-optimization procedure, which perform well under realistic propagation conditions in coarsely-quantized massive MU-MIMO-OFDM systems but at the cost of high complexity. However, to the best of our knowledge, none of the above algorithms have been implemented in hardware.
A large number of data detector hardware designs for massive MU-MIMO systems has been proposed in the past; see, e.g., [24]- [31]. All of these data detectors have been designed for BS architectures with high-resolution ADCs. Furthermore, these implementations are suitable for frequency-flat channels, or rely on OFDM or single-carrier frequency-division multiple access (SC-FDMA) to decompose frequency-selective channels into orthogonal, frequency-flat subcarriers. However, due to the severe distortion caused by 1-bit ADCs, OFDM and SC-FDMA processing does no longer result in orthogonal and frequency-flat subcarriers [13]. Hence, the use of conventional data detectors that have been designed with frequency-flat channels and high-resolution ADCs in mind inevitably result in poor performance in BS architectures that use 1-bit ADCs.

B. Contributions
We propose a new data detection algorithm and develop a corresponding VLSI design specialized for 1-bit massive MU-MIMO-OFDM systems operating over frequency-selective channels. Our contributions are summarized as follows: • We propose a nonlinear quantization-aware data detection algorithm that solves a relaxed version of the ML detection problem in 1-bit massive MU-MIMO-OFDM systems. The proposed algorithm includes optimizations that enable an efficient VLSI implementation.
• Based on our data detection algorithm, we develop a nonlinear channel estimation algorithm that mitigates the performance loss under 1-bit quantization. We further improve the quality of channel estimates using time-domain maximum likelihood estimator, which exploits correlation across subcarriers to denoise the channel estimates.
• We use simulations to demonstrate that the proposed channel estimation and data detection methods outperform linear algorithms for frequency-selective channels.
• We present an efficient VLSI architecture for the proposed data detection algorithm and show the first implementation results of a 1-bit massive MU-MIMO-OFDM data detector on a field programmable gate array (FPGA). Our simulations and FPGA implementation results show that our design achieves comparable hardware efficiency but (often significantly) lower error rate compared to data detectors that have been designed for systems with high-resolution ADCs.

C. Notation
Boldface lowercase and uppercase letters represent column vectors and matrices, respectively. For a matrix A, the transpose and Hermitian transpose are denoted by A T and A H , respectively, the kth column is a k = [A] k , and the entry on the mth row and nth column is A m,n = [A] m,n . The 2 -norm of a vector a and the Frobenius norm of a matrix A are a 2 and A F , respectively. The diagonal matrix with main diagonal given by the vector a is A = diag(a). The M × N all-zeros and N × N identity matrices are 0 M ×N and I N , respectively. The N × N discrete Fourier transform (DFT) matrix is denoted by F and normalized so that FF H = I N . For a vector a, the kth entry is denoted by a k = [a] k , and the real and imaginary parts are (a) = a R and (a) = a I , respectively. We use to define an extended dot product that takes two M × N matrices A and B and returns an M × 1 vector c = A B, whose mth element is the dot product of the mth rows of A and B, i.e., c m = [A T ] T m [B T ] m . We use to define a Hadamard product that takes two M × N matrices A and B and returns an M × N matrix C = A B whose entry on the mth row and nth column is C m,n = A R m,n B R m,n + jA I m,n B I m,n . The signum function sign(·) operates entry-wise on vectors and for each entry x returns +1 if x > 0 and −1 otherwise. A proper complexvalued zero-mean Gaussian vector a with covariance matrix Σ is denoted by a ∼ CN (0, Σ).

D. Paper Outline
The rest of the paper is organized as follows. Section II introduces the system model. Section III presents our quantizationaware data detection and channel estimation algorithms for 1-bit massive MU-MIMO-OFDM systems. Section IV describes the VLSI architecture and shows FPGA implementation results of the data detector. Section V concludes the paper.

II. SYSTEM MODEL
We consider the uplink of a 1-bit massive MU-MIMO-OFDM system illustrated in Figure 1, where U single-antenna UEs communicate with a BS that is equipped with B U antennas. We assume a block-fading scenario and communication over a frequency-selective channel using OFDM. Each OFDM symbol consists of W = W used + W guard subcarriers, where W used refers to the number of subcarriers used to carry data or pilot symbols, and W guard refers to the number of  guard subcarriers. The set of subcarriers used for data or pilot symbols is denoted by Ω used and the set of subcarriers used as guard tones is denoted by Ω guard .

Remark 1.
In what follows, we assume perfect timing and frequency synchronization at the BS. Due to the severe distortion of 1-bit quantized received signals, timing and frequency synchronization is challenging. A recent study in [32] has shown that accurate synchronization is feasible using the conventional Schmidl-Cox algorithm for the downlink of OFDM-based systems with 1-bit DACs. To the best of our knowledge, not much is known about uplink synchronization with 1-bit ADCs, but we expect that results in [32] can be extended to our scenario. A detailed study of synchronization with 1-bit quantized signals is left for future work.
The channel is modeled in the time domain by the matrices H t ∈ C B×U , t = 1, . . . , L, where L is the number of taps of the channel's impulse response. The frequency-domain channel matrices H w ∈ C B×U , w = 1, . . . , W , are obtained from the time-domain representation via a DFT: noting that H t = 0 B×U , for t = L + 1, . . . , W . To simplify notation, we often use the frequency-domain channel matrices H b ∈ C W ×U , b = 1, . . . , B, corresponding to each BS antenna. The wth row of H b is the channel vector between the bth BS antenna and all users on the wth subcarrier for all w = 1, . . . , W . Throughout the paper, channel matrices with the subscript b correspond to the W ×U frequency-domain channel matrices associated with each BS antenna; channel matrices with the subscript w correspond to the B×U frequency-domain channel matrices associated with each subcarrier. Uplink communication within each channel coherence interval is divided into two phases. In the first phase, which consists of N t = U T OFDM symbols, the UEs send pilot signals. The parameter T determines the number of training symbols per UE. In the second phase, the UEs transmit N d data-carrying OFDM symbols. In what follows, we describe the transmission model for the duration of one OFDM symbol, which is the same for both channel training and data transmission-the only difference is the choice of frequency-domain symbols.
During the transmission of each OFDM symbol, each UE generates its own frequency-domain symbol vector s u ∈ C W . For the subcarriers reserved as guard tones, these symbols are zero, i.e., [s u ] w = 0 for w ∈ Ω guard . For the other subcarriers w ∈ Ω used , these symbols are chosen from a constellation set X (or pilot constellation set X t ), i.e., [s u ] w ∈ X (or [s u ] w ∈ X t ) and are normalized as E[|[s u ] w | 2 ] = E s for all w. Each UE then converts its frequency-domain vector s u into the time domain using a W -point inverse DFT, and transmits the resulting vector after prepending a cyclic prefix (CP) of length P . We assume perfect synchronization and a CP length of P ≥ L − 1, which is sufficient to avoid inter-symbol interference.
At the BS-side, each antenna receives a noisy superposition of the UEs' signals. To simplify notation, we will use the W × U matrix S whose uth column contains the frequencydomain symbols of the uth UE. Let y b ∈ C W denote the (unquantized) signal vector received at the bth BS antenna after removal of the cyclic prefix. This vector can be modeled as where n b ∼ CN (0 W , N 0 I W ) is the thermal receive noise at the bth BS antenna with variance N 0 per complex entry, and z b ∈ C W is the vector of noiseless frequency-domain signals associated with the bth BS antenna given by where is the extended dot product defined in Section I-C.
In what follows, we assume that the in-phase and quadrature baseband signals at the output of each BS RF chain are quantized by a pair of zero-threshold 1-bit ADCs. For a complex-valued scalar z, we model this quantization operation as r = Q(z) = sign(z R )+j sign(z I ), which is applied elementwise to vectors and matrices. The W -dimensional vector of the 1-bit quantized observations at the bth BS antenna for the duration of one OFDM symbol is thus given by Assuming E[ H b F ] = W U E h for b = 1, . . . , B, the average receive signal-to-noise ratio (SNR) at each BS antenna prior to quantization is given by ρ = W used U E s E h /(W N 0 ).

III. 1-BIT DATA DETECTION AND CHANNEL ESTIMATION
We now present our data detection and channel estimation algorithms for 1-bit massive MU-MIMO-OFDM systems, and we demonstrate their effectiveness via simulation results.

A. Quantization-Aware Data Detection with Box Constraints
In order to derive the quantization-aware data detection algorithm, we first formulate the ML problem and then relax its constraints to arrive at a problem that can be solved efficiently. We start with the likelihood of the output of a 1-bit quantizer r = Q(µ + n) = sign(µ + n), given the noiseless input µ and assuming that n is circularly-symmetric complex Gaussian noise with variance N 0 = σ 2 . For this model, the likelihood function is given by the following expression [13], [33], [34]: Here, 2π exp(−u 2 /2)du, is the cumulative distribution function of a standard normal random variable. Now let us assume that r = Q(µ + n), where r, µ, and n are N -dimensional vectors, and the noise vector is distributed according to n ∼ CN (0, N 0 I). The likelihood function of the vector r is given by p(r|µ) = N n=1 p(r n |µ n ). Hence, the ML data detection problem corresponds to [13] S = arg max where r b is the vector of 1-bit observations at the bth BS antenna as defined in (4). Note that this problem is NP-hard, as the ML data detection problem for the infinite-resolution case, and an exhaustive search would require one to evaluate the objective function |X | W U times. In order to overcome this prohibitive complexity, we relax the discrete constellation constraints. The same idea has been used in the special case of frequency-flat channels in [1], [10], [11], [24], and in a more general framework with multi-bit ADCs in [13]. In order to arrive at an efficient method, we relax the discrete constellation to its bounding box (convex hull), which is, for quadrature-amplitude modulation (QAM), given by where S X = max s∈X {|s R |, |s I |}. Concretely, we replace the constraints The resulting optimization problem is convex and can be formulated in equivalent form as follows: Here is the negative logarithm of the likelihood function in (6), and I(·) is the indicator function which outputs zero if its input is satisfied and infinity otherwise. Since magnitude is lost completely in 1-bit measurements, we re-scale the outputŜ 1BOX according tô followed by mapping the normalized outputŜ norm to the nearest constellation point in the discrete constellation set X . The optimization problem in (8) can be solved using the forward-backward splitting (FBS) framework [35], [36]. We use FBS to design an iterative algorithm that we call 1-Bit OFDM boX (1BOX) detector, which is summarized in Algorithm 1. In each iteration of 1BOX, lines 4 to 11 calculate the (scaled) negative gradient of the objective function f (S) of (8), and line 12 increments the estimates of the current iteration in the direction of the negative gradient by a step-size of κ and end for 9: for w = 1, . . . , W do 10: end for 12: projects the result onto the box constraint set. Calculating the gradient of f (S) requires the inverse Mills ratio for a normal random variable defined as and its complex version ω c (z) = ω(z R ) + jω(z I ), which is applied entry-wise to vectors. With this definition, the negative gradient of the objective function f (S), with respect to the estimated matrixS, is a W × U matrix whose wth row is Here, V is a B × W matrix whose bth row is given by and z b is defined in (3). The quantity G on line 10 is a scaled version of the gradient, where we omit the prefactor √ 2/(2σ), which is absorbed into the step-size κ. The function proj c (·) on line 12 of Algorithm 1, implements a projection onto the box constraints in (7). This function operates element-wise on the entries of its input and simply performs the operation where proj(.) is given by proj(x, B) = sign(x) min(|x|, S X ).

Remark 2.
We note that for constant-modulus modulation schemes, such as 8-PSK, one could achieve better performance by using a circle-like polytope as the constraint set rather than a rectangular box as in (7). However, to develop simpler VLSI implementations, we have used box constraints regardless of the modulation scheme. Projection onto more complex circle-like polytopes significantly increases hardware complexity [37].

B. Optimization for Hardware Implementation
The 1BOX algorithm summarized in Algorithm 1 has two drawbacks from a hardware implementation perspective. We next introduce two solutions that address these issues.

1) Stable Gradient Calculation:
The first issue arises from the inverse Mills ratio ω(x) in (10), which is numerically unstable for negative inputs of large absolute value. Since both the nominator and denominator of (10) asymptotically approach zero for large negative values of x, such input values result in zero-over-zero division with finite-precision arithmetic. In order to circumvent this issue, one can use l'Hôpital's rule to find the negative infinity limit of ω(x), which is equal to −x. For sufficiently large positive values of x, ω(x) produces outputs very close to zero. As illustrated in Fig. 2, these properties of ω(x) can be exploited to simplify a hardware implementation using the approximation: The thresholds t p = 4 and t n = −4 have been selected based on simulations to minimize the performance degradation due to the approximation (14). This approximation enables efficient hardware designs with a small look-up-table (LUT) that stores only the function values between t n and t p .
2) Limiting the Noise Standard Deviation: The second problem arises from convergence issues at high SNR with fixed-point arithmetic. Adaptive step-size rules are able to deal with such convergence issues; see, e.g., [11], [36]. Such rules, however, entail excessively high complexity as they require a search over suitable step sizes in every iteration that includes repeatedly evaluating the objective function (which per se requires high complexity). Instead, we propose a simple modification to Algorithm 1 that simplifies our hardware design in Section IV. Due to the large dynamic range of the objective function at high SNR, small step-sizes are required to ensure convergence. From a hardware perspective, it is desirable to have a fixed step size that can be implemented with simple arithmetic shift operations. Since at high SNR, the role of step size is to control the dynamic range of the entries of G to be added to S (k−1) on line 12 of Algorithm 1, limiting the dynamic range of the entries of α b on line 6 is equivalent. This effect is accomplished by thresholding the value of the noise standard deviation σ, i.e., we replace the value of σ with σ whenever σ < σ . Our simulations have shown that suitable values for σ correspond to an SNR between 10 dB and 15 dB. In Section IV-B3, we will denote the thresholded noise standard deviation byσ, i.e., we setσ = σ if σ < σ andσ = σ otherwise.

C. Linear-Quantized Data Detection
Since linear data detection algorithms have been studied extensively in the literature for both full-resolution and lowresolution converters [8], [17], [38], we use them as a benchmark to evaluate the performance of the 1BOX data detector. Zero-forcing and L-MMSE detectors allow for efficient hardware implementations [26], [31] and achieve similar error-rate performance in massive MU-MIMO systems where B U . Hence, we consider ZF detection (referred to as ZF-DET), which first converts the 1-bit received data at each BS antenna r b , b = 1, . . . , B, into the frequency domain using a LetR be a W × B matrix whose bth column isr b . Then, ZF is applied on each active subcarrier w ∈ Ω used as follows: The outputs are then normalized as in (9) and quantized to the nearest points in the constellation X .

D. 1-Bit Channel Estimation
The performance of coherent data detectors depends heavily on the accuracy of the available channel estimates-this is even more critical in 1-bit quantized systems which suffer from nonlinear distortions. We next develop a method that relies on the same tools of 1BOX to calculate improved channel estimates. As discussed in Section II, during the channel training phase, all UEs transmit N t = U T pilot OFDM symbols, concurrently. This results in U T W training symbols in total, as each OFDM symbol contains W frequency-domain symbols. The frequency-domain pilot symbols of all UEs transmitted during the nth pilot OFDM symbol are gathered in the matrix T n ∈ X W ×U t , where X t is the set of pilot symbols, augmented with zero to take into account the zerosymbol used for guard subcarriers. To simplify exposition, we additionally introduce the per-subcarrier matrix T w ∈ X Nt×U t whose uth column contains the frequency-domain pilot symbols of the UE u transmitted over N t pilot OFDM symbols on the wth subcarrier. Let Y b be the W × N t matrix associated with the bth BS antenna, whose nth column contains the unquantized time-domain samples (after removing the cyclic prefix), received during the nth training OFDM symbol. Then, the 1-bit quantized observations at the bth BS antenna are given by where Z b is a W × N t matrix, whose nth column is given by 1) Linear-Quantized Channel Estimation: A naïve way for obtaining channel estimates from 1-bit measurements is to ignore quantization altogether and perform W independent channel estimation tasks relying on the orthogonality of OFDM together with linear channel estimators, such as ZF or L-MMSE, on a per-subcarrier basis [13], [17]. Since ZF and L-MMSE channel estimation provides similar performance in massive MU-MIMO, we focus on ZF channel estimation. Similar to ZF-DET, ZF channel estimation (called ZF-CHEST) first converts the 1-bit measurements at each BS antenna R b into the frequency domain according toR Then, the channel estimates for each BS antenna b and for each active subcarrier w ∈ Ω used are obtained as follows: We note that a Bussgang-based L-MMSE (BL-MMSE) channel estimator has been proposed in [19] that achieves superior performance than the conventional L-MMSE channel estimator by taking into account the nonlinearities caused by 1bit ADCs. A similar approach can be used to design BL-MMSEbased data detectors [39], [40]. Unfortunately, the complexity of these methods is prohibitive in OFDM systems, as they require the inversion of a U W × U W matrix, caused by the fact that 1-bit quantization destroys orthogonality that enables one to decouple the estimation problem into independent problems for each subcarrier. Furthermore, the matrix to be inverted applies a nonlinear arcsine function to each entry, which prevents efficient iterative methods to find the inverse.
2) ML-based Channel Estimation: Another approach to obtain improved channel estimatesĤ b , b = 1, . . . , B, is to directly solve the ML channel estimation problem which is derived analogously to (6). The idea of ML-based channel estimation with 1-bit quantized measurements for frequency-flat channels has been put forward in [10]. We emphasize, however, that the method in [10] is not directly applicable to OFDM systems, and the general ML problem needs to be derived for such systems, as done in (18). The problem in (18) is convex and we propose to solve it using normalized gradient descent (NGD), which we call NGD-CHEST. The algorithm resembles that of the 1BOX data detector in (6), except that (i) there are no constraints on the channel matrices (and hence no projection operation required), (ii) we assume that the pilot matrices T n , n = 1, . . . , N t , are known, (iii) we initialize the algorithm with the ZF-CHEST estimates to improve convergence. Due to the structural similarity between 1BOX and NGD-CHEST, it would be possible, with only a few modifications, to use the same VLSI architecture proposed for 1BOX in Section IV to also carry out the NGD-CHEST algorithm. However, due to the short time available for channel training in each channel coherence interval, it would not be practical to carry out both NGD-CHEST and 1BOX data detection with the same hardware instance-separate instances should be used for each task. Next, we describe channel denoising and normalization techniques that are applied to NGD-CHEST and ZF-CHEST outputs to improve the channel estimates.
3) Channel Denoising and Normalization: In most practical systems, the number L of channel taps in the time domain is less than the number of OFDM subcarriers-this implies that adjacent subcarriers are correlated. One can exploit this property to denoise the channel estimates by means of a timedomain maximum likelihood estimator (TDMLE) [41]. For the frequency-domain channelĥ b,u ∈ C Wused×1 between the bth BS antenna and uth user, a denoised channel estimate can be obtained as follows: Here, F Ωused is a W used × L matrix, constructed from a Wpoint DFT matrix by taking its first L columns and the rows indexed by Ω used . The resulting denoised channel estimates are collected in the matricesĤ denoised b , b = 1, . . . , B. Since for 1-bit quantizers amplitude information is lost completely, we apply a re-scaling procedure analogous to (9). More concretely, we assume that the BS is able to acquire an estimate of the average channel gain at each BS antenna 1 Due to the spatial proximity of BS antennas, we assume that γ b = γ for all b = 1, . . . , B. In order to minimize the mean squared error (MSE) between the exact channel matrices {H b } B b=1 and their estimates {Ĥ b } B b=1 , the BS re-scales the channel estimates as follows:

E. Complexity Analysis
In this section, we provide analytic complexity expressions for the 1BOX and NGD-CHEST algorithms, measured in terms of the number of real-valued multiplications. Each iteration of 1BOX, as shown in Algorithm 1, consists of two for-loops with B and W iterations. Line 5 of Algorithm 1 involves W inner products between U -entry vectors. By assuming that each complex-valued multiplication requires four realvalued multiplications, this line requires 4U W real-valued multiplications. The computations on line 6 and 7 involve DFT operations on W -entry vectors that can be implemented using fast Fourier transforms (FFT). Each W -point FFT requires 2W log 2 W real-valued multiplications, assuming a radix-2 implementation. The complexity of calculating ω c for a scalar input using look-up-tables can be approximated by one realvalued multiplication. The Hadamard products with the vectors r b on lines 6 and 7 do not require actual multiplications and can be implemented efficiently with conditional negations as described in Section IV-B3. Line 10 consists of the product of a U × B matrix by a B × 1 vector, which requires 4U B real-valued multiplications. Multiplications with the real-valued constant κ of all entries of G on line 12, requires 2U W real-valued multiplications. In total, 1BOX involves 8BU W + 4BW log 2 W + BW + 2U W real-valued multiplications per algorithm iteration. To obtain an estimate of each of the W ×U channel matrices H b , b = 1, 2, . . . , B, we need to run NGD-CHEST, which is similar to 1BOX, with the exception that the dimension B is replaced by N t . Therefore, the overall complexity of obtaining estimates for B channel matrices is given by BK(8N t U W + 4N t W log 2 W + N t W + 2U W ) for K algorithm iterations.

F. Simulation Results
To demonstrate the effectiveness of NGD-CHEST and the 1BOX data detector, we now present simulation results and a comparison with linear channel estimators and data detectors designed to operate with high-resolution ADCs.
1) Simulation Settings: We consider a 1-bit massive MU-MIMO-OFDM system with W = 128 subcarriers, whose middle part of W used = 100 subcarriers are used for data and pilots, and the remaining W guard = 28 subcarriers at the two sides are left unused. During the channel estimation phase, all UEs simultaneously transmit pilots. Since non-sparse pilot matrices perform better than diagonal training matrices, a phenomenon that has been observed in [13], we set the frequency-domain pilot matrices T n , n = 1, . . . , N t , to contain random QPSK symbols that are known to the BS. In our simulations, we use N t = U T training OFDM symbols with T = 2. For NGD-CHEST, we set the number of algorithm iterations to K = 5 with a fixed step-size of 1/16. For both ZF-CHEST and NGD-CHEST, we use TDMLE channel denoising and normalization techniques outlined in Section III-D3, to improve the channel estimates. During the data transmission phase, the UEs generate frequency-domain symbols from either 8-PSK or 16-QAM constellations. For the 1BOX detector, we use K = 3 iterations and set the step-size to κ = √ 2/64 for the floating-point experiments and κ = 1/32 for our fixed-point results as we absorb the √ 2 factor into the scaling schedule of the FFT hardware design. In addition, the choice of κ = 1/32 simplifies fixed point design as it can be implemented with trivial arithmetic right shifts.
2) Error-Rate Performance: Figure 3 shows uncoded bit error rate (BER) results for (i) a systems with B = 128 BS antennas and U = 8 UEs and (ii) a systems with B = 64 BS antennas and U = 4 UEs. For both systems, we use Graycoded 8-PSK and 16-QAM. Each plot in Figure 3, contains six curves: (i) ZF-CHEST followed by ZF-DET, with infinite resolution ADCs for both channel estimation and data detection (used as a reference), (ii) 1-bit ZF-CHEST followed by 1bit ZF-DET, (iii) 1BOX detector with perfect CSI, (iv) 1-bit ZF-CHEST followed by 1BOX detection, (v) NGD-CHEST followed by 1BOX detection and (iv) NGD-CHEST with the fixed-point version of 1BOX detection (denoted by "(fp)" on the figure legends) which uses the fixed-point implementation parameters detailed in Section IV-D. As we can see from Figure 3, the 1BOX detector combined with NGD-CHEST achieves significantly better error-rate performance compared to linear quantized uplink processing, i.e., ZF-DET with ZF-CHEST. The SNR gap at BER = 10 −2 ranges from 1.5 dB for the 128 × 8 setup with 8-PSK signaling to 8 dB in the 128 × 8 setup with 16-QAM signaling. Additionally, we see that obtaining channel estimates from NGD-CHEST results in significantly better error-rate performance than those from ZF-CHEST.
In Figure 3, we observe that the performance of the fixedpoint version of 1BOX that corresponds to the hardware implementation detailed in Section IV, closely match those of floating-point performance. Finally, we observe that the proposed NGD-CHEST and 1BOX algorithms, need higher SNRs to achieve the same BER as that of ZF-CHEST and ZF-DET for high-resolution systems. This indicates the existence of a trade-off between the error-rate performance and the reduction in cost and power consumption of basestation RF chains, when using 1-bit ADCs. A detailed study of this trade-off and an assessment of the overall system cost and power consumption as a function of ADC resolution is left for future work.

Remark 3.
A key limitation of data detection with 1-bit ADCs is that supporting higher-order modulation schemes (such as 64-QAM or higher) is challenging [13]. While the proposed algorithm shows acceptable performance with 8-PSK and 16-QAM, it does not perform well for higher-order modulations schemes, especially for SNR values typically encountered in mmWave systems. Therefore, 1-bit data detection is suitable for systems that operate at lower per-user data rates in exchange for reduced RF chain cost and power consumption.

IV. ARCHITECTURE AND FPGA IMPLEMENTATION
We now present a VLSI architecture for the 1BOX data detection algorithm and show reference FPGA implementation results. We then provide a comparison with existing linear data detectors that have been developed for infinite-resolution massive MU-MIMO systems. To the best of our knowledge, this is the first data detector implementation for 1-bit massive MU-MIMO systems reported in the open literature.

A. Architecture Overview and Operation Principles
The proposed VLSI architecture is shown in Figure 4 and consists of the following five modules: (i) UPC (short for update, project, and control), (ii) MVM1 (short for matrixvector multiplication unit 1), (iii) FTF (short for frequency-timefrequency), (iv) MVM2 (short for matrix-vector multiplication unit 2), and (v) H-MEM (short for H-memory). The UPC module carries out the operations on line 12 of Algorithm 1, as well as preparing the control signals; the MVM1 module is responsible for the matrix-vector multiplication on line 5; the FTF module performs the operations on lines 6 and 7; the MVM2 module is responsible for the matrix-vector multiplication on line 10; and the H-MEM module stores the frequency-domain channel matrices {Ĥ w } W w=1 . The signal names in Figure 4 correspond to the variables in Algorithm 1, e.g., [H w ] b represents the channel vector of the bth BS antenna over the wth subcarrier. The proposed architecture has been optimized in terms of hardware efficiency, measured in throughput per FPGA resources. With this optimization strategy, our goal is to minimize the resource consumption needed to achieve a specific throughput. Therefore, even though each instance of the proposed architecture may achieve relatively low throughput, it is possible to scale up the throughput by replication, i.e., by instantiating N parallel designs that process different sets of receive signals. This approach enables us to increase the throughput by a factor of N while maintaining the same hardware efficiency.
In addition, the proposed architecture operates in streaming fashion, which reduces the overhead of control and data buffering. As illustrated in Figure 5, the architecture processes the data of the bth BS antenna over W consecutive clock cycles-one clock cycle for each of the W subcarriers. For example, the elements of the W × 1 vector z (k) b on line 5 of Algorithm 1, are produced in MVM1 sequentially over W consecutive clock cycles. Consequently, it takes BW clock cycles to compute all vectors z (k) b , for b = 1, 2, . . . , B, in the kth iteration of 1BOX. Therefore, each algorithm iteration requires D = BW + L UPC + L MVM1 + L FTF + L MVM2 clock cycles, where L X denotes the latency of module "X," caused by pipelining. Consequently, our data detector architecture achieves a sustained throughput of where |X | is cardinality of the constellation set and f is the circuit's clock frequency in MHz. A detailed discussion of the timing schedule is provided in Section IV-C.

B. Architecture Details
The operating principles and implementation details of the five modules are detailed next.
1) UPC: This module contains a U × W memory block labeled S-MEM that stores the matrix of estimated symbols S (k) at iteration k . All entries of this memory block are updated at the end of each iteration with the values obtained from the MVM2 module. The control unit, labeled "CTRL", is responsible for synchronizing operations of the entire architecture, as all other modules mainly consist of steaming data paths. Since the matrix S (k) accumulates the estimates over iterations as in Algorithm 1, it has to be initialized at the beginning of the first iteration. In order to avoid wasting any clock cycles for erasing the content of the S-MEM block and to allow for continuous processing, the control unit asserts the reset signal of the output register of the UPC module, so that initial values of zero are provided to the next module during the first W clock cycles of the first iteration.
At with κ[G T ] w , and writes back the results to wth column of S-MEM after applying the projection operation, as shown on line 12 of Algorithm 1. When updating the S-MEM block with the result from the MVM2 module during the first iteration, the control unit asserts the reset signal of the register before the accumulator to ensure that the S-MEM content from the previous detection task does not get accumulated. In addition, during the first W clock cycles of each iteration, the control unit sets the select signal of the multiplexer (MUX) at the output of the UPC module so that its output comes directly from the projection operation, since during these clock cycles the content of S-MEM is being updated and hence not ready to be passed to the output.
2) MVM1: This module consists of U complex-valued multipliers and a balanced adder tree that sums the results of U multipliers. In every clock cycle, two U -dimensional vectors enter the module and their inner product is computed after L MVM1 clock cycles, including additional clock cycles caused by pipelining. The value of L MVM1 depends on the number of UEs U . For U = 4 and U = 8, for example, the number of clock cycles are 4 and 5, respectively.

3) FTF:
The FTF module contains the FFT and inverse FFT (IFFT) submodules, which are implemented using the Xilinx LogiCORE FFT IP with radix-2 pipelined streaming I/O architecture. This streaming FFT architecture achieves high throughput and provides continuous processing capability, i.e., accepts one sample of a W -point vector per clock cycle and produces one entry of the resulting transform per clock cycle, after a latency of L FFT clock cycles. The IFFT and FFT submodules carry out the operations on lines 6 and 7 of Algorithm 1, respectively. Each output of the IFFT core, is multiplied by 1/σ (cf. Section III-B2) and then sign-refined 2 by the corresponding 1-bit received signal [r b ] w (carried out with the logic labeled with "SR" on Figure 4) to produce the vector α b as shown on line 6. Note that the missing factor of √ 2 is absorbed to the scaling schedule of the IFFT implementation. The result, which is in the time domain, is then fed to the nonlinear functionω defined in (14), which is implemented by a small LUT, as detailed in Section III-A. The output of theω submodule is once again sign-refined before being converted back to the frequency-domain by the FFT core. For an OFDM system with W = 128 subcarriers, the FTF module has a latency of L FTF = 702 clock cycles, which is the sum of the latencies of all submodules.

4) MVM2:
This module is responsible for the matrix-vector multiplications on line 10 of Algorithm 1, as well as the multiplication by the step size κ, i.e., computes κG. To this end, the MVM2 module contains U processing elements (PEs), which consists of a complex-valued multiplier, a complexvalued adder, and logic to perform complex conjugation (denoted by "conj.") and shifting to carry out the multiplication by the step size (κ), since the step size is chosen to be a negative power of two (i.e. 1/32). The architecture details of a PE is shown on the right side of Figure 4.
We note that the sequence of operations carried out by the MVM2 module is not exactly the same as shown by the forloop between lines 9 to 11 of Algorithm 1; this is because the MVM2 module receives the elements of the B × W matrix V, defined on line 7 of Algorithm 1, in a row-by-row fashion, over BW clock cycles, while the line 10 of the algorithm shows the multiplication with one column of V at a time. The module receives the bth row [V T ] b , during clock cycles (b − 1)W + 1 to bW . As soon as it receives the bth element of

5) H-MEM:
This module is a BW × U memory that stores the W frequency-domain channel matrices {Ĥ w } W w=1 . During the last iteration of each data detection task, in which the entries of the channel matrices are used for the last time, the control unit within the UPC module signals the channel estimation module that the H-MEM is ready to receive new matrices pertaining to the next data detection problem and asserts the write-enable (WE) signal of the H-MEM block.
Remark 4. All modules of the proposed architecture have been carefully pipelined in order to improve hardware efficiency. As a result, our designs achieve relatively high clock frequencies.
We used pipelining registers at the inputs and outputs of each multiplier, implemented by DSP48 units of the FPGA. However, in order to keep our architecture diagrams simple, we do not show these pipelining registers in Figure 4. The critical path of the design is shown with red color in Figure 4 and passes through the memory array "r-RAM" within the FTF module.
C. Timing Schedule Figure 5 illustrates the sequence of operations carried out within one iteration of Algorithm 1, by each of the four main modules detailed in Section IV-B. The horizontal axis shows the progress of time measured in clock cycles. Each row of this diagram corresponds to one of the modules, indicated by its name on the right. The small boxes in each row show the operations of one clock cycle of the corresponding module (if it is active). Each box, contains the antenna (b) and frequency (w) indices of the data being processed in that clock cycle, as well as the output of the module in that cycle (if it exists). Additionally, the diagram shows the timing schedule of the modules relative to each other. For example, at the beginning of each iteration, the MVM1 module begins to compute the first entry of z 1 , i.e., [z 1 ] 1 , after a latency of L UPC clock cycles. Subsequently, the FTF module begins its operation once it receives a valid output from the MVM1 module, which requires a latency of L UPC +L MVM1 clock cycles, counted from the beginning of the iteration. It is important to note that the MVM2 module produces valid outputs only during the last W clock cycles of its operation in each iteration, due to its multiply-accumulate approach for computing the matrix-vector product; see Section IV-B for more details. Once the result (κG) of the MVM2 module is ready, they are transferred to the UPC module in order to update the estimate matrix S (k−1) of the previous iteration, according to line 12 of Algorithm 1. In total, each algorithm iteration requires BW + L UPC + L MVM1 + L FTF + L MVM2 clock cycles as discussed in Section IV-A.
Remark 5. Typical values of B and W encountered in massive MU-MIMO-OFDM systems, result in relatively high latency and low per-instance throughput, as seen in Table I. This stems from the sequential nature of the proposed 1BOX algorithm, which exhibits stringent data dependencies caused by the repeated time-to-frequency and frequency-to-time transforms. Therefore, it is non-trivial to decompose our algorithm into independent tasks, as opposed to linear data detectors that transform the frequency-selective system into a set of independent frequencyflat systems. Nevertheless, a straightforward solution to increase the throughput would be to deploy multiple parallel instances that operate on different sets of received data. Besides that, it is also possible to increase the throughput per instance of the proposed data detector, by using highly-parallel FFT architectures and matrix-vector product engines. Further possible improvements are (i) more aggressive pipelining and (ii) interleaved processing of multiple symbols concurrently in a single instance. Combining these techniques with an ASIC implementation in a modern CMOS technology node might be able to achieve Gb/s throughputs. The design of such improved architectures is left for future work.

D. Fixed-Point Parameters
In order to maximize hardware efficiency, we deploy fixedpoint arithmetic. The word-length of each signal has been optimized based on BER simulations to minimize the loss compared to a floating-point reference model, while maintaining low area. In what follows, we use the notation [x.y] to denote the 2's complement binary format of a fixed-point signal that has x integer bits (including the sign bit) and y fractional bits with a total word length of x + y bits. The reported word lengths are for each of the real and imaginary components of complex-valued signals. The channel entries [H w ] b are in format [4.4]. The entries of the vectors z b are in format [5.5]. The entries of the matrices V, G, and S (k) are in format [4.4], [1.7], and [2.7], respectively. For the entries of α b , we use format [5.4]. Theω LUT storing the fixed-point values of the functionω(t) for t n < t < t p , consists of 128 entries in format [3.4]. The LUT containing the values of 1/σ consists of 256 entries in format [2.6]. In Figure 4, we show the word length of real and imaginary parts of each complex-valued signal with a number next to the signal connection. The BER performance corresponding to the fixed-point hardware implementation is shown in Figure 3.

E. Implementation Results and Comparison
In order to demonstrate the efficacy of the proposed architecture for the 1BOX algorithm, we now present implementation results on a Xilinx Virtex-7 XC7VX690T FPGA. Table I summarizes the FPGA implementation results of the 1BOX detector for two system dimensions B × U , i.e., 64 × 4 and 128×8. We reiterate that these are, to the best of our knowledge, first hardware implementation results of a 1-bit data detector, which renders a fair comparison with existing designs difficult.
Nevertheless, we provide a comparison with state-of-the-art massive MIMO data detectors that have been designed for massive MU-MIMO systems with high-resolution ADCs. We reiterate that this comparison is not entirely fair and its purpose is to show the overhead in hardware efficiency caused by the proposed iterative detection algorithm that supports massive MU-MIMO-OFDM systems with 1-bit ADCs. We emphasize that the reference linear data detectors are not specialized to operate in 1-bit systems and therefore perform poorly in such systems; see Section III-F for the details.
The design in [31] implements an L-MMSE data detector for single-carrier frequency-division multiple access (SC-FDMA)based massive MU-MIMO systems. This design, which is referred to as "NS-SCFDMA" in this paper, uses three iterations of Neumann series to perform an approximate matrix inversion per subcarrier. The design in [26] implements an optimized coordinate descent (OCD) algorithm that approximates a boxconstrained detection problem for massive MU-MIMO-OFDM systems-frequency-domain conversion circuitry is, however, excluded. The design in [42] implements a parallel Gauss-Seidel (PGS) algorithm that iteratively solves an L-MMSE data detection problem for frequency-flat channels. In Table I, we also include two of the most recent, state-of-the-art linear detectors for massive MU-MIMO systems, proposed in [27] and [43]. The design presented in [27], implements a recursive conjugate-gradient-based L-MMSE detector, called RCG, and the design in [43] implements an algorithm based on steepest descent and Barzilai-Borwein algorithms, abbreviated to ES-DBB, that solves the L-MMSE detection problem iteratively. All of the compared designs consider a system with B = 128 BS antennas and U = 8 UEs. However, the throughputs reported in the reference designs NS-SCFDMA, OCD, PGS and RCG are for 64-QAM, while we consider 16-QAM for the 1BOX data detector. Therefore, in Table I, we scale the reported throughput of these reference designs by a factor of 4/6 so that the throughput of all designs is with respect to 16-QAM. We include the throughput for ESDBB as it was reported in [43], since the modulation scheme was not specified for that design. We note that PGS, NS-SCFDMA and RCG designs contain hardware to compute post-equalization signalto-noise-plus-interference ratio (SINR) and log-likelihood ratios (LLRs), which is not present in our design. Table I includes two metrics for comparing the hardware efficiency of different designs: (i) throughput per look-up table (LUT) utilization and (ii) throughput per normalized resource consumption (NRC), which is calculated as NRC = LUT + FF + 280 × DSP48 [44]. The NRC is a more accurate measure for assessing the resource consumption than the LUT count alone as it takes into account the other FPGA resources as well.
As shown in Table I, the throughput and latency results of the 1BOX data detector fall short of the reference data detectors, which are specialized for BS architectures with high-resolution ADCs. However, our design achieves similar hardware efficiency in terms of throughput/NRC compared to other designs, and even higher than that of PGS and ESDBB. The RCG implementation achieves the highest throughput and efficiency, but excludes OFDM processing circuitry. It is important to discern that the proposed 1BOX implementation directly operates on 1-bit measurements, while the reference designs perform linear data detection based on high-resolution measurements. Consequently, the error-rate performance of the reference designs would, at best, be close to that of ZF-DET shown in Section III-F for 1-bit massive MU-MIMO systems. Additionally, our 1BOX implementation is designed for OFDM systems and includes DFT processing, while the other designs do not include DFT processing-except for NS-SCFDMA, which includes logic for SC-FDMA processing. As a result, their hardware efficiency would be lower if OFDM processing circuitry would be included. We note that it is not possible to exclude the submodules corresponding to OFDM processing in the FTF module from 1BOX detector to enable a meaningful comparison with the reference designs that do not include OFDM processing. This is due to the fact that the operations in the FTF submodule are an integrated part of each iteration of 1BOX algorithm and they are not separable from the rest of the algorithm. This is in stark contrast to linear detectors for highresolution OFDM-based systems, that decompose the system into independent frequency-flat subsystems, and a comparison between detectors designed for OFDM-based systems and detectors designed for frequency-flat systems is possible.
In order to shed more light on the FPGA resource utilization of the proposed 1BOX data detector, Table II shows a breakdown of resources corresponding to the individual modules. We note that the FTF module consumes the largest portion of resources. In turn, roughly 80% of the FTF resources are consumed by the FFT and IFFT cores that carry out the timedomain to frequency-domain transform and vice versa. This confirms that a direct comparison with other data detectors that exclude such time-to-frequency conversion is not accurate. In addition, Table II shows that most of the block RAMs are used in the H-MEM module, which contains the channel matrices for all subcarriers-the reference FPGA designs in Table I for high-resolution ADCs do not provide information on the amount of storage allocated for channel matrices.

V. CONCLUSIONS
We have proposed a quantization-aware data detection algorithm, called 1BOX, for massive MU-MIMO-OFDM systems operating over frequency-selective channels with 1-bit ADCs. To improve the channel estimates for such architectures, we have also proposed a channel estimation algorithm referred to as NGD-CHEST. We have shown using simulations that NGD-CHEST combined with the 1BOX data detector outperform conventional linear channel estimation and data detection methods in terms of error-rate performance. Furthermore, we have developed a reference VLSI architecture and presented corresponding FPGA implementations for 1BOX detector. Our results demonstrate that the proposed design achieves comparable hardware efficiency to data detectors that have been designed for high-resolution systems. The proposed channel estimation and data detection algorithms are particularly useful in all-digital massive MU-MIMO-OFDM systems operating at mmWave or terahertz (THz) frequencies, which require large antenna arrays and high bandwidths. Other use cases for the proposed algorithms and hardware designs could be radar or imaging systems for automotive or robotics applications that operate at high carrier frequencies (e.g., mmWave or THz), in which the RF power consumption is expected to be a bottleneck due to the high bandwidths and the large number of antennas.
There are numerous avenues for future work. Each instance of the 1BOX data detector achieves throughputs of tens of Mb/s, which is insufficient for next-generation wireless systems operating at millimeter-wave frequencies. However, there exist several techniques as outlined in Section IV-C, that can significantly increase the throughput per instance of our architecture. The design of such high-throughput architectures is part of ongoing work. Another important research direction is to explore the overall performance-complexity-cost tradeoff in basestations that use low-resolution data converters. In such a study, the effect of reduced data converter resolution on the overall system power consumption and cost should be considered. Finally, the design of efficient algorithms for uplink timing and frequency synchronization as well as parameter estimation (such as SNR or noise variance) for systems with 1-bit ADCs, is an important open research problem.