Photonic Computing to Accelerate Data Processing in Wireless Communications

Massive multiple-input multiple-output (MIMO) systems are considered as one of the leading technologies employed in the next generations of wireless communication networks (5G), which promise to provide higher spectral efficiency, lower latency, and more reliability. Due to the massive number of devices served by the base stations (BS) equipped with large antenna arrays, massive-MIMO systems need to perform high-dimensional signal processing in a considerably short amount of time. The computational complexity of such data processing, while satisfying the energy and latency requirements, is beyond the capabilities of the conventional widely-used digital electronics-based computing, i.e., Field-Programmable Gate Arrays (FPGAs) and Application-Specific Integrated Circuits (ASICs). In this paper, the speed and lossless propagation of light is exploited to introduce a photonic computing approach that addresses the high computational complexity required by massive-MIMO systems. The proposed computing approach is based on photonic implementation of multiply and accumulate (MAC) operation achieved by broadcast-and-weight (B&W) architecture. The B&W protocol is limited to real and positive values to perform MAC operations. In this work, preprocessing steps are developed to enable the proposed photonic computing architecture to accept any arbitrary values as the input. This is a requirement for wireless communication systems that typically deal with complex values. Numerical analysis shows that the performance of the wireless communication system is not degraded by the proposed photonic computing architecture, while it provides significant improvements in time and energy efficiency for massive-MIMO systems as compared to the most powerful Graphics Processing Units (GPUs).


Introduction
The next generations of wireless communication networks, i.e., 5G and Beyond (5GB), are designed to accommodate a large number of smart devices, connected to each other, that are being served by base stations equipped with a massive number of antennas while the requirements of individual devices on the reliability, latency, and energy consumption are met [1,2]. Massive-MIMO systems are one of the key enablers of 5GB to provide high-data-rate and low-latency connectivity [3]. Such dense connectivity together with the demands for higher data rate and lower latency will increase the complexity, and accordingly, the cost of data processing in 5GB systems. In massive-MIMO systems, due to the large number of antennas at the BS and massive number of mobile devices in the network, parallel signal processing for missions such as channel estimation, precoding, and signal detection will become increasingly complex and time-consuming. This complexity has been considered as of the main bottlenecks of realizing massive-MIMO systems. Accordingly, efficient methods for reducing this complexity and improving the efficiency of the radio transceiver architectures are required. Different optimization techniques have been proposed in the literature to address the 5GB challenges from both algorithmic and hardware implementation perspectives [4][5][6][7][8]. Moreover, machine learning techniques have been recently exploited to reduce the ever-increasing computational complexity and to satisfy ultra-reliability, high data rate, and low latency requirements of 5GB [9,10]. It has also been shown that a careful co-design of algorithms and hardware parameters can result in an even more energy-efficient signal processing in MIMO systems [11,12].
While the proposed methods can result in a less complex signal processing for massive-MIMO systems, the limitations imposed by the digital electronics-based hardware preclude full exploitation of the improvements offered by those algorithms. In particular, in order to provide the desired connectivity in 5GB massive-MIMO systems, the base stations are potentially equipped with more than thousands of antennas to serve more than hundreds of mobile users in the network. In these scenarios, precoding the signals that are to be transmitted to the users or detecting the signals received from the users at the BS requires billions of MAC operations per second. For millimeter-wave (mmWave) 5G, in which the latency requirements limit the slot length to be as short as tens of millisecond, the required rate for MAC operations to perform tasks such as precoding or detection is in the order of hundreds of Tera MACs per second. This number can easily reach to tens of Peta MACs per second for beyond-5G wireless communications. In addition, due to the mobility of the users and the dynamics of the environment, the channel state information matrix, and accordingly, the precoding and/or detection matrix need to be updated frequently [13]. Such computational requirements can be hardly met with current power-and bandwidth-limited base stations with digital electronics-based processing units. Computing based on digital electronics hardware which is accompanied by components such as analog-to-digital (ADC) and digital-to-analog (DAC) converters faces fundamental challenges in terms of processing rate and energy consumption [14,15]. On the other hand, the fact that analog electronic components are frequency-dependent with poor reconfigurability in the radio frequencies (RF) limits their application in 5GB systems [16].
Photonics-based computing has been proposed as a promising approach to provide highperformance and low-latency systems for large-scale signal processing [17][18][19]. In particular, an integrated optical platform comprising of both active elements such as modulators, lasers and photodetectors, and passive elements such as waveguides and couplers has shown orders of magnitude improvement in computation time and throughput as compared to the electronics counterparts [20]. Leveraging unique features of light, optical computing has been regarded as one of the emerging technologies to address the "von-Neumann bottleneck" [21].
One of the recently-developed photonics-based computing protocols is Broadcast-and-Weight (B&W) [22], in which wavelength-division multiplexing (WDM) scheme, a bank of microring modulators (MRM), and balanced photodetectors (PD) are utilized to implement weighted addition in a photonic platform. This photonic MAC unit can provide significant potential improvements over digital electronics in energy, processing speed, and compute density [20]. However, the proposed architecture [22] has limitations that makes it unqualified for communication systems. One of the most important limitations is that the B&W architecture can only realize real-valued vectors or matrices. Moreover, it is not capable of operating MAC over two negative-valued inputs. Finally, there exist constraints on the number of MRMs and parallel wavelength channels that can be realized in the system. Therefore, the typical large matrices in communication systems need to be partitioned in an efficient way before processing.
In this paper, we exploit the B&W architecture to develop a photonic computing platform that meets the stringent requirements of next-generation wireless communication systems, including massive-MIMO-enabled networks. The proposed photonic computing platform tackles the aforementioned limitations of the B&W architecture, and hence, it is capable of supporting wireless communication networks. In particular, we devise simple preprocessing steps by which inputs (vectors or matrices) with arbitrary values (real or imaginary, positive or negative) can fit into the proposed computing platform. Furthermore, by utilizing different algorithmic approaches such as matrix inversion approximation and parallelization techniques, the efficiency of the proposed architecture for matrix inversion and large-size matrix multiplication is improved. Several numerical analyses show that while the performance of the proposed photonic architecture is comparable to the performance of the digital-electronics processing units such as GPUs, the time efficiency of the proposed architecture is significantly improved.

System Model
The computational complexity required for signal detection in a massive-MIMO system is formulated in this section and the capability of digital electronics and optics to address such computational requirement is explored accordingly. Consider a massive-MIMO uplink system with single-antenna users and a BS equipped with antennas, where the channels between the users and the BS are modelled as block Rayleigh fading channels. If x ∈ C ×1 denotes the vector of the transmitted symbols, that are selected from finite modulation set S, i.e., x = { | ∈ S}, and n denotes the white symmetric Gaussian noise, i.e., ∼ CN (0, 2 ), then the received signal, y ∈ C ×1 , can be written as where matrix H ∈ C × denotes the channels between the users and the BS, which is assumed to be perfectly known at the transmitter and the receiver.

Signal Detection in MIMO Systems
After the signals transmitted by the users are received, the BS is to obtain the best estimation of the signal transmitted by each user,x, by solving the following optimization problem, The optimal solution of the optimization problem in (2) can be obtained by the Maximum Likelihood (ML) detection. However, since the computational complexity of the ML detection increases exponentially by increasing the number of antennas, alternative detection schemes, such as Zero Forcing (ZF) or Minimum Mean-Square Error (MMSE) have been proposed in the literature [23]. Depending on the detection scheme that is employed by the massive-MIMO system, a detection matrix, namely A, is constructed and used to detect the transmitted signal through linear processing as followsx where

Computational Complexity of Signal Detection in Massive-MIMO Systems
The complexity of signal processing in 5GB can be mostly attributed to computationally-complex matrix operations such as multiplications and inversions of matrices with large sizes (see (4)).
In order to measure the complexity of signal processing required in massive-MIMO systems, the number of Floating Point operations (FLOPs) [24] or the number of MAC operations can be calculated (note that each MAC operation includes two FLOPs). As can be seen in (4), the linear detection process involves matrix inversions and matrix multiplications which both have complexity of the order ( 3 ) MACs (FLOPs). Accordingly, the computational complexity in a wireless communication system scales with the number of antennas at the BS, or with the number of users in the cellular network, or both. This is specifically a fundamental challenge for 5GB, where massive-MIMO is an essential part of the development.

Digital and Analog Electronics
Consider a massive-MIMO system in which the BS with more than a thousand antennas is serving more than a hundred mobile users. In this system, according to (4), the signal detection requires about one million MAC operations, while in order to meet the latency requirement the slot length should be as low as 125 s. Thus, the required MAC operations rate to complete the detection task is more than 440 TMACs/s. The BS must be able to process the data at such rate while keeping the power consumption and the size of the data processing unit equitable.
In recent years, GPUs and FPGAs have been developed to encompass general purpose tasks in the high-performance computing arena. The most powerful GPU architecture to date is NVIDIA VOLTA with 640 tensor cores and 21 billion transistors which can deliver more than 50 TMACs/S. However, tens of these units are needed to meet the required computational power. In these systems, I/O latency and sequential processing capabilities cannot exceed the time resolution of the processor which is ultimately bounded by its clock rate.
To tackle the speed limitation of digital-electronics-based processors while maintaining a reasonable area and power consumption, an optical computing approach is proposed in this paper as a revolutionary computing paradigm. It allows very complex operations to be performed in real time, which can significantly offload electronic post-processing and provide a technology to make RF decisions on-the-fly.

Photonic Computation for Massive-MIMO Systems
The architecture of the proposed photonic computing platform for ultra-fast signal processing in the next generations of wireless communication networks is depicted in Figure 1. The processor core which is based on the B&W architecture [22] is a photonic-integrated circuit (PIC) fabricated on a silicon photonics (SiPh) platform. It contains a matrix-multiplication engine where the input vectors are loaded using microring resonators (MRRs) in the modulation and weight bank sections. The photodetector array performs the optical summation. Wavelength-division multiplexing scheme is adopted where all optical inputs, spatially separated by wavelength, lies in a single waveguide. The Electronic Control and Reconfigurable Unit (ECRU) is composed of interconnected ASIC, FPGA, central processing unit (CPU) and random-access memory (RAM) modules. Its main function is to generate analog control signals for setting the weights of the microring photonic modulators. Another important feature of the ECRU unit is to make sure that the processor core is well calibrated by correcting for the fabrication variations and regulating the controls signals against any thermal fluctuations. By means of the General-Purpose Input/Output (GPIO) or Universal Serial Bus (USB), the ECRU unit maintains a high-bandwidth communication link with a computer motherboard.

Matrix-Multiplication Engine
As depicted in Figure 2, in order to implement matrix multiplication, i.e. A × B, where A ∈ R × + and B ∈ R × , in the proposed optical architecture, the elements of the first vector-to-be-multiplied are loaded using all-pass MRMs and are encoded in the intensities of the wavelength-multiplexed signals (Modulation section). The elements of the second vector-to-be-multiplied are encoded as weights using add-drop MRMs (Weight Bank section). The interfacing of optical components with electronics are facilitated by the use of mixed-signal integrated circuit blocks such as DACs and ADCs, integrated inside the ASIC in the ECRU unit. The multiplication is performed by linking the elements of the first and the second vectors via an optical waveguide, and the accumulation is performed by the photodetector followed by a transimpedance amplifier (TIA) to provide electronic gain, which is also integrated in the same ASIC. For heterogeneous integration, the different analog and digital electronic control circuitry such as ADCs, DACs,

Photonic Matrix Inversion
In addition to matrix multiplication, matrix inversion is a widely-used operation in wireless communication networks. In 5GB, the inverse of an arbitrary, potentially large, and complex matrix needs to be calculated on-the-fly. Although the computational complexity of matrix multiplication and matrix inversion are in the same order, matrix multiplication is preferred from the hardware implementation perspective [13]. Accordingly, several algorithms have been proposed in the literature to approximate the inverse of a matrix with a set of matrix multiplications, e.g., conjugate gradients [25] and iterative algorithms such as Newton [26] and Neumann series method [27,28]. Cholesky factorization can also be considered as another technique that can be used to implement matrix inversion with a number of matrix multiplications [29]. Iterative algorithms can outperform other approaches by taking advantage of reusing available resources in the processing unit, and hence are considered to be more hardware-friendly [35].
In massive-MIMO systems calculating the inverse of the Gram matrix, H H, is an essential operation. In a massive-MIMO system with users and antennas at the BS, where , the Gram matrix becomes diagonally dominant, and that leads to an accurate approximation of inverse matrix in both Newton and Neumann series approximations.
In the following, calculating the inverse of matrix A ∈ C × using Newton iterative method and Neumann-series approximation techniques is explained.

Neumann-Series Approximation
The Neumann-series expansion of the inverse of a matrix A is given as [27,28] whereÂ −1 is guaranteed to converge to the exact inverse of matrix A when In massive-MIMO systems, where A = H H is a diagonally-dominant matrix, the condition in (6) holds. In that case, if matrix A is rewritten as where matrix A diag is a diagonal matrix with diagonal elements of A and matrix A off-diag holds all elements of the matrix A and has zeros on the main diagonal, the -term Neumann series approximation isÂ

Newton Approximation Method
For an arbitrary invertible matrix A, with an initial rough estimation of its inverse X −1 0 , the estimated inverse matrix at the th iteration of Newton approximation technique is [26] where A −1 diag , which is a diagonal matrix with diagonal elements of A, can be used as the first rough estimation, X −1 0 . The main advantage of the Newton method is that it converges quadratically to the inverse matrix if I − AX −1 0 < 1. This condition is satisfied for the diagonally-dominant Gram matrix in the uplink data detection in massive-MIMO systems.

Algorithm-Hardware Co-Design for Photonic Computing
In B&W architecture, the elements of the LHS matrix are encoded into the light intensities. This implies that only real positive values can be realized in the this architecture and it limits the application of the B&W-based photonic MAC in a variety of cases including wireless communication networks. In order to tackle this issue, in the following sections, preprocessing steps are proposed such that any arbitrary matrix can be represented by the optical architecture.

Preprocessing Step 1: Addressing Complex-valued Matrices
In order to represent complex-valued matrices with the proposed photonic computing platform, the real representation of complex-valued matrices is explored. Any arbitrary complex-valued matrix, A ∈ C × , can be written as the summation of the real and imaginary parts, where A ∈ R × and A ∈ R × are real-valued matrices denoting the real and imaginary parts of A, respectively. According to (10), the multiplication of two complex-valued matrices, namely A ∈ C × and B ∈ C × , can be obtained as (11b) Therefore, in order to multiply two complex-valued matrices, four parallel real-valued matrix multiplications of the same size as that of the original matrices can be considered.

Preprocessing Step 2: Addressing Negative-valued Matrices
The proposed solution to represent negative-valued LHS matrices in this architecture is to project the negative sign of the elements of the LHS matrix, A, to the sign of the RHS matrix, B.
In doing so, the negative-valued matrix A ∈ R × is rewritten as a subtraction between two positive-valued matrices,Ā ∈ R × + and | min |1, where min is the element of matrix A with the smallest value, and 1 is an all-one matrix of size × . The negative sign of the subtraction can then be projected into the sign of the RHS matrix elements as follows Therefore, negative-valued matrix multiplication can be performed by summation of two positivevalued matrix multiplications which can be processed in parallel in the proposed photonics-based computing architecture.

Preprocessing Step 3: Parallelization and Matrix Tiling Based on the Photonic Computing Architecture
The last preprocessing step is proposed to implement parallelization and matrix tiling methods. This step is required so that matrices with any arbitrary size can be processed efficiently using the proposed photonic computing system with limited number of cascaded modulators and parallel channels. Consider a computing unit with parallel wavelength channels, all-pass MRRs in the Modulation section (representing LHS matrix), and accordingly, add-drop MRRs in the Weight Bank section (representing RHS matrix) (Figure 2).Each single usage of such architecture can compute multiplication of matrices with sizes that lie within the range of parameters and ; see Figure 3. In order to optimally utilize this architecture to perform the multiplication between arbitrary matrices, A ∈ R × + and B ∈ R × + , the matrices need to be partitioned based on and , following the mapping discussed in Section 3.1. The results of the partial multiplications are recorded in the memory and in the last step, corresponding parts are added together to generate the final result. Figure 4 illustrates the implementation of the multiplication of matrices A and B, in which, without loss of generality, it is assumed that and are dividable into and , respectively.

Numerical Analysis
The time and power efficiency of the proposed photonic computing approach mainly depend on the number of parallel multiplexed waveguide channels, , and the number of different wavelengths that can be realized in each of those channels, . An optimally designed MRR is capable of supporting up to 108 WDM channels, taking both the finesse of the resonator and the channel spacing in linewidth-normalized units into account [30]. Here, we consider a finesse of F = 368 for the MRRs, and a minimum channel spacing of 3.41 × linewidth, based on the assumption that a 3dB cross-weight penalty is allowed [31]. Accordingly, the maximum number of the MRRs, in each of the modulation and weight-bank parts, will be = 100. An architecture with multiplexed waveguide channels contains 2 MRRs in total. However, assuming that a maximum of 1024 MRRs can be manufactured in the optical architecture [32], there is a finite set of feasible values for each of and . The optimal arrangement of the number of parallel waveguide channels and the number of MRRs fundamentally depends on the computational and energy requirements.
In this section, the performance and the efficiency of the proposed photonic computing platform in different practical scenarios are evaluated and compared to those of the conventional digital electronics-based processing units.

Power Consumption
The total power consumption of the proposed photonic computing architecture can be obtained by adding up the power usage of different photonic and electronic components. In the architecture proposed here, lasers are utilized for generating different wavelengths, each with 100 mW power usage. The architecture contains 2 MRRs and 2 DACs each with 19.5 mW and 26 mW power consumption, respectively. Finally, the TIA with 17 mW and the ADC with 76 mW power usage are integrated at the output of each waveguide. Hence, the total power consumption will be calculated as Using (13), the power usage of the proposed architecture is calculated and reported in Table 1, along with those of other computation hardware baselines [32]. The results show that the power consumption of the photonic system is close to that of digital processing hardware such as GPUs. Furthermore, based on the computational requirements of the task to be executed, the power consumption of the photonic computing system can be 1/3 of that of the best digital electronic processors in the literature.

Time Efficiency
The computation time of the proposed architecture mainly depends on the bandwidth of the components and the time that it takes for light to propagate through the architecture. The propagation time after multiplexing, when 2 MRRs are considered in each waveguide channel is estimated by where MRR is the radius and F is the finesse of the MRRs. and denote the speed of light and the effective refractive index of the waveguide, respectively. For an architecture with 2 cascaded MRRs, light propagates through the shared bus waveguide with the minimum length of 2 MRR × 2 and will be trapped by the in-resonance MRRs (only two in total, one in the modulation section and one in the weigh-bank section) F /2 times [33]. Accordingly, if 2 = 200, MRR = 10 , F = 368, and = 2.4, the propagation time is calculated to be 110 . The throughput of the other components integrated in the proposed architecture is provided in Table 2 [32]. SDRAM is connected to a computer and is considered as digital memory before DACs and after ADCs. According to Table 2, the speed is mainly limited by ADCs, DACs, and TIAs with a throughput of 10 / . Hence, the approximate computation time is equal to 100 for each single usage of the photonic system, namely, single use = 100 . In order to obtain the total processing time for multiplication of matrices A ∈ C × and B ∈ C × , the number of times that the architecture is used to calculate the final result needs to be calculated which depends on the dimensions of the matrices. Moreover, the upper-bound of the processing time is obtained by assuming that both the first and the second preprocessing steps in Section 3.3 are required. Considering an optical chip with parallel waveguide channels, and MRRs to represent the corresponding elements of each of the matrices, the number of times that chip should be used to compute the multiplication of A and B, namely use , can be obtained (see Figure 4) as and accordingly, the total processing is calculated as The total processing time in (16) and the power usage calculated in (13) show that designing the optimal architecture requires optimization over and parameters considering the requirements of the target application.
In this work, General Matrix Multiplication (GEMM) is considered as a benchmark to evaluate the computation speed of the proposed architecture. Figure 5 illustrates the computation time of the proposed photonic system as compared to the runtime of Titan XP GPU [34]. The parameters of the benchmarked scenarios are the dimensions of the matrices as listed in Table 3. As shown in Figure 5, by increasing the number of parallel waveguide channels and/or the number of MRMs in the proposed architecture, the processing time can be reduced notably. Moreover, Figure 5 highlights the fact that the processing time of the photonic computing platform is lower than that of Titan XP GPU, and the gap between the processing times exponentially increases when the photonic architecture scales up.

Massive-MIMO Wireless Communication Systems
In this section, the performance and the efficiency of the proposed photonic computing system when employed in massive-MIMO wireless communication systems is studied. For this purpose, MMSE signal detection in an uplink scenario, where single-antenna users transmit their signals to a BS equipped with antennas, is considered. The Channel State Information (CSI) is assumed to be perfectly known at the transmitter and the receiver. The performance, in terms of the Symbol Error Rate (SER), and the processing time are compared with one of the most powerful GPUs, namely NVIDIA GeForce RTX 2080 Ti .

Time Efficiency Analysis
In this numerical study, a massive-MIMO system that consists of a 1024-antenna BS which is serving = 64 users is modelled. The processing time of this system for different parameters of the photonic system, i.e., different numbers of waveguide channels and different numbers of MRRs in each channel, is evaluated using (16). Additionally, the computation time for two matrix inversion approximation methods, namely Neumann-series and Newton approximations, explained in Sections 3.2.1 and 3.2.2, is reported. As shown in Figure 6, the processing time associated with the proposed photonic computing system is significantly less than that of GPU in both approximation approaches. Furthermore, as expected, increasing the number of parallel waveguide channels and modulators in each of those channels can further reduce the runtime. However, other limitations must be taken into account when increasing the number of components in the photonic system. It can be seen in Figure 6 that the computation time of Neumann-series approximation is marginally lower than that of the Newton approximation method. Moreover, the study in [35] shows that both methods perform equally in terms of SER. Hence, Neumann approximation method is adopted in the next numerical experiments.

Precision Analysis
In this section, we seek to evaluate the impact of the modulator control precision on the performance of the proposed optical platform. We consider a MIMO system with = 8 users being served with a BS equipped with = 64 antennas. The signal detection process in the BS is performed utilizing a photonic system with = 8 parallel waveguide channels, each of which with 2 = 16 MRRs. Figure 7 illustrates the SER for different precisions over 10 5 channel realizations. (In that figure GPU Exact indicates the performance of the GPU when the matrix inversion is computed using the built-in functions in the considered software, rather than approximating that using Neumann approximation method.) It can be seen that in the low-SNR regime, where the effect of transmission noise is dominant, the performance of the proposed photonic computing platform is similar to that of the GPU, regardless of the precision bits. However, as the SNR increases, the low-precision error becomes more dominant. As shown in Figure 7, for relatively higher SNR values, there is a notable gap between the performance of the photonic computing platform with only 6-bit modulator precision and that of the GPU. However, photonic computing can achieve the same performance as that of the GPU when the precision reaches to 8 bits (plus the sign bit). Therefore, in the next numerical experiments, an 8-bit precision is considered for the performance analysis.

Performance Analysis
To evaluate the performance of the proposed photonic computing system when the number of BS antennas increases in a massive-MIMO wireless communication network, an uplink system with = 8 users and BS with different antenna array length is modelled. The signal detection in the BS is performed using MMSE detection method. Figure 8 summarizes the simulation results for a photonic computing system with = 8 parallel waveguide channels, 2 = 16 MRRs in each channel, and 8-bit precision (plus the sign bit) over 10 5 channel realizations. It can be seen in Figure 8   number of antenna elements (antenna array gain) can significantly improve the performance of the system in both cases. (Note that by increasing SNR, SER improves to the extent that for SNR greater than -14 dB, no symbol error has been found.) Considering the above evaluation of the performance, power consumption, and the processing time of one of the most powerful GPUs, i.e., NVIDIA GeForce RTX 2080 Ti, and those of the proposed optical computing platform, it can be seen that optical computing can provide the same performance in a relatively lower power consumption as conventional GPUs, while it can significantly reduce the processing time. That makes the proposed optical architecture as a promising candidate which is capable of supporting high computational complexity required in the next generations of wireless communication networks, while it can meet low latency and high reliability requirements of those systems.

Conclusion
In this paper, a photonic computing architecture is proposed to be employed in next-generation massive-MIMO wireless communication systems to address their stringent computational requirements. The proposed computing approach is based on the B&W architecture to implement MAC operations in the optical domain exploiting the light speed and lossless propagation. Preprocessing steps are developed so that the proposed computing system can represent matrices with any arbitrary values. Numerical experiments confirm that the proposed photonic computing architecture can offer the same performance as those of the most powerful digital electronics-based data processing units such as GPUs, while its time efficiency is shown to be several orders of magnitude better than that of the modern state-of-the-art GPUs. Based on the simulation results, the proposed photonic platform can be integrated into the 5GB base stations as a powerand cost-efficient solution to enable ultra-fast data processing for the next-generation wireless communication networks.
Funding. The research carried out for this work is self-funded. 16 14 12 10 SNR (dB)  University, Ontario, Canada, for the fruitful discussions that the authors had with him.
Disclosures. The authors declare no conflicts of interest.