Learning Energy-Efficient Transmitter Configurations for Massive MIMO Beamforming

Hybrid beamforming (HBF) and antenna selection are promising techniques for improving the energy efficiency (EE) of massive multiple-input multiple-output (mMIMO) systems. However, the transmitter architecture may contain several parameters that need to be optimized, such as the power allocated to the antennas and the connections between the antennas and the radio frequency chains. Therefore, finding the optimal transmitter architecture requires solving a non-convex mixed integer problem in a large search space. In this paper, we consider the problem of maximizing the EE of fully digital precoder (FDP) and HBF transmitters. First, we propose an energy model for different beamforming structures. Then, based on the proposed energy model, we develop a self-supervised learning (SSL) method to maximize the EE by designing the transmitter configuration for FDP and HBF. The proposed deep neural networks can provide different trade-offs between spectral efficiency and energy consumption while adapting to different numbers of active users. Finally, towards obtaining a system that can be trained using in-the-field measurements, we investigate the ability of the model to be trained exclusively using imperfect channel state information (CSI), both for the input to the deep learning model and for the calculation of the loss function. Simulation results show that the proposed solutions can outperform conventional methods in terms of EE while being trained with imperfect CSI. Furthermore, we show that the proposed solutions are less complex and more robust to noise than conventional methods.


I. INTRODUCTION
Wireless communication has been revolutionized by massive MIMO (mMIMO) technologies, which are already one of the key enabling technologies in the fifth-generation (5G) of wireless networks thanks to their potential to increase the transmission capacity through the deployment of large-scale antenna arrays at the transmitter or receiver side [1].As a result, millimeter wave (mm-Wave) communications can be used at longer ranges, thus greatly increasing the bandwidth available to wireless networks [2].
The conventional implementation of MIMO systems utilizes a dedicated radio frequency (RF) chain for each antenna element.Even though this approach is appropriate for common small-scale MIMO systems, it is inadvisable for mMIMO systems equipped with a large number of antenna elements due to the high production costs and power consumption associated with the RF circuitry.Therefore, even though mMIMO is an important technology for future generations of wireless networks, it still faces many technical challenges to improve its energy efficiency (EE) and, to date, it remains a subject of ongoing research [3].In light of this, hybrid beamforming (HBF) and antenna selection are proposed as an effective way to facilitate the implementation and to improve the EE of mMIMO systems [4].Indeed HBF reduces the number of RF chains and digital-to-analog converters (DACs), helping to improve EE.Accordingly, for better EE, HBF techniques are being examined for 5G cellular networks in the mm-Wave frequency bands, and will likely also be found in sixthgeneration (6G) networks [5].
Different HBF structures have been proposed to achieve different trade-offs between cost, energy consumption, and spectral efficiency (SE), which can be grouped into three general categories, fully-connected HBF (FC-HBF) [6], fixed subarray HBF (FSA-HBF) [7], and dynamic subarray HBF (DSA-HBF) [8].Each category has its advantages and limitations.FC-HBF offers flexibility but has higher implementation complexity.FSA-HBF balances SE and complexity.Finally, DSA-HBF provides adaptability but with additional design complexities.To configure a HBF structure, one of the most prominent techniques consists in minimizing the Euclidean distance between the desired fully digital precoder (FDP) and its hybrid counterpart [9].However, this technique requires designing the FDP, which is computationally complex and not necessarily energy-efficient.Furthermore, the number of possible HBF structures is extremely large, making it complicated to find an optimal HBF structure.Therefore, the question that arises is how to efficiently design the best HBF structure in terms of energy consumption and SE.Towards answering this question, our first step consists of proposing an accurate energy model that finds the power consumption of each component in different beamforming structures.Our second step involves applying machine learning-based approaches to design the beamforming structure instead of using complex optimizationbased ones.
Thanks to the enormous success of deep learning (DL), in a wide variety of engineering fields, deep neural networks (DNNs) have received significant attention in recent years and have been widely applied to wireless communication systems [10], [11].Despite the fact that training DNNs to solve wireless communication problems can be computationally intensive, it can take place offline and only the trained DNN model will be used to make online decisions, thus reducing the overall complexity.Different studies used DNNs to address complex problems within the physical layer [12].In supervised learning approaches, the time spent during the data labeling procedure is not negligible.In addition, this procedure must be performed each time a new dataset is used for training.In reinforcement learning (RL) approaches, an agent collects online data as it interacts with its environment in a trialand-error manner.In mMIMO systems, since the HBF action space is large, the convergence of the RL model requires a large number of experiments.As a consequence, unsupervised learning demonstrates superiority over supervised learning and reinforcement learning in terms of its ability to autonomously extract meaningful patterns and insights from large datasets without relying on explicit labels or large training overhead.
To summarize, in this study, we aim to optimize the EE of mmWave mMIMO systems by designing HBF structures and FDP using DL-based techniques.The problem consists of jointly designing the transmitter configuration and beamforming weights that maximize the EE.To accomplish this, we first propose an accurate energy model that takes into account the power consumption of the different components of the mMIMO system.Second, we propose an unsupervised deep learning approach that incorporates two key components to design an energy-efficient beamforming structure: (i) a novel loss function that considers different trade-offs between SE, energy consumption, and active users, and (ii) imperfect channel state information (CSI) during both the training and inference phases.

A. Related Works
In [13], the authors compared the EE of six different phaseshifter (PS)-based and switch-based HBF structures.However, given the hardware available today, the energy model in [13] overstates the power consumption of PSs, which makes the conclusion unfair to PS-based approaches.Many studies are proposed in the context of DL-aided HBF design and antenna selection algorithms [14]- [26].In particular, a received signal strength indicator (RSSI)-based FC-HBF design implemented with supervised learning is proposed in [14].The authors of [15] suggested a supervised learning approach for FC-HBF design under perfect CSI.Another form of supervised learning is also proposed for the FSA-HBF design with perfect CSI in [16].The authors of [17] proposed a reinforcement learning (RL) approach to design the HBF.However, they assumed that the CSI is known perfectly, and due to the continuous action space, their method relies on deep deterministic policy gradient (DDPG), which is computationally complex [27].In the context of unsupervised learning for the FC-HBF design, the authors in [19], [20] presented a novel HBF design employing imperfect CSI for single base station (BS) and cell-free mMIMO (CF-mMIMO), respectively.However, their approaches are only for FC-HBF.In [21], the authors proposed an unsupervised learning approach for HBF and antenna selection using a differentiable activation function for 1-bit PSs.However, the main objective is to maximize the SE of the mMIMO system, and the authors neither optimized the EE nor considered an accurate energy model.In [22]- [24], the authors proposed a joint antenna selection and precoding design with an iterative algorithm and a DL solution to maximize the SE of multi-user multiple-antenna downlink systems.The proposed ML approach assumes perfect CSI for the training data, does not optimize the EE, and requires a complex iterative algorithm to generate the training samples.In [25], [26], the authors proposed a supervised learning approach to solve the antenna selection problem.However, the proposed method only applies to FDP.

B. Contributions
In this paper, we consider both FDP and HBF transmitters and develop new unsupervised deep learning solutions that jointly design beamforming and antenna selection while taking into account the power consumption and insertion loss (IL) of all components.For FDP, the proposed solution designs the FDP vectors along with the antenna selection solution, while for HBF, thanks to a multi-tasking DNN, the proposed solution directly provides the analog precoder (AP) and digital precoder (DP) with the power allocation among the antennas.A preliminary version of this work was published in [28], where we only considered maximizing the SE by designing the HBF for fixed and dynamic HBF structures.
In summary, the contributions of this work are as follows: • We propose an accurate energy model for the FDP and HBF structures while considering the latest state-of-theart hardware solutions.
• We propose an unsupervised deep learning solution robust against imperfect CSI to find the optimal energy-efficient antenna selection for FDP and transmit power allocation for HBF considering the proposed accurate energy model.Due to the binary constraints of beamforming connections, our unsupervised deep learning approach makes use of the Gumbel-Sigmoid technique inspired by Gumbel-Softmax.The Gumbel-Sigmoid technique is designed such that it considers the constraints of all components involved in the beamforming connections.
• We design an unsupervised loss function that takes into account the SE, the energy consumption (EC) as well as the number of active users.Thanks to this loss function, the proposed solution is flexible and can intelligently adjust the power consumption according to the number of active users and can provide an optimal trade-off between SE and EC.
• We train the proposed unsupervised deep learning solution using imperfect CSI for both the DNN input and the loss function computation.We also investigate the noise tolerance of our approach by showing that imperfect inputs can be beneficial and improve the EE of the mMIMO system.
• The proposed solutions are evaluated in a realistic ray-tracing channel model generated using a threedimensional model of an urban environment to capture the geometry-based characteristics of the channel.The simulation results show that the proposed solution outperforms conventional solutions in terms of EE with lower  computational complexity, and can be adapted to achieve different trade-offs between SE and EC.

C. Paper Organization and Notation
The rest of the paper is organized as follows: In Section II, we present the system setup followed by the baseline solutions and the channel model.The proposed energy model for the different beamforming structures is provided in Section III.Section IV presents the proposed energy-efficient unsupervised learning solutions for HBF and FDP, including a discussion of the DNN structure, the training phase, and the online phase.In Section V, we evaluate the performance of the proposed algorithms by comparing them with state-of-the-art solutions using a realistic ray-tracing channel model.Finally, Section VI concludes the paper.

II. SYSTEM MODEL AND BASELINES
Let us assume a time-division duplex (TDD) multi-user mMIMO system where channel reciprocity is available such that the uplink channel estimate can be used for the downlink transmission.The mMIMO system consists of a single BS in a single-cell equipped with N T antennas and N RF RF chains serving N U single antenna users simultaneously as shown in Figure 1.The DP is performed in the baseband and then the output signal goes through the RF chains.Each RF chain is composed of a DAC, a low pass filter (LPF), a local oscillator (LO) and a mixer, and is connected to the N T antennas.The resolution for all DACs and PSs are fixed respectively to b D and q.The RF chains are connected to the antennas through PSs.The network of these connections and PSs, known as AP, can be tuned based on different HBF structures.

A. Conventional Beamforming Structures
We first review the three conventional beamforming structures followed by their non-DL design methods, which are used as baselines for our proposed solutions.Connections between the RF chains and the antennas are represented by a binary matrix Ω.For HBF, Ω = Ω HB ∈ {0, 1} N T ×N RF , and [Ω HB ] n,m = 1 if antenna n is connected to RF chain m.For FDP, Ω = Ω FD is an N T × N T diagonal binary matrix, with Ω FD = diag(ω), where ω = [ω 1 , . . ., ω N T ] and ω n = 1 if antenna n is activated.
1) Fully Digital Precoder (FDP): In FDP, each antenna is connected to an RF chain through the circuit of DAC, LPF, LO, and a mixer.The signal received by each user can be written as where h u ∈ C N T ×1 stands for the channel vector from the N T antennas of the BS to the user index u, x = [x 1 , . . ., x u , . . ., x N U ] is the matrix of transmitted symbols for all users, normalized to is the additive white gaussian noise (AWGN) with mean 0 and variance σ 2 , and u u denotes the precoder vector for the u th user.The SE of the optimal fully digital precoder (O-FDP), U opt = [u 1 , . . ., u u , . . ., u N U ] for single-antenna users is obtained by solving the following problem: where , the signalto-interference plus noise ratio (SINR) of the u th user is given by and P TX is the normalized total transmit power constraint.
The baseline results presented in this paper are obtained by solving (2) based on [29].We refer to this approach as optimal FDP (O-FDP).For a given connection matrix Ω FD , the FDP is given by U = Ω FD × U opt .
2) Fully Connected Hybrid Beamforming (FC-HBF): In all HBF structures, we assume N RF << N T .Regardless of the chosen HBF structure, the signal received by each user can be written as The HBF vectors consist of a DP, W = [w 1 , . . ., w u , . . ., w N U ] ∈ C N RF ×N U , and an AP, A ∈ C N T ×N RF .Since the AP is a combination of the PSs and combiners and it depends on HBF structure and connection between the antennas and RF chains, we define it as follows: where P q ∈ C N T ×N RF is the coefficient of the q bits PS connecting the n th antenna and m th RF chain, where [P q ] n,m ∈ {e j2πk/2 q : k ∈ {1, . . ., 2 q }}.Therefore, the SE for a given HBF (A, W) is given by and the SINR of the u th user can be expressed as In FC-HBF, we set Ω HB = Ω FC , and all RF chains are connected to all antennas through PSs, combiners, and power amplifier (PA) as shown in Figure 2 (top), where the green boxes show the connections.This structure thereby enables maximum design flexibility and therefore requires a large number of PSs and combiners, which increase the implementation cost and energy consumption.The AP of the FC-HBF can be expressed according to (5) with Since all the antennas are connected to all the RF chains through a PS with q-bit quantization, the feasible analog precoder for n th antenna and m th RF chain is [A] n,m ∈ {e j2πk/2 q : k ∈ {1, . . ., 2 q }}.Conventional HBF solutions either rely on codebook-based solutions to limit the number of feasible solutions [30] or, more rarely, use real-valued PSs [9].The conventional approach consists of first designing the O-FDP matrix in (2).Then, the AP and DP are designed in such a way that the resulting precoders approximate U opt as follows: 5), ( 8), (9), We obtain the FC-HBF solution of (10) using "PE-AltMin" and "MO-AltMin" proposed in [9].
3) Subarray Hybrid Beamforming: Each antenna in a subarray structure is connected to only one RF chain through a PS.Consequently, the total number of PSs is reduced to N T , instead of N T × N RF in the FC-HBF.In the subarray HBF structure, we consider two types of connection: (i) a structure equipped with fixed connections, known as fixed subarray HBF (FSA-HBF), or (ii) a structure equipped with dynamic connections, known as dynamic subarray HBF (DSA-HBF).Examples of possible connection matrices for each case are shown in Figure 2. The DSA-HBF structure enables the antennas and the RF chains to be dynamically switched at each time interval in response to changing conditions.It was shown that such a dynamic structure significantly enhances the SE of the system by providing more degrees of freedom in the HBF design compared to a FSA-HBF structure, and reduces the power consumption compared to the FC-HBF structure [8].Therefore, based on the general definition of the AP (A = P q ⊗ Ω HB ), the constraint on matrix Ω HB for subarray HBF is given by To find the precoder matrices for FSA-HBF, the general approach described in (10) for FC-HBF can be used.For the DSA-HBF, the connection pattern (Ω HB ) between the RF chains and the antennas is dynamic and needs to be optimized, resulting in a large design space.

B. Problem Definition
The main objective of this paper is to maximize the EE of the mMIMO system by selecting the antennas and designing the BF structure.For the FDP case, the problem consists of finding the precoder matrix U and antenna selection Ω FD = diag(ω) that maximize the EE, while achieving a desired minimum average SE denoted as R d .More formally, we seek to solve the following optimization problem: maximize where P FDP is the total power consumed by the BF components.
Similarly, the HBF design consists in finding the precoder matrices W and A and the power allocation that maximizes the EE.Therefore, we have the following optimization prob-lem: maximize where P HBF is the total power consumed by the HBF transmitter, and again R d is the minimum average required SE.
The power consumption P FDP and P HBF will be described in detail in Section III.In this paper, for simplicity, we consider a total power constraint for the transmitter, where the power transmitted by each antenna is not necessarily equal or limited to P TX /N T .It should be noted that in HBF turning off an antenna is not necessarily corresponding to deactivating an RF chain.On the contrary, since in FDP, each antenna is connected to one RF chain, and the power consumed by RF chains is noticeable, turning off the RF chains leads to the deactivation of the corresponding antennas.

C. Channel Model
The experiments presented in this paper are based on the generic deep learning dataset for mm-Wave mMIMO systems (known as deepMIMO) [31], which provides a channel vector h of length N T for each user position on a quantized grid.An N T × N U channel matrix entries in the dataset are obtained by concatenating the N U channel vectors randomly selected from the available user positions of the considered area.
Since we consider TDD communication with channel reciprocity, the estimated CSI in the uplink can be employed for downlink.However, due to channel estimation errors, the downlink channel cannot be perfectly estimated.Thus, to model the channel estimation error, the BS uses the minimum mean square error such that the estimated channel matrix is given by [32]: where represents the reliability of the estimate, and ϵ ∼ N (0, σ 2 e ) is an error matrix modeled as a zero-mean Gaussian noise with variance σ 2 e .Unlike previous DL-based studies, where perfect CSI is available during the training of the DNN, in this work, we propose to use the imperfect CSI ( Ĥ) not only as the input to the DNN but also to compute the loss function during the training phase.In Section V, we further evaluate the impact of the imperfect CSI by varying the value of β and show that a moderate level of imperfection in CSI can act as a regularizer for the DNN and slightly improve the SE.

III. ENERGY MODEL
In this section, we present an energy model for the different FDP and HBF hardware configurations, considering both the direct energy consumption as well as the energy consumption resulting from IL of each component.

A. General Beamforming Structure
We consider a regularity assumption where components of the same type have the same input/output interface, i.e. their inputs and outputs are connected to the same type and number of components.This assumption is generally true because it eases the conception of generic circuits.
To better represent each HBF structure, we suggest a general template form as shown in Figure 3 (a), where a given antenna is connected to a combiner having c ∈ {1, . . ., N RF } inputs.Each input of a combiner is connected to the output of a phase shifter.Then, each phase shifter is connected to an RF chain through a switch.The number of switches is ψ ∈ {1, . . ., N RF }.As a result, the analog precoder can be fully characterized by specifying the tuple (ψ, c).For instance, for the three conventional HBF structures that we discussed previously, we have: • (N RF , N RF ) for the FC-HBF structure.In the FC-HBF structure, all the switches are connected (i.e., ψ = N RF ), while the outputs of all the PSs are combined before each antenna i.e., c = N RF .
• (N RF , 1) for the DSA-HBF structure.In DSA-HBF, only one switch can be connected at each time interval, therefore c = 1, while there are possible connections for all the switches, thus ψ = N RF .It should be noted that such configuration for switches works like a multiplexer.Thus, in a practical system, the switches are replaced by a ψ ×1 multiplexer.
The hardware complexity of different beamforming techniques is compared in Table II.

B. Energy Consumption Analysis
We now describe the energy consumption of each component, and we list the most recent state-of-the-art hardware solutions.We consider components that are suitable for operating in the frequency range of 20-40 GHz.
A component of the set {D, L, M, LO, Ψ, Φ, C, PA} is denoted by o and the correspondence between a component and its notation is defined in Table I.We denote IL o as the insertion loss of passive component o and when o depends on some parameter x, we use IL o (x).The average power dissipated by the active component o is denoted as P o , or P o (x) if o depends on the parameter x.See Table I for the list of components and their parameters.Note that the power dissipated by the wires is neglected and when c = 1 there is no need for a combiner (i.e., IL C (1) = 0 dB).Likewise, the switches can be replaced with wires when ψ = 1 or ψ = c, that is IL Ψ (1) = IL Ψ (c) = 0 dB, since all possible connections are always established.
In our energy model, we consider the possibility of turning off the RF chains or antennas to save power.The n th antenna or the m th RF chain is turned off when the n th row or the m th column of the matrix Ω is zero, respectively.Therefore, we can define N T (Ω) = {n : as the set of activated antennas and RF chains, respectively.
1) RF Front-End: The RF front-end corresponds to the circuitry between the antenna and the DAC.As shown in Figure 3 (b), for the FDP, this consists of low pass filters (LPFs), mixers, local oscillators (LOs), switches, and power amplifiers (PAs).On the other hand, in Figure 3 (a), the HBF requires a network of PSs, splitters, and combiners in addition to the components described for the FDP.Mixers, combiners, switches, and PSs are assumed to be passive devices that introduce IL.
For the mixer, based on the recent solution in [33], we consider IL M = 6.4 dB.The IL of the PS and the combiner plays a key role in designing energy-efficient HBF, especially for the FC-HBF, where all the RF chains are connected to all the antennas through PSs and combiners.In Table III, we list the ILs of PSs from some recent state-of-the-art references.
Based on this table, we choose IL Φ = 3.7 dB with q = 9.4 bits resolution, and we assume IL C = 1.8 dB [34].For DSA-HBF, the switches dynamically change the connections between the RF chains and the antennas to improve the flexibility of the structure.Since these ILs are in low power and they do not have a big impact on the final power consumption, we assume IL Ψ (ψ) = 1.1 dB for the other values of ψ, by considering single pole single throw (SPST) switch [35].Now, denoting by P out BB the output power of each RF chain, the input power of the PA before the n th antenna for all structures of the HBF (in mW) can be written as where IL Φ denotes the IL of PSs and IL values are expressed in a linear scale.In the FC-HBF, where all the RF chains are connected to the antennas (Ω HB given in ( 8)), we have (ψ, c) = (N RF , N RF ) and IL Ψ (ψ = c) = 1.For DSA-HBF with the structure of (ψ, c) = (N RF , 1) and the connection matrix Ω HB in (11), due to IL of switches, we have IL Ψ (ψ) = 1.1.In FSA-HBF that has a structure (ψ, c) = (1, 1), there are neither combiners nor switches.As a result, IL Ψ (1) = 1, and IL C = 1.Similarly for the FDP, as shown in Figure 3 (b), the input power of the PA on the n th antenna can be obtained as Finally, the direct current (DC) power drawn by the n th active PA P PA , can be written as where α is the power-added efficiency (PAE) of the linear power amplifier (LPA), P n TX is the transmit power of the n th antenna, and BF should be replaced with HBF or FDP according to the chosen transmitter type.Based on the recent solution for PA listed in [40]- [42], we consider an average PAE of α = 36.
2) Digital to Analog Converter: DACs are among the components having the largest power consumption in wireless applications.The power consumed by a DAC (P D ) is a linear function of the sampling frequency (f s ) and the figure of merit (FoM D ) of the converter, and grows exponentially with the number of bits of resolution (b D ) as [43].The sampling frequencies for ultra wideband applications are in the range of 0.5-1 GHz.It is shown in [43] that in terms of required signal-to-quantization noise ratio (SQNR), FDP required 2 bits less than HBF.Therefore, we assume b D = 4 for FDP and b D = 6 for HBF, respectively.Moreover, based on [44], we consider FoM D = 54.5 fJ/conv.
3) Low Pass Filter in TX: The output of the DACs will require analog LPF to reject spectral images and maintain outof-band emission limits.For an m ′ -th order active LPF with cutoff frequency f c , the FoM L is the power consumed per pole per Hertz [45].The power drawn by LPF is given by [45], we assume a first order LPF with f c = 500MHz, and FoM L = 1.4 mW/GHz.Furthermore, we define P LO as the power consumed by the mixer from the LO and we consider P LO = 10 dBm [46].
4) Total Energy Consumption: Now, putting it all together, the total power consumed by a given beamforming structure can be written as follows: where P PA,BF should be replaced with either P PA,HBF or P PA,FDP according to the transmitter type and Ω ∈ {Ω FD , Ω HB }.In this paper, we focus on passive PS, but we note that active PS can be easily considered in the model by setting IL Φ to 1 and adding the power consumption of all active PSs to (18).The energy consumption E BF can then be obtained with E BF = T s × P BF , where T s is the duration of a symbol.When considering a fixed symbol duration, minimizing the power consumption is equivalent to minimizing the energy.Therefore, we evaluate the EE as b/s/Hz/W.It is interesting to see that based on (15), considering passive PSs and combiners, the power consumed by different HBF structures is similar since the IL of the passive components is applied on the low power signals, before the PAs.However, in terms of hardware complexity and cost, shown in Table II, the subarray HBF is more efficient than FC-HBF.
In equations ( 6) and ( 18), we observe that both the SE and the EE are influenced by the matrix Ω.This matrix defines the connection between the RF chains and the antennas.Having more connections results in higher SE as it increases beamforming flexibility.However, each connection corresponds to the use of an RF chain in FDP, and in the case of HBF, it involves a PS and a combiner, leading to increased costs and energy consumption.This dependency makes the optimization problem in (13) difficult to solve.Consequently, to address this issue, we propose a novel unsupervised learning solution in the following sections.This approach aims to jointly optimize both SE and EE.

IV. ENERGY-EFFICIENT BEAMFORMING DRIVEN BY DEEP UNSUPERVISED LEARNING
In this section, we describe the unsupervised learning solution to design the antenna selection and efficient HBF as well as FDP.We start by describing the architecture of the proposed DNN in Section IV-A.Then, the proposed method is divided into two phases: the training phase is described in Section IV-B, and the online phase is described in Section IV-C.

A. Deep Neural Network Architecture
The input and the hidden layers of the proposed DNN architecture are common for both the FDP and the HBF structures.However, the output layers are different for each BF structure.We start by describing the architecture of the input and the hidden layers denoted as DNN core as shown in Figure 4. Then in the following subsections, we describe the architectures of the output layers of the HBF and the FDP.
The input of the DNN is given by the imperfect channel matrix Ĥ given in (14).To improve the representation learning, we normalize the input to H = Ĥ/∥ Ĥ∥ 2 F such that ∥ H∥ 2 F = 1.Then, we separate the real part ℜ{ H} and the imaginary part ℑ{ H} of H into two channels that are fed to the first convolutional layer (CL).DNN core consists of 2 CLs 16@N T × N U where 16 is the number of channels and N T × N U is the dimension of each channel followed by 1 CL 8@N T × N U .The kernel size is 3 × 3 for all CLs.The CLs are followed by 2 fully-connected layers (FLs), each with 1024 neurons.The "Leaky ReLU" activation function and batch normalization are used after all layers except for the output layers.This DNN core is then combined with different output layers to form the HBF model, called efficient HBF network (E-HBF-Net), and the FDP model, called efficient FDP network (E-FDP-Net).The models are relatively small.For example, for N T = 64, N U = 8, the total number of parameters including the output layers in E-HBF-Net is 5.8M (with N RF = 8), and 4.9M in E-FDP-Net.A detailed complexity analysis is presented in Section V-D.
1) Output Layers for HBF: As shown in Figure 5 (a), we divide the output of the last FL into 4 parallel layers.The first and second parallel layers, both of size N RF × N U , generate the real and imaginary part of the DP.The output of the third parallel layer generates the AP, thus its dimension is N RF × N T .The output of AP can also be adapted to PS resolutions.It is shown in [28] that using the straight-through estimator (STE) technique, we are able to have different numbers of quantization bits for the PSs.In this paper we again consider the same approach for the output of the DNN dedicated for PS quantization in AP.The fourth layer of size N RF × N T designs the matrix Ω HB .
As we described before, Ω HB must be a binary matrix.Typically, this binary constraint requires using the "Sigmoid" function during training and then, during the online phase, applying a rounding technique to transform the real values into binary values.However, we found that this approach does not lead to good results for unsupervised learning, because the SE measured during training can be very different from the actual SE measured during testing.To solve this problem, we propose to use a differentiable approximation, called "Gumbel-Sigmoid" during training inspired by the "Gumbel-Softmax" estimator [47].The Gumbel-Softmax approximation is a technique that allows sampling from a categorical distribution during the forward pass of a neural network, by combining a re-parameterization trick and a smooth relaxation.The connection between the RF chains and the antennas can be represented using a categorical binary distribution.Hence, defining π n,m as the probability that antenna n is connected to the RF chain m, then we can form an N T × N RF matrix Π that corresponds to the probability states between antenna n and the RF chain m.The Gumbel-Softmax function, G(Π), applied to each element of the matrix Π can then be defined as where ΩHB is the output of the DNN, and g and g ′ are independent samples with zero mean and unit variance, drawn from the Gumbel distribution.Note that the exp(•) and log(•) functions are applied element-wise when taking a matrix as input.The parameter τ is called the "Gumbel temperature".When τ → 0, G(Π) tends to the categorical distribution, but when τ → ∞, it converges to the uniform distribution [47].Therefore, there is a trade-off between small temperatures, where sample vectors are close to one-hot but the variance of the gradient is large, and large temperatures, where samples are more uniform but the variance of the gradient is small.We thus consider τ as a hyper-parameter to be optimized in our implementation.
2) Output Layers for FDP: The proposed architecture for FDP is shown in Figure 5 (b).We divide the output layer into 3 parallel layers.The first two layers are dedicated to the real and imaginary part of the FDP with dimension N T × N U .The third layer, similar to the one for HBF, designs the antenna selection vector (ω) described in Section II-A1.Here again, we use the Gumbel-Sigmoid described in (19) to obtain the binary variables from ω.Let π ′ n denotes the probability of activating the n-th antenna and . Then, we have ΩFD = diag( ω), where ω = G(π ′ ).

B. Training Phase: Unsupervised Learning
In the training phase, thanks to unsupervised learning, the data samples consist of only imperfect channel matrices without the need for labels.The imperfect channel ( Ĥ) is modeled as in (14) and it includes a coefficient β that determines the magnitude of the estimation error and thus helps us study the impact of the estimation error of the channel on the DNN training.
Although the approach to train the DNN is similar for E-HBF-Net and E-FDP-Net, there are differences in their hardware configurations.Therefore, we first present the common aspects shared by both DNN models and then proceed to explain the parts specific to each model.
The objective of the proposed solutions is to design the beamforming configuration to not only maximize the SE but also to minimize the EC while being adaptive to the number of active users, i.e., when the number of active users is small, it intelligently turns off part of the antennas since they will no longer be needed.Consequently, it will reduce the EC.To achieve this objective, we design the following unsupervised loss function to train the DNN: where the first term is related to EC and the second term is related to both the SE and the active number of users and is called the adaptive antenna selection (AAS) term.The hyperparameter γ is required to achieve proper training convergence and should be tuned in the training phase.Each term of the loss function is described in detail in the sequel.EC term (L EC ): This term is introduced to add a penalty to the total loss function to reduce EC.It is given as: where PBF is the total power consumption for either HBF ( PHBF ) or FDP ( PFDP ) given in (18) as discussed in Section III, which depends on Ω ∈ { ΩFD , ΩHB }.Thus, ΩBF affects both the SE as well as the EC.AAS term (L AAS ): This term of the loss function L AAS is given by: where as discussed in Section II-B, parameter R d denotes the desired average SE value for each user, and R is either RHBF ( Ā, W) for HBF or RFDP (U × ΩFD ) for FDP.Thanks to the AAS term, the SE is forced to approach R d while parts of the antennas can be turned off to reduce the EC (according to the EC term L EC ).As a result, the AAS term guarantees to consume minimum power to satisfy an average desired SE (R d ).

1) Efficient Hybrid Beamforming Network (E-HBF-Net):
To design an efficient HBF structure, a programmable is employed for each connection (N T × N RF ) to find the best matrix (Ω HB ) that maximizes the EE.As shown in the "Training Phase" of Figure 6, the proposed DNN for HBF, E-HBF-Net, is designing jointly the DP (W = ℜ[W] + iℑ[W]), the PSs ( Pq ), and the connections matrix ( ΩHB ) by employing the proposed "Gumbel Sigmoid" function as in (19).
Obtaining PHBF requires computing ( 18) and thus we first need to know the power consumed by the PAs.Thus, based on (17), we would need the input and output power of the PAs.The output power of the PAs 1, . . ., N T is given by where pTX = [ P 1 TX , . . ., P N T TX ] T and Ā and W = [w 1 , . . ., w N U ] are the AP and DP outputs designed by the proposed DNN.Due to the total power constraint assumed at the BS in (13c), we should normalize the power such To respect the inequality of the power constraint, we introduce a new power threshold PTX that is a function of the connection matrix as follows: Therefore, the maximum transmitted power is limited to P TX when all connections are established ([ ΩHB ] n,m = 1 ∀n, m), while reducing the number of connections reduces the transmit power.After power normalization, we can obtain the input power of the PAs according to (15).However, for the DNN loss function, we cannot have sum over a dynamic set as defined in (15).Therefore, we reformulate (15) as where pin PA,HBF = [ P in,1 PA,HBF , . . ., P in,N T PA,HBF ] is the vector of input power of the APs, ΩHB is the output of Gumbel-Sigmoid function for HBF, and 1 N denotes the all-one column vector of size N .According to (17) and ( 23)-( 25), we can obtain P DC,n PA,HBF .To compute the power consumption of all activated RF chains as in (18), we need to determine the number of activated RF chains (i.e., N RF ( ΩHB )).However, finding N RF ( ΩHB ) again requires a summation over a dynamic set and it is not appropriate for the loss function.As a consequence, we use an alternative linear algebra formulation.First, we compute the expectation over all antennas of each RF chain as ΩT HB 1 N T /N T .Then, we find the expected number of activated RF chains as follows: Algorithm 1: Efficient HBF (E-HBF-Net) FeedForward E-HBF-Net.train()6:
2) Fully Digital Precoder (E-FDP-Net): E-FDP-Net provides the precoder U = ℜ[U] + iℑ[U] and the vector ω for antenna selection, where ΩFD = diag( ω).To evaluate the first term of the loss function detailed in (20), the total power consumption of FDP ( PFDP ) is required.Computing PFDP for E-FDP-Net is simpler than HBF because in FDP each antenna is connected to one RF chain.Consequently, the input power of each PA is simply P out BB .Similar to HBF, to respect the power constraint for FDP, ∥ Ū∥ 2 F ≤ P TX , the output power should be a function of ω = [ω 1 , ..., ωN T ].As a consequence, we denote the output power of the n th antenna as Therefore, the power consumed by the PAs is given by (17).
Finally, the power consumed by the active RF chains is also to compute because the number of active RF chains is given by NRF ( ΩFD ) = N T n=1 ω.

C. Online Phase: Transmitting Data
Once the DNN has been trained, the online phase can start as shown in Figure 6 (right).In the online phase, the DNN input is only given by the imperfect channel matrices Ĥ.In the online phase, like the training phase, the outputs of the DNN: the AP (P q ) and the DP (W) in HBF and (U) in FDP can be employed as is without any further processing, which is not the case for the connection matrix Ω.Since the connection matrix ( ΩHB in HBF or ω in FDP) should be binary, once it is output by the DNN in the online phase, it requires binary quantization.To do so, we can use the element-wise round function (⌊•⌉) on each element of the connection matrix as follows: Ω HB = ⌊ ΩHB ⌉ for HBF and ω = ⌊ ω⌉, and Ω FD = diag(ω) for FDP.The output power of the n th antenna for E-HBF-Net is given by the n th element of the power vector defined in (23) while for E-FDP-Net it is given by (28).The two proposed DNN solutions, E-HBF-Net and E-FDP-Net, are described in Algorithm 1, Algorithm 2, respectively.

V. PERFORMANCE EVALUATION
In this section, the performance of the proposed DNN, implemented using the PYTORCH deep learning library, is numerically evaluated.The scenario "O1-28 GHz" of the deepMIMO channel model [31] is employed to generate the unlabeled dataset (the channel coefficients h u for user u) for the training and testing.In the deepMIMO dataset [31], realistic channel information is generated by applying raytracing methods to a three-dimensional model of an urban environment to capture the geometry-based characteristics, such as the correlation between the channels at different locations, and the dependence on the materials of the various environmental elements, among others.The parameters to generate the deepMIMO dataset are shown in Table IV, where the channel model parameters active_user_first and active_user_last are set to 1100 and 2200 respectively.The BS is equipped with N T = 64 antennas and N RF = 8 RF chains with PSs serving N U = 4 users randomly located in a dedicated area (S1 in Figure 7).Scenario "O1" consists of several users' locations being randomly placed in two streets surrounded by buildings.These two streets are orthogonal and intersect in the middle of the considered area.The size of the DNN dataset is set to 2 × 10 6 samples, with 85% of the samples used for the training set and the remaining used to evaluate the performance.We used "AdamW" as the DNN training optimizer.The hyper-parameters used in our DNN model are listed in Table V.In addition, hyper-parameter τ known as the Gumbel-Sigmoid temperature is set to 0.1 and

A. Spectral Efficiency and Power Consumption Analysis
We first verify the maximum SE that can be achieved by the proposed DNNs, when they are trained without considering their power consumption, and compare them with the baseline solutions presented in Section II-A.This maximal SE is shown in Figure 8 when varying the noise power.Taking into account channel attenuation, the average signal-to-noise ratio (SNR) ranges from −7.8 dB to 22.2 dB.To obtain the maximum SE, we set γ = 0 so that the loss function for E-HBF-Net and E-FDP-Net in ( 21) depends only on L AS and we set R d = 15   to have no constraint on SE.On the one hand, the proposed E-FDP-Net gives a close-to-optimal performance.On the other hand, E-HBF-Net, outperforms other conventional solutions and is very close to E-FDP-Net performance.In the lownoise regime, the SE of all solutions continues to increase.However, both E-HBF-Net and E-FDP-Net outperform other conventional non-DL solutions in high SNR regimes.
In Figure 9, we compare the power consumption of different BF hardware configurations at a given SE.It is shown that by adjusting R d for E-FDP-Net and E-HBF-Net, different SE and power consumption trade-offs can be obtained, where for each proposed technique we set R d in {1, 3, 5, 6, 8}.To cover a range of SE values, we also adjust the transmitted power for the conventional methods by setting P TX in {0.1, 1, 10}W.We see that the optimal FDP and the proposed E-FDP-Net with R d = 8 achieve the best SE.However, they also consume the most power because they require to activate all N T RF chains.
In this figure, we see that when the desired SE parameter R d is reduced, both E-FDP-Net and E-HBF-Net are able to reduce their power consumption.For example, when R d is decreased from 8 to 5 bits/s/Hz/user, the consumed power for both E-FDP-Net and E-HBF-Net is reduced significantly (64% less for E-FDP-Net and 68% less for E-HDF-Net).By decreasing R d further, both the power consumption and the SE continue to decrease.Furthermore, we see that E-FDP-Net and E-HBF-Net achieve must better energy efficiency than the baseline approaches.For example, when R d = 6, it can be seen that E-HBF-Net achieves similar SE compared to FC-HBF solved with MO-AltMin, but with almost 1.7 times less consumed power.Further, the baseline solutions exhibit a power floor, shown by red lines in the figure, that corresponds to the power consumed by RF chains.When the transmit power of O-FDP and FC-HBF is decreased to P TX = 1W and P TX = 0.1W, the SE is degraded due to the lower transmit power.However, there is constant power consumption for each beamforming technique due to the operation of RF chains.On the contrary, E-FDP-Net and E-HBF-Net have the ability to reduce their power consumption below these floors by adaptively turning off their RF chains.
To illustrate how many antennas are activated by E-FDP-Net, we plot in Figure 10 the connection matrix ΩFD for one sample of the test set, for different values of R d , where a blue square represents the value 1 and a white square represents the value 0. It can be seen that large values of R d lead to more active antennas (and thus more active RF chains), and thus to a higher power consumption.In Figure 11, we show the average value of ΩHB over the inputs, for different values of R d .When decreasing R d , the number of active antennas (non-zero columns) remains constant, while the number of active RF chains (non-zero rows) is reduced.This is because the power consumption of an antenna depends on its transmit power, which can be adjusted, whereas RF chains consume a fixed amount of power and must be turned off to save power.It is interesting to see that with a lower value of R d , the E-HBF-Net designs the connection matrix such that a small number of RF chains are activated that are connected to several antennas, which helps to increase the spatial multiplexing gain and degrees of freedom.Finally, we see that as R d increases, more antennas and more RF chains are activated, and thus more power is used.Figure 12 presents the EE versus SE comparison for the proposed E-FDP-Net and E-HBF-Net, with varying adjustments to R d .Notably, as SE decreases, E-HBF-Net demonstrates superior EE performance compared to E-FDP-Net.This outcome is attributed to the behavior of E-HBF-Net at lower SE values, where it intelligently deactivates RF chains while keeping multiple antennas active.Conversely, in E-FDP-Net, turning off an RF chain also turns off the associated antenna.Consequently, E-HBF-Net excels in conserving energy while simultaneously offering enhanced SE due to its higher flexibility.Furthermore, as SE increases, E-HBF-Net maintains its efficiency advantage over E-FDP-Net, although the performance gap between the two approaches diminishes.

B. Varying the Number of Users
To show the impact of antenna and RF chain selection when varying the number of active users, we present Figure 13 for R d = 3 and R d = 5, where the left-side sub-plots present E-FDP-Net and the right-side ones shows E-HBF-Net.To improve the presentation we use the normalized number of active RF chains ( N RF (Ω) N RF ), which in the case of FDP is equal to the number of active antennas.In the proposed solutions, we see that by increasing the number of active users, the DNN not only activates more RF chains but also increases the transmitted power to meet R d .Moreover, when R d is small, the DNN requires a smaller number of active RF chains while minimizing the transmitted power, thus lowering power consumption and consequently increasing EE. Figure 13 shows that the proposed DNN approaches are adaptive to the number of active users in the network.That is, depending on the scenario, the DNN designs the beamforming structures to adapt to the varying number of users in each scenario.For instance, in a high-traffic scenario, when the number of active users is large, the DNN will activate more antennas and RF chains to meet the average SE.On the other hand, in a lowtraffic scenario, when the number of users is low, the DNN has no need to activate a large number of antennas and RF chains, and thus can significantly increase its EE.Finally, we notice that by controlling the value of R d , which depends on the application and the objective of the service provider, the power consumption can be adjusted.

C. Training with Imperfect CSI
Unlike other studies that assume perfect CSI for DNN training, in this work, we employed imperfect CSI not only for the input of the DNN but also for the computation of the loss function.The robustness of the proposed methods against imperfect CSI is evaluated and compared to other non-DL methods in Figure 14.Here we train the DNN with different β in {0, 0.1, 0.2, 0.3, 0.4, 0.5}.It is clear that the SE performance decreases as the value of β increases.In particular, when β increases from 0 to 0.5, the SE performance for O-FDP degraded by 38%.For PE-AltMin, the degradation is around 25%, whereas it is around 27%, for MO-AltMin.The lowest degradation in terms of SE performance is achieved for E-HBF-Net and E-FDP-Net, (e.g., the degradation is around 9% and 11%, respectively).Therefore, the proposed methods are more robust against estimation errors.Moreover, the red lines in Figure 14 shows the ideal case of perfect CSI when β = 0.It is interesting to see that for a small β (i.e.0.1) the SE performance did not degrade, but in contrast, it slightly improved in the online phase.This is due to the fact that training with imperfect CSI can act as a regularization technique known as noise injection in the machine learning literature and thus can improve the generalization of the DNN in the online phase [48].
In Figure 15, we present the convergence of the training of the proposed E-FDP-Net in terms of SE, power consumption, and EE, when R d = 3 and N U = 4.We see in the top subplot that the DNN learns quickly to design the connection matrix and the FDP to obtain an SE of N U R d = 12, i.e., after few epochs, the achieved SE for each user is around R d .Then, while the SE target is respected, the DNN learns to gradually reduce power consumption by turning off some RF chains until it achieves the minimum power consumption as shown in the middle subplot.

D. Computational Complexity Analysis
To evaluate the computational complexity of the proposed DNNs, we derive the analytical expression of the number of real multiplications (RM) and compare it with other approaches.We assume that one complex multiplication (CM) corresponds to 4 RMs and that the 1 complex division corresponds to 8 RMs (assuming that the real division of 1 is equal to 1 RM).Only the matrix multiplications and inversions are taken into consideration, the other operations are considered negligible.A CM between a matrix of size N × P and a matrix of size P × M requires N M P CMs.To invert a square matrix of size N , around N 3 /3 CMs are required if the Gaussian elimination algorithm is employed.Finally, we consider that the eigenvalues of a square matrix of size N are obtained using the Cholesky decomposition [49], which requires approximately 4N 3 RMs.O-FDP requires 4(2 N U − 1)(2N U N 2 T + N 2 U N T + 1 3 N 3 T ) RMs as described in [19].In the specified scenario, we replicate the implementation of SoTA algorithms.Our observations reveal that the PE-AltMin algorithm typically achieves convergence within an average of ℓ PE = 15 iterations.Given that the computation of the singular-value decomposition of a p×q matrix necessitates approximately 4p 2 q + 22q 3 resource modules (RMs), we can formulate the total number of RMs required for PE-AltMin as ℓ PE (8N RF N U (N T +N U )+22N 3 RF ).MO-AltMin has a much higher complexity than PE-AltMin [9].MO-AltMin is composed of a main loop that computes the DP, and of an inner loop applying the "Conjugate Gradient" algorithm to find the HBF.In the main loop, computing the DP requires 4N T N U N RF RMs, while in the inner loop, the Kronecker product of a N RF × N T matrix with a N U × N T matrix is computed, which requires 4N 2 T N U N RF RMs.Based on the defined scenario the outer loop is repeated ℓ MO = 2 times while the inner loop is repeated ℓ ′ = 30 times, the total number of RMs used by MO-AltMin is 4ℓ MO N T N U N RF 1 + ℓ ′ N T .To design the HBF, both PE-AltMin and MO-AltMin require designing the FDP as discussed in (10), thus the complexity of obtaining the FDP should be added to the complexity of PE-AltMin and MO-AltMin.
On the other hand, to compute the computational complexity of the DNN approaches, we need to compute the number of parameters of the DNN architectures.Both DNN architectures, E-HBF-Net and E-FDP-Net, have the same DNN core but their output layers are different due to different output dimensions.The number of RMs in the DNN core is calculated for each layer separately, then summed up.The width of the l th FC and CL are respectively denoted as f l and c l .The number of multiplications required for DNN core is M(DNN core ) = (2c 1 + c 1 c 2 + c 2 c 3 + c 3 f 1 /κ 2 )N T N U κ 2 + f 1 f 2 , where κ is the kernel size i.e. κ = 3 [19].Considering that for E-HBF-Net there are 4 output layers, one layer for the AP, two layers for the DP, and one layer for the connection matrix, then the total number of multiplications is M(DNN core ) + f 2 (N T N RF + 2N U N RF + N T N RF ).Likewise, for E-FDP-Net, the total number of multiplications is M(DNN core ) + f 2 (2N U N T + N T ).Examples of the numerical values of these analytical expressions are shown in Table VII.It can be seen that for HBF transmitters, E-HBF-Net reduces the complexity by 38% compared to the least complex conventional approach (PE-AltMin), while for FDP transmitters, O-FDP is 1.5 times more complex than E-FDP-Net.

VI. CONCLUSION
In this paper, we studied the problem of antenna selection and beamforming design in a massive multiple-input multipleoutput (mMIMO) system with the objective of maximizing energy efficiency (EE).First, we derived an accurate energy model for the mMIMO system.Our proposed energy model takes into account the transmit power as well as the power consumed by the hardware by considering the insertion loss and the direct power consumption of different components such as the combiners and the power amplifiers.Next, based on our energy model, we designed unsupervised deep learning approaches to intelligently and adaptively select the BF structures and the transmitting antennas.Specifically, we proposed two deep neural networks models, called E-HBF-Net and E-FDP-Net, for hybrid BF and for fully digital precoding, respectively.Both DNNs optimize the EE of the mMIMO system by intelligently selecting the transmitting antennas and choosing the precoding matrices for HBF and FDP, which allows them to achieve significantly better EE than conventional solutions.Simulation results confirm that the proposed DNNs can adapt to the number of active users and that they provide different trade-offs between SE and EC that can be controlled by tuning a hyper-parameter.Furthermore, we show that the DNN models can be trained exclusively using imperfect channel information (CSI), i.e., the imperfect CSI was used as input to our DNN models as well as to compute the loss function during training.

Fig. 1 .
Fig. 1.Massive MIMO system model structure with one transmitter BS employing HBF to serve a set of users.

Fig. 6 .
Fig. 6.Training (left) and online (right) phases for efficient BF.The outputs of the DNN depend on the BF structure (HBF, FDP).

Fig. 9 .
Fig. 9. Power required to achieve a given SE for the various transmitter configurations.Idle P BF is the power consumed by the BF structure when P TX = 0.The parameters are set to: N U = 4, N T = 64, N RF = 8, and σ 2 = −130 dBm.

R d = 9 Fig. 10 .
Fig. 10.The connection matrix ΩFD = diag( ω) of E-FDP-Net for one input sample, and for different values of hyper-parameter R d , where a blue square represents the value 1 and a white one represents the value 0. System parameters are set to: N U = 4, N T = 64, and σ 2 = −130 dBm.

7 Fig. 11 .Fig. 12 .
Fig. 11.The average value of the connection matrix ΩHB of E-HBF-Net given for different values of hyper-parameter R d , where the shade of each square represents the range of values from 0 (light) to 1 (dark).System parameters are set to: N U = 4, N T = 64, N RF = 8, and σ 2 = −130 dBm.

Fig. 13 .
Fig. 13.The number of active RF chains, EE, and power consumption of the proposed E-FDP-Net (left sub-plots), E-HBF-Net (right sub-plots) versus different numbers of users.System parameters are set to: N RF = 8, N T = 64, and σ 2 = −130 dBm.