Joint Client Selection and Receive Beamforming for Over-the-Air Federated Learning With Energy Harvesting

Federated learning (FL) is a well-regarded distributed machine learning technology that leverages local computing resources while protecting privacy. The over-the-air (OTA) computation has been adopted for FL to prevent excessive consumption of communication resources by employing the superposition nature of wireless waveform. Meanwhile, energy harvesting technology can relieve the energy constraint of clients and enable durable computation for FL. However, few of the existing works on OTA FL have considered jointly performing client selection and receive beamforming optimization with energy harvesting clients. The objective of this work is to address this issue to improve the learning performance of OTA FL. Specifically, we first derive the expression of the optimality gap regarding client selection and receive beamforming design. Then, to minimize the optimality gap, a mixed-integer nonlinear programming (MINLP) problem is formulated and decomposed into two sub-problems. Next, the semidefinite relaxation method and the channel-energy-data (CED)-based method are developed to optimize the receive beamforming sub-problem and client selection sub-problem iteratively. One alternative optimization method is proposed to deal with the decoupled sub-problems for obtaining the solutions to the original MINLP problem. Our simulation results demonstrate that the proposed solution is superior to the other comparison schemes in various parameter settings.


I. INTRODUCTION
D UE TO the increase in the number of deployed Internet of Things (IoT) devices, a great amount of data is continuously generated by them. Developers typically resort to deep learning techniques to extract meaningful information from these data. Moreover, with the development of hardware, the computing and storage capabilities of end devices, such as smartphones, smartwatches, and other intelligent IoT devices, have been significantly enhanced, which makes it possible to train models locally. Federated learning (FL) [2] has been proposed as an effective technique for distributed training on distributed IoT devices while preserving privacy.
Under the typical FL paradigm, clients collaborate to train a shared model with their local data and send the updates to the parameter server (PS). Then, the PS aggregates the received signals and broadcasts the averaged updates to the selected clients. The communication resources between clients and PS are usually constrained and the clients are required to interact with the PS multiple rounds during the training processes. As a result, the communication bottleneck becomes a significant problem that needs to be solved for FL. Some strategies have been proposed to impose communication efficiency for FL such as adjusting the number of local epochs [2] and client selection [3]. In [2], the communication cost can be reduced for FL by carrying out multiple local epochs at the client side before communication. In [3], the authors jointly optimize client selection and bandwidth allocation for FL to realize fast convergence and communication efficiency.
Different from the above works, which are implemented with digital transmission for FL, some studies [4], [5] adopt over-the-air (OTA) computation to reduce the communication cost for FL. Compared with digital transmission, clients can share the same wireless channel via analog transmission due to the superposition property of the multiple access channel. The communication overhead can be reduced for OTA FL by making full utilization of the spectrum resources when clients send the gradients to PS. In addition, the privacy leakage from the client to the PS can be avoided for OTA FL since the signals received at the PS side through OTA computation are aggregated signals. Recent research studies for OTA FL focus on client selection [6], power control [7], data heterogeneity [8], and energy constraint [9]. The authors in [6] conduct the convergence analysis for OTA FL and develop a client selection scheme when transmission power control is taken into consideration. In [7], the authors optimize the transmission power control to directly maximize the convergence speed of the OTA FL system with convergence analysis. In [8], the convergence analysis for OTA FL with heterogeneous data is given, and the authors conclude that convergence can be guaranteed for OTA FL with heterogeneous data and fading channels. Besides, due to the limited battery capacity of IoT intelligent devices, the energy constraint is one key issue that needs to be addressed. The authors in [9] first derive the convergence analysis for OTA FL when energy constraint is considered for local clients. Then, they formulate an online optimization client selection problem and employ the Lyapunov optimization technique to optimize a stable client selection scheme by solving the nonlinear integer programming problem.
However, in most of the current papers, energy harvesting technologies have not been considered. Energy harvested from the environment such as wind, solar power, and human motion can enable local clients to perform sustained training and achieve green computing. There are some prior research studies have successfully applied energy harvesting techniques to wireless transmission [10], [11], task offloading, and resource allocation [12], [13] for mobile edge computing. The authors in [13] aim to minimize energy consumption while meeting the quality of services for local clients with energy harvesting by optimizing task offloading and resource scheduling among the local clients, edge server and cloud. There are also some prior works regarding FL with energy harvesting clients via traditional digital transmission [14], [15]. In [14], the time-division duplex is used for gradient transmission for typical FL. The authors formulate one integer linear program for client selection and client association and employ the branch and bound algorithm to minimize the training loss when multiple base stations exist. In [15], two different energy harvesting modes are investigated which include deterministic energy arrivals and stochastic energy arrivals. They conclude that energy harvesting technology can realize sustainable distributed learning. Different from [14], [15] that merely focus on the typical FL, we propose a novel energy-aware OTA FL system that incorporates the energy harvesting technique to supply power to local clients. Instead of considering energy consumption as the constraint for the typical FL in [14], [15], we try to quantify the impact of energy consumption and energy harvesting on the convergence performance of OTA FL. In [16], the authors investigate the energy harvesting technique for OTA FL, and two distributions of energy arrival processes including Bernoulli distribution and uniform distribution are discussed. Different from [15], [16], which merely make client selection decisions based on the distribution of the harvested energy arrivals, in this paper, we make client selection decisions based on the channel-energy-data (CED) coefficient. Instead of simply giving the convergence analysis of the optimality gap for OTA FL regarding client selection in [16], we give the convergence analysis of the optimality gap for OTA FL regarding energy harvesting, energy consumption, receive beamforming, client selection, and data size. In our previous study [1], the client selection problem under the energy constraint is formulated as one nonlinear integer programming problem for one single-input singleoutput OTA FL system with energy harvesting clients, and the receive beamforming design is not considered.
In this paper, we apply energy harvesting techniques to OTA FL system to realize durable computation and reduce reliance on conventional battery sources. Our proposal is to optimize the joint problem of client selection, receive beamforming design, and energy management for one single-input multiple-output 1 OTA FL system with energy harvesting and energy constraint. The main contributions of this paper are summarized as follows: • We first derive the optimality gap between the actual loss and the optimal loss for OTA FL and quantify the impacts of energy harvesting, energy consumption, client selection, and receive beamforming design on the optimality gap. To minimize the optimality gap, we formulate a mixed-integer nonlinear programming (MINLP) problem based on the convergence analysis.
1. The single-input multiple-output setting is considered for OTA FL in this paper. The massive multiple-input multiple-output and the clustered beamformers [17] can reduce the aggregation error for OTA computation with low latency, which is considered to cooperate with our work in the future.
• Then, the original intractable MINLP problem is transformed into an online MINLP problem. By decomposing the online MINLP problem, two sub-problems can be obtained one for client selection and the other for receive beamforming design. • To address the MINLP problem, we introduce several optimization strategies, encompassing a semidefinite relaxation (SDR) approach for the receive beamforming design sub-problem, a CED-based technique for the client selection sub-problem, and an alternative optimization method to jointly optimize the decoupled sub-problems. • We conduct a thorough theoretical analysis of the proposed scheme and validate its performance via simulation experiments. A comparative evaluation with other comparison schemes reveals that our proposed approach exhibits substantial advantages in terms of convergence performance for OTA FL. Besides, we also evaluate the impact of design parameters on the learning performance of OTA FL. The rest of this paper is organized as follows. In Section II, we briefly review the related works about OTA FL. In Section III, we introduce the FL model, the communication model for OTA FL, and the energy management model for OTA FL. The convergence performance is derived and the problem is formulated in Section IV. Section V develops the jointly online optimizing algorithm for client selection and receive beamforming design. Section VI shows the simulation results. Finally, this paper is concluded in Section VII. The notations used throughout this paper are summarized in Table 1.
Notations: Let R denote the real number sets and C denote the complex sets. Let regular letters, bold lower-case letters, and bold capital letters denote scalars, vectors, and matrices, respectively. The complex normal distribution with mean 0 and covariance matrix σ 2 is denoted with CN 0, σ 2 . The transpose operation and conjugate transpose operation are denoted as (.) T and (.) H , respectively. Let E[.] denote the expectation operation.

II. RELATED WORKS
Recently, OTA FL has been proposed for communicationefficient distributed learning by making full use of the spectrum resources [4], [5]. Most of the recent articles focus on the transmission power control [18], [19], client selection scheme design [5], [20], [21], and receive beamforming design [22] for the OTA FL system. The design of transmission power for local clients can mitigate aggregation errors and improve the convergence speed for OTA FL. In [18], the authors adopt the successive convex approximation method and the trust region method to optimize transmission power for OTA FL when non-uniform channel fading exists. In [19], the authors optimize the transmission power to minimize the aggregation error for OTA FL when taking gradient statistics into account. Client selection is one effective scheme to improve the learning performance of the OTA FL system. In [5], the authors adopt the difference-of-convex-functions method to mitigate the influence of system heterogeneity and network heterogeneity for OTA FL. Specifically, the optimization of client selection and receive beamforming for OTA FL is realized by minimizing the model aggregation error and maximizing the number of selected clients. In [23], the clients with weak channels are neglected for the broadband situation and the authors prove that the latency for OTA FL is smaller compared with the FL via digital transmission. In addition, the diverse energy constraints are investigated in [9] and the client selection scheme is designed with the Lyapunov optimization to realize fast convergence. The work [21] adopts the SDR technique to minimize the optimality gap of the training loss by jointly optimizing client selection and power control for OTA FL.
Multi-antenna beamforming is another way to improve the performance for OTA FL [20], [22], [24], [25], [26], [27]. In [20], the receive beamforming is optimized with the difference-of-convex-functions method, and the client selection is designed with the Gibbs sampling method for reconfigurable intelligent surface-assisted OTA FL system. The work [22] jointly optimizes the receive beamforming design and learning rate with the combined method of difference-of-convex-functions method and exhaustive search method. In [27], the authors adopt the SDR technique to deal with the receive beamforming design, and the client selection is solved with the difference-of-convex-functions method.
Different from the previous studies that only focus on transmission power control for OTA FL, the influence of energy constraint and energy harvesting on the learning performance of OTA FL is also investigated in this paper. The convergence analysis of the OTA FL system is derived, which is related to energy management, client selection, receive beamforming design, and transmission power control. The optimized decisions of the client selection and receive beamforming for the OTA FL system are influenced not only by the transmission power control but also by the energy management.

III. SYSTEM MODEL
We consider an OTA FL system which is composed of a PS with N antennas and K single-antenna local clients, as shown in Fig. 1. 2 Let K denote the local client set. The feature-label pairs for local client k ∈ K can be represented is the i-th input feature vector, y k,i is the corresponding ground truth label. Let D k denote the number of the local training samples for local client k, and the total number of training samples for all local clients is D = k∈K D k . In addition, 2. The typical FL with one-staged communication structure is considered in this paper. Two-staged communication [28] for distant clients by introducing the intermediary server can improve the convergence performance for OTA FL, which is beyond the scope of this paper and considered as the future work. each client can harvest energy from the ambiance, and the harvested energy will be used to sustain its computation and transmission energy consumption. In the following, we will introduce some background knowledge related to the FL model, the communication model, and the energy consumption and harvesting models.

A. FEDERATED LEARNING MODEL
We assume that the total number of training rounds is T for one typical FL system, and the information will continue to be exchanged between the local client side and the PS side during this period. At the t-th training round, the typical FL system conducts the following four steps: • Client selection: At first, a subset of clients K t are selected from K by the PS to join in the training.
After the training process is completed, the selected client k transmits the local gradients g k,t to the PS. • Model aggregation: After receiving the signals, the global gradient can be represented as Let η denote the fixed learning rate. The global model is updated as The global loss for the selected clients can be given by

B. COMMUNICATION MODEL
For the OTA FL system, the communication process includes the global model broadcasting from the PS to clients and the local model uploading from clients to the PS. In this paper, we assume that the global model can be broadcast correctly from the PS to clients. In the upload process via OTA computation, the selected clients need to send their updated gradients to PS synchronously with the shared wireless channel. Let p k,t represent the power control factor of client k at the t-th training round. The pre-processing operation for local gradients g k,t at client k during the t-th training round is as follows: Let h k,t ∈ C N denote the channel coefficient of client k at the t-th training round, where N is the number of receive antennas at the PS side, and m t ∈ C N denote the receive beamforming vector at the t-th training round. We assume that the channel coefficients may fluctuate in different training rounds but stay quasi-static during the same training round. Let z t = [z [1] t , z [2] t , . . . , z [S] t ] ∈ C N×S represent the additive white Gaussian noise (AWGN) matrix, in which, z [s] t denotes the induced noise for s-th gradient signal. Each element of z [s] t follows the complex normal distribution of CN 0, σ 2 . The PS receives the aggregated gradients vector r t ∈ C S at the t-th training round as To get the averaged gradients of the OTA FL system, the PS post-processes the received aggregated gradients [6] as where α t represents the normalization scaling factor at the t-th training round. Then, the error estimate t between the aggregated gradientsr t and the error-free gradients g t at the t-th training round can be expressed as According to [29], to minimize the mean squared error, the uniform forcing technology is used for OTA computation. As a result, the power control factor p k,t needs to satisfy the requirement for the selected client k at the t-th training round as follows: Besides, the average value of the transmission power for client k during the t-th training round is constrained by the maximum transmission power P 0 as follows: Taken (10) into (8), the averaged gradients can be rewritten as follows:r

C. ENERGY CONSUMPTION AND HARVESTING MODELS
According to [30], the transmission time of each training round for OTA FL is calculated as where T slot is the transmission time for each resource block and R is the number of the transmitted signals for each resource block. Let c k denote the computation energy consumption for one training sample of client k. The data size D k for client k may be varied across different clients because of the heterogeneous data distribution. The total energy consumption includes the computation energy consumption and transmission energy consumption. In this paper, the number of local epoch is set as 1 for simplicity. 3 By combining (10) and (13), the total energy consumption e tot k,t [9] for local client k during the t-th training round can be calculated as 3. Extending to multiple local epochs is viable by adequately re-orchestrating the energy consumption and harvesting models. VOLUME 4, 2023 We assume that the local client k can harvest energy from the surrounding environment. The energy harvesting process is constructed based on a successive energy arrival model. Let e arr t = [e arr 1,t , . . . , e arr K,t ] denote the newly arrived energy vector during the t-th training round, and e arr k,t means the arrived energy for client k during the t-th training round. For each local client, the energy harvested from the surrounding environment can be stored in the battery and used to supply subsequent computation and transmission energy consumption.
Let b k,t represent the battery capacity at the beginning of the t-th training round for local client k. The maximum battery capacity for all clients is set as B max . At the t-th training round, the total energy consumption cannot exceed the residual battery capacity for each selected client k, which is represented as The updated battery capacity of each client can not exceed the maximum battery capacity B max during each training round. The updated battery level b k,t+1 for client k at the end of the t-th training round can be expressed as follows: The transmission energy consumption is constrained by the residual battery capacity and maximum transmission power. Based on (11) (14), and (15), the transmission energy for client k at the t-th training round cannot be larger than the transmission energy threshold max k,t , which is calculated as The presence of channel fading and the introduced noise make the received gradient signals deviate from their actual values for OTA FL. The client selection and receive beamforming design both have impacts on the received signals for OTA FL because of the energy constraint, channel fading, and communication error. In the next section, we derive the impact of the convergence performance of the OTA FL system regarding the client selection matrix β, the receive beamforming matrix m, and the transmission energy threshold max k,t .

IV. CONVERGENCE ANALYSIS AND PROBLEM FORMULATION
The convergence analysis of the OTA FL system is given in this section. Note that the energy constraint and transmission power control are both introduced into the convergence analysis for OTA FL. The derived convergence results reveal that the channel coefficient, the transmission energy threshold, and the data size all have impacts on the convergence rate of the OTA FL system. Based on the convergence result, the non-convex MINLP problem is built as P 1 . Because of the stochasticity of the channel coefficient and energy harvesting, the offline problem P 1 is transformed into the online non-convex MINLP problem P 2 .

A. CONVERGENCE ANALYSIS
According to (4) and (8), the global model for the OTA FL system at the t-th training round is updated through For ease of convergence analysis, we have the following assumptions regarding loss functions [3], [6].

Assumption 1 (l-Smoothness):
For parameters w and v, there is a non-negative constant l to make the following inequality hold as Assumption 2 (PL Inequality): There exists a nonnegative μ constant to make the Polyak-Lojasiewicz (PL) condition satisfy as follows:

Assumption 3 (Gradient Bound):
The local gradients ∇f (w) 2 2 is bounded by the global gradients ∇F(w) 2 2 with parameters λ 1 ≥ 0 and λ 2 ≥ 0, such that Let F(w * ) denote the optimal loss, and the optimality gap between the actual loss and the optimal loss at the t-th training round can be formally represented as E[F(w t+1 )] − F(w * ). A smaller optimality gap represents better learning performance for the OTA FL system. The relationship of the optimality gap for two adjacent rounds is given in Lemma 1.
Lemma 1: When Assumptions 1-3 are satisfied, with the learning rate η = 1 l , the given client selection vector β t , the given receive beamforming vector m t , and the transmission energy threshold max k,t = min b k,t − D k c k , τ tr P 0 , the optimality gap E[F(w t+1 )] − F(w * ) at the t-th training round within the energy constraint is given by where ψ t and φ t can be represented as follows: Proof: See Appendix A. Lemma 1 reveals that the upper bound of the optimality gap at the t-th training round E[F(w t+1 )] − F(w * ) is related with the optimality gap at the (t − 1)-th training round E[F(w t )] − F(w * ), ψ t , and φ t . By repeatedly applying Lemma 1 and collecting terms, we have the optimality gap after the total number of T training rounds as in Theorem 1.
Different from Lemma 1, Theorem 1 illustrates the relationship between the optimality gap after T training rounds and the initial optimality gap.
Theorem 1: Suppose that the total number of training rounds is set as T and the initial global model is set as w 1 . The optimality gap after T training rounds of the OTA FL system is given by where Proof: See Appendix B. Theorem 1 reveals that the optimality gap is upperbounded by T (β, m). When the initial optimality gap E[F(w 1 )] − F(w * ) and the parameters μ, l, λ 1 , and λ 2 of Assumptions 1-3 are known, the upper bound of the optimality gap is decided by ψ 1 , φ 1 , . . . , ψ T and φ T .

B. PROBLEM FORMULATION
According to Theorem 1, the optimality gap after T training rounds E F(w T+1 ) − F(w * ) is influenced by the client selection vector β t , the receive beamforming vector m t , and the transmission energy threshold max k,t , which are obtained from each training round t and each selected client k. The transmission power control and energy constraint are both introduced into the convergence analysis process of the optimality gap. To minimize the optimality gap after T training rounds, the problem P 1 can be formulated as It is difficult to solve the problem P 1 directly because of the stochasticity of the channel coefficient h t and the harvested energy vector e arr t for each training round t. We try to reformulate the problem P 1 and focus on solving the problem with an online pattern according to Lemma 1.
The optimality gap at the t-th training round E[F(w t+1 )]− F(w * ) is defined as t+1 . According to Lemma 1, we have One can observe that the optimality gap at the t-th training round t+1 is decided by φ t when the parameters of assumptions μ, l, λ 1 , λ 2 , and the optimality gap at the (t − 1)-th training round t are known. Therefore, the problem P 1 can be transformed to P 2 to minimize φ t with an online pattern, which can be formulated as follows: The formulated problem P 2 is one MINLP problem rather than the nonlinear integer programming problem in [1]. The receive beamforming design and transmission power control are both considered in P 2 , which makes the problem more difficult to solve compared with [1]. By analyzing (12) and (24), one can observe that if more clients are selected to upload the gradients, the loss decreases faster. However, due to the constraint of the transmission power control and residual battery capacity, α t will be smaller and the signal-to-noise ratio (SNR) will decrease when more clients are selected to participate in the training, which will make the convergence speed slower.

V. JOINTLY ONLINE OPTIMIZING CLIENT SELECTION AND RECEIVE BEAMFORMING
In this section, we propose an alternative optimization method to solve the problem P 2 . Specifically, the SDR programming method is adopted to optimize the receive beamforming when the client selection vector is given, and the discrete search method named the CED-based method is proposed to realize client selection with the given receive beamforming design. The process of the alternative optimization will iterate for a predefined number of iterations. Besides, we also give discussions about the computation complexity of the proposed algorithms.

A. RECEIVE BEAMFORMING OPTIMIZATION
At the t-th training round, assuming that the client selection vector β t is given. To optimize the receive beamforming design for the OTA FL system, the problem P 2 can be simplified as For the convenience of the analysis, the min-max problem P 3.1 is reformulated as one minimization problem P 3.2 , which is presented in Lemma 2. Lemma 2: By assuming that M t = m t m H t and H k,t = h k,t h H k,t , the non-convex receive beamforming optimization problem P 3.1 can be reformulated as Proof: See Appendix C.

VOLUME 4, 2023
One effective method to deal with the problem P 3.2 is the SDR method, which obtains the convex relaxed problem by dropping the rank-one constraint. In this way, the problem P 3.2 can be transformed to Then, the approximate solution can be obtained by solving the convex problem P 3.3 with CVX toolbox [31]. Assuming that M * t is the approximate optimal value of P 3.3 , then the eigenvalue decomposition for M * t is conducted as . . , N is the matrix including the eigenvalue vectors and = diag(ρ 1 , ρ 2 , . . . , ρ N ) is the diagonal matrix including the eigenvalue values. Let ρ max be the maximum eigenvalue of M * t and max be the corresponding eigenvector of M * t . According to [29], if the constraint rank M * t = 1 is satisfied, the optimal receive beamforming vector is m * t = √ ρ max max . If the constraint rank M * t = 1 can not be satisfied for the relaxed problem, we adopt the Gaussian randomization [32] to get the candidate optimal values m [ t is randomly generated from CN (0, I N×1 ). We generate the total number of I candidates for receive beamforming vector. Then, from the candidates m t ∈ m [1] t , . . . , m [I] t , we get the optimal value m * t which can get the minimum objective value according to The process for designing the receive beamforming vector with SDR combined with the Gaussian randomization approach is summarized in Algorithm 1.

B. CLIENT SELECTION OPTIMIZATION
Given the optimized receive beamforming vector m t obtained from Algorithm 1, to optimize the client selection decisions, the problem P 2 can be simplified as Note that P 4 is in essence a nonlinear integer programming problem. There is an inverse correlation between the optimization objective value φ t β t and the maximum con- The CED coefficient for client k at the t-th training round is denoted as follows: which is decided by the channel coefficient h k,t , the transmission energy threshold max k,t and the dataset size D k . Generate the number of I random vectors, and the i-th vector is ξ [i] t ∈ CN (0, I N×1 ); 7: Obtain the candidate optimal values m [i] Obtain the optimal receive beamforming vector m * t according to (33). 9: m t = m * t .
Let Q t = [q 1,t , q 2,t , . . . , q K,t ] denote the list of CED coefficients. The maximum value of Q t is calculated as Q max t = max k∈K q k,t . One can observe that if the selected clients have worse channel quality, less available transmission energy, and larger data size, Q max t becomes larger. By analyzing the problem P 4 , one can observe that there is a trade-off relationship between the total number of the selected training samples k∈K β k,t D k and Q max t . If more clients are selected during the training process, the number of the training samples k∈K β k,t D k will be larger, which will make the global loss convergence speed be faster. However, if more clients are selected, Q max t will increase, which will lead to slow convergence. As a result, to improve the learning performance of the OTA FL system, an effective client selection scheme is required to minimize the problem P 4 .
The proposed discrete search method named the CEDbased method for P 4 is summarized in Algorithm 2. First, Q t is sorted in ascending order to get the sorted queue Q t . Let q k,t be the k-th value in Q t , and let S [k] be the subset of the selected k clients according to the k smallest values of Q t . There are K possible client selection decisions, and the final client selection is decided bỹ The indicator k * can be obtained by calculating the index of the minimum value ofφ [t,k] as The client can be selected if the index k is smaller than k * for the sorted queue Q t , which denotes q k,t ≤ q k * ,t .

C. THE ALTERNATIVE OPTIMIZATION FOR CLIENT SELECTION AND RECEIVE BEAMFORMING
The proposed alternative optimization method to minimize the optimality gap of the OTA FL system is summarized in Algorithm 3. At the beginning of the t-th training round, all clients can be selected at the initial setup as β t (0) = [1, 1, . . . , 1]. Then, the receive beamforming is optimized with the SDR method according to Algorithm 1. When the optimized receive beamforming vector m t is obtained, the selected clients can be obtained based on Algorithm 2. These two steps are repeated for J times. When the joint optimization process is completed, the PS sends the global model to the selected clients. Then, local clients update the global model and upload the gradients via OTA computation. The PS obtains the aggregated gradients based on (8). Clients obtain the energy from the energy resources and update the current battery level queue based on (16).
The computational complexity of Algorithm 1 for designing the receive beamforming is mainly decided by the procedure of obtaining M * t with SDR algorithm (see Line 1 in Algorithm 1), which is O (N 2 + K) 3.5 [33]. The computational complexity of Algorithm 2 for optimizing client selection is mainly determined by the sorting process (see Line 3 in Algorithm 2), which takes O(KlogK) operations. In addition, as Algorithm 3 is conducted T rounds during the training procedure and the alternative optimization is conducted for J iterations, the overall time complexity is given by O TJ((N 2 + K) 3.5 + (KlogK)) . Fig. 2 is given to clearly illustrate the relations of the problems of the whole OTA FL system with energy harvesting. First of all, the problem P 1 is formulated according to Theorem 1 to minimize the optimality gap after T training rounds for OTA FL. Then, the intractable offline MINLP problem P 1 is transformed to the online MINLP problem P 2 to minimize the optimality gap at the t-th training round based on (28) and Lemma 1. To solve the problem P 2 ,

Algorithm 3 The Alternative Optimization Algorithm
Input: h t , b t , c t , D, σ 2 , τ tr , P 0 , D, I, J, j = 0, and β t (0). Output: β t , m t . 1: repeat 2: Given β t (j) and (h t , b t , c t , D, τ tr , P 0 , I), obtain m t (j + 1) via Algorithm 1; 3: Given m t (j + 1) and (h t , b t , c t , D, σ 2 , τ tr , P 0 , D), obtain β t (j + 1) via Algorithm 2; 4: Update m t = m t (j + 1), β t = β t (j + 1), j = j + 1; 5: until j = J. the MINLP problem is decoupled into two sub-problems as P 3.1 for receive beamforming design and P 4 for client selection. The SDR programming is used for optimizing receive beamforming sub-problem. Specifically, the problem P 3.1 can be transformed to P 3.2 via matrix lifting according to Lemma 2. Then, the non-convex problem P 3.2 is relaxed to the convex problem P 3.3 via dropping rank-one constraint. Then, the convex problem P 3.3 can be solved via the CVX toolbox. For the client selection sub-problem, the discrete search method combined with the CED coefficient is proposed to deal with the nonlinear integer programming problem P 4 .

VI. PERFORMANCE EVALUATION
In the simulation, the PS is located at (0, 0, 10), and 40 clients are randomly located within the range of a circle with a radius of 250 meters. Leth k = G P G C 3×10 8 4π f c L k ρ denote the average channel gain for the free-space path loss model, where G P = 5 dBi denotes the antenna gain of the PS, G C = 0 dBi denotes the antenna gain of the local clients, f c = 915 MHz means the carrier frequency, L k means the distance between PS and client k, and the pass loss exponent ρ is set as 3. The channel gain h k,t for client k during the t-th training round is expressed as h k,t = h k γ k,t , where γ k,t is generated from the Gaussian distribution with zero-mean and unit-variance. The transmission slot for each resource block T slot is set as 1 ms, and the number of the transmitted signals R is set as 14 [30]. Letē k denote the average amount of harvested energy per round for client k, and the harvested energy for client k during the t-th training round is e arr k,t . The harvested energy for client k during the training processes is [e arr k,1 , e arr k,2 , . . . , e arr k,T ], which follows a Poisson distribution with an average ofē k [34]. We assume thatē k is uniformly distributed between 0.1 J and 1 J for different clients. The computation energy consumption per sample c k is set as 0.001 J for all clients. The maximum battery capacity B max is set as 20 J, and the maximum transmission power P 0 is set as −10 dB. The total number of candidates I for receive beamforming vector is set as 5 in Algorithm 2. The iteration number J for the alternative optimization method is set as 20.
We use the Fashion-MNIST dataset [35] to conduct the experiments. Two kinds of settings regarding the data distribution among clients are taken into consideration according to [36]: balanced and unbalanced data settings. For both settings, the training samples at local clients are independent and identically distributed (i.i.d.). For the balanced data setting, the number of samples is equal to 800 for all clients. For the unbalanced data setting, the sample size is set randomly in [100,200] for half of the clients, and [1000, 2000] for the other half. By analyzing φ t in (24), one can observe that the convergence rate of the optimality gap is influenced by the number of selected clients for the balanced data setting. However, for the unbalanced data setting, the convergence rate of the optimality gap is influenced not only by the number of selected clients, but also by the data size of selected clients. We use two kinds of data settings to illustrate that by introducing the data size D k of client k in our formulations, the proposed scheme can be applied to different data size distributions. The four-layer convolutional neural network is adopted for training, which consists of two 5 × 5 convolution layers, one fully connected layer with 50 units, and one softmax layer. The total number of the model parameters is 21840, and the total number of training rounds is set as 1500. The learning rate η is set as 0.01 by default.

A. COMPARISON SCHEME
We compare the proposed solution CED+SDR with the following comparison methods: • Perfect: all clients are selected in each training round and assuming that the perfect aggregation can be achieved with the error-free transmission. • DC [5]: the client selection and receive beamforming is optimized with the two-step difference-of-convexfunctions (DC) method, and the threshold of the mean squared error is set as 15 dB. • CED only: clients are selected with the proposed CED method without beamforming optimization for each training round. • SDR only: for the client selection sub-problem, the local clients with sufficient energy can be selected, which is similar to [15], [16]. Besides, the receive beamforming is optimized with the proposed SDR method.   Fig. 3(a) and Fig. 3(b) demonstrate the performance of the training loss and test accuracy of the proposed method with other baselines for the OTA FL system under the balanced data setting. The number of antennas for PS is set as 10 and the noise power is set as −70 dB. We omit the training loss of the DC method and SDR only method since they are too large compared with the other three methods. The training loss of the other three methods is depicted in Fig. 3(a). It is observed that the proposed CED+SDR method performs better compared with the DC method and SDR only method from Fig. 3(b). Besides, one can observe that the training loss of the proposed CED+SDR method decreases faster compared with the CED only method from Fig. 3(a) even though the accuracy of these two methods is close in Fig. 3(b). The DC method cannot converge as it only takes transmission power control into account and ignores the energy constraint of local clients for the OTA FL system. The SDR only method performs worst as the selected clients with SDR only method may have poor channels or inadequate energy.
The performance of the CED only method performs better compared with the SDR only method, indicating that the client selection has a larger impact on the learning performance of OTA FL than the receive beamforming optimization. Fig. 4(a) and Fig. 4(b) demonstrate the learning performance of the proposed method with other baselines for the OTA FL system under the unbalanced data setting, which has the similar trend to the balanced data setting. From Fig. 3 and Fig. 4, we can get the conclusion that the proposed scheme can be applied to different data size distributions.

B. PARAMETER ANALYSIS
In this subsection, the impacts of the noise power, the number of antennas at the PS side, and the average arrival rate of the harvested energy on the learning performance of the proposed CED+SDR method are discussed for the OTA FL system. We show the experimental results of the proposed method under the unbalanced data setting as examples since the results of the proposed method under the balanced data setting have similar trends to those under the unbalanced data setting.
In Fig. 5(a) and Fig. 5(b), different noise power settings are investigated for the proposed CED+SDR method under the unbalanced data setting for the OTA FL system. The number of antennas at the PS side is set as 10. We can see that accuracy of the OTA FL system decreases when noise power increases for the proposed CED+SDR method under the unbalanced data setting. The reason is that the introduced noise makes the gradients deviate from the actual values in each training round for the OTA FL system. When the noise power is set too large such as −50 dB, the SNR is too low to make the OTA FL system converge. When the noise is set as −80 dB, the proposed CED+SDR method has a similar performance to the Perfect method.
In Fig. 6(a) and Fig. 6(b), the impacts of the number of antennas at the PS side are shown for the proposed CED+SDR method under the unbalanced data setting for the OTA FL system. The noise power is set as −60 dB for easily observing the impacts of the number of antennas at the PS side on the learning performance of the OTA FL system. We can see that the proposed CED+SDR method can converge even when the number of antennas at the PS side is set as 1. And with the number of antennas at the PS side increasing, the proposed method performs better, which indicates that the proposed SDR method is effective for receive beamforming optimization of the OTA FL system.
To show the influence of the average arrival rate of the harvested energyē k on the learning performance of the proposed  CED+SDR method conveniently, we setē k as the same value for all clients for the OTA FL system. Besides, the number of antennas at the PS side is set as 10 and the noise power is set as −60 dB. Fig. 7(a) and Fig. 7(b) show the training loss and the test accuracy of the proposed CED+SDR method with different average arrival rate settings for energy harvesting. We can see that accuracy increases when the average arrival rate of the harvested energyē k increases for the OTA FL system under the unbalanced data setting. It is observed that when the average arrival rate of the harvested energȳ e k is set too small such as 0.2, the proposed method cannot converge as the SNR is too low caused by the transmission energy constraint. However, if the average arrival rate of energyē k is too large, the learning performance cannot continue to improve as the transmission energy of local clients is limited not only by the current battery capacity but also by the maximum transmission power.

VII. CONCLUSION
The energy management problem is one of the key issues for the OTA FL system. In this paper, we employ the energy harvesting technique for OTA FL and derive the convergence analysis of the optimality gap regarding client selection, receive beamforming, energy constraint, and power control. Based on the convergence analysis results, we formulate the online MINLP optimization problem to minimize the optimality gap when jointly considering client selection and receive beamforming. The alternative optimization is developed to solve the MINLP problem. The CED-based method is proposed to optimize client selection decisions, and the receive beamforming is optimized with the SDR method. The simulation results show that our proposed method performs better compared with the other benchmarks.

APPENDIX A PROOF OF LEMMA 1
As denoted by o = ∇F(w t ) −r t , the errors are introduced by client selection and noisy channels. When Assumption 1 exists, by incorporating (18) into (19), we have By accessing the expected values of (38) and setting the learning rate to η = 1 l , we have E F(w t+1 ) Let K t = k|β k,t = 1, k ∈ K, t ∈ {1, . . . , T} denote the set of selected clients and K t = k|β k,t = 0, k ∈ K, t ∈ {1, . . . , T} denote the set of unselected clients [3], and then E o 2 2 is bounded as ∇f (w k,t , x k,i , y k,i ) Compared to typical FL [3], the difference for OTA FL lies in the introduced aggregation error part Sσ 2 ||m H t || 2 α t k∈K β k,t D k 2 caused by noise.
Based on the Assumption 3, we can get By incorporating (41) into (39) and subtracting F(w * ) from both sides of (39), we have According to (14), (15), and (17), which represent the calculation of the total energy consumption e tot k,t , energy consumption constraint caused by available battery b k,t , and the definition of the transmission energy threshold max k,t , the normalization scaling factor α t at the t-th training round needs to satisfy the following condition: The variable α t can also be expressed as follows: Based on Assumption 2 and Assumption 3, we have and g k,t 2 ≤ λ 1 + λ 2 ∇F(w) 2 .