A Hybrid Model-Based and Data-Driven Approach to Spectrum Sharing in mmWave Cellular Networks

Inter-operator spectrum sharing in millimeter-wave bands has the potential of substantially increasing the spectrum utilization and providing a larger bandwidth to individual user equipment at the expense of increasing inter-operator interference. Unfortunately, traditional model-based spectrum sharing schemes make idealistic assumptions about inter-operator coordination mechanisms in terms of latency and protocol overhead, while being sensitive to missing channel state information. In this paper, we propose hybrid model-based and data-driven multi-operator spectrum sharing mechanisms, which incorporate model-based beamforming and user association complemented by data-driven model refinements. Our solution has the same computational complexity as a model-based approach but has the major advantage of having substantially less signaling overhead. We discuss how limited channel state information and quantized codebook-based beamforming affect the learning and the spectrum sharing performance. We show that the proposed hybrid sharing scheme significantly improves spectrum utilization under realistic assumptions on inter-operator coordination and channel state information acquisition.


I. INTRODUCTION
M ILLIMETER-WAVE (mmWave) communications appear as a promising solution to support extremely high data rates and low latency services in future wireless networks [1]. Although mmWave bands offer a much wider spectrum than the commonly used sub 6-GHz bands, it is still essential to seek an optimal use of the spectrum with the ultimate goal of maximizing the benefits for users while fostering healthy competition in the spectrum market [2]. Spectrum sharing addresses these goals by allowing multiple service providers (hereafter called operators) to access the same band for the same or different uses. This paper investigates the case of spectrum sharing for mobile broadband services among multiple mobile operators [3]- [7].
Manuscript received September 20, 2019; revised January 9, 2020; accepted March 2, 2020. Date of publication March 17, 2020; date of current version December 9, 2020. This work was partially sponsored by the Ericsson project SPECS II and the Swedish Research Council under grant 2018-00820. The associate editor coordinating the review of this article and approving it for publication was H. T. Dinh. (Corresponding author: Hossein S. Ghadikolaei.) Hossein S. Ghadikolaei Spectrum sharing provides substantially more bandwidth to individual operators but gives rise to increased interference levels. This is usually addressed by heavy coordination among the base stations (BSs) and computationally-prohibitive optimization problems. In mmWave networks, however, large antenna arrays, directional communications, and the unique propagation environment substantially simplify the problem of managing interference in a shared spectrum, making it more feasible [8].

A. Literature Survey
A series of recent works proposed various technology enablers and performance evaluation methods that help realize the vision of managing the spectrum without bounds and networks without borders [7], and ultimately making the best use of radio spectrum, see [5], [9] and references therein. In particular, Hu et al. [5] conducted a comprehensive survey on the benefits of spectrum sharing in four application scenarios of future wireless networks: wider coverage, massive capacity, massive connectivity, and low latency.
Rebato et al. [4] proposed a hybrid spectrum sharing scheme in mmWave networks, where an operator has exclusive access to some parts of the mmWave bands but also some shared access to some other mmWave bands. The authors showed the advantages of this hybrid method (where data packets are scheduled through two mmWave carriers with different propagation characteristics) over traditional fully licensed or fully pooled spectrum access schemes. Jurdi et al. [6] used a system-level analysis to show that infrastructure sharing can be advantageously combined with sharing spectrum licenses in the mmWave bands.
Coordination mechanisms have a large impact on the gains that spectrum sharing can achieve, and are intertwined with the supporting architectural solutions [3], [5], [8], [10]- [13]. The early work by Mihovska et al. proposed an approach for both intra-and inter-operator coordination scenarios and concluded that operators can advantageously pool spectrum resources when network loads are temporarily uneven among the cooperating operators [10]. Shokri-Ghadikolaei et al. [8] showed that the large antenna setting can reduce the need for inter-operator coordination. In fact, they showed that in the case of digital beamforming with ideal channel estimations inter-operator coordination can be limited to cell-edge users.
A large part of the literature utilizes the increasing number of antennas to form narrow beams, which reduces both intra-This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and inter-operator interference, defined as the interference within the same or among different operators. However, the inherent imperfections in terms of errors in the channel state information (CSI) acquisition, hardware limitations, and the constraints of quantized code-books make inter-operator coordination a necessary ingredient of managing a common spectrum pool [8], [14]. Besides, there is no consensus on how to properly model the coordination cost.
Due to the complexity and inherent data acquisition difficulties of coordinating a large set of radio network nodes, learning-based coordination mechanisms to better manage the spectrum sharing were recently proposed by [15]- [17]. Unfortunately, the schemes developed in [15] and [16] suit secondary users and are not directly applicable in inter-operator spectrum sharing scenarios, in which the participating operators share the spectrum pool on an equal right basis. In contrast, the Q-learning framework of [17] facilitates interoperator sharing by the mechanism of intelligent user offloading. However, none of these schemes addresses the problem of optimizing the network utility while maintaining an acceptable level of coordination and setting the precoders and combiners to reduce the intra-and inter-operator interference.

B. Model-Based Approaches for Spectrum Sharing
Model-based approaches, while being ubiquitous in communication systems [18], may rely on inaccurate and unrealistic assumptions for the sake of mathematical tractability. Consequently, performance evaluation and protocol development based on such approximated and inaccurate models run the risk of not working well in practice [19]. Datadriven approaches address this disadvantage by learning and optimizing from the data -usually acquired by measurements -making minimal assumptions on the system model. These approaches have been the core of the success of modern machine learning and artificial intelligence. Data-driven approaches, however, may need a large number of training samples to perform well, which are hard to obtain in most wireless networks due to their inherent nonstationary nature [20]. This is indeed the case for general network optimization problems and in particular for spectrum sharing [21].
In this paper, we advocate the use of a hybrid approach for spectrum sharing, in which the model-based part operates on a small timescale, whilst the data-driven part operates on a coarser time scale and refines the models used in the model-based part. The benefit of hybrid approaches has been demonstrated in the context of speech signal processing for the localization and tracking tasks [22] and in these parallel and independent works [23], [24].

C. Contributions of the Present Paper
In this paper, we propose a framework to analyze and quantify the benefits of spectrum sharing over exclusive spectrum access for a multi-operator millimeter-wave network. More specifically, we capture the trade-offs among the signaling cost, coordination complexity, and overall network performance by an optimization task that takes as input a model for the rate functions and returns the optimal association and coordination policies throughout the network along with proper beamforming vectors. We then augment this approach by adding a learning functionality that continuously refines the rate models to compensate for missing information (mostly missing CSI) and to keep the signaling overhead manageable. To enable this new function, every operator runs some carefully designed rate measurement tasks, reports the results to a cloud server that keeps an updated dataset for the learning and runs the spectrum sharing optimization problem using the updated data-driven rate models. The main contributions of our work can be summarized as follows: • We propose a new generic and tractable approach for modeling the cost of coordination among multiple BSs, which is of significant interest on itself, beyond the scope of this paper. • We investigate the gains of beamforming and coordination for spectrum sharing schemes in mmWave networks. We argue that a pure model-based solution approach to this problem is infeasible, mainly due to modeling inaccuracy, the overhead of pilot transmission, and the lack of sufficient information (including erroneous or completely missing CSI). • We develop a hybrid model-based and data-driven approach where the model-based part optimizes the decision variables (association and coordination) and finds proper beamforming vectors, and the data-driven part sequentially and continuously refines the model. Our approach has the same computational complexity as the pure model-based approach but operates with a much lower signaling overhead. 1 • We then use domain-specific knowledge (large antenna arrays and the sparse scattering environment of mmWave systems) to properly initialize the learning process to minimize its running complexity while guaranteeing the user performance. • We discuss how large antenna arrays, limited feedback, and imperfect/missing CSI affect the learning process and consequently the spectrum sharing performance. Conceptually, our hybrid solution could be considered both in a centralized and in a more realistic distributed implementation.

D. Paper Organization
The rest of the paper is organized as follows. We introduce our system model, including a novel coordination model, in Section II. We formulate the problem of spectrum sharing in Section III and discuss the complexities of pure modelbased approaches. Section IV develops our hybrid solution approach and numerical performance evaluations. We provide important engineering insights in Section V, followed by concluding remarks of Section VI. Due to space limitations, we have provided all the proofs and extended numerical results in the extended version of this paper [25].
Notations: Capital bold letters denote matrices and lower bold letters denote vectors. The superscripts (X ) T , (X ) H , (X ) † stand for the transpose, transpose conjugate, and Moore-Penrose pseudo-inverse of X, respectively. The subscript [X ] mn denotes entry of X at row m and column n, and [X ] n represents column n of X. I x , and 1 x , and 0 x are the identity, all-one, and all-zero matrices of size x, respectively. Table I lists the main symbols used in the paper.

II. SYSTEM MODEL
In this paper, we use the following system model for our model-based approach that we propose in Section III. This system model is generic and embraces distinct model elements for the network, the employed association scheme, the deployed antenna and channel models, and models for beamforming and multi-operator coordination.

A. Network Model
We consider the downlink of a multi-operator cellular network with a total bandwidth W to be shared among Z operators in the network. Each operator z controls and operates the subset B z of the BSs such that B = B 1 ∪ B 1 ∪ . . . ∪ B Z is the set of all BSs in the network. With no infrastructure sharing, for example, {B z } Z z =1 are disjoint sets. We denote by U the set of all UEs, by U z the set of all UEs of operator z, and by W z the bandwidth of operator z. Without loss of generality, we assume universal frequency reuse within an operator's network. Consequently, all non-serving BSs of an operator cause interference to every user equipment (UE) of that operator in the downlink.

B. Association Model
We denote by a bu a binary variable that is equal to 1 if UE u ∈ U is served by (or associated to) BS b ∈ B. We collect all binary control variables a bu in association matrix a, where A = z ∈[Z ] A z . Binary matrix A z of size |B| × |U | is the association of operator z, namely [A] bu = 1 if and only if u ∈ U z and a bu = 1. Let N b = u∈U a bu and A b be the number and the set of UEs that are being served by BS b, respectively. We also call N b the load of BS b. Note that without national roaming, each BS can serve only UEs of the same operator. Namely, a bu = 0 for all b ∈ B z , u ∈ U k where z = k. We first impose the constraint that national roaming is not permitted, which will be relaxed in Section III-D to examine the potential performance improvement due to national roaming.
We define the association period as a consecutive series of coherence intervals (CIs) over which association A remains unchanged, see Fig. 1. Although beamforming should be recomputed every CI, the association is a long-term process in the sense that it remains fixed over some CIs [26]. Such an assumption is natural, due to the inherent cost of handover for re-association. In this paper, we investigate the performance of optimal association; i.e., we find the optimal A z for all operators. Using these associations, the BSs and UEs recalculate their beamforming vectors every CI. To avoid the interplay between the short-term scheduling and the association problem, which should be handled at different time scales, we ensure that each BS can serve all its associated UEs simultaneously by imposing that the number of served UEs is compatible with the number of RF chains at each BS.

C. Antenna and Channel Model
We consider a half wavelength uniform linear array (ULA) of N BS antenna elements for all BSs and a ULA of N UE antennas for all UEs, albeit our mathematical framework can be easily extended to other antenna models. We consider a narrowband mmWave channel model [27]. Let N bu be the number of paths between BS b ∈ B and UE u ∈ U, and g bun be the complex gain of the n-th path that includes both path loss and small scale fading. In particular, g bun is a zero-mean complex Gaussian random variable with E[|g bun | 2 ] = L bu for n = 1, 2, . . . , N bu , where L bu is the path loss between BS b and UE u. The channel matrix between BS b and UE u is given by where a BS ∈ C N BS and a UE ∈ C N UE are the vector response functions of the BSs' and UEs' antenna arrays to the angles of arrival and departure (AoAs and AoDs), θ BS bun is the AoD of the n-th path, θ UE bun is the AoA of the n-th path, and (·) H is the conjugate transpose operator. For a ULA with half wavelength antenna spacing at the BS, we have a BS (θ) = 1 √ N BS 1, e j π sin(θ) , . . . , e j (N BS −1)π sin(θ) H .
(2) a UE (θ) can be obtained from (2) by changing N BS to N UE .

1) Analog Combiners:
To simplify the implementation requirements, we consider an analog combiner using phase shifters at the UE side (only one RF chain per UE). With these phase shifters, each UE can only change its antenna boresight. Let w UE u ∈ C N UE be the combining vector of UE u. Assume that BS b serves UE u and that the estimates of the channel gains and the corresponding AoAs are available. We pick for UE u the analog combiner that maximizes its link budget [28], namely 2) Precoders: For the sake of presentation simplicity, we assume that each BS employs a fully-digital precoder. At the end of this subsection, we show how to extend our derivations to the case of hybrid (analog-digital) precoding.
We assume that all the UEs are concurrently served by their respective BSs with multiuser MIMO. To ensure this, we impose the condition N b ≤ N BS for all BSs and all operators in the next sections. Let W BS b ∈ C N BS ×N b be the digital precoding matrix at BS b whose u-th column w BS bu ∈ C N BS is the precoding vector for UE u. We define the transmitted sym- are the data symbols for the N b UEs of this cell with normalized power, and ρ Tx is the average transmit power at each BS. Moreover, λ b normalizes the maximum transmit power of the BS b to ρ, namely We consider regularized zero forcing (RZF), which is of practical interest for minimizing the inter-BS (within and among different operators) interference. 2 For every BS b, define H b as the effective channel that the digital precoder observes containing (w UE u ) H H bu for several u in its rows; formally defined later in this section.
Suppose that UE u is being served by BS b, and that (w UE u ) H H bu has appeared in row m of H b . Using RZF, the precoding vector of UE u is where δ is an arbitrary (usually very small) positive number, and I is an identity matrix of proper size.
In the case of hybrid precoding, we can still design W BS b based on (5) and then approximate the true hybrid precoding matrix with a cascade of an analog and a digital precoder, while satisfying the constant-modulus constraint of the analog precoder; see, e.g., [29] and [30].
3) Coordination: Let UE u be served by BS b using combiner w UE u . Define the effective channel between any BS i ∈ B and any UE u as H iu := (w UE u ) H H iu . In fact, the effective channel is the actual channel between BS i and UE u processed by the analog combiner of the UE. We define binary matrix then acquiring this effective channel has a much lower cost than if i and b belong to different operators. To model this, we add a penalty for the coordination to promote the optimal use of coordinations. For a given association of the BSs and UEs a, we assign a penalty [P] iu corresponding to the element [C ] iu of the coordination matrix. For sake of simplicity, in the following, we consider a constant penalty matrix P, though it can be in general a function of the distance, operator load, and number of antennas, among others. The penalty terms may vary for each operator, reflecting various billing policies. When incurring almost no cost of estimating the channel of the own served UEs, a higher cost of estimating the effective channel of a UE within operator, and an even higher cost of estimating the effective channel of UEs of other operators. This abstraction of the penalty matrix facilitates the cross-layer design of spectrum sharing. Notice there should be some inter-operator architectural support whose design is out of the scope of this paper. Interested readers are referred to [3] and references therein. To implement the penalty matrix, we recall the set of BSs and UEs of all operators. Furthermore, the penalty matrix P 0 represents the cost associated with channel estimation, where [P 0 ] bu is the penalty when BS b estimates the channel of UE u. This penalty may not be identical for all non-serving operator.
Example 1: Let p andp denote the penalty of intra-operator and inter-operator coordinations, incurring identical costs for all operators. The template penalty is then computed as Given P 0 , we then set where a bu ∈ {0, 1}, u ∈ U, and p b is the coordination penalty for the UEs associated to BS b. Note that a bu = 1 means BS b serves UE u. The coordination cost in each CI to serve users of operator z is thus u∈Uz b∈B Ultimately, the goal is to find the optimal coordination policy that maximizes a network objective (e.g., sum-rate of the UEs) while bounding the coordination cost; see Section III. As we show throughout this paper, under realistic settings for CSI acquisitions and network topologies, this optimization task is possible by a hybrid model-based and data-driven approach.
The effective channel H i is a matrix of dimension Example 2: To illustrate the notations of this paper, Fig. 2 shows an illustrative example with two operators, each having 2 BSs and 5 UEs. We run this example throughout the paper. Every BS can estimate the effective channel of its associated UEs. BS 2 can estimate the effective channel toward UE 5 via intra-operator coordination. Moreover, BSs 1 and 2 (of operator blue) can estimate their effective channel toward UE 7 (of operator red). For this topology, For all b ∈ B and u ∈ U, the penalty of coordinating with associated UEs is p b = 1, while the penalty of intra-operator and inter-operator coordination is 10 and 100, respectively.   (3). Then, given a coordination matrix C, every BS i obtains H i and finds w BS bu from (5). The data transmission phase then follows.

III. PROBLEM FORMULATION AND SOLUTION APPROACHES
In this section, we formulate the problem of spectrum sharing among multiple operators. Specifically, we use the models of Section II and then show the complexity and limitations of this model-based approach to optimize the beamforming, association, and coordination for spectrum sharing. Note that all the variables with superscript b are operator-dependent. This dependency exists since BS b belongs to B z for some z.

A. SINR and Rate for Model-Based Approach
We define a cell as the set of UEs that are served by the same BS. The received power at each UE u ∈ U z when the serving BS is b ∈ B consists of the desired power ρ Rx , intra-cell interference I (1) , inter-cell interference I (2) , interoperator interference I (3) bu , and noise power spectral density σ 2 . I (1) corresponds to the signals transmitted to other UEs by the same BS. I (2) denotes the interference from the signals transmitted by other BSs of the same network operator. I We first note that the received power at UE u from BS b is Recall the definitions of the binary association variables a ij and the set of associated UEs A i . Each BS serves multiple UEs at the same time and frequency resources, as UEs are separable at the spatial domain. The intra-cell and inter-cell interference to UE u ∈ U z when served by BS b ∈ B z are For UE u, inter-operator interference I bu depends on the set of operators (and BSs) that share the same bandwidth. Without loss of generality, we assume that W z = W . With universal frequency reuse, UE u receives interference from all BSs of all operators, and the inter-operator interference can be expressed as Note that the special characteristics of mmWave networks, such as high penetration loss and directional communications, substantially reduce the interference components (9)-(11), compared to sub-6 GHz systems, as established in [31]. We use this property later on in Section IV to substantially reduce the complexity of the hybrid model-based and data-driven optimization algorithm by a proper initialization.
The long-term rate that UE u will receive from all BSs is where the expectation is over all random channel gains. Notice that we do not assume joint transmission, so b∈B a bu = 1 for all u ∈ U. Sharing the spectrum increases the bandwidth available to each operator (with a prelog contribution to the rate in high SINR regimes); however, it also increases the interference power. As we discuss later in this section, not being able to compute r u due to missing CSI is an important disadvantage of the model-based approaches.

B. Optimal Spectrum Sharing With Model-Based Approach
For given a and C, and in every CI, BS b estimates H b , and finds the digital precoding and analog combiner using (5). Given that each BS can evaluate the average rate r u for its associated UEs from (12), a cloud server (logical controller) collects {r u } from all BSs, computes the coordination cost per CI from (7), and evaluates a network utility f z (A, C ) for operator z. Given r u in (12), we use a logarithmic utility that ensures both high network throughput and some level of fairness among individual UEs [26]: Given B and U , the controller computes P 0 from (6) and formulates the following optimization problem to find the optimal association and coordination strategies: where {α z } z are a set of positive constants that scalarize the multi-objective optimization problem, and Z z =1 α z = 1. Constraint (14b) guarantees association of each UE to only one BS, mitigating joint scheduling requirements among BSs. Constraint (14c) ensures that N b ≤ N BS , so all N b UEs that are associated to BS b can be served together with multiuser MIMO. If N b < N BS , some RF chains will be switched off, and the BS automatically gives higher transmit power to the active RF chains. Constraint (14e) ensures that the coordination cost of every operator is upper-bounded by its maximum budget P max z . Constraint (14f) ensures that the UEs of operator z can be only served by BSs of the same operator.
Remark 2 (Signaling Complexity): To compute the rate function, (14a), and thereby solving (14), an operator should coordinate with all UEs, leading to a coordination cost of u∈U b∈B [P ] bu . Notice that (14e) is the coordination cost of network operation when the solution to (14) is deployed.
Special Case (National Roaming Variant of P 1 ): We can modify P 1 to allow for national roaming. To this end, we should only replace (14b) by b∈B a bu = 1, ∀u ∈ U, replace (14c) by u∈U a bu ≤ N BS , ∀b ∈ B, and remove constraint (14f).

C. Practical Considerations for Model-Based Approach
While theoretically sound, optimally solving P 1 (and its distributed variant) with the signaling and time-limitations of the conventional radio access and core networks would be infeasible. To solve P 1 , for instance, the BSs of every operator should be able to send (or receive) pilot signals to all UEs of all operators and exchange a huge amount of information with a central controller, which should then solve P 1 . The complexity and cost of such level of channel estimation and coordination grow large with the number of BSs and UEs, and are in general overwhelming for mmWave networks with dense BS deployment. Moreover, if BSs or UEs belong to different network operators, a huge inter-operator signaling via the core networks is required for synchronization and for the calculation of I (3) bu . Furthermore, channel aging may render the exchanged information outdated before it serves its purpose. To tackle this problem, most of the works in the literature consider the noise-limited assumption and ignore the interference terms, see [32] and references therein, namely I bu ≈ 0. This is a rather limiting assumption, and it has been shown that a few links may observe strong interference [33]. Moreover, with the interference-free assumption, there is no gain of using a precoder to reduce the interference, which would be an incorrect design decision.
These impairments have prohibited the application of optimal spectrum sharing in state-of-the-art wireless systems. Nonetheless, the solution of P 1 gives a theoretical upper bound for the performance of spectrum sharing (a benchmark). In the following, we take a data-driven approach as a completely different alternative to address the problem of spectrum sharing in mmWave networks.

D. Illustrative Numerical Results
In this section, we numerically investigate the effect of the input/design parameters, namely, the number of antennas, network topology, association, and coordination levels. We use these insights to develop an efficient hybrid approach in the next section.
We consider an illustrative scenario of two operators, each having 2 BSs and 10 UEs with the topology of Fig. 2. We generate 100 random channels, find the beamforming vectors in every realization, and evaluate the interference terms. We consider two antenna settings: (N BS = 8, N UE = 2) and (N BS = 64, N UE = 16). For all b and u, we set p b = 1, the intra-operator coordination penalty to 10, and the inter-operator coordination penalty to 100. Fig. 3 shows three example settings for the association and coordination matrices. In the first scenario, Fig. 3(a), we assume no coordination among UEs and unintended BSs, namely C = A. In the second scenario, Fig. 3(b), we set [C ] bu = [A] bu and then allow BS 1 to estimate the effective channel toward UE 6 and cancel the resulting interference. In Fig. 3(c), we assume full coordination, namely [C ] bu = [A] bu . This level of coordination may improve the rate performance at the expense of a very high coordination cost. Moreover, for every antenna setting, we run P 1 and its national roaming variant, introduced in Special Case of Section III-B. Fig. 3(d) shows the optimal association and coordination for (N BS = 8, N UE = 2) and P max z = 120 (up to one inter-operator coordination) with national roaming. To find this solution, we first apply a continuous relaxation to the binary constraints of P 1 and then rounding to recover binary solutions. Furthermore, we assume that [A] bu = 1 implies [C ] bu = 1 for every b ∈ B and u ∈ U, which further reduces the feasibility space. This is a natural simplification of the optimization problem, as a serving BS will always estimate the channels of its serving UEs. Table II shows the performance of the network under three scenarios of Fig. 3 and the optimal solution, obtained from P 1 and its national roaming variation. From this table, coordination substantially reduces the interference and improves both the network sum rate and the minimum UE rate. This improvement is significant for UE 6, which is served by BS 3 (belongs to the red operator) but is located very close to BS 1 (belongs to the blue operator). Imposing [C ] 16 = 1 leads to a substantial reduction of I (3) and thus to an improvement in the achievable rate.
For the small antenna setting (N BS = 8, N UE = 2) and for the topology of this example, the availability of national roaming can substantially reduce the overall coordination overhead by selecting a much better association. The optimal serving BS for UE 6 is now BS 1, and consequently, the coordination cost reduces from 100 (i.e., inter-operator cost) to 1 (i.e., p b for associated UEs). The use of large antenna arrays (N BS = 64, N UE = 16) reduces the interference footprint and the need for coordination. Still, selecting a better association and coordination solution lead to an improvement in the rate performance. However, as mentioned before, this may entail a formidable signaling overhead.

IV. HYBRID SOLUTION APPROACH
So far, we have observed that neither P 1 nor its distributed variant can be solved in practice due to missing CSI and lack of proper rate models. Data-driven approaches bypass the need for precise modeling techniques and are thereby less sensitive to missing features and modeling inaccuracies. In this section, we propose that the learning task continuously refines the rate model of every UE rather than optimizing the decision variables. The model-based part then uses the updated rate models to find proper association and coordination strategies.
To enable this hybrid solution approach, we introduce two types of frames, training and operation, designed to improve the interplay among the exploration and exploitation and quality of service at UEs. In the training frames, the BSs and UEs use a randomized policy to explore the space of "proper" solutions for (A, C), formally described in Section IV-E, and to improve the rate models. In the operation frames, the operators apply a previously found good solution to protect the UE performance from potentially weak rates of some candidate (A, C). The new solutions will be applied to the operation frames only after passing a predefined confidence on their rate performance, measured in several training frames. Fig. 4 illustrates the proposed hybrid approach.

A. Data-Driven Part
Developing a solution approach for P 1 is challenging. First, due to the lack of a closed-form solution, we need iterative approaches to solve P 1 . These solvers must evaluate the objective function for several A and C matrices, until convergence. Thus, one needs to send additional pilots to evaluate the updated combining vectors at the UEs (which change as a is updated), and estimate some new channels {H bu } for some b and u. These additional pilot transmissions and channel estimations can be very expensive as we may need many iterations before convergence, and we may typically end up in a situation where we have to estimate almost all the channels; clearly this is impractical in a cellular network. Moreover, it is at odds with the coordination cost model (7), where we consider the cost associated with the final solution only. Second, when we know the effective channels corresponding to the final solution, every BS computes ρ Rx bu and I (1) bu from (8) and (9), and feed them back to the cloud server. However, we have access only to some summands of I (2) bu and I (3) bu for which the respective entry of C is 1. Consequently, the central controller cannot compute I (2) bu + I (3) bu and therefore the objective function. To address these challenges, the data-driven part takes as input the network topology, the association matrix a, the coordination matrix C, the effective channels H b , and outputs an approximation of the rate of UE u, denoted by r u . More specifically, the data-driven part is comprised of two components: a dataset and a learning method. Each entry of the dataset includes (A, C , H b ,{r u } u∈U ), while the learning method approximates the rate function. We maintain a dataset at the cloud server and update it before and after every training frame; see Section IV-C.
At every CI, BS b measures r u for its associated UEs (having [A] bu = 1). This is simply done by a feedback from the UE reporting its throughput in this CI. It collects these values and reports them to the cloud server prior to every training frame. The server updates the input-output dataset along with the mapping r u (A, C ) for all u ∈ U, and computes the next tuple (A, C) to be examined in the following training frame. This is done by the EXPLORE function. After that frame, the cloud updates the dataset and the rate models and decide whether to apply new association and coordination solutions to the subsequent operation frames. Initialize A (0) , C (0) (described in Section IV-B) 13: for k = 1, 2, 3, . . . do 14: Run A-step and find A (k +1) using (17) 15: Run C-step and find C (k +1) using (18)  In Section IV-D, we discuss how to initialize the rate models. The cloud server then gradually updates these models with any new entry in the dataset through the UPDATE procedure. The other functions of this algorithm, called by the operators, will be illustrated in Algorithm 1.

B. Model-Based Part
Given the updated rate models, the cloud server formulates and solves an optimization problem similar to P 1 and finds the new association and coordination solutions. In the following, we derive the modified optimization problem and develop a solution algorithm.
We start by re-writing the optimization problem as a function of A and C. We write (7) as where 1 is a matrix of ones having appropriate size. Then, we can rewrite the coordination cost (7) as u∈Uz b∈B where • is the Hadamard product, and (·) T is the transpose operation. If required, every operator can obtain an approximation of the rate functions of its UEs through the DOWNLOAD function of Algorithm 1, and then find an approximation of f z (A, C ), denoted by f z (A, C ), for any a and C, where f z = u∈Uz log r u . We can now write the modified optimization problem as: s. t. Constraints (14b), (14c), (14f), and (14g) Notice that the computational complexity of (16) is of the same order of magnitude as that of (14), and we can reuse the existing solution algorithms of the pure modelbased approach, (16), in the model-based part of our hybrid approach. However, the main benefit of (16) is having a much lower signaling complexity and latency to acquire the needed channel state information. In many cases, we may not be able to compute the objective function of (14) due to the heavy signaling complexity and other challenges involved; see Section III-C. In general, the objective f z is not jointly convex in A and C, and the space of the problem is combinatorial. Thus, we employ the block-coordinate descent (BCD) framework (also known as alternating optimization) [34], where P 1R is split into two subproblems solved iteratively: (A-step) to find the optimal association and (C-step) to find the optimal coordination.
Denoting by A (k ) and C (k ) denote the values for a and C at iteration k, BCD yields the following update rules: (A-step): s. t. Constraints (14b), (14c), and (14f) (17d)

Algorithm 2 Hybrid Model-Based and Data-Driven Spectrum Sharing
Input: CI index n; An indexed sequence of training and operation frames; a feasibility space for the association and coordination F Output: Beamforming vectors in every CI, optimal a and C Find the precoding vectors from (5) 8: Operate with those precoding and combining vectors 9: Measure ru at the end of CI n and record it 10: if CI n is a training frame then 11: Run UPDATE(A, C , {ru }) at the cloud for rates obtained from all UEs in CI n 12: Set Run (A , C ) = OPTIMIZE(A (0) , C (0) ) at the cloud 14: Run Clear recorded rates at every BS 16: if confidence criteria met for (A new , C new ) then 17: Set A of ← A new and C of ← C new 18: end if 19: Set A ← A of and C ← C of 20: end if 21: if CI (n + 1) is a training frame then 22: Run UPDATE(A, C , {ru }) at the cloud for rates obtained from all UEs in the previous operation frames 23: Set Run (A , C ) = OPTIMIZE(A (0) , C (0) ) at the cloud 25: Run Set A ← A tf and C ← C tf 27: Clear recorded rates at every BS 28: end if 29: end for (C-step): Although the above subproblems are combinatorial, they may be still be solved effectively using binary programming or branch-and-bound solvers [35]. We must emphasize that the use of BCD drastically reduces the size of the search space from O(2 |B| 2 |U | 2 ), for the joint optimization optimization in P 1R , to O(2 |B||U | ) for each BCD iteration. Moreover, we can further seek sufficient conditions on the approximation functions f u . For instance, when the learning function is bilinear in A and C, and the coordination penalty matrix consists of integers values, then linear program relaxation of these subproblems is optimal or close to optimal [35]. In the future, we will investigate efficient solution methods and relaxations for P 1R . This current work, however, is aimed at showing the usefulness of this approach, rather than its large-scale implementation.
Moreover, not being able to show the local optimality is a known downside of almost all first-order methods (including BCD) in a nonconvex landscape. Indeed, the iterative algorithms may converge to a saddle point, which is stationary but neither local maxima nor minima. However, recent studies showed that the gradient noise in the stochastic (mini-batch) gradient along and the use of the perturbed gradient descent method, as we have used in our work, are efficient approaches to escape first-order saddle points [36].
Let A of and C of denote the association and coordination matrices for operation frames, A tf and C tf denote the association and coordination matrices for a training frame, and A (k ) and C (k ) denote the association and coordination matrices at iteration k of BCD. Algorithm 2 is a pseudo-code of our hybrid solution approach. Below, we show the monotonically increasing nature of the BCD updates.
Lemma 1 (Convergence of BCD): Let f z be continuous biconcave in a and C. Then, the BCD updates in (17) and (18) ). Moreover, the updates converge to a limit point lim k →∞ f z (A (k ) , C (k ) ).
Although the convergence of BCD updates to a limit point is shown using standard BCD results, establishing that the limit point is stationary with respect to P 1R is more challenging. Indeed, the coupling between a and C in constraint (16c) implies that the conventional BCD convergence cannot be applied to show that lim k →∞ f z (A (k ) , C (k ) ) is a stationary point of P 1R .

C. Training Frames
The OPTIMIZE function of the server will be re-executed before and after every training frame. The purpose of these frames is to dynamically refine the current rate models and thereby find a better association and coordination solution. Naturally, we expect a high frequency of training frames in the first few association periods (as we assume no a priori knowledge of the network), while this frequency can be decreased as we obtain more knowledge on the rate models. In the presence of non-stationary environments, where the rate distributions are changing over time, we may need to add enough training frames to enable the tracking functionality. In Section IV-E, we numerically investigate how many training frames are required to find a close-to-optimal solution after a change in the number of UEs.
Before every training frame, the server gets all the new rate measurements, updates its models, and re-executes the BCD procedure. It then runs a randomized policy on a set of feasible solutions F ⊆ {0, 1} |B|×|U | × {0, 1} |B|×|U | and returns one association and one coordination matrix to be explored in the following training frame. After this exploration, the cloud updates the rate models and checks whether there is a new "reliable" solution to be applied in the operation frames. This reliability can be measured in terms of some predefined confidence bounds on the objective function. The consequence of this conservative approach is protecting UEs from service interruption due to unsure A and C.

D. Initializations
We underline the importance of initializing both the UPDATE procedure and the OPTIMIZE function. More specifically, we discuss a "good" starting point to speed up learning {r u }, and initial solutions A (0) , C (0) to the BCD algorithm.
1) Rate Model: We first observe that severe pathloss, blockage, and directionality substantially reduce the interference footprint of mmWave networks in both cellular [31] and ad hoc [37] settings. In this case, we can use the well-known Gaussian approximation for the interference by an i.i.d. realization of a Gaussian process [38]. In particular, (19) where I We can now prove the following proposition.
Proposition 1: Let A and C be given, [A] bu = 1, N bu = 1, θ UE bu and θ UE iu be AoA of the LoS links between UE u and BSs b and i, respectively. Let L iu = E[|g iun | 2 ] for n = 1 (LoS path). Then, where sinc(x) is sin(x)/x for x = 0 and 1 for x = 0. Notice that (20) is valid for N bu = 1, namely single path between BS b and UE u. However, we have numerically observed that (20) indeed leads to a very good initialization of the rate models, which could be due to the sparse scattering characteristic of the mmWave systems.
From the definition of RZF, I (1) bu = 0 and ρ Rx bu = N BS N UE L bu ρ Tx for any feasible coordination solution in which a BS obtains the CSI of its associated UEs. Using (20), we can also simplify the expressions of I (2) bu and I (3) bu in (19). Employing these expressions, the cloud server can initialize the rate models for every association and coordination matrices A and C with one of the following three scenarios: • Full topological knowledge: If the cloud server knows a priori L iu , θ UE bu and θ UE iu for all i,b,u such that [A] bu = 1 and [C ] bu = 0, then it substitutes (20) into (19), and sets I In this case, our initialization reduces to the well-known interference-free assumption [8], [32], [39]. Moreover, we set ρ Rx bu = N BS N UE ρ Tx for all BS and UE pairs. After the initialization, the cloud server gradually updates the rate models with any update in the dataset through the UPDATE procedure of Algorithm 1.
2) BCD Solver: To initialize the BCD iterations for the very first time, we use the INITIALIZE function (in Algorithm 1) with one of the following options: • Full/partial topological knowledge available: We use the following rule as an approximation of the strongest BS association. For every z and u ∈ U z , [A (0) ] bu = 1 for b ∈ arg max b∈Bz L bu . We then set C (0) = A (0) . • No topological knowledge available: We randomly allocate UEs to BSs within the same operator. We then set C (0) = A (0) . In the subsequent frames, we initialize the BCD solver by the current association and coordination matrices used in the operation frames.

E. Illustrative Numerical Results
In this section, we numerically investigate the performance of our proposed spectrum sharing approach. We use the same network as that in Table II, a CI of 1 ms, and two antenna configurations, small (N BS = 8, N UE = 2) and large (N BS = 64, N UE = 16). The network is stationary during the simulation, so that the optimal association and coordination are fixed. In this case, the optimal performance of the solutions are presented in Table II. For the learning task inside the UPDATE procedure, we use a fully-connected deep neural network with 1 input layer having 2|B||A| nodes, 5 hidden layers each having 20 nodes, and one output layer having |U | nodes. We use a quadratic loss (for the regression task) and train the neural network with backpropagation, mini-batch gradient method with a mini-batch size of 10 samples [40], and the ADAM optimizer for adaptive step-size [41]. To ensure escaping the first-order saddle points, we have also slightly perturbed gradients for a few times once the iterations stall [36]. Notice that the input layer takes a concatenation of the vectorized form of A and C, and the output layer returns the regression results for { r u } u . 3 For the INITIALIZE function, we assume the availability of the full topological knowledge, so the location of all nodes and path-loss of all links are available to the cloud. For the EXPLORE function, we restrict the set of feasible association by limiting the cell-size to 150 meters. This is a reasonable assumption in mmWave networks, due to severe path loss and a dense BS deployment. Moreover, we enforce that every BS should estimate the effective channel toward its associated UEs. Moreover, to improve the exploitation, we gradually decay exploration parameter by setting ← 0.9× after every 1000 CIs. Finding the optimal decrement rate for or even developing a deterministic exploration policy are interesting topics for future work. We have considered two benchmarks: closest BS association and Oracle (upper bound on performance). In the first benchmark, every UE is served by the closest BS. In this case, a BS acquires CSI of only its associated UEs in every CI (so no inter-BS coordination). The Oracle benchmark shows the performance of the solution of the pure model-based approach, P 1 , given also in Table II, in which the cloud server needs perfect CSI of all channels in the network. Although we were not able to find any state-of-the-art approaches for our problem setting, we should emphasize that their potential performance would respect our benchmarks. As we shall see, the performance of our approach is very close to that of the Oracle in most cases. Fig. 5(a) illustrates the instantaneous network sum rate of our hybrid approach. 4 From this figure, the envelope of the sum-rate is increasing with CI index. Interestingly, we also observe that sum-rate values converge to the Oracle, which suggests that Algorithm 2 is asymptotically optimal in this example. This convergence behavior validates our earlier discussions regarding the importance of initialization for the learning function; see Section IV-D. We should emphasize that the particular propagation characteristics of mmWave networks allow for that initialization. Observe that these conclusions also hold for large antenna scenario, where the increased sumrate is due to a reduction in interference -which is in turn due to the increased directionality. Moreover, notice that the fluctuations in Fig. 5 are normal due to the i.i.d. realizations of the small-scale fading in every CI and the randomness in the channel estimation error. Fig. 5(b) shows minimum UE rate for the same numerical setup, where the above conclusions still hold. Furthermore, the increased variance of the fluctuations is a result of looking at the minimum rate, which has inherently more randomness than the sum-rate. Surprisingly, Fig. 5(b) also reveals that Algorithm 2 offers good robustness and fairness (with respect to the minimum rate), although the sum-rate is the objective that is maximized. Finally, our approach substantially outperforms the closest-BS association in terms of both the network sum-rate and the minimum rate of UEs. The gain is mainly due to 1) coordination in the small antenna regime, where the interference may be stronger, and 2) load balancing over the network in the larger antenna regime, where the interference may be less dominant.

A. Performance in the Large Antenna Regime
In this subsection, we evaluate the asymptotic behavior of spectrum sharing when the number of antennas grows large. It was shown in [8] that the array response vectors at the BS and UE, i.e., {a BS (θ)} θ and {a UE (θ)} θ form an orthonormal basis, which can serve as orthogonal spatial signatures of the BSs and UEs, as N BS and N UE grow large. Moreover, in this regime, there exist infinitely many spatial signatures (corresponding to different values of θ). Thus, multiuser interference vanishes as a result of assigning different signatures to  Table II. different UEs and BSs. In the asymptotic regime, we can show using similar steps as those in [8, Proposition 1] that the following holds: Remark 3: Suppose that a BS has perfect CSI toward its associated UEs. The interference components, formulated in (9)-(11), vanish almost surely as either N BS → ∞ or N UE → ∞.
Remark 3 suggests that we can ignore the intra-and interoperator coordination completely, and consequently P 1 and its distributed variant (introduced in Section III-B) yield the same optimal solution. Table II confirms the same trend in the finite antenna regime, where increasing the number of antennas reduces the contributions of coordination on reducing the interference components. Notice that in reality, the perfect CSI assumption of Remark 3 may not hold, leading to a residual sporadic strong interference [37]. Consequently, we need some level of coordination to tame strong interference terms. However, this mandatory level of coordination at the mmWave bands is much less than that at the sub-6 GHz bands.

B. Imperfect CSI and Hardware
Although this work alleviates the need for a complete CSI knowledge of the entire network, through the learning functionality, the BSs should have access to error-free effective channels of some selected UEs. However, CSI is estimated using pilots and will inevitably have some estimation errors. These effects are also compounded by the limited number of RF chains in mmWave MIMO, and quantized analog precoding/combining. But there have been great strides in efficient methods for channel estimation (exploiting sparsity [42] or reciprocity [30]), and hybrid precoding that closely approximates fully digital solutions [29]. Moreover, in a distributed setting, CSI acquisition (at the network level) may be done using so-called Forward-Backward training methods to estimate the CSI in a fully distributed manner [43]. These methods, however, may further increase the coordination cost. Sensitivity analysis of the proposed hybrid scheme to the estimation error in the effective channels, convergence with feedback quantization [44], and the extension of our approach toward robust learning are important future directions.

C. Signaling and Computational Overheads
In our approach, we have two sources of signaling. In every CI, we need to acquire CSI from every BS b to UE u for which [C ] bu = 1, whereas the Oracle need CSI for each BS-UE pair. This significantly fewer number of pilot transmissions is feasible due to our rate approximation. To enable it, the cloud collects the current rate measurements from all BSs, reexecutes the BCD solver, and announces the new association and coordination (only if they have been changed). This process should be done twice for every training frame, once before the training frame and once after it. Therefore, besides some CSI estimation in every CI, the signaling/communication overhead of the proposed hybrid scheme is mainly dominated by the number of training frames. The frequency of these frames is inevitably large in the first few CIs since we assume no a priori knowledge about the network. However, we can gradually decrease the exploration frequency by replacing several training frames with operation frames. The lower bound on the exploration frequency depends on many factors, including the dynamics of the topology and the fluctuations of the network load, whose characterization is an interesting topic for future works.
As for the computational complexity, the main contributing factor is solving the two subproblems using BCD (see Algorithm 1). Although this entails solving two combinatorial problems, one can develop low-complexity solutions, e.g., via relaxations or decompositions. Moreover, the BCD solution is carried out at the cloud server which has large computational resources. Another contributing factor is the matrix inversion in the computation of the RZF precoder at each BS, which scales cubically with the number of UEs served by the BS.

D. Dynamic Number of BSs/UEs
Our main algorithms have been developed for a fixed number of BSs and UEs. In a real network, however, some UEs may join and leave the network, and some BSs may be turned on or off to save energy.
We should highlight that the special characteristics of mmWave communications (directionality, blockage, and propagation loss) would substantially reduce the impact of farther BSs/UEs [19]. In other words, adding/removing some BSs or UEs will have only local effects, impacting the rate models of only a few surrounding UEs. In this situation, the INITIALIZE function can enable fast adaption to dynamic U and B using a few new samples. In the case of having new UEs, we use the INITIALIZE function for both finding a good initialization for the rate function of the new UEs and for adding some interference terms to the rate models of the existing UEs.
In the case of smaller U , we can remove their impacts on other UEs by removing their contributions to the rate function, approximated by the INITIALIZE function.
In the light of the above discussion, we argue that the complexity of the functional approximator (e.g., deep neural network) should be manageable in a real network. The main reason is that the cloud server trains an individual approximator for every UE. In our experiments, our neural network was already over-parameterized. Such a network can easily approximate more complicated rate functions, which may happen for larger |U | and |B|, as we have shown in our experiments over a much bigger network; see Figs. 6 and 7 of the extended version [25]. Moreover, due to the interference locality at the mmWave networks [19], a reasonable change in the number of UEs or BSs does not substantially change the hardness of the rate function (to be approximated). Finally, we reemphasize the fact that current work is intended as a proof of concept of usefulness and viability of the proposed hybrid approach. Several of the issues raised by the reviewers (e.g., scalability and complexity reduction) will be part of our future research.

VI. CONCLUSION
In this work, we investigated the problem of spectrum sharing in mmWave networks and argued the formidable complexity of a pure model-based solution approach. As a viable alternative, we proposed to complement it by a data-driven approach to make the spectrum sharing problem solvable in practical systems. In particular, the model-based part chooses the beamforming and optimizes association and coordination decisions, given a set of rate models. The data-driven part continuously refines the rate models, maintaining the optimality of our solution even in non-stationary environments. The resulting algorithm balances the use of training frames (designed to explore the solution space) and operation frames (designed to exploit good solutions). Our hybrid scheme has the same computational complexity as the pure model-based approach while being robust to insufficient signaling and missing CSI. Our numerical results revealed large gains in network sum-rate while satisfying a predetermined budget on the coordination cost.