Meta-Bandit: Spatial Reuse Adaptation via Meta-Learning in Distributed Wi-Fi 802.11ax

IEEE 802.11ax introduces several amendments to previous standards with a special interest in spatial reuse (SR) to respond to dense user scenarios with high demanding services. In dynamic scenarios with more than one Access Point, the adjustment of joint Transmission Power (TP) and Clear Channel Assessment (CCA) threshold remains a challenge. With the aim of mitigating Quality of Service (QoS) degradation, we introduce a solution that builds on meta-learning and multi-arm bandits. Simulation results show that the proposed solution can adapt with an average of 1250 fewer environment steps and 72% average improvement in terms of fairness and starvation than a transfer learning baseline.


I. INTRODUCTION
W I-FI technology has evolved dramatically in the last 10 years with a 16x factor increase in terms of throughput from 802.11n to 802.11ax standards.Moreover, in 2022 the number of global Wi-Fi devices in use remounts to nearly 18 billion [1].This experienced growth altogether with the variety of use cases such as cloud computing, multigigabit streaming, augmented and virtual reality (AR/VR) and telepresence in combination with high dense user deployments has imposed several challenges to Wi-Fi technology [2].IEEE 802.11ax [3] reduces the impact of the aforementioned issues with an interest into spatial reuse (SR).Improving SR decreases packet collisions among stations and it allows determination of channel access rights.Additionally, SR allows adjusting the CCA threshold and AP's Transmission Power (TP) dynamically, enhancing adaptability.Yet, the challenge lies in dynamically fine-tuning these settings, particularly in densely populated areas like malls, stadiums, and transportation hubs.
Machine Learning (ML) has been considered as a viable instrument for the optimization of SR in the context of Wi-Fi 802.11ax.Among several studies, the authors in [4] propose a federated learning-based solution in a multi-Basic Service Set (multi-BSS) scenario to improve SR.In [5], the authors use a centralized approach based on an Infinite Multi-Armed Bandit comprised by a Gaussian Mixture Sampler and Thompson Pedro Enrique Iturria-Rivera, Burak Kantarci, and Melike Erol-Kantarci are with the School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON K1N 6N5, Canada (e-mail: pitur008@uottawa.ca;burak.kantarci@uottawa.ca;melike.erolkantarci@uottawa.ca).
Digital Object Identifier 10.1109/LNET.2023.3268648 Sampling (TS) optimizer to select the TP and Overlapping BSS/Preamble-Detection (OBSS/PD) threshold parameters appropriately.Furthermore, in [6], the authors present a comparison among cooperative and non-cooperative algorithms to improve SR in decentralized settings.Moreover, they propose to model the problem as a Multi-Agent Contextual Multi-Armed Bandit (MA-CMAB), and for the first time, set the grounds to the application of transfer learning to deal with dynamic environments in the SR context.Despite the advantages of transfer learning techniques in RL, they are susceptible to negative transfer in which the target task underperforms after receiving knowledge from the source task [7].
To address the aforementioned issue, newer techniques such as meta-learning allows to quickly adapt to unseen tasks given sufficient source tasks expertise [8].
In this letter, we address the problem of SR adaptability in dynamic scenarios using Meta-Reinforcement Learning (meta-RL).However, convergence of the learning process remains a challenge.Ideally, the exploration performed by RL agents should complete promptly to further exploit the best action.However, a premature exploration could lead to a policy that ends up at a local minima or to the agent's total failure [9].Meta-learning has been conceptualized as "learning to learn" to tackle the learning convergence challenge with just few learning instances [10].Meta-learning has been previously applied in the wireless communications domain.In [11], the authors propose a meta-learning and recurrent neural network (RNN) approach to predict mmWave/THz link blockages.In the Wi-Fi context, the authors in [12] propose using metalearning to adapt quickly in the presence of new types of impersonation attacks.
To the best of our knowledge this is the first study that addresses adaptability via meta-learning and contextual multiarmed bandits in the 802.11axSR context.More specifically, a Deep Contextual MAB based proposed in [13], namely Sample Average Uncertainty (SAU)-Sampling is leveraged to apply a Model Agnostic Meta-Learning (MAML) algorithm [8]. Figure 1 illustrates a high level overview of the meta-learning strategy in an Open-WiFi settings [14].In the proposed MA-CMAB scheme, each Contextual Multi-Armed Bandit (CMAB) agent experiences a set of T N scenarios that experience mobility-induced fluctuation of the user demand.The acquired knowledge is stored in the Open-WiFi controller for posterior meta-training in an offline fashion.Finally, the meta-agent named "meta-bandit" is used to further few-shot adaptation in the unseen T N +1 scenario.The main goal of this letter is to propose a method capable of accelerating learning convergence of SR-based RL in high dense Wi-Fi networks, while maintaining comparable Key Performance Indicators (KPIs) with state-of-the-art methods such as transfer learning.Moreover, our results show that the Meta-Bandit exhibits at convergence time a considerable improvement in terms of network fairness when compared with the presented baseline.

TABLE I NOTATIONS AND DEFINITIONS
The rest of this letter is organized as follows.Section II introduces the system model followed by the proposed MA-CMAB.Section III describes the meta-reinforcement learning algorithm, which is followed by the performance evaluation of the meta-learning approach along with discussions in Section IV.Finally, Section V concludes this letter.

II. MULTI-AGENT CONTEXTUAL MULTI-ARMED
BANDIT ALGORITHM In this section, we start with the introduction of the system model followed by the context definition, action space, and reward function for the MA-CMAB algorithm utilized in this letter.In Table I we present the list of notations.

A. System Model
We consider a Wi-Fi 802.11ax scenario composed of M APs and S stations that are randomly positioned.In each AP, a CMAB agent resides and has the capability to communicate back and forth with the Open-WiFi Cloud Controller.In addition, each station and AP are equipped with two antennas supporting up to two spatial streams in transmission and reception.In this letter, we assume 5 GHz frequency with a 80 MHz channel bandwidth in a Line of Sight (LOS) setting.The propagation loss is modeled on the basis of the Log Distance propagation loss model with a constant speed propagation delay.Finally, an SINR-based adaptive rate data model is considered with downlink UDP traffic.

B. Context Definition
Our MA-CMAB solution is based on a contextual multiarm bandit algorithm introduced in [13].More specifically, we employ a cooperative multi-agent version of the previous algorithm named Cooperative Sample Average Uncertainty-Sampling (SAU-Coop) [6].Note that the methods presented in this letter are extensible to non-cooperative implementations.
where N AP m corresponds to the total of STAs attached to the m th AP.Here, τ AP m is: (2) where normalized values of the average RSSI and Noise, τ AP m and ΥAP m are chosen based on the discretization of their maximum values presented in [15].Finally, the context of each CMAB agent is defined as:

C. Action Space
The action space per m th AP corresponds to the number of combinations of CCA threshold (P m cs ) and TP values (P m tx ) which in the context of MABs translates to the number of arms for each MAB agent.The action space is defined as shown in Eq. (5).

D. Reward Definition
The following two terms contribute to the reward function: a local reward per AP, r AP m and the Jain's fairness of the network, r J : Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.where Ψ AP m is the set of starving stations attached to the m th AP, r J stands for the overall network's Jain's fairness index and N AP m denotes the set of stations attached to the m th AP.Finally, R m s and R m s,A are the effective and achievable throughput of the s th station attached to the m th AP, respectively.The cooperative multi-agent SAU-Sampling algorithm is described as follows in Algorithm 1.

III. META-REINFORCEMENT LEARNING ALGORITHM
In this section, we introduce our Meta-Bandit scheme that leverages the previously presented MA-CMAB and MAML.The proposed algorithm consists of two offline and one online stage: 1. Meta-Task datasets generation, 2. Meta-Training and 3. Meta-Adaptation.

A. Meta-Task Dataset Generation
Meta-learning requires large number of meta-tasks to train the meta-agent [8].In our study, the Open-cloud controller learns from various independent environments to prepare for the Meta-Training stage where the meta-tasks involve diverse set of scenarios.We leverage the internal structure of the SAU-Sampling CMAB and store the context x t = (x 1,t , . . ., x M ,t ), action selected a t = (a 1,t , . . ., a M ,t ) and predicted reward μt = (μ(x

B. Meta-Training
The SAU-Sampling CMAB maps its context onto the rewards via NNs.In addition, it considers the action taken by the m-th agent and uses it as a one-hot vector at training time.The network structure utilized for the meta-agent is defined as follows:  (10) where stands for the input layer, n denotes the dimension of the context x, p the dimension corresponding to the number of each agent's K arms.
In this case, a supervised multi-output regression is still in effect.Thus, the general form of the loss function is defined as the mean-squared error taking the form of: where x , a, μ are the context, action and predicted output sampled from the d th dataset and m th agent.The output of the algorithm corresponds to a meta-agent represented by the parameterized function ψ (m,i) and θ (see Algorithm 2, line 10) capable to adapt with few examples.The meta-learning algorithm named Meta-Bandit is described in Algorithm 2.

C. Meta-Adaptation
The third stage of the algorithm, named as Meta-Adaptation, is performed in an online manner.After each agent of the MA-CMAB scheme experiences F samples, a k-shot sampling procedure will follow to be further utilized in the metaadaptation.Finally, each meta-agent will adapt its model as follows: where F (m) is the set of F samples from the m-th AP.

D. Complexity Analysis
The complexity analysis must be performed separately for the different stages of our proposal.In addition, we will consider the regret, as the standard metric to compare bandits.As mentioned before, our SAU-Coop algorithm is built upon [13] where it has been proved an empirical reduction on running time and comparable complexity with other MABs techniques such as TS [16].More specifically, the SAU metric τ 2 a introduced in this letter, resembles to TS according to (Proposition 2, page 5) [13] with an empirical behavior similar to TS as well.The expected regret is logarithmically bounded as: where μ a and μ * corresponds the expected reward for action a and μ * = max {μ 1 , . . ., μ K }, respectively.a ∈ K refers to the action taken among the possible K arms and n the number of times where the SAU chooses an action.Finally, Δ a is defined as Δ a = μ * − μ a .Furthermore, we present the time complexity of our proposed SAU-Coop scheme which corresponds to the complexity of the neural network that comprises the reward predictor on (Algorithm 1, line 6).In our scenario, each predictor is composed by two hidden layers, the input and output layer.Thus, at least three matrices are needed to represent the four layers' weight relationship: W ab , W ca , W dc , where a, b, c, d are the number nodes of each layer.As an example we can write the following equations for the propagation from layer a to b: The complexity of operation in Eq. ( 14) corresponds to O(a * b * t), meanwhile Eq. ( 15) is O(a * t) which gives a total complexity of O(a * b * t).Analogously, we can perform the same procedure with matrices W ca and W dc and obtain a total time complexity of O(n * t * (ab +ca +dc)), which can be further reduced to O(ab +ca +dc), since n = t = 1 in our MA-CMAB proposal.This corresponds to a low time complexity, thus no implications on the performance of the algorithm.
On the other hand, the MAML's time complexity corresponds to O(d 2 ) where d is the problem dimension.Evidently, the quadratic term indicates the outer and inner loop updates of the MAML algorithm which can be quite costly if the set of the model's parameter dimension is large.However, the MAML training is performed offline, thus we can ignore it.On the other hand, in the online stage described in Section III-C, the meta-adaptation is performed via gradient descent with a complexity of O(knd ) where k, n and d corresponds to the number of iterations, number of samples and number of features, respectively.Evidently, if d increases the complexity will become an issue, however this is not the case and consequently, performance is not affected.

IV. PERFORMANCE EVALUATION
In this section, we present the simulation settings and baseline model under consideration.

A. Simulation Setting
The dataset is generated by simulating S random scenarios with different user load per AP via the discrete network simulator, ns-3.OpenAI Gym [17] is used to interface between ns-3 and the MA-CMAB solution.In Table II and Table III, we present the learning hyperparameters and network settings for the dataset generation and online meta-adaptation.

B. Baseline Model
We consider a transfer learning-based baseline model that is already proposed in [6].Specifically, we utilize partial transfer learning where specific layers of the CMAB agents are transferred from task T N to T N +1 instead of transferring the full model.The selection of the CMAB neural network hidden layers l t to be transferred is considered as an additional hyperparameter.The previous approach reduced the model overfitting allowing an improved adaptive behavior.

C. Results
In this section we evaluate the main network KPIs for the meta-bandit under different k-shot sampling strategies to select the optimal k-shot configuration and the transfer learning baseline while k ∈ {1, 5, 20, 50}.Each meta-bandit is trained with S = 200 random tasks where the train, test and validation split is 0.6:0.2:0.2.In the online stage, 50 runs per traffic mode are performed with 10 random unseen scenarios per run to observe the performance of the trained meta-bandit.Similarly, the transfer learning strategy uses the first trained policy per run as the transfer policy to the rest of the nine unseen scenarios.In Fig. 2, we present the results for the meta-bandit and the transfer learning baseline in the online stage under different traffic modes regimen at 95% confidence level.
Figure 2(a) depicts the network fairness performance of the techniques under study.One-shot is outperformed in all cases whereas the best performance can be achieved under 50-shot or the transfer learning technique.It is worth noting that the transfer learning technique achieves better throughput performance for two reasons: 1) it does not modify the internal NN structure of the CMAB agents, 2) it has longer fine-tune steps.A slightly different behavior is observed in Fig. 2(b) and 2(c) for fairness and user starvation, respectively.For traffic load of 0.011 and 0.056 Gbps the transfer learning exhibits the worst performance, especially in terms of starvation, as a product of negative transfer learning.As mentioned in Section I, negative transfer occurs following upon the transfer of a source policy to a target policy resulting in degraded performance of the target policy.Apparently, selection of the source policy has a significant impact on the transfer learning algorithm.In this letter, we select a random policy to be transferred which may Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.lead to the negative transfer phenomena under some scenarios.Selection of the best source policy remains an open issue that can be tackled with intelligent RL solutions [18].
Figure 3 presents the impact of the meta-learning approach against the transfer learning baseline in terms of fairness convergence, MCS index and delay at 95% confidence level and 0.001 Gbps data traffic.In Fig. 3(a), it can be seen that oneshot sampling strategy is not capable of adapting successfully.On the other hand, for k ∈ {5, 20, 50}, successful adaptation can be observed.In addition to the previous results, we show a red and a blue arrow indicating the improvement in terms of convergence and adaptability over the transfer learning baseline, respectively.We observe that in terms of convergence an average of less than 1,250 environment steps are needed, for the proposed techniques.Furthermore, in terms of fairness and starvation, 64% and 80% improvement over the transfer learning baseline is achieved, respectively.In Fig. 3(b) the ECDF of the MCS index indicates that the best performance is obtained with 50-shot sampling strategy.In the same fashion, Fig. 3(c) depicts the behavior of the ECDF delay over all strategies, 50-shot standing out as the best strategy.
V. CONCLUSION In this letter, we have proposed meta-learning-based improvement for spatial reuse in distributed cooperative Wi-Fi 802.11ax networks by leveraging a Multi-Agent Contextual Multi-Armed Bandit (MA-CMAB).The results have shown that meta-learning is able to improve spatial reuse adaptation in terms of convergence for highly dense and dynamic scenarios when compared to a transfer learning baseline.

Manuscript received 2
April 2023; accepted 12 April 2023.Date of publication 20 April 2023; date of current version 2 January 2024.This work was supported in part by Mitacs Accelerate Program and in part by NetExperience Inc.The associate editor coordinating the review of this article and approving it for publication was N. Passas.(Corresponding author: Melike Erol-Kantarci.)

Fig. 1 .
Fig. 1.Meta-Agent training and deployment workflow in an Open-WiFi architecture.
The context is composed of the local observations of APs, as such: 1) Number of starving stations, |Ψ AP m | where m stands for the index of an AP under ω fraction of their attainable throughput during the t-th episode.2) Average RSSI, I AP m where m denotes the index of an AP during the t-th episode.3) Average Noise, Υ AP m where m represents the index of an AP during the t-th episode.Furthermore, each context element is normalized as follows: 0, −50 dBm ≤ I AP m ≤ −60 dBm, 0.25, −60 dBm ≤ I AP m ≤ −70 dBm, 0.5, −70 dBm ≤ I AP m ≤ −80 dBm, 0.75, −80 dBm ≤ I AP m ≤ −90 dBm, 1, −90 dBm ≥ I AP m per agent in each simulated load.Three datasets per agent are utilized in the Meta-Training stage: train defined as D val | where D is the database of all datasets.Such datasets are assumed to be available at the Open-WiFi controller for further training in the second stage.

Fig. 2 .
Fig. 2. Network Key Performance Indicators for 0.001, 0.056, 0.11, 0.16 Gbps traffic regimes in terms of a) Cumulative Throughput, b) Fairness and c) User starvation.Each figure shows the performance of one-shot, 5-shot, 20-shot and 50-shot for the meta-bandit and the transfer learning baseline.

Fig. 3 .
Fig. 3. Network Key Performance Indicators for 0.001 Gbps traffic regimen for the meta-agent and transfer learning baseline: a) Fairness convergence graph, b) Empirical Cumulative Distribution Function (ECDF) for the Modulation Coding Scheme (MCS) index c) ECDF for the Delay.The red double arrow in a) indicates the improvement in terms of convergence time of the meta-agents over the transfer learning.
c 2023 The Authors.This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
L cs and L tx corresponds to the number of levels to be quantized the CCA threshold and TP values, respectively.Finally, the number of arms corresponding to the action space for the m th agent K AP m becomes |A cs = P min cs , P min cs +