Toward More Reliable Deep Learning-Based Link Adaptation for WiFi 6

—The problem of selecting the modulation and coding scheme (MCS) that maximizes the system throughput, known as link adaptation, has been investigated extensively, especially for IEEE 802.11 (WiFi) standards. Recently, deep learning has widely been adopted as an efficient solution to this problem. However, in model failure cases, predicting a higher-rate MCS can result in a failed transmission. In this case, retransmission is required, which largely degrades the system throughput. To address this issue, we formulate the adaptive modulation and coding (AMC) problem as a multi-label multi-class classification problem. The proposed for- mulation allows more control over what the model predicts in failure cases. In this context, we propose a simple, yet powerful, loss function to reduce the number of retransmissions due to higher-rate MCS classification errors. Since wireless channels change significantly due to the surrounding environment, a huge dataset is generated to cover all possible propagation conditions. However, to reduce training complexity, we train the CNN model using part of the dataset. The effect of different subdataset selection criteria on the classification accuracy is studied. It is shown that some criteria for dataset selection consistently behave better than others. To confirm the performance, we applied the proposed model for adapting the IEEE 802.11ax standard in outdoor propagation scenarios. The simulation results show that the proposed loss function reduces up to 50% of retransmissions compared to traditional loss functions. Finally, we propose an optimal subdataset selection criterion.

I. I To accommodate the ever-increasing growth in throughput demand, developing high-performance wireless systems became more essential. These wireless systems should consider both: the unique features of the di erent data services and the dynamic and spatiotemporal characteristics of the wireless channels. Therefore, techniques like dynamic resource allocation and link adaptation are incorporated into the di erent wireless standards in order to support the quality of service (QoS) requirements while serving the increased number of users [1]. Link adaptation represents a key element in determining the system's latency and throughput performance [2]. Fortunately, machine learning (ML) is anticipated to provide viable solutions to the link adaptation challenges in wireless systems [3], [4].
In the literature, the link adaptation problem is formulated as a multiclass classi cation problem where the class labels represent di erent modulation and coding scheme (MCS) combinations [4]- [8]. According to this formulation, each data point is allowed to belong to only one class and a supervised ML model can be trained to select the ideal MCS based on the training data. However, supervised models generally, and neural networks (NN) speci cally, have a certain level of accuracy [9]. In this case, failing to predict the ideal MSC has unpredictable implications on the system throughput. In fact, predicting a higher-rate MCS will result in a failed transmission and, consequently, a retransmission is required which largely degrades the system throughput. These problems come from the fact that formulating the problem as multiclass classi cation has no control over what the model can predict in the failure cases. Now the question is, if the model failed to predict the optimal MCS, why we do not train it to predict a suboptimal one?
To answer this question, we model the link adaptation problem, for the rst time, as a multi-label multi-class classi cation. In this modeling, a datapoint is allowed to belong to more than one class at the same time (all the successful MCS in AMC problem). Bases on that, the model learns to predict not only the optimal MCS, but also all suboptimal ones. Such modeling approach gives more control to what the model learns from the training phase and what it can predict in the failure cases. However, we need to enforce the model avoid predicting higher-rate MCs that cause the retransmissions. To solve this issue, we propose a new loss function that adds more penalization to these cases. The proposed loss function reduces the number of retransmissions compared to traditional crossentropy loss function, which widely employed in the literature.
In the other hand, Wireless channels vary signi cantly according to the surrounding environment. To have realistic results, a huge dataset is constructed to cover all possible channel variations. However, it is computationally expensive to utilize all the samples for training. To train the model, a random selection for part of the dataset is an intuitive method in order to guarantee fair representation for the full spectrum of classi ed/studied cases. In this work, we examine alternative selection criteria other than the random selection. We also compare the e ect on the resulting classi cation accuracy.
The aforementioned selection criteria are based on the  domain-knowledge and the understanding for the nature of the wireless channels. For orthogonal frequencydivision multiplexing (OFDM) based systems, we assume an interference-free, noise-free, single-user, and single-input single-output setup. In this case, the delay dispersion of the channel is the decisive factor on the MCS selection. Hence, instead of randomly selecting the training subdataset, we select the subdataset that comprises a uniform (or as close as possible to a uniform) distribution of the channels delay dispersion behaviors. Given that the channel dispersion behavior is not easy to be fully characterized, for such selection to take place, we employ well-know criteria characterizing the delay dispersion such as root-mean-square delay spread and window delay spread. We study the e ect of the selection criteria on the model accuracy, and the optimal selection criteria is highlighted.
The contributions of this work can be summarized as follows: • Unlike the literature work, we formulate the problem of AMC as multi-label multi-class classi cation problem. The model trained to predict all the possible labels for successful transmission (including the optimal MCS and suboptimal ones). • We employ a convolutional neural network (CNN) with an innovative loss function. The proposed model allows to control what link parameters to use when failing to select the optimal ones. Consequently, it outperforms a CNN with traditional crossentropy function in terms of the retransmission rate. • We study the impact of training subdataset selection criteria on AMC problem and highlight the corresponding e ect in the classi cation accuracy.

A. Problem Formulation
Assume we have C di erent combinations of MCS and guard intervals GI each of them called a transmission mode (TM). The TMs are indexed as i ∈ I ⊂ N, where the cardinality of I is the number of available combinations. The index i, thereafter referred to as the class distinctly maps to a combination of MCS and GI. We adopt the IEEE 802.11ax standard for single-input single-output (SISO) system at 0.8 and 3.2 guard intervals with a xed bandwidth of 20 MHz. Therefore, in terms of multi-label multi-class classi cation, link adaptation is the problem of selecting all the class labels, i, to which a certain channel realization belongs. Thus, for a certain channel realization ch n , the classi er selects all the labels, i, corresponding to all valid transmission modes T M i . Then, we can express the classi er function as a function F that maps a channel realization ch n to a set of labels y ⊂ {1, 2, ..., c}, where c ≤ C, such that: where T X(ch n , T M i ) = 1 when transmitting a packet through channel ch n and with transmission con guration given by T M i is successful and zero otherwise. From the predicted TMs, we select the one corresponding to the higher data rate. As shown in Fig. 1, a user station (STA) sends the estimated channel state information (CSI) to the access point (AP). The AP then use the received CSI to adapt the transmission link parameters.

B. Datasets Generation
In this work, we selected four scenarios that have diverse delay dispersion characteristics: urban micro-cell, suburban macro-cell, urban macro-cell and rural macro-cell. Using the Matlab WINNER II toolbox [10], for each scenario, 50,000 channels are generated. For each channel, using the Matlab IEEE 802.11ax toolbox, we simulate transmitting a packet using all TMs and, consequently, the corresponding labels are obtained. We split the generated channels to 80% training and 20% testing.

C. Selection of Training Subdatasets using Di erent Delay Dispersion Criteria
The training subdatasets will be constructed using two approaches: random selection criteria and di erent delayspread-based selection criteria. Based on the random approach, Cases 1 & 2 are identi ed, and based on the delayspread approach, Cases 3, 4, & 5 are identi ed.
1) The random selection criteria (Cases 1 & 2): The random approach is applied in the following two ways: • Case 1, Random Full Dataset (RandomFD): all data points (i.e., a total of 160,000 data points; 40,000 data point from each of the four scenarios) are used as the training dataset. • Case 2, Random Partial Dataset (RandomPD): the training subdataset is composed of data points selected randomly and equally from each scenario.
RandomFD represents a reference case where all data points are used for training, and RandomPD is the typical widely-used way of reducing the number of data points through random selection.
2) The delay-spread-based criteria (Cases 3, 4, & 5): The delay-spread-based selection approach is applied to select di erent training subdatasets each of which has the same number of data points as RandomPD. Instead of being selected in a total random fashion as in the RandomPD, the goal hear is to make the selection such that the data points of the built subdatasets experience the full delay dispersion behaviour of RandomFD in some sense. Using this approach, from the total 160,000 available data points, we select the subdataset points such that the distribution of the delay dispersion metric will be as close as possible to uniform.
Lets assume RandomF D i to be the ith data point in the RandomFD dataset; S(RandomF D i ) is its corresponding delay dispersion evaluated based on a speci c metric of interest, S; i = 1, 2, ..., I (where I is the total number of data points in RandomFD), and min S(RandomF D) & max S(RandomF D) are the minimum and maximum obtained delay dispersion values, respectively, among all the data points of RandomFD. We assume the interval [min S(RandomF D) , max S(RandomF D)] to be divided into Z equal disjoint sub-intervals. We de ne the histogram of S(RandomF D) as the function that counts the number of delay-spread observations, n z , that fall into the zth subinterval, where z = 1, , 2, ..., Z, and n min & n max are the minimum and maximum number of observations, respectively, obtained per sub-interval using the full dataset i.e., RandomFD.
Then, our proposed delay-spread-based selection approach can be applied as follows. Select a subdataset from RandomFD such that the selected subdataset has a histogram, m z , de ned as follows.
where T is the total number of data points in the selected subdataset.
The value of x determines the maximum number of data points at each of the Z intervals, which results in selecting a subdataset with a histogram that exhibits a tendency toward having a uniform distribution of the delay dispersion behaviour over the [minF D, maxF D] range. The possibilty of ending up with a perfect uniform distribution increases as the number of data points in RandomFD increases.
Based on the applied delay-spread metric (i.e., S), which is our design criterion, we can now de ne the di erences among Case 3, Case 4, and Case 5 of the studied cases.
• Case 3, root-mean-square delay spread Partial Dataset (rmsPD). In this case, the training dataset is selected using the delay-spread metric de ned as the normalized second-order moment of the delay pro le of the channels.
• Case 4, window (40%) delay spread Partial Dataset (W40%PD). In this case, we characterize the delay dispersion using the delay window parameter which is de ned as "the length of the middle portion of the power delay pro le containing a certain percentage of the total power found in that impulse response" (p. 4, [11]). Here we use the 40% as our design criterion. • Case 5, window (70%) delay spread Partial Dataset (W70%PD). In this case, we use the same de nition of the delay dispersion metric as in Case 4; however, here we use the window that contains 70% of the power of the delay pro le.

III. P D L A AMC
The convolutional neural networks (CNNs) have showed superior performance in di erent domains including computer vision, natural language processing, speech synthesis, etc [12]. One main advantage of CNNs is its proven capabilities in processing raw data. This advantage eliminates the burdens of data pre-processing. Inspired by this, we propose a CNN-based approach for AMC in IEEE 802.11ax. While all the work in the literature treated the problem as multiclass classi cation [13], this is the rst work to tackle the problem as a multi-label multi-class classi cation. In this case, each channel realization can belongs to more than one class which means that a packet is successfully transmitted via this channel with more than one TM con gurations.

A. CNN Model
The proposed deep convolutional neural network (DCNN) includes convolutional layers, average pooling layers, and fully connected layers. Particularly, the rst hidden layer is a convolutional layer with 20 lters. The second hidden layer is a convolutional layer of 32 lters. The following layer is an average pooling layer with pool size of 4. Then, another convolutional layer is added with 64 lters. After that, an average pooling layer with pool size 2 is utilized. The fourth convolutional layer consists of 32 lters. Then, an average pooling layer with pool size 2 is utilized. For the all convolutional layers, every lter has a size of 10 × 2, with ReLU activation, F (x) = max(x, 0). After the 4 convolutional layers, there are 2 fully connected layers. The rst fully connected layer contains 50 and C neurons respectively where C is the number of available TMs. Since one CSI can belong to many classes at the same time, we use Sigmoid activation function (3) in the output layer to approximate the multinomial distribution of the class labels. To relieve the e ect of over tting, an l2 regularizer is added to the last two layers.
Adam optimizer [14] is adopted to train the model along with our customized loss function (section IV). The DCNN is trained for 1000 epochs with batch size of 128. After training the DCNN, it is deployed for predicting the appropriate TM.

B. Dataset Description
Consider a labeled dataset consisting of pairs of x and y. In this case, x represents di erent CSI in di erent selection cases described in subsection II-C. The label vector y is a vector in {0, 1} C where C is the number of the available T M s (i.e., the same as the number of the available classes). If the i th position in the label vector of the j th data instance is set to one, this indicates that a transmission over a channel with CSI equal to j th CSI in the dataset using the i th transmission mode will result in a successful transmission. In the same way, 0 indicates a failed transmission. In our experiments, the label vector is 24 th dimensional vector representing the di erent available combinations of MCS and GI.

C. Evaluation Metrics
To evaluate the proposed model in the context of communication systems e ciency, we applied two system-speci c evaluation metrics, namely, data-rate loss (DRL) and number of retransmissions (NR). We de ne δ as: where R(·) is a function that maps a TM to the data rate associated with this TM, T M i is the optimal TM given in the dataset, and T M i is the predicted TM. For positive values of δ this means that the model predicts higherindex TM than the optimal one. This implicitly incurs a retransmission. The number of retransmissions is given by NR metric. While negative values of δ implies that the model predicts suboptimal TM, which leads to a rate loss. The di erence between the data rates of T M i and T M i is given by DRL.

A. Why we need a customized loss
The traditional loss function used in multi-label multi-class classi cation problems is crossentropy (5).
where C is the total number of classes, which equals to the dimension of y. We can see that the function in (5) treats all wrong predictions equally which is not relevant to the considered AMC problem. We can see that equation (5) pushes the model toward learning the true distribution of class labels. Although this is the ultimate goal of any classi cation problem, in some cases we aim to stress on certain type of errors (false positives or false negative).
De nition .1. The labels of a class i denoted as Y i . The positive instances of Y i denoted as Y + i and the negative instances as Y − i . Also, the predicted labels for a class i isȲ i . The positive instances ofȲ i denoted asȲ + i and the negative instances asȲ − i .
De nition .2. Given a classi er f , the false positive, and false negative are de ned as: In the problem under consideration, a false positives in higher-rate MCS may lead to a retransmission, which is very costly in terms of bandwidth utilization. However, a false negative indicates selecting a lower T M , which it can be tolerated more than a retransmissions. For this reason, we aim to design a loss function that emphasis on the false positive errors more than false negatives.

B. Proposed Loss
We propose a new customized loss function that adds more penalization to false positive predictions. Since the proposed loss function emphasis on false positives, we named it Crossentropy+, CE + . The new loss given by: where CE (y,ȳ) is the traditional crossentropy given in (5) and φ (y,ŷ) is an extra penalization term for the false positive predictions given by: where C is the total number of classes and β is a weight term added to control the credit assigned for the traditional crossentropy term and the newly added term. Setting β to a high value may lead the model to predictŷ = {0} C vector which minimizes the second term and completely ignores the rst term. In the other hand, if we set β ≤ 1, the model may ignore it and learns parameters that minimize only the rst term of (6). We set β = 1.3 for all the experiments in this work. However, in the future, we can learn a value for β to meet di erent QoS requirements (may be di erent for a WiFi public network or for a 5G URLLC network).

V. S R
We organize this section into two subsections: the prediction results of the CNN model using the di erent proposed delay-spread-based subdataset selection criteria, and the improved prediction results achieved when incorporating the proposed loss function.

A. Results of AMC using CNN
To gure out the e ect of the training set size, we trained the model with varying set size, namely, 10K, 20K, 30K, 40K, and 50K channels, for each selection criterion. We also consider a larger RandomFD dataset. For each training set, we test the model using three di erent scenarios, namely, suburban macro-cell (C1), urban macro-cell (C2) and rural macro-cell (D1). The urban micro-cell is not included in the analysis because all studied algorithms achieve almost the same results. Fig. 2 shows the percentage of retransmissions to the total data points in each test scenario. We can notice that, in terms of the di erent selection criteria, W40%PD obtains the best performance in all the test scenarios. Also note that for all criteria, scenario D1 obtained higher retransmission rate compared to both C1, and C2. Also, this gure shows that RandomPD and rmsPD training subdatasets always obtain higher retransmission percentage compared to W40%PD and W70%PD. We can also notice that the performance is greatly improves with increasing the size of training dataset. However, a little or no improvement has been recorded with increasing the size from 40K to 50K. Vapnik-Chervonenkis (VC) dimension theorem [15] can explain this saturation behavior. According to VC-dimension theorem, a model keep learning better with more training data points, up to certain number, N vc , after which the model capacity reached a saturation point and adding more data points does not improve the learning anymore. Fig. 3 shows the percentage of data rate loss due to deploying the CNN-model with di erent training subdataset selection criteria. Recall from section IV that a data rate loss happens when the model predicts a false negative in the index of the ideal TM. The gure shows an inverse trend between the retransmission rate and the data rate loss. However, it is worth to note that since the overall system performance is decided by both: rate loss and retransmission rate, then it is more likely to tolerate a reasonable rate loss rather than repeated retransmissions. We can see that W40%PD, which results in the best performance in terms of retransmissions, obtained around -3.1% rate loss in the worst case (scenario C2). Based on these observations, we can conclude that training a model based on W40%PD gives the best performance in the retransmission with acceptable rate loss. Also the proposed CNN approach obtained near-optimal TM selection. However, we still can improve the performance of the proposed model by adapting the loss function as we will see in the next subsection.

B. The Performance of the Proposed Loss-Function
To evaluate the performance of the proposed loss function, we trained a model with traditional crossentropy and our proposed loss functions. To obtain fair comparison, we used the same model capacity in the two cases. We also xed all other hyperparameters (e.g., the same number of epochs, initialization, or optimizer).  The results of training the model using the two loss functions are shown in Table I. This table shows the number of retransmissions in scenario C2. We selected this test scenario since it has the largest percentage of retransmissions compared with the other scenarios as shown in Fig. 2. We can see that the proposed loss function has largely reduced the number of retransmissions under all selection criteria and dataset sizes. The proposes loss function was capable of obtaining more than 50% proposed improvement over traditional crossentropy in some cases. Table I also shows the percentage of rate loss for each training set size. We can see that the rate loss using our proposed loss function is larger than using traditional crossentropy. Given that the model capacity is constant, this can be explained by the fact that reducing the false positives may result in increasing the false negatives. However, depending on the speci cations of the used communication system (speci cally the cost of retransmission compared to rate loss), varying the value of β in (7) provides a wide range of tradeo for performance selection.

VI. C
A convolutional neural network framework for adaptive modulation and coding (AMC) is presented. The proposed framework is validated for adapting IEEE 802.11ax in outdoor scenarios. We model the problem of AMC, for the rst time, as a multi-label multi-class problem to predict the best available transmission modes. We showed that traditional loss functions are limited in solving such problem. We proposed a new loss function that reduces the number of retransmissions while increasing the likelihood of selecting the ideal modulation and coding scheme (MCS). The proposed loss function proved to outperform the traditional crossentropy function. Empirically, we showed that best throughput is obtained by applying a window delay 40% subdataset selection criterion.