Online machine learning algorithms to optimize performances of complex wireless communication systems

: Data-driven and feedback cycle-based approaches are necessary to optimize the performance of modern complex wireless communication systems. Machine learning technologies can provide solutions for these requirements. This study shows a comprehensive framework of optimizing wireless communication systems and proposes two optimal decision schemes that have not been well-investigated in existing research. The first one is supervised learning modeling and optimal decision making by optimization, and the second is a simple and implementable reinforcement learning algo-rithm. The proposed schemes were verified through real-world experiments and computer simulations, which revealed the necessity and validity of this research


Introduction
Recent advancements in wireless communication technologies have led to widespread deployment and use of technologies worldwide. Along with widespread usage, the requirements and use cases are becoming higher and broader, leading to highly complex wireless communication systems.
Today's wireless communication systems are becoming large-scale networks, where numerous wireless devices are deployed in a rather small area, such as massive IoT, and where the structures of networks are becoming more heterogeneous: for example, multiple radio access technologies are simultaneously used in a mobile terminal, multiple radio transmission ranges are deployed, and various In this study, we discuss the application of machine learning technologies to wireless communication systems in terms of performance optimization. Existing research is classified along with the viewpoint of the amount of information available and the method of decision for optimal action. Then, we propose a new scheme for optimizing the performance of complex wireless communication systems using machine learning technologies. Another scheme, to provide a feasible solution for mobile wireless devices that have rather limited resources, is proposed as a simple reinforcement learning-based optimal decision. The rest of the paper is organized as follows. Starting with the indication of issues of current and future wireless communication systems in Section 2, various approaches for optimal decision making by machine learning-based modeling are reviewed in Section 3, and the proposed schemes are introduced. In Section 4, one of the proposed schemes, supervised learning modeling and optimization algorithm, is elaborated, and some experimental results are presented. In Section 5, another proposed scheme, a simple reinforcement learning-based optimal decision making method using an MAB algorithm, is elaborated, and some experimental results are presented. Finally, in Section 6, the conclusion and some remarks are described.

Issue of current and future wireless systems
Owing to the advancements in radio and communication technologies, today's wireless communication systems provide high-speed data transfer and a wide area of communication links, while they have become very complex systems. Wireless systems, including wired systems, are generally composed of multiple layers of multiple functions, such as the physical layer, MAC layer, network layer, transport layer, and application layer. Each layer has different protocols. It leads to a difficulty in modeling wireless systems. 5G systems [3] have been deployed in several regions in the world very recently. Compared to 4G/LTE or 3G, the major difference in 5G systems is that there is no single technology to develop such as orthogonal frequency division multiple access(OFDMA) in 4G/LTE or code division multiplexing (CDMA) in 3G. A 5G system network consists of a radio access network (5G NR) and a core network (5G CN). For 5G NR, physical transmission technologies such as OFDM(A) in the higher frequency band such as 60 GHz are proposed, and massive multiple-input-multiple-output (MIMO) with beam foaming. For 5G CN, several network management technologies are required such as network function virtualization (NVF), software defined network (SDN), edge computing, and network slicing. Along with these, the 5G system is more comprehensive: it can include a legacy 4G/LTE system and licensed assisted access (LAA) / licensed shared access (LSA) assuming the usage of the industrial, scientific, medical (ISM) band frequency which requires a certain channel access scheme for sharing spectra with other wireless systems like listen before talk (LBT).
From the viewpoint of performance requirements, there are three major requirements for 5G systems: enhanced mobile broadband (eMBB), massive machine-type communications (mMTC), and ultra-reliable low latency communications (URRC). Fundamentally, these are different aspects of requirements. Indeed, as use cases, various independent scenarios are proposed based on these requirements: a rapid download of large data such as movie data of some gigabytes (GB) by satisfying eMBB, management of massive amounts of mobile sensors such as IoT devices by satisfying the mMTC, and telemedicine with extreme low latency wireless network by satisfying URRC.
To satisfy these extreme requirements of 5G systems, high-performance devices/equipment are needed in wireless communication systems which can allow management of low-end devices, such as IoT. All these requirements and use cases indicate that current and future wireless communication systems are becoming more and more complex and heterogeneous. Authors of [4] indicate that these viewpoints are applied in 6G, and propose communication networks using artificial intelligence (AI).

Research challenge to optimize future wireless communication system
Numerous studies have focused on seeking the optimal decision of wireless communication systems through mathematical and theoretic formulation of wireless channel and transmission power control [5][6][7][8][9][10], modulation and coding, the behavior of MAC protocol [11] or higher layer protocols. The common approach of these studies is to define the mathematical model of the function of wireless communication system, to formulate the maximization or minimization problem, and to obtain the optimal solution by solving the problem.
An advantage of these "classical" approaches is that the theoretical optimal solutions or parameters, or at least their upper-or lower-bounds, can be obtained under the assumption of the continuity of the function described.
In contrast, in terms of the modeling of wireless communication systems, the classical approach generally focuses on a certain layer performance such as channel capacity in physical layer or throughput in MAC layer, and cannot cover whole system modeling. Indeed, current and future complex wireless communication systems are hard to be described mathematically as a whole system. Moreover, the mathematical formulation of wireless systems cannot always be applied to time-varying situations. Wireless systems such as Wi-Fi or Bluetooth are operated autonomously and interact with each other over time, which is beyond the description of mathematical modeling. These indicate that the classical approach faces fundamental difficulty in applying the optimization of modern wireless communication systems.
In short, there are two issues in optimizing the performance of current and future complex wireless communication systems.
• Issue 1. Classical one-way optimal decisions are not suitable for complex time-varying situations.
• Issue 2. Classical modeling by mathematical formulation cannot be applied to multi-layer and multiprotocol complex wireless systems.

Cognitive radio technology
Cognitive radio [12,13] is a fundamental concept in wireless systems. It learns the environment and behavior of wireless communication nodes and performs trial-and-error to seek the best action to improve the performance. It can adapt to various changes in the environment, such as sudden increases or decreases in traffic demand, variations in wireless channels, and contention among wireless transmitters. Figure 1 shows the concept of cognition cycle [12]. The key idea of cognitive radio is the cognitive cycle [13]: learning and action, and their feedback cycle.
The concept of cognitive radio gives insights on facing the issues mentioned above. Important points are feedback cycle and learning. In the next section, we discuss those by applying machine learning technologies.

Outside world Learn
Decide Act Urgent Normal Immediate Orient Observe Plan Figure 1. Concept of cognitive radio.

Machine learning for wireless system
Machine learning technologies are becoming a more and more important solution for various issues in current society. They provide us with a strong tool to build a whole system model using a large collection of data in a wireless communication system and wireless communication network. Machine learning technologies, including deep learning, have been increasingly researched in the field of communication technology recently [14][15][16][17]. Several studies indicate that the management of 5G/Beyond 5G systems requires machine learning technologies [18,19].
In reference [4], AI-enabled intelligent architecture for 6G is proposed. Their proposed architecture is divided into four layers: intelligent sensing layer, data mining and analytics layer, intelligent control layer and smart application layer. Between them, intelligent control layer consists of learning, optimization, and decision-making. They indicate that significantly dynamic and complex network in 6G cannot be optimized through traditional mathematical algorithms. Our research is based on the same viewpoint, and proposes schemes to optimize such complex networks by using machine learning technologies.
There are two points of view when using machine learning to optimize the decisions and actions of wireless communication systems. One point is the amount of data. Supervised learning, especially deep learning and its related methods, can deal with a large amount of data to extract the characteristics of the system from which the data are collected. If the amount of data is limited, the reinforcement learning approach would be more suitable. It allows one to make the optimal decision through the iteration of trial-and-error cycles under the environment of limited information and parameters to be controlled.
Another point is to how to decide the optimal action to achieve higher performance. There are two strategies: decision by learning scheme or by optimization algorithm, namely maximization or minimization of a formula. If changes in the environment surrounding wireless communication systems are relatively slow, and if the relation between parameters and performance is continuous, not dis-crete, then an optimization algorithm would be suitable. Furthermore, notably, if the relation between parameters and performance is discrete, the reinforcement learning approach would be suitable. Figure 2 shows the relation of these approaches and examples of application of optimizing wireless communication systems. In this study, we investigate two panels of this figure. The figure on the lower-right pane is based on modelling by supervised machine learning and decisioning through an optimization algorithm. The one on the upper-left pane is simple reinforcement learning. We focus on a multi-armed bandit (MAB) problem formulation for this approach.

Classical theoretical model and optimal decision
The lower left panel in Figure 2 indicates the "classical" theoretical model and optimization. It uses a mathematical formulation of wireless communication systems, usually focusing on a certain layer (or two layers, such as the physical layer and MAC layer). Then, the formulation is converted to the optimization problem: maximization or minimization of the formulated equation, whose solution yields the optimal parameters of the system. The obtained optimal parameters, usually under the assumption of continuity of the function expressed in the formulated equation, are the theoretical optimal values. There are many good examples of this type of research, such as the water-filling optimization of transmission power [20].

Deep-learning-based model and reinforcement-learning-based decision
If a large amount of data of wireless communication systems is expected to be obtained, and to seek the best action by learning, the approach could be a combination of modelling by supervised learning and the decision making by reinforcement learning, as shown in the upper-right panel of Figure 2. Deep reinforcement learning (DRL) is a typical example of this approach.
Deep learning (DL) is a newly developed and rapidly spreading technique in various fields. It is an advanced form of an artificial neural network and a type of supervised learning. The first achievement of the DL was in the field of computer vision. This technique has been introduced in various layers of communication systems [21]. The early applications of DL were for the estimation of parameters of the propagation channel [22,23] and device location estimation [24][25][26][27].
DRL is a combination of deep learning and reinforcement learning, as shown in Figure 3. DRL has been applied very recently in the field of wireless communication systems [28][29][30][31][32][33][34]. It can be seen  as an implementation of the cognitive cycle: it learns the relationship among environment, parameter, action, and performance of the wireless communication node through deep learning. Decisioning trialand-error: it seeks better actions through reinforcement learning. The major strong point of DRL is to build a performance model by deep learning in an online manner and to utilize it to predict the performance of the systems when certain parameters are deployed. Reinforcement learning, usually Q-learning based algorithms, are applied to seek better action by evaluating the results, updating its network, and choosing the predicted parameters better using deep learning. Note that DRL always requires the information of the state of wireless communication systems, which might be unrealistic in the real world.

Supervised-learning-based model and optimal decision
DRL is a strong technique for seeking better decisions for wireless communication systems, as described in the previous section. However, there are some assumptions and limitations. First, it requires a large amount of data owing to training in deep learning. It leads to some temporal training overhead in the real world implementation. Second, it uses reinforcement learning: even if the variables are continuous and can be optimized by minimization or maximization of continuous functions, it is forced to do trial-and-error processes. It may suffer from insufficient performance because the number of trials is finite in the real world. In other words, the DRL approach can underperform compared to the classical mathematical formulation and optimization approaches. It is because the classical scheme brings about theoretically determined parameters which will achieve maximum or at least some sort of upper-bound performance of wireless communication systems. It cannot be assured in general that the DRL approach will reach the maximum theoretical performance.
This discussion provides insights into a better solution for using machine learning technologies to optimize wireless communication systems. What if some sort of mathematical optimization can be applied to seek the best parameters while using machine learning as a tool to build the performance model of wireless communication systems? The answer proposed in this paper is the wireless communication system optimization method based on cognitive cycles using machine learning and an optimization algorithm. It uses a supervised learning algorithm to build the performance model of wireless communication systems, and defines the optimization problem using this performance model as a function of variables, observables, and performance of the systems. Then, by solving that optimization problem, the optimal parameters are obtained. After taking actions according to the optimum parameters, wireless communication systems observe the results and then update the performance model by machine learning. This feedback loop is an implementation of a cognitive cycle based on machine leaning, taking advantage of classical mathematical formulation approaches. The details of the proposed scheme are elaborated in the following section. Note that the word "optimization" used here means mathematical optimization, i.e., the maximization or minimization of certain functions and not in the strict meaning of a simple optimization problem.

Simple reinforcement learning decision
The assumption of the proposed cognitive cycle approach described in the previous section is that the communication nodes in wireless communication systems to be optimized can obtain and control various observables and parameters. However, wireless equipment in the real world, especially simple devices such as IoT sensors, cannot deal with such multiple parameters in general. Indeed, there are one or a few parameters to control, as well as observables, in typical IoT devices. In addition, the computational resources and communication frequencies are limited in such devices. Therefore, more simple, light-weight algorithms must be developed.
The multi-armed bandit (MAB) problem [35] is a simple machine learning problem, which can achieve good performance under the limitation of a finite number of trials. It is used in some areas of wireless communication systems, like a channel selection in a cognitive radio [36,37], or resource allocation in 5G small cells [38], but the numbers of studies have been limited. In addition, very a few or none of them have performed experimental validation of the research by implementing on wireless devices.
In this study, a light-weight and high-performance algorithm is proposed using the Tug-of-War (TOW) algorithm, which was developed recently. Its performance is similar to that of well-known algorithms such as UCB-1 tuned, while the implementation is very simple [39]. Notably, TOW does not require any information on the current state of the system.

Contribution of this study
In summary, we propose two novel approaches that are different from classical mathematical optimization or current state-of-the-art deep reinforcement learning. The first proposed approach is a wireless communication system optimization method based on cognitive cycle using machine learning. It uses machine learning to build the performance model of the systems, obtain the optimal parameters by the optimization formulation, and update the performance model in an online manner. The second approach is a simple reinforcement learning, MAB problem formulation. Using a light-weight algorithm, it is more feasible in terms of the implementation and the operation of wireless devices such as IoT. Through these proposals and discussions, this paper provides a new understanding that is useful for building strategies to optimize the performance of complex wireless communication systems.

Optimal decision making through the cognitive cycle using the supervised-learning-based model
Wireless systems have recently become increasingly complex, which makes it difficult to build a cross-layer model, as previously mentioned. The relations between radio variables and system performance are further complicated. Consequently, the optimization of whole wireless systems through cross-layer modelling cannot be realized.
As discussed in Section 3, if sufficient data are available, using supervised machine learning technology can overcome the issue of modelling. The relationship between action and performance is learned by increasing the number of samples. Thus, the complex relations among various radio parameters and network performance can be obtained, which improves the precision of decision-making for the best performance.
This section proposes a wireless system optimization method based on the cognitive cycle using machine learning. Our approach uses an optimization algorithm to seek optimal action, in contrast to reinforcement learning in DRL, as indicated in the right-hand side of Figure 2 and Section 3.3.

Related works
Technologies for next-generation wireless networks, such as 5G, are a major topic in the field of wireless communication today. In reference [18], the possibilities of machine learning technologies for next-generation 5G networks are discussed. Supervised learning techniques can be used to support channel state estimation in MIMO systems. Unsupervised learning for cell clustering, especially in heterogeneous networks, and reinforcement learning for the decision-making process of mobile users have also been suggested. The authors of [40] discussed autonomic communications in future software-driven networks. In particular, they suggested the potential of machine learning in network optimization and the need to redesign more decentralized concepts.
In the next-generation wireless networks, networks become heterogeneous. A discussion of licensed shared access (LSA) was as follows [41][42][43]: 5G network nodes can use not only licensed spectra but also unlicensed bands. S. Haykin discussed the comprehensive function of a cognitive dynamic system to organize communications using both licensed and unlicensed bands [44]. The need for dynamic spectrum management using a cognitive dynamic system in 5G was discussed. In reference [45], the authors analyzed the performance optimization of heterogeneous cognitive wireless networks. A typical optimization problem of load balancing was analyzed in both the centralized and decentralized cases. In reference [46], the authors introduced machine learning in mobile terminals to optimize the aggregation method for IEEE 1900.4 [47] heterogeneous wireless networks and maximize throughput.

Cross layer modeling of wireless system
Several studies have attempted to understand the relationships between various variables and performance to optimize wireless networks [6,10,[48][49][50]. These studies generally focused on the performance of a certain layer, such as channel capacity in the physical layer and throughput in the MAC layer, and do not cover higher layer application throughputs. For an example of the optimization of wireless network capacity, we refer to the resource allocation problem in [6]. In principle, assuming ideal link adaptation, the formulation of the sum capacity of a multicell wireless network is expressed as where is the total number of cells, Γ is the signal-to-interference and noise ratio (SINR) at the receiver, is the set of users simultaneously scheduled across all cells, [ ] is the users in cell , and is the transmit power of the scheduled users. Then, the capacity optimization problem by resource allocation is formulated as arg max , ( , ). (4.1) As referred to in [6], this problem is nonconvex, so the solution is not straightforward; however, this equation represents the fundamental relations among radio variables and system performance.
In another example, in [10], the optimization problem of cooperative sensing in cognitive radio networks was analyzed. This is a sensing-throughput tradeoff problem: a strict sensing policy minimizes the possibility of interference to the primary user, although the opportunities to gain more throughput would be missed, and vice versa. The achievable MAC layer throughputs of the secondary users can be given as where is the sensing time, is the total frame time (including sensing time ), is the number of sensing results of sensor nodes (1 ≤ ≤ , is the total number of sensor nodes), and is the threshold parameter of the energy detector at the sensor node. 0 is the ideal throughput of secondary users if the primary user is always absent, ( 0 ) is the probability that the primary user is absent in the channel, and P is the probability of a false alarm. Focusing on the maximization of the secondary users' throughput, that is, the minimum probability of detection of the primary user is assumed, the sensing threshold can be given by the function of , , and the received signal-to-noise ratio (SNR). Under this condition, the optimization of the throughput of secondary users is formulated as Because the throughput depends on the probabilities of false alarm and detection, which depend on the SNR, Eq (4.2) can be expressed as a function of , , and SNR. This formulation was examined by computer simulation, and the optimal values of and for a given SNR were obtained.
The formulations of the optimization problems (4.1) and (4.2) can be generalized as follows. Let the radio parameters be (such as , , or , ), the observed radio environment be (such as SINR or SNR), and the system performance be (such as capacity or throughput). Then, they can be formulated as = ( , ), where represents the relations among the radio parameters, environment, and performance. Then, the optimization problem is formulated as where ( ) is the utility function of throughputs, for example, the summation of the expected throughput of each node. By solving the above equation, the optimal set of parameters ( ) required to maximize the network performance is obtained. This can be achieved if the relation between the inputs and output is mathematically described.
In recent wireless systems, however, the situation has become more complicated. As mentioned above, modern wireless systems are equipped with various technologies for each layer. Some systems transmit signals on a single carrier with frequency hopping and others on a multicarrier with OFDM. The channel access of one protocol is TDMA, and that of the others is CSMA/CA. In general, wireless communication applications use higher-layer protocols such as IP, TCP, or UDP. Therefore, we need to consider various observables and parameters . Moreover, the relations among these variables and network performance are hardly known. Consequently, the mathematical formulation of function cannot be realized. Machine learning technologies, which have the fundamental characteristics of data-driven modeling, aid in this difficulty. By using them, the hidden and complex relations among various wireless observables and parameters and network performance can be obtained. We propose a generalized cross-layer modeling of wireless system performance using machine learning. In the proposed model, ( ) can denote the utility of the whole system performance, including the application. denotes various layer parameters, and denotes various observables. The optimization method using the proposed modeling is described in the next subsection.

Cognitive cycle and optimization for complex wireless systems
Cognitive radio is the concept of an intelligent radio that can learn from its past experience and autonomously decide its actions suitable for radio environments and needs for communication. The cognitive cycle [12] is a feedback cycle of observation, learning, decision making, and action. Haykin proposed a more concrete process of cognitive radio in [13] from an engineering perspective. He addressed the following fundamental tasks for a cognitive radio: radio-scene analysis, channel-state estimation, transmit-power control, and dynamic spectrum management. Wireless network nodes can change the radio parameters of transmission and reception to avoid interference among users and improve communication quality.
In general, wireless communication requires learning to establish wireless links and satisfy communication qualities. For example, a radio frequency (RF) module controls the coding rate based on the received signal strength indicator (RSSI) to reduce the error probability of wireless links. This means that the RF module learns the relationship between the inputs (RSSI, coding rate) and output (link quality). In cognitive radio networks, the cognitive engine should determine and coordinate the actions of the cognitive radio based on the learning of the environment. The relationship between inputs and outputs becomes more complicated in cognitive radio networks owing to its flexibility, such as software-defined radios. Cognitive radio can control various parameters such as frequency, channel, coding rate, and transmission power. The relationship between these parameters and the performance of wireless communication is hardly formulated. Thus, machine learning technologies, which can learn the complex, non-linear relationships among various information, would be the solution. By combining the concept of cognitive cycle and optimal decisioning, we propose the concept of the entire system, as depicted in   Figure 5 elaborates the proposed wireless system optimization method using machine learning. The observables of environment are collected, which include not only the radio status but also MAC statistics, or higher layer statistics. For , various parameters of the wireless node or network were considered. Besides these variables, network performance is observed. They are a set of samples, , for a machine-learning algorithm: Using , the cognitive engine builds and updates the model by machine learning: The updating method depends on the type of algorithm. For supervised learning, it uses as the training data, and for unsupervised learning, it uses for clustering or dimension reduction. By solving the optimization problem (4.3), a cognitive engine decides the optimal action to adopt the current situation. The solution of Eq (4.3), * , yields the optimal parameters for communication entities. After deciding the optimal action, to use parameter * , the cognitive engine starts to reconfigure the wireless network. The necessary information is sent to communication entities.
In the following subsections, we describe examples of wireless communication systems used to evaluate our proposed method.

Application to IEEE 802.11 WLAN
We evaluate the proposed optimization method by applying it to the IEEE 802.11 WLAN. As an example of an optimization scenario, we consider the parameter optimization of the IEEE 802.11 stations (STAs) operated in the infrastructure mode. Each STA and cognitive controller connected to access points (APs) has functions of a cognitive engine described in the previous section and runs the cognitive cycle, as mentioned below.
Each STA measures the wireless environment , obtains the current radio parameter and performance , and then, adds a sample to . includes the radio status at the STA, such as the RSSI and wireless link quality.
includes wireless parameters such as the transmit power, operating channel, and address of the connecting AP. The uplink or downlink throughput is considered as the performance index .
The cognitive engine in the STA updates the learning model using . We consider supervised learning for the evaluation. The cognitive engine builds a model that represents the relations among , , and from training samples and then sends the information of the model to the cognitive controller.
The cognitive controller solves the optimization problem (4.3) using the information of the model from STAs and obtains the optimal parameters * of STAs. The information of the optimal parameters is sent to STAs to reconfigure the network. The STA changes its wireless parameters according to * and then continues the cycle starting from the measurement of the environment and performance.

Implementation of learning and optimization
We use support vector regression (SVR) as a learning algorithm, similar to [46]. SVR is an analog output version of support vector machines (SVMs) [51]. In SVR, the estimation function can be expressed as [52] where is the number of training samples, is the input of the training samples ( and ), is an unknown input set for the learning algorithm, and is a kernel function. , ′ , and are unknown parameters obtained by the optimization technique proposed in [52], using training samples , , and , respectively. We formulate the optimization problem (4.3) in the evaluation as follows: where is the number of STAs, is the possible parameter set for the STA-, is the current measured quality of the radio environment at STA-, and ( , ) is the estimated throughput of STA-obtained using the throughput model described above. Here, we use the logarithmic utility function of throughput considering fairness among STAs, where STAs with lower throughput have relatively larger gains for the objective function than those with higher throughput.

Experiments using IEEE 802.11 devices
We implement the method for the IEEE 802.11 WLAN devices. The experiments were coordinated in our university laboratory working space [53,54].
The IEEE 802.11 WLAN APs and STAs are operated in the 2.4 GHz ISM band. Laptop PCs with Ubuntu 14.04 are used as both STAs and APs. In each cognitive cycle, the STA observes the delay and packet loss ratio through pinging, RSSI from its connecting AP using the iwconfig command, the number of packets around the STA using the tcpdump command as the link quality ( ), and the throughput ( ) using the TCP iPerf command. The STA sets the transmission power, channel number (from 1 to 13), and data rate at the physical layer (from 6 to 54 Mb/s) for the current wireless parameters ( ).
The STA then builds the throughput model through SVR and sends information regarding the SVR model to its connecting AP. The AP sends it to the cognitive controller. We set up one of the APs as the cognitive controller, which calculates the optimal set of STA parameters * and returns the result to the AP, and then, the STA obtains the result from its connecting AP. To reduce the calculation costs for solving the optimization problem, we use the particle swarm optimization (PSO) algorithm [55,56] at the cognitive controller.
In the experiment, three APs and nine STAs are operated in channels 1, 6, and 11 in IEEE 802.11 g. The operating channel is fixed for each AP. The locations of all APs and STAs are fixed during the experiment. We use uplink TCP throughputs to evaluate the performance since uplink traffic generally makes radio resource usage more competitive in CSMA/CA. We also add background UDP traffic of approximately 8 Mb/s on channel 11. To verify the performance of the proposed system, the uplink throughput performance is compared with that of other algorithms, focusing on the selection of the connecting AP at the STA as follows: (A) selection by RSSI, (B) random selection, (C) selection by radio resource utilization, and (D) selection of the number of STAs as equally as possible among channels. In algorithm (A) using RSSI, the STA selects an AP with the highest RSSI. This seems to be a popular method for devices in the market. In algorithm (C), the STA selects the AP of a channel where the minimum number of packets is observed in each cycle. In each algorithm, each cycle runs for 30 s. All STAs start iPerf traffic of 2 s at the same time in each cycle. Before starting the proposed method, the STA observes the radio environment in each channel for 1 h and utilizes it as training data.      Figure 7 compares the average throughput per channel among the algorithms. The utilizationbased algorithm (C) shows a higher throughput at channel 6, where it is detected as the most vacant channel. However, the throughputs at the other channels are much lower. This algorithm is based on observations of a wireless environment, but neither learns nor optimizes the entire system.
In contrast, the proposed method, which has a function of learning and optimization, shows higher throughput at channels 1 and 6 and lower throughput at channel 11, which has a higher background traffic. As a whole, the proposed method can improve the network performance. These results indicate that the proposed method can build an appropriate throughput model through learning and can select the optimized wireless parameters that improve the overall network performance.

Evaluation by computer simulation
We also conducted a computer simulation for an extended evaluation of the proposed method [54]. The basic implementation is the same as that in the experiments already shown. The binary programs of learning and optimization were the same as those in the experimental devices. The network simulator QualNet 7.4 [57] was used as the platform for computer simulation. The number of STAs was 21, and that of APs was 3; the operating channels were 1, 6, and 11. The variables of the learning sample ( , , ) were the same as those in the experiment conducted in the laboratory. The STAs in the proposed system send uplink TCP traffic of two types of offered loads. The background traffic is generated by constant bit rate (CBR) traffic. The offered load of channel 6 is smaller than that of channels 1 and 11. The detailed settings of simulation are shown in Table 1. As background communication nodes, three APs in each channel, five STAs in channel 1, one STA in channel 6, and seven STAs in channel 11 are set. Background CBR traffic adds 500 Kbps/STA in channel 1, 100 Kbps in channel 6, 500 Kbps/STA in channel 11.
The main difference in settings from those of the experiments is the offered load variation of the STAs in the proposed system. Similar to the experimental results, the computer simulation results show an improvement by introducing the proposed method, as shown in Figures 8 and 9. From Figure 9, the proposed cognitive cycle using machine learning can optimize the choice of channel according to the formulation of Eq (4.6).

Application for space communication
The proposed supervised learning-based optimization scheme is applied to space communication. Figure 10 shows a communication system in the space, inspired by the use case indicated in [58]. One of the characteristics of space communication different from terrestrial wireless communication is the large communication delay. The main component of communication delay is a propagation delay. The distance from the Earth to the Moon, for example, is 384,400 km in average, where the bidirectional wireless communication experiences at least more than 2.56 seconds. This delay scale is to be handled at higher layer than physical or MAC layers, namely, transport layer with such as TCP. However, the physical and MAC parameters have also to be taken into consideration: MCS, transmission power, etc. This is a similar situation shown in the previous subsection in the application for IEEE 802.11 devices. Therefore, in this subsection, the application of proposed supervised learning-based optimization scheme to the space communication is examined.    Table 2 shows the algorithms for evaluation. These correspond with the right quadrants of Figure 2, i.e., the amount of available information is large and the decision is made by a learning or optimization algorithm. Desktop computer running Ubuntu 17.10 is used to emulate space communication by adding communication delay with the tc command. For the application traffic, an image file of 10 MByte is transfered by a socket program. The duration of transferring is monitored to obtain the throughput. Parameters such as delay caused by the radio propagation in space and other processing factors and packet error rate are variable. In order to train each algorithm, several pre-training with random sampling parameters was conducted before the experiments and used for each algorithm to build the model. Table 3 shows other settings of parameters.

Implementation and evaluation
As a reference algorithm, a deep reinforcement learning, as previously shown in Figure 3 is examined through experiments. The training data for all algorithms is obtained through 500 rounds before the experiment. The training data is composed of the time for file transfer of 10 MB, observed round trip time (RTT), selected TCP algorithm, MTU, and MCS. Parameters are randomly selected for generating the training data. The neural network is composed of three fully-connected layers, including two hidden layers and 7 and 50 neurons each, as described in the reference research [59]. Figure 11 shows the throughput results of those algorithms with a communication distance of 390,000 km. It roughly corresponds to the distance between the Earth and the Moon (384,400 km), where the throughput performances show the superiority of the proposed algorithm. The percentage values show the relative throughput increase or decrease to that of the deep reinforcement learning (DRL) algorithm (deep learning with the -Greedy algorithm). The proposed algorithm(c), using support vector regression (SVR) modeling and optimization algorithm (PSO), shows an 18% increase in throughput.
In contrast, the throughput of supervised learning modeling and reinforcement learning decision(b), i.e., SVR and MAB with the -Greedy algorithm, shows lower throughput than those of DRL(a) and proposed(c). Figure 12 shows the parameter selection of TCP and MTU in each algorithm. It suggests that the proposed algorithm selects proper TCP algorithm BBR and MTU values around 1200, while other algorithms do not select them, which brings the difference of throughputs.  In the context of complexity and real-time issues, the authors of [61] propose modified deep reinforcement learning-based optimization for beamforming. They introduce post-decision state learning to improve learning speed. Regarding our proposed scheme, SVR can be deployed with a rather limited amount of data in general compared to deep learning. On the other hand, complexity for solving the optimization problem becomes large as the number of parameters to optimize increases in the proposed scheme. Therefore, using some mathematical optimization algorithms, such as PSO in this example, is recommended to reduce calculation costs when implemented on real devices.

Simple reinforcement-learning-based decision making
This section elaborates on the proposed approach of simple reinforcement-learning-based decision making, as previously indicated in the top left of Figure 2 and Section 3.4. In this section, we formulate the problem of seeking better action with simple reinforcement learning as a multi-armed bandit (MAB) problem and demonstrate the effectiveness of a novel MAB algorithm called the TOW. We examined our approach in two cases. The first is network selection in heterogeneous wireless networks [62], and the second is channel selection in massive IoT. Experimental results are shown through the implementation on real devices or through computer simulations.

Wireless system optimization as an MAB problem
The MAB problem [35] is a simple machine learning problem in which a player attempts to obtain the maximum reward from multiple slot machines. The aim of the MAB is to decide which slot machine should be selected to obtain the maximum reward through finite trials. The assumption is that the player does not have any prior information on each slot machine. The player starts to gather information on each slot machine by trying as many slot machines as possible. Then, the player estimates which slot machine has the highest expected reward and selects that slot machine to play. Through this process, the player gets more rewards. There is a trade-off between exploitation and exploration. If the player takes a long time for estimation, the player can estimate the reward more precisely, although the time for playing the selected slot machine becomes short. If the player takes only a short time for estimation, the player can take a long time to play the selected slot machine, although the reward of that slot machine might be low. Figure 13 shows the concept of the optimal decision making for the MAB problem in wireless communication systems.  Figure 13. The concept of optimal decision making for the MAB problem.

Multi-armed bandit algorithm and tug-of-war model
Several algorithms have been proposed to solve MAB problems [63][64][65], such as -greedy algorithm, softmax algorithm, and UCB1-tuned algorithm. Although the UCB1-tuned algorithm is known as the best algorithm among parameter-free algorithms, the TOW model [39,[66][67][68] has approximately the same performance as the UCB1-tuned algorithm. The calculation in the TOW model does not require a variance, such as in UCB-1 tuned algorithm, therefore it is suitable for mobile devices such as IoT [69]. The TOW model can adapt the variable environment where the reward probability changes dynamically. The authors of [70] discusses an issue of adapting uncertain and dynamic environment by using reinforcement learning. It also discusses the complexity of reinforcement learning algorithms. Unlike the Q-learning algorithm or others, the TOW model does not require state information, which reduces complexity. Indeed, the TOW model is confirmed to work well in a real-time manner with IoT testbed under such an environment [69]. Therefore, the TOW model is fitted to solve the problem of decision-making in cognitive terminals.
The TOW model is a multi-armed bandit algorithm inspired by the behavior of amoeboid organisms. Unlike other algorithms for estimating the reward probability of each slot machine, TOW dynamics use a unique learning method that is equivalent to updating all machine estimates simultaneously based on the volume conservation law. In the TOW model, the decision is made according to the displacements of the imaginary volume-conserving objects, which increase or decrease along with rewards, as shown in Figure 14. The TOW models of the two machines are formulated as below. Imagine that the player plays a slot machine A or B at a time. When playing machine A, if the player receives rewards, then 1 is added to an estimator ; otherwise, is decreased (punishment). After playing time step , the displacement of machine A, (= − ), is expressed as follows: where ( ) is a fluctuation, ( ) ( ∈ { , }) is the number of times that machine has been played until time , ( ) counts the number of punishments when playing machine until time . is a weighting parameter described below.  Let the probabilities of providing rewards in machines be ( ∈ , ). Considering the ideal situation where the sum of the reward probabilities = + is known to the player, the expected reward ′ ( ∈ , ) is given as If we define ′′ = ′ /(2 − ), we can obtain the difference in estimates in an ideal situation as In contrast, the difference between and from Eq (5.2) is given by From the above two equations, we can obtain the nearly optimal weighting parameter in terms of as If the number of machines is ( > 2), = /(2 − ) is given by = 1 + 2 , where 1 and 2 are the first-and second-highest reward probabilities, respectively, [67]. Then, Eq.(5.1) can be expressed as follows: where is the fluctuation in each slot machine. The player selects the machine which has the highest ( ). We use the following ( ) in this paper: where is the amplitude of the fluctuation.

Application of the MAB algorithm to wireless network selection
For the first example, we apply the MAB problem and TOW algorithm to wireless network selection in a cognitive mobile terminal [62]. Figure 15 shows the concept of reward in the MAB problem in this model. In this example, three networks, Wi-Fi 1, Wi-Fi 2, and LTE, are available at a cognitive mobile terminal. The mobile terminal evaluates each network based on the reward corresponding to the performance of each network, such as throughput.  Figure 15. Concept of reward in network selection in cognitive radio as a multi-armed bandit problem.

Background
Recent wireless devices are equipped with multiple wireless interfaces like smartphones. Various wireless networks, such as 3G, LTE, and Wi-Fi, are available. Moreover, similar to access points in Wi-Fi, a mobile terminal can choose from multiple access networks. Ideally, in such a heterogeneous network, a user may choose the best wireless network by gathering information from each network. Several studies have focused on heterogeneous wireless network selection. There are two types of approaches: network-centric and user-centric. In the network-centric approach [71][72][73], the selection decision is performed using a central controller. It assumes that detailed information on each network status is available at the central controller. However, it is difficult to exchange information among various wireless networks that operate independently. Therefore, in this section, we focus on the selection of wireless networks at mobile terminals.
Several studies have investigated wireless network selection at the mobile terminal [74][75][76][77]. Most of them require considerable information of networks or computational capacity. However, it is not always possible for mobile devices to gather information from all networks or to spare battery resources for complex calculations. It is important for the heterogeneous network selection to seek the best solution as much as and as fast as possible using the limited information about the networks. Simultaneously, it is also important for mobile devices to suppress the complexity of calculations for making decisions. These constraints and requirements are similar to those in the MAB problem. In the MAB problem, a player of slot machines attempts to obtain the maximum reward through finite trials. Therefore, the MAB problem approach can help in heterogeneous network selection.
The major challenges in the selection of wireless networks at mobile terminals in heterogeneous environments are as follows.
• Efficient decisions must be made under the situation in which a small amount of information regarding each network is available. • A practical algorithm that can be implemented on resource-constraint mobile devices is required.
To overcome these issues, several studies investigated algorithms and their performances. In reference [74], a non-cooperative game formulation and analysis were provided for Wi-Fi and cellular network selection on a mobile terminal. Results of computer simulations showed that the game can converge to Nash equilibria. However, the assumption that the mobile terminal can obtain information from other mobile users is not always possible. In reference [77], a reinforcement learning solution and simulation analysis were provided for heterogeneous cellular networks. Although the simulation results showed the convergence speed and the suppression of overheads, it requires feedback information from the networks, which is only feasible in cellular networks. It is important for the mobile terminal in a heterogeneous network to select a better network without coordination from the networks. At the same time, it is important for mobile devices to suppress the complexity of calculations for making decisions. These constraints and requirements are closely similar to those in the MAB problem. We propose a wireless network selection technique based on the MAB problem. As described in the previous section, we used the TOW algorithm to solve the MAB problem. Figure 16 shows the system model of the proposed wireless network selection. The mobile terminal, capable of connecting various types of wireless networks ( ∈ {1, 2, ... }), functions as a TOW algorithm. It observes the performance of network , where is the selected network. can be any index of performance, such as throughput, delay, or other metrics of wireless networks. The TOW algorithm then judges whether the reward or punishment is to be added by evaluating the performance of network . If the reward is given, it updates the estimator as + 1; otherwise, it updates the estimator as − . Then, , the displacement of each network, is updated as shown in Eq (5.8). Note that all ( ∈ {1, 2, ..., , ..., }, and not only the selected network , are updated here. The algorithm in the mobile terminal is as follows:

Heterogeneous network selection as an MAB problem
1) Start observing the performance of each wireless network . All the networks are monitored at least once. 2) Update and all based on the observation of network by Eqs (5.2) and (5.8). 3) Select a wireless network * with highest . 4) Observe the performance of the selected network and decide whether a reward or punishment is to be given. 5) Back to 2 and continue.

Implementation of the proposed scheme
To validate the proposed method in a heterogeneous wireless network environment, we implemented the proposed algorithm on a wireless device and performed experiments. The TOW algorithm was installed on Ubuntu Linux on a laptop PC as a cognitive mobile terminal, equipped with both 802.11n/ac (2.4 GHz and 5 GHz) and LTE communication module. To judge the reward or punishment (1 or − ) in Eq (5.2), we used the average throughput of the wireless networks as a threshold. If the observed throughput of wireless network is larger than the average, then the reward is given for ; otherwise, a punishment is given. We use the first-and second-highest reward probabilities among the networks as in Eq (5.7). The experiments were conducted in and around the university building. The mobile terminal selects the wireless network from two Wi-Fi networks (2.4 GHz and 5 GHz) of the university infrastructure and LTE to communicate with the server.

Experimental setup
We used the iperf command to observe the throughput. The parameter of the fluctuation in Eq  Table 4. To verify the performance in heterogeneous environments, we examined the throughput at each location. Figure 18 shows the average throughputs of each wireless network and the proposed TOW algorithm. Each experiment was repeated three times, and the throughputs shown here are the average values. The locations (c-1) and (c-2) are both outside the building but different in the Wi-Fi traffic situation: (c-1) is more crowded. It is shown that the proposed TOW algorithm achieves a high average throughput among other wireless networks. In the laboratory room (a) and inside the building (b), the throughput of the proposed system is as high as that of Wi-Fi at 5 GHz and 2.4 GHz, respectively. In contrast, outside the building at (c-1) and (c-2), where the signal strength from the Wi-Fi access point becomes much lower, the throughput of the proposed system is as high as that of LTE. Moreover, as shown in case (c-2), the proposed algorithm can achieve an average performance which is as high as that of LTE, where the differences in performance among all networks are rather small. The results show that the proposed algorithm can accurately estimate the probabilities of rewards among wireless networks.

Application of the MAB algorithm to dynamic channel selection in IoT devices
In this subsection, another example of the application of a simple reinforcement learning-based optimal decision, the dynamic channel selection in IoT devices, is presented. The major challenge in the selection of wireless channels autonomously at the mobile terminal in massive IoT use cases is making efficient decisions in situations where little or no information of each mobile node is available. It is also challenging to find a practical algorithm that can be implemented on resource-constrained mobile devices of the IoT. Lai et al. modeled a cognitive radio as a multi-armed bandit (MAB) problem [36,37]. In their model, the channel selection of a cognitive radio is defined as an MAB problem under the assumption of a probabilistic vacancy of each channel. In a previous work [78], TOW applications for channel selection in wireless LANs were proposed. It gave an efficient dynamic spectrum-sharing for cognitive radio. In this subsection, we show another application of TOW in massive IoT. For the evaluation of this application, computer simulation experiments were conducted.  Figure 19. The proposed wireless channel selection based on the MAB algorithm in massive IoT. Figure 19 shows the proposed wireless channel selection in the massive IoT scenario. The devices use one of the wireless channels from # ( ∈ {1, 2, ..., }), which is decided by the TOW algorithm in each device. The acknowledgement frame (ACK) received upon successful communication is the reward of the TOW algorithm.

Simulation setup
To verify the proposed simple reinforcement algorithm in a massive IoT scenario, network simulation using ns-3 [79] was conducted. For the simulation, the LR-WPAN model was used. The settings of the simulation are shown in Table 5. Two simulation scenarios, A and B, were considered, as described in Table 6. Note that the association between the device and coordinator is simplified in the simulation (message passings are omitted).  The performance index of this simulation is set as the frame success rate (FSR), which is the ratio number of received packets to the total number of transmitted packets, as shown in Eq (5.10).
where is the number of devices, is the number of available channels, and is the number of counts receiving reward (ACK) when node uses channel , and is the number of counts node attempts to send using channel . Scenario B, where not only devices but also coordinators select their operational channels autonomously, is more difficult to select optimal channels. This scenario is an example of a highly distributed decision network. The result of this scenario is shown in Figure 20(b). When devices select their operational channels by the TOW algorithm, FSR outperforms -Greedy algorithm or by UCB-1 tuned algorithm. It indicates that the TOW algorithm can select operational channels efficiently among other MAB algorithms, even in the more distributed situation.  , where the channels of coordinators are also selected by the same algorithm as the devices. In both scenarios, the channel selections of the TOW algorithm show convergence, while those of UCB-1 and -Greedy show fluctuations. The selected channels show convergence in the TOW algorithm, while those of UCB-1 also show convergence but slightly fluctuate over time. Compared to other algorithms, the selected channels of -Greedy are unstable and fluctuate, because of the randomness of "greedy" exploration of the algorithm. The results show that the proposed algorithm can efficiently select channels in a distributed manner.

Conclusions
Advancements in wireless communication technologies have led to enormous positive changes globally. Along with this, wireless systems have become increasingly complex, not only in a single communication node, but also as a communication system. From the viewpoint of exploiting its capability, two simple questions rise. One question is how to build models for complex wireless systems of the present and the future. Another is how to decide optimal action using models of wireless communication systems.   A classical mathematical formulation-based optimization scheme cannot be applied to today's complex wireless communication systems because the complexity of the systems prevents building mathematical models. It opens the window for applications of machine learning technologies to optimize the performance of wireless communication systems. Data-driven modeling, by machine learning, is an aid for these issues. Deep reinforcement learning, which combines deep learning-based modeling with reinforcement learning-based decision making for action, is a state-of-the-art scheme in recent research fields. Various studies have applied it in the field of wireless communication. However, there still exist future works to be focused on. One point is to seek alternatives for modeling to realize continuous function-based modeling to obtain better solutions for continuous systems. Another point is to seek a feasible yet effective scheme if the amount of available information is rather small, similar to IoT systems. In this study, to provide solutions for these points, two novel schemes that are different not only from classical mathematical optimization but also from the current state-of-the-art deep reinforcement-learning approach are proposed.
One of the proposed schemes is supervised learning based modeling and optimization. It uses a certain amount of information to build a model of the wireless communication system and obtain optimal parameters using an optimization algorithm. This is based on a cognitive cycle using machine learning. It uses a supervised machine learning algorithm to build the performance model of the systems, obtains the optimal parameters by solving the optimization problem, takes action according to the decision, and updates the performance model in an online manner. Through both real-world experiments and computer simulations of the application of IEEE 802.11 WLAN, the validity of the proposed scheme was confirmed.
Another proposed scheme is simple yet easily implementable reinforcement learning by the MAB problem formulation. By using a novel, lightweight, and distributed TOW algorithm, adaptive learning wireless communication systems with limited software and/or hardware capabilities, such as IoT, can be realised. Two applications are shown: heterogeneous network selection and channel selection in massive IoT. Both real-world experiments and computer simulations demonstrated the validity of the proposed scheme.
These results show the effectiveness and feasibility of the proposed schemes. Various applications based on the proposed schemes are currently being developed [69,[80][81][82], proving that this research have opened a new field of application of online machine learning technologies to optimize the complex wireless communication systems of the present and the future.
The major achievements of this study are as follows: 1) We have provided a comprehensive framework for optimizing complex wireless communication systems from the viewpoint of application of machine learning technologies. 2) We proposed a scheme using supervised learning and optimization that is a better alternative to deep reinforcement learning especially when parameters are continuous. 3) We have proposed a scheme using simple reinforcement learning based on TOW, a lightweight MAB algorithm, which provides a feasible solution to increase performances of wireless communication systems when the amount of available information is small, similar to IoT.
There are some future works, which is to confirm the effectiveness of this research for more complex real-world networks and applications. The proposed schemes are based on the fundamental framework to optimize complex wireless communication systems. Therefore, they can be applied to various applications including 5G/6G. Mobile edge computing (MEC), or multi-access edge computing, is one of the hot topics in the 5G/6G networks. The proposed schemes can be applied to optimize communication and computing in MEC. Network orchestration is also an important topic in heterogeneous 5G/6G networks. End-to-end optimization of communications in heterogeneous networks, ether in a distributed manner or centralized manner, is to be examined with the proposed schemes. Non-terrestrial networks add vertical communication in 5G/6G networks. Further applications such as a relay network in space and multiple interfaces for radio and optical communication can be optimized through the proposed schemes. Through those verifications, a more concrete figure of a future communication network using machine learning will be revealed.

Copyright notice
This article includes materials of following copyright. Copyright 2019 IEEE. Reprinted, with permission, from [54].
This article also includes materials from "Efficient wireless network selection by using multi-armed bandit algorithm for mobile terminals" [62] by the same author, which appeared in the Nonlinear Theory and Its Applications, Copyright(C)2019. The material in this paper was presented in part at them, and all the figures of this paper are reused from them under the permission of the IEICE.

Conflict of interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.