Enhancing energy efficiency for cellular-assisted vehicular networks by online learning-based mmWave beam selection

Millimeter Wave (mmWave) technology has been regarded as a feasible approach for future vehicular communications. Nevertheless, high path loss and penetration loss raise severe questions on mmWave communications. These problems can be mitigated by directional communication, which is not easy to achieve in highly dynamic vehicular communications. The existing works addressed the beam alignment problem by designing online learning-based mmWave beam selection schemes, which can be well adapted to high dynamic vehicular scenarios. However, this kind of work focuses on network throughput rather than network energy efficiency, which ignores the consideration of energy consumption. Therefore, we propose an Energy efficiency-based FML (EFML) scheme to compensate for this shortfall. In EFML, the energy consumption is reduced as far as possible under the premise of meeting the basic data rate requirements of vehicle users, and the users requesting the same content in close proximity can be organized into the same receiving group to share the same mmWave beam. The simulation results demonstrate that, compare with the comparison method with best energy efficiency, the proposed EFML improves energy efficiency by 17–41% in different scenarios.

in the existing related works, the performance measurement standard of each beam is the amount of data received by the vehicle rather than the energy efficiency, which is not suitable for the development demand of green communication due to the lack of energy consumption concerns. Also, the authors in [3] do not consider aggregating users in close proximity who request the same data content (e.g., the latest traffic congestion information, real-time high-definition electronic maps, current events, and news) to serve as a multicast group, resulting in the possibility of consuming resources repeatedly to send the same content. Therefore, to address the above problems, we propose an Energy efficiency-based FML (EFML) scheme and list the main contributions as follows.
1. Different from the existing online learning-based mmWave beam selection schemes aiming to maximize the overall aggregated received data, our scheme aims at enhancing the energy efficiency of cellular-assisted vehicular networks. 2. In our scheme, the transmission power of each mmSBS is allowed to be adjusted as long as the energy efficiency can be improved. Therefore, the energy consumption can be reduced as far as possible under the premise of meeting the basic data rate requirements of vehicle users. 3. To further reduce energy consumption and save communication resources, the users requesting the same data content in close proximity are organized into the same receiving group to share the same mmWave beam and reduce the occupation of RF chains. 4. The simulation results show that, compared with the existing online learning-based mmWave beam selection schemes, the EFML scheme substantially improves the energy efficiency and the amount of data of cellular-assisted vehicular networks at the cost of more system overhead. However, after a period of sufficient online learning, there is no difference in the cost of updating the beam performance of the system.
In the rest of the paper, the related works are presented in Sect. 2. The system model and the detail of the EFML algorithm are addressed in Sects. 3 and 4, respectively. Simulations results are discussed in Sect. 5. Finally, we conclude this paper in Sect. 6. Furthermore, for the convenience of readers, the main notations of this paper can be found in Table 1.

Related work
There have been many solutions to the problem of beam selection in traditional networks (e.g., the works in [5,6]). However, they need complex transceiver links and accurate location information and thus undoubtedly cause high overhead and delay. Unlike the above works in the sub-6 GHz bands, there are many works on beam selection in mmWave networks. The authors in [7] proposed a mmWave beam selection method based on deep learning which utilizes the channel characteristics of the sub-6 GHz band to solve the mmWave beam selection problem, while the authors in [8] presented a beam alignment algorithm based on machine learning for the beam management problems in the mmWave massive Multi-input Multi-output (MIMO) networks. The author in [9] proposed an iterative order minimum optimization training scheme based on the simulated beam selection of machine learning. The above schemes require a large number of prior data samples and beam training processes.
In [10], the authors studied the problem of multiple RF chains of mmWave transceivers in mobile mmWave communication systems and developed a codebook-based beam tracking strategy, which shows that the performance of the beam tracking strategy can be improved by optimizing the transmit power of the training beam. The authors in [11] gave an overview of current beam management approaches based on 5G standardization, including some of the major challenges and future trends for mmWave communications in current 5G New Radio (NR) standards. By analyzing the average search delay of two different mmWave network models, the authors in [12] found that the average number of searches is related to the number of search sectors. The authors in [13] designed a low-cost beamforming module-assisted hybrid architecture and proposed a fast beam training method. The authors in [14] studied the sensitivity of the beam stability selected by the base station. By observing different operating frequencies, dynamic channel characteristics, and different user mobility, they found that the perceived timeof-stay of the beam will be affected by beam management parameterization.
The authors in [15] leveraged the advantage of the mmWave characteristics in ultra-dense networks and proposed a method for joint optimization and resource  The expected performance of energy efficiency The optimal subset of beam-power pairs in the time slice t of the l-th sector The set of under-explored beam-power pairs of the set of subspaces of the l-th sector The number of under-explored beam-power pairs of the set of subspaces of the l-th sector The set of under-explored beam-power pairs of the set of subspaces of the mmSBS The set of beam-power pairs to be used of the set of subspaces of the mmSBS allocation between base stations and users. Specifically, they aimed at maximizing user throughput in the system while also considering fairness. To reduce the overhead and complexity of the wireless backhaul and access process, the authors in [16] proposed a hybrid beamforming multi-stage design scheme based on channel feedback. To improve the efficiency of user-provided networks through resource allocation of links, the authors in [17] proposed a joint incentive and resource allocation algorithm, which considered the restriction of network resources, incentive system and user fairness. Moreover, to alleviate the overload problem of cellular networks and save cellular network resources, the authors in [18] proposed the traffic offloading method through opportunistic mobile networks. Also, the authors in [19] proposed an incentive mechanism based on delay constraint and reverse auction to stimulate Wi-Fi access points to participate in the data unloading process. To further reduce the traffic burden of cellular networks and the cost of content service providers, the authors in [20] proposed a new method based on incentive drive and deep Q network, which considered the incentive mechanism and content caching strategy to improve the offloading performance.
To reduce the overhead of establishing a mmWave link in vehicular-to-everything networks, the authors in [21] proposed a beam training method based on the assistance of out-of-band information. The authors in [22] proposed a beamforming scheme based on deep learning for high mobility mmWave systems. To further reduce the beamforming overhead of the mmWave system, the authors in [23] proposed an intelligent prediction beam alignment algorithm from the Multiple Access Control (MAC) layer of the mmWave vehicle system. The authors in [24] proposed a machine learning approach based on situational awareness to predict mmWave beams. Specifically, this approach learns beam information from some past observations including the position of the vehicles and the optimal beam. The authors in [25] proposed a neural network-based algorithm for beam alignment in vehicular networks. However, this scheme needs to learn more information about the channel state and can only select the best beam direction for a single user. Considering the propagation characteristics of mmWave in 5G vehicular networks, the authors in [26] proposed a simulated anneal-based beam management model to improve the effective communication of the system. The challenges of mmWave communication for the vehicular networks are also investigated in [27,28].
Moreover, MAB is a classic and general online learning method and has been used to solve various problems in wireless communication networks [29]. The author in [30] developed an equivalent structured MAB model to solve the beam alignment problem in the mmWave system. However, this method requires an exhaustive search for beam alignment between transceivers, which will cause great system overhead due to the large search space. The authors in [2] proposed FML to address the context-awareness beam selection issue. Specifically, they modeled the problem of beam selection as a contextual MAB problem and proved the convergence of FML. However, they only consider onedimensional contextual information and only one vehicle can be served within the beam range.
The authors in [31] modeled the problem of beam selection as a contextual combinatorial MAB problem with delayed feedback and Quality of Service (QoS) constraints and proposed an online learning algorithm that achieves a good balance between satisfying the performance guarantee of the system and maximizing the network capacity. However, since this prediction mechanism requires the view information of the source mmWave base station, it will cause greater system overhead. In addition, the fast mobility of the vehicle scenarios is also a big challenge to this prediction method. The authors in [32] developed an online learning algorithm for beam selection by using the MAB framework that requires learning rough beam orientation in the pre-defined codebook.
To reduce the time consumed in beam training, the authors in [33] proposed a beam selection scheme based on deep learning, which realized low delay and high-speed communication by reducing the number of measurements. In [34], the authors proposed low-cost joint designs of digital filters and analog beam selection, which achieved a higher network sum-rates than the benchmark without joint design. Due to the high path loss and penetration loss, it is not easy to establish and track beams in mmWave vehicular communications. The authors in [35] proposed a beam selection method based on integrated learning classification to determine the beam pairs suitable for mmWave vehicular communication, which used the position and type of the receiving vehicle and its neighboring vehicles. The authors in [36] designed a location-based beam prediction and selection technology to maximize the achievable rate in mmWave cellular systems, which leveraged the machine learning tools to deal with the blockages. With the social information and context of vehicles and passengers, the author in [37] proposed a twolayer online learning algorithm for fast and effective beam allocation for mmWave base stations. However, the goal of the above studies is to maximize the achievable rate or increase the system capacity.
The authors in [3] proposed an online learning-based algorithm for mmWave beam selection to improve the network capacity of the vehicular communication systems. Furthermore, the algorithm selects a more appropriate beam direction and beam width for the mmWave base station by setting and learning more dimensional context information. However, the above researches are all only considered maximizing the system throughput or achieving the maximum network rate, but they did not consider the power adjustment to reduce the energy consumption. Overall, the work in [3] is the most relevant to our work, but there is still room for improvement in IFML due to the problems discussed above. For example, the problem of power adjustment is not described in detail, and only unicast communication scenarios are considered. It is the main motivation for this paper to consider user multicast groups and power adjustment requirements for green and energy-saving communications.

Network architecture
An integrated mmWave/sub-6 GHz cellular network is considered in this paper, in which some mmSBSs are overlapped in the coverage area of an LTE eNB. As shown in Fig. 1, by a wired or wireless backhaul link, a mmSBS can communicate with its associated LTE eNB. Each vehicle is equipped with two kinds of radio interfaces, where an LTE interface is used to keep a connection to the LTE eNB, and a mmWave interface is adopted for high-speed data transmission. From a theoretical point of view, an infinite number of virtual beams can be programmed per mmSBS, and the beam width of each beam can be set between 0° and 360°, where beams are allowed to overlap and the number of RF chains is much less than the number of virtual beams due to the manufacturing cost and the limitation of form-factor.
All the vehicles will be grouped according to how close they are to each other and whether the same data content is requested. In this paper, each mmSBS only provides service for each vehicle group. Even if there is only one vehicle in a vehicular network, it must form a vehicle group. Each vehicle group has a unique identifier and the other parameters associated with this group include the number of vehicles, the identifiers of vehicles, the requested data content, the central coordinates of the distribution of vehicles within the group, and the identifier of the vehicle farthest from the target mmSBS within the group.
The maximum number of RF chains at a mmSBS determines the maximum number of vehicle groups that this mmSBS can simultaneously serve. If the number of vehicle groups in the coverage of a mmSBS and the number of the virtual beams of it exceed the number of RF chains of this mmSBS respectively, the mmSBS should select the best subset of beam-power pairs in order to provide the best system performance. To reach that target, we formulate each mmSBS's beam selection as a MAB problem, where each mmSBS can identify the subset of best beams with the matching transmission power values over time. According to the description in [38], a decision maker of a MAB problem has to choose a subset of actions of unknown expected rewards to maximize the reward over time, but those which have already generated high rewards should also be exploited, where how to deal with the exploration vs. exploitation dilemma is a challenging problem.

Problem statement
Like the work in [3], the number of virtual beams at a mmSBS is not limited in this paper. Also, unlike the work in [3], the transmission power of each mmSBS is allowed to be adjusted in this paper. Therefore, besides preserving the coverage division in [3], we must also focus on how to find an appropriate transmission power for each mmSBS from the set of available transmission powers.
Firstly, for the purpose of reducing the search time of the online learning process, each mmSBS's coverage area is divided into L non-overlapping sectors (e.g., L = 4 in Fig. 1), where there are no more than M l virtual beams for the l-th sector ( l ∈ {1, . . . , L} ) and these virtual beams are allowed to overlap. In the l-th sector, each mmSBS uses a set M l and a set ℵ l , which includes M l = M l virtual beams and N l = ℵ l transmission power levels. Therefore, there are M l × N l beam-power pairs for the l-th sector.
For each sector, the mmSBS can choose a subset of no more than n beam-power pairs to serve no more than n ( n < M l × N l , ∀l ∈ {1, . . . , L} ) vehicle groups simultaneously, in which the number of the served vehicle groups is limited by the maximum number of RF chains at the mmSBS and thus the maximum number of them is limited to n.
From the perspective of all the sectors, the n beam-power pairs in the same sector may not be the best n beam-power pairs. Thus, first of all, the mmSBS should choose no more than n best beam-power pairs from every sector separately to provide them to no more than n vehicle groups in every sector respectively. Then, it chooses no more than n best beam-power pairs from all the chosen beam-power pairs to provide them to no more than n vehicle groups in the whole coverage area of the mmSBS.
The LTE eNB is capable of providing the vehicle group context information to the mmSBS. With the help of the LTE eNB, a vehicle group will know the location of the mmSBS and the chosen beam-power pair for it. Figure 2 shows the whole process of information interaction between vehicles, the LTE eNB, and mmSBSs. When a vehicle wants to communicate with a mmSBS via a mmWave link, it firstly sends a registration request message (refer to "1: registration request" in Fig. 2) to the LTE eNB with which it keeps a continuous connectivity via its LTE interface. This registration request message contains the description of the vehicle's velocity, location, and request data content. The LTE eNB may receive a large number of registration requests from the vehicles in the service areas it covers. So, it will periodically analyze and handle the received registration requests based on certain policies, where the interval between successive processing operations can be adjusted according to the delay requirements of registration requests. If the interval is longer, the number of registration requests processed at a single time period may be larger, but the response per vehicle will be slower, and vice versa.
Once the time to process the registration requests arrives, the LTE eNB firstly analyzes the received registration requests to build the vehicle groups one by one according to the vehicles' velocity, location, and request data content, and then sends a potential mmSBS a mmWave service request message (refer to "2: service request" in Fig. 2). This message contains each vehicle group's identifier, each vehicle's cellular system identifier in a vehicle group, the identifier of the vehicle farthest from the target mmSBS within a vehicle group, and the expected direction of arrival at the mmSBS.
By using the EFML scheme, the mmSBS will respond to the LTE eNB's mmWave service request (refer to "3: service response" in Fig. 2) with the chosen beam-power pairs. Upon receipt of service response from the mmSBS, the LTE eNB will send each vehicle in each vehicle group a registration response message about the mmSBS (refer to "4: registration response" in Fig. 2). This message contains the mmSBS's location and the chosen beam-power pairs.
Once each vehicle in the vehicle group reaches the covered area, it sends the mmSBS an associating request to start a mmSBS associating process, and then it receives an associating response from the mmSBS (refer to "5: association" in Fig. 2). Then, each vehicle obtains the Channel State Information (CSI) by analyzing the associating response message from mmSBS and feedbacks the CSI to the mmSBS. Based on the CSI feedback from all the vehicles in a group, the mmSBS can know whether the beam-power pair it chooses can meet the data rate requirement of the vehicle with the worst channel quality in the group.
After the associating operations, the mmSBS starts the data transfer process (refer to "6: communication" in Fig. 2), and then it will get acknowledgments of the transferred data frames if the data transfer process is successful, where any other feedback is not required. If any vehicle in the vehicle group cannot detect the mmSBS within the chosen beam-power pair, it will send the feedback to the LTE eNB (refer to "7: service feedback" in Fig. 2). Finally, in order to help mmSBS make better decisions in the future, the LTE eNB will send the feedback to the mmSBS (refer to "8: service feedback" in Fig. 2).
The selection results of beam-power pairs should be adjusted in time to serve the most suitable set of vehicle groups, so each mmSBS uses a discrete time setting, where system time is divided into time slices with equal length and denoted as t (t ∈ {1, …, T}).
When each time slice t passes, all the selection results of beam-power pairs will be updated. If each time slice is relatively shorter, the selection results of beam-power pairs are updated timelier, but it generates a higher system overhead. Thus, how to get a reasonable tradeoff will be very critical, and one option is to determine the specific value through experience. The detailed process of selection and update for beampower pairs is described below.
1. At the first time slot of each time slice t, a set g l t = {g l t,i |i ∈ 1, . . . , G l t } of vehicle groups will be registered in the l-th ( l ∈ {1, . . . , L} ) sector of the mmSBS via the LTE eNB, in which G l t is the number of vehicle groups and meets G l t = g l t ≥ n . The parameter n is determined by the maximum number of RF chains that mmSBS can support, so it also represents the maximum number of vehicle groups that can simultaneously obtain downlink transmission services in the entire coverage of mmSBS.
As mentioned above, the mmSBS obtains the information about the group context o l t,i of each incoming vehicle group g l t,i . The group context o l t,i may be described by D context dimensions, which is regarded as an . . , G l t } and acquired by the mmSBS after the first time slot of each time slice t. In this paper, we only consider vehicle group distance (which is determined by the vehicle farthest from the target mmSBS within the group) and direction of arrival as the context for a vehicle group, so the context vector is two-dimensional (i.e., D = 2). 2. The mmSBS chooses a subset of no more than n best beam-power pairs from the l-th sector, in which the set of chosen beam-power pairs in each time slice t is indicated as After this, it reselects no more than n beam-power pairs from L l=1 BP l t to serve no more than n vehicle groups within the mmSBS's coverage area. Finally, no more than n vehicle groups in L l=1 g l t are selected to accept service, and each vehicle of each selected vehicle group is informed about the chosen beam-power pair by the associated LTE eNB by adopting their LTE interfaces. 3. When any vehicle of each chosen vehicle group (e.g.,g l t,i ) reaches its expecting coverage of mmSBS, it receives communication data from this mmSBS and feeds this situation back to it. The mmSBS only observes the amount of data successfully received by the vehicle with the worst channel quality within each group, and then regarded it as the amount of data r b l t,j ,p l t,j o l t,i that the vehicle group g l t,i successfully receives via the chosen beam-power pair b l t,j , p l t,j , until the time slice t is over or the vehicle with the worst channel quality in the group is not covered by its beam.
The amount of data r b l t,j ,p l t,j o l t,i is usually limited to r max , in which r max is the maximum amount of data that can be received by the vehicle with the worst channel quality in the group. The contact time and the Shannon theorem can be employed to estimate r max . The contact time is considered to be the time during that mmSBS can send data to the vehicle with the worst channel quality in the group, which is bounded by the coverage area of the chosen beam-power pair and relies on vehicle speed, beam direction, beam width and transmission power size.
In this paper, the performance of the chosen beam-power pair b l t,j , p l t,j ∈ BP l t for the vehicle group with the context o l t,i ∈ O G l t during the time slice t is estimated by This performance measure in (1) is also approximated as the energy efficiency of the vehicle group with the context o l t,i ∈ O G l t when this vehicle group gets the beampower pair b l t,j , p l t,j during the time slice t. We consider e b l t,j ,p l t,j o l t,i as a random variable, and denote its expected value as , which is also seen as the expected performance of the beam-power pair b l t,j , p l t,j under the group context o l t,i . The goal of the mmSBS's choosing a subset of the beam-power pairs is to maximize the expected energy efficiency at a subset of vehicle groups. In other words, its goal is to maximizing the average expected beam-power pair performance. We denote the optimal subset of beam-power pairs in the time slice t of the l-th sector of the mmSBS . . , G l t } and its n beam-power pairs formally satisfy the formula (2).
At the beginning of system initialization, if the mmSBS already knows the expected beam-power pair performance ẽ b l t,j ,p l t,j o l t,i for each vehicle group context o l t,i ∈ O G l t and each beam-power pair b l t,j , p l t,j ∈ M l × ℵ l , it could be easy to choose the optimal subset of beam-power pairs for each set of the reaching vehicle groups in the l-th sector by (2). As shown in the formula (3), the average energy efficiency can be obtained through the total amount of data expected to be received for all time slices.
Usually, the mmSBS has no information about the communication environment, so it must learn the expected beam-power pair performance ẽ b l t,j ,p l That is, in order to learn these performances, the mmSBS must explore different beam-power pairs for different group contexts over time.
Also, it should exploit the beam-power pairs proved to have good performance. Thus, the mmSBS must make a trade-off between exploring the beam-power pairs with the unknown performance and exploiting those with the known high performance.
Next, we will elaborate the EFML scheme, in which the best n beam-power pairs are selected from L l=1 M l × ℵ l based on the incoming vehicle groups with the contexts We regard the regret of learning as the expected difference in the average energy efficiency achieved by a vehicle group and by the learning algorithm. According to (3) and (4), it can be estimated as the formula (5).

The energy efficiency-based FML
Each mmSBS executes the EFML scheme independently. To begin with, the context space of the vehicle group in each sector of the mmSBS will be evenly divided into context subspaces of the same size. Then, the EFML learns the performance in each subspace of different beam-power pairs separately. Furthermore, the EFML executes either an exploration action or an exploitation action in each time slice, and it depends on the control function of the system and the contexts of reaching vehicle groups. If the EFML executes an exploration action, the scheme randomly chooses a subset of beam-power pairs. Furthermore, in any exploitation process, it will choose the beam-power pairs that performed best in the previous time slices. Finally, by observing the average energy efficiency achieved by the vehicle groups in its coverage area, the EFML scheme obtains performance estimating values of the chosen beam-power pairs. Thus, the algorithm learns the performance of each beam-power pair under each vehicle group context over time.
The pseudo-code description of the EFML scheme is listed in Algorithms 1-3. In the lines 1-5 of Algorithm 1, the EFML evenly divides the vehicle group context space  In (6), K : {1, . . . , T } � → R is a deterministic monotone increasing control function, which is adopted to decide whether to execute an exploration process or run an exploitation process. The control function K (t) should be adequately chosen to ensure that the EFML scheme obtains an expected good performance with respect to its regret. The Theorem 1 in [2] provides an appropriate choice for the control function and it is also described in [3]. For the convenience of readers, we repeat it as follows. .
There is a detailed proof of Theorem 1 in [2], and the parameters are slightly different but basically the same. For the reader's convenience, we also repeat the description of Assumption 1 here.
Assumption 1 there is α > 0 and β > 0 so that for all b, p ∈ M l × ℵ l and for all x, y ∈ O G l t in the l-th sector, it is true that ẽ �b,p� (x) −ẽ �b,p� y ≤ β x − y α , where ||·|| represents the Euclidean norm in R D .
For Assumption 1, although the parameters are somewhat different from those described in [2,3], they are essentially the same. The lines 6-14 in Algorithm 2 show that the EFML scheme will execute the exploration process if there are under-explored beam-power pairs. If the number u l (t) := BP ue GH l t (t) of under-explored beam-power pairs is at least n, the EFML scheme randomly chooses n beam-power pairs from them. If the number u l (t) of under-explored beam-power pairs is less than n, the EFML scheme chooses all the u l (t) under-explored beam-power pairs. Furthermore, it chooses the (n − u l (t)) beam-power pairs b l , which satisfy the formula (7).
In (7), j = 1, …, (n − u l (t)) . As shown in lines 15-17 in Algorithm 2, the EFML scheme will conduct an exploitation action when there are no under-explored beam-power pairs, and it will choose the n beam-power pairs b l In (8), j = 1, …, n. After choosing the n beam-power pairs from each sector respectively, the EFML scheme will reselect no more than n beam-power pairs from all the chosen beam-power pairs of all the sectors, as described in the lines 8-18 in Algorithm 1. After this, the EFML scheme observes the received data of each vehicle group g l t,i with the context o l t,i ∈ s l t,i ∈ O G l t in each beam-power b l t,j , p l t,j , and estimates the energy efficiency of each vehicle group g l t,i according to the formula (1) (see lines 2-3 in Algorithm 3).
Based on these observations and estimations, the EFML scheme updates its internal counters (see lines 4-13 in Algorithm 3), where the weight coefficient ζ denotes the contribution of the recently observed beam-power pair performance to the current updated beam-power pair performance, and usually depends on the empirical values with respect to the communication performance of the system.  Figure 3 shows the simulation scenario adopted in this paper, where the mmSBS's coverage is partitioned into the four sectors (i.e., L = 4) with the equal size and there are differences in terms of road length and road distribution between the different sectors. Each sector is equally divided from two dimensions, where the mmSBS is firstly taken as the central point for equiangular division and then it is used as the starting point for equal length division. Based on the above spatial division principle, besides paying attention to vehicles' requesting data content, the LTE eNB should also make the registered vehicles located in the same angle range be the same group, where the furthest group member from the mmSBS is used to determine how far the group is from the mmSBS. For simplicity without loss of generality, the distance from the mmSBS is classified into three levels: Near, Moderate, and Far. The group distance from the mmSBS is regarded as 'Near' if it is less than R N . Otherwise, it is regarded as 'Moderate' if it is less than R M . Except for these two cases, everything else is considered to be 'Far' . Like in [3], the seven types of beams are adopted in the EFML scheme based on the different beam widths (i.e. from 30° to 90° with the step size of 10°) for each sector, in which the number of beams per type is one. Thus, M l is equal to 7. In addition, we set N l is equal to 10, which means that the discrete transmission power values are set from 0.1 to 1 Watts with the step size of 0.1. In the simulations, a time slice is set to a fixed length of time by the mmSBS. For each vehicle group, its two-dimensional context vector involves the arrival direction dimension and the vehicle group distance dimension (i.e., D = 2). The arrival direction of a vehicle group is defined as the angle between the positive X-axis in the plane coordinate system with the mmSBS as the origin and the line connecting the center point of the vehicle group with a mmSBS.

Simulation settings
We denote the number of two-dimensional subspaces in each sector of the mmSBS by O T and set O T = 18 . Furthermore, the parameter α and the length of a time slice t are set to α = 0.34 and t = 3s , respectively. Thus, according to Theorem 1, the time horizon T is approximately 6000, and the value of the control function K (t) is approximately 2.03. We implement our simulation environment through the event-based network simulator OMNeT++, in which we also use the vehicular network simulation framework of Veins and the road traffic simulator of SUMO. In addition, we consider the vehicle's speed between 5 and 10 m/s. The mmWave channel propagation model in our simulation as the following formulation.
where p r i is the received power of the vehicle i while p t i is the transmission power of the mmSBS, G t i and G r i are the directional transmission gain and directional reception gain between the vehicle i and the mmSBS, respectively, and G c i is the channel gain between the vehicle i and the mmSBS. The estimation of the gain parameters in the above channel model can be found in [39,40]. But for the convenience of readers, they are briefly stated as follows. When a mmSBS selects a certain beam-power pair to the vehicle i, the transmission gain and the reception gain of this mmWave channel can be estimated by (9)   In (10) and (11), θ t i denotes the transmitting beam width of the mmSBS, while θ r i is the receiving beam width of the vehicle i. In addition, θ t i and θ r i is the gain of the main lobe, while ξ represents the gain in the side lobe and 0 < ξ ≪ 1 . ϕ t i denotes the angle between the line connecting the mmSBS with the vehicle i and the center line of the transmitting beam of the mmSBS, and ϕ r i is the angle between the line connecting the vehicle i with the mmSBS and the center line of the receiving beam of the vehicle i. According to [41], the channel gain G c i can be given by where δ (·) denotes the Dirac delta function, and χ c i is the amplitude of the path from the vehicle i to the mmSBS, while τ i is the propagation delay of the path from the vehicle i to the mmSBS. And τ i can be estimated by the following expression.
In (13), c and d i are the speed of light and the distance of the path from the vehicle i to the mmSBS, respectively. Wireless signal transmission methods include Line-of-Sight (LOS) transmission and Non-Line-of-Sight (NLOS) transmission. When there are buildings and plants between the transmitter and the receiver, the NLOS path will have some problems such as high pathloss, reflection and penetration loss. Here, we consider only one reflection of a given path. According to [41], we can get the estimation of the amplitude of LOS and NLOS path as follows.
where denotes the wavelength of the mmWave in this simulation, which can be estimated by = c/f c and f c is the carrier frequency. ∂ is the reflection coefficient of the path between the vehicle and the mmSBS.

Simulation schemes and performance metrics
The EFML scheme is most similar to the works in [2,3]. However, the work in [2] only considers the one-dimensional context vector and unicast communication scenarios. Although the work in [3] considers the two-dimensional context vector, it uses the identifier of road and the direction of arrival as the context instead of the arrival direction and the vehicle group distance. Furthermore, the work in [3] aims at maximizing network throughput and does not consider the power adjustment of the base station. Thus, to compare the difference between considering the energy efficiency as optimization goal and taking the amount of received data as optimization goal under the same context, we design several comparison schemes based on the works in [2,3], which are called VFML and NFML for convenience. The VFML and the NFML schemes retain the core ideas in [2] and [3] (i.e., the optimization target is the amount of data received, the transmission power is consistent, and the vehicle groups mode do not be applied) respectively, but other parts are the same as our EFML scheme. Also, we adopt a User Experience Quality Assurance (UEQA) for the vehicle group as another comparison of our EFML scheme. Based on the approximate estimation formula f (γ i ) = 1 − e −0.5γ i in [42], the bit transmission success rate from the mmSBS to a vehicle can be easily estimated if the signal-to-noise ratio (SNR) γ i at this vehicle is known. If we know that the receiving bit error rate (BER) threshold BER th i of each vehicle, we can estimate the corresponding SNR threshold value γ th i by letting BER th i be equal to e −0.5γ i th , which can be expressed by To ensure that the BER level of each vehicle's receiving data from the mmSBS is not more than BER th i , the transmission power of the mmSBS should not be lower than the transmission power threshold p th i , which is estimated as follows.
In (16), W and N 0 represent the bandwidth of mmWave band and the background noise power spectrum density, respectively. In other words, when the mmSBS adopts p th i to send data to the vehicle i, the bit transmission success rate from the mmSBS to the vehicle i can be expressed by Thus, combined with Shannon theorem, the energy efficiency of the UEQA from the mmSBS to the vehicle i can be expressed by The energy efficiency of a vehicle group is determined by the vehicle that has the minimum energy efficiency of bit transmission in all the members of a vehicle group. The performance metrics adopted in the simulation experiments are the energy efficiency, the online learning cost, the cumulative received data, and the aggregate received data. The energy efficiency of the EFML, the VFML and the NFML is defined in the formula (1) while that of the UEQA is defined in the formula (18). The definition of online learning cost is the number of explorations rounds that each of the three schemes to achieve a certain percentage of the performance of the optimal solution, and all exploration operations in each discrete time slice are regarded as one round of exploration. The cumulative received data for all the three schemes is defined as the amount of data received by all the vehicles during the time horizon T, while the aggregate received data for all the three schemes is defined as the amount of data received by all the vehicles during a (15) Table 2.

Analysis of simulation results
We evaluate the performance metrics of the EFML scheme compared with the benchmark schemes such as the VFML, the NFML scheme and the UEQA scheme. In Figs. 4, 5 and 6, we investigate the impact of the number of vehicles in the simulation area on the performance metrics, in which no more than 6 selected beam-power pairs are employed simultaneously in each time slice and the number of vehicles ranges from 35 to 95 with the step size of 15.
As shown in Fig. 4, we can see that the cumulative received data increases the number of vehicles in the simulation area. The reason is that the fewer the number of vehicles, the less contextual information in the system, which ultimately leads to a poorer learning effect. The mmSBS is also unlikely to accurately select the beam power pair Cumulative received data (Gbit) The number of vehicles in the simulation area EFML VFML NFML UEQA Fig. 4 Impact of vehicle density on cumulative received data that maximizes the received data. However, the increase in the number of vehicles means that more contextual information can be provided, which is conducive to getting a better learning effect. In this case, the system is more likely to precisely select the beam-power pair that maximizes the received data. It can also be observed in Fig. 4 that the EFML outperforms the other algorithms VFML, NFML and UEQA in the cumulative received data. There are two reasons for this occurrence: for one thing, compared with the NFML and VFML, the EFML can provide service for each vehicle group with the same request (i.e., multicast). Nevertheless, the NFML focuses on unicast communication (i.e., only one of the vehicle group can be serviced) and the VFML only concerns one-dimensional context. For another thing, the UEQA only considers the power that meets the worst SNR within the vehicle group, while the EFML allows the power to be adjusted while satisfied the BER threshold of the system. That is, in addition to finding a more appropriate beam orientation and beam width for vehicle groups, it can also provide a more suitable power to reduce the power consumption. Figure 5 shows the online learning costs of different schemes under the different number of vehicles with the same simulation configurations as Fig. 4. We can observe from Fig. 5 that the number of exploration costs of online learning decreases with the number of vehicles in the simulation area. The main reason is that with the increase of vehicles that enter the system in a scheduling period, the corresponding context subspaces will raise. It causes that the performance of each beam-power pair in more subspaces can be detected. If it is found that there is no historical performance data or the recorded historical data is not sufficient, the exploration schedule should start as soon as possible. Therefore, more beams can be scheduled for detection in a scheduling time slice, which is beneficial to speed up the detection process of the performance of each beampower pair in each context subspace and thus effectively reduce the number of exploration rounds.
In Fig. 5, we can also see that the cost of the online learning of the EFML is higher than that of VFML, NFML, and UEQA. This is because these comparison schemes do not pay attention to the dimension of power and the VFML only considers one-dimensional context (i.e., the direction of vehicle arrival). In other words, the EFML has a greater learning space than other schemes that do not take into account transmission power adjustment. Therefore, the EFML needs to spend a higher online learning cost than other schemes to select a more appropriate transmission power for each mmSBS.
In Fig. 6, we can see that as the number of vehicles in the simulation area increases, the energy efficiency of the network also increases. The main reason is that under the condition that the number of concurrent beams is limited, the more vehicles means that the probability of selecting the vehicles with relatively high data rate and relatively low power consumption is greater. Moreover, Fig. 6 shows that the energy efficiency achieved by the EFML is better than that achieved by the VFML, the NFML and the UEQA. This phenomenon can be explained in the two aspects. Firstly, the VFML and the NFML adopt a fixed transmission power while the UEQA only considers meeting the worst SNR in a certain vehicle group. That is, the adjustability of the transmission power is not considered in these schemes. However, in addition to meeting the BER threshold for vehicles within the group, the EFML can adjust a more appropriate transmission power for each mmSBS to improve the energy efficiency of the system. Secondly, in the EFML, the vehicles with the same content request and in close proximity are constantly grouped together to share the same mmWave beam-power pair and thus they save power consumption and improve the energy efficiency of the system. In Figs. 7, 8 and 9, we investigate the impact of the number of selected beam-power pairs per time slice on the performance metrics, in which the number of vehicles in the simulation area is set to 65 and the number of selected beam-power pairs per time slice varies from 2 to 6 with the step size of 1. In Fig. 7, it can be seen that as the number of beam-power pairs that can be used concurrently in each time slice increases, the cumulative received data of all schemes also increases. This is because the greater number of selected beam-power pairs that can be used concurrently means that the more vehicles can be served at the same time. We can also find from Fig. 7 that cumulative received data achieved by the EFML is higher than that achieved by the other algorithms VFML, NFML and UEQA. The explanation of the difference of cumulative received data between different schemes is similar to that of the result in Fig. 4. From Fig. 8, we observe that the number of exploration rounds decreases with the number of selected beam-power pairs per time slice. The more beam-power pairs that can be used simultaneously in each time slice, the more beam-power pairs with unknown or uncertain performance information that can be explored at the same time slice. Therefore, when the number of context spaces of the system is fixed, the number of exploration rounds will decrease as the number of selected beam-power pairs per time slice increases. It can also be seen that the cost of online learning of the EFML is higher than that of other schemes, and the explanation of the difference among different schemes is similar to that of the results in Fig. 5. Figure 9 shows that the energy efficiency slightly decreases with the number of selected beam-power pairs per time slice. As mentioned earlier, in any exploitation process, the EFML will select the beam-power pairs that have been proved to have the best performance in the previous time slices. In this case, the average energy efficiency will be higher if a smaller number of optimal beam-power pairs are selected. We can also see from Fig. 9 that the energy efficiency achieved by the EFML is higher than that achieved by the VFML, the NFML and the UEQA, and the explanation for this difference is similar to that of the results in Fig. 6.
In Figs. 10, 11 and 12, we investigate the effect of different thermal noise power density on the performance metrics, in which the number of selected beam-power pairs per time slice is set to 6, the number of vehicles in the simulation area is set to 65, and the thermal noise power density ranges from − 170 to -150 dBm/Hz in steps of 5 dBm/Hz. It is obvious from Fig. 10 that cumulative received data decreases with thermal noise power density. This is due to the fact that the SNR of any receiver will be affected by thermal noise density. It is easy to know that the data rate will decrease with the SNR by the Shannon theorem.
Furthermore, we can see from Fig. 10 that the cumulative received data of UEQA is almost unaffected when thermal noise power density is relatively small. This is because that the UEQA adjusts the transmission power to maintain a certain SNR to meet the BER threshold of the system according to the formula (16), and thus the cumulative received data remains almost unchanged. However, when thermal noise power density is too large, the transmitter may not meet the BER threshold of the system even if the transmission power is adjusted to the maximum value. At this time, the cumulative received data will decrease with thermal noise power density. It can also be seen from Fig. 10 that the cumulative received data of the EFML is more than that of the VFML, the NFML and the UEQA, where the explanation of the reason for the difference is similar to the explanation of the result of Fig. 4. It can be seen from Fig. 11 that thermal noise density has almost no effect on the cost of online learning. This is because that the size of context space required by the online learning algorithm will not change with the change of thermal noise density. Moreover, the explanation of the difference in the number of the online learning costs among different schemes is similar to the explanation of the results in Fig. 5. From Fig. 12, we can observe that the network energy efficiency will decline significantly as thermal noise density increases. This decrease is due to the two reasons. On the one hand, we can know that the cumulative received data decreases with thermal noise power density based on the results in Fig. 10. On the other hand, no matter how the environmental noise changes, the VFML and the NFML always maintain a fixed transmit power. However, as the thermal noise density of the system increases, the EFML and the UEQA must increase the transmission power of each mmSBS to ensure that each vehicle satisfies the SNR threshold of BER, which reduces the network energy efficiency. We can also see from Fig. 12 that the difference in the network energy efficiency among different schemes, and the explanation of this difference is similar to the result in Fig. 6.
In Figs. 13, 14 and 15, we analyze the performance metrics achieved by the schemes over the time horizon with 6000 time slices, in which the number of selected beampower pairs per time slice is 6 and the number of vehicles in the simulation area is set to 65. Figure 13 shows that the aggregate received data achieved by different schemes over  We can see from Fig. 13 that the aggregate received data per time slice achieved by the EFML began to show an upward trend after the 1300th time slice and higher than the UEQA after the 2200th time slice. This is because that the context space of the EFML is larger than that of the VFML, the NFML and the UEQA. Due to the insufficient online learning before the 1300th time slice, most of the beam-power pairs allocated by the EFML to the vehicles are selected randomly. Moreover, the VFML, the NFML and the UEQA may have entered the exploitation phase while the EFML is still in the exploration phase. So, the cumulative received data of the EFML may not be as good as the other schemes before the 1300th time slice. However, after a period of sufficient online learning, the EFML can choose a set of beam-power pairs that are more reasonable than the other schemes, including beam directions, beam widths and transmission powers. By using the same simulation parameters as in Fig. 13, Fig. 14 shows the number of exploration operations per time slice over the time horizon of 6000 time slices. We can observe from Fig. 14 that the number of exploration operations in each time slice will decrease as the number of time slices increases. This is because the number of underexplored beam-power pairs decreases in the system over time. It can also be seen from Fig. 14 that the EFML requires more time slices to explore beam performance than the other schemes. Since EFML considers the power dimension, its context subspace is larger than that of other schemes. That is, it will take a longer time for the EFML to enter the exploitation phase.
As can be seen from Fig. 15, after a certain number of time slices, the energy efficiency per time slice achieved by the EFML is higher than that of the UEQA. The main reason is as mentioned above. Since the EFML has more context space than the UEQA, it needs more time slices to explore beam performance. As long as the learning is sufficient, the EFML can adjust and select a better power for the mmSBS to reduce the power Fig. 15 The energy efficiency over the time horizon consumption and improve the energy efficiency of the system. Also, we see that the VFML and the NFML are always less energy efficient than the other schemes over a time horizon of 6000 time slices. This is because it always selects the maximum transmission power for each mmSBS without reasonable power adjustment.
In Figs. 16, 17 and 18, we investigate the impact of learning information space size on the performance metrics of EFML and NFML, in which "EFML 7 × 2 × 18", "EFML 7 × 1 × 18", "NFML 7 × 2 × 10" and "NFML 7 × 1 × 10" represent the different learning information space used by the EFML and NFML respectively. "EFML 7 × 2 × 18" means that each sector in the EFML scheme adopts the beams with seven different widths, the number of beams at each width type is 2, and the contextual subspace of each sector is 18. It is obvious that "EFML 7 × 1 × 18", "NFML 7 × 2 × 10" and "NFML 7 × 1 × 10" have the similar meanings. Figures 16, 17 and 18 show the cumulative received data, online learning cost, and energy efficiency of EFML and NFML with different learning information space sizes under different numbers of vehicles. It can be seen from Figs. 16 and 18 that in terms of cumulative received data and energy efficiency, the scheme with a large learning information space is better than that with a small learning information space. This indicates that the performance can be improved by increasing the number of beams with the same width and the granularity of subspace partition under the condition that beam overlap is allowed. In other words, increasing the number of beams and the fineness of the context division will help the mmSBS flexibly select more suitable power-beam pairs for the vehicle groups. At the same time, we can see that EFML performs better than NFML because the EFML considers vehicle group multicast communication and can serve more vehicle users on the premise of allowing beam overlap and increasing available RF chains. Moreover, we can see that without changing the core idea of the NFML algorithm, the improvement obtained by expanding the learning information space is not obvious, which further illustrates the advantages of this paper.
Combined with Fig. 17, we can also see that, although the performance of "EFML 7 × 1 × 18" is slightly worse than that of "EFML 7 × 2 × 18", the online learning cost of the latter is much higher than that of the former. This means that "EFML 7 × 1 × 18" has a relatively higher performance-to-cost ratio. Also, we can see that the online learning cost of EFML is higher than that of NFML, and the explanation is similar to that in Fig. 5.

Conclusions
In this paper, we proposed the EFML scheme to improve network energy efficiency in cellular-assisted vehicular networks based on the MAB theory. By reducing energy consumption as far as possible under the premise of meeting the basic data rate requirements of vehicle users, the EFML scheme avoids unnecessary power consumption. By grouping the users requesting the same data content in close proximity into the same receiving group, the EFML scheme save mmWave beams and reduce the occupation of RF chains. The simulation results show that, compared with the existing online learning-based mmWave beam selection schemes, the EFML scheme not only improves the energy efficiency but also the amount of data in cellular-assisted vehicular networks at the cost of more system overhead. However, there is no difference in terms of beam performance update cost after a certain number of time slices between the EFML scheme and the comparison schemes.

Methods/experimental
The simulation scenario is shown in Fig. 3, where the coverage radius of each mmSBS is 100 m and each mmSBS coverage is partitioned into the four sectors (i.e., L = 4). Each sector is equally divided from two dimensions, where the mmSBS is firstly taken as the central point for equiangular division and then it is used as the starting point for equal length division. We implement our simulation environment through the event-based network simulator OMNeT++, in which we also use the vehicular network simulation framework of Veins and the road traffic simulator of SUMO. In addition, we consider the vehicle's speed between 5 and 10 m/s. In order to compare the difference between considering the energy efficiency as optimization goal and taking the amount of received data as optimization goal under the same context, we design two comparison schemes, which are called NFML and VFML for convenience. Also, we adopt the UEQA scheme as another comparison of our EFML scheme.