An intelligent decision system for virtual machine migration based on specific Q-learning

Due to the convenience of virtualization, the live migration of virtual machines is widely used to fulfill optimization objectives in cloud/edge computing. However, live migration may lead to side effects and performance degradation when migration is overused or an unreasonable migration process is carried out. One pressing challenge is how to capture the best opportunity for virtual machine migration. Leveraging rough sets and AI, this paper provides an innovative strategy based on Q-learning that is designed for migration decisions. The highlight of our strategy is the harmonious mechanism for applying rough sets and Q-learning. For the ABDS (adaptive boundary decision system) strategy in this paper, the exploration space of Q learning is confined by the boundary region of rough sets, while the thresholds of the boundary region can be dynamically adjusted by the reaction results from the computing cluster. The structure and mechanism of the ABDS strategy are described in this paper. The corresponding experiments show a firm advantage for the cooperation of rough sets and reinforcement learning algorithms. Considering both the energy consumption and application performance, the ABDS strategy in this paper outperforms the benchmark strategies in comprehensive performance.


Introduction
Due to the explosion of the Internet of Things, the amount of data for internet data centers (IDCs) is booming with an irreversible trend.Confronting the complexity and uncertainty of modern application requirements (accurate concurrency, high fault tolerance, large fluctuations), the traditional approach for resource optimization and distributed computing is no longer sustainable.Innovative algorithms and artificial intelligence (AI) will be necessary for autonomic optimization and management of intelligent IDC in the future.
Cloud/edge computing is a consolidated computing paradigm.Equipped with massive virtualized CPU, memory and storage resources, Infrastructure as a Service (IaaS) providers offer low-level computing resources on demand for end users.Computing instances with virtual machines (VMs) can be provided flexibly to accommodate diverse applications.The foremost advantage of applying VMs is that they are flexible enough to be transparently migrated from one physical machine (PM) to another without interrupting online services.The operation of live migration is convenient by leveraging VMM, e.g., VMware, KVM and XEN.In large computing clusters, there are rich scenarios in which a VM migration process should be launched.
As the optimal problem becomes more complex, AI technology is now used by advanced cloud providers such as Amazon AWS, Tencent Cloud, and MS Azure to provide reliable analysis and operation.For instance, reinforcement learning algorithms have been applied to address job scheduling and resource provision in cloud/ edge computing scenarios.In [1], a generic algorithm and deep reinforcement learning were combined for cost-aware scheduling.Study [2] proposed a DRL-based preemptive scheduling approach for cloud jobs.Specifically, the researcher in [2] focused on reducing the execution cost of VMs and guaranteeing users' expected response time with a deep Q-network.
Although VM migration is crucial and convenient, it also has some side effects.First, the performance of the application running on the migrating VM is affected, especially during the beginning and downtime of a migration process [3].Performance degradation in VMs is accompanied by high resource utilization on PMs.Experiments by Xu et al. [4] showed that applications running on VMs enduring serious performance degradation and variation, e.g., the loading time of the Doom3 Game will increase by 25 to 110% when an Amazon EC2 will endure resource contention with other instances.Inappropriate VM migration leads to performance variation or degradation.The corresponding SLAV (violation of service level agreement) will be triggered, which will cause loss of revenue and business reputation for the IaaS provider.Second, frequent VM migrations can also induce additional increases in energy consumption.Research [5,6] has investigated the operating cost of live migration and shown inevitable delays in both the execution time and response time of applications during migration.Considering that thousands of servers run in clusters, the electricity cost is not trivial for IaaS providers [7,8].
The general assumption in the literature on virtual resource optimization is that VM migration is rational and imperative because it mainly focuses on how to schedule migration based on the execution time, routing path and other migration costs.The missing fact is that some migrations may be irrational or erroneous, which has been discussed in our previous research [9].VM migration is usually overused, and the decision about whether to migrate is crucial but has not received sufficient attention.Although the application of artificial intelligence in enhancing the performance, design, monitoring, and maintenance of time-critical computing systems has been well described in [10], live VM migration in this paper is also a different research topic.Before selecting the migration destination, the crucial step is the decision of whether a VM should migrate.
Reinforcement learning algorithms such as DQN (deep Q-network), actor-critic and DDPG (deep deterministic policy gradient) can achieve optimal action by improving the prediction accuracy.However, in our experiments, we found that it is difficult to use these advanced RL algorithms to address VM migration decision problems.The environment of the VM pool in a computing cluster is constantly changing.The deep neural network trained based on the previous data needs to be constantly readjusted.This leads to a slow convergence speed, which makes it difficult to meet the needs of live VM migration.The aim of this paper is to find a harmonious combination of reinforcement learning and rough sets.As green computation is one of the optimal objectives, Q-learning, which is the elementary reinforcement learning algorithm, was applied in this paper.We innovated the method of updating the value table, which increases the efficiency of Q-learning.In addition, the action selection method was also innovated to apply a rough set decision system.All these innovations were patented by this team.These engineering innovations are expected to have positive value in real-time AI decision scenarios.
The remainder of this paper is organized as follows: Sect."Motivation scenarios" presents the motivation of this paper.A thorough formal definition is given in Sect."Problem formulation".In Sect."Strategy interpretation", we describe the details of the innovative algorithm, and the corresponding performance evaluation is performed in Sect."Performance Evaluation".We discuss related works in Sect."Related works", while the conclusion and future work are included in Sect."Conclusion remarks and future directions".

Motivation scenarios
Here, we address two scenarios based on our experiments.The motivation of the research in this paper is clear with the following two typical examples.The workload benchmarks for the experiments are examples from the RUBiS [11] and DACAPO [12] suites.The mixed application types consist of mail service, matrix multiplication, database read/write operations, web page browsing, etc.Note that workloads can be adjusted by specific parameters such as the size of the matrix and the number of connections in the Apache web server.
The first scenario involves overmigration with fluctuating resource demand.As shown in our previous research [9], the CPU utilization of a PM fluctuates between 49 and 87%.In the eighth sampling data (each once every second), this utilization value is 83%.Then, VM migration occurs when the CPU utilization exceeds the predetermined threshold.Due to the performance degradation in this experiment, it took 205 s to complete this migration.Although 205 s is still normal considering network congestion across different network partitions, the problem is that the duration of enduring excessive CPU utilization is only 3 s.Due to resource content, this 205 s of migration results in an extended execution time of jobs.This leads to a corresponding SLAV that is distinctly greater than the enduring resource shortage of a mere 3 s (the shaded area in Fig. 1).
The second scenario concerns the phenomenon of resource shortages in migration.In this experiment, there are 3 VMs running on PM2.The initial CPU utilization of PM 2 is less than 80%.Due to storage failure at another node, a VM migration process is triggered, and PM 2 is chosen as the destination.As migration is resource intensive, the CPU of PM2 significantly increases (from 78% to more than 90%) in the next few seconds.The main problem in this scenario is that when the final synchronization process for live migration confronts the increase in resource demand in PM2, the resource shortage becomes more severe (CPU utilization increases to 100% in our experiment).This brings about significant performance degradation, and it takes more than 50 min (almost 100 times more than normal migration) for this abortive migration in the underlying experiment to complete.In addition, severe SLAV of application in the VM and extra energy consumption of PM are incurred for the prolonged migration time in this experiment.
Drawbacks exist for the static threshold approach, as it leads to over-migration, which is shown in the first scenario.However, this fixed threshold is prevalent even in mature platforms (e.g., VMware and Xen), and this dichotomy is currently common due to its easy implementation.Our early research [9] provided a three-way decision approach based on probability theory and rough sets.Although the migration decision issue can be transformed into a three-classification problem by [9], we did not provide a judgment method for the boundary region.A specific performance probe from our patent approach is utilized in [9] to make further decisions about live migration.However, performance monitoring from probes leads to elaborate adjustments of the parameters.An advanced strategy with automatic judgment from an AI perspective is expected, which is also our straightforward motivation for this paper.

Problem formulation
In this paper, both rough sets and Q-learning are involved in the ABDS (adaptive boundary decision system) strategy for VM migration decisions.In this section, three regions of rough sets are provided to reduce the optimization space, and the corresponding Q-learning approach is interpreted in the following formulation.
Denote U as a finite set, while E represents an equivalence relation on U. The equivalence relation E induces the partition of U, which can be denoted as U/E.For an object x ∈ U , the equivalence class that contains x can be denoted as [x] = y ∈ U xEy .
For migration decisions, the first critical step is to determine whether the related PM is in a state of overload.The element x is a two-dimensional vector that represents the resource status of a computing node.Therefore, a clear partition of U (meaning the whole space for the resource utilization state) to classify the status of a PM is needed.D is the subset of U, which indicates the overload state of a PM, while D c represents the PMs that are not oversubscribed.To change the static dichotomy, the approximation space and rough sets are introduced here to solve this problem.The lower and upper approximations of subset D are defined as (1).
Leveraging rough sets [13,14], space or sets can be approximately divided into positive, negative and boundary regions.According to this method, for subset D, space U can also be divided into three regions: positive region POS(D), negative region NEG(D) and boundary region BND(D).Following the definition of regions in (1)  rough sets, the corresponding three regions can be formulated as (2).
Combined with Formula (1), the three regions for the migration decision can also be represented as (3).
For most engineering applications with discrete space, applying the above Pawlak rough set model may be too strict [15,16].As probability and statistics are always involved in artificial intelligence algorithms, a probabilistic rough set model is utilized here to enable robustness response to uncertainty.In complex environments, uncertainty is effective in reducing the side effects of an arbitrary threshold.In this paper, uncertainty is quantified by the degree of overlap between equivalence and approximation, i.e., [x] and D in (2).Given an object in [x], Pr(D|[x]) is the conditional probability that an object belongs to D. In the cloud/edge computing scenario, Pr(D|[x]) is the conditional probability that the target computing node is overloaded.
The symbol |•| in (4) represents the cardinality of a set.According to the above definitions, three regions with rough sets in (2) can be equivalently represented as (5), where α and β are the parameters used in the partition of U.They are key thresholds of regions in the three-way classification [13,17,18].
The principle of (5) means that x is identified as a member of D if the probability value is greater than α; x is rejected as a member of set D if the probability is less than β.We neither accept nor reject x if the probability of x belonging to D is between α and β.Leveraging the rough sets approach in (5), the flexible three-way decision can be applied more effectively than the arbitrary dichotomy method.In this paper, vector x represents the resource utilization of a PM, while D represents the set of overloaded PM.Therefore, if a PM is in an overload state (the certainty of being overloaded is greater than α), it will be defined within the positive region.The corresponding instruction is that the PM ( 2) can be regarded as oversubscribed, and VM migration should be conducted immediately (otherwise, it may lead to performance degradation, which is discussed in Sect."Motivation scenarios").If a PM is far from the overload state (the possibility of belonging to set A is less than β), then any migration plan (regarding this PM as a source) should be denied.The problem is in the boundary region.If the state of the PM is within the boundary region, a dedicated monitoring probe was utilized in our previous work [9] to support further judgment.In this paper, an innovative method is proposed based on Q-learning to decrease the overhead and enable automatic decision-making.The formulation for the underlying mechanism is provided here, while the details for the ABDS strategy are shown in the next section.
The Bellman equation is utilized here to solve the decision process.Considering the discount factor γ in Q-learning, the expectation of reward (considering current migration which represents feedback from the computing cluster) within an episode can be presented as (6).According to the iterative relation, it can also be converted to Formula (7).
The principle of Q-learning is to achieve the best optimal function of action (the action in this paper has only two choices-to migrate or not).Considering the probability factor, the optimal function within the boundary region can be formulated as formula (8).
The probability of a state transition can be derived from previous data, and the best Q value can be derived from Formula (8).Then, the appropriate action will be chosen based on the optimal Q value and the specific policy.Usually, mature policies such as the ε-greedy, UCB (upper-confidence bound) or gradient bandit will be utilized for action choice.Note that we innovated the action choice method in our experiment, which will be interpreted in the following section.After the action is conducted, the Q table will be updated dynamically according to Formula (9).
In this paper, the updating Q table in Formula ( 9) is critical for the threshold of the boundary region in Formula (5).The parameters and other details of the (6) innovative method incorporating Q-learning and rough sets are presented in the next section.

Strategy interpretation
In this paper, the migration decision issue is solved from an AI perspective.As VM migration closely interacts with computing clusters, e.g., the source PM, the destination PM and the throughput of the routing path, a reinforcement learning algorithm is utilized here to address the interactions between individuals and the environment.
In our research, the ABDS was developed based on Q-learning, which is a classic RL approach.Compared with the original Q-learning algorithm, many of the details have been modified in the ABDS, as shown in Fig. 2. In Fig. 2, the computing cluster is recognized as the environment, while the PM is considered to be the individual in the RL algorithm.
Given a computing node, a migration decision should be made based on the corresponding resource utilization.The foremost contribution of our ABDS is the harmonious combination of rough sets and the RL algorithm.As Fig. 2 shows, the migration choice will affect the cluster environment, while feedback (reward) will be collected to update the thresholds of rough sets.Meanwhile, the dynamically updating boundary region leads to flexible adjustment of the exploration space for the related Q-learning.

Macro configuration of the ABDS
Leveraging the Q-learning framework, some parameters need to be specified in the proposed ABDS strategy.First, the resource utilization of the PM is regarded as the state S. The real-time resource utilization can be derived using specific tools such as VmStat and htop.Although various resources can be derived, only CPU and memory are considered.Therefore, state S is a two-dimensional space composed of CPU and memory sampling data.To minimize the state space and improve the convergence rate, this two-dimensional continuous space needs to be discretized to a certain extent.In our experiment, the sampling data are set approximately equal to the value of the nearest discrete point.Second, episodes are also an important concept in Q-learning.Both the solution exploration and the parameter updates depend on the iteration of the episode.Unfortunately, unlike in the classic maze, there is no clear episode as the computing cluster works continuously.In this paper, we set 20 min as an episode period, and the thresholds of the boundary region in the ABDS are adjusted once per episode.Third, the action of an individual (PM) has only two choicesmigration or not.In addition, other parameters should be considered.γ is the discount factor that is used to discount the previous reward value.In this paper, the ABDS strategy has the opposite effect on the discount series compared with classic Q-learning.R t is denoted as the reward of the environment at t, while the former reward R t-n should be discounted as γ n R t at t-n. α in Formula ( 9) represents the learning rate, which is used to limit the updating extent of the Q table.Our experiments show that VMs may have abnormal migration experiences; thus, the learning rate is used to update the Q table in a soft way.

Calculation and updating of the Q table in ABDS
The Q table is crucial in our ABDS strategy.In this paper, the Q table has two dimensions: the rows represent different resource states (discrete resource utilization vectors), Fig. 2 Structure of the decision system in our research while the columns represent the action choices (migrating or not migrating).Each value in the Q table is the discounted summation of experienced rewards, while the current reward is related to the migration interaction with the computing cluster.In this paper, three components should be considered for the calculation of the Q value.First, when live migration is executed, the performance of applications running on this VM will degrade [19].C vm is the cost of migration from the VM perspective, and the values of C vm are different for various types of applications.The second level is PM.If a migration occurs, then resource contention issues may arise at the destination PM.C pm is denoted as the cost of migration from the perspective of resource contention on PM.Note that only those for which the resource utilization of PM exceeds 60% will be involved in the calculation, as migration is usually triggered by high resource utilization.For PM, the migration cost is the opposite of the reward.If the migration process relieves the status of resource contention, then the predicted reward value is positive, and vice versa.The third factor is energy consumption, which is not negligible for solving the performance and energy trade-off issue.An experiment in [20] showed that the power of a PM mainly depends on CPU utilization.In our experiment, the power of two types of servers is measured every 10 points for CPU utilization, and some key point data are shown in Table 1.
Let C ec be the increment of energy consumption for migration in the computing cluster.Linear programming is applied to obtain the C ec value at each CPU percentage point.From the perspective of power consumption, both the duration of migration and the power value on the source and destination PM are related to the reward of the migration process.The prediction of resource utilization and migration duration has been well researched [21,22].For brevity, the details of the other components are not included; rather, we only focus on how to further calculate and update the reward value in the Q table.
Parameter settings and other factors are also provided here for better interpretation.For the discount factor γ in Sect."Problem formulation", we set the value to 0.7 to guarantee that the interaction of the latest 5 migrations can be traced back.Since the previous rewards can be derived by looking up the Q table, the remaining problem for implementation is the action selection method.As mentioned in Sect."Problem formulation", the selection method is different from the general ε-greedy and UCB methods.In this paper, the action choice is related to the corresponding regions of rough sets.If state S is in the positive region, then the probability of selecting migration is 100%, which means that a migration is to be carried out immediately.If state S is in the negative region, then the probability is 0. If state S is in the boundary region shown in (5), the probability value depends on the reward sampling data within the boundary region.As discussed in this section, the predicted reward of current migration is composed of C vm , C pm and C ec .The weight parameters for the three components can be adjusted according to the actual situation.Considering the nonlinear negative effect of SLAV, both the performance degradation in VM and the resource contention in PM should penalize the estimated reward, which is denoted as pt in (10).
Normalization is utilized for the reward, and the cluster will benefit more if the reward value approaches 0. Since the reward feedback can be fixed by Formula (10), the remaining parameter in Formula ( 9) is α.α is the learning rate used to control the extent of updating the Q value.It is set to 0.6 in our experiment.Our underlying experiments verify that 0.55 to 0.67 for α is acceptable.If we continue to reduce the value of α, the AI decision system will inadequately respond to the current changes in the computing cluster environment.

Adaptive adjustment of the boundary thresholds
This paper provides a harmonious combination of rough sets and reinforcement learning.Three regions in the ABDS strategy provide a flexible decision system.Compared with predefined boundary thresholds in rough sets, a new method is introduced to update the related threshold.
Considering the actual migration environment, the threshold values for α and β should be updated at moderate frequencies.In our ABDS strategy, threshold adjustment is triggered at the end of each episode, which is also shown in the following pseudocode.In general, the Q value is updated more frequently at each action step, while the thresholds of the boundary regions are adjusted at the end of each episode.In this paper, the initial values for α and β are set to 100% and 60%, respectively, which means that performance degradation can be ignored when resource utilization is less than 60%.After a certain number of sampling dates for migration are collected, the thresholds can be adjusted by the distribution of the sampling data.The innovative method is that the lower bound of resource utilization for all positive samples in the Q table is regarded as the ideal value of threshold α, while the upper bound for all negative samples is the target value for threshold β.To reduce the oscillation of (10)  thresholds, a half-step adjustment method is applied in which the mean of the current value and the target value are adopted to enhance the stability of the decision system.The framework of our ABDS strategy is shown in the following pseudocode.

Algorithm 1
As with all other reinforcement learning algorithms, the input of the ABDS in the above table is the status of a computing node.For example, when a polling computing node is taken into consideration, its resource utilization can be noted as the state of the node.The output of the ABDS is the action decision, which means to migrate or not (this is just the action type).As mentioned in Sect."Calculation and updating of the Q table in ABDS", the reward value is a negative value between 0 and -1.

Performance evaluation
In this section, corresponding experiments are performed to evaluate the performance.First, we introduce the experimental settings and the metrics used in our evaluation.Then, we reveal the experimental results and present the analysis.

Experimental settings and performance metrics
Related experiments are carried out based on two basic types of servers in our laboratory: server A (Intel i79750H, hexa-core with 2600 MHz, 16 GB RAM) and server B (Intel octa-core i9-9900k with 3600 MHz, 32 GB RAM).We adopt OpenStack (4.0.2) to manage the virtual resources of clusters composed of the above two types of servers.Nova (15.1.1)is implemented to undertake computing jobs in VMs from the control node and compute node in this cluster.In addition, the realization of the innovative strategy in this paper also depends on modifying related components (e.g., the Nova scheduler) in OpenStack.To fit the real scenario in the IaaS cloud, different types of VMs are created based on different mirror images.The small VM type has 1 CPU unit and 2 GB of RAM, the medium type has 2 CPU units and 4 GB of RAM, and the large type has 4 CPU units with 8 GB of RAM.
The workloads applied in this paper are mixed types of benchmarks from SPECCPU (http:// www.spec.org/ cpu20 06/), Netperf, Hadoop instances, and SPECweb2005 (http:// www.spec.org/ web20 05/).In addition, considering the importance of web applications, the ApacheBench test is also implemented for performance tests on response time.To increase the frequency of data sampling for migration, an extra workload is implemented to cause fluctuations in resource consumption.These workloads include database transaction, matrix transposition and a simple probe designed to test the execution time for CPU-intensive applications.In this paper, the resource situation of the cluster is monitored by Prometheus (2.8.1).
For the experimental metric, both the energy consumption and performance should be considered.For energy consumption, we actually measured the power of each PM model at different CPU resource utilizations.Table 1 shows the power of two PMs measured by separating 10 CPU percentages.Linear regression is applied within each 10-point CPU power range to obtain the power value at each integer CPU percentage point.For the SLAV metric, the response time is taken into consideration because it is the most obvious indicator for web applications.The response time data come from the Apache Bench test.The client constantly sends requests to the Apache server for access to the homepage of the website in our experiment.
The values of the learning rate, discount factor and initial thresholds in our ABDS strategy follow the values discussed in Sect."Strategy interpretation".The weighted values for the three components in the reward calculation are set to be equal in the initial setting.Note that we only focus on the optimization of the overloaded PM.In addition to the metrics discussed above, downtime is also an important metric for migration.In our experiment, we used the nova migration list to measure the duration of migration, while the downtime of live migration could be derived by the timestamp difference of VMs in Nova log files.As the downtime of sampling data is usually quite low, it is not included in the metrics, and the default value is maintained for related parameters such as the max_downtime, steps and delay in the evaluation experiment.In addition, the experiment is carried out in our LAN network with high throughput; thus, the network bandwidth and topology have no direct effect on the performance evaluation.

Comparative algorithms and experimental results
In this subsection, we address the experimental results and present the relevant analysis.Five benchmark algorithms of four types are compared in the performance evaluation.The algorithm proposed in our previous research [9] is not used here for additional factitious prevention within the boundary region.The first type is the empirical static threshold.For migration, 80 to 85 CPU percentages are regarded as the proper threshold range for source PM with high resource utilization [23,24].Here, 80TR and 85TR represent the triggering algorithms with thresholds of 80 and 85 CPU percentages, respectively.The third algorithm is classic Q-learning, in which the action selection depends on the ε-greedy operator.The fourth algorithm is denoted as RS + R, which is the combination approach of rough sets and random sets.In RS + R, state S is also divided into three regions.The difference is that when the current state is in the boundary region, RS + R chooses the action with absolute randomness.The fifth comparative algorithm is the combination of rough sets and Q-learning.We denote it as RS + Q, in which the classic Q-learning is only used in the phase of action selection in the boundary region.As discussed in Sect."Strategy interpretation", our ABDS involves the combination of rough sets and modified Q-learning, which is described in Sect."Strategy interpretation".The specific API for the environment in this paper is developed with Open AI Gym.
First, we evaluate the power consumption of the comparative algorithms.Note that we only focus on the power of the CPU on active servers.The reference line is the initial power state of four servers (two servers A and two servers B) with a 70% CPU percentage.Due to the constantly triggered workloads, all six power curves in Fig. 3 increase in different ways.Note that underload detection of the server and PM hibernation is not considered in this experiment; thus, the total power will increase when a new PM is activated during migration.
Figure 3 shows that the performance of the traditional strategy for the static threshold is acceptable.The power of both the 80 and 85 strategies is moderate compared with that of the other strategies.Basic Q-learning has the highest power value, and we found that it is difficult to achieve performance convergence during the entire 100-min duration.This shows that effectively applying the naïve AI algorithm is not trivial.
For deep analysis, the values in the Q table do not converge in a constantly changing computing cluster environment.In addition, the exploration of ε-greedy leads to many poor migration decisions.The power of RS-related strategies is quite competitive, which shows the benefit of confined space due to region partitions.The RS + Q and ABDS have the best power performance.The performance of the ABDS is in line with that of the RS + Q.These two strategies have the lowest power consumption.During the last 20 min, the power of the RS + Q strategy is nearly 20% less than that of the 80 strategy.The underlying reason for this large power difference is that more migration is triggered in strategy 80, which leads to more active servers with significant power increases.

Fig. 3 Performance evaluation on the CPU power
The second metric for our experiment is the response time (RS time).This sub-experiment is launched leveraging the AB test on the Apache server in a VM, and the data in Fig. 4 are the mean values of the 10 experiments.
Since 20 min is the predefined episode duration in the Q-learning related algorithms, 100 min are divided into 5 parts, as shown in Fig. 4.
As shown in Fig. 4, the RS time fluctuates during different durations.In the initial 20 min, the 80 strategy has the worst performance, which is obviously higher than that of the other traditional strategies.The reason is that the 80 strategy has lower triggering thresholds; thus, more migration is launched than in the other strategies during the first 20 min.The RS value is strongly related to the migration process, and interestingly, in our experiment, the RS time increased sharply during the downtime of migration (RS increased to more than 0.8 s, which is almost 500 times greater than that in the normal state).During the evolution (finish and creation) of workloads in the following durations, the RS time increases moderately.The RS time for strategy 85 is the highest during the second 20 min.We noticed that the CPU utilizations of the four PMs approach the specific threshold, which leads to an obvious increase in migration.Q-learning has the highest RS value, as it is still far from the optimal policy during the middle durations.The performance of RS + R is worse than that of RS + Q and ABDS.We traced the Nova log file, and half of the migration process was not suitable for absolute random policy.RS + Q and ABDS dominate other strategies during most of the study period, as they receive minimum interference from inappropriate migration processes.
In addition to the RS time, the execution time is another important metric for the performance of applications running on VMs.For CPU-intensive or data-intensive jobs, makespan is an important Qos (quality of service) metric.The expected execution time of applications will be extended when current VMs endure resource contention or unsuitable live migration.As discussed in Sect."Experimental Settings and performance metrics", a lightweight probe with a matrix calculation job is applied to quantify the ratio of the actual execution time to the expected running time.Note that the definition of SLAV can be flexible in different computing scenarios.In this paper, SLAV is defined as the situation in which the RS time of a request in the AB test is greater than 0.1 s or the actual execution time is extended to more than 20% of the expected duration.
Figure 5 shows the mean number of SLAVs with six different algorithms during the entire experimental duration.In Fig. 5, RS and ET represent the number of SLAVs in the RS time and prolonged execution time, respectively.The number of SLAVs for both the 80 and 85 strategies is within a moderate interval between 20 and 25.For Q-learning, the value of ET is quite different from that of RS.Q-learning has the best ET performance because it always tries to carry out live migration before resource contention occurs.
The ABDS strategy becomes cautious in the boundary domain of rough sets when determining whether a virtual machine should migrate.This modest delay in VM migration results in an increase in the ET value compared with classical Q-learning.However, severe performance degradation occurs for classical Q-learning due to frequent VM migration processes.The performance of RS + R is also relatively poor, and we speculate that the random choice of migration may lead to unreasonable migration decisions.Compared with the static threshold Fig. 4 Response time performance evaluation strategy, there is no advantage of RS + Q for the SLAV metric.However, we can still observe the positive effects of the rough set on Q-learning from Fig. 5.By limiting the exploration space, the SLAV in RS + Q decreased by 25.6% compared with that in naïve Q-learning.The ABDS has the minimum SLAV because the regions of rough sets can be adaptively adjusted with the episode iteration.
Figure 6 shows the normalized reward value for different optimization strategies.From Figs. 3, 4 and 5, we can see that the performance of RS + Q dominates the naïve Q-learning, while the ABDS performs much better than the random choice in RS + R. Therefore, Q-learning and RS + R are excluded from further experiments.Figure 6 shows that the reward values of strategies 80 and 85 fluctuate steadily around the mean value.The underlying reason is that each migration process will be triggered under the same conditions.During the experimental duration, the reward values in the RS + Q and ABDS increased with great fluctuations.In the first 20 min, the reward for the RS + Q and ABDS strategies was lower than that for the 80 and 85 strategies.The underlying reason is that the AI agent cannot find the optimal policy during the exploration phase.During the remainder of the experiment, Fig. 5 of SLAVs for application in VMs Fig. 6 Normalized reward value over 100 min the rewards of both strategies increase along with the updates of the Q table.However, the reward value of the ABDS dominated that of the RS + Q during most of the experimental duration.It is inferred that the compression of the exploration space for reinforcement learning is conducive to earlier convergence, while the reward from the cluster can be applied to adjust the boundary region of rough sets, which leads to better migration decisions.

Related works
VM migration is the key operation for cluster management and has received sustained attention in recent years [5][6][7]25].Hybrid algorithms were applied to solve the trade-off problem between energy and performance.Multiscale sliding windows were applied to estimate the load, and a specific relationship between resource utilization and SLA was proposed to address the performance and energy issues [26].Salehi et al. [27] discussed the preemption technique to reduce the overall energy consumption with SLA constraints.Factorization skills were adopted to forecast workload fluctuations, and an integrated method was introduced to reduce energy consumption while maintaining SLA.Karthikeyan [28] et al. innovated a VM swap system in which energy conservation and SLA can be guaranteed at the same time.Although they focused on the energy and performance issues for VM migration, the limitations of these studies are obvious.The threshold for triggering live migration is always a static value that was defined in advance.The greatest challenge for VM migration is how to use smart AI algorithms to meet the needs of live migration decisions.Considering the reinforcement learning algorithm, there are two points hindering its application in live VM migration.The first is agent exploration, which is the optimization space.The improvement of this strategy depends on effective exploration, but this exploration itself leads to two problems.First, the convergence rate is slow, which will lead to the negative impact of delayed decision-making.Second, an excessive number of searches increases the amount of computations, which violates the requirements of reducing energy consumption in green computing.The second point is the instability of the training model.Based on underlying experiments, in a complex and time-series distributed (cloud/edge) computing environment, it is impossible to guarantee the stability of the trained model.Some parameters need to be constantly updated and retrained, which leads to the consumption of considerable additional computing power.Therefore, in the last 5 years, only one study [29] has focused on the VM migration problem using a reinforcement learning algorithm.In [29], Hummaida utilized reinforcement learning to learn when it is rewarding to migrate a VM.The reinforcement learning reward function drives a policy toward high CPU utilization and attaches a penalty to overachieving SLAs.Similar to this paper, the CPU utilization and response time are also regarded as key indicators.However, [29] used a simulation method that is far from the real computing cluster environment used in this paper.
Compared with the above studies, this paper applies a special threshold for the boundary region that can be adjusted dynamically based on the modified Q-learning approach.
From the perspective of the optimization strategy itself, the only characteristic that is worth differentiating is the category of these algorithms.Considering the complexity of the optimization problem, heuristics are the most common algorithm applied in this research.Torre et al. [30] introduced a heuristic algorithm based on an island population model to approximate the Pareto optimal of VM placement.In some scenarios, it is difficult to design an effective heuristic algorithm for multiobjective optimization.Many bionics algorithms have been applied to solve complex problems.In [31], a hybrid IEFWA/ BBO algorithm was used to construct an energy-efficient program for virtual resource management, while an ant colony algorithm was used in [32,33] to optimize the process of live VM migration.With the growth of AI technology in recent years, machine learning algorithms have become exciting solutions for complex problems in various research areas.An individualized machine learning algorithm was introduced in [34] to increase the accuracy of load prediction.Leveraging the reinforcement learning algorithm, Peng et al. [35] designed a special approach to manage virtual resources in the IaaS cloud.Reinforcement learning is a suitable approach for simulating complex interactions between VMs/PMs and computing clusters.However, few studies on VM migration decisions exist.Applying AI skills in complex scenarios is not easy, and many factors (discretization, episodes, parameters, etc.) need to be well designed for real environments.In contrast to the above studies, this paper proposes a new decision system based on Q-learning with specific innovations.To the best of our knowledge, this is the first time rough sets have been combined with the criterion of action selection in Q-learning.

Conclusion remarks and future directions
VM migration is widely used for its convenience.However, the side effects of live migration have not received enough attention.This paper describes specific scenarios in which VM migration is misused.Focusing on the problem of overmigration, a new formulation is introduced based on rough sets and Q-learning.To solve this complex optimization problem, we establish an innovative ABDS strategy based on Q-learning.In the ABDS strategy, Q-learning is modified, and new methods are introduced.A series of experiments verify that the boundary regions from rough sets can be effectively applied in the action selection of reinforcement learning algorithms, while our ABDS strategy outperforms the benchmark in the evaluation of energy and performance tests.
There are two directions for our future work.One is to introduce the topology of networks in a hybrid computing cluster.Considering the topology, a hybrid network would greatly increase the complexity of the optimization problem, which is beyond the scope of this paper.The other objective of this paper is to explore the feasibility of using advanced AI algorithms to solve live VM migration problems.Among these advanced AI algorithms, federated learning has a special advantage in the live optimization of multiple objectives.For example, [36] obtained Pareto optimal solutions balancing resource efficiency and test accuracy.We will attempt the federated learning method because it has the potential to solve the live VM migration problem in the future.

Table 1
Power (Wattle) of CPU with different utilization