A DRL-Based Container Placement Scheme with Auxiliary Tasks

: Container is an emerging virtualization technology and widely adopted in the cloud to provide services because of its lightweight, flexible, isolated and highly portable properties. Cloud services are often instantiated as clusters of interconnected containers. Due to the stochastic service arrival and complicated cloud environment, it is challenging to achieve an optimal container placement (CP) scheme. We propose to leverage Deep Reinforcement Learning (DRL) for solving CP problem, which is able to learn from experience interacting with the environment and does not rely on mathematical model or prior knowledge. However, applying DRL method directly dose not lead to a satisfying result because of sophisticated environment states and huge action spaces. In this paper, we propose UNREAL-CP, a DRL-based method to place container instances on servers while considering end to end delay and resource utilization cost. The proposed method is an actor-critic-based approach, which has advantages in dealing with the huge action space. Moreover, the idea of auxiliary learning is also included in our architecture. We design two auxiliary learning tasks about load balancing to improve algorithm performance. Compared to other DRL methods, extensive simulation results show that UNREAL-CP performs better up to 28.6% in terms of reducing delay and deployment cost with high training efficiency and responding speed.


Introduction
Container has become a popular operating system (OS) level technology for providing cloud computing services, due to high-performance, scalability, lightweight resource allocation and good isolation. Unlike virtual machines (VMs), containers do not require the entire OS resources. They are able to share the same OS kernel with each other to reduce the provisioning cost and improve resource utilization efficiency. The application development platform like Docker allows containers deployed on the top of any (such as throughput, network utility, etc.). One related problem is the VNF placement problem. Song et al. [Song, Zhang, Yu et al. (2017)] consider the tradeoff between computing resource cost and communication resource cost. In Gu et al. [Gu, Chen, Jin et al. (2018)], the VNF placement problem is formulated into a mixed-integer linear programming problem and solved by relaxation-based algorithm to deal with the computational complexity. Cloud container cluster provisioning belongs to the category of virtual network embedding. Zhou et al. [Zhou, Li and Wu (2019)] design an online method to address the optimal placement of containers with maximal value of all served clusters. Lv et al. [Lv, Zhang, Li et al. (2019)] study the container allocation problem in real industrial environment while considering the balance between communication cost and resource utilization in large-scale data centers, and propose two algorithms to solve the container placement problem and container reassignment problem respectively. Zhang et al. [Zhang, Chen, Dong et al. (2019)] propose an improved genetic algorithm to search a placement scheme with optimal energy consumption. These existing works mainly use mathematical methods such as integer linear programming (ILP) and integer nonlinear programming (INLP) model to abstract the problem, and provide mathematical method like primal-dual algorithm or heuristic algorithm [Zhou, Li and Wu (2019); Piet, Bart and Pieter (2018) ;Quang, Singh, Bradai et al. (2018)] to solve it. However, these existing approaches can only work under the assumption that arriving services are predictable or known a priori. Due to the randomness arrival of service requests, network state and network flow are time-varying. The deployment methods mentioned below have limitations to adapt to these network changes. DRL-based algorithm has great advantages in solving such problems because it interacts with the environment and learns from experience [Huang, Yuan, Qiao et al. (2018); Li (2017); Wei, Wang, Guo et al. (2019)]. In Zhao et al. [Zhao, Liang, Niyato et al. (2018)], double deep Q-learning (DDQN) approach is introduced to obtain optimal resource allocation in heterogeneous networks. However, DDQN only works for the problem with a lowdimensional discrete action space. Xu et al. [Xu, Tang, Meng et al. (2018)] propose to leverage deep deterministic policy gradient for model-free control in network, but find that direct application of DRL-based solution does not lead to satisfying performance. Therefore, an effective DRL-based method is needed to capture network dynamics and deal with high-dimensional action space.

Problem definition 3.1 Network model
The cloud network is modeled as an undirected graph container v is placed on server z and 0 otherwise. If all the containers of a service are successfully deployed, the service request is satisfied. Fig. 1 shows a placement scheme for the CC of service 1. Let ( , ) h i j y v v denote the bandwidth requirement for packet transmission from i v to j v in service h 's CC.

Average delay
If h is accepted, the processing rate at v is h v µ , thence the average packet processing time We have the total processing delay of h as: The average transmission time the packets experience from denote the server where container v is placed. If i v and j v are hosted on the same server , the transmission process is negligible, i.e., ( , ) Therefore, the total transmission delay of h is:

Resource cost
For container v in h , each unit of resource r it uses requires the cost of h vr f . Similarly, cost per unit bandwidth is ( , ) h i j f v v . Therefore, the total cost of satisfied service h is: Notations are summarized in Tab. 1.
cost per unit of bandwidth

Optimization problem
Based on the above definitions, our objective is to minimize the total cost of all deployed CCs and the total delay. We transfer the multiple objectives into a single scalar and define the objective to be minimized as a weighted sum of cost and delay: Subject to: , , , , Constraint (4a) guarantees that a container can be deployed in at most one server. Constraint (4b) indicates that only when all the containers of a service's CC are successfully deployed, the service request is satisfied. Constraints (4c) and (4d) ensure that each CC's resource occupation cannot exceed network resources.

Algorithm description 4.1 A3C algorithm
In a standard DRL setup, there is an agent interacting with environment. And the optimal problem is modeled as a Markov process (MDP) which is typically expressed as { , , , } s a r p , where s is state space, a is action space, r is instantaneous reward, and p is state transition probability. At each epoch i , agent observes the state i s , executes the action i a and get the next state 1 i s + and instantaneous reward i r . Using i r , the agent updates an n-step return which is defined as the discounted sum of rewards: where [0,1] γ ∈ is a discount factor. This process continues until agent reaches a terminal state and restarts after it. The agent's objective is to find a mapping from state to action, which is called policy ( ) s π , to maximize the expected return : is the expected return following action a from state s . In the seminal work, value-based model-free reinforcement learning methods (like Q-learning) approximate the action-value function using ( , ; ) Q s a θ with parameters θ . The parameters θ are learned by minimizing a meansquared error, for example optimizing a loss function in n-step Q-learning: where ' s is the next state after s . Compared to value-based methods, policy gradient methods parameterize the policy as ( | ; ) a s π θ , update parameters θ and adjust the policy to maximize the expected reward by performing gradient ascent on However, value-based DRL methods can only work for problems with a low-dimensional action space, since it needs to find the action that maximizes the action-value function, which requires an iteration process to solve a non-linear programming problem at every epoch. Policy-based methods can handle a large action space, but can only update parameters after each episode is completed, resulting in a very slow learning speed.
Since the total number of CC placement decision is our optimization problem is difficult to be solved by value or policy-only methods. So, we consider tackling it by an actorcritic approach, asynchronous advantage actor-critic algorithm, which uses a parameterized actor network to generate actions, so it can handle high dimension action space; at the same time, critic's value function estimation supports actor to update the gradient.
The parameterized actor function is implemented by CNN and expressed as ( | ) is defined to indicate the expected discounted return for following policy π from current state i s . In actor-critic method, value function is estimated by parameterized critic: The parameters of the actor and critic are learned by iteratively minimizing a sequence of loss functions, where actor's and critic's loss functions respectively defined as: To make the process of propagating rewards to relevant state much more efficient, i G is updated toward the n-step return which defined as where H is the entropy and hyperparameter β controls the strength of the entropy regularization term, which is used to prevent policy convergence too early.
To eliminate the correlation between samples, A3C algorithm relies on asynchronous actor-learners and accumulated updates for improving training stability. Compared with previous solutions that use experience replay to reduce time correlation, the asynchronous RL Framework saves storage space. Besides stabilizing learning, we obtain a reduction in training time, which is roughly linear in the number of parallel actor-learners. In order to utilize the A3C algorithm, we model our placement problem as an MDP and design the state space, action space and reward function.
Action space: The action is defined as a placement scheme for all services' CC, i.e., the solution to the optimization problem. The action vector Reward function: When agent performs the action according to the current state and accomplish CC placement, it will get instant reward. For the goal of DRL is to maximize cumulative reward, we use the difference between the objective function at current state and that of next state: Although A3C algorithm performs well in many tasks as a state-of-art DRL method, our experimental results show that applying A3C directly to CC placement does not bring satisfactory performance. Its convergence speed is intolerable when the solution space is large. According to the analysis, we speculate that it is due to the following reasons: • In the general A3C training process, the agent only considers maximizing cumulative reward to achieve the best policy, and doesn't clearly know how to explore. • Although A3C uses an asynchronous method to reduce the correlation between samples, it cancels the replay buffer, which can increase the efficiency and stability of learning. In response to such problems, we propose an improved A3C algorithm for CC placement problem.

UNREAL-CP
In the general RL training process, the agent only considers maximizing cumulative reward to achieve the best policy, a simple random noise based exploration method does not work well for our CP problem. But the environment may also contain other available training information. Firstly, to address the problem of low exploration efficiency, we introduce the idea of auxiliary learning to guide the exploration direction of the agent in the base task. The agent is trained to maximize the reward of multiple tasks that face the same goal. This method does not require additional supervision and is an unsupervised learning method, called unsupervised reinforcement and auxiliary learning (UNREAL). In addition to auxiliary tasks, we also use the replay buffer to improve the accuracy of value function and the efficiency of auxiliary tasks Given a set of auxiliary control tasks C , we define an auxiliary control task c C ∈ by reward function ( ) c r , with the agent's policy ( ) c π for it. The overall objective is to maximize total performance across all tasks: is the discount return of auxiliary task c , and θ is the Common parameter of ( ) c π and π . By sharing parameters, the agent improves the performance of the policy π by optimizing the performance on the auxiliary task. For the auxiliary task c , we sample minibatch of transitions from replay buffer and use the Q-learning to optimize the action-value function ( , ) For each control task c , we optimize an n-step Q-learning loss: In our problem, the balance of server resource utilization has a great impact on network performance. First of all, load balancing can prevent any one server from being overloaded or crashed, thereby increasing service availability. Second, when resource utilization is high, servers usually generate exponential response time. Load balancing ensures servers' acceptable resource utilization, resulting in a shorter response time.
Third, under a balanced workload, no server becomes a bottleneck, which improves the overall network throughput. Based on the significance of load balancing to network performance, we design the following auxiliary tasks:

Resource utilization control
If the resource utilization in one server is significantly higher than that in others, then it will become a bottleneck in the service and seriously reduce the overall network performance. So, we want the variance of resource utilization for all servers to be as small as possible.
The variance of resource utilization for typer resource of all servers is given by where zr U is the utilization of r on server z and given by r U is r 's mean utilization of all servers and | | Z is the number of servers. The total resource utilization variance is the sum of that for all resource types, which is expressed as: The objective of this auxiliary task is to minimize the total variance of total resource utilization (i.e., min ( ) Var U ), so we refer to it as resource utilization control. The reward function of this task is defined as where '( ) Var U is the total utilization variance of next state.

Available resource balance control
The amount of different types of available resources in each server also needs to be balanced. If there are no RAM resources available, the remaining CPU resources in the server are useless for coming service requests. For two kinds of resources i r and j r , let ( , ) i j r r δ denote the expected ratio between i r and j r . The available resource balance index for resources i r and j r is defined as: Finally, the reward function of available resource balance control task is defined as: where '( ) Bal A is the total available resource balance index of next state. Based on the above definitions, the UNREAL algorithm for CC placement problem aims to maximize total performance across all these auxiliary tasks, and optimize a single combined loss function with respect to the joint parameters θ . The loss function combines the A3C loss 3 A C L together with auxiliary resource utilization control loss RUC L and available resource balance control loss RBC L : In summary, Our UNREAL-CP algorithm is expressed as algorithm 1: End-to-end average packet delay, total deployment cost and the weighted sum of the delay and cost are set as indicators to measure the performance of each algorithm under different amounts of containers. In addition, we compare the performance of A3C and UNREAL-CP in terms of learning efficiency and cumulative reward. From Fig. 2, we can see that compared to the other three algorithms, UNREAL-CP reduces deployment cost significantly. For example, when each CC contains 5 containers, UNREAL-CP reduces the deployment cost by 28.1%, 39.2% and 51.6% respectively compared to A3C with RUC, RBC and original A3C. We can see that when the number of containers is small, the delays under the placement strategies obtained by the four algorithms are similar. However, the gap between different algorithms becomes larger when the number of containers increases. This is because the UNREAL-CP algorithm considers the resource balance. When the number of containers in service increases, some overloaded servers will become the bottleneck of the network, affecting the traffic transmission, while balancing network load can effectively avoid such problems.

Figure 3: Average delay under different algorithms
Because UNREAL-CP performs better in reducing placement costs and average latency, it finally achieves an average reduction about the sum of cost and delay by 11.9%, 19.1% and 28.6% respectively, which is shown in Fig. 4.  Fig. 5, we can observe that UNREAL-CP has better convergence performance. UNREAL-CP reaches a better placement solution with higher reward within 60 episodes, while A3C uses more time and only obtains a local optimal solution. It means that the auxiliary tasks we designed significantly improves the learning efficiency of the agent in container placement problem.

Conclusion
In this paper, we present UNREAL-CP, a DRL approach for placing container clusters in cloud, taking deployment cost and average E2E delay into consideration. A3C algorithm architecture is used because it makes decisions under the guidance of both actor and critic's DNNs, and has great advantages in solving dynamic and continuous control problems. Since network load balancing has a significant impact on reducing network latency, we propose two auxiliary tasks, resource utilization control and available resource balance control to improve the convergence performance of A3C. The results of extensive experiments show that the proposed UNREAL-CP algorithm can effectively reduce the deployment cost and E2E delay up to 28.6%. Compared with original A3C, the convergence speed of our algorithm is also improved.