A smart cache content update policy based on deep reinforcement learning

This paper proposes a DRL-based cache content update policy in the cache-enabled network to improve the cache hit ratio and reduce the average latency. In contrast to the existing policies, a more practical cache scenario is considered in this work, in which the content requests vary by both time and location. Considering the constraint of the limited cache capacity, the dynamic content update problem is modeled as a Markov decision process (MDP). Besides that, the deep Q-learning network (DQN) algorithm is utilised to solve the MDP problem. Specifically, the neural network is optimised to approximate the Q value where the training data are chosen from the experience replay memory. The DQN agent derives the optimal policy for the cache decision. Compared with the existing policies, the simulation results show that our proposed policy is 56%–64% improved in terms of the cache hit ratio and 56%–59% decreased in terms of the average latency.

of the backhaul can be reduced and content retrieval from the edge can be faster than from the remote core network [6] [7].
However, because of the limited cache capacity, it is necessary to update cache content to ensure that cache-enabled networks always store the most popular content [8]. The most two common content update policies are the least frequently used (LFU) policy and the least recently used (LRU) policy [9]. LRU frequently stores the content with the latest access time and LFU frequently stores content with the largest cumulative request times.
Besides, as described in [10], a heterogeneous cache structure is proposed, in which the most popular contents are stored at small BSs and the less popular contents are stored at macro BSs. The combination of small BSs and macro BSs can maximise the network capacity and satisfy the content transmission demand. In [11], an optimal cooperative cache policy that can increase the cache hit ratio was presented. The cache hit ratio is utilised to describe how frequently content is requested by mobile users. In [9], an adaptive cache policy was proposed that can reduce user access latencies. In [12], an edge cache policy was proposed to reduce the average content delivery latency. However, conventional methods lack adaptive ability in dynamic cache scenarios. The reason is that they assume that the content popularity distribution is known or can be accurately predicted, which is difficult to achieve in dynamic caching scenarios. In this case, due to an inaccurate distribution of content popularity, the conventional methods have poor cache performances since their performances are highly dependent on the accurate distribution of content popularity.
Motivated by the deep reinforcement learning (DRL) approach in solving the dynamic problem [13], DRL has been applied into cache policies to improve the cache performance of dynamic cache scenarios. In [14], a DRL approach was proposed to reduce the transmission cost by jointly considering proactive cache and content recommendations. In [15], a cache content update policy based on DRL was proposed to improve energy efficiency. In [16], a DRL model was utilised to minimise transmission latencies. For specific, reinforcement learning (RL) is applied to obtain the optimal cache policy. In [17], a DRL-based policy was proposed to minimise system power consumption. In [18], a deep Q-learning network (DQN) algorithm, one branch of DRL, is applied to do the network slicing decision and allocate the spectrum resources for the content delivery. In [19], a DQN-based mobile edge computing network is proposed, in which several computation tasks are offloaded from the user terminals to the computational access points. Although DQN has attracted significant attention in the cache-enabled network, there are very little works done in applying DQN into the cache content update phase. Moreover, most of the previously mentioned DRL-based cache policies assume the content requests as a timevarying variable. They did not adopt more practical scenarios in which the content requests are varied in both time and location, also known as spatiotemporally varying scenarios.
Inspired by the aforementioned literature, in this paper, a DQN-based content update policy at BSs is proposed to increase the cache hit ratio and reduce average latency, as well as considering spatiotemporally varying scenarios in which content requests vary by both time and location. The reasons to apply DQN are as follows: 1) DQN has a faster convergence speed than the conventional DRL policies, e.g., advanced actor-critic (A2C) and deep deterministic policy gradient (DDPG) [14]. 2) DQN can adapt to the varying scenarios, as long as the dynamic problem is correctly modeled and the DQN agent is allowed to continuously learn experience from the environment [18]. The main contributions are summarised as follows:  The dynamic cache content update problem is formulated as a Markov decision process (MDP) problem, which is solved by a DQN algorithm. Specifically, the neural network is utilised to approximate the Q value and the DQN agent is used to decide whether or not to cache the requested content.
 Our proposed policy is compared with LRU, LFU and DRL [20] policies and the simulation results demonstrate that our proposed policy has the best cache performance in terms of the cache hit ratios and average latencies.
The rest of this paper is organised as follows. The system model and problem formulation are introduced in Section 2. The detailed elements of the MDP framework and the principles of the DQN-based cache content update policy are discussed in Section 3. The simulation results are shown in Section 4 and the conclusion is provided in Section 5.

System model and problem formulation
In this section, the system model and the problem of how to maximise the cache hit ratio and minimise average latency are introduced.

System model
As shown in Figure 1, the cache-enabled system includes one core network, ℳ cacheenabled BSs, and ℧ mobile users. Each BS can store ℋ contents at most. The total content library = {1, 2, …, } contains kinds of contents and each content has the same size . The core network is assumed that has enough capacity to store the entire contents.
Each BS covers a circular cellular region with a fixed radius, and all of the mobile users in

Problem formulation
The problem in this study consists of two sub-problems: maximising the cache hit ratio and minimising average latency.

A) Maximising the cache hit ratio
The cache hit ratio is utilised to describe the probability of the requested content at the local cache. The system cache hit ratio ℎ _ is formulated for N requests as follows: where ( ) is a function to test whether the requested content is cached locally. The definition of ( ) is as follows: Maximising the cache hit ratio is expressed as follows:

B) Minimising the average latency
The latency is an indicator that evaluates the cache content update policy's performance.
The latency is the time when content is transmitted from one location to another. The latency consists of the transmission latency , propagation latency , processing latency , and queue latency . From [20], the expression of the latency is given as: Normally in the content update process, the destination of the content packet is determinate and the content packet is assumed that does not need to wait for transmission. Hence, the processing and queue latencies can be neglected during the content update process [20] [21], and the expression of the latency can be optimised as follows: where is the content size, is the content transmission rate, ℛ is the maximal coverage radius of the serving BS or core network, is the distance between the user and the serving BS or between the serving BS and the core network, and * is the maximal propagation latency between the user and the serving BS or between the serving BS and the core network. To meet the requirement of the fifth-generation (5G) communication [22], the indicator * is expressed as: where * − is the maximal propagation latency between the user and the serving BS, and * − is the maximal propagation latency between the serving BS and the core network.
In more detail, if the requested content is cached locally, the content can be directly retrieved from the serving BS. Thus, for a hit content request, we consider the maximal propagation latency between the user and the serving BS * − , the distance between the user and the serving BS − , and the maximal coverage radius of the serving BS ℛ . The definition of the hit content ℎ latency is as follows: If the requested content is missed at the serving BS, the serving BS needs to first retrieve the requested content from the core network and then deliver the requested content to the corresponding user. Hence, for a missed content request, we consider the maximal propagation latency between the user and the serving BS * − , the maximal propagation latency between the serving BS and the core network * − , the distance between the user and the serving BS − , the distance between the serving BS and the core network The system latency is the sum of the latency of all of the hit content requests and all of the missed content requests. The average latency is the system latency divided by the number of content requests E. The and are defined as follows: The problem on how to minimise the average latency can be formulated as follows: P_2: Min

A deep Q-learning network-based cache content update policy
The related elements of the deep Q-learning network will be introduced in section 3.1. The principle of the DQN algorithm and the workflow of our proposed cache policy will be provided in section 3.2.

The description of the related elements of the deep Q-learning network
The principle of the DQN can be regarded as a Markov decision process (MDP) [23] [24].
To apply the DQN to the cache content update problem, the related notations under the DQN framework are described.

A) State space
In time slot , the instant state consists of the currently cached content, the currently requested content and its corresponding user, the user's next location, and the current time.
In time slot , the current instant state is defined as: where is the cached content at the i th DQN agent, is the currently requested content, is the unique name of the mobile user currently requesting the content, is the next location of the j th user, i ∈ {1, 2, …, ℳ}, and j ∈ {1, 2, …, ℧}.
The state space S is the set of all of the instant states over a time period. It is defined as: where A i uses one hot code, which means only one action can be executed in a time slot.
In this study, 0 = 1 means the cached content remains the same and = 1 means the v th The state-action value function is the expected value of the discounted cumulative reward from the current state and action is based on the policy used to choose one action.
The definition of the state-action value function is: where [0, 1] is a discount factor that affects the future reward from the current state .
The target of the MDP is finding the optimal (s) and * (s) that can obtain the maximal value function.

The cache content update based on the deep Q-learning network
A) Principle of the DQN framework DQN is an effective hybrid framework of neural networks and Q-learning. In this framework, the neural network is applied to predict the Q values rather than recording the Q values in a Q table. However, the DQN will not be efficient when considering only the combination of Q-learning and the neural network. The following two characteristics improve the DQN framework's efficiency. The neural network enables ( , , ) ≈ ( , ) [26]. According to [5], the evaluation of ( , ) is derived from Q-learning as: where is the learning rate (0, 1), is the discount factor [0, 1].
The neural network can be trained via the minimisation of the loss function. The loss function Loss ( ) is defined as: where + * ( +1 , +1 , − ) is the target network's Q value and ( , , ) is the evaluation's Q value.
The detailed optimisation of the evaluation network and target network is shown in Figure   2. In each training step, the evaluation network receives a backpropagated loss function based on a batch of experiences randomly selected from the experience replay memory.
The parameter of the evaluation network is then updated by the minimisation of the loss function via the stochastic gradient descent (SGD) function. After several steps, the parameter of the target network − is updated by assigning the latest parameter to − .
After a training period, the two neural works are stably trained.

Results and discussion
In this study, we consider a cache-enabled network with 4 BSs and 10 mobile users and ensure that each user is covered by a BS. For simplicity, the users are distributed along with the edge of the serving BS and each BS has the maximal communication distance with the core network, and hence the rate ℛ is 1. Besides, there is no overlap between any two BSs to avoid the handover between any two BSs. Furthermore, each content has the same size (2,000 bits) and the content transmission rate is 35 Mbit/s. The neural network has three layers, the input layer, hidden layer, and output layer. The hidden layer has 512 neurons, and the number of neurons at the input and output layers is (ℋ+3) and (ℋ+1), respectively. The maximal cache capacity ℋ is described in each experiment. The learning rate is 0.9, the greedy parameter is 0.9, and the discount factor is 0.1. The content requests of the ℎ user are generated following the Zipf distribution law as: where is the content rank, is the Zipf parameter, and is the total number of content requests. In each experiment, we assume that the total number of content requests is 7,200.   [20], and our proposed policy. The Zipf parameters vary from 1.1 to 1.8, the users' locations are fixed, and the cache can store 288 types of contents at most. As the Zipf parameter increases, the four policies' cache hit ratios increase. This occurs because as the Zipf parameter increases, there is less content with larger probabilities of content requests. In other words, the popular content becomes more popular, the unpopular content becomes less popular, and the type of content decreases. Considering the same cache capacity, the cached content is more popular, and therefore the cache hit ratio increases.
Our proposed policy has the highest cache hit ratio regardless of the Zipf parameter. The simulation demonstrates that the effect of the popular content in the cache hit ratio increases as the Zipf parameter increases. Thus, our proposed policy is superior to the three other policies.   respectively. This significant improvement occurs because our proposed policy considers the effect of the users' random distribution and the random generation of Zipf parameters.
Therefore, our proposed policy quickly adapts to spatiotemporally varying content requests.
Consequently, we conclude that our proposed policy is superior for managing spatiotemporally varying problems.  Here, the Zipf parameters vary from 1.1 to 1.8, the mobile users' locations are fixed, and the cache can store 288 types of contents at most. As demonstrated, our proposed policy always has the lowest cache hit ratio compared with the other three policies. Thus, our proposed policy has the best cache hit ratio. The higher the cache hit ratio is, the more contents can be retrieved locally. The local latency from the BS is much smaller than the remote latency from the core network. Therefore, our proposed policy performs better than the other three policies in terms of the average latency. As shown in Figure 7, we investigate the effect of the cache capacity on the average latency.
In this simulation, the Zipf parameter is 1. 4

Conclusions
In this study, a DRL-based cache content update policy is proposed with the objective to maximise the cache hit ratio and minimise the average latency. Compared to the existing policies, a more practical cache scenario is considered, in which the content requests vary spatiotemporally. The dynamic content update problem is formulated as an MDP problem, and DQN is applied to solve this MDP problem. Specifically, the neural network is trained to approximate the Q value, in which the training data are chosen from the experience replay memory. The DQN agent derives the optimal policy from the neural network for the cache decision. Compared with the existing policies, e.g., the LFU, LRU, and DRL [20] policies, the simulation results show that our proposed DRL-based cache content update policy has the best cache performance in the considered spatiotemporally varying scenario, and is 56%-64% improved in terms of the cache hit ratio and 56%-59% decreased in terms of the average latency.

Data availability
Content requests were described in the simulation section.

Conflicts of interest
Lincan Li, Chiew Foong Kwong, Qianyu Liu, and Jing Wang declare that there are no conflicts of interest regarding the publication of this paper.

Funding statement
This study was supported by Ningbo Natural Science Programme (NBNSP), project code 2018A610095.