Minibatch Recursive Least Squares Q-Learning

The deep Q-network (DQN) is one of the most successful reinforcement learning algorithms, but it has some drawbacks such as slow convergence and instability. In contrast, the traditional reinforcement learning algorithms with linear function approximation usually have faster convergence and better stability, although they easily suffer from the curse of dimensionality. In recent years, many improvements to DQN have been made, but they seldom make use of the advantage of traditional algorithms to improve DQN. In this paper, we propose a novel Q-learning algorithm with linear function approximation, called the minibatch recursive least squares Q-learning (MRLS-Q). Different from the traditional Q-learning algorithm with linear function approximation, the learning mechanism and model structure of MRLS-Q are more similar to those of DQNs with only one input layer and one linear output layer. It uses the experience replay and the minibatch training mode and uses the agent's states rather than the agent's state-action pairs as the inputs. As a result, it can be used alone for low-dimensional problems and can be seamlessly integrated into DQN as the last layer for high-dimensional problems as well. In addition, MRLS-Q uses our proposed average RLS optimization technique, so that it can achieve better convergence performance whether it is used alone or integrated with DQN. At the end of this paper, we demonstrate the effectiveness of MRLS-Q on the CartPole problem and four Atari games and investigate the influences of its hyperparameters experimentally.


Introduction
Reinforcement learning (RL) is an important machine learning methodology for solving sequential decisionmaking problems. In theory, by interacting with an initially unknown environment, the RL agent can learn the optimal action policies at different states to maximize the cumulative expected return [1]. Unfortunately, in the past several decades, due to the so-called "curse of dimensionality," RL can only be used to solve some real-world problems with the small-scale discrete or low-dimensional continuous state space. It is not until 2013 that this dilemma was partially solved by Mnih et al. [2]. By combining the Q-learning algorithm with deep learning, they proposed the preliminary version of the deep Q-network (DQN) algorithm. Two years later, Mnih et al. [3] presented the normal version of DQN, which achieves the human-level performance on 49 classical Atari games. Since then, DQN has attracted more and more research attention, and many other novel deep RL algorithms [4,5] and new applications [6,7] have been proposed, and thus deep RL has become a thriving research branch in artificial intelligence. However, although DQN has succeeded in some more complicated problems [8][9][10], it still has many drawbacks, such as slow convergence, instability, and low sample efficiency. erefore, we will focus on how to improve the DQN's performance in this paper.
Currently, there are three main categories of research work on improving DQN. e first category mainly focuses on how to estimate action values accurately. For example, Hasselt et al. [11] proposed the double DQN, which can reduce the observed overestimation by exploiting the idea of double Q-learning. Wang et al. [12] introduced a dueling network architecture, which separately estimates state values and advantage values to improve the policy evaluation. Hausknecht and Stone [13] presented the deep recurrent Q-network, which is more suitable for solving partial observation problems, by adding recurrent LSTM layers to convolutional networks. Kim et al. [14] combined the mellowmax method with DQN to calculate the target action values, preventing overestimation effectively. Anschel et al. [15] proposed the averaged DQN, which uses some previously learned action-value estimates to produce the current action value. is algorithm can reduce the approximation error variance in the target values. e second category mainly focuses on how to explore or exploit samples efficiently. Schaul et al. [16] presented a prioritized experience replay, which can make the effective use of historical samples to improve the DQN's convergence performance. Fortunato et al. [17] proposed the noisynet DQN, which adds noise to the deep network parameters for aiding efficient exploration. Lee et al. [18] introduced an episodic backward update to improve the sample efficiency. e third category mainly focuses on how to reduce memory and computation. Mnih et al. [19] proposed asynchronous variants of four standard reinforcement learning algorithms, such as the asynchronous one-step Q-learning algorithm and the asynchronous n-step Q-learning algorithm. Interestingly, this work also opens the door to research the asynchronous advantage actor-critic (A3C) algorithm.
In traditional RL, Q-learning algorithms often use linear functions to approximate action values, which have better stability and fewer hyperparameters to be trained than DQNs [20]. In particular, the least squares (LS) type RL algorithms, such as the least squares policy iteration (LSPI) algorithm [21], the fitted-Q iteration (FQI) algorithm [22], and the recursive least squares temporal difference with forgetting factor (RLS-TD-f ) algorithm [23], not only have better stability but also have faster convergence. In the research community of adaptive filtering, the LS and the recursive least squares (RLS) algorithms are famous for their fast convergence rate. Obviously, the success of LS-type RL algorithms mainly benefits from this merit. In recent years, many new machine learning algorithms, such as the extreme learning machine (ELM) [24] and the broad learning system [25,26], have been proposed by combining LS or RLS algorithms. In practice, the last layer of the neural network used for DQN is usually a linear layer, which means that we probably can improve the DQN's performance by integrating DQN with the LS-type RL algorithms. In fact, Levine et al. [20] proposed a hybrid approach-the least squares deep Q-network (LS-DQN), which combines DQN with LSPI or FQI. By retraining the last layer of the policy network with a batch least squares update periodically, LS-DQN can obtain better convergence performance than DQN, whereas LS-DQN is not easy to use. At each update by using LSPI or FQI, LS-DQN needs to use the current network parameters to generate a training dataset, which requires running a forward pass of the deep network for each sample in the experience replay buffer. In addition, LS-DQN needs to generate new state-action features and compute the matrix inverse. From the DQN's learning mechanism, a perfect integrated LS-type algorithm should be able to use the inputs of the DQN's last layer for approximating action values and should have the same learning mode as DQN.
In our previous work [27], we propose two policy control algorithms called ESNRLS-Q and ESNRLS-Sarsa. ey seem to meet the above requirements to some extent, although they are also difficult to integrate with DQNs. ey use the same experience replay and minibatch learning mode as DQN. In addition, they can avoid computing the matrix inverse and are more suitable for online learning by using recursive least squares (RLS). Based on this work and inspired by the work of Levine et al., we propose a novel minibatch RLS Q-learning algorithm with linear function approximation, called the MRLS-Q. Our main contributions are as follows. (1) By borrowing the experience replay to remove the temporal correlation between the observed transitions, we first combine the traditional Q-learning algorithm with the RLS optimization technique. (2) By using state features rather than state-action features for linear function approximation, we make MRLS-Q able to be used alone and also be integrated into DQN seamlessly. (3) In order to reduce the computational complexity and make the RLS method suitable for training parameters in the minibatch mode, we present an average approximation method for updating the RLS autocorrelation matrix. (4) In order to alleviate the feature change of the same state and integrate MRLS-Q into DQN, we present a new method to define the feature function of MRLS-Q. (5) We demonstrate the effectiveness of MRLS-Q, alone and as the last layer of DQN, by using the CartPole problem and four Atari games, respectively. We also test the influences of its hyperparameters experimentally. e remainder of this paper is organized as follows. Section 2 describes the related theories and algorithms of MRLS-Q. Section 3 represents the detailed derivation and the practical implementation of MRLS-Q. en, in Section 4, comparison experiments on the CartPole problem and four Atari games are conducted to separately verify the effectiveness of MRLS-Q used alone and as the last layer of DQN. Finally, Section 5 summarizes the whole paper.

Background
In this section, we briefly review the related theories and algorithms of our MRLS-Q, including the Markov decision process (MDP), DQN, and LS-DQN. In addition, we also describe some notations that will be used throughout this paper.

Markov Decision Process.
In RL, a sequential decision problem is generally formulated as an MDP with a five-tuple 〈S, A, P, r, c〉, where S is the state space, A is the action space, P(s t ′ |s t , a t ) ∈ [0, 1] and r(s t ′ |s t , a t ) ∈ R are the statetransition probability and the immediate reward from the state s t to the next state s t ′ by taking the action a t , and c ∈ (0, 1] is the discount factor. At the state s t , the agent's action a t is determined by the control policy π. For a given MDP, the goal of RL is to learn the optimal policy π * for maximizing the cumulative expected return J(π), i.e.,

Computational Intelligence and Neuroscience
where J(π) is usually defined in the form of discount return [1] as where s 0 is the initial state, whereas J(π) is hardly calculated by using the above equation directly, since P(s t ′ |s t , a t ) is unknown in RL, and s t ′ and r t can only be obtained by the agent's interaction with the environment.
To tackle this problem, RL usually resorts to estimating the action value Q π (s t , a t ) to measure the performance of π when the initial state and action are s t and a t . In this paper, we assume that S is continuous and A is discrete. For this kind of MDP problems, to overcome the curse of dimensionality, Q π (s t , a t ) is often approximated by linear function approximators or deep neural networks.

e DQN Algorithm.
DQN is probably the most important algorithm in deep RL. It combines the Q-Learning algorithm with deep neural networks and uses the experience relay for breaking the correlation among samples and training network parameters.
e DQN algorithm can be summarized as follows. At the current step t, the agent firstly uses the ϵ-greedy policy to select the action a t as where ϵ is the exploration factor, Θ t−1 is the policy network parameter, and Q(s t , a; Θ t−1 ) is approximated by this network. en, after taking a t , the agent moves to the next state s t ′ , obtains the reward r t , and stores (s t , a t , s t ′ , r t , d t ) into the experience replay buffer D, where d t ∈ 0, 1 { } denotes that s t ′ is the terminal state or not. Next, by using the minibatch ..,M sampled from D randomly, the algorithm calculates the loss of the policy network as 1 , . . . , d t,M ] T , ∘ denotes the Hadamard product, and Θ is the target network parameter which is copied from the policy network every some fixed steps or episodes. Finally, by using some gradient descent optimization method, the algorithm updates Θ t−1 to Θ t . For example, by using the SGD method [28], Θ t−1 is updated as where α is the learning rate and ∇ Θ t−1 denotes zL(Θ t−1 )/zΘ t−1 .

e LS-DQN Algorithm.
LS-DQN is a hybrid approach, which combines the traditional LSPI or FQI algorithm with the DQN algorithm. By enjoying the stability and efficiency of LSPI or FQI, it can obtain better performance than DQN. e LS-DQN algorithm can be briefly summarized as follows. Whenever the agent runs DQN some steps, it uses LSPI or FQI to retrain the last layer of the policy network once. e retraining consists of the following three substeps. Firstly, by recalculating all samples in the experience replay buffer with the current network parameters, the policy network generates a new dataset D. Secondly, by using the current network parameters and the dataset D, the algorithm generates state-action features. Finally, the algorithm uses LSPI to retrain the current last-layer parameter Θ L t in the policy network as where Θ L t(: ,i) is the i th column vector of Θ L t . Besides, A and b are defined as follows: where ϕ(s j , a j ) is the state-action feature of the state-action pair (s j , a j ). As Levine et al. stated in their work [20], the algorithm can also retrain Θ L t by using FQI, since it is a batch shallow RL algorithm that computes iterative approximations of the Q-function using regression. For brevity, we will not discuss the FQI algorithm in this paper.

The Proposed Algorithm
In this section, we will introduce the detailed derivation and the practical implementation of our proposed algorithm, respectively. Our algorithm, the MRLS-Q algorithm, can be used not only alone but also as the last layer of DQN.

Algorithm Derivation.
MRLS-Q is a new Q-learning algorithm with linear function approximation, but it is more similar to the DQN algorithm rather than the traditional Q-learning algorithm. It uses the experience replay and the minibatch training mode, separates the linear function approximator into a policy approximator and a target approximator, and uses the state features rather than the state-action features. Besides, it uses an average RLS method for updating parameters.
First, we introduce the agent's interaction with the environment. At the current step t, the agent also uses the Computational Intelligence and Neuroscience ϵ-greedy policy to select the action a t as equation (3). Note here that Q(s t , a; Θ t−1 ) is approximated by the policy approximator as where ϕ(s t ) ∈ R N is the feature vector of s t , Θ t−1 ∈ R N × |A| is the policy approximator parameter, in(a) denotes the index of a in A, and Θ t−1(: ,in(a)) is the in(a) th column vector of Θ t−1 . en, the agent takes a t , moves to s t ′ , obtains r t , and stores (s t , a t , s t ′ , r t , d t ) into the experience replay buffer D.
Second, we introduce the RLS update of the policy approximator parameter in the minibatch training mode. Let M n � (s n,i , a n,i , s n,i ′ , r n,i , d n,i ) i�1,...,M denote the minibatch sampled from D at the n th step, and let Φ(S n ) � [ϕ(s n,1 ), . . . , ϕ(s n,M )] T denote the feature matrix of S n . Define the least squares loss function as λ t− n Q π S n , a n − Q S n , a n ; Θ where λ ∈ (0, 1] is the forgetting factor and Q(S n , a n ; Θ) is approximated by the policy approximator as Q S n , a n ; Θ � Φ S n Θ :,in a n , (11) and Q π (S n , a n ) is estimated by the target approximator as Q π S n , a n � r n + c 1 − d n ∘ max a∈A Φ S n ′ Θ (:,in(a)) , (12) where Θ is the target approximator parameter which is copied from the policy approximator every some fixed steps or episodes. en, the parameter learning problem of the policy approximator can be transformed into By using the chain rule, we can get where ∇ Θ denotes zL(Θ)/zΘ, and an element in Q(S n , A; Θ) ∈ R M×|A| is defined as Q s n,i , a; Θ � Q s n,i , a n,i ; Θ , a � a n,i , 0, a ≠ a n,i , and an element in Q π (S n , A) ∈ R M×|A| is defined as Q π s n,i , a � Q π s n,i , a n,i , a � a n,i , 0, a ≠ a n,i .

⎧ ⎨ ⎩ (16)
Let ∇ Θ � 0. en, we can get where Rewrite the above two equations as the following recursive forms: Further, rewrite the above two equations as the following vector forms: where Q π (s t,i , A) is the i th row vector of Q π (S t , A). Unfortunately, we cannot directly use the Sherman-Morrison formula [29] to compute A −1 t , since the last term in the righthand side of equation (21) is a sum of vector products.
Next, we present an average approximation method to deal with the above problem. Considering that all training samples are from the same environment and thus their features have some similarity, we rewrite equations (21) and (22) as follows: where k is the approximation factor, and ϕ t and q π t are defined as Let P t � A −1 t . By using the Sherman-Morrison formula for (23), we can get where v t � P t−1 ϕ t ,

Computational Intelligence and Neuroscience
Plugging equations (24) and (27) into (17), we finally get where where Q(s t,i , A; Θ t−1 ) denotes the i th row vector of Q(S t , A; Θ t−1 ).

Practical Implementation.
As reviewed in Section 2.2, DQN generally uses gradient descent methods to update network parameters. To make MRLS-Q easier to be integrated into DQN, we next rewrite equation (30) as the "gradient descent" form of ∇ Θ t−1 .
If the loss function of MRLS-Q is defined by equation (4), by using the chain rule for equation (4), we can get Recall the fact that we once used kϕ t ϕ t T and kϕ t q π t in equations (23) and (24) (19) and (20), respectively, which means In addition, from equation (31), we can obtain Using equation (9) yields en, equation (35) can be written as Further, from equation (25), the above equation can be written as (38) Next, plugging equation (33) into equation (38), we have From equations (9) and (11), the above equation can be rewritten as Using equations (34) and (40), we can get erefore, we can rewrite equation (30) as It shows that (P t−1 /(λ + kv T t ϕ t )) is the learning rate of Θ t in MRLS-Q.
However, although RLS has a fast convergence rate, it often suffers from overfitting. In recent years, there has been extensive research on this problem. Based on Ekşioglu's work [30], we add an L 1 regularization term into the above equation, i.e., where η is the regularization factor and sgn(·) is the sign function. Based on the above derivation, the pseudocode of MRLS-Q is summarized in Algorithm 1, and the flow diagram of MRLS-Q is summarized in Figure 1. In the practical implementation, here ∇ Θ t−1 can be calculated by the automatic differentiation package of PyTorch or TensorFlow directly. Besides being used alone, MRLS-Q can also be used as the last layer of DQN, since it uses the same loss function and experience replay as DQN. However, there is still an obstacle to the combination of MRLS-Q and DQN. As the training goes on, the parameters of the DQN network are continuously changing, and the outputs of the same inputs are changing as well. us, we cannot use the inputs of the DQN's last layer as the features of MRLS-Q directly. In order to alleviate this kind of change and integrate MRLS-Q into DQN, we present a new method to define the feature function of MRLS-Q as where X L t ∈ R M×N L−1 is the output matrix of the DQN's penultimate layer and ] is a small hyperparameter to prevent the denominator becoming zero.

Experiments
In this section, we use two sets of experiments to demonstrate the effectiveness of MRLS-Q. Our experiments are divided into two sections. In Section 4.1, we test MRLS-Q on the CartPole problem as an independent algorithm. In Section 4.2, we test MRLS-Q on four Atari games as the last layer of DQN.

e CartPole Problem.
In this set of experiments, we firstly verify the performance of MRLS-Q on the CartPole-v0 problem, which is from the OpenAI Gym. For comparison Computational Intelligence and Neuroscience purposes, we build a new algorithm called Adam-Q, by replacing P t−1 /(λ + kv T t ϕ t ) in equation (42) with the Adam optimizer, since the traditional Q-learning algorithm with linear function approximation is hardly convergent in 100 episodes. en, we verify the influences of hyperparameters on MRLS-Q, experimentally.
To compare the performance between MRLS-Q and Adam-Q, the experimental settings are summarized as follows. (1) Both algorithms use 400 radial basis functions (RBFs) for action-value approximation.
To investigate hyperparameter influences on MRLS-Q, we test |D| ∈ 500, 1000, 5000, 10000  Figure 2(b), it shows that the capacity of D has a significant influence on the performance of MRLS-Q. e larger capacity will result in the better performance, since big D is helpful to remove the correlation between the observed transitions. From Figure 2(c), it can be seen that MRLS-Q is robust to the initialization P 0 , whereas too big P 0 will make MRLS-Q become unstable and too small P 0 will make MRLS-Q converge slowly. From Figure 2(d), it can be seen that k also has a significant influence on MRLS-Q. From equation (29), bigger k will make P t update with higher strength. If state feature values change greatly, k should be set to a big value.

Four Atari Games.
In this set of experiments, we verify MRLS-Q as the last layer of DQN on four Atari games: Pong-v0, Breakout-v0, SpaceInvaders-v0, and RiverRaid-v0, which are from the OpenAI Gym. Here we choose the traditional DQN algorithm with the Adam optimizer for comparison. For Adam-DQN and in the second to fifth layers of Hybrid-DQN, the learning rate, β 1 , and β 2 of Adam are 0.0000625, 0.9, and 0.999; in the last layer of Hybrid-DQN, the initialization P 0 , the forgetting factor λ, the regularization factor η, the approximation factor k, and ] are 0.1I, 1, 10 − 8 , 1/2, and 10 − 12 , respectively. Note that here we use a big k to update P t for adapting to the feature change.
e average evaluation results are presented in Figure 3. It shows that Hybrid-DQN can speed up the convergence of all tested games. Figure 3(a) is much clearer to demonstrate this advantage, since the Pong game is much simpler than other three games. In addition, Figures 3(a), 3(c), and 3(d) show that Hybrid-DQN can improve the convergence quality of Pong, SpaceInvaders, and RiverRaid, and  Computational Intelligence and Neuroscience Figure 3(b) shows that Hybrid-DQN can improve the learning stability of Breakout. In summary, by integrating our MRLS-Q, Hybrid-DQN can improve the stability and performance. Compared with the LS-DQN algorithm, MRLS-Q can be used as the last layer of DQN directly, and thus Hybrid-DQN is easier to use.

Conclusion
How to improve convergence and stability of the DQN algorithm is one of the key issues in deep RL. In this paper, we propose MRLS-Q, a linear RLS function approximation algorithm with the similar learning mechanism to DQN. MRLS-Q can be used not only alone but also as the last layer of DQN. Similar to LS-DQN, the Hybrid-DQN with MRLS-Q can enjoy rich representations from deep RL networks as well as stability and data efficiency of the RLS method, but it can seamlessly integrate MRLS-Q and thus is easier to use. In MRLS-Q, we use the experience replay to break the correlation between training samples, present an average RLS optimization method to improve the convergence performance and reduce the computational complexity, employ an L 1 regularization technique to prevent overfitting, and propose a new method to define the feature function for alleviating the feature change of the same state and integrating MRLS-Q into DQN. Experiment results on the CartPole problem demonstrate that MRLS-Q has better convergence than Adam-Q and reveal the hyperparameter influences on MRLS-Q. In addition, experiment results on four Atari games demonstrate that DQN can improve convergence and stability by integrating with MRLS-Q.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no known conflicts of interest or personal relationships that could have appeared to influence the work reported in this paper.   Computational Intelligence and Neuroscience