An Empirical Investigation of Transfer Effects for Reinforcement Learning

Previous studies have shown that training a reinforcement model for the sorting problem takes very long time, even for small sets of data. To study whether transfer learning could improve the training process of reinforcement learning, we employ Q-learning as the base of the reinforcement learning algorithm, apply the sorting problem as a case study, and assess the performance from two aspects, the time expense and the brain capacity. We compare the total number of training steps between nontransfer and transfer methods to study the efficiencies and evaluate their differences in brain capacity (i.e., the percentage of the updated Q-values in the Q-table). According to our experimental results, the difference in the total number of training steps will become smaller when the size of the numbers to be sorted increases. Our results also show that the brain capacities of transfer and nontransfer reinforcement learning will be similar when they both reach a similar training level.


Introduction
Reinforcement learning (RL) aims at learning policies to map from states to actions for the purpose of maximizing the expected accumulated reward and reaching the goal. Compared with the supervised learning approaches where the models are trained on the input set and the given output set, the RL agent has to interact with the environment and learn from those experiences through trial and error to yield the optimal behaviour.
Mathematically, RL can be formulated as a Markov decision process (MDP) which is a framework to model decision-making problems [1]. An MDP is represented by the tuple <S, A, T, R> where S denotes the state space in the environment and A is the action set to take in a given state.
Function T is defined as P(s ′ |st, na) which indicates the probability of the next state s ′ ∈ S at time step t + 1 given the current state s ∈ S and the action a ∈ A taken at time step t. Function R is a reward scheme used to assign the score for the action performed under the state s and is used as a guidance for the agent to produce suitable behaviours. en, the objective of the RL agent is to learn a policy π θ (a|s) which tells the agent what the best action a ∈ A to perform is while the environment is in the state s ∈ S with the parameter θ. In general, there are two main approaches to solving RL problems, model-based and model-free learning. In the model-based approaches, the goal is to learn the model of the environment and obtain the optimal policy relying on the past transitions. On the other hand, model-free approaches learn to directly acquire the optimal policy by the trial-and-error interactions without modelling the underlying environment. Model-based approaches are often sample-efficient, but the requirement of specifying the model of real-world tasks is often restrictive and difficult to satisfy. erefore, model-free approach is commonly preferred over the model-based approach if it is not hard to sample the trajectories [2,3]. Q-learning [4] and SARSA [5] are two well-known model-free RL algorithms which fit the optimized policy by learning the action-value (Q-value) function. Note that an action-value function is used to express the expectation of the reward for each state-action pair (s, a). In recent years, since the development of deep learning methods has gained significant attention and achieved innovations in many fields, it is common to adopt deep learning methods for RL algorithms in order to boost the performance. A combination of the convolutional neural network (CNN) [6] and Q-learning called deep Q-networks (DQN) [7,8] is proposed to handle large state-action space. DQNs have been shown to reach at or even beyond human-level performance on many games. An alternative double estimator method, double Q-learning [9], is introduced to reduce the overestimations of the action values in the Q-learning algorithm. As double Q-learning was proposed in a tabular setting and DQN algorithm suffers from overestimations, double DQN is used for large-scale function approximation and to reduce the overestimations by combining double Q-learning and DQN [10][11][12].
RL algorithms usually require large amounts of trial-and-error and many learning iterations to determine an effective policy from very large-scale state-action space, making them very time-consuming. Recently, there has been a strong interest in the development of deep learning models with the ability to transfer experiences across similar tasks. e two representative types of methods are the transfer of trained models and transfer of learned knowledge [13]. e first methods transform the neural network layers from the pretrained model to the target model [14,15] whereas the second approaches aim at transferring learned knowledge from the trained network to the target network [16,17]. A Q-learningbased approach has been applied for the sorting problem [18]. However, it takes large number of training steps to finish the training process, even for small sets of data. Since transfer learning has been widely adopted to speed up the training process, this motivates us to devise a transfer scheme and compare it with the nontransfer method in the training performance. In this paper, we conduct a series of experiments using the sorting problem as a case study. We transfer the knowledge learned from the task n to the task n + 1 where n is the size of the numbers to be sorted and continuously use a Q-learning-based method to train the model. e total number of training steps and the size of the brain capacity, which denotes the knowledge in the Q-table, are two metrics to measure the impact of transfer learning techniques. e rest of this paper is organized as follows. Section 2 reviews the background and related work of this paper. Section 3 describes our training strategies and detailed methodology. Experimental setup and results are presented and discussed in Section 4. In Section 5, we discuss conclusions and future work.

Background and Related Work
In this section, we first give an overview of Q-learning which is the base RL algorithm in this paper. e application of RL in the sorting problem is discussed as well.
Q-learning, a form of model-free method, is one of the most known RL algorithms initially designed for the use of Markov decision processes. It updates the Q-value with the following rule: where Q(s t , a t ) is the action-value function to compute the expected reward of a state-action pair at time step t, α is the learning rate, c is the discount factor, and r t is the reward obtained after selecting action a t given state s t . e max operator from the update rule indicates that the agent chooses the best action a by computing the maximum Q-value for the next state s t+1 . e mechanism to exploit the maximum Q-value while updating is called an off-policy algorithm, i.e., the choice of taking action a t and a does not follow the same policy. On the contrary, the SARSA updates the Q-value based on the policy being followed by the following equation: When the algorithm uses the same mechanism for the behaviour policy (i.e., Q(s t , a t )) and the estimation policy (i.e., Q(s t+1 , a)), it is called on-policy [19]. e sorting problem is a quintessential computer science task and has been applied to many fields since its emergence. Based on the analysis of all comparison-based sorting algorithms, the computation requires O (n log n) complexity. A RL-based approach, which applies stability and resiliency ideas from feedback controls, is proposed to overcome the errors and early program termination limitations for the traditional computing [20]. An empirical exploration compares the RL model with two traditional sorting algorithms and shows that the RL sorting model completes the task with less array manipulations. In order to investigate the effect of two different reward schemes, immediate reward and pure delayed reward, a Q-learning algorithm is implemented to compare the total number of training steps and average number of sorting steps [18]. A case study of the sorting problem is conducted and concludes that immediate reward takes much less steps to finish the task.

Methodology and Learning Design
In this section, we describe important features in our proposed methodology, which include training level and brain capacity. We also discuss how we designed the RL algorithm in order to formulate the sorting problem into RL settings.

RL-Based Setting for Sorting Problem.
We model the initial state s 0 Step t, which consists of n elements, as the list of numbers to be sorted, and hence, there will be n factorial possible states denoted by As suggested by the previous study [18] that immediate reward performs better than pure delayed reward, we use 2 Computational Intelligence and Neuroscience immediate reward scheme in this research. We give the reward by considering whether the action actually improves the number of elements in the correct position. A similarity value is introduced to measure the similarity between the current state s t and the goal state S goal (i.e., the sorted list) as follows: where Equal function will return one if two states have the same value at position i and zero otherwise. We then compute the difference of sim(s t , S goal ) and sim(s t+1 , S goal ) to assign the reward as follows: In this paper, reward_better is 1, reward_equal is 0, and reward_worse is −1. For the aforementioned example, since s 0 � [4, 5, 3, 2, 1] will receive a similarity value of 1 and s 1 � [2, 5, 3, 4, 1] will receive a value of 2, the reward value of reward_better will be given.

Learning Algorithm.
e objective of the learning algorithm is to sort a given example which consists of n numbers for a series of episodes until the success rate reaches a predefined threshold. Algorithm 1 (RL_Sort) represents how we executed the model training on one training instance based on the Q-learning algorithm. e algorithm gives a list S training and a Q-table as inputs and then produces a new Q-table and the number of training steps as output. RL_Sort begins with the initialization of upper_bound, train_steps, and success_rate. e upper_bound is used to define the maximum allowed number of swaps for sorting and we set n + 1 as the threshold. e variable train_steps is to store the number of episodes spent for training. e variable suc-cess_rate is the criterion to terminate the training process and is set to 0.75 in our experiments. S goal is the correct sorting result. e experimental parameters are as follows: α � 0.05, c � 0.9, and ε � 0.85. In each episode from line 11 to line 31, the model chooses an action a(i, j) given current state s based on ε-greedy [21] and receives a new state s' (lines 12∼13).
ere are two conditions in which the episode will end. In one condition, s′ is the S goal and a positive reward (reward_win � 1) will be given (lines 16∼18). In the other condition, the number of swapping times already exceeds upper_bound and a negative reward (reward_lose � −1) will be received. Since the first condition reaches a success state, we will examine the success rate for the latest 100 episodes to determine whether the training process should stop or a new episode should begin. For the cases that the current episode needs to continue (lines 23∼28), the Q-table is updated based on the reward equation (4).
When the training task moves from the example of sorting n numbers to n + 1 numbers, values in Q-table are usually set to zero or randomly initialized. In our transfer setting, the knowledge learned from sorting n numbers is migrated to solve the problem of sorting n + 1 numbers. For the Q- . ose nontransferable Q-values will be set to zero or randomly initialized. Figure 1 demonstrates how we transfer a Q-table from n � 3 to n � 4.

Performance Metrics.
In this paper, we define three performance metrics which include training level, number of training steps, and brain capacity.
Training level is a performance-oriented indicator to measure how well the model can use the existing knowledge to perform the task during training. After finishing a training procedure of one instance for sorting n numbers, the model is scheduled to sort n! tasks where each task is given by a permutation of those n numbers. Subsequently, we compute the average number of sorting steps for these n! tasks as the model's training level. Number of training steps, which is denoted as train_steps in Algorithm 1, is the number of episodes that the model spends on training an example. It is an important factor to measure the effectiveness of the algorithm. Brain capacity is concerned with the status of Q-table and is an important measure to compare the knowledge usage between nontransfer and transfer methods. It is defined as the ratio of entries which have been updated in a Q-table.

Experimental Setup and Results.
In order to compare the difference and efficacy between nontransfer and transfer methods, a case study in the sorting problem is presented. We illustrate a series of experiments for both nontransfer and transfer RL to investigate the difference of training speed and the contrast of knowledge requirement.

Experimental Setup.
We design an experimental setting to train the model to sort lists of n numbers where each list is from a permutation of {1, 2, ..., n}. In order to provide an equitable comparison, we run nontransfer and transfer Computational Intelligence and Neuroscience RL in parallel and propose an algorithm, which is presented as pseudocode in Algorithm 2, to satisfy our needs. e input of Algorithm 2 consists of a list S training which is a permutation of {1, 2, ..., n} and a Q- ] as the mechanism discussed in Section 3. B. A variable upper_bound is used as one of the constraints for the training level. e input list S training is given to both S nt and S tr as the initial sorting list for both methods. en, the algorithm starts iteratively to solve the sorting tasks. We will begin with the nontransfer RL. is process consists of training and evaluation. In the training part, we input the current Q-tables (NRQ n [S n , A n ]) and the list S nt to Algorithm 1 to train the model (line 11). e number of training steps returned from Algorithm 1 is accumulated to the variable NonTrans_Tr_Steps (line 13). For the evaluation part, the returned NRQ n [S n , A n ] of Algorithm 1 is then used to sort n! lists from the permutation of {1, 2, ..., n} and the average number of sorting steps is model's training level denoted as Avg nt . We then select the list which takes the maximum number of steps to sort as the new S nt (line 15). e same procedure is also applied for transfer RL as seen in lines 12, 14, and 16. e above process is repeated until two models reach a similar training level (i.e., Avg nt and Avg tr are very close or both of them are lower than upper_bound). is restriction is to ensure that both two methods exhibit comparable abilities to sort n! lists and affirm that it is fair to conduct a further comparison of the total number of training steps and the brain capacity.

Experimental Results.
As an empirical study, we illustrate our results for n equal to 5, 6, 7, and 8. In order to produce a more fair view of the comparison, we repeat Algorithm 2 for 30 episodes for each n. e total number of training steps and the brain capacity are two perspectives to measure the performance. e total number of training steps for the nontransfer method is abbreviated to input: S training , Q n [S n , A n ] (1) initialize (2) upper_bound � n + 1 (3) train_steps � 0 (4) success_rate � 0.75 (5) S goal � [1, 2, ..., n] (6) repeat (7) end � FALSE (8) swap_times � 0 (9) s � S training (10) current_rate � 0 (11) repeat (12) Select an action a based on ε-greedy (13) Perform the action a and observe s′ and the corresponding reward (14) swap_times � swap_times + 1 (15) if (s′ is S_goal) then end � TRUE (18) Check the success rate for the latest 100 episodes and assign to current_rate (19) elseif (swap_times >upper_bound) then end � TRUE (22) else (23) if upper_bound � n + 1 (6) Assign S training to s nt and s tr (7) finish � FALSE (8) NonTrans_Tr_Steps � 0 (9) Trans_Tr_Steps � 0 (10) repeat (11) NRQ n [S n , A n ], Steps nt � RL_Sort(s nt , NRQ n [S n , A n ]) (12) TRQ n [S n , A n ] , Steps tr � RL_Sort(s tr , TRQ n [S n , A n ]) (13) NonTrans_Tr_Steps � NonTrans_Tr_Steps + Steps nt (14) Trans_Tr_Steps � Trans_Tr_Steps + Steps tr (15) Sort n! lists in S n by NRQ n , compute the average Avg nt and pick the list with max value as s nt (16) Sort n! lists in S n by TRQ n , compute the average Avg tr and pick the list with max value as s tr (17) if (|Avg nt − Avg tr |/Avg tr <� 0.1) or (Avg nt <� upper_bound and Avg tr <� upper_bound) (18) finish � TRUE (19) until finish is TRUE ALGORITHM 2: e algorithm for training the non-transfer and transfer RL methods. Computational Intelligence and Neuroscience 5   Tables 1-4 for different n. Looking at the comparison of the total number of training steps, we can see that the values of NonTrans_Tr_Steps and Trans_Tr_Steps increase significantly when n increases. It is worth noting that some of these two values are less than 100 when n is 5. erefore, instead of using the latest 100 episodes to check the success rate mentioned in Section 3. B, we opt for the latest 10 episodes to examine that. Regarding the comparison of the brain capacity, the values of NonTrans_Br_Capacity and Trans_Br_Capacity are generally smaller than 0.25 and their values are almost less than 0.1 while n is greater than 6. is implies that the knowledge requirement only occupies a small portion of the Q-table in order to solve the sorting task.
For each episode, we also calculate the ratio of the total number of training steps (Ratio_Tr_Steps) as the division of NonTrans_Tr_Steps by Trans_Tr_Steps and the ratio of the  Computational Intelligence and Neuroscience brain capacity (Ratio_Br_Capacity) as the division of NonTrans_Br_Capacity by Trans_Br_Capacity. For the value of Ratio_Tr_Steps, there are nine numbers greater than or equal to 5.00 when n equals 5. But, as n increases, this phenomenon does not appear and the transfer effects diminish. For the value of Ratio_Br_Capacity, the range is much narrower and is largely concentrated between 0.75 and 1.25. As described in Algorithm 2, both nontransfer and transfer methods are required to have very close training levels in order to finish a training episode. Since close training level means that two methods have similar abilities and performance for sorting n! lists, this could explain why the value of Ratio_Br_Capacity is around 1. In general, transfer method exhibits better performance in terms of training steps. However, in some cases, Ratio_Tr_Steps is smaller than 1, which means nontransfer method takes less steps to complete the training. Since both methods require similar size of the brain capacity to sort n! lists, there may be possibilities that the transfer model exploits the transferred knowledge but does not explore enough to expand its knowledge. is will lead to take more training steps to finish the training process.
To explore the distribution of the Ratio_Tr_Steps and Ratio_Br_Capacity, boxplots are presented in Figures 2 and  3 to do the statistical analyses. A boxplot represents the minimum, 25 th percentiles, median, 75 th percentiles, and   Computational Intelligence and Neuroscience maximum of the given dataset. In Figure 2, we observe that the medians of the Ratio_Tr_Steps, which are the red lines inside the box, gradually decrease when n increases. is is in accordance with our previous observation that the growth of n may lower the transfer effects. In Figure 3, the medians of the Ratio_Br_Capacity all occur around 1.00 mostly aligning with our previous conjecture. In addition to the statistics in boxplots, we also compute the averages of Ratio_Tr_Steps and Ratio_Br_Capacity in Table 5. e average performance shows very similar trends as the boxplots.

Conclusions
It is reported from prior research that the Q-learning-based approach for the sorting problem requires a large number of training steps. Since the transfer learning method is able to share the knowledge learned from the source domains with the target domain, we devised a transfer scheme to investigate the time cost and knowledge usage issues between nontransfer and transfer models. e Q-table obtained from the prior task is served as the knowledge source to be transferred to the next task. We chose the sorting problem as our case study to analyse two important performance metrics, number of training steps and brain capacity. As a result of the experiment, the brain capacity for two models will be similar after reaching a similar training level. e difference of the total number of training steps between two models will be significant when n is smaller. However, as n increases, the proportion of the transferred knowledge will be smaller and the difference will become less pronounced, making the transfer effect insignificant.
As shown in Table 4, the maximum number of total training steps is close to 100,000 while n equals 8. It would be necessary to enable faster learning in order to handle larger n. Future work will therefore be concerned with the reduction of the state space. State abstraction [22,23] with the ability to leverage the knowledge learned from prior experiences is worth the effort to improve the scalability of the current approach. Another area of future work is to extend the current tabular representation approach to the deep learning-based methods in order to improve the learning stability and computational efficiency.

Data Availability
No data were used to support the findings of the study.

Conflicts of Interest
e authors declare that they have no conflicts of interest.