Empirical explorations of strategic reinforcement learning: a case study in the sorting problem

. Recent advances in deep learning and reinforcement learning have made it possible to create an agent that is capable of mimicking human behaviours. In this paper, we are interested in how the reinforcement learning agent behaves under different learning strategies and whether it is able to complete the task similar to human performance in principle. To study the effect of different reward types, two reward schemes which include immediate reward and pure­delayed reward are introduced. To build a more human­like agent when interacting with the environment, we propose a goal­driven design that forces the agent to achieve a level close to human ability and a training mechanism that learns only from good trajectories. Q­learning is one of the most popular reinforcement learning algorithms and we employ it for our study. As the sorting problem is a classical topic in theoretical computer science with widespread applications, it is used for the empirical evaluation. We compare our results against the algorithmic solutions.


INTRODUCTION
The goal of reinforcement learning is to map from states to actions in order to maximize a reward or achieve a goal.Unlike supervised learning problems where each instance with a correct label is given to the learner, the reinforcement learning agent has to try and discover which optimum state action sequences will yield the best performance for the desired task.At each time step, the agent perceives the current state and takes an action accordingly.In return, the environment transits to a new state based on the agent's action and provides an immediate reward or a onetime reward until the final time step.Since actions of the agent affect not only the reward but also the future state, all these pose the challenge of developing efficient and effective algorithms for reinforcement learning problems.
Qlearning is a wellknown reinforcement learning algorithm that the agent learns to act optimally in order to successively approximate the actionvalue function [1].The actionvalue is defined as the sum of the received reward for taking a particular action at a particular state and the expected future reward thereafter.It is desirable for the agent to try all actions in all states sufficiently under some exploration scheme.The strength of Q learning is that it does not require any model of the environment to perform iterative updates and it has been proved that Qlearning converges with probability one.In reinforcement learning tasks, reward not only defines the objective but also formulates the decisionmaking process.A properly chosen reward can guide the agent towards desired behaviours, on the contrary, a poorly chosen reward may cause the agent to fail the learning process or move away from the objective [2,3].Reward schemes, in general, can be classified into two categories including immediate reward and puredelayed reward although some tasks fall in between these two extremes [4].In the class of immediate reward problems, the environment assigns a value to each action taken at a particular state.The card game snap is an example of this type.In the case of puredelayed problems, the agent will not receive any rewards for every step of the action but there will be a reward given at the end to indicate a success or a failure.Playing the board game, backgammon, is an example that can be characterized as a puredelayed problem.
The convergence time of reinforcement learning is a serious concern and developing methods for speeding it up is important.For example, if the game level is too hard, directly learning from the most difficult level may take a long time [5].Shaping is the idea, originated from behaviourist psychology, to reduce the learning curve for goaldirected exploration and fast training [6,7].It gradually increases the complexity of the task, so that the agent is allowed to learn easier versions of the task first and use the obtained skills to accelerate learning while the tasks become progressively harder.The learning from Easy Missions mechanism is one of the shaping methods to make the robot deal with easier situations at the early stages and later on navigate in more difficult situations [8].
The main aim of this paper is to explore the use of reinforcement learning to improve the average per formance empirically for the problem with already known complexity.Learning from good trajectories can make the agent mimic good experiences and lower the com putational costs by updating the parameters for those successful cases only.The goaldriven design allows our explicit goals to be exercised gradually.We propose a training method combining the good trajectories adoption and goaldriven design to balance the speed of con vergence time and the quality of results.In addition, since the choice of reward is considered one of the major influences on the quality of policies solved by re inforcement learning, we use two extremes (immediate reward and puredelayed reward) to investigate how they will affect the timeliness and accuracy of the training task.Sorting, a fundamental data operation, has been applied to many computing tasks and has already attracted intensive interest since its introduction.Among all comparisonbased sorting algorithms, the performance cannot be better than ( og ) in the average or worst case.To illustrate our approach and to lay the groundwork, we consider the sorting task in which an agent is asked to perform under the designated strategies.The remainder of this paper is organized as follows.Section 2 reviews the related work of this paper.Section 3 describes our approach and detailed strategies over which the agent operates.Experimental results are presented and discussed in Section 4. In Section 5, we summarize conclusions and future work.

RELATED WORK
In the reinforcementlearning problem, an agent takes sequential actions for the corresponding states of the environment in an attempt to maximize a reward signal.There are two main methods for solving such a problem, modelbased and modelfree learning.Modelfree ap proaches directly learn the policy by the trialanderror interaction without modelling the behaviour of the underlying environment whereas modelbased approaches learn to build an explicit model of the environment and then compute the optimal policy relying on the derived model [9].The details of the modelbased approaches are out of the scope of this study and their reviews are omitted for simplicity.There are various algorithms for modelfree approaches, but most are classified into one of two families, either valuebased methods or policybased methods, according to the goal of the training.
Valuebased methods fit an optimized policy by primarily learning a value function which includes the state value function and the action value (Qvalue) function [10].A state value function is used to determine the expected reward the agent can receive in a given state whereas an action value function is to assess how well the agent performs an action for a given state [11].Qlearning [1] and SARSA [9] are two wellknown and extensively studied valuebased methods.SARSA is an onpolicy algorithm which fits the Qvalue to the current policy depending only on the past states visited and actions taken.On the other hand, Qlearning which is an offpolicy algorithm attempts to directly find the Qvalue for the optimal policy rather than the policy that was used to generate the data.Because of using the maximum action value to approximate the expected values of actions, Q learning is less robust due to overestimating the Qvalue.Double Qlearning, a double estimator method, is proposed to use two estimators for uncoupling the selection and evaluation of an action, so that it can eliminate the harm caused by the overestimation in Qlearning [12].In recent years, deep learning methods have gained significant attention and have been successfully applied to a wide range of areas.It is, therefore, natural to adopt deep learning methods for reinforcement learning problems as the rich representation of the deep network could enable the traditional reinforcement learning algorithms to perform more effectively.A recent development is the combination of the Convolutional Neural Network (CNN) [13] and Qlearning to develop a reinforcement learning agent called deep Qnetworks (DQN) [14].It has shown to be effective in Atari games with large stateaction spaces and can achieve at or even beyond human level.An extension of DQN is to replace the CNN layers with Long ShortTerm Memory networks [15] to address especially for the PartiallyObservable Markov Decision Process (POMDP).The resulting deep recurrent QNetwork [16] is capable of integrating information over time such that it shows similar performance on Atari games but exhibits better performance on POMDP domains.Similarly, there are also versions of double SARSA [12] and deep SARSA [17].
In contrast to valuebased methods, policybased methods try to learn the policy directly instead of maintaining the state/action value function to select the best action accordingly.In practice, the policy is represented explicitly by a parametric probability distri bution π θ (|) such that action  is chosen in state  based on the current policy.The objective is to update the parameter θ over the time in order to derive the optimal policy π * and obtain the largest reward [18].A major advantage of such methods is their capability of handling large action spaces or continuous action spaces by computing the probability for each action or learning the probability distribution [11].The REINFORCE algorithm, a wellknown policy gradient algorithm, uses Monte Carlo sampling to update the parameter θ via gradient ascent [1].A natural policy gradient which refers to the steepest descent direction is proposed to replace the gradient with the natural gradient for the purpose of speeding learning [19].
Actorcritic methods combine the advantages of valuedbased methods and policybased methods to optimize both the policy and the value function where the actor refers to the learnt policy and the critic to the learned value function.Since the gradient of policybased methods is usually estimated by simulation, this may result in high variance and make the algorithm converge too slowly.Adding a critic will not only reduce the variance but also deliver faster convergence [20,21].Building on top of the traditional actorcritic model, Advantage ActorCritic (A2C) develops an advantage function to evaluate the policy instead of using the value function [22].An advantage function represents a relative action value defined by the difference between the action value function and state value function (i.e.,  π (, ) = Q π (, ) - π ()).It is used to capture how better an action is compared to the average performance of the policy at a given state in terms of the expected reward [23].Asynchronous Advantage ActorCritic (A3C) is an asynchronous version of A2C.The A3C algorithm has multiple actors executing different policies to stabilize training in parallel as more actors are allowed for more exploration.Moreover, although deep reinforcement learning algorithms based on experience replay have achieved success in many challenging domains, asyn chronous update in A3C is able to efficiently reduce memory and computation cost per real interaction [24].
Although Convolutional Neural Networks have gained significant popularity and success, most usable network architectures heavily depend on expert knowledge and experience.A metamodelling algorithm based on reinforcement learning, MetaQNN, is trained to search connections between convolution, pooling, and fully connected layers through an εgreedy Qlearning strategy with experience replay [25].MetaQNN is able to yield good performance on small datasets such as CIFAR10 but is computationally expensive for big datasets due to the search of a huge space.To address this issue, a blockwise network generation pipeline, BlockQNN, automatically designs the network archi tecture using a fast Qlearning framework where the state  represents the status of the current layer and the action  is the decision for the next successive layer [26].The focus of BlockQNN is switched to learn the entire topological structure of network blocks to improve the performance rather than designing the entire network.Since onpolicy and offpolicy tech niques have their own advantages, recent methods have been developed to make use of both onpolicy and off policy learning.While policy gradient methods offer stable learning but require collection of large amounts of onpolicy experiences, an approach combining policy gradient and Qlearning (PGQ) is proposed to take advantage of offpolicy data by drawing experience from a replay buffer [27].The PGQ approach achieves better performance than DQN and A3C on Atari games.QProp is a sampleefficient policy gradient method which trains an offpolicy Q critic as a general control variate to reduce onpolicy gradient variance by using Taylor expansion [28].In addition to the improvement of sample efficiency compared to stateoftheart policy gradient methods, QProp has outperformed other actor critic techniques in humanoid locomotion tasks.Among the existing imitation learning methods, Deep Q learning from Demonstrations (DQfD) is proposed to pretrain the network in DQN by leveraging small sets of demonstration data from a human expert and includes a margin loss which encourages the expert's actions to have higher Qvalues than other actions [29].Once the pretraining is completed, the agent starts to interact with the environment and explores a much larger state space.The experiments are conducted on 42 Atari games and DQfD achieves stateoftheart results for 11 games.
Compared to prior studies, we are more interested in guiding the reinforcement learning to quickly reach the task.Our novel goaldriven design gradually relaxes the constraints imposed on the agent and forces it to achieve close to human ability level.Moreover, to speed up the convergence time, the goaldriven design is also accompanied with a training mechanism that learns only from good trajectories so as to reduce the computational costs for updating the parameters.

METHODOLOGY
In this section, we illustrate our approach with a case study in the sorting problem and present an algorithm for learning from good trajectories.The sorting task is characterized as a reinforcement learning process and our proposed learning technique is considered as an agent to sort the list.At each time step , the agent observes state   representing the current sorting result of the given list and takes an action to exchange the values in the positions  and .After performing an action, the agent may receive an immediate reward to assess the action or the reward may be delayed.

Qlearning
Qlearning, a form of modelfree reinforcement learning, was proposed for Markov decision processes [1].It directly provides agents with the capability of learning to act optimally for pairs of states and actions without relying on an explicit model of the Markov process.The core of the algorithm is a Qvalue iteration derived from the Bellman Equation [30] given by (1) value of the next possible state by choosing optimal action a to maximize the value.However, the above greedy method with pure exploitation (i.e., to choose the action yielding the highest value) may potentially cause the agent to run into local optima quickly.Therefore, the agent need to be capable of incorporating exploration (i.e., to select an action which may not be the optimum for the given state).In practice, εgreedy is often the first choice to balance the tradeoff between exploitation and exploration [31].
In this study, we apply Qlearning as our base reinforce ment learning algorithm in the sorting problem.The state is defined as a list of numbers.For the task of sorting six numbers, there will be 720 states in total.The action is to denote the swap of values in position  and position .Thus, there are 15 actions to sort six numbers.The Qlearning algorithm calculates a Qvalue based on the current sorting result (i.e., state) and the exchange of two numbers (i.e., action).This Qvalue indicates the expected values the agent may receive by selecting the action.

Rewards
In reinforcement learning, the goal of a task is characterized in terms of the rewards and the agent utilizes the received reward as a guidance to produce suitable behaviours.An agent's ultimate objective is to maximize the expected cumulative reward it receives over time instead of focusing on the shortterm rewards while interacting with its environment.The reward at each time step t is denoted as and its cumulative reward is defined as (2) The formulation of the cumulative reward may be problematic for continuing tasks because it could diverge to infinity.Moreover, all rewards are equally considered, no matter how far away in the future they are.The discount factor, is introduced to prevent the cumulative reward from increasing to infinity and control the weights between future rewards against the current reward.The discounted cumulative reward is defined as (3) In the sorting problem, we consider two types of rewards, immediate reward and puredelayed reward.For the immediate reward, a reward will be given at each time step depending on whether the action improves the number of items in the correct position.For example, if the state of six numbers is '1,3,2,4,6,5', where two numbers are in the correct position, an action of swapping two numbers results in '1,2,3,4,6,5', where four numbers are in the correct positions and the agent will receive a positive reward.On the contrary, if the action causes the ranking worse, a negative reward will be given.For the puredelayed reward scheme, a reward will only be assigned at the end to indicate a successful or unsuccessful sorting.The purpose of this design is to investigate whether an immediate reward can reduce the number of required trials or a puredelayed reward could actually reflect the longterm goal best by obtaining rewards in the far future.

Goaldriven design
Although Reinforcement Learning has been used for autonomous tasks, the prominence largely depends on whether it can be scaled up to do larger and harder tasks.Especially, learning to obtain a good performance in the challenging tasks often takes a very long time.In this paper, we propose a method based on relaxing the constraints to approach the goal more rapidly and learn from easier tasks.With relaxing the constraints currently imposed on the agent, we give the learner more freedom and less restriction to discover useful and effective policies.What we do is essentially adjust the termination condition to allow better execution in an attempt to enhance success for the missions.Generally, it is !).!*%"$"3)%*(!?).$/)(2!)+$/1&0"();*1(&%?)*"+)- .
. applicable to know which learning condition is easier to fulfil than others in order to reach the goal given that a priori knowledge of the problem is available.In the sorting problem, we consider the constraint on the number of actions (i.e.swaps between two numbers) taken.We initially put severe constrains on the problem and gradually relax the constraints over the course of learning until the goal is achieved or time runs out.Doing this has the great merit of making the problem simpler by increasing the allowed number of swaps.

Good trajectories
The agent learns to improve its skill from observable histories called trajectories where each trajectory is a stateaction sequence of length h denoted as ⟨( 1 , 1 ), ( 2 , 2 )…( h , ℎ )⟩.A good trajectory, in this study, is a sequence in which the agent is able to reach the goal state successfully within a predefined number of steps.By contrast, if the agent does not finish the task promptly, we consider it a bad trajectory.Although good trajectories may not be the optimal solution and there is no guarantee that learning from good trajectories will make the agent have performance comparable to or better than that of an expert, at least it is clear that the agent should try to imitate those aspects of the teacher agents.On the contrary, bad trajectories are even more ambiguous because it is not obvious and sometimes difficult to differentiate whether the entire trajectory was wrong or some parts of the trajectory were correct.
Figure 1 shows the flowchart of our training process.The input of the algorithm is a training set (denoted as Training_Set) which is randomly selected from ! lists where each training sample contains the state and constraint.At each iteration, the agent will interact with the training sample to perform the sorting task.If the training sample is unable to be sorted, we will relax the constraint of the sample as described in the previous section and save the sample into the replay set (denoted as Replay_Set) without updating the Qtable; otherwise, the sample will be removed from the Training_Set and the Qtable will get updated.At the end of each iteration, elements in the Replay_Set will be the new target training samples.The training process keeps iterating until all training samples have been trained successfully.

EXPERIMENT AND RESULTS
In order to determine the efficacy of our proposed approach, a case study in the sorting problem is presented.We conducted a series of experiments to observe the performance under the learning configurations as well as to compare the results with other algorithmic solutions.

Experimental setup
We construct two experimental tasks that aim at evaluating our learning strategies to sort  numbers.The first task is to apply two reward schemes, which is specified in Subsection 3.2 on the sorting problem with loose constraints.A Qlearning agent is asked to reach the goal state within  2 actions.Given a training sample (i.e. a list of numbers), we are interested in the number of episodes that the agent can approach a 90% successful rate for the latest 100 episodes.This task is designed to assess the effect on the different reward schemes.For the second task, the agent is expected to interact with the environment based on the training flowchart in Fig. 1.We start this task with a strict constraint that the agent is required to finish sorting within  actions.Given a training example, we are interested whether the agent can have a 90% successful rate for the latest 100 episodes within 45 000 episodes.If the agent is able to fulfil the requirement, we will update the Qtable and remove this training example from the training set; otherwise, the Qtable will stay unchanged and the constraint will be relaxed to +1 actions at the next iteration.The goal of this task is to investigate the feasibility about forcing the agent to learn from good trajectories and easier missions.In the next section, we explain our performance according to the designated tasks and compare our results to Quicksort.

Experimental results
To evaluate our approach, we employ the proposed Q learning agent to sort ! lists where each list is a per mutation of  numbers.As a case study, we demonstrate our results for  equal to 6, 7, and 8.In the training step for both tasks described in Subsection 4.1, we randomly select 40 lists as our training set for each value of .
For the first task, we illustrate the number of training steps needed to take in order to sort each training example within  2 actions for different values of  across two different reward schemes in Figs 2-4.We note that the performance of immediate reward is significantly over puredelayed reward.The comparison of the average training steps is outlined in Table 1.As  increases, their differences tend to increase progressively as well.For  = 6, there are 6! = 720 states while it explodes to 40 320 states for  = 8.So, even with large state spaces, the agent is still able to reach an average of  2 actions to sort after acceptable training steps.In this experiment, we keep updating the Qtable no matter the agent fails or succeeds in the sorting, i.e., we learn from both good and bad experiences.
For the second task, the detailed training results at each iteration based on the algorithm in Fig. 1 are reported in Tables 2-4.We note that, in general, immediate reward   For  = 6 and given the immediate reward scheme, there is only one list that requires 10 (i.e.,  + number of iteration) actions to finish the training procedure.For  = 7 and  = 8 with the immediate reward scheme, all 40 lists can be trained to sort within 11 (i.e., 7 + 4) actions and 13 (i.e., 8 + 5) actions, respectively.To examine whether or not our approach offers support for the empirical data when compared to the existing solution, we assess our results against Quicksort.After the training process, we conduct the experiments to sort ! lists for 50 times while  equals 6, 7, and 8. Table 5 displays the comparative analysis of maximum and average for

$
(+ H;&2632,'A23$&2.'&3$I<<23('12$&2.'&3$>$ J@>8@$ KK#8E$ E$ K">K8K$ !K#8#$ F$ @LKL8#$ KJLE8>$ % Table 1.The average training steps for two reward schemes when  equals 6, 7, and 8   Quicksort for all cases of .On average, it only takes 5.03 actions to sort six numbers, 6.67 actions to sort seven numbers, and 9.18 actions to sort eight numbers.Although puredelayed reward did not yield a good average value while attempting to sort six numbers, it still holds better performance for sorting seven and eight numbers compared to Quicksort.The detailed results of 50 experiments for  equal to 6, 7, and 8 are shown in Fig. 5, Fig. 6, and Fig. 7, respectively.In conclusion, our training algorithm for both reward schemes is capable of improving the behaviour by learning from good experiences and goal driven strategy despite its training overhead.

CONCLUSION AND FUTURE WORK
In this paper, we described in detail our learning method that enables the reinforcement learning agent to improve its behaviours.We introduced humanlike strategies for the agent to adopt while gaining experience with the environment.Two different reward schemes (pure delayed reward and immediate reward) were chosen so as to guide the learning process and access their impacts on the training.As a result of our experiment, immediate reward took much less steps to complete the training process than puredelayed reward.To shorten the learning time and avoid unnecessary trial and error, we adopted rapid and easy task learning by relaxing the constraints of maximum allowable actions.To exploit and imitate past good experiences, we propose to learn from good trajectories in which the agent is able to reach the goal state efficiently.Our case study was conducted on the sorting problem.On average, the number of comparisons for immediate reward is about half the number of comparisons necessary for Quicksort.The empirical results indicate that there is a substantial reduction in the number of steps to solve the problem.The future work is to investigate the selection of training samples and evaluate its influence on the sorting task.In our current approach, we randomly selected 40 lists as our training set without applying instance selection techniques.We plan to study whether the property of training samples influences the outcome of the training and accuracy of the performance by considering the difference of choosing between difficult states (i.e., most digits are in the wrong positions) and easy states (i.e., most digits are in the correct positions) as the training targets.Moreover, we will also investigate whether the number of training samples has significant impact on training processes, and deal with the automatic determination of the number of training samples.Another   1, it takes more than 1000 training steps to complete the training task for  equal to 8 with loose constraints ( 2 actions).It would be necessary to accelerate learning in order to handle larger .Transfer learning has been used to speed up learning through the adaptation of previously learned behaviours by the intertask mapping.In the sorting problem, instead of randomly initializing weights for larger , we will explore its state similarity with previously learned tasks and initialize its actionvalue by leveraging the knowledge from prior learned and similar states.The number of actions The number of actions The number of actions The number of actions

Fig. 2 .
Fig. 2. Distribution of the number of training steps for two reward schemes when  = 6.

Fig. 3 .
Fig. 3. Distribution of the number of training steps for two reward schemes when  = 7.

Fig. 4 .
Fig. 4. Distribution of the number of training steps for two reward schemes when  = 8.
work is to improve the efficiency of the training process.As we can see from Table

Table 2 .
The number of training iterations for two reward schemes when  equals 6

Table 3 .
The number of training iterations for two reward schemes when  equals 7

Table 4 .
The number of training iterations for two reward schemes when  equals 8 Quicksort, puredelayed reward, and immediate reward given different values of .From the aspect of average performance, immediate reward achieves promising results and it only needs less than half of the values obtained by