Reinforcement Learning

Reinforcement learning is the type of learning done by an agent who is trying to ﬁgure out a good policy for interacting with an environment. The framework is that time is discrete; at each time step the agent perceives the current state of the environment and selects a single action. After performing the action, the agent earns a real-valued reward or penalty, time moves forward


Introduction
Reinforcement learning (RL), with the aid of advances in deep neural networks, has made major breakthroughs in diverse disciplines.Some early highlights were in computer games (Mnih et al., 2015), in chess and Go (Silver et al., 2016) and in robotics (Lillicrap et al., 2015;Haarnoja et al., 2018b).Recent highlights include developing efficient algorithms such as in matrix multiplication (Fawzi et al., 2022) and in sorting (Mankowitz et al., 2023).
There are a few applications of RL in astronomy as well.Telescope automation is closely related to robotics and RL can be used in telescope control including adaptive optics (Nousiainen et al., 2022;Landman et al., 2021;Nousiainen et al., 2021) and adaptive reflective surface control (Peng et al., 2022) as well as in observation scheduling (Jia et al., 2023a(Jia et al., ,b, 2022)).Going further down the data flow, RL has been applied in radio astronomical data processing pipelines (Yatawatta and Avruch, 2021;Yatawatta, 2023) for hyper-parameter tuning.Considering modern astronomy to be a flow of data or information from the observing telescope to the scientist, we foresee many more applications of RL to aid and refine this flow and motivates this publication.
Several methodologies fall under the umbrella of machine learning (ML): Supervised learning is the most commonly used methodology where a machine is given both the input and the required output to learn to perform a certain task.In unsupervised learning on the other hand, only the input is given to the machine.Reinforcement learning follows a different paradigm where a machine learns to perform a task by repeated attempts and getting some form of feedback from an external environment.Another noteworthy difference in RL is the temporal aspect, i.e., the task to perform is considered to be a sequence of actions to take instead of just one action, as in, say, supervised learning where a classifier outputs the class corresponding to the input in one step.
In this paper, we provide an overview of modern deep-RL with a focus on its use in astronomy.Reinforcement learning has a long history and multiple origins, stemming from several disciplines such as machine learning, dynamic programming and control and the scope of of this paper is to give a brief but sufficient overview of the topic so that a new user can quickly apply the RL techniques in their work.This paper is organized as follows: in section 2, we give a theoretical overview of RL.In section 3, we discuss model-free RL algorithms both for discrete and continuous action spaces.Next in section 4, we discuss model-based RL where a model representing the environment is built and used.Finally, in section 5 we discuss practical aspects of RL including commonly used software and we conclude in section 6.
Notation: We do not distinguish between scalars, vectors or tensors in our notation and the actual dimension of the variables should be inferred depending on the context of their use.The matrix transpose, in-verse and determinant are given as (•) T , (•) −1 and det(•), respectively.A multinomial Gaussian with mean µ and covariance Σ is given as N (µ, Σ).

Reinforcement learning theory
In this section, we provide a concise theoretical overview of reinforcement learning .Modern RL sits in the intersection of many diverse disciplines and we take an approach based on machine learning (Sutton and Barto, 2018;Szepesvári, 2010) to provide this introduction.Alternative viewpoints do exist, for example from dynamic programming and control (Bertsekas, 2005(Bertsekas, , 2012) ) or even from neuroscience (Bertsekas and Tsitsiklis, 1996).

The state, action and reward
We consider a form of artificial intelligence, which we call the agent interacting with its environment as seen in Fig. 1.The agent performs a task and has an objective or a goal that we can specify.Information is passed from the environment or the world to the agent in the form of observations.After considering the observations, the agent performs the task which is fed to the environment as the action.Based on the action, the environment will undergo a change.To evaluate the effect of the action taken, the agent also receives a reward which is a numeric evaluation of the quality of the action and its effect on the environment.We put the relationships between the agent and its environment into a more rigorous mathematical form as follows.Let S be a set whose elements are possible representations of the state of the environment.Any element in S can be given by s and without loss of generality we assume s to be a vector.
Note that the state does not have to be equal to the observation: the state is a condensed form of the observation, for example by removing observables that are directly dependent on other observables.We consider A to be a set of possible actions to take.Any element in A can be given by a which is also assumed to be a vector.Note also that both s and a can be vectors of discrete values (integers) as well as continuous values depending on the problem.In Table 1 we have listed several examples of RL applications.In the chess game example, both the state and action are discrete because there are only a finite number of chess pieces and moves that can be made at any point in the game.In the other two examples, we have continuous values for s and a although some state (or action) variables can be discrete.The dimensions of s and a can also be large for real life problems, for example in the self-driving automobile example in Table 1.
The quality of the action taken by the agent is measured by the reward r which is generally a real valued scalar.We consider the function generating the reward to be in the functional space R. The definition of the reward is unique to each RL problem and we (as the users) have the freedom to define this as long as we follow the general rule: the higher the reward is, the closer we are to achieving our objective.In other words, we give higher rewards to the agents when the actions taken by the environment leads to reaching the goal and lower rewards (or higher penalties) when the actions lead to poor results or failure.

Markov decision processes
As seen in Fig. 1, the interactions between the agent and its environment are repetitive, i.e., at any given iteration, the agent observes and receives the state s (and reward r) and recommends action a which is passed onto the environment.The environment takes the action a and as a result, its state will change.This updated state (or observation) is passed onto the agent, together with the reward r.This cycle continues until we reach our goal, hence it is a process.Most RL problems are episodic, i.e., after many repetitions of this cycle, the agent reaches its goal or ends in failure.If we consider the chess game example in Table 1, the game ends when the agent wins, loses or draws the chess game.Therefore, one chess game can be considered as one episode.In the bipedal walker and self-driving automobile examples, the episode will end when we reach our destination, or when we run out of Apply forces for acceleration, braking, steering etc. energy (fuel) or when we meet with an accident.This temporal aspect is unique to RL compared with other ML problems such as supervised learning.We use the subscript t to denote this temporal dependence when we describe the state, action and the reward, such as s t , a t and r(s t , a t ).At step t, the transition probability of the state from s t to the next state s t+1 is given by p which is called the state transition probability in P.
The tuple (S, A, R, P) is called a Markov decision process (MDP).The Markov property implies that the state transition probability at step t is only dependent on the current state s t and the action taken a t .In other words, we can parameterize the state transition probability as p(s t+1 |s t , a t ) which is only dependent on s t and a t (we use the notation for conditional probability here).The reward received at state s t after performing action a t is denoted by r(s t , a t ) which can also be represented as r t .
An important question to answer in many RL problems is how to determine if the agent has learnt to perform the required task.Considering the examples in Table 1, it is clear for the chess example where we consider the agent has learnt to play chess if it wins in almost all games.For the bipedal walker example, determining whether or not the walker has learnt to walk is a bit problematic.In order to alleviate this issue we consider the agent has learnt the task if it can sustain a sufficiently high reward at each time step t.Obviously, we do not expect the agent to reach such a high reward at start and we need to look forward to the future.Therefore, while learning a task, the agent does not attempt to maximize the immediate reward, but the cumulative reward over a number of future steps as well.In an infinite horizon MDP, we consider an infinite number of steps in the future.To account for uncertainties in the future, we calculate a discounted cumulative reward with a discount factor γ (0 < γ < 1).This also makes the summation of rewards over future steps converge to a finite value.

Q function, value function and policy
In order to illustrate the basic RL concepts, we start with a simple example: In Fig. 2 we show a maze, which is our environment.The agent can start from any of the empty squares and has to navigate to the top right hand square, which is the goal.At each step, the agent can make four moves, i.e., up, down, left and right.If the agent reaches the top right hand corner, we reach the end of the episode (terminal state).The agent can move (act) ←,→,↑, or ↓.The state S is a discrete space with 5 states and the action A is also a discrete space with 4 actions.
In order to reach our goal, we can define the reward given for each action taken at each state as in Table 2.Note that we give a very low reward (−∞) to impossible actions.We give a high reward (100) when we reach our goal by moving up at state 4.
Table 2: Tabulated reward R[s, a] for the maze environment to achieve the goal.Each row corresponds to a state and each column corresponds to an action.

→
In order to reach our goal in the maze, we tabulate the quality of each state and action pair that we name as the Q- Select action a ← column number that gives maximum value of row number s in Q-table .If more than one column has the maximum value, choose randomly between them.

5:
Next state s ′ ← follow action a and Fig. 2.

6:
if s ′ = terminal state then

8:
Stop. 9: 11: Next state s ← s ′ . 12: end if 13: end while What we have shown in algorithm 1 is a crude form of Q-learning that is only feasible with a low dimensional, discrete state and action spaces.The application of algorithm 1 to Fig. 2 is tedious, but can be done by hand with some value selected for γ (say γ = 0.9).Python source code implementing algorithm 1 for the maze environment is given in Appendix A. After sufficient number of iterations and sufficient repetitions of algorithm 1 with different initial states (episodes), we will get an end result as in Table 3.Using Table 3, we can solve the maze environment, for example in state 0, we get the highest Qvalue (79.1) by taking action in the first column → while in state 1, we can choose either the first → or the third ↑ column and so on.In most problems however, the dimensions of the state space and the action space are too high to use a tabular method.The same can be said if the state or action spaces are continuous.
In order to generalize RL algorithms to handle high dimensional or continuous problems, we move on from a tabular representation to a functional representation relating S and A. Some of the major components are: • Policy: The policy is a mapping from S to A. A deterministic policy π(s) will produce action a given the state s, π(s) → a.A stochastic policy π(a|s) will predict the conditional probability distribution of the action a given the state s.
We can generate samples from the conditional probability distribution to get a representation of the action, i.e., a ∼ π(a|s).
• Q-function: The Q-function Q(s, a) → q is a mapping from the state and action spaces to a real number q, S × A → R.This is the generalization of the Q-table (e.g., Table 3) from discrete RL problems to continuous or high dimensional RL problems.Given the state s, Q(s, a) will given an indication of the quality (or expected cumulative reward) of taking action a under policy π(•).The higher Q(s, a) is, the closer we are to achieving the goal of the RL problem.The Q-function is also called actionvalue function or state-action value function.
• Value: The value of state s is the expected cumulative reward starting with state s and following a policy π(•).In other words, it is a measure of the importance of being in state s to reach the goal of the RL problem.We use V (s) → v to denote the value function that maps S → R.
It is important to understand how the aforementioned policy and value functions are related to one another.At the solution of the RL problem (or after reaching the goal), the optimality conditions are defined by the Bellman equation ( 1) (1) The Bellman equation relates Q(s, a) to the (optimal) policy π(s).The current state and action taken are given by s and a respectively while the next state and action (under optimal policy) are given by s ′ and a ′ .The immediate reward is given by r(s, a) while the discount factor is γ.Note that we have already used a form of (1) in algorithm 1 to solve the maze problem (see line 10).Note also that if s is the terminal state (end of episode), s ′ does not exist and the right hand of ( 1) is just r(s, a).
To solve any given RL problem, we have to learn the optimal π(s), Q(s, a) or V (s) tailored to that particular problem.In order to do this, we represent each of the aforementioned functions as deep neural networks (DNN, LeCun et al. (2015)).There are two main reasons for deep neural networks to be ideally suited for this task.First, deep neural networks have the ability to find arbitrary representations.Second, from a practical viewpoint, modern deep learning frameworks have built in learning capabilities with gradient descent and built-in gradient calculation such as reverse mode automatic differentiation (Paszke et al., 2017).The deep neural networks are parameterized by trainable parameters and in section 3 we will describe how we train these deep neural networks to solve the RL problem.

Deep reinforcement learning algorithms
In this section, we will focus our attention on model-free deep RL algorithms.We first discuss the common difficulties faced during training an RL agent.
• Not enough data: A large amount of training data is required even in other deep learning tasks such as supervised learning.This is even more prominent in RL.Note that in RL, we generate training data by interacting with an environment.Most real life environments are complex and expensive to operate (for example a self driving automobile).Hence, generating enough training data is hard in RL.In section 4, we discuss model-based RL as a way to generate data without interacting with an environment.Alternatively, we can store and re-use past experience which we will elaborate in this section.
• Exploitation vs exploration: The optimization problems underlying the training of RL are nonconvex and ill-conditioned with high dimensional parameter spaces.Moreover, the state space S and action space A can also be high dimensional.In order to reach the global optimal point in our optimization, we need to uniformly sample the full S and A. It is easy for the solutions to converge to local minima or overfit because of poor sampling.In order to overcome this, during selection of the action to take, we can balance between random sampling (exploration) or using the policy to maximize the reward (exploitation).This is commonly termed ϵ-greedy action selection, where we select a random action with probability ϵ and we exploit the policy with probability 1 − ϵ where ϵ is a small positive value.
• Stability: The solution to the RL problem is obtained by (indirectly) solving (1) or something similar in an iterative manner.We do see that Q(s, a) appears on both sides of the equation and makes in unstable.Generally a small learning rate is chosen to avoid divergence but other improvements are used to overcome this as we illustrate later in this section.

Experience replay
In order to alleviate the data deficiency for training, we can store our interactions with the environment at each step t and re-use them as experience.We keep a special buffer (replay buffer) D for this and at each step, we store the tuple ( state s, action a, reward r, next state s ′ ) in the buffer.Since we are using past experience, we are using actions based on past policies, that might be different from the current policy.Hence the use of a replay buffer compels us to use off-policy RL algorithms.It is also possible to prioritize the past experience that we retrieve from D based on how poor the agent has performed in the past (Schaul et al., 2015).For example, we can prioritize retrievals from D giving high priority to past steps where the transitions from s to s ′ has large change in the Q-values (importance sampling).
We present the high level algorithm for RL using a replay buffer in algorithm 2. This algorithm uses transitions from step t to t + 1 for learning, also called temporal difference learning with one step look ahead (TD-0).We use two outer loops, one loop over E episodes and an inner loop over L steps perepisode.We consider an infinite-horizon RL problem and hence make L sufficiently large for this purpose.Environment: simulate or realize new (random) observation.

8:
Environment: take action a, determine reward r(s, a) and next state s ′ .9: D: store (s, a, r, s ′ ).

10:
Agent: sample a mini-batch from D and learn.end for 17: end for An important statement in algorithm 2 is in line 10, where the actual training of the agent is done.A mini-batch is one or more tuples of (s, a, r, s ′ ) that is sampled from D and fed to the learning algorithm.Generally, a large minibatch will give a more stable performance of the learning algorithms and the largest size of the mini-batch is mostly determined by the available memory.We will expand the learn-ing statement (line 10 in algorithm 2) in the following to consider various learning strategies.

Discrete action RL
In a discrete action space A, the action a can only take a finite number of values.However the state space S can be continuous or discrete, we do not impose any restriction.In contrast to the tabular algorithm 1, we use a Q-function Q(s, a) where a is discrete.The Q-learning (Watkins and Dayan, 1992) step can be given as in (2): ) can be done by evaluation of Q(s ′ , a) for all possible values of a and selecting the maximum Qvalue.If we have more than one tuple of (s, a, r, s ′ ) in a mini-batch, we repeat (2) for all tuples sequentially.Initial values for Q(s, a) will be set to 0, especially for the terminal states s.If the next state s ′ is terminal, the right hand term of (2) is just µ (r(s, a) − Q(s, a)) and the same is implicitly applied to all learning steps hereafter.
As noted before, solving (1) by iterating (2) is not stable and leads to overestimation (Van Hasselt et al., 2016), especially for large µ values.Double Q-learning (Van Hasselt, 2010) is an improvement to Q-learning in this regard.In double Q-learning, instead of one, we use two Q-functions, i.e., Q 1 (s, a) and Q 2 (s, a).The motivation for this is to keep one Qfunction fixed while updating the other.With probability 0.5 we update the first Q-function as and otherwise we update the second Q-function Note that in (3) and ( 4), the target reward is estimated by the Q-function which is fixed.The evalu- as first finding the value a ′ that gives the maximum of Q 1 (s ′ , a ′ ) and using this to evaluate Q 2 (s ′ , a ′ ).
Thus far, we have treated the Q-functions without any consideration on how they are represented.In most practical algorithms, the Q-functions are represented by deep neural networks.For example, using a DNN with trainable weights given by θ, we can model the Q-function in (2) and we denote this by Q θ (s, a).Using the same trick of double Q-learning, we create two Q-functions, one parameterized by θ which is trained and another parameterized by θ ′ which is denoted as the target Q-function, i.e., Q θ (s, a) and Q θ ′ (s, a) respectively.
We minimize the mean squared error loss J(θ) during training.A gradient descent step to minimize (5) can be given as where µ is the learning rate.Note that the gradient of ( 5) can be calculated as using the chain rule.Note the similarity of the gradient (7) to the Q-learning step (2).The target network parameters θ ′ are updated at a lower cadence after repeating (6) several times by copying θ, i.e., θ ′ ← θ.For a mini-batch of several (s, a, r, s ′ ), the loss (5) is averaged over the whole mini-batch.With modern deep learning frameworks, we only need to specify the loss to minimize as in (5) and it is not necessary to explicitly calculate the gradient and the gradient descent.
Looking back at Table 1, the chess game example has both discrete state and action spaces.However, their dimensionalities are high, and unlike other RL problems, the environment (another chess player) acts in an adversarial manner.For this type of RL problems (games) a combination of Monte Carlo tree search (for dimensionality reduction) and DNNs (to reduce depth and breadth of tree search using value functions) are used (Silver et al., 2016).

Continuous action RL
In a continuous action space, the action a can have an infinite number of values.Therefore, the direct application of algorithms such as Q-learning is not feasible.The policy π(s) or π(a|s) plays a major role in calculating a, instead of searching through all possible actions.In dynamic programming, there are two families of algorithms for solving RL problems in continuous action spaces: value iteration and policy iteration.In value iteration, the value function V (s) is updated iteratively (according to the Bellman optimality condition (1)) until convergence and based on the converged value function, the policy is calculated.On the other hand, in policy iteration, the value function is updated while evaluating the current policy and thereafter, the policy is also iteratively updated.
In this paper however, we focus on actor-critic methods where we jointly update both the value function and the policy and can be considered as a combination of value iteration and policy iteration.As seen in Fig. 3, we decompose the agent into an actor and a critic as follows: • Actor: implements the policy.The actor will produce and action a given state s, π(s) → a or the conditional probability of a given state s, π(a|s) that can be sampled to generate a.In algorithm 2, the actor is active in line 7 and line 10.
• Critic: evaluates the state s (and action a) using the Q-function Q(s, a) or the value function V (s).Conceptually, the critic provides a critique of the action taken by the actor.The critic is active in the learning step in line 10 of algorithm 2.
We describe some of the actor critic algorithms that are state-of-the-art in the following.
In line 7 of algorithm 2, the action to take given the state s is generated as where ϵ is generated from an Uhlenbeck-Ornstein process (Uhlenbeck and Ornstein, 1930) (noise is correlated between each learning step in algorithm 2).At each learning step in algorithm 2, the critic is updated using the loss function which is quite similar to (5) except that the maximization step with respect to a ′ is replaced by evaluation of the target policy π ϕ ′ (s ′ ).Afterwards, the policy is updated to maximize the Q-value Q(s, a), therefore the loss to be minimized with respect to ϕ is given by Using gradient descent, both θ and ϕ are updated as where µ Q and µ π are the learning rates for the critic and the actor, respectively.At each learning step, the target network parameters are also updated by a small amount using Polyak averaging as where τ is a small positive value.

Twin delayed DDPG (TD3)
The main shortcoming of DDPG is the overestimation of the Q-value (Fujimoto et al., 2018) and TD3 is an attempt to overcome this.In this algorithm we use two Q-functions instead of one, parameterized by θ 1 and θ 2 , i.e., Q θ 1 (s, a) and Q θ 2 (s, a).Similar to DDPG, each Q-function has its corresponding target, parameterized by θ ′ 1 and θ ′ 2 , i.e., Q θ ′ 1 (s, a) and Q θ ′ 2 (s, a).At initialization, θ 1 and θ 2 are randomly initialized to be as different as possible.The actor uses policy π ϕ (s) parameterized by ϕ and target policy π ϕ ′ (s) with ϕ ′ as parameters.
The agent generates an action in line 7 of algorithm 2, given state s as where ϵ is a vector (similar to the size of the action) whose values are zero mean Gaussian noise with σ = 0.2 standard deviation that is clipped to have values in [−c, c] with c = 0.5 (the range of values from −c to c is denoted as [−c, c]).Furthermore, the action itself is clipped to only have values in [a, a].The objective of this clipping is to smoothen the generated actions, so to act as a form of regularization on π ϕ (s).The learning step (line 10 in algorithm 2) is as follows.Using the next state s ′ (sampled from D) and using the target policy, we generate where ϵ and the clipping operations are similar to (13).Because we have two Q-functions, the loss for the i-th (i = [1, 2]) can be written as (15) where we find the minimum Q-value of the target Qfunctions evaluated at s ′ , a ′ by min . By comparing two Q-values and using the minimum of the two, we try to avoid overestimating the Q-value, which is an improvement to DDPG.We minimize the total loss J(θ 1 )+J(θ 2 ) (averaged over the mini-batch) to update both θ 1 and θ 2 .
In TD3, the policy is updated less frequently than the Q-functions.While at each learning step θ 1 and θ 2 are updated, the parameters of the policy ϕ are updated with a slower cadence (say at every 5 learning steps).The policy update is done by maximizing the Q-value with respect to ϕ, quite similar to DDPG (10) except we only use one Q-function for this, for example we minimize Notice that during the policy update, we do not add noise to the actions as in ( 13) or ( 14).The target parameters are updated similar to DDPG (12) except that it is performed with the same cadence with which the actor is updated, as in where τ is a small positive value.

Soft actor critic (SAC)
Both DDPG and TD3 have deterministic policies π(s) and therefore to enable some exploration while choosing actions, some form of noise is added to the action as in (8) or in (13).In contrast, the actor in SAC implements a stochastic policy π(a|s) that inherently includes some randomness.Each element of vector a is modeled to be an independent and identically distributed random variable to create the policy π(a|s).The probability density function (PDF) of the i-th element of a (say a i ) is generated using a Gaussian random variable with mean µ i and variance σ 2 i .However, this leads to values in the range [−∞, ∞] and to keep the actions within a finite range, the probability density function is transformed by a tanh(•) function that maps the output to [−1, 1].In practice we use a DNN with parameters ϕ to model π ϕ (a|s) as the mean and the variance of each element a i .
Similar to TD3, SAC uses two Q-functions parameterized by θ 1 and θ 2 as Q θ 1 (s, a) and Q θ 2 (s, a) and corresponding target Q-functions Q θ ′ 1 (s, a) and Q θ ′ 2 (s, a) parameterized by θ ′ 1 and θ ′ 2 .In order to generate an action in line 7 of algorithm 2, we sample the policy as The learning step in line 10 of algorithm 2 involves minimizing loss functions to update the critic (θ 1 and θ 2 ) and thereafter to update the policy (ϕ).Generally, high randomness in actions leads to high chance of exploration (as opposed to exploitation).High randomness also means high differential entropy of the conditional PDF, so we can increase the reward given to policies with high entropy.With this motivation, we add an extra reward to the Q-value based on the differential entropy of the policy π ϕ (a|s) that is proportional to − log π ϕ (a|s).With this motivation, we have a loss function for the critic as and α is the entropy regularization factor (small value ≈ 0.1).Note that we need to apply the tanh(•) to the Gaussian PDF to calculate log π ϕ (a|s) in closed form, as given in Haarnoja et al. (2018a,b).The policy is updated by minimizing where is a sample drawn from π ϕ (•|s).In order to have a differentiable (with respect to ϕ) cost function J(ϕ) in ( 21), we use the re-parameterized (Kingma and Welling, 2013) generation of a ϕ as where ⊙ is the element-wise product and ξ is randomly generated from N (0, I) which is Gaussian noise with zero mean and unit covariance.Note that we still have a differentiable mapping from vectors µ ϕ (mean) and σ ϕ (diagonal covariance) to a ϕ .Both µ ϕ and σ ϕ are modeled by a DNN with parameters ϕ.
After updating θ 1 and θ 2 at each learning step, the target network parameters θ ′ 1 and θ ′ 2 are also updated by Polyak averaging as in (17).

Model based reinforcement learning
Generating enough data to train RL agents is a fundamental problem in many applications.In some applications, the data generation is expensive and may even harm the actual physical system, for example a robot or an automobile might be damaged if certain actions are performed.In order to overcome this problem, as shown in Fig. 4, we can create a representative model of the actual physical system to generate more data.The use of an internal model to represent the environment is called model based RL and offers a rich variety of algorithms (Levine et al., 2015;Nagabandi et al., 2017;Clavera et al., 2018;Chua et al., 2018;Janner et al., 2019;Wang et al., 2019;Clavera et al., 2020;Wang et al., 2021).Of course, model based RL will only work if the model we construct as the proxy for the environment can accurately represent the dynamics of the environment.There are two forms of uncertainties that we need to consider when building a model to represent the actual environment (Chua et al., 2018).Aleatoric uncertainty is the uncertainty due to inherent randomness in the measurements (e.g., thermal noise, quantization).In contrast, epistemic uncertainty is the uncertainty due to the lack of complete information about the actual physical system (e.g., misrepresentation of the state).We can minimize both forms of the aforementioned uncertainties by using an ensemble of probabilistic DNNs as our model.A probabilistic DNN models the transition probability p(s ′ |s, a) of the next state s ′ given s and a as opposed to a deterministic DNN which directly gives the next state s ′ as output.By using a probabilistic DNN we can reduce the aleatoric uncertainty.In order to reduce the epistemic uncertainty, we use an ensemble of such probabilistic DNN models.

Probabilistic ensemble models
We can select any probability distribution to model the transition probability and often a multinomial Gaussian with a mean µ θ i and a diagonal covariance Σ θ i is chosen, as in We use parameters θ i to represent µ θ i and Σ θ i in the probabilistic model.In order to estimate the parameters θ i , we maximize the likelihood, or minimize the negative log-likelihood (ignoring the constant terms) 25) We minimize J (θ i ) in (25) by using s, a and s ′ sampled from the replay buffer D (Lakshminarayanan et al., 2016).In the ensemble, we have B probabilistic DNN models with parameters θ i , i ∈ [1, B] for each model.During training, each θ i is randomly initialized and we independently sample minibatches with replacement for each i to minimize J (θ i ) in ( 25).The reason for this is to find a diverse set of feasible solutions for θ i for each i.
Once we have a trained model p θ i , there are several ways to use it in RL.A direct way is to use the model to generate more training data (Janner et al., 2019;Wang et al., 2021).In model predictive control (Nagabandi et al., 2017;Chua et al., 2018) on the other hand, the dynamics model is used to predict future rewards and plan the sequence of actions to take.It is also possible to perform gradient descent directly through to model, for example to optimize a policy (Clavera et al., 2020).

Probabilistic ensemble with trajectory sampling
As an example of model based RL, we discuss the probabilistic ensemble with trajectory sampling algorithm (PETS, Chua et al. (2018)) in this paper.The PETS algorithm is simpler in the sense that it does not use gradient descent optimization, compared to the model-free algorithms described in section 3.However, note that learning the dynamics model by minimizing (25) still involves optimization based on gradient descent.We use an ensemble with B probabilistic models.The basic idea of PETS is to look ahead T steps and based on the expected reward, decide on which action a t to take.In algorithm 3, we have given the pseudo-code for PETS.Note that we use a replay buffer D similar to algorithm 2. In fact, algorithm 3 is similar to 2 in the manner of interactions with the environment and the use of a replay buffer.The major difference is given in line 6 of algorithm 3, where we select the action a t by sampling and not by using a policy.
The selection of the optimal action is done by sampling trajectories starting from the current state Train p θ i using D by sampling with replacement for all i.

5:
Select initial state s 1 from environment.

8:
Execute a t in the environment, record s t+1 and reward r t .

10:
end for 11: end for s t using the cross entropy method (CEM, Botev et al. (2013)) that is summarized in algorithm 4. We look ahead T steps and let us consider the candidate actions for this trajectory to be a t , a t+1 , . . ., a t+T −1 .In one trial, we propagate these actions with one (randomly selected) dynamics model i (from B models), as s t+1 ∼ p θ i (s t , a t ), s t+2 ∼ p θ i (s t+1 , a t+1 ) and so on until we reach s t+T ∼ p θ i (s t+T −1 , a t+T −1 ).Afterwards, using the state and action pairs (s t , a t ), (s t+1 , a t+1 ), upto (s t+T −1 , a t+T −1 ), we calculate the expected rewards r t ,r t+1 , to r t+T −1 .We roll-out P trials where we randomly select the dynamics model i to generate the state sequence and we record the average reward for this trial as the average reward over all P trials and all T steps.
We select M elite actions from the C candidate actions at step t, i.e., a t , that has the highest average reward and calculate the mean and variance to update µ and σ 2 .The updated µ and σ 2 are used to generate candidate actions for the next iteration.After N iterations, µ is returned as the optimal action a t .The crucial component in algorithm 4 is the evaluation of the reward (line 6), that needs some information on how the rewards are calculated.If this information is not available, we can train the dynamics model p θ i to predict the reward in addition to the state transition probability.

7:
Average r(s p τ , a τ ) over p (P values) and τ (T values) and record this for this candidate action.

8:
end for 9: From the candidate actions, find M elite actions corresponding to the M highest averaged rewards.

10:
Update µ and σ 2 with the mean and variance of the elite actions.11: end for 12: Return µ as optimal action.

Hint assisted RL
Almost all tasks in astronomy (planning, scheduling, resource allocation, data processing etc.) already have a rich variety of methods in use.Existing methods rely on fundamentals from statistics, signal processing etc. and also on heuristics and experience gained by decades of practice.In such situations, an obvious question to ask is if we can incorporate the knowledge we already have into an RL agent in an efficient manner.
Hint assisted RL (Yatawatta, 2023) is one way of incorporating already existing knowledge into the training of RL agents.As shown in Fig. 5, the provision of hints can be based on anything, for example it could be based on an existing model or it could be based on an experienced astronomer suggesting a solution.Generally a hint h is a replacement for an action a (having the same dimensionality and the same domain for example).We use a constraint c(a, h) (for example c(a, h) = ∥a − h∥ 2 or any other metric) as a distance measure between the action a and the hint h.As shown in Fig. 5, the hint is directly fed into the actor affecting the learning of the policy.

Reward
We modify the learning of the policy as follows.First, we define a function g(a, h) as where [x] + = x if x > 0 and [x] + = 0 otherwise (similar to a ReLU function).In (26), a threshold is given by δ (> 0) that determines how far apart the hint and the action can be.In other words, we take into account the situations where the hint can be inaccurate, so a large value for δ indicates that we have less trust in the accuracy of h.
We employ the alternating direction method of multipliers (ADMM, Boyd et al. (2011); Giesen and Laue (2019)) for the policy optimization with the hint being applied as a constraint.We modify ( 16) or ( 21) as where J(ϕ) is the original loss function for the policy in TD3 ( 16) or in SAC ( 21).The Lagrange multiplier is given by λ and ρ is the regularization factor.In the learning step of algorithm 2 (line 10), the parameters ϕ for the policy are updated as where J h (ϕ) is the augmented Lagrangian in ( 27).Thereafter, the Lagrange multiplier is updated as The execution of ( 29) can be done less frequently than the execution of ( 28).Similar to TD3, the target policy (if exists) parameters can be updated by Polyak averaging as in (17).

Applications in astronomy
In this section we discuss some practical aspects of applying RL to new tasks and also present some simple examples to illustrate the performance of the algorithms discussed in sections 3 and 4. Reinforcement learning algorithms are well supported in both major deep learning frameworks, i.e., Pytorch (Paszke et al., 2019) and Tensorflow (Abadi et al., 2016).Furthermore, an exhaustive number of collections of environments such as Gym (Brockman et al., 2016;Towers et al., 2023) and collections of standard algorithm implementations such as stable-baselines (Raffin et al., 2021) and model based RL libraries (Pineda et al., 2021) also exist.
There are several practical issues that a prospective user of RL should consider when applying aforementioned tools and utilities to their problem.
• In some tasks such as in a telescope system control, the state can be obvious and can directly relate to the physical measurements.On the other hand, in more abstract tasks such as in tuning a regression problem, this might not be the case.Therefore some amount of insight into the problem and also some experimentation is required to determine the state representation.An obvious question to ask is if the task behaves as an MDP.If this is not the case and the state transition depends not only on the current state but some history as well, the historical data can also be included in the current state (in other words, the current state is a window of states extending into the past).Similarly, the action can be part of the state and the action learnt by the agent can be the incremental action (or the scaling of the action).
• Both the state and the action will be formed by combining information from various sources, in other words combining apples and oranges.We need to pay attention to the numerical stability of the DNNs that are used to represent various models such as the actor or the critic.Ideally, all data should have the same dynamic range for the neural networks to perform well.Therefore, when combining data from different sources, we need to pay special attention to scale or normalize data appropriately.
• The calculation of the reward in each task is entirely determined by the objective to achieve and constraints such as differentiability does not apply (Tadepalli and Ok, 1996;Henderson et al., 2018).In practice, actions that deliver good results are boosted by scaling the reward up and conversely, penalties can be added to the reward to discourage undesirable actions.Clipping of rewards is also common in practice (Hu et al., 2020), mainly for the numerical stability of the DNNs.
• In some applications, we will encounter actions that include both continuous and discrete variables, i.e., hybrid action spaces.In such situations, we can model the probabilities of the discrete actions as the policy and draw a number according to the highest probability.For example, if we have K possible discrete actions, we can use a vector of K continuous variables and apply a soft-max operation to get a vector of probabilities.Afterwards, we can choose the element with the highest probability as the discrete action.
• The dimensions of the input and the output can vary even within a single problem.For example, the sky models (Nijboer et al., 2006) used in various data processing steps in radio astronomy can have different number of sources depending on the direction in the sky.In order to accommodate that, we can design our RL agent to handle the largest possible number of dimensions and fill missing values with zeros.
An improvement is to use auto-encoders or self attention mechanisms (Vaswani et al., 2017) to encode the input before feeding it into the RL agent.The encoding can be learnt during or before training the RL agent.
All forms of tasks in astronomy can be considered as possible applications of RL.We refer the reader back to section 1 for a review of all such existing applications.We also highlight some of the potential applications of RL in astronomy: • Planning and control: Astronomical observatories serve multiple users with diverse science goals.Data collection for such diverse science goals need a significant amount of planning and control and RL can aid in performing this.
• Resource allocation: Various forms of resources are required to produce science from raw astronomical data.Examples of such resources are: observing duration, computing resources, storage, network bandwidth etc.The resources are limited and need to be allocated fairly among the users.There are additional constraints such as minimizing the monetary cost and energy consumption.Such allocation problems can also be tackled by RL.
• Hyper-parameter tuning: Generic pipelines are adapted for processing each specific observation in most astronomical data processing pipelines.
In order to do this, various hyper-parameters need to be tuned, mostly by grid-search based approaches.As shown by previous work (Yatawatta and Avruch, 2021;Ichnowski et al., 2021;Yatawatta, 2023), RL agents can outperform grid-search based approaches in tuning tasks such as regression, classification, and clustering.
• New science: Data collected by various observatories for specific science purposes are stored at large archives of astronomical data.Such archival data can be re-used by other astronomers whose science can be significantly different from the science for which the data were originally collected.We can train RL agents to re-use archival data for new scientific purposes.This discovery can be formulated as a task for an RL agent to learn and the learning can be done in a manner of indexing the archives for potential science.

Example: Bipedal walker
Wrapping this section up, we present a simple and yet challenging RL task: training a bipedal robot to walk.This environment is provided with Gym (Brockman et al., 2016) and is called the BipedalWalker-v3 environment.The difficulty of training the agent to walk increases depending on the ground on which the robot tries to walk.In Fig. 6, we show the environment with almost flat ground that is seemingly a simple task.In contrast, the hard version is shown in Fig. 7 where the path has many obstacles including stairs, pits and hurdles.The state s of the bipedal walker is a vector of 24 real numbers corresponding to position of leg joints and various velocities of the body.The action a corresponds to the torques applied to the 4 leg joints, thus the action space is a continuous 4 dimensional space.We employ several algorithms to train both the easy and hard versions of the bipedal walker and show the results in Figs. 8, 9 and 10.An agent is considered to have learnt to walk and reach the end of the path once the cumulative reward per episode is more or equal to 300.The DNN architectures for the actor and the critic and the hyper-parameters used in TD3 and SAC are given in Appendix B. In each comparison, we initialize the environment and the agents using a fixed random seed.
In Fig. 8 we compare the model free algorithms TD3 and SAC in training the bipedal walker to walk in the easy environment shown in Fig. 6.We see that with the SAC algorithm we are able to train the walker to reach the target cumulative reward while with TD3 we get a slightly lower reward.We move on to the hard environment in Figs. 9 and 10.In Fig. 9, we compare the performance of the SAC algorithm with and without using hints.The hints in this case are generated by an agent that is trained with the easy environment (Fig. 8).Therefore, the hints provided are inherently not accurate but nonetheless, the walker trained with the use of hints is able to achieve a higher reward, almost reaching the target.The terrain in the hard environment has many obstacles such as stairs, pits and hurdles that are randomly generated as seen in Fig. 7.This makes the environment highly non-stationary which is also reflected by the dispersion of the rewards achieved in each episode as in Fig. 9.This also prevents the vanilla version of SAC (without using hints) from achieving the target reward of 300.
In Fig. 10 we perform a similar comparison using the TD3 algorithm.We compare the performance of TD3 with and without using hints in the hard environment.The hint assisted version of TD3 is able to get a higher reward, but much less than the SAC version.The hints are provided by the TD3 agent trained in the easy bipedal walker environment but as we see in Fig. 8, the TD3 agent is still not reaching the target reward.Therefore we are using an incompletely trained agent to provide the hints, which explains the poor results in Fig. 10 compared to Fig. 9. Nonetheless, the version using the hints still get a higher reward than the version without hints in Fig. 10.We see the use of the hints as a 'brain transplant' from one agent to another and the hints could well be derived from another source.

Example: Calibration
In this example, we consider a problem more applicable to astronomy.The objective of this example is to give a more conceptual overview of formulating the state, action and reward for a general problem and more in-depth details about this particular example can be found in (Yatawatta, 2023).Given a data vector y, we build a model describing the data as Problems similar to (30) can arise in many applications, including regression, model fitting, calibration and so on.In our model, we have K basis functions s i (•) (generally non linear) that are parameterized by the parameter vector θ and our data is corrupted by noise in the vector n.
As illustrated in Fig. 11, we consider the mechanism that actually solves (30) to be a black-box, i.e., a specialized mechanism to solve the problem at hand of which we do not intend to have low-level control.The application of RL to the aforementioned problem can be described as follows.
• Objective: We will use RL to determine the best model to use for (30).We consider the K basis functions s i (•) as our dictionary and we will use RL to select the indices i from 1, . . ., K to best fit our situation.We need to consider two criteria to make this selection.The first criterion is the quality, i.e., we need construct a model that best describes the data.The second criterion is the cost because we have a finite amount of computational resources that limit the size of the model to create as well as the number of iterations that we expend in creating the model.Generally we will encounter multiple realizations of (30) with different ground truth models and we use RL to determine the best model for any realization of (30).
• State: The state should summarize the performance of the system in solving (30).Since we have a black-box as in Fig. 11, we can use statistical tools such as the influence function (Hampel et al., 1986;Koh and Liang, 2017) to determine the state.The exact details in deriving the influence function for problems similar to (30) can be found in (Yatawatta, 2019).
• Action: Let us use the set I to represent the indices i from 1, . . ., K that we have selected for our model.The action in our problem should specify I and additionally, the maximum computational budget to expend within the blackbox in Fig. 11.Given K directions, we can formulate the action to predict the probabilities of each i = 1, . . ., K being selected to create I. Therefore, the action can be formulated to be a vector of K values within the range (0, 1] and if any value is greater than 0.5, we select its index to be part of I.In order to determine the computational budget, an additional scalar within the range (0, 1] can be appended to the action.We can scale this value to fit within the minimum and maximum computational budget.To recapitulate, the action is formulated to be a vector of K +1 values in the range (0, 1] derived from the DNN for the actor that can be trained to predict values in the range [−1, 1] (or values in the standard normal distribution followed by tanh(•) activation for a stochastic actor).
• Reward: Once we have I, we determine the parameters θ and we can find the residual as One way to measure the quality of our model is to use the Akaike information criterion AIC (Akaike, 1974).Given the standard deviations of the data y and the residual r as σ y and σ r , respectively, we can determine the AIC as AIC ∝ σ r σ y 2 length(y) + length(θ).( 32) The first term of (32) represents the quality of the model (fractional reduction in the variance by using the model).The second term of (32) represents the degrees of freedom consumed by the model and length(θ) is directly dependent on the cardinality of I (how many directions are selected).
Using the AIC, we can formulate the reward to use for training our RL agent as reward ∝ −AIC − penalty (33) where the penalty represents the computational cost required to determine the parameters θ (for example by using maximum likelihood estimation).Since we use an iterative method to find θ, we make the penalty proportional to the number of iterations used by the maximum likelihood estimation algorithm.
Guided by the aforementioned concepts, we train an RL agent to maximize the reward for our problem (30) with K = 6 using SAC algorithm.Hints are provided to SAC by using an exhaustive search of 2 K possible choices for I.The midpoint within the minimum and maximum number of iterations is used as the hint for the computational budget.The agent is trained by using simulated random realizations of (30), each realization is called an episode.In each episode, the agent is given 7 steps to take an action and learn.The evolution of the average reward of each episode is shown in Fig. 12.Note that Fig. 12 shows the same reward curve at two different scales, in order to highlight the initial low rewards and the slow increase of the reward at later iterations showing that the agent is learning.

Conclusions
We have provided a brief overview of deep reinforcement learning algorithms that are directly applicable in various astronomical tasks.Taking into account the plethora of alternative methods and techniques that already exist in astronomy to perform these tasks, we have also introduced a simple mechanism to transfer the knowledge from existing methods to RL agents via the use of hints.The growth of data intensive astronomy needs efficient and autonomous agents to monitor, control and process data with minimal human involvement.The use of reinforcement learning can help us to achieve this goal.Source code implementing all algorithms discussed in this paper are publicly accessible at (hint assisted reinforcement learning).

Figure 1 :
Figure 1: An agent interacting with its environment.The agent receives an observation and performs an action and receives a reward corresponding to the action.

Figure 2 :
Figure2: The maze environment with 5 valid states 0, 1, . . ., 4. The agent can move (act) ←,→,↑, or ↓.The state S is a discrete space with 5 states and the action A is also a discrete space with 4 actions.

Algorithm 2
Training the RL agent Require: Number of episodes E, number of loop iterations L 1: Initialize Environment and Agent.2: Setup empty (or reload saved) ReplayBuffer D. 3: for e = 1, . . ., E do 4:

Figure 3 :
Figure 3: An RL agent composed of an actor and a critic.

Figure 4 :
Figure 4: Model based RL.A dynamics model representing the environment is created and used by the agent.

Figure 5 :
Figure 5: Hint assisted RL.An external hint is directly provided to the actor in the RL agent.

Figure 8 :
Figure 8: Cumulative reward of the bipedal walker environment with episodes for SAC and TD3.

Figure 9 :
Figure 9: Cumulative reward of the hard version of bipedal walker with episodes for SAC agent with and without using hints.

Figure 10 :
Figure 10: Cumulative reward of the hard of bipedal walker with episodes for TD3 agent with and without using hints.

Figure 11 :
Figure 11: Calibration considered as a black-box.

Figure 12 :
Figure 12: Average reward of each episode during to solve (30).The reward is shown in two different scales in the two plots to highlight the early and late behaviors.

Table 1 :
Some examples where RL can be applied table.The Q-table therefore is a table with rows corresponding to each state and columns corresponding to each action (similar form as Table 2).Let us call this Q[s, a] and let us call the reward table in Table 2 as R[s, a].We follow algorithm 1 to update the values of Q[s, a] iteratively.

Table 3 :
Converged Q-table for the maze environment after several iterations with γ = 0.9.
Algorithm 3 PETS Require: Number of episodes E, number of loop iterations L 1: Setup replay buffer D, fill by taking random actions.2: Randomly initialize ensemble dynamics model p θ i for all i. 3: for e = 1, . . ., E do