Optimal Policy Learning for Disease Prevention Using Reinforcement Learning

China West Normal University, Nanchong, China Institute of Computing, Kohat University of Science and Technology, Kohat, Pakistan Faculty of Engineering and Information Technology, Northern University, Nowshera, Pakistan Faculty of Computer and Information Technology, Al-Madinah International University, Kuala Lumpur, Malaysia Faculty of Computer Science, University of Swabi, Swabi, Pakistan King Abdulaziz University, Jeddah, Saudi Arabia


Introduction
Different types of diseases such as malaria, flu, dengue, and HIV have a huge impact on the quality of life of the human population [1][2][3]. If we consider malaria only, then according to the World Health Organization's report, approximately 3.2 billion people are infected with malaria. As per their report, in 2016 and 2017, there were 217 and 219 million malaria cases reported, which shows an increase in malaria cases in recent years [4]. erefore, effective use of resources to get malaria under control has been critical. Insecticide-Treated Nets (ITNs) are the primary method of malaria prevention [5] because there is a type of mosquito called the anopheles mosquito; it bites after 9 p.m. When a mosquito sets on the net, it dies due to the insecticide, which disrupts the reproductive cycle. In addition to ITNs, the other malaria preventive policies include Indoor Residual Spraying (IRS) [6], larvicide [7] in bodies of water, and malaria vaccination [8][9][10][11].
Machine Learning algorithms are applied in different domains and have made tremendous progress [12] where healthcare sector is particularly influenced by machine learning [13][14][15] in the past few years. ese machine learning algorithms are focusing on the diagnosis of diseases [16] or forecasting future results [17], but the treatment of diseases is not explored [18]. It is a very important step to diagnose a disease and is considered as an important step to treat diseases, and machine learning techniques can support healthcare professionals in the treatment to some extent, but it has been a challenging problem to find the best policy to treat patients for medical professionals [19]. Recently, much popularity is gained by reinforcement learning (RL) [20] in video games [21][22][23], where good and bad actions are learned by the agent through interactions with the environment and the response of the environment. In the context of video games, RL has performed very well, but limited progress has been made in real-world domains like health care. In video games such as AlphaGo and StarCraft, the agent plays a large number of actions in the environment and learns the optimal policy. However, in the context of health care, it is considered unethical to use humans to train RL algorithms and not to mention that this process would be costly and takes years to complete. We are not able to observe everything happening the body of a person. We can measure blood pressure, temperature, and some other measurements at different intervals of time, but these measurements do not represent the complete state of a patient. Similarly, the data collected in health care about patients may exist for one time and may not exist for others. For example, chest X-rays that are used in the treatment of pneumonia [24] are collected before a person is infected and after the person is cured, but the RL model has to know all the estimates of the states the patient goes through. It is very challenging in health care where there are many unknown facts about patients at all time steps.
Reward function is one of the most important functions in RL, and it is challenging in many real-world applications to find a good reward function. In health care, it is even more challenging to search for the reward function that keeps balance between short-term success and overall long-term improvements. For example, in case of sepsis [25], improvements in blood pressure at different durations of time may not cause improvement in the overall success. Similarly, having only a single high reward at the end of an episode (i.e., survived or died) demonstrates that a long route is followed without different intermediary rewards [26,27]. It is also difficult to know what actions result in reward and what actions result in penalty. All the major breakthroughs are possible by using simulated data in deep RL that is equal to many actual years [28]. When data are generated through simulators, it is not a problem, but in case of health care, it is not possible to generate simulated data for the treatment of different diseases. Generally, the data are very scarce to start with training supervised learning, and the data that exist take efforts to annotate to be used for supervised learning.
Furthermore, hospitals are not willing to share data of patients mainly because of privacy reasons. All these facts further make the application of deep RL to health care challenging.
By nature, the health care data is nonstationary and dynamic [29]. For example, it is possible that patients' symptoms are stored at different intervals of time and maybe different records are stored for different patients. Over time, the objectives of treatments may also change. In literature, different studies [30][31][32] are focused on reducing the overall mortality. When the condition of a person improves, the focus shifts to a different objective such as the duration of the virus staying in the body. Similarly, viruses or infections may change much more rapidly and may evolve in different dynamics [33][34][35] that are most probably not observed in the training data used for supervised or semisupervised learning algorithms. Decision-making in medical diagnosis is inherently sequential [36,37]. It means that a patient visits a health care centre for the treatment of a disease. e doctor, based on previous experiences, decides a treatment to be followed. Later, when the patient returns to the same doctor, the treatment that was previously suggested by the doctor decides the current state of the patient and also helps the doctor in which decision needs to be taken next. In the existing state-of-the-art AI strategies of dealing with disease treatment [38,39], the sequential nature of the decisions is ignored [40]. ese AI systems make decisions on the basis of the present state of the patients. e sequential nature of medical treatment can be effectively modelled as Markov Decision Process (MDP) [41][42][43][44] and better solved through RL. e RL algorithms will not only consider the instantaneous outcomes of treatment but also the long-term benefits of the patients [45].
An intervention of actions to avoid malaria are systematically explored in this paper. e paper demonstrates a real-world example of reinforcement learning, where simulated humans are trained to learn an effective technique to avoid malaria. In the literature, AI techniques are used for the prediction, diagnosis, and healthcare planning, but this paper takes a different approach by simulating an environment and using simulated humans to use different reinforcement learning techniques to avoid malaria. A combination of interventions is explored to control the transmission of malaria and learn techniques for malaria avoidance. e paper is organized as follows: the related works are explained in Section 2. e problem of malaria avoidance and the methodology of reinforcement learning are given in Section 3. Experiments are performed, and their results are analysed in Section 4. Concluding remarks of the paper are given in Section 5.

Related Work
Recent advancements in machine learning and big data have motivated researchers of different domains to use these algorithms in their problems. Biomedical and health care researchers are getting benefits from these algorithms in early disease recognition, community services, and patients 2 Scientific Programming care. In [46], machine learning and MapReduce algorithms are used to effectively predict different diseases in diseasefrequent societies. e paper demonstrated to achieve 94.8% accuracy and convergence speed that is faster than CNN (Convolutional Neural Network) based algorithms. Similarly, deep learning and big data techniques have been used in [47] to predict infectious diseases. e authors have combined Deep Neural Network (DNN) and Long Short-Term Memory (LSTM) and evaluated the performance with Autoregressive Integrated Moving Average (ARIMA) in making the prediction of different diseases one week in the future. Better results have been achieved compared to ARIMA. Automatic diagnosis of malaria enables us to provide reliability in health care services to areas where resources are limited. Machine learning techniques have been tried to investigate the process of automating malaria detection. In [48], malaria classification is performed using CNN. Similarly, in [49], CNN has been used to detect malaria classification and has demonstrated promising accuracy. Deep reinforcement learning (DRL) has recently attained remarkable success, notably in complex games like Atari, Go, and Chess. ese achievements are mainly possible because of the powerful function approximation with the help of DNN. DRL has been proved as an effective method in the medical context. Several applications of RL have been found in the context of medicine. For instance, RL methods have been used to develop strategies of treatment for epilepsy [50] and lung cancer [51]. Authors have used the sepsis dataset which is a subset of the MIMIC-III dataset [25]. An action space consisting of vasopressors and IV fluid is selected. Each drug of varying amount is grouped in four bins. Double Deep Q-Network is used for the evaluation. SOFA score which is used for measurements of organ failure is used for the reward function. U-curve is used for evaluation. e mortality rate is used as a function of dosage of policy prescription versus the policy that is actually followed.
In [19], DRL is used to develop a framework that predicts an optimal strategy to deal with Dynamic Treatment Regimes using medical data. e paper has claimed that their RL model is more flexible and adaptive in high dimensional action and state spaces compared to other RL based approaches. e framework models real-world complexity in helping doctors and patients to make a personalized decision in making treatment choices and disease progression. e framework combines supervised learning and DRL using DNN. e dataset is taken from the database of the Centre for International Bone Marrow Transplant Research (CIBMTR) registry. e framework has demonstrated achieving promising accuracy to predict a human doctor's decision and at the same time compute a high reward function. In [52], an RL system is developed that helps diabetes patients to engage in different physical activities. Messages sent to patients were made personalized to patients and the results have demonstrated that participants receiving messages with the RL algorithm increased the number of physical activities and walking speed. A supervised RL with recurrent neural network (SRL-RNN) is combined in a framework to make different treatment recommendations by Wang et al. in [53].
eir results of experiments conducted on MIMIC-3 dataset have demonstrated that the RL based framework can reduce the estimated mortality and at the same time provide promising accuracy to match doctor's prescriptions. In [54], the authors describe a novel technique that can find the optimal policy that can treat patients with chemo using RL. e authors have used Q-Learning, and, for the action space, a mechanism is used to quantify doses for a given time period that an agent can choose from. e cycle of dose is initiated with a frequency as determined by an expert. At the end of each cycle, transition states are compared.
e mean reduction in tumour diameter determines the reward function. Simulated clinical trials are used for the evaluation of the algorithm.
In [55], the authors have taken a different approach that uses the RL techniques to encourage healthy habits instead of looking for direct treatment. In [56], the authors focus on sepsis and RL, but a different approach is taken that uses the RL techniques to control glycemic. In [57], the authors have focused on counterfactual inference and domain adversarial Neural Networks. It is a complicated problem to solve the problem of decision-making under uncertainty. Health care practitioners are facing problems under challenging constraints, with limited tools to make data driven decisions. In [58], the authors have solved the problem of finding an optimal malaria policy as a stochastic multiarmed bandit problem and have developed three agent-based strategies to explore the space of policies. A Gaussian Process regression is applied to the finding of each agent, for compression and for stochastic results from simulating the spread of malaria in a fixed population. e policy generated by the simulation is compared with human experts in the field for direct reference. In [59], the authors have exposed subtleties associated with evaluating RL algorithms in health care. e focus is on the observational setting where RL algorithms have proposed a treatment policy and been evaluated based on historical data. A survey in [60] discusses the different applications of reinforcement learning in health care. e paper provides a systematic understanding of theoretical foundations, methods and techniques, challenges, and new insights into emerging directions. A context aware hierarchical RL scheme [61] has been shown to significantly improve the accuracy of symptom checking over traditional systems while reducing the number of inquiries. Another study that introduces basic concepts of RL and how RL could be effectively used in health care is given in [62].
Policy for malaria control using the reinforcement learning algorithm is explained in [63,64]. e authors have applied the Genetic Algorithms [65], Bayesian Optimization [66], and Q-Learning with sequence breaking to search for optimal policy for a few years. eir experiments demonstrated the best performance by Q-Learning algorithm. A systematic review of agent-based models for malaria transmission is given in [67]. e paper covers an extensive array of topics covering the spectrum of transmission and intervention of malaria. Machine learning algorithms for the prediction of different diseases are studied in [68]. e authors have used Decision Tree and MapReduce algorithms and have claimed to achieve 94.8% accuracy. Machine learning algorithms have been used to automatically Scientific Programming diagnose malaria in [69]. Deep Convolutional Neural Networks have been used for classification. e authors in [70] have discussed safety applications related to AI in those domains where deep reinforcement learning is applied to the control of automatic mobile robots. An investigation of the risk associated with malaria infection to identify those bottlenecks in different malaria elimination techniques is discussed in [71]. Other relevant studies can be found in [72][73][74].

Methodology
Reinforcement learning (RL) [75] is an example of machine learning methods falling between supervised and unsupervised learning, where an agent learns by interacting with the environment. e agent performs certain actions and receives feedback from the environment. is feedback is in the form of negative or positive reward and determines the sequence of good or bad actions to be adapted within a particular situation. As a result, the agent can perform its operation efficiently without any intervention from a human. In other words, RL is a learning method where an agent learns a sequence of actions to eventually increase the reward function. e agent decides which action is the most appropriate and yields a maximum reward. It is possible that an action may not give a positive immediate reward but the long-term reward is also considered. In RL, we have two components, that is, agent and environment as shown in Figure 1. e agent represents the type of RL algorithm, and the environment represents what action returns which reward. e environment is established by sending a state at time t as S t ∈ S, where S is the representation of the set of possible states to the agent. e action taken by the agent at time t is represented by A t ∈ A (S t ), where A (S t ) is the representation of the set of actions possible to be taken at state S t . e reward to be received by performing that action is represented as R t+1 ∈ R, where R is the set of rewards. After one time-step, the next state S t+1 will be sent to the agent by the environment along with reward R t+1 . is reward will eventually help the agent increase its knowledge to be used in evaluating its last action. is process of sending state and receiving reward as an outcome by the agent continues until the environment sends the last or terminal state to the agent.
In addition to the agent and environment, there are four components in a RL environment: (i) policy, (ii) reward, (iii) value function, and (iv) model of the environment.
(1) Policy. A policy defines the behaviour/reaction of an agent at a particular instance of time. Sometimes, a policy can be described as a simple function or as a lookup table, where a policy may involve a lot of computation, for example, the searching process. e policy is considered as a central part of the RL agent because it alone can describe the reaction of the agent. e policy may be stochastic, to determine possibilities for every action. e policy is represented by π t , where π t (a | s) demonstrates the probability of A t � a if S t � s (2) Reward. A reward signal indicates the target of an RL problem. As a result of an action taken by the agent, the environment returns a number, called a reward, at every time step. e objective of the agent is to get most of the total reward over time. us, the reward signal identifies that an action is good or bad. e rewards signal determines the action to be taken. If an action returns a low reward, then the policy will be changed to select another action in a similar situation. So generally, a reward signal is the stochastic function of the state and action.  (1), where α′ is a small positive fraction called the step-size parameter, which influences the rate of learning. r + cV(s ′ ) is called Temporal Difference target and is an unbiased estimate for V(s ′ ). In equation (1), r represents reward and c represents the discounting factor. is update rule is an example of a Temporal Difference learning method, called so because its changes are based on a difference, V(s ′ ) − V(s), that is, difference between estimates at two different times:  Figure 1: A typical reinforcement learning paradigm.

Scientific Programming
Let us assume that there are finite states and rewards. Let us consider an environment that may respond at time t + 1 to the action taken at time t. is response actually depends on everything that happened earlier. e complete probability distribution of the dynamics of the system can be defined in equation (2), for all r, S, and all possible values of the actions in the past represented in the form of action, states, and rewards, that is, S t , A t , and R t . However, due to the Markovian property, we can represent the response of the environment at t + 1 that depends only on the state and action at time t. e dynamics of the environment can be defined as given in equation (3), for all r, s′, S t , and A t . It means that a state or an environment has a Markovian property if and only if equations (2) and (3) e task of RL that satisfied the Markovian property is known by the name Markov Decision Process (MDP). Given a state s and action a, the computation of probability of next state s′ along with reward r is denoted as given in equation (4). e expected value of rewards for the state-action pairs can be computed given in equation (5). e expected rewards for state-action-next-state is given in equation (6): Value functions, which is a function of states or stateaction pairs, are used to estimate the performance of an agent in a given state.
is performance is computed in terms of future rewards to be collected. e state value is denoted by V π (s) given a policy π and state s and is computed as shown in equation (7), where E π [.] represents the expectation of variable when an agent follows a policy π at time step t. Similarly, the action value of a state s following a policy π represented by q π (s, a) is given in equation (8), where q π is the function of action-value when π policy is used: RL problem is solved by searching for a policy that helps the agent to collect maximum possible rewards over the execution of the simulation. A given policy π is treated as a better policy or equal to another policy π′, it the expectation of the π is greater or equal to the expectation of π′ for all states. In other words, π ≥ π′ if and only if V π (s) ≥ V π ′ (S) ∀ s ∈ S. An optimal policy is the policy that is considered good or equal to all possible policies. Optimal policies are represented by π * . e same state-value function is shared by optimal policies as V * and defined as V * (S) � max V π (S) ∀ s ∈ S. ey also share same optimal action-value function, represented by q * defined as q * (s, a) � max q π (s, a) ∀ s ∈ S and a ∈ A(s).
e model-based RL means the simulation of the dynamics of a given environment. e model learns the probability of moving from the current state s 0 , taking action a and ending in next state s 1 . Given the learning of transition probability, the agent can determine the probability to enter a state given the current state and action. However, modelbased algorithms are not practical because the state space and action space grow. On the other side, the model-free algorithms depend on trial-and-error to update its knowledge. erefore, space is not required to store all combination of states and actions. In this paper, we are using model-free algorithms. Classification of RL algorithms are made based on on-policy and off-policy. When the value is based on the current action a and derived from the current policy, it is known as on-policy. When an action a * is obtained from a different policy, then it is known as offpolicy.

Q-Learning. A well-known algorithm in RL is Q-
Learning developed by Watkins [76]. Its proof of convergence is given by Jaakkola [77]. Q-Learning is a simple technique, and it can compute optimal action value without the involvement of intermediary evaluation of cost and the usage of a model [78]. is algorithm is model-free and is considered as off-policy algorithm, which is derived from Bellman Equation as shown in equation (9), where expectation is given by E and discounting factor is represented by λ. is update equation is shown in Algorithm 1 on line 10. Learning rate is represented by α. e next state's Q value determine the next action a instead of using the current Scientific Programming 5 policy. e overall objective of the algorithm is to maximize the Q-value:

SARSA.
A similar algorithm to Q-Learning is SARSA [79,80]. In case of Q-Learning, greedy policy is followed, but in case of SARSA on-policy is followed. SARSA learns Qvalue by performing actions using the current policy. Algorithm 2 shows the algorithm of SARSA. Current policy is used to carry out selection of actions.

Deep Deterministic Policy
Gradient. An actor-critic architecture is called Deep Deterministic Policy Gradient (DDPG) [81,82]. e parameter x is tuned for policy by actor as given in equation (10). Using Temporal Difference error, the policy computed by the action is evaluated by critic as demonstrated in equation (11). e policy decided by the actor is shown by v. e idea of experience replay and separate target network as utilized by Deep Q Network (DQN) [83] is used by DDPG. Algorithm 3 shows the algorithm of DDPG.

Simulation and Discussion
In this section, we present the results of algorithms explained in Section 3 obtained in a simulated human population and see which algorithm performs better to prevent humans from diseases. For the evaluation, we need an environment where we have different states, actions, and agents (representative of human population) looking for the best policy to avoid diseases such as malaria, flu, and HIV. In this section, results are shown for malaria avoidance only, but similar environment with sufficient information can be used for the avoidance of other types of diseases such as flu, HIV, and dengue. An environment where a human, mosquito, and other factors that can influence the transmission of malaria virus to spread to human is shown in Figure 2. e box on the left contains factors relevant to human and the box on the right contains factors pertaining to mosquitoes. Different factors that can influence the disease are shown inside the arrows linking the boxes for humans and mosquitoes. Environment factors and interventions are shown on the top and bottom of the boxes for human and mosquitoes. e IBM Africa research team has taken steps to control malaria by developing a world-class environment to distribute bed nets and repellents. eir goal is to develop a custom agent that will help identify the best policies for rewards based on the simulation environment. Our work leverages the environment developed by IBM Africa research for reinforcement learning competition on hexagon-ml (https://compete.hexagon-ml. com/practice/rl_competition/38/) where an agent learns the best policy for the control of diseases, that is, malaria. e environment provides stochastic transmission models for malaria and different researchers can evaluate the impact of different malaria control interventions. In the environment, an agent may explore optimal policies to control the spread of the malaria virus. A diagram representing the environment developed by Hexagon-ML for finding the best policy for avoiding malaria is given in Figure 3. e environment contains five years. Every year is a state. At every state, we take different actions in the form of ITN and IRS.
States are represented as S ∈ {1, 2, 3, 4, 5}, where each number shows the number of the year. We are trying to solve the problem of making one-shot policy recommendations for the simulation intervention period of 5 years. e main control methods used in different regions are mass-distribution of long-lasting ITNs, IRS with pyrethroids, and the prompt and (1) Randomly initialize critic network Q(s, a|θ Q ) with weight θ Q (2) Randomly initialize actor μ(s | θ μ ) with weight θ μ (3) Initialize target network Q′ with weight θ Q′ ⟵ θ Q (4) Initialize target network μ′ with weight θ μ′ ⟵ θ μ (5) Initialize replay buffer R (6) while For every episode do (7) Randomly initialize N for exploration (8) Get initial observation state s 1 (9) while For every step in the episode do //Repeat until s is terminal (10) Section action a t � μ(s t | θ μ ) + N t as per the current policy and exploration strategy (11) Perform action a t and monitor rewards r t and new states t+1 (12) Store (s t , a t , r t , s t+1 ) in R (13) Sample a randomly selected minibatch of N transition (s i , a i , r i , S i+1 ) from R (14) ) 2 //Update rule for critic to minimize the loss (16) Update rule for actor policy using the sampled policy gradient (17) θ Q′ ⟵ cθ Q + (1 − c)θ Q′ //Update rule for target network (18) θ μ′ ⟵ cθ μ + (1 − c)θ μ′ Scientific Programming component is the use of seasonal spraying, and it defines the proportion of population coverage for this intervention (a IRS ∈ (0, 1]). e seasonal spraying is performed through alternating the intervention between April and June every year in different regions. e policy decision is framed in a way of the simulated population to be covered by a particular intervention; the space of policy A is designed through Health care organizations should be able to explore all possible set of actions for appropriate malaria interventions within the populations. ese policies include a mix of actions, like the distribution of ITNs, IRS, larvicide in water, and vaccination for malaria control. e space of possible policies for the control of malaria is not complete and inefficient for health care experts to explore without an adequate decision support system. e environment in simulation handles the distribution of the interventions in the simulated population. e agent is in charge of the complex actions of targeted interventions, which are not reported previously. Although the action space is finite (i.e., finite number of people in the simulation environment) the space size grows exponentially as more interventions are added. e computation time of simulation will also grow linearly with the number of populations. erefore, a complex exploration of the entire action space becomes impossible as complexity goes to a real-world equivalent simulation. e agent learns different rewards during the learning process. e idea of learning is to collect as much reward as possible during the process of execution of the experiment. ese rewards are infinite and usually represented by R π ∈ (−∞, +∞), where the policy is represented by π. Every policy is associated with a reward represented by R θ (ai) and is a stochastic parameterization of the simulation shown as θ which produces random distribution of parameters for the simulated environment. e environment is executed for 100 episodes, and rewards are collected. An episode consists of five consecutive years. e rewards collected by different algorithms are demonstrated in Figure 4. e random selection algorithm when there is no learning for 100 episodes is given in Figure  4(a). In random policy learning, every time one episode is finished, the environment is initiated with different random states and different policy is tried at random to go from one state to another to collect rewards. In this algorithm, no learning is involved, and this experiment is performed only  Figure 4(b). Compared to random search algorithm, this algorithm has shown improvements as the agent is learning through Q-learning mechanism to collect rewards in the learning process. SARSA algorithm is used, and the result of reward collection is shown in Figure 4(c). e SARSA trained agents are used to look to policy to avoid malaria in a simulated human environment and has shown improvements over simple Qlearning algorithm. An even more sophisticated algorithm known as DDPG is used in the environment to collect rewards, and results are demonstrated in Figure 4(d).
is algorithm shows improvements compared to all other three algorithms and demonstrated that deep learning methods can potentially collect better results in reinforcement learning algorithms.
We have combined the results of the algorithms trained in this paper in Figure 5. In random searching process, there is no learning, and therefore reward is not maximized. But in other algorithms such as Q-learning, SARSA, and DDPG, there is learning involved, and therefore reward is maximized. e overall rewards collected by different algorithms are combined in one figure ( Figure 5(b)). e maximum rewards are collected by DDPG because a complex algorithm is used for collection of rewards. is comparison of three algorithms is shown in Table 1.
is comparison  demonstrates the best policy obtained by operating in the environment to avoid malaria and the related reward collected by performing the best policy. is table demonstrates that DDPG has outperformed traditional learning algorithms.

Conclusion
Since the development of human civilizations, humans have always been in the quest to improve the quality of life from different perspectives. We are looking for the most comfortable accommodation, fast and secure transport, clean and healthy food, comfortable clothes, and many other things. But because of the environmental changes and different actions taken by humans, there are possibilities of different viruses entering the body of humans and affecting the quality of life of humans. For instance, malaria, flu, HIV, and dengue are some diseases that not only affect a single individual but also can affect the whole population, as the virus spreads from one person to another person. Humans over time have learned different methods to treat these diseases. ere are doctors, who prescribe medicine to treat diseases, and hence diseases are in control. But the problem is that the decision of a doctor requires a huge amount of knowledge and experience, to effectively cure a disease. We think it is possible that the human effort is minimized, and some AI-based solutions are explored. Different AI-based solutions have also been explored by researchers, in the form of supervised learning such as ANN, KNN, and SVM. However, the problem with these supervised learning is that the model is trained on the existing data to make similar decisions when a similar data is presented as testing.
ere is a huge gap to further generalize the solution. erefore, unsupervised learning algorithms and reinforcement learning are becoming popular. In this paper, we have explored reinforcement learning-based algorithms, where an agent interacts with the environment to get feedback and improves its state of knowledge. We have experimented with three different algorithms in reinforcement learning. ese algorithms are Q-Learning, SARSA, and DDPG. All these algorithms perform better than random search, as there is learning involved. Q-Learning and SARSA are based on traditional methods of reinforcement learning. However, because of the popularity of deep learning, researchers are interested in introducing deep learning in reinforcement learning. DDPG is a deep learning-based algorithm. Our experiments have demonstrated that deep learning-based

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.