Autonomous Decision-Making for Aerobraking via Parallel Randomized Deep Reinforcement Learning

Aerobraking is used to insert a spacecraft into a low orbit around a planet through many orbital passages into its complex atmosphere. The aerobraking atmospheric passages are challenging because of the high variability of the atmospheric environment. This paper develops a parallel domain randomized deep reinforcement learning architecture for autonomous decision-making in a stochastic environment, such as aerobraking atmospheric passages. In this context, the architecture is used for planning aerobraking maneuvers to avoid the occurrence of thermal violations during the atmospheric aerobraking passages and target a final low-altitude orbit. The parallel domain randomized deep reinforcement learning architecture is designed to account for large variability of the physical model, as well as uncertain conditions. Also, the parallel approach speeds up the training process for simulation-based applications, and domain randomization improves resultant policy generalization. This framework is applied to the 2001 Mars Odyssey aerobraking campaign; with respect to the 2001 Mars Odyssey mission flight data and a Numerical Predictor Corrector (NPC)-based state-of-the-art heuristic for autonomous aerobraking, the proposed architecture outperforms the state-of-the-art heuristic algorithm with a decrease of 97.5% in the number of thermal violations. Furthermore, it yields a reduction of 98.7% in the number of thermal violations with respect to the Mars Odyssey mission flight data and requires 13.9% fewer orbits. Results also show that the proposed architecture can also learn a generalized policy in the presence of strong uncertainties, such as aggressive atmospheric density perturbations, different atmospheric density models, and a different simulator maximum step size and error accuracy.


I. INTRODUCTION
Sustained interest in Mars exploration has resulted in a growing number of reentry vehicles and spacecraft inserted in the orbit around the planet [1], [2], [3], [4], [5], [6], [7], [8]. Currently, orbiters inserted into the Martian system are vital for conducting science investigations from orbit through remote sensing as well as maintaining relay communication with Mars surface missions. However, in the context of future human missions to Mars, demands on Mars orbiters may surge. Indeed, new infrastructure capabilities may be required to permit and support future human activities; for instance, a Mars global positioning network, weather and observation satellite constellations, and formation flying science missions will be vital. This infrastructure may be essential for future Mars science and exploration missions.
There are three options to insert spacecraft into a stable orbit on arrival in the Mars system: 1) a fully propulsive maneuver, which requires a large propellant mass for highthrust propulsion systems or a long duration for low-thrust propulsion systems, 2) aerocapture, which has never been performed and requires the spacecraft to be fully encapsulated in a protective aeroshell [9], or 3) aerobraking, which uses a small propulsive maneuver to insert into a high energy orbit then atmospheric drag to deplete energy the spacecraft via multiple passes through the atmosphere (Fig. 1) until the desired orbit is reached. With respect to a fully propulsive insertion maneuver, aerobraking enables large propellant savings. However, the propellant mass savings come at the cost of increased mission risk associated with aerobraking [10]. The spacecraft passes multiple times through the thermosphere, where the flow is rarefied, and the temperature, velocity, and pressure change considerably; the intrinsic, high variance of atmospheric density at these altitudes can cause the density to reach magnitudes able to damage the spacecraft [11]. For instance, the Mars thermosphere density can vary on scales of 20 km of latitude and longitude [12] and depending on latitude and season, the density can vary from one orbit to another from 15% to 100% [13]. The presence of density variations with large magnitude is hazardous to the safety of the missions because it could generate temperatures, which exceed the limits of the spacecraft components.
As stated by the NASA Space Technology Roadmaps [14], autonomy is particularly beneficial when the rhythm of an efficient decision-making is faster than communication constraints or when the autonomous decision-making reduces the overall cost. Currently, aerobraking missions are managed by human ground teams on Earth; the navigation team is charged to plan and design aerobraking maneuvers (ABMs) to relocate the periapsis if thermal violations are predicted. For this reason, aerobraking requires around-the-clock operations, ground station coverage, and deep space network (DSN) use over a long time period (2-11 months) [15]; also, ABMs cannot be planned frequently for low-apoapsis orbits. Autonomous aerobraking maneuver planning (on-board or off-board) may reduce operational costs, free the mission from the human ground cost, and reduce the likelihood of faults caused by human error [10]. Furthermore, performing autonomous decision-making onboard would reduce the need for DSN coverage and associated cost and allow more aggressive aerobraking conditions (lower periapsis altitudes) without increasing risk, resulting in shorter and more efficient campaigns. A ground-station-in-the-loop approach requires around a four-hour threshold to complete an aerobraking maneuver design; this threshold is the sum of the time of communicating the tracking data, of processing the data, making decisions, and uploading commands to the spacecraft. When the orbit period decreases below this time range in the later phases of an aerobraking campaign, only less efficient aerobraking passages can be performed [16]. With onboard autonomous decision-making, ABMs could be performed more frequently and with smaller impulses, decreasing the criticality of each maneuver and possibly saving propellant.
Several works, mostly led by Tolson, regarding autonomous aerobraking have been developed [17], [18], [19], [20], [21], [22]. Also, a major investigation conducted by NASA concluded with two extensive reports and generation of the Autonomous Aerobraking Development Software (AADS) [20], [23], [24]. However, the current literature on autonomous aerobraking focuses only on the development of models able to accurately predict flight performance of single orbit, drag passage, and whole aerobraking campaigns. Several studies focused on the development of a model able to estimate the trajectory of the spacecraft accurately during the whole aerobraking campaign by the flight-computer [10], [12], [16]; other studies focused, instead, on the prediction of the atmospheric density and the implementation of models to describe complex atmospheric behavior [22], [25], [26], [27]; some other studies analyzed the thermal environment [13], [28]. A more recent work focuses on implementing a different concept of operation in the drag passage, which uses articulated solar panels to gain trajectory control authority [29]. Preliminary efforts by the AADS team formulated heuristics to implement the aerobraking planning on-board [21], and, to date, no studies in the literature focus on implementing an efficient decision-making process to avoid thermal violations or targeting toward a final orbit.
In response to these challenges, we develop a parallel domain randomized deep-reinforcement learning (PR-DRL) architecture to enable an autonomous decisionmaking capability for aerobraking manuevers. This architecture is used in combination with the AeroBraking Trajectory Simulator (ABTS), an aerobraking simulator [30], to develop an autonomous decision-making process for aerobraking campaigns, which defines when, how, and if ABMs should be performed. This architecture could enable an onboard capability or be used for mission design purposes or by the ground station navigation team. This work leverages the advancements in machine learning (ML), specifically reinforcement learning algorithms, to improve and automate aerobraking at the campaign level, filling a significant gap in the literature and existing technology.
Reinforcement learning is a subfield of ML used to learn optimal decisions over time. The broad applicability of the reinforcement learning (RL) approach has caused its use to increase significantly in many scientific and engineering fields. In the space applications field, researchers have applied ML approaches to tackle complex problems, such as development of interplanetary trajectories with low-thrust engines [31]; orbits classification [32]; hovering around and landing on asteroids [33], [34], [35], [36] and estimation of its gravity field [37]; classification of debris [38] and predictions of their orbits [39], [40]. Furthermore, recently, an increasing number of works have used deep reinforcement learning (DRL) algorithms for spaceflight campaign design [41], and high-level spacecraft operations [42], [43]. Furthermore, RL and DRL were applied to generate guidance control for proximity operations [44], [45] and for several hypersonic applications [46], [47], [48], [49]. All these studies obtained performance improvements with respect to the current state of the art.
However, DRL is also prone to overfitting, which is when the learned policy specializes really well in a particular task and has issues achieving policy generalization; generalization is the ability to learn a task beyond the specifics of the trained environment. Generalization in DRL aims to produce policies that are able to perform well in unseen situations, avoiding overfitting in their training environments. Achieving generalization is particularly important if DRL policies will be deployed in real-world applications, specifically for simulation-based training data, where models may not match reality. Also, achieving generalization is even more critical, if an erroneous decision might lead to the failure of a space mission. Studies focusing on generalization are booming and one particular research direction attempts to tackle generalization is domain randomization (DR), which consists of randomizing the environment and the dynamics across a distribution of parameters [50], [51], [52], [53]. However, as the training environments are more sparsely randomized, optimization becomes increasingly difficult, thus requiring more samples [54]. In this work, we develop a parallel randomized deep reinforcement learning (PR-DRL) architecture, which uses DR in combination with a parallel dynamics-simulator architecture to increase the sample-efficiency while achieving generalization. Furthermore, a generalization sensitivity study is carried out; the generalization study is essential to define the safety and the performance of the mission for environments far from the expected one.
Finally, the proposed architecture is applied to the last three phases of the 2001 Mars Odyssey aerobraking mission, which are also the most hazardous because these do not allow for a continuous ground-in-the-loop approach and the spacecraft orbit decay may become more aggressive and risky. The performance of the proposed approach is compared with the 2001 Mars Odyssey aerobraking mission flight-data and the AADS heuristic. Results and comparison allow defining fundamentals attributes of on-board decision-making capabilities for aerobraking purposes.
The contributions of this work are as follows: 1) The development of a novel parallel domainrandomized DRL architecture for computationally intensive simulation-based applications. This architecture enables faster learning with respect to a serial approach and shows great results in learning a more generalized policy in an intensely uncertain environment.
2) The delineation of a procedure to convert a general aerobraking campaign to a Markov decision-process (MDP) framework, including the design of a 3-D continuous running reward function in terms of action and spacecraft state.
3) The definition of a generalization sensitivity study, which uses out-of-distributions generalization environments to evaluate the algorithm's performance and the mission's safety. 4) Finally, the definition of the advantages of an autonomous decision-making capability for aerobraking missions and the implications of its use from a mission performance perspective.
The rest of this article is organized as follows; Section II introduces relevant models based on the 2001 Mars Odyssey Mission and ABTS, which is used to simulate the aerobraking dynamics and the environment and dynamics uncertainties, and provides a discussion on previous attempts done to develop and include onboard decision-making capability for aerobraking. Section III presents the parallel domain-randomized architecture and the employed neural network architecture. Section IV describes how to frame an aerobraking campaign as a MDP, in terms of action space, state space, terminal reward and running reward function. Section V shows the results for the case study in terms of training and testing, compares the testing results with the state-of-the-art heuristic and the Odyssey mission flight data, and reports the results in terms of generalization. Finally, Section VI concludes this article.

II. MODEL
Two different environments shape aerobraking orbits; an in-space portion of the orbit, in which the spacecraft experiences the gravity force of the planet, and an atmospheric portion of the orbit, the drag passage, in which drag and gravity forces act on the spacecraft. Both portions are evaluated using the ABTS [30]. Previous work has proposed using the analytical Keplerian motion formulas for the in-space portion of the orbit to decrease the computational time required to simulate the dynamics [55]. However, due to the short-term effects of the J2 perturbations on semimajor axis and eccentricity, the depleted energy after a drag passage was often overestimated. In this work, J2 perturbations are included, and the in-space flight segments are evaluated through numerical integration of the equations of motion. To decrease the computational power, a less restrictive maximum integration step was used.

A. AeroBraking Trajectory Simulator
The propagation of the orbits is evaluated using ABTS, an open-source simulator written in Python [30], [56], that has been validated against the Mars Odyssey data. Specifically, ABTS is used to model an entire aerobraking orbit starting at the apoapsis and ending at the following apoapsis at Mars. ABTS integrates the equations of motion of (1) using solve_ivp, a Scipy library function, through a Runge-Kutta method of order 5(4) (RK45), which controls the error assuming accuracy of the fourth-order method, but takes the steps using the fifth-order accurate formula. The drag passage was run limiting the step size to 1 s, the absolute tolerance to 1 × 10 −7 and relative tolerance to 1 × 10 −5 , while the in-space portion of the orbit was run using a maximum step size of 10 s, with absolute tolerance of 1 × 10 −5 and relative tolerance of 1 × 10 −4 . However, Falcone and Putnam [30] show how integration method and integration step size influence the results. For this reason, in the generalization analysis, the policy was tested also changing integrator accuracy, maximum step size, and integration method, using the same values reported in the reference. This represents a substantial addition with respect to previous work, where the influence of the integrator step was not considered and evaluated [55]. Testing the policy for a variable simulator environment is fundamental to verify the reliability of the model results. Specifically, the learned policy was tested using the backward differentiation formula (BDF) method, which is an implicit method for stiff problems, with a drag passage maximum step size of 0.1 s, an absolute tolerance of 1 × 10 −11 and relative tolerance to 1 × 10 −9 , while the in-space portion of the orbit was run using a maximum step size of 5 s, with absolute tolerance of 1 × 10 −12 and relative tolerance of 1 × 10 −10 .
As in previous work, ABTS uses an inverse-squared law with J2 perturbation gravity model. The Mars atmosphere is modeled with MarsGRAM 2010 run online in the simulation, which provides atmospheric density, temperature, and wind as well as correlated uncertainties as a function of altitude, latitude, longitude, and time of day, and season. The spacecraft is modeled as a rectangular prism (bus) and flat-plates (solar panels), based on the 2001 Mars Odyssey mission spacecraft. Spacecraft aerodynamic coefficients are evaluated through the Flow of Rarefied Gases theory by Schaaf and Chambre for the combinations of three flat plates [11]. Finally, the heat rate on the solar panels is evaluated for a free molecular flow regime flat plate perpendicular to the flow, as described in Schaaf and Chambre [11]. Table I shows a summary of the planetary parameters used throughout the simulation, while the vehicle parameters used are reported in Table II. Finally, ABMs were assumed finite burns with a thrust of 4 N and performed at apoapsis, while in the previous work, the ABMs were  [55]. This modification is fundamental to develop a more realistic aerobraking environment.

B. 2001 Odyssey Mission
The baseline mission of this study is the 2001 Mars Odyssey mission [57]. The Mars Odyssey mission was the second mission to perform an aerobraking maneuver at Mars; Odyssey aerobraking is also the fastest of all aerobraking campaigns performed at Mars. Since this study shows how autonomous ABM planning can be more efficient in aggressive aerobraking mission environment, the Odyssey mission is chosen as a baseline due to its more aggressive aerobraking timeline. The Mars Odyssey aerobraking campaign is divided into five phases: 1) Walkin, 2) Main Phase I, 3) Main Phase II, 4) Endgame, and 5) Walkout. The Main Phase is switched between 1 and 2 when the argument of periapsis (AOP) crossed the 90 • line. This crossing represents the point at which the periapsis altitude, from naturally increasing, instead starts decreasing. For this reason, Main Phase II, the Mars Odyssey aerobraking became more hazardous; if not controlled, the spacecraft could reach low altitudes that could result in too high heat rates on the solar panels. Also, from the start of Main Phase II, the orbital period decreased to less than 4.7 h; after the start of the Endgame, the orbital period decreased to less than 3.4 h. As already mentioned, the 4-h limit represents the threshold of completing a ground-station-in-the-loop scheduling and planning approach for each orbit. Although from Main Phase II, aerobraking became riskier, the Walkout phase was the most hazardous since an uncontrolled apoapsis decay could result in surface impact in a few orbits. For the Odyssey mission, the heat rate limit was strictly decreased to avoid the impact with Mars in a day if the spacecraft remained uncontrolled in that time period. Table III reports the flight data for each phase. While some data were explicitly reported in [57], others, such as the apoapsis range, the V budget for the Walkin and Walkout phases, and the number of thermal violations, are calculated from the information reported in the same reference. Specifically, the apoapsis range was evaluated for a periapsis altitude fixed at 110 km.
This study focuses on the Mars Odyssey mission starting from Main Phase II. While previous results have analyzed the effect on short low aerobraking phases (partial Endgame and Walkout), including a longer aerobraking is essential to evaluate the mission advantages with the current state-of-the-art and compare with real flight-data [55].  [57] Longer aerobraking means a more complex decisionmaking for ABMs planning, specifically in a strongly uncertain scenario.

C. Inclusion of Uncertainty
In order to achieve a robust policy, the initial conditions and the atmospheric environment are dispersed using uniform distributions during training. Specifically, initial conditions are perturbed because the beginning of an aerobraking campaign is closely connected with inherited uncertainties from the prior Earth-Mars transfer trajectory. Also, the AOP is freed to vary broadly, specifically between 60 • and 90 • , because the rate of the natural decrease of the periapsis from one orbit to the next one is strongly affected by the AOP value. Varying the AOP teaches the agent how to react to the periapsis decay in different conditions. The right ascension of the ascending node (RAAN) is also free to change between 110 • and 120 • to allow flexibility during the mission and, with the AOP variation, decouple the periapsis location with the planet latitude and longitude. As already mentioned, atmospheric density varies significantly with the latitude and longitude of the drag passage. For this reason, large dispersions of AOP and RAAN could be beneficial for creating a robust policy. However, a too broad range of initial AOP and RAAN has the effect of anchoring the atmospheric density and the periapsis heat rate of the first drag passage of an aerobraking campaign to a wide variety of initial values. If the initial heat rate value is too far from the initial corridor, the periapsis location needs to be modified accordingly to meet the desired heat rate corridor. This process can be achieved only through a gradual, multistep change of the periapsis. Indeed, this is usually achieved with the Walkin phase, where the periapsis is gradually decreased until the spacecraft encounters the sensible atmosphere. However, this process is considered beyond the scope of the study. For this reason, AOP and RAAN are uniformly perturbed within a reasonable range.
The initial orbital inclination is dispersed between 88.6 • and 98.6 • , which is equal to ±5 • of the initial inclination of the 2001 Mars Odyssey mission (93.6 • ) [57]. Generally, the spacecraft aerobraking inclination would be strictly tied to the final orbit requirements, and the vast majority of Mars orbiters target polar orbits; for this reason, the inclination was not perturbed beyond this range. Also, the hypersonic rarified aerodynamic coefficients are dispersed between ±10% their nominal values. Drag and lift hypersonic coefficients' dispersion is necessary both for the lack of knowledge on the vehicle model and to avoid developing an association between the drag passage forces and the apoapsis radii difference between the atmospheric interface (AI) and the atmospheric exit (AE). In addition, the perturbed atmospheric density and winds are provided directly by MarsGRAM 2010; however, the seed value of the random number generator is changed for every aerobraking campaign. Also, the MarsGRAM rpscale parameter, which is the random density scale factor, and rwscale parameter, which is the random wind scale factor, were both set to 1. Finally, no dust storms were simulated (intens=0) [58]. However, dust storm effects were evaluated in the generalization analysis. The sensitivity of the proposed model to fundamentally complex aerobraking environments is essential to verify the reliability of the proposed approach. This inclusion and, in general, the generalization analysis represents one of the main contributions of the current work with respect to previous work [55].

D. State-of-the-Art AADS Heuristic
During a major investigation conducted by NASA to develop the AADS, a heuristic for planning ABMs on board was proposed, as described by the logic in Fig. 2. This logic aims to maintain the heat rate in or close to the defined heat rate corridor using a numerical-predictor corrector (NPC) approach. After each drag passage, the onboard computer requires predicting the heat rate for the next drag passage; if the predicted heat rate is in the corridor, no ABM is designed or performed. However, if the predicted heat rate is outside the corridor, the onboard computer designs an ABM that attempts to position the next heat rate on the corridor target. This heuristic requires considerable computational power for each prediction (occurring at each aerobraking orbit), and does not guarantee that the heat rate would reach the corridor target after performing the designed ABM. Specifically, this heuristic becomes more challenged at low orbit periods, where the aerobraking environment is more hazardous due to the intense apoapsis decay. Finally, this heuristic is not able to target a final orbit. The AADS heuristic is used for comparison; specifically, the ABTS is used to simulate both the predicted and the actual drag passages using the distributions presented in Table IV. The predicted, and the actual drag passages are simulated using two different seeds, initializing at the beginning of the aerobraking campaign. Finally, the predicted environment and heat rate are used to define the ABMs V required to hit the target corridor, set to 0.15 W/cm 2 . The ABMs V is calculated through the bisection method. The presented logic is used to evaluate the results presented in Section V.

A. Tabular Q-Learning
A tabular Q-learning algorithm was previously used by the authors to define the aerobraking maneuver scheme in an aerobraking mission campaign [55]. This first effort has shaped the MDP proposed in this work. The tabular Q-learning preliminary results analysis provided valuable information which shaped the proposed DRL approach. For instance, results showed that the agent had difficulties learning how to avoid the thermal violations because many actions resulted in one. This effect, combined with an action space too large [the agent decides between 51 periapsis (actions) at each apoapsis] and a tendency by the agent to prefer small periapsis variations (frequent small impulse over large impulses), had led to modify the action space in the actual ABMs. Moreover, the tabular approach used a simple reward function. The results show that the running reward greatly influences the learning process. Based on these results, in the DRL approach, a greater effort was devoted to designing the reward function. Finally, tabular Qlearning was trained only on nominal flight condition data, ignoring any source of variability. However, the complexity and risk of aerobraking lie on the strong variability of the atmospheric environment; the ability of the policy to complete an episode also in different conditions is considered vital. Furthermore, the tabular Q-learning algorithm stores the Q-value of each state-action pair in a table. This approach becomes increasingly taxing with the rise of the number of states. Also, increasing the number of states makes it increasingly difficult to learn for a Q-learning algorithm due to decreasing probability of visiting a specific state or action pair. For these reasons, employing a DQN algorithm, which uses a neural network (NN) as a function approximator, becomes indispensable. NN can handle large state space problems due to its ability to generalize earlier experiences to unseen states.

B. Parallel Randomized Deep Reinforcement Learning
To account for the perturbed environment and speed up the learning process, a parallel architecture is proposed. For some simulation-based applications, as in this case, the bottleneck of the learning process is the simulation time. Specifically, in this case, the simulation time increases with the drag passage duration, which increases with the decrease of the orbital period and apoapsis radius. For this reason, a Gorila (developed by Google DeepMind) inspired architecture is designed [59]. The architecture is presented in Fig. 3; the parallel architecture has two main components: 1) parallel workers that generate new behaviors and 2) a master that is trained from replay memory and provides actions to the workers. Both the workers and the master run on a single compute node, using different CPU cores. The master core is connected to a GPU, which is used to train the Q-network. The master has exclusive access to the network parameters and the replay memory, and asynchronously receives the workers' transitions and stores all the experiences in the replay memory. The master is also in charge of sending back the following action to the workers. These represent the main differences with Gorila, where each process contains one worker, a separate replay memory, and a master that samples the data from the replay memory and trains its NN. The evaluated gradients by each master are sent to a central parameter center that updates the central parameters and sends the updated parameters to each master after a fixed interval. Furthermore, without parallelization, the proposed reinforcement learning approach would be unfeasible since it would demand almost a year of training to generate the number of needed samples. Using the parallelized approach decreases the training time to about a week.
The parallel approach is well suited to randomization, and particularly to DR, since each worker can perform a different, randomized episode. During the training, each worker simulates a different aerobraking randomized mission, with different initial conditions, and perturbed hypersonic coefficients; also, each worker uses its own Mars-GRAM 2010 model with a unique seed number. This differentiation is necessary to generate independent samples and learn a generalized policy. Also, the uniform distributions are centered around the Mars Odyssey mission nominal conditions to guarantee that the policy is able to handle the scenario. Indeed, it is expected that the proposed approach would be used only after baseline mission definition and the policy will be trained for each specific mission. The same PR-DRL approach is used for testing, both when the testing distribution and the train distribution are the same [ Fig. 4(a)] and when the testing distribution is far from the training one [ Fig. 4(b)]. The latter is used to analyze the performance of the policy in extreme aerobraking environment. While this  architecture has been used by previous work, its ability and capability to generalize the policy was not considered and not tested [55]. This represents an essential step to highlight the remarkable ability of the proposed architecture. Furthermore, the Q-network is trained using the NN, shown in Fig. 5. Using a feed-forward NN with two hidden layers. The input, represented by a vector of dimensions [9×1], is sent to the first hidden layer composed of 1024 units activated through rectified linear units (ReLU) functions. The second hidden layer consists of 1024 fully connected units activated employing the (ReLU) function again. Finally, the output layer is a fully connected layer with a single output for each valid action.
After a broad tuning of the hyperparameters, the discount factor β is set to 0.95, while the initial learning rate to 1×10 −4 . Also, the Adam optimization algorithm is used. The probability required by the exploration algorithm is evaluated through the decaying-epsilon formula, with a decay rate set to 5000. The batch size is set to 256, and the network is trained every five steps. The train starts after 10 000 steps and stops after 1 100 000 steps. The replay buffer size stores 20 000 state-action-transition tuples. Finally, the target Q-network is updated every 10 000 steps. This value and the replay memory size had the most significant effect on the learning stability. These hyperparameters have been identified after a long tuning process; specifically, a batch size of 256 with respect to 128 has shown to have a great influence in the training stability [55].

IV. MARKOV-DECISION PROCESS AND DEEP Q-NETWORK FOR AEROBRAKING
The deep Q-network (DQN) is one of the most widely used DRL algorithms. Its popularity grew when Google's DeepMind published information about an agent trained using the DQN algorithm and achieved human-level results when playing Atari games [60]. The DQN algorithm is a Qlearning algorithm in which the complex, nonlinear function Q is approximated through an NN (Fig. 5). Specifically, the Q-learning algorithm is a model-free and off-policy control algorithm, which aims to approximate the Q-function. The Q-function, also called the state-action value function, specifies how good an action a ∈ A under the policy π is in the state s ∈ S. The policy is the strategy that the agent uses to determine the next action based on the current state. The Q-learning algorithm updates the Q-function through Bellman's equation (2) where Q new is the updated state-action value function; Q is the current value function; s, a, and r are the current state, action, and reward, respectively. Also, α and β are the learning rate and the discount factor, while max a Q (s , a ) is the estimate of the optimal future value function. The DQN, since it aims to approximate the Q-function, performs stochastic gradient descent to minimize the loss based on the Bellman's equation ( Also, DQN is more sample efficient than its competitors, i.e., A3C [61]; and since the studied application requires long and computationally intensive simulations, the decision to use the DQN algorithm family aimed to prioritize this aspect. A double DQN (DDQN) algorithm is an extension of the DQN algorithm; indeed, DQN often overestimates Q values, harming the learning process. The idea behind double Q-learning is to use two networks, Q and Q,' where Q, the trained network, is used for action selection and Q,' the target network, for action evaluation. The DQN algorithm utilizes the target network for action selection and evaluation. Thus, Van Hasselt [62] addresses the maximization bias by decoupling the selection of the action in the max operation. This study uses the DDQN algorithm.

A. Markov-Decision Process for Aerobraking
As required by RL, aerobraking campaigns were framed as a MDP, which requires the strict definition of the episode (initial and terminal conditions), state or observation space, action space, and reward function. In this study, an aerobraking campaign represents the episode. Each episode starts at the initial apoapsis and terminates if any of the following conditions occur: 1) The spacecraft reaches or surpasses the target orbit; 2) the heat rate is too low (q − lim ) or too high (q + lim ); 3) the periapsis altitude is higher than the region in which the atmospheric density is sensible (135 km); d) the periapsis altitude is too close to the planet surface (85 km). Generally, the third and fourth conditions are overridden by the second condition, according to the thermal limits used. Although the second condition can be considered too taxing, and in a real mission, modest thermal violations do not cause mission failure, this condition helps to limit episode length, speeding up the learning process.
In addition, the observation space is defined by the spacecraft orbital position (apoapsis radius, periapsis altitude, inclination, AOP, and RAAN) at the AE, drag passage time duration, date, maximum atmospheric density magnitude encountered in the passage, and maximum heat rate on the solar panels. Finally, all the observations were preprocessed and normalized within the range of 0 and 1 because differences in the scales across observations can lead to unstable learning and significant errors during the gradient descent step. Specifically, the drag passage time duration is normalized between 400 and 1200 s, which represents a conservative range duration of drag passages in aerobraking; the date is normalized between the first of January of 2001 to the first of January of 2003 used in its Proleptic Gregorian ordinal format. The inclination is normalized between 79 • and 100 • , the AOP between 0 • and 360 • , while the RAAN between 90 • and 180 • .
The agent action is a discrete set of ABM at apoapsis; the discretization of the action space is required by the DDQN algorithm. The action space is defined as a discrete range of actions that can increase or decrease spacecraft velocity at the apoapsis. This discrete variation has the effect of lowering or raising the periapsis of the next drag passage and varying the encountered maximum heat rate. Specifically, the action space is the set Negative actions lower the periapsis altitude, while positive actions raise it. The action space was chosen arbitrarily, but the range cover the V budget considered in the literature [20], [57]. Also, due to the smaller energy depletion (short-term effect of the J2 perturbations) and the use of finite burns, the minimum and maximum actions of the action space were reduced with respect to the previous work [55]. Also, not all thermal violations have the same impact on the mission. For this reason, different thermal violations thresholds, namely, low, soft, medium, and hard thermal violations, are included in the study. A low thermal violation is when the heat experienced is lower thanq − lim ; a wider number of this type of thermal violation affects, i.e., increases, the aerobraking length. A soft thermal violation is defined when the heat rate experienced is betweenq + lim andq ++ lim , whereq ++ lim >q + lim . This type of thermal violation, although is undesirable, would not lead to mission failure. A medium thermal violation is defined when the heat rate experienced is betweenq ++ lim andq +++ lim , whereq +++ lim >q ++ lim andq +++ lim is the heat rate corresponding to the immediate action line. This type of thermal violation would not lead to mission failure, but is considered unwanted. Finally, the hard thermal violation is defined when the heat rate experienced is greater thanq +++ lim and is considered totally unacceptable and would lead to a mission failure.
Based on the previous discussion, the terminal reward is defined as (5) shown at bottom of this page, where d target is the distance between the apoapsis radius and the target radius apoapsis,q is the maximum heat rate on the solar panels, and nint is the nearest integer function.
Equation (5) reports that a reward is obtained if the agent is in the proximity of the final orbit; specifically, the closer the final aerobraking orbit is to the target orbit, the higher the terminal reward is. If the final aerobraking orbit matches the target orbit, the terminal reward is set to +10; if the distance between the two is between 9 and 10 km, the terminal reward is set to +6.
In RL, sparse rewards make learning difficult. For this reason, it is important to define a running reward function, that regularly provides rewards and informs the agent of the quality of its actions. Furthermore, through a running reward function, it is possible to favor indirectly one trajectory with respect to another to encourage the agent to exhibit a desired behavior. For this reason, a 3-D continuous running reward function was designed. This function provides a reward based on the performed action, heat rate, and distance to the goal. The running reward is given by whered is the normalized distance to the goal, r apoapsis,km is the apoapsis radius state, r target,km is the apoapsis target radius, and r initial,km is the initial apoapsis radius. Also,q is the normalized maximum heat rate experienced by the spacecraft in the drag passage. Finally, V is the normalized magnitude of the ABM, and thus, of the action. In this case, V is already capped between 0 and 1, so V and V coincide; otherwise, a normalization of the V would be required.
Low heat rates are allowed and not penalized to encourage a periapsis raise as the mission approaches the state goal. A higher periapsis altitude has two effects; the first is to diminish the apoapsis decay rate and make it easier to reach the final state goal. The second effect is to decrease the required periapsis raise maneuver V budget to locate the periapsis outside the atmosphere when the aerobraking is completed. For this reason, ifq is less thaṅ q − lim , V = 1 − V . Fig. 6 shows the running reward function behavior [(6)] in terms of normalized distance to the goal and normalized experienced heat rate for three different action magnitudes. Generally, the running reward function always result in a penalty, with the effect to shorten the episode. In the context of aerobraking, a short episode results in a short aerobraking campaign; a short campaign is favorable because it reduces the mission time, mission risks, and possibly the V for ABMs. The proposed running reward function is bounded between −0.2 and 0.0. Also, the penalty decreases with the distance to the goal. This trend is due to the first term of (6), which teaches the agent to reach the goal state location. Furthermore, as visible from all three plots, the penalty increases with the increase of the magnitude of the action used. This trend is achieved through the third term of (6) and is included to complete the aerobraking campaign using a limited overall V budget. For this reason, the three contour plots are capped at three different maximum and minimum values; the minimum and maximum values of Fig. 6(a) are, respectively, equal to −0.1 and 0, the minimum and the maximum values of Fig. 6(b) are, respectively, equal to −0.11 and −0.01, while the minimum and maximum values of Fig. 6(c) are equal to −0.2 and −0.1. High V maneuvers are penalized, except, as already mentioned, when the agent is in the 100-km neighborhood of the target state. Finally, the second term of (6) connects the reward with the experienced heat rate. This term provides the bell-shape trend visible in the three contour plots; to achieve this trend a normal distribution density function is used. The heat rate is related to the energy depleted in the drag passage; the higher the heat rate, the higher the atmospheric density and the drag force experienced by the spacecraft. Also, large drag forces slow the spacecraft more. For this reason, higher heat rates result in fewer passages and a faster aerobraking campaign. However, the heat rate must be maintained in the heat rate corridor. Therefore, the second term of (6) provides the largest reward when the heat rate is in the middle of the corridor, as also visible from Fig. 6. Ultimately, the overall MDP framework for aerobraking was generalized to be used for different aerobraking campaigns (i.e., different phases, initial orbits, body, thermal limits, and final target orbits). Previous work lacked of such generalized framework and were framed to work for the specific studied case. The MDP framework is one of the main contributions of this work.

V. RESULTS FOR AEROBRAKING AT MARS
The proposed architecture is applied to aerobraking at Mars, using the 2001 Mars Odyssey mission characteristics to assess the architecture's performance in an aerobraking environment and compare it to existing data for the Mars Odyssey in the literature. While training over the whole aerobraking campaign is preferable, feasibility and performance of the architecture can be demonstrated, and computational resource requirements limited, by training the architecture for the most challenging aerobraking phases. To this end, the PR-DRL architecture is trained to perform the Mars Odyssey mission from the Main Phase II to the Walkout phase. Aerobraking becomes more challenging from the Main Phase II for two reasons. First, the orbital period is around the threshold to enable a groundin-the-loop approach for each aerobraking orbit, so more meticulous planning is required. Second, the periapsis starts to decrease with the Main Phase II naturally. This effect increases with the decrease of the orbital period; in the Walkout, the spacecraft would impact the planet's surface within one day with an uncontrolled apoapsis decay.
The heat rate target is set to 0.15 W/cm 2 , the low thermal boundary to 0.05 W/cm 2 , and the high thermal boundary to 0.25 W/cm 2 . This heat rate corridor is lower than the one designed for the 2001 Mars Odyssey Main Phase, and is closer to the corridor used for the Mars Odyssey Endgame [57]. Indeed, the Mars Odyssey follows a decreasing and complex heat rate corridor scheme which varies during the mission, beginning around 0.32 W/cm 2 in the Main Phase II and decreasing to 0.25 W/cm 2 at the end of the Endgame. Furthermore, the Mars Odyssey Walkout phase uses a severe decreasing corridor from 0.2 to 0.03 W/cm 2 , due to the lack of controllability by the remote decision-making. To maintain simplicity, a constant corridor is used in this proposed MDP framework. Furthermore, a soft thermal violation is defined between 0.25 and 0.3 W/cm 2 , a medium thermal violation between 0.3 and 0.45 W/cm 2 , and a hard thermal violation occurs when the heat rate is greater than 0.45 W/cm 2 .

A. Training Results
The PR-DRL architecture is trained using the randomized environment shaped by the distributions presented in Table IV. Fig. 7 shows the training results; Fig. 7(a) reports the averaged training and testing loss and its standard deviation over 25 000 training steps, while Fig. 7(b) shows the same statistics for the achieved reward. The testing is performed over 40 episodes. As visible from Fig. 7(a), the loss decreases and stabilizes, displaying a stable training. Fig. 7(b) reports the reward growth; however, the final reward is negative since the running reward function [(6) always penalizes new steps]. Also, the results for the testing of the learned policy are reported in Fig. 7. The testing is performed using the terminal reward of (5) (green line) or neglecting the first four terminal conditions of the same equation (orange line), which correspond to stopping the episode in case of the occurrence of a thermal violation. The latter case corresponds to a more realistic mission. As expected, the reward standard deviation for the tolerant case (orange) reaches lower values than the conservative counterpart (green), which also results in a slightly larger loss. Nevertheless, for most of the training, mean values agree with each other, both for the training and testing cases. However, in some policy evaluations, the tolerant case (orange) performs significantly worse than the training and the other testing case. These policies are rejected since an agreement between the performance of the two testing cases is preferable because of the different physical meanings of the two. Indeed, while it is essential to achieve the goal state (orange line), it is equally important to minimize the number of thermal violations (green line). For the same reason, for the rest of the analysis, the last trained policy in which training and testing are in agreement is used (policy frozen at 1.1 million training steps). Fig. 9. Mean number of repeated action for tested policy frozen at 750-K training steps. Fig. 8 reports the performance of the policy testing when the thermal violations do not trigger a terminal condition (orange case of Fig. 7). Each subplot, except for the first one [8(a)], reports the mean and standard deviation of the analyzed feature; the plot in Fig. 8(a) shows a percentage value; specifically, it shows the episode outcome in percentage over the 40 tested episodes. This plot highlights that the agent learns how to reach the final state more often during the training. This behavior is also shown in Fig. 8(b), in which the final distance to the goal is reported; the final distance average and standard deviation slowly increase in the neighborhood of 0-km distance with the training step. Fig. 8(c) shows the evolution of the thermal violations starting from 200-K training step along with the testing process, displaying how the number of thermal violations mean tends to shrink with the training step, bounding the sum of the mean thermal violations to much less than the unity for long-trained policy. However, some long-trained policies show a large standard deviation, mainly for low thermal violations. These policies are indeed not considered in the rest of the article because of their relatively poor performance. Fig. 8(d) shows the required V and number of maneuvers. While the V shows a tendency to decrease in time for long-trained policy, the number of maneuvers tends to oscillate between 40 and 180. Finally, 8(e) reports the episode time and the number of aerobraking orbits, which are consistent with each other. Fig. 9 reports the average number of repeated actions for the selected policy: the figure shows that the agent more frequently chooses positive actions, which raise the periapsis. As already mentioned, the periapsis naturally decays over the aerobraking campaign, and the periapsis location requires continuous correction to avoid thermal violations. However, the agent decides more often not to perform any ABM, which represents the only action with a zero-penalty [(6)].

B. State-of-the-Art Heuristic Comparison
The proposed PR-DRL approach and the AADS heuristic are used to simulate 40 episodes generated using the perturbed environment presented in Table IV [pink distributions of Fig. 4(a)]. The results of this analysis are presented in Table V and Fig. 10. Specifically, Table V shows the reward scored by the two approaches, the number of thermal violations encountered, and the percentage of episodes that reached the apoapsis state goal. The PR-DRL architecture outperforms the AADS heuristic in total reward; specifically, the PR-DRL architecture achieves an average reward 87.2% higher than the AADS heuristic average score. Also, the AADS final reward results in more volatility shown by its larger standard deviation. In addition, the PR-DRL policy avoids more overall and distinct classes of thermal violations; for the PR-DRL, the most frequent class is composed of low thermal violations. Also, the PR-DRL policy reports a negligible number of medium thermal violations. In contrast, for the AADS heuristic, the most frequent class is the low thermal violations class; however, all the AADS thermal violation classes are at least one order of magnitude larger than the PR-DRL results, except for the high thermal violation, which is zero also for the AADS. Generally, the PR-DRL architecture achieves a decrease of 97.5% in the average occurrence of thermal violations relative to the AADS heuristic. Finally, the PR-DRL achieves the goal 100% of the time with a threshold of ± 10km, compared with the 53% of the AADS. However, if the threshold is set to ± 5km, the goal percentage remains to 100% for AADS and only to 38% for the PR-DRL, proving that the PR-DRL approach prioritizes the targeting task. It has to be noticed, however, that AADS heuristic is not designed to target a final orbit. Fig. 10(a) shows the heat rate corridor for the first 10 of the 40 tested episodes. As visible, the PR-DRL consistently maintains flight within the prescribed heat rate corridor. In contrast, the AADS heuristic results in many thermal violations. This result confirms that although atmospheric density variations are considerable, the PR-DRL policy can proactively react to such variability and be far enough to the limit to avoid future thermal violations. Furthermore, some low or soft/medium thermal violations are present at the beginning of the phase; this is only due to the initial conditions. In this case, the initial conditions heat rate did not meet the corridor, and both the AADS heuristic and the PR-DRL policy design actions to reach the corridor. Finally, at the end of the phase, the PR-DRL gradually decreases the heat rate; this represents an attempt by the agent to build a strategy to reach the goal. As a result, lower is the heat rate, lower is the apoapsis difference between passages, and higher is the probability of reaching the apoapsis goal. Such behavior is absent in the AADS heuristic.
Finally, Fig. 10(b) shows the actions taken by the two approaches. The AADS heuristics shows sparser actions with larger magnitude; specifically, the magnitude increases on getting closer to the goal state. Overall, the PR-DRL policy appears more proactive and designs smaller magnitude and more frequent actions than the AADS heuristic. For this heuristic, larger magnitude-positive actions are designed only to increase (decrease) the periapsis at the end (beginning) of the phase and react to sudden changes in the atmospheric density. The AADS heuristic waits for a prediction that the vehicle will be out of the corridor to perform an action, which requires a larger V return to the corridor and associated target heat rate.

C. Mars Odyssey Mission Flight Data Comparison
The same policy used for the previous analysis was used to simulate the 2001 Mars Odyssey mission from Main Phase II to Walkout phase [red distributions of Fig. 4(a)] to investigate the performance of the PR-DRL in a real mission scenario. In this case, the inclination, AOP, and RAAN were chosen to match the Mars Odyssey mission data [57]. The inclination was set to 93.6 • , the AOP to 115 • , and the RAAN to 89 • . All the missions started on the 18th of December 2001. All the other initial conditions were set like those used for the previous analysis. 100 episodes were simulated using the PR-DRL policy and the AADS heuristic. Their performance, expressed in terms of mean and standard deviation, is compared with the 2001 Mars Odyssey mission flight data, provided by the [57]; results are shown in Fig. 11. In the case of thermal violations number (Fig. 11(d)), the standard deviation is reported only for feasible physical terms (positive values). Fig. 11(a) reports the ABMs number; the number of maneuvers for the PR-DRL is higher than the number of maneuvers performed by the Mars Odyssey mission navigation team, while the one performed by the AADS is lower. The discrepancy between the behavior of the two autonomous decision-making approaches might lie on the heat rate corridor; while both approaches used the same definition of heat rate corridor, the AADS heuristic appears too conservative in the heat rate predictions. The PR-DRL policy shows a proactive behavior toward avoiding thermal violations, while the AADS heuristic shows a reactive behavior. A tighter operational corridor would possibly be necessary for the AADS heuristic to induce a more reactive behavior. The number of ABMs performed by the PR-DRL policy [see Fig. 11(a)] aligns with the notion that an onboard planning algorithm would allow for more frequent maneuvers. In contrast, a groundstation-in-the-loop approach allows fewer maneuvers due to the planning time and the orbital period constraints. Fig. 11(b) and (c) show the phase duration and V used for the ABMs, the apoapsis correction maneuver, needed to correct any apoapsis target error, and the raise periapsis maneuver, needed to raise the periapsis outside the Mars atmosphere. In this case, the three approaches result in a comparable mission time and fuel consumption. Specifically, PR-DRL results are on average one day slower, and the AADS results are on average 5.5 days faster than Odyssey. However, AADS uses on average 2.4 m/s more V than the 2001 Mars Odyssey mission, while the PR-DRL requires on average 3.7 m/s less V than Odyssey.
In addition, Fig. 11(d) reports the number of thermal violations for the three analyzed cases. The thermal violations are counted with respect to the specific, defined heat rate corridor, which is different between the 2001 Mars Odyssey mission and the two autonomous decision-making approaches. The heat rate corridor defined by the navigation team of the 2001 Mars Odyssey mission is more lenient than the one set in this study, i.e., higher heat rate values are tolerated. Nevertheless, PR-DRL policy is able to abate the number of thermal violations by 98.7% with respect to the real flight data. Likewise, the AADS heuristic reports a lower average number of thermal violations; however, it also reports an undesirable large variability, indicating performance could, in some scenarios, be worse than the Mars Odyssey.
Finally, Fig. 12 shows an example of aerobraking performed using the PR-DRL policy and the 2001 Mars Odyssey mission flight data in terms of heat rate and orbit number. Also, the heat rate corridor defined for the PR-DRL approach is shown. In this plot, the PR-DRL results show an almost-oscillatory behavior at the beginning of the aerobraking; later in the mission, the variation of heat rate appears more scattered. Also, the Main Phase II of the PR-DRL aerobraking is longer than the one reported by the Mars Odyssey mission; this is because the heat rate corridor was less restrictive in the actual mission, and overall, lower periapsis passages were allowed. Furthermore, the Walkout phase shortens for the PR-DRL, causing a drastic reduction of aerobraking orbits. The Walkout phase is the most dangerous phase of aerobraking, and this reduction might result in a reduction of the correlated risks. Overall, PR-DRL reports 29 less orbits on average 13.9% decrease, but due to variations in how the orbital period changes over time, the PR-DRL scheme requires slightly more time than Odyssey (+1 d).

D. Generalization
A generalized policy, which represents the ability of a policy to maintain high performance in the case of unseen situations, is vital for critical aerospace applications. Generalization does not have only one meaning [54]. Specifically, independent and identically distributed (IID) Generalization Environments is when the same distribution describes training and testing environment [ Fig. 4(a)], while out-of-distribution (OOD) Generalization Environments is when training and testing environment are not in the same distribution [ Fig. 4(b)]. This work, until this point, has focused on IID Generalization; however, defining the performance for unseen environments is equally important to identify potential weaknesses in the approach and for future real-world application. Also, we believe that using a DR approach would indirectly enable the generalization of the problem to unseen situations. For this analysis, the learned policy is tested in a range of different situations, namely, 1) with a more aggressive atmospheric environment, 2) using an alternate, differentiated model for atmospheric density, 3) shorter episode length, 4) different initial conditions, and 5) modifying the accuracy of the simulator. The performance is analyzed using the training and testing performance gap. However, although [54] presents an RL formalism based on the expected reward, in this analysis, we retain the native definition from the supervised learning because our terminal reward is a function of the length of the aerobraking campaign. The generalization gap is then defined as GenGap(π ) := L(π, M| C test ) − L(π, M| C train ) (9) where L(π, M| C ) is the loss for the policy π acting on the environment described by the distribution C [54].
The results of this analysis are reported in Table VI. These results present generalization gap and mission performance. In these results, the more aggressive atmospheric environment is simulated using MarsGRAM 2010 parameters, increasing the rpscale and rwscale to 2, which is their upper limit, and setting a dust storm with maximum intensity (intens set to 3 and dasttau set to 0.3) for all duration of the aerobraking campaign. Also, the policy is tested using an exponential atmospheric density law; furthermore, the policy is also tested to replicate only the last phase of the Mars Odyssey mission (Walkout) or to simulate from an unseen initial state (from Main Phase I phase). These last two simulations aim to define how the policy performs with different initial conditions, seen and unseen. Specifically, starting the campaign from the Main Phase I means setting the initial AOP to greater than 90 • , where the periapsis experiences a natural rise. However, the training is performed with AOP always less than 90 • ; thus, the policy is tested for a physical phenomenon not covered in training. Finally, the accuracy of the ABTS integrator is decreased, as discussed in [30]. As already assessed in Section IV, the simulator's accuracy was incremented to speed up the training; however, an assessment of the performance of the policy with a slightly different simulator is essential.
Results show good performance for every case except the long-duration aerobraking case. Most importantly, the ABTS integrator characteristics and density laws have a marginal and contained impact on the performance. Interestingly, the V budget and number of maneuvers seem to be linked with the variability of the atmospheric environment: a more variable environment results in a larger required the V, a greater number of aerobraking maneuvers, more thermal violations. Contrarily, the policy is less challenged for a milder atmospheric environment (exponential density law) and can reach the final state faster with no thermal violations, a drop in the aerobraking duration and episode length. The improved accuracy and the errors of the simulator integrator shows only a marginal degradation of the performance with respect to the nominal case results. The larger degradation is found when the policy is tested with different initial conditions (short and long aerobraking campaign). Starting the aerobraking campaign from the Walkout phase only moderately increases the generalization gap, the total thermal violations increases, and the reached goal percentage decreases. However, performance results show similar results to the other cases. On the other hand, results in the table show that the long aerobraking campaign reports the worst performance in all the analyzed parameters, except for the reached goal percentage. Once more, this result was foreseen since DQN is a model-free algorithm, and the algorithm learns behavior only for experienced physical models. This result brings a fundamental point and shows the model's weakness; the most impactful physical phenomenons require being present in the training environment if model-free approaches are used. This conclusion may seem obvious, but identifying the most relevant physical factors for the specific applications is less obvious when designing a model-free architecture. Generally, these results establish that the PR-DRL approach is able to achieve good performance in aerobraking for scenarios far from the nominal case.

VI. CONCLUSION
A PR-DRL approach is developed, and tested for autonomous aerobraking maneuver. Autonomous aerobraking requires an onboard algorithm to plan and design ABMs to maintain the spacecraft in a safe thermal environment. In addition, the PR-DRL architecture is used to teach the spacecraft how to navigate in an aerobraking campaign. To this end, a MDP framework for aerobraking is developed, and terminal and 3-D running reward function, in terms of goal distance, heat rate, and V, is derived. The reward function trains the spacecraft to avoid thermal violations altogether and reach a target apoapsis state. The proposed autonomous aerobraking architecture is tested for the aerobraking of the 2001 Mars Odyssey mission.
Results show that the PR-DRL approach outperforms the current state-of-the-art heuristic, achieving an average reward 87.2% higher than the heuristic average score and a decrease of 97.5% in the average occurrence of thermal violations relative to the heuristic. Results also show that the PR-DRL approach can proactively and robustly avoid thermal violations and develop a behavior that allows reaching the target orbit 94% of the time in a broadly perturbed environment. Furthermore, results show that the PR-DRL approach reduces the number of thermal violation by 98.7% and the number of aerobraking orbits by 13.9% for the 2001 Mars Odyssey mission with respect to the real flight data while using a comparable V budget and duration. Finally, results show that the trained policy is able to generalize the results for different atmospheric conditions, integrator characteristics, and seen initial state-however, the performance is expectedly poor in the case of unmodeled physical phenomena.
Overall, results show that the proposed PR-DRL approach may reduce aerobraking mission risk by reducing the number of thermal violations and reducing the number of orbits, and therefore, atmospheric passes, with similar flight performance in required V and mission time as the 2001 Mars Odyssey mission aerobraking campaign. These results show how beneficial autonomous planning of aerobraking maneuvers may be, especially if implemented onboard. However, not every autonomous algorithm shows the same performance: the AADS heuristic performance is tied to the heat rate corridor defined for the specific mission; this effect shows that designing an approach that is aware of the variability of the environment is necessary to maintain a low-risk mission. However, the generalization analysis points out that a careful definition of the model may enable a safe and on-board approach. While the performance degradation shown for different atmospheric conditions could be fixed with a short fine-tuning during the mission in the Walkin phase, model imperfection may lead to more significant issues. Therefore, future work should include partial physical models in the learning architecture.