Comparing Reinforcement Learning Methods for Real-Time Optimization of a Chemical Process

: One popular method for optimizing systems, referred to as ANN-PSO, uses an artiﬁcial neural network (ANN) to approximate the system and an optimization method like particle swarm optimization (PSO) to select inputs. However, with reinforcement learning developments, it is important to compare ANN-PSO to newer algorithms, like Proximal Policy Optimization (PPO). To investigate ANN-PSO’s and PPO’s performance and applicability, we compare their methodologies, apply them on steady-state economic optimization of a chemical process, and compare their results to a conventional ﬁrst principles modeling with nonlinear programming (FP-NLP). Our results show that ANN-PSO and PPO achieve proﬁts nearly as high as FP-NLP, but PPO achieves slightly higher proﬁts compared to ANN-PSO. We also ﬁnd PPO has the fastest computational times, 10 and 10,000 times faster than FP-NLP and ANN-PSO, respectively. However, PPO requires more training data than ANN-PSO to converge to an optimal policy. This case study suggests PPO has better performance as it achieves higher proﬁts and faster online computational times. ANN-PSO shows better applicability with its capability to train on historical operational data and higher training efﬁciency.


Motivation
Many chemical processes, like coal combustion and bioreactors, are complex, and thus deriving appropriate models and optimizing outputs is difficult [1][2][3]. Machine learning has shown success in optimizing complex systems such as scheduling electricity prices to manage demand and maximize power grid performance [4][5][6]. This motivates exploration of other machine learning techniques like reinforcement learning (RL) on model-free optimization [7]. RL research has seen many breakthroughs in recent years with new algorithms capable of defeating most humans in difficult games [8][9][10][11]. These algorithms are not specifically designed to play games, but to learn and accomplish general tasks. A real-world example can be seen in OpenAI's algorithm which learned how to control a robotic hand to solve a Rubick's cube under disturbances [12].

Literature Review
One popular RL algorithm in process systems, ANN-PSO, trains an artificial neural network (ANN) to model the system and uses a global optimization method, such as particle swarm optimization (PSO), to select optimal inputs. ANN-PSO is an off-policy algorithm, meaning it can train using data previously gathered under normal operating conditions. In a real case study, ANN-PSO is shown to be effective in

Contributions
While process systems RL applications is an active area of research, there are few papers comparing the algorithms in their applicability and performance in process systems. To investigate if the widely used ANN-PSO can be replaced by newer actor-critic methods, this paper presents a novel comparison between two algorithms-ANN-PSO and PPO-by comparing the methods of the algorithms and evaluating these algorithms on a case study of a stochastic steady-state chemical optimization problem. To explore both algorithms' potential to achieve an optimal policy, ANN-PSO and PPO are also compared to two benchmark control strategies: nonlinear programming on a first principles model and maximum production. The algorithms' performance and applicability are compared based on implementation feasibility, system optimization, training efficiency, and online computational time. Furthermore, insight into the behavior of these algorithms is explored through parity plots of the agents' predicted and actual profits, and sensitivity analysis of the agents' actions. Below is a summary of the novel contributions of this work.
1. This work represents a first-of-its-kind comparison study between PPO and other methods for real-time optimization. Specifically, comparisons are made to maximum production operation (no optimization), optimization using an ML model (artificial neural network) and particle swarm optimization (ANN-PSO), and optimization using a first principles model and gradient-based non-linear programming. 2. Our results demonstrate that PPO increases profitability by 16% compared to no optimization.
It also outperforms ANN-PSO by 0.6% and comes remarkably close to matching the performance of the FP-NLP method, getting within 99.9% of FP-NLP profits. 3. Though more time must be invested into training the system, PPO reduces online computational times, resulting in 10 and 10,000 times faster computation compared to FP-NLP and ANN-PSO, respectively . 4. ANN-PSO has higher training efficiency compared to PPO as ANN-PSO converges with ≈10 5 training examples while PPO converges to an optimal policy with ≈10 6 training examples.
5. Parity plots suggest ANN-PSO's lower profits are due to PSO exploiting errors in the ANN, causing ANN-PSO to consistently overpredict profit and select suboptimal actions. 6. Comparing PPO and ANN-PSO, PPO has better performance as shown by its higher profits and faster computational times. ANN-PSO has better applicability as shown by its higher training efficiency and capability to train on historical operational data.
The following are the sections and the material they cover. Section 2 introduces the optimization methods and their theory. Section 3 presents the case study of optimizing steady state operation of a continuously stirred tank reactor. Section 4 describes how the optimization methods are implemented and evaluated. Section 5 presents and analyzes the results from the case study. Section 6 summarizes our findings and suggests future work.

Markov Decision Process
An optimization problem is considered where an agent interacts with an environment which is assumed to be fully observable. This problem can be formulated as a Markov Decision Process (MDP) where the environment is described by a set of possible states S ∈ R n , possible actions A ∈ R m , a distribution of initial states p(s 0 ), a reward distribution function R(s t , a t ) given state s t and action a t , a transitional probability p(s t+1 |s t , a t ), and a future reward discount factor γ. An episode begins with the environment sampling an initial state s 0 ∈ S with probability p(s 0 ). The agent then selects action a ∈ A, and receives the reward sampled from the environment r t ∈ R with probability R(s, a). The environment then samples from s t+1 ∈ S with probability p(s t+1 |s t , a t ). The agent selects another action and this process continues until the environment terminates. This process is shown in Figure 1. The two algorithms of interest-ANN-PSO and PPO-and a benchmark algorithm operate as the agent in the MDP process to select actions. All three algorithms are briefly discussed in Sections 2.2-2.4 and are summarized in Figure 2.

Artificial Neural Network with Particle Swarm Optimization (ANN-PSO)
For ANN-PSO, an artificial neural network is trained to approximate the environment reward function by mapping states and actions to an expected reward. This is done by minimizing the mean squared error between predicted and actual reward on collected state-action pair data and is shown in Figure 2 (ANN-PSO offline preparation section). Once the ANN accurately predicts rewards from states and actions, particle swarm optimization (PSO) is used on the ANN to find actions that maximize the expected reward for a given state and is shown in Figure 2 (ANN-PSO online computation section). ANN-PSO is an off-policy algorithm meaning the algorithm can learn from data collected from other policies. ANN-PSO's weaknesses include overfitting to noise in stochastic environments and being computationally expensive online since PSO performs many iterations to find optimal actions for each state. Below is a brief explanation of ANN-PSO which uses Mnih et al.'s q-learning formulation to train the ANN and Kennedy and Eberhart PSO algorithm [37,38] to search for optimal actions. To maximize the reward, a state-action value function, Q π (s, a), is defined in Equation (1). This equation states that, given an initial state, the policy's value of a state-action pair is the expected cumulative reward of policy π.
π is a policy mapping states to actions. An optimal state-action value function Q * (s, a) satisfies the Bellman equation which is shown in Equation (2).
With Equation (2), Q * can be estimated with a function approximator such as a linear function or a nonlinear function through iteration. In this paper, an ANN is used to estimate Q * . Training is done by discretizing each dimension of the state S and action sets A into N spaces. Every combination of the discretized states and actions are passed to the system, which returns the reward. The neural network is then trained to estimate the reward given a state-action pair. For stochastic environments, the ANN may overfit to noise which causes a plant-model mismatch.
To find the maximum value state-action pair for the estimated Q * , PSO is used. PSO is a derivative free optimizer which uses multiple particles, collectively called the swarm, to search the solution space to find an optimal action. The particles are initially given random positions and velocities and then are guided by each particle individual best and the swarm best positions to find the optimal solution of function f . The algorithm is shown in Figure 2 (ANN-PSO online computation section).
For ANN-PSO, the actions are the positions and the function to be maximized is the reward returned by the ANN. For every new state, PSO searches for optimal actions. This process is computationally expensive since there are many particles and iterations that must be performed for every new state encountered.

Proximal Policy Optimization (PPO)
While ANN-PSO finds the optimal action by learning values of every state-action pair and searching for the optimal action, Proximal Policy Optimization [39] learns an optimal policy to sample actions, given states. PPO is an example of an actor-critic algorithm where a policy network is trained to maximize the cumulative reward, while a critic is trained to determine how much better the new policy is and push the probabilities of those actions up or down. PPO is an on-policy algorithm meaning the algorithm learns from data collected based on the algorithm policy, and thus the algorithm requires interaction with the environment.
The PPO algorithm for training is shown in Figure 2 (PPO offline preparation section). While online, the mean action is selected from π θ (a t |s t ) and is shown in Figure 2 (PPO online computational section).
For PPO, the policy, denoted as π(a|s), is an action probability distribution function, given a state.
π(a|s) is approximated using a neural network π θ (a|s) with parameters θ optimized to maximize a utility function U(θ). U(θ) is shown in Equation (3) and represents the policy expected reward from time t to time t + t f starting from state s t .
τ is all possible trajectories from state s t , r(τ) is the discounted reward of τ, and p(τ|θ) is the probability of τ given parameters θ. The probability distribution is typically a normal distribution. (3) and (4), a surrogate objective with the same gradient is maximized so constraints can be applied to limit update step sizes.

Instead of directly maximizing Equations
However, using r(τ) in U(θ) poses a problem as the agent increases probabilities of all trajectories with positive reward which leads to high variance and low sample efficiency. Thus, r(τ) is replaced with a generalized advantage estimationÂ τ , which is shown in Equations (5a) and (5b).
λ is a factor for trade-off of bias vs. variance, and V(s t ) is the expected episode cumulative discounted reward of state s t given π θ (a|s). V(s t ) is approximated with a neural network V θ (s t ) and, in this paper, shares parameters with π θ (a|s).
A τ provides a baseline by comparing action a t received reward to the previous policy estimated value of state s t and pushes the agent to increase probabilities of trajectories that have higher rewards than the current policy. Combining Equations (4) and (5a) yields Equations (6a) and (6b): To prevent the agent from making excessively large updates, the surrogate objective is clipped as shown in Equation (7).
is a hyperparameter to limit the policy updates. This limits how much the policy changes the probabilities of trajectories compared to the old policy since η τ (θ) denotes trajectory probability ratios between the new and old policies. As the policy and value function neural networks share parameters in this paper, the objective function incorporates the value function error. An entropy bonus is also added to encourage the agent to explore and avoid premature convergence to a suboptimal solution. The final objective is shown in Equations (8a) and (8b).
c 1 and c 2 are hyperparameters, S[π θ ](s t ) is an entropy bonus function, and V targ t is the observed cumulative discounted reward from time t to the end of the episode.

First Principles with Nonlinear Programming (FP-NLP)
Maximizing the reward by using nonlinear programming (NLP) on a first principles model (FP-NLP) is a reliable and well-accepted approach [40][41][42]. Thus, FP-NLP is used as a benchmark to evaluate the performance of the two previous methods. Using FP-NLP requires thorough understanding of the system governing equations, such as fundamental equations for energy and mass balances, to formulate the first principles model. This first principles model can then be formulated into an optimization problem and solved using NLP algorithms such as an interior-point method for NLP (IPOPT). While this method does not require learning like ANN-PSO and PPO, the requirement for an accurate first principles model makes this method difficult to implement in systems where accurate models are difficult to develop. Furthermore, FP-NLP requires an iterative line search, as shown in Figure 2 (FP-NLP online computation section), which can lead to long computational times.

Case Study
These methods were evaluated by having them perform economic optimization of a steady state continuously stirred tank reactor (CSTR) model as shown in Figure 3. The values for the model constants are listed at the end of this section in Table 1. The CSTR is assumed to be well mixed. A mass balance with species A and B flowing in at rates ofv A andv B , respectively, is shown in Equation (9).
c is a valve constant and h is the liquid height in the tank. Inside the CSTR, an elementary, liquid phase, endothermic, irreversible reaction occurs with Equation (10).
The rate −r A is given by Equation (11).
C A and C B are the concentrations of A and B in the reactor, and k is related to temperature with Arrhenius equation as shown in Equation (12).
A is the Arrhenius pre-exponential factor, E a is the activation energy of the reaction, R is the universal gas constant, and T is the temperature of the reactor.

Parameter Value Unit
As the system is at steady state, the mole and energy balance equations for the system are written as Equations (13)- (16).
A and B are assumed to have similar densities and heat capacities of ρ and c p , respectively, which remain constant over the operating temperature range. A and B enter with concentrations C A,0 and C B,0 , respectively, T 0 is the feed temperature of both species, V is the volume of the reactor, ∆H rxn is the heat of reaction, and q is the heat input.
Thus, C C and q can be written as Equations (19) and (20).
The cost per time (CPT) function is defined as Equation (21).
P i is the price of commodity i and is sampled from a uniform distribution ranging from P i,low to P i,high . The agents can manipulatev B within the range of [0,v] and the temperature set point (Temp SP) within [T 0 , T ub ] where T ub is the upper temperature limit for steady state operation of the CSTR. The goal is to minimize the cost so the optimization problem can be formulated as Equation (22a) subject to Equations (17) For the MDP formulation, the possible states are all combinations of prices, and the possible actions are all combinations ofv B and T. p(s 0 ) is a uniform distribution over the ranges of all four prices, and the reward function is −CPT in Equation (21).
To make the model more realistic, feed concentrations and temperatures were stochastic and sampled from a normal distribution with a given mean and a relative standard deviation of σ. Actions were also perturbed by sampling from a normal distribution with a mean of the chosen action and a relative standard deviation of σ. For each set of prices, the agent selects one action and the simulation is repeated N times. The average reward is then calculated and returned.

Evaluation
Simulations were run on an Intel(R) Core(TM) i7-6500U CPU with 2.50 GHz clock rate and 8 GB of RAM.

ANN-PSO Implementation
The neural network was trained to map prices with actions to profits. The prices, P A , P B ,P C , and P q , and actionsv B and T were each discretized into N spaces. N ranges from 3 to 14 to test the performance of ANN-PSO given various amounts of training data. All combinations of the prices and actions (N 6 ) are fed through the reactor simulation and the profit is returned. The ANNs were configured and trained using Keras [43] with an Adam optimizer. The network trained on the minimum-maximum scaled data and k-fold validation was used to evaluate the networks. The network architecture was optimized by varying number of layers, activation functions, regularization coefficients, and learning rate. The network trained with N = 13 discretization points performed best and was used for algorithm comparisons. PSO was done using Pyswarms [44]. The hyperparameters can be found in Table A1.

PPO Implementation
PPO training was done using Stable Baselines, which is an RL python package [45]. The PPO agents were trained for 14 6 ≈ 7.5 × 10 6 time steps and agents were saved periodically to investigate performance improvement over training time. The agents were trained in a normalized vectorized environment and the network used was 2 fully connected layers of 64 units followed by a long short-term memory layer of 256 units. This was then split into the value and policy functions. Hyperparameters, such as the entropy coefficient, value function coefficient, activation functions, and number of epochs were varied to optimize the agent performance. The final network hyperparameters as well as the particle swarm hyperparameters are listed in Table A2. For evaluation, action means from the action probability distributions were used.

Benchmark Algorithms
For FP-NLP, the case model study was coded into GEKKO, an optimization suite [46]. IPOPT was then used as the NLP solver to minimize the cost. Another strategy evaluated was maximizing the production of product C regardless of the prices. The actions for maximizing production were T = T ub andv B =v 2 .

Testing and Metrics
To evaluate the agents' overall performance in profitability, sample efficiency and computational time, ANN-PSO agents, PPO agents, FP-NLP, and maximum production chose actions for the same 1000 random price combinations. The random prices are not necessarily examples the agents have seen before, but they are within the range of prices the agents have encountered. The received profits from the stochastic system, predicted profits, and computational times were saved and compared between algorithms. To examine the agent actions and their effects on profits, a sensitivity analysis was performed by holding the prices of B and C at their average values of 27.5 $/m 3 and 11 $/kmol, respectively. Prices for A and q were discretized into 500 points and ranged from 2 to 20 $/m 3 and 0 to 2 × 10 −5 $/kJ, respectively. The agents' selected actions for each combination of A and q and the simulation returned profit were recorded. For the sensitivity analysis, the simulation was made deterministic by setting all random variables to their mean.

PPO Achieves Higher Profits, but ANN-PSO Has Better Training Efficiency
The methods' average profits for the 1000 random price evaluation are shown in Figure 4. FP-NLP has the highest average profit at $139.28/min and is followed by PPO best profit of $139.20/min. ANN-PSO best profit is $138.37/min and max production has a mean profit of $120.23/min. This shows both methods achieve a higher profit than the maximum production strategy with ANN-PSO and PPO increase the average profit by 15% and 16%, respectively. PPO is also shown to perform as well as FP-NLP with PPO achieving 99.9% the profit of FP-NLP.  Figure 4 also shows ANN-PSO has better training efficiency than PPO, as ANN-PSO profit overtakes the max production benchmark with ≈ 4.1 × 10 3 data points and plateaus with ≈1.2 × 10 5 data points as compared to PPO, which overtakes the max production benchmark with roughly four times the amount of data at ≈1.6 × 10 4 data points and plateaus at ≈2.6 × 10 5 data points. PPO's lower training efficiency is due to needing to learn both the policy and the value of the state, while ANN-PSO only needs to learn the value of the state-action pairs. Furthermore, ANN-PSO exploration is determined by the fed data while PPO explores based on the stochastic policy. This allows for easier implementation of ANN-PSO since normal operation data-such as when PID controllers or expert personnel controls the process-can be used as training data.
More training examples likely would not improve performance of either method as ANN-PSO and PPO's average profit starts oscillating near 10 7 training examples.
While ANN-PSO training efficiency is appealing, PPO still converges to policy with higher profits. This can be explained by examining the neural networks' prediction accuracy, shown by parity and residual plots in Figure 5. Figure 5a shows that ANN-PSO neural network accurately predicts the agent profits, as seen by the high R 2 value of 0.998. However, if the residual plot is examined, the prediction error has a mean of almost $3/min, indicating the prediction network is overestimating the profit. This is due to the PSO algorithm exploiting small errors in the neural network that arise from overfitting to stochastic data. Figure 5b shows that PPO value network has a similarly high R 2 value. The residual plot shows that PPO's profit prediction errors are evenly distributed around 0 with a mean of −0.01. PPO does not overpredict the profit due to the update clipping, stochastic policy, and online training. The update clipping both prevents the critic from overfitting to noise and the actor from exploiting the critic prediction errors. The stochastic policy helps to further prevent critic error exploitation, and the online training allows the PPO agent to explore actions with high predicted profit and update the critic with the observed reward. Figure 6 shows the computational times of the algorithms to return the optimal action. ANN-PSO has the longest average computational time of 41 s. PPO and FP-NLP have relatively fast computational times of 0.002 s and 0.02 s, respectively. This highlights the fast computational time of PPO since the computation is one forward pass through the policy neural network. Both FP-NLP and ANN-PSO require iterations to find the optimal action, which results in longer computational times.

Sensitivity Analysis
The results of the sensitivity analysis are shown in Figure 7. For these pricing cases, the temperature set point selection is the main variable affecting price, while the flow of B can be held nearly constant. All three methods have similar profit contour plots, but ANN-PSO has a more jagged profile and overall darker contour indicating the agent had a lower profit compared to the other methods. This is caused by PSO exploiting the errors in the ANN, which is shown clearly in the ANN-PSO flow of B actions. While FP-NLP flow of B ranges from 1.63 to 1.68 m 3 /min, ANN-PSO ranges from 1.6 to 1.73 m 3 /min. Overall, the ANN-PSO contour plot appears random showing how PSO finds apparent optimal actions in areas the ANN overpredicts the reward. The erroneous flow of B may contribute to the suboptimal temperature set point selection as ANN-PSO typically selects lower temperatures compared to FP-NLP when P q is between 0.25 to 0.75 $/kJ. The temperature profile is also jagged, indicating the PSO is again exploiting the ANN model overprediction. PPO on the other hand has smoother actions and the profiles appear to be very similar to FP-NLP actions. The main reason for the smooth action contour plots is because instead of optimizing a complicated ANN, PPO is mapping the state to actions with a continuous function. The main difference between FP-NLP and PPO actions is in the flow of B, but the difference between the actions is negligible. PPO is choosing actions based on the training with the stochastic environment, and FP-NLP is choosing actions based on a perfect deterministic environment. This causes PPO's actions to be slightly suboptimal in the tested deterministic environment compared to FP-NLP.

Conclusions
In this study, we compared the performance and applicability of two RL algorithms-ANN-PSO and PPO-by exploring their methods and applying them on stochastic steady-state economic optimization of a CSTR with FP-NLP as a benchmark algorithm [4][5][6]. We evaluate the RL algorithms' performance with their profitability and online computational times, and their applicability with their data requirements and training efficiencies. On the case study, PPO shows better performance as its average profits are higher compared to ANN-PSO and its online computational times are faster. ANN-PSO shows better applicability as the algorithm can train with normal operational data and converges to a policy with less training data. Both algorithms perform similarly to FP-NLP with ANN-PSO and PPO achieving 99.3% and 99.9% of FP-NLP's average profit, respectively. PPO has the fastest online computational time, 10 and 10,000 times faster than FP-NLP and ANN-PSO, respectively. While these results are based on this case study, they may be useful guidelines for more complex applications.
Investigation of the ANN-PSO residual plots reveal PSO exploits overpredictions in the ANN, resulting in suboptimal actions. PPO does not have this issue as it has update clipping and a stochastic policy. The ANN exploitation is also seen in a similar algorithm named deep deterministic policy gradient. Some methods to combat this issue are to clip the artificial neural network and perturb actions to prevent ANN error exploitation [47].
As ANN-PSO has high training efficiency, researching methods to limit value overprediction to improve performance, such as training two ANNs and taking the minimum predicted profit of the two, would be beneficial. Other off-policy algorithms should also be investigated like twin-delayed deep deterministic policy gradients and V-trace to observe their performance on a chemical system after training on operational data. Other improvements to ANN-PSO include decreasing the computational time. Our previous work shows replacing PSO with an actor ANN can significantly reduce the online computational time and find an optimal policy [48]. As PPO and other on-policy algorithms requires interaction with an environment, they should be explored in their ability to augment FP-NLP. Another research topic is exploring how these algorithms perform with more complex systems or when the system changes such as multiple reactions, reversible reactions, or heat exchanger fouling.  Function which maps states to actions Q π (s, a) Value of a state-action pair given policy π

PPO symbols
Coefficient to limit policy updates in L CLIP η τ (θ) Trajectory τ probability ratio between new and old policieŝ A τ Generalized estimated advantage of parameters θ over θ old λ Factor for bias vs. variance trade-off τ Trajectory starting from state s t θ Vector of policy and value neural networks parameters c 1 Coefficient to weight value function loss in L CLIP+VF+S (θ) c 2 Coefficient to weight entropy bonus in L CLIP+VF+S (θ) L CLIP+VF+S (θ) Final objective function to maximize L CLIP , minimize V θ (s t ) mean, and encourage exploration L CLIP (θ) Clipped objective function to prevent excessively large updates L VF (θ) Mean squared error between V θ (s t ) and V targ t r(τ) Discounted reward of trajectory τ S[π θ ](s t ) Entropy bonus function for exploration encouragement t f Time length of trajectories τ U(θ) Policy expected cumulative reward from time t to time t + t f starting from state s t given parameters θ V θ (s t ) Expected episode cumulative discounted reward given policy with parameters θ V targ t Observed cumulative reward from time t to the end of the episode

Case study symbols ∆H rxn
Heat of reaction  Tables   Table A1. ANN-PSO hyperparameters.

Hyperparameter Value
Adam learning rate 1 × 10 −4 Adam exponential decay rate for the first-moment estimates 0.9 Adam exponential decay rate for the second-moment estimates 0.999 Adam small number to prevent any division by zero 10