A New Approach to Solving Stochastic Optimal Control Problems

A conventional approach to solving stochastic optimal control problems with time-dependent uncertainties involves the use of the stochastic maximum principle (SMP) technique. For large-scale problems, however, such an algorithm frequently leads to convergence complexities when solving the two-point boundary value problem resulting from the optimality conditions. An alternative approach consists of using continuous random variables to capture uncertainty through sampling-based methods embedded within an optimization strategy for the decision variables; such a technique may also fail due to the computational intensity involved in excessive model calculations for evaluating the objective function and its derivatives for each sample. This paper presents a new approach to solving stochastic optimal control problems with time-dependent uncertainties based on BONUS (Better Optimization algorithm for Nonlinear Uncertain Systems). The BONUS has been used successfully for non-linear programming problems with static uncertainties, but we show here that its scope can be extended to the case of optimal control problems with time-dependent uncertainties. A batch reactor for biodiesel production was used as a case study to illustrate the proposed approach. Results for a maximum profit problem indicate that the optimal objective function and the optimal profiles were better than those obtained by the maximum principle.


Introduction
Several phenomena from various fields are described through dynamic models. Due to the unsteady nature of these systems and the complexity involved in describing the parameters or variables included, the incorporation of uncertainties is usually needed to represent the inherent unsteady behavior. Stochastic processes result from the uncertain behavior of dynamic variables and are represented by using any probability law for the evolution of the variables x over time t, x t . Thus, for given time values t 1 < t 2 < t 3 . . . < t n , one can calculate the probability that the corresponding variable values, x 1 , x 2 , x 3 . . . x n , lie on some specific ranges. When time t 1 arrives and we observe the actual value of x 1 , we can condition the probability of future events on this information [1]. One of the commonly used stochastic processes is the Brownian motions such as the one-dimensional random walk process, which is a random process where the value of the variable x at time t depends only and exclusively on the value in the previous time t − 1, plus an uncertain term with zero mean and unit variance. Thus, the random walk satisfies the general Wiener process properties [2], falling into this category.
Wiener processes or Ito processes can act as basic pieces to model a wide range of complex continuous and stochastic dynamic variables [3]. In an Ito process, the change over time of the state variable, dx, has two components: a deterministic term and a stochastic component. Ito processes are mainly applied to financial analyses and stock price modeling to represent time-dependent uncertainties; nevertheless, their scope of application could be considerably extended [1].
Recently, Ito processes have been used to approach dynamic systems under uncertainty in various engineering fields. The use of optimal control theory in systems described by SDE (Stochastic Differential Equations) results in stochastic optimal control problems (SOCPs). Stochastic optimal control problems imply the improvement of the system performance by the determination of the optimal profiles of both the control variables and the uncertain parameters.
Optimal control usually requires the calculation of time derivatives; however, stochastic processes do not follow the ordinary rules of calculus. It is necessary to use tools that allow differentiate and integrate stochastic processes like Ito's lemma [1]. This makes the problem of stochastic optimal control a difficult problem to solve. Most of the problems involving stochastic optimal control have been solved in literature using stochastic dynamic programming. A book by Andrew [4] in this area provides the approach as well as applications. The applications are mostly in finance. However, stochastic dynamic programming involves partial differential equations that are difficult to solve. Rico-Ramirez and Diwekar [5,6] used this mathematical tool to propose the stochastic version of the Maximum Principle for solving SOCPs. Such an approach has been successfully used in various fields. For instance, in engineering, Ulas and Diwekar [7] addressed thermodynamics uncertainties in the batch separation process. Diwekar's group members use this method to obtain optimal temperature profiles for both crystallization [8] and biodiesel production in batch reactors [9,10]; in the area of ecology, the approach was used to establish optimal guidelines and strategies for pH control in lakes [11] and ecosystem management [12]; whereas in medicine, the stochastic maximum principle was used for predicting optimal drug dosage in both insulin-dependent diabetic patients [13] and in the superovulation stage of in vitro fertilization [14].
Indeed, the use of the stochastic maximum principle to approach the solution of SOCPs presents several advantages, and it has shown to be useful in different kinds of applications. However, it is also limited in certain respects; for instance, it cannot handle constraints in control variables. Further, there are optimal control problems where the mathematical model is completely or partially unknown (black box models); then, the existing theory cannot be used to derive the required optimality conditions. Also, the solution to the resulting two-point boundary value problem for large-scale systems is quite complex and sometimes unmanageable for standard numerical methods.
An alternative approach that can be used to avoid such limitations and to solve the SOCP is the use of stochastic non-linear programming (SNLP) techniques by the discretization of the SDE. Generally, there are two approaches to solve SNLP problems [15]: (i) Structures identification and transformation techniques that convert the problem into a deterministic non-linear programming problem [16] and (ii) decomposition techniques that divide the problem into sub-problems or stages, like the L-shaped decomposition [17]. Another approach that can be used is to capture uncertainty by an embedded sampling loop for uncertain parameters in the optimization process, representing the discretized uncertainty space, giving an interactive stochastic model. However, due to the model having to be re-run for every sample point and the large scale that these problems usually have, this approach can be computationally expensive.
To avoid these high computational requirements, Sahin and Diwekar [18] proposed an algorithm named Better Optimization of Nonlinear Uncertain Systems (BONUS) which reduces a large number of model evaluations by a reweighting scheme using Kernel Density Estimation. BONUS has been successfully proved for solving a variety of stochastic non-linear problems (SNLPs) such as the minimization of water consumption in a coal power plant [19], optimal sensor placement [20,21], and water management in cooling-constrained power plants [22].
This work intended to take the advantages of the BONUS algorithm and extend its scope in order to solve coupled design and control problems; such problems require an integrated strategy that can deal with both uncertainties and decision variables either static or time-dependent nature. In the present work, we are then presenting a new approach for solving SOCPs based on the use of the BONUS algorithm. To illustrate our approach, the batch reactor for biodiesel production studied by Benavides and Diwekar [9] was used as our case study. The inherent uncertainties in the biodiesel production process may imply a negative impact on the process economics and a huge variation in the product quality; since those uncertainties are better represented as time-dependent uncertainties, the problem is an interesting instance of a SOCP where the optimal temperature profile that maximizes the profit needs to be determined.

Case Study: Production of Biodiesel in a Batch Reactor
Due to the increasing environmental concerns, recent literature reports significant efforts to develop clean and renewable energy sources and to reduce the consumption of fossil fuels. Biodiesel has become one of the most promising green alternatives to minimize the use of fossil energy, generally blended with fossil diesel, keeping the right balance of fuel performance, cost, and environmental impact [23]. Biodiesel can be produced from a wide diversity of feedstock like vegetable oils, animal fats, microorganisms, or waste cooking oil through a variety of technologies such as microemulsion of oil, pyrolysis, and fractional cracking [24]. Among these, transesterification is the most common industrial alternative pathway to biodiesel production, mainly due to the biodiesel yield and reaction time [25], usually done in a batch reactor due to the high biodiesel yields and short reaction time [26].
Transesterification is a catalytic reaction of long-chain triglycerides (C14-C20) with short-chain alcohols (C1-C4) [27]. The reactive system consists of the production of three molecules of methyl ester (ME) from one molecule of triglycerides in a series of three reversive reactions: in the first, a molecule of triglyceride (TG) is converted to one diglyceride and one ME; in the second, the DG is converted to one monoglyceride (MG) and one ME; finally, the MG is converted to one glycerol (GL) and one ME [26]. Glycerol is usually obtained as a byproduct with high commercial value. Also, it can be refined to increase its purity or used to produce fuel additives through its esterification, improving the overall profit of the transesterification process [28]. A mathematical model to represent the biodiesel production dynamics was proposed by Noureddini and Zhu [29] (Equations (1)-(6)).
where F i represents the rate of each reaction, C TG , C DG , C MG , C E , C A , and C GL are the concentrations of TG, DG, MG, ME (main product), methanol (A), and glycerol (GL), respectively. The reaction constants, k j , are a function of the reaction temperature, T, and are formulated in terms of Arrhenius-like expressions (Equation (7)).
where a j is the frequency factor for that reaction, Ea j is the activation energy for each component j, and R is the universal gas constant.
Although biodiesel production has been widely studied, several challenges remain in order for biodiesel to succeed fossil-diesel as a replacement. Raw material price [30], energy consumption [31], time, and scale-up costs are the main among them [32]. Even though it is known that raw material is the highest cost in biodiesel processes, being about 70-75%, it is derived from the use of edible feedstock. With the growing use of no-edible biomass, this percentage can significantly decrease, leaving human labor, machines, and energy expense as other important costs in the overall process [33].
Time and temperature are two crucial factors to enhancing energy consumption in the batch reactor. On the one hand, the optimal reaction temperature is important due to the high temperature increase reaction rate, but it also requires a large heat duty incurring high operating costs. On the other hand, the optimal time not only affects the heat duty, but exceeding this time may result in a reversible reaction, decreasing the process yield [34][35][36]. Recent efforts have determined the optimal temperature profile as a key parameter for yield in biodiesel batch operations [37][38][39].
Furthermore, the type and number of triglycerides in bio-based materials varies considerably, significantly affecting the process economy and the product quantity and quality [40][41][42]. Then, it is necessary to consider such uncertainty in the feed composition, since it is one of the key factors in the process influencing the design, modeling, and control of it. Probabilistic techniques can be used to model this uncertainty and propagate it by the use of o iterative procedures [10].
In this paper, we used the dynamic model given by Equations (1)- (7) to formulate the stochastic maximum profit problem for biodiesel production. The problem consists of determining both the optimal temperature profile and the minimum batch time while maximizing the overall profit. The model further considers uncertainty in the variation in the feedstock composition. The objective function for this problem is given by Equation (8) [43]: where E represents the expected value, P r is the sales value of ME ($/kg), B O and Co are the quantity (kg) and cost ($/kg) of feed respectively which consists of the initial amount of triglycerides and methanol; t is batch time (min), and t s is the set-up time for each batch (min). This equation does not include a term derived from energy consumption, because feedstock represents 80% to 90% of the total production cost [44][45][46]. Table 1 presents the values of the parameters needed to calculate the objective function. In order to incorporate the inherent feedstock uncertainty into the formulation, this must be quantitatively characterized. Following the approach of Abbasi and Diwekar [40], soybean was selected as a raw material. Five main triglycerides components are represented through the probability distributions shown in Figure 1. However, these static uncertainties in batch reactors cause a time-dependent uncertain behavior of the system [10]. The presence of time-dependent uncertainties results in a SOCP difficult to solve by conventional algorithms. Benavides and Diwekar [10] proposed an approach to solving this SOCP on biodiesel batch production. They propagated the static uncertainties through the model to derive time-dependent uncertainties in the component concentrations. Then, Ito processes were used to model these uncertainties in order to use the stochastic maximum principle to solve the problem. The dynamics of biodiesel production were then represented by stochastic differential equations like the one shown in Equation (9).
where dCi represents the differential of the concentration of component i; Fi(C, t) is the right-hand side of the differential equations for the concentration of component i (Equations (1)-(6)); gi is the variance parameter; and εt is a random value which has a unit normal distribution with zero mean and unit standard deviation. The discrete form of Equation (9) is expressed for each component as: Further details of the resulting formulation can be reviewed in the work of Benavides and Diwekar [10]. Benavides and Diwekar [10] proposed an approach to solving this SOCP on biodiesel batch production. They propagated the static uncertainties through the model to derive time-dependent uncertainties in the component concentrations. Then, Ito processes were used to model these uncertainties in order to use the stochastic maximum principle to solve the problem. The dynamics of biodiesel production were then represented by stochastic differential equations like the one shown in Equation (9).
where dC i represents the differential of the concentration of component i; F i (C, t) is the right-hand side of the differential equations for the concentration of component i (Equations (1)-(6)); g i is the variance parameter; and ε t is a random value which has a unit normal distribution with zero mean and unit standard deviation. The discrete form of Equation (9) is expressed for each component as: Further details of the resulting formulation can be reviewed in the work of Benavides and Diwekar [10].

A New Approach for Stochastic Optimal Control Problems
This section describes the algorithm used in this work to solve the biodiesel production problem as a SOCP. Rather than applying the stochastic maximum principle technique, we used the BONUS algorithm as a numerical tool to optimize the stochastic system. The algorithmic steps are as follows.

Discretization
In order to solve a SOCP using NLP techniques, control variables and inputs must be represented as scalar values. In the maximum profit problem of the biodiesel production case, temperature, T, is a continuous control variable, which is a vector; therefore, it needs to be discretized in n intervals of equal temporality as is shown in Equation (16). The n value must be large enough to be a good representation of the continuous variable, but without compromising the computational requirements.
A conventional approach would also require the discretization of the state variables; in that way, each differential equation would become a set of n algebraic equations, making the problem extremely large. The approach used in this work avoids this complexity by keeping the state variables in their continuous form, including the SDE for the time-dependent uncertainties. However, during the integration of the differential state variables to perform function evaluations, the values of the decision variables are needed in any arbitrary continuous time. To perform the integration, a simplified linear behavior was assumed here for each of the discretized control variables, so that a continuous representation was obtained back by using their discrete values and linear interpolation (see Figure 2).

A New Approach for Stochastic Optimal Control Problems
This section describes the algorithm used in this work to solve the biodiesel production problem as a SOCP. Rather than applying the stochastic maximum principle technique, we used the BONUS algorithm as a numerical tool to optimize the stochastic system. The algorithmic steps are as follows.

Discretization
In order to solve a SOCP using NLP techniques, control variables and inputs must be represented as scalar values. In the maximum profit problem of the biodiesel production case, temperature, T, is a continuous control variable, which is a vector; therefore, it needs to be discretized in n intervals of equal temporality as is shown in Equation (16). The n value must be large enough to be a good representation of the continuous variable, but without compromising the computational requirements.
A conventional approach would also require the discretization of the state variables; in that way, each differential equation would become a set of n algebraic equations, making the problem extremely large. The approach used in this work avoids this complexity by keeping the state variables in their continuous form, including the SDE for the time-dependent uncertainties. However, during the integration of the differential state variables to perform function evaluations, the values of the decision variables are needed in any arbitrary continuous time. To perform the integration, a simplified linear behavior was assumed here for each of the discretized control variables, so that a continuous representation was obtained back by using their discrete values and linear interpolation (see Figure 2).

The BONUS Algorithm
The NLP numerical techniques require the evaluation of the model and objective function in each iteration in the optimization process. Also, perturbation analysis is commonly used to estimate the derivative information of each decision variable. Because of this, the number of model evaluations increases exponentially, making this approach prohibitive even for moderately complex problems. Being tested for various models [15,[19][20][21], the reweighting approach called BONUS [18] has demonstrated its potential to avoid such limitations. The BONUS performs a computational manipulation of the samples taken from a base sample space using a narrow reference distribution

The BONUS Algorithm
The NLP numerical techniques require the evaluation of the model and objective function in each iteration in the optimization process. Also, perturbation analysis is commonly used to estimate the derivative information of each decision variable. Because of this, the number of model evaluations increases exponentially, making this approach prohibitive even for moderately complex problems. Being tested for various models [15,[19][20][21], the reweighting approach called BONUS [18] has demonstrated its potential to avoid such limitations. The BONUS performs a computational manipulation of the samples taken from a base sample space using a narrow reference distribution at the values of discretized control variables. A set of distribution weights is obtained and used to reweight the output distribution function. Figure 3 illustrates the general BONUS approach. A uniform probability distribution was assumed for all of the decision variables along with specific probability distributions for the uncertain variables to generate the base input probability distribution for the analysis. The BONUS samples the solution space of the objective function only at the beginning of the analysis by using those base distributions. Then, the stochastic differential model was simulated for each sampling point, the result was the output base probability distribution. In each iteration, the input distribution changed as the decision variables changed. Using the reweighting scheme with the current and the base distribution, a new output distribution was estimated. As the optimization proceeded, the decision variables were determined by the optimizer. In this manner, in BONUS, the inner sampling loop is only used for the first iteration. A detailed description of the technique is presented in the book on BONUS [15]. at the values of discretized control variables. A set of distribution weights is obtained and used to reweight the output distribution function. Figure 3 illustrates the general BONUS approach. A uniform probability distribution was assumed for all of the decision variables along with specific probability distributions for the uncertain variables to generate the base input probability distribution for the analysis. The BONUS samples the solution space of the objective function only at the beginning of the analysis by using those base distributions. Then, the stochastic differential model was simulated for each sampling point, the result was the output base probability distribution. In each iteration, the input distribution changed as the decision variables changed. Using the reweighting scheme with the current and the base distribution, a new output distribution was estimated. As the optimization proceeded, the decision variables were determined by the optimizer. In this manner, in BONUS, the inner sampling loop is only used for the first iteration. A detailed description of the technique is presented in the book on BONUS [15].

BONUS: Extending the Scope of the Algorithm for SOCP with Time-Dependent Uncertainties
Despite all of the benefits that BONUS presents, it has the inconvenience that it was originally developed to deal only with static uncertainties. That is why the original algorithm was modified in order to solve SOCPs with time-dependent uncertainties. The modified BONUS, given in Figure 4, can be divided into two sections: initialization and NLP optimization. A brief description of the main steps is presented next.

Initialization
The decision variables are represented as a set of discrete values as is described in Section 2.2.1. Once the decisions variables have been discretized, uniform probability distributions are assumed for all of them to generate the base input probability distribution for the analysis f(ui*) through the kernel density estimation function (KDE) in Equation (17). The sampling is done through an efficient sampling technique called Hammersley sequence sampling (HSS) [48]; the number of samples, N, must be large enough to capture the effect that time-dependent uncertainties have on the objective function.

BONUS: Extending the Scope of the Algorithm for SOCP with Time-Dependent Uncertainties
Despite all of the benefits that BONUS presents, it has the inconvenience that it was originally developed to deal only with static uncertainties. That is why the original algorithm was modified in order to solve SOCPs with time-dependent uncertainties. The modified BONUS, given in Figure 4, can be divided into two sections: initialization and NLP optimization. A brief description of the main steps is presented next. and (20) thus avoiding the sampling and the model execution at each iteration.
where ωi is the weighting factor of the ith decision variable.

Initialization
The decision variables are represented as a set of discrete values as is described in Section 2.2.1. Once the decisions variables have been discretized, uniform probability distributions are assumed for all of them to generate the base input probability distribution for the analysis f (u i *) through the kernel density estimation function (KDE) in Equation (17). The sampling is done through an efficient sampling technique called Hammersley sequence sampling (HSS) [48]; the number of samples, N, must be large enough to capture the effect that time-dependent uncertainties have on the objective function.
where h is the width of the window, also called the smoothing or bandwidth parameter, while u i * is the set of base samples of the decision variables. The stochastic differential model is then simulated for each sampling point, covering the entire solution space of the objective function by using the input base distribution; the result is the output base probability distribution, F(Z(u i *)). All of the information that can be obtained from the dynamic model has been processed at this point.

NLP Optimization
The information generated in the initialization section was used in a built-in reweighting approach to determine the optimal profiles of decision variables that maximize/minimize the expected value of the objective function E(Z(u)).
In each iteration, k, the real probability distributions f (u i ) are calculated using the KDE of Equation (18).
As the decision variables change, the distributions also change. Thus, the objective function Z(u) and its gradient information are estimated through the weighting scheme given by Equations (19) and (20) thus avoiding the sampling and the model execution at each iteration.
where ω i is the weighting factor of the ith decision variable.

Results and Discussions
The case study involved a 10,000 L feed of a mixture of triglycerides and methanol in a ratio of 1:6, with an initial triglycerides concentration of 0.3615 mol/L. The temperature profile was discretized in 6 points; the resulting SOCP included, therefore, 7 control variables (6 values for temperature and the reaction time) and 6 state variables (components concentrations).

Using BONUS with Static and Time-Dependent Approaches to Uncertainty
To determine the BONUS capability to manage dynamic (time-dependent) uncertainties, the SOCP was solved first by using static uncertainties in triglycerides concentration and ordinary differential equations; then, the same problem was solved by using the stochastic differential equations under the Ito processes approach. In both cases, 500 samples were taken using the HSS technique. Figure 5 presents the optimal temperature profiles obtained. During the first 20 min of the reactor operation of the simulation, the profiles were different, but they presented similar trends in the interval 332-338 K to 390 K; after that, both of the profiles were practically the same, reaching 400 K as the final temperature. Also, both of the solutions presented similar reaction times, 59.7292 min for the case of static uncertainties and 58.6958 min for the case of time-dependent (Ito processes) uncertainties. The case study involved a 10,000 L feed of a mixture of triglycerides and methanol in a ratio of 1:6, with an initial triglycerides concentration of 0.3615 mol/L. The temperature profile was discretized in 6 points; the resulting SOCP included, therefore, 7 control variables (6 values for temperature and the reaction time) and 6 state variables (components concentrations).

Using BONUS with Static and Time-Dependent Approaches to Uncertainty
To determine the BONUS capability to manage dynamic (time-dependent) uncertainties, the SOCP was solved first by using static uncertainties in triglycerides concentration and ordinary differential equations; then, the same problem was solved by using the stochastic differential equations under the Ito processes approach. In both cases, 500 samples were taken using the HSS technique. Figure 5 presents the optimal temperature profiles obtained. During the first 20 min of the reactor operation of the simulation, the profiles were different, but they presented similar trends in the interval 332-338 K to 390 K; after that, both of the profiles were practically the same, reaching 400 K as the final temperature. Also, both of the solutions presented similar reaction times, 59.7292 min for the case of static uncertainties and 58.6958 min for the case of time-dependent (Ito processes) uncertainties. Once the optimal profiles were obtained, they were tested using the 500 samples of the original stochastic model with static uncertainties. Figure 6 shows the methyl ester (biodiesel) concentration profiles of each sample of the temperature profiles obtained using dynamic uncertainties; a similar plot was obtained for the case of dynamic uncertainties (not shown). Further, the mean concentration values obtained by using both static and dynamic uncertainties are compared in Figure 7; the mean values are slightly different at the beginning of the simulation due to the differences in temperature. Once the optimal profiles were obtained, they were tested using the 500 samples of the original stochastic model with static uncertainties. Figure 6 shows the methyl ester (biodiesel) concentration profiles of each sample of the temperature profiles obtained using dynamic uncertainties; a similar plot was obtained for the case of dynamic uncertainties (not shown). Further, the mean concentration values obtained by using both static and dynamic uncertainties are compared in Figure 7; the mean values are slightly different at the beginning of the simulation due to the differences in temperature. However, in both cases, the final product concentration was the same (0.96 mol/L).   Table 2 shows the comparison of the results obtained through the stochastic maximum principle (SMP) and BONUS. Even though the optimal reaction time was lower in the SMP case, the biodiesel concentration reached when the BONUS approach was used was 24% larger. Also, and more   Table 2 shows the comparison of the results obtained through the stochastic maximum principle (SMP) and BONUS. Even though the optimal reaction time was lower in the SMP case, the biodiesel concentration reached when the BONUS approach was used was 24% larger. Also, and more significant, the profit increased by a factor of 2 with the BONUS reweighting scheme, being practically the same with both static and dynamic uncertainties. These results show that this  Table 2 shows the comparison of the results obtained through the stochastic maximum principle (SMP) and BONUS. Even though the optimal reaction time was lower in the SMP case, the biodiesel concentration reached when the BONUS approach was used was 24% larger. Also, and more significant, the profit increased by a factor of 2 with the BONUS reweighting scheme, being practically the same with both static and dynamic uncertainties. These results show that this particular problem possesses multiple solutions; SMP uses a gradient method which might be trapped in a local minimum; on the other hand, we believe that BONUS, by using sampling through all the feasible space for reweighting, was able to avoid such local minimum.

Conclusions
This work presented a new approach to solving stochastic optimal control problems using the BONUS [15] reweighting scheme. The maximum profit in biodiesel batch production was used as an illustrative example which showed that BONUS provides similar results when it is used with both static and time-dependent uncertainties. The illustrative case study also proved that the approach can solve problems that involve stochastic dynamic models. Also, for our specific case study, the reweighting approach gave better results than the stochastic maximum principle technique. This last result shows another advantage of the use of BONUS. Throughout the inherent exploration of all the feasible space in the initial sampling step, it may avoid local minimum solutions. We, therefore, believe that our approach provides a simplified efficient alternative for solving SOCPs and, because of the reweighting strategy, it has great potential for dealing with large-scale systems.

Conflicts of Interest:
The authors declare no conflict of interest.