Formulating data-driven surrogate models for process optimization

Recent developments in data science and machine learning have inspired a new wave of research into data-driven techniques for mathematical optimization. This paper first considers two essential conditions for integrating surrogates into process optimization and discusses achieving those conditions. Next, we consider two perspectives for developing process engineering surrogates: a surrogate-led and a mathematical programming-led approach. These data-driven surrogate models must be integrated into a larger process optimization problem, so this paper next discusses the verification problem , i


Introduction
Both data-driven techniques and mathematical optimization have been pillars of process systems engineering (PSE) since its inception (Sargent, 1972;Pistikopoulos et al., 2021).But recent developments in data science and machine learning have inspired a new wave of research into data-driven techniques for mathematical optimization (Ning and You, 2019).This research belongs to larger efforts at the intersection of data science and PSE (Qin and Chiang, 2019;Shang and You, 2019;Tsay and Baldea, 2019;Schweidtmann et al., 2021;Thebelt et al., 2022).
This paper first considers two essential conditions for integrating surrogates into process optimization and discusses achieving those conditions.Next, we consider two perspectives for developing process engineering surrogates: • A surrogate-led perspective first selects an appropriate surrogate, for instance a Gaussian process for its statistical properties, and then develops effective optimization formulations for that particular surrogate model.
• A mathematical programming-led perspective selects a specific surrogate model based on its desired optimization properties, e.g.linearity.
The surrogate-led approach selects a particular data-driven surrogate model and then develops an optimization formulation for that surrogate model.The second, mathematical programming-led, perspective is important for PSE domains where the optimization problems being solved are so difficult that a surrogate model must conform to specific properties.
These data-driven surrogate models must be integrated into a larger process optimization problem, so this paper next discusses the verification problem, i.e., checking that the optimum of the surrogate corresponds to the optimum of the truth model.Finally, we consider relevant software.

Surrogates for process optimization
We mention two essential conditions for process optimization surrogates: Condition 1. Surrogate model accuracy can be enforced by constraining them to desired space so that they do not extrapolate with large errors.
Condition 2. Overfitting data-driven models must be avoided for surrogate models embedded into optimization problems.
These conditions are tied to the stability of optimal solutions, which are partially characterized by Lipschitz continuity (LipC) for input/output relations and data uncertainties.Moreover, the LipC property depends strongly on surrogate model parameters with bounded confidence regions that result from data uncertainties.Otherwise, the optimum determined from a surrogate model is unlikely to be robust to changes in the problem data or input conditions.

Levels of Surrogacy
Truth models for plant system-wide equations can be classified at the physical property, unit and system or plant levels, through the following form: y = g(x sys , x un , x pp , u sys ) 0 = f sys (x sys , x un , x pp , u sys , p un ) 0 = f un (x sys , x un , x pp , p un ) where y and u are system outputs and input vectors, x l are state variables at level l.At the plant level, the abstraction is the most generalized and seeks to substitute the entire system of Equations ( 1) with a single surrogate model, y = g surr (u sys ) = g(x sys , u sys ) + ε y , where ε y is the approximation error vector of the surrogate model.This approach avoids calculating mass or energy balances and attempts to determine a simulation topology solely based upon the simulation input variables u sys .Such models have the following advantages: First, plant surrogate models may have few input variables (degrees of freedom) for the entire plant and they may lead to high fidelity interpolative plant models.Moreover, they are solvable with simpler derivative-free optimization solvers.On the other hand, extrapolating these surrogates will likely violate conservation laws and other first principle relations.Moreover, these surrogates are not reusable for related cases, as they need to be reconstructed for every specific plant case.
Instead, surrogate models at the unit level can be linked to form a plant model.At the unit level, these represent an intermediate level of surrogacy that satisfy the overall mass and energy balances of the plant, although these surrogate models are not designed to account for conservation or other first principle laws within the unit.The underlying system of Equations ( 1) is reduced to the form: where ε un is the approximation error vector of the surrogate model.Such models have the following features.Unit-level surrogate models with few input variables (degrees of freedom) based on the unit structure, and they can lead to high fidelity interpolative unit models.Moreover, conservation laws hold at plant level and these surrogates are reusable for new plant-level cases.But extrapolating the surrogate model may violate first principle relations and conservation laws in the unit, and surrogate extrapolation errors may lead to convergence failures at the plant level.Also, plant-wide optimization solution of embedded unit surrogates requires midscale optimization solvers, which are more computationally expensive.
Finally, surrogate models describing first-principles relations within common process units, e.g., flash calculations and physical properties, represent the lowest level of abstraction, leaving the rest of the unit-and plant-level model equa-tions in place.Equation (1) becomes: where ε pp is the approximation error for the surrogate model.These rigorous first-principle or "physics-based" models are now integrated within the unit models and the rest of the process is then modeled with rigorous unit-and plant-level equations.These equations, along with the physics-based surrogates, form an equation-oriented model, which must be solved with a large-scale optimization solver.Such models have the following features.These surrogate models with few input variables (degrees of freedom) for subunit models and can lead to high fidelity interpolative subunit models.Moreover, conservation laws hold at unit and plant level and these surrogates are reusable for new plant-level and unit-level cases.On the other hand, the optimization problem based on Equation (3) must be solved with largescale optimization solvers, in order to evaluate plant-level and unit-level cases efficiently.Moreover, surrogate model extrapolation errors can lead to convergence failures at both unit and plant levels.Ma et al. (2022) and Goldstein et al. (2022) explore the performance of surrogates at these different modeling levels.
Optimization models of data-driven surrogates Regression based on polynomials: Developing polynomial regression models is common and has been extensively reviewed (Bhosekar and Ierapetritou, 2018).Some newer ideas in the process systems engineering literature adaptively select which polynomial regressors to use, e.g., ALAMO (Wilson and Sahinidis, 2017), and build a polynomial surrogate explicitly for global optimization, e.g., ARGONAUT (Boukouvala and Floudas, 2017).
Neural networks: With machine learning, data-driven models are traditionally based on (deep) neural nets are usually trained using variants of stochastic gradient descent algorithm (SGD), such as adaptive moment estimation (Adam) have been very successful in practice but do not have favorable convergence properties (e.g., global, superlinear) that are standard for modern nonlinear optimization algorithms.Since they are very successful in fitting DNNs (training), it is an open question whether how these DNNs and SGD algorithms compete with modern, large-scale optimization strategies for process engineering models.
This disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks (DNN) has recently been explored by Zhang et al. (2022).Here it is shown that neural network weights, with either differentiable (e.g., sigmoidal) or nondifferentiable (ReLU) activation functions, often do not converge to stationary points of the loss function, even though stable convergence to minimum loss, approaching zero, is observed.This stabilization on training loss can be explained through a convergence proof on the DNN weight distributions.
These caveats lead us to take a mathematical programming view of the optimization models applying to several types of data-driven surrogates.Some of the earliest formulations for optimizing over neural networks are big-M mixed-integer programming formulations relevant to ReLU activation functions (Lomuscio and Maganti, 2017;Fischetti and Jo, 2018).Grimstad and Andersson (2019) considered tightening big-M parameters for optimization over NN with ReLU activation functions.Alternative mixed-integer formulations for ReLU activation functions include adding cuts representing the convex hull of a single neural network node (Anderson et al., 2020) or a partition-based formulation that includes a subset of the convex hull constraints (Tsay et al., 2021).
Other mathematical programming formulations for ReLU activation functions include a semidefinite relaxation (Raghunathan et al., 2018) and a quadratic relaxation derived through applying the S-Lemma (Fazlyab et al., 2019).For ReLU activation functions of the form (r = max(0, x)), Yang et al. ( 2021) considered three formulations: 1. embedded, e.g., r = a/(1 + e −ax ) ≈ max(0, x), which needs smoothing for a nonlinear optimization solver, 2. binary variables to handle the max functions within a mixed-integer linear strategy, 3. complementarity formulations with r = x + y, 0 ≥ y x + y ≥ 0 using basic (not relaxed) complementarity formulations within a nonlinear optimization problem.
Yang et al. ( 2021) also consider convex hull constraints used to prevent extrapolation of neural networks.In addition, Ma, Sahinidis and coworkers Ma et al. (2022) consider datadriven approaches to minimize energy cost of extractive distillation with truth models from an Aspen simulator.First, they apply surrogate-based optimization using ALAMO's generalized linear models, and then also consider neural network models with the ReLU activation function.Second, the use DFO to optimize problems directly using simulation results.In a detailed comparison, they observed that ALAMO performs well for less complex systems, while the ReLU network performs better for complex ones.On the other hand, the most effective performance was obtained with DFO using smooth penalty functions.Other types of neural network activation functions have also been considered, including binarized NN (Khalil et al., 2018) and a reduced space formulation for nonlinear smooth activation functions (Schweidtmann and Mitsos, 2019).But there is lots more to do in this research area, for instance transforming nonsmooth ReLU activation functions using smoothed mathematical optimization problems with complementarity constraints.Regression trees: Mixed-integer programming formulations of gradient-boosted regression trees are from Mišić (2020) and Mistry et al. (2021).These formulations are available, for example, in the black-box optimizer ENTMOOT (Thebelt et al., 2021(Thebelt et al., , 2022) ) and the formulation tool OMLT (Ceccon et al., 2022).Gaussian processes: Optimization over Gaussian processes can be managed in many different ways.First, Gaussian processes are a natural fit for robust optimization strategies (Bertsimas et al., 2010a,b;Bogunovic et al., 2018;Wiebe and Misener, 2021;Wiebe et al., 2022).Of course, there are infinite possible functions in a Gaussian process, so (depending on the application) we can either optimize over the mean while integrating some notion of uncertainty or use pathwise-conditioning to sample from the Gaussian process posterior (Wilson et al., 2020).Schweidtmann et al. (2021) have also developed a reduced space formulation for global optimization.

Selecting a data-driven surrogate based on its optimization properties
The preceeding discussion develops optimization models for data-driven surrogates.But sometimes we wish to focus first on the needs of an application and then chose a corresponding surrogate.Early, foundational work initializing this line of inquiry explored how individual surrogates may fit into larger decision-making problems (Palmer and Realff, 2002;Caballero and Grossmann, 2008;Henao and Maravelias, 2011).Some of the other ideas in this area include: developing a decision tree with desired properties (Bertsimas and Dunn, 2017), selecting ReLU neural networks to expand the applicability of multi-parametric programming (Katz et al., 2020), and Developing a neural network with desired properties (Tsay, 2021).
There has been significant work solving optimization problems based on hybrid data-driven / mechanistic models and the consequences of these algorithms for surrogate models (Eason and Biegler, 2016;Bajaj et al., 2018;Eason and Biegler, 2018;Kim and Boukouvala, 2020).The next subsections further develop these ideas.For optimization, all models are imperfect, but some are useful.As a result, surrogate models ranging from first principles to shortcut models to data-driven models are widely applied in the context of optimization studies.

Strategies for surrogate-based optimization
Derivative-free optimization methods are generally formulated for unconstrained optimization problems, i.e., min x∈R n f (x).These methods are either stochastic or deterministic in nature.The former methods are based on opportunistic sampling and selection algorithms, which converge asymptotically, but offer no guarantees for a finite number of samples.On the other hand, deterministive methods are based on generalized pattern searches, which adapt themselves to the response surface, and often provide guarantees to convergence of local optimality of surrogate model.These include the DFO and NOMAD.DIRECT solvers described in Conn et al. (2009).In addition, surrogate models with optimally chosen basis functions are created by the ALAMO solver, which applies an MIP formulation, linear least squares and modified AIC criteria.
For the development of constrained optimization formulations with data-driven models, optimization strategies can be applied with embedded surrogate models, which substitute for high-fidelity (or "truth") models are widely performed in process engineering.There, the high-fidelity model is replaced over the entire optimization space with a surrogate model such as polynomial, DNN and Kriging model (Bhosekar and Ierapetritou, 2018).While this approach no longer requires additional evaluation of the highfidelity model once the surrogate model is established, it is likely that the optimization will lead to extrapolation of the surrogate model.And these extrapolation errors for the surrogate can lead to convergences failures, or termination at a point that is not the optimum of the high-fidelity model.Consequently, it is challenging to maintain the accuracy of the surrogate model over the entire optimization space.
Conditions where the optimum of the surrogate model corresponds to the optimum of the truth model.
For global optimization there are a number of DFO methods with convergence guarantees "in the limit" (see Huyer and Neumeier (2008)).However, unlike conventional methods based on spatial branch and bound search, they do not provide lower bounds and certificates for global solutions.
On the other hand, for local optimization, this challenge can be addressed through locally approximated surrogate models that are updated with recourse to the truth model, as part of the optimization strategy.Starting from unconstrained approaches by Fahl and Sachs (2003) and Conn et al. (2009), Eason andBiegler (2016, 2018) developed the trust region filter (TRF) method for constrained optimization that samples from "truth models" which have smooth input-output properties.The TRF method iteratively solves sub-problems with local surrogate models under trust region constraints along with stabilizing filter methods.Applied to any surrogate model with smooth input-output properties, this approach guarantees convergence to the target problem with truth models and requires few function evaluation of these box models.In addition, if the derivatives of these highfidelity model are available, then simpler surrogate models can be used along with first order corrections (FOC).FOC approaches have been to surrogate large models that are expensive in computation such as aerodynamics and pressure swing optimization (Alexandrov et al. (1998); Agarwal and Biegler (2013)) with the derivative information of the original high-fidelity model.
The trust region filter (TRF) method for surrogate-based optimization ) has rigorous guarantees of convergence to local optimality for the truth model.These are based on DFO properties in Conn et al. (2009) and are independent of the choice of the surrogate model.On the other hand, performance of the TRF method depends on accuracy of the surrogate model sampling of the truth model is required as TRF proceeds.The TRF method been applied in a number of surrogate-based optimization case studies, where direct optimization of the truth models was prohibitive.These include periodic adsorption processes (Agarwal and Biegler, 2013)), air-fired and oxycombustion power plants (Dowling et al., 2016), surrogate equations of state and MWD models for polymerization (Eason et al., 2018;Kang et al., 2019), hear exchanger networks with surrogated detailed exchanger models (Kazi et al., 2020), real-time optimization of refineries (Chen et al., 2021) and optimization of benzene chlorination processes (Yoshio and Biegler, 2020).Future research will deal with relaxing the TRF properties to include model mismatch and noise from surrogate models, based on work on ε-exact models by Biegler et al. (2014).

Conclusions
This paper has offered brief glimpse into formulating datadriven surrogates models as mathematical programming formulations for process optimization.There is still much work to be done in this area.