Maximizing information from chemical engineering data sets: Applications to machine learning

It is well-documented how artificial intelligence can have (and already is having) a big impact on chemical engineering. But classical machine learning approaches may be weak for many chemical engineering applications. This review discusses how challenging data characteristics arise in chemical engineering applications. We identify four characteristics of data arising in chemical engineering applications that make applying classical artificial intelligence approaches difficult: (1) high variance, low volume data, (2) low variance, high volume data, (3) noisy/corrupt/missing data, and (4) restricted data with physics-based limitations. For each of these four data characteristics, we discuss applications where these data characteristics arise and show how current chemical engineering research is extending the fields of data science and machine learning to incorporate these challenges. Finally, we identify several challenges for future research.


Introduction
Data have always played a critical role in chemical engineering applications, but recent advances in artificial intelligence enable new possibilities for increasing the information gained from chemical engineering data sets.Previous reviews discuss technical advances relevant to chemical engineering, e.g.artificial intelligence (Venkatasubramanian, 2019), machine learning (Lee et al., 2018;Yan et al., 2020), optimization approaches (Rios and Sahinidis, 2013;Biegler et al., 2014;Boukouvala et al., 2016;Ning and You, 2019), surrogate modeling (Bhosekar and Ierapetritou, 2018;McBride and Sundmacher, 2019), hybrid data-driven/mechanistic modeling (Von Stosch et al., 2014), and latent variable methods (Dong and Qin, 2018).Other reviews highlight the applications and possibilities for artificial intelligence in the process industries (Qin and Chiang, 2019;Shang and You, 2019;Tsay and Baldea, 2019;Pistikopoulos et al., 2021).
This review complements previous reviews by showing how different data characteristics arise in chemical engineering.Table 1 mentions common descriptors characterizing data and gives examples for how these types of data arise in chemical engineering applications.Figure 1 illustrates two common data concepts: variance and volume.The upper right quadrant in Figure 1 (shaded grey, high variance, high volume) is where existing data science approaches have found most success, and this  review article does not discuss it because classical machine learning is already relevant.This review article discusses the upper left and lower right quadrants (shaded blue): these are regimes where current chemical engineering research offers transformations to increase the information content of relevant data sets.We do not discuss the low variance, low volume quadrant: such data are likely not useful for machine learning, and engineering methods should be used to generate additional data.
We first discuss how challenging data characteristics arise in chemical engineering applications.Then, we identify four characteristics of data arising in chemical engineering applications that make applying classical artificial intelligence approaches difficult: • High variance, low volume data.Illustrated in the lower right corner of Figure 1, • Low variance, high volume data.Illustrated in the upper left corner of Figure 1, • Noisy/corrupt/missing data.Data veracity can be construed as a third characteristic, • Restricted data.Physics-based limitations give rise to restrictions in chemical engineering.
Note that these qualifying distinctions, e.g. between high variance/low volume versus low variance/high volume, may be problem-specific.For example, low volume may have a completely different meaning for safety-critical versus fairly innocuous processes.Likewise, the definition of low variance may change between online versus offline applications.The meanings of the qualifiers high and low therefore only have significance when considering the scope of the application and the goals for using the data.
For each of these four data characteristics, this paper discusses applications where these data arise and shows how current chemical engineering research is extending the fields of data science and machine learning to incorporate these challenges.We also identify challenges for future research.The overall vision is that chemical engineers are using (1) traditional engineering approaches, (2) classical artificial intelligence, and (3) new research at the intersection of chemical engineering and artificial intelligence, to derive significant value from data that are often (erroneously) viewed as information-poor.Figure 2 depicts several commonly used data-driven model architectures that we mention throughout the text.

How "information-poor" data arise in chemical engineering
This section outlines typical characteristics of chemical engineering data sets and gives examples how such data sets originate in the real world.A large number of chemical engineering applications are subject to two types of corrupted data sets, namely, restricted in variability, i.e. low variance, high volume, and restricted in volume, i.e. high variance, low volume.Sections 3 and 4, respectively, give a general discussion of these data.Moreover, chemical engineering data sets may arise based on noisy, corrupted, and missing data: this case is presented in detail in Section 5. Section 6 discusses various forms of inaccessible data due to known or unknown restrictions.
Limited process understanding.Process engineers may have limited data, e.g., from pilot or experimental campaigns, for developing predictive models and designing a complete process (Tsay et al., 2018).As a result, the process design stage typically focuses on feasibility in meeting stringent safety, regulatory, and product constraints.After finding a feasible design, process operators may be risk-averse and maintain processes close to a few operating points known to be feasible.Operational changes after plant commissioning are typically minor and conducted with experience-guided trial and error, as changes to process set points meeting the requisite constraints are not easily predicted.
Similarly, there may be known physical restrictions on certain states/properties, but how these limitations are linked to the process parameters might not be fully understood, or they may be governed by complex relations.For example, some states or properties may be described by algebraic equations, a system of partial differential equations (e.g., a computational fluid dynamics model), or by quantum chemistry simulations.Process parameter restrictions may be the result of hidden constraints on latent variables (Martelli and Amaldi, 2014;Conn et al., 2009): for example avoiding flooding of distillation columns (Górak and Sorensen, 2014) may introduce hidden constraints, as operators have have imperfect knowledge of the combinations of operating conditions and parameters that result in column flooding (Piché et al., 2001).Hidden constraints may also be fully non-quantifiable, for which we might not be able to observe the outcome of certain process configurations.For example, during data acquisition phases, non-quantifiable hidden constraints can result in unsuccessful experiments from which no data are obtained or for which process simulators fail to converge.Hidden constraints are especially challenging for optimization and control applications, when engineers cannot directly incorporate the hidden constraints into a mathematical model.The upper left quadrant of Figure 1 visualizes the resulting data sets which are large but may be subject to low variability.Section 3 discusses more details of such data sets.Section 6 discusses specifics of data sets arising when processes are subject to unknown limitations.
Operational limitations.Operators may artificially impose operational limitations when the target is process efficiency.When a process is well understood, it can be subjected to process design optimization.Such design calculations traditionally assume continuous, steady-state processing, for maximizing the large-scale manufacturing of bulk chemical products.However, design optimization results in sub-optimal performance when the process deviates from its nominal operating point, and therefore process operators seek to minimize such deviations, which would otherwise generate output data with more "variance."Further, optimal operating points (in terms of economic performance, energy efficiency, etc.) often lie near one or more process constraints.For example, it can be most efficient to make products at the requisite purity, or to operate equipment near safety limits.Operation close to limits can further reduce flexibility to deviate from steady state.
Besides process efficiency, product quality requirements can strictly limit the range of process parameter manipulation.This results in large data sets with very little variation around a few operating conditions (Qin and Chiang, 2019).Especially in the pharmaceutical industry, there can be strict regulations regarding operation of production processes (Troup and Georgakis, 2013).We mainly consider the operational limitations to be known limitations on the process parameters, giving rise to feasible domains on the process parameters.If these limitations can be modeled mathematically as algebraic expressions, they can be incorporated as constraints into optimization and optimal control frameworks.However, the exact mechanism(s) between the parameters and product properties might not always be fully known, and instead the limitations might be based on heuristic correlations and rules of thumb.
In addition to intentional limitations based on process efficiency and product requirements, there may be limitations on process operation due to equipment considerations, such as a maximum pressure that can be achieved by a pump/compressor, limits on heating/cooling, maximum achievable flow through pipes, temperature limitations due material properties, and spatial limitations.Such limitations can also arise from safety considerations, e.g., maximum allowed temperature and pressure in a reactor (Green and Southard, 2019).Limitations on process parameters and operating conditions may also be imposed by regulatory limits on allowed emissions, forbidding certain parameter settings.Unreachable and unsteady states may make it practically intractable to gather data for some operating conditions.For example, limitations in measurement instrumentation may restrict reliable measurements during relatively quick process transitions.In experimental studies for catalyst development, certain catalyst compositions may not result in stable configurations usable in chemical processes.
Data sets arising from such limitations can give rise to both low variance, high volume and high variance, low volume data sets, see upper left and lower right quadrants in Figure 1.In experimental studies, such limitations may restrict search spaces, as detailed in Section 3. On the other hand, in steady-state process settings, such limitations may limit variance of data sets which are still sizeable due to constant monitoring.Section 4 and Section 6 give more details.
Process control.In addition to guiding processes to desired states (setpoint tracking), process control systems seek to mitigate the effects of disturbances.Advances in process control, e.g., improvements in controller tuning, control structures, sensors, continue to decrease process deviations and closed-loop response times, reducing disturbance-related variance in process outputs (Huang et al., 1997;Simkoff et al., 2020).Model predictive control (MPC), a standard technology in the chemical industry, explicitly accounts for process dynamics; this enables processes to transition between setpoints quickly and reject disturbances effectively.Needless to say, fast disturbance rejection is desirable.However, an unintended consequence is that variance in recorded data is further reduced, as the number of non-steady-state samples is likely decreased, leading to data sets similar to the upper left quadrant in Figure 1.Section 4 addresses this in more detail.
Resource constraints.Data gathering can be expensive: costs may be related to labor, material, equipment, etc. Modeling projects are often postponed or delayed to prioritize other activities, e.g.expansion, debottlenecking, viewed as higher impact for a production facility.As a result, project planners enforce tight resource budgets to limit spending.Moreover, certain experiments may be more expensive than others, e.g., catalyst development where material costs highly vary between different catalyst configurations, leading to sparse data sets.Economic constraints can also lead to noisy, missing, and corrupted data.Alleviating hardware problems with more accurate sensors and data transmission/storage systems may not be economically viable.Often, less accurate/reliable, but cheaper sensors are used, which can lead to increased noise and more frequent sensor failures.Economic constraints may also restrict the total number of sensors which can be placed in a process, as well as their locations, as it may be more expensive to install and maintain a sensor in some locations than others.On top of economic constraints, industrial projects are sub-ject to tight schedules that assign certain time frames to different project steps.Time consuming experiments with many steps and long preparation periods may result in similarly sparse data sets compared to economically constrained experimental trials.
Corrupted (but useful) data sets arising from such limitations are often sparse and have high variability, similar to the lower right quadrant of Figure 1.Section 3 discusses cases when such data sets are the result of noisy, missing, or corrupted measurements.More general details on high variance, low volume data sets are given in Section 3.
Hardware limitations.Often, hardware limitations of the sensors and data transmission and storage systems lead to low quality data.Measurements are inherently noisy due to interference of the environment.Physical phenomena like sensor coking or aging can cause drift and other corruptions.Missing data can be the result of failed sensors or problems in acquiring, transmitting or storing data (Imtiaz and Shah, 2008).Measurements outside a sensor's range can also lead to missing or inaccurate values.Even when economic considerations are excluded and hardware is accurate, it may be impossible to place sensors at certain process locations because of physical restrictions.For example, in cryogenic processes, certain sensors may not fit within the process' insulation.Similarly, in small (e.g., intensified, microchannel) reactors, data may be limited to measurements from certain areas where sensors can be placed.Hardware limitations with respect to sensors can cause both low variability in large data sets and limit volume of data sets with high variability, i.e., Figure 1 upper left and lower right quadrant, respectively.Section 5 addresses uncertain data due to hardware limitations.
Real-world example 1: Estimating Powder Compositions.As a prototypical challenge at the intersection of chemical engineering and data science, consider predicting a low dimensional latent structure with partial least squares for pharmaceuticals (García Muñoz and Torres, 2020).Partial least squares outperforms principle component analysis when the output is strongly correlated with directions in the data that have low variance, so it may be applicable to a low variance regime (depending on the scope of the application).But García Muñoz and Torres (2020) show an industrial example where this classical data science technique is insufficient: an improved approach representing the casual relationship between spectra and chemical composition allows more accurate predictions and shows the possibilities of adding chemical know-how to the data.
Real-world example 2: Optimizing Catalyst Performance.Mistry et al. (2020) present another example utilizing industrial catalyst data in collaboration with BASF.Modeling the effectiveness of a chemical catalyst is a tedious task that requires highly nonlinear models that vary between different applications.Data-driven tree ensembles are popular for modeling such systems due to their excellent prediction accuracy.Moreover, such algorithms are efficient and allow for cheap model development.Mistry et al. (2020) combine traditional mixed-integer techniques (Mišić, 2020) to guarantee the feasibility of catalyst properties while leveraging the predictive power of tree ensembles trained on real industrial data.This approach allows chemical companies to derive feasible and promising product designs while reducing resources dedicated towards building models.

High variance, low volume data
While machine learning methods have been shown to perform well for applications where large amounts of data are available, it still remains challenging to get deep insights into physical processes given high variance, low volume data.For this type of datasets we distinguish between two general application settings: (1) before sample collection, when there is a limited budget of experiments available and the goal is to maximise a pre-defined utility based on these trials, and (2) after sample collection, where the goal is to utilize a high variance, low volume dataset in the most effective way, e.g. for target prediction or process control.

Relevant applications
High variance, low volume data sets have implications for various chemical engineering applications.The following discusses a few applications in more detail.
Batch or semi-batch production.Batch or semi-batch production processes are common when production demand is low and/or products are expensive.While large chemical plants with high throughput are monitored at all times, data for batch processes may be limited.Without steady-state limitations, batch procedures allow dynamic treatment, e.g.different temperature profiles, to maximize key targets such as yield and selectivity.While this may allow better performance compared to steady-state continuous processes, it also makes process modeling, monitoring, and control more complicated, often leading to poorly performing physical models.In the context of high variance, low volume data, applications for batch or semi-batch production seek to effectively use existing data rather than generating new promising data.Ongoing research focuses on enhancing or replacing physical models with data-driven approaches.Challenges in batch or semi-batch production processes include using high variance, low volume data for (1) dynamically handling controllers during experiments or optimizing control policies and (2) predicting final batch product quality.
Material discovery and development.Research and development of new materials includes catalysis development, battery material production, etc. Experimental studies optimize material configurations and properties based on single or multiple objectives given limited resources.Experiments are often guided by applying physical principles, model-based design of experiments, and intuition derived from previous experiments.Depending on objective, this application category is generally referred to as black-box optimization or design of experiments (DOE).Challenges for these applications include: (1) deriving an initial design of experiments, (2) proposing promising new experiments based on previous ones, and (3) effectively exploring the underlying search space.For many DOE applications, explicit constraints need to be handled to ensure feasible experimental settings, e.g.bounds on temperature, closure of mass-balance equations.
Design of computer experiments.Computational experiments can be fairly expensive and time-consuming, requiring large and powerful computing clusters.Popular examples of such calculations include density functional theory, computational fluid dynamics, and finite element method simulations.Such methods describe many complicated physical processes with high accuracy and may enhance, or even substitute for, real-world experimental studies.Similar to real-world DOE, the main challenges include: (1) initial sets of computations to run, (2) proposing promising new experiments based on existing knowledge, and (3) sufficiently exploring the search space.Contrary to many classic DOE settings, physical systems described by computer simulations are already well understood.However, mathematical descriptions may be too complicated to be fully integrated in optimization problems that determine optimal experimental conditions.In these cases, the computer simulation becomes a black-box that may be enhanced with intuition based on explicit constraints from the full physical model.Optimizing these problems remains challenging.
Model parameter estimation.Mathematical models are important for various applications in chemical engineering, e.g.process design and process control.Many models use restrictive prior assumptions to allow a mathematical description of physical systems that captures major trends but is inaccurate when predicting target quantities.Model parameter estimation based on real-world data enhances such models.Specifically, fitting parameters in mathematical equations that capture general trends in an underlying system to real-world target values can improve model accuracy drastically.In the context of high variance, low volume data, main challenges relate to (1) identifying new data locations to improve weaknesses in existing models, (2) estimating uncertainty of model parameters, and (3) deriving new governing equations from data instead of fitting parameters.

How this type of data is addressed in the literature
We categorized applications of high variance, low volume data into two categories: before and after sample collection.Data-driven approaches in literature that address high variance, low volume data can be further categorized into (1) pre-processing of data and (2) enhancing machine learning models trained on low volume data.The following presents examples of approaches that successfully handle high variance, low volume data.
Feature selection and dimensionality reduction.Whether a dataset is low or high volume depends on the amount of data and the intrinsic problem dimensionality.In contrast to dataset features, which may not accurately describe the underlying system, intrinsic dimensionality refers to minimum number of variables needed to represent the data.For example, a subset of all features presented or a set of latent features (e.g., obtained via principal component analysis) might sufficiently represent the data.In general, this makes machine learning models more effective as the same amount of data is applied to represent fewer degrees of freedom.Therefore, dimensionality reduction algorithms identify intrinsic dimensionalities to allow learning in low dimensional spaces.Such techniques are popular when deriving quantitative structure-activity relationship models using molecular descriptors to link structural properties to physio-chemical properties of interest (Ponzoni et al., 2017;Eklund et al., 2014).These type of models are used for drug discovery and optimization.Reducing complexity without losing information is essential to obtain better fits with machine learning models.Janet and Kulik (2017) use feature selection techniques to improve accuracy of machine-learning models when predicting quantum mechanical properties for chemical discovery, especially in modestly sized data sets.Other research (Bartók et al., 2013;Ghiringhelli et al., 2015;Huang and Von Lilienfeld, 2016) finds similar advantages when using feature selection techniques to model chemical properties based on molecular structures.
Model-based and Bayesian optimization.Bayesian optimization (BO) is a popular approach for black-box optimization and design of experiments.In general, BO is based on a datadriven model derived from Bayesian statistics.For an unknown function f : R n → R, BO predicts the next evaluation point, e.g. the next experiment, to determine the optimal solution x * of f : For general black-box optimization, there is no other information available for f , e.g., gradients.To derive new query points for black-box f , BO instead learns an approximate/surrogate model and optimizes an acquisition function.Acquisition functions combine the predictive mean of the surrogate model and some variance measure quantifying the uncertainty of model predictions to handle the exploitation vs. exploration trade-off.Exploitation refines the surrogate model prediction near promising query points for black-box function f , while the exploration evaluates the underlying search-space in regions with high surrogate model uncertainty.Thus, popular models for Bayesian optimization have both good prediction and uncertainty quantification capabilities.Examples of such models are Gaussian processes (GPs), Bayesian neural networks, and random forests, (see Figure 2 for some examples of trained models).More detailed overviews of BO can be found in the literature (Shahriari et al., 2015;Frazier, 2018;Lizotte, 2008).
BO frameworks and data-driven model-based optimization methods have many applications in design of experiments.Several model types can be used to achieve promising results.Rall et al. (2019) show that artificial neural networks (ANNs) can be used to model synthetic membranes for desalination and ion separation processes.The resulting models are then optimized over in a single-and multi-objective fashion to determine optimal fabrication conditions of the membranes used.After carrying out the proposed experiments, the authors were able to show the reliable prediction ability of ANNs when modeling membrane performance.Bradford et al. (2018a) propose a multi-objective optimization method based on GPs: the algorithm uses spectral sampling to approximate drawn function samples of the GP posterior distribution.A genetic algorithm optimizes the samples to find promising new evaluation points of the black-box function.The Bradford et al. (2018a) approach has been applied to continuous flow chemistry (Schweidtmann et al., 2018) and pharmaceutical processes (Clayton et al., 2020).Other GP-based approaches were applied for tissue engineering (Olofsson et al., 2018), solar cell material optimization (Herbol et al., 2018), and optimization of sustainable algal production (Bradford et al., 2018b).Other data-driven model-based design of experiment approaches use tree ensembles (Mistry et al., 2020;Thebelt et al., 2021Thebelt et al., , 2022) ) and algebraic basis functions (Wilson and Sahinidis, 2017) to predict new promising points for evaluation.
Hybrid modeling.Small datasets with high variance are useful to recognize trends in the underlying feature space but may be difficult to use for interpolation between data points.Hybrid data-driven/mechanistic methods try to fill these dataset gaps with domain knowledge such as mathematical equations.Here, machine learning models using small datasets can enhance existing physics-based models.Rall et al. (2020) use data-driven models to enable both membrane synthesis and membrane process design in a hybrid modeling approach.While the process design is modelled using common mechanistic models, membrane properties are estimated using ANNs.They demonstrate the proposed hybrid modeling strategy can lead to better overall performance compared to conventional approaches that only design processes based on a collection of known membranes.Henao and Maravelias (2011) use ANNs to replace complicated models for operation units in process synthesis, while keeping deterministic models for simple units.The authors investigate applications related to design of continuous stirred tank reactors, solvent regeneration units, and synthesis of a reaction separation, finding that that ANNs can lead to compact and accurate representations of superstructure optimization models.A key advantage is replacing various nonlinearities stemming from process models with a single type of nonlinearity from the activation function of the ANNs used, which enables universal treatment when optimizing over such models.Schweidtmann and Mitsos (2019) propose a framework that allows deterministic global optimization of ANN embedded structures.This framework has been applied to hybrid modeling applications including optimization of organic Rankine cycles (Huster et al., 2019;Schweidtmann et al., 2019).
Low volume data models.While some methods handle small data sets by increasing their volume via additional experiments, other approaches enhance limited data by using deterministic models.If these options are unavailable, specifically tailored models may perform well in small data settings.GPs are popular models for this category as they smoothly fit existing data points and give valuable uncertainty quantification in unexplored regions (Rasmussen, 2003).While this does not solve the problem of inaccurate interpolation between distant data points, it does indicate where low-accuracy of the model is expected.While GPs have a built-in uncertainty measure, there exist approaches that aim to estimate uncertainty for other commonly-used data-driven models.Springenberg et al. (2016) use Hamiltonian Monte Carlo methods, specifically based on Chen et al. (2014), to derive reliable uncertainty estimates for ANNs in low volume data applications.Garnelo et al. (2018) propose Neural Processes that combine the advantages of both GPs and ANNs and give explicit uncertainty estimates.

Low variance, high volume data
Another challenge faced in applying data-driven solutions to chemical processes is that the vast majority of operating data recorded by large-scale processes correspond to normal operation at a (few) routine point(s) (Qin and Chiang, 2019).Several reasons for this are summarized in Section 2. Nevertheless, the volume of such data can be significant: a chemical process can have many sensors that record measurements at frequencies in the order of minutes, and stored records can go back in time a decade or more.

Relevant Applications
Machine learning methods applied to the low variance, high volume data from large-scale processes must be carefully selected/adapted to meet this unique challenge.
Outlier/fault detection.Machine learning can be deployed online to detect (and/or classify) faults or outliers in the behavior of a process.Specifically, when new data are continually recorded, fault detection seeks to answer: are the data (statistically) different from normal operation?In a data-driven methodology, "normal operation" is quantified using a recorded dataset.Due to its importance and direct applicability to process operations, data-driven fault detection has been widely studied (Venkatasubramanian et al., 2003;Qin, 2012;Ge et al., 2013;Jiang et al., 2019).The main challenges related to low variance, high volume data are: (1) building classification models from unbalanced datasets (2) determining statistical limits for "normal operation," and (3) attributing outliers/faulty measurements to a root cause.
Process drift.A related challenge to fault detection is identifying "drift" in process behavior (Lee et al., 2011;Montgomery et al., 1994).For instance, closed-loop process behavior can change slowly over time due to degradation of control system performance (e.g., plant-model mismatch), equipment deterioration (e.g., heat exchanger fouling, build-up of trace components), etc.Unlike in many fault detection applications, drifting process behavior often results in measured data that are not statistical outliers, but rather correspond to gradual shifts.For equipment degradation in particular, many research efforts focus on condition-based monitoring, where additional data streams specific to equipment, e.g., machine vibration, audio/video data, are used to model their condition(s).The challenges are: (1) leveraging and combining information from diverse data streams, (2) quantifying a slow change in the underlying process dynamics, and (3) isolating individual effects from other phenomena/noise in recorded data.
Flexible operations.Recent trends towards flexible operation of chemical processes are based on motivations to deviate from the accepted paradigm of steady-state operation (Riese and Grünewald, 2020;Pattison et al., 2016).For example, large fluctuations in electricity prices may motivate over-production when prices are low, and vice versa (assuming products can be stored).Creating these flexible production schedules requires a good understanding of feasible process operating points, and perhaps also of feasible transitions.Therefore, machine learning models can help production schedulers understand the feasible operating regimes of a process by examining data from its past operation.When process operational data include multiple operating points, the range of feasible operating points and transitions can potentially be identified from historical data.The main challenges identified here are: (1) identifying recorded data that correspond to routine or desirable operations and (2) creating data-driven formulations that both accurately describe process operations and are amenable to scheduling formulations.
Unconventional control strategies.As noted above, improvements to control technologies have greatly reduced variance in process data.However, it can at times be favorable to sacrifice some degree of control performance in order to gain data-driven process knowledge (Mesbah, 2018;Hewing et al., 2020).For instance, additional (multi-objective) terms can be included in a modelpredictive-control objective function to encourage excitation of a process when model improvements are desired.Alternatively, a reinforcement-learning-type approach can be taken, allowing the controller to learn an optimal policy over time rather than requiring an accurate open-loop model a priori.Strategies such as these balance control performance and exploration of the input-output space, allowing process models and/or their control systems to be improved as data are collected.The main challenges identified here are: (1a) training/improving dynamic models from quasi-stationary data, (1b) training/improving open-loop dynamic models with closed-loop data, i.e., closed-loop identification, and (2) balancing risk-averse operation with exploration for model/controller improvements.

How are these challenges addressed?
A central theme in dealing with these data is making the most of data predominantly corresponding to "normal" operation.Intuitively, while large data volume facilitates estimation of noise/variance, low "variance", i.e., changes in operation, complicates the understanding of underlying process dynamics.Therefore, many chemical engineering applications dealing with such data exploit subject-matter expertise to assist in modeling underlying behavior.When such information is not readily available, some purely data-driven approaches have still found success.
Data reconciliation and moving horizon estimation.When a mathematical model is available in addition to process data, model predictions can differ significantly from measured values, owing to model assumptions, measurement error, etc.To this end, data reconciliation adjusts measured data, e.g., by solving a maximum likelihood estimation (MLE problem subject to known model equations, often comprising conservation laws and/or variable constraints (do Valle et al., 2018).When a more sophisticated process model is available, its parameters can be simultaneously estimated via MLE, resulting in error-in-variables problems, e.g., Esposito and Floudas (1998); Gau and Stadtherr (2002).Solution of this (global) optimization problem is known to be challenging: to maintain computational tractability, the number of considered data samples can be fixed, known as moving horizon estimation (Johansen, 2011).Several research efforts (Zavala et al., 2008;Alessandri et al., 2011;Hashemian and Armaou, 2015) have focused on expediting computational solution of the MHE problem, for deployment in online applications.In summary, data reconciliation and MHE techniques use data to continually improve process knowledge via a hypothesized mathematical model.In turn, this results in improved models for process control, and MPC systems can exploit the reconciled measurements and parameter estimates from MHE (Huang et al., 2010;Voelker et al., 2013).Additionally, fault detection can also be performed by tracking parameter estimates from MHE (Bemporad et al., 1999;Spivey et al., 2010).
Filtering and state estimation.In a similar vein to the above, observed data can be used to estimate the values of the state, or hidden/unmeasured, variables of a process.Specifically, the state estimation problem determines the values of the process states, given a model structure and sequence of measured data.While state estimation can also be formulated as an MHE (optimization) problem, several filtering methods are commonly used to update state estimates using only the most recent measurement, summarizing previous data using, e.g., state and covariance matrix estimates.The popular Kalman filter provides optimal estimates for the case of a linear, unconstrained system subject to Gaussian noise (typically also estimated from data).Extensions such as the extended and unscented Kalman filters enable state estimation in nonlinear systems by linearizing the system around its current state or additional sampled points, respectively.Rather than assuming Gaussian noise, a sampling/Monte Carlo approach can be taken-a technique known as particle filtering (Zhao et al., 2014).Daum (2005) and Rawlings and Bakshi (2006) provide more comprehensive overviews of (nonlinear) state estimation.Importantly, while they are generally simpler than MHE problems, filtering-based schemes address similar applications, such as fault detection (Bhagwat et al., 2003).Finally, Kalman-filtering-based approaches can also update estimates of model parameters by treating the parameters themselves equivalently to hidden process states (Ljung and Gunnarsson, 1990;Guo, 1990).
Scale-bridging models.While low variance data are often insufficient to construct a detailed process model, useful coarse approximations may still be derived.For instance, several works "bridge" multiple time scales by identifying feasible operating regimes from historical data, in order to embed lower-level process knowledge in decisions at a higher level (Tsay and Baldea, 2019).This can involve identifying samples corresponding to feasible steady-state operating points (Xenos et al., 2016a,b) and/or modeling regions of such feasible operation using convex-region surrogate models (Zhang et al., 2016b,a).Such data-driven models naturally allow for discontinuous operating modes by incorporating multiple feasible regions.On the other hand, dynamic scale-bridging models can be trained from operating data, where transitions between feasible operating points are also modeled.Recent works (Pattison et al., 2016;Tsay et al., 2019) accomplish this by performing system identification on (closed-loop) recorded process data, resulting in approximations of input-output relationships between process setpoints and outputs.As the underlying variations in such data are few, data-mining techniques can further reduce the size of the dynamic scale-bridging models (Tsay and Baldea, 2020).Overall, scale-bridging models derived using the above techniques can be used to compute optimal schedules of feasible operation.
Unsupervised learning.In terms of fault detection, process datasets are almost always highly unbalanced, i.e., nearly all recorded samples correspond to normal/routine operation.Therefore, unsupervised learning is a well-established approach to use low variance, high volume data, e.g., anomaly detection, clustering, rather than supervised learning, e.g., explicit classification of faulty vs normal.As in Section 3, dimensionality reduction, or manifold learning, seeks to represent highdimensional data with a low-dimensional set of latent variables.Note that supervised methods for dimensionality reduction also exist.Once the set of latent variables is learned, they can be used in fault detection (Chiang et al., 2000), operational optimization (García-Muñoz et al., 2008), and process control (Laurí et al., 2010).MacGregor and Cinar (2012) and Qin et al. (2020) discuss further applications and techniques.Relatively simple fault detection rules can be derived by applying multivariate statistics, e.g., Hotelling's t-squared statistic, to measured process data or latent variables (Yin et al., 2014).Clustering represents an alternative class of unsupervised learning, with the goal of partitioning data into a number of clusters based on some defined similarity metric.By partitioning historical data into clusters, multiple operating modes of a process can be identified (Quiñones-Grueiro et al., 2019) enabling, e.g., fault detection tailored to each operating mode.Fault detection can also be constructed directly from a clustering scheme (Detroja et al., 2006), by observing how new data affect the learned clusters.An advantage of using unsupervised techniques is the potential to detect new fault types, rather than only those used to train a supervised model.
Data-driven process control.Machine learning techniques similar to the paradigm of "reinforcement learning" (RL) can use data to continually improve the performance of a process and its control system.In contrast to the above methodologies, which primarily use available data to learn a model (from which control actions can be optimized), RL uses data to directly learn an optimal control policy (Sutton and Barto, 2018;Hoskins and Himmelblau, 1992;Shin et al., 2019).However, the "model-free" RL approach is often more difficult, data-demanding, and therefore less effective for practical systems, in part due to the presence of noise (Rawlings and Maravelias, 2019).RL controllers may violate constraints while learning optimal policies; therefore, RL-based control is also popular in fields with fewer safety-critical constraints, e.g., building systems (Wang and Hong, 2020).Even when using a dynamic process model is feasible and desirable, an approximate explicit MPC controller can still be trained based on closed-loop process data ( Åkesson and Toivonen, 2006;Hussain, 1999;Lovelett et al., 2020).A major challenge here is incorporating system and controller constraints in the learned control policy (Vaupel et al., 2020).Finally, a direct way to deal with low variations in process data is to enforce variations via future control actions.Here, "dual control," or simultaneous identification and control, can be achieved by heuristically incorporating system excitation in a process control problem (Mesbah, 2018).Persistent excitation strategies add constraints to the standard MPC problem to maintain a minimum level of excitation (Genceli and Nikolaou, 1996).System excitation can also be enforced as a secondary objective, resulting in a multi-objective MPC problem (Aggelogiannaki and Sarimveis, 2006;Feng and Houska, 2018;Heirung et al., 2015) 5. Noisy/corrupt/missing data Chemical engineering data sets are often noisy and contain corrupted or missing values, so applying machine learning methods requires considering these properties.Figure 3 illustrates the types of uncertainty which can arise from low quality data.This section discusses how noisy, corrupted, and missing data arise in chemical engineering applications, where and why this type of data set is particularly relevant, and how machine learning has been applied to low quality data.Next to noise and corruption in the data, we consider two different types of missing data.First, data sets may contain missing values because of hardware failures, limited sensor ranges, or outlier removal.Second, data which is relevant to the process may simply not be in the data set, because sensors could not be installed due to economic or physical constraints.Different methods may be applicable to data sets with missing data depending on whether values are missing due to hardware issues or economic and physical constraints.

Relevant applications
The following discusses several applications where noisy, corrupted, and missing data are particularly relevant and have been addressed in detail in the literature.
Fault detection.Section 4 discusses machine learning for fault detection, but process data used for detecting and pin-pointing equipment failures are also subject to noisy, corrupt, and missing data.Fault detection has been studied in the presence of both missing data (He et al., 2009;Zhang and Dong, 2014;Askarian et al., 2016;Guo et al., 2020) and noisy data (Venkatasubramanian and Chan, 1989;Hoang and Kang, 2019;Pham et al., 2020).Major challenges in applying fault detection to this type of data sets are: (1) developing methods which work in the presence of missing values, (2) ensuring noisy data do not lead to false positive faults, and (3) ensuring noisy or corrupted data do not lead to delays or failures to detect faults.Addressing these challenges is important to assure the accuracy of fault detection and therefore the reliability of the plant.
Degradation and fouling modeling.Equipment degradation and fouling is increasingly of concern not just in condition-monitoring, but also process design, planning, and scheduling (Yildirim et al., 2017;Basciftci et al., 2018;Wiebe et al., 2018Wiebe et al., , 2020)).Degradation is usually modeled by stochastic models because of its inherent randomness and the limitations of vibrational and other data underlying these models (Jardine et al., 2006).Making use of such models in a process design, monitoring, or control context requires careful consideration of the data limitations and resulting uncertainties.Challenges include: (1) developing models capturing the underlying equipment degradation in the presence of noisy/corrupted data and (2) making decisions based on uncertain equipment degradation states.
State estimation and soft-sensing.Section 4 discusses state estimation as a method for dealing with missing data due to economical or physical restrictions, but these methods themselves can also be subject to errors in the input process data (Kadlec et al., 2011;Wang et al., 2018;Guo and Huang, 2020).A main challenge arising from this is developing state estimation and soft-sensing methods which are robust to missing or noisy input data.Other control-related areas relevant to low quality data include disturbance detection and rejection ( Lawryńczuk, 2008), sensor planning (Tewari et al., 2020), trend estimation or slow feature analysis (Zhao and Huang, 2018;Si and Wang, 2019), meta-learning and sparse optimization/compressed sensing.

How this type of data is addressed in the literature
Chemical engineering contributions have taken two different perspectives on low quality data, both of which routinely employ machine learning techniques.The data perspective considers noise, corruption, and missing values to be data properties which need to be addressed.Such approaches often try to pre-process the data or develop methods which are robust, e.g., to missing values.On the other hand, the uncertainty perspective recognizes that noise, corruption, and missing values lead to uncertainty in process knowledge.Instead of trying to "fix" the data, approaches that take the uncertainty view acknowledge the existence of uncertainty and try to make good decisions despite it.Noisy measurements can lead to both parametric uncertainty, if a process parameter is measured, or to black-box uncertainty, if the data represent an unknown functional dependency.Missing values can exacerbate this uncertainty.Systematic corruption, e.g., by sensor drift or coking, can lead to measurement drifts while process conditions remain constant.Section 4 discusses process drift in more detail.The time at which disturbances or equipment failures occur may also be uncertain due to noisy measurements or missing values.Below, we discuss the main approaches for using machine learning to deal with noisy, corrupted, or missing data from both perspectives.
While low quality data are an important source of uncertainty in chemical engineering processes, there are other sources of uncertainty as well.For example the small size of available datasets can also be an important source of uncertainty, as discussed in Section 3.
Data pre-processing.The most common approach for addressing low quality data from the data perspective is pre-processing (Xu et al., 2015).In the pre-processing paradigm, missing values are usually addressed by data imputation.Missing data imputation for process engineering data sets has been previously reviewed (Severson et al., 2017;Imtiaz and Shah, 2008;Walczak and Massart, 2001).For a detailed review of machine learning techniques for data imputation see Lin and Tsai (2020).Supervised learning is often applied to predict missing values, e.g., with k-nearestneighbours, decision trees, random forests, and ANNs (Lin and Tsai, 2020).Recent contributions in machine learning have also applied deep generative approaches such as generative adversarial networks (GANs) or variational autoencoders (VAEs) for data imputation (Yoon et al., 2018;Camino et al., 2019;Nazábal et al., 2020).These approaches can achieve high accuracy but tend to have many parameters and therefore require large training sets.They may therefore not be applicable to chemical engineering applications with low volume data regimes.The choice of method most suitable for a given application depends on the amount of training data available, their quality, the percentage of missing values, and more.Pre-processing for noisy and corrupted data includes outlier removal.Both unsupervised machine learning approaches, e.g., k-means clustering (Pamula et al., 2011), and supervised approaches, e.g., support vector machines, have been used to detect and remove outliers.Unlike supervised approaches, clustering based approaches do not require a training set with labeled outliers.
Models which are robust to noise/missing values.An alternative approach to preprocessing is to use techniques which work in the presence of and are robust to noisy, corrupted, and missing data.Many machine learning methods cannot be applied when the data contain missing values.To alleviate this, Eirola et al. (2013Eirola et al. ( , 2014) ) propose distance estimation between vectors with missing values which allows distance-based machine learning algorithms like k-nearest-neighbours or support vector machines to be applied to data sets with missing values without pre-processing.Mesquita et al. (2019) propose a similar approach for estimating the expected value of Gaussian kernels with incomplete data.These techniques have the advantage that they can also be applied when a large percentage of values is missing.Other approaches which have been successfully applied in the presence of missing and noisy data are Bayesian networks (Zhang and Dong, 2014;Askarian et al., 2016).While most machine learning techniques can be applied without modification in the presence of noise, noise may lead to inaccurate predictions or overfitting.

Stochastic models
The uncertainty perspective to low quality data often starts by modeling the data using a stochastic model.Stochastic models are often a better choice than deterministic models for noisy or corrupted data sets because they quantify the resulting uncertainty.Stochastic processes can be interpreted as probability distributions over functions.As such, they are particularly useful for modeling uncertainty in functional dependencies based on noisy or corrupted data, i.e., black-box function uncertainty.Simple Lévy type stochastic processes, like the Wiener or Gamma process, are commonly used in data-based condition monitoring/equipment degradation (Zhang, 2015;Nguyen et al., 2018).These models have also increasingly been incorporated into process scheduling and planning applications (Wiebe et al., 2018).Another commonly used class of stochastic processes, Gaussian processes (GPs), are extensively used as surrogate models in chemical engineering, but have also been used for applications with black-box uncertainty due to noisy measurements of functional dependencies (Wiebe et al., 2020;Liu et al., 2020;Bradford et al., 2020).Other stochastic models which have been applied to this type of data include Dirichlet process and Gaussian mixture models (Campbell and How, 2015;Chen and Zhang, 2010;Ning and You, 2018), as well as Bayesian networks (Zhang, 2015;Jain et al., 2018).
Data-driven optimization under uncertainty.Several recent contributions in process systems engineering use data-driven optimization under uncertainty to combine machine learning with robust optimization to make optimal decisions based on noisy or corrupted data.Data-driven robust optimization uses data to construct uncertainty sets.Constraints where uncertain parameters occur are then required to hold for all values of the parameter within the uncertainty set.Early approaches in data-driven robust optimization focused on constructing uncertainty sets from data using confidence regions and statistical hypothesis testing (Bertsimas et al., 2018).Other authors use Dirichlet process mixture models to construct unions of ellipsoidal uncertainty sets (Campbell and How, 2015) or polyhedral uncertainty sets (Ning and You, 2018).The Dirichlet approach has the advantage that it can capture multimodal uncertainties.It may therefore be particularly applicable to chemical engineering data sets where multimodal uncertainty is common, e.g., due to distinct operating modes.Other contributions use unsupervised learning to construct data-driven uncertainty sets, e.g., Shang et al. (2017) use kernel-based support vector clustering to derive data-driven uncertainty sets while Goerigk and Kurtz (2020) construct sets from the output of an unsupervised deep neural networks.ANNs have also been used in the context of distributionally robust and chance constraint optimization.Zhao and You (2020) use GANs to construct empirical distributions based on (noisy) data and create ambiguity sets for distributionally robust optimization based on these distribution.While robust optimization is traditionally used for addressing parametric uncertainty, it can also be applied to black-box function uncertainty, e.g., Wiebe et al. (2020) use GPs to model black-box functions based on data and develop a method for robust reformulation of constraints depending on these black-box functions.

Restricted data
Advances in sensor technologies and process monitoring/control systems have made large amounts of process data easily accessible for chemical processes.Emerging technologies, such as Internet of Things (Atzori et al., 2010) and Industry 4.0 (Lasi et al., 2014), are also promising more efficient data collection and integration in the process industry (Isaksson et al., 2018).However, there are fundamental challenges in data acquisition for chemical processes that can make it impossible to sample or measure properties during some operating conditions, resulting in restricted data sets.Restricted data sets can cause difficulties in directly applying classical data-based technologies.

Applications dealing with restricted data
There are unique challenges in applying machine learning methodologies in chemical engineering that arise from restricted data sets.This section briefly describes application areas and discusses challenges due to restricted data sets.
Process control.Machine learning for process control has been an active research topic since the early 1990s (Hoskins and Himmelblau, 1992;Bhat and McAvoy, 1990), and this topic has gained more attention in the last years due to advances in machine learning (Shin et al., 2019;Rawlings and Maravelias, 2019).The restricted data setting creates several challenges for process control including: (1) learning accurate dynamic models from restricted data sets, (2) incorporating knowledge of hidden constraints, (3) reliably solving the large-scale constrained nonlinear optimization problems formed by MPC, (4) learning a control law within the process limitations, (5) ensuring a controller respects limitations and safety requirements at all time, and (6) identifying model mismatch.
Optimizing operation and production processes from data.Accurate mathematical models may not be known for each process unit, and some properties can change continuously or require mathematical relations too complicated to be directly integrated into an optimization problem.By constructing data-driven surrogate models, it is possible to form a mathematical model of the optimization tasks containing some unknown relations.Applications learning a model from data and using the model within an optimization framework for decision-making include: flowsheet optimization (Caballero and Grossmann, 2008), superstructure optimization (Henao and Maravelias, 2011), supply chain management (Wan et al., 2005), and process intensification (Gutiérrez-Antonio and Briones-Ramírez, 2015; Quirante et al., 2015).Some challenges within the restricted data setting are: (1) training accurate models on the restricted data sets, (2) learning constraints for the problem, (3) learning models of appropriate complexity and difficulty for the optimization problem, (4) incorporating model uncertainty, and (5) solving the resulting optimization problems.
Inverse problems and product discovery.As previously mentioned, chemical engineering applications of inverse problems include product discovery and design of materials.Restricted data sets also create challenges for inverse problems.For example, data might only be available for certain regions of the input space, resulting in models with high uncertainty in large parts of the input space.In molecular design, constraints arise from structural constraints, chemical feasibility, and required product properties (Harper and Gani, 2000;Austin et al., 2016;Folic et al., 2008;Gani, 2004).Unstable molecular designs may appear as hidden constraints in the inverse problem.The challenges include: (1) learning accurate models from restricted data sets, (2) taking restrictions and hidden constraints into consideration, (3) incorporating model accuracy and uncertainty into the inverse problem, and (4) efficiently solving the resulting optimization/inverse problem.

How restricted data challenges are addressed in the literature
This section reviews existing approaches for addressing some of the challenges of restricted data.These are active research topics, and there might exist multiple solutions to these challenges.
Learn constraints and limitations from data.Identifying feasible regions is a key component in flexibility analysis (Swaney and Grossmann, 1985;Grossmann et al., 2014), but the classical approaches assume that algebraic constraints containing some uncertain parameters are known (Halemane and Grossmann, 1983;Grossmann and Floudas, 1987).We refer the interested reader to Banerjee et al. (2010), Rogers and Ierapetritou (2015), Wang andIerapetritou (2017), andMetta et al. (2020) for reviews covering approaches to estimate feasible regions and constraints based on surrogate functions.However, to efficiently model constraints with surrogate functions requires data samples in all regions of interests and a mixture of data points where the constraints are both satisfied and violated.Especially in situations where data must be collected during normal operation, there may be very few (if any) data points available where the constraints are violated.In such circumstances the surrogate function approach may not be a viable strategy.One-class classification (Ruff et al., 2018;Khan and Madden, 2009) has been proposed to address similar situations in machine learning: one-class classification could also be used in chemical engineering applications to represent feasible process configurations, but the authors are not yet aware of any such applications.
Dealing with hidden constraints.The main difficulty with hidden constraints is the inability to directly measure or observe the state or property behind the hidden constraint, i.e., we cannot quantify by how much the constraint is satisfied or violated.If a hidden constraint is encountered within an optimization framework, we need to somehow acknowledge that the point is infeasible and move away from the infeasible solution.But, due to the hidden nature of the constraint, we might not obtain any other information than that the current point, e.g., process configuration, is infeasible.With derivative free optimization algorithms, such as population-based search methods, hidden constraints can be directly dealt with by a simple penalty approach (Martelli and Amaldi, 2014), where a large penalty is imposed on the objective to force the search away from infeasible solutions.Within an optimization framework, it could also be possible to use so-called, no-good cuts (Nannicini and Belotti, 2012;D'Ambrosio et al., 2010) to exclude a small neighborhood around the infeasible point from the search space.No-good cuts adds complexity to the optimization problem, and a large number of such cuts may result in a computationally intractable optimization problem.Another approach for dealing with hidden constraints is to use a support vector machine to identify and remove infeasible solutions from the search space (Ibrahim et al., 2018).
Learning models from restricted data sets.The challenges in learning accurate models from restricted data sets can be quite similar to learning from small data sets.Transfer learning (Pan and Yang, 2009;Taylor and Stone, 2007) is a machine learning concept that reuses knowledge learned from a similar task to improve performance and reduce the amounts of training data needed.Transfer learning could also be useful for efficiently learning models from restricted data sets for chemical engineering applications.For example, consider creating a model that predicts the product yield for a chemical process from a set of operating conditions.Training an accurate model might require a large data set with high variance, but using knowledge and data from a similar process might greatly reduce the amount of data needed.A simple approach to practically implement transfer learning for such applications is to use (part of) a model trained for a similar task as a starting point for the new model, e.g., by reusing weights from some layers of an ANN.Hybrid ANNs can be another approach for dealing with small and restricted data sets by incorporating known physical relations or first-principles equations, reducing the complexity of the model that is leaned from data (Psichogios and Ungar, 1992;Medsker, 2012;Bellos et al., 2005).For example, physical relations can be incorporated by penalizing physical inconsistencies in the loss function while training ANNs (Raissi et al., 2019;Raissi, 2018;Karpatne et al., 2017).There is also a risk of overfitting the models on restricted data sets, resulting in models with overall poor performance.Several methods for training low complexity and sparse models have been presented (Wilson and Sahinidis, 2017;Bishop, 2006;Louizos et al., 2018;Manngård et al., 2018), which can improve the generalization ability of models trained on restricted data sets.
Safety guarantees.The black-box nature of many machine learning models creates a set of challenges regarding safety guarantees, especially for automatic control applications.For many control applications, it is crucial that the controller behaves as expected under all circumstance to avoid dangerous situations.Therefore, successfull implementation of a data-driven controller in a safety-critical application might requires some safety guarantees.For example, we could ideally ensure that the controller applies reasonable control actions under all circumstances and does not take "forbidden" control actions.Adversarial examples for image classification (Goodfellow et al., 2014) have highlighted the sensitivity of ANNs by showing examples where image classifications can drastically change by practically invisible perturbations to the input images.This has led to the development of robust verification techniques (Cheng et al., 2017;Bunel et al., 2018;Botoeva et al., 2020;Ehlers, 2017), that analyze the input-output behaviour of the ANNs by proving if certain outputs can occur while the inputs are restricted to specific domains (Carlini and Wagner, 2017).Optimization and verification techniques might also be useful for obtaining safety guarantees for ANNs in process control applications.For example, Dai et al. (2021) guarantee Lyapunov stability of ANN controllers during training by also learning a Lyapunov function as an ANN, while Paulson and Mesbah (2020) propose a projection operator for guaranteeing feasibility and constraint satisfaction.
Data-driven process optimization.Here we consider situations where mathematical models are known for some processes and unit operations, but others are not fully known.To solve the optimization task, we need to combine both known algebraic expressions with learned surrogate models into a tractable optimization problem.Several types of surrogate models have been used for chemical engineering applications, including from ANNs (Schweidtmann et al., 2019), radial basis functions (Wang and Ierapetritou, 2017), and Kriging/Gaussian processes (Palmer and Realff, 2002;Caballero and Grossmann, 2008;Jia et al., 2009).Restrictions with known algebraic expressions can directly be incorporated into the optimization problem, whereas unknown restrictions might also need to be learned from data and represented using surrogate models.Eason and Biegler (2016) presented a trust-region filter method for this specific problem structure.Bhosekar and Ierapetritou (2018) provide an overview of surrogate-based optimization methods.The optimization task here can also be viewed as a constrained black-box optimization problem, where the process units/operations with unknown mathematical models are considered as black-box functions.A variety of methods hav ebeen proposed for constrained black-box optimization problems (Audet and Dennis Jr, 2006;Hernández-Lobato et al., 2016;Conn et al., 2009;Banks et al., 2008;Boukouvala et al., 2016).We refer the interested reader to Conn et al. (2009); Boukouvala et al. (2016) for more details on black-box optimization.
Optimizing over ML models with constraints.Most of the applications mentioned in this section involve optimizing over ML models with constraints that represent various limitations.Solving these optimization problems efficiently can be challenging, and there are different approaches available depending on the type of machine learning model.Some machine learning models, such as ANNs with piecewise linear activation functions, can directly be represented as mixed integer linear optimization problems (Anderson et al., 2020;Fischetti and Jo, 2018;Tsay et al., 2021), and solved by established software.Global optimization techniques have also been presented for gradient boosted trees (Mistry et al., 2020;Thebelt et al., 2021) and neural networks with sigmodial activation functions (Schweidtmann et al., 2019).If the entire machine learning model and optimization problem can be represented by algebraic equations and inequalities, the problem can be passed to a deterministic global optimization solver, such as ANTIGONE (Misener and Floudas, 2014), BARON (Tawarmalani and Sahinidis, 2005), or SCIP (Gamrath et al., 2020).

Conclusion
The overall vision of this review paper is the observation that chemical engineers are using (1) traditional engineering approaches, (2) classical artificial intelligence, and (3) new research at the intersection of chemical engineering and artificial intelligence to derive significant value from "information-poor" data.The commonality between each of the four data characteristics we identify is that each of these four types of data would be unsuited to classical machine learning approaches: the challenge for researchers at the interface between chemical engineering and computer science is to increase the information gained from the resulting available data.

Figure 1 :
Figure 1: Variance and volume are two common descriptors characterizing data.

Figure 2 :
Figure 2: Prediction surfaces of different types of data-driven models.

Figure 3 :
Figure 3: Uncertainty in chemical engineering data sets

Table 1 :
Common descriptors that characterize data together with examples (in italics)