Optimized data exploration applied to the simulation of a chemical process

In complex simulation environments, certain parameter space regions may result in non-convergent or unphysical outcomes. All parameters can therefore be labeled with a binary class describing whether or not they lead to valid results. In general, it can be very difficult to determine feasible parameter regions, especially without previous knowledge. We propose a novel algorithm to explore such an unknown parameter space and improve its feasibility classification in an iterative way. Moreover, we include an additional optimization target in the algorithm to guide the exploration towards regions of interest and to improve the classification therein. In our method we make use of well-established concepts from the field of machine learning like kernel support vector machines and kernel ridge regression. From a comparison with a Kriging-based exploration approach based on recently published results we can show the advantages of our algorithm in a binary feasibility classification scenario with a discrete feasibility constraint violation. In this context, we also propose an improvement of the Kriging-based exploration approach. We apply our novel method to a fully realistic, industrially relevant chemical process simulation to demonstrate its practical usability and find a comparably good approximation of the data space topology from relatively few data points.


Introduction
Chemical process design is in its nature a multicriteria optimization (MCO) task [16,4]. To find a Pareto-optimal design [10], an engineer needs to explore multiple trade-offs between competing (design) parameters and objectives. Recently, powerful decision support tools have been developed for usage in industrial contexts to aid an engineer in this design process. These tools enable him to navigate through Pareto-efficient solutions from simulation runs performed for different parameter variations [4,7].
In a typical industrial application, a flowsheet simulation of a chemical process involves solving several hundreds to several thousands of nonlinear equations for each specified parameter combination. Only certain combinations of the parameters lead to numerically converging, physically reasonable results. Therefore, in order to estimate the Pareto frontier it is necessary to determine valid parameter combinations, i. e., to find the operation window or feasibility region in the parameter space.
A binary feasibility classification task in this context can be challenging due to an often vast parameter space and the computational effort involved in solving the system of nonlinear equations. There exist different approaches from various branches of engineering how to solve this problem. Many studies propose to reduce the computational cost by approximating the feasible region. For example, in Ref. [12] parameters are classified according to their feasibility using a support vector machine (SVM) [9] combined with k-means clustering to reduce the model training cost.
An alternative to binary classification of feasible points is to train a surrogate model of a continuous feasibility function, which reflects by how much the constraints which define the feasibility range are violated. The surrogate model is usually trained adaptively by sampling regions with high model uncertainty and the vicinity of the predicted boundary. Suggested ways to construct a surrogate model include high dimensional model representation techniques [1], Krigingbased modelling [5,6], and most recently, radial basis function (RBF)-based modelling [30]. The RBF-based modeling outperforms Kriging-based modeling in terms of achieving better accuracy with fewer sampling points, however, both approaches exhibit limitations in five-and higher-dimensional problems [30]. Moreover, in certain applications it is impossible to access the information about maximum constraints violation.
In this manuscript, we present a novel method to sequentially explore feasibility classifications. Inspired by Ref. [6] our proposed algorithm does not only provide us with an adaptive estimator for feasibility, but can also take an optimization target into account. This allows us to focus the exploration on parameter regions of interest. We therefore consider our approach as a method for optimized data exploration, which exceeds bare feasibility classification. Briefly put, we propose an adaptive screening of an operation window combining feasibility classification and optimization.
For demonstration purposes we apply our method to the simulation of a realistic chemical process. In particular, we consider the simulation of two serialized distillation columns to separate an azeotropic mixture of chloroform and acetone [3]. We show that we can improve the estimation of feasible regions in comparison with a uniform grid approach and a latin hypercube sampling (LHS) [15] approach. We use Chemasim, a BASF in-house flowsheet simulator, to run the simulations. By design, Chemasim, like other flowsheet simulators, does not provide a complete quantification of equation violations for unfeasible parameters, hence a surrogate model of a continuous feasibility function can not be applied. Our algorithm, however, is based entirely on a binary feasibility classification, which we can directly extract from Chemasim.
In the following, we will first present a formal description of our algorithm. Subsequently, we will explain its functionality with the help of a toy example. Using a benchmark, we will show the strengths of our method in comparison with a Kriging-based exploration approach for a binary feasibility classification scenario. We also propose an improvement of this Kriging-based approach for a discrete feasibility constraint violation. Finally, we will outline the chemical process simulation to which we have applied our algorithm and present the results. We will conclude with a brief summary and outlook.

Optimized data exploration
Our method of optimized data exploration can be considered as a sequential design of experiments, where each new experiment (i. e., each new simulation evaluation) is chosen based on a utility function. Summarized, the method consists of four major steps: (i) Start-up: Ensure that an initial data set is available. This preliminary step is required to ensure at least basic knowledge about the data topology. Therefore, one may use previously obtained data or evaluate a set of parameters, e. g., on a regular grid.
(ii) Choice: Choose the parameter for which the utility function is maximized. The utility function makes use of the previously obtained data to interpolate or extrapolate missing information.
(iii) Evaluation: Evaluate the simulation for the chosen parameter.
(iv) Repetition: Repeat from (ii) or finish the exploration based on a termination criterion.
The choice of a suitable utility function and a sufficient initial knowledge about the data are of course crucial for the success of our approach. From a practical perspective, data exploration is often driven by an optimization, hence we consider in the following that we will not only evaluate a simulation outcome for its validity but will also extract some kind of quantitative optimization target from the result. In order to emphasize on the main aspects of our data exploration method, we will only consider a single optimization target. A MCO problem can, however, always be reduced to this one-dimensional case by an appropriate scalarization of its targets.
In this section, we will first describe the general framework of our method in which we introduce the main formal ingredients. We will subsequently explain the exploration algorithm itself in more detail. A toy example will help us to demonstrate our method.

Framework
We use simulation evaluations and estimators to generate data. Therefore, we will first formally define those concepts. Furthermore, we will discuss our choice of a utility function. These definitions will serve as a framework for the following studies.

Data generation
We consider a simulation S which is described by a mapping of parameters x from the compact p-dimensional parameter space χ ⊂ R p onto the solution space This solution space consists of two parts. First, a classification space containing two classes which describe whether the simulation outcome y ≡ y(S, x) ∈ η was valid (i. e., numerically convergent and physically reasonable in the sense that the numerically obtained simulation result fulfills a set of predefined conditions) or invalid. And second, the optimization target space τ ⊂ R which contains all possible results for the optimization target t ≡ t(S, x) ∈ τ . The optimization target is meaningful only for valid simulation outcomes. Without loss of generality we assume in the following that a maximization of t is considered optimal. The symbol ⊗ in Eq. (2) denotes a tensor product. Each evaluation of the simulation therefore leads to a data point given by the evaluated parameter x, the corresponding outcome y and the optimization target t from the mapping S, Eq. (1). Although the optimization target is meaningless for invalid outcomes, we still include it in d(x) to achieve a unified notation for valid and invalid data points. Note that we assume that the result of a simulation is purely deterministic and completely defined by the choice of parameters x. The collection of results from n evaluations consequently allows us to define a data set 4 Data exploration is achieved by adding new elements to such a set. For reasons of convenience we use to denote the corresponding collection of evaluated parameters.

Prediction
A sufficiently large data set allows us to perform model predictions for solutions of yet unevaluated parameters with the help of an estimator which maps from the space of possible data sets D and the parameter space χ onto the solution space S, Eq. (2), and the predicted outcome probability space η p ≡ [0, 1]. The solution space contains the predicted outcomeŷ ≡ŷ(E, D, x) ∈ η and the predicted optimization targett ≡t(E, D, x) ∈ τ , whereas the predicted outcome probability space contains the probability of the predicted outcomeŝ pŷ ≡pŷ(E, D, x) ∈ [0, 1]. Briefly put, E allows us to predict simulation outcomes with their associated probability and optimization targets for arbitrary parameters based on previous simulation results. So far, E is considered completely general and our algorithm is not limited to a specific choice. However, it has turned out in practice that E is best represented by two independent estimators: First, a kernel SVM used for a classification of the outcomeŷ and second, a kernel ridge regression (RR) [17] to predict the optimization targett. The probability of the predicted outcomeŝ pŷ is then obtained from Platt scaling [20]. The kernel RR can be understood as a surrogate model for the optimization target. For both of these estimations we use the well-known kernel method [11], which is discussed in section A in more detail.

Utility function
At the heart of our data exploration method lies the utility function which describes the estimated exploration benefit of evaluating the parameter x based on a previously obtained data set D. Therefore, in each sequential exploration step a new evaluation of the simulation is chosen for the parameter with the best utility score. The two ingredients of the utility function are the utility vector and the weight vector which both consist of three components. Each component of the utility vector can be assigned a straightforward interpretation: Outcome prediction uncertainty Optimization target prediction Distance to nearest neighbor The components (or weights) s ≥ 0, o ≥ 0 and r ≥ 0 of the weight vector determine the influence of U s (E, D, x), U o (E, D, x) and U r (D, x), respectively, on the utility function. In other words, w determines the explorative behavior.
The expression ||w|| 1 = s + o + r in Eq. (8) represents the 1-norm of w and since by definition u ∈ [0, 1] 3 , as we will see below, one has U (E, D, x, w) ∈ [0, 1]. In the following, we will formally define the components of u, Eq. (10). The first component represents the estimated outcome prediction uncertainty based on the Shannon information entropy [27] S(pŷ) ≡ −pŷ lnpŷ + (1 −pŷ) ln(1 −pŷ) ln 2 (14) in bits. Here we have recalled the estimated probability of predicting a valid or invalid outcomepŷ(E, D, x) from the estimator mapping, Eq. (7). The second component (15) represents the estimated optimization score. It is based on which makes use of the extremal optimization targets respectively, to rescale the predicted optimization targett(E, D, x), from Eq. (7), in such a way thatt r (E, D, x) ∈ [0, 1]. As a last component, the utility vector contains the classification feature space distance [25] Here, the expression || · || 2 stands for the 2-norm distance and γ C represents a hyperparameter of the kernel SVM estimator. A derivation of Eq. (18) can be found in section B. By definition, U r (D, x) becomes smaller the more similar the parameter x is to its nearest neighbor in D x , Eq. (6). Therefore, this expression can also be understood as an artificial repulsion of data points which ensures that newly suggested parameters x new explore unknown regions of the parameter space χ. The amount of utility reduction with decreasing distance is controlled by γ C . In section A this hyperparameter is discussed in more detail. Note that it is determined during the training of E, Eq. (7), as explained further below.
Summarized, we have introduced the utility function U (E, D, x, w), Eq. (8), based on the utility vector u, Eq. (10), and the weight vector w, Eq. (11). The utility vector consists of three components with a distinct meaning, Eq. (12). These components are weighted by the weights s, o and r contained in the weight vector. Consequently, we can control the explorative behavior by tuning the weights.

Algorithm
Our data exploration method is outlined in Algorithm 1. As described in the introduction of this section, it can be partitioned in four major steps. These steps correspond to the following lines of the algorithm: As a termination criterion we use a desired number of newly evaluated points N . We assume here that the simulation S, Eq. (1), is defined by the application and is not modified during the exploration process. By contrast, the estimator E, Eq. (7), is retrained in each iteration step. For the sake of simplicity we omit S and E in the notation.

7
Algorithm 1 Outline of our novel data exploration algorithm. A detailed description can be found in section 2.2. while n < N do 13: x ← Suggestion(D expl , χ, w) 14: D expl ← D expl ∪ Simulation(x) 15: n ← n + 1 16: end while 17: return D expl 18: end function

Functions
The entry point to the algorithm is the Exploration function. Its arguments represent an initial data set from previous evaluations D init (which might also be an empty set), the parameter space to explore χ, the parameter G ≥ 0 controlling the number of initial evaluations G p in the hyperrectangular compact set χ G ⊆ χ, the desired total number of points to evaluate N ≥ G p and the weight vector w, Eq. (11), which controls the behavior of the utility function, Eq. (8). Furthermore, the Exploration function contains three implicit functions: • Simulation: Performs the mapping S and returns the corresponding data point d(x).
• Suggestion: Performs two steps in order to obtain the next parameter to evaluate. First, the estimator E for the current data set D expl is trained. Second, Eq. (9) is evaluated and the parameter x new with the best utility is returned.
• Grid: Returns a set of G p ≤ N parameters which constitute a regular grid in the parameter subspace χ G . If previous knowledge about the data topology is available, it is generally reasonable to choose a starting grid in such a way that both valid and invalid solutions are sampled in a region of an expectably good optimization target.
After N points have been evaluated, the explored data set D expl ⊃ D is returned.
The simulation S, Eq. (1), is considered completely general up to this point. Therefore, our algorithm is universal and can be applied in many different scenarios. We will specify simulation mappings further below in explicit examples for data exploration.

Estimator training
During the training of E we tune the hyperparameters of the SVM and the RR independently in each iteration step by cross validation [11] as soon as the size of the data set D expl allows it. For training and prediction we use standardized parameters by removing the mean and scaling to unit variance. All available data at each iteration step is considered as training data.
The optimization target t corresponding to an invalid outcome y is meaningless, but it might nevertheless be of practical use to be able to assign some numerical value to it, e. g., to train an estimator. Therefore, we assume in the following that invalid optimization targets taken from a data set D correspond to the worst valid optimization target of this data set t min (D), Eq. (17b), for all practical purposes.
Moreover, we assume that the initial data set D init or the data set obtained from the initial grid sampling is sufficiently large so that a suitable estimator of our choice is well-defined. Otherwise, either the start-up step has to be changed accordingly or a different estimator has to be chosen. We will not further discuss such pathological cases.

Demonstration
To illustrate our data exploration method from the previous section, we will first present a two-dimensional toy example (i. e., p = 2). Specifically, we consider parameters in the parameter space Thus, the parameter space is a simple square of edge length 4.

Toy simulation
The chosen toy simulation S toy , Eq. (1), is given by respectively, where ||x|| 2 denotes the 2-norm of x. Thus, simulation outcomes y are considered valid within and on a circle of radius √ 3 in the first and third quadrant and invalid otherwise as shown in Figure 1. This particular example is interesting to study data exploration behavior because it has both sharp and round edges and two distinct feasibility areas only connected by a single point. The optimization target t directly corresponds to x 1 .
We assume no previous knowledge about the data so that D init = {}. The initial grid is chosen to expand uniformly across the whole parameter space χ G = χ toy , Eq. (20). To solve Eq. (9) numerically in each iteration step we use a differential evolution approach [28]. Although our proposed algorithm, Algorithm 1, is completely deterministic in its general form, the usage of cross validation for the estimator training, the statistics involved in the Platt scaling and a differential evolution approach to solve the optimization problem introduce a certain degree of randomness. The presented results are therefore chosen as representative examples.

Results
The results are shown in Figure 2. Each row, (1) to (5), corresponds to a different set of exploration hyperparameters N , G, χ G , s, o and r as summarized in table 1. We will hereafter refer to a set of such hyperparameters as a setup. Each setup yields a different explored data set D expl . Column (a) shows the evaluated outcomes y, Eq. (21a), and labels them as valid ( ) or invalid ( ). The contours ( ) separate the valid from the invalid predicted outcomesŷ, Eq. (7), as given by the final estimator. The corresponding regions are shaded accordingly ( and ). Column (b) shows the optimization target t, Eq. (21b), for valid outcomes as color-interpolated markers from the best possible result t = √ 3 ( ) to the worst possible result t = − √ 3 ( ). Also shown are the invalid outcomes ( ). For the first and second row we also mark the iteration number n + 1 from Algorithm 1, i. e., the chronological order in which the points have been evaluated after the initial grid had been set up.
Apparently, setup (1) already provides us with a rough estimate of the parameter topology with 16 evaluations, which is then refined by setups (2) and (3) with 9 and 39 additional evaluations, respectively. For all of these first three setups, the optimization target has no effect on the utility. This situation changes for setup (4), where we make it the main contribution to the utility function. Consequently, the evaluations are concentrated in an area with high values of t and the lower left part of the topology is hardly explored. Finally, setup (5) shows a pure grid approach in the framework of our algorithm for comparison, which is finished after the initial sampling step since G 2 = N .

Relative success rate and score
In section C we define the relative success rate and the score as a quality measure for our estimator. The score represents the fraction of correct outcome predictions performed for all parameters in the explored data set D expl , whereas R is defined as the fraction of correct outcome predictions for almost all parameters in the parameter space χ. In other words, σ is a local and R a global quality measure of our estimator. The fraction consequently tells us how well the outcome predictions from the estimator obtained from the explored data set generalize to the whole parameter space. If D expl is a representative subset of χ, g can consequently be used to measure whether the estimator is overfitted (g > 1) or underfitted (g < 1). However, since the aim of our exploration algorithm is to find a representative subset of the parameter space in the first place, we can instead consider g as an estimated quality measure for our training set itself. For g > 1 the local prediction is better than the global prediction, hence the explored data can be seen as an oversimplified subset. For g < 1, conversely, the local prediction is worse than the global prediction and the explored data can be seen as overcomplicated subset. If g = 1, the local and global predictions are equally good. In this case the explored data set can be considered as perfectly representative. The kernel SVM we use for classification allows us to almost always achieve a perfect score σ and therefore we remain in the realm g ≥ 1. Practice has shown that it seems to be a good approach to use such oversimplified data subsets during exploration. This observation could be due to the fact that our utility function is in such cases mostly based on the main features of the classification border and tends to neglect minor details, which are usually not important for all but the very last exploration steps.

Ratios of false positives and false negatives
Two additional global quality measures for our estimator are given by the ratio of false positives and the ratio of false negatives defined in section D. We use the convention that positive results correspond to valid outcomes and negative results to invalid outcomes. Consequently, r fp represents the fraction of wrongly predicted valid outcomes and r fn the fraction of wrongly predicted invalid outcomes, respectively, performed for almost all parameters in the parameter space χ. An estimator of high quality is indicated by a high relative success rate R, Eq. (22), a low ratio of false positives r fp and a low ratio of false negatives r fn . Depending on the application, either false positives or false negatives might be considered far more adverse. However, if the two ratios are to be valuated equally, considering R might be a sufficient quality measure since by definition holds true.

Validity ratio
The calculation of valid data points yields meaningful optimization targets and therefore provides us with more information than the calculation of invalid data points. Hence, the fraction of valid to invalid data points can serve as a measure for the usefulness of a sampling. For this purpose we have defined the validity ratio as the fraction of valid to invalid data points in the explored data set D expl in section E. Moreover, we have defined its reference limit as the fraction of the total volumes for valid and invalid outcomes, respectively, in the whole parameter space. The validity ratio α of a uniform random sampling in the whole parameter space will eventually converge to α ∞ for a sufficient number of samples. Therefore, this value represents a reasonable reference scale for the validity rate α which other sampling approaches have to be measured up to. The calculation of α ∞ also allows us to specify a worst-case reference limit for the ratios of false positives and false negatives, Eq. (25). Specifically, we consider a completely randomized estimator that predicts valid and invalid outcomes with equal chance. Performing such random predictions for almost all parameters in χ lead us to the reference limit for the ratio of false positives and the reference limit for the ratio of false negatives respectively as explained in section E. An estimator exceeding these limits is consequently worse than a "random coin-tossing" estimator.

Exploration characteristics
We list the characteristic values of exploration R, Eq. (22), σ, Eq. (23), r fp , Eq. (25a), r fn , Eq. (25b), and α, Eq. (27) together with the best and worst optimization targets and respectively, for each of the explored data sets, (1) to (5) with N = 10 000 data points; see section C. It is important to emphasize that global quality measures like R are only possible because our considerations are not limited to predefined data sets, but rather make use of the fact that we can calculate new simulation data on demand. This enables us to create the randomly chosen samples necessary for the Monte Carlo approach.

Discussion
A comparison of the setups (2) and (5) in table 2 shows that our data exploration approach increases the highest relative success rate R by almost 8% in comparison with a regular grid sampling with the same number of sampling points. As expected, the highest value for R is given by setup (3), followed by the setups (2) and (1). The worst performance is given by setup (4) and the grid approach, setup (5). However, while setup (4) has a poor value of R, it leads to the best optimization target t best ≈ 1.716, which almost reaches the theoretical limit of √ 3 ≈ 1.732. This is no surprise given our choice of the weight vector w, Eq. (11), which enforces parameter sampling near the best optimization target. Summarized, we see from our toy example that by tuning the weight vector we can intuitively control the exploration behavior.
The score σ is perfect for all setups, which can be expected from such small training sets. Thus, we remain in the realm g > 1, Eq. (24).
Only the setups (2) and (3) have ratios r fp below the worst-case reference limit r fp∞ , Eq. (29a), while the ratios of all other setups exceed it. The best ratio is achieved for setup (3) and the worst for setup (4), which incorporates the optimization target. The ratios r fn , on the other hand, are almost the same for all setups and are all much smaller than the worst-case reference limit r fn∞ , Eq. (29b). We assume that this result is due to the fact that invalid outcomes are mostly found in the outer realm of the parameter space, a topological behavior that can be uncovered with almost any sampling method even with only a few samples because of the low dimensionality of the parameter space. A comparison of the validity ratios α reveals that the best result is achieved for setup (4), followed by (3) and (1). Setups (2) and (5) even fail to beat the reference limit α ∞ , Eq. (28). As we will see further below, these relatively bad validity ratios are a consequence of the very small number of samples considered here and will improve with ongoing exploration. The fact that setup (1) has a better validity ratio than setup (2) is also a result of the sparse sampling.
For an exploration approach with a weight o = 0, for which the optimization target is ignored, a lower value of t worst can be considered more favorable since it indicates a more complete parameter space exploration. On the other hand, if the optimization target is of importance by choosing a weight o > 0, exploration of such uninteresting regions should rather be avoided and a higher value of t worst can be considered more favorable. The latter is the case for setup (4) and we find that it shares the same value for t worst with setup (1) and the grid approach, setup (5).

Summary
The toy example clearly shows how the weights in the utility function can be used to control the explorative behavior in an intuitive way. For an accomplished data exploration, we consider a high value of R, α and t best as desirable results, whereas r fp and r fn should be small. In other words, we seek (i) a good outcome prediction while (ii) evaluations in regions with a good optimization score should be preferred and (iii) the explored data set has a good ratio of valid to invalid outcomes. In this sense, these five exploration characteristics can be seen as objectives of a MCO problem with possibly conflicting goals. Relative importance of these objectives varies by the application.  Figure 2: Comparison of different data exploration setups for the toy example. Each row represents one of the exploration setups (1) to (5) from table 1. Column (a) shows the sampled outcomes y, Eq. (21a), with colored markers and the predicted outcomesŷ, Eq. (7), with colored regions in the toy example parameter space χ toy , Eq. (20). Column (b) shows the respective targets t, Eq. (21b), for each sample. In both columns we also indicate the contours of the true feasibility region in analogy to Figure 1. The numbers in (1b) and (2b) show the chronological order in which the points have been evaluated after the initial grid had been set up.  (1) to (5) with their respective exploration hyperparameters. We list the total number of samples N and the parameter G controlling the number of initial samples G 2 , which are placed on a regular grid in χ G = χ toy , Eq. (20). The weights s, o and r constituting the weight vector w, Eq. (11), determine the influence of the three components of the utility function, Eq. (8). Specifically, s controls the outcome prediction uncertainty impact, o the optimization target prediction impact and r the distance to the nearest neighbor impact, respectively, Eq. (12). Setups (1) to (4) use our algorithm, whereas setup (5) represents a pure grid approach in the framework of our algorithm, which is finished after the initial sampling step since  (23), the ratio of false positives r fp , Eq. (25a), the ratio of false negatives r fn , Eq. (25b), the validity ratio α, Eq. (27), the best optimization score t best , Eq. (30a), and the worst optimization score t worst , Eq. (30b). The arrows indicate whether high values (↑) or low values (↓) are considered more favorable for each characteristic.
Depending on the context, either a low value of t worst (e. g., for o = 0) or a high value (e. g., for o > 0) can be desired. (8)

Benchmark
To demonstrate the strengths of our algorithm, we will in the following briefly present a benchmark with a Kriging-based exploration approach, which has been described in Refs. [5,30]. We have already briefly mentioned this alternative strategy in section 1. Summarized, the Kriging-based exploration works in an iterative way similar to our novel method: Starting from an initial data set, a feasibility estimator is trained each step with the currently explored data points. However, in contrast to our algorithm, the estimator is continuous and therefore relies on a continuous feasibility function which reflects the degree of feasibility constraint violation. Based on the continuous estimator, a new parameter is suggested and the simulation is evaluated for the new parameter. The resulting data point is included in the set of explored data points and the next iteration begins.
A more detailed explanation of the Kriging-based exploration can be found in Refs. [5,30]. Our algorithm is described in section 2. For the benchmark we will make use of our toy simulation, Eq. (21), from section 3.

Competing algorithms
The main purpose of the benchmark is to study the behavior of the two algorithms of interest -our proposed method and the Kriging-based approach -with respect to the number of sampled points N . Therefore, we run our algorithm on a collection of setups (6) n defined by the hyperparameters G = 4, χ G = χ toy , s = 2, o = 0 and r = 1. By setting s > r we suppressed the spreading of data points in favor of a more precise feasibility border sampling. Each setup in the collection only differs by its number of samples N = n.
The notation of the setups follows section 3. Furthermore, we use an additional convention: The type of brackets we use to denote an exploration setup stands for the type of algorithm this setup is using to determine the sampling. Setups with round brackets refer to our algorithm, whereas non-round brackets represent different algorithms, as we will see in the following.

Kriging-based approach
For comparison, we run the Kriging-based exploration algorithm on a collection of setups [7] n , where n stands for the total number of sampled points in the same sense as for the setups (6) n . For the estimator we use a Gaussian Process Regression (GPR) [22] and achieved the best results using a Matérn kernel [14] with smoothness parameter ν = 1.5. In each iteration the hyperparameters of the GPR kernel are optimized by maximizing the log-marginal-likelihood of the GPR model with the help of the L-BFGS-B algorithm [8]. The training data is standardized by removing the mean and scaling to unit variance. Because the toy simulation, Eq. (21), is defined as a binary classification problem and by design no explicit information about feasibility constraint violation is available, we use as target values for the GPR model. In each iteration step, a new parameter x GPR new ∈ χ is sampled where the expected improvement [30] becomes maximal so that holds true. Here we have made use of the GPR model predictionŷ GPR (x) and its corresponding standard error s GPR (x) evaluated for the parameter x.
The initial data set for the training of the GPR model is chosen as a regular grid of 4 × 4 data points in complete analogy to the initial data set for our algorithm. The numerical solution of Eq. (34) is obtained with a differential evolution approach.

Modified Kriging-based approach
The expected improvement, Eq. (33), is supposed to focus the sampling on the feasibility border where the predicted violationŷ GPR (x) vanishes. However, in our binary classification scenario with the constraint violation described by Eq. (32), all points are penalized by either one of the two outcomes regardless of their distance to the border. This lack of knowledge about a continuous distance measure might significantly worsen the performance of the Krigingbased approach. Therefore, we suggest a modification which can compensate this deficiency. Specifically, we consider a reformulation represents an upper bound of the closest Euclidean distance of the parameter x to the feasibility border. According to Eq. (36), this upper bound corresponds to the Euclidean distance of x to its nearest neighbor of opposing feasibility x in the currently explored data set D expl . In particular, additional samples in D expl can only improve d fb (x), which converges to a tight bound for the theoretical limit of an infinite number of samples.
Apart from the major difference of using y GPR continuous (x) instead of y GPR (x) we leave the Kriging-based algorithm unchanged. For comparison, we run this modified algorithm on a collection of setups {8} n , where n stands for the total number of sampled points in the same sense as for the setups (6) n and [7] n .

Results
For the benchmark we compare the exploration characteristics introduced in section 3.2. Specifically, we consider the success rate R, Eq. (31), the score σ, Eq. (23), the ratio of false positives r fp , Eq. (25a), the ratio of false negatives r fn , Eq. (25b), and the validity ratio α, Eq. (27), together with the best and worst optimization scores t best and t worst , respectively, Eq. (30). The exploration characteristics are summarized in table 3 for four different numbers of samples n ∈ {25, 50, 100, 150} for each of the three candidate algorithms. Additionally, we show the progression of certain characteristics for an increasing number of samples n ∈ [20,150] in Figure 3. The vertical lines ( ) correspond to the four chosen numbers of samples from table 3. Finally, snapshots of the explored outcomes for these samples can be found in Figure 4 analogously to the left column in Figure 2.

Discussion
The top plot in Figure 3 shows that after about 35 samples our algorithm ( ) steadily exceeds the success rate R of the Kriging-based method ( ) by roughly 0.02, and similarly, by roughly 0.01 the success rate R of the modified Kriging algorithm ( ), albeit after about 80 samples. Both Kriging models exhibit higher R values initially, however at the cost of increased false positive rates r fp . The false positive rate is consistently better for our algorithm over all number of samples. Both Kriging-based algorithms exhibit a peak in the false positive rate in the initial sampling stage up, even up to the reference limit r fp∞ , Eq. (29a), indicating overestimation of the feasible range as illustrated in the top row of Figure 4. On the other hand, the third plot in Figure 3 shows that our algorithm tends to underestimate the feasible region, especially for the first 35 samples. However, the major difference between benchmarked algorithms is highlighted in the bottom plot in Figure 3. Our algorithm exceeds the reference validity ratio α ∞ , Eq. (28) already after about 35 samples. After about 70 samples the validity ratio of our algorithm is twice as high as that of the Krigingbased algorithm, and about 4/3 times higher than that of the modified Kriging algorithm. Such high validity ratio values indicate that majority of new points are placed within feasible region and thus relevant parameter space is sampled more efficiently. Without modification, the Kriging-based algorithm reaches the validity ratio of a uniform sampling, thus confirming that the binary constraint violation formulation in Eq. (32) does not penalize sufficiently sampling far from the feasibility border. The modified constraint violation function, Eq. (35), appears to partially address this issue as it results in an increased validity ratio, albeit it does not reach the level of our algorithm.
Our algorithm constantly pushes t best and t worst to their limits with an increasing number of samples, whereas the Kriging-based algorithm already reaches fixed values after about 25 sampled points. Those fixed values are far from the theoretical limits given by ± √ 3 ≈ ±1.732. This indicates that an exploration of the feasibility region is rather coarse with the Kriging-based approach. The modified Kriging algorithm reaches similar t best and t worst values indicating better sampling at the feasible boundary than the unmodified Kriging-based algorithm. Note that we have not included the optimization target in our utility function, which would lead to a better value of t best for much less samples as shown in section 3.2. Moreover, the optimization target is not explicitly considered in the Kriging-based algorithms in the first place. The best optimization target can therefore be expected to be much worse if the target maximum is not directly located on the feasibility border.
From the progression of R in Figure 3 we see that the prediction quality of all compared algorithms appears to be almost saturated for 150 samples. Although additional data points can still lead to an improvement of the estimators' precision, this improvement is expected to be rather small and might not be worth the additional calculation effort. According to table 3 we have σ < R so that g < 1, Eq. (24), for (6) 150 , which indicates that the local prediction is worse than the global prediction. Thus, the explored data set can be seen as an overcomplicated subset of the complete parameter space. Such a behavior could in fact be used as a possible quantitative stopping criterion for our algorithm in agreement with the qualitatively observed saturation.

Summary
The benchmark has shown that our method is comparable to the Kriging-based approach for a very small number of samples, but outperforms it for an increasing number of iterations. It is particularly remarkable that we achieve a much higher ratio of valid to invalid data points with our algorithm, which makes our sampling more efficient. Moreover, we can clearly consider our modification of the Kriging-based approach as an improvement for the case of a discrete feasibility constraint violation.

Application
In the current section we will apply the proposed algorithm from section 2 to the simulation of a realistic chemical process. This process describes the inner workings of a production plant consisting of two connected distillation columns. The task of the plant is to separate a two-component mixture consisting of chloroform and acetone by means of a so-called pressure swing distillation. Such kind of operations are very common in chemical engineering and their physical description is well-known [3].
As shown in Figure 5 each column has one input stream (feed) and two output streams (distillate and bottom stream), consisting of chloroform and acetone  (31), the score σ, Eq. (23), the ratio of false positives r fp , Eq. (25a), the ratio of false negatives r fn , Eq. (25b), the validity ratio α, Eq. (27), the best optimization score t best , Eq. (30a), and the worst optimization score t worst , Eq. (30b). The arrows indicate whether high values (↑) or low values (↓) are considered more favorable for each characteristic. Depending on the context, either a low or a high value of t worst might be preferable. According to these characteristics, our algorithm is comparable to the Kriging-based approach for few samples, but superior for many. The modified Kriging-based approach generally performs better than the original version. A visualization of exploration characteristics for different sample sizes can be found in Figure 3.  mixtures. Different parameters such as the operating pressure P influence the concentrations of the components in the output streams. The two columns are connected in such a way that the bottom stream of the first column constitutes the feed for the second column and the bottom stream of the second column is recycled and mixed with the educt of the process to form the feed of the first column. The two distillate streams constitute the products of the process, which are desired to have high purities. The distillation process we consider is not straightforward since the mixture of chloroform and acetone exhibits an azeotropic behavior. This means that at the azeotropic point, the composition in the vapor phase equals the composition in the liquid phase, which is sensitive to P in this particular case. One possibility to surpass this distillation limit and to separate the azeotropic mixture is to use a special setup in which the first column operates at low pressure (P = 1 bar) and the second column operates at high pressure (P = 10 bar). As a result, a high concentration of acetone in the distillate stream of the first column can be achieved and the concentrations in the corresponding bottom stream are close to the azeotrope. The higher pressure in the second column changes the azeotrope composition in such a way that a high concentration of chloroform can be achieved in the distillate stream of the second column.

Chemical simulation
The most common approach for modeling distillation columns is the equilibrium stage model (ESM) [3], which is based on a cascade of interconnected equilibrium stages. Every stage has a vapor (boiling) and a liquid (condensing) output stream, where the vapor stream of each stage raises to the stage above and the liquid output stream of each stage flows to the stage below. The reflux ratio of a column represents the ratio between the internal liquid streams of the last two stages and the distillation stream. A higher reflux ratio means that more energy is needed for cooling and heating. For every stage certain physically motivated equations (conservation of mass and conservation of enthalpy together with thermodynamic equilibrium and closure conditions) must hold. The ESM is used in all commercially available flow sheet simulators and is used wold-wide to simulate chemical distillation processes.
We model each column with an ESM consisting of 28 stages (not shown in Figure 5). The non-random two-liquid (NRTL) model [23] is used to describe the interactions between the substances in each stage. The complete chemical process is then represented by a system of about 400 coupled linearly independent equations, some of which are highly nonlinear. We also have a comparable number of independent internal variables describing, e. g., the concentrations, temperatures and flow rates of the internal streams between the stages.
As already mentioned in section 1, we use the flowsheet simulator Chemasim to perform the calculations. Given the parameters x, we formally define the function if the Chemasim evaluation using x is well-defined and leads to a convergent and physically feasible result invalid otherwise (37) to describe the outcome of a Chemasim evaluation in the classification space η, Eq. (3). Convergence and physical feasibility are exclusively decided by Chemasim-internal criteria and we have no knowledge about the degree of a violation, i. e., we have no access to a continuous feasibility function.
We consider a four-dimensional parameter space (i. e., p = 4) for the chemical process simulation. It is constituted by the mass fractions of acetone at the distillate stream of column one m ac ∈ [0.1, 1.0] and chloroform at the distillate stream of column two m cl ∈ [0.8, 1.0], as well as the reflux ratios of the two columns r 1 ∈ [5, 35] and r 2 ∈ [5, 35], respectively. Using the parameter vector our parameter space can consequently be written as and therefore corresponds to a four dimensional hyperrectangle. The simulation S sim , Eq. (1), can formally be expressed by respectively, where we have recalled Eq. (37). Our optimization target t(S sim , x) is chosen as the average of the two distillate mass fractions, which we aim to maximize. Recall that by definition, the optimization target is only meaningful for valid simulation outcomes y(S sim , x). We specify sufficient internal variables of the simulation with a suitably chosen but fixed value so that given the parameters x the system of equations is well-defined and can be solved by Chemasim.
We assume that we have no previous knowledge about the data (i. e., D init = {}) except for the fact that there is a parameter space window in which both valid and invalid parameters can be found. For the start-up step of our algorithm we choose a regular grid which expands inside of a chosen hyperrectangle χ G . Eq. (9) is again solved numerically using a differential evolution approach.

Results
The results are shown in Figure 6 analogously to Figure 2. To achieve a two dimensional illustration of the four dimensional parameter space χ sim , Eq. (39), we show three different planar cuts through χ sim in the first three columns and project all data points onto the respective planes. Specifically, we choose the planes for which holds true with c = 10 in column (a 1 ), c = 20 in column (a 2 ) and c = 30 in column (a 3 ), respectively. As a consequence, the outcomes of the projected data points do not necessarily have to correspond to the estimated feasibility borders ( ) on each plane. It is also important to emphasize that the exact theoretical feasibility border is unknown so we cannot plot it. For the optimization target visualization in column (b), the specific choice of the projection plane does not affect the plots. Each of the five rows represents a different setup as summarized in table 4. Specifically, the first two setups are based on N = 40 evaluations and while setup (1) ignores the optimization target, it is taken into account by setup (2) with an equal weight. The third and fourth setup also contain these two cases but for N = 81 evaluations. All of these four setups have an initial grid parameter G = 2 and make use of a grid in the parameter space window χ win , Eq. (41). Finally, the setups (5) and (6) represent pure grid approaches in the framework of our algorithm, which are finished after the initial sampling step since G 4 = N . In table 4 we also list three additional setups, 7 to 9 , which are not shown in Figure 6. Each of these setups represents a typical LHS of N parameter points in χ sim , Eq. (39). The other exploration hyperparameters have no meaning for these setups.
As one would expect, it becomes apparent from columns (a 1 ) to (a 3 ) that the estimated feasible regions expand and include higher distillate mass fractions m ac and m cl , respectively, with increased reflux ratios r 1 = r 2 . Interestingly, this observation can be made for all depicted setups. The only exception can be found in (1a 2 ), which covers a smaller validity area than (1a 1 ), but still includes higher distillate mass fractions. Since r 1 , r 2 ∈ [5, 35], column (a 2 ) in fact shows a projection onto the central reflux plane. For the calculation of R, r fp , r fn and the respective reference limits α ∞ , r fp∞ and r fn∞ we use a Monte Carlo approach with N = 10 000 data points. Specifically, we make use of the success rate approximation, Eq. (31), the approximation of the ratio of false positives

Exploration characteristics
and the approximation of the ratio of false negatives as explained in sections C and D, respectively. Moreover, the above-mentioned reference limits are approximated by and as explained in sections D and E, respectively. A numerical evaluation of Eq. (44) yields which indicates a close balance between valid and invalid points in the whole parameter space. Using Eq. (46) we can directly determine and which are also very similar due to this balancing.

Discussion
We find that according to table 5 setup (3) reaches the best value for the relative success rate R, closely followed by setup (1). Both results have overlapping confidence intervals. A slightly worse value for R is achieved by the setups (2) and (4). By contrast, remarkably bad results can be found for the grid and LHS approaches. A comparison from setup (6) with setup (3) reveals that our data exploration approach results in an improvement of R by almost 90% with only one third of the sampling points (81 instead of 256). This result shows that our algorithm is way more efficient especially for higher dimensional spaces than a uniform grid sampling. The scores σ show that the estimators trained on the LHS samples also perform bad on the training set itself. Since σ increases with a higher number of samples, we assume that this behavior might be a consequence of the sparsity of the sample in the four dimensional parameter space, which can lead to an overly complex prediction of the feasibility border. It might therefore be possible that using a different estimator may result in a better overall performance for the LHS approaches. For all setups, we have g > 1, Eq. (24).
When we compare the ratios of false positives and false negatives for the first four setups, we find that setup (3) has the lowest value of r fp , followed by (1), (2) and (4). The lowest value of r np is, on the other hand, achieved by (4), followed by (2), (1) and (3). It is remarkable that the setups (4), 8 and 9 exceed the reference limit r fp∞ , Eq. (47a), but have almost no false negatives, whereas the setups (5), (6) and 7 exceed the reference limit r np∞ , Eq. (47b), but have almost no false positives. Clearly, all of the associated estimators generalize very badly due to the unrepresentative sampling used for the training.
The validity ratio α is best for setup (4), followed by setup (2). Both of these ratios clearly exceed the reference limit α ∞ , Eq. (46). The other setups, including the three LHS approaches, only have validity ratios smaller than α ∞ and can therefore be considered less useful than a typical random sampling. This result is no surprise since respecting the optimization score will guide the exploration towards parameter space realms with valid outcomes.
As expected, we see from a comparison between (1) and (2) or (3) and (4) in Figure 6 that omission of the optimization target leads to a more regular spreading of the evaluations in the parameter space. Again, it is no surprise that incorporating the optimization score can lead to a worse outcome classification.
The highest value for t best is reached by setup (4), closely followed by setup 9 . All other setups achieve slightly worse values. Therefore, we find that even a small number of evaluations provides us with a reasonably good optimization target. From a direct optimization of the simulation by means of a MISQP procedure [24] we find an optimization target of t ≈ 0.950, which is only slightly better than our best exploration result of t ≈ 0.930. Its calculation can, however, require a few hundred parameter evaluations. Moreover, the Chemasim-internal optimizer has difficulties solving this specific problem due to the fact that the optimization target, Eq. (40b), has vanishing derivatives with respect to r 1 and r 2 , which makes them insignificant for a gradient-based optimization approach. The feasibility outcome, on the other hand, explicitly depends on those parameters. Since the optimizer ignores this relation, it can be very difficult to find an optimal target in Chemasim without injecting expert knowledge.
The value of t worst shows us in how far regions of a rather uninteresting optimization score have been explored. Exploration of such regions might be necessary to improve R, but should be avoided in favor of more relevant regions if the optimization target is of interest. From a comparison between (3) and (4) we find, without surprise, that incorporating the optimization target into the utility function leads to a more favorable value of t worst . However, a comparison between (1) and (2) shows the opposite effect. We assume that this behavior is a result of an unfavorably trained estimator for the optimization target during one iteration of the exploration process.

Summary
Our data exploration method has proven to be clearly superior to grid-based or LHS approaches. From the calculation of comparably few data points we already get a very good estimation of the feasibility region in the whole parameter space. Moreover, by incorporating the optimization target in the utility function we can achieve a sampling which contains data points very close to the optimum. Such data points can serve as suitable starting points for a rigorous optimization algorithm.

Conclusions and Outlook
From our benchmark, we have found that our method yields better exploration characteristics than a previously suggested Kriging-based approach [5,30] for a binary feasibility classification scenario as soon as a critical number of sampling points has been exceeded. We have also seen that the ratio of valid to invalid data points in the sampling is much higher with our strategy, which can be important if the data set should be further used, e. g., to train a shortcut model or for optimization purposes. The performance of the Kriging-based approach might have been worsened by the lack of knowledge about a continuous feasibility constraint violation in our example. Therefore, we have suggested an   (4), which use our algorithm, the setups (5) and (6), which represent a pure grid approach in the framework of our algorithm, and the setups 7 to 9 , which are based on a LHS approach, with their respective exploration hyperparameters. For each setup, we show the total number of samples N . For setups (1) to (6) (1) to (4), which use our algorithm, the setups (5) and (6), which represent a pure grid approach in the framework of our algorithm, and the setups 7 to 9 , which are based on a LHS approach. The respective hyperparameters are shown in table 4. We list the success rate R, Eq. (31), the score σ, Eq. (23), the ratio of false positives r fp , Eq. (43a), the ratio of false negatives r fn , Eq. (43b), the validity ratio α, Eq. (27), the best optimization score t best , Eq. (30a), and the worst optimization score t worst , Eq. (30b). The arrows indicate whether high values (↑) or low values (↓) are considered more favorable for each characteristic. Depending on the context, either a low or a high value of t worst might be preferable. These characteristics show that our algorithm provides us with a very good estimation of the feasibility region from relatively few data points in comparison with grid-based or LHS approaches. The chemical process simulation has shown that our data exploration method provides us with an excellent approximation of the data space topology from a relatively small number of data points in comparison with grid-based or LHS approaches. By tuning the explorative hyperparameters we can intuitively control the exploration behavior. In this way we can concentrate the exploration on parameter regions with a relatively good optimization target. Hereby, we have discovered an almost perfect optimization target which could be used as a very suitable starting point for rigorous optimization strategies to speed up the optimization process. Since the simulated chemical process is fully realistic and industrially relevant we have demonstrated that our method is applicable to a real-world problem.
It is important to emphasize that some industrially relevant applications certainly require more than four design parameters. In such cases both our data exploration method and the previously suggested Kriging-based approach might be challenged by the exploration of a high dimensional parameter space [30]. To circumvent this curse of dimensionality, we propose to couple the data exploration strategy with a dimensionality reduction method [29] in order to restrict the search to a lower dimensional manifold. However, an extended discussion of such an approach is beyond the scope of this manuscript.
Naturally, our method introduces an additional computational overhead in comparison with conventional data exploration strategies like a regular grid or a randomized sampling. Therefore, it is best suited for simulation environments where suggesting the next data point to evaluate takes significantly less time than the actual evaluation. Simulations of chemical processes constitute perfect candidates for this requirement due to their computational complexity and the difficulty of predicting their operation window.
As a stopping criterion for exploration we have used a fixed number of evaluations. This has allowed us to compare our method with grid-based and random sampling approaches. In practice one might instead want to make use of a suitable precision measure for the estimator to stop the evaluation at a sufficient accuracy. In the scope of our benchmark we have suggested to use the fraction of the relative success rate to the score as a possible stopping criterion comparable to a saturated exploration.
We have shown that the utility function strongly influences the outcome of our algorithm. It seems natural to ask in how far this function could be modified or generalized. In the following we will therefore briefly discuss open questions for future research related to this topic.
First of all, from a more general point of view, maximizing the utility function can also be regarded as a MCO problem, where each element of the utility vector corresponds to an objective function. In this sense the complete parameter space represents the feasible set of decision vectors. We use the weight vector to reduce this problem to a single objective problem. However, it could be insightful to treat the data exploration multicriterially. In a similar way it would also be possible to take a non-scalarized MCO target into account (e. g., in our application from section 5 both mass fractions could act as optimization targets). As indicated in the introduction, such a scenario is common in the context of chemical process engineering and therefore of great practical interest.
Furthermore, we always assume a constant weight vector for the entire duration of the exploration. One possible modification of our algorithm would be a dynamic approach in which the coefficients of the weight vector change in each step of the iteration. For example, an exploration which starts with a strong weight of the outcome prediction uncertainty and ends with a strong weight of the optimization target prediction would first focus on the classification boundary and later on the region of interest. Such a dynamic approach could also be combined with a MCO strategy.
In this manuscript we have focused on a binary classification approach to separate valid from invalid simulation outcomes. However, for certain applications it could be beneficial to distinguish different causes of invalidity. One could for example use a ternary classification approach to separate valid, physically unfeasible and numerically divergent outcomes. Depending on the simulation, even more different invalidity classes might be of interest. Our exploration method could be straightforwardly generalized to incorporate such an extension. SVMs could still be used to calculate prediction probabilities for such a multi-class classification problem [31]. Since the task of classification itself is still an active area of research with connections to many other scientific fields (such as quantum mechanics [26] with promising results [2,13]), it can be expected that the quality and performance of classification methods is subject to future improvements.
In real-world applications the outcomes of a simulation may crucially depend on a large number of different simulation parameters (e. g., algorithmic parameters or initialization values) and their mutual interactions. This means that divergent outcomes could turn into feasible outcomes for different simulation parameters and vice versa. However, in a complex simulation environment the reasons for a numerical divergence can become practically untraceable. To take this behavior into account, it would be possible to make use of a stochastic perspective, where numerically divergent outcomes are only considered invalid with a certain probability. This invalidity probability would consequently reflect the ignorance about the influence of the simulation parameters on the outcomes.
Prediction probabilities for the (multi-class) classification of outcomes already provide a statistical framework that can be exploited in this context. It would for example allow to identify numerically divergent data points of high uncertainty, i. e., data points that result from a divergent simulation run but have a comparably low probability of belonging to the class of divergent outcomes. Such data points could then be re-evaluated (and possibly re-labeled if a different outcome occurs) using differently tuned simulation parameters. If the simulation environments allows it, a higher computational effort (e. g., by increasing the numerical precision or the iteration steps of the underlying equation solver) could be used for each revision in the hope that convergence can eventually be achieved. The allocation of suitable sample weights [18] would be an intrinsic way of SVMs to assign a higher certainty to re-calculated data points with the same divergent outcome.
Although the proposed statistical perspective fits rather naturally into our new method of optimized data exploration, we only consider it a conceptional idea that goes beyond the scope of this manuscript. Furthermore, the benefit of re-evaluating divergent data points harshly depends on the specific simulation environment and requires full control over the simulation parameters.
Summarized, we consider the straightforward variability of our algorithm through a modification of the utility function as a conceptional strength, which allows us to study different approaches in a single framework. Therefore, our method can also be seen as a very versatile starting point for further research in the fields of data exploration, feasibility classification and optimization. Moreover, its demonstrated practicability gives way for different kind of applications in the realm of chemical process engineering and beyond.

Acknowledgements
This work was developed in the Fraunhofer Cluster of Excellence "Cognitive Internet Technologies". The authors would like to thank Christian Bauckhage, Jürgen Franke and Marius Kloft for their helpful and constructive comments.
Our numerical examples were realized with the help of scikit-learn [19].

A Kernel methods
As explained in section 2.1.2 we use a kernel SVM and a kernel RR to perform predictions, Eq. (7). Both estimators rely on the kernel method [11]. Specifically, we assume that there exists a mapping from the parameter space χ to a feature space F in such a way that the inner product represents a Gaussian kernel for all parameters x and x in χ, where γ is a hyperparameter of the feature space metric and || · || 2 stands for the 2-norm distance.
It is important to emphasize that the feature space for the SVM is not necessarily the same as for the RR. Therefore, we assume in fact two feature space mappings, namely φ C for the SVM, which maps to the classification feature space F C , and φ R for the RR, which maps to the regression feature space F R . Both mappings are defined in analogy to Eq. (48) and are associated with the Gaussian kernels k C (x, x ) and k R (x, x ), respectively, of the form given by Eq. (50) with hyperparameters γ C and γ R , respectively.

B Classification feature space distance
The feature space distance between two parameters x 1 and x 2 with the common feature mapping φ, Eq. (48), can be written as [25] δφ(x 1 , based on an arbitrary kernel function k. We make use of this expression in section 2.1.3 to define the third component of the utility vector, Eq. (10), as the classification feature space distance between the parameter x and its nearest neighbor x NN from the data set D x , Eq. (6). Here, φ C represents the classification feature space mapping of the kernel SVM estimator described in section A. The nearest neighbor is obtained from solving The choice of a Gaussian kernel k C (x, x ), Eq. (50), allows us to rewrite Eqs. (52) and (53) as Eq. (18). By definition, U r (D, x) reduces the utility the closer the parameter x is located to its nearest neighbor x NN in the classification feature space F C , i. e., the more similar the parameter is to its nearest neighbor in D x . For neighbors with identical features, U r (D, x) becomes 0 and for neighbors with fundamentally different features it can asymptotically approach 1.

C Relative success rate and score
To quantify the quality of the exploration we make use of the relative success rate It contains the volume of correct outcome predictions with the indicator function and the total volume of the parameter space We can also write where the score of a test data set D t represents the fraction of correct outcome predictions performed for all parameters x ∈ D t . The absolute values in Eq. (59) denote a set cardinality, a notation which we will also use in the following sections. The specific test data set from Eq. (58) is required to contain data points d(x), Eq. (4), for all or almost all parameters x from χ. Summarized, the relative success rate R(D) ∈ [0, 1] is a quality measure for an outcome estimator taken across the whole parameter space, whereas the score σ(D t ) ∈ [0, 1] may also be used for a local quality measure depending on the chosen test data set D t . A larger value of R(D) and σ(D t ) indicates an estimator of higher quality.
In practice, the relative success rate can be determined by evaluting the integral in Eq. (55) with the help of a standard Monte Carlo approach [21], which also provides us with an estimated error. Given the collection of evaluated parameters respectively. Summarized, the ratios r fp (D) ∈ [0, 1] and r fn (D) ∈ [0, 1] can be considered as a quality measure for an outcome estimator taken across the whole parameter space. A worst-case reference value for the ratios of false positives and false negatives will be introduced in section E based on evaluating randomized predictions for almost all parameters in χ. Note that an estimator of high quality is indicated by a large relative success rate R(D), Eq. (54), but small ratios of false positives and false negatives; also see Eq. (26).
In complete analogy to section C, we can use a Monte Carlo approach to numerically calculate V fp (S, E, D, χ) and V fn (S, E, D, χ) in Eq. (68) using a collection D MC respectively. Here we have recalled the Monte Carlo expectation value, Eq. (64), and the Monte Carlo error estimate, Eq. (66), to simplify our notation.

E Validity ratio
In general, valid data points are more insightful than invalid data points because they contain information about the optimization target. To quantify the usefulness of a sampling, we therefore introduce the validity ratio α(D) ≡ |{(x, y, t) ∈ D | y = valid}| |{(x, y, t) ∈ D | y = invalid}| .
The higher the fraction α(D) of valid to invalid data points in the data set D, the more useful the corresponding sampling. However, to provide a clear scale for α, some kind of reference value is necessary. Such a reference value can be obtained from the idea of an infinite sample: For a given simulation S, a fully dense sampling in the parameter space χ with infinitely many data points results in the infinitely large data set D ∞ (χ). For this reference data set the validity rate converges to the reference limit lim D→D∞(S,χ) given by the fractions of the total volumes for valid and invalid outcomes in χ, respectively. Here we have introduced the valid volume based on the total parameter space volume V total (χ), Eq. (57). For the two-dimensional toy example from section 3 the reference limit α ∞ (S toy , χ toy ), Eq. (75), can be calculated explicitly using the definitions of the parameter space χ toy and the toy simulation S toy , Eqs. (20) and (21), respectively. It is straightforward to see from Figure 1  is given by Eq. (28). For the chemical process simulation from section 5, an explicit expression of the reference limit of the validity rate can not be obtained since we use a flowsheet simulator to perform calculations. Therefore, we have to fall back to a numerical approximation using a Monte Carlo approach, Eq. (44). We will discuss this method further below in more detail.
Our previous considerations also allow us to define a reference value for the ratio of false positives and false negatives, Eq. (68), from section D in a straightforward way. Suppose a completely randomized estimator, who predicts valid and invalid outcomes with the same chance. For a fully dense sampling D ∞ (χ), the ratios with respect to this randomized estimator converge to the worst-case reference limits respectively. One has and r fn∞ ≡ r fn∞ (S, χ) ≡ V invalid (S, χ) 2[V valid (S, χ) + V invalid (S, χ)] = 1 2(α ∞ + 1) (84b) so that r fp∞ +r fn∞ = 1 2 in general and r fp∞ = r fn∞ = 1 4 for α ∞ = 1, as expected for random outcome predictions.
For the toy example from section 3, these terms can be evaluated using the explicit expression of α ∞ , Eq. (82), which results in Eq. (29). In contrast, we only have an approximation of α ∞ , Eq. (44), for the chemical process simulation from section 5 and can therefore only use the approximations given by Eq. (45). This approach is explained in the following.
Analogously to the considerations from sections C and D, a Monte Carlo approach allows a numerical evaluation of V valid (S, χ) in Eq. (75) when we presume a collection D MC x of N evaluated parameters, Eq. (61), which have been chosen randomly from χ. As a result, we find Eq. (44) with the approximation and its estimated error respectively. Here we have made use of the abbreviations and