Embedding-Space Explanations of Learned Mixture Behavior

Sohns, J.-T.; Gond, D.; Jirasek, F.; Hasse, H.; Weber, G. H.; Leitte, H.

doi:10.1007/978-3-031-35779-4_3

J.-T. Sohns⁴,
D. Gond⁵,
F. Jirasek⁵,
H. Hasse⁵,
G. H. Weber⁶ &
…
H. Leitte⁴

Included in the following conference series:

International Research Training Group Conference on Physical Modeling for Virtual Manufacturing Systems and Processes

921 Accesses

Abstract

Data-driven machine learning (ML) models are attracting increasing interest in chemical engineering and already partly outperform traditional physical simulations. Previous work in this field has mainly focused on improving the models’ statistical performance while the thereby imparted knowledge has been taken for granted. However, also the structures learned by the model during the training are fascinating yet non-trivial to assess as they are usually high-dimensional. As such, the interpretable communication of the relationship between the learned model and domain knowledge is vital for its evaluation by applying engineers. Specifically, visual analytics enables the interactive exploration of data sets and can thus reveal structures in otherwise too large-scale or too complex data.

This chapter focuses on the thermodynamic modeling of mixtures of substances using the so-called activity coefficients as exemplary measures. We present and apply two visualization techniques that enable analyzing high-dimensional learned substance descriptors compared to chemical domain knowledge. We found explanations regarding chemical classes for most of the learned descriptor structures and striking correlations with physicochemical properties.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Machine learning is rapidly entering the field of engineering. The data-driven prediction using such methods is already outperforming traditional engineering algorithms for multiple properties [1,2,3]. With the transition from a computer science gimmick to appliance in real-world scenarios, the stakes rise significantly. Whether human lives are on the line or the planning of an expensive production step, the confidence in the algorithm needs to be exceptional. An emerging solution is to provide human-understandable explanations for the decisions of machine learning, which can spark trust and suspicion where necessary [4, 5].

Recent research has introduced the concept of matrix completion methods (MCM) to predict the thermodynamic properties of mixtures, or, in other words, the mixture behavior, from a sparse data set of experimental values [1,2,3, 6, 7]. Among others, these methods allow predicting activity coefficients, which are a measure for the non-ideality of a mixture. In the present work, models for the prediction of activity coefficients of solutes at infinite dilution in solvents at a constant temperature of 298.15 K [1, 7] are taken as a prototype to create an algorithmic pipeline that is transferable to a broader series of use cases. To give an instance in the context of process level production planning, an accurate and trusted machine learning algorithm empowers precise, fast and, most importantly, cheap simulations, thereby avoiding costly and time-consuming experiments.

We order the data set in matrix form with solutes as one axis, solvents as the other axis and mixture behavior, i.e., the activity coefficients, as cell entries. The assumption is that the resulting matrix is of low rank, i.e., that it can be described by a few factors. The MCM algorithm learns a predefined number of latent features (factors) per row and column that are optimized to reproduce the existing entries through vector products of the factors. Here, four latent features have proven to yield excellent results [1, 7]. We name them u1 to u4, though the numbers do not induce an order. Different starting conditions of the algorithm could result in a switch in the numbering. The latent features are called latent, because they are intermediate features in the mixture prediction workflow and are typically not shown in practice. However, we consider them the point of interest of the algorithm, since they contain all information within the learning algorithm for each individual substance. Subsequent processing is a trivial vector multiplication.

An explanation of the latent features could describe the learned compressed model of each substance’s mixture behavior and thereby increase trust in the current model, where justified, possibly superseding the empirical model [8, 9] that is currently used in practice. Ideally, explanations also open up future models to be substance-data-driven instead of mixture-data-driven. This would alleviate a current drawback of MCM in that it, in its pure form, cannot extrapolate to substances outside the training set.

We base our explanations of the substances on a comparison with chemical knowledge captured in two additional data sets. First, a chemist has annotated each substance with its most defining chemical class. Second, we gathered a set of readily available physicochemical descriptors, e.g. molar mass, on each substance. The questions we are trying to answer throughout this chapter are:

Is there structure in the learned latent space that is sensible to a human, i.e. does it coincide with domain knowledge?
Are there correlations with physicochemical descriptors and properties that explain certain latent features, ideally allowing bidirectional reasoning?

Since the latent space is spanned by four dimensions, communicating its information is hard, since a direct visualization is impossible. Therefore, we rely on two interactive visual analytics tools [10, 11] that employ dimension reduction techniques to create two-dimensional and thereby viewable embeddings.

Throughout this chapter, we provide the following contributions:

We provide an analysis of the feature space learned by MCM with two visual analytics tools regarding their relationship to two types of physicochemical knowledge in Sects. 2.3, 3.2 and 3.3.2.
We propose an extension of a decision boundary visualization tool towards regression models in Sect. 3.3.1.

2 Rangesets

We will first introduce the challenges and possibilities of interpreting high-dimensional embeddings, present a solution with rangesets proposed in [10], and then provide a rangeset analysis of latent features in matrix completion with regard to domain knowledge.

2.1 Motivation

Reading attribute information out of high-dimensional embeddings is difficult as the reduction of dimensions aggregates the original data on typically just two viewing axes. The interpretation of these axes depends on the type of projection. Linear projections like principal component analysis (PCA) [12] as presented in Sect. 3 can still be meaningfully annotated with axes. However, the linearity in projection can also be a constraint when the original dimensionality is too high impeding cluster analysis and outlier detection. In these cases, non-linear techniques, which try to untangle the complex coherence of data points, often work better to uncover structures in high-dimensional space. Even though corresponding methods like multidimensional scaling (MDS) [13], t-distributed stochastic neighbor embedding (t-SNE) [14] and uniform manifold approximation (UMAP) [15] are commonly used in computer science and engineering fields, these techniques share that they inhibit the direct annotation of original dimension axes in projection space.

However, the visual retrieval of original data attributes is vital for the interpretation of these otherwise abstract plots. An augmentation of the embedding with color can provide this information. Nonato and Aupetit [18] classify augmentation strategies of non-linear dimension reductions into three main categories: Direct enrichment, spatially structured enrichment and cluster-based enrichment. In direct enrichment the layout is enriched per point [19,20,21,22]. The most common technique, color-coding each point can be seen in Fig. 1 (a). While simple to implement and understand, these techniques suffer from occlusion and overplotting, making it hard to identify clusters and respective outliers [23]. Spatially structured enrichments encode the embedding space based on a geometrical abstraction. These provide an immediate sense of attribute value distribution, but resort to averaging, as in the iso-contours of Fig. 1 (b), or fine-grained tessellation as in the triangulation of Fig. 1 (c). Cluster- or set-based approaches group points based on their visual or data-space proximity and plot abstractions of these groups [24,25,26]. The technique used in this chapter belongs to this third option, while integrating parts of the previous two to increase readability.

2.2 Rangeset Construction

Rangesets [10], shown in Fig. 1 (d), first bin data points with similar attribute values and then draw geometric contours based on visual proximity for a set-based visualization that captures both visual and data-space proximity. Clusters of points with similar attribute values are conveyed through non-convex α-hulls, while outliers are kept as points. Users are enabled to quickly observe structure and detect outliers.

As this approach first groups in data attribute space and then in embedding space, we outline the algorithm illustrated in Fig. 2 in the following. It is designed to show the distribution of a specific data attribute in an arbitrary (non-linear) embedding. This data attribute does not necessarily need to be considered for the creation of the embedding beforehand.

As a set-based visualization, the attribute values to be displayed need to be in categories. Categorical data can be used directly, numerical data needs to be binned. For each bin, the corresponding data points are extracted and a Delauney triangulation of the filtered points is computed. From this Delauney triangulation all triangles that contain an edge longer than a defined threshold ε are removed. The remaining connected triangles form α-hulls that describe connected regions, while the unconnected points are highlighted as dots of increased size.

Both α-hulls and outliers are colored based on their respective bin. Visualizing non-linear attribute distribution as rangesets instead of as a continuous field (ref. Figure 1 (b) (c)) polygons can overlap, which is accounted for by semi-transparent rendering.

The choice of parameter ε strongly influences the visual appearance of rangesets. The effects of various ε values are shown in Fig. 3. For $\varepsilon =0$ all points are outliers and drawn as dots, Fig. 3 (b). For small values of ε, tight contours are created with many points considered outliers, Fig. 3 (c). Larger values of ε lead to larger polygons up to the convex hull of the considered set of points. A default value is proposed in [7] based on Wilkinson [27]:

$$\varepsilon ={q}_{75}+1.5\cdot ({q}_{75}-{q}_{25})$$

With ${q}_{25}$ and ${q}_{75}$ being the 25th and 75th percentile of the edge lengths in the minimal spanning tree.

While the mathematical formulation of rangesets is well defined, the best parameter choice for interpretation varies based on the individual use case. The shape of rangesets depends both on the choice of ε and the choice of bins for numerical data. While default values have been stated in the previous paragraphs, users can refine bin ranges and shown attributes in an interactive browser tool called NoLiES [10]. NoLiES further provides comparison of attribute distributions via small multiples [28] and colored histograms. The tool is built in Jupyter Notebook with common plotting libraries [29,30,31]. A demo is available at bndr.it/96wza and the code at github.com/Jan-To/nolies.

2.3 Application to Process-Level

With the technique introduced above, we are able to collate the latent feature space of solutes learned by MCM with available chemical knowledge. We first check whether the learned structure is sensible at all through a comparison with chemical classes, then we analyze the structure of the learned latent space itself and lastly look for correlations with physicochemical substance descriptors.

2.3.1 Chemical Class as Descriptor of Learned Solute Features

The chemical sensibleness of the learned latent space spanned by u1-u4 can be initially reviewed by the distribution of chemical classes. We know from empirically designed models that structural groups are often well-suited for characterizing the mixture behavior [8, 9]. Hence, chemical classes that are defined by these structural groups should be a good high-level descriptor to check whether the learned latent distribution correlates with expectations.

The MDS projection on latent features u1-u4 in Fig. 3 is optimized to preserve high-dimensional distances between points in the 2D environment. Substances with similar latent feature values are generally projected closer to each other than substances with dissimilar values. Consequently, substances with the same chemical class should be close to each other as well and form visible groups. To encode this visually, chemical classes are chosen as the attribute for rangesets.

Chemical class is already a categorical variable and needs no further discretization to define the rangesets, but the filtration parameter ε is still indeterminate. Varying the values of ε confirms that first, coloring per point, Fig. 3 (a), is inferior at communicating distribution, clustering and outliers. Second, too high values of ε, Fig. 3 (c), integrate outliers into clusters, leading to inexpressive polygons. Lastly, the default ε value 0.54, Fig. 3 (b), and values slightly above it, Fig. 3 (d), give a good balance between connected components and outliers in this dataset. Further analysis is performed in this configuration.

Figure 3 (d) shows a striking coherence of chemical classes and the similarity in latent features, which constitute the positioning in the embedding. The sparsely overlapping rangeset polygons for most colors (blues, oranges, browns, light green, light purple) indicate that chemical class can be a distinct descriptor of the solute’s learned latent features. The polygons for aromatics, alkanes and alkenes span a wider space and have minor overlap with other classes. These classes have a common, but not unique latent feature profile. Each of the rangesets for nitriles, alcohols and aldehydes is clustered yet separated from the rest, hence indicating that for these solutes a distinct characteristic latent feature combination is learned. The polygons for ester and ketones are overlapping, indicating similar learned solute properties. All three observations fit with chemical knowledge.

Analyzing the coherence within each chemical class, we look at the outliers, highlighted by bigger dots, with respect to the same colored polygon(s). We observe that alcohols, nitriles, amides and alkanes have one or less outliers, indicating uniform latent features and hence learned solute behavior. On the other hand, aromatics generally share latent features, but aromatics like chrysene or phenol differ significantly, in line with their unique chemical structure. Water and heavy water are isolated as well, again due to their unique chemical structure.

From the analysis above, we can conclude that chemical classes generally coincide with the learned distribution on latent features. The cases where position and therefore latent feature values are ambiguous with chemical classes can mostly be explained with the chemical knowledge of an expert. For the considered set of solutes, chemical class therefore is a suitable descriptor of MCM features, even though the features were purely derived from the respective mixture behavior.

2.3.2 Latent Feature Distribution

The MDS projection used as the base for the analysis in this section is a non-linear projection technique. The tradeoff of such projections is that the high-dimensional axes are not readable anymore as there is no direct mapping. We lose the ability to quickly find high/low values, the direction along which the values are increasing and the occurring value combinations. While the point-based non-linear definition of the MDS projection forbids a perfect reconstruction of the axis, rangesets provide insight into these lost attributes.

Since the MDS projection is conducted to reduce latent features u1-u4 to two dimensions, the distribution of individual u’s could explain the spatial structure of the dimension reduction. In Fig. 4 (a)–(d) the rangeset attributes are set the individual latent features discretized into five equidistant bins from very low to very high.

For u2 and u3 there are clear directions of increasing values, hinted by black arrows, which give these directions a simple meaning. For u1 and u4 the trends are non-linear and not monotonically increasing. For u1 the values in the orange high bin form a connected patch but are enclosed by and overlapping with the yellow medium bin. The deduction of u1 value from the embedding position is therefore ambiguous for this area. The same phenomenon occurs for u4 with the blue very low bin. We further observe a plethora of rangeset outliers in u1 and u4 that are not following the overall trend, which further hampers the ability to guess u values from the MDS projection.

Comparing the rangesets of the original dimensions in a small multiples setting in Fig. 4 also reveals common occurring feature combinations. The matching trends of increasing values from top left to bottom right in the rangesets of u3 and u4 implies a positive correlation between the dimensions. On the flipside, the trends of u2 and u3 are perpendicular and therefore uncorrelated. Comparing the patches of u1 and u3, we recognize that substances with both very high and very low u1 values have high or very high values in u3.

In essence, the analysis of the projection dimensions with rangesets unveil the lost ‘axes’ of the projection and their interaction, even though both can be too complex to grasp.

2.3.3 Physicochemical Descriptors of Learned Solute Features

The analysis in Sect. 2.3.1 showed that while chemical classes work as general descriptors of learned solute features, they are too coarse-grained to describe the feature combinations precisely. However, any precise correlation between readily available information and MCM features would be essential to enhance the MCM to a data-driven virtual approach. As physicochemical descriptors are available for most substances, we apply rangesets to analyze possible correlations.

Figure 4 (e) and (f) show two simple descriptors, molar mass and polarity, where correlations can be seen. As before, the properties are discretized in five equidistant bins and ε = 1. Considering molar mass, the red and blue distribution of outlier points at the opposite sides with overlapping regions in the center, hints that extreme molar mass values are characteristic for solute features, but medium values are not. The findings for polarity are even more clear. The big blue polygon in Fig. 4 (f) indicates that nonpolar substances share common solute features. From this region, polarity is gradually increasing with the change in similarity, analogous to u3, suggesting that polarity is rather continuously captured in solute features.

Some of the descriptors may be good for describing individual MCM features or at least be captured in combinations thereof. However, rangesets capture only the trends of the continuous relationship between changes in features and attributes, chemical classes or physicochemical descriptors. Rangesets are grouping data points based on their neighborhood in one specific projection. Visually filling the space between points suggests that we have knowledge of this space. Due to projection ambiguity, these neighborhood relationships are not necessarily monotonous or continuous, as seen in the overlap in Fig. 4. To get further insight which parameters need to change exactly to achieve a certain value, a more detailed analysis requires a different tool, which we will present in the next section.

3 Decision Boundary Visualization

The relationship between a high-dimensional space and a related variable can be modeled as a multivariate function. In this section we present an interactive tool to explore decision functions with regard to their high-dimensional input spaces and apply it to deepen our analysis on the relationship between latent MCM features and chemical classes. Afterwards, we propose an extension of the tool for regression analysis, which we can then expand on the physicochemical descriptors’ link with MCM features.

3.1 CoFFi

Machine learning approaches span a high-dimensional space in their input or, in the application in this chapter, MCM, in the latent features. As such, MCM is considered a black box algorithm, since the relationship between input data and generated latent features is inaccessible. Explaining this relationship can improve trust in properly performing systems [6] and can point out flaws in ill-formed systems [7]. We abstract the black box model to a decision function y = f(x), that can be probed for any input x to generate output y.

Visual explanations of black-box decision functions are a pressing research field that leads in various directions. Common approaches can be classified in two categories. Sample-based approaches find fitting projections of labeled datasets and explain the changes based on the contrastive juxtaposition of discrete data points. Rangesets fall into this category, but more specialized approaches exist that focus on individual regions [32,33,34]. The other direction is to compute visual maps by probing the input space densely in a fixed two-dimensional embedding [35]. Literature can be united under the conclusion that the interesting parts of the decision function lie where the output value changes significantly [32, 35]. Since humans internally reason by comparisons [36], counterfactual reasoning, reasoning over what would need to change to achieve a different result, is another preferable approach for explaining decision functions [37]. The visual analytics tool Counterfactual Finder (CoFFi) [11] unites sample-based analysis with visual maps and counterfactual reasoning and we therefore use it to further analyze the data at hand. While CoFFi was previously only intended for classification problems, we introduce modifications for regression analysis after a tool overview.

The interface features five components marked in Fig. 5 that are linked and interactively explorable. A data table (A) displays the data set in an accustomed fashion. The topology view (B) provides multiple algorithms for non-linear dimension reduction to assess overall class distribution and data set separability. This view is identical to rangesets with $\varepsilon =0$ or plain glyphs colored by class affiliation. The partial dependence view (C) shows the expected outcome for changes to individual input parameters of a currently selected reference point, while holding the other parameters fixed. This univariate behavior analysis, which is typically called partial dependence analysis [38], is displayed as horizon charts [39], where the vertical baseline is the decision boundary. A higher prediction per class is indicated by vertical areas, each indicating 25% increase in prediction. Areas are colored according to the most probable class with more confident predictions in richer colors.

The embedding view (D) advances the partial dependence analysis to multivariate space. A PCA projection based on a local neighborhood of data points is regularly sampled to produce a visual map of a slice of the output space. Due to the unique properties of PCA, the original high-dimensional axis can be overlaid as a gray biplot [40] capturing feature variance and correlation. Positively correlated axes point in similar directions, negatively correlated axes in opposite directions. The necessary feature changes to reach the white decision boundaries can be read with regard to the axes. Since PCA is linear, feature values increase linearly in the direction of each axis, but do not change orthogonal to the axis. Decision boundaries orthogonal to axes are relying on the respective axis feature value to cross a threshold, which is shown on mouse-over. Finally, axes with little importance or of little interest can be fixed and thereby excluded from the multivariate analysis in the feature selection (E). For more details on functionality and theoretical background, we refer the reader to the original publication [8]. A demo is available at bndr.it/cqk5w and the open source code on Github at github.com/Jan-To/COFFI.

3.2 Chemical Classes in Latent Feature Space

An analysis of the class distribution in CoFFi may provide more hints on how each class is encoded in the MCM’s latent feature values. The latent features u1-u4 are the input dimensions that define the feature space and the chemical class is the output dimension that defines the color. We hope to read out which changes are necessary to flip between classes.

However, two preprocessing steps are necessary. We filter the data set to the eight most occurring classes, since saturation was previously used to expand the colormap and saturation is overloaded here as a probability indicator already. Further, MCM is by design only defined on the training samples and can therefore not be probed at intermediate samples. As this is a disadvantage of MCM we would like to overcome in the future, we train a surrogate model instead to predict the chemical class from latent features. While a surrogate model creates a continuous decision function, the exact values are only interpolated from the model and need to be taken carefully. Nonetheless, surrogate models are an established explanation approach [41] and the general decision areas are expressive enough to deduce explanations. Our experiments showed that a three-layer fully-connected neural network is able to capture all decision boundaries while keeping them simple.

We first gain an overview by centering the PCA at the data set mean in Fig. 5. Comparing the MDS embedding in (B) with the one in Fig. 3 (d) reveals that the classes alcohols (pink), alkanes (cyan), alkene (purple) and aromatics (orange) are even more distinct than without the filtration to eight classes. The model importance (E) is highest for u3 and u4 and lowest for u1, which therefore seems to be less discriminative, but still relevant regarding chemical class. In Fig. 5 we find that u2 > 1 is characteristic for alcohols. We deduce that alcoholic mixture behavior is encoded in high u2 values by MCM. In both the linear (D) and non-linear projection (B), alkanes and alkenes are still neighboring as expected from their similar molecular structure. The deciding feature between the two seems to be a threshold of u4, since it is orthogonal to the axis. We confirm this assumption by updating the embedding view to a representative of alkenes, cyclohexene, and its neighbors in Fig. 5 (F) finding a threshold of u4 ≈ 0.

We pick an outlier in the MDS plot to check whether the exceptional latent feature values align with the chemical expectations. 2,2,2-Trifluoroethanol is a halogen that is highly reactive and has an isolated position in MDS. In the focused plot of Fig. 6 (A) the neighboring classes are aromatics (orange) and alcohols (pink). The closeness with alcohols is sensible, since 2,2,2-Trifluoroethanol, as the name suggests, contains a hydroxyl group and therefore can be considered an alcohol as well. By probing the decision boundary with alcohols (gray cross), we learn that alcohols differ by slightly higher values in all features, which is something that was not visible in the univariate and non-linear analysis. Another interesting finding is that u1 and u3 as well as u2 and u4 are highly correlated when restricting to this local neighborhood, which is contrary to without the exclusion of alkanes and alkenes in Fig. 5 (D).

3.3 Latent Features in Physicochemical Descriptor Space

A reproducible link between latent MCM features and physicochemical descriptors would significantly improve our capability to build machine learning algorithms – possibly up to the point of extrapolation. The previous rangeset analysis fell short in uncovering usable relationships. As a next step, we relate individual latent features in CoFFi to a set of readily available physicochemical descriptors of the solutes. The descriptors – dipole moment, polarizability, anisotropy, normed anisotropy, H-bond acceptance, H-bond donation, HOMO-LUMO gap, ionization energy, electron affinity and molar mass – are described in Table 1.

Table 1. Overview over considered physicochemical descriptors.

Full size table

We again learn a surrogate model to predict the latent features from the set of physicochemical descriptors. We specifically decided against direct u-to-property scatterplots, which would avoid the surrogate model, but possibly miss higher-dimensional effects. We use the same model as in the previous section, as it proved to provide a good balance between simplicity and accuracy. CoFFi has previously only been used for classification problems, but predicting a continuous latent feature value is a regression problem. In the following section, we propose an adaptation to the existing workflow to handle regression problems in CoFFi.

3.3.1 CoFFi Adaptation for Regression

The fundamental basis of counterfactual reasoning is the definition of the decision boundary. With categorical outcome variables, this boundary is trivial as the change in the most probable class. For continuous outcome variables this is no longer given. As such, we segment the continuous space into ranges that can be described together. Thereby, the regression can be handled like a classification with decision boundaries at the segment crossings.

In practice, the transformation requires some precautions. Contrary to the five equidistant bins in rangesets, the segment edges should always be designed according to the modality of the distribution. Since the edges are even more pronounced here, they need to be chosen and communicated explicitly. A histogram in the bottom left of Fig. 7 displays both the distribution and the currently chosen segment edges. In this case, we chose to set the segment to low, median and high, since there is an unimodal distribution. The physicochemical descriptors then serve as the input for the surrogate model to predict whether one specific u is within a certain segment.

3.3.2 Latent Feature Analysis

In this section, we analyze the individual latent features u1-u3 by associating the physical properties in the regression variant of CoFFi and draw conclusions on the relevance for model explanation. The analysis for u4 is omitted due to space constraints.

u1. Figure 7 shows the CoFFi interface for u1 containing all solutes. The distribution of u1 values is unimodal with the peak at 0. Therefore, u1 has little to no influence on the mixture computation in MCM for most solutes. We deduce that u1 is not encoding a common mixture behavior, but is rather specialized to encode a few rather exotic solutes. In the MDS and PCA projections we observe that the solutes with low u1 values (blue) are spread on the outside of the plots, hence holding unusual physicochemical descriptor combinations. Selecting the blue dots in the MDS updates the data table to reveal that these are aromatics and esters with high organic solubility. Selection of high u1 values (red) reveals that these characteristics are not united in physicochemical descriptors, but are rather taken by phenol, chloroform and their variants. The model importances are rather evenly distributed within 5% and 14%, which is why we cannot deduce any dominant relationship between a particular physicochemical descriptor and the u1 value. We conclude that the u1 value is encoding mixture behavior which is not captured in the current set of physicochemical descriptors.

u2. The distribution of u2 is bimodal as shown in the bottom left of Fig. 8. We therefore segment into negative (blue), close-to-zero (green), medium (orange) and high (red) values. The UMAP projection shows clusters of distinct physicochemical descriptor combinations in line with the segmentation. High u2 values are reached by alcohols while medium values encode ketones. We select an evenly class-distributed subgroup of substances with the lasso tool in UMAP to contrast the high and medium solutes to the close-to-zero solutes. Partial dependence view and model importance show that H-bond characteristics are most important, with H-bond acceptance above 0.018 changing from high u2 to medium u2 and H-Bond donation of below 0.005 changing to the main close-to-zero group (horizontal arrows). We filter the embedding view to the influential physicochemical descriptors only to extend our uni-dimensional analysis to two-dimensional dependencies. The decision boundaries orthogonal to the respective axis reveal that H-bond characteristics dominate over changes of similar magnitude in dipole moment. As we wonder about the curved boundary in the bottom right, we hover over the embedding to realize that it starts when the H-bond donation is already zero, but HomoLumoGap is still decreasing. We conclude that HomoLumoGap is therefore only influential on u2 for solutes that are not H-bond donors, e.g. Ketones.

u3. The distribution of u3 shown in Fig. 9 A is different from the previous ones in that all values are negative. Most values lie between −1.5 and 0 (red), with a segment of medium (yellow) and significantly negative (blue) values. The MDS and the data table in Fig. 9 B + C reveal a string cluster of medium-size alkanes (yellow) transitioning into long-chained alkanes (blue). We concentrate our analysis via selection on just this cluster to retrieve the responsible physicochemical descriptor. Polarizability (39%) and molar mass (21%) are the most influential descriptors in our surrogate model (Fig. 9 D). This explanation aligns with chemical knowledge, since polarizability and molar mass increase with chain-size in alkanes. We notice the strong but sensible correlation in the embedding – the polarizability and molar mass axes point in exactly the same direction in Fig. 9 E – and probe the embedding (gray cross) to read in Fig. 9 F that above 135 a₀³ polarizability and 154 g/mol molar mass (arrows) the solutes have u3 < -2.6 (blue). However, we need to note that this is merely an explanation and not a physical dependency. Other hidden properties may be the actual influential factor, but the surrogate model identified polarizability and molar mass as specific descriptors to distinguish alkanes from each other and the other substances, which is sensible and can be used for the development of data-driven MCM algorithms.

4 Conclusion and Future Work

Matrix completion methods have proven to be more accurate than current state-of-the-art solutions for prediction of thermodynamic properties of mixtures. In this chapter, we analyzed the latent feature space of such a matrix completion model with regard to chemical knowledge. Within two interactive visual analytics tools, we were able to provide explanations for the learned solute features. We found that chemical classes coincide with the structure of the learned feature space wherever the chemical class is defining for a substance’s solute behavior and that chemical similarity is captured by the neighborhood relation in latent space. Alcohols and other substances with hydroxyl groups were particularly exceptional in their learned characteristics. Finally, some latent features were clearly explained by physicochemical descriptors, while others only revealed trends. The insight gained in this chapter serves a first step towards a fully data-driven mixture prediction, potentially reducing costs while increasing accuracy in process planning.

The current work is limited in analyzing one attribute or latent feature at a time. The simultaneous correlation of physicochemical descriptors to multiple latent features could provide a global understanding into which descriptors are reflected in which part of learned space. Additionally, parts of the current analysis rely on a simple surrogate model. An analysis on the predictability of solute descriptors from physicochemical descriptors in a sophisticated model is up to future work.

References

Jirasek, F., Alves, R.A.S., Damay, J., et al.: Machine learning in thermodynamics: prediction of activity coefficients by matrix completion. J. Physical Chemistry Lett. 11(3), 981–985 (2020). https://doi.org/10.1021/acs.jpclett.9b03657
Article Google Scholar
Hayer, N., Jirasek, F., Hasse, H.: Prediction of Henry’s law constants by matrix completion. AIChE J. 68( 9), e17753 (2022). https://doi.org/10.1002/aic.17753
Jirasek, F., Hayer, N., Abbas, R., Schmid, B., Hasse, H.: Prediction of parameters of group contribution models of mixtures by matrix completion. Phys. Chem. Chem. Phys. 25, 1054–1062 (2023). https://doi.org/10.1039/D2CP04478A
Article Google Scholar
Bussone, A., Stumpf, S., and O’Sullivan, D.: The role of explanations on trust and reliance in clinical decision support systems. In: Proceedings of 2015 International Conference on Healthcare Informatics. Dallas, USA, pp. 160–169 (2015). https://doi.org/10.1109/ICHI.2015.26
Schramowski, P., Stammer, W., Teso, S., et al.: Making deep neural networks right for the right scientific reasons by interacting with their explanations. Nat Mach Intell 2, 476–486 (2020). https://doi.org/10.1038/s42256-020-0212-3
Article Google Scholar
Jirasek, F., Hasse, H.: Perspective: machine learning of thermophysical properties. Fluid Phase Equilib. 549, 113206 (2021). https://doi.org/10.1016/j.fluid.2021.113206
Article Google Scholar
Jirasek, F., Bamler, R., Mandt, S.: Hybridizing physical and data-driven prediction methods for physicochemical properties. Chem. Commun. 56, 12407–12410 (2020). https://doi.org/10.1039/D0CC05258B
Article Google Scholar
Fredenslund, A., Jones, R.L., Prausnitz, J.M.: Group contribution estimation of activity coefficients in nonideal liquid mixtures. AIChE J. 21, 1086–1099 (1975)
Article Google Scholar
Fredenslund, A., Gmehling, J., Rasmussen, P.: Vapor-Liquid Equilibria Using UNIFAC, A Group-Contribution Method. Elsevier, Amsterdam, Netherlands (1977)
Google Scholar
Sohns, J.-T., Schmitt, M., Jirasek, F., et al.: Attribute-based explanation of non-linear embeddings of high-dimensional data. IEEE TVCG 28(1), 540–550 (2022). https://doi.org/10.1109/TVCG.2021.3114870
Article Google Scholar
Sohns, J.-T., Garth, C., Leitte, H.: Decision boundary visualization for counterfactual reasoning. Comp. Graphics Forum (2022). https://doi.org/10.1111/cgf.14650
Article Google Scholar
Hotelling, H.: Relations between two sets of variates. Biometrika 28(3/4), 321–377 (1936). https://doi.org/10.2307/2333955
Article MATH Google Scholar
Borg, I., Groenen, P.: Modern multidimensional scaling: theory and applications. Springer Series in Statistics (2005). https://doi.org/10.1007/978-1-4757-2711-1
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(86), 2579–2605 (2008)
MATH Google Scholar
McInnes, L., Healy, J.: UMAP: Uniform manifold approximation and projection for dimension reduction, ArXiv e-prints 1802.03426 (2018)
Google Scholar
Aeberhard, S., Coomans, D., De Vel, O.: Comparative analysis of statistical pattern recognition methods in high dimensional settings. Pattern Recogn. 27(8), 1065–1077 (1994)
Article Google Scholar
Dua, D., Graff, C.: UCI Machine Learning Repository (2017)
Google Scholar
Nonato, L., Aupetit, M.: Multidimensional projection for visual analytics: linking techniques with distortions, tasks, and layout enrichment. IEEE TVCG 25, 2650–2673 (2019). https://doi.org/10.1109/TVCG.2018.2846735
Article Google Scholar
Dowling, M., Wenskovitch, J., Fry, J., et al.: Sirius: dual, symmetric, interactive dimension reductions. IEEE TVCG 25(1), 172–182 (2018). https://doi.org/10.1109/TVCG.2018.2865047
Article Google Scholar
Lee, H., Kihm, J., Choo, J., et al.: iVisClustering: an interactive visual document clustering via topic modeling. Comp. Graphics Forum 31(3pt3), 1155–1164 (2012). https://doi.org/10.1111/j.1467-8659.2012.03108.x
Lehmann, D.J., Theisel, H.: General projective maps for multidimensional data projection. Comp. Graphics Forum 35, 443–453 (2016). https://doi.org/10.1111/cgf.12845
Article Google Scholar
Stahnke, J., Dork, M., Müller, B., Thom, A.: Probing projections: Interaction techniques for interpreting arrangements and errors of dimensionality reductions. IEEE TVCG 22, 629–638 (2016). https://doi.org/10.1109/TVCG.2015.2467717
Article Google Scholar
Mayorga, A., Gleicher, M.: Splatterplots: overcoming overdraw in scatter plots. IEEE TVCG 19, 1526–1538 (2013). https://doi.org/10.1109/TVCG.2013.65
Article Google Scholar
Collins, C., Penn, G., Carpendale, S.: Bubble sets: revealing set relations with isocontours over existing visualizations. IEEE TVCG 15(6), 1009–1016 (2009). https://doi.org/10.1109/TVCG.2009.122
Article Google Scholar
Schreck, T., Schußler, M., Zeilfelder, F., Worm, K.: Butterfly Plots for Visual Analysis of Large Point Cloud Data (2008)
Google Scholar
Joia, P., Petronetto, F., Nonato, L.G.: Uncovering representative groups in multidimensional projections. Comp. Graphics Forum 34, 281–290 (2015). https://doi.org/10.1111/cgf.12640
Article Google Scholar
Wilkinson, L., Anand, A., Grossman, R.: Graph-theoretic scagnostics. IEEE INFOVIS 2005, 157–164 (2005)
Google Scholar
Tufte, E.: Envisioning Information. Graphics Press 67 (1990)
Google Scholar
Bokeh Development Team: Bokeh: Python Library for Interactive Visualization (2020)
Google Scholar
Gillies, S., et al.: Shapely: Manipulation and Analysis of Geometric Objects (2007)
Google Scholar
P. D. Team. Panel: A High-level App and Dashboarding Solution for Python
Google Scholar
Ma, Y., Maciejewski, R.: Visual analysis of class separations with locally linear segments. IEEE TVCG 27(1), 241–253 (2021). https://doi.org/10.1109/TVCG.2020.3011155
Article Google Scholar
Tatu, A., Maass, F., Färber, I., et al.: Subspace search and visualization to make sense of alternative clusterings in high-dimensional data. IEEE VAST, pp. 63–72 (2012). https://doi.org/10.1109/VAST.2012.6400488
Jeong, D.H., Ziemkiewicz, C., Fisher, B., et al.: IPCA: an interactive system for PCA-based visual analytics. Comp. Graphics Forum 28, 767–774 (2009). https://doi.org/10.1111/j.1467-8659.2009.01475.x
Article Google Scholar
Espadoto, M., Rodrigues, F., Telea, A.: Visual analytics of multidimensional projections for constructing classifier decision boundary maps. Int. .erence Inf. Visualization Theory Appl. 10, 28–38 (2019). https://doi.org/10.5220/0007260800280038
Article Google Scholar
Lipton, P.: Contrastive explanation. R. Inst. Philos. Suppl. 27, 247–266 (1990). https://doi.org/10.1017/S1358246100005130
Article Google Scholar
Wachter, S., Mittelstadt B.D., Russell, C.: Counterfactual explanations without opening the black box: automated decisions and the GDPR. ArXiv e-prints (2017). http://arxiv.org/abs/1711.00399
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. The Annals of Statistics 29(5), 1189–1232 (2001). ISSN: 00905364
Google Scholar
Few, S.: Time on the Horizon (2008). http://www.perceptualedge.com/articles/visual_business_intelligence/time_on_the_horizon.pdf
Gabriel, K.R.: The biplot graphic display of matrices with application to principal component analysis. Biometrika 58(3), 453–467 (1971). ISSN: 00063444
Google Scholar
Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should i trust you?”: explaining the predictions of any classifier. ACM KDD 16, 1135–1144 (2016). https://doi.org/10.1145/2939672.2939778

Download references

Author information

Authors and Affiliations

Visual Information Analysis Group, RPTU Kaiserslautern-Landau, Kaiserslautern, Germany
J.-T. Sohns & H. Leitte
Laboratory of Engineering Thermodynamics (LTD), RPTU Kaiserslautern-Landau, Kaiserslautern, Germany
D. Gond, F. Jirasek & H. Hasse
Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
G. H. Weber

Authors

J.-T. Sohns
View author publications
You can also search for this author in PubMed Google Scholar
D. Gond
View author publications
You can also search for this author in PubMed Google Scholar
F. Jirasek
View author publications
You can also search for this author in PubMed Google Scholar
H. Hasse
View author publications
You can also search for this author in PubMed Google Scholar
G. H. Weber
View author publications
You can also search for this author in PubMed Google Scholar
H. Leitte
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to J.-T. Sohns .

Editor information

Editors and Affiliations

FBK - Lehrstuhl für Fertigungstechnik und Betriebsorganisation, RPTU Kaiserslautern-Landau, Kaiserslautern, Germany
Jan C. Aurich
Scientific Visualization Lab, RPTU Kaiserslautern-Landau, Kaiserslautern, Germany
Christoph Garth
Mechanical and Aerospace Engineering, University of California Davis, Davis, CA, USA
Barbara S. Linke

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sohns, JT., Gond, D., Jirasek, F., Hasse, H., Weber, G.H., Leitte, H. (2023). Embedding-Space Explanations of Learned Mixture Behavior. In: Aurich, J.C., Garth, C., Linke, B.S. (eds) Proceedings of the 3rd Conference on Physical Modeling for Virtual Manufacturing Systems and Processes. IRTG 2023. Springer, Cham. https://doi.org/10.1007/978-3-031-35779-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-35779-4_3
Published: 11 July 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-35778-7
Online ISBN: 978-3-031-35779-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics