1 Introduction

Machine learning is rapidly entering the field of engineering. The data-driven prediction using such methods is already outperforming traditional engineering algorithms for multiple properties [1,2,3]. With the transition from a computer science gimmick to appliance in real-world scenarios, the stakes rise significantly. Whether human lives are on the line or the planning of an expensive production step, the confidence in the algorithm needs to be exceptional. An emerging solution is to provide human-understandable explanations for the decisions of machine learning, which can spark trust and suspicion where necessary [4, 5].

Recent research has introduced the concept of matrix completion methods (MCM) to predict the thermodynamic properties of mixtures, or, in other words, the mixture behavior, from a sparse data set of experimental values [1,2,3, 6, 7]. Among others, these methods allow predicting activity coefficients, which are a measure for the non-ideality of a mixture. In the present work, models for the prediction of activity coefficients of solutes at infinite dilution in solvents at a constant temperature of 298.15 K [1, 7] are taken as a prototype to create an algorithmic pipeline that is transferable to a broader series of use cases. To give an instance in the context of process level production planning, an accurate and trusted machine learning algorithm empowers precise, fast and, most importantly, cheap simulations, thereby avoiding costly and time-consuming experiments.

We order the data set in matrix form with solutes as one axis, solvents as the other axis and mixture behavior, i.e., the activity coefficients, as cell entries. The assumption is that the resulting matrix is of low rank, i.e., that it can be described by a few factors. The MCM algorithm learns a predefined number of latent features (factors) per row and column that are optimized to reproduce the existing entries through vector products of the factors. Here, four latent features have proven to yield excellent results [1, 7]. We name them u1 to u4, though the numbers do not induce an order. Different starting conditions of the algorithm could result in a switch in the numbering. The latent features are called latent, because they are intermediate features in the mixture prediction workflow and are typically not shown in practice. However, we consider them the point of interest of the algorithm, since they contain all information within the learning algorithm for each individual substance. Subsequent processing is a trivial vector multiplication.

An explanation of the latent features could describe the learned compressed model of each substance’s mixture behavior and thereby increase trust in the current model, where justified, possibly superseding the empirical model [8, 9] that is currently used in practice. Ideally, explanations also open up future models to be substance-data-driven instead of mixture-data-driven. This would alleviate a current drawback of MCM in that it, in its pure form, cannot extrapolate to substances outside the training set.

We base our explanations of the substances on a comparison with chemical knowledge captured in two additional data sets. First, a chemist has annotated each substance with its most defining chemical class. Second, we gathered a set of readily available physicochemical descriptors, e.g. molar mass, on each substance. The questions we are trying to answer throughout this chapter are:

  • Is there structure in the learned latent space that is sensible to a human, i.e. does it coincide with domain knowledge?

  • Are there correlations with physicochemical descriptors and properties that explain certain latent features, ideally allowing bidirectional reasoning?

Since the latent space is spanned by four dimensions, communicating its information is hard, since a direct visualization is impossible. Therefore, we rely on two interactive visual analytics tools [10, 11] that employ dimension reduction techniques to create two-dimensional and thereby viewable embeddings.

Throughout this chapter, we provide the following contributions:

  • We provide an analysis of the feature space learned by MCM with two visual analytics tools regarding their relationship to two types of physicochemical knowledge in Sects. 2.3, 3.2 and 3.3.2.

  • We propose an extension of a decision boundary visualization tool towards regression models in Sect. 3.3.1.

2 Rangesets

We will first introduce the challenges and possibilities of interpreting high-dimensional embeddings, present a solution with rangesets proposed in [10], and then provide a rangeset analysis of latent features in matrix completion with regard to domain knowledge.

2.1 Motivation

Reading attribute information out of high-dimensional embeddings is difficult as the reduction of dimensions aggregates the original data on typically just two viewing axes. The interpretation of these axes depends on the type of projection. Linear projections like principal component analysis (PCA) [12] as presented in Sect. 3 can still be meaningfully annotated with axes. However, the linearity in projection can also be a constraint when the original dimensionality is too high impeding cluster analysis and outlier detection. In these cases, non-linear techniques, which try to untangle the complex coherence of data points, often work better to uncover structures in high-dimensional space. Even though corresponding methods like multidimensional scaling (MDS) [13], t-distributed stochastic neighbor embedding (t-SNE) [14] and uniform manifold approximation (UMAP) [15] are commonly used in computer science and engineering fields, these techniques share that they inhibit the direct annotation of original dimension axes in projection space.

Fig. 1.
figure 1

[10] Comparison of different augmentation strategies of an original attribute in a non-linear embedding on an exemplary dataset [16, 17]. (a) Colorcoded points require mental grouping for outlier detection. (b + c) Field-based approaches fail to capture regions with diverse values. (d) Rangesets alleviate both problems.

However, the visual retrieval of original data attributes is vital for the interpretation of these otherwise abstract plots. An augmentation of the embedding with color can provide this information. Nonato and Aupetit [18] classify augmentation strategies of non-linear dimension reductions into three main categories: Direct enrichment, spatially structured enrichment and cluster-based enrichment. In direct enrichment the layout is enriched per point [19,20,21,22]. The most common technique, color-coding each point can be seen in Fig. 1 (a). While simple to implement and understand, these techniques suffer from occlusion and overplotting, making it hard to identify clusters and respective outliers [23]. Spatially structured enrichments encode the embedding space based on a geometrical abstraction. These provide an immediate sense of attribute value distribution, but resort to averaging, as in the iso-contours of Fig. 1 (b), or fine-grained tessellation as in the triangulation of Fig. 1 (c). Cluster- or set-based approaches group points based on their visual or data-space proximity and plot abstractions of these groups [24,25,26]. The technique used in this chapter belongs to this third option, while integrating parts of the previous two to increase readability.

2.2 Rangeset Construction

Rangesets [10], shown in Fig. 1 (d), first bin data points with similar attribute values and then draw geometric contours based on visual proximity for a set-based visualization that captures both visual and data-space proximity. Clusters of points with similar attribute values are conveyed through non-convex α-hulls, while outliers are kept as points. Users are enabled to quickly observe structure and detect outliers.

As this approach first groups in data attribute space and then in embedding space, we outline the algorithm illustrated in Fig. 2 in the following. It is designed to show the distribution of a specific data attribute in an arbitrary (non-linear) embedding. This data attribute does not necessarily need to be considered for the creation of the embedding beforehand.

Fig. 2.
figure 2

[10] Key steps of the rangeset algorithm to compute contours and outliers.

As a set-based visualization, the attribute values to be displayed need to be in categories. Categorical data can be used directly, numerical data needs to be binned. For each bin, the corresponding data points are extracted and a Delauney triangulation of the filtered points is computed. From this Delauney triangulation all triangles that contain an edge longer than a defined threshold ε are removed. The remaining connected triangles form α-hulls that describe connected regions, while the unconnected points are highlighted as dots of increased size.

Both α-hulls and outliers are colored based on their respective bin. Visualizing non-linear attribute distribution as rangesets instead of as a continuous field (ref. Figure 1 (b) (c)) polygons can overlap, which is accounted for by semi-transparent rendering.

The choice of parameter ε strongly influences the visual appearance of rangesets. The effects of various ε values are shown in Fig. 3. For \(\varepsilon =0\) all points are outliers and drawn as dots, Fig. 3 (b). For small values of ε, tight contours are created with many points considered outliers, Fig. 3 (c). Larger values of ε lead to larger polygons up to the convex hull of the considered set of points. A default value is proposed in [7] based on Wilkinson [27]:

$$\varepsilon ={q}_{75}+1.5\cdot ({q}_{75}-{q}_{25})$$

With \({q}_{25}\) and \({q}_{75}\) being the 25th and 75th percentile of the edge lengths in the minimal spanning tree.

While the mathematical formulation of rangesets is well defined, the best parameter choice for interpretation varies based on the individual use case. The shape of rangesets depends both on the choice of ε and the choice of bins for numerical data. While default values have been stated in the previous paragraphs, users can refine bin ranges and shown attributes in an interactive browser tool called NoLiES [10]. NoLiES further provides comparison of attribute distributions via small multiples [28] and colored histograms. The tool is built in Jupyter Notebook with common plotting libraries [29,30,31]. A demo is available at bndr.it/96wza and the code at github.com/Jan-To/nolies.

2.3 Application to Process-Level

With the technique introduced above, we are able to collate the latent feature space of solutes learned by MCM with available chemical knowledge. We first check whether the learned structure is sensible at all through a comparison with chemical classes, then we analyze the structure of the learned latent space itself and lastly look for correlations with physicochemical substance descriptors.

2.3.1 Chemical Class as Descriptor of Learned Solute Features

The chemical sensibleness of the learned latent space spanned by u1-u4 can be initially reviewed by the distribution of chemical classes. We know from empirically designed models that structural groups are often well-suited for characterizing the mixture behavior [8, 9]. Hence, chemical classes that are defined by these structural groups should be a good high-level descriptor to check whether the learned latent distribution correlates with expectations.

The MDS projection on latent features u1-u4 in Fig. 3 is optimized to preserve high-dimensional distances between points in the 2D environment. Substances with similar latent feature values are generally projected closer to each other than substances with dissimilar values. Consequently, substances with the same chemical class should be close to each other as well and form visible groups. To encode this visually, chemical classes are chosen as the attribute for rangesets.

Chemical class is already a categorical variable and needs no further discretization to define the rangesets, but the filtration parameter ε is still indeterminate. Varying the values of ε confirms that first, coloring per point, Fig. 3 (a), is inferior at communicating distribution, clustering and outliers. Second, too high values of ε, Fig. 3 (c), integrate outliers into clusters, leading to inexpressive polygons. Lastly, the default ε value 0.54, Fig. 3 (b), and values slightly above it, Fig. 3 (d), give a good balance between connected components and outliers in this dataset. Further analysis is performed in this configuration.

Fig. 3.
figure 3

Visual effect of various values for contour parameter ε. MDS-embedding of 240 solutes based on latent features u1-u4 with rangesets colored by chemical class.

Figure 3 (d) shows a striking coherence of chemical classes and the similarity in latent features, which constitute the positioning in the embedding. The sparsely overlapping rangeset polygons for most colors (blues, oranges, browns, light green, light purple) indicate that chemical class can be a distinct descriptor of the solute’s learned latent features. The polygons for aromatics, alkanes and alkenes span a wider space and have minor overlap with other classes. These classes have a common, but not unique latent feature profile. Each of the rangesets for nitriles, alcohols and aldehydes is clustered yet separated from the rest, hence indicating that for these solutes a distinct characteristic latent feature combination is learned. The polygons for ester and ketones are overlapping, indicating similar learned solute properties. All three observations fit with chemical knowledge.

Analyzing the coherence within each chemical class, we look at the outliers, highlighted by bigger dots, with respect to the same colored polygon(s). We observe that alcohols, nitriles, amides and alkanes have one or less outliers, indicating uniform latent features and hence learned solute behavior. On the other hand, aromatics generally share latent features, but aromatics like chrysene or phenol differ significantly, in line with their unique chemical structure. Water and heavy water are isolated as well, again due to their unique chemical structure.

From the analysis above, we can conclude that chemical classes generally coincide with the learned distribution on latent features. The cases where position and therefore latent feature values are ambiguous with chemical classes can mostly be explained with the chemical knowledge of an expert. For the considered set of solutes, chemical class therefore is a suitable descriptor of MCM features, even though the features were purely derived from the respective mixture behavior.

2.3.2 Latent Feature Distribution

The MDS projection used as the base for the analysis in this section is a non-linear projection technique. The tradeoff of such projections is that the high-dimensional axes are not readable anymore as there is no direct mapping. We lose the ability to quickly find high/low values, the direction along which the values are increasing and the occurring value combinations. While the point-based non-linear definition of the MDS projection forbids a perfect reconstruction of the axis, rangesets provide insight into these lost attributes.

Since the MDS projection is conducted to reduce latent features u1-u4 to two dimensions, the distribution of individual u’s could explain the spatial structure of the dimension reduction. In Fig. 4 (a)–(d) the rangeset attributes are set the individual latent features discretized into five equidistant bins from very low to very high.

For u2 and u3 there are clear directions of increasing values, hinted by black arrows, which give these directions a simple meaning. For u1 and u4 the trends are non-linear and not monotonically increasing. For u1 the values in the orange high bin form a connected patch but are enclosed by and overlapping with the yellow medium bin. The deduction of u1 value from the embedding position is therefore ambiguous for this area. The same phenomenon occurs for u4 with the blue very low bin. We further observe a plethora of rangeset outliers in u1 and u4 that are not following the overall trend, which further hampers the ability to guess u values from the MDS projection.

Fig. 4.
figure 4

240 solutes in an MDS-embedding based on latent features u1-u4. Rangesets are chosen to interpret the distribution of individual latent features (a)–(d) or physicochemical descriptors of each solute (e)–(f) in the dimension reduction. Black arrows are added manually to indicate major value trends where applicable.

Comparing the rangesets of the original dimensions in a small multiples setting in Fig. 4 also reveals common occurring feature combinations. The matching trends of increasing values from top left to bottom right in the rangesets of u3 and u4 implies a positive correlation between the dimensions. On the flipside, the trends of u2 and u3 are perpendicular and therefore uncorrelated. Comparing the patches of u1 and u3, we recognize that substances with both very high and very low u1 values have high or very high values in u3.

In essence, the analysis of the projection dimensions with rangesets unveil the lost ‘axes’ of the projection and their interaction, even though both can be too complex to grasp.

2.3.3 Physicochemical Descriptors of Learned Solute Features

The analysis in Sect. 2.3.1 showed that while chemical classes work as general descriptors of learned solute features, they are too coarse-grained to describe the feature combinations precisely. However, any precise correlation between readily available information and MCM features would be essential to enhance the MCM to a data-driven virtual approach. As physicochemical descriptors are available for most substances, we apply rangesets to analyze possible correlations.

Figure 4 (e) and (f) show two simple descriptors, molar mass and polarity, where correlations can be seen. As before, the properties are discretized in five equidistant bins and ε = 1. Considering molar mass, the red and blue distribution of outlier points at the opposite sides with overlapping regions in the center, hints that extreme molar mass values are characteristic for solute features, but medium values are not. The findings for polarity are even more clear. The big blue polygon in Fig. 4 (f) indicates that nonpolar substances share common solute features. From this region, polarity is gradually increasing with the change in similarity, analogous to u3, suggesting that polarity is rather continuously captured in solute features.

Some of the descriptors may be good for describing individual MCM features or at least be captured in combinations thereof. However, rangesets capture only the trends of the continuous relationship between changes in features and attributes, chemical classes or physicochemical descriptors. Rangesets are grouping data points based on their neighborhood in one specific projection. Visually filling the space between points suggests that we have knowledge of this space. Due to projection ambiguity, these neighborhood relationships are not necessarily monotonous or continuous, as seen in the overlap in Fig. 4. To get further insight which parameters need to change exactly to achieve a certain value, a more detailed analysis requires a different tool, which we will present in the next section.

3 Decision Boundary Visualization

The relationship between a high-dimensional space and a related variable can be modeled as a multivariate function. In this section we present an interactive tool to explore decision functions with regard to their high-dimensional input spaces and apply it to deepen our analysis on the relationship between latent MCM features and chemical classes. Afterwards, we propose an extension of the tool for regression analysis, which we can then expand on the physicochemical descriptors’ link with MCM features.

3.1 CoFFi

Machine learning approaches span a high-dimensional space in their input or, in the application in this chapter, MCM, in the latent features. As such, MCM is considered a black box algorithm, since the relationship between input data and generated latent features is inaccessible. Explaining this relationship can improve trust in properly performing systems [6] and can point out flaws in ill-formed systems [7]. We abstract the black box model to a decision function y = f(x), that can be probed for any input x to generate output y.

Visual explanations of black-box decision functions are a pressing research field that leads in various directions. Common approaches can be classified in two categories. Sample-based approaches find fitting projections of labeled datasets and explain the changes based on the contrastive juxtaposition of discrete data points. Rangesets fall into this category, but more specialized approaches exist that focus on individual regions [32,33,34]. The other direction is to compute visual maps by probing the input space densely in a fixed two-dimensional embedding [35]. Literature can be united under the conclusion that the interesting parts of the decision function lie where the output value changes significantly [32, 35]. Since humans internally reason by comparisons [36], counterfactual reasoning, reasoning over what would need to change to achieve a different result, is another preferable approach for explaining decision functions [37]. The visual analytics tool Counterfactual Finder (CoFFi) [11] unites sample-based analysis with visual maps and counterfactual reasoning and we therefore use it to further analyze the data at hand. While CoFFi was previously only intended for classification problems, we introduce modifications for regression analysis after a tool overview.

The interface features five components marked in Fig. 5 that are linked and interactively explorable. A data table (A) displays the data set in an accustomed fashion. The topology view (B) provides multiple algorithms for non-linear dimension reduction to assess overall class distribution and data set separability. This view is identical to rangesets with \(\varepsilon =0\) or plain glyphs colored by class affiliation. The partial dependence view (C) shows the expected outcome for changes to individual input parameters of a currently selected reference point, while holding the other parameters fixed. This univariate behavior analysis, which is typically called partial dependence analysis [38], is displayed as horizon charts [39], where the vertical baseline is the decision boundary. A higher prediction per class is indicated by vertical areas, each indicating 25% increase in prediction. Areas are colored according to the most probable class with more confident predictions in richer colors.

The embedding view (D) advances the partial dependence analysis to multivariate space. A PCA projection based on a local neighborhood of data points is regularly sampled to produce a visual map of a slice of the output space. Due to the unique properties of PCA, the original high-dimensional axis can be overlaid as a gray biplot [40] capturing feature variance and correlation. Positively correlated axes point in similar directions, negatively correlated axes in opposite directions. The necessary feature changes to reach the white decision boundaries can be read with regard to the axes. Since PCA is linear, feature values increase linearly in the direction of each axis, but do not change orthogonal to the axis. Decision boundaries orthogonal to axes are relying on the respective axis feature value to cross a threshold, which is shown on mouse-over. Finally, axes with little importance or of little interest can be fixed and thereby excluded from the multivariate analysis in the feature selection (E). For more details on functionality and theoretical background, we refer the reader to the original publication [8]. A demo is available at bndr.it/cqk5w and the open source code on Github at github.com/Jan-To/COFFI.

3.2 Chemical Classes in Latent Feature Space

An analysis of the class distribution in CoFFi may provide more hints on how each class is encoded in the MCM’s latent feature values. The latent features u1-u4 are the input dimensions that define the feature space and the chemical class is the output dimension that defines the color. We hope to read out which changes are necessary to flip between classes.

However, two preprocessing steps are necessary. We filter the data set to the eight most occurring classes, since saturation was previously used to expand the colormap and saturation is overloaded here as a probability indicator already. Further, MCM is by design only defined on the training samples and can therefore not be probed at intermediate samples. As this is a disadvantage of MCM we would like to overcome in the future, we train a surrogate model instead to predict the chemical class from latent features. While a surrogate model creates a continuous decision function, the exact values are only interpolated from the model and need to be taken carefully. Nonetheless, surrogate models are an established explanation approach [41] and the general decision areas are expressive enough to deduce explanations. Our experiments showed that a three-layer fully-connected neural network is able to capture all decision boundaries while keeping them simple.

Fig. 5.
figure 5

CoFFi interface with chemical classes for solutes relative to latent feature space of MCM. (A) Data table (B) Non-linear projection (C) Partial dependence (D) Sampled linear projection (E) Importance distribution (F) Manually added linear projection of decision boundaries in the neighborhood of Cyclohexene. Alkanes (cyan) differ from alkenes (blue) in u4 = 0.

We first gain an overview by centering the PCA at the data set mean in Fig. 5. Comparing the MDS embedding in (B) with the one in Fig. 3 (d) reveals that the classes alcohols (pink), alkanes (cyan), alkene (purple) and aromatics (orange) are even more distinct than without the filtration to eight classes. The model importance (E) is highest for u3 and u4 and lowest for u1, which therefore seems to be less discriminative, but still relevant regarding chemical class. In Fig. 5 we find that u2 > 1 is characteristic for alcohols. We deduce that alcoholic mixture behavior is encoded in high u2 values by MCM. In both the linear (D) and non-linear projection (B), alkanes and alkenes are still neighboring as expected from their similar molecular structure. The deciding feature between the two seems to be a threshold of u4, since it is orthogonal to the axis. We confirm this assumption by updating the embedding view to a representative of alkenes, cyclohexene, and its neighbors in Fig. 5 (F) finding a threshold of u4 ≈ 0.

Fig. 6.
figure 6

Chemical classes in latent feature space focused on 2,2,2-Trifluoroethanol surrounded by similar substances. A gray cross marks the counterfactual probe on the decision boundary to alcohols. In the usual univariate analysis with partial dependence the same change, marked by dark gray lines, is not visible as a counterfactual.

We pick an outlier in the MDS plot to check whether the exceptional latent feature values align with the chemical expectations. 2,2,2-Trifluoroethanol is a halogen that is highly reactive and has an isolated position in MDS. In the focused plot of Fig. 6 (A) the neighboring classes are aromatics (orange) and alcohols (pink). The closeness with alcohols is sensible, since 2,2,2-Trifluoroethanol, as the name suggests, contains a hydroxyl group and therefore can be considered an alcohol as well. By probing the decision boundary with alcohols (gray cross), we learn that alcohols differ by slightly higher values in all features, which is something that was not visible in the univariate and non-linear analysis. Another interesting finding is that u1 and u3 as well as u2 and u4 are highly correlated when restricting to this local neighborhood, which is contrary to without the exclusion of alkanes and alkenes in Fig. 5 (D).

3.3 Latent Features in Physicochemical Descriptor Space

A reproducible link between latent MCM features and physicochemical descriptors would significantly improve our capability to build machine learning algorithms – possibly up to the point of extrapolation. The previous rangeset analysis fell short in uncovering usable relationships. As a next step, we relate individual latent features in CoFFi to a set of readily available physicochemical descriptors of the solutes. The descriptors – dipole moment, polarizability, anisotropy, normed anisotropy, H-bond acceptance, H-bond donation, HOMO-LUMO gap, ionization energy, electron affinity and molar mass – are described in Table 1.

Table 1. Overview over considered physicochemical descriptors.

We again learn a surrogate model to predict the latent features from the set of physicochemical descriptors. We specifically decided against direct u-to-property scatterplots, which would avoid the surrogate model, but possibly miss higher-dimensional effects. We use the same model as in the previous section, as it proved to provide a good balance between simplicity and accuracy. CoFFi has previously only been used for classification problems, but predicting a continuous latent feature value is a regression problem. In the following section, we propose an adaptation to the existing workflow to handle regression problems in CoFFi.

3.3.1 CoFFi Adaptation for Regression

The fundamental basis of counterfactual reasoning is the definition of the decision boundary. With categorical outcome variables, this boundary is trivial as the change in the most probable class. For continuous outcome variables this is no longer given. As such, we segment the continuous space into ranges that can be described together. Thereby, the regression can be handled like a classification with decision boundaries at the segment crossings.

In practice, the transformation requires some precautions. Contrary to the five equidistant bins in rangesets, the segment edges should always be designed according to the modality of the distribution. Since the edges are even more pronounced here, they need to be chosen and communicated explicitly. A histogram in the bottom left of Fig. 7 displays both the distribution and the currently chosen segment edges. In this case, we chose to set the segment to low, median and high, since there is an unimodal distribution. The physicochemical descriptors then serve as the input for the surrogate model to predict whether one specific u is within a certain segment.

3.3.2 Latent Feature Analysis

In this section, we analyze the individual latent features u1-u3 by associating the physical properties in the regression variant of CoFFi and draw conclusions on the relevance for model explanation. The analysis for u4 is omitted due to space constraints.

u1. Figure 7 shows the CoFFi interface for u1 containing all solutes. The distribution of u1 values is unimodal with the peak at 0. Therefore, u1 has little to no influence on the mixture computation in MCM for most solutes. We deduce that u1 is not encoding a common mixture behavior, but is rather specialized to encode a few rather exotic solutes. In the MDS and PCA projections we observe that the solutes with low u1 values (blue) are spread on the outside of the plots, hence holding unusual physicochemical descriptor combinations. Selecting the blue dots in the MDS updates the data table to reveal that these are aromatics and esters with high organic solubility. Selection of high u1 values (red) reveals that these characteristics are not united in physicochemical descriptors, but are rather taken by phenol, chloroform and their variants. The model importances are rather evenly distributed within 5% and 14%, which is why we cannot deduce any dominant relationship between a particular physicochemical descriptor and the u1 value. We conclude that the u1 value is encoding mixture behavior which is not captured in the current set of physicochemical descriptors.

Fig. 7.
figure 7

Default CoFFi interface for physicochemical descriptors’ influence on u1. u1 is unimodal distributed with most values close to zero. Solutes with low u1 values (blue) hold uncommon physicochemical descriptor values (peripheral distribution in embeddings), while phenols and chloroform hold high values (red) and blend in with the average points (yellow).

Fig. 8.
figure 8

CoFFi interface with physical properties related to u2. After selection of a balanced group of green, yellow and red data points, the other components update accordingly. Further filtration to influential property axes updates the embedding to cover only the variance in these properties centered at the selection mean.

u2. The distribution of u2 is bimodal as shown in the bottom left of Fig. 8. We therefore segment into negative (blue), close-to-zero (green), medium (orange) and high (red) values. The UMAP projection shows clusters of distinct physicochemical descriptor combinations in line with the segmentation. High u2 values are reached by alcohols while medium values encode ketones. We select an evenly class-distributed subgroup of substances with the lasso tool in UMAP to contrast the high and medium solutes to the close-to-zero solutes. Partial dependence view and model importance show that H-bond characteristics are most important, with H-bond acceptance above 0.018 changing from high u2 to medium u2 and H-Bond donation of below 0.005 changing to the main close-to-zero group (horizontal arrows). We filter the embedding view to the influential physicochemical descriptors only to extend our uni-dimensional analysis to two-dimensional dependencies. The decision boundaries orthogonal to the respective axis reveal that H-bond characteristics dominate over changes of similar magnitude in dipole moment. As we wonder about the curved boundary in the bottom right, we hover over the embedding to realize that it starts when the H-bond donation is already zero, but HomoLumoGap is still decreasing. We conclude that HomoLumoGap is therefore only influential on u2 for solutes that are not H-bond donors, e.g. Ketones.

Fig. 9.
figure 9

CoFFi interface relating physical attributes and bins of u3 (A). Selection of a string cluster (B) reveals that solutes with small u3 are different length alkanes (C). The model importance analysis (D) shows polarizability and/or molar mass to be the explaining descriptors. Their axes (E) coincide signaling high positive correlation, to which probing (gray cross in E)) reveals the split point between bins (gray lines in F).

u3. The distribution of u3 shown in Fig. 9 A is different from the previous ones in that all values are negative. Most values lie between −1.5 and 0 (red), with a segment of medium (yellow) and significantly negative (blue) values. The MDS and the data table in Fig. 9 B + C reveal a string cluster of medium-size alkanes (yellow) transitioning into long-chained alkanes (blue). We concentrate our analysis via selection on just this cluster to retrieve the responsible physicochemical descriptor. Polarizability (39%) and molar mass (21%) are the most influential descriptors in our surrogate model (Fig. 9 D). This explanation aligns with chemical knowledge, since polarizability and molar mass increase with chain-size in alkanes. We notice the strong but sensible correlation in the embedding – the polarizability and molar mass axes point in exactly the same direction in Fig. 9 E – and probe the embedding (gray cross) to read in Fig. 9 F that above 135 a03 polarizability and 154 g/mol molar mass (arrows) the solutes have u3 < -2.6 (blue). However, we need to note that this is merely an explanation and not a physical dependency. Other hidden properties may be the actual influential factor, but the surrogate model identified polarizability and molar mass as specific descriptors to distinguish alkanes from each other and the other substances, which is sensible and can be used for the development of data-driven MCM algorithms.

4 Conclusion and Future Work

Matrix completion methods have proven to be more accurate than current state-of-the-art solutions for prediction of thermodynamic properties of mixtures. In this chapter, we analyzed the latent feature space of such a matrix completion model with regard to chemical knowledge. Within two interactive visual analytics tools, we were able to provide explanations for the learned solute features. We found that chemical classes coincide with the structure of the learned feature space wherever the chemical class is defining for a substance’s solute behavior and that chemical similarity is captured by the neighborhood relation in latent space. Alcohols and other substances with hydroxyl groups were particularly exceptional in their learned characteristics. Finally, some latent features were clearly explained by physicochemical descriptors, while others only revealed trends. The insight gained in this chapter serves a first step towards a fully data-driven mixture prediction, potentially reducing costs while increasing accuracy in process planning.

The current work is limited in analyzing one attribute or latent feature at a time. The simultaneous correlation of physicochemical descriptors to multiple latent features could provide a global understanding into which descriptors are reflected in which part of learned space. Additionally, parts of the current analysis rely on a simple surrogate model. An analysis on the predictability of solute descriptors from physicochemical descriptors in a sophisticated model is up to future work.