Fuzzy Divisive Hierarchical Clustering of Solvents According to Their Experimentally and Theoretically Predicted Descriptors

: The present study describes a simple procedure to separate into patterns of similarity a large group of solvents, 259 in total, presented by 15 speciﬁc descriptors (experimentally found and theoretically predicted physicochemical parameters). Solvent data is usually characterized by its high variability, di ﬀ erent molecular symmetry, and spatial orientation. Methods of chemometrics can usefully be used to extract and explore accurately the information contained in such data. In this order, advanced fuzzy divisive hierarchical-clustering methods were e ﬃ ciently applied in the present study of a large group of solvents using speciﬁc descriptors. The fuzzy divisive hierarchical associative-clustering algorithm provides not only a fuzzy partition of the solvents investigated, but also a fuzzy partition of descriptors considered. In this way, it is possible to identify the most speciﬁc descriptors (in terms of higher, smallest, or intermediate values) to each fuzzy partition (group) of solvents. Additionally, the partitioning performed could be interpreted with respect to the molecular symmetry. The chemometric approach used for this goal is fuzzy c-means method being a semi-supervised clustering procedure. The advantage of such a clustering process is the opportunity to achieve separation of the solvents into similarity patterns with a certain degree of membership of each solvent to a certain pattern, as well as to consider possible membership of the same object (solvent) in another cluster. Partitioning based on a hybrid approach of the theoretical molecular descriptors and experimentally obtained ones permits a more straightforward separation into groups of similarity and acceptable interpretation. It was shown that an important link between objects’ groups of similarity and similarity groups of variables is achieved. Ten classes of solvents are interpreted depending on their speciﬁc descriptors, as one of the classes includes a single object and could be interpreted as an outlier. Setting the results of this research into broader perspective, it has been shown that the fuzzy clustering approach provides a useful tool for partitioning by the variables related to the main physicochemical properties of the solvents. It gets possible to o ﬀ er a simple guide for solvents recognition based on theoretically calculated or experimentally found descriptors related to the physicochemical properties of the solvents.


Introduction
The large number of different solvents used for many important chemical processes and technologies need special attention since their properties depend on a range of specific chemical and physical parameters such as melting and boiling point, water solubility, polarity, vapor pressure, density, viscosity, and even toxicity and many others.
Solvents can be separated by one of four basic methods: by solvent power (solubility polarity, acidity/basicity, properties/parameters), evaporation rate/boiling point, chemical structure, and hazard classification. Within the latter, this evaluation identifies both physical hazards (e.g., flash point, flammability, or reactivity) and toxicity, etc. The partitioning based on chemical structure groups used three groups: hydrocarbons, and oxygenated and chlorinated solvents [1][2][3].
Parker [1] divides them into: protic, aprotic, and inert according to the dipolarity of the solvent molecules and their ability to act as hydrogen bond donors. One disadvantage of a classification scheme such as this is that the groups are not restraining.
Partitioning of solvents based on physicochemical properties proved to be a significant and challenging problem [2][3][4][5][6]. Special interest provides a new study [7] where a new solvent similarity index is introduced, aiding in discovering the most suitable solvent for specific purposes. The solvent similarity index was calculated based on 261 pure solvents at 298 K, and classification was done for the solvents according to their solvation properties. Pushkarova et al. [8] used, as empirical characteristics of solvent-solute interactions via Taft-Kamlet-Abboud, polarity functions to determine the solvatochromic polarity. The practice of solvatochromic probing is growing rapidly but classification of media based on these values can be difficult. The paper focuses on the artificial neural networks (ANN) for the classification of solvent on the basis of their solvatochromic characteristics. Also, the influence of data variation on the stability of classification has been studied.
In the study of Gramatica et al. [9] a neuron nets approach was used for solvent separation. In general, many other chemometric methods contributed to proper solvent selection for practical needs like regression analysis, factor analysis, or partial least square regression [3][4][5].
Bradley et al. [10] used the Abraham general solvation model to predict the solvent coefficients for all organic solvents. The models were used to propose sustainable solvent replacements for commonly used solvents.
Recent efforts are concentrated on the application of chemometric strategies as suitable tools for classification of solvents (as objects of the analysis) characterized by many properly selected variables (chemical, structural, and physicochemical descriptors [11][12][13][14][15]. The majority of the methodologies are well developed and widely used for classification, interpretation, and modeling purposes like cluster analysis, principal components and factor analysis, artificial neural networks, partial least square regression, and discriminant analysis. A limited number of applications are related to fuzzy analysis [16,17]. Fuzzy clustering and partitioning also finds application in solvents characterization [18]. Fuzzy clustering analysis offers unique opportunities for decomposition of a large data set into a fixed number of similarity groups or clusters. Indeed, the classical cluster analysis (hierarchical or non-hierarchical) could achieve similar results but the strong advantage of the fuzzy partitioning strategy is the opportunity to locate a certain object (or variable) not to a single group of similarity but to calculate a function of membership for each object. Thus, a single object could be attributed to more than one cluster. This makes the interpretation efforts more loosely allowing considering specific distribution of objects into clusters with respective degree of membership. It eliminates ambiguity in interpretation or often unavoidable overlapping of clusters.
The major goal of the present study is to achieve a reliable partitioning of a large number of solvents with broad practical use by application of fuzzy partitioning methodology.
In this study, the fuzzy divisive hierarchical clustering and the powerful fuzzy divisive hierarchical associative-clustering method, which offer an excellent possibility to associate each fuzzy partition of samples to a fuzzy set of characteristics (descriptors), were successfully applied for the characterization of 259 solvents, according to their 15 specific descriptors (experimentally found and theoretically predicted). What is quite new is the partitioning of solvents and their association with different descriptors with high, moderate, and low values. The obtained results clearly demonstrated the efficiency and information power of the advanced fuzzy clustering method in solvents characterization and clustering.

Fuzzy Clustering Methods
The application of fuzzy logic for various scientific and technical goals has been commented on for decades [19]. This approach differs from the classical hard clustering where each object of the data set finds its own cluster. Thus, an object either belongs to a defined cluster or is out of it. The application of Fuzzy theory to the problem of finding similarity between objects of interest leads to the conclusion that a particular object can belong simultaneously to more than one cluster, but with different degrees of membership (DOMs) between 0 and 1 [20,21]. In one of the possible approaches to so-called fuzzy c-means clustering (FCM), each cluster is replaced by a cluster prototype [22,23] with a respective center, which contains information about the size and the shape of the cluster. The degrees of membership are computed from the distances of the data point to the cluster centers. These distances are responsible for the value of DOM and determine the cluster properties and shape (point, line, etc.) [24].
There are different algorithms in fuzzy clustering applications, the most used being the binary divisive algorithm and the generalized fuzzy c-means algorithm (GFCM). The fuzzy methods briefly described above and the corresponding software were clearly described and efficiently applied in previous papers [25][26][27][28][29][30].

Data Set
The dataset consists of 269 solvents. Each solvent was described by 15 variables (molecular descriptors and experimentally obtain properties) shown below in Table 1. In the present study, the following set of subprograms implemented in the EPI Suite™ version 4.10 were used: MPBPWIN™, WATERNT™, HENRYWIN™, KOAWIN™, KOWWIN™, and BCFBAF™.
The melting point (MP), boiling point (BP), and vapor pressure (VP) within the MPBPWIN™ module in EPI Suite™ were applied to predict the properties of our interests. The MPBPWIN™ estimates melting point by the two methods: (1) the Joback Method (a group contribution method); (2) the Gold and Ogle method MP = 0.5839 * BP (in • K). Boiling point is valued by an adaptation of the Stein and Brown (1994) method, which is also a group contribution method. Vapor pressure is predictable as well by the methods: (1) Antoine, (2) Modified Grain method, and (3) the Mackay method. WATERNT™ estimates water solubility directly using a "fragment constant" method similar to that used in the KOWWIN™ program.
The Henry's law constant is estimated by the subprogram HENRYWIN™, which calculates (air/water partition coefficient) using both the group contribution and the bond contribution methods.
This KOAWIN™ program evaluates the logarithm of the octanol-air partition coefficient (KOA) of an organic compound with the compound's octanol-water partition coefficient (Kow) and Henry's law constant (HLC). For the KOAWIN only a chemical structure was needed for estimation of KOA. In the KOAWIN structures are implemented by the SMILES codes (Simplified Molecular Input Line Entry System). The KOA is possible to be predicted from the octanol-water partition coefficient (KOW) and Henry's law constant (H) by the subsequent equation: where R is the ideal gas constant and T is the absolute temperature. KOA and KOW are unitless values. H/RT is the unit less Henry's law constant, also known as the air-water partition coefficient (KAW).
Therefore, the equation to estimate KOA is: The KOWWIN™ program is for the octanol-water partition coefficient prediction. The basis of prediction in KOWWIN is a "fragment constant" methodology. In this "fragment constant" method, the starting structure is divided and then evaluated.
The comparison with the available experimental data shows a high level of correlation. In such a way, missing data in the large data set could be replaced.

Fuzzy Divisive Hierarchical Clustering of Descriptors
The fuzzy clustering of the variables (15 in total) aims to check the following:

•
If the experimental values of the respective variables conform with the calculated one (i.e. if they fall within a fuzzy cluster with high membership function); • If the partitioning procedure could determine stable groups of similarity between the variables with high DOM; • The procedure is important for revealing information about possible descriptors for classification of the solvents in interest.
In the supplemental information section (Supplement T1) the fuzzy partitioning results for 15 variables are presented. In total, 28 groups are considered. The summary of the final partitioning is shown below: A1-only HLc is included (a typical outlier) A2-MPe MPc BPe BPc Dens WSe WSc VPe VPc HLe logKOWe LogKOWc logKOAc logBCF (the rest of the variables show a high level of similarity with a distinct difference from HLc).
In the next steps of fuzzy partitioning respective groups of similarity based on DOM will be sought.

A21-MPe MPc BPe BPc Dens VPe VPc HLe logKOWe LogKOWc logKOAc logBCF A22-WSe WSc
In this partitioning stage, the experimentally found and theoretically calculated values of water solubility are extracted as a group of similarity different from the rest of variables in subgroup A21.

A2122122-HLe logKOWe LogKOWc logKOAc A21221221-HLe logKOWe LogKOWc
The fuzzy partitioning carried out for 15 variables characterizing a set of solvents revealed the following fuzzy linkage of the variables:

•
Very good coincidence between experimentally determined and theoretically calculated values of the variables characterizing the solvents; this means that if experimental values of some solvents are missing, calculation substitutes could be successfully used for classification and interpretation goals; • HLc was defined as a typical outlier; • The group of variables characterizing the distribution between different media (important for toxicity properties determination) is very compact; • The parameters characterizing physicochemical properties (MP, BP, WS, and VP) indicate various type of similarity with the other parameters-water solubility is the most distant to the rest of parameters, followed by BP and MP; density is closest to BP; logBCF is slightly different as compared to the rest of "toxicity esteems." Additional material could be found in Supplement

Fuzzy Divisive Hierarchical Clustering of Solvents
To compare the partitions, and the similarity and differences of the investigated solvents, we have to analyze both the characteristics of the prototypes corresponding to the partitions hierarchy obtained by applying fuzzy divisive hierarchical clustering and DOMs of solvents corresponding to all fuzzy partitions. The results presented in Table 2 clearly illustrate the most specific characteristics of each fuzzy partition and their similarity and differences.
The initial two clusters A1 and A2 indicate that one typical outlier is present in the list of solvents-perfluorooctane, whose properties are completely different from those of the other 268 solvents. The further divisive fuzzy clustering indicates the level of the membership function of each solvent into each of the next groups included (22 in total).
Next, Table 2 shows the final fuzzy partitioning with the prototypes of the partitions, ranked solvents for each group and the range of DOM.

Fuzzy Divisive Hierarchical Associative-Clustering of Solvents and Descriptors
To compare the partitions, and the similarity and differences of solvents, we have to analyze the DOMs corresponding to all fuzzy partitions for both the samples and characteristics (descriptors). The results obtained by applying the fuzzy divisive hierarchical associative-clustering method using the descriptor data are presented in Table 3. By carefully analyzing the fuzzy partitions at each level (partition history/hierarchy) in parallel with the descriptor considered data, the following remarks may be taken. The fuzzy partitioning of the solvents with indication of the descriptors related to each fuzzy partition (cluster) is depicted in Table 3. Table 3. The fuzzy partitioning of the solvents and variables (descriptors).

Solvents
Variables DOM Solvents DOM Variables   For the final goal of fuzzy partitioning of the objects (solvents) was performed by the use of 10 variables (only the experimentally found ones) ( Table 4). Phenetole Diisobutyl adipate Geranyl acetate Menthanyl acetate Trichloroethylene Pentyl acetate 1-Octanol.

Class 10 (HLc): Outlier Perfluorooctane 20
The solvents underlined above do not strictly belong to logical formation of similarity classes and seem more to be rather odd than reasonable as members of the respective class (polar, non-polar, or volatile solvents determined by specific variables). A careful check of the position of these 12 solvents into the fuzzy partitioning groups indicates that all of them have quite low maximal value of DOM as determined by fuzzy analysis (this values is shown next to the name of the solvent).
The few exceptions found (only 9 out of 259 solvents), namely: (diethyl carbonate, benzene, nitrobenzene);(o-, m-, p-xylene) and (carbon tetrachloride, bromobenzene, trichloroethylene), are resultant to their low maximal DOM, so their position into one group of similarity is not stable and they could be considered either as members of the group with low probability, or members of a different class.
In Table 5 summarized results according to obtained classes are presented. The table could be used as a practical guide for selection of type of solvents based on their physicochemical properties.

Conclusions
The fuzzy hierarchical clustering of a large group of solvents into 10 classes of similarity made it possible to find patterns of the chemicals with specific properties divided by important descriptors. The fuzzy partitioning method applied helped in finding relationships between solvents of various nature (polar, non-polar, volatile etc.) and the physicochemical variables used. Additionally, the chemometric analysis has proven that if there are missing data of specific descriptors the theoretical calculation of them is possible with very high level of approximate to the experimentally observed and established physicochemical indicators.
Thus, the present study offers a simple methodological approach to the complex problem of solvent partitioning.
In order to understand the similarity and differences of various solvents, fuzzy divisive hierarchical clustering and fuzzy divisive hierarchical associative-clustering were successfully applied. The fuzzy partition hierarchy of solvents and descriptors associated allowed identifying partitions (groups) of solvents with more or less similar characteristics in terms of higher, smallest, or intermediate values of considered descriptors.