Gas principal properties as new compact descriptors for data-driven gas solubility modelling

Principal properties (PPs), new compact descriptors for 48 gases were derived, their physico-chemical significance discussed, and applications to predict gas solubility in organic solvents by means of data-driven soft models reported


Introduction
Optimization of the desired biological and physico-chemical properties for complex chemical entities (molecules or macromolecules) or for the performance of chemical processes involving different chemical building blocks (reactions, industrial and physico-chemical processes etc.) requires parametrization of non-continuous variables (substituents, amino acids, solvents, catalysts, etc.) by means of chemical descriptors.The need for variables (descriptors) orthogonal to each other suitable for multivariate experimental design led to the derivation of principal properties (PPs), intrinsic properties representative of experimentally observable macroscopic descriptors for chemical entities.7][8][9] Principal properties for heteroaromatic moieties, based on aromaticity 10 and on 3D-GRID structural parameters 11 have also been reported.
PPs have been successfully applied in the field of Quantitative Structure Activity Relationships (QSARs), 5 in particular for the design of biologically active peptides. 6,7Dedicated PPs for amino acids were then derived for peptides QSARs 8 and for quantitative sequence-activity modelling. 12More recently MIF (Molecular Interaction Field) molecular descriptors 9,[13][14][15][16] were used for ligand-based virtual screening in antitumor drug design.
PPs were also derived for Ionic Liquids (ILs), low melting point salts comprising an organic cation and an inorganic or organic anion, which exhibited unprecedented efficiency at a molecular level providing opportunities for the development of green and sustainable chemical procedures.As experiments could not possibly explore the huge experimental chemical space covered by cationic and anionic counterparts of ILs, derivation of in silico VolSurf+ physico-chemical descriptors was needed.These parameters used as such, or compacted into cationic and anionic PPs, PP+ and PP-for ILs, respectively, 17 were applied to develop QSARs relating the chemical structure of ILs to their toxicities, 17,18 or to important physico-chemical properties such as polarity, 19 heat capacity, 20 viscosity, density, decomposition temperature and conductivity. 21Such an approach has been illustrated in detail and allowed design of ILs for specific applications. 22he above examples illustrate the use of PPs as molecular descriptors for building blocks to predict the biological or physico-chemical properties of more complex chemical entities.An interesting example of PPs multivariate optimization of a chemical process with three chemical building blocks is provided by the Fischer indole synthesis 23,24 a two-step synthesis involving the reaction of ketones with phenylhydrazones and ring closure in the presence of an acid.When unsymmetrical ketones are used, an indole regioisomeric mixture can be obtained.The use of PPs of ketones, of Lewis acid catalysts and of solvents led to the regiospecific synthesis of single indole regioisomers 25 and to a one-pot reaction under milder conditions. 26n spite of several successful applications, PPs have not been very popular over the past three decades.The main criticism is focussed on the fact that they are subjective quantities dependent on the data set adopted for their derivation and the lack of an immediately interpretable physical meaning.On the other hand PPs are statistically orthogonal and therefore can be safely used in multiparameter linear equations, avoiding the danger of collinearity.Furthermore, they are less influenced by measurement errors and system-specific variations as compared to single descriptors (e.g.solvent polarity scales) and can be derived for a wider set of objects as compared to the original descriptors (e.g.different solvent polarity scales), allowing the investigation of a wider chemical space.Finally the physical meaning of PPs can be evidenced by the PCA descriptor loadings in their derivation.
In 1803 Henry's law provided the first and still most popular quantitative measurement for gas solubility in a solvent at a given temperature. 27Henry's constant H is defined as P/x where P is the partial pressure of the gas in bar units and x is its molar fraction in the liquid phase.Accordingly, high H values denote low solubility, while low H values correspond to higher solubility.In 1936, Hildebrand 28 proposed the definition of a "solubility parameter", further extended by Hansen. 29Similar Hansen parameters indicate miscibility in various proportions, while dissimilar values denote limited solubility.Henry, Hildebrand and Hansen solubility parameters, providing a quantitative estimate for the ancient motto "like dissolves like", are widely adopted in industrial processes requiring the knowledge of gas-liquid solubility, for the selection of extraction solvents and in many other applications.In this context, a recent review dealt with the solubility parameters of permanent gases, an important issue for industrial processes and environmental elimination. 30odelling and predicting gas solubility adopting theory-driven approaches, such as quantum mechanical calculations, 31 mixed quantum mechanics/molecular mechanics 32 and force field 33 models have been reported. 34t has recently been proposed 21 that data-driven modelling approaches can complement and usefully integrate theory-driven ones.
In particular, the SIMCA approach 35 adopting PCA/PLS 36 modelling compacts raw data into data of higher relevance, eventually adopting different soft models of local validity for different classes leading to simple linear predictive equations.PCA/PLS modelling has been successfully applied by our group in many different areas such as cultural heritage, 37 to predict NMR shifts, 38 in food chemistry, 39 for drug identification 40 and in genome based cancer research [41][42][43] including leukaemia. 44,45 limitation of PLS models, relating molecular descriptors (variables in the X matrix) to a given molecular property (the y dependent variable) is that they are derived from a learning set of objects spanning a given chemical space.Accordingly, PLS predictions, which can in principle be calculated for all test set objects for which the descriptors are available, are reliable within the investigated experimental domain.However, a PLS model of local validity usually requires a lower number of descriptors as compared to a more general one.Therefore, disentangling the objects into more homogeneous classes spanning a limited chemical space may lead not only to more accurate predictions, but also to simple linear equations -with a lower number of independent variables -which can be applied directly by experimentalists to address a specific problem.In this context, we here derive by PCA gas PPs as new compact experimental descriptors and report two examples of PLS analysis to predict the outcome of gas solubility in organic solvents, a key physico-chemical property in many industrial processes.

Derivation of gas PPs
The gas PPs reported in Table 1 were derived by SIMCA (Soft Independent Modelling of Class Analogy) 35 as the scores of a PCA carried out on a data matrix including 48 gases as objects and 42 experimentally determined physico-chemical properties as variables 46 (Table 2).PCA is an "open" statistical procedure which depends on the choice of the objects (gases) and variables (observable gas properties) included in the data matrix, which has to achieve an optimal balance between model (and therefore PPs) generality and descriptor prediction ability.In the present case we selected 48 gases among the most common ones, for which a significant number of experimental determinations was available in the Air Liquide database. 46PCA of the selected data matrix provided a 4 significant principal components (PC) model with good predictive ability (Q 2 =0.72), explaining 89.7% of variance (Table S1).   1 and plotted as t1-t2 and t3-t4 in Figures 1a and 2a respectively, together with the corresponding p1-p2 and p3-p4 loadings plots in Figures 1b and 2b respectively.
The loadings elucidate the descriptors information providing guidance for interpreting the physicochemical meaning of gas PPs. Figure 1b clearly shows grouping of descriptors typical of thermal properties, such as heat capacities (Cp, Cv) and thermal conductivities, mainly in the top left quadrant (negative p1 and positive p2).Properties related to the capability of molecules to move (increase in translational energy), to rotate (increase in rotational energy) and to vibrate (increase in vibrational energy), such as viscosity, vapor pressure, Z ratio, Cp/Cv ratio and liquid/gas equivalent ratio, are in the bottom left quadrant (negative p1 and p2).
Interestingly, most neutral gases are located in the same quadrant of Figure 1a.Positive p1 loadings are exhibited by properties related to the gas molecular weight, namely boiling and melting points, specific gravity and density.Therefore, gas PP1 values in Table 1 (i.e.t1 values in Figure 1a) are very high for halogenated gases and increase on increasing the number of carbon atoms in hydrocarbons.In Figure 2a acid gases and the only basic one (ammonia) are located in the upper right quadrant (positive t3 and t4).In the same quadrant of the loadings plot (Figure 2b) we find gas properties related to the equilibrium with other phases, such as latent heats of vaporization and fusion, triple point, and critical pressure and temperature, all affected by the capability to form hydrogen bonds.The lower right quadrant of the loadings plot (positive p3 and negative p4) in Figure 2b is characterized by the presence of vapor pressure properties.Interestingly in the same quadrant of the score plot in Figure 2a we find volatile hydrocarbons.
In both scores plots hydrogen, a biatomic gas, exhibits a peculiar behavior being outside the confidence model ellipse, due to its unique properties deriving from its electronic configuration and position in the periodic table.

Application of gas PPs for gas solubility modelling
Gas PPs listed in Table 1 and available solvent PPs 2,3 represent orthogonal descriptors for each of two chemical building blocks which constitute a more complex physico-chemical process, gas solubility in organic solvents.In particular, multivariate PLS modelling of gas solubility in organic solvents, using gas PPs herein derived and solvent PPs reported in reference 3 as descriptors and gas solubility as the dependent variable was carried out.The descriptors solvent space spanned by this analysis includes 8 solvents, while for the gas space, 10 gases, as defined by the experimentally determined Henry's constants 47 considered as the dependent variable.In the present case, the y dependent variables in the PLS model were converted into the logarithms (with base 10) of the Henry's constants log H.The choice of the logarithmic form has been pointed out 48 to be relevant for theoretical considerations as log H is inversely proportional to the solvation ΔGs free energy.
A preliminary PLS analysis was carried out using a 77x6 descriptor matrix including 77 gas-solvent combinations (Table S2) and 6 descriptor variables (4 gas and 2 solvent PPs) and log H as the responses.This b model provided two significant PLS components, and the resulting scores plot (Figure S1) evidenced significant differences in the gas structures with hydrocarbons separated from hydrogen sulfide, sulfur dioxide and carbon dioxide and suggested to adopt different class models.Actually, the latter three gases do not represent a sufficient number of learning set objects to build a separate class model.However, a separate PLS model could be derived for the solubility of hydrocarbons, leading to satisfactory statistical parameters (Table S3) and to the VIP (Variable Importance on the Projection) values bar plot reported in Figure S2.VIP values, giving an indication (in absolute values) of what variables in the X block (PPs of both gases and solvents) are relevant to determine the dependent variable (gas solubility), suggested that PP4 gas and PP2 solvent could be eliminated without a significant loss of information.Accordingly, a new PLS model was derived for a matrix including 48 objects in the learning set and only four important descriptors (three PPs for the gases and one PPs for the solvents).The analysis provided a satisfactory model for the solubility of hydrocarbons in organic solvents where (see Table S4) 3 PLS components explain 76.8% of y variance (Q 2 = 0.73).The VIP descriptor values reported in Figure S3 indicate, as expected, that the gas structure has a major influence in the process as compared to the solvent structure.
In Figure 3 we report the correlation plot spanning 3.5 log units including model predictions for 48 learning set objects and 8 test set randomly selected objects distributed along the y experimental domain.The predictions for test set objects (Figure 3 and Table S2) are not significantly different from those of the learning set ones, providing an external validation of the model predicting ability.As mentioned above, one advantage of the soft modelling approach adopted here is that gas solubility for hydrocarbons can be easily calculated by the following four parameters equation, where the independent variables are readily available in Another literature data set suitable to test the performances of gas PPs includes the solubility in five nalkanols, expressed as logarithms of Ostwald coefficients 49 for nine gases, seven of which are neutral (5 noble gases, nitrogen and oxygen), one a hydrocarbon (methane) and one a fluorinated compound (SF6).The peculiarity of the chemical structure of the latter suggested we should exclude it from the analysis, while methane was retained to verify if it would fit a soft model derived for neutral gases.
A PLS analysis carried out on a matrix including 30 objects in the learning set and 6 descriptors (4 gas and 2 solvents PPs, Table S5) gave an excellent 3 PLS components model (see Table S6) explaining 97.4% of y variance with a good predicting ability (Q 2 = 0.948).External model validation is provided by Figure 4, the correlation plot including also 6 test set objects distributed along the y experimental domain.In the present case, gas solubility for neutral gases and methane in n-alcohols can be calculated by the following six parameter equation: Eq. 2 logL = 0.29118 -0.00374 (PP1 solv) -0.00156 (PP2 solv) + 0.13772 (PP1 gas) + 0.07827 (PP2 gas) +0.00003 (PP3 gas) + 0.22384 (PP4 gas)

Conclusions
New gas PPs based on experimentally determined properties were derived and interpreted according to the gas structural features of different classes.The gas and solvent PPs, both open access in Arkivoc, can be adopted as descriptors to develop data-driven soft models for different classes to investigate an important process such as the solubility of gases in organic solvents.This flexible approach provided simple equations which can be conveniently used by experimentalists to predict gas solubility, a key physico-chemical property in many industrial processes.

Experimental Section
Computational methods.The data set used for PCA 35 was a table (matrix) in which 48 gases were characterized by 42 physico-chemical properties. 46The variables have been autoscaled by multiplying the variables by appropriate weights (the reciprocal of the variable standard deviation) to give them unit variance (i.e., the same importance).PCA was carried out by using the SIMCA software package 26 on a data matrix containing 48 x 42 xik elements, where the index k is used for the physico-chemical properties (variables) and index i for the gases (objects).Autoscaled matrix elements were then fitted into a model given by Equation ( 3), where the number A of significant cross terms (components), and the parameters pak and tia are calculated by minimizing the residuals, eik, after subtracting xk (the mean value of the i th experimental quantities xk).
Parameters xk and pak (the loadings) depend only on the physico-chemical properties (variables), and the tia (scores) only on the solvents.The deviations from the model are expressed by the residuals, eik.The number of significant components (A) was determined using the cross-validation technique (CV). 50he Partial Least Squares Projections to Latent Structures (PLS) 36 chemometric tool allows to find relationships between the gas and solvents PPs (X matrix) and the response, in this case the gas solubility in a given solvent.The PLS algorithm computes PLS components for each of the two matrices (X and Y), searching simultaneously for a linear relationship between the X-scores and Y-scores of the PLS components by means of equation ( 4), where b a is a proportionality coefficient: The main statistical parameters provided by the PLS method 36 are R 2 X, R 2 Y (respectively sum of squares of all the Xs and Ys explained by all extracted components) and Q 2 , the fraction of the total variation of the Y's predicted by all PLS components, as estimated by cross validation.Q 2 was computed as: 1-PRESS/SS, where SS is the residual sum of squares and PRESS is the squared difference between observed and predicted values for the data kept out of the model fitting.CV was performed in the same way as for PCA.In the present case the PLS method is able to detect which variables in the X block (i.e. gas and solvent PPs) are relevant to determine the dependent variables (i.e. the gas solubility) by means of the VIP values.SIMCA computes VIP values by summing over all model dimensions the contributions VIN (variable influence).For a given PLS dimension, a, (VIN)ak 2 is equal to the squared PLS weight (wak) 2 of that term, multiplied by the % explained of residual sum of squares by that PLS dimension.The accumulated (over all PLS dimensions) value, VIPk = ∑(VIN)k 2 is then divided by the total percent explained of residual sum of squares by the PLS model and multiplied by the number of terms in the model.

Figure 1 .
Figure 1.PCA scores plot for gases (1a) and loadings plot for descriptor variables (1b) for the first and second components. b

Figure 2 .
Figure 2. PCA scores plot for gases (2a) and loadings plot for descriptor variables (2b) for the third and fourth components.

Table 2 .
Experimentally measured properties included as descriptors in the data matrix for gas PPs derivation

Table 1
and in reference 3: Page 365© ARKAT USA, Inc