Prediction of blood:air and fat:air partition coefficients of volatile organic compounds for the interpretation of data in breath gas analysis

In this article, a database of blood:air and fat:air partition coefficients (λb:a and λf:a) is reported for estimating 1678 volatile organic compounds recently reported to appear in the volatilome of the healthy human. For this purpose, a quantitative structure-property relationship (QSPR) approach was applied and a novel method for Henry’s law constants prediction developed. A random forest model based on Molecular Operating Environment 2D (MOE2D) descriptors based on 2619 literature-reported Henry’s constant values was built. The calculated Henry’s law constants correlate very well (R2test  =  0.967) with the available experimental data. Blood:air and fat:air partition coefficients were calculated according to the method proposed by Poulin and Krishnan using the estimated Henry’s constant values. The obtained values correlate reasonably well with the experimentally determined ones for a test set of 90 VOCs (R2  =  0.95). The provided data aim to fill in the literature data gap and further assist the interpretation of results in studies of the human volatilome.


Introduction
It is well established that volatile organic compounds (VOCs) produced and then partially released by the human body have a great potential for diagnosis in physiology and medicine. In particular, this volatile chemical fingerprint can provide non-invasive and real-time information on infections, cancer development, metabolic disorders, progression of therapeutic intervention as well as individual's exposure to environmental pollutants, or toxins [1][2][3]. For instance, VOC patterns identified in breath proved to be useful for recognition of lung cancer [4][5][6][7][8][9], gastric cancer [10,11], and breast cancer [12]. Despite this huge potential, the use of these patterns within a clinical setting is still rather limited. The main unresolved issue is the poor understanding of the origin, behavior, and metabolic fate of their constituents in the human organism. Within this framework, the knowledge of the fundamental physicochemical parameters of identified markers governing their distribution in human organism is highly desirable.
A recent review reported a database of 1764 volatiles appearing in different human body fluids [13]. Amongst these, 874 were detected in exhaled breath, 279 in urine, 504 in skin emanations, 353 in saliva, 130 in blood, and 381 in feces. These compounds belong to diverse chemical families and thereby exhibit very different physicochemical properties. In principle, two fundamental physicochemical parameters govern the behavior and fate of VOCs in humans. These are blood:air and fat:blood partition coefficients. In particular, the blood:air partition coefficient (λ b:a ) is a paramount determinant of pulmonary gas exchange, which together with ventilatory flow and cardiac output Prediction of blood:air and fat:air partition coefficients of volatile organic compounds for the interpretation of data in breath gas analysis 6 determines both the inhalational uptake of exogenous vapors and the elimination of endogenous compounds via exhalation. This parameter is particularly crucial in breath gas analysis, which usually aims at the identification and exploitation of volatile constituents of human breath for the diagnosis of disease states that frequently occur in distant parts of the body [1,2]. More specifically, VOCs exhibiting lower affinity for blood (λ b:a < 10 (mol × − L b 1 )/(mol × −1 L a )) exchange merely in the alveoli, whereas those with high blood affinity (λ b:a > 100) exchange also in the airways [14][15][16]. Moreover, breath levels of poorly blood-soluble VOCs react very sensitively to changes in ventilation and perfusion, which can be misinterpreted as fluctuations in the endogenous (blood) levels [14,17]. Blood affinity also influences the peripheral gas exchange via the relation λ tissue:b = λ tissue:a /λ b:a . The tissue:blood partition coefficient (λ tissue:b ) is commonly used to describe a venous equilibrium between blood and the respective tissue. The fat:blood partition coefficient (λ f:b ), in turn, governs the distribution of VOCs between the blood compartment and fat tissue and lipophilic cell membranes. Lipophilic volatiles tend to accumulate in lipid membranes, or fat compartment, whereas, compounds with low λ f:b readily leave lipophilic cell membranes and drain into blood. Together λ f:b and λ b:a determine the equilibrium concentration of a given compound between breath, blood and fat and assist in modeling the uptake, distribution, and elimination of VOCs in the human organism [14][15][16][17][18][19].
The blood:air and the fat:blood partition coefficients of VOCs observed in the human volatilome can differ by more than 12 orders of magnitude [20,21]. This means that species having comparable levels in exhaled breath can exhibit disparate concentrations in blood and fat. This effect can be illustrated using two volatiles that are omnipresent in human breath; isoprene and acetone [22][23][24][25]. The isoprene blood:air partition coefficient (λ b:a ) amounts to 0.95 (mol × − L b 1 )/(mol × −1 L a ) [26], whereas, its λ f:b amounts to 82 (mol × L f ) [27]. Acetone, in turn, has a blood:air partition coefficient of ~340 [28] and a fat:blood parti- [29]. Assuming a concentration of both species in alveolar air of 200 ppb (7.76 × 10 -9 mol × L −1 at 37°C and 1 bar), their equilibrium concentrations in blood are Thus, the same concentration of VOCs in alveolar air may correspond to blood concentrations that vary more than 8 orders of magnitude. The same holds true for the concentrations in the fat compartment. These essential differences in physicochemical properties of compounds forming the human volatilome [30] may result in different exhalation kinetics manifested, e.g. via different responses of the breath constituents during the moderate workload ergometer challenge [14,15,17,19,31]. In this context, it becomes clear that the knowledge of reliable blood:air and fat:blood partition coefficients of VOCs observed in the human volatilome is of utmost importance for the understanding of their behavior in the human body, identification of their underlying biochemical pathways and assessment of their applicability in diagnosis and therapy monitoring. Currently, the methods for determining blood:air and fat:blood partition coefficients can be classified as experimental and predictive approaches. The experimental methods (mainly headspace techniques) employ direct measurements of the gas and blood/fat concentrations of an analyte in closed containers under equilibrium conditions [26,29,[32][33][34]. It should be stressed here that the fat:blood partition coefficient is usually determined indirectly via measurements of the fat:air partition coefficient and dividing the latter by a respective blood:air partition coefficient. Unfortunately, the experimental approach suffers from various downsides related to tissue sampling and handling, analytical treatment (e.g. losses of analytes, contaminations, sample decomposition), or unavailability of reference materials. Moreover, this approach is relatively time and effort consuming and involves special (frequently sophisticated) analytical instrumentation. Consequently, experimentally determined values of the blood:air or fat:air partition coefficients are still lacking for many volatile compounds. On the other hand, predictive approaches calculate λ f:b and λ b:a using other physicochemical parameters of the compound under scrutiny, such as water:air and n-octanol:water partition coefficients (λ w:a and λ o:w ), vapor pressures, blood composition, or previously obtained λ f:b and/ or λ b:a of homolog compounds [33,[35][36][37][38][39][40]. Predictive approaches can, however, fail, or lead to wrong results when the necessary physicochemical characteristics are not available or incorrectly determined, or unknown factors influencing the compound's solubility occur (e.g. VOC protein binding).
Within this context, the primary objective of this paper is to estimate blood:air and fat:blood partition coefficients for 1678 VOCs reported to occur in the human volatilome [13]. Thereby, we expect to fill the literature data gap and support the interpretation of results in studies on the human volatilome.

Methods
In principle, partition coefficients of organic compounds can be predicted according to two different methods: (a) from their chemical structure or (b) by inference from other physicochemical properties.
If predictions are to be made based on the chemical structure, ab-initio methods based on primary physical principles (quantum mechanical calculations or force field simulations) or methods based on chemical similarity can be used. Ab-initio methods are usually very time-consuming, and their predictive accuracy varies with the target property. On the other hand, methods based on the chemical similarity can give very good results if a large enough set of training data or knowledge base is available. This is the realm of quantitative structure-property relationship (QSPR) models, which are used in daily routine in many different fields that require predictions of physicochemical properties. QSPR model predictions are usually very fast to calculate, and the uncertainty in the individual predictions can be quantified very well.
Predictions based on inference from other physicochemical properties (classic formulae) can be used if these properties and the scaling factors are known. For example, the blood:air and fat:air partition coefficients can be predicted from the octanol:water and the water:air partition coefficients according to the formula of Poulin & Krishnan [41]. While octanol:water partition coefficients for many different organic volatile compounds can be found in respective databases, the data on water:air partition coefficients are much more limited. Thus, a two stage strategy was used to derive estimates for blood:air and fat:air partition coefficients. First, water:air partition coefficients were calculated from the chemical similarity using a QSPR model, and then blood:air and fat:air partition coefficients were calculated by inference from tabulated octanol:water and estimated water:air partition coefficients. All values of Henry's constant were converted to atm × m 3 × mol −1 . This unit corresponds to the Henry's constant expressed as the ratio between gas partial pressure and its water concentration. This unit was used for all further modeling. Apart from this, all data given in the above compilations were used as given, no further inclusion/exclusion criteria of data points were applied.

Chemical structure assignment
In the four sources described above, the chemical identity of the compounds is either specified by the CAS (Chemical Abstracts Service) number, trivial name or IUPAC name. The chemical structures (as SMILES) of the compounds were derived via two different web services: the CACTVS chemical identifier resolver hosted by the NIH (http://cactus.nci.nih.gov/ chemical/structure), and the ChemSpider API [46]. Structures obtained for CAS numbers were considered the most reliable. If no CAS number was available, structures obtained from the Chemical Identifier Resolver were prioritized over structures obtained from the ChemSpider API. Structures with chemical identifiers where the ChemSpider API yielded more than ten different solutions were individually checked.
Identifiers that yielded zero hits in any web service were individually determined. For all chemical structures, the InCHI Key [47] was calculated using OpenBabel [48] in order to identify and remove duplicate entries within each datasets and between the datasets. After removal of all duplicates and counterions, the structure assignment and cleanup procedure yielded 2619 different compounds with assigned Henry's constant values. A full list of those can be found in the supporting information.

Descriptors
In order to generate QSPR (quantitative structureproperty relationship) models, the full set of 2D descriptors (n = 192) available from MOE [49] was calculated for all compounds. A table with the numerical values for all descriptors and compounds can be found in the supporting information. During the course of model development, different descriptor sets were also tried, namely the RDKit 2D descriptors [50] and the ParaSurf Surface Integral Model descriptors [51]. However, the models based on these descriptors did not perform better and are not reported in this publication.

Model training
Based on the descriptors and the measured Henry's constant values, random forest models [52] were trained to predict the Henry's constants. Random forests are a standard nonlinear machine learning tool in chemoinformatics for generating QSPR models. In large scale validations, random forests usually turn out to be among the best performing machine learning algorithms for QSPR [53].
In brief, a random forest is a collection of decision trees (in this study n trees = 500), each trained on a bagged subset of the overall data (in this study the default 'sampling with replacement' was used, which leads to 63.2% of the overall dataset being used as training set for each tree). On every split level, a randomly selected subset of the descriptors (the default for regression: n descriptors /3) is evaluated for the split point that gives the largest improvement in root mean squared error (RMSE). For every tree, the out-of-bag test set is predicted and the final prediction is the average of the individual predictions of the out-of-bag predictions. No descriptor selection was applied, therefore the out-of-bag predictions represent the predictions for an independent validation set. In other words, individual compounds are not used to predict the partition coefficients of themselves at any stage, thus the results presented resemble a fully independent validation.
Sheridan et al [54] reported that random forests tend to underestimate extreme values and this behavior can be alleviated by rescaling the predictions. Thus, an additional crossvalidation loop was introduced to calculate scaling parameters and apply them to the predictions: the overall dataset was split in ten parts, and nine out of the ten parts were repeatedly used to train the model and to calculate the scaling parameters.
setup and the tenfold cross validation for scaling, as above. The error model predictions were scaled to reproduce the moving window RMSE (or standard deviation, this is the same here). In order to calculate the moving window RMSE, the compounds were sorted according to the predicted absolute error, and the experimental and predicted RMSE for each compound was calculated from all compounds within a window of 101 (50 to the left, 50 to the right). The RMSE within the moving window is termed 'local' RMSE, since it is different for every compound and depends on the order of the compounds. For both QSPR model and error model, the R implementation 'randomForest' by Wiener and Liaw [59] was used to generate the models.

Performance metrics
The quality of the QSPR models is measured using the R 2 , RMSE, and mean unsigned error (MUE). All metrics are calculated based on the predictions of the outer loop cross validation test folds.
The Henry's law constant values of the tenth part that was not used for any model building was then predicted and the prediction was rescaled.
During the course of the model development, different linear QSPR modeling approaches were tried, including PLS [55] and stepwise regression with descriptor pool size adjusted F-values [56]. However, those linear models performed slightly worse and are not reported in this publication.

Error model training
Sheridan has recently introduced the concept of building a separate error model to predict the confidence intervals for individual predictions [54]. In this approach, an additional random forest model is trained to predict the absolute error of the QSAR/ QSPR model. Sheridan showed that the tree standard deviation and the absolute predicted value are two essential descriptors that have a high predictivity for the absolute error [57,58]. In the initial experiments, it was found that the MOE descriptors add additional predictivity, improving the correlation between the predicted and the absolute error. Therefore all MOE descriptors from above plus the tree standard deviation and the predicted value were used to train error models.
To build the error models, the strategy outlined by Sheridan was followed, using the same random forest Here n is the number of compounds, log 10 H exp is the logarithm of the experimentally determined Henry's constant, log 10 H pred is the logarithm of the predicted Henry's constant, is the average of the experimentally determined Henry's constant and 10 ( ) H log pred is the average of the predicted Henry's constants.

Henry law constant prediction for the data base of VOCs
The final QSPR and error model were used to predict Henry's constant values and the prediction standard deviation for the 1741 compounds of the data base of volatile compounds. In addition, the structures of the data base of volatile compounds were compared with the 2619 compounds from the training set and 68 overlapping compounds were found. The predictions and the reported experimental values are given in the supporting information.

Prediction of blood-air and fat-air partition coefficients
Using the predicted values of the Henry law constant, the blood-air and the fat-air partition coefficients were estimated using the method of Poulin & Krishnan [41]. The blood-air partition coefficient can be estimated by the formula: Where, A ≈ 0.798 is the fraction of neutral lipids in adipose tissue (fat), B ≈ 0.002 is the fraction of phospholipids in adipose tissue, and C ≈ 0.15 is the fraction of water in adipose tissue.

Henry's law constants
For the 2619 membered Henry's constant data set, a QSPR model for the Henry's constant values was generated with R 2 = 0.967, RMSE = 0.49 and MUE = 0.22. A plot of predicted versus measured Henry's constant is shown in figure 1.
The predicted values of Henry's constant span a range from roughly −15 to 1 atm × m −3 × mol −1 and agree well with literature values. Errors tend to become larger as the values of Henry's constant become extremely negative. The predicted local RMSE error of the hold-out test set, calculated with the moving window approach, is plotted against the empirical experimental local RMSE in figure 2. Figure 2 shows that the predicted error correlates very well with the empirical local RMSE. This indicates that the individual error estimates are highly reliable. Overall the predicted error estimates span a range from 0.01 to 3.14 log 10 Henry's constant values. The error estimates represent the standard deviation of the Predictions are based on crossvalidation, so none of the compound has been used to predict its own value. prediction interval. It means, that there is a 68% probability that the experimental value is within the predicted value +/− the standard deviation. Large standard deviations indicate that the model predictions are rather unsafe, and there is probably no way around measuring the Henry's constant. In contrast, small standard deviations indicate that the predicted values are highly reliable, whereas the individual usage of the predicted values dictates whether the reliability is sufficient.

Blood:air and fat:air partition coefficients
The predicted values of blood:air and fat:air partition coefficients for selected compounds of interest are presented in table 1. These values were used to calculate the fat:blood partition coefficients (λ f:b ) using the formula λ f:b λ f:a /λ b:a . The values of estimated blood:air and fat:air partition coefficients were compared to the experimentally determined values of species of interest. An extensive literature survey resulted in collecting the experimental data for 90 VOCs in case of λ b:a and 29 VOCs for λ f:a from the list of 1678 species [29,33,40,[60][61][62][63][64][65][66][67][68][69][70][71]. Only values obtained for human blood and adipose tissue were taken into consideration. In the case of the majority of species only one single literature value determined for small populations, was available. This fact may lower the reliability of the experimental data as reference values. A plot of estimated versus measured blood:air partition coefficients is shown in figure 3. The predicted blood:air partition coefficients correlate reasonably well (R 2 = 0.95) with the experimentally determined values of this parameter. For instance, the predicted value of λ b:a for acetone amounts to 126 and agrees well with the available experimental data (340 [72], 245 [61], 186 [29]). In case of isoprene the predicted and experimentally determined values amounted to 0.253 and 0.95 [26], respectively.

Conclusions
The main goal of this work was to create a database of blood:air and fat:air partition coefficients for 1764 VOCs reported to occur in the human volatilome. Since experimental data of these parameters are relatively rare (e.g. measured λ b:a have been found for only  Costello et al [13]. The table presents also occurrence of these species in human body (after [13] [13]. For the purpose of this approach, a QSPR model was generated to predict the Henry's constant values. The QSPR model was built based on 2619 Henry's constant values assembled from the literature. It uses standard QSPR methodology, the random forest machine learning algorithm, and MOE2D descriptors. In addition, a separate error model was generated to estimate individual model uncertainty, an approach that has recently been pioneered by Sheridan [54]. Compared to other QSPR models, the model has an excellent performance (R 2 = 0.97) and a highly reliable error estimation, which allows judging the predicted values for each compound individually. The blood:air partition coefficients were calculated for 1678 species from the list of 1764 volatiles reported by de Lacy Costello et al according to the method of Poulin & Krishnan using the modelled Henry's constant values. These values agree reasonably well with the available experimental data with R 2 = 0.95.
Nevertheless, the limitations of the study should be indicated. First, the values of the estimated partition coefficients can be affected by the uncertainties of the other parameters used for their prediction (e.g. octanol:water partition coefficient, enthalpy of vaporization). Secondly, additional factors influencing the compound's solubility (e.g. blood proteins binding, differences in blood composition) were not taken into consideration. Considering these variations, it becomes clear that the predictions do not provide precise values of partition coefficients under scrutiny. Thus the values reported within this manuscript should be considered as an approximation of real values. Consequently, careful use of the database is needed. Thus, it is recommended in the first place to use exper imentally determined values of blood:air and fat:air partition coefficients. If the human experimental data are not available in the literature, experimental values of partition coefficients determined for animals (e.g. rat-λ b:a ) can be applied as a reasonable surrogate. Finally, in case of the absence of measured values of λ b:a and λ f:a estimated within this study partition coefficients should be applied. To sum up, it is expected that partition coefficients data provided by this study will assist future investigations in this exciting field.