BioCompoundML: A General Biofuel Property Screening Tool for Biological Molecules Using Random Forest Classifiers

: Screening a large number of biologically derived molecules for potential fuel compounds without recourse to experimental testing is important in identifying understudied yet valuable molecules. Experimental testing, although a valuable standard for measuring fuel properties, has several major limitations, including the requirement of testably high quantities, considerable expense, and a large amount of time. This paper discusses the development of a general-purpose fuel property tool, using machine learning, whose outcome is to screen molecules for desirable fuel properties. BioCompoundML adopts a general methodology, requiring as input only a list of training compounds (with identi ﬁ ers and measured values) and a list of testing compounds (with identi ﬁ ers). For the training data, BioCompoundML collects open data from the National Center for Biotechnology Information, incorporates user-provided features, imputes missing values, performs feature reduction, builds a classi ﬁ er, and clusters compounds. BioCompoundML then collects data for the testing compounds, predicts class membership, and determines whether compounds are found in the range of variability of the training data set. This tool is demonstrated using three di ﬀ erent fuel properties: research octane number (RON), threshold soot index (TSI), and melting point (MP). We provide measures of its success with these properties using randomized train/test measurements: average accuracy is 88% in RON, 85% in TSI, and 94% in MP; average precision is 88% in RON, 88% in TSI, and 95% in MP; and average recall is 88% in RON, 82% in TSI,


■ BACKGROUND
The assertion that fuel performance is controlled by a finite number of meaningful fuel properties is essential in the study, down-selection, and a priori assessment of various fuels and fuel-related compounds.The challenge in evaluating from thousands to hundreds of millions of potential known substances and compounds, without direct experimental characterization poses a daunting task.A number of techniques exist for dealing with this problem.A common approach is the use of chemical intuition to filter compounds by known, established, and experienced properties. 1 Although powerful, the chief limitation of chemical intuition is the limitation of all manual human efforts, namely, the efficient and tedious evaluation of large amounts of data. 2 To this end, a considerable effort has been expended over the last halfcentury in generating computational methods for predicting quantitative structure−property relationship (QSPR) and quantitative structure−activity relationship (QSAR) of various fuel properties. 3The techniques for studying these vary considerably: from regression-based methods (e.g., multiple linear regression, neural network general regression, and partial least squares) to classification methods (e.g., ensemble trees, support vector machines, logistic regression, and linear discriminant analysis). 4They also cover a large number of compound properties related to fuel performance, including cetane number, 5 octane number, 6 threshold soot index (TSI), 7 flashpoint, 5 melting point (MP), 8 and others.In addition to these machine-learning methods, other more recent approaches have developed model-based screens for fuel candidates using computer-aided molecular design. 9n this paper, we present a general-purpose open-source machine-learning classifier for screening compounds for desirable fuel properties.In compound screening, we do not predict specific values of fuel properties but rather the probability that a property value is above or below a defined threshold.For example, spark-ignition (SI) engines require fuel with a minimum research octane number (RON).Our classifier can predict the probability that a compound with unknown RON has a value above the required minimum.In contrast to more resource-intensive computational tools (e.g., quantum mechanical methods and density functional theory), our classifier trains in minutes and processes individual compounds in seconds.While BiocompoundML may be used for any chemical property, we use this tool to present our predictive classifiers for RON, TSI, and MP.These three properties have special importance in that they define three key aspects of fuel performance, namely, the anti-knock performance (RON), the atmospheric/health impacts (TSI), and the ability for a pure compound to handle and blend into a fuel (MP).
Our software is a discrete classification tool, which allows it to be used to quickly screen compounds.Included in this package are two additional important features: (1) a clustering tool and (2) a feature reduction function.The clustering tool was designed with screening in mind.Because we anticipate one common use case is the screening of a large number of compounds, our automated clustering algorithm helps ensure that tested compounds bear structural resemblance to training compounds, prior to classification.Our feature reduction function implements Boruta feature selection 10 to automatically reduce the number of features in the model.The Boruta selection feature is important because our feature selection protocol includes an automated feature collection package for obtaining data directly from the open chemical repository PubChem, 11 an implementation of PaDEL-Descriptor (http:// padel.nus.edu.sg/software/padeldescriptor/) for generating QSAR descriptors (called using the chemofeatures parameter in BioCompoundML), and an interface for providing usersupplied features.The interface is especially useful when including experimental, proprietary, or closed-licensed data sources.Users can, in fact, provide separate substance data files (SDFs) in training and testing directories to collect QSAR descriptors for prediction using PaDEL-Descriptor.
This software is evaluated using three strategies.The first strategy is leave-out resampling, using 50% build versus 50% test data sets, for RON, TSI, and MP.The second strategy directly measured RON for 16 compounds not included in our training data set.The third strategy involved testing the 8696 compounds stored in the MetaCyc database. 12Nearly all of these compounds are known to be biologically produced and were thus evaluated to screen for potentially high-performance biofuel blendstocks.
This software differs in a variety of ways from other generalpurpose chemical model development tools.Unlike camb, an open-source chemical model development tool, written in R, BioCompoundML does not focus specifically on protein activity or quantification focusing but rather on chemical characterization and classification. 13BioCompoundML is a fully packaged distribution, which allows it to be run on user data, directly from a personal computer, server, or cloud instance.This separates BioCompoundML from other tools, such as OCHEM, which provide an online development environment for chemical property prediction and classification. 14Bio-CompoundML is a fully functional executable program, created in Python.With the exception of selecting parameters and providing training and testing data, users should not have to directly interact with the provided libraries, unlike scikit-chem (https://github.com/richlewis42/scikit-chem)and RDKit (https://github.com/rdkit/rdkit),which provide a set of Python libraries for designing chemical analysis workflows, predicting common features, and converting between data formats.

■ METHODS AND IMPLEMENTATION
Feature Collection.BioCompoundML is designed to be used in conjunction with PubChem Entrez application programming interface (API) of the National Center for Biotechnology Information (NCBI).We include with this package a feature extraction package that directly collects experimental and fingerprint data from NCBI using either the compound identifier (CID) of NCBI or standard Chemical Abstracts Service (CAS) Registry Numbers.Two main sets of features are extracted: experimental/estimated features and fingerprints.The experimental/estimated features that are used by the feature extraction package include experimentally measured properties (e.g., MP, boiling point, and vapor pressure), inferred structural features (e.g., rotatable bond count and heavy atom count), chemical properties (e.g., molecular weight and formal charge), and inferred chemical properties (e.g., XLogP3, which estimates the octanol−water partitioning, a property directly related to hydrophobicity).PubChem fingerprints are 881 properties relating to the compound structure.These include, for example: >4H, ≥1 any ring size 6, and simplified molecular-input lineentry system (SMILES) patterns [e.g., C(−C)(−C)(C)].A full list of fingerprints is available at ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt.BioCompoundML is designed to connect directly to PubChem and extract these features for any compound in its training or testing set.These retrievals can, however, also be turned off directly.Additionally, this package also has the capacity to retrieve SDFs from NCBI.These may be useful in downstream QSPR/QSAR feature extraction.
BioCompoundML additionally includes a copy of PaDEL-Discriptor, a molecular descriptor calculator, that takes as input a SDF and provides thousands of individual QSPR and QSAR descriptors for each compound. 15By default, BioCompoundML calculates 1444 of these descriptors (1D/2D descriptors).This software is provided with its open source Apache 2.0 license.
Commonly, analysts with interest in fuel properties will have access to proprietary data, experimental results, curated features, or licensed calculators.BioCompoundML includes the option of adding these features to the training and testing data sets.These features may be provided to the program as separate columns in the input data files.
Feature Reduction/Selection and Value Imputing.The most obvious means of feature reduction is to remove all non-variable features.This is performed as a first step, following feature collection.The next step of imputing missing values is achieved using a two-step approach.The first step is to perform k-nearest neighbor (k-NN) imputation. 16This process takes a distance matrix and imputes missing values using the k-NN.The distance matrix in this tool is calculated using the Jaccard distance/Tanimoto score, 17 using the 881 NCBI fingerprint variables.This allows for the distance matrix to be collected separately from value imputation.This matrix is used to identify the nearest neighbors.The default for BioCompoundML is k = 5.The distance matrix is then used to assign a weight to each value for the nearest neighbors and return a weighted average, such that nearer neighbors (in this case, compounds) are more heavily weighted.This approach is generalizable and has shown consistent success as an approach to missing data. 18In cases where features were too sparse to fully resolve using k-NN, we used the mean value for the feature as a minimum information imputer (see Figure 1).
In addition to feature reduction, we were also interested in reducing data complexity through feature selection.We used the Boruta algorithm for selecting features for random forest classification. 10The original Boruta algorithm was written in R. We use a separate package, written by Daniel Homola for Python https://github.com/danielhomola/boruta_py/.This algorithm works by generating a set of shadow random features, by duplicating and then shuffling the variables.These are used to create z-score distributions for each feature, through random iteration, followed by classification using random forest.Each original feature is compared to the maximum z score for shadow features.Features that do not score significantly better than the shadow features are excluded from the model.These can dramatically reduce the complexity of the model, by eliminating uninformative features (see Figure 1).
Random Forest Classification.Random forest classifiers work by creating ensembles of decision trees using randomized "bags of features". 19These features are randomly sampled and classified into decision trees.These trees are then used in the process of "voting" for a machine-learned model.The incorporation of randomness into the training stage 19 means that, as the error rate decreases in the training set, through the addition of trees, the error rate in the testing set remains constant. 20This is due to strong classifiers dominating the decision process, at the expense of weak (overfitting-prone) classifiers. 21n additional feature of the random forest classifier is its ability to separate tree-solving steps on parallel processes. 22This makes the steps prior to voting, embarrassingly parallel.This time-consuming step can be handled by separate central processing units (CPUs), allowing the classifier to scale efficiently.
Use of Clustering To Limit Tests.As part of BiocompoundML, we include a clustering tool, which screens and selects compounds that are structurally similar to the training set.First, our clustering tool evaluates similarities and differences between compounds in the training data set.PubChem fingerprints, prior to feature reduction by Boruta, are transformed using the random trees embedding algorithm. 23This unsupervised transformation algorithm uses a forest of random trees to encode data (compounds) by the indices of leaves that each compound falls into, resulting in a high dimensional sparse binary code.Dimensionality reduction by truncated singular value decomposition (SVD) is then implemented on the sparse binary codes. 24

RON Classifier Development.
A principle fuel property of interest for contemporary SI engines is the RON.RON is a measure of the resistance of a chemical to autoignition, which can occur in the fuel−air mixture ahead of the flame front created by the spark if temperature and pressure become too high for the autoignition resistance of the fuel.RON is one of the hallmark properties of SI engine fuels, and future, highefficiency engines may require significantly higher RON than available in the market today. 25or this analysis, we collected RON measurements from 148 pure training compounds from our review of the literature 26−29 and 36 internally tested compounds.All but three of the internally tested compounds were tested according to the ASTM D2699 standard.The other compound RON values were estimated using derived cetane numbers, which is not exact but allows for discrete classification. 30These compounds had a median RON of 94.4 (see Figure 2).We chose this RON as the threshold for the classifier, because it (1) evenly split the data set and (2) generally measures high versus low RON, as it is experimentally evaluated.To build our BioCompoundML model, we selected 881 NCBI fingerprint features and 26 NCBI experimental features.Only 157 of these features were variable in our training set.After Boruta feature selection (following k-NN imputation), only 9 features proved to be useful for model prediction (Table 1).
Using 100 random 2-fold splits of the data, we observed high accuracy, precision, and recall.The average accuracy in these 50% leave-out experiments was 87.79%, with a standard deviation of 5.8%.Precision was 88.16%, with a standard deviation of 11.4%.Recall was 88.17%, with standard deviation of 11.3%.The receiver operator characteristic (ROC) area under the curve (AUC) was 0.88, with 0.056 standard deviations.Figure 3 shows the ROC curves for 10 random 50% leave outs.It is expected that the performance of the RON classifier will dramatically improve as the diversity of training data increases, expanding the functional group space.The current functional groups for the training set include alkenes, alkanes, alkynes, benzenes, cycloalkenes, cycloalkanes, alcohols, esters, ketones, and naphthalenes.
The heaviest weighted features in the classifier are shown in Table 1.The most heavily weighted feature, XLogP3, is a measure of water−octanol partitioning.A high XLogP3 is predictive of a high RON value.The other LogP-related features (i.e., LogP and XLogP-AA) are alternatively calculated estimates of LogP.The temperature of autoignition, boiling point (at 760 mmHg), and vapor pressure (mmHg at 25 °C) are also important features in predicting whether a compound has a RON value greater than or less than 94.4.The structural features C−C−C−C−C−C and C−C−C−C−C, which are SMILES patterns that refer to the presence of six-and fivecarbon structures, are also important.It is necessary to note that, although some features correlated directly with RON, ensembles of decision trees work hierarchically, meaning that feature interactions are not necessarily additive or direct.These weightings can, however, give insight into the physical attributes that govern the property of interest (in this case, RON) and, hence, the attributes that could merit further experimental investigation.
Using our RON model, we tested our results using 16 additional compounds not used to build the model.RON for these compounds was measured using the ASTM D2699 standard and has not previously been measured.They were selected on the basis of our interest in them as potential targets for synthetic bioengineering.
From the SI engine fuel perspective, there are essentially two classes: RON significantly >94.4 and RON not significantly >94.4.These compounds considerably expanded the range of variability, from the training data, and serve as a test of the expandability of our model.A linear regression model of classification probability of RON > 94.4 by observed RON has a R 2 = 0.653 (p < 0.001) (see Figure 4).On the basis of this linear regression, the classification decision boundary (the point where the model intersects 94.4, given y = 0.0074x + 0.1547) is 0.849.Using this decision point, all tested molecules have been correctly assigned.However, including the 95% confidence interval at the decision boundary (94.4), the range of marginal values covers from p > 0.800 to p < 0.897.One compound (linalool with an observed RON = 96.7)falls within this marginal region.This suggests that the observed accuracy [(true positives + true negatives)/total] is between 93.75 and 100% (see Table 2).
RON Evaluation for Large Numbers of Compounds Using BioCompoundML.We collected a large corpus of biologically produced compounds, using MetaCyc. 12This database includes 8696 potentially biologically produced compounds.A much smaller proportion is of interest to us, and therefore, we included a feature in BioCompoundML that automatically clusters compounds using a Tanimoto fingerprint similarity criterion, which has been shown to be a powerful functional clustering technique. 17Clustering training data to the total MetaCyc database allowed us to exclude a large set of compounds that would otherwise stretch the limits of any machine-learned model (see Figure 5).
We used the classifier to investigate the 1977 hydrocarbons and oxygenates in MetaCyc, to identify which compounds had the highest and lowest probability for being high RON compounds.Supplemental Table 1 of the Supporting Information shows 107 compounds that score as having high RON (on the basis of the threshold defined above of p > 0.897).It is important to note that the choice of compounds for     inclusion into a fuel is complicated, and there are a variety of reasons why a compound would be chosen for inclusion or exclusion from a fuel blend.However, the general purpose of this tool is to rank and classify a large number of compounds based on individual properties.It is the combination of these properties that ultimately inform us about which compounds will be useful in fuels.TSI Classification.Soot formation is important in many types of combustion for both environmental and engineering reasons.Soot formation can be measured as the smoke point (ASTM D1322 standard, the height in millimeters of a smokeless flame when the fuel is burned in a specific lamp), with lower values indicating a higher sooting tendency.Aviation turbine fuels are limited to a minimum smoke point because high levels of soot cause a higher level of the fuel energy content to be released as thermal radiation, reducing engine durability and efficiency.The TSI was developed to provide a comparison of the soot-forming tendency of different fuels that takes into account differences in air−fuel combustion stoichiometry and different experimental setups.TSI is defined as a value between 0 and 100 for evaluating the onset of soot formation in both premixed and diffusion flames 31 and is defined mathematically by the equation where a and b are experimental constants.TSI corresponds to the inverse of the smoke point, scaled by the molecular weight and experimental constants to between 0 and 100.In some cases, SI engines that employ direct injection can produce high levels of soot particles, and TSI is one metric that has been proposed for predicting fuel effects. 32Our study used the TSI value of a previous study to determine success of our classification technique in predicting this value using 98 compounds. 33Training data for this feature were found to optimally cluster into two classes (using the k-means algorithm and optimizing silhouette score; see Figure 6).The TSI classification boundary was chosen as the median, which for this data set was 5.9.
Using NCBI fingerprint and NCBI experimental and computed features, Boruta reduced the total number of features from 907 to 16 important features.The majority of selected features correspond to SMILES patterns (10 of 16).These may ultimately correspond to bond availability.Complexity is unsurprisingly the most strongly weighted feature (weight = 0.1639).It has been well-known for decades that increasing complexity in general leads to greater resistance to oxidation and corresponds to increased soot formation. 34,35Features relating to mass and molecular weight are also heavily weighted, which is unsurprising, given that they are important features in the definition of TSI.XLogP3 is also heavily weighted (weight = 0.0639).
Using 100 random 2-fold splits of the data, we observed high accuracy, precision, and recall.The average accuracy in these 50% leave-out experiments was 87.21%, with a standard deviation of 9.2%.Precision was 90.71%, with a standard deviation of 12.4%.Recall was 82.30%, with standard deviation of 19.3%.ROC AUC was 0.87, with 0.09 standard deviations.
MP Classification.MP is an example of a fuel property in which there is a large body of experimental data from which to build a machine-learning classifier.MP is important in that fuels are required to be in the liquid state for handling and blending.Additionally, MP of a proposed bioblendstock is related to the minimum wintertime operability temperature of the final fuel blend.We set the classification criterion at 20 °C.This temperature corresponds to identification of compounds that are liquid at room temperature and is the maximum allowable boundary for a fuel blendstock.To build our classifier, we used the Jean-Claude Bradley Open Melting Point Dataset. 36After the removal of redundancies and low-quality measurements, this data set provided 14 869 individual MP measurements.
To classify MP, we adopted two distinct approaches: we classified using exclusively NCBI fingerprints and separately using NCBI fingerprints and a small number of "computed properties".These computed properties include 20 basic The case of many features of limited discriminating power can lead to a problem of overfitting.In this case, having a wide breadth of data can be a huge benefit.Figure 7 maps data relative to MetaCyc.In contrast to RON, this map covers a huge span of likely data.The average accuracy using 100 random 50% leave-out experiments was 92.3% (±0.5%); average precision was 94.7% (±0.6%); average recall was 96.2% (±0.5%); and average ROC AUC was 0.8438 (±0.01).
Using fingerprint plus computed properties let Boruta reduce the number of features from 901 to 194.The most heavily weighted features primarily include structural fingerprints but weighs the selected features much more heavily than fingerprint only.The five most heavily weighted features include complexity (weight = 0.0903), computed property topological polar surface area (weight = 0.0670), molecular weight (weight = 0.0582), exact mass (weight = 0.0578), and monoisotopic mass (weight = 0.0556).For this classifier, the average accuracy using 100 random 50% leave-out experiments was 93.7% (±0.5%), the average precision was 95.4% (±0.6%), the average recall was 97.1% (±0.4%), and the average ROC AUC was 0.8662 (±0.01).

■ DISCUSSION
In this paper, we present a general-purpose fuel property characterization tool.This tool uses a variety of machinelearning techniques to perform automatic random forest classification of compounds.BiocompoundML can be adapted to rapidly screen compounds for any desired compound property.We used this technique on a variety of compounds to classify three key fuel properties (i.e., RON, TSI, and MP).We found that this technique provided accurate and precise classification of compounds into high and low classes.We also found that it accurately classified high RON compounds and, when it failed to accurately classify low RON compounds, most errors were marginal.We further ran this tool on a large set of biologically producible hydrocarbons and identified numerous strong predictions of compounds with high and low RONs.This tool proved accurate and precise in predicting TSI class.For data sets with a large number training set (MP), BioCompoundML proved accurate and precise, even with exclusively structural data.
This tool is fully open-source and is usable by a broad range of fuel researchers and industry.The use of this tool is not restricted to the three properties chosen, and it is intended that, in the future, it will find applicability to a wide number of chemical properties.

Notes
The authors declare no competing financial interest.

Figure 1 .
Figure 1.Feature imputation and selection.The first panel of this figure shows the raw data for RON-selected compounds using the fingerprint of NCBI and experimental data.In this panel, white cells correspond to missing data.These data are then imputed using k-NN and mean value in the second panel.Boruta feature selection/reduction is this used to remove uninformative features.The third panel shows the removal (in white) of uninformative features from the data set.
Using this transformed model of the training data set, we identify the distinct number of clusters in the training data by calculating the silhouette coefficient and k-means clustering for 2−8 clusters.The cluster number with the highest average silhouette score is selected as the number of clusters that exist within the training data.This clustering tool can then be used to evaluate the training data in the context of the testing data and extracts test data similar in structure to the compounds in the training.Initially, all testing data, prior to feature reduction by Boruta, are transformed using random trees embedding and dimensions are reduced via truncated SVD, creating a model of the testing data.Training data are then fitted to the testing data model.Each cluster (determined in the first step) of the fitted training data is used to next create a OneClassSVM model for each of the clusters.OneClassSVM is an unsupervised outlier detection method, which allows for the prediction of whether test data are similar to the training clusters or outlie them.Each of the fitted test compounds are run through each of the OneClassSVM training cluster models.If a compound is determined to be similar to at least one of the clusters, further classification and prediction of the compound occurs; otherwise, the compound is disregarded.

Figure 2 .
Figure 2. Distribution of RON values in training data.RONs in the literature range from 0 to 120.Our classifier is particularly interested in compounds that are higher than 94.4.The change in color on this figure corresponds to this boundary.

Figure 3 .
Figure 3. ROC curves for 10 random 50% leave-out experiments.Average AUC is 0.82.This figure shows 10 random ROC curves with 50% for training and 50% left out for testing.

Figure 4 .
Figure 4. Regression of observed RON by probability in a high RON class (RON > 94.4) for experimentally measured compounds.The 95% confidence intervals are included.On the basis of this regression, the expected threshold for classification in high/low RON is p = 0.8487.

Figure 5 .
Figure 5. Non-metric multiple dimensional scaling of the first two principal components.RON training data maximally clustered into two classes (using the k-means algorithm and maximizing silhouette score).Black-and green-colored regions illustrate boundaries around these two clusters.Only 4.5% (n = 398) of 8696 compounds within MetaCyc fall into this category.

Figure 6 .
Figure 6.Clustering of TSI training compounds.The left panel shows the cumulative silhouette score for each class, with the red line representing the average silhouette score.The right panel shows the multiple dimensional scaling plot for the first two principle components.Compound types are labeled by point shape.

Figure 7 .
Figure 7. MetaCyc compounds with MP compounds mapped.Data are visualized using non-metric multiple dimensional scaling of the first two principal components.Compounds with available MPs are plotted in black.

■
NOMENCLATURE NCBI = National Center for Biotechnology Information RON = research octane number TSI = threshold soot index MP = melting point QSPR = quantitative structure−property relationship QSAR = quantitative structure−activity relationship CID = compound identifier CAS = Chemical Abstracts Service SDF = substance data file k-NN = k-nearest neighbor SVD = singular value decomposition ROC = receiver operator characteristic AUC = area under the curve SMILES = simplified molecular-input line-entry system

Table 1 .
RON Predicting Features by Weight and Type

Table 2 .
Experimentally Tested Compounds with Model Predictions This research was conducted as part of the Co-Optimization of Fuels & Engines (Co-Optima) Project sponsored by the Bioenergy Technologies and Vehicle Technologies Offices, Office of Energy Efficiency and Renewable Energy (EERE), U.S. Department of Energy (DOE).Co-Optima is a collaborative project of multiple national laboratories initiated to simultaneously accelerate the introduction of affordable, scalable, and sustainable biofuels and high-efficiency, lowemission vehicle engines.Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the National Nuclear Security Administration, DOE, under Contract DE-AC04-94AL85000.This work was part of the DOE Joint BioEnergy Institute (http://www.jbei.org) supported by the Office of Biological and Environmental Research, Office of Science, DOE, through Contract DE-AC02-05CH11231 between Lawrence Berkeley National Laboratory and the DOE.This work was also supported by the Vehicle Technologies Office, DOE, under Contract DE347AC36-99GO10337 with the National Renewable Energy Laboratory.The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript or allow others to do so for United States Government purposes.