Descriptors-based machine-learning prediction of cetane number using quantitative structure-property relationship

The physicochemical properties of liquid alternative fuels are important but difficult to measure/predict, especially when complex surrogate fuels are concerned. In the present work, machine learning is used to develop quantitative structure-property relationship models. The fuel chemical structure is represented by molecular descriptors, allowing the linking of important features of the fuel composition and key properties of fuel utilization. Feature selection is employed to select the most relevant features that describe the chemical structure of the fuel and several machine learning algorithms are tested to construct interpretable models. The effectiveness of the methodology is demonstrated through the development of accurate and interpretable predictive models for cetane numbers, with a focus on understanding the link between molecular structure and fuel properties. In this context, matrix-based descriptors and descriptors related to the number of atoms in the molecule are directly linked with the cetane number of hydrocarbons. Furthermore, the results showed that molecular connectivity indices play a role in the cetane number for aromatic molecules. Also, the methodology is extended to predict the cetane number of ester and ether molecules, leveraging the design of alternative fuels toward fully sustainable fuel utilization.


Introduction
Low-carbon alternative fuels are becoming increasingly important, but fossil fuels still play a key role in energy supply, especially in difficult-todecarbonize transport applications such as shipping, road freight, and aviation transport.Overall, they are responsible for emitting more than 50% CO 2 of the entire transport sector [1].With the need to take a step towards net zero emissions and sustainable energy utilization, renewable fuels and biofuels derived from sources other than petroleum are becoming increasingly important [2].The design of alternative fuels is often based on the life cycle assessment methodology, which is an all-encompassing evaluation method employed to estimate fuel viability and benefits, offering insights into their environmental impact, energy efficiency, economic feasibility, and sustainable decision-making through a holistic evaluation [3].
In light of the fuel design, one of the essential "fuel criteria" is that an alternative fuel must be compatible with the existing global infrastructure [4].
So, it can be integrated into the current transportation system using existing infrastructure and be burned in existing engines (such as diesel engines for optimal fuel economy) with minor adjustments as drop-in fuels.Among efforts on developing low-emission fuels for diesel engine combustion, liquid synthetic fuels have shown high potential for low-carbon transport applications [5].Synthetic fuels are carbon-based liquid fuels chemically synthesized from a carbon source designed to mimic the physicochemical properties of fossil fuels [2,4].Synthetic fuels can be manufactured via chemical conversion processes from 'defossilised' carbon dioxide sources, such as point source capture from the exhausts of industrial processes and direct capture from air or biological sources.Bio-based fuels using carbon from biological sources are renewable and have been in use.Liquid synthetic fuels could be a direct alternative to liquid fossil fuels [2].Also, gas-to-liquid fuels are synthetic fuels produced from natural gas and other hydrocarbon gases as feedstocks [6].One example of synthetic fuels is oxymethylene ethers (OME x ).As a clean alternative fuel to compression-ignition diesel engines, OME x is a class of dimethyl ether (DME) derivatives that can be produced from a range of waste feedstocks and biomass, thereby avoiding new fossil carbon from entering the supply chain [7,8].Also, a range of methodologies and processes have been used for the synthesis of OME x [9].In addition, OME x can be produced from renewable energy stored via catalytic conversion of hydrogen (H 2 ) and carbon dioxide (CO 2 ), as an electrofuel (or e-fuel), thereby used as a sustainable energy carrier [4].It has been demonstrated that OME x can substantially reduce harmful emissions of nitrogen oxides (NO x ) and particulates (soot) avoiding the NO x -soot trade-off [10,11].However, OME x is currently used as a blending component instead of a complete replacement for diesel due to differences in their physicochemical properties [11].
For the rapid integration of liquid alternative fuels into current infrastructures for storage, transport, and direct injection in combustion engines, the physicochemical properties associated with fuel composition must be known.This represents a significant challenge since alternative fuels are often composed of complex mixtures and the physicochemical properties depend on fuel composition variability linked with production source and process [2].
To address this challenge, accurate information on the physicochemical properties of complex mixtures over the engine operational ranges is mandatory to adapt the system operation to alternative fuels, but this is not readily available.
In terms of fuel utilization, important properties of liquid fuel include cetane number (CN), density, flash point, freezing point, autoignition temperature, energy content, and combustion emissions.Cetane number is a key physicochemical property to measure the ignition quality of diesel-like fuels.
The ignition quality is directly linked to the molecular structure of the diesel fuel blends [12].Although the chemical structures of liquid synthetic fuels are mostly known, a critical aspect faced in designing new diesel-like fuels is understanding the relationship between the molecular structure and fuel properties.Usually, the physicochemical properties are measured from costly experimental facilities [13,14].Also, no robust physicochemical models able to predict CN are available [15,16].
Recently, machine-learning (ML) models have gained attention in the renewable energy sectors [17,18].This is the case with fuel design, where ML can be a powerful tool to predict the physicochemical properties of fuels from the chemical structures [19,20].Furthermore, an AI-assisted fuel design methodology was developed to predict the properties of pure compounds and fuel blends [21].Freitas et al. [22] proposed a methodology to explore the thermodynamic properties of practical fuels by combining molecular dynamics (MD) simulations and ML models.The results show that ML models can yield accurate predictions of fundamental fuel properties from the chemical compositions of the fuels.
Going further, research efforts have been dedicated to developing predictive CN models based on quantitative structure-property relationships [23,24].Guan et al. [25] proposed an active subspace method for descriptor selection to build a QSPR predictive model for cetane number.Saldana et al. [26] were pioneers in using ML techniques for CN prediction based on molecular structure.Also, an artificial neural network (ANN) to predict the cetane numbers of hydrocarbons and oxygenates, which are critical for compression-ignition engines was developed [27].A systematic data quality analysis integrated with graph neural networks (GNNs) for a reliable CN prediction was also developed [28].Also, Schweidtmann et al. [29] proposed a GNN for predicting fuel ignition quality indicators of hydrocarbons, combining multitask learning, transfer learning, and ensemble learning to address the limitation of experimental data to train the model.Despite remarkable advances in the construction of QSPR machine learning models, these methods might suffer from a lack of interpretability, which limits the understanding of the underlying physics/chemistry of the predictive models, i.e., the relationship between the molecular structure and physicochemical property is not fully understood.Explainable artificial in-telligence tools are being developed to leverage the interpretability of ML models so that the effects of the input quantities on the predictions can be better understood [30].The present work aims to characterize the dependence of the cetane number of liquid synthetic fuels on chemical composition/structure using ML techniques to leverage data obtained through experiments.Moreover, the study provides insights into the interpretability of the ML models, highlighting the importance of feature selection and the correlation between molecular descriptors and physicochemical properties, with implications for the development of predictive models for renewable fuels.In particular, the novelty lies in the use of sophisticated ML techniques to design descriptor-based interpretable quantitative structure-property relationship (IQSPR) models.Such an approach allows an understanding of the underlying physics/chemistry which links the chemical structure with the physicochemical properties.To do so, the chemical structure of such complex fuels is described using molecular descriptors [31].Descriptors allow us to correlate the chemical structure of the fuels with physicochemical properties.Furthermore, we use a recursive feature elimination (RFE) strategy to select the set of most suitable descriptors that best characterize the molecular structure to construct robust IQSPR predictive models.Moreover, we test different ML algorithms to evaluate their performance (like a model selection) to construct IQSPR predictive cetane number models.Finally, we use the SHapley Additive exPLanations (SHAP) [32] method to measure the impact of descriptors on CN predictions.
It is worth pointing out that the ML-assisted liquid synthetic fuel property modeling methodology depicted here has the potential to construct predictive models able to assist the design of alternative fuels with desired properties that can potentially lead to fully sustainable fuel utilization, i.e., the engines are 100% powered by sustainable/renewable synthetic fuels, which breaks a barrier in alternative fuel utilization as synthetic fuels are currently used as a 'drop-in' component instead of on their own.Furthermore, it is worth highlighting the present approach is used to build predictive models for cetane numbers.Still, such an approach can be extended to other physicochemical properties important for the design of liquid alternative fuels, as it is ubiquitous for data analytics of fuel properties.
The remainder of this paper is organized as follows.The idea of descriptorsbased QSPR models is explained in Section 2. In Section 3, the CN dataset used to build the QSPR models is described.Finally, the results are shown in Section 4, and the conclusions are drawn in Section 5.

Molecular Descriptors
A molecular descriptor is a mathematical characterization of a chemical structure.The main idea is transforming chemical information embedded within a symbolic representation of a molecule into a set of features useful to represent the chemical composition [33].In particular, molecular descriptors reflect complex structural patterns, such as the number of atoms in the molecule, molecular shape, size, as well as atomistic properties [34].
Molecular descriptors can be divided into classes concerning the information content about the chemical structure and the ease of computing them [35].1D descriptors are computed from the bulk properties of the molecules, such as the number of atoms, molecular weight, and so on.These descriptors are simple to compute but do not provide any information regarding the molecular structure.2D molecular descriptors provide topological information based on the 2D structure representation of the molecule, such as the number of rings, the number of hydrogen bond acceptors and donors, matrix-based descriptors, etc.Finally, 3D descriptors reflect information about the spatial coordinates of atoms of the chemical structure.
These descriptors may provide useful information about the molecule, such as the intramolecular hydrogen bonding.However, the determination of 3D descriptors might be a time-consuming task due to their complexity [36,37].
In addition, molecular descriptors based on functional groups in the molecule structure have been proposed to build QSPR models [38,39].In particular, these descriptors are calculated by counting the number of functional groups, such as methyl (−CH 3 −) and methylene (−CH 2 −), in the chemical structure.
Molecular descriptors can be a powerful tool to convert molecular structures into values that can be used to construct predictive machine learning models [40,41].Quantitative structure-property relationship models to predict sooting tendencies were developed using molecular descriptors [42,43].
Such a description may be used to connect important features of the fuel composition with key properties of fuel utilization, allowing a step toward the development of IQSPR-ML models.Going further, revealing the dependence of physicochemical properties of liquid synthetic fuels on fuel mixture chemical composition/chemical structure may lead to new information about the physicochemical properties.
In the present work, we compute the molecular descriptors using the open-source descriptor-calculator Mordred software [34].Mordred software is chosen due to the easiness of implementation, where the descriptors are generated from the simplified molecular input line entry specification (SMILES), and it has a lower computation time compared with other traditional descriptor calculators such as the PaDEL-Descriptor [44].Mordred can compute more than 1800 molecular descriptors including 2D and 3D descriptors, such as the number of carbon atoms, the number of aromatic bonds, topological indices, matrix-based descriptors, and so on.

Feature Selection
As the descriptor calculator software can provide thousands of molecular descriptors, selecting the set of most suitable descriptors that best characterize the chemical structure is critical for developing interpretable and accurate predictive models.Moreover, over-parametrization may hamper the predictability of IQSPR-ML models.In this sense, feature selection techniques can be applied to remove irrelevant descriptors and select the most important features regarding the physicochemical properties [45].
In this regard, several strategies for selecting the descriptors that significantly contribute to constructing robust QSPR models have been developed [46,47,20].St. John et al. [43] used principal component analysis (PCA) and RFE to select the best features for the prediction of sooting tendency from molecular structure.Furthermore, a QSPR model for the cetane number of hydrocarbons using the activate subspace method to remove irrelevant descriptors has been proposed [25].Recently, Comesana et al. [45] developed a systematic methodology for selecting features to construct accurate and interpretable predictive models for several physicochemical properties of organic molecules.
Based on Comesana et al. [45], in the present work, we have developed a method using recursive feature elimination to develop an IQSPR model for the prediction of the cetane number of liquid synthetic fuels guided by the molecular structure.RFE is a supervised feature selection method that recursively removes features at each iteration based on feature importance.
The procedure is repeated until a predefined desired number of features is achieved.
The main steps of the feature selection methodology proposed in the present work are given as follows.First, we generate the molecular descriptors from SMILES using the Mordred software for each molecule.Next, we remove non-numerical values, zeros, and descriptors with 95% of matching values across molecules, i.e., descriptors that have similar values for different molecules, since they do not provide reliable information about the molecule structures.Autocorrelation descriptors may not be linked with the molecular structure and physicochemical properties of the molecule [45,48], so they are also removed.We compute the correlation between the molecular descriptors and physicochemical properties to be predicted and rank the descriptors from the highest to lowest correlation.The correlation provides prior knowledge of the most relevant descriptors before employing the RFE method and avoids eliminating reliable information.Here, the Spearman coefficient is used [49].Next, we selected the top-ranked descriptors and removed the descriptors with a Spearman correlation coefficient greater than 0.9 regarding the top-ranked descriptors to avoid redundant information.The remaining descriptors set is used as input for the RFE method.Furthermore, removing collinear features can help the QSPR model to generalize and improve the interpretability of the model [50].
Here, the Scikit-learn [51] ML library is used to implement the RFE method with a model estimator (e.g., a predictive model that assigns weights to features).Further details about the ML estimator models are presented in the next subsection.

Machine Learning Models
In this section, we present a brief description of ML models for a generic property γ function of the chemical structure/composition and state variables.In particular, the aim is to learn a mapping f characterizing the macroscopic thermodynamic relation between the physicochemical property and the molecular structure: Here, f is a map that acts as an equation of state that may replace costly experimental measurements.Φ is the vector of molecular descriptors that characterize the chemical structure of fuel.Also, θ is the vector of state variables, such as temperature and pressure.The vector ξ denotes potential noise from the data and is often considered random.
Despite many works dedicated to constructing QSPR based on ML models [52,53], there is no clear statement regarding the best ML algorithms to be chosen.The choice of the ML model is a challenge, as it might depend on the physicochemical properties to be learned and the relationship between them with the molecular descriptors.This becomes more critical when it faces a small data regime, where the ML models are prone to overfitting and difficult to generalize [54,55].
In this context, we test eight different ML algorithms.The models used are the least absolute shrinkage and selection operator (LASSO), support vector regression (SVR), random forest (RF), gradient boosting regression (GBR), and Gaussian process regressor (GP).These models are implemented using the Scikit-learn [51].We also implemented a light gradient-boosting machine (LGBM) [56], and extreme gradient boosting (XGB) [57].Moreover, we train a multi-layer perceptron (MLP) using the PyTorch [58].All ML models are deterministic except the GP model.GP is a popular probabilistic Bayesian approach to building predictive models mainly due to its capacity to handle both parametric and epistemic uncertainties [59].The optimization of model hyperparameters is detailed in the next section.
The aim of testing different ML algorithms is to evaluate the performance (like a model selection) of the ML algorithms to construct IQSPR predictive models for physicochemical properties guided by the molecular structure.

Dataset
In this paper, the cetane numbers dataset is acquired from the report by Murphy et al. [60].The dataset consists of experimental CNs of hydrocarbons computed using four different methods, ASTM D613, ASTM D6890, blend method, and an unknown method.Constructing a dataset based on CNs of hydrocarbons is a natural choice since they are typical surrogates for diesel fuels [61], and the design of alternative fuel blends are based on them [2].
Here, we follow the same approach proposed by Guan et al. [25] to build the dataset, where the CNs from the first two methods are prioritized due to measurement repeatability, and the mean values are considered for fuels with multiple measurements.Thus, the final dataset consists of 110 CNs of hydrocarbons divided by n-alkanes, iso-alkanes, cycloalkanes, alkenes, and aromatics (see Table S1 in the Supplementary Material (SM) for a complete list of CNs). To

Results and Discussions
In this section, we assess the performance of the ML-assisted liquid synthetic fuel design methodology to construct IQSPR predictive models of CNs for hydrocarbons.Such an assessment unfolds into two main goals: to evaluate the performance of descriptor-based IQSPR-ML models in predicting CNs and to investigate which molecular descriptors are most important to predicting CNs and the relationship between the molecular structure with the physicochemical property.Also, we describe how the methodology can be extended to the design of alternative fuels.The code and data accompanying this manuscript are made publicly available at https: //github.com/RodolfosmFreitas/AI-SyntheticFuel.

Performance of descriptor-based IQSPR-ML models
To evaluate the performance of descriptor-based QSPR-ML models we compute the coefficient of determination (R 2 -score), the mean absolute error (MAE), and the mean absolute percentage error (MAPE).These are commonly used metrics to measure the performance and accuracy of ML models.
The R 2 -score is given by: where n is the number of samples (molecules), γ i is the measured physicochemical property, γi is predicted value, and γ = 1 N N i=1 γ i is the mean over the samples.The MAE is a commonly used metric due to being not very sensitive to large errors and outliers, which is desirable if the dataset spans different scales.It is given as follows Also, the MAPE metric is given by Furthermore, to avoid overfitting we use k-fold cross-validation [54].The   consider 0.9 a good target.Such behavior may suggest that these models are overfitting, which is typical when we are dealing with a small data regime.
That is further confirmed in Fig. 2(b).We observe that the tree-based models present a lower MAE to the training set but the error increases significantly on the test set.The same behavior is depicted in the probabilistic GP model.
Furthermore, linear models like LASSO and SVR (with a linear kernel) return a not satisfactory accuracy.That might suggest a non-linear link between the molecular descriptors and the cetane number.
Also, we can verify that the MLP model returns a satisfactory accuracy in both train and test sets.Thus, we top-ranked the MLP descriptor-based model for CN predictions since the goal here is to construct a QSPR model able to generalize patterns in training data so that it becomes feasible to predict CNs from new unseen data (molecules) correctly.Here, the MLP has 2 hidden layers and 32 neurons per layer.The MLP is constructed using Ray Tune [62] to search for the hyperparameters that give the lowest mean squared error.The hidden layers use a hyperbolic tangent non-linearity activation function.The models are trained for 1,000 stochastic gradient descent steps using the Adam optimizer [63] with a learning rate of 5 × 10 −3 , with a learning rate scheduler, that is used to drop 10 times on plateau during the training process.Indeed, MLPs are universal function approximators that can detect and decode intrinsic relations from data.MLP models are an effective tool for accurately predicting several physicochemical properties [40,42].
Table 1 shows the test dataset accuracy metrics for the different models.
As we can see, the MLP-QSPR model returns an MAE of 5.074, an acceptable value since the experimental error range of CN measurements is from 2.8 to 4.8 [64].Also, by comparing the predictions of the present model with QSPR models available in the literature, we can verify our model presents competitive performance compared to the state-of-the-art QSPR models.In particular, the model by Guo et al. [27] used an ANN to construct predictive models for CNs of hydrocarbons and oxygenates with average absolute errors of 6.5 and 4.4 for compounds with and without rings.Finally, a QSPR model combining an active subspace method and linear regression returns an average absolute error of 5.0 [25].Here, we do not claim that this represents a fair comparison as the models have been constructed on different train/test sets, but we use it to demonstrate the performance of the present methodology to build QSPR models.
Figure 3 shows the parity plots between the measured and the predicted suggests that the model captures well the relationship between the molecular structure and CNs.Furthermore, we measure the structural deficiency of the MLP model (also called epistemic uncertainty [65]) for CN predictions.
The model error follows a normal distribution with zero mean and constant variance, as shown in Fig. 4.This implies that such discrepancies are only due to uncorrelated noise/fluctuations in the data, i.e., not to structural deficiencies of the model.
In light of the design of alternative fuels, the knowledge of the physicochemical properties of such fuels raises the potential to leverage the efficiency of the existing global infrastructure and combustion engines, carbon neutrality (emissions), and financial viability.In this context, we extend the methodology to construct an IQSPR predictive model for the CN of esters and  ethers.Biodiesel is a low-carbon alternative fuel to petrodiesel [66], which is primarily composed of fatty acid methyl esters (FAMEs).Furthermore, oxymethylene dimethyl ethers, including dimethyl ether (OME 0 ), have been  Here, the trained MLP-QSPR model for CNs of hydrocarbons is used to initialize the parameters of the MLP model for CNs of esters and ethers, which mimic a transfer learning approach [29], and the model is retrained using the ester and ether data.The results show that the model can predict the OME x CNs very well, and the potential of the present approach to assist the design of new fuels.(see Table S2 in the Supplementary Material (SM) for a complete list of CNs of esters and ethers.)

Model Interpretability
In the current context, model interpretability means the ability to correctly and efficiently understand the relationship between the molecular structure and the physicochemical property, allowing the construction of IQSPR models that present the underlying chemistry behind predictions.Here, we do not claim to define the meaning of all molecular descriptors but instead, provide some physical/chemical basis for linking the top-ranked descriptors and physicochemical properties.
The key advantage of descriptors-based IQSPR ML models is the ability to reveal the descriptor's importance in predicting CNs.Here, feature importance is evaluated using the SHAP approach which is a game theoretic approach to explain the predictions of ML models.Fig. 6 shows the average impact on the cetane number of the top ten molecular descriptors for the MLP model.These descriptors suggest that molecular size, charge distribution, and atomistic properties affect the CN of hydrocarbons.In particular, we remark the spectral mean absolute deviation from Barysz matrix weight by vdw volume (SpMAD Dzv) as the top-ranked descriptor.Indeed, matrix-based descriptors are topostructural, and topochemical indices are computed from a graph representation of the molecule [67].More specifically, the molecule's structure is represented as a hydrogen-depleted graph, from which several graph-theoretical matrices are calculated and the chemical information is extracted through a set of basic algebraic operators.In particular, the SpMAD Dzv is a weighted distance matrix-based descriptor that encodes chemical and topological information derived from the presence of heteroatoms and multiple bonds in the molecule [68].Such a descriptor accounts for molecular shape and branching, and it is computed by a sum of atomic properties of the carbon atom weighted based on bond order and atomic properties of the bonded atoms.So, it might be expected that the values tend to increase with the chain length, branching, and heteroatoms.The complementary information content (CIC) is an information-theoretic topological parameter calculated from the premise that each chemical structure contains a subset of elements.It is directly linked to the molecular structure and, specifically, it is a function of the number of atoms in the molecule [33].
Therefore, it is a natural choice since it is well known the CN for hydrocarbons is related to the number of carbons in the molecule [12].Here, we also highlight the molecular connectivity indices (AXp-2dv) which encode different aspects of atom connectivity within a molecule, such as the amount of branching ring structures present, and flexibility.Molecular connectivity indices aim to measure the extent of branching of the carbon-atom skeleton of saturated hydrocarbons.These indices represent a good measure of the molecular area and are useful to describe the physicochemical properties of molecules as a measure of intermolecular interactions [69].This descriptor plays a role in aromatic molecules as will be shown subsequently.Table 2 shows a complete definition of top-ranked descriptors.S1 in the SM illustrates the correlation analysis for iso-alkanes, cycloalkanes, and alkenes.) In fact, such correlation analysis allows us to construct IQSPR predictive  with SpMAD Dzv, as shown in Fig. 7.The same tendency is confirmed for alkenes and iso-alkanes, as we can see in Fig. S2 in the SM.For iso-alkanes, if the branch is concentrated at one end of the molecular structure, leaving a long chain at the other end, the CNs tend to increase with the size of the long chain.Also, the size of the branch and the position play a role but are secondary compared with the size of the long chain.Similar to alkanes, CNs of alkenes tend to increase with molecular size, aligning with the choice of SpMAD Dzv as the most important descriptor.On the other hand, aromatics have presented a non-monotonic behavior with the top correlated descriptor (AXp-2dv).Also, as we can see, the CNs of aromatics tend to increase with the size of the n-alkyl group attached to the aromatic ring, aligning with the results found in the literature.This explains the choice of SpMAD Dzv as the second most correlated descriptor since it is directly linked with molecular size [12].Furthermore, these results suggested that the number of descriptors required to describe the molecule correctly might increase proportionally with the complexity of the structure.That is the case of n-dodecylbenzene and 2-phenyltetradecane, where both have very similar structures, as can be seen in Fig. 9, and more than one descriptor is required to distinguish the molecules.Thus, the selection of the most relevant descriptors is crucial, the wrong choice may hamper the predictability of QSPR-ML models, especially in a small data regime.molecule (MAXsCH3 and SsCH3).The ETA descriptors are directly corre-494 lated with molecular shape and size [71].The E-state indices combine the electronic state of the bonded hydrogen atoms with the topological structure of the molecule.These molecular descriptors have been demonstrated very useful to construct predictive QSPR models [72].Moreover, we can verify that ether molecules present a high negative correlation with the RPCG descriptor.This result might suggest that descriptors related to intermolecular bonding interactions can be a good choice to construct IQSPR predictive models for CNs prediction of renewable fuels like oxymethylene dimethyl ethers, as shown in Fig. 11.

Conclusions
We have presented a novel ML-assisted liquid synthetic fuel property prediction methodology that can potentially assist in fully sustainable fuel utilization, i.e., the engines are 100% powered by sustainable/renewable syn-   thetic fuels.The methodology developed can be summarized as follows.
Molecular descriptors have proven to be a useful tool to characterize the molecular structure of fuels.The use of feature selection methods may reduce the number of features by minimizing the correlations between the molecular descriptors to develop high-performing models.The novelty of the presented approach lies in the use of ML models to construct robust interpretable QSPR predictive models of fuel properties based on molecular structure.Such an approach provides tools to understand the underlying physics/chemistry of the link between the molecular structure and the fuel properties.
The effectiveness of the approach was tested by constructing IQSPR machine-learning models for CN predictions.CN is a key physicochemical property to measure the quality of diesel fuels during the ignition process.
Therefore, developing accurate IQSPR predictive models for CN predictions is essential for the design of alternative diesel fuels.The results showed that the methodology predicts well the CN based on different fuel molecules.
Moreover, the key advantage of the resulting models is the ability to correctly and efficiently reveal the relationship between the molecular structure and CNs.
It is worth mentioning that the present approach was used to predict the CN, but such a methodology can be extended to any important physicochemical property for the design of liquid alternative fuels because the ML approach is based on an interpretable quantitative structure-property relationship that is ubiquitous for data analytics of fuel properties.Indeed, the prediction of reliable physicochemical properties of alternative fuels is an important step forward towards the generation of digital tools that can assist in decarbonization by the use of renewable fuels.In this context, future studies will be focused on extending the methodology to inverse fuel design by exploring the chemical space to learn fuel blends matching the desired properties, leveraging the design of alternative fuels toward fully sustainable fuel utilization.
train the descriptor-based IQSPR-ML models we preprocessed the dataset.Here, we use the non-supervised K-means clustering algorithm to divide the dataset into six subdomains based on CN values.This allows data points of each subdomain in both training and testing sets.The dataset is scaled using the standard scaler, which means removing the mean and dividing by the standard deviation.Next, 75% data points of each subdomain are selected randomly to train the IQSPR-ML models.The remaining 25% are used to test them.Such a strategy allows the learning process to enhance the ability of the IQSPR models to generalize to unseen fuel molecules.Figure 1 shows an overview of training and testing data, including the percentage of the five categories of hydrocarbons used in the present work.In terms of numbers, 80 CNs are used for training, and 30 CNs are used for testing.

Figure 1 :
Figure 1: Distribution of the cetane number training and testing datasets.

models are trained using k − 1
folds as the training data, and the remaining left-out fold is used as the validation set.Here, we split the training dataset into k = 5 folds.Also, the cross-validation is repeated 5 times with different random seeds to evaluate the robustness of this approach.Here, the ensemble models are optimized by grid-search over the number of estimators ∈ [10, 20, 50, 100], i.e., the number of trees in the forest and the number of boosting stages to perform.For all models, the best number of estimators was 100.Moreover, we test different forms of the kernel for the GP model including the radial basis function (RBF) and the Matern family[59].The RBF kernel returns the best predictions.

Figure 2 (
Figure2(a) shows the train and test set R 2 score for the eight QSPR-ML models.As we can see, ensemble models such as RF, GBR, and XGB present high scores for the train set, with R 2 ≥ 0.95.However, such models perform the worst in the test set, returning an R 2 -score lower than 0.9.Here, we

Figure 3 :
Figure 3: Parity plot between predicted and measured cetane number of hydrocarbons for the MLP model.

Figure 4 :
Figure 4: Probability distribution of the model error of the MLP model.

Figure 5 :
Figure 5: Parity plot between predicted and measured cetane number of esters and ethers for the MLP model.

Figure 6 :Table 2 :
Figure 6: Feature importance of descriptor-based MLP model.The top 10 important features are measured as the average magnitude of SHapley Additive exPLanations (SHAP) values.

Figure 7 :
Figure 7: Pearson's correlation between the molecular descriptors and cetane number for different molecular structures.

Finally, we assess
the correlation analysis of the ten top descriptors with CNs of esters and ethers.Note that CNs of esters have correlations with topological indices such as the extended topochemical atom (ETA) index and the electrotopological state indices (E-state) for the -CH3-groups in the

Figure 8 :
Figure 8: Predictions of the descriptor-based IQSPR-ML model for cetane number of hydrocarbons.Measured CNs (black) and predicted CNs (red).

Figure 9 :
Figure 9: Comparison of molecular descriptors of n-dodecylbenzene and 2phenyltetradecane.

Figure 10 :
Figure 10: Pearson's correlation between the molecular descriptors and cetane number for different molecular structures.

Figure 11 :
Figure 11: Predictions of the descriptor-based IQSPR-ML model for cetane number of OME x .Measured CNs (black) and predicted CNs (red).

Table 1 :
Accuracy metrics of the ML models for the test set.
model.Here, we use the whole dataset to show that the MLP model is not biased towards a specific category of hydrocarbons.The parity plot shows that the MLP model predictions lie near the diagonal x = y line, which