Quantitative Structure–Retention Relationships with Non-Linear Programming for Prediction of Chromatographic Elution Order

In this work, we employed a non-linear programming (NLP) approach via quantitative structure–retention relationships (QSRRs) modelling for prediction of elution order in reversed phase-liquid chromatography. With our rapid and efficient approach, error in prediction of retention time is sacrificed in favor of decreasing the error in elution order. Two case studies were evaluated: (i) analysis of 62 organic molecules on the Supelcosil LC-18 column; and (ii) analysis of 98 synthetic peptides on seven reversed phase-liquid chromatography (RP-LC) columns with varied gradients and column temperatures. On average across all the columns, all the chromatographic conditions and all the case studies, percentage root mean square error (%RMSE) of retention time exhibited a relative increase of 29.13%, while the %RMSE of elution order a relative decrease of 37.29%. Therefore, sacrificing %RMSE(tR) led to a considerable increase in the elution order predictive ability of the QSRR models across all the case studies. Results of our preliminary study show that the real value of the developed NLP-based method lies in its ability to easily obtain better-performing QSRR models that can accurately predict both retention time and elution order, even for complex mixtures, such as proteomics and metabolomics mixtures.


Introduction
Quantitative structure-retention relationships (QSRRs) [1,2] modelling has become a de-facto standard for the prediction of retention time in reversed-phase liquid chromatography analysis, which accounts for >90% of separations in modern laboratories [3]. QSRRs and retention prediction in general have numerous applications. From identification of the most informative structural molecular descriptors with respect to retention mechanisms, prediction of retention for new analytes, up to comparison of different chromatographic columns and determination of physical properties (lipophilicity, dissociation constants, relative bioactivities).
Elution order in reversed phase-liquid chromatography (RP-LC) is typically governed by polarity of the mobile phase, whereby the more hydrophobic the analytes are, the longer it takes for them to elute (decreasing polarity) [4]. For simple analytical mixtures (e.g., <10 analytes), it is straightforward to predict their elution order based on hydrophobicity (expressed as logP) which can be determined either experimentally or in silico. However, nowadays, chromatographers face analytical mixtures with ever-increasing complexity (e.g., proteomics, wastewater, pharmaceutical mixtures) which can Int. J. Mol. Sci. 2019, 20, 3443 2 of 11 lead to chromatograms comprised of thousands of close and even overlapping peaks. In this case, a retention time prediction model with a low error does not guarantee the same for elution order.
There are only a few studies in literature dealing with the problem of elution order prediction in RP-LC including our previous work where we presented a multi-objective-optimization (MOO)-based method [4][5][6][7]. For instance, Vorslova et al. [5] present a study for prediction of retention times of phenylisothiocyanate derivatives of 25 natural amino acids using gradient RP-LC. The two-parameter solvatic sorption QSRR model with three physicochemical constants was used for prediction of the retention times. Namely, the electrostatic interaction energy of analytes with water, partial molar volume of analytes in water, surface tension and dielectric permittivity values for both the mobile and stationary phases, and a constant which includes the phase ratio and other characteristics of both stationary and mobile phases. The authors have reported average deviations between predicted and experimental retention time values of <6%, while the predicted elution order mostly corresponded to the experimental ones, with some larger deviations for retention times >15 min, with several unresolved (simulated) peaks.
Shinoda et al. [6] have used artificial neural networks (ANNs) to model the retention times of peptides with up to 50 amino acid residues. The authors report a good model for 834 peptides (with the determination coefficient, R 2 of 0.928). The QSRR model is further applied to a dataset of 121,273 peptides resulting from LysC-digestion of the Escherichia coli proteome, however without experimental validation. The developed ANN-based QSRR model has also been used to predict elution order for improvement of peptide identifications in reversed phase-liquid chromatography / tandem mass spectrometry (RP-LC-MS/MS) workflows. Elution order of peptides was predicted with an error of <11%. The method itself was based on prediction of anteroposterior relations of each peptide pair. However, the details of the methodology are not very well described. On the other hand, Bach et al. [7] presented a complex machine learning-based methodology for prediction of elution order in metabolomics based on rank support vector machines and dynamic programming.
The developed QSRR models were based on molecular fingerprints of two molecules as input and elution order as output. The authors postulate that elution order is far more conserved across different columns and instruments than retention time seemingly overcoming the main limitation of QSRR. However, the results of the elution order predictions are quite sensitive to the composition and number of training samples, while the developed method itself is computationally intensive [7].
In our previous work [4], we have presented an MOO-based elution order prediction method using genetic algorithms (GA) [8,9] for optimization employing two QSRR models with a priori selected molecular descriptors related to the RP-LC retention mechanism. Although the presented results were quite promising, showing "positive" trends (i.e., considerable decrease in elution order errors, with an increase of retention time errors), GA required considerable computing times of several minutes, whereas the execution of the multiple linear regression-non-linear programming (MLR-NLP) is nearly instantaneous. On top of that, the interior-point algorithm used to solve the NLP formulation of elution order is much less complex than GA.
In this work, we have defined elution order prediction as an NLP problem ( Figure 1) with relaxed constraints; considerably faster compared to the MOO-based method. The developed NLP-based method is directly implemented within the QSRR modelling process and was used for prediction of elution order of two (more simple) analytical mixtures: (i) analysis of 62 organic molecules on the Supelcosil LC-18 column; and (ii) analysis of 98 synthetic peptides on seven RP-LC columns with varied gradients and column temperatures. Results are compared to the QSRR models built using only multiple linear regression (MLR) [10] termed control models.
peptides resulting from LysC-digestion of the Escherichia coli proteome, however without experimental validation. The developed ANN-based QSRR model has also been used to predict elution order for improvement of peptide identifications in reversed phase-liquid chromatography / tandem mass spectrometry (RP-LC-MS/MS) workflows. Elution order of peptides was predicted with an error of < 11%. The method itself was based on prediction of anteroposterior relations of each peptide pair. However, the details of the methodology are not very well described. On the other hand, Bach et al. [7] presented a complex machine learning-based methodology for prediction of elution order in metabolomics based on rank support vector machines and dynamic programming. The developed QSRR models were based on molecular fingerprints of two molecules as input and elution order as output. The authors postulate that elution order is far more conserved across different columns and instruments than retention time seemingly overcoming the main limitation of QSRR. However, the results of the elution order predictions are quite sensitive to the composition and number of training samples, while the developed method itself is computationally intensive.

Results and Discussion
In this work, an NLP-based formulation directly implemented within the QSRR modelling process has been derived for prediction of chromatographic elution order in RP-LC. The method was applied to two case studies with rather simple separations on seven columns in varied chromatographic conditions. Two QSRR models were evaluated: one for RP-LC separation of organic compounds, and the other for the RP-LC separation of peptides. MLR was used to construct "control" models, while an NLP formulation was formed to solve the problem of elution order prediction. The two were compared in terms of performance and with the paired t-test.
As it can be observed from Figures 2 and 3, most of the columns follow a "positive" trend; with the increase of retention time %RMSE, %RMSE of elution order considerably decreases, with a few exceptions (Kaliszan 1, Licrospher 1, and Licrospher 4). In our previous work [4], we have presented an MOO-based elution order prediction method using genetic algorithms (GA) [8,9] for optimization employing two QSRR models with a priori selected molecular descriptors related to the RP-LC retention mechanism. Although the presented results were quite promising, showing "positive" trends (i.e., considerable decrease in elution order errors, with an increase of retention time errors), GA required considerable computing times of several minutes, whereas the execution of the multiple linear regression-non-linear programming (MLR-NLP) is nearly instantaneous. On top of that, the interior-point algorithm used to solve the NLP formulation of elution order is much less complex than GA.
In this work, we have defined elution order prediction as an NLP problem ( Figure 1) with relaxed constraints; considerably faster compared to the MOO-based method. The developed NLP-based method is directly implemented within the QSRR modelling process and was used for prediction of elution order of two (more simple) analytical mixtures: (i) analysis of 62 organic molecules on the Supelcosil LC-18 column; and (ii) analysis of 98 synthetic peptides on seven RP-LC columns with varied gradients and column temperatures. Results are compared to the QSRR models built using only multiple linear regression (MLR) [10] termed control models.

Results and Discussion
In this work, an NLP-based formulation directly implemented within the QSRR modelling process has been derived for prediction of chromatographic elution order in RP-LC. The method was applied to two case studies with rather simple separations on seven columns in varied chromatographic conditions.  Two QSRR models were evaluated: one for RP-LC separation of organic compounds, and the other for the RP-LC separation of peptides. MLR was used to construct "control" models, while an NLP formulation was formed to solve the problem of elution order prediction. The two were compared in terms of performance and with the paired t-test.
As it can be observed from Figures 2 and 3, most of the columns follow a "positive" trend; with the increase of retention time %RMSE, %RMSE of elution order considerably decreases, with a few exceptions (Kaliszan 1, Licrospher 1, and Licrospher 4). Statistical significance of the differences between the QSRR model performances for all the columns between the two methods (MLR and MLR-NLP) has been tested with a paired t-test. Table 1 summarizes the t-test results and it was shown that the two approaches exhibit statistically significant differences (p < 0.05). The relative differences between %RMSE in retention time and elution order are evident from Figure 4, with deviations from the "positive" trend for three chromatographic columns, in which the NLP-based method surprisingly exhibited better performance than the MLR control models in terms of %RMSE(tR).  Statistical significance of the differences between the QSRR model performances for all the columns between the two methods (MLR and MLR-NLP) has been tested with a paired t-test. Table 1 summarizes the t-test results and it was shown that the two approaches exhibit statistically significant differences (p < 0.05). The relative differences between %RMSE in retention time and elution order are evident from Figure 4, with deviations from the "positive" trend for three chromatographic columns, in which the NLP-based method surprisingly exhibited better performance than the MLR control models in terms of %RMSE(t R ). In fact, one of the chromatographic columns, Supelcosil LC, has exhibited a decrease in both %RMSE(t R ) and %RMSE(order). These deviations can be explained with the non-linearity between the parameters calculated from the molecular structure of the analytes and their retention times. Thereby, for the columns in question, our formulation has led to a better QSRR model. The MLR model itself is fully linear, whereas our NLP-based formulation introduces a degree of non-linearity due to its multinomial quadratic form (see Section 3.4.). In fact, one of the chromatographic columns, Supelcosil LC, has exhibited a decrease in both %RMSE(tR) and %RMSE(order). These deviations can be explained with the non-linearity between the parameters calculated from the molecular structure of the analytes and their retention times. Thereby, for the columns in question, our formulation has led to a better QSRR model. The MLR model itself is fully linear, whereas our NLP-based formulation introduces a degree of non-linearity due to its multinomial quadratic form (see Section 3.4.).
Detailed results for both case studies and all the columns / chromatographic conditions are summarized in Table 2, while the performance plots for all the columns are available in the Supporting Information (Figures S1-S10). Out of the evaluated chromatographic columns / conditions, two exemplary QSRR models from both case studies were detailed here (Supelcosil LC with tG = 10 min, T = 35 °C and Xterra with tG = 20 min, T = 40 °C). Both the NLP-based QSRR models have exhibited low %RMSE(tR) of 8.07% and 15.17% (Table 2), with the former decreasing, and the latter increasing in comparison to the control MLR models. This can also be observed from the predictive ability plots in Figure 5A,D. The respective %RMSE(order) were 51.77% and 22.4% (Table 2). In both cases the larger errors in elution order seem to originate from the training samples ( Figure 5B,E). Increasing the degree of non-linearity in the QSRR model itself and the method formulation should lead to further improvements, especially in the second case study involving peptides > 5 kDa for which the relationship between molecular descriptors and retention time is nonlinear [11,12].
Finally, all the analytes predicted using the NLP-based QSRR elution order prediction method fall within their respective chemical domains of applicability. This is evident from Figure 5C,F whereby for both columns the points are within the warning limits of three multiples of standard deviations of standardized residuals and critical leverage values. The QSRR models are thereby considered stable and robust for small organic molecules and peptides up to 24 peptides (the longest peptide in the dataset). Detailed results for both case studies and all the columns / chromatographic conditions are summarized in Table 2, while the performance plots for all the columns are available in the Supporting Information (Figures S1-S10). Out of the evaluated chromatographic columns / conditions, two exemplary QSRR models from both case studies were detailed here (Supelcosil LC with t G = 10 min, T = 35 • C and Xterra with t G = 20 min, T = 40 • C). Both the NLP-based QSRR models have exhibited low %RMSE(t R ) of 8.07% and 15.17% (Table 2), with the former decreasing, and the latter increasing in comparison to the control MLR models. This can also be observed from the predictive ability plots in Figure 5A,D. The respective %RMSE(order) were 51.77% and 22.4% (Table 2). In both cases the larger errors in elution order seem to originate from the training samples ( Figure 5B,E). Increasing the degree of non-linearity in the QSRR model itself and the method formulation should lead to further improvements, especially in the second case study involving peptides >5 kDa for which the relationship between molecular descriptors and retention time is non-linear [11,12].
Finally, all the analytes predicted using the NLP-based QSRR elution order prediction method fall within their respective chemical domains of applicability. This is evident from Figure 5C,F whereby for both columns the points are within the warning limits of three multiples of standard deviations of standardized residuals and critical leverage values. The QSRR models are thereby considered stable and robust for small organic molecules and peptides up to 24 peptides (the longest peptide in the dataset).

Chromatographic Experiments
Chromatographic experiments performed to obtain the data for development of the NLP-based elution order prediction method are detailed in refs. [13,14]. Briefly, for both case studies gradient elution was used. For the first case study the mobile phase was comprised of methanol and 100 mM tris buffer at pH values of 2.5 and 7.2, while for the second case study it was water with 0.12% trifluoroacetic acid (TFA), and acetonitrile with 0.10% TFA. The 62 organic analytes were dissolved in the mixture of methanol and tris buffer, whereas the 98 synthetic peptides were dissolved in water containing 0.10% of TFA. Dead volumes were determined based on the elution of the second solvents.

Chromatographic Experiments
Chromatographic experiments performed to obtain the data for development of the NLP-based elution order prediction method are detailed in refs. [13,14]. Briefly, for both case studies gradient elution was used. For the first case study the mobile phase was comprised of methanol and 100 mM tris buffer at pH values of 2.5 and 7.2, while for the second case study it was water with 0.12% trifluoroacetic acid (TFA), and acetonitrile with 0.10% TFA. The 62 organic analytes were dissolved in the mixture of methanol and tris buffer, whereas the 98 synthetic peptides were dissolved in water containing 0.10% of TFA. Dead volumes were determined based on the elution of the second solvents. All the measurements were performed with a flow rate of 1 mL/min, and the injected volume of 20 µL. UV detection was used in both case studies, with wavelengths of 214 and 223 nm, for the first and second case study, respectively. In the first case study, the Supelcosil LC-18 column was used, whereas for the second case study: Xterra MS C18, LiChrospher RP-18, LiChrospher CN, Discovery HS F5-3, Discovery RP Amide C16, PLRP-S and Chromolith columns were used.

QSRR Model Development
Upon obtaining the experimental retention data from literature [13,14], two QSRR models were developed. The QSRR model formulation for the first case study was a simple model involving three parameters defined with the following relationship: where µ is the total dipole moment, δ min is the natural bond orbital (NBO) [15,16] charge of the most negatively charged atom, while SASA is the solvent accessible surface area. The molecular descriptors for QSRR model defined with Equation (1) were originally obtained using a low-level of theory (MM/AM1) [13]. So, in this work, we have re-optimized the molecular structures and computed all the descriptors using high-level density functional theory (DFT) [17] calculations using the Minnesota 15 (MN15) functional [18] and the 6-311+G ** basis set. [19] The solvation model density (SMD) solvation model [20] with water as a solvent was used to model the pronounced solvent effects. The DFT calculations were performed in Gaussian 16 software (Ref. S1).
In the second case study, a QSRR formulation specifically devised for RP-LC separation of peptides was used: where logSum AA is the logarithm of the sum of gradient retention times of 20 natural amino acids, logvdW vol. is the logarithm of van der Waals volume, and clogP is the in silico octanol-water partition coefficient describing hydrophobicity. Commonly, the functional forms of Equations (1) and (2) are linear with coefficients estimated using the MLR method [10].

External Validation
Both datasets were uniformly separated into training and external validation sets (70/30%) using the Kennard and Stone algorithm [21]. Such external validation was used for the MLR (control), and the MLR-NLP QSRR models. Performance metrics such as the percentage root mean square error (%RMSE) were evaluated and predictive ability of the developed models was also depicted. %RMSE [22,23] was defined as: where i is the i-th out of n compounds, while y i andŷ i are experimental and predicted retention times, respectively. After predicting the retention times and sorting them w.r.t. the experimental ones, computing the predicted elution order is straight-forward. For %RMSE of elution order, the retention time parameter is simply replaced with the analyte index.

Applicability Domain
Chemical applicability of the QSRR models to a large set of compounds is one of the approaches of their validation. The concept of applicability domain (AD) is introduced for that purpose. AD represents the domain in which compounds possess similar structural, physicochemical or biological properties to the ones of the training compounds. Typical graphical description of the AD is the dependence between standardized residuals of the model and the corresponding leverage values (Williams plot). Leverage values are calculated as the diagonal of the Hat matrix: where X 1 is the training set matrix of descriptors, whereas X 2 can correspond to both training and validation set matrix of descriptors.
To determine whether a compound falls within the AD; warning limits: the critical leverage value h* and three multiples of standard deviation of standardized residuals are determined. The critical leverage value is defined as [22,24]: where N is the number of observations, and K is the number of variables.

Elution Order Prediction
In this work, an NLP formulation for elution order prediction with relaxed inequality constraints was defined. For a QSRR model with three descriptors x j,1 , x j,2 , x j,3 for a compound j and the corresponding retention time y j sorted in ascending order y j ≤ y j+1 , the QSRR in the optimization formulation can be defined as: min a j y j −ŷ j 2 (6) where y j = f (x 1 , x 2 , x 3 ) and f (x 1 , x 2 , x 3 ) can have any functional form. Thereby, Equation (6)  when x j,i and y j are mean-centered and MLR is used. This formulation is thereby an NLP problem. When the retention times are sorted in ascending order it is straight-forward to calculate the predicted elution order. From the point of view of mathematical programming, this problem can be handled by adding inequality constraints: min a j y j − a 1 x j,1 − a 2 x j,2 − a 3 x j,3 2 s.t. y j ≤ y j+1 or a 1 x j,1 − a 2 x j,2 − a 3 x j,3 ≤ a 1 x j+1,1 − a 2 x j+1,2 − a 3 x j+1,3 However, the resulting constrained NLP problem comprises of too severe inequality constraints which cannot be satisfied and at the same time provide a meaningful QSRR model even for simple mixtures which are the case studies in this thesis.