pKalculator: A pKa predictor for C–H bonds

Determining the pKa values of various C–H sites in organic molecules offers valuable insights for synthetic chemists in predicting reaction sites. As molecular complexity increases, this task becomes more challenging. This paper introduces pKalculator, a quantum chemistry (QM)-based workflow for automatic computations of C–H pKa values, which is used to generate a training dataset for a machine learning (ML) model. The QM workflow is benchmarked against 695 experimentally determined C–H pKa values in DMSO. The ML model is trained on a diverse dataset of 775 molecules with 3910 C–H sites. Our ML model predicts C–H pKa values with a mean absolute error (MAE) and a root mean squared error (RMSE) of 1.24 and 2.15 pKa units, respectively. Furthermore, we employ our model on 1043 pKa-dependent reactions (aldol, Claisen, and Michael) and successfully indicate the reaction sites with a Matthew’s correlation coefficient (MCC) of 0.82.


Selecting unique conformers
Conformers with relative energies above 3 kcal/mol are removed from the conformer selection when finding the unique conformers as they are considered too high in energy compared to the rest.Hereafter, the similarity of conformers are compared using a distance (RMS) matrix of the conformers of a molecule.Now, the unsupervised non-hierarchical Butina clustering algorithm is used.Each cluster centroid is the conformer with the largest number of neighbors, and the neighbors have a distance threshold of 0.5 Å. Hereafter, the conformers are sorted by the number of neighbors, and the first conformer in the list (centroid) is selected.Each conformer inside the distance threshold of 0.5 Å becomes part of the same cluster and is removed from the list.This is repeated for the rest of the list, and conformers that are not part of a cluster become single instances [1].

Benchmark study -computational methods
This benchmark study evaluates the computational effort and the accuracy of single-point calculations or re-optimizations based on GFN2-xTB.These calculations utilize ORCA (v.5.0.4) [2,3] with DMSO as the solvent.The composite electronic structure method r 2 SCAN-3c [4], combined with its custom def2-mTZVPP/J basis set and the universal solvation model (SMD) [5], is employed.Furthermore, calculations using density functional theory (DFT) are conducted.For this purpose, the dispersion D4-corrected DFT functional CAM-B3LYP [6,7], the Karlsruhe triple- basis set (def2-TZVPPD) [8,9], and the conductor-like polarizable continuum model (CPCM) [10] implicit solvation model are employed.An unpublished previous benchmark study indicates that a longrange-corrected functional with dispersion correction excels in computing anions.This trend is also observed here, with CAM-B3LYP D4 showing the best performance, as demonstrated in Figure S1, Table S1, and Table S2.Table S1: Absolute p a unit error metrics for different levels of theory using the Bordwell dataset (419 compounds).

Finding outliers
Of the 775 molecules, 43 compounds are from [11] with no experimental p a values.732 compounds are therefore left to find the linear relationship between the experimental p a values and the lowest Δ • values, see Figure S7.Outliers with an error exceeding 5 p a units are reviewed for calculation errors or errors in the literature.The observed outliers typically result from one of the following reasons: calculation errors concerning the expected minimum p a site, discrepancies between literature structures and database structures, mislabeled experimental p a values, or extrapolated p a values.Notably, the extrapolated p a values correspond to compounds beyond the scale (p a ≥ 35) measurable in DMSO because of DMSO's autoprotolysis (p a = 35) [12,13].The final result consists of 695 molecules, which can be seen in Figure S8 where outliers have been omitted.Table S3 shows the different compound names omitted from the linear regression in Figure S8.It should also be noted that we altered our QM workflow for compounds that exhibit bridgeheads.Instead of S7 generating min(1 + 3 rot , 20) conformers for each SMILES using RDKit (v.2022.09.4) [14], we generate min(10 + 3 rot , 20) conformers for the neutral SMILES.Hereafter, we deprotonate each C-H bond for each generated neutral conformer and pass it through the QM workflow.The reason for doing this is that our original QM workflow sometimes shortlists deprotonated conformers that are very different from the neutral conformers.This produces high energy differences that yield erroneous p a values.Table S3: Outliers for the Bordwell and iBond datasets ( = 37).Estimated : extrapolated pk a values ( = 21); high error: high errors on the expected lowest p a site compared to the calculated values ( = 12); wrong site: failed computation on the expected lowest p a site ( = 4).

The descriptor
The following section evaluates which descriptor vector best describes our p a values.From Table S4 and Figure S9, it is seen that the charge shell descriptor with six shells and values sorted according to the Cahn-Ingold-Prelog (CIP) rules best describes the p a values.

Machine learning models
Regression and binary classification models have been trained to evaluate either how well we can predict p a values for each site or how well we can predict the site of reaction.Table S5 shows the performance metrics for the different regression models using fivefold cross-validation.
Based on the classification of the minimum p a value for each site in a molecule is set to either a '1' (lowest p a site) or '0' (not lowest p a site; we also introduce a tolerance where a p a value within +1 p a units or +2 p a units of the lowest p a value is accepted as '1' to account for slight variations; see Table S6.From that, a confusion matrix (CM) is constructed to compare the  p a values target taget+1 taget+2 [15,20,25]

Outliers for the test set
This section examines the outliers associated with the test set for the best regression model; see Table S10.Generally, the regression model struggles to accurately predict the C-H p a values at bridgehead positions (comp69, comp70, comp321).Bridgehead deprotonation is often energetically unfavorable, leading to an unstable anion.Since our descriptor vector solely describes CM5 charges, it fails to account for the steric strain associated with the bridgehead position.The regression model also encounters difficulties when an anion gains extra stability through charge delocalization via resonance, as shown in Figure S10.It is expected that our model may struggle to describe the additional stability arising from conjugation since the descriptor vector for CM5 charges originates from the neutral molecule.A more accurate approach to generate the descriptor vector could involve concatenating the CM5 charges for each deprotonated site within the molecule or by taking the difference between the neutral descriptor vector and the deprotonated descriptor vector for each site, thereby providing a more precise representation of the molecule.

Outliers for Reaxys
Figures S2-S6 show different linear regressions between Δ • min values at  = 295.15K against experimental p a for the Bordwell dataset (419 compounds) for the different QM methods.

Figure S9 :
Figure S9: MAE versus n-shells and RMSE versus n-shell for the Bordwell dataset (419 compounds) MAE: mean absolute error; RMSE: root mean squared error.
predictions of the machine learning (ML) model to the calculated sites.A site is classified as a true positive (TP) or true negative (TN) if the ML model's prediction aligns with the QM-computed sites.Conversely, it is labeled as a false positive (FP) or false negative (FN) if the ML prediction differs from the calculated sites.From these classifications, we derive several evaluation parameters: accuracy (ACC), Matthew's correlation coefficient (MCC), recall/sensitivity (true-positive rate -S11 TPR), specificity (true-negative rate -TNR), precision (positive predictive value -PPV), and negative predictive value (NPV).

Figure S10 :
Figure S10: Outliers from the held-out test set (20%; 155 compounds; 789 p a values) using the best regression model.The outliers below have an error of seven or above between the calculated (QM) and predicted (ML) sites.

Figure S11 :
Figure S11: False negative (FN) and false positives (FP) from the held-out test set (20%; 155 compounds; 789 p a values) using the best regression model as a binary classifier.

Figure
FigureS12displays a sample of 20 random compounds from the Reaxys dataset, highlighting the limitations of our model in predicting reaction sites based on p a -dependent reactions.Because our method relies on the principle of lowest energy, it predicts the thermodynamic p a .However, as Roszak et al.[15] discuss, various factors can influence the reaction site, including (i) the formation of a dianion, (ii) the deprotonation facilitated by pre-coordination of a base, (iii) the emergence of strained intermediates, and (iv) the enolization of ketones.Such factors allow for the manipulation of the reaction site, resulting in the functionalization of a less acidic C-H site.In the context of unsymmetrical ketones, two acidic sites are possible.The more substituted site typically results in a more stable anion (the thermodynamic anion), while the less substituted site yields the kinetic anion.By selecting appropriate bases, synthetic chemists can

Table S4 :
Performance metrics using different numbers of shells (n-shells) for the descriptor vector for the Bordwell dataset (419 compounds).MAE: mean absolute error; RMSE: root mean squared error.
(a) MAE vs n-shells (b) RMSE vs n-shells

Table S5 :
Performance metrics for different regression models.The dataset (775 compounds; 3910 p a values) is randomly split into a training set (80%; 620 compounds; 3121 p a values) and a held-out test set (20%; 155 compounds; 789 p a values).Subsequently, a fivefold randomly shuffled cross-validation (CV) is conducted.Within each fold, the original training set is split randomly into a new training set (90% of the original training set) and a validation set (10% of the original training set).Hereafter, each ML model is trained on our original training set (80%; 620 compounds; 3121 p a values) and tested against the held-out test set (20%; 155 compounds; 789 p a values). MAE: mean mean absolute error for fivefold CV;  RMSE: mean root mean squared error for fivefold CV; MAE: mean absolute error; RMSE: root mean squared error for fivefold CV.

Table S6 :
Examples for targets in binary classification.'1' is given for the lowest p a value, and '0' is given for all other calculated p a values.+1 and +2 denotes either +1 p a units or +2 p a units of the lowest p a value accepted as '1' (true site).

Table S7 :
Training and validation set performance metrics for different binary classification models.The dataset (775 compounds; 3910 p a values) is randomly split into a training set (80%; 620 compounds; 3121 p a values) and a held-out test set (20%; 155 compounds; 789 p a values).Subsequently, a fivefold randomly shuffled cross-validation (CV) is conducted.Within each fold, the original training set is split randomly into a new training set (90% of the original training set) and a validation set (10% of the original training set).+1 and +2 denotes either +1 p a units or +2 p a units of the lowest p a value accepted as '1' (true site).AUC: Area under the curve.

Table S8 :
Held-out test set performance metrics for different binary classification models, including the best regression model.The dataset (775 compounds; 3910 p a values) is randomly split into a training set (80%; 620 compounds; 3121 p a values) and a held-out test set (20%; 155 compounds; 789 p a values).Each p a value in a molecule is set to either a '1' (lowest p a site) or '0' (not lowest p a site.+1 and +2 denotes either +1 p a units or +2 p a units of the p a value accepted as '1' (true site).For the null models, all sites are set to '0'.The best models are marked in bold.

Table S9 :
Held-out test set performance metrics for different binary classification models, including the best regression model.The dataset (775 compounds; 3910 p a values) is randomly split into a training set (80%; 620 compounds; 3121 p a and a held-out test set (20%; 155 compounds; 789 p a values).Each p a value in a molecule is set to either a '1' (lowest p a site) or '0' (not lowest p a site.+1 and +2 denotes either +1 p a units or +2 p a units of the lowest p a value accepted as '1' (true site).For the null models, all sites are set to '0'.The best models are marked in bold.

Table S10 :
Outliers ( = 15) from the held-out test set (20%; 155 compounds; 789 p a values) using the best regression model.The outliers below have an error of seven or above between the calculated (QM) and predicted (ML) sites.* denotes the lowest p a site, and ** denotes bridgehead sites.

Table S13 :
Reaxys performance metrics: comparison for different binary classification models, including the best regression model.The Reaxys dataset comprises 1043 reactions (584 aldol, 51 Michael, and 408 Claisen).Each p a value in a molecule is set to either a '1' (lowest p a site) or '0' (not lowest p a site.+1 and +2 denotes either +1 p a units or +2 p a units of the lowest p a value accepted as '1' (true site).For the null models, all sites are set to '0'.The best models are marked in bold.