The Atomic Partial Charges Arboretum: Trying to See the Forest for the Trees

Abstract Atomic partial charges are among the most commonly used interpretive tools in quantum chemistry. Dozens of different ‘population analyses’ are in use, which are best seen as proxies (indirect gauges) rather than measurements of a ‘general ionicity’. For the GMTKN55 benchmark of nearly 2,500 main‐group molecules, which span a broad swathe of chemical space, some two dozen different charge distributions were evaluated at the PBE0 level near the 1‐particle basis set limit. The correlation matrix between the different charge distributions exhibits a block structure; blocking is, broadly speaking, by charge distribution class. A principal component analysis on the entire dataset suggests that nearly all variation can be accounted for by just two ‘principal components of ionicity’: one has all the distributions going in sync, while the second corresponds mainly to Bader QTAIM vs. all others. A weaker third component corresponds to electrostatic charge models in opposition to the orbital‐based ones. The single charge distributions that have the greatest statistical similarity to the first principal component are iterated Hirshfeld (Hirshfeld‐I) and a minimal‐basis projected modification of Bickelhaupt charges. If three individual variables, rather than three principal components, are to be identified that contain most of the information in the whole dataset, one representative for each of the three classes of Corminboeuf et al. is needed: one based on partitioning of the density (such as QTAIM), a second based on orbital partitioning (such as NPA), and a third based on the molecular electrostatic potential (such as HLY or CHELPG).


Introduction
Atomic partial charges q A are a central concept in general chemistry. Unfortunately, they do not correspond to a single, well-defined, quantum mechanical observable. The very notion of atomic partial charges, however, does imply that "atoms in molecules" (in the broad sense of the word, not the narrow one of Bader/QTAIM [1] charges) are meaningful concepts.
Parr, Ayers, and Nalewajski [2] consider partial charges to be noumena: The term noumenon, originally coined by Plato from the Greek word noös (knowledge, cognition), is defined by the Oxford English Dictionary as "an object knowable by the mind or intellect, not by the senses; specifically (in Kantian philosophy) an object of purely intellectual intuition". Frenking and Krapp [3] speak of "unicorns": "my[th]ical animal[s] whose appearance is known to everybody although nobody has ever seen one".
In reaction, Matta and Bader [4,5] strongly took exception. They point out that Bader's QTAIM charges, in particular, can be seen as expectation values of well-defined operators, namely: Heaviside "step" functions that are unity within one set of zeroflux surfaces and zero outside. The fact that QTAIM charges often are at odds with "chemical intuition" (or at least chemical received wisdom) leaves the reader even more perplexed, and more in need of a guide. [6] Perhaps a third term is more apropos here, namely the statistical one "proxy variable". A proxy variable is a (fairly easily) measurable quantity that acts as an indirect "proxy" for a deeper concept that eludes direct measurement. Examples from economics are proxies for the standard of living such as GDP per capita or PPP (purchasing power parity). Other examples, from psychology, are IQ scores when used as proxies for general intelligence, or (relatedly) SAT scores as predictors for scholastic success. (As a hoary quip goes, the only thing standardized cognitive tests directly measure is performance on those tests.) In that sense, partial charges are "proxy variables" for a deeper chemical concept that Meister and Schwarz [7] have termed "[molecular] ionicity".
In a separate category they place "experimental" charges, which are measured (or rather, for which trends are inferred) from NMR or EXAFS chemical shifts, dipole moments, and the like.
Cramer and Truhlar [9] proposed a slightly different taxonomy which, with variations, we will adhere to in the present work.
* Class I charges are derived from experimentally measurable properties, e. g., from observed deformation densities (as in the work of the late Philip Coppens [27,28] ), from the electronegativity equalization principle, or from, e. g., dipole moments of diatomic and small (usually highly symmetric) polyatomic molecules.
* Class III charges are extracted from the wave function or electron density by fitting a physical observable (e. g., the electrostatic potential) derived from it.
* Class IV charges, such as the CM5 model introduced by Truhlar and coworkers, [29] are based on semiempirical adjustment of a well-defined Class II or Class III model to better reproduce one or more physical observables (e. g. the dipole moment).
(We note in passing that the legal fiction known as "formal charges" might be termed Class 0.) A number of review articles have been devoted to partial charges, such as Wiberg and Rablen, [30,31] Bachrach, [32] Cramer and Truhlar, [9] and most recently Ayers and coworkers [26] (which focuses principally on Hirshfeld-type approaches from an information theoretical point of view).
Normally, "observability" in the quantum mechanical sense of the word implies the existence of an operator for which one can obtain an expectation value. Cioslowski and Surjan [33] propose a weaker, more general definition where a uniquely defined computational protocol that extracts uniquely defined numerical values from the wave function, which have welldefined infinite basis set limits, is said to satisfy a "generalized observability criterion". (It should be pointed out that the hoary Mulliken population analysis, [14] for instance, does not satisfy said criterion on account of its pathological basis set dependence -nor do modified Mulliken analyses like Löwdin, [34] Ros-Schuit, [35] or Bickelhaupt. [36] However, minimal basis set projections [37] -denoted here as MBS-Mulliken, [14,37] MBS-Bick-elhaupt (Ref. [36] and vide infra), etc. -do satisfy the Cioslowski-Surjan criterion. [36] ) So, there are now several dozen charge distributions "on the market". But how distinct are they really? How different a picture do they really paint? Answers to this question might be found in feature reduction techniques -most notably from the best-known such technique, PCA (principal component analysis).
Investigating this question requires a dataset of molecules large and chemically diverse enough that any conclusions reached cannot (easily) be dismissed as sample selection artefacts. The almost 2,500 unique species in the GMTKN55 benchmark [38] (general main-group thermochemistry, kinetics, and noncovalent interactions, 55 problem subsets) for density functional methods are one useful starting point. (See Refs. [38][39][40][41][42] for example applications to DFT. ) We shall show that there is indeed a "method in the madness" and that almost all of the data variation can be represented by two to three principal components.

Population Analyses Considered
We considered the following population analyses as representatives of the different classes in the Cramer and Truhlar [9] taxonomy (where we shall introduce some subclasses below): (1) for Class I charges, the EEQ (electronegativity equalization principle) charges as implemented in the DFTD4 program of the Grimme group (Section II.A of Ref. [43]); (2) Class II is subdivided here into: * Class IIa: from partitioning orbitals. (NB: this effectively corresponds to Corminboeuf's Class 2.) We can subdivide further into: -IIa1: Mulliken [14] and variants (Löwdin, [34] Bickelhaupt, [36] …) applied in the full basis set. We considered these but discarded them on account of their pathological basis set dependence.
-IIa1 M: projection of the MOs onto a minimal basis set [37] followed by the above techniques. This eliminates the basis set hypersensitivity and leads to values that satisfy the extended Cioslowski-Surjan observability criterion. [33] We here consider MBS-Mulliken [37] and (by straightforward extension) MBS-Bickelhaupt. The latter differs from MBS-Mulliken in that off-diagonal overlap populations are taken into account by diagonal-overlap weighted average instead of simple average: -IIa2: techniques based on some form of natural or intrinsic atomic orbital: here we consider Weinhold's NPA (natural population analysis [15] ) and the IAO (intrinsic atomic orbitals) of Knizia, [16] which are functionally equivalent [17] to Ruedenberg's QUAOs (quasi-atomic orbitals [18,44] ).
to Class 3 in Corminboeuf's taxonomy.) We can further distinguish between fuzzy and discrete domains.
-IIb1: fuzzy domains: we primarily focus on variants of the "stockholder" population analysis of Hirshfeld, [20,21] specifically: * his original method (in which the deformation density is partitioned based on the promolecular density); * the iterated Hirshfeld [22] (Hirshfeld-I) method, where the proatom densities that define the promolecular density are iteratively interpolated between ionization states; * the Iterative Stockholder Approach [23] (ISA) in which proatoms are avoided through angular integration around an atom; * MBIS (minimal basis iterative stockholder [25] ), which mitigates certain issues with proatoms that are unbound in vacuo (e. g., N -, N 2-, N 3-); * Finally, the DDEC6 (density derived electrostatic and chemical approach, version 6) of Manz [24](a-f) which is designed for computational resilience in problematic systems and contains many adjustments to ensure proper behavior in solids 2 . This approach has been further developed for additional properties (e.g., bond orders, [24](c) dispersion coefficients [24](e) ) that are beyond the scope of the present paper.
Stockholder-type charges have been rationalized based on information theory [2,45] and were the subject of a recent review. [26] (We considered Salvador's topological fuzzy Voronoi charges [46] for a representative sample consisting of the W4-17 benchmark, [47] and found said charges to be very similar in practice to Bader QTAIM charges, R 2 = 0.97.) -IIb2: disjoint/discrete domains: the most important representative of this are QTAIM (quantum theory of atoms in molecules [1] ), a.k.a. "Bader charges", in which zero-flux surfaces are used to partition the electron density into atomic cells within which numerical integration takes place.
Maslen and Spackman [48,49] proposed an modification inspired by Hirshfeld charges: while zero-flux surfaces are determined as in QTAIM, the integration is performed instead on the deformation density (i. e. 1 molecule -1 promolecule ). This mitigates the tendency of QTAIM to amplify charges: zero-flux surfaces tend to run far from high electronegative atoms like O and F, at the expense of less electronegative atoms.
The Voronoi Deformation Densities (VDD) of Fonseca-Guerra et al., [19] go one step further, in that they partition the deformation density through simple Voronoi tessellation: any point in space closer to a given atom than to the others is assigned to that atom.
(3) Class III can be further subdivided into: * Class IIIa: based on the electrostatic potential. (This corresponds to Class 1 in Corminboeuf's taxonomy.) The original CHELP of Chirlian and Francl [10] was modified by Breneman and Wiberg (CHELPG) [11] to improve rotational invariance. Merz-Kollman (a.k.a. Merz-Singh-Kollman [12,50] ), HuÀ Lu-Yang (HLY [13] ), and RESP (restrained electrostatic potential [51] ) all have the same physical basis but merely represent different integration grid specifics for the ESP. As HLY appears to be the most stable numerically and is available for all elements represented in our dataset, we have focused primarily on HLY.
* Class IIIb: based on other electrostatic properties. Here we consider Cioslowski's APT (atomic polar tensor [52] ) charges, based on the trace of the dipole moment derivatives.
(4) under Class IV we consider CM5 (charge model 5), [29] an empirical adjustment of Hirshfeld to better reproduce molecular dipole moments, and ADCH: [53] (atomic dipole corrected Hirshfeld), an adjustment to ensure reproduction of molecular dipole moment obtained at the same calculation level.
In addition, we consider ACP, atomic charge partitioning [54] and especially its iterative variant i-ACP, [55] which arguably can be pigeonholed both under Class IV (on account of their adjustment for better electrical properties) and under IIb1 (as they bear an obvious kinship to MBIS in representing proatoms by Slater-type functions).

Other Computational Details
All electronic structure calculations were carried out using the PBE0 functional [56,57] with the def2-TZVPP basis set [58] using the Gaussian, [59] MOLPRO, [60] and Q-CHEM [61] electronic structure program systems. SG-3 [62] or equivalent integration grids (e. g. Grid = UltraFine in Gaussian) were employed. The reference geometries were obtained from the online supporting information to the GMTKN55 paper and used without further optimization. Species with trivial charge distributions (e. g. atoms, homonuclear diatomics, tetrahedral P 4 ,…) were removed from the dataset, as were the handful of duplicate species, which left 2,125 unique molecules. Details of the electronic structure and postprocessing software used to generate each specific set of partial charges are given in Table 1. The various aspects of this process were automated using a collection of scripts developed in-house. For easy data manipulation, Python scripts using numpy and pandas libraries were written. Duplicates were identified from the input geometries (by comparison of the rotational constants). Holes in the data were generally from technical limitations dealing with molecules that contain heavy elements for which some parameter was lacking in the available implementation.
Principal component analysis was initially performed upon 25 different population analysis methods. Our initial analysis identified some near-redundancies (see below), leading to some winnowing.
Upon completing the analysis, its statistical stability was tested by adding elements that were randomly chosen from the dataset to provide noise. The structure of principal components along with the explained variance values was compared between the untouched and adulterated data to assess how robust the implied structure markers within the data are.

Results and Discussion
The full correlation matrix of all variables, as well as the full partial charges dataset, are given in Microsoft Excel format in the Supporting Information.
By way of illustration, we show here the block between the four electrostatic charge variants: Its eigenvalues and eigenvectors (i. e. the principal components of the correlation matrix) are: Note that the eigenvalues of a correlation matrix add up to its dimension, in casu, four. The largest eigenvector (effectively the arithmetic mean of all four variables) has eigenvalue 3.93, the next largest (corresponding to CHELPG vs. HLY) having eigenvalue 0.055. One naive variable selection technique, discussed by Jolliffe [69] in Section 9.3 of his textbook on Principal Component Analysis, is to eliminate the variables that have the largest coefficient in the eigenvectors with the lowest eigenvalues. This 'backward elimination' (BE) strategy leads to Merz-Kollman and RESP being eliminated, leaving just CHELPG and HLY. The eigenvectors for the remaining 2 × 2 correlation matrix are just (CHELPG � HLY)/ p 2. Whether one chooses to retain CHELPG or HLY for further analysis is basically arbitrary: CHELPG is available in a larger number of codes, while HLY is implemented (in Gaussian) for all elements in the Periodic Table. In the present paper, we have elected to retain HLY.
Between the four iterative Class IIb1 methods, we have the following correlation matrix: Its largest two eigenvalues are 3.934 and 0.045, the former corresponding to all charges moving in tandem and the latter to ISA pulling in the opposite direction from the three others. Within the lower 3-variable block, the top eigenvalue is 2.977, showing that these three charge types on the whole contain very similar information.
The squared correlation matrix for 18 variables has been given in Table 2, reordered to maximize blocking. Some comments concerning it are in order here: (a) Note the high correlation (R 2 = 0.96) between VDD and ordinary/original Hirshfeld. Both are based on partitioning the same promolecular density: the key difference is of course that Hirshfeld uses fuzzy boundaries derived from 1 proatom /1 promolecule , while VDD employs discrete Voronoi tesselation. Apparently, the impact of this difference on the statistical similarity of the variables is smaller than one might naively expect.
The first eigenvector, PC 1 , has all charges pulling in the same direction with different strengths: PC 1 can thus be said to correspond to Schwarz and Meister's [7] 'principal component of ionicity'. The different coefficients correspond roughly to how pronounced charge differences are: e. g., for the original Hirshfeld we see 0.08, for iterative Hirshfeld 0.26, for DDEC6 0.21, and for QTAIM 0.38. PC 2 is another matter: it pits QTAIM, APT, and i-ACP with same signs against every other distribution with opposite sign. PC 2 can hence be seen as a secondary "principal component of ionicity". Why APT, despite its different physical principles, marches in comparative lockstep with QTAIM is somewhat opaque to the authors. PC 3 , which may still be of marginal statistical significance, pits the electrostatic charges and ISA against the orbital-based charges.
In order to verify how well determined PC 2 and PC 3 are, about 10 % random numbers were added to the sample and the analysis repeated. The eigenvalue for PC2 clearly stays in well-defined territory, but PC 3 appears to be in danger of  'drowning in the noise'. We therefore caution against overinterpreting PC 3 . What happens if we retain just one representative of each "physical" group? This leads to the following: Principal component analysis is one way to address the feature selection problem. PCs still require calculating or measuring all the variables, however: hence, in data science and machine learning, some attention has been devoted to the variable selection problem, i. e.: to extracting the subset of original variables that contains most of the information in the dataset. A simplistic way is 'hunting' for the variables with the largest coefficients in the PCs, but Jolliffe's textbook on PCA [69] cautions against this. Ref. [70] discusses more sophisticated approaches and their implementation in the subselect module of the R statistical software system. [71] "Subselect" works through simulated annealing to optimize a choice of objective functions. [70] We ended up using the GCD (generalized coefficient of determination) criterion, that is, the overlap between the subspace spanned by variable subset size n and that spanned by first n principal components. (The analysis was run using R version 3.6.1 on the senior author's Macbook Pro.) However, if we apply subselect to the covariance matrix of the various population types, we find that the objective function goes through a maximum for 3 variables, and that these are MBSMulliken, QTAIM, and HLY. (Solutions using another electrostatic potential charge instead of HLY, or NPA instead of MBSMulliken, are essentially of the same quality.) For a single variable, we get Hirshfeld-I as the optimum; for two variables, MBSMulliken and QTAIM. Expanding to four and five variables successively adds EEQ and APT, respectively.
If one forces inclusion of Hirshfeld-I, then successive additions lead to (again, using the GCD criterion): Intriguingly, said three variables correspond to one representative each of the three Corminboeuf classes.
Let us consider the eigenvalues and eigenvectors of the covariance matrix for this 3-variable subset: The structure here is very clear: (1) Principal ionicity; (2) QTAIM in opposition to the others; (3) Electrostatic versus orbital.
Another angle on the issue is afforded by considering the squared Pearson correlations R 2 for least-squares fits of the individual charges in terms of the first few principal components. The result is shown in Table 4.
We note that the first principal component has the strongest correlation with Hirshfeld-I (R 2 = 0.9752) and with MBSBickelhaupt (R 2 = 0.9765), while MBIS and DDEC6 (both iterative Hirshfeld variants) come quite close at R 2 = 0.963 and 0.964, respectively. The effect of adding the second principal component is essentially negligible for Hirshfeld-I and MBSBickelhaupt (both of which have small loadings in PC2) but QTAIM is now the winner at R 2 = 0.9842, albeit closely followed by the iterative Class IIb1 methods other than ISA. If we consider the largest increases in R 2 from one to two PCs, then QTAIM followed by APT seems to be most associated with PC2. The corresponding increase from adding PC3 is by far the largest for HLY, which jumps from 0.84 to 0.95. To reach 0.99 territory, HLY requires adding in PC5 and PC6: its loading in PC4 is too small to be useful. CM5 and EEQ would benefit from a 4 th PC.
From the converse point of view, we may consider how well the three first PCs would be expressed as linear combinations of the three variables (MBSMulliken or NPA, QTAIM, HLY). The R 2 for these fits are 0.990 for PC1, 0.977 for PC2, and 0.843 for PC3. Increasing the latter number further (as well as good fits for PC4 and PC5) can be achieved by adding EEQ and APT.

Conclusions
From a principal component analysis of about two dozen different charge distributions for a sample of over 2,000 maingroup molecules, we can establish the following. Intriguingly, original Hirshfeld is the most resistant to expansion in principal components, requiring as many as ten of them. 8. Behavior of IBO (intrinsic bond orbital) charges is statistically very similar to NPA (natural population analysis) as well as MBSMulliken, that is, Mulliken population analysis after minimal basis set projection.  Somewhat surprisingly, discrete vs. fuzzy partitioning of the deformation density (i. e., Voronoi Deformation Densities vs. original Hirshfeld charges) causes a much smaller difference than iterating the pro-atomic charges that make up the promolecular density.
We also define a minimal basis set-projected variant of Bickelhaupt charges, which should be easy to implement in any electronic structure system, or postprocessing code for same, for which one has source code available.