Machine Learning of Molecular Electronic Properties in Chemical Compound Space

The combination of modern scientific computing with electronic structure theory can lead to an unprecedented amount of data amenable to intelligent data analysis for the identification of meaningful, novel, and predictive structure-property relationships. Such relationships enable high-throughput screening for relevant properties in an exponentially growing pool of virtual compounds that are synthetically accessible. Here, we present a machine learning (ML) model, trained on a data base of \textit{ab initio} calculation results for thousands of organic molecules, that simultaneously predicts multiple electronic ground- and excited-state properties. The properties include atomization energy, polarizability, frontier orbital eigenvalues, ionization potential, electron affinity, and excitation energies. The ML model is based on a deep multi-task artificial neural network, exploiting underlying correlations between various molecular properties. The input is identical to \emph{ab initio} methods, \emph{i.e.} nuclear charges and Cartesian coordinates of all atoms. For small organic molecules the accuracy of such a"Quantum Machine"is similar, and sometimes superior, to modern quantum-chemical methods---at negligible computational cost.

ground-and excited-state properties. The properties include atomization energy, polarizability, frontier orbital eigenvalues, ionization potential, electron affinity and excitation energies. The machine learning model is based on a deep multitask artificial neural network, exploiting the underlying correlations between various molecular properties. The input is identical to ab initio methods, i.e. nuclear charges and Cartesian coordinates of all atoms. For small organic molecules, the accuracy of such a 'quantum machine' is similar, and sometimes superior, to modern quantum-chemical methods-at negligible computational cost.
insights that relate structure to property in a predictive and quantitative manner. How are we to systematically construct robust models of electronic structure properties that properly reflect the information already obtained for thousands to millions of different chemical compounds?
With increasing numbers of data and available computational resources, increasingly sophisticated statistical data analysis, or machine learning (ML) methods, have already been applied for predicting not only outcomes of experimental measurements but also outcomes of computationally demanding high-level electronic structure calculations. In close analogy to the quantitative structure property relationships (QSPRs) prevalent in cheminformatics and bioinformatics, QSPRs can also be constructed for electronic structure properties. Examples include QSPRs for exchange-correlation potentials using neural networks (NNs) [2,3], basisset effects using support vector machines [4,5] or molecular reorganization energies affecting charge transfer rates [6,7], or for solid ternary oxides [8]. Ordinarily, these applications rely on association, using regression methods that create statistically optimized relationships between the so-called descriptor variables and the electronic property of interest. Not surprisingly, the heuristic ad hoc identification and formatting of appropriate descriptors represents a crucial and challenging aspect of any QSPR, and is to be repeated for every property and class of chemicals.
We make use of an alternative ML approach, recently introduced by some of us for the modeling of molecular atomization energies [9]. This approach is based on a strict first principles view on chemical compound space [11]. Specifically, solutions to Schrödinger's equation (SE) are inferred for organic query molecules using the same variables that also enter the electronic Hamiltonian H , i.e. nuclear charges Z I and positions R I , 9 and that are mapped to the corresponding total potential energy, H ({Z I , R I }) −→ E [11,12]. Unlike the aforementioned QSPRs this ML model is free of any heuristics: it exactly encodes the supervised learning problem posed by SE, i.e. instead of finding the wavefunction which maps the system's Hamiltonian to its energy, it directly maps the system to energy (based on examples given for training), {Z I , R I } ML −→ E. The employed descriptor, dubbed the 'Coulomb'matrix, is directly obtained from {Z I , R I }. As such this constitutes a well-defined supervisedlearning problem and in the limit of the converged number of training examples the ML model becomes a formally exact inductive equivalent to the deductive solution of SE. It is advantageous that the training data can come from experiment just as well as from numerical evaluation of the corresponding quantum mechanical observable using approximate wavefunctions (separated nuclear and electronic wavefunctions, Slater determinant expansions, etc), Hamiltonians (such as the Hückel or any exchange-correlation potential), and self-consistent field procedures. Building on our previously introduced work [9], we present here a more mature ML model developed to accomplish the following two additional tasks: (i) simultaneously predict a variety of different electronic properties for a single query and (ii) reach an accuracy comparable with the employed reference method used for generating the training set. The presented ML model is based on a multi-task deep artificial NN approach that captures correlations between seemingly related and unrelated properties and levels of theory. A remarkable predictive accuracy for 'outof-sample' molecules (i.e. molecules that were not part of the training set) has been obtained through the use of random Coulomb matrices that introduce invariance with respect to atom indexing. For training, we generated a quantum chemical database containing nearly 10 5 entries for over 7000 stable organic molecules, made of up to seven atoms from main-group elements, consisting of C, N, O, S and Cl, saturated with hydrogen to satisfy valence rules [13,14]. For each molecule, the atomization energy, static polarizabilities, frontier orbital eigenvalues and excitation energies and intensities have been calculated by a variety of widely used electronic structure methods, including state-of-the-art first principles methods, such as hybrid densityfunctional theory and the many-body single particle Green's function and screened Coulomb interaction (GW) approach (see section 2) 10 . Figure 1 illustrates the complete property database and how it has been used in model training and prediction.

Molecular structures (input)
While the present ML model approach is generally applicable, for the purpose of this study we restrict ourselves to the chemical space of small organic molecules. For all the cross-validated training and out-of-sample model performance testing, we rely on a controlled test bed of molecules, namely a subset of the general molecular data base (GDB)-13 database [13,14] consisting of all 7211 small organic molecules that have up to seven second and third row atoms consisting of C, N, O, S or Cl, saturated with hydrogen. The entire GDB-13 database represents an exhaustive list of the ∼0.97 B organic molecules that can be constructed from up to 13 such 'heavy' atoms. All GDB molecules are stable and synthetically accessible according to organic chemistry rules [15]. Molecular features such as functional groups or signatures include single, double and triple bonds; (hetero-) cycles, carboxy, cyanide, amide, amine, alcohol, epoxy, sulfide, ether, ester, chloride, aliphatic and aromatic groups. For each of the many possible stoichiometries, many constitutional isomers are considered, each being represented only by a single conformational isomer.
Based on the string representation (SMILES [16,17]) of molecules in the database, we used the universal force field [18] to generate reasonable Cartesian molecular geometries, as implemented in OpenBabel [19]. The resulting geometries were relaxed using the PBE approximation [20] to Kohn-Sham density functional theory (DFT) [21] in a converged numerical basis, as implemented in the FHI-aims code [22] (tight settings/tier2 basis set). All the geometries are provided in the supplementary material (available from stacks.iop.org/NJP/15/095003/mmedia).

Molecular representation (descriptor)
One of the most important aspects for creating a functional ML model is the choice of an appropriate data representation (descriptor) that reflects important constraints and properties due to the underlying physics, SE in our case. While there is a wide variety of descriptors used in chem-and bio-informatics applications [23][24][25][26][27], they are conventionally based on Cartoons of ten exemplary molecules from the database are shown; they are used as the input for quantum chemistry, for learning or for prediction. Relying on the input in the 'Coulomb' matrix form, the concept of a 'quantum machine' (QM) is illustrated for two seemingly uncorrelated properties, atomization energy E and HOMO eigenvalue, which are decoded in terms of the largest two principal components (PCA1, PCA2) of the last NN layer for 2k molecules not part of the training. The color-coding corresponds to the HOMO eigenvalues.
prior knowledge about chemical binding, electronic configuration or other quantum mechanical observables. Instead, we derive our representation without any pre-conceived knowledge, i.e. exclusively from stoichiometry and configurational information, from that generated according to the previous subsection. As such, the molecular representation is in complete analogy to the electronic Hamiltonian used in ab initio methods.
For this study, we use a randomized variant of the recently introduced 'Coulomb matrix', M [9,10]. The Coulomb matrix is an inverse atom-distance matrix representation that is unique (i.e. no two molecules will have the same Coulomb matrix unless they are identical or enantiomers) and retains invariance with respect to molecular translation and rotation by construction: Off-diagonal elements encode the Coulomb repulsion between nuclear charges of atoms I and J , while diagonal elements represent the stoichiometry through an exponential fit in Z to the free atoms' potential energy. We have enforced invariance with respect to atom indexing by representing each molecule as a probability distribution over Coulomb matrices p(M) generated by different atom indexing of the same molecule. Details for producing such random Coulomb matrices can be found in the supplementary material (available from stacks.iop.org/NJP/15/095003/mmedia).

Molecular electronic properties (output)
The reference values necessary for learning and testing consist of various electronic groundand excited-state properties of molecules in their PBE geometry minimum. Specifically, we consider the atomization energies E, static polarizabilities (trace of tensor) α, frontier orbital eigenvalues HOMO and LUMO, ionization potential IP and electron affinity EA. Furthermore, from optical spectrum simulations (10-700 nm), we consider the first excitation energy E * 1st , excitation of maximal optimal absorption E * max and its corresponding intensity I max . Data ranges ofpropertiesforthemolecularstructuresandforvariouslevelsoftheoryaregiveninfootnote9; property mean values in the data set also feature in table 1.
To also gauge the impact of the reference method's level of theory on the ML model, polarizabilities and frontier orbital eigenvalues were evaluated with more than one method. Static polarizability has been calculated using self-consistent screening (SCS) [28] as well as hybrid density functional theory (PBE0) [29,30]. PBE0 has also been used to calculate atomization energies and frontier orbital eigenvalues. The electron affinity, ionization potential, excitation energies and maximal absorption intensity have been obtained from ZINDO [31][32][33]. Hedin's GW approximation [34] has also been used to evaluate frontier orbital eigenvalues. GW is a quasi-particle ab initio many-body perturbation theory, known to accurately account for electronic excitations that describe electron addition and removal processes [34]. The SCS, PBE0 and GW calculations have been performed using FHI-aims; [22,35], ZINDO/s calculations are based on the ORCA code [36]. ZINDO/s is an extension of the INDO/s semiempirical method with parameters to accurately reproduce single excitation spectra of organic compounds and complexes with rare-earth elements. The INDO Hamiltonian neglects some two-center two-electron integrals in order to simplify the calculation of time-dependent Hartree-Fock equations. While the ZINDO results are usually not as accurate as highly Table 1. Mean absolute errors (MAEs) and root mean square errors (RMSE) for out-of-sample predictions by the ML model, together with typical error estimates of the corresponding reference level of theory. Errors are reported for all 14 molecular properties, and are based on out-of-sample predictions for 2211 molecules using a multi-task multi-layered NN ML model obtained by cross-validated training on 5000 molecules. The corresponding true versus predicted scatter plots feature in figure 3. Property labels refer to the level of theory and molecular property, i.e. atomization energy (E ref ), averaged molecular polarizability (α), HOMO and LUMO eigenvalues, ionization potential (IP), electron affinity (EA), first excitation energy (E * 1st ), excitation frequency of maximal absorption (E * max ) and the corresponding maximal absorption intensity (I max ). To guide the reader, the mean value of the property across all 7211 molecules in the database is shown in the second column. Energies, polarizabilities and intensity are in eV, Å 3 and arbitrary units, respectively. correlated methodologies, the semiempirical Hamiltonian reproduces the most important features of the absorption spectra of many small molecules and complexes, particularly characterizing their most intense bands on the UV-vis spectra. All properties are provided in the supplementary material (available from stacks.iop.org/NJP/15/095003/mmedia). Similar conclusions hold for the selected levels of theory: the employed methods can be considered to represent a reasonable compromise between computational cost and predictive accuracy. It should be mentioned that ML methods can, in principle, be applied to any method or level of approximation.

Training the model
Our model consists of a deep and multi-task NN [37,38] that is trained on molecule-properties pairs. It learns to map Coulomb matrices to all 14 properties of the corresponding molecule simultaneously. NNs are well established for learning functional relationships between the input and the output. They have successfully been applied to different tasks such as object recognition [39] and speech recognition [40]. Given a sufficiently large NN, its universal approximation capabilities [41] and the existence of the underlying noise-free SE, an NN solution can be expected to exist that satisfyingly relates molecules to their properties. Specifically, a deep NN will properly unfold, layer after layer, a complex input into a simple representation of molecular properties. Finding the true relationship unfolding among those that fit the training data can be challenging because there is typically a manifold of solutions. The multi-task setup forces the NN to predict multiple properties simultaneously. This is conceptually appealing because these additional constraints narrow down the search for the 'true model' [42], as the set of models that fit all properties simultaneously is smaller. Details of the NN training procedure can be found in the supplementary material (available from stacks.iop.org/NJP/15/095003/mmedia).

Results and discussion
Before reporting and discussing our results, we note the long history of statistical learning of the potential energy hyper surface for molecular dynamics applications. It includes, for example, the modeling of potential energy surfaces with artificial NNs starting with the work of Sumpter and Noid in 1992 [43][44][45][46][47][48][49] or Gaussian processes [50,51]. Our work aims to go beyond single molecular systems and learn to generalize to unseen compounds. This extension is not trivial, as the input representation must deal with molecules of diverse sizes and compositions in the absence of one-to-one mapping between atoms of different molecules.

Database
Scatter plots among all properties for all the molecules are shown in figure 1. Visual inspection confirms the expected relationships between various properties: Koopman's theorem relating ionization potential to the HOMO eigenvalue [52], the hard soft acid base principle linking polarizability to stability [53] or electron affinity correlating with the first excitation energy. Correlations of identical properties at different levels of theory reveal more subtle differences. Polarizabilities, calculated using PBE0 or with the more approximate SCS model [28], are strongly correlated. Also, less well-known relationships can be extracted from these data. One can obtain to a very decent degree, for example, the GW HOMO eigenvalues by subtracting 1.5 eV from the corresponding PBE0 HOMO values.
Some properties, such as atomization and HOMO energies, exhibit very little correlation in their scatter plot. The inset of figure 1 illustrates how our QM (i.e. the NN-based ML model) extracts and exploits hidden correlations for these properties despite the fact that they cannot be recognized by visual inspection. Similar conclusions hold for atomization energy versus first excitation energy or polarizability versus HOMO energy.

Accuracy versus training set size
It is an important feature of any ML model that the error can be controlled systematically as the training set size is varied. We have investigated this dependence for our ML model. Figure 2 shows a typical decay of the ML model's mean absolute error (MAE) for predicting the properties of 'out-of-sample' molecules as the number of molecules in the training set increases logarithmically from 500 to 5000. For all the investigated properties, the improvement of error suggests that the MAE could still be lowered even further through the addition of more molecules. However, since the reference method's 'precision' (i.e. the estimated accuracy of the employed level of theory) is reached for almost all properties already using 5000 examples, adding further examples does not make sense. For the atomization energy the decay is particularly dramatic: a tenfold increase in the number of molecules (500 → 5000) reduces the error by 70%, from 0.55 to 0.16 eV. But also for the HOMO/LUMO eigenvalues, the error reduces substantially. We find that the expected error decay law of ∝1/ √ N is only recovered for the atomization energy; for other properties the error decays more slowly. Figure 2 also features the statistical error bars for the MAEs-a measure of outliers. The error bar is only slightly larger than symbol size, and hardly varies as the training set increases and the testing set decreases.

The final machine learning model
After cross-validated training on the largest training set with 5000 randomly selected molecules, 2211 predictions have been made for the remaining 'out-of-sample' molecules, yielding at once all 14 quantum chemical properties per molecule. The corresponding true versus predicted scatter plots feature in figure 3. The corresponding mean absolute and root-mean square errors (RMSE) are shown in table 1, together with the literature estimates of errors typical of the corresponding level of theory. Errors of all properties range in the single digit percent of the mean property. Remarkably, when compared to published typical errors for the corresponding level of theory, i.e. used as a reference method for training, similar accuracy is obtained-the sole exception being the most intense absorption and its associated excitation energy. This, however, is not too surprising: extracting the information about a particular excitation energy and the associated absorption intensity requires sorting the entire optical spectrum-thus encoding significant knowledge that was entirely absent from the information employed for training. For all other properties, however, our results suggest that the presented ML model makes 'out-of-sample' predictions with an accuracy competitive with the employed reference methods. These methods include some of the more costly state-of-the-art electronic structure calculations, such as GW results for HOMO/LUMO eigenvalues and hybrid DFT calculations for atomization energies and polarizabilities. Work is in progress to extend our ML approach to other properties, such as the prediction of ionic forces or the full optical spectrum. We note, however, that for the purpose of this study any level of theory and any set of geometries could have been used.
The remarkable predictive power of the ML model can be rationalized by (i) the deep layered nature of the NN model that permits us to progressively extract the relevant problem subspace from the input representation and gain predictive accuracy [59,60]; (ii) inclusion of random Coulomb matrices for training, effectively imposing invariance of property with respect to atom indexing, clearly benefits the model's accuracy: additional tests suggest that using random, instead of sorted or diagonalized [9], Coulomb matrices also improves the accuracy of kernel ridge regression models to similar degrees; and (iii) the multi-task nature of the NN accounts for the strong and weak correlations between seemingly unrelated properties and different levels of theory. Aspects (i) and (iii) are also illustrated in figure 4.
We reiterate that evaluation of all 14 properties at the said level of accuracy for an outof-sample molecule requires only milliseconds using the ML model, as opposed to several CPU hours using the reference methods used for training. The downside of such accuracy, of course, is the limit in transferability. All ML model predictions are strictly limited to out-of-sample molecules that interpolate. More specifically, the 5000 training molecules must resemble the query molecule in a similar fashion as they resemble the 2211 test molecules. For compounds that bear no resemblance to the training set, the ML model must not be expected to yield accurate predictions. This limited transferability might one day become moot through a more intelligent choice and construction of molecular training sets tailored to cover all of a pre-defined chemical compound space, i.e. all of the relevant geometries and elemental compositions, up to a certain number of atoms.

Conclusion
We have introduced a ML model for predicting the electronic properties of molecules based on training deep multi-task artificial NNs in chemical space. Advantages of such a QM (conceptually speaking, as illustrated in figure 1) are the following: (i) multiple dimensions: a single QM execution simultaneously yields multiple properties at multiple levels of theory; (ii) a systematic reduction of error: by increasing the training set size the QM's accuracy can be converged to a degree that outperforms modern quantum chemistry methods, hybrid densityfunctional theory and the GW method in particular; (iii) a dramatic reduction in computational cost: the QM makes virtually instantaneous property predictions; (iv) user-friendly character: training and the use of the QM do not require knowledge about the electronic structure or even about the existence of the chemical bond; (v) arbitrary reference: the QM can learn from data corresponding to any level of theory, and even experimental results. The main limitation of the QM is the empirical nature inherent in any statistical learning method used for inferring solutions, namely that meaningful predictions for new molecules can only be made if they fall in the regime of interpolation.
We believe our results to be encouraging numerical evidence that ML models can systematically infer highly predictive structure-property relationships from high-quality databases generated via first-principles atomistic simulations or experiments. In this study, we have demonstrated the QM's performance for a rather small subset of chemical space, namely for small organic molecules with only up to seven atoms (not counting hydrogen) as defined by the GDB. Due to its inherent first principles setup, we expect the overall approach to be equally applicable to molecules or materials of arbitrary size, configurations and composition-without any major modification. We note, however, that in order to apply the QM to other regions in chemical space with a similar accuracy differing amounts of training data might be necessary.
We conclude that combining reliable databases with ML promises to be an important step toward the general goal of exploring chemical compound space for the computational bottom-up design of novel and improved compounds. We construct a four-layer NN with 2000, 800, 800 and 1000 nodes at each layer. The network implements the function φ −1 • , where functions f 1 , f 2 and f 3 between each layer correspond to a linear transformation learned from data followed by a sigmoid nonlinearity. The NN is trained to minimize the MAE of each property using the stochastic gradient descent algorithm (SGD) [62]. Errors are back-propagated [63] from the top layer back to the inputs in order to update all parameters of the model. We run 250 000 iterations of the SGD and present at each iteration 25 training samples. During training, each molecule-property pair is presented in total 1250 times to the NN, but each time with different atom indexing. A moving average of the model parameters is maintained throughout training in order to attenuate the noise of the stochastic learning algorithm [64]. The moving average is set to remember the last 10% of the training history and is used for the prediction of outof-sample molecules. Training the NN on a CPU takes ∼24 h. Once the NN has been trained, the typical CPU time for predicting all 14 properties of a new out-of-sample molecule is ∼100 ms. Prediction of an out-of-sample molecule is obtained by propagating ten different realizations of p(M) and averaging outputs. Prediction of multiple molecules can be easily parallelized by replicating the trained NN on multiple machines. For more details on training neural networks, the reader is referred to [65,66]