Machine learning for the solution of the Schrödinger equation

Machine learning (ML) methods have recently been increasingly widely used in quantum chemistry. While ML methods are now accepted as high accuracy approaches to construct interatomic potentials for applications, the use of ML to solve the Schrödinger equation, either vibrational or electronic, while not new, is only now making significant headway towards applications. We survey recent uses of ML techniques to solve the Schrödinger equation, including the vibrational Schrödinger equation, the electronic Schrödinger equation and the related problems of constructing functionals for density functional theory (DFT) as well as potentials which enter semi-empirical approximations to DFT. We highlight similarities and differences and specific difficulties that ML faces in these applications and possibilities for cross-fertilization of ideas.


Introduction
Machine learning (ML) has been increasingly used in theoretical and computational chemistry and computational materials science. It has been particularly successful in accelerating materials research with ML-assisted alloy design, discovery of solar cell and battery materials, etc [1][2][3][4][5][6]. These uses of ML effectively create shortcuts through the field of quantum chemistry calculations. Quantum chemistry calculations themselves are ultimately based on the solution of the nuclear Schrödinger equation (SE) or the electronic SE. The nuclear SE is usually solved on a potential energy surface (of electronic energies) to obtain vibrational properties or reaction dynamics [7][8][9]. The electronic SE is solved directly as an ODE (ordinary differential equation) or in a roundabout way (with density functional based methods) to obtain electronic energies, band structures, and various electronic properties [10][11][12][13][14][15][16].
In the application to construction of interatomic potentials, machine learning has grown from being considered a curiosity with mediocre accuracy in the 1990s [17,18] to a highly-accurate approach which is accepted as such in applied literature. This followed a series of developments since the mid-2000s from the first spectroscopically accurate molecular potential energy surfaces by Manzhos and Carrington [19] to accurate potentials suitable for large-scale solid state and nanoparticles simulations by Behler et al [20][21][22]. These developments were achieved by combining general properties of ML methods, mostly NN (neural networks) and GPR (Gaussian process regression), with modifications specific to the application, such as sum-of-product representations [23] or expansions of energy over orders of coupling [24,25] or its representation as a sum of atomic energies dependent on the chemical environment [26,27] as well as data selection schemes specific to the application [19,28]. The field of machine learning for interatomic potentials has been active and diverse and will not be reviewed here.
Exploration of ML for the solution of the vibrational and electronic Schrödinger equation is also not new (as will be shown below), but as of yet ML has not established itself as a highly-accurate approach to solve the SE in applications. There are signs that it is on the verge of doing so, as promising results have recently appeared in the literature for the solution of long-outstanding genuinely difficult problems such as construction of kinetic energy and exchange-correlation functionals as well as solution of the multi-electron SE, as described below. It is likely that also in this direction, highly accurate solutions and acceptability for applications will be achieved by combining generic properties of ML methods, such as universal

Machine learning for solution of the vibrational Schrödinger equation
ML has been applied to the solution of the vibrational SE. In some works, the wavefunction was represented as values on a grid and in others, as a continuous function (such as a NN model). Zeiri et al [52] used a genetic algorithm to solve the SE on a one-dimensional double-well potential. GA populations consisted of numerical values of the wavefunction on an equidistant grid. The algorithm searched for the state with an energy closest to a reference energy. The lowest state was found as closest to zero reference energy, the first excited state as the closest state orthogonal to the ground state, and so on. A similar approach was used by Saha et al for a one-dimensional symmetric double well problem and a two-dimensional coupled harmonic oscillator considering the ground and the first excited state [53].
Makarov and Metiu [54] used genetic programming whereby a functional form of the wavefunction was constructed from primitives. The wavefunction was represented as a tree with nodes for various functions or operators and 'leaves' for parameters or variables. A GA-like algorithm was used to find the best function. They constructed in this way wavefunctions for one-dimensional single and double well potentials; the algorithm was helped by imposing asymptotic conditions [54].
Neural networks were used to directly represent bound-state wavefunctions, mostly for synthetic problems. Lagaris et al [55] used a NN with ridge activation functions [56] so solve the SÊ where r spans the configuration space,T is the KEO (kinetic energy operator), and V is the potential, for a one-dimensional Morse potential as well as two and three coupled anharmonic oscillators (i.e. potentials often used to model inter-atomic interactions) for several lowest-energy states. The wavefunction was represented with a single hidden layer NN with sigmoid activation functions as with the exponential factors assuring correct asymptotic behavior for bound states. The NN parameters were tuned to minimize the residual at a set of points r i on equidistant grids, and the energy was computed from the wavefunction (E = ⟨ Ψ|Ĥ|Ψ ⟩ ,Ĥ =T + V). We note that the authors' use of a single hidden layer NN (which is a universal approximator [57,58]) is equivalent to using a spectral representation with a tunable basis, and the use of equation (3) is equivalent to rectangular collocation (i.e. solving the SE by satisfying it at a set of collocation points rather than variationally) [59]. This has the advantage that the wavefunction expansion need not be integrable or differentiable everywhere, allowing for use of discontinuous or non-integrable functions (the common sigmoidal functions e.g. can have infinite integrals over their infinite support) which only need to be smooth at the collocation points [23,28,60]. Lagaris et al also solved the variational problem with a NN wavefunction.
Nakanishi and Sugawara [61,62] used an NN with ridge activation functions to represent the wavefunction of a dozen lowest-energy levels of a one-dimensional SE for a harmonic oscillator, a double-well potential, and a Morse oscillator, obtaining level accuracy on the order of 0.1% with fewer than 10 neurons. A genetic algorithm was used to pre-optimize NN parameters before derivatives-based optimization. Energy was a fitting parameter. They used an amplitude-phase decomposition of the wavefunction, where both A(x) and S(x) are separate single-hidden layer NNs. Asymptotics was not artificially enforced and was obtained with NN training. NNs were trained to minimize the residual on a set of about a 100 uniformly distributed random points within a range where the wavefunction has appreciable amplitude. The above works used a separate NN for each state with all NN parameters trained together. Manzhos and Carrington [63] used linear algebra to determine linear coefficients in the NN representation of the wavefunctions, and many levels were computed at the same time from one set of nonlinear NN parameters. NNs with radial basis functions as activators (RBFNN), which are also universal approximators [64,65], were used to ensure asymptotics instead of equation (2): It was suggested that RBFNNs escape the curse of dimensionality [66]. The energy was a fitting parameter that was adjusted to be equal to that computed from the wavefunction. A method to solve rectangular matrix pencils by Boutry et al was used to solve the collocation equation [67]. In this way the SE for coupled harmonic oscillators in up to six dimensions (d = 6) were solved for several lowest-energy levels. Rather than using an equidistant collocation grid, it was found advantageous to use a non-uniform point distribution; the points were selected from an 'inverse potential' probability density function: points r uniformly and randomly sampled in the configuration space were accepted if where rand is a random number uniformly distributed over [0; 1] and bias is used to adjust the point density at V max , the maximum used value of the potential. Only 5000 collocation points and on the order of 50 neurons were used. This distribution was found to be more efficient than a uniform grid, similar to the experience when fitting potential energy surfaces or solving the electronic Schrödinger equation [19,28,68,69]. This approach was also used by the same authors to represent vibrational wavefunctions of a real molecule, H 2 O [70]. It was not obvious how the accuracy achieved on synthetic potentials in prior works would translate to real molecules, and this work provided an insight. It took 6000 collocation points and 50 neurons to achieve an accuracy of about 2 cm −1 for the first five vibrational levels.
Recently Teng [71] used a RBFNN together with Variational Monte Carlo (VMC) calculations to solve the SE for a two-dimensional harmonic oscillator in a linear potential and a one-dimensional particle in a box with a linear potential. Space was sampled with a very large (50 000) number of points.
Spectroscopically accurate (accuracy better than 1 cm −1 ) ML solutions for multiple (dozens or hundreds) levels of the vibrational SE for real polyatomic molecules are still wanting. The computational power and the ML algorithms which are currently easily available suggest they are feasible. One factor that makes achieving spectroscopic accuracy with real molecules cumbersome is the use of the exact KEO. The synthetic problems cited above used simple KEOs based on second derivatives. The KEO is complicated in internal coordinates [72]. In the case of the water molecule, it was easy to apply the exact KEO analytically to the neurons, which is not the case for large molecules. We have shown, however, that this issue is easily circumvented by the application of the exact KEO numerically [73]. Sub-cm −1 accuracy for a four-atomic molecule (d = 6) can be obtained with equation (4) when using about 100 000 collocation points and about 30 000 basis functions without ML optimization of the basis or of the points [28,73,74]. It is expected that significant gains can be achieved for computational vibrational spectroscopy by basis set and point optimization with ML. This is supported by results to date with optimized bases and collocation points for vibrational [53,74,75] and electronic [69] (see below) problems. Multi-output NNs (as opposed to level-by-level calculations) are expected to be useful.

Wavefunction representation
The same paper of Zeiri et al [52] described in section 2 also applied the same GA based approach to a model electronic problem of the HF (Hartree-Fock) or DFT type (for 2 electrons) on a synthetic potential (harmonic binding potential between the electron and nucleus) as well as the helium atom. The authors achieved an accuracy on the order of 10 -2 Ha and pointed out inefficiency of an even-spaced grid that they used. Only the ground state was considered using a single radial coordinate.
Chaudhury and Bhattacharyya [76] also represented the wavefunction as a string of values on a grid and used a GA to minimize the residual of the SE on the grid; the energy was computed from the wavefunction. One-dimensional (radial) SE for an electron in a screened Coulomb potential and a anharmonic (quartic) oscillator were solved and the ground state for a two-electron (He) atom was also computed. They also applied the GA in the framework of the perturbation theory, solving for the second-and higher-order corrections to the wavefunction, which were similarly encoded as values on a grid. The accuracy, however, was only on the order of 10 -1 Ha for the ground state of He. The same group later used a similar approach to solve the SE for an H atom including the first excited state (2 s) which was produced by using an orthogonality constraint as penalty terms in the fitness function [53].
Caetano et al [77] used a single hidden layer NNs and an amplitude-phase decomposition of the wavefunction similar to that of equation (4) (Nakanishi and Sugawara). GA was used to find near-optimal NN parameters by minimizing the SE residual on a mesh. While standard sigmoid neurons were used in the hidden layer, the output layer had Gaussian neurons, which enforced asymptotics. They compared results with energy considered as a fitting parameter with energy computed from the wavefunction; the convergence with the latter was better. For a test case of a one-dimensional double harmonic oscillator, a 0.1 mHa accuracy was obtained for the ground state. A mHa accuracy was obtained for the ground state of the H atom (one-dimensional radial SE). The KS equation was solved for He and Li atoms with a~10 mHa accuracy for the total energy. They also considered the Hooke atom and the TF (Thomas-Fermi) atom.
The same paper of Lagaris et al [55] considered in section 2 also used their NN approach (equation (2)) to solve radial SE and Dirac equations (where separate NNs were used for the two parts of the reduced radial wavefunction) for muonic atoms and the equation for the n + α system. Saito [78] used a standard single hidden later NN to represent the wavefunction of the Bose-Hubbard model.
Choo et al [79] and Hermann et al [80] used an NN representation of the wavefunction in conjunction with the variational Monte Carlo training approach to compute energies and potential curves of small molecules approaching full CI (configuration interaction) accuracy.
Machine learning has been used to select important configurations in the CI method. Coe used NNs to choose important configurations in an iterative CI scheme [81,82]. This significantly sped up convergence and overall reduced the CPU cost of single point calculations as well as when building potential energy curves for N 2 , H 2 O, and CO. Potential curves of near-full CI quality were obtained at a tenths of the cost. Single hidden layer NNs were used with (binary) occupations of the spin orbitals in the configuration of interest as inputs. The output is the predicted transformed coefficient of the configuration which is fitted to coefficients in the training set. Advantages over stochastic configuration selection were demonstrated.
It is only recently that ML, specifically deep NNs, have started to be used to solve multi-electron SE. Han et al [83] computed ground-state energies of Be, B, LiH, and a chain of 10 hydrogen atoms. The antisymmetric nature of the wavefunction was imposed by a product representation whose components were represented with multi-layer NNs. The ground state was obtained by minimizing the variance of the local energy. For two-electron system, a mHa accuracy was achieved. The accuracy worsened towards 10 -1 Ha for the larger systems.
A more involved approach has been developed by Schütt et al [84] using a deep NN [85]. The NN inputs included atom types and position. The predictors of chemical environment of atoms and atom pairs were deduced by the first NN stage and were used to predict the energy and the Hamiltonian and overlap matrix in a basis of atomic orbitals. Ultimately, basis coefficients were predicted providing the wavefunction. A mHa accuracy was achieved for water, malondialdehyde, ethanol, and uracil.
Then works of Han et al and Schütt et al are examples of combining a generic ML method with application-specific representations to achieve interesting accuracy. This is the road that was previously taken to achieve highly accurate potential energy surfaces, and it is also proving its value with ML for SE.

Basis tuning
Schütt and VandeVondele [86] used GPR to tune a localized basis set which achieved, with a minimal basis, near-complete basis results in MD simulations of liquid water for structural observables such as pair correlation functions. The basis was built in a molecular geometry-dependent way; descriptors for GPR were sensitive to atom's chemical environment.

ML as booster to lower-level methods to approximate higher level methods
There is often a high-degree of correlation between the results of lower and higher levels of ab initio theory. On the basis of such correlation between Hartree-Fock and MP4 (Møller-Plesset 4th order perturbation theory) energies, Malshe et al [87] used a NN to predict MP4(SDQ) energies with the same basis set for configurations of vinyl bromide. The use of ML saved about 78% of the computational time that would be required for the MP4(SDQ) calculations. A single hidden layer NN with 80 to 140 neurons was used to achieve accuracies on the order of 0.03 eV.
Earlier, Silva et al [88] used NNs used to fit the correlation energy of atoms and diatomic molecules. They were able to recover about 90% of the correlation energy as function of atomic charges and interatomic bond lengths, with a single hidden layer NN.
McGibbon et al [89] trained a NN on a large set (on the order of 10 5 ) of CCSD(T) (coupled clusters singles and doubles with perturbative triples) data to improve MP2 (Møller-Plesset 2nd order perturbation theory) results by predicting corrections to the energies. This was shown to significantly improve binding energies of weakly bound compounds dominated by non-covalent interactions. A NN with 10 hidden layers with 50 neurons each was trained as a function of 22 features to reweight energy components of MP2 to fit CCSD(T) data.
Margraf and Reuter [90] on the other hand used MP2 to reduce the computational cost of a coupled cluster calculation. Approximate T-amplitudes estimated using MP2 were used as inputs for machine learning to predict the CCSD correlation energy using KRR.
Welborn et al [91] used GPR to predict MP2 and CCSD correlation energies from HF calculations, using the Fock matrix F, Coulomb matrix J, and exchange matrix K as descriptors. They demonstrated degrees of transferability among geometries, molecules, and elements when training the GPR model on a set of configuration of a training set of small molecules. The same group later used GPR to predict the correlation energies based on B3LYP calculations [92]. MP2 amplitudes together with other descriptors such as HF matrix elements (F, J, K), two-electron integrals and MP2 denominators were also used to learn CCSD amplitudes by Townsend and Vogiatzis [93] using the k-nearest neighbors algorithm [94].

ML for data selection
Ku et al [69] applied ML to optimize the collocation point set when solving the KS equation with the rectangular collocation method [68]. They used a combination of GPR and GA to reduce the number of points from about 50 000 to about 2000 while maintaining mHa accuracy. Important in such optimization is avoidance of repeated solution of the SE on trial point sets which would largely defeat the purpose. Instead of solving the KS equation on trial point sets, Ku et al showed that a good solution can be achieved with points which are best for ML representation of the potential.

Machine learning for density functional theory
Similar to the uses of ML for solving the SE, applications to DFT were often explored on simplified synthetic systems, including one-dimensional systems, which as of this writing continue to be prominent in the literature, rather than real-life molecules and solids. One reason for this is that exact ground state energies (and energy components) and densities that are used in ML can then be computed from the exact solution of the SE. In the last couple years, however, accurate machine-learned XC (exchange-correlation) functionals, kinetic energy functionals (KEF), and total energy functionals have been reported.

Exchange-correlation functionals
The field of ML for exchange-correlation functionals is in full bloom. While machine-learned functionals are still not available in end-user codes, promising results are appearing that create hope of ML functionals entering applied research soon. While many works continue to focus on synthetic, often low-dimensional systems, machine-learned functionals for real-life systems have also appeared.
Nagai et al [95] used a multi-layer NN to map the electron density to the Hartree-exchange-correlation potential for a test system-a one-dimensional two-particle spinless fermion model. A multi-layer NN was also used to learn the electron correlation in the Anderson-Hubbard model by Ma et al [96]. Custodio et al [97] used single-and two-hidden layer NN to learn an LSDA type functional for a one-dimensional Hubbard model. Schmidt et al [98] used a neural network to learn the universal exchange-correlation functional that simultaneously reproduces both the exact exchange-correlation energy and the potential, for a one-dimensional system with two strongly correlated electrons, in a non-singular potential. A multi-layer NN was used with exponential linear neurons (rather than common sigmoidal neurons), and symmetry was inbuilt into it by fixing of first layer weights. Ryczko et al [99] used a large deep NN to predict correlation, exchange, external, kinetic, and total energies simultaneously for model systems with harmonic oscillator and simple random external potentials. They also showed that for these systems, a deep NN is capable of predicting these directly from the external potential, effectively obviating the need for the KS scheme.
Similar to the use of ML to solve the SE, while very small errors can be achieved on such model systems (e.g. 0.03% in the ground state energy in the work of Custodio et al) they might not be representative of the accuracy which would be achieved with the same methods for real molecules or solids.
Zhou et al [100] machine-learned the exchange-correlation potential for real molecules. They showed that using a convolutional NN that learns the exchange correlation potential from the electron density based on small molecule CCSD data, one can achieve portability to larger molecules. The convolutional NN effectively generated a density-derived description inside itself. More accurate bond lengths and vibrational frequencies were obtained with the NN functional than with B3LYP for small molecules. The NN was also able to learn the weak vdW (van der Waals) interactions from CCSD data. Lei and Medford [101] fitted the exchange-correlation portion of B3LYP formation energies of small molecules to chemical accuracy. They used a two-hidden layer NN with rectified linear neurons. Space-distributed spherical harmonics were used as rotationally invariant descriptors rather than the density directly. This work used a very large data set (10 million points) and a large test set (0.55% training data and 99.45% testing data). Kolb et al [102] also used single hidden layer NNs to learn the exchange-correlation energies of NH 3 as functional of the density, using B3LYP and CCSD calculations as reference, with near-chemical accuracy. KKR was used by Ji and Jung [103] to learn PBE and LDA densities and exchange-correlation potentials of small molecules as a function of a representation of the molecule based on spherical averages of modified pseudopotentials at grid points. mHa accuracy was obtainable on individual molecules but worsened to dozens mHa when used on other molecules.
Nagai et al [104] also used a (multi-layer) NN to learn a non-local XC functional. Atomization and total energies of small molecules as good as those with popular hybrid and meta-GGA (generalized gradient approximation) functionals were obtained as well as transferability beyond the training set. The NN was trained using Monte-Carlo updates of the weights alternated with self-consistent KS-DFT calculations, rather than by back-propagation.
ML is also useful to improve on existing XC functionals. For example, Li et al [105] used a NN to optimize the range-separation parameter of the LC-BLYP functional. A single hidden layer NN depended on descriptors including number of atoms, number of electrons, the ZPE (zero-point energy) and the dipole moment as well as descriptors of pairwise and self-interaction. A balanced accuracy over reaction energies and barriers, ionization potentials and electronic affinities of a set of molecules was achieved.
Bayesian optimization was used to improve exchange correlation functionals. Fritz et al [106] used it to optimize a GGA vdW functional for calculations on water, based on a pool of 15 GGA exchange functionals. Vargas-Hernandez was able to slightly improve molecular atomization energies and bond lengths by optimizing the free parameters of hybrid-exchange-correlation functionals with Bayesian optimization [107]. He also made the method identify the most appropriate exchange-correlation functional.
Nudejima et al [108] used a two-hidden layer neural network with reflected linear neurons to learn the correlation energy density of the CCSD(T) reference. The descriptors included the electron density, its gradient, the kinetic energy density as well as the Hartree-Fock exchange energy density and electron density based on the fractional occupation number of molecular orbitals. The model was able to achieve highly accurate fits (R 2 > 0.999) to the correlation energy density and could compete with several popular functionals for reaction energies between small molecules.

Kinetic energy functionals
Widespread use of OFDFT in applications beyond light metals is held back by the absence of sufficiently accurate KEFs. The high stakes of the KEF issue are well known [109,110] and will not be repeated here. Recent years witnessed the appearance of several works using ML for KEF construction for systems where previous approximations were not good enough for use in applications (i.e. for practically all types of matter beyond light metals). Typically, the non-interacting (KS) KEF is the target, and one attempts to learn the KS KED (kinetic energy density) or the positive-definite version where the sum is over occupied orbitals, as a functional of a number of density-dependent variables (the density and its various derivatives and powers). Often, rather than aiming to reproduce the τ of equation (7) or (8), one fits the enhancement factor (or enhancement functional) F = τ + /τ TF/vW , where τ TF/vW is the Thomas-Fermi or von Weizsacker KED or their hybrid. Other target functions are possible such as G = (τ − τ vW ) /τ TF [111]. A significant fraction of work on ML for KEFs was done on model and one-dimensional systems [112]. In their seminal works, Snyder et al [113,114] used KRR to learn the energy of noninteracting 'electrons' in 1D as a function of the density, using model potentials-sums of three Gaussian-like wells [113] and two softened Coulombic potentials representing a 1D 'diatomic molecule' [114]. They obtained correct dissociation behavior of the 1D models of H 2 , He 2 , Li 2 , Be 2 , and LiH. Burke and co-workers also used KRR to learn the universal energy functional (subsuming the kinetic energy) is the KEO andV ee is electron repulsion, for a 1D system modeling chains of hydrogen atoms [115]. KRR worked on a representation of the density in a basis set specially constructed for this purpose. On these low-dimensional model systems, an equivalent of chemical accuracy was obtained. Ultimately, the methods have to be validated on real-life systems, which has recently been done. The same paper of Kolb et al [102] mentioned above also presented an NN fit of the kinetic energy of NH 3 . They were not able to achieve chemical accuracy but were able to improve by orders of magnitude over the TF-vW model (a linear combination of the Thomas-Fermi and von Weizsaecker functionals). They fitted as a function of the density and did not explore the choice of density-dependent variables, which appears to be critical to achieving chemical accuracy (vide infra).
Yao and Parkhill [116] used a convolutional NN to fit the enhancement factor F for a TF-vW hybrid. They explored density and scaled (dimensionless) modulus of the gradient of the density as NN inputs. Potential energy minima along bond coordinates were obtained; however, the potential energy curves were jagged and the fit showed a significantly higher test set error than the fitting set error (overfitting).
Golub and Manzhos used NNs to fit KS KED of metallic as well as (solid state and molecular) covalently bound systems [117]. They used terms of the 4th order gradient expansion as five density-dependent variables as NN inputs. A single hidden layer NN with few (about a dozen) neurons was able to closely reproduce the KS KED without overfitting. However, deep NNs were required to reproduce the KS KED simultaneously for several materials, which is a pre-requisite for a transferrable KEF. We note that limited transferability is sufficient for use in applications. Indeed, in applications of KS DFT a large number of XC functionals were proposed and are used in applications, which are advantageous for the modeling of specific classes of materials or phenomena.
Nakai group also produced accurate NN fits to KS KEDs [118]. They used electron densities and their gradients up to the third order as descriptors and fitted the enhancement factor. A three hidden layer NN with 30 neurons per layer was used. In a subsequent work [119] Nakai group reproduced smooth potential energy curves of a number of small molecules, which is a major achievement. It was achieved with the introduction of new descriptors, namely, the distances between grid points and centers of nuclei. A five hidden layer NN with 50 neurons per layer was used.
Variables selection is an important issue in ML KEFs. While Yao and Parkhill used scaled gradients that satisfy exact conditions, Nakai and co-workers did not scale the density-dependent variables. Our own tests in the lead-up to [117] showed much worse numerics with scaling. In the work of Golub and Manzhos, as terms of formal gradient expansion were used as variables, the conditions would be satisfied in the limit of a linear fit. Non-linearity of the NN fit was mild (most of the KED variance could be captured by a linear fit); as a result, the exact conditions were, if one can say so, 'nearly satisfied' . Burke and co-workers showed on one-dimensional synthetic problems that the use of scaled variables may or may not improve machine-learned KEFs [120]; they used the KRR method. It appears that using non-scaled variables is a useful line of exploration which should not be ignored, as exact uniform density scaling does not happen in Nature.
Rather than using multiple density-dependent variables (powers and derivatives), Brockherde et al [121] used a representation of the density in a specially constructed basis, and this representation was fed to KRR. This is an effective way to obviate the need for higher-order powers and derivatives of the density which may be difficult to converge. They were able to machine-learn density-potential and density-energy mapping which was accurate enough to model several small molecules and to perform molecular dynamics simulations on malonaldehyde capturing the intramolecular proton transfer process. This approach was recently used to machine-learn density functionals from coupled-cluster energies based only on DFT densities to chemical accuracy, for water and ethanol [122]. The authors showed that one can reduce the computational effort by learning only the correction to a standard DFT calculation.
ML has also been used to learn the relation between the potential and the density in DFT: Wang et al used a RBFNN to learn such mappings along one-dimensional cuts through space and suggested that this approach can used to deduce the underlying SE [123].

Data distribution.
A critical issue in machine learning of KEFs is data sampling in the space of density dependent variables. Golub and Manzhos [117] emphasized the extremely uneven KS KED data distribution. They were able to achieve accurate NN fits to the KS KED without overfitting by using a combination of partial histogram equalization and (denser) sampling on two-dimensional surfaces (in the three-dimensional Cartesian space). We also observed a much smoother distribution of the function G (than of τ) which might be useful in future ML work on KEFs. This suggests that data pre-processing may be key to achieving good ML functionals. In a subsequent work [124], Golub and Manzhos also showed that minima on the potential energy surface could be found even with KEDFs trained on single-geometry data. This suggests that it may be possible to sample the space of density-dependent variables without sampling the configuration space. This issue requires further exploration. When searching for suitable density-dependent variables, we also observed signs of multi-valuedness when using too few variables. The space of density-dependent variables should be of sufficient dimensionality for accurate KED machine learning.

ML as booster to lower-level methods to approximate higher level methods
Balabin and Lomakina [125] used NNs to estimate the DFT energy with a large basis set using lower-level energy values and molecular descriptors. They estimated energies with 6-311G(3df,3pd) and 6-311G(2df,2pd) basis sets and different functionals from energies with smaller bases and molecular descriptors (atomic composition, the gap, the polarizability, and dipole and quadrupole moments). An error of the order of 0.5 kcal mol −1 was achieved. They found that using energies obtained with several lower-level methods is advantageous. A single hidden layer NN with several neurons was used, and it was found the NN structure did not significantly depend on the functionals or basis sets used. The NN approach was found to outperform multiple linear regression (MLR).
NNs were also used to boost the quality of optical absorption calculations. Li et al [126] trained a single hidden layer NN (which a GA was used to pre-train) to reduce the error of the calculated absorption energies of 150 organic molecules from 0.47 to 0.22 eV for the non-expensive TDDFT/B3LYP/6-31G(d) calculation. NN inputs included the TDDFT absorption energies, oscillator strengths and molecular descriptors including the number of double bonds, the number of aromatic rings, the number of electrons, and the HOMO-LUMO gap. The same group earlier used the approach to correct energies of formation and compared NN with MLR; similar to the work of Balabin and Lomakina, NN was found to be more powerful [127]. NNs (single hidden layer) were also used to improve heats of formation computed with various DFT and HF setups by Duan et al [128]. Ramakrishnan et al [129] used KRR to learn deviations of reference second-order approximate coupled-cluster (CC2) singles and doubles spectra from TDDFT on a large (up to 20 000) set of small organic molecules, and were able to achieve corrected excitation energies within 0.1 eV of the reference values for the test set.
Chen et al [130] used multi-layer NNs to boost their energy-based fragment method which was used to compute excited states of large systems. A deep learning neural network was used for training the ML models for monomer energies and dimer and trimer interaction energies in the energy decomposition for photochemically inert surroundings of the chromophore.
Ramakrishnan et al [131] used kernel ridge regression to learn corrections to lower-level methods (from semi-empirical to DFT to wavefunction methods) to approximate energies with highly correlated methods. Machine-learned quantities included enthalpies, free energies, entropies, and electron correlation energies, which were learned based on a large set of organic molecules. The correction to a property was expressed as ∑ where α i are regression coefficients, σ is the kernel width (a hyper-parameter) and |R i − R b | is a measure of similarity in the vectors of descriptors of the query molecule R b and molecules from the training set R i . For this measure, the Manhattan norm between Coulomb matrix representation [34] was used. In this way, staple methods (such as DFT/B3LYP) could be corrected to essentially G4MP2 or CCSD quality. This work also applied the approach to improving semi-empirical methods like PM7 or OM2 to DFT accuracy and for predicting correlation energy.
Lentz and Kolpak [132] used NNs to predict band gaps obtained with a hybrid functional (HSE) based on electron density obtained with the PBE functional. An RSME of the gap of about 0.2 eV was achieved for a range of solid semiconductors, which was also shown to be much better than linear regression between the PBE and HSE band gaps. They found that a single hidden layer was sufficient and outperformed more complex NN architectures.
ML has been used to improve a posteriori models of dispersion interactions. Gao et al [133] used NNs to fit dispersion corrections to DFT calculations with various functional and relatively small basis sets using CCSD(T)/CBS (complete basis set) as benchmark. Root mean square errors of the DFT calculations relative to the benchmark were improved by at least 70% with this approach. Proppe et al [134] used GPR to refit the errors of the PBE with D3(BJ) dispersion correction scheme using CCSD(T) as a reference, on a set of about 1250 molecular dimers.
Dick and Fernandez-Serra [135,136] and Bogojeski et al [122] used NNs to learn corrections to energies (of gas phase and liquid water) obtained with a lower-level method, based on DFT densities. Rather than using directly the density as input, they developed atom-centered density-based descriptors and used Behler-type NNs (i.e. a sum of atomic contributions). Dick and Fernandez-Serra [136] used a delta-learning model (i.e. functional is built as a correction to an existing non-ML functional) to learn an XC functional as a function not only of the density but also of atomic types and positions. They also represented the density in an atom-centered basis and used a Behler-Parrinello type neural network. Accuracy of better than 10 meV vs CCSD(T) of dissociation curves of several small molecules was obtained.

Pseudopotentials
Most ab initio materials simulations, especially plane-wave based calculations, are facilitated by the use of pseudopotentials (PP). Highly accurate PPs used in KS-DFT are non-local. OFDFT must use local PPs, which are in general less accurate. While many PP libraries exist for KS-DFT software covering most of the periodic table, this is not the case for OFDFT. While there is general awareness of the importance of KEFs for OFDFT, the issue of PPs has been underappreciated, although good approximations to the KEF are highly unlikely in the absence of good local pseudopotentials which would permit KEF approximations to operate on smoother density. Availability of sufficiently accurate local PPs would enable large-scale simulations of many practically relevant materials as well as facilitate KEF development.
A local PP is a simple-shaped univariate function, but the calculations are very sensitive to fine details of its shape. ML based approaches to building PPs for OFDFT are very promising as they in principle are capable to find the best possible shape of the PP, especially if non-parametric approaches are used that do not impose a functional form. This is a different type of problem than in most ML applications where ML operates in multidimensional spaces. As local PPs have a well-define asymptotic, only the non-asymptotic region around the nucleus is of interest. Legrain and Manzhos produced highly accurate PPs for Li, Na, and Mg [137] by using a parameterized functional form and tuning the parameters with a GA to reproduce structural and energetic parameters. This approach allowed to palliate in a semi-empirical way deficiencies of the OFDFT approximation (KEF), e.g. the cohesive energy could be reproduced contrary to the PP obtained with a density-inversion procedure [138]. Del Rio et al [139] used an evolutionary algorithm to tune Gaussian-like correction terms to the local pseudopotential to reproduce properties of liquid and solid lithium and solid gallium with OFDFT. ML might have most success in designing non-parametric PPs but those are still wanting.

Density functional tight binding parameterization
Machine learning has been used to construct pairwise repulsive potentials of the self-consistent charge (SCC-) DFTB method [140][141][142]. These potentials are critical for the accuracy of the method and also offer opportunities to tune the DFTB setup to perform well for systems which were not included in the DFTB parameterization [143,144]. Knaup et al [145] used a genetic algorithm to produce an accurate pair potential for N-H, whereby GA was used to tune the nodes of the spline representation of the potential. Panosetti et al [146] used GPR to build DFTB repulsive potentials for the modeling of Li intercalation in graphite. They achieved an accuracy similar to that of DFT for inter-layer distances and Li diffusion barriers. In Kranz et al [147] machine-learned repulsive potentials were produced for pairs of O, N, C, H, and F atoms by training potential parameters on energies and forces of equilibrium and non-equilibrium structures of 2100 molecules and testing on about 130 000 organic molecules. The authors started from an existing DFTB parameterization and tuned polynomial corrections to the repulsive potential. The polynomial parameters were tuned as a function of bond descriptors to account for chemical environment, to reproduce atomization energies and forces for a set of equilibrium and perturbed molecular geometries. While the fit was a traditional least-squares fit, bond clustering done with the mean-shift algorithm [148,149] is a ML approach.
Witek and co-workers used particle swarm optimization (PSO) to tune a complete set of DFTB parameters including atom parameters (orbital energies, orbital and density confining parameters, and Hubbard parameters and their derivatives) and atom pair parameters (the number and the location of the spline knots defining, and constraints on, the repulsive potential) [150]. PSO was used due to the large resulting dimensionality of the parameter space. The parameters were fitted to reproduce structures, formation energies, bandstructures, and vibrational properties of target systems which included different phases of silicon and small molecules. They achieved accuracy for these quantities typical of high-quality DFT simulations. The electronic part of the parameterization done by Panosetti et al [146] also used PSO.
ML has been used for improvement of DFTB performance achieved by learning the difference in energy between DFTB and higher level methods [151]. A recent example is a work by Irle and co-workers using NN to learn a correction to DFTB (vs. DFT) for a large number of geometries of glycine [152]. A Behler-type NN modeling atom chemical environment was used [21,45,153,154]. The resulting quality of the potential energy surface was similar to that of a highly accurate DFT approach.
ML can be used to parameterize semi-empirical Hamiltonians beyond the two-center ansatz of the DFTB approximation. Hegde and Bowen [155] used kernel ridge regression to build transferable prediction of DFT Hamiltonians for different environments of a Cu atom in an fcc crystal structure with near-DFT accuracy, and Dral et al [156] used KRR to re-optimize the parameters of the semi-empirical OM method.

Discussion and conclusion
Machine learning has been used to solve the Schrödinger equation or to improve the quality of lower-level solutions. When solving the Schrödinger equation, machine learning has been used to represent the wavefunction (e.g. with a neural network) in a flexible way which can be viewed (especially in the case of a single hidden layer NN) as an expansion over a tunable basis or as grid points directly entering ML-based optimization (e.g. with a genetic algorithm). ML has also been used to tune parameters entering the solution with a given method (e.g. to reweight MP2 energy components) or to fit a correction to a lower-level method. Representing the wavefunction as values on a grid is enticing as it allows a truly non-parametric representation (no functional form is imposed); however, this approach, mostly tested on simple low-dimensional problems faces in-principle difficulties in high-dimensional spaces encountered in both the electronic and vibrational problems. On the other hand, direct (generally non-uniform) grid representation may have potential for practically important low-dimensional problems such as the Kohn-Sham equation.
The use of Gaussian process regression for wavefunction representation remains largely unexplored. GPR has the potential to combine the advantages of the truly non-parametric grid representation with low-density grids. GPR is a hyper-parametric representation: while there exists an analytic expression for the function between the grid points, the functional form is only indirectly imposed via the choice of kernel type and parameters.
Several of the works considered above applied the same ML techniques to electronic and vibrational problems [52,55]. There are similarities and differences. Both problems are linear ODEs with potentials that have wells that support bound states. In the vibrational case, however, the potential minimum is smooth while in the electronic problem it is singular (unless pseudopotentials are used). The potential in the electronic case is slower approaching the long-range asymptotic (e.g. as r −1 ) than interatomic potentials. The density of lowest-energy states is much more even in the vibrational problems. There is nevertheless a degree of portability of ideas between the two fields.
While ML has been used directly for wavefunction representation in both electronic and vibrational problems, ideas such as ML-based corrections or terms reweighting (of a lower-accuracy method) have been more explored in the electronic world; they are expected to be useful also for vibrational problems. As of today, spectroscopically accurate (accuracy better than 1 cm −1 ) ML solutions for multiple (dozens or hundreds) levels of the vibrational SE for real polyatomic molecules are still wanting, and these ideas can help achieve those. Variational approaches easily achieve errors of ≪ 1 cm −1 for four-or five-atomic molecules [157]. These, however, do not take into account the error introduced when building the potential energy surface, which can be substantial. In our experience, we found that it was difficult to achieve a spectrum error much smaller than 1 cm −1 with ML (NN representation of the wavefunction), but ML allows solving the SE directly from ab initio data. It is expected that with sufficient data as accurate solutions of the SE as permitted by the dataset can be obtained.
It is only recently that milli-Hartree accuracy was achieved when solving the electronic Schrödinger equation for real life systems with ML [84]. This was achieved by combining a ML approach (NN) with a traditional basis set rather than by a straight-up application of a ML method to represent the wavefunction. This is a viable way forward to applications. Nevertheless, direct ML wavefunction representations with chemical accuracy are worth pursuing. While variational approaches often achieve µHa accuracy, their steep scaling with the number of degrees of freedom is notorious. For electronic as well as for vibrational problems, this scaling is in particular due to the use of direct-product type bases. ML allows representing the solution in a non-direct product form and effectively in an optimized basis (e.g. with NNs) and in this sense obviates exponential scaling of the cost. Variational approaches require large quadrature grids to achieve high accuracy. These also can be obviated by the use of ML-assisted representations. For example, the use of a single hidden layer NN (either ridge or radial basis function) to represent the wavefunction is equivalent to a traditional basis expansion but with a non-direct-product, tunable basis. This allows achieving quality solutions with very small bases (compared to direct product bases). In turn, the smallness of the basis allows using fewer quadrature or collocation points [59,69]. This not only reduces CPU and memory cost of the calculation but also the cost of obtaining the underlying potential data, which could be quite expensive, e.g. when solving the nuclear Schrödinger equation for solid state, interfacial and nanoparticle systems where sampling the potential with ab initio calculations is costly. Rectangular collocation with radial basis functions already showed its advantages over variational approaches that use direct product bases in terms of required numbers of basis functions and collocation points even without ML optimization. ML optimization of the basis and points promises significant gains.
With the works of Kranz et al [137] and Chou et al [150] which presented parameters for a wide selection of systems, ML firmly established itself as a powerful tool for accurate parameterization of density functional tight binding. This bodes well for wider use of this method enabling large-scale and therefore more realistic simulations. A particular class of systems which are amenable to DFTB modeling and for which accurate parameterizations are still absent are organic batteries [158]. The results with ML based DFTB parameterization to date suggest that ML could be efficient to achieve accurate DFTB modeling of organic battery materials. This is especially important for modeling of polymeric materials (requiring relatively large scale models to account for amorphous character) and of realistic charge-discharge dynamics [159].
There are still a few works using ML to tune basis sets. Basis tuning is a powerful approach to diminish the cost of both electronic structure and vibrational calculations. This has been well known and used to build bases with methods other than ML [61,160]. ML-based basis set optimization promises significant pickup and should therefore be further pursued. ML looks extremely promising in the development of orbital-free DFT. As of now, ML has been successfully used to learn density-kinetic energy, density-energy, and potential-energy mappings as well as the mapping between the potential and density. Two methods have been widely used: kernel ridge regression and NN. The NNs have the advantage of being able to work with large datasets (millions of data), while kernel based methods are in principle somewhat more accurate (with optimal kernels and hyper-parameters) but are costlier to fit and to recall. A somewhat underappreciated area is the development of local pseudopotentials for OFDFT. Here recommended future directions include machine-learned non-parametric small-core PPs. The shape of the PP is extremely sensitive and it is difficult to construct a correct functional form, which is why non-parametric approaches are desirable, while targeting semi-core PPs might alleviate the issue of intrinsically lower accuracy of local PPs.
As the kinetic energy is a significant part of the total energy, orbital-free functionals represent a stringent test of ML which it appears to be passing. The exchange-correlation problem is more forgiving, and ML has now been used to make exchange-correlation functionals beating some of the widely used functionals and approaching the accuracy of wavefunction-based methods. In both the kinetic energy functional problem and the XC problem, machine-learning the correction to a lower-level method (dabbed ∆-learning) has been successful and can be recommended.
A significant advantage of ML is that the parameters of the wavefunctions or orbitals need not be determined from linear algebra the way basis coefficients usually are. ML is therefore well suited to solve non-linear equations [40,77,161,162]. Specifically in the context of ab initio calculations, this has the potential to solve the Kohn-Sham equation directly without self-consistent field cycling. This direction is little explored and also deserves further studies.
Data selection is a critical issue. While in one-and low-dimensional cases, uniform sampling is feasible and can provide high accuracy, in multiple dimensions, adapted sampling distributions are necessary, e.g. versions of the inverted potential distributions when solving the Schrödinger equation or partial histogram equalization when sampling the space of density-dependent variables in DFT. There are indications that a proper sampling scheme is critical for the success of machine learning of KEFs for OFDFT [117]. When sampling in the multidimensional space of features is sparse, techniques suitable for sparse sampling which have been shown to be effective in the field of multidimensional potential energy surfaces, such as representation with sub-dimensional component functions or dimensionality reduction [24,25,163], could be useful. When data are abundant and sample well all relevant regions of space, single-hidden layer NNs are sufficient and are often advantageous over multi-layer NNs, as was noted by several authors cited above; this was also our experience when fitting potential energy surfaces. It is with sparse or very unevenly distributed data that multi-layer (so-called 'deep') neural networks are advantageous [117].
Another important issue is feature/variables selection. The use of both scaled (i.e. satisfying the exact conditions) and unscaled densities and derivatives (gradient, Laplacian and higher order derivatives) present viable lines of exploration. An alternative which may be more robust is working with a basis expansion of the density. The proposition that it may be possible to sample the space of density-dependent variables without sampling the configuration space deserves further exploration.
ML has its lot of disadvantages. Machine learning methods typically fail when recalled outside of the learning domain (extrapolation). This can be palliated by adopting representations which enforce e.g. boundary conditions (such as wavefunction decay, as described above) or via coordinate transformations [157]. The ML methods considered above are ignorant of symmetry. This can be addressed by explicit or implicit symmetrization [46,98,136,157]. ML typically requires relatively costly optimization of parameters (NN) or hyper-parameters (GPR, KRR) or extensive evolutionary searches (GA-inspired schemes). Nevertheless, with the growth of computer power, these issues are becoming easier to resolve than the exponential scaling of traditional approaches. This indicates an important place of ML techniques in the computational chemistry toolbox going forward.