A quantum mechanical NMR simulation algorithm for protein-scale spin systems

Nuclear magnetic resonance spectroscopy is one of the few remaining areas of physical chemistry for which polynomially scaling simulation methods have not so far been available. Here, we report such a method and illustrate its performance by simulating common 2D and 3D liquid state NMR experiments (including accurate description of spin relaxation processes) on isotopically enriched human ubiquitin - a protein containing over a thousand nuclear spins forming an irregular polycyclic three-dimensional coupling lattice. The algorithm uses careful tailoring of the density operator space to only include nuclear spin states that are populated to a significant extent. The reduced state space is generated by analyzing spin connectivity and decoherence properties: rapidly relaxing states as well as correlations between topologically remote spins are dropped from the basis set. In the examples provided, the resulting reduction in the quantum mechanical simulation time is by many orders of magnitude.


Introduction
The computing power required for nuclear magnetic resonance (NMR) simulations grows exponentially with the spin system size 1 , and the current simulation capability is limited to about twenty spins 2 . Proteins are much bigger and the inability to accurately model their NMR spectra is a significant limitation. In particular, exponential scaling complicates validation of protein NMR structures: an ab initio simulation of a protein NMR spectrum from atomic coordinates and list of spin interactions has not so far been feasible. It is also not possible to cut a protein up into fragments and simulate it piecewise without losing essential dipolar network information 3 . For this reason, some of the most informative protein NMR experiments (e.g. NOESY) are currently only interpreted using simplified models 4 . Very promising recent algorithms, such as DMRG 5 , are also challenged by time-domain NMR simulations of proteins, which contain irregular three-dimensional polycyclic spin-spin coupling networks that are far from chain or tree topologies required by tensor network methods. In this communication we take advantage of the locality and rapid relaxation properties of protein spin systems and report a solution to the protein NMR simulation problem using restricted state spaces 6 . NOESY, HNCO and HSQC simulations of 13 C, 15 Nenriched human ubiquitin protein (over 1000 coupled spins) are provided as illustrations.

Simulation methods
The restricted state space approximation in magnetic resonance 6 is the observation that a large part of the density operator space in many spin systems remains unpopulated and can be ignored -the analysis of quantum trajectories in liquid state NMR indicates that only low orders of correlation connecting nearby spins are in practice populated 6,7 . The reasons, recently explored [6][7][8][9][10][11][12][13][14] , include sparsity of common spin interaction networks 6,7 , the inevitable presence of spin relaxation 11,15 , the existence of multiple non-interacting density matrix subspaces 10,12 , the presence of hidden conservation laws 12 and simplifications brought about by the powder averaging operation 8,14 . It is possible to determine the composition of the reduced space a priori, allowing the matrix representations of spin operators to be built directly in the reduced basis set 11,12 . Taken together, this yields a polynomially scaling method for simulating liquid phase NMR systems of arbitrary size. Our final version of this method is described in this communication -we build the reduced operator algebra by only including populated spin product states in the basis. The populated subspace is mapped by analysing the topology of the spin interaction network. A rigorous accuracy analysis is highly technical and has been published separately 11 .
There are two distinct spin interaction networks in NMR systems: the J-coupling network, defined by electron-mediated interactions that propagate through chemical bonds, and the dipolar coupling network, defined by through-space magnetic dipolar couplings between nuclei. In the liquid phase, these two networks have very different manifestations: the Jcoupling network is responsible for multiplicity patterns observed directly in NMR spectra, whereas the dipolar network is partially responsible for line widths and cross-relaxation processes. Both networks are irregular, three-dimensional, and contain multiple interlocking loops that challenge current DMRG techniques 5 . In a typical NMR experiment, nuclear magnetization flows across both networks and the locality of the operator basis set should therefore be understood as locality on the corresponding graphs.
After testing a variety of state space restriction methods 6,7,[11][12][13][14] , we propose the following procedure for generating the reduced basis set in liquid state NMR simulations: 1. Generate J-coupling graph (JCG) and dipolar coupling graph (DCG) from J-coupling data and Cartesian coordinates respectively. User-specified thresholds should be applied for the minimum significant J-coupling and maximum significant distance.
Because spin interactions are at most two-particle, the computational complexity of this procedure and the number of edges in the resulting graphs scale quadratically with the number of spins.
2. Use the depth-first search algorithm 16   The accuracy of the basis set can be varied systematically by changing subgraph size in Stage 2 -the limiting case of the whole system corresponds to the formally exact simulation 11 . The basis set nomenclature implemented in our software library, called Spinach 17 , and used for the simulations described below, is given in Table 1. The procedure described above runs in quadratic time with respect to the total number of spins in the system.
Once the active space is mapped, matrix representations should be built for relevant spin operators and state vectors. Experimentally encountered spin interactions are at most twoparticle, and the Hamiltonian appearing in the equation of motion for the density operator is therefore a sum of at most two-spin operators with a known direct product structure 1 : where n  are interaction magnitudes, N is the total number of spins, and , Their faithful representations have exponential dimensions, but representations in low correlation order basis sets are cheap 12 . In a given operator basis  k O : Because dot products commute with direct products and the trace of a direct product is a product of traces we have: r Tr in which the dimension of individual matrices ,   Figure 1 took less than 24 hours on a large shared-memory computer. Importantly, the problem dimension remains too big for matrix factorizations: the recently developed diagonalization-free methods 15 are essential.
The storage of the system trajectory in the indirect dimension of the 2D NMR simulation shown in Figure 1 requires 512×848,530 complex doubles (6.96 GB) of memory. It is clear that 3D NMR simulations would put some strain on modern computing facilities. This would have been a difficult problem, were it not for a peculiar property of propagator semigroupssimulations can be partially run backwards, even in the presence of relaxation. The general algebraic summary is given below and a special case of the HNCO pulse sequence is illustrated in Figure 3.
The free induction decay coming out of a 3D NMR experiment is a function of the three evolution times   , , t t t and may be formally written as where 0  is the initial density matrix,  is the detection state, This transformation splits a 3D NMR simulation into one forward 2D simulation from the initial state, one backward 2D simulation from the detection state and one dot product in the middle. Equation (7) is formally equivalent to Equation (6), but the reduction in storage requirements is considerable -for a typical protein 3D NMR experiment, instead of a dense 64×64×256×10 6 array of complex doubles (over 16 TB of data) at the end of the 3 t period in Equation (6), the arrays in Equation (7) have dimensions of 64×64×10 6 and 64×256×10 6 as well as better sparsity, resulting in the worst-case storage requirements of about 256 GB. As per Equation (7), their scalar product along the last dimension returns the required 64×64×256 free induction decay. Importantly, Equation (7) retains the parallelization opportunities and the time-memory trade-offs offered by the fact that different 1 t increments may be evolved independently in 2 t forward, and different 3 t increments may be evolved independently in 2 t backward. The final operation -the matrix dot product in Equation (7) -is also intrinsically parallel. Practical testing shows that the two-sided propagation technique reduces the simulation time of 3D NMR experiments on proteins (HNCO example is given in Figure 4) by at least an order of magnitude.
Even in reduced spaces the algebraic structure of the time-domain NMR simulation problem lends itself to multiple efficiency tweaks. Sparse matrix algebra 19 is advantageous because in the Pauli basis all spin Hamiltonian matrices are guaranteed to be sparse 18 . The direct product structure in Equation (2) is completely defined by its indices -repeated requests for the same operator can be served from disk or RAM using the index array as a database record identifier. Parallelization is straightforward at both the propagation 18

Experimental methods and data processing
All NMR spectra were recorded at 300 K on Bruker AVANCE-III 900 and Varian Inova600 spectrometers equipped with 1 H, 13  While chemical shift data is a necessary outcome of NMR structure determination 3 , complete J-coupling data is not expected to be available in the foreseeable future for any protein. We found that missing J-couplings can be obtained with sufficient accuracy (±25% is required for 2D/3D NMR simulations reported) from atomic coordinates using semi-empirical estimates, and implemented a graph-theoretical estimator with the following stages: 1. The molecular bonding graph is partitioned into connected subgraphs of size two, and one-bond J-couplings are assigned from a complete database of atom pairs. Our experience with ubiquitin indicates that there are fewer than 100 unique connected atom pairs in regular proteins, and that most one-bond J-couplings within those pairs can be either found in the literature 3 , or measured in individual amino acids, or estimated with sufficient accuracy using electronic structure theory software 28 .
2. The molecular bonding graph is partitioned into connected subgraphs of size three, and two-bond J-couplings assigned from a complete database of connected atom triples. The number of unique connected atom triples in proteins is also reasonable -we saw fewer than 150 in regular proteins, a small enough number for an exhaustive list to be compiled from experiments, literature and electronic structure theory estimates.
3. The molecular bonding graph is partitioned into sequentially connected subgraphs of size four and dihedral angles are computed from atomic coordinates, allowing three-bond Jcouplings to be assigned from a complete database of Karplus curves 29 , with angle averaging for sites designated as mobile. Karplus curves are a well-researched topic, with specific data available for the backbone and less accurate generic curves available for rest of the structure 3 . The number of unique sequentially connected atom quartets found in proteins (fewer than 300, many belonging to similar structural types) was sufficiently small for a complete database of Karplus curves to be compiled from literature data, experiments, and electronic structure theory estimates.
J-couplings across more than three bonds were ignored. The effect of the electrostatic environment was also ignored -on the scale of the accuracy required for protein simulations its effect on J-coupling is small 30,31 . Matlab code listing the specific parameter values is available as a part of the Spinach package 17 . More accurate J-coupling estimation methods are undoubtedly possible, but are beyond the scope of the present paper -we should note very clearly here that this paper is an exercise in quantum mechanics rather than structural biology. 3. Automatic protein NMR structure validation. Structure validation can be defined as making sure that atomic coordinates coming out of a crystallographic or NMR experiment correspond to reality and eliminating any mismatches between the mathematical solution and the true biological structure. The critical step in that process -back-calculation of protein NMR spectra -is now possible.

Results and discussion
Taking a more distant and speculative view, it may eventually become feasible to run protein NMR structure determination and validation directly from atomic coordinates, using ab initio or DFT methods to predict spin interaction parameters and then the methods described above to generate candidate NMR spectra for least squares fitting. Such "direct structure fitting" has been demonstrated for EPR of small molecules 37 . Its routine use would require significant im-provements in the accuracy of quantum chemistry methods, but such improvements are quite likely in the next ten years.

Conclusions
The algorithm reported results in the reduction of liquid state NMR simulation time of protein-scale spin systems by many orders of magnitude -a considerable improvement over brute-force simulations using traditional techniques 1,19 .

Basis set Description
IK-0(n) All spin correlations up to, and including, order n, irrespective of proximity on J-coupling or dipolar coupling graphs. Generated with a combinatorial procedure, by picking all possible groups of n spins in the current spin system and merging state spaces of those groups. Recommended for testing and debugging purposes. IK-1(n,k) All spin correlations up to order n between directly J-coupled spins (with couplings above a user-specified threshold) and up to order k between spatially proximate spins (with distances below the userspecified threshold). Generated by coupling graph analysis as described in the main text. The minimum basis set recommended for liquid state protein NMR simulations is IK-1(4,3) with the distance threshold of 4.0 Angstrom.

IK-2(n)
For each spin, all of its correlations with directly J-coupled spins, and correlations up to order n with spatially proximate spins (below the user-specified distance threshold). Generated by coupling graph analysis as described in the main text. Recommended for very accurate simulations on very large computer systems with the distance threshold of 5.0 Angstrom or greater.  Figure 1 at different accuracy levels.
Basis set for the reduced state space (see Table 1