Modern Technologies of Solution Nuclear Magnetic Resonance Spectroscopy for Three-dimensional Structure Determination of Proteins Open Avenues for Life Scientists

Nuclear magnetic resonance (NMR) spectroscopy is a powerful technique for structural studies of chemical compounds and biomolecules such as DNA and proteins. Since the NMR signal sensitively reflects the chemical environment and the dynamics of a nuclear spin, NMR experiments provide a wealth of structural and dynamic information about the molecule of interest at atomic resolution. In general, structural biology studies using NMR spectroscopy still require a reasonable understanding of the theory behind the technique and experience on how to recorded NMR data. Owing to the remarkable progress in the past decade, we can easily access suitable and popular analytical resources for NMR structure determination of proteins with high accuracy. Here, we describe the practical aspects, workflow and key points of modern NMR techniques used for solution structure determination of proteins. This review should aid NMR specialists aiming to develop new methods that accelerate the structure determination process, and open avenues for non-specialist and life scientists interested in using NMR spectroscopy to solve protein structures.


Introduction
Tertiary structure determination of biomolecules such as nucleic acids and proteins at atomic resolution provides essential insight into the function of bioactive molecules. X-ray crystallography [1] and nuclear magnetic resonance (NMR) spectroscopy [2] have been the primary methods over the past few decades to obtain high-resolution structures. More recently, the rapid technological growth of cryo-electron microscopy [3] has seen this technique emerge as a third major approach to solve biomacromolecular structures at atomic resolution. Solution NMR offers a number of distinct features for structural biology studies: 1) Dynamics of protein folding, structural fluctuations, internal mobility Computational and Structural Biotechnology Journal 15 (2017) [328][329][330][331][332][333][334][335][336][337][338][339] and chemical exchange of target molecules can be investigated over a wide range of timescales (i.e., between picoseconds and submillisecond timescales) [4,5]. The most successful application of NMR to study biomolecular dynamics has been carried out on intrinsic disordered proteins (IDPs) such as zinc finger proteins that usually do not yield crystals, which provide sufficient quality X-ray diffraction data for high-resolution structure determination [6][7][8][9][10][11]. Moreover, these proteins are too small or too flexible to obtain strong contrast images by modern cryo-EM analysis. 2) Studies of protein-protein or protein-ligand interactions can be performed under physiological conditions. The affinity and the location of the interaction sites between the target protein and its binding partner molecules can be determined accurately and sensitively even if the interaction is very weak (K d =~mM) by performing simple NMR experiments, e.g., chemical shift perturbation analysis by measuring two-dimensional (2D) heteronuclear single-quantum coherence (HSQC) spectra of target proteins with and without the binding partner, while such an interaction study by cocrystallization or co-immunoprecipitation can be challenging [12][13][14]. Therefore, techniques based on using solution NMR can be readily used to screen chemical fragment libraries for preliminary hits, which have the potential to act as seed compounds in drug development. Such approaches usually identify weak binders for the target protein, but represent a powerful starting point for structure-based drug discovery and development studies.
Among the various applications of solution NMR to study protein function, as aforementioned, tertiary structure determination of proteins by NMR remains a steadfast and powerful application of this spectroscopy. Recent development of software and a number of webbased resources, which reduce the burden of complex NMR data analysis, have contributed to the systematic integration of sophisticated semi-automated NMR platforms for structure determination of biomolecules [15][16][17][18][19]. Such advances, along with improvements in NMR hardware, have lowered the knowledge barrier to facilitate entry into the field of structural biology by NMR for non-specialists. However, only a handful of articles that describe the workflow and practical aspects of protein structure determination by solution NMR spectroscopy, which covers the latest method developments, have been published despite the practicality of this technique.

Workflow of Protein Structure Determination by Solution NMR
The standard approach to protein structure determination by solution NMR comprises several steps: (i) preparation of a protein sample labeled with stable NMR spin-1/2 isotopes, 13 C and 15 N; (ii) acquisition of NMR spectra; (iii) data processing and assignments of signals; (iv) generation of a chemical shift table that gives the signal assignments derived from the analysis of NMR spectra such that the analyst or an automated program can subsequently assign NOE signals to generate Fig. 1. Workflow of protein structure determination by solution NMR spectroscopy. (A) First, the target protein is uniformly labeled with NMR active isotopes ( 13 C and 15 N) and purified by chromatography processes. In many cases, it is essential to optimize the solution composition of the NMR sample (e.g., buffer, pH, type and concentration of salt, and other additives to prevent aggregation and/or to improve thermal stability of the target protein) and NMR parameters (e.g., sample temperature during NMR data collection, strength of static magnetic field, pulse sequences, inversion recovery delay) prior to collecting all the multidimensional heteronuclear NMR experiments. (B) Generally, various sample conditions and NMR parameters are verified by measuring 2D 1 H-15 N HSQC NMR spectra and/or 2D projection spectra of 3D HNCACB (e.g., 2D HN(CA)CB) until optimal conditions are found. If sample/NMR data collection optimization is performed by only 15 N-based NMR experiments, only a 15 N-labeled target protein (without 13 C-labeling) is required. (C) The NMR signals of all 1 H, 13 C and 15 N nuclei are assigned by analyzing multidimensional heteronuclear NMR spectra using NMR software that facilitates assignment of the signals. The right lower panel of this figure is a screenshot of a NMR signal assignment using the Kujira and MagRO software, developed by Prof. Naohiro Kobayashi (Osaka University, Japan), which is available at PDBj-BMRB http://bmrbdep.pdbj.org/en/nmr_tool_box/magro_nmrview.html. (D) After completing the assignment process, the solution structures of the target protein are determined by performing simulated annealing using NMR-based restraints such as 1 H-1 H distances and dihedral angles derived from NOEs and chemical shifts, respectively. (E) Finally, the precision and the accuracy of the determined structure are verified statistically and experimentally (e.g., by analyzing the Ramachandran plot and measuring RDCs of the target protein, respectively). distance or dihedral angle restraints; and (v) simulated annealing (SA) by simplified molecular dynamics calculations with the NMR-based restraints are used for initial structure modeling to obtain an ensemble of structures that satisfy the experimentally determined constraints. These SA calculations are followed by molecular dynamics (MD) simulation with explicit or implicit water using a more sophisticated force field. (vi) The final stage of the workflow is validation of the precision and accuracy of the determined structure (Fig. 1). Details of these steps are explained in the subsections below. In this mini-review, basic and general information for researchers interested in NMR structure determination of proteins will be provided. We will mainly describe general procedures that are routinely used for NMR structure determination of soluble and small-to-medium size (c.a. b 25 kDa) proteins. More advanced techniques required for NMR analysis of more challenging proteins, e.g. membrane proteins or larger proteins (c.a. N 25 kDa), will also be described briefly.

Protein Sample Preparation for NMR Measurements
In the initial NMR studies aimed at solving protein structures, complete assignment of the proton ( 1 H) NMR signals was achieved by acquiring 1 H NMR spectra, e.g., 2D 1 H-1 H double quantum filteredcorrelation spectroscopy (DQF-COSY), 2D 1 H-1 H totally correlation spectroscopy (TOCSY), and 2D 1 H-1 H NOE correlated spectroscopy (NOESY) spectra. These spectra provided sufficient information to obtain 1 H-1 H distance restraints to calculate the tertiary structure for non-isotope labeled peptides and small proteins (c.a. b 10 kDa) [20,21]. Extension of the 1 H only approach to larger proteins was not possible because of severe signal overlap precluding unambiguous assignment of signals [21]. In the early 1990s, owing to the development of technologies for expressing recombinant proteins using bacteria cultures with medium containing stable isotopes, a large number of target proteins could be uniformly labeled with the stable "NMR active" isotopes, 13 C and 15 N, which can be detected by NMR measurements. Two-(2D), three-(3D) and four-(4D) dimensional NMR spectra observing signals that correlated 1 H, 13 C and 15 N nuclei made for dramatic improvements in signal separation and removed signal overlap present in 1 H NMR spectra [21].
The Escherichia coli (E. coli) expression system is most commonly used for stable isotope labeling of proteins, as it has several advantages: easy handling, bacteria cells grow quickly, cost efficiency and the availability of many established methods for protein isotope labeling [22,23]. For example, the most powerful application of the methods would be site-specific/desired amino acid selective isotope labeling of proteins, which has been firmly established using the E. coli expression system [22][23][24][25]. Alternative 13 C-labeling of protein is also possible by using particular 13 C-enriched carbon sources, such as [1-13 C]-glucose [26], [1,3-13 C 2 ]-or [2-13 C]-glycerol [27], 13 C-acetate [28], and [1,2-13 C 2 ]-or [3-13 C]-pyruvate [29][30][31]. As this technique enables NMR observation of only the desired moiety of the target protein, it is especially useful for large molecular-weight (c.a. N 25 kDa) proteins or IDPs that show severe signal degeneracy.
In situations where an insufficient amount of the recombinant protein is produced by the E. coli expression system, the yeast expression systems, Pichia pastoris [32] or Kluyveromyces lactis [33] as host cells, can be used as an alternative. These systems not only have the advantage of being similar to the E. coli expression system (easy handling, rapid cell growth, and many available and firmly established isotope labeling techniques with relatively low cost), but also offer eukaryotic cell-specific features that facilitate the over-expression and correct folding of particular proteins, e.g., protein requires a complex array of disulfide bonds to form for native folding [22,23,34,35].
Cell-free protein expression systems are another option to overexpress proteins. This approach synthesizes recombinant proteins in vitro in a reaction mixture containing the DNA coding the target protein and a cell lysate containing gene transcription/translation molecular machineries, and, if required, isotope labeled amino acids for protein labeling [36][37][38][39][40]. Preparation of a fresh cell lysate, which is necessary for synthesizing a sufficient amount of the recombinant protein stably and reproducibly, is challenging. Furthermore, the running cost of such a cell-free system is generally high [23], indicating that it is not suitable for large-scale protein production with stable isotopes. However, this approach is still adequate for amino acid selective isotope labeling or selective introduction of unnatural amino acid residues into the target protein, because amino acid scrambling is relatively mild when compared with other expression systems that use host cells [41][42][43][44][45][46]. As an additional remarkable point, reconstructing an isotopelabeled membrane protein into artificial lipid bilayer discs, e.g., nanodiscs, is relatively easy to achieve by cell-free expression [47][48][49][50].
Isotope labeling of recombinant protein using insect or mammalian cells is also possible, although this is generally an expensive method with low yields. Moreover, the available isotope labeling variation is limited when compared with that of other expression systems [51,52].
In NMR protein structure determination, sample purity should be as high as possible because signal linewidths and protein stability are affected by non-specific contact with impurities. However, NMR spectra can be obtained even if the sample is crude, as demonstrated by studies examining cell lysates or metabolites [53][54][55][56][57][58].
The conditions of the protein solution and the NMR experimental parameters should be sufficiently optimized prior to starting multidimensional heteronuclear NMR measurements [59][60][61]. In many cases, the protein concentration, hydrogen ion (pH) value of the sample solution, the type of salt/buffer and its concentration and the NMR measurement temperature should be carefully examined by several pilot NMR experiments (typically 1D 1 H or 2D 1 H-15 N HSQC spectra are recorded) with different sample solution compositions and NMR parameters. If an optimal condition is found, the NMR spectra will show sharp (narrow linewidths) and sufficiently well-dispersed signals [59,61,62]. To assign backbone 1 H, 13 C and 15 N signals successfully by the inter-residue chemical shift linking approach, the observation of 13 C β nuclei with reasonable chemical shift dispersion and sensitivity is a key point to consider. Pilot measurements of 3D HNCACB and/or 2D HN(CA)CB spectra can be good indicators for judging the adequacy of the solution conditions and potential difficulties of obtaining signal assignments. Perdeuteration of target proteins combined with transverse relaxation optimized spectroscopy (TROSY)-type NMR pulse schemes may be beneficial when the molecular weight of the target protein exceeds 25 kDa (see Section 4).
Although the sensitivity of solution NMR is continuously improving thanks to the development of new NMR pulse schemes and hardware, e.g., an increase in the static magnetic field strength and the innovation of cryogenic probe [63], a higher protein concentration is preferable to obtain NMR signals with higher signal-to-noise ratio and resolution within a reasonable period of NMR instrument use. In particular, for protein structure determination, sufficiently high solubility and stability are typically required. Ideally, the target protein should be a monodispersed species at a protein concentration N0.5 mM and be stable in the NMR instrument for a minimum of 1-5 days.
The aggregation character of the target protein can be monitored by observing broadening of the NMR signals as a function of protein concentration or by investigating the rotational correlation time of the target protein via measurement of the [ 15 N, 1 H]-TRACT [64] or the 15 N T 1 /T 2 value [65]. Biochemical wet experiments (e.g., size-exclusion column chromatography) are also useful for investigating the character of the target protein. However, this method cannot address protein aggregation at the protein concentration used for NMR measurements, because the protein is significantly diluted in chromatography [59].
In general, the pH value of the protein solution is selected by considering the theoretically/experimentally determined isoelectric point (pI) of the protein. A lower pH is preferable to reduce the chemical exchange between water 1 H and protein 1 H N nuclei, because 1 H fast exchange reduces the sensitivity of 1 H N -15 N correlation signals [66,67]. In some cases, however, the protein cannot be placed at a low pH because of the pI, solubility, or stability. As buffer components, phosphate buffer (e.g. sodium or potassium phosphate) is an ideal choice because it does not have 1 H nuclei [61]. As a character of buffer conductivity and its effect on the sensitivity of NMR signal detection using cryogenic probes, however, other buffers such as HEPES-or MES-NaOH would be more preferable [68].
The sample temperature during NMR data collection is often room temperature (typically 25°C). However, a higher sample temperature should give better spectra unless such temperatures affect the structure and stability of the target protein (up to 40°C when a cryogenic probe is used). Higher temperatures offer sharper signals by decreasing solvent viscosity and thus the rotational correlation time of the protein [59,61].
Some types of chemical additives can increase the solubility and/ or stability of proteins [59,61]. For example, a reducing agent (e.g., dithiothreitol (DTT), tris(2-carboxyethyl)phosphine(TCEP)-HCl) may be essential for preventing protein oligomerization caused by inter-molecular nonspecific disulfide bond formation, when free cysteines exist on the surface of the protein. Multiple well plate-based highthroughput screening methods using fluorescence correlation spectroscopy or thermal shift assays can explore chemical additives that improve protein solubility and stability, which may help to find the best condition without exhaustive sample condition optimization work that consumes limited NMR instrument availability [59,61].
Note that 5%-10% (v/v) 2 H 2 O must be added to the NMR sample to provide a lock signal for the static magnetic field passing through the sample solution. In addition, in many cases, a standard chemical shift reference compound is dissolved in the sample solution, as described in the next section.

Acquisition of NMR Spectra and Their Assignment
Initially, 1 H, 13 C, and 15 N NMR signals arising from the nuclei of the target protein have to be assigned before we can obtain angle and distance restraints for protein solution structure determination. In general, a number of NMR spectra are recorded to obtain a complete set of NMR assignments [69][70][71].
For backbone ( 1 H, 13 C and 15 N) signal assignments: 2D 1 H-15 N HSQC, 3D HNCA*, 3D HN(CO)CA*, 3D HNCACB, 3D CBCA(CO)NH, 3D HNCO, 3D HN(CA)CO* and 3D HBHA(CO)NH are recorded. The 3D HNCACB and 3D CBCA(CO)NH are the most basic spectra for backbone signal assignments [71]. The asterisk next to the experiment acronym indicates optional spectra required when 3D HNCACB and 3D CBCA(CO)NH spectra are insufficient for completing backbone assignment. The 3D HNCO and 3D HN(CA)CO are useful complements to inter-residue chemical shift linking using 3D HNCACB and 3D CBCA(CO)NH experiments, and also help to eliminate any potential assignment unambiguity that exists. Furthermore, chemical shift assignments of 13 C α , 13 C β , 13 C′ and 1 H α nuclei are important for secondary structure prediction by TALOS or its related programs (e.g., TALOS+, TALOS-N) as described in Section 5.2. In those pulse schemes, except CBCA(CO)NH, 1 H-13 C-15 N correlations are measured using the outand-back coherence transfer approach between 1 H and 13 C/ 15 N to improve the sensitivity of those low gyromagnetic ratio (γ) nuclei (typically 5-10 fold). First, polarization of proton magnetization is transferred to low-sensitive nuclei 13 C or 15 N via J-couplings, and then the chemical shifts of those nuclei are recorded. Finally, the magnetization of 13 C or 15 N is returned to the starting proton, and the chemical shifts of the 1 H nuclei are recorded as the free induction decay (FID). However, nuclei denoted within brackets in the acronym of the pulse sequence participate in the coherence transfer pathway but their chemical shifts are not encoded in those multi-dimensional NMR spectra. Incidentally, those out-and-back coherence transfer schemes starting from the amide proton are also applicable to perdeuterated proteins.
For side chain assignments: 2D 1 H-13 C constant-time (CT) HSQC ( 13 C offset on aliphatic and aromatic regions), 3D 15 N-edited TOCSY-HSQC, 3D CC(CO)NH, 3D H(CCCO)NH, 3D (H)CCH-TOCSY ( 13 C offset on aliphatic and aromatic regions), 3D H(C)CH-TOCSY ( 13 C offset on aliphatic and aromatic regions), 2D (HB)CB(CGCD)HD, 2D (HB)CB(CGCDCE)HE, 13 C-edited NOESY-HSQC and 15 N-edited NOESY-HSQC spectra are required. There are many variations of those pulse schemes for time evolution such as CT or semi-CT methods, echo and anti-echo, single or multiple quantum coherence, or TROSY type acquisition. It is preferable that each user selects proper pulse schemes for individual protein samples by performing pilot NMR experiments to collect better quality NMR spectra that have the highest signal-to-noise ratio and resolution. Catalogues of standard pulse programs disclosed by the manufacturer of NMR instruments, such as Bruker (available from their websites), may help determine the best NMR measurements to use.
The minimum number of multi-dimensional NMR spectra required to obtain complete assignment of 1 H, 13 C, and 15 N signals of proteins using magnetization transfer via inter-nuclear J-couplings is six: 3D HNCACB, 3D CBCA(CO)NH, 3D (H)CCH-TOCSY (aliphatic and aromatic regions) and 3D H(C)CH-TOCSY (aliphatic and aromatic regions) [71]. In practice, however, using NOESY spectra (e.g., 13 C-edited NOESY-HSQC), which provide a wide range of inter-nuclear distance information between the backbone and side chains, may be helpful, in particular, for assignment of aromatic side chain signals.
The 13 C-and 15 N-edited NOESY-HSQC spectra are essential for both signal assignments and for generation of 1 H-1 H distance restraints [21,72]. The mixing time used in NOESY spectra of proteins is generally 80-150 ms, as described in Section 5.1.
In general, when the molecular weight of the target protein exceeds 30 kDa, measurement and assignment of the protein NMR signals become difficult owing to the increasing degeneration and linebroadening of the signals, of which effect is associated with fast transverse magnetization decay [21]. The latter issue is the biggest burden to solution NMR spectroscopy. For example, the transverse relaxation rate of 13 C α magnetization is dominated by the dipole-dipole (DD) interaction with the proximate 1 H α , leading to dramatic sensitivity losses to 13 C α signals or signals in multidimensional NMR spectra that transfer magnetization via 13 C α . This is particularly problematic when the molecular weight of the target protein exceeds 20 kDa [73,74]. In this case, the DD relaxation of 13 C α magnetization can be suppressed remarkably by substitution of 1 H α to 2 H (perdeuteration) by overexpressing the target protein in 2 H 2 O media [73,74]. In addition, 1 H-15 N correlation spectra with narrower signals can be obtained by TROSYtype NMR pulse schemes, which selectively observes the coherence where DD relaxation has been attenuated by the chemical shift anisotropy (CSA) effect [75]. The TROSY effect will be more prominent in combination with a perdeuterated protein and higher static magnetic fields (i.e., ≥800 MHz) because cancellation of DD relaxation by the CSA effect is field dependent. However, the benefits from perdeuterated proteins are limited to improving backbone NMR signal assignments and for 13 C-direct detection NMR experiments [76][77][78][79]. 13 C-direct detection NMR methods are also applicable to protein structure determination [80]. In many cases, moreover, mild denaturation and regeneration treatments of the deuterated protein sample in a 1 H 2 O buffer may be required to back-exchange deuterium to proton at amide positions to regenerate 1 H N -15 N correlations. However, multi-dimensional NMR experiments that use polarization transfer starting from non-labile protons, e.g. 3D CBCA(CO)NH, cannot be used with deuterated proteins [71].
Prior to starting the assignment of 1 H, 13 C and 15 N chemical shifts of the target protein, the NMR spectra should be calibrated [81,82]. 1 H chemical shift calibration (or denoted as "referencing") uses the temperature-dependency of the 1 H chemical shift of 1 H 2 O automatically when processing NMR spectra using the program NMRPipe [83]. In addition, chemical shift calibration using standard compounds that yield 1 H chemical shifts, e.g., 4,4-dimethyl-4-silapentane-1-sulfonic acid-d 6 (d 6 -DSS), offer further reliable calibration [81,82]. Practically, calibration can be achieved by simply measuring the chemical shift value of the major 1 H peak of the trimethyl group of d 6 -DSS (expected to appear near 0 ppm) in a 1D 1 H spectrum. It is important for appropriate calibration to acquire the 1D 1 H spectrum with sufficient number of data points, in order to read its peak maximum as precisely as possible. Then, the 1 H carrier offset of the NMR spectra of the target protein is adjusted as the chemical shift value of the 1 H peak of the trimethyl group of d 6 -DSS becomes 0.000 ppm. As shown in the BioMagResBank (BMRB) website (http://www.bmrb.wisc.edu/ref_ info/cshift.html), the calibrated offset of the 13 C and 15 N carrier frequencies (Hz) of the target protein are determined by multiplying factors based on the gyromagnetic ratio (γ 13C /γ 1H or γ 15N /γ 1H , corresponding to 0.251449530 or 0.101329118, respectively) to the calibrated 1 H center frequency (Hz) [81,82]. After chemical shift referencing of the 1 H, 13 C, and 15 N center frequencies, all of the NMR spectra of the target protein are processed.
It is also possible to obtain the 1D 1 H NMR spectrum of a standard compound by dissolving it in a target protein sample solution as an internal standard. However, the user has to be careful that occasionally nonspecific interactions between the standard compound and the protein can occur. Furthermore, a strong 1 H signal and 1 H noise derived from the standard compound often obstructs analysis of the NMR spectra. Therefore, it is also acceptable to measure the 1 H reference spectrum of the standard compound using a NMR sample only containing standard compound, e.g., 0.5 mM d 6 Chemical shift referencing is mandatory for depositing the NMR data and structure into public databases such as the BMRB and Protein Data Bank (PDB). In addition, chemical shift calibration is important for accurate prediction of secondary structures and backbone dihedral angles by comparing data to chemical shift/structure databases using TALOS or related programs [84,85], as described in Section 5.2. In any case, it is crucial to obtain reference chemical shift data on a standard compound at an identical temperature and static magnetic field strength to the NMR data collection on the target protein.
The processing of the collected NMR data involves a series of several mathematical conversions from time domain data to frequency domain data. Here, fast Fourier transformation of discretely sampled digital data of the direct and indirect dimensions with adequate shaping by apodization and window functions, linear prediction, phase adjustment and baseline correction are comprehensively achieved by programs such as NMRPipe (https://spin.niddk.nih.gov/bax/software/NMRPipe/) [83].
The chemical shift of the signals arising from the target protein should be assigned as completely as possible. In particular, 1 H assignments are important, as they form the basis of inter-proton distance restraints. The completeness and accuracy of the 1 H assignments will strongly and directly affect the accuracy and convergence of the calculated NMR structure [86][87][88][89][90]. There are many visualization software packages to examine and analyze the NMR spectra and perform chemical shift assignments, e.g., Sparky (Goddard TD and Kneller DG, SPARKY 3, University of California, San Francisco), CARA (developed by Dr. Keller of the Kurt Wüthrich's group, http://cara.nmr-software.org/portal/), NMRView [91], and Kujira and MagRO (developed by Prof. Naohiro Kobayashi of the PDBj-BMRB group (Osaka University, Japan), http:// bmrbdep.pdbj.org/en/nmr_tool_box/magro_nmrview.html) [15,17] with NMRView and CcpNmr (http://www.ccpn.ac.uk/). Programs for automated assignment of backbone and side chain signals are also available, such as PINE and PINE-SPARKY, which is included in the new version of Sparky (NMRFAM-Sparky) [92,93], MARS [94], UNIO (assembly of MATCH/ASCAN/CANDID/ATNOS algorithms) [87,88,[95][96][97] and FLYA [98], which is partially powered by the GARANT algorithm [99]. The basic concept of the automated signal assignment process is matching peaks between experimentally observed and expected ones, which are based on the amino acid sequence of the target protein and magnetization transfer schemes of the recorded NMR spectra [99]. From a practical perspective, successful performance of automated assignment software requires the collection of high-quality NMR spectra that show sufficiently well-resolved signals and the expected number of signals. They typically provide reliable assignment of well-separated signals by linking chemical shifts, which are expected from coherence transfer pathways based on the pulse schemes used for the NMR experiments and topology of chemical structures of amino acid residues. Moreover, auto-assignment programs can offer possible assignment of unassigned signals, which are difficult to assign manually owing to ambiguity of inter-residue chemical shift linking. In many cases, therefore, the user has to manually confirm and carefully correct the assignment made by automated programs using NMR spectrum visualization software, as described above.

Distance Restraints
Protein structure determination by simulated annealing uses the 1 H-1 H distance and dihedral angle restraints derived from NMR data [86,90]. The 1 H-1 H distance information is obtained by estimating the intensity of the 1 H-1 H NOE signals [21]: the intensity of the 1 H-1 H NOE signals is dependent on the mixing time (τ m ), during which 1 H-1 H cross-peaks are generated [21]. However, although build-up rates obtained from linear dependency of the NOE cross-peak intensity on the τ m correlate with the 1 H-1 H distance when a limited (short τ m ) NOE build-up regime is used, quantitative estimation of 1 H-1 H distances from the intensities of the 1 H-1 H NOE signals becomes impossible over the limited τ m range because relaxation and spin diffusion also proceed during long mixing times [21,100]. In general, the τ m value should be within the initial NOE build-up range (typically 80-150 ms for proteins) [21]. Nonetheless, recently developed innovative methods offering more exact conversion of NOE data to internuclear distance information, Exact NOE (eNOEs), are powerful protocols for accurate and precise protein solution structure determination [101][102][103]. The 3D 13 Cor 15 N-edited NOESY-HSQC spectra are widely used. Following peak picking and integration of the 1 H-1 H NOE signals, the table containing the residue number/atom name/chemical shift/ 1 H-1 H NOE signal intensity is used as primary input data for simulated annealing [90]. In the classical way to prepare NOE based distance restraints, the NOE peaks are roughly classified into three (or four) groups depending on their signal intensity: strong, medium and weak (and very weak) [86,90]. In the structure calculation, this information is translated into 1 H-1 H distance ranges of 1.8-2.7, 1.8-3.3, and 1.8-5.0 (and 1.8-6.0) Å [86,90]. The lower limit of the 1 H-1 H distance, 1.8 Å, corresponds to twice the van der Waals radius of 1 H atoms [86,90].
Rough classification and wide range (not rigorous) distance restrictions based on NOE data are based on the difficulty of converting the NOE signal intensity to an inter-nuclei distance accurately and directly because the NOE signal intensity is affected by not only the interatomic distance, but also various factors such as spin-diffusion and conformational averaging as a result of local structural fluctuations. Because of the mildness of distance restraints based on NOE experiments, a sufficient number of distance restraints are required for accurate and precise protein structure determination (generally speaking more than 10-20 distance restraints per residue, not including intra-residue NOEs) in order to build a sufficient density of NOE-network anchoring [90]. The concept of NOE-network anchoring is based on the consideration that the correctly assigned NOE distance restraints will form a self-consistent NOE set, which is compatible with tertiary structure models of target proteins [87,88]. It is also important to empower automated NOE assignment by the program CYANA [90] (see Section 6).
In particular, a sufficient number of long-range distance information (5.0-6.0 Å, which are upper limit distance restraints) are critical for precise and accurate structure determination [104]. As the intensity of NOE peaks with important distance information, i.e., long-range, is often weak, optimization of experimental conditions of the NOESY spectra is important to gather a sufficient number of weak NOE signals. Higher sample concentration (N 0.5 mM), lower temperature, lower pH (in the case of 15 N-edited NOESY-HSQC) and stronger static magnetic fields are preferable parameters to ensure observation of weak NOEs. When only a minimal number of NOE signals are observed owing to the poor character of the target protein, e.g. aggregation or signal line broadening owing to averaging of multiple conformation caused by fluctuation in an intermediate-slow NMR time scale regime, long-range distance information can be collected, which may complement the shortage of NOEbased distance restraints, by protein labeling with paramagnetic metal ions or radicals to measure the paramagnetic relaxation enhancement (PRE) effect [105] or the pseudo-contact shift (PCS) [106,107]. The PCS data provide important NMR-based restraints that describe the global fold of the protein and/or tertiary structure of protein-protein complexes with high accuracy owing to its unique features: PCS provides both long-range distance restraints and angle information of bond vectors in the protein against the tensor frame of the magnetically aligned paramagnetic center molecule.
In addition, the resolution of NOESY spectra significantly affects the quality of the structure determined by NMR [108]. Using the highest possible resolution is appropriate for precise and accurate structure determination; this is especially important when performing automated NOE peak picking and assignments. Algorithms for automated NOE peak picking, e.g., NMRView [91], AUTOPSY [109], ATNOS [88], AUDANA [110], or CYPICK [111], have been used for preparing peak lists to perform automated NMR analysis.
Information about the location of hydrogen bonds can be used as distance restraints [90]. The rate of chemical exchange between the amide proton in the protein and protons in the solvent water can be estimated by performing H/D exchange [112] or CLEANEX-PM [113] experiments. It is possible that amide protons showing a significantly slow exchange rate may be involved in the formation of hydrogen bonds. At a stage of the structure calculation process that the global fold of the protein has been trustfully determined, pair residues forming hydrogen bonds may be deduced from the modeled structure. However, this step must be carefully executed to prevent incorrect assignment of hydrogen bonded residues. More ideally, therefore, pair residues forming hydrogen bonds should be experimentally determined by long-range HNCO measurements [114,115] or other analogous NMR techniques to avoid incorrect interpretation of the modeled structure. The distances between backbone amide protons and carbonyl oxygen atoms, or backbone nitrogen and carbonyl oxygen, are normally set as 1.8-2.0 or 2.7-3.0 Å, respectively. Since the distance restraints of hydrogen bonding are strong, giving large entropic information that contrasts those of dihedral angles and NOE-based distance restraints, introduction of hydrogen bonds should be applied at a later stage in the structure calculation. In a similar manner, information of the location of intramolecular disulfide bonds also available as restraints should be entered carefully into a structure calculation, perhaps after an initial model has been calculated using purely distance and dihedral restraints.

Dihedral Angle Restraints
Dihedral angles of the target protein, such as backbone φ(N-C α ), ψ(C α -C'), ω(N α -C') and side-chain χ(n), are determined by a wide variety of NMR experiments [86,90]. These angles are important for defining the secondary structure and the side-chain conformation of the target protein.
According to the Karplus equation, a dihedral angle can be estimated by measuring the 3 J-coupling constant [116]. For example, the 3 Jcoupling constants of each residue can be directly measured by NMR experiments such as 3D HNHA [117], ARTSY-J [118] and 3D HNHB [119]. However, acquiring these experiments with high precision and accuracy cannot be achieved easily because the sensitivity of these NMR experiments that measure 3 J-couplings is relatively low. In those pulse schemes, long evolution and refocusing periods (total~50 ms in the case of 3D HNHA) are required to measure the rather small 3 Jcoupling constants (b10 Hz), which would be especially difficult to measure accurately with a protein that gives rise to broad signals owing to, for example, molecular size or chemical exchange processes [118]. As an alternative approach, dihedral angles and the secondary structure of proteins are predicted by the programs TALOS (and related versions) [84,85], CamShift [120] and SHIFTX2 [121]. These programs use prior chemical shifts statistics, including 1 H N , 15 N, 13 C α , 13 C β , 1 H α and carbonyl 13 C nuclei, with corresponding structure coordinates.
Dihedral angle information is useful to refine secondary structures in spite of their application in simulated annealing as relatively mild restraints (lower and upper limits are normally around +/− 20-40°). For application of the predicted dihedral angles in structure modeling, it is important to use trusted chemical shifts derived from correctly calibrated NMR spectra, as described above.

Automated NOE Assignment and Structure Modeling of Proteins
Programs for automated NOE assignments and molecular modeling, e.g., CYANA [86,90,122], Xplor-NIH [123], AUREMOL (http://www. auremol.de/), critical assessment of automated structure determination of proteins by NMR (CASD-NMR) developed by the e-NMR project (http://haddock.chem.uu.nl/enmr/eNMR-portal.html) [19,124,125], PONDEROSA-C/S (http://ponderosa.nmrfam.wisc.edu/) developed by the National Magnetic Resonance Facility at Madison (NMRFAM) (http://www.nmrfam.wisc.edu/software.htm) [16,18,126] and UNIO (http://perso.ens-lyon.fr/torsten.herrmann/Herrmann/Software.html) [19,87,88,[95][96][97], are used for solution structure determination of biomolecules such as proteins and nucleic acids. In this mini-review, we will show an example of the program CYANA, which is used widely for NMR structure determination, to explain the general workflow of protein structure calculation. The amino acid sequence data of the target protein, the chemical shift table, and the NOE peak list are required as primary input files to perform automated NMR assignments and molecular modeling using CYANA [86,90,122]. In NOE based structure modeling by CYANA, seven steps of iterative calculations including the structure-modeling step for obtaining an initial global fold of the protein are performed. This is followed by refinements of the automated NOE assignments by referencing the calculated intermediate structure model of the protein using the algorithm extended from the program CANDID [90,122]. The automated NOE assignment process has the advantages of being fast, unambiguous and unbiased [88,127]. However, in many cases, manual intervention and/or re-assignment of the results of automated NOE assignment are necessary because completeness of the automated process is not perfect, especially when the peak maxima of the NOESY signals do not match 1 H chemical shift assignment table [88,127]. Several other algorithms, e.g., NOAH [128,129], ASDP [130], ARIA and its related programs [131,132], KNOWNOE [133], AutoNOE-ROSETTA [134] and PASD in Xplor-NIH [135], have also been developed for automated NOE assignments and structure calculation. Even if the stereo-specific assignments of methylene protons or methyl groups of leucine or valine residues are ambiguous, in the process of the structure-aided NOE auto-assignment, those signals can be automatically swapped with fair correctness by accounting for the modeled structure just before the final stage in the calculation. A high precision of modeled structures can be strongly supported by random combination of long-range distance restraints and sufficiently condensed NOEnetwork anchoring generated in the early stage of the CYANA calculation [87,90].
In particular, when performing structure-aided NOE auto-assignment and structure modeling, the quality of the resultant structures strongly depends on the completeness and correctness of the assigned chemical shits [86][87][88][89]. Notably, a number of spurious NOEs such as noise peaks included in the input can lead to wrong NOE assignments, and thereby result in incorrect modeled structures. These issues have been intensively discussed in a literature [90]; nevertheless, a sufficient number of correct but weak NOE signals providing long-range distance information is a key feature for accurate and precise structure modeling [90]. In contrast, using the conventional approach of manual NOE assignments and generation of distance restraints, violations in the structure calculation caused by restraint violations may act as indicators to determine whether there are problems in the restraints list. Thus, in the conventional way, when systematic violations arise from the presence of incorrect restraints, the structure calculation and revision of restraints have to be repeated exhaustively. Both conventional and automated approaches can lead to endless and repetitive rounds of calculations, and thus a set of criteria is needed to ensure such a cycle is avoided. Previously, the national project Protein 3000 mainly performed by RIKEN solved a large number of structures by solution NMR using the CYANA and Kujira systems (over 1080 structures) [15,17]. This project revealed a number of benchmarks to achieve when solving structures by NMR: 1) the remaining number of unassigned NOE peaks should be b~5%, which should not be localized to a certain region of the calculated structure; 2) high completeness and accuracy of the assigned chemical shifts of 1 H, 13 C and 15 N signals (N90% of completeness); and 3) a sufficient number of NOE signals in the NOESY spectra (N 10-20 NOEs per residue), with a wide intensity range and sufficiently high signal-to-noise ratio and resolution. Using the Kujira and MagRO system, the analysts can easily and visually inspect the accuracy of the assigned chemical shifts and the NOE signal that were automatically assigned by CYANA. In the process, NOE signals showing abnormal lineshapes and/or spurious NOE peaks can be eliminated. The aforementioned strategy should be suitable for structural studies of small proteins that give good-quality NMR spectra, which has been quantitatively evaluated by Kobayashi et al. [15,17].
Furthermore, a systematic violation found in the structure calculations when there is sufficient NOE network anchoring may aid in identifying inappropriate angle restraints.
A general way to model structures using distance and angle restraints is by simulated annealing, as described above [86,90]. The protocol of the calculation has been standardized. A completely unfolded state of the target protein, namely at an extremely high temperature (e.g. 10,000 K), is virtually generated as an initial model, and the temperature is gradually cooled to around 0 K to minimize the total potential energy (or target functions) and to satisfy the input NMR restraints. It is well known that because the modeled NMR structures calculated by simulated annealing tend to be trapped into local minima of the energy landscape, the analysts should run multiple simulated annealing calculations with different random seeds to get an ensemble of modeled structures as mentioned below.
The structure calculation by simulated annealing using CYANA is performed using a highly simplified force field with smaller van der Waals radii and without static electric potential energy. In modern NMR structure determination, therefore, the models determined by CYANA are further refined by molecular dynamics calculations with explicit or implicit water model systems using other computer programs powered by advanced force fields, e.g., Xplor-NIH [123,136], CNS [137], ARIA and related software [131,132], AMBER [138], OPLS-AA [139] and CHARMM [140].
Conventionally, the final NMR structure is represented by an ensemble of 10-30 lowest-potential energy structures or structures with the lowest number of violations. Superposition of the structural ensemble can be performed manually based on the secondary structures in the determined structures. The automated identification of the ordered region of the determined structures is also possible using particular programs, e.g., NMRCORE [141], FindCore [142], CYRANGE [143] and FitRobot [144]. The resulting structural ensemble can be graphically represented using molecular viewers such as UCSF Chimera [145], CCP4mg [146], VMD-XPLOR [147], PyMol (https://www.pymol.org/) and MOLMOL [148]. The convergence of the modeled structures is estimated by the root-mean-square deviation (RMSD, Å) of backbone and/or heavy atoms of the 10-30 models that represent the ensemble [149,150]. This RMSD provides an indicator of the precision of the models.
Regions where the structure is determined by a sufficient number of restraints appear well-converged, and therefore the RMSD value in those regions is small. The local diversity of the RMSD values directly correlates to the number of restraints used in the structure calculation. Through superimposed representation of the structural ensemble it is easy to identify regions that poorly converge or are over-restrained.
Poorly converged regions appear partially disordered or flexible. However, other possibilities may explain why the structure in that region could not be determined correctly in the modeled ensemble, i.e., insufficient restraints to force convergence of this region of the protein structure. Indicating whether the non-converged region is really disordered or dynamic in the deposition of the NMR structure into public databases or publication in a journal sometimes cannot be unambiguously stated. Thus, there is no conclusive evidence about whether the poorly converged region undergoes conformational fluctuations, without further investigation of the dynamics, kinetics and stoichiometrical features, which can be performed by NMR relaxation experiments.

Validation of Precision and Accuracy of the Determined Protein Structure
Validation of the modeled structures to assess the precision and accuracy is an important final step in the structure determination process by NMR. There is no guarantee that the protein structure determined by solution NMR with high precision matches the same protein structure determined by a different method [149,150]. To finalize the structural study by NMR, the analyst should carefully verify and assess both the precision and accuracy of the determined structure based on particular criteria.
As mentioned above, the convergence (precision) of the modeled structures can be generally assessed through the RMSD values of the 10-30 lowest minimal potential energy structures [151].
The accuracy of the modeled structures can be assessed through geometric parameters derived from the coordinates of the lowest energy structures [149,150,152]. The most widely used analysis examines bond angles, chirality and the side chain rotamer states, as well as the backbone conformation by creation of a Ramachandran plot. The results of the Ramachandran plot analysis are classified into four groups: 1) most favored, 2) additionally allowed, 3) generously allowed and 4) disallowed. Ideally the number of residues found in the generously allowed and disallowed regions of the Ramachandran plot is zero, or at least less than 10%, indicating that there are only a few geometric issues among the atoms and bonds in the structural ensemble [150]. This kind of analysis can be performed by PROCHECK-NMR [153,154] or MolProbity [155], and is a requirement when depositing the molecular coordinates into the PDB.
The orientation of bond vectors in a protein, such as 1 H N -15 N and 1 H α -13 C α , relative to the principal axis of the molecular alignment tensor of the structure can be investigated by measuring residual dipolar coupling (RDC) between the dipolar-coupled nuclei using uniformly 15 N-and/or 13 C-labeled protein in a weakly aligned medium [152].
The information about the bond vector orientations in a protein can be used to validate the accuracy of the determined structure [152]. The interaction of the dipolar coupling between two spin-1/2 nuclei depends on the angle and distance between the bond vector and their gyromagnetic ratio. In contrast to solid-state NMR, the Hamiltonian derived from the dipolar coupling interaction between two dipolarcoupled nuclei is not observed because it averages to zero because of isotropic tumbling of the protein in solution. When the protein is dissolved together with a medium that orients under the influence of the magnetic field, e.g., a liquid crystalline formed by lipid discs such as bicelles [156], or filamentous phage such as Pf1 [157,158], the aligned medium will provide some extent of anisotropy to the proteins owing to slight restriction of the isotropic molecular tumbling due to repetitive collisions between the protein and the aligned media [159]. As a result, a small residual value of the dipolar coupling between nuclei can be observed, and the size of the RDC can be tuned by the extent of alignment of the protein.
From the Cartesian coordinates of the aligned protein, it is theoretically possible to predict the alignment tensor of the weakly aligned protein using programs such as PALES [160] or REDCAT [161] by simulation of the predicted dipolar coupling between the bond vector and the principal axes of the estimated alignment tensor of the protein.
RDC values can also be calculated from the determined structure. The back-calculated RDC values from the structure (the RDC data have not been used as restraints in the structure calculation) should show a high correlation (correlation coefficient = 0.8-1.0) with the measured RDCs, confirming that the structure has been correctly determined [162]. Therefore, we would encourage this approach as a means to validate the protein structure, if the target protein is stable in alignment media and a sufficient number of RDCs are measured.
As aforementioned, if the analyst wants to directly apply the RDC values as restraints for structure calculation [159], it is necessary that the alignment tensor of the protein be determined precisely and unambiguously in advance. In a conventional manner, precise and unambiguous determination of the alignment tensor of the protein based on RDCs is impossible without information about the tertiary structure of the target protein. Therefore, conventionally, restraints for bond vector orientations using RDCs are generally used during the final refinement stage in NMR structure calculations [159]. Recently, however, methods for estimating the alignment tensor based solely on a histogram distribution of the experimental RDCs (without information about the tertiary structure) have been developed [163].
There is always a certain possibility that the determined structure with a high convergence is (locally or wholly) incorrect because internuclear distance restraints based on NOE and dihedral angle restraints fundamentally provide local (short-range, b6 Å) information, which means that those restraints cannot unambiguously restrict the relative orientation between each secondary structure or sub-domains of a protein [159]. In particular, in the cases of a multidomain protein or a protein-peptide complex, accurate determination of the relative orientation between each molecule is difficult when using such shortrange restraints, even if the tertiary structure of each component has been determined precisely and accurately [164][165][166][167].When the NMR structures have been refined using RDC data as restraints, as a matter of course, the RDC data may no longer be applied to validation.

Concluding Remarks
Solution NMR spectroscopy is a popular method to determine the tertiary structure of proteins. Surprisingly, however, there are only a handful of articles that describe the workflow of protein NMR structure determination [71,[168][169][170]. We anticipate that this mini-review should help scientists who are interested in protein solution structure determination by NMR.
The rapid and steady progress in NMR hardware and software, e.g., CYANA upgrades and automation of NMR signal assignment programs, has continued unabated [171]. Furthermore, recently, integrated NMR software platforms offering systematic and semi-automatic biomolecular structure determination, e.g. PONDEROSA-C/S [16,18,126] and UNIO [19,87,88,[95][96][97], have been developed. Moreover, global fold determination of desired proteins has now become easier owing to the development of programs, e.g. CS-ROSETTA [172,173], which can determine protein structures using only chemical shifts and RDC data as restraints.
In addition, solution NMR spectroscopy is a powerful tool for intermolecular interaction studies and screening of druggable seed compounds with atomic level resolution, because it can detect and determine a binding site, even if the interaction is extremely weak. By performing magnetization relaxation dispersion NMR experiments, tertiary structures of minor, excited or invisible protein conformations can be determined when this conformation is populated by b1% [201,202]. Therefore, it is expected that methodological advances of solution NMR spectroscopy, which can determine the tertiary structure of a protein in the bound state accurately and precisely in a weak interaction, will become a unique and beneficial structural biology tool by maximizing the features of solution NMR spectroscopy. Experimentally determined internuclear distances and secondary structure information predicted by TALOS and its related programs [84,85], CamShift [120] and SHIFTX2 [121] can greatly assist accurate homology based modeling or docking simulations to obtain structural models. The most successful examples using these methods would be HADDOCK [203,204] and CS-ROSETTA [172,173], whose feasibility has been generally accepted to yield structural insights that give rational explanations describing the biological functions of biomolecular complexes and models. The integration of the above methods and/or additional structural data such as X-ray scattering and information about intermolecular contacts derived from pull-down and yeast-two-hybrid assays as well as point mutation data, the so-called hybrid/integrative methods, is a new approach to study large biomacromolecular systems [205].
A number of software and web-based resources for NMR data analysis as described in this article are available and user friendly, and should help to open avenues for non-specialist and life scientists interested in studying the structure of proteins.