Going clean: structure and dynamics of peptides in the gas phase and paths to solvation

The gas phase is an artificial environment for biomolecules that has gained much attention both experimentally and theoretically due to its unique characteristic of providing a clean room environment for the comparison between theory and experiment. In this review we give an overview mainly on first-principles simulations of isolated peptides and the initial steps of their interactions with ions and solvent molecules: a bottom up approach to the complexity of biological environments. We focus on the accuracy of different methods to explore the conformational space, the connections between theory and experiment regarding collision cross section evaluations and (anharmonic) vibrational spectra, and the challenges faced in this field.


Biomolecules in the gas phase
In spite of bearing little resemblance to biological environments, the experimental and theoretical study of biomolecules in the gas phase has been steadily gaining importance in the past decades, especially among physical scientists. Pioneer experimental studies starting in the late 90s encompassing all main groups of biomolecules [1][2][3][4][5][6][7][8][9] were able to show that much physical insight on structure formation and dynamics of these molecules can be gained from transfering them to the gas phase. The reason is that the gas phase offers clean conditions, under which theory and experiment can meet on equal footing and can follow a stepwise bottom-up approach towards the full complexity of the real biological environment. The reduced size of the systems allows their treatment with a range of theoretical methods that rely on approaches to solving the quantum mechanical Schrödinger equation, usually referred to as 'first-principles' methods. These methods are typically much more accurate than empirical models-but due to the intrinsic approximations in them, it is also a priori unclear how well they are actually able to describe the structure and dynamics of biomolecules in the gas phase. In this synergistic combination, experiments can serve as a benchmark for testing how appropriate the theoretical treatment of these complex systems is, while theory can be employed to give a physical interpretation to experiments.
In this review, we give a brief survey of the current state of the field regarding the study of, in particular, peptides in the gas phase. We focus on the theoretical side of this field, summarizing what is the current state-of-the-art with respect to accuracy of such calculations, the systems sizes they can treat, what is their predictive power, and where there is room for improvement-basing a good portion of it on works in which the authors were involved. We also give special attention to the dynamical nature of these molecules, and the importance of grasping at least local entropic, anharmonic and temperature effects. There will be less of a focus on long time scale dynamics of these molecules, which involve large conformational rearrangements (e.g. folding). We choose to concentrate on local dynamics because these span time scales currently accessible to first principles potentials. Also they can be connected to most state-of-the-art experiments available in the literature for medium-sized peptides.

Polymeric biomolecules in the gas phase
There are three main classes of biomolecular oligomers and polymers, namely peptides and proteins (see figure 1), nucleic acids (figure 2(A)) and carbohydrates ( figure 2(B)). Below we briefly describe each of them, with a stronger focus on peptides and proteins, which will be the main subject of this review.
Peptides and proteins make up the machinery of life and are involved in virtually all of its manifestations, from comparably small signaling peptides to gigantic protein complexes. A peptide or protein is a linear chain (oligomer) of amino acids (residues) that are linked by so-called peptide bonds (see figure 1, top). Peptide bonds are formed between the amino group and carboxylic acid group of two building blocks. In addition, amino acids carry a side chain 'R' of differing chemical functionality. The sequence of the different amino acid side chains R is called primary structure and defines the structure and dynamics of the peptide or protein. Oligomers beyond a certain length (from about 50 amino acids on) that are able to form distinct structural motifs are called proteins. Structure formation at the level of peptides (secondary structure) is mainly dependent on the conformational properties of the monomers and backbone hydrogen bonding. In larger oligomers, i.e. in proteins, side chain interactions and packing gain importance and govern tertiary structure formation. These larger proteins or even complexes thereof can be studied in isolation as well (see, for example, a recent review by Carol Robinson [11]).
Among other biological functions, nucleic acids are the carriers of genetic information. In an organism, a sequence of nucleotides in deoxyribose nucleic acids (DNA) can be transcribed into ribose nucleic acids (RNA) that then serves as template for the stepwise linkage of the amino acids into peptide or protein chain. They feature a sugar-phosphate backbone with nucleobases connected to the (deoxy)ribose moieties (see figure 2(A) for a pictorial representation of the different groups). Structure formation is mainly triggered via intermolecular hydrogen bonding between specific pairs of bases (base pairing) in case of DNA or intramolecular base pairing in case of RNA. A recent review by Abi-Ghanem and Gabelica [12] may serve as an entry point to the literature about nucleic acids in the gas phase. Arcella et al [13] investigate DNA in the gas phase by combining ion-mobility mass spectrometry and extensive classical molecular dynamics (MD) and ab initio molecular dynamics (AIMD) simulations. They describe rich dynamics of DNA that quickly looses memory of its solution structure in the gas phase and explores a large conformational space. Of special interest is the observation of protons hopping between phosphates of the DNA backbone that was seen in AIMD simulations.
Polymeric carbohydrates serve as nutrition and energy source or as structural scaffolds. They can also be linked to proteins acting as recognition molecules and possibly playing a role in protein folding. Most of the known carbohydrates are composed of around 20 different monosaccharide units connected to each other by what is called the glycosidic bond or linkage (see figure 2(B)). In contrast to the backbone of peptides or the backbone of nucleic acids, carbohydrates are not necessarily composed as a linear chain. The building blocks have one donor (the anomeric C) but multiple acceptors for glycosidic bonds, such that branched structures can be realized. In addition, due to chirality, glycosidic bonds can be formed in two chiral (enantiomeric) forms (α or β). These contributions result in a diversity of possible topologies of carbohydrates that surpasses the number of possible sequences in nucleic acids and peptides by orders of magnitude, even with relatively small numbers of building blocks [14]. The significant conformational degrees of freedom are rotations around the single bonds of the glycosidic linkages and the conformation of the monosaccharide rings.
The main focus in this review will be on secondary structure stabilization and dynamics in peptides containing from a few to some tens of amino acids. These motifs are shown in figure 1. Briefly there are three main elements of secondary structure, namely, helices, pleated-sheets, and turns. The turns are regarded as non-periodic motifs, while helices and sheets are regarded as periodic, in the sense that a repeating unit can be defined, allowing for a characterization based on pairs of torsional angles. The nomenclature given to the helices depend on the hydrogen bonding pattern that arise from their constituting residues (amino acids). The most famous types, the α and the 3 10 helices are characterized by H-bonds between residue i to i + 4 and residue i to i + 3, respectively. Sheets are also stabilized by backbone H bonds and can be characterized as parallel and anti-parallel depending on the relative orientation of their peptide chains. Finally, turns are necessary motifs to reverse the propagation of sheets and helices, so that compact structures can be formed. It is not necessary for a H-bond to form in order for the motif to be characterized as a turn, but many do form through the formation of H-bonds. The most common type is known as the β turn. Turns cause a complete reversal of the direction of structure propogation.

Experimental techniques probing conformation and dynamics in the gas phase
The study of (bio)molecules in the gas phase has become more popular in the past decades mainly due to the development of experimental techniques in the late eighties, that can gently transfer intact biomolecules to the gas phase, like MALDI (matrix-assisted laser-desorption ionization [15]) and ESI (electro-spray ionization [16]), in combination with high-accuracy mass spectrometers [17,18] or in molecular beams [19].
When dealing with peptides, it is possible to isolate secondary structure motifs in the gas phase, so that their 'unperturbed' energy landscape and stabilizing intermolecular interactions can be carefully studied. The environmental effects can then be added in a controlled way, for example by the stepwise addition of water molecules to the polypeptide or by adding ions to the complexes. At the same time, these clean experiments in the gas phase allow to benchmark theoretical methods, at system sizes that can be treated in a fully first-principles manner. There is much debate as to how biologically relevant the study of biomolecules in the gas phase actually is [20][21][22], since it is to be expected that due to the lack of solvent and hydrophobic/hydrophilic interactions one can actually stabilize different conformations in the gas phase. From a more physical perspective, it is undeniable though that in these experiments much insight about the fundamental stabilizing interactions can be gained, also encompassing what is the role of the protonation state and the first shells of solvation, or the interaction with only ions, ions and water, etc. This understanding can be certainly transferred to the more complex biological environment.
There are several reviews about studies of biomolecules (peptides, proteins, sugars, etc) in the gas phase in the literature, from which we highlight only a few for the interested reader, namely [1-3, 6, 8, 9, 18, 20-25]. Below we give a brief overview of the main experimental techniques that yield quantities which can be connected to theoretical calculations that we will review in the next sections.

Ion mobility-mass spectrometry.
Mass spectrometry (MS) is a powerful gas-phase experimental technique that separates ionic clusters or molecular ions based on their mass-to-charge ratio (m/z). With ion mobility (IM), or gas chromatography, charged molecules and clusters can be separated according to their different mobility in a buffer gas. Especially the combination of both techniques, ion mobility-mass spectrometry (IM-MS), first accomplished in 1962, [26] can allow for the separation and characterization of mixtures of compounds or conformers that would otherwise not be distinguishable. In the context of this review, we focus especially on the ability of IM-MS to investigate structural and dynamical properties of peptides.
In IM-MS experiments, an electric field drags the ions through a drift tube of a certain length. This drift tube is filled with a buffer gas (often He or N 2 ) and collisions between buffer gas and ions slow down the ions depending on their shape and size. As a result, an arrival-time distribution (ATD) of m/z selected ions is measured by a detector, as sketched in figure 3. The arrival times can be transformed into collision cross sections (CCSs) by the Mason-Schamp equation [27] where ze refers to the net charge of the system, μ is the reduced mass of the ion and the buffer-gas particles (usually He atoms), and k B is Boltzmann's constant. The reduced mobility K 0 is the proportionality constant that relates drift velocity v d and the electric field E of the apparatus following the relation The resulting CCS is a geometrical property of the molecule and it is ideally independent of the apparatus used.
A few examples of the use of IM-MS experiments to study structure and dynamics of peptides in the gas phase are • Jarrold and coworkers have developed a high-temperature drift-tube instrument and studied polyalanine helices in the gas phase from room temperature to 725 K [28]. The surprising finding is that helical structures can be observed still at these high temperatures for the peptide Ac-Ala 15 -Lys(H + ). Tkatchenko et al identified van der Waals (vdW) interactions as the crucial stabilizing contribution in DFT-based molecular dynamics simulations [29], being essential to explain the high temperature stability of the helical structure observed in experiment. • Based on ion-mobility measurements, Shelimov and Jarrold were able to show the unfolding and refolding of Cytochrome C in vacuum [30]. The folded versus unfolded state is linked to different charge states with a folded to unfolded transition between charge states +5 and +7. • By using a combination of IM-MS and MD simulation, von Helden and co-workers studied different combinations of cis/trans isomerization states of prolyl peptide bonds of ubiquitin [31]. CCS measurements and computations are sensitive enough to reveal the cis or trans conformation of a single peptide bond in a biological macromolecule. • The group of Clemmer has played a leading role in devising drift-tube apparatus using them to investigate different kinetically-trapped conformations of, for example, Bradykinin [32,33]. • Russel and co-workers have used a cryogenic drift tube at 80 K to investigate the structures of singly-protonated water clusters [34]. They were able to measure small (1-30 water molecules) and large clusters (31 up to about 120 water molecules) and to assign changes of H bonding upon loss of single water molecules from the clusters.  is encountered, due to the absorption of single or multiple photons, the sample dissociates or fragments and detection via mass spectrometry is possible. One can detect either the fragments or the depletion of the molecular beam. At this point, two types of action spectroscopy can be performed. One technique, commonly called infrared photo-dissociation IRPD (sketched in figure 4(A)), is usually performed at lower temperatures and uses inert tags (e.g. H 2 , Ar, Ne, etc) on the target molecule that are released after the absorption of one or very few photons, due to the low binding energy of the tag. Another technique, called infrared multiple-photon dissociation IRMPD (sketched in figure 4(B)) does not use any tag and simply measures the fragmentation of whole molecules due to the sequential absorption of at least a few tens of photons. For detailed reviews of the experimental techniques, we point the reader to [18,23,35,36]. In both action spectroscopy techniques mentioned above, non-linear effects can arise due to the absorption of more than one photon. Therefore, different from absorption spectroscopy where the spectra can be safely approximated by a linear response theory, here it is not a priori clear that the vibrational spectra measured in this manner will allow a linear response modelling. Especially for IRMPD, where indeed many photons are absorbed sequentially, causing induced and spontaneous emission as well as energy redistribution among vibrational modes, it is clear that a linear response approximation may fail. It has been shown that while the line shape and intensity of the peaks can be strongly influenced by these non-linear absorption effects, the peak positions usually follow the ones calculated by linear response [37,38] with slight red-shifts due to anharmonicities. In certain systems, it is found in particular, for lower frequencies below 1200 cm −1 , some peaks are transparent to IRMPD (but not to IRPD), as was shown by the group of Asmis for microhydrated nitrate-nitric acid clusters [39] and bisulfate/sulfuric acid/water clusters [40]. The reason they propose is that the absorption of photons disrupts the hydrogen bond network of these systems and causes the modes to go out of resonance with the frequency of the laser. More specific comparisons regarding theoretical modeling and experimental IR spectra will be given in section 4.2.
In the experimental studies, several different parts of the vibrational spectra can be probed, which are sensitive to different conformational properties: (i) The amide A/B regions, comprising localized CH and NH stretch vibrations above ≈ 2500 cm −1 , sensitive to the H-bonding pattern; (ii) the amide I (mainly collective CO stretch vibrations), amide II (mainly collective NH bend vibrations), and amide III (collective and localised CH and CN bend vibrations) regions between 2000 and 800 cm −1 , sensitive to backbone conformation; and (iii) the 'far-infrared' region, below 800 cm −1 , which contains collective vibrations and is also sensitive to backbone conformation. While much focus has been given to the amide A/B and amide I and II regions in most studies, attention has been called to the amide III region in mid-sized polypeptides [41,42] and to the far-infrared region in small polypeptides [43] as regions that can be used to differentiate conformations, if anharmonicities of the potential-energy surface are taken into account. As an illustration, we show these regions and the harmonic normal modes of vibrations calculated with the PBE exchange correlation functional for the formamide molecule in figure 5.
Gas phase investigations can be used to study distinct aspects of protein secondary structure, peptide bond properties, and aspects of microsolvation. In the following we discuss a few outstanding examples: • Tanabe et al have used UV/IR pump-probe experiments on clusters of acetanilide and water to investigate the motion of a single water molecule from the hydrogen bond acceptor (CO group) to the hydrogen bond donor (NH group) of a peptide bond [44]. • Gerhards and co-workers studied dimers of the short peptide Ac-Val-Tyr(Me)-NHMe in molecular beam experiments [45]. The combination of IR/UV doubleresonance spectroscopy and simulated vibrational spectra (harmonic, B3LYP/cc-pVDZ) identifies the formation of an anti-parallel β sheet-like structure. The study shows that sheet-formation can be regarded as an intrinsic trend of peptides that is not necessarily linked to aqueous solution. • The group of Rizzo has a long-standing experience in UV/ IR experiments on helical peptides Ac-Phe-Ala n -LysH + with a C terminal protonated Lys and a Phe residue as UV chromophore [46]. The helical pattern has been elucidated with a 15 N labeling technique. The C terminal capping motif that is present in the longer helices with ⩾ n 5 has recently been shown to be present also in short peptides with n = 1 [47]. These results confirm predictions about the helix onset made by Rossi et al for very related systems [48]. These systems with the aromatic Phe side chain are a challenge to theory and will be discussed further on in this review.
• Besides strands/sheets and helices, turns are the third main secondary structure motif in proteins. The group of Mons studied peptides Ac-Phe-NH 2 , Ac-Phe-Pro-NH 2 , and Ac-Pro-Phe-NH 2 from supersonic molecular beams wit UV/IR double resonance spectroscopy [49]. The authors assign various turn types and indicate the dependence of Phe conformations on the neighboring residues. • Johnson and his group have most elegantly shown how gas-phase infrared spectroscopy of cold ionic complexes can be used to elucidate not only molecular structure, but also the way two molecules interact with each other [50]. They use site-specifically placed 13 C labels as conformational reporters. Difference spectra between the distinctly labeled systems allow for structural investigations of a single peptide ion and also complex formation through binding to sodium cations or with other molecules. • In order to directly estimate the energy barriers between different conformers, Zwier and co-workers developed a double resonance conformer selective pump and dump technique that excites molecules to a higher electronic level and then relaxes them back into a specific vibrational ground state [51]. With this approach the authors were able to reconstruct the potential-energy surface of tryptamine. • Compagnon and coworkers have carried out seminal work on the FELIX free electron laser on peptides in the gasphase, for example looking at backbone preferences [52], internal proton transfer that can stabilize zwitterionic structures in the gas-phase [53], and microsolvation of amino acids [54]. More recently they have been looking at sugars in the gas phase [55], focusing especially on the issue of 'anharmonicities in vibrational modes'. • The group of Lisy has a body of work based on IRPD regarding the influence of charge due to the interaction with 'metal ions and temperature' on the conformational preferences of small biomolecules [56,57]. • Vaden, Snoek, and coworkers have measured IRMPD spectra of a variety of peptides in the gas phase, also performing extensive structural searches involving density functional theory. They have, for example, studied the Ala n H + , n = 3, 4, 5, 7 series of peptides [58] in the amide A/B region concluding that these peptides form mostly globular structures at larger sizes, despite the high propensity of the Ala amino acid to for helices. They have also looked at peptide sequences relevant to amyloid formation, showing that even if the isolated structure of Ac-VQIVYK-NHMe is folded, the simple interaction with another monomer in the gas phase seems be energetically favorable enough to trigger a conformational change and 'β-sheet aggregation' [59].
Vibrational spectroscopy techniques can also be combined with ion mobility-mass spectrometry. A first example of, in that case, electronic spectroscopy of mobility selected peptides was published by Rizzo and coworkers [60]. They use a field asymmetric waveform ion mobility spectrometry (FAIMS) setup combined with UV photofragment spectroscopy in order to decompose the electronic spectrum of doubly-protonated bradykinin in a conformer-specific manner. Also Voronina and Rizzo demonstrate how to use a combination of ion-mobility selection and cold-ion spectroscopy to study kinetically trapped conformers of triply-protonated bradykinin [61]. von Helden, Pagel, and co-workers have used ion mobility in order to separate conformers of protonated benzocaine and to record vibrational spectra [62].
The spectral resolution can be improved by measuring vibrational spectra of cold species. This can, for example, be realized in cold traps that are utilized in IR/UV double resonance experiments where changes of UV fragmentation yield are recorded as a function of IR excitations [63,64]. Alternatively, ions can be embedded in liquid He nanodroplets [65] and therewith cooled to an equilibrium temperature of about 0.4 K. Employing such a setup, von Helden and coworkers have measured vibrational spectra of the short peptide leucine-enkephalin [66].

Accuracy of the potential-energy surface
The potential-energy surface (PES) of a system is often defined as an energy function of the coordinates that tells how energy changes with respect to a change in any atomic position. This definition assumes an adiabatic separation of the electronic and nuclear degrees of freedom (known as the Born-Oppenheimer approximation). Moreover, it is usually (but not necessarily) also connected to the assumption that nuclei are classical particles. Even if both of these assumptions can break down in many situations (some of them discussed in the next sections), they are, in most cases, a good approximation or at least a good starting point to map the energy profile of a system.
When dealing with biomolecules in general, the challenging aspect is that the PES are often far from simple: due to the existence of several soft and anharmonic degrees of freedom these PES tend to have several different local minimaall of which will contribute to the partition function and thus define the thermodynamical properties of the system. If this PES is not rigorously described also all thermodynamic properties and structural preferences of the system will not be reliable. Especially the amount of anharmonic degrees of freedom make most harmonic approximations fail for these systems.
Perhaps the most popular way of evaluating PES are the so-called force fields. Force fields are parametrized empirical energy functions that represent the energy of a given system in terms of the sum of qualitatively different interactions. In the case of molecules (and especially peptides) the different contributions are separated into bonded interactions (e.g. potentials for bond lengths, bond angles, and torsions) and non-bonded interactions (e.g. van der Waals and electrostatics). For all of these terms, the functional form is physically motivated but arbitrary, and the parameters are fitted to either experimental data or quantum chemistry methods. The advantages of such an approach is that energy evaluations are computationally cheap. Therefore, these methods (if used in combination with smart sampling techniques [67]) typically allow enough statistical sampling to enable the evaluation of thermodynamical properties and to treat system sizes that can bear more connection to biological size-and time-scales with respect to more accurate methods that are too computationally expensive. If used with caution, these potentials can yield good physical insight on the structure and dynamics of biomolecules. However, it is becoming more clear that their performance in many situations is far from optimal. Especially regarding polypeptides, recent literature has shown that force fields have several limitations when compared and benchmarked against higher level quantum chemistry methods. Relative energies between different peptide conformations are not well reproduced [68][69][70][71] and differ quite drastically between different force fields. Regarding the interaction of peptides with ions, force fields have been shown to yield even poorer energetics with respect to high level theoretical benchmark data [72,73], even when especially tailored parameters and polarizable potentials are used. More recently, a study has shown that kinetic models derived from converged simulations based on different non-polarizable force fields largely disagree [74].
The desired solution would be to describe the potentialenergy surface (PES) at least as accurately as possible for the electronic degrees of freedom-which would mean to use methods like full configuration interaction (full CI), coupled cluster with a high enough order of excitations (e.g. with single, double, and perturbative triple excitations CCSD(T)), or quantum Monte Carlo (QMC). These methods are considered the gold standard of quantum chemistry, and do indeed provide a very accurate description of potential energy surfaces, but of course, are very costly to compute. Even if they can be used for benchmarking purposes it is not computationally feasible to routinely use them for PES exploration and the simulation of other physical properties.
A good compromise can be found among the wave function based methods, for example with Møller-Plesset perturbation theory (MP2) or coupled-cluster methods with lower-order excitations, e.g. singles and doubles (CCSD). A promising route is to use approximations like the domain based local pair natural orbital coupled cluster method with single-, double-, and perturbative triple excitations (DLPNO-CCSD(T)) [75]. The method is described as efficient enough to perform rather accurate coupled cluster calculations even for relatively large molecules with hundreds of atoms. However, some of the approximations must be carefully balanced [76]. It is typically computationally cheaper to use electronic density based methods like density-functional theory (DFT). DFT, with its approximate exchange correlation functionals, is arguably the best compromise between cost and accuracy in the market of electronic structure theory methods. Its advantage is that it allows one to treat molecules of sizes up to a few thousand atoms and reach time scales of hundreds of picoseconds in its most optimized implementations (Big-DFT [77], ONETEP [78], FHI-aims [79], CASTEP [80], CP2K [81], etc).
It is well known that results from DFT can depend on the choice of exchange-correlation functional. However, since the theory itself is based on the first principles of quantum mechanics, it is possible to obtain accurate results as long as one ensures that the chosen functional can describe the relevant physical properties of the system. For example, most standard DFT functionals lack, by construction, long range van der Waals (vdW) dispersion. It is, however, now widely accepted that these interactions have a critical impact on the structure [48,69,70,72,73,83,84] and dynamics [29,83] of peptides, especially for the larger sizes. It becomes thus almost mandatory to include these interactions in the most accurate manner in DFT calculations of peptides in any type of environment, and several schemes for including these corrections have been proposed in the last decade, which were nicely reviewed in [85]. Also, the inclusion of Hartree-Fock exchange can mitigate the self-interaction/delocalization problem of DFT and substantially change the strength of H Mean absolute error and maximum error for the energy hierarchies of 16 conformers of Gly-Phe-Ala (GFA), 15 conformers of Gly-Gly-Phe (GGF), 15 conformers of Phe-Gly-Gly (FGG), and 27 conformers of Ac-Ala 3 -NMe, compared to CCSD(T) reference data from [70,82] Note: Values for the mean-absolute errors (MAE) and maximal errors (Max.) are reported in meV (in parentheses: converted to kcal mol −1 ).
bonds, the description of polarizability, or barriers for conformational dynamics. What is more prudent to avoid in DFT is to blindly use different types of functionals without any kind of physical reasoning or benchmarks.
As an example of the type of accuracy that can be reached with state-of-the-art DFT methods nowadays, we show in table 1 mean absolute errors and maximum errors on relative energies for three-residue peptides, shown in figure 6 (FGG, GFA, GGF, Ac-Ala 3 -NMe), of DFT functionals with respect to CCSD(T) reference benchmark data. We test a generalized gradient exchange correlation functional (PBE [86]) and include both a pairwise van der Waals correction with C 6 coefficients that depend on the electronic density [87] (vdW TS ), and another that includes both electrostatic screening and many body effects up to infinite order through a coupled fluctuating dipole model [88,89] (MBD@rsSCS, which we here call MBD). We also test a hybrid exchange correlation functional with these corrections, namely PBE0 [90]. For comparison, we also calculate the same relative energies with popular non-polarizable force fields (OPLS-aa [91], Amber99sb [92], Charmm22 [93,94]) and the polarizable force field AmoebaPro04 [95,96]. Augmenting DFT approaches with a correction for long-range van der Waals interactions leads to energy estimates that agree very well with CCSD(T) calculations, which is evident by low mean-absolute errors (MAE) and low maximal errors. For example PBE0 + MBD yields MAEs of only up to 28 meV (0.6 kcal mol −1 ) and a maximal error of 66 meV (1.5 kcal mol −1 ). The force fields tested here and the bare functionals alike give higher MAE and also higher maximal errors that severely limit their predictive power.
In order to illustrate how such errors can impact larger polypeptides, the experimental benchmark helix-forming peptide Ac-Phe-Ala 5 -LysH + is ideal. From very accurate conformer selective UV-IR double resonance experiments in the gas-phase by Stearns and coworkers [46], it was established that four conformers are present in the experimental beam, which have been satisfactorily assigned to helix-forming structures, based on the similarity of their harmonic IR spectra to the measured ones. A subsequent study [97] considered 19 density functionals, plus Hartree-Fock and MP2 methods, finding that the spread of the relative energies of these four conformers could vary by around 0.15 eV for these methods. None of the functionals considered included long-range van der Waals interactions. Further studies on the same system by Rossi and coworkers [69] considered a larger pool of conformers coming from an extensive first-principles scan of the PES of this peptide. Based on the benchmarks shown in figure 6, the authors found that when considering the energy hierarchies at the PBE0 + MBD level and (harmonic) zero point energy contributions on this system, the four conformers observed in experiment are indeed predicted to be the ones with lowest energies. The spread of their energy differences is also consistent with what is estimated from experiment (≈50 meV), and within the estimated error bars, such that the detailed energy hierarchy between them cannot be safely predicted by any DFT method. Interestingly, [69] finds that the relative abundances for different conformers observed in experiment are better explained by a kinetic trapping from higher temperatures.
Finally for even larger peptides, where the experimental data is also not so conclusive, small energy differences can be even more important as the conformational landscape can get more congested. We take as an example the 20-residue peptide Ac-Lys-Ala 19 -H + , studied in [98] by Schubert, the author of this review, and coworkers. We show in figure 7 (data reproduced from [98] and [99]) in panel (A) the comparison between the force field relative energies for thousands of conformers predicted by the OPLS-aa force field, and relative energies of the same conformers when further relaxed with PBE + vdW TS 'light settings' (smaller basis sets and integration grids in the FHI-aims [79] code) and 'tight settings' (larger basis sets and integration grids). The scatter is huge, spanning up to 1.5 eV in DFT for conformers that were 0.5 eV apart in OPLS-aa. We also show in figures 7(B) and (C), for a set of selected conformers of this molecule the comparison between the energy hierarchies of PBE + vdW TS and the AmoebaPro13 force field [100], and the comparison between the different functionals and van der Waals corrections discussed above. The energy differences between the functionals are much smaller than the difference comparing to AmoebaPro13. Many body van der Waals dispersion do indeed have an impact in molecules of this size, which in this case also improves agreement to experimental data, as discussed further in section 4.2.

Sampling the PES connecting to first-principles methods
The degrees of freedom (DOF) that define a PES are the positions of all atoms of the molecules expressed in, for example, Cartesian space, internal coordinates, etc. For molecules, one can often simplify that (reducing the number of DOF) by assuming a fixed configuration of the molecular system (basically assuming that covalent bonds do not break). As a consequence, an internal coordinate system consisting of bond lengths, bond angles, and torsion angles can be used to describe a molecule's structural (conformational) space. Since bond lengths and bond angles typically vary around a single equilibrium value, torsion angles are often the most descriptive internal coordinates for a molecular system. An exploration of a molecule's potential energy surface must sample the space defined by the combination of all its torsional degrees of freedom. For a typical peptide molecule with three backbone torsion angles per residue and further torsions in the side chain, the problem easily gets too large for a systematic grid-based enumeration of possible points on the PES. A single alanine building block in a peptide chain has three torsional DOF. (See figure 8: the torsions φ and ψ represent rotations around single bonds and the peptide bond torsion angle ω adopts cis or trans conformations. Assuming a grid of 60 degrees for discretization of the single-bond rotations yields × × = 6 6 2 72 conformations to test for a single building block. For a chain of N building blocks this number virtually explodes already for short peptides with 72 N . A variety of strategies has been developed and employed to explore these conformational spaces connecting to first principles methods. Below, we will give a rough definition and some examples of them. • Systematic searches can be performed by discretization of the involved degrees of freedom with sufficiently fine grids. All combinations of torsion angles are either subject to a single point energy calculation or serve as starting point for local geometry optimizations. Such an approach is well applicable to small molecular systems, e.g. dipeptides. With a more 'target-oriented' objective, also bigger systems can be studied in a systematic way, if only a particular region of the search space is of interest. An example is the search for all possible helical structures in homologous peptides, i.e. peptides which have their backbones extended by methylene units. With the aim of finding such periodic and hydrogen bonded structures, the same combination of backbone torsion angles is applied to all subunits and only geometries that are (i) clash free and (ii) feature a backbone hydrogen-bonding pattern of interest are considered [101][102][103]. • Systematic searches can easily be performed for monomers. The knowledge gained in this way can then be combined in the creation of starting structures for longer oligomers of the respective building block(s). This approach has been successfully employed for example to β-peptides, which are homologous peptides with an addition of one methylene unit. [103][104][105][106]. • Parallel-tempering or replica-exchange molecular dynamics (REMD) can substantially enhance the sampling of conformational space in comparison to standard MD simulations [107][108][109][110][111][112]. REMD requires only limited human interaction and no definition of collective variable or alike. Robust protocols exist for a wide range of simulation programs. Several copies (a.k.a. replicas) are simulated in parallel by means of MD simulations at different temperatures. At predefined intervals, pairs of replicas with neighboring temperatures are eventually swapped based on a Metropolis criterion. The individual copies traverse a wide temperature range and can overcome barriers. • Basin hopping [113] reduces the PES to attraction basins centered on local minima. In contrast to REMD, moves on the landscape do not follow realistic pathways. The basic algorithm starts with a structure guess and a local optimization to the next local minimum. A perturbation of coordinates generates a new staring point for a geometry optimization that leads to the next minimum. This sequence of coordinate perturbation and local optimization is repeated until a convergence criterion is met. Frequently used implementations are for example in the programs TINKER [114] or GMIN [113]. • Genetic algorithms (GAs) are frequently used for global structure search and optimization of chemical compounds [115][116][117]. They use a 'survival of the fittest' concept.
Starting from a population of random solutions, genetic operations are applied and energy-optimal solutions are selected. GAs use the accumulated information to explore promising regions of conformational space. Examples are the program foldaway by Damsbo et al [118] and the program Fafoom [119,120] that can employ first-principles techniques.
A complete sweep of the potential-energy surface with any of the above mentioned methods is anything but trivial. All methods require parameter choices that have to be made by the respective user as well as a careful selection of the energy function to be used. While force fields offer low computational costs and therefore allow for a more exhaustive sampling of the PES, the results can suffer from the systematic energy errors that were discussed in the previous section. Firstprinciples methods offer a description of the energetics that is unbiased by empirical parameters, but that may demand far more computational resources. Clever combinations of search techniques and stepwise increase of accuracy can be a way out that, however, requires experience. In the next section, we will review some of these combination methods.

How can theory predict structure and dynamics?
As presented in the last sections, several benchmark works have shown that force fields may not be accurate enough to predict quantitative energy differences between peptide conformations in the gas phase. However, as also mentioned in the previous section, the high dimensionality of the potential-energy surface renders the direct exploration with first-principles potentials an elusive task. Therefore, theoretical studies that aim to explore the PES of larger polypeptides (and biomolecules in general) with first principles methods tend to follow an overall similar work flow [48,69,72,98,[121][122][123][124][125].
The general aim is to balance a broad sampling of conformers and an accurate description of the energetics with the available computer power. We exemplify this work flow in general below, illustrating it by the technique followed in [98], which we believe to be among the largest current computational efforts to study the conformational space of alanine based polypeptides from first principles. The work flow is also schematically represented in figure 9. The first step involves a thorough enumeration of different conformers using a force field. These conformers are commonly local minima in the force field, found by different sampling techniques, like basin hopping, replica exchange, genetic algorithms, or any other sampling method. The idea is to perform a global and thorough exploration of structure space. For example in [98] replica-exchange molecular dynamics (REMD) simulations were performed with the OPLS-aa force field [91,126] with 16 replicas for a total of 500 ns per replica. From these simulations conformations at each 2 ps were considered to generate an overall set of conformations. The less reliable the PES is at this step, the more conformers will have to be considered in the second step.
The second step is choosing which conformers from the force field sampling will be considered for the treatment with higher level methods (e.g. density-functional theory or other quantum chemistry methods). The conformers can be ranked by energy from lowest to highest. As described above, there can be large possible errors related to the force fields. The discrepancy between empirical and first principles descriptions is highlighted, for example, in figure 7(A). Many conformers (hundreds to thousands, depending on the system's characteristics) should be considered, otherwise low-energy conformers may be completely missed. Alternatively, conformers can be sorted by structural criteria in order to generate a pool of candidate structures that is as diverse as possible for investigation. Examples are clustering algorithms based on the rootmean-square deviation (RMSD) of Cartesian coordinates (e.g. in [98]) or sorting of structures according to hydrogen-bonding patterns (e.g. in [42,101]). Other descriptors for structural similarity can, for example, be found by using machine learning methods similar to the ones presented in [127][128][129]. The chosen conformers are typically fully optimized with higherlevel methods. Especially the local geometry optimization of force field minima with first-principles methods can involve large conformational changes that may lead to new local minima, which are not present in the force field. In [48,69,72,98,122,123] we could highlight the importance of considering a large pool of conformers: Considering only a couple of tens of conformers would have led to missing many of the relevant structures discussed in these papers. The discrepancy in relative energies from FF and DFT illustrated in figure 7(A) also raises the question if all relevant local minima can be located from simply re-relaxing the force-field conformers. As a means of ameliorating the situation, it is possible to introduce a third step a local first-principles sampling. In [98], for example, × 16 20 ps ab initio REMD simulations were performed and the most stable conformer (C2) of the study was only found in this refinement step.
After that step, one can continue increasing the accuracy for a subset of the conformers from the previous step. The conformers can again be clustered and a new smaller set can be chosen according to the same criteria as in the first step or others. The accuracy can be increased either by increasing numerical settings of the calculations (basis sets, grids, etc) or by going to even higher level theoretical methods. In [98] both were done, by going to a higher numerical accuracy as well as using computationally more expensive (and often more accurate) hybrid DFT functionals, and many-body van der Waals dispersion corrections [88]. Other works have also used MP2 and CCSD(T) methods for smaller systems in this step [70].
In order to exemplify the range of computational costs of different methods, we present in figure 10 timings that were measured for a comparably small system, namely phenylalanine with a Ca 2+ cation. Please note that the accuracy level of the DFT (really_tight settings mean a very large basis and very fine integration grid) and the wavefunction calculations (with 3-4 extrapolation to the complete basis-set limit) are chosen rather high compared to what one would perform as standard calculation. The specific timings for each method can vary considerably when using different (smaller or larger) basis sets, when using different codes, or when treating larger and denser systems. The nominal scaling with system size N is for DFT N 3 , for MP2 N 5 , and for CCSD(T) N 7 . In all cases however, developments are ongoing to reduce the respective scaling by the use of smart algorithms [75,78,131]. Nevertheless, the timings presented in figure 10 are good guidelines for what to expect in computational cost when increasing accuracy.  [114]. DFT calculations in the generalized gradient approximation (PBE and BLYP) and with hybrid functionals (PBE0 and B3LYP) were done with FHI-aims [79] (including pairwise Tkatchenko-Scheffler van der Waals correction and really_tight computational settings). Wavefunction-based calculations (MP2 and DLPNO-CCSD(T) [75]) were performed with the Orca code [130] using Ahlrich's basis sets for a 3-4 extrapolation to the completebasis-set limit. The timings for the DFT calculations include force evaluations. The timings for the wavefunction calculations include both steps, the triple-and quadruple-ζ calculations. If the calculations were running in parallel (DFT and wavefunction), the real timings were multiplied with the number of cores. Please note, the numbers are meant to give a rough qualitative idea about the range of timings that can be expected with different methods. Different codes, settings, systems, and computer infrastructures will result in quantitatively different timings. Having finished with a smaller subset of the most-likely structure candidates, it is desirable to connect to more physical quantities than a simple scan of the potential energy surface. Free energies and thermodynamic properties at realistic/ experimental conditions can be explored at this step, either by performing anharmonic free energy evaluations with a method of choice (steered dynamics, metadynamics, umbrella sampling, replica exchange, etc) or at least considering these contributions in the harmonic approximation. If the system is too large, again it becomes unfeasible to calculate more accurate anharmonic quantities with a higher-level electronic-structure method, such that the harmonic approximation remains as the last resort. Its predictive power, though, has to be critically assessed for these soft and flexible systems.
With the low (free) energy conformers at hand, the connection to experiment can be established by computing experimentally accessible observables. In the present work we focus especially on collision cross sections that are experimentally derived from IM-MS (see section 1.2.1) and on vibrational spectra (see section 1.2.2). Other possible quantities of interest are electronic spectra, neutron scattering data, or any other experimental technique that is the most applicable to the environment where the biomolecule is measured in experiment.
Another important application of first-principles based conformational searches are studies that compare properties across chemical space. An example is the search for essentially all conformers of 20 proteinogenic amino acids alone and interacting with either of the cations Ca 2+ , Ba 2+ , Sr 2+ , Cd 2+ , Pb 2+ , and Hg 2+ [124]. As a result, one obtains comparable data for sets of compounds and/or complexes, generated on equal footing with respect to the search technique and the employed energy function. Based on such grounds, physical observables can be computed and compared across chemical compound space. The workflow employed by Ropo and coworkers [124] starts from a force field based structure search (Tinker scan [114] with the OPLS-AA force field [91]) and the relaxation with DFT-PBE + vdW. Again, it is necessary to refine the search results with a local first-principles search step. The bias from the initial treatment with empirical potentials can only be compensated by ab initio REMD simulations. The multi-step search procedure yielded an essentially unbiased first-principles data set of more than 45,000 stationary points on the PESs of the different molecular systems. The data can be used as a starting point for, e.g. the parameterization of empirical potentials, comparisons of properties like cation binding strength across chemical space, or as input for spectra calculations. The data is available from the website http://aminoaciddb.rz-berlin.mpg.de and from the NoMaD repository [132].

Theory-experiment comparison-computation of experimentally accessible observables
A major challenge when performing simulations is to match the experimental conditions in a simulation setup. An effort on both ends is needed. Experimental conditions should be well controlled and the data recorded precise and sharp and the system size and character that is considered in the simulation should be as realistic as computationally feasible. The gas phase is an excellent environment in this respect, where it is possible to simulate physical observables on a very similar footing with experiments.
In the next section we focus on the calculation of collision cross sections and vibrational spectra. In addition, there are several optical spectroscopy techniques that can probe also electronic excitations and dynamics of excited states in the gas phase, connected to UV and visible probes. For example, in the UV-IR pump-probe experiments mentioned above, the UV laser induces electronic excitations that can be used to select different conformers. Reviews and perspectives of such optical spectroscopies in the gas phase, applied to peptides and other biomolecules can be found in [8,9,7,133]. Antoine and Dugourd report the possibility of recording electron photo-detachment following electronic excitation in negatively charged peptides to obtain gas-phase optical spectra for large systems (even proteins), since this process does not suffer from limitations brought by energy redistribution into vibrational modes with system size and is less congested than a vibrational spectrum for large systems [133]. Theoretical modelling of electronic excited states and the resulting processes and dynamics is a major challenge, since it requires the use of time-dependent or explicitly correlated electronic structure techniques [134][135][136] that can treat excited states. These are very computationally expensive if compared to ground state techniques and have many further limitations included in the approximations, such that their application to large biomolecular systems is still limited, but growing fast.

Collision cross sections
From the Cartesian coordinates of conformers that result from a structure search for a particular molecular ion, it is possible to compute CCS values. The underlying collisions of the ion with the buffer-gas atoms (e.g. He) or molecules (e.g. N 2 ) can be modeled including different levels of detail. We will review here the three most-commonly used methods, the projection approximation [137], the exact hard-sphere scattering [138], and the trajectory method [139].
The projection approximation, or in short PA [137], takes the shape of the molecule into account, modelling the interaction between ion and buffer-gas particles by means of Lennard-Jones and charge-dipole interactions. The averaged collision cross section in the PA (CCS PA ) is calculated by using the collision parameters θ, φ, and γ as well as the minimal impact parameter b min as follows: In practice, b min is tabulated as atom-wise impact parameters, and in a simplified view they are stored as up-scaled atomic radii. The CCS value for a given molecular conformation is computed numerically by: (i) projecting the atoms of the molecule onto a randomly chosen plane, (ii) drawing the collision radii around positions of the nuclei, and (iii) repeatedly selecting random points within an area A enclosing the projected molecule. Out of step (iii), a CCS value for a planar orientation N is computed following the formula ( ) where h is the number of hits within the projected outline of the molecule and t is the number of overall tries. Steps (i) to (iii) are repeated for different planes and an average CCS value out of CCS N values is computed until convergence to a given threshold is reached. PA is shown to work well especially for largely convex molecules.
PA neglects scattering events as well as multiple collisions between buffer-gas particles and the ion. However, such effects are especially pronounced for concave molecular surfaces where certain surface areas can be shielded by parts of the molecule, while in others multiple collisions may occur. The projection-superposition approximation (PSA) aims to compensate for this with a shape factor that accounts for the concavity of a molecule [140]. Alternatively, scattering and multiple-collision effects can be considered by regarding ion and buffer-gas particles as hard-spheres. The exact hardsphere scattering (EHSS) approach [138] explicitly follows the trajectory of a He atom that is shot at the molecule or cluster through all possible collisions until it leaves the molecule/cluster for good. Here, the scattering angle χ (the angle between the trajectories before and after a collision event between the molecular ion and a buffer-gas particle) is computed as a function of the collision parameters θ, φ, and γ and the impact parameter b for multiple collision geometries and thus an average CCS EHSS can be obtained: The trajectory method (TM) models one extra bit of the physics defining the drift of an ion through a buffer gas, namely long-range interactions between the drifting ion and the buffer gas. The importance of this contribution depends on the polarizability of the buffer gas, which is for example stronger in N 2 than in He, and on the charge distribution in the (molecular) ion. The charge(s) of the drifting ion induces dipoles in the buffer gas atoms altering its drift velocity without 'physical contact' [139].
In addition to the symbols explained above, the reduced mass μ and the relative velocity g are being used. The interaction between the ion and the buffer-gas particles is modeled by two terms: a Lennard-Jones 12-6 potential and a term that accounts for the interaction between the charge (distribution) of the ion and the charge-induced dipole of the buffer-gas particle. This treatment can consider differences in polarizability between buffer gases, for example between He and N 2 .
We note, though, that in principle all methods are designed to work with He as the buffer gas. When comparing to measurements made with, for example, N 2 , parameters going into the calculations have to be adapted. An overview about specific contributions to the collision cross section can be found in a paper by Wyttenbach et al [141] where for a wide range of systems experimental and PSA-simulated CCS are compared. There are several programs described in the literature, which can be more or less straightforward to obtain. We list here only some of the more popular ones: MOBCAL is developed in the group of Jarrold and incorporates PA, EHSS, and TM. It can be downloaded at www.indiana.edu/nano/software.html. sigma is developed in the group of Bowers and it computes CCSs according to the PA and EHSS method. It is available under this URL: bowers.chem.ucsb.edu/ theory_analysis/cross-sections/sigma.shtml. FHIsigma is a spin-off of sigma by Wesemann and von Helden and comes with a graphical user interface. The program is available at: sigma.fhi-berlin.mpg.de. IMPACT is intended for structural proteomics applications and claims to compute extremely fast PA-CCSs [142]. The software is available at: benesch.chem.ox.ac. uk/resources.html.
The choice of method, for example between PA, PSA, EHSS, and TM, can be critical for the predictive power of the CCS calculation. Some examples are collected in table 2. Depending on the nature of the ionic cluster/complex or molecular ion under investigation, the alternative methods can agree, like in the case of two peptides from reference [123], where PA amd TM give virtually the same results. But there are also examples where the methods give qualitatively different results. Different protonation states (protomers) of benzocaine exist that result in either the distribution of the positive charge over the molecule or in its localization at a protonated amino function [62]. In the experiment, both forms can be separated with a polarizable buffer gas (N 2 ). In simulations, the CCSs computed with the PA are indistinguishable, while  130 144 TM predicts distinct values for the protomers and allows an interpretation of the experiment. The interpretation of an experimental arrival-time distribution or of the derived CCS distribution is not unambiguous. The theoretical CCS of a single conformer represents a projection of the conformational degrees of freedom onto a single coordinate. As a consequence similar CCSs may still result from different structures. Also, in the experimental CCS, even a single sharp peak represents not only a projection of spatial coordinates, but also of the dynamics of the molecular or cluster ion over the drift time. Consequently, measuring a single sharp peak can mean that either (i) there is only a single conformational family present in the ion cloud, (ii) there are multiple (more than one) conformational families present in the ion cloud that have the same CCS, or even (iii) the time average over multiple interconverting conformers for a single molecule is converged during the drift time and the measured CCS basically represents a converged average over the CCSs of the different structures. An example was shown in [123], where IMS data of a β peptide is interpreted to represent the interconversion between related helix types. In a sense, ionmobility experiments, especially in conjunction with molecular simulations, can be used to deduce not only the structure of molecules, but also their dynamics.

Vibrational spectra
As mentioned in section 1.2.2, several experiments probe the vibrational spectra of biomolecules in the gas phase. These spectra contain more detailed structural information than CCS experiments and simulations. However, especially for larger and more anharmonic systems, a comparison to theoretical simulations is necessary in order to interpret the experimental signal. Good reviews on several types of theoretical spectroscopy methods that can be used in connection to first principles potential-energy surfaces for biomolecules can be found, e.g. in [143,144].
Theoretically, the 'zeroth-order' way to model the vibrational properties of any system is the harmonic (or double harmonic) approximation. In this approximation a Taylor expansion of the Born-Oppenheimer potential with respect to displacements of nuclear coordinates is truncated on the second (quadratic) order and harmonic frequencies of vibrations are calculated for the problem of coupled harmonic oscillators with force constants corresponding to the second derivative of the potential [145]. From Fermi's golden rule, it is known that the IR intensities are proportional to the square of the matrix elements of dipole-allowed transitions. One can thus Taylor expand the dipole moment with respect to nuclear displacements, solve the quantum mechanical Hamiltonian in the harmonic approximation, and find the allowed transitions. By truncating the expansion of the dipole moment at first order, one arrives at the expressions for the so called 'double harmonic' approximation. Not only this approximation does not contain any anharmonicities, it also does not allow any other transition beyond the fundamental ones. For Raman spectra, similar expressions can be calculated for the harmonic approximation relying on the estimation of matrix elements of allowed transitions from the polarizability tensor [145]. This type of approximation is frequently used for a first comparison of structural properties in connection with scaling factors that compensate for the complete lack of anharmonicities (both of the classical PES and connected to the quantum nature of the nuclei).
A fundamental problem with the harmonic approximation for the study of biomolecules is that these molecules can have very anharmonic potential-energy surfaces. A well known way to calculate IR transitions including anharmonicities is to relate Fermi's Golden Rule to time correlation functionsa derivation found in many textbooks (e.g. [146]). One finds that the IR absorption spectrum can be written as the product of the frequency-dependent refractive index ( ) ω n and the Beer-Lambert absorption coefficient ( ) α ω as where β is the inverse temperature, V the volume, ε 0 the dielectric permittivity of vacuum, c the speed of light and H H The Kubo transformed correlation has the same symmetries as a classical correlation function [147] and arises naturally in several approximate quantum dynamics schemes [147,148]. The Fourier transform of the Kubo transformed time correlation ˜( ) ω µµ I and the one of the canonical time correlation ( ) ω µµ I are related by Thus, the commonly coined 'quantum correction factor' [37,149] arises naturally from the relationship of these two correlations. The expression that one usually calculates for IR absorption is 3 d e 0 t 2 0 2 0 i (9) where the brackets denote a time average, and ( ) µ t is generated by classical or approximate quantum dynamics for the nuclei. Similar expressions for Raman spectra can be found with respect to the autocorrelation functions of the polarizability tensor [150]. When classical dynamics (e.g. Born-Oppenheimer ab initio molecular dynamics) is employed to approximate these autocorrelation functions only the anharmonicities of the underlying (classical) potential-energy surface are taken into account. The remaining discrepancies when comparing to benchmark experiments can be due to the lack of considering the quantum nature of the nuclei (which introduces what is sometimes referred to as quantum anharmonicities), the use of an approximate potential-energy surface, or sampling of the wrong (ensemble of) conformers-all of which can cause the spectra to change considerably, as discussed in more detail below.
Other techniques to obtain anharmonic vibrational spectra are, e.g. vibrational self consistent field (VSCF) and second order vibrational perturbation theory (VPT2). These methods and their applications to biomolecules have been reviewd by Roy and Gerber [151], and Barone and coworkers [152] recently. In both of them the quantum nuclear Hamiltonian is approximately solved either in a mean field approximation or a perturbation theory one, thus including quantum anharmonicities. However, the inclusion of temperature and explicit dynamics (where many conformations may be sampled) is not straightforward [153,154], and the methods are expensive to treat very large molecules. An impressive recent work from a computational point of view was the application of VSCF-PT2 with the B3LYP functional to the spectra of two conformers of Gramicidin S, comparing to cold gas-phase IR-UV double resonant spectra, obtaining satisfactory agreement [155].
Even though the evaluation of IR and other vibrational spectra from autocorrelation functions has been popular for decades especially for condensed phase systems and empirical potentials, Gaigeot and coworkers have pioneered its use in connection to first-principles (DFT) potential energy surfaces and applying it to isolated and solvated small polypeptides [24,[156][157][158]. It is remarkable how well the simulated spectra based on a linear absorption regime (see equation (9)) agree with those measured with the IRMPD technique. Great examples are spectra for Ala 2 H + , Ala 3 H + that were derived from ab initio molecular dynamics simulations employing the BLYP functional [159,160]. The authors observe that at room temperature the peptides interconvert between a few different structures and that these dynamics are important for the comparison with the IRMPD spectra. This type of studies serves also as an indirect probe of the dynamics. They also reported sensitivity to different conformations in the amide III regions for polyalanine peptides [24], and good structure selectivity and comparison to IR-UV IRMPD spectra in the far-infrared region for Ac-Phe-Gly-NH 2 and Ac-Phe-Ala-NH 2 [43]. This is very interesting, since vibrations in this lower wavenumber region are more classical in nature and can be more accurately represented by classical (ab initio) molecular dynamics, not requiring simulation techniques that incorporate quantum effects of the nuclei.
As an illustration of their work about the importance of anharmonicities in comparison to experiments, we highlight a larger peptide, Ala 7 H + , for which IRMPD spectra were measured by Vaden and coworkers [58]. In that study, Vaden and coworkers also performed extensive structural searches starting with a force field, then passing through a cascade of more accurate (standard) DFT functionals (until B3LYP), identifying conformational families, and finally performing single point calculations with MP2 for the energetically most favored conformers and calculating harmonic vibrations at the B3LYP level. The most likely globular structures, and the comparison of their harmonic IR spectra at the B3LYP level with the measured room temperature IRMPD spectrum is shown in figures 11(A) and (B) (reproduced from [58]). Gaigeot and coworkers then took these structures and calculated IR spectra from ab initio molecular dynamics at the BLYP and level and T = 350 K in [161]. The comparison between this anharmonic spectrum and the same experiment is shown in figure 11(C), reproduced from [161]. It is immediately apparent that even if the agreement is not perfect, anharmonicities in this NH and CH stretch regions are necessary to reproduce the experimentally observed intensities below ≈3100 cm −1 . The authors conclude that these structures adopt more globular conformations with the NH + 3 group self solvated within CO groups of the molecule. As will be shown below, the exact placement of the position of the simulated peaks with respect to experiment in the anharmonic case may be a fortuitous cancellation of errors, since the inclusion of van der Waals interactions can change considerably the dynamics of the molecule and inclusion of nuclear quantum effects cause large red shifts in this spectral region.
It is worth noting that intensities are typically not to be trusted when comparing theory and IRMPD experiments due to the strong non-linear effects expected in the multiplephoton abosption process. Attempts have been made by Calvo and coworkers to model specifically IRMPD [162] with all relevant dynamical effects, which can yield good results for small molecules albeit relying on some empirical modelling. Comparisons to IRPD would be interesting, since it is less prone to to non-linearity in the lineshape and peak positions. However, the tag which is often used can also disturb the spectrum (as observed in Kr tagged gold clusters [163] and Ar tagged protonated water clusters [164]), and one is usually restrained to low temperatures due to the low binding energy of the tag. In most of the work present in the literature so far, it must be said, though, that the modelling of the IR spectra within linear response theory (including anharmonicity) has been able to provide important interpretations to vibrational signatures obtained from IRPD or IRMPD.
Blum and coworkers (including the authors of this review) have focused on the study of larger polypeptides, especially in the fundamental characterization of interactions governing structure formation and dynamics. For the benchmark series of helix-forming alanine based polypeptides Ac-Ala n -LysH + the authors have studied many different aspects related to secondary structure formation using DFT and ab initio molecular dynamics. Regarding the smaller members of this polypeptide series, n = 4-8, the authors have reported that beyond the formation of stable H-bond chains with increasing n, an important contribution to helix stabilization comes from the vibrational entropy of very soft modes that are present in the helices but not in more compact structures [48]. Helices are predicted to be the most stable isolated structures in the gas phase starting at n = 8, in agreement with experimental evidence from IMMS measurements [165].
For a more direct structural characterization, Rossi and coworkers have also calculated the (classical-nuclei) anharmonic IR spectra of n = 5, 10, and 15, and compared to experimental IRMPD measurements at room temperature [42]. In general, the structural characterization of gas-phase peptides based on vibrational spectra requires an objective metric of agreement between simulation and experiment. To that end, Rossi and coworkers have employed the Pendry reliability factor R P [166] in an implementation that was distributed with [167]. Since, as it was already discussed, the IRMPD spectra could have peak intensities that are distorted due to the absorption of many photons, a simple overall least squares fit for the intensities would not suffice for a comparison between theory and experiment. The Pendry R-factor, originally used in low energy electron diffraction experiments [166], addresses the need to match mainly peak positions, rather than the intensities. Given two continuous curves with intensities ( ) ω I exp and ( ) ω I th , this R-factor compares the renormalized logarithmic derivatives of the intensities, given by: , and W approximately the half width of peaks in the spectra. The advantage is that the L functions have a sign inversion exactly where the maximum of the peak is, and if peaks are far enough apart, relative intensities are completely ignored, while if they are close together, ( ) ω L is moderately sensitive. However, the L functions would be too sensitive to zeroes in the intensity, since the logarithmic derivatives would have singularities in this case. The Y function is a simple transformation of L, which avoids such singularities, by giving similar weights to maxima and zeroes in the intensities. The Pendry R-factor (R P ) is then defined as: complete anti-correlation. R P is always defined with respect to a rigid shift ∆ between the two curves considered. A python script for the calculation of this and other reliability factors is available from Github 3 . We reproduce in figure 12(A) the theoretical IR spectra obtained with DFT-PBE adding pairwise van der Waals corrections (PBE + vdW [87]) for helical structures of Ac-Ala 10 -LysH + and Ac-Ala 15 -LysH + compared to experiment. For n = 15 the comparison of the harmonic spectra of a helix containing mostly 3 10 helical H-bonds, another containing αhelical H-bonds, and the anharmonic spectra obtained from equation (9) from PBE + vdW molecular dynamics shows (quantitatively) how the agreement to experiment increases in the anharmonic case. A Pendry reliability factor R P of 0.32, obtained with respect to a rigid shift ∆ of the whole spectrum by 26 cm −1 is an indication that the structure of this molecule is indeed the α-helical one shown in figure 12(B), where the lysine residue is completely self-solvated in the backbone carbonyl groups. Also in panel B, we show the H-bond dynamics of the molecule in the trajectory generating that spectrum, highlighting 3 10 -and α-helical H bonds. Although fluctuations are observed, the molecule maintains a mostly α-helical structure throughout. For Ac-Ala 10 -LysH + we also find a good agreement between the theoretical (anharmonic) and experimental IR spectrum for the α-helix. Examining the dynamics of this molecule when switching off the vdW interactions, we can show in panel (C) that the structure becomes more extended, stabilizing a 3 10 helical motif, and worsening the agreement with experiment (shown only in [83]). This observation is also in line with a study of interplay between H-bond cooperativity and vdW contributions in polyalanine helices: H-bonds get systematically strengthened by vdW interactions, and the high temperature stability of Ac-Ala 15 -LysH + is increased, while at lower temperatures the lack of vdW interactions also stabilize a more extended 3 10 -helical structure [29].
The effect of the location of the charge and the peptide sequence was also studied for even larger alanine-based polypeptides, namely Ac-Ala 19 -LysH + and Ac-Lys-Ala 19 -H + [98]. Ac-Ala 19 -LysH + was seen to form helices, consistent with measured ion mobility cross sections. Ac-Lys-Ala 19 -H + presented cross sections consistent with more compact, globular conformers (as expected due to the unfavorable interaction of the charge with the possible helix macrodipole), but its IR spectrum was very similar to helical structures. Theoretical calculations could solve this puzzle: even if of a compact/ globular nature, energetically favored conformers of Ac-Lys-Ala 19 -H + still retained a large helical content.  [42], copyright 2010 American Chemical Society. Comparison between experimental (gray lines) and theoretical (red lines) (PBE + vdW functional) vibrational spectra, all normalized to 1 for the highest peak. ((a), (b)) Ac-Ala 15 -LysH + : calculated spectra based on the harmonic approximation, for a 3 10 -helical (a) and an α-helical (b) local minimum of the potential-energy surface. (c) Ac-Ala15-LysH + : calculated spectrum from AIMD (including anharmonic effects), starting from an α-helix and α-helical in character throughout the simulation. (d) Same as panel (c), for Ac-Ala10-LysH + . Pendry R-factors and rigid shifts ∆ between measured and calculated spectra are included in each graph (calculated spectra are shifted by ∆ for visual comparison). (B): Illustration of the hydrogen bond network evolution of Ac-Ala 15 -LysH + during a PBE + vdW microcanonical simulation. On the right side of the plot, the ratios of α-helical and 3 10 -helical bonds observed during the simulation for each oxygen, labeled from N to C-terminus is shown. (C): Illustration of the hydrogen bond network evolution of Ac-Ala 10 -LysH + during a PBE + vdW TS and a PBE microcanonical simulation (labels are the same as in (B)).
Here we take the opportunity to address a commonly adopted approximation in these simulations, namely that of performing dynamics considering classical nuclei. Hydrogen atoms, ubiquitous in these molecules, are quite quantum entities even at temperatures as high as room temperature. These effects are known to affect the structure and dynamics of condensed phase systems (especially water) [169,170] and hydrogen bonds [171,172]. A simulation technique that has been progressively gaining more attention to include nuclear quantum effects (NQE) beyond the harmonic approximation at least in non time-dependent observables is path integral molecular dynamics (PIMD). This technique exploits an exact isomorphism between the statistical properties of a quantum system and that of a classical ring polymer, where each bead is a repetition of the original system, connected to each other by harmonic springs. A detailed explanation of this technique is beyond the scope of this manuscript, but good descriptions can be found in [173,174]. This technique is especially suited to massively parallel architectures, since the replicas of the system can be run in parallel given that there are enough CPUs available. For time-dependent observables, e.g. time correlation functions, the situation is much trickier, due to the difficulties of modelling true quantum dynamics. Also within path integral molecular dynamics there are a few approximations to time correlation functions that have been proposed, namely centroid molecular dynamics [175], ring polymer molecular dynamics [147], and thermostatted ring polymer molecular dynamics (TRPMD) [168]. Albeit approximate these methods can give reliable results especially for larger systems and/ or extended systems [176], and are the only methods so far that can be applied on a more routine basis to realistic multidimensional systems. At room temperature, even for the most efficient of these methods, one must use several tens of replicas of the system, making these simulations still substantially more costly than their classical-nuclei counterparts.
We used TRPMD to calculate the IR spectrum of Ac-Ala 10 LysH + , shown in figure 13. We used the FHI-aims program package [79] in connection to the i-PI program [177] in order to perform the dynamics. We simulated 20 ps of TRPMD dynamics, starting from the thermalized α-helical structure, a time step of 0.5 fs for the integration, 16 replicas of the system (beads), and light settings in FHI-aims for the PBE + vdW force evaluation. In figure 13 we compare the IR spectrum thus obtained with the AIMD-PBE + vdW spectrum (tight settings, without any shifts applied) and the IRMPD room temperature experimental spectrum already published in [42]. We observe that while for very low frequency modes the classical and quantum nuclei simulations agree pretty well, above 1000 cm −1 most of the modes are softened (red-shifted) in the quantum case, something that becomes progressively more pronounced for all modes above 2500 cm −1 . This observation is in line with the fact that higher frequency modes are more quantum in nature. Even if TRPMD is known to over-broaden the line-shapes [168], the red-shifts should be reliable, modulo the limitations of the DFT functional itself (lower barriers, softer H-bonds). As also shown in figure 13, this effect goes in the opposite direction of the experimental data, which is already slightly blue shifted from the classical nuclei simulation. This is an indication that the PBE + vdW functional itself is here at fault. In these systems, when calculating harmonic frequencies of vibration with, e.g. the PBE0 + vdW functional, they are all blue shifted with respect to PBE + vdW. The over-softening of the modes is one more manifestation of the self-interaction problem. It seems, thus, that in order to get better agreement of peak positions with experiment in a fully anharmonic picture, one should perform a simulation with van der Waals corrected hybrid functionals (which are, unfortunately, considerably more expensive than standard generalized gradient ones) and include nuclear quantum effects.
So far, only studies of polypeptides in isolation have been discussed. As mentioned in the introduction, the gas phase is ideal not only due to its 'clean room' conditions, but also to the fact that it is straightforward to control the gradual inclusion of 'external agents', as for example ions, metal cations and small metallic clusters, and solvent molecules, for example water. We dedicate a following section to the discussion of microsolvation. Here we briefly review the interaction with ions. Since the early 2000s, IMS experiments have pointed to the role of cations stabilizing helical structures in polyalanine peptides [178], and more recently evidence for helix stabilization has been established based on the measurement of gas phase IRMPD spectra in the Amide A/B range of sodiated polyalanine peptides of various sizes [179]. Through Figure 13. Infrared absorption spectra of Ac-Ala 10 -LysH + calculated with ab initio molecular dynamics (AIMD-PBE + vdW) at 300 K, with ab initio thermostatted ring polymer molecular dynamics [168] (TRPMD-PBE + vdW) at 300 K, and the experimental IRMPD room-temperature spectrum from [42]. measurement of IR spectra, also the role of metal cations to stabilize the zwitterionic form of some amino acids in the gas phase has been studied [18].
We had a detailed look at the effect of small cations (Li + and Na + ) on the structure of prototypical turn-forming peptides Ac-Ala-Ala-Pro-Ala-NMe and Ac-Ala-Asp-Pro-Ala-NMe [72]. The different systems were investigated by means of theoretical and experimental vibrational spectroscopy. First of all it was evident that in the gas phase, the interaction of the peptide carbonyl groups with the strong positive charge of the cations enforces conformations on the backbone that would not be possible for the peptide alone. Furthermore, the preferred conformations differ depending on the cation. The comparison between experimental and simulated spectra revealed that multiple conformers co-exist and probably interconvert in the gas phase. Consequently, the computed spectra for individual conformers have to be mixed in order to match the spectra recorded in the experiments, but a good agreement is reached. One can raise the question of how relevant are these results in solution. Hints come from short ab initio MD simulations that were performed on energetically stable conformations of peptide-cation systems with a few dozens of waters. Within the time scales accessible, the interactions between the cation and the peptide backbone remained preferred over direct solvation of the cation by the water molecules.

Towards first-principles free energies
Even if the PES is really the basis for all thermodynamic quantities, the sole knowledge of the PES does not allow a direct connection with real-world physics. For equilibrium properties, what is really needed is a good estimate of the partition function from statistical mechanics and all thermodynamic quantities that can be derived from it, most importantly, free energies.
Unfortunately, estimating free energy values for biomolecules is not an easy task. The harmonic approximation for the free energy (discussed in many textbooks, e.g. [146]), is the most common approximation. The reason is that it is the only one feasible with more costly (e.g. first principles) potentials and for larger molecular sizes. Due to the anharmonic nature of these molecules, it is not guaranteed though that this approximation will be plausible even at relatively low temperatures.
In order to get vibrational contributions to the free energy it is possible to use, for example, the VSCF and VPT2 methods, already discussed in the last section. For small molecules, Basire and coworkers have developed a technique which relies on the estimation of microcanonical densities of states and partition functions, that gives access to temperature effects and relative populations connected to a second order vibrational perturbation theory [153,154] approach. However for higher dimensional and flexible systems this technique becomes very challenging. Quasi-harmonic analysis, in which dynamics can be decomposed into principal components and entropies calculated from this decomposition can be used as an approximation, provided there is enough sampling, but again, they rely on a quasi-harmonic picture that is likely to fail in many situations.
We have shown in the previous sections that it is possible to extract, for example, vibrational spectra from first-principles molecular dynamics (MD) simulations. However, the estimation of (relative) free energies requires a sampling of the conformational space that can currently only be realized for rather small molecular systems with few well defined degrees of freedom [180]. For larger systems, with hundreds of atoms, it is a much larger (and close to impossible) effort to gather the required statistical sampling of conformational space in order to estimate these free energies. It is worth noting though that with smart algorithms and optimized codes these quantities are becoming accessible [181]. There are two main points do be addressed [182]: (i) The simulation has to be long enough to ensure that the time-average of the simulations resembles the ensemble average of the system and (ii) free energies from MD simulations require the definition of collective variables, that are not trivial to define. In the field of biomolecular simulations a variety of MD-based simulation techniques are being used to solve point (i), we only summarize some frequently used types here: • A straightforward approach is the computation of long (μs to ms time-scale) trajectories. This idea brought to the extreme is the construction of dedicated hardware like the molecular-dynamics supercomputer Anton [183] that provides access to the kinetics and thermodynamics of, for example, protein folding [184,185]. • Alternatively, many short MD trajectories can be combined by using Markov-chain models [186][187][188][189][190]. This approach is striking because it is inherently parallel and allows the use of distributed computational resources [191,192]. • The necessity of either very long or large numbers of independent shorter MD simulations comes from the nature of the transitions between the different meta-stable states on the free-energy landscape of a given system. These transition are often rare events and in order to obtain converged values, these events have to be observed sufficiently often. In order to enhance sampling and therewith shorten the required simulation times, multiple methods are available: replica-exchange MD, umbrella sampling [193][194][195][196], metadynamics [197,198], etc.
One or several collective variables are needed as degrees of freedom (DOF) that define the free-energy surface. In case of, e.g. umbrella sampling or meta-dynamics, these collective variables have to be known a priori, while they can be defined a posteriori in non-biased MD simulations. Overall, it would be interesting to pursue methods that can be even more efficient in sampling, or methods that can reach convergence with a small amount of statistics.

Challenges towards solvation
A biomolecule immersed in a solvent presents three different qualitative types of interactions that need to be described. These are the intramolecular interactions, the biomolecule-solvent interactions, and the intra-and inter-molecular interactions of the solvent. The interactions between the biomolecule and the solvent and the influence of the collective interactions of the solvent on the biomolecule are the ones that will ultimately define the solvated state. It is important to note that the solvent is often not a simple homogeneous environment, but includes ions and other inhomogeneities that also need to be accurately captured. Studying biomolecules directly in solution has the drawback that the resulting measurements are quite congested by the amount of different interactions that play a role. It is thus desirable to build up the solvated state step by step, so that theory and experiment can work in synergy towards a consistent and reliable description of these molecules in solution.
Experimentally, regarding the solute-solvent and solventsolvent interactions, perhaps the most detailed characterizations of physical properties are connected to mass spectrometry (MS), where it is possible perform thermochemical equilibrium measurements [199] and, if connected to spectroscopy techniques, to measure also more detailed structural information. In these experiments, solvation with water molecules or ions (or both) can be investigated in a stepwise manner, such that the physical properties of the very first stages of solvation can be identified. For example, it is possible to measure equilibrium constants, binding enthalpies, and vibrational spectra that can be directly connected to calculations.
Using only IM-MS, thermochemical equilibrium properties and overall geometric information have been gathered for a range of biomolecules and the first stages of their interaction with the solvent (microsolvation) [3,178,[200][201][202][203]. A review in this area can be found in [3]. More recently, also the measurement of vibrational spectra of mass selected species in the gas phase were able to probe more detailed conformational properties of clusters of solvent molecules [164,[204][205][206][207]208] or the first stages solvation especially of peptides [209], and sugars [6,210,211]. We highlight here two recent experimental works dealing with peptides to illustrate the state of the field. Impressive results have been reported by Nagornova and coworkers [209] on the microsolvation of Gramicidin S cooled to 12 K. By performing conformer selective double resonance IR-UV spectroscopy they were able to connect IR features to structural changes caused by the absorption of 1-15 water molecules. Another work by Warnke and coworkers [200] instead used ion mobility-mass spectrometry to show how crown-ethers can micro-solvate charged Lys side chains in cytochrome-C and other proteins. The authors were able to decompose the effects responsible for the unfolding of highly-charged states in the gas phase into Coulomb repulsion and side chain to backbone interactions that interrupt backbone hydrogen bonding.
Experiments nowadays are able to provide more and more accurate data on thermochemical and structural properties of (micro)solvated biomolecules, but without the support of theoretical calculations, the understanding of the results is limited. It is not straightforward to obtain quantitative data for these systems from simulations, though. The difficulties are at least two fold: (i) One still has the high conformational freedom of the biomolecule itself, but now further complicated by the presence of ions and solvent which introduce an extra range of qualitatively different interactions to be modeled; (ii) It is known to be difficult to simulate even the solvent in isolation, with most quantum chemical methods failing to correctly describe overall structural properties like radial distribution functions, or diffusion coefficients [212][213][214][215][216][217][218], or the correct relative energies of hydrogen bonded structures [219,220].
The main challenge is to correctly and thoroughly explore the potential energy surface (PES) and the entropic contributions to the free energy-even more important when related to the solvent. These simulations must involve an accurate evaluation of the potential energy and span a long time scale (or a huge volume of phase space). Unfortunately nowadays one can have either one or the other: an accurate evaluation of points in the PES can be achieved by the highest-level quantum chemistry methods but these are too computationally expensive to allow a thorough sampling of the PES, while empirical potentials allow a thorough sampling of the PES but do not provide quantitative estimates. It is also important to note that only describing the electronic structure of these systems is not enough-especially in connection with the solvent, the inclusion of nuclear quantum effects beyond the harmonic approximation is necessary [170,[221][222][223][224][225].
Nevertheless some successes from theory have been achieved for the microsolvation of model peptides, for example the Ac-Ala n -LysH + series already mentioned in this review. The groups of Bowers [203] as well as of Jarrold [165] performed IM-MS experiments for the monohydrated structures of a few conformers (different sizes) of this peptide series. In these experiments, they had access to equilibrium constants of the monohydration reaction, derived from the ratio between the intensity of the peaks corresponding to the bare and the monohydrated structures. Based on previous observations that more globular/compact structures had a lower propensity to adsorb one water molecule than helical ones, they concluded that the shortest helical member of this series would happen at n = 8-without thorough theoretical support, it is difficult though to understand what is the atomistic mechanism for this difference in water adsorption propensity. In [122], Chutia and coworkers have performed extensive first principles conformational scans of n = 5 and n = 8 microsolvated by up to 5 water molecules. One conclusion is that the intramolecular hydrogen bonds of the self-solvated ammonium group, in both cases, are the most stable hydration sites. For one water molecule the most stable conformers are shown in figure 14, together with the calculated standard (Gibbs free) energy of formation ∆G 0 of the reaction. The agreement with experimental values is pretty good (also at other temperatures, shown in [122]). From the theoretical work, the authors concluded that the decrease in water adsorption propensity is not due to a radically different binding site, but instead only to modified internal free energy contributions (harmonic vibrational free energy) in the specific H 2 O adsorption site at the LysH + termination, in an example of how theory can help to gather a deeper understanding of experimental data. However, it is still a challenge for theory to be able to give even more reliable results for larger peptides surrounded by more solvent molecules. In this respect, theoretical advances as proposed by Gaigeot et al [226] that allow a separation of solute and solvent vibrational spectra in simulations are of great importance.
It is thus pressing to build a tighter relationship between the quantum and the empirical world. While for water there is an appreciable effort to build better and more accurate potentials based on quantum mechanical calculations [221,[227][228][229], for the solvent-biomolecule (or ion-biomolecule) interaction these efforts are much less pronounced. An improvement in this area can be achieved precisely by performing these theory-experiment benchmarks of the stepwise build-up of solvation, and modifying empirical potentials according to this data.

Conclusions
The aim of this review was to give an overview on the interplay of experiment and simulation regarding the structure and dynamics of biomolecules in the gas phase. Given the scientific fields of the authors, the focus was clearly on first-principles calculations on peptides towards the computation of physical observables like vibrational spectra and collision cross sections. For flexible molecular systems, for which biomolecules are a prime example, a thorough search of the accessible conformational space is crucial before any attempt to compare simulated properties with their experimental counterpart.
A typical work flow is outlined in the following (and in figure 9): (i) The exact chemical structure (connectivity of the atoms) of the molecular system has to be known. This includes knowledge about possible alternative protonation states (protomers). In cases where, for example, cations like H + or Na + are involved, their presence and location relative to the molecule has to be considered as well. (ii) An initial enumeration of structural candidates can be performed by the sampling of a computationally-cheap potential-energy surface (PES), for example of an empirical force field. (iii) As we have outlined in this review, the limited accuracy of force-field methods requires a refinement at the level of electronic-structure theory. This can be facilitated by using density-functional theory (DFT) methods or quantum-chemistry methods like Møller-Plesset perturbation theory (MP2). Higher-level methods, like coupled cluster, quantum Monte Carlo, or full configuration interaction, are computationally very demanding and thus normally limited to small systems and benchmark-type calculations. (iv) In order to remove a possible bias from the initial sampling of the force-field based PES, further exploration of the first-principles PES in the proximity of already located low-energy structures is advisable. This can be facilitated by, for example, (replica-exchange) ab initio molecular dynamics simulations. (v) Free-energy estimations in the harmonic approximation should be considered, not the least because they also offer a first glimpse at the vibrational spectrum of the molecular system. Further MD-based sampling can potentially be used to obtain more accurate thermodynamical observables (free energies, enthalpies, etc). However, the size of structure space and the computational cost of the required converged simulations again restrict such approaches to either rather small or rigid molecular systems. (vi) The comparison to experiment serves as (i) validation of the method (search strategy and energy function) and (ii) as a way to add structural resolution to the experiment. Both can be achieved by the computation of physical observables, e.g. collision cross sections, vibrational spectra, optical spectra, etc.
Each simulation represents an approximation to reality and inherently produces errors. The gas phase is a clean-room environment and gas-phase experiments can produce accurate and sharp data that represents a challenge to theory and simulation. We would dare to say that the higher signal-to-noise ratio that is present in condensed-phase experiments might actually cover some of the involved systematic errors in the theoretical description. This highlights the importance of the gas phase as an ideal environment for validating energy functions and simulation techniques.
An important point that we can conclude is that it is not sufficient to focus on a single or a few structures, given the complex dynamics observed in the gas phase (and even more Torr) for monohydration of Ac-Ala 5 -LysH + and Ac-Ala 8 -LysH + compared to literature data. Also shown, the most stable conformations of monohydrated Ac-Ala 5 -LysH + and Ac-Ala 8 -LysH + from theory (PBE + vdW). Values and structures from [122]. Ac-Ala5-LysH + H2O Ac-Ala8-LysH + H2O so in solution). Most of the larger sources of uncertainties in the theoretical treatment have to do with an insufficient or still inaccurate treatment of dynamics. If an accurate free-energy surface could be accessed and sampled, most of the remaining problems would be solved. This would allow, for example, the correct prediction of the conformational ensembles observed in ion-mobility measurements (CCS/ATD) or in vibrational spectroscopy. In addition, it would give access to reliable barriers and a natural inclusion of anharmonic effects in vibrational spectra. In order to reach this goal, we need to compute potential energies and forces including the correct physics, which then need to be sampled faster and for long time scales. We note that the correct physics may go even beyond just grasping the physics of the electronic structure but also the quantum nature of the nuclei, which can cause much stronger anharmonicities (as shown in this review) and change considerably effective barrier heights. Going even further, for these highly anharmonic and high-dimensional systems, in many situations the dynamics of nuclei and electrons are coupled. These non-adiabatic effects are truly difficult to treat from a theoretical point of view in these structures. The efficient exploration of conformational space for highdimensional flexible systems in an accurate manner thus poses one of the most pressing issues in this field. For it to be solved, either the accuracy of force fields must be improved, or the computational limitations of first-principles methods, when it comes to larger length scales and longer time scales, needs to be lifted. Possible routes that can be followed in methodological developments involve, for example, better parametrization of force fields based on the increasing number of first-principles data present in the literature, development of smarter free energy evaluation methods that can deal with fewer statistical sampling, and/or even better scaling of first-principles codes in massively parallel architectures. As these issues are already recognized by the community, several efforts in all fronts are paving the way to treat larger systems with state-of-the-art accuracy (e.g. [75,98,183,[230][231][232][233][234] and many others).
Nevertheless, as it has been shown in this review, both the time and length scale currently accessible to first-principles methods already allow an accurate treatment of systems with hundreds of atoms in simulations. On the experimental side, it is routinely possible to transfer large biomolecules, e.g. large proteins and even complexes, to the gas phase by electrospray ionization and to study them by mass spectrometry and ion mobility-mass spectrometry (IM-MS) [11]. However, with the size of the molecular systems, vibrational spectroscopy investigations get hindered by more and more congested spectra. A promising route that is currently being followed to circumvent this problem is to measure conformer selective spectra by either using (i) UV/IR double-resonance techniques and (ii) pre-selecting conformers by using IM-MS. A way to get sharper spectra is to measure them at low temperatures for example by using either cold-ion traps [63,64] or helium droplets [65,66]. Conformational selection and cold-ion spectroscopy can also be combined.
The investigation of biomolecules in the gas phase is a dynamically growing field and a constant challenge to experimentalists and theorists alike. The constant developments and improvements of experimental techniques trigger the use of more and more sophisticated simulations and vice versa. As such this line of research pushes our understanding of the very basics of biomolecular structure formation and dynamics. For the development of simulation methods, the precise data that can be obtained from gas-phase experiments is ideal to develop and test new methodologies that will also have an impact in condensed-phase simulation.