A Structure-free Method for Quantifying Conformational Flexibility in proteins

All proteins sample a range of conformations at physiologic temperatures and this inherent flexibility enables them to carry out their prescribed functions. A comprehensive understanding of protein function therefore entails a characterization of protein flexibility. Here we describe a novel approach for quantifying a protein’s flexibility in solution using small-angle X-ray scattering (SAXS) data. The method calculates an effective entropy that quantifies the diversity of radii of gyration that a protein can adopt in solution and does not require the explicit generation of structural ensembles to garner insights into protein flexibility. Application of this structure-free approach to over 200 experimental datasets demonstrates that the methodology can quantify a protein’s disorder as well as the effects of ligand binding on protein flexibility. Such quantitative descriptions of protein flexibility form the basis of a rigorous taxonomy for the description and classification of protein structure.

The scattering intensity of a sphere with uniform charge density can be derived using the Fourier-transform, in spherical coordinates, of a homogeneous sphere with charge density 0 ρ and radius S R : . [1] Here the scattering vector, , corresponds to the change in momentum between the incoming and outgoing wave-vectors, = ( , , ) denotes a point in the sphere, and dV is the volume element. Without loss of generality, we can choose the direction of to coincide with the zaxis.
With this convention we have • = cos . Using the fact that = ! sin , we obtain: where . Note that the right hand side of equation [2] makes it clear that the intensity of a spherically symmetric object, with a homogeneous charge density, is only a function of the magnitude of the scattering vector. Integration over the azimuthal angle, , yields a factor of 2π and integration over the polar angle, , yields: . [3] The intensity therefore reduces to: . [4] To obtain the final result we need to obtain an expression for the intensity at q=0. At low values of q, we perform a Taylor series expansion, keeping only the first order terms of the right hand side of equation [4] to obtain: Hence as the q approaches zero, the intensity approaches the square of the total charge of the sphere. Therefore we rewrite equation [4] as follows: Using the fact that 2 , we obtain: where 5 3 α = and (0) S I is the intensity at 0 q = .

Finding optimal values for I S (0), µ and σ
Solving the R g D model entails finding optimal parameters for µ , σ and I S (0) that minimize where I exp (q) is the experimentally determined scattering intensity, ε exp (q) is the associated error, and I S (0) appears in the expression for I µ,σ (q) (see equation [7]). Furthermore, since experimental error estimates are not available for all BIOISIS entries, we set in equation [8] to ensure that all of the scattering profiles would be treated equally.
A natural choice for I S (0) is the experimental scattering intensity at . However, in SAXS measurements one cannot measure directly at q=0 due, in part, to the difficulty that arises from the non-zero width of the incoming beam. We therefore use Guinier's law to exploit the analytical behavior at small q, thereby obtaining an estimate for the experimental scattering intensity at q=0. More precisely, at small q the scattering intensity can be approximated as follows 1 : 3 . [9] ε exp (q) = 1 0 q = This approximation is the basis of the widely used Guinier plots, where the natural logarithm of is plotted versus , and a line is fit to the data at low values of q. The slope of the line is used to compute the average radius of gyration, which serves as our initial guess for µ in our optimization algorithm, and the y-intercept is used as the estimate of . Once the value of is known, we normalize the entire intensity spectrum by this value, i.e.
. Normalization facilitates the comparison of the different intensity profiles that are obtained using different protein concentrations and different experimental setups. This is particularly important because the intensity at is, in general, a function of the protein concentration, the X-ray source intensity, and the sensitivity of the detector. After normalization we set .
Once initial values for ! 0 and µ are chosen, we perform a grid-search minimization where we vary σ values from 0.01 to 1 in increments of 0.01. The value of σ that minimizes , where µ is the value obtained from Guinier's law, is used as the initial value in a gradientdescent algorithm.
In the first step of the gradient-descent algorithm we fix ! 0 = 1.0 and then search for values of µ and σ that minimize ε µ,σ , I S (0) ( ) . We then perform another set of minimizations where all three variables are allowed to vary to find the minimum value for ε µ,σ , I S (0) ( ) . This stepwise procedure generally leads to faster convergence rates.

Guinier analyses
As outlined above, solving the RgD model involves first performing a Guinier analysis on the scattering data. An important part of the Guinier analysis is to choose a region to linearly fit the data. Unfortunately the choice of this region can be problematic. Although a plot of the natural logarithm of I(q) versus q 2 will be linear in the limit where q is smaller than the inverse dimensions of the protein, this condition does not translate to a well-defined cutoff in all cases.
In our program, RgD, the user manually chooses the upper cutoff for the q 2 axis of the Guinier plot; then, a linear fit is performed in the range from the lowest q data available up to the specified cutoff.
The RgD program also has an automated option that iteratively calculates the cutoff. First, an initial value for the radius of gyration is calculated as a function of the number of amino acids N.
For this function, a phenomenological constant capturing the relationship between R g and N was determined by plotting R g against N 1/3 for all entries in the BIOISIS database. Using the initial estimate, the first cutoff in the Guinier Plot is calculated as (1.3/R g ) 2 ; then R g is calculated again using the new cutoff, and the process is continued until the calculations converge within a tolerance of 10 -4 . In the study of complexes, both manual and automated options for the cutoff were used, and the calculated entropies differed by less than 1%.

Estimating the error in RgD calculations of entropy
It is also important to investigate the effect of experimental noise on the calculated entropy. This is particularly important for comparison of two systems measured in similar experimental conditions. Because we do not have the raw data for the entries in the databases, we simulate raw data for each entry by using the reported standard deviation of the experimental scattered intensity provided in the database entries; we back-simulate the raw data. To do this, at each q in the spectrum, the noise was calculated by using a random Gaussian distribution that had a standard deviation equal to that of the experimental scattered intensity (at that particular q). Five noise-simulated spectra were generated for each system and the statistics of the entropy were calculated. Table S3 shows the results for representative datasets from the Bioisis and SASBDB databases.

RgD on Model Systems
We demonstrate the RgD model on three model systems using theoretical SAXS profiles. We chose the intrinsically disordered K18 tau isoform (130 residues), the partially disordered protein CcdA (144 residues, with two 34 residue intrinsically disordered regions), and the folded protein CcdB (202 residues) (Fig. 3). The Bayesian Weighting algorithm was used to generate an ensemble for the K18 tau isoform (available through http://www.rle.mit.edu/cbg/data.htm) 2 . The 77 structures from this ensemble which had a 5% probability of having a weight of at least 0.005 were used for this analysis. NMR models for the CcdA protein in two states were used as an For each ensemble, theoretical SAXS profiles were generated in Crysol, primarily using the default parameters 6 . The order for the Fibonacci grid was set to its maximal value to improve the surface representation. The computed theoretical curves had 100 points ranging from q=0 Å -1 to q = 0. 5 Å -1 . 6 . The ensemble-averaged mean SAXS profile was then computed, together with the standard errors over the ensemble. For K18 tau, the Boltzmann weights provided by the Bayesian weighting algorithms were used to compute the ensemble average. For CcdA and CcdB, the conformations were assumed to be equally probable.
The RgD algorithm was run for each theoretical SAXS profile with both a user-determined maximum q 2 and an automatically-determined maximum q 2 . The intensity computed for q=0 Å -1 was deleted prior to running RgD. The user-determined maximum q 2 was found in Primus by searching for the largest q such that qR g < 1.3 7 . The automatically determined maximum q 2 values were within 0.0007 of the manually determined values, and the resulting RgD entropy values were identical.
The ensemble-optimization method, implemented as EOM, was used to compute R FLEX 8 . The genetic algorithm component of EOM (GAJOE, Genetic Algorithm Judging Optimisation of Ensembles) was run using a set of 10,000 random conformations that were generated based on the primary sequence of each protein by EOM. For these calculations, all default parameters were selected, with the exception that a compact-chain was selected for CcdB, the folded protein in our data set, and native-like chains were selected for the remaining proteins.

MD simulations of CcdB
The modeler software and the SCWRL4.0 side-chain prediction algorithm were used to model missing residues and atoms in CcdB structures, and chain termini were ionized [9][10][11] . For each of the two simulations, the CcdB structure was solvated in a dodecahedral box of TIP3P water, so that the minimum distance between the protein and the box wall was 1nm. Sodium and Chloride ions were added so that the concentration was 150mM with neutral charge. MD simulations were performed in the NVT ensemble using the charmm27 force field in Gromacs 5.0.4 [12][13][14][15][16] .
Particle-mesh Ewald (PME) and periodic boundary conditions were used to compute long-range electrostatic interactions 17 . The temperature was stabilized at 300K using the V-rescale thermostat, a modified version of the Berendsen coupling thermostat, with separate coupling for the protein and the solvent 18 . The LINCS algorithm was used to constrain the lengths of all bonds, allowing a simulation time-step of 2fs 19 . Before solvating, each structure was minimized in vacuum using steepest descent minimization until the maximal force was less than 500kJ/(mol nm) or a maximum of 500 steps. A second round of steepest descent minimization was performed after minimization until the maximal force was less than 100 kJ/(mol nm) or a maximum of 50,000 steps. A 100ps molecular dynamics simulation was then performed for which the protein was position-restrained, to allow time for the solvent to equilibrate. The position-restraints on the protein were then removed, and a 100ns simulation was performed to equilibrate the system. Finally, a production simulation of approximately 700ns was performed for each CcdB structure. During the equilibration and production simulations, one atom was fixed so that the protein did not diffuse outside of its box. Structures were extracted every five nanoseconds from the trajectories to generate an ensemble of 285 conformations accessible to CcdB. Table S3: Estimate of the effect of noise on the uncertainty of the RgD entropy. The noise simulations were performed by using the reported standard deviation for the scattered intensity at each q. * The noise simulations for C3b were performed using the standard deviations of the experimental scattered intensity provided for C3b/Efb.