Anisotropic protein-protein interactions in dilute and concentrated solutions

Interactions between biomolecules are ubiquitous in nature and crucial to many applications including vaccine development; environmentally friendly textile detergents; and food formulation. Using small angle X-ray scattering and structure-based molecular simulations, we explore protein–protein interac- tions in dilute to semi-concentrated protein solutions. We address the pertinent question, whether interaction models developed at inﬁnite dilution can be extrapolated to concentrated regimes? Our analysis is based on measured and simulated osmotic second virial coefﬁcients and solution structure factors at varying protein concentration and for different variants of the protein Thermomyces Lanuginosus Lipase (TLL). We show that in order to span the dilute and semi-concentrated regime, any model must carefully capture the balance between spatial and orientational correlations as the protein concentration is ele- vated. This requires consideration of the protein surface morphology, including possible patch interactions. Experimental data for TLL is most accurately described when assuming a patchy interaction, leading to dimer formation. Our analysis supports that the dimeric proteins predominantly exist in


a b s t r a c t
Interactions between biomolecules are ubiquitous in nature and crucial to many applications including vaccine development; environmentally friendly textile detergents; and food formulation. Using small angle X-ray scattering and structure-based molecular simulations, we explore protein-protein interactions in dilute to semi-concentrated protein solutions. We address the pertinent question, whether interaction models developed at infinite dilution can be extrapolated to concentrated regimes? Our analysis is based on measured and simulated osmotic second virial coefficients and solution structure factors at varying protein concentration and for different variants of the protein Thermomyces Lanuginosus Lipase (TLL). We show that in order to span the dilute and semi-concentrated regime, any model must carefully capture the balance between spatial and orientational correlations as the protein concentration is elevated. This requires consideration of the protein surface morphology, including possible patch interactions. Experimental data for TLL is most accurately described when assuming a patchy interaction, leading to dimer formation. Our analysis supports that the dimeric proteins predominantly exist in their open conformation where the active site is exposed, thereby maximising hydrophobic attractions that promote inter-protein alignment.

Introduction
Understanding and controlling protein colloidal stability associated with aggregation phenomena are besides biological interest, of significant importance in biotechnology where it affects expression, fermentation, purification, formulation, and eventually efficacy of produced protein molecules [1]. It is also central to formulation of protein-based pharmaceuticals such as insulin, where the controlled colloidal properties of the proteins directly couple to their release profile when used as a drug [2]. The factors influencing stability are already intricate in controlled and pure systems, but in native systems there is an added level of complexity due to e.g. glycosylation and unprocessed (signal) pro-peptide that biotech-produced protein samples typically have, depending on the choice of host organism and production conditions [3][4][5].
In this work we investigate protein-protein interactions of a 30 kDa lipase (TLL) from the fungal organism Thermomyces lanuginosus (formerly Humicola lanuginosa). The TLL enzyme protein is a triacylglycerol lipase (EC 3.1.1.3) that catalyses the hydrolysis of ester-bonds in substrates such as triglycerides. This type of lipase has widespread use from stereo-specific synthesis of drugs to laundry detergents where it enables removal of fatty and oily soil and allows for washing under more sustainable conditions, e.g. at lower temperatures and with reduced use of chemicals [6,7]. TLL exhibits interfacial activation, i.e. the activity of TLL is strongly enhanced upon recruitment to an interface where the substrate is enriched. In the course of this process, TLL undergoes a conformational change upon shifting the local environment: from the aqueous environment of high dielectric constant to a more hydrophobic (substrate-rich) environment of lower dielectric constant. The lid domain -made of residues 82-98 and covering the hydrophobic active-site region -rolls over, thereby exposing the active site to the interface [8][9][10].
To study intermolecular interactions between different variants of TLL in aqueous solutions we performed small-angle X-ray Scattering (SAXS) combined with molecular modelling. Both SAXS and the related technique, small-angle neutron scattering (SANS), are commonly used to study non-ideal macro-molecular solutions [11][12][13][14], including the effect of protein glycosylation [15,16]. We here complement experimental scattering studies with computer simulations aiming to shed light on the origin and detail of the observed aggregation properties. Molecular Dynamics (MD) is used to establish models for equilibrium structures of the different protein molecules and are subsequently used in coarse grained Monte Carlo simulations (MC) to model one-body, two-body, and manybody interactions. With this analysis we address the pertinent question: can parameters valid at dilute conditions be reliably transferred to more concentrated regimes?

Protein samples
Purified TLL lipase samples were prepared for SAXS experiments with and without N-linked deglycosylation, and with and without the presence of a SPIRR (Ser-Pro-Ile-Arg-Arg) propeptide at the N-terminus. For the deglycosylation, Endoglycosidase H from Streptomyces plicatus was purchased from Sigma-Aldrich, (Saint-Louis, USA) and was used by incubation in a pH 5.5, 50 mM NH 4 Ac + 200 mM NaCl buffer at 30C for 1 h. For SAXS experiments, samples were buffer exchanged by size exclusion and run over a Superdex200 column at an Äkta purifier using a pH 8, 10 mM Tris + 100 mM NaCl buffer. Based on stock concentrations obtained in the range of 6-8 mg/ml, we prepared dilution series of the samples that went down to 0.3-0.6 mg/ml (sample dependent). The final samples were characterised by intact molecular weight analyses by mass spectrometry (see following paragraph).

Protein concentrations
The protein samples were dried, then hydrolysed in 18.5% HCl + 0.1% phenol at 110°C for 16 h. Amino acid analyses were performed by precolumn derivatisation using the Waters AccQ-Tag Ultra Method. In short, amino acids were derivatised by the AccQ-Tag Ultra Reagent and separated with reversed-phase UPLC (UPLC, Waters Corp., Milford, MA), and the derivatives quantitated using UV absorbance with a molar extinction coefficient of 35920 M À1 cm À1 at 280 nm [17,18].

Mass Spectrometry
The intact molecular weight analyses were performed using a MAXIS II electrospray mass spectrometer (Bruker Daltonik GmbH, Bremen, Germany). The samples were first diluted to 0.1 mg/ml in MilliQ water. The diluted samples were applied to an AdvanceBio Desalting-RP column (Agilent Technologies) followed by washing and elution from the column running an acetonitrile linear gradient and introduced to the electrospray source with a flow of 400 ml/min by an Ultimate 3000 LC system (Dionex). Data analysis was performed with DataAnalysis version 4.3 (Bruker Daltonik GmbH, Bremen, DE). The molecular weight was calculated by deconvolution of the raw data in the range 20 to 60 kDa.
The lipase sequence contains one N-glycosylation site (Asn-Xaa-Thr, sequence motif) that is glycosylated with a high mannose type glycan structure. The N-glycosylation is efficiently removed by EndoH (Endo-N-acetylglucosaminidase H). EndoH cleaves the bond in the diacetylchitobiose core of the oligosaccharide between two N-acetylglucosamine (GlcNAc) subunits directly proximal to the asparagine residue, generating a truncated sugar molecule with one N-acetylglucosamine residue remaining on the asparagine. Thus by measuring the intact protein mass after EndoH treatment it is possible to precisely measure the percentage of glycosylated molecules in the protein sample by simply comparing the intensity of the unmodified mass peak with the mass peak + 1xGlcNAc. A spreadsheet showing the calculations is included in the supporting information.

Small-Angle X-ray Scattering
SAXS experiments were performed on the BioXolver L (Xenocs) placed at Department of Drug Design and Pharmacology, at the University of Copenhagen. During the measurements, the instrument was equipped with a 250 W liquid gallium alloy X-ray source (MetalJet) with a wavelength of k ¼ 1:34 Å. The scattering data were collected at 298 K in 20 x 30 s exposures with a sampledetector distance of d ¼ 654 mm. This provided a q-range of 0:011 À 0:50 Å À1 , with q defined as q ¼ 4p=k sinðhÞ, where 2h is the scattering angle. Data was averaged using RAW [19] and absolute scale calibrated using water as a reference. Pair distance distribution functions, pðrÞ, were obtained by indirect Fourier transformation performed using BayesApp [20].

All Atom Molecular Dynamics
As the structure of Thermomyces lanuginosus lipase (TLL) with the SPIRR fragment attached to the N-terminus (TLL+) is not already available, all atom molecular dynamics (MD) simulations were performed to generate those structures, for the closed and open forms of TLL. The PDB structures of the closed (1DU4) and open (1EIN) form of TLL [21] were used as starting points for the SPIRR-containing propeptides, see Fig. 1. The SPIRR fragment was attached to the N-terminus of the two forms of TLL using PyMol [22] and protonation states of pH 8 were obtained using PDB2PQR [23]. To relax the generated stucture, MD simulations were performed using Amber 16.12 [24], with the ff14SB forcefield [25], SPC/E water [26] and Li/Merz parameters for monovalent ions for SPC/E [27]. Harmonic C a constraints of 10 kcal/mol/Å 2 were applied during the minimisation. Constant pressure (1 bar) was maintained using a Monte Carlo (MC) barostat [28], with volume change attempts every 100 steps. Langevin dynamics with a collision frequency of 1.0 ps À1 were used to maintain a constant temperature of 298 K [29]. A real-space cutoff of 8 Å was applied to non-bonded interactions and the remaining electrostatic interactions were accounted for using the particle-mesh Ewald procedure with a Fourier spacing of 72 Å. The system was equilibrated for 500 ps, followed by 500 ns of sampling. A principal component analysis was applied to the trajectory of atoms in order to identify discrete conformational states and the structure corresponding to the lowest free energy region was selected as the final structure.

Metropolis Monte Carlo Simulations
For studying protein-protein interactions at multiple concentrations, we constructed a coarse grained (CG) interaction model for the closed and open forms of TLL starting from the PDB structures 1DU4, 1EIN, and as well as MD generated variants with SPIRR attached. Each residue was represented by a spherical bead, which could be charged or neutral, depending on residue type and solution pH. In all simulations pH = 8, giving a wildtype TLL net charge of À7e, and the temperature was set to 293. 15 where N is the number of interaction sites, b ¼ 1=k B T is the inverse thermal energy, k B ¼ be 2 =4p 0 r ¼ 7 Å is the Bjerrum length in water; r ij is the distance between the ith and jth particle; z their charges; e and r is the LJ depth and diameter, respectively. Lorentz-Berthelot mixing is used so that e ij ¼ ffiffiffiffiffiffiffiffiffi e ii e jj p , and r ij ¼ ðr ii + r jj Þ=2. Table 1 shows the used r ii and e ii values, where M w is the residue molecular weight, and q is the density defined in [30].
For hydrophobic residues (ALA, ILE, LEU, MET, PHE, PRO, TRP, VAL) residues with a solvent accessible surface area (SASA) larger than 30 Å 2 [31], a specific hydrophobic-hydrophobic LJ strength, e HÀH , was used to match experimental virial coefficients at low protein concentrations, see Results and Discussion. SASA was measured using a spherical probe of radius 1.5 Å.
De-glycosylation of TLL leaves a mannose unit on ASN 33, and we thus added an uncharged bead of radius equivalent to that of a residue (2.9 Å) along the surface of ASN 33 and with be ii ¼ 0:005, assuming negligible attraction to protein residues. For the cases where TLL is glycosylated (TLLm), a neutral bead of radius 20 Å is attached to the previously mentioned mannose unit.

Two-body Interactions
Two proteins were placed inside a cylindrical cell (L ¼ 500 Å, R ¼ 80 Å) with hard sides and periodic ends, see Fig. 2, middle. Proteins were allowed to rotate and translate along the z-axis of the cylinder only. For TLLm+ and TLL+, we considered that the proportion of proteins having the SPIRR propeptide attached was of 63%, as determined by mass spectrometry. Thus, the simulation box contained one protein with SPIRR and one protein without SPIRR. The angularly averaged two-body protein-protein distribution function, gðrÞ, was sampled via the histogram method as a function of mass-center separation, r. The potential of mean force (PMF) is obtained by Boltzmann inversion, bwðrÞ ¼ À ln gðrÞ þ c where the free energy of the reference state, c, is chosen such that wðrÞ ! 0 for large r. The second virial coefficient was estimated from the PMF via B 2 ¼ 2p R 1 contact ðe ÀbwðrÞ À 1Þr 2 dr and was normalised by the same second virial coefficient of hard spheres (B 2HS ) as used for the experimental data.
During the simulations, the contacts between the residues of the two proteins were recorded. A contact between a residue of protein 1 and a residue of protein 2 was considered effective if the distance between the surfaces of the two residues was less than 3 Å. The number of contacts for each residue was then summed at the end of the simulation and normalised so that the total number of contacts for one protein is 100%.

Many-body simulations
325 rotating and translating proteins were placed in a cubic cell with periodic boundaries and volumes matching the experimental protein concentrations. A spherical cut-off of 5 times the Debye length was used to truncate the pair-potentials to reduce the computational time. Due to oligomer formation, a cluster move with a threshold of 6 Å was used to rotate and translate aggregated proteins. Protein mass center positions were used to compute the concentration dependent structure factor, SðqÞ, by explicitly averaging [32] Eq. 2 over the 3+6+4 directions obtained by permuting the crystallographic index ½100; ½110; ½111 to define the scattering vector q ¼ 2pp=Lðh; k; lÞ where p ¼ 1; 2; . . . ; p max . The sampled q- , L being the cube side length and p max ¼ 15.
N p is the number of proteins; r i protein mass center positions. For comparison, we also evaluate SðqÞ using the Debye formula [33,34], Although Eq. 3 is strictly not valid for periodic cells, the interaction range in our system is short compared to the box size and it is interesting to compare with the more slowly converging Eq. 2.

One-body simulations
In both two-body and many-body simulations, salt implicitly accounted for using a Debye length, 1=j, c.f. Eq. 1. To validate this approximation, we performed a series of one-body, Grand canonical Monte Carlo (GCMC) simulations of a single protein in explicit salt solution. The 1:1 NaCl salt particles and protein counter ions consisting of charged beads (r NaÀCl =4.6 Å, NaÀCl =0.01 kT) with a constant chemical potential, are inserted in a spherical cell with radius R ¼ ð3=4pc p N A Þ 1=3 . The mobile configurational space was sampled by a combination of single ion translational moves and grand canonical insertion and deletion of electroneutral salt pairs (Fig. 2, left). The interaction energy is still Eq. 1 albeit with a Debye length of infinity as salt is now handled explicitly, plus a term accounting for the energy fluctuation due to the residue protonation state, kT P np i ðpK a;i À pHÞðlnð10ÞÞÞ.

Results and Discussion
We initially study four variations of TLL, see Fig. 3, and in the first part we discuss the experimental SAXS observations, followed by a dissection using theory and molecular modelling. Table 1 Lennard-Jones parameters used for the coarse grained protein model. Radii, r ii =2, of amino acid residues are estimated from their molecular weights, Mw, using q ¼ 1 g/mol/Å 3 , and range between 3.3-4.6 Å.  4 shows concentration normalised scattering intensities, IðqÞ=c p (top) and the corresponding internal pair-correlation functions, pðrÞ (bottom). From the levelling-off at low q-values; Guinier plots (see SI); and the pair-distance distribution functions, it is evident that the samples are free from large-scale aggregates. From the systematic concentration dependent increase of the scattering intensity at low q it is furthermore clear that a concentrationdependent oligomerisation is present for all four variants. This oligomerisation is also observed in pðrÞ as a second peak or shoulder at around 60 Å that increases as a function of concentration and also gives rise to an increase of the D max . At high q, while the data are noisy for the lower concentrations, then at higher concentrations, a distinct double bump is visible with maxima at 0.25 Å À1 and 0.4 Å À1 . Comparing the scattering of TLL in our experiments with that of the known structures of the wild-type enzyme, it is evident that the SAXS data with the lowest concentration corresponds well to monomeric TLL, while data obtained from the highest concentration corresponds well to a dimeric TLL enzyme ( Figure S1). This is consistent with the previously reported concentration dependent formation of dimers in TLL [35,36]. Scattering intensities, Ið0Þ, for q ! 0 were obtained from an inverse Fourier transform (IFT) analysis to obtain pðrÞ.

Separation of form and structure factor contributions, and introduction of the second virial coefficient
The Small-angle scattering intensity of interacting particles in solution can generally be described as the product of a form factor and a structure factor, where the normalisation factor AðcÞ ¼ cMDq 2 m ; c is the particle mass concentration (units of g/m 3 ); M is the molar mass of the individual particles (g/mol); and Dq m their excess scattering length mass density (units of m/g). PðqÞ is the normalised form factor intensity and Sðq; cÞ denotes the effective structure factor normalised such that SðqÞ equals unity at infinite dilution. Note that the ''effective structure factor", Sðq; cÞ is concentration dependent and takes into account all interaction effects between general particles. This may be both specific, short range attractive interactions leading to dimer formation as in the case of TLL, as well as more unspecific (long range) interactions, including excluded volume effects. This implies that Sðq; cÞ may generally not be expressed analytically in a closed form. We refer to [37] for further description of this approach to separate form and structure factors in the context of aggregating or oligomerising particles. Eq. 4 can be rewritten as: Let PðqÞ denote the normalised particle form factor of the protein monomers, such that for a set of SAXS data at different protein concentrations, c p , the structure factor at q ¼ 0; Sðc p ; 0Þ, is related to a virial expansion [38], 1 Sðc p ; 0Þ ¼ Pð0Þ where A 2 is the second virial coefficient in molÁm 3 Ág À2 ; c p the protein concentration in g/m 3 , and M w the protein molar mass in g/mol. Pð0Þ ¼ 1 as per definition. As mentioned above, the experimental Iðc p ; 0Þ values, were approximated through IFT analysis yielding the pðrÞ functions. For each series, we then plotted 1=Sðc p ; 0Þ as a function of the protein concentration and performed a linear regression on the data points. The slope of this curve equals 2M w A 2 . We converted A 2 to m 3 : B 2 ¼ A 2 N a =M 2 w , then normalised B 2 by the second virial coefficient of equivalent hard spheres, where R p is the radius of the protein, for which we take the hydrodynamic radius estimated by Zhu et al. [39], 27.8 Å. Table 2 shows the obtained B 2 =B 2HS and we note that for the same state of glycosylation, the addition of the SPIRR propeptide decreases the value of B 2 =B 2HS . This attraction is reasonable as the addition of the SPIRR propetide reduces the protein net charge from À7e to À5e and adds two hydrophobic residues. Similarly, glycosylation of the protein leads to a B 2 increase as expected from the effect of increased steric repulsion between the proteins.

Metropolis Monte Carlo Simulations
In this section we develop a structural model to understand and predict the scattering intensities and osmotic second virial coefficients. The strategy is to first calibrate a two-body model -valid at low protein concentrations -with the experimental virial coefficient for wildtype TLL (Table 2). Next, we test if the model is able to predict virial coefficients for the variants as well as many-body effects and SAXS spectra occurring at higher protein concentrations.

Two-body Interactions
The osmotic second virial coefficient reports exactly on the pairinteraction between two proteins in the linit of infinite dilution. TLL exists in a closed and open form, with the latter having $ 20% more hydrophobic surface area exposed to the solvent. To determine how to model short-range attractive interactions, we explore two schemes for non-electrostatic interactions between the proteins (see also Fig. 2): 1. Non-hydrophobic residues have a low LJ-parameter, b i ¼ 0:005, effectively meaning that the LJ interaction is chiefly repulsive and therefore contributes to excluded volume only. This model has previously been used to calculate structure factors for protein solutions [30]. 2. Non-hydrophobic residues have a moderate LJ-parameter, b i ¼ 0:05 which mimics a generic short-range attraction distributed uniformly over all residues. This model has been used in a number of studies for protein-protein interactions [40,41] and would in the limit of perfectly spherical proteins correspond to a colloidal Hamaker constant [42].
For both models, a single LJ interaction parameter between hydrophobic residues, HÀH , is fitted to match the experimental virial coefficient for wildtype TLL. Two proteins are placed in a cylindrical geometry (Fig. 2,  where the protein-protein interaction is predominantly repulsive (B 2 =B 2HS > 0) and independent of the value of the hydrophobic attraction, therefore likely dominated by the electrostatic compo- needed to match the experiments. This is due to a higher exposure of hydrophobic residues in the case of the open conformation: the total SASA of the hydrophobic residues of the open form is indeed 1.2 times the one of the closed form, primarily since four residues become solvent-accessible due to the opening of the lid. The contact between the two proteins is indeed strongly predominant in the lid region, as can be seen on Fig. 6 for wild-type TLL.
The values of e HÀH best fitting experiment are shown in Table 3 and are now used to predict the virial coefficients for the TLL variants. Fig. 7 shows that the parameterisation, based purely on wildtype data, well reproduces the TLLm variant for all interaction models, i.e. open/closed, be LJ =0.05 or 0.005.
However, the agreement between simulation and experiment becomes worse in the case of the SPIRR-containing forms of the lipase (TLL+ and TLLm+). Further, simulating using the closed form of TLL gives larger deviations than using the open form, i.e. the protein-protein interaction is too attractive when attaching SPIRR. This result fits the expectation that when the proteins are in close proximity, they do indeed take the open form which is stabilised by short-range interactions.
It is important to note that the results presented here are valid at low protein concentrations and the question is if the developed  models can be transferred to higher concentrations? This is the topic of the next section.

Many-body Interactions
The values of e HÀH matching the experimental data determined in the dilute two-body simulations (Table 3) have been kept as a reference in the coming many-body simulations at higher protein concentrations. We now exclusively focus on wildtype TLL, and investigate the effect of the open/closed conformations, and values of e LJ . Fig. 8, left shows the simulated structure factors obtained for the four different interaction models and it is clear that large differences arise as the protein concentration increases. Remember that by design, all four models give the same virial coefficient in the dilute regime. The right-hand side of Fig. 8 shows the concentration normalised scattering intensities, IðqÞ=c p from the simulations in comparison with experiment. Clearly the interaction scheme using a closed structure of TLL and a moderate LJ parameter (be LJ ¼ 0:05) differs significantly from experiment due to strong attraction as the proteins move closer. The remaining three models all give similar experimental agreement, although using the open structure and a moderate LJ interaction gives the best overall agreement at all concentrations. Fig. 9 shows the values of 1=Sð0Þ for each condition at different concentrations, each obtained by fitting the low q part of the corresponding SðqÞ. The root-mean-square deviation of the simulated values of 1=Sð0Þ to the experimental ones is presented in Table 4.
This confirms our previous notion that the open TLL structure and an intermediate LJ parameter of 0:05 kT gives the best agreement with experiment. It also highlights the danger of transferring interaction models from infinite dilution to high concentration where multiple proteins may simultaneously interact. A particularity of the present system is that the protein-protein interaction is driven largely by a hydrophobic patch [14], i.e. the active site. This calls for a particularly well-balanced interaction between the generic LJ interaction (working between all residues) and the specific hydrophobic attraction. We can balance this at infinite dilution to match e.g. osmotic second virial coefficients, but further introducing concentration series is a much more stringent, and here crucial, test.

Effect of Structural Flexibility
In the previous section we observed that the open and closed forms of TLL gave rise to significantly different protein-protein interactions. When free in solution, we expect the closed form to dominate as it is unfavorable to expose hydrophobic groups to the solvent. Upon association, the lid is expected to open, thereby stabilising the dimer through hydrophobic interactions as previ-  Table 3 Best fitting values of the hydrophobic-hydrophobic interaction strength, beHÀH, to the experimental B2=B2HS for wildtype TLL. Results are obtained for both the open and the closed structure, as well as for two LJ interaction models (see Fig. 2).

Conformation
be LJ be HÀH  Table 3. The blue background colours reflect .the cell values.
ously shown for Pseudomonas fluorescens lipase [36] and suggested for several other lipases, including TLL [35]. We now mimic this situation in the coarse grained simulations by introducing a MC swap move [43][44][45]. Here there is a dynamic exchange between the open and closed form. A caveat is that the intrinsic probability or free energy barrier to exchange between the two forms is currently unknown. We therefore arbitrarily set this barrier to zero, corresponding to the artificial situation where the two forms are equally likely. While this is certainly a harsh simplification, the missing term is somewhat captured in the heuristic parameterisation of the hydrophobic residue interaction, e HÀH fitted to the experimental virial coefficients. For the two LJ interaction models, i.e. ''weak" (be LJ ¼ 0:005) and ''moderate" (be LJ ¼ 0:05) the hydrophobic interaction, be HÀH is 0.537 and 0.404, respectively. Using this parameterisation from the low protein concentration regime we now investigate the effect of increasing the concentration. Fig. 10 shows that regardless of the interaction model, the open form of TLL is more abundant as the protein concentration increases. This is because the open form -with its large hydrophobic surface exposed -is stabilised in the dimers found at higher concentra-    tions. As expected, in the dilute limit we recover the 1:1 ratio between the open and closed form, bearing in mind that this is an artificial choice. Similar to the discussion in the Many-body Interaction section we have different models based on either a ''moderate" or a ''weak" general Lennard-Jones interaction scheme that both reproduce the experimental second virial coefficient in the limit of infinite protein dilution. To more stringently test these models, we perform simulations at finite protein concentration and compare scattering intensities with experimental SAXS data. As shown in Fig. 11, the two models give increasingly different results as the protein concentration is elevated, highlighting the importance of correctly balancing the strength and range of site-site interactions. Incorporating dynamic exchange between the open and closed form, the best overall agreement is seen for a ''weak" to negligible LJ attraction as also used in previous studies [30]. It is important to reiterate that in our current implementation of the MC swap move, we artificially assume a 1:1 distribution of the open and closed for of TLL. A first estimate of a more realistic distribution could be obtained using e.g. meta-dynamics all-atom simulations which determines the transformation free energy between the two TLL forms.

Effect of Protein on the Salt Excess Chemical Potential
For protein solutions prepared by dialysis against a buffer solution, the salt chemical potential in the sample, by definition, equals that of the buffer. That is, for different protein concentrations the salt activity is constant, yet the salt concentration, c p s may vary due to the chemical environment in the protein solution. In models for protein-protein interactions it is however often assumed that the salt concentration (not activity) in the buffer solution, c s , equals c p s , or c s ¼ c s =c p s % 1. In this section we investigate if this approximation holds as it affects the ionic strength used to calculate the Debye screening length. The approach is to use Grand Canonical MC moves for salt particle insertion [46] and explore different salt activities and protein concentrations. As a reference point, we per-formed salt only simulations (bulk solution, c s ), then added a coarse grained TLL in the center of a spherical simulation cell. In this so-called cell model description [47][48][49], the radius of the simulation box now determines the protein concentration.
For different salt activities, a s , Fig. 12 shows the ratio c s ¼ c s =c p s , where c s is the salt concentration in the bulk solution (reference point). For all activities, c s increases with the protein concentration, indicating a gradual deviation from the bulk salt solution. In the SAXS experiments and in the previous MC simulations, the ionic strength is 0.105 M, while c p , ranges from 0.01 mM to 0.3 mM. This region is marked in Fig. 12 (red square) and c s is here close to unity, suggesting that c s % c p s is a good approximation. However, large deviations occur at higher protein concentration and care must be taken when estimating the salt concentration. The departure from ideality is driven mainly by (i) the protein excluded volume, and (ii) the non-uniform electric potential surrounding the charged proteins. Within the cell model, (i) can be estimated by assuming spherical proteins that exclude a fraction of the total volume: Here R cell is the radius of the simulation cell;

Conclusion
Aiming to understand protein-protein interactions in dilute to semi-concentrated protein solutions, we have studied variants of the enzyme Thermomyces Lanuginosus Lipase (TLL) dissolved in aqueous salt solutions. Using a combination of small-angle X-ray scattering and molecular simulations we have simultaneously measured and calculated osmotic second virial coefficients, B 2 , and solution structure factors, SðqÞ. Via experimental data from dilute protein solutions, we created and tested several coarse grained protein interaction models where short range attraction is (i) accumulated solely on the hydrophobic, active site of TLL, or (ii) also smeared over the entire protein surface. We also explored the effect of experimentally known open and closed form of TLL. Our results clearly show that while a given model may well reproduce experimental data (B 2 ) at infinite dilution, it is no guarantee to predict the solution structure at even mildly elevated concentrations. For the present system we attribute this to the fact that the TLL system interacts through a single-patch interaction that promotes dimer formation. Therefore, the generic attraction (between all surface residues, e LJ ) and specific patch attraction (between hydrophobic residues, e HÀH ) must be balanced to correctly capture both spatial and angular correlations between the proteins as the concentration increases. This highlights the importance of careful consideration of protein structure when performing a priory predictions of solution stability.
We further explored a two-state conformation model where Monte Carlo swap moves dynamically account for the known open and closed forms of TLL. While this added degree of freedom in principle elevates realism, the intrinsic free energy difference between the two forms is currently unknown and requires further experimental and/or theoretical investigation. For all interaction models, we do observe a transition from closed to open form as the concentration increases. This agrees well with existing experimental knowledge of TLL. Using an open form -with the hydrophobic patch exposed -consistently gives best agreement with measured scattering intensities as a function of protein concentration. Overall the above findings all support the idea of a single patch interaction.
In summery, we investigated variants of TLL as typically found in industrial protein production pipelines. Fixing model parameters to capture the experimental B 2 of the wildtype, we explored if variant behaviour could be predicted at dilute conditions. When using a fixed, open form of TLL, the coarse grained interaction model reasonably well reproduces the effect of glycosylation (TLLm) and to some extend also the effect of added a pro-peptide (TLL+ and TLLm+). In all cases, using a closed TLL structure reduces agreement with experiment, again suggesting that dimer formation occurs through interactions between the active site patches, whereby a restricted angular space is a fundamental model requirement.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.