Capturing protein communities by structural proteomics in a thermophilic eukaryote

Abstract The arrangement of proteins into complexes is a key organizational principle for many cellular functions. Although the topology of many complexes has been systematically analyzed in isolation, their molecular sociology in situ remains elusive. Here, we show that crude cellular extracts of a eukaryotic thermophile, Chaetomium thermophilum, retain basic principles of cellular organization. Using a structural proteomics approach, we simultaneously characterized the abundance, interactions, and structure of a third of the C. thermophilum proteome within these extracts. We identified 27 distinct protein communities that include 108 interconnected complexes, which dynamically associate with each other and functionally benefit from being in close proximity in the cell. Furthermore, we investigated the structure of fatty acid synthase within these extracts by cryoEM and this revealed multiple, flexible states of the enzyme in adaptation to its association with other complexes, thus exemplifying the need for in situ studies. As the components of the captured protein communities are known—at both the protein and complex levels—this study constitutes another step forward toward a molecular understanding of subcellular organization.

Simplification of cell lysate by anion exchange Lysate was produced as above and passed through a HiTrap DEAE FF column (GE Healthcare) 0.25 ml/min in 100 mM HEPES pH 7.4, 95 mM NaCl, 5 mM KCl, 1 mM MgCl 2 . Flow-through was collected and concentrated by spin filtration using 100,000 Da cut-off Amicon Ultra centrifugal filter unit. Sample was then separated by size exclusion chromatography and prepared for mass spectrometry analysis as above.
Mass spectrometry analysis For protein co-elution, size exclusion chromatography fractions were analyzed in biological triplicates. Peptides mixtures were separated with a BEH300 C18 (75 µm x 250 mm, 1.7 µm) nanoAcquity UPLC column (Waters) using a stepwise 75 min gradient from 3% to 85% (v/v) acetonitrile in 0.1% (v/v) formic acid at a flow rate of 300 nl/min. The connected LTQ-Orbitrap Velos Pro instrument (Thermo Scientific) that was operated in data-dependent mode performing one survey MS1 scan followed by up to 20 (TOP20) collision-induced dissociation (CID)-based fragmentation MS/MS scans of the highest abundant ions available. Only charge states >1 were allowed for fragmentation. Important MS settings were: m/z range: 375-1600; resolution: 30,000 FWHM; AGC: 10 6 ; maximum ion time: 500 ms. Important MS2 settings were: minimum signal threshold: 150; dynamic exclusion time: 30 s; isolation width: 2 Da; normalized collision energy: 40; activation Q: 0.25; AGC: 30,000; maximum ion time: 50 ms, The mass spectrometry proteomics will be deposited to the ProteomeXchange Consortium (Vizcaino et al, 2014) via the PRIDE partner repository.
Finally, simple calculation of the percentage of proteins identified in U2OS and HeLa cells was performed. For HeLa cells, Nagaraj et al (Nagaraj et al, 2011) report 10,255 proteins; Kristensen et al. (Kristensen et al, 2012) found 1,961 proteins in their SEC experiment (19.1 % of total proteome), using the same column (see above). Similarly, for U2OS cells, Beck et al (Beck et al, 2011) identified 10,006 proteins; Kirkwood et al (Kirkwood et al, 2013) report 2,926 proteins identified in the larger size fractions (29.2% of total proteome).
Mass spectrometry data search, protein identification and analysis MS raw data for each fraction were searched against the Chaetomium thermophilum protein database including common protein contaminants using the MaxQuant search engine (version 1.305) (Cox & Mann, 2008). Important search parameters were: missed cleavages, 2; variable modifications, oxidation (M), Nacetylation (protein N-terminus); multiplicity, 2, allowing for light and heavy dimethyl labeling; static, carbamidomethylation (C); matching between runs, yes. Proteins and peptides were filtered for 1% FDR using target-decoy-based reverse database search. Further filters applied were: protein score, equal or higher than 60; PEP, <0.05; number of unique peptide counts, equal or higher than 2. Major protein ID was used for identified protein groups. Label-free quantitation of protein intensities was deemed more efficient than labeled spike-in standard quantitation. Thus, protein intensities for the biological triplicates (dimethyl light channel) were exported as iBAQ scores as determined by the MaxQuant software.
Preparation of cross-linking samples for Xi and xQuest searches For cross-linking of C. thermophilum protein complexes, consecutive fractions were pooled into three-fraction pools. For the xQuest pipeline, native protein complexes in fraction pools were cross-linked using 0.5 mM isotope-labeled disuccinimidyl suberate (DSS, Creative Molecules) at 35°C, 600 rpm, for 30 min. The cross-linking reaction was quenched by the addition of 100 mM ammonium bicarbonate at 35°C, 600 rpm, for 10 min. For the Xi pipeline native protein complexes in fraction pools were cross-linked using 1:1 protein mass: cross-linker (BS3 from Thermo Scientific) mass ratio at 4°C, 600 rpm, for 2 hours min. The cross-linking reaction was quenched by the addition of 100 mM ammonium bicarbonate at 4°C, 600 rpm, for 30 min.
An additional cross-linking experiment was performed using a preperative SEC column for the xQuest pipeline: 2 ml of an approximately 12.5 mg/ml solution was injected onto a HiPrep 26/60 Sephacryl S-400 HR column at 0.2 ml/min and 300 µl fractions were collected. Consecutive fractions were pooled into seven-fraction pools and cross-linked using DSS. These were individually concentrated to 100 µl in order to achieve a concentration of approx. 0.5-1 mg/ml.
Cross-linked protein complexes were denatured by addition of 4 M Urea/0.1% 0.2% (w/v) Rapigest® and subjected to carbamidomethylation and digestion as described above. Purified peptide mixtures of cross-linked samples were dried in a vacuum concentrator and stored at -20°C until further use. Cross-linked peptides derived from each fraction pool were enriched using peptide gel filtration as described previously (Leitner et al, 2014). In brief, samples were reconstituted (30% (v/v) acetonitrile / 0.1% (v/v) trifluoroacetic acid) and fractionated using a Superdex Peptide PC 3.2/30 column (GE) on an Äktamicro LC system (GE) at a flow rate of 50 µl/min. Fractions eluting between 0.9 and 1.4 ml based on the 215 nm UV absorbance profile were collected (3-4 fractions total for each of the original fraction pools) and dried in a vacuum concentrator. Fractions were reconstituted in 20-40 µl buffer containing 5% (v/v) acetonitrile and 0.1% (v/v) formic acid prior to mass spectrometry analysis Mass spectrometry and data analysis with xQuest of cross-linked samples For cross-linking analysis, a total of 24 peptide gel filtration fractions were analyzed in technical duplicates with the same settings as described above, except a stepwise 60 min gradient was used and the minimum signal threshold was set to 100.
Mass spectrometry and data analysis with Xi of cross-linked samples Samples were analyzed using a UltiMate 3000 Nano LC system coupled to an Orbitrap Fusion Lumos Tribrid mass spectrometer equipped with an EasySpray Source (Thermo Fisher Scientific, San Jose, CA). Mobile phase A consisted of 0.1% formic acid in water, mobile phase B of 80% acetonitrile, 0.1% formic acid and 19.9% water. Peptides were loaded onto a 500 mm C-18 EasySpray column (75 µm ID, 2 µm particles, 100 Å pore size) with 2% B at 300 nl/min flow rate for 11 min and eluted at 300 nl/min flow rate with a linear gradient from 2% -40% B over 139 min.

Prediction of co-eluting proteins
Proteins detected in more than one biological replicate and with a positive identification in more than three consecutive fractions were retained for co-elution analysis. Data in each fraction were averaged across biological replicates and each chromatogram was correlated point by point with each other chromatogram to generate a Pearson Correlation Coefficient. A cross-correlation score was then derived as a threshold for protein co-elution discovery, and calibrated by an assembled benchmark of protein complexes with known molecular structures. An algorithm is available from the authors upon request.

Ortholog mapping
Orthologous protein groups within C. thermophilum, C. globosum, Neurospora crassa and 20 related fungi (pezizomycotina) were generated using the in-house eggNOG pipeline and these pezizomycotina orthologuous groups were mapped to fungal orthologous groups (fuNOG) from eggNOG (Jensen et al, 2008). If fuNOG groups had at least two members of the pezizomycotina orthologous groups in them and any S. cerevisiae protein in the same group, it was assigned as an ortholog of C. thermophilum.

Protein interaction prediction using known 3D structures of homologs
We used the Mechismo (Betts et al, 2015) pipeline (minus filtering for experimentally determined interactions) to predict C. thermophilum protein-protein interactions based on known 3D interaction structures of homologs, and scored the resulting interactions with InterPreTS (Aloy & Russell, 2002).

Protein interaction prediction
For the computation of the network and subsequent clustering, the R programming language was used with related packages (Liaw & Wiener, 2002;Williams, 2011). Existing protein-protein interaction knowledge was mined from String (v.9.1) (Franceschini et al, 2013) by extracting all known interactions between Chaetomium orthologs in S. cerevisiae. String scores were re-processed to remove physical association evidence and retaining all other experimental evidence (e.g. genetic interactions). Crosscorrelation co-elution (CCC) scores were filtered for scores above 0. The distributions of these three datasets and their overlap are found in Appendix Fig. S7. Weighted biochemical interaction data from the co-elution study (936,056 interactions containing 330,330 positively scored (>0) experimental interactions) were combined with interface predictions from Mechismo (744, Z-score) and existing knowledge derived from String (75,914) using a Random Forest (RF) classifier and a manually curated training set of reference interactions to filter out spurious connections and infer a network of high confidence, in a similar manner to (Havugimana et al, 2012). As a result, no interaction in the final data exists without a foundation of an experimentally derived score.
A benchmark set of known protein complexes was assembled from the homologous PDB complexes and AP-MS-derived core S. cerevisiae complexes (Benschop et al, 2010) (Dataset EV2). This benchmark contains 5,926 true positives (TP) and 99,837 true negatives (TN), assumed from inter-complex interactions. A machine learning training set was created from it by randomly subsampling from the positive and negative interactions to achieve a balanced, 1:1 ratio of TP:TN (~10,000 interactions), and split into three independent sets for learning, validation, and testing (70/15/15% respectively). The performance of the model was validated on the out-of-bag (OOB) sample, such that each of the 500 trees was constructed using a different bootstrap sample from the original data to get an unbiased estimate of the test set error. Approximately 63% of the available data were used for the tree construction, leaving the remaining 37% for testing [the OOB data]. Our RF Model has yielded an OOB error rate of ~8% and a consistent ROC of ~97% across independently evaluated sets (Appendix Fig. S7). Quality was evaluated using the OOB error rate, ROC, as well as the mean decrease accuracy and mean decrease Gini as measures of predictor variable importance in reducing classification error and partitioning the data into defined classes respectively. The resulting RF model was applied to the unclassified data (322,672 positive co-elution experimental scores). RF-produced probability score distribution was evaluated and a cut-off for significance of 1 Standard Deviation (SD) was selected for predicted class one interactions (resulting in an RF probability threshold of >= 0.85) (Appendix Fig. S7). The retained network contained 679 proteins connected with 6,002 unique, un-directed, non-duplicated edges (Dataset EV3). The quality of the RF-derived network was evaluated using a randomized Erdős-Rényi model (Erdős & Rényi, 1959) (for the same number of nodes and edges). Calculated network centralities show that the RF network is a non-random, scale-free, real-world network (Appendix Fig. S7). The degree distribution follows the power law with a correlation of 0.8 and R-squared of 0.7, compared with the randomized network correlation of -0,044, and R-squared = 0.003 and the global clustering coefficient of 0.5 vs. random networks' 0.026. Network was visualized using Cytoscape v3.3 (Shannon et al, 2003) (Appendix Fig. S7).

Community detection
Protein complexes were inferred using the ClusterONE algorithm (Clustering with Overlapping Neighborhood Expansion) . ClusterONE assumes that protein complexes appear as densely connected regions within the inferred interaction network, and proteins may belong to multiple complexes, thus the regions may overlap. Complexes are discovered by growing multiple clusters from seed proteins, independently of each other, and the greedy algorithm attempts to maximize the cohesiveness of the complex . The input network weighted by RF probability was clustered with the following parameters: node penalty value=2.0; cluster density threshold=0.3; haircut threshold=0.3; seeding from all nodes; multi-pass; weights=RF probability. All parameter combinations were varied and optimized to yield the highest Maximum Matching Ratio (a measure of how accurately the predicted complexes represent the reference set). MMR alongside of the positive predictive value (PPV), clustering-wise sensitivity (cws), and geometric accuracy (acc) (Brohee & van Helden, 2006) formed a quality benchmark evaluating how well the result recovers the benchmark set of complexes. Clustering yielded a total of 92 significant clusters (p-value <0.1). Removal of redundancy (merging clusters containing ≥50% overlapping subunits) reduced this to 65 clusters. Clusters with a clustering coefficient (transitivity) of 0 were removed, as we expect protein complexes to show higher density (connectivity) among members (Wasserman & Faust, 1994). Three clusters containing proteins eluting in the final fraction were not considered due to their incomplete elution peaks. The final 48 high-confidence clusters are shown in Dataset EV4.

Structure prediction of proteins participating in high-molecular weight assemblies
Prediction of the structure of all 1,176 identified proteins was performed with iTASSER v4.2 (Yang et al, 2015). The top predicted model was selected according to its respective c-score, (Roy et al, 2010). Details for model quality for those with >30% of sequence identity and coverage are shown in Appendix Fig. S2; detailed coordinates for the models are available from the authors upon request.

Protein complex assignment using the Protein Data Bank
Each of the 1,176 proteins found in total in all three biological replicates were manually submitted to the NCBI BLAST server (http://blast.ncbi.nlm.nih.gov/Blast.cgi) and searched against the PDB (www.pdb.org). A threshold of 30% of sequence identity was assigned. Decision on the assembly was taken after back-BLASTing the rest of the subunits, if any, of the PDB structure to the C. thermophilum proteome. All results are included in Dataset EV2.

Calibration of cross-linking ld score using structural models
For proteins for which cross-links were identified, cross-linking distances on the structure were calculated with Xwalk (Kahraman et al, 2011). Furthermore, protein models with known complex states were modeled according to their template protein complex by superimposing the C. thermophilum atomic models on top of their structural homologs with PyMOL v1.2 (www.pymol.org). Data is included in Dataset EV2 regarding homology.

Modeling of protein interfaces using cross-linking data
The HADDOCK webserver (guru access) was used (de Vries et al, 2010;van Zundert et al, 2015). Missing side-chains were properly built and the interface of the complex was optimized using the OPLS force field (Jorgensen & Tirado-Rives, 1988) and non-bonded interactions were calculated using a cut-off of 8.5 Å. Electrostatic energy (E elec ) was calculated by a shift function, while a switching function (between 6.5 and 8.5 Å) was used for the van der Waals energy (E vdw ). Desolvation energy was calculated by implementing empirical atomic solvation parameters (Fernandez-Recio et al, 2004). All calculations were performed with HADDOCK version 2.1/CNS version 1.2 (Brunger, 2007;Dominguez et al, 2003). Cross-linking data were implemented as interaction restraints, set to have an effective (and maximum) Cα-Cα distance of 35.2 Å, whereas the minimum distance was only defined by energetics. This distance was selected due to the maximum Cα distance that the DSS cross-linker may have, when cross-linking Lys sidechains. For docking calculations, the standard HADDOCK protocol was used (Dominguez et al, 2003). Standard HADDOCK scoring for it0 was applied to select the 200 top-ranking structures that awere subsequently passed onto the next docking steps (semi-flexible simulated annealing and final refinement in explicit water). Finally, clustering at 7.5 Å was performed. Parameter files for each run performed in this study are available from the authors upon request.

Negative-stain electron microscopy and 2D class averaging
Samples were directly deposited on glow-discharged (60 sec) Quantifoil®, type 300 mesh grids and negative-staining with uranyl acetate 2%(w/w) water was performed. In short, after sample was applied, it was rinsed twice with distilled water after 30 sec and uranyl acetate was then applied for 60 sec. After blotting, grids were left to air-dry and subsequent imaging at the FEI Morgagni 268(D) was performed for each fraction. Recording of data was performed with a side-mounted 1K CCD Camera (SIS). After data acquisition (pixel size=7.1 Å), E2BOXER was used for particle picking with a box size of 80X80 pixels. In total, 37,424 particles were picked out of 30 fractions, resulting in 1,170±827 particles per fraction. Class averaging was performed using RELION 1.2 (Scheres, 2012a;Scheres, 2012b). Briefly, 20 classes per fraction were set for 2D classification. Cross-correlation of final class averages was performed using an in-house MATLAB script. The algorithm is available from the authors upon request.

ctFAS enzyme preparation and vitrification
Three fractions derived from SEC enriched in ctFAS (according to quantitative MS data) were pooled and subsequently visualized for structural integrity. ctFAS was ~50% enriched (see Appendix Fig. S14) and overall protein concentration was determined to be ~40 ng/µL. Samples were then deposited on glow-discharged (60 sec) carbon-coated holey grids from Quantifoil®, type R2/1. A FEI Vitrobot® was used for plunge-freezing. In short, humidity was set to 70%, temperature to 293 o K, blotting and drain time to 3 and 0.5 sec, respectively. Sample volume applied was 3 µL and blot offset was set to -3 mm.

Image acquisition
The vitrified samples were recorded on a FEI Titan Krios microscope at 300 kV and automatic image acquisition was performed with FEI EPU software. Pixel size was set to 2.16 Å and an FEI Falcon 2 camera was used in movie mode. The total number of frame groups was seven and dose applied per frame group (e-/Å 2 /sec) was set to 4/4/4/4/4/4/24, respectively. Total dose applied was summed to 48 e-/Å 2 , but the last frame was used only for particle picking. Magnification was set to 75000X, whereas defocus ranged between 0.6 and 3.0 µm. A total number of 13,419 micrographs were acquired in 21 hours (1 frame/ 6 sec; 1 movie/42 sec).

Data processing and 3D reconstruction
Motion correction was applied to acquired micrographs (Li et al, 2013) leading to 1,917 summed micrographs. E2BOXER was used for particle picking with a box size of 256X256 pixels. 7370 particles were selected from 1597 micrographs (4-5 particles/image). For CTF correction, CTFFIND from the Grigorieff lab was used (Rohou & Grigorieff, 2015). The RELION 1.2 package (Scheres, 2012a;Scheres, 2012b) was then used for 2D class averaging, 3D classification and 3D reconstruction of the density map. Briefly, 20 classes were set for 2D classification and, after visual inspection, eight were selected for further processing that included 4,898 particles. 3D classification was set to five classes, performed without imposing symmetry, using the cerulenin-inhibited yeast FAS as an initial model (Gipson et al, 2010), low-pass filtered to 60 Å. 3,933 particles were successfully classified in a single class (Appendix Fig. S14) and this was subsequently used for reconstruction. 3D reconstruction was performed using the same initial model but applying D3 symmetry and underwent 25 interactions, showing convergence. Significant dependency of the resolution according to masking was observed (Appendix Fig. S14), but default Gaussian mask from RELION 1.2 leads to a calculated resolution (Gold-standard FSC=0.143) of 4.7 Å.
Modeling of the ACP-enoyl reductase domain interaction and the FAS-carboxylase metabolon Models of C. thermophilum acyl carrier protein (ACP) and enoyl reductase (ER) domains were generated using Modeller 9v2 and chosen structural homologs were selected from the yeast homolog with resolved densities for both (Leibundgut et al, 2007). Additional density of ACP was observed close to the ER domain of fatty acid synthase (FAS); thus, coarse placement of the ACP was performed using CHIMERA (Pettersen et al, 2004) and subsequently fitted to the density. Manual inspection of the fit agreed with the ACP orientation with the lipid-binding domain facing the ER domain. Energy calculations using the refinement webserver were performed, the domain interaction serving as input. Energy calculations were performed as previously described (Kastritis & Bonvin, 2010;Kastritis et al, 2014). Correlation of van der Waals energy with experimentally measured equilibrium dissociation constants for known complexes was derived from (Kastritis et al, 2014).  (NNET)). (ii) ROC curve for the RF model plotted for evaluation of the model on the independent (never seen) test dataset. The ROC curve clearly indicates prediction capacity for protein-protein interactions by the RF approach (iii) Evaluation of the variable importance on the model using Mean Decrease in Gini (not shown), and Mean Decrease in Accuracy (shown), which depicts how much inclusion of a predictor variable (STRING, co-elution cross-correlation coefficient (CCC), or Mechismo score) in the model reduces the prediction error. Although the most significant contributor is the external data, if CCC values from co-elution are removed, a deterioration in prediction capacity of the RF network model is observed. The smallest contribution comes from Mechismo interface score due to their low number (see Dataset EV3) (C) (i) Distribution of the CCC (rescaled 0-1) and the STRING combined scores covered by the RF probability threshold of 0.85-1.00. (ii) Distribution of RF-derived probabilities of predicted interactions, class 0 (<0.5 probability), class 1 (interaction >=0.5) with 1SD threshold of 0.85 probability for network reconstruction. The final accuracy of the dataset is within 15% of FDR; those are shown in the network in Appendix Figs 8-10. (iii) The quality of the RF-derived networks was evaluated using a randomized Erdős-Rényi model (same number of nodes and edges). Calculated network centralities showed that the network has non-random, scale-free, real-world network properties with degree distribution following the power law with correlation of 0.8 and Rsquared of 0.7, compared to the randomized network correlation of -0,044, and Rsquared = 0.003, and the global clustering coefficient of 0.5 vs. random network 0.026.
Quality calibration of intra-and inter-protein cross-links based on structural models of C. thermophilum proteins and complexes. (A) Upper panel, intra-link FDR calibration for (a) data from the two experiments using the Orbitrap Velos Pro, and (b) from the Orbitrap Lumos Fusion data. (B) Distance calculations of intra-protein cross-links (violated > 30 Å of distance); the log-normal experimentally determined distributions for the crosslinkers is recapitulated (C) Similar to (A), but corresponding to inter-protein cross-links. (D) Calculated distances of cross-links on protein-protein interactions; a log-normal distribution is recapitulated and few distances are violated (E) Docking calculations and highlighted structural models for unknown homomultimers (for details in identified novel homomultimers see Dataset EV5 and text). capacity of the van der Waals interactions. Calculated van der Waals for the ACP/ER interaction implies the interaction to be in the µm range. (E) Complementation of electrostatic surface potentials for the two protein domains is apparent and comprises the major energetic determinant. Co-factors are also modeled.