Identification of Extracellular Segments by Mass Spectrometry Improves Topology Prediction of Transmembrane Proteins

Transmembrane proteins play crucial role in signaling, ion transport, nutrient uptake, as well as in maintaining the dynamic equilibrium between the internal and external environment of cells. Despite their important biological functions and abundance, less than 2% of all determined structures are transmembrane proteins. Given the persisting technical difficulties associated with high resolution structure determination of transmembrane proteins, additional methods, including computational and experimental techniques remain vital in promoting our understanding of their topologies, 3D structures, functions and interactions. Here we report a method for the high-throughput determination of extracellular segments of transmembrane proteins based on the identification of surface labeled and biotin captured peptide fragments by LC/MS/MS. We show that reliable identification of extracellular protein segments increases the accuracy and reliability of existing topology prediction algorithms. Using the experimental topology data as constraints, our improved prediction tool provides accurate and reliable topology models for hundreds of human transmembrane proteins.


Results
Labeling and enrichment of extracellular protein segments. We have optimized a well-established labeling method [41][42][43][44][45][46][47] to enhance topology prediction of hundreds of endogenous proteins by high throughput experimental topology data (Fig. 1). This method relies on the selective chemical tagging of cell surface proteins by Sulfosuccinimidyl-2-(biotinamido)ethyl-1,3-dithiopropionate (sulfo-NHS-SS-biotin), which is a membrane impermeable reagent labeling only extracellular amino termini and lysine side chains. While covalent surface labeling has been used to identify cell surface proteins [42][43][44][45][46] and to determine topology of particular protein (e.g. IFITM1 41 ), here our aim was to label extracellular segments of the majority of surface exposed transmembrane proteins to improve topology prediction of the transmembrane proteome. Labeling conditions were optimized to minimize false positives, i.e. labeling of intracellular segments of TMPs. Cell-surface biotinylation of Chinese hamster ovary (CHO) cells was verified by confocal microscopy, which showed homogenous fluorescent labeling of the cell surface and no signal in the cytoplasm ( Supplementary Figures 1 and 2). Importantly, treatment with sulfo-NHS-SS-biotin did not compromise the integrity of the cells, as suggested by measuring cell death by propidium iodid uptake (Supplementary Figure 3A). Labeling of intact cells was followed by the solubilization and digestion of the membrane preparations and the affinity enrichment of the modified peptides. Samples corresponding to different stages of the purification process were blotted onto a PVDF membrane to allow semi-quantification of biotinylated peptides. Results presented in Supplementary Figure 4 show that the binding capacity of the neutravidin beads did not limit the enrichment of the biotinylated peptides. Biotin contents of samples taken before affinity isolation (Supplementary Figure 4 C1, 3,5) and samples from fractions after affinity binding (Supplementary Figure 4 C2, 4,6) clearly show that the biotinylated components remained bound to the neutravidin beads.
Peptides were eluted from the neutravidin agarose affinity columns using reducing agents or formic acid in the absence or presence of alkylating agents, which were added to block peptide aggregation through disulphide bridges. The method results in the covalent labeling of extracellular primer amines by a thioacyl ( Altogether 17493 peptides containing covalently modified lysines were detected by liquid chromatography tandem-mass spectrometry (LC/MS/MS, Supplementary Table 2). Modified peptides from the various searches were mapped to unique proteins, and the modifications in unique proteins were counted. We considered only those lysines that were identified in at least three independent labeling experiments. Using this filter, we identify  Table 3). Regarding all the homologous TMPs that can produce the labeled peptides, our experiments provided topology data for 2776 human TMPs in the UniProt database (Supplementary Table 3).

Validation of the experimental results.
To validate our results, we compared the topological location of the labeled lysine residues to independent experimental results reported in the literature 15 . While no prior experimental topology data was available for 85 proteins, we identified 450 out of the 730 labeled lysines in 113 proteins whose topology could be confirmed by independent experiments. Of these, 98.7% of the positions were correctly classified. It is important to note that the 113 proteins whose topology could be confirmed are not sequentially similar or identical, thus this validation is not biased by selecting only one specific type of TMPs from the TOPDB database.  Characterization of the labeled lysine residues. To gain insight into the efficiency of the labeling method, we compared the sequential environment of the labeled lysines with that of the extracellular lysines that were expected to be labeled but were not detected in our experiments. The sequence logos of the surrounding 20 amino acids clearly show that our method preferentially identified those lysines that are followed by positively charged amino acids that serve as tryptic cleavage sites and ensure the appropriate peptide size for MS analysis (Supplementary Figure 5). Moreover, the highest value (the smallest entropy) at position − 1 of the labeled peptides suggests that the proximal amino acid may alter the chemical reaction of the labeling process. We also calculated the length distributions of those extracellular segments that were identified by covalently modified lysines and of those extracellular segments that would be identified if the same number of lysines have been modified randomly (Supplementary Figure 6). The two distributions are not significantly different, suggesting that reactivity with sulfo-NHS-SS-biotin is not influenced by the length of the extracellular loops or domains.

Validation of the overall method.
To measure the potential impact of our experiments on topology predictions, we analyzed a human TMP benchmark set consisting of 333 human TMP sequences 3 . The topology of these TMPs was established based on the available 3D structures of the same or homologous proteins. The set contains 8099 extracellular and 4892 intracellular lysines. Using the CCTOP algorithm, we simulated topology predictions taking into account an increasing number of extracellular lysines as constraints 11 . To avoid any bias, all further computational and experimental constraints were neglected, and only the selected lysines were considered. We selected 25%, 50% and 75% of extracellular lysines by chance, and compared the results of the predictions to the established topology of the benchmark TMPs, using only these randomly selected extracellular lysine residues as constraints (Fig. 3A, blue line). The randomizations were repeated 50 times to calculate the average and the standard deviation of the prediction accuracy and reliability (Fig. 3B, Supplementary Figure 7). To assess the theoretical limits of our approach, a simulation was run in which all extracellular lysines were considered as constraints (100% on the plots). As expected, the accuracy of the topology predictions was significantly improved by involving extracellular lysines as constraints (Fig. 3). The simulations suggest that the maximal benefit is a 23% increase in the prediction accuracy (from 56% to 79%) (Fig. 3A, blue line), which would occur with the labeling of all extracellular lysines. By limiting the constraints to 20% of the extracellular lysines (corresponding to the percent of labeled lysines in our experiments), the accuracy of the topology predictions is still increased by 14% (from 56% to 70%). To simulate the effect of erroneously identified positions on prediction accuracy, we corrupted the prediction algorithm by replacing 4, 8, 12 and 16% of the randomly selected 25, 50, 75 and 100% extracellular lysines with intracellular lysines. As shown in Fig. 3A, false positive constrains have a drastic effect, resulting in an actual decline of the prediction accuracy.

Discussion
Sulfo-NHS-SS-biotin has been extensively used to determine the orientation of unique protein termini 41 or to identify cell surface proteins in different cell lines, such as blood and lymphatic vascular endothelial cells 42 , mesenchymal stromal cells 43 , hepatoma cells 44 , B-cell precursor acute lymphoblastic leukemia 45 , melanoma cells 46 and pancreatic cancer cells 47 . Here we used this well-established labeling agent for the identification of extracellular lysine residues of transmembrane proteins in order to increase the topology prediction accuracy and the reliability of the CCTOP algorithm. We assessed the potential of the experimental approach by modelling the impact of constraints on the accuracy and reliability of the CCTOP predictions using a TMP benchmark set. As expected, both values increased by applying additional constraints, although not in a linear fashion ( Fig. 3 and Supplementary Figure 7). Near half of the maximal increase was achieved by using only 20% of the potential extracellular lysines as constraints. Moreover, the simulations revealed that the prediction accuracy can be increased only if experimental data are free or almost free of error. Unfortunately, labeling of intracellular lysines of TMPs results in a significant deterioration of the prediction accuracy limiting experimental strategies that would increase false positive hits. In view of the simulations, we optimized the labeling procedure to minimize the risk of labeling intracellular residues of TMPs. In particular, we used a membrane-impermeable reagent and optimized the experimental conditions to ensure maximal extracellular biotinylation without intracellular labeling of TMPs ( Supplementary Figures 1-3). We also verified that the binding capacity of neutravidin agarose columns was not exhausted and the bound biotinylated peptides are fully removed for MS analysis (Supplementary Figure 4). Importantly, we purified labeled peptides rather than labeled proteins and did not consider unlabeled peptides as hits. This strategy lowered the number of identified proteins, but ensured a low false discovery rate.
To validate our results, we collected published information on the exact localizations of the labeled segments. 450 out of the 730 labeled lysines could be compared to independent experimental data -of which 98.7% was confirmed to be located extracellularly. Based on this result, we are confident that our experimental data may be used as constraints in the CCTOP algorithm to enhance the large scale topology prediction of human TMPs.
In our experiments we identified 730 topological positions for 198 TMPs in three cell lines. Not all of the known TMPs were identified from these cell lines possibly because of the following reasons: i) lack or inaccessibility of primary amines in the extracellular domain of membrane spanning proteins; ii) too short or too long peptides were produced during the digestion, preventing identification by LC/MS/MS; iii) post-translational modifications may prevent the identification of peptides and proteins; iv) different abundance of proteins in a particular sample (dynamic range problem) 42 .
Besides the extracellular part of the TMPs, some abundant intracellular and extracellular proteins were also labeled (Supplementary Table 2). It is important to note that the observed intracellular labeling was restricted to intracellular proteins, and did not affect intracellular segments of TMPs. Labeled cytosolic proteins are abundant and likely originate from damaged cells, that are attached to the cell surface in normal blood circulation 48 . For example, we detected labeled histone proteins that are bound to neutrophil extracellular traps containing DNA, histones and cell-specific granule proteins 49 . Other labeled cytosolic proteins have also been reported as adsorbed proteins of the particular cell surface (tubulin, actin), which were also present as contaminants in our data 50,51 . A similar level of intracellular protein labeling was reported by Hofmann et al. who used Cell Surface Capture (CSC) analysis to study surface proteins from four Hodgkin and four non-Hodgkin lymphoma cell lines 52 . Since the labeling of intracellular lysines was restricted to intracellular proteins that were labeled outside of the cells, we were confident that the experimental strategy was in keeping with the expected low false positive hit rate.
Our experimental data yielded 6 conflicting positions, 3 of which belong to two proteins that have homologous protein structures in the PDBTM database. For the topology assessment of ADT2_HUMAN (ADP/ATP translocase 2) protein we used a closely related structure, 2C3E (89% sequence identity) 53 , which suggested that Lys-147 is cytosolic and Lys-23 is localized in the membrane. Lys-23, which is detected 41 times in our experiments, is located in the cavity of the outward open structure, which is likely accessible to the Sulfo-NHS-SS-biotin reagent. Similarly, labeling of Lys-147 (detected 6 times) can be also explained by the penetration of the biotinylation agent through the open gate (Supplementary Figure 8).
Two positions with conflicting data were identified in the GTR1_HUMAN (Solute carrier family 2, facilitated glucose transporter member 1) protein, which were analyzed in the PDB:4PYP structure (99% sequence identity) 54 . This structure captures the protein in an inward open conformation where the coordinates of the bound ligand were also available. Positions Lys-245 and Lys-256 (both detected 3 times) belong to an α -helix positioned in front of the cytosolic entrance of the gate, suggesting that the labeling agent could have reached these lysines from the extracellular compartment in the open conformation. Interestingly, on the extracellular side, only Lys-117 was detected more than 50 times. The low level labeling of Lys-38, Lys-183 and Lys-300 is also consistent with the continuous transition between the open and closed states when the moving ligand covers these lysines while passing through the channel.
Cell Surface Capturing by chemical tagging of N-linked glycopeptides has been recently used for the characterization of the cell surface proteome of several cell lines 26 . Since glycosylation occurs only on extra-cytoplasmic segments, glycan-specific purification of pepti.des offers a highly specific method for the identification of extracellularly localized peptides. However, the frequency of extracellular lysines is four times larger than that of the N-X-S/T motifs where the glycolysations happen. Therefore, chemical labeling and identification of extracellular lysines may offer more input for topological predictions. Unfortunately, Lys-cell surface capture technology (Lys-CSC), reported by Hofmann et al. showed a very high level labeling of intracellular TMP segments: 17% of lysines can be found on intracellular part and only 83% of the labeled lysines were on the extracellular segments of TMPs (calculated using the topology prediction results of the CCTOP method on the observed TMPs and the unambiguously tagged lysine positions reported in the Supporting materials of Hofmann et al. 52 ). The parameter optimization, reported in our work, diminished false positive lysine labeling and increased the labeling accuracy to 98.7%, which allowed a significant improvement of the constrained topology prediction.
The topology information gained from our experimental results contributes to a more accurate topology prediction of the human transmembrane proteome. In future studies, we plan to increase the scope of the predictions by including further cell lines and cellular organelles expressing different unique TMPs 26 , combined with different proteases to increase sequence coverage. Further experimental data will provide a better understanding of the topology structure of individual TMPs and will help us to elucidate the structure-function relationship of transmembrane proteins.
Scientific RepoRts | 7:42610 | DOI: 10.1038/srep42610 Methods Experimental Design and Statistical Rationale. Proteomic study was performed on two human cell lines (HL60 is an acute promyelocytic leukemia cell line, K562 is a chronic myelogenous leukemia cell), and on human red blood cells (RBC). Labeling reactions and downstream purifications were carried out in at least three biological replicates for each sample type. Mass spectrometry measurements were repeated at least three times from each isolation.
Human samples. The study was approved by the regional ethical committees (Department of Health, Office of Hungarian Government, Budapest, Hungary), and all procedures were performed in accordance with the Declaration of Helsinki. The blood samples were collected after obtaining written consent; sampling was performed at the Hungarian National Blood Transfusion Service 55 .
Cell isolation. RBC (1-2 × 10 10 RBC/sample) was isolated from 1-2 ml blood by centrifugation at 500 × g for 5 minutes at 4 °C; pelleted cells were washed with PBS three times to remove contaminating platelets and white-blood cells. HL60 and K562 cells were collected similarly by centrifugation at 500 × g for 5 minutes at 4 °C. In the last washing step, 4 mM iodoacetamide alkylation agent was used for the blocking of free sulfhydryl groups to avoid the production of "piggy-back" disulphide peptides. Membrane protein solubilization and digestion. Membrane proteins were solubilized in 100 mM NH 4 HCO 3 buffer (pH = 8.0) containing 0.05-0.1% (w/v) Rapigest surfactant, 10% acetonitrile, 1 mM iodoacetamide and 1 mM 2,2′ -thiodiethanol. The latter was used to prevent overalkylation during the overnight digestion. Solubilization was assisted by brief pulses of sonication followed by incubation on ice for 30 minutes. The suspension was incubated with 500 units of PNGaseF (New England Biolabs) for 2 hours at 37 °C before adding trypsin, chymotrypsin or thermolysin in a 1:50 (w/w) protease:protein ratio (the various enzymes were applied on separated samples). The samples were incubated at optimum temperature of the given enzyme (trypsin: 37 °C, chymotrypsin: 30 °C, thermolysin: 70 °C) for 16 hours. Thermolysin and chymotrypsin digestion mixtures were supplemented with 0.5 mM CaCl 2 or 10 mM CaCl 2 and 0.5 mM MgCl 2 , respectively. Digestion was stopped by heat inactivation (95 °C for 10 min) followed by the addition of the appropriate enzyme inhibitors to the reaction mixture: 100 μ M TLCK, 100 μ M TPCK and 10 mM EDTA, 10 mM 1,10-phenanthroline in case of trypsin, chymotrypsin and thermolysin, respectively. Protein samples (10 μ g) were loaded on a 12% SDS-PAGE to compare digestion efficiencies. Gels were stained with Coomassie Brilliant Blue.

Cell surface labeling. Cell surface biotinylation was performed using 2 mM Sulfo-NHS-SS-biotin (Thermo
Biotinylated peptide isolation. The biotinylated peptides were precipitated on neutravidin agarose beads (Pierce). In order to bind all biotinylated products, saturation of the neutravidin column was monitored by dot-blots (Supplementary Methods and Supplementary Figure 4). Digestion mixtures were incubated with 300-500 μ l of packed, equilibrated neutravidin agarose beads for 2 hours at room temperature. Columns were washed extensively to reduce the number of non-specific peptides or contaminants. Washing steps were performed by the following buffers with 5-10 ml (20 bead volumes): 100 mM NH 4 HCO 3 (pH = 8.0), 5 M NaCl in PBS, 100 mM NH 4 HCO 3 (pH = 8.0), 100 mM NaHCO 3 (pH = 11.0) and a final wash with 100 mM NH 4 HCO 3 (pH = 8.0) at 65 °C. Beads were transferred into a new, equilibrated spin column before the final washing step carried out with 100 mM NH 4 HCO 3 (pH = 8.0). Enriched peptides were eluted by incubating the beads with 100 mM NH 4 HCO 3 (pH = 8.0) buffer containing 50 mM DTT or TCEP for 1 hour at room temperature or by incubating the beads with 100 μ l of concentrated formic acid (98%) for 1 hour at 37 °C. In order to avoid further disulphide-bridge formation, free sulfhydryls were alkylated either with iodoacetamide (Sigma), N-ethylmaleimide (Sigma), or 2-bromoethylamine (Sigma).
Scientific RepoRts | 7:42610 | DOI: 10.1038/srep42610 Mass spectrometry analysis and peptide identification. Peptide mixtures were analyzed by LC/ MS/MS using two different instrument setups. In one setting a nanoAcquity (Waters, Milford, MA, USA) Ultrahigh Pressure Liquid Chromatography (UPLC) system was coupled online to an Linear Trap Quadrupole (LTQ)-Orbitrap Elite (Thermo Scientific, Waltham, MA, USA) mass spectrometer. 5 μ l (~1/50-1/80) of the peptide mixture was injected onto a Symmetry C18 nanoAcquity UPLC trap column (Waters, 0.18 × 20 mm, 5 μ , 100 Å) with a flow rate of 10 μ l/min for 2 min and separated on a BEH300C18 nanoAcquity UPLC column (Waters, 0.075 × 250 mm, 1.7 μ m, 300 Å) using a linear gradient of 3-40% of solvent B in 40 or 100 min. Solvent A was 0.1% formic acid in water, solvent B was acetonitrile containing 5% DMSO (dimethyl sulfoxide) and 0.1% formic acid, the flow rate was 300 nl/min). Survey scans measured in the Orbitrap (resolution = 60000) were followed by CID acquisitions in the linear trap, or HCD acquisitions, from the 10 or 5 most abundant multiply charged ions, respectively (normalized collision energy was 35% and dynamic exclusion was enabled for 30 sec exclusion duration). In some cases, the peptide mixture eluted from the Neutravidine gel was further purified or prefractionated using a C18 ZipTip (Millipore).
In another setting we used a Bruker Maxis II ETD Q-TOF (Bremen, Germany) mass spectrometer with CaptiveSpray nanoBooster ionization source coupled to a Dionex Ultimate 3000 NanoLC System (Sunnyvale, CA, USA). Peptides were trapped on Acclaim ™ PepMap100 ™ C18 Nano-Trap column (5 μ m, 100 Å, 100 μ m × 20 mm, Thermo Fisher Scientific, Waltham, MA, USA) and separated online using a 15 cm Waters Peptide BEH C18 nanoACQUITY 1.7 μ m particle size UPLC column and gradient elution (2.5-25% eluent B in 80 min, then 25-45% eluent B in 20 min). Solvent A was water + 0.1% formic acid (FA), while solvent B was acetonitrile + 0.1% FA. For the MS measurements a fix cycle time of 2.5 sec was used, MS spectra were acquired at 3 Hz in the 150-2200 m/z mass range, while CID was performed at 16 Hz for abundant precursors and at 4 Hz for ones of low abundance.
For the LTQ-Orbitrap Elite data, Proteome Discoverer (Thermo, v1.4) or PAVA script 59 was used for MS/MS peak list generation and database search was executed using ProteinProspector v5.14.1. At least two consecutive searches were performed. First, the complete SwissProt (SwissProt.06.10.2014 (545388/545388 entries searched) protein database was used to identify proteins present in the sample. For the second search the following database was compiled and concatenated with its randomized sequences: the human UniProt (UniProtKB.2015.4.16 (145723/47262724 entries searched) database was concatenated with the SwissProt hits. The data were searched for tryptic (if chymotrypsin or thermolysine were used for digestion the appropriate enzyme was set) peptides with one or two non-specific and 2-3 missed cleavage sites. No constant modification was used but several variable modifications were set: carbamidomethyl (C), oxidation (M), deamidation (NQ), thioacylation (K) and carbamidomethylthio-propanoylation (K). When the peptides were eluted from the beads with formic acid, the thioacyl-Biotin (+ 389.090 Da) modification of lysine residues was set. When other alkylation than iodoacetamide was performed on separated samples after the DTT or TCEP elution, additional variable modifications were listed for lysine residues: Thio(AE) (+ 131.04 Da) or Thio(NEM) (+ 213.045 Da) and on the cysteine residues: aminoethyl (+ 43.04 Da) or N-ethylmaleimide (+ 125.048 Da) (Supplementary Table 1). Maximum 3 modifications were permitted per peptides. Mass tolerance was set to 10 ppm for the precursor ions. Fragment ion mass accuracy was set to 0.6 Da for ion trap CID data and 25 ppm for HCD data. The ProteinProspector default acceptance criteria (min. 15 and 22 score and 0.05 and 0.01 max. E value for peptides and proteins, respectively) were used for the evaluation process with manual inspection for labeled Lys containing peptides. The false discovery rate was calculated by dividing the double of the number of the decoy peptide hits with the number of the identified spectra and it was found to be less than 1% in every search result.
For the QTOF data, raw data were first recalibrated using Bruker Compass DataAnalysis software 4.3 (Bruker Daltonik GmbH, Bremen, Germany) for the internal calibrant. MS/MS peak list generation was performed using ProteinScape software 3.1 (Bruker Daltonik GmbH, Bremen, Germany). As above, the samples were first matched with the human SwissProt database (SwissProt 2014_08, 546238/546238 entries searched). Decoy database was generated by Mascot. The parameters of the Mascot search engine were set as follows: semiTrypsin enzyme, maximum 4 missed cleavages, carbamidomethyl (C) as fixed modification and several variable modifications: oxidation (M), deamidation (NQ), thioacylation (K and protein N-term) and carbamidomethylthio-propanoylation (K and protein N-term). MS tolerance was set to 7 ppm; MS/MS tolerance was 0.05 Da. Mascot ion score corresponding to p < 0.05 was 13, however to ensure confident modified peptide identifications, matches were accepted above a score of 24. The false discovery rate was less than 1% in every search result.

Processing of the MS results.
Mass spectrometry experiments yielded confidently identified modified peptides belonging to different proteins (Supplementary Table 2). Since different search engines and databases were used for the data evaluation, we unified the results by mapping all the resulted peptides to UniProt human sequences and filtering to 95% sequence identity of the mapped proteins. We used the CCTOP algorithm to decide if the mapped proteins were indeed TMPs (containing at least one predicted TMS) or not. Positions corresponding to labeled peptides from at least three different biological replicates were considered further (Supplementary Table 3).