Structural Analysis of Multiprotein Complexes by Cross-linking, Mass Spectrometry, and Database Searching*S

Most protein complexes are inaccessible to high resolution structural analysis. We report the results of a combined approach of cross-linking, mass spectrometry, and bioinformatics to two human complexes containing large coiled-coil segments, the NDEL1 homodimer and the NDC80 heterotetramer. An important limitation of the cross-linking approach, so far, was the identification of cross-linked peptides from fragmentation spectra. Our novel approach overcomes the data analysis bottleneck of cross-linking and mass spectrometry. We constructed a purpose-built database to match spectra with cross-linked peptides, define a score that expresses the quality of our identification, and estimate false positive rates. We show that our analysis sheds light on critical structural parameters such as the directionality of the homodimeric coiled coil of NDEL1, the register of the heterodimeric coiled coils of the NDC80 complex, and the organization of a tetramerization region in the NDC80 complex. Our approach is especially useful to address complexes that are difficult in addressing by standard structural methods.

Most protein complexes are inaccessible to high resolution structural analysis. We report the results of a combined approach of cross-linking, mass spectrometry, and bioinformatics to two human complexes containing large coiled-coil segments, the NDEL1 homodimer and the NDC80 heterotetramer. An important limitation of the cross-linking approach, so far, was the identification of cross-linked peptides from fragmentation spectra. Our novel approach overcomes the data analysis bottleneck of cross-linking and mass spectrometry. We constructed a purpose-built database to match spectra with crosslinked peptides, define a score that expresses the quality of our identification, and estimate false positive rates. We show that our analysis sheds light on critical structural parameters such as the directionality of the homodimeric coiled coil of NDEL1, the register of the heterodimeric coiled coils of the NDC80 complex, and the organization of a tetramerization region in the NDC80 complex. Our approach is especially useful to address complexes that are difficult in addressing by standard structural methods.

Molecular & Cellular Proteomics 6:2200 -2211, 2007.
Mass spectrometry-based proteomics is a powerful tool for the analysis of multiprotein complexes (1). Thousands of complexes have been isolated, and their protein compositions have been determined (2)(3)(4). Although many complexes will feed into large scale crystallization trials, only a few are likely to reveal their structure. Many protein complexes are heterogeneous, insoluble at the concentrations needed for crystallization, or yield crystals lacking the quality needed for structure determination. When structures are obtained they often comprise only parts of the proteins because difficult areas have been removed to increase solubility or crystallization properties. Cross-linking in conjunction with mass spectrometry is a very promising tool to yield structural information on proteins and protein complexes that is difficult to address using standard structural methods (5). Just as mass spectra can reveal the identity of the protein components of a complex, if the complex or protein has been cross-linked mass spectra can be used to identify direct proximity of proteins in a complex (6) and aid fold recognition of proteins (7). Although this has been shown in proof of principle (6,7), general application has yet to be achieved.
Success of mass spectrometry in identifying proteins is largely due to the apparent simplicity and to the automation of protein identification using mass spectrometric data. Three features are central to the automation. (a) Based on an observed peptide mass a list of candidate peptides can be extracted from protein databases. (b) The candidate peptides can be evaluated by assessing their match to the fragmentation spectrum, resulting in a single number, the score. (c) The rate of false identifications can be estimated by computing the likelihood of a random hit. Unfortunately this straightforward automation procedure could so far not be applied to crosslinked peptides.
In the absence of automatic tools similar to those used for normal peptide identification, cross-linking cannot be used routinely for structural analysis of multiprotein complexes. Indeed work based on identifying cross-linked peptides has so far been limited to complexes composed of not more than two different proteins (5). Standard database search tools cannot create a list of candidate cross-linked peptides based on the observed mass. A number of dedicated programs consider pairs of peptides contributing together to the observed mass (7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17). The candidates then need to be validated on the basis of their match to fragmentation spectra. Currently this requires screening the spectra either completely manually or through software assistance (13,15,16,18). No scoring system or algorithm has yet been developed to replace human intervention. Ideally such a scoring algorithm would objectively sort the false from the true matches and add a measure of confidence to the results.
Here we present an algorithm that automatically finds and validates cross-linked peptides using fragmentation spectra, thereby overcoming the key limitation in the analysis of protein cross-links. Proteins are cross-linked using a 1:1 mixture of stable isotope-labeled and non-labeled cross-linker to re-duce false positive rates of the process. Proteins are digested using trypsin, and peptides are analyzed by LC-MS/MS prior to data analysis with our algorithm (see Fig. 1). The algorithm was applied to data acquired from two human coiled-coil complexes, the NDEL1-(17-174) homodimer (38 kDa) and the NDC80 heterotetramer (176 kDa). We used a standard database search tool, Mascot (19), and a purpose-built cross-link database (XDB) 1 that contains cross-linked peptides represented as single linear peptides. Our algorithm assigns a score and describes the confidence of each match through comparison with negative controls.

EXPERIMENTAL PROCEDURES
Purification of Protein Complexes-The NDC80 complex was purified according to our published procedure (20).
NDEL1-(17-174) was purified as described earlier (21) with the following changes. Polymerase chain reaction fragments of human NDEL1 corresponding to amino acids 17-174 of the full-length protein were subcloned in the pGEX6P-1 expression vector (GE Healthcare) and expressed in Escherichia coli strain BL21(DE3). The protein was purified by glutathione affinity chromatography. The GST tag was removed using Prescission protease (GE Healthcare), and the resulting sample was further purified by size exclusion chromatography using a Superdex 200 column equilibrated with 10 mM Hepes, pH 7.5, 100 mM NaCl. Treatment of the GST fusion protein with the Prescission protease leaves a 5-residue extension at the N terminus of NDEL1- , numbered Ϫ5 to Ϫ1.
Cross-linking-Human NdEL1-(17-174) (29 g of protein equivalent to 775 pmol) and human NDC80 complex (15 g of protein equivalent to 86 pmol) were mixed with a 100ϫ excess of isotopelabeled cross-linker bis(sulfosuccinimidyl)glutarate (BS 2 G) (Pierce) in a final volume of 150 l of 10 mM Hepes, pH 7.5, 100 mM NaCl at room temperature. The cross-linker, a 1:1 mixture of light BS 2 G-d 0 and heavy BS 2 G-d 4 , was freshly prepared as a 10 nmol/l solution in DMSO. The reaction was stopped after 30 min by adding 5 l of 1 M ammonium bicarbonate. Sample buffer was added for separation by SDS-PAGE.
Digestion-The samples were electrophoresed through Novex Nu-PAGE 1-mm 4 -12% Tris-glycine gels (Invitrogen) in MOPS buffer (Invitrogen), fixed in 50% methanol, 5% acetic acid, and stained with the colloidal blue kit (Invitrogen). Bands were excised and processed following a standard trypsin digestion procedure (22): reduction in 100 mM DTT for 30 min at room temperature, alkylation with 55 mM iodoacetamide for 30 min at room temperature in the dark, and digestion with 12.5 ng/l trypsin (proteomics grade, Sigma) overnight at 37°C. The supernatant was loaded onto StageTips (23), and peptides were eluted in 20 l of 80% acetonitrile, 0.1% trifluoroacetic. The acetonitrile was allowed to evaporate off (Concentrator 5301, Eppendorf AG, Hamburg, Germany), and the volume of each eluate was adjusted to 5 l with 1% trifluoroacetic acid of which 2.5 l, i.e. half, were injected for LC-MS/MS analysis.
Nano-LC-MS/MS and Data Analysis-The proteins, after digestion with trypsin, were analyzed by LC-MS/MS using an HPLC system (1100 binary nanopump, Agilent, Palo Alto, CA) coupled on line to an ion trap FTICR hybrid mass spectrometer (LTQ-FT, ThermoElectron, Bremen, Germany). C 18 material (ReproSil-Pur C18-AQ 3 m, Dr. Maisch GmbH, Ammerbuch-Entringen, Germany) was packed into a spray emitter (75-m inner diameter, 8-m opening, 70-mm length; New Objectives) using an air pressure pump (Proxeon Biosystems, Odense, Denmark) to prepare an analytical column with a self-assembled particle frit (24). Mobile phase A consisted of water, 5% acetonitrile, and 0.5% acetic acid, and mobile phase B consisted of acetonitrile and 0.5% acetic acid. The samples were loaded from an Agilent 1100 autosampler onto the column at a 700 nl/min flow rate. The gradient had a flow rate of 300 nl/min, and the percentage of buffer B varied linearly from 0 to 20% in the first 77 min and then from 20 to 80% in a further 15 min. We used a SIM method for mass acquisition (25) with one low resolution FT-MS scan (fill target, 1,000,000 ions; resolution, 25,000; maximum fill time, 2 s; mass range, m/z 300 -1575). The three most intense signals (dynamic exclusion for 180 s) were selected for SIM (fill target, 500,000 ions; maximum fill time, 50 ms; window, m/z 22) in the FTICR cell and MS 2 /MS 3 in the ion trap (normal scan; wideband activation; fill target, 10,000 ions; maximum fill time, 100 ms). Each cycle lasted ϳ3 s.
In principle, precursor selection for MS/MS could be directed onto doublet signals focusing the fragmentation on candidate cross-linked peptides. However, the low data quality of the usually weak signals of cross-linked peptides in a full FT-MS spectrum makes an FT-SIM scan necessary for reliable observation of both signals of a doublet. Directed selection of precursors would require the acquisition of the fragmentation spectrum to follow the SIM scan. However, the MS/MS spectrum can be recorded in the ion trap part of our LTQ-FT in parallel to the SIM scan being recorded in the FT cell. Recording both spectra in parallel, regardless of the multiplicity of the precursor, is more time-economic than recording first the SIM and then an MS/MS scan for peptides with doublet signals. The doublet information is instead used for post-acquisition data filtering. High mass accuracy FT-MS/MS would have a different time economy. Because the MS/MS spectrum has to be recorded after the MS and SIM spectra, doublet directed sequencing is highly advisable.
Peaks were picked from the raw data files using DTAsupercharge (version 0.94, made available by SourceForge, Inc.) with the following settings: precursor mass deviation, m/z 0.08; smart picking for MS/MS activated; maximum search level, 8. The four lists, one for each band in the gel, contained in total the fragment information of 14,871 precursors. A peak list was then created for light precursors, and a corresponding peak list was created for heavy precursors. The apparent occurrence of isotopic doublets, which indicates the presence of the cross-linker, was used to enrich the dataset for spectra of cross-linked peptides. For the selection of doublets, all SIM scans were extracted from the raw data files by a custom program written in the ".NET"-integrated programming language C# using the XDA-api (Xcalibur Development kit, Thermo Inc.). We then extracted, for all precursors in the complete list, the m/z, charge state, and scan number. The appropriate scan was located in the SIM file, and it was determined whether the precursor had a partner signal with intensity 0.4 -2.5ϫ at plus or minus 4.025 Ϯ 0.01 Da. The partner intensity threshold was imposed to take into account that peptides containing the non-deuterated and deuterated cross-linker show shifted elution profiles due to the difference in isotope composition of the two species. This shift can result in partner peak intensity ratios different from 1:1, depending on the timing of the SIM acquisition with respect to the elution profiles of the peptides. The threshold values were determined by inspecting SIM scans of peptides that we identified as being modified with the hydrolyzed cross-linker. If there was a matching signal above the precursor m/z, the precursor was taken as a candidate peptide containing the light form of the cross-linker, and the MS/MS peak list of the precursor was added to the peak list of light precursors. Equivalently if a matching signal was found below the precursor m/z, the precursor was taken as a candidate peptide containing the heavy form of the cross-linker, and the MS/MS peak list of the precursor was added to the peak list of heavy precursors. This process resulted in a total of 1452 queries, i.e. 10% of the acquired data.
Peptide Identification via Mascot Database Search-The complete peak list for each band from the gel was searched against Swiss-Prot (www.expasy.org/sprot), to which the exact sequences of the recombinant proteins under investigation were added, using Mascot (version 2.0) with the following parameters: monoisotopic mass values; peptide tolerance, 0.08 Da; MS/MS tolerance, 0.5 Da; instrument, ESI-TRAP; fully tryptic specificity; cysteine carbamidomethylation as fixed modification; oxidation on methionine and hydrolyzed crosslinker on protein N terminus, lysine, serine, and tyrosine as variable modifications; two missed cleavage sites allowed. The results of this first search were used in three ways. First, the peptides matching the protein complex members were used to estimate the mass accuracy of the analysis. We took as the mass accuracy the mass deviation that included 97% of the identified peptides (588 peptides). This value was ϳ4.5 ppm for all four sets of data. For comparison, the average deviation was 1.3 ppm. Second, we could see the extent of side reaction of the amine-specific cross-linker with serine and tyrosine. We detected only a very small number of serine-containing peptides being modified, and there was no indication of tyrosine modification. Therefore, we did not further consider these modifications in our analysis. Third, the identified proteins were selected for the construction of the XDB. Although we worked with a purified complex, in addition to the four expected proteins several other proteins were present in the sample as judged from the gel; presumably they were contaminants from the expression system. One approach would be to identify all proteins present and to include them in the XDB. This would, however, unnecessarily inflate XDB as we are only interested in those proteins actually found in the respective fraction we analyze. By searching Swiss-Prot for each analysis we ensure that we consider exactly those proteins that can be detected in the respective gel band. To also identify cross-linked peptides using a standard database search tool like Mascot required a special database and a separate search. The selected proteins were digested in silico allowing for up to two missed cleavages. The obtained peptides were filtered, to contain either an internal Lys or the protein N terminus, and joined up in all possible pairwise combinations. It is essential to have both linear permutations of a peptide pair, i.e. AB and BA, to allow the complete matching of fragments (see also Fig. 1 and "Results"). Creating one entry per peptide pair has the disadvantage of resulting in many short entries and occupying more memory than combining the peptides in linear succession. If a protein P gives a peptide set [a, b, c] and a protein Q gives a peptide set [A, B, C], then XDB would contain a single protein in which the peptides of the proteins P and Q were concatenated in a single sequence as caabacbbccCAABACB-BCCcAaAbAcBaBbBcCaCbC. The search program will create from this sequence all possible pairs in both permutations: aa, ab, ba, bb, bc, cb, cc, ca, ac, etc. This way of constructing a cross-link database is a very condensed way of writing all possible pairwise combinations, in our example leading to 30 letters instead of 54, i.e. resulting in almost 50% compaction. Note that the peptides concatenated in XDB contain missed cleavage sites. The search will also create chimeric peptides containing parts of the two original peptides. These are known false positives. The reversed cross-link database was obtained by inverting the entire cross-link database, i.e. writing the sequence from C terminus to N terminus. The cross-link database was searched with the peak lists of light and heavy precursors using Mascot with the parameters: monoisotopic mass values; peptide tolerance, 0.08 Da; MS/MS tolerance, 0.5 Da; instrument, ESI-TRAP; fully tryptic specificity; cysteine carbamidomethylation as fixed modification; light or heavy cross-linker hydrolyzed and oxidation on methionine as variable modifications; five missed cleavage sites allowed. For the second control using a wrong mass for the cross-linker, 3 Da were added to the correct masses of the heavy and light cross-linker in the modification file of Mascot.

Identification of Cross-links by Analyzing the Database Search Results-
The database search retrieves the peptides that match the observed mass. Mascot does not return all candidates but only those considered non-random based on an initial matching of fragments. 2 Mascot already uses some fragment information to select higher value candidates than obtained on the basis of the measured peptide mass alone. The cross-linked peptides can be found as miscleaved peptides in the output of the database search. The score used for expressing the quality of match between a spectrum and a crosslinked peptide is presented under "Results." When calculating our score, we considered all b-and y-ions of the cross-linked peptide. Other ions such as those resulting from loss of water or ammonia, internal fragments, and multiply charged fragments are observable (26) but not currently included in the algorithm. Considering all possible fragments results in a large number of mass values and lowers the selectivity of the score at our current, low mass accuracy (Ϯ0.5 Da). For the Mascot-independent matching of precursor masses with predicted cross-linked peptides we wrote a Perl script that computes all predicted cross-linked peptides matching to precursors within 4.5 ppm deviation based on input protein sequences and number of missed cleavages allowed (two) and the amino acid required for the linkage (lysine or protein N terminus). Note that we focus on those products composed of two linked peptides and containing a single cross-linker molecule. Including other cross-link products is possible but increases the search space and consequently the background of the data analysis. Currently data of peptides containing more than one cross-linker do not contribute to the background because, not being identified as a doublet of 4-Da spacing, they are filtered out.

Algorithm
Spotting Candidate Signals of Cross-linked Peptides-We used a cross-linker targeting amino groups, BS 2 G, in a 1:1 mixture of its unlabeled light and labeled heavy form (the latter containing four deuterium atoms) (27). The use of a light/ heavy mixture results in doublet mass signals for those cases in which the cross-linker was incorporated between two peptides or alternatively on a single peptide. We began our analysis by selecting from the entire LC-MS/MS dataset only those fragmentation spectra of peptides with doublet signals, thus focusing our analysis on fragmentation products of cross-linked peptides (Fig. 1a). It should be noted that the doublet information serves solely for the reduction of data to focus the analysis onto likely cross-linked peptides. In this way, the false positive rate of the database search is minimized. For small datasets such as obtained for a single protein or a small complex this will not be necessary, and any cross-linker can be used.
Assigning Candidate Peptide Pairs-Two observations allowed the construction of a special database for the identification of cross-linked peptides. First, a cross-linked peptide has the same mass of a peptide obtained by fusing the two linked peptides via a normal peptide bond and adding the hydrolyzed cross-linker as a modification (Fig. 1b). Second, from its ends up to the linkage site the "linearized" virtual peptide generates the same fragments of the cross-linked peptide (Fig. 1c). The two possible permutations of the linearized peptide (␣␤ and ␤␣) cover the entire set of possible single bond fragments of the cross-linked peptide, which are usually the most intense signals. If a linearized version of all possible cross-linked peptides were built, a standard database search tool should make it possible to find cross-linked peptides using fragmentation data very much like any ordinary peptide. Thus, in our XDB, every peptide in a target protein is combined with every other peptide in a linear sequence and in both permutations (Fig. 1d). We took into account cross-links within a protein and between proteins. A standard database search algorithm can now find matches to the observed mass of cross-linked peptides simply by allowing for missed cleavages of the enzyme used for digestion and considering the hydrolyzed cross-linker as a variable modification. Mascot (19) can thus be used in its normal function as a database search tool to find peptides matching the experimental data. The analysis of the non-linked peptides gives a clear indication of the mass accuracy of the measurement in MS and MS/MS and which proteins to include in XDB.
For small complexes, the candidate list of cross-linked peptides returned from the database search is short enough for manual validation. However, the number of candidates increases dramatically with the size of the complex. This increase of complexity follows the third power of the number (n) of tryptic peptides, assuming the complexity of the dataset increases linearly with n and the database size increases with n 2 . Only a score, ideally probabilistic, can free the investigator from having to validate every match manually. A score expressing the quality of match between a spectrum and a linear peptide, such as the Mascot score, does not fulfill this function. The database search program computes the score FIG. 1. a, a protein complex is cross-linked using a 1:1 mixture of the light (L; unlabeled) and heavy (H; stable isotope-labeled) versions of a cross-linking agent. The complex is then digested, and peptides are analyzed by LC-MS/MS. Peptides containing the cross-linker are recognized as doublets in the MS spectrum. b, a cross-linked peptide can be linearized without changing the mass into a missed cleavage peptide carrying a hydrolyzed cross-linker as modification for the purpose of using standard database search algorithms. c, the single bond fragments of a cross-linked peptide coincide with those of the two permutations of the linearized peptide between peptide termini and linked residues (asterisks). Each of the two linearized peptides accounts for a subset of the possible single bond fragments of the cross-linked peptides, and together they account for the complete set. d, the acquired fragmentation data are used in standard database searching to identify the cross-linked proteins. The sequences of these proteins are used to construct an XDB in which all peptides that are considered candidate partners in cross-linking are combined linearly with each other in permutations ␣␤ and ␤␣. e, the fragmentation data are used to search XDB. A candidate cross-link between peptide ␣ and ␤ is found as a missed cleavage ␣␤ (hit 1) and/or ␤␣ (hit 2). The candidates are then scored as cross-linked peptides.

Structural Analysis by Cross-linking, MS, and Bioinformatics
based on the fragment matches to the linearized version of the cross-linked peptide (Fig. 1e). Both permutations of the linearized version together account for all single bond cleavages of the cross-linked peptide. However, as only one permutation at a time is considered by the database search program, only a subset of observed fragments is matched, whereas the other subset is not and thus lowers the score.
Scoring-We developed a scoring algorithm that expresses how well an identified cross-linked peptide agrees with the experimental fragmentation spectrum, based on an algorithm recently used for MS 3 scoring (25). This score does not provide an absolute answer, but together with the estimation of false positives described below it can be used to conclude whether an identification is correct or not. The scoring algorithm uses a probabilistic approach that considers the fragments of a cross-linked peptide. The scoring includes only matches with the most intense ions of the spectrum in a given m/z window. Ion matching is performed with the same tolerance used in the database search. To describe the chance of matching the sequence of a cross-linked peptide to a fragmentation spectrum we use the binomial distribution where k is the number of matched ion masses, n is the number of calculated fragments in the mass range under consideration, and p is the probability of a random match for a fragment mass. For convenience, S is reported as a score similar to the Mascot score.
SЈ ϭ Ϫ 10log 10 S (Eq. 2) The probability of a random match (p) for fragment masses is given by the number of considered peaks (N) in an m/z window of width W and the mass accuracy (⌬m) used in the database search as Equation 3.
We found empirically that selection of the four most intense peaks in an m/z window of width 100 Da represents the best parameters for scoring our data. For ion trap fragmentation data (mass error, Ϯ0.5 Da) the probability of a random match is 0.04. The same values have been used in the algorithm for RS 3 scoring (25) and another algorithm used for spectrum matching (28).
False Positive Estimation-Database searches conducted under conditions that yield only false results and no true cross-links give an estimate of how many wrong identifications we expect at a given score cutoff. In conventional large scale protein identification experiments, false positive rates are determined by searching against a control database containing reversed or otherwise falsified sequences (29). The operator can then determine for any score cutoff the rate of random, false matches in the database search by looking at how many matches were found using the same score cutoff in the control database search. We adapted the same concept for our XDB creating and searching a reversed XDB. As a second negative control we conduct a database search against XDB but using a false mass for the cross-linker. This is a more strict control because in any false match one of the two peptides may actually be correct, and only the second one may be wrong.

Coiled Coil Analysis: the NDEL1 Homodimer
Kinetochores are complex proteinaceous scaffolds that represent the site of attachment of chromosomes to the mitotic spindle (30). Several coiled-coil proteins play essential roles at the kinetochore. These include NDEL1 and the members of the NDC80 complex. High resolution structural analysis of large coiled-coil complexes such as the four-protein NDC80 complex and the NDEL1 homodimer suffers from the general difficulty of crystallizing coiled-coil domain-containing complexes. In particular, the elongated shape of these domains and the difficulties in determining the register and overall organization of heterologous and/or antiparallel coiled coils restrain the potential of protein engineering for designing stable constructs for crystallization.
As a proof of principle to establish our approach, we tried to detect sites of cross-link on the coiled-coil domain of NDEL1 (residues 17-174, indicated as NDEL1-(17-174)) (Fig. 2a). NDEL1 is a regulator of cytoplasmic dynein that acts by forming a complex with LIS1 (31). NDEL1 localizes to the centrosome where it is implicated in centrosomal separation and centrosomal maturation and for mitotic entry (32). NDEL1 also localizes to kinetochores during mitosis and is believed to regulate dynein function there during the process of microtubule-kinetochore attachment (33). NDEL1 binds LIS1, the product of a gene that is mutated in type I lissencephaly (31). A previous structural and biochemical analysis of LIS1 and its interaction with NDEL1 suggested that NDEL1 might need to form an antiparallel coiled coil to bind to LIS1, but this was not corroborated by structural analysis (21).
To study whether NDEL1 forms parallel or antiparallel dimers, we carried out a cross-linking analysis on a recombinant form of NDEL1-(17-174). We efficiently cross-linked NDEL1-(17-174) to form a dimer (Fig. 2b). Analysis of the cross-linked protein upon tryptic digestion by LC-MS/MS and Mascot searches using XDB gave us three matches, all of which passed manual inspection (see Fig. 2c and for additional annotated spectra Supplemental Fig. 1). Cross-links I and II involved overlapping sequences and could therefore be unambiguously identified as sites of interprotomer cross-linking (Fig. 3a). It is clear from Fig. 3b that these two sites are only compatible with a parallel arrangement of the ␣-helices of the NDEL1 coiled-coil region. Cross-link III involved different tryptic peptides and therefore could not be identified unambiguously as an intra-or interprotomer cross-link.
Thus, the results of our analysis are inconsistent with the hypothesis that NDEL1 forms an antiparallel dimer, although we cannot formally rule out the possibility that the NDEL1 coiled coil changes its orientation upon binding to LIS1. The information we obtained is also in perfect agreement with that revealed by the crystal structure of NDEL1-(1-174). 3 We mapped the position of the cross-linked lysine residues onto the crystal structure of the NDEL1 coiled coil. This showed that the cross-linked residues occupy positions g and e of the coiled coil that face the same side of the structure and are hence ideally situated for cross-linking (Fig. 3c). The crystal structure shows that the C␣ atoms of Lys-80 and of Lys-82 are ϳ9.6 Å away, which is in very good agreement with the length of the BS 2 G cross-linker (7.7 Å) and the length of the lysine side chain (ϳ6 Å).

Multiprotein Complexes: the NDC80 Complex
After testing the approach on the relatively simple problem represented by NDEL1, we decided to approach a much more complex problem. The NDC80 complex is a constituent of the outer plate of the kinetochore and plays a critical role in establishing the stable kinetochore-microtubule interactions required for chromosome segregation in mitosis (34). The NDC80 complex is comprised of NDC80 (also known in human as HEC1 for highly expressed in cancer 1), NUF2, SPC24, and SPC25; all four proteins contain coiled-coil domains. The NDC80 complex is dumbbell-shaped with a central shaft flanked by the globular regions at either end that contain the N-terminal heads of NDC80 and NUF2 at one end and the globular C-terminal heads of SPC24 and SPC25 at the opposite end (20,35). The structures of the C-terminal heads of SPC24-SPC25 and of the globular domain of NDC80 in yeast have been determined (36,37). However, the interactions of the four subunits in the central rod, which are likely mediated by coiled coils, are currently unclear. This complex architecture makes the structure of the complex significantly more difficult to study than the NDEL1 dimer. The NDC80 complex was cross-linked with BS 2 G, and after separation by SDS-PAGE, four high molecular weight bands were excised, digested, and analyzed by LC-MS/MS (Fig. 4).
The number of candidate cross-links for the 176-kDa NDC80 complex was too large to allow manual interpretation of the database search hits as described for the 38-kDa NDEL1-(17-174) complex. High mass accuracy and the use of isotope labeling to recognize the signals of peptides containing a cross-linker, the gold standard in the field so far for creating a candidate list, returned 1427 matches between spectra of precursors with a doublet signal and computed cross-linked peptides. The NDC80 complex creates such a  3. a, structure of the cross-links observed for NDEL1- . The N-terminal extension resulting from the expression system covers residues Ϫ5 to Ϫ1 and the sequence of NDEL1 starts with Ala-17 following the numbering of the full-length protein. b, model of the NDEL1-(17-174) homodimer as parallel coiled coil starting with the same amino acid. Cross-linked residues are underlined, and residues predicted as hydrophobic center of the coiled coil are bold. c, the coiled coil wheel (view along the helical axis) shows the sequence around cross-link II.

Structural Analysis by Cross-linking, MS, and Bioinformatics
large database, as a result of its size, that virtually any precursor mass produces a match even at high mass accuracy (4.5 ppm). Mascot, as a routine tool for matching fragmentation spectra with peptide sequences, condensed the list of matches to 125 by using XDB, yielding a 10-fold reduction of the data. Nevertheless the number of matches was still very large for manual validation. Next the matches were sorted using our scoring algorithm. This gave a possibility to start the manual validation with the best quality data. However, manual validation is not free of error. Its success rate is unknown and furthermore depends on subjective criteria. Given controls, the score can be used to assign a degree of confidence to candidate cross-linked peptides. This is how scores are used in protein identification, and this is how we planned to use our scoring algorithm in protein structure determination.
Using a score of 15 at peptide mass error 4.5 ppm (Fig. 5) left us with 69 matches having 90 -100% confidence, leading to the identification of 26 cross-linked peptide pairs with unique sequence containing 25 different linkage sites (Table I). We designed two negative controls to determine the false positive rate at this score and peptide mass accuracy cutoff: (a) assuming a false mass for the cross-linker and (b) searching against the inverted version of XDB. At 4.5 ppm, the accuracy of our measurement for peptide masses, 2436 false positives were obtained taking both controls together. Remarkably the number of false positives was reduced to 55 using fragment ion information in Mascot to search XDB. Assuming a similar number of false positives for our experiment, however, these controls predict that up to half of our 125 initial matches might be false positives. The score is a quality measure that ranks the candidates according to their match to the fragment ions (Fig. 5a). Through the control searches we could estimate the confidence (C) associated with our identifications for any score cutoff as where N c is the number of hits above the cutoff in the control search and N r is the number of hits above the cutoff in the real search. Conducting this calculation within score-windows of five units and applying a cut-off at 90 -100% confidence resulted in our high confidence list of 69 matches. Fig. 5c illustrates the high specificity achieved by our scoring algorithm despite the relatively low accuracy (Ϯ0.5 Da) of our fragment data. Other types of instruments can yield fragment data of 10 -100ϫ higher accuracy than those obtained here and will result in even higher specificity. Note in Fig. 5a that 20 cross-links are found with a score larger than 45 but no false positives regardless of the peptide mass accuracy. The algorithm can therefore also be used for data obtained with instruments, such as a stand alone ion trap, that provide less accurate data for peptide masses than those we used for our analyses. Note also in Fig. 5a the virtual absence of any match with a score below 5. This is the effect of the random match filter of Mascot. Our score would also have been able to cope with these random matches. For a complete list of the 69 high confidence matches see Supplemental Table 1, and for additional annotated spectra see Supplemental Fig. 2. A detailed view into the 69 high confidence matches reveals an important insight into the current limitations of the analysis and points toward a possible solution. In eight instances, both forms of the cross-link, light and heavy, were identified within the same LC-MS/MS run (Supplemental Table 1). The observation that only one of the pair was identified in all other cases indicates that the analysis was not exhaustive. The same line of reasoning based on SILAC (stable isotope labeling by amino acids in cell culture) labeled peptide pairs demonstrated recently the limited depth of analysis for complex peptide mixtures (38). As a result of the incompleteness of our LC-MS/MS analyses we are possibly missing out on a significant proportion of cross-links. A solution would be to focus the data acquisition on potential cross-links. Using the doublet detection as a selection criterion for MS/MS is unfortunately not possible with the current version of our instrument software. However, we noticed that nearly all identified crosslinks (64 of 69) were observed with z Ͼ 2. In contrast, about 50% of all selected precursors had z ϭ 2. In agreement with this we observed that restricting the selection of precursors for MS/MS on those with z Ͼ 2 allows reducing the background and focusing better on cross-linked peptides. 4 Importantly the high charge states of cross-linked peptides can be utilized for the enrichment of these species using strong cation exchange. 4 The 25 linkage points we obtained provide a detailed picture of the NDC80 complex and illustrate the potential of using cross-linking and mass spectrometry in structure elucidation (Fig. 6). So far, the full-length human and yeast NDC80 com-plexes have failed to provide diffraction quality crystals. Scanning force microscopy and electron microscopy studies on the reconstituted human NDC80 complex (20) or on its yeast homolog (35) have indicated, however, that the complex is ϳ57 nm long. The architecture of the NDC80 complex is characterized by the existence of two dimeric subcomplexes consisting of the SPC24 and SPC25 subunits and of the NUF2 and NDC80 subunits, respectively (20,35). As already explained above, the two globular regions have been proposed to contain the globular, N-terminal heads of NDC80-NUF2 and the globular, C-terminal heads of SPC24-SPC25. The central shaft has been proposed to contain the coiled-coil regions in the N-terminal portions of SPC24 and SPC25 and in the C-terminal portions of NUF2 and NDC80 with a tetramerization domain containing the C-terminal tails of NDC80-NUF2 and the N-terminal heads of SPC24-SPC25.
The list of cross-link sites we identified is consistent with this prediction (Table I and Fig. 6). A number of intramolecular cross-links bridge seven (IX, X, XI, XVI, and XXVI) or 10 residues (VII and VIII). This corresponds to two or three turns of a helix, bringing the linked residues again onto the same side of the helix, and supports a helix as a secondary structure element in the region of these cross-links. The lysine residues of cross-link XIV are in positions g and e of the predicted coiled coil supporting the prediction locally as well. In agreement FIG. 5. a, each match of the combined data of four analyses is plotted with its score and its mass deviation between the observed mass and the mass of the matched cross-linked peptide (E). The same data are searched as negative control either assuming a 3-Da heavier cross-linker mass for both forms of the cross-linker, light and heavy (ϫ), or against a reversed sequence database (ϩ). The lines at 4.5 ppm indicate the mass accuracy of the measurement. The line at score 15 indicates 95% confidence as estimated by the hits of the two negative controls. The lack of matches with scores below 5 is a result of the prefiltering of random matches. b, the number of hits are plotted over the score using the data of a at 4.5 ppm accuracy. Each data point sums the counts up to the next point, i.e. counts at score 20 give the count of hits having score 20 -30. The number of matches with scores below 5 was computed separately as described under "Experimental Procedures." c, confidence of the cross-linked peptides plotted over score. The confidence C* is calculated in each score range as C* ϭ (1 Ϫ (N * c /N * r )) ϫ 100 where N * c is the number of hits in the control search and N * r is the number of hits in the real search falling into the respective score range. The shaded region designates the region of confidence below 90 -100%. Numbering refers to a linker region (sequence GPLGS, numbered

Structural Analysis by Cross-linking, MS, and Bioinformatics
Ϫ5 to Ϫ1) that precedes the natural N terminus of NUF2 or SPC24 (numbered 1) and that is retained at the N termini of these proteins after proteolytic cleavage from the GST tag used for purification. with the previous analyses (20,35), we found several linkages connecting the SPC24 and SPC25 subunits (cross-links XXIII and XXV) and the NUF2 and NDC80 subunits (cross-links I, IV, XII, XIV, XVII, and XVIII). These linkages are consistent with a parallel orientation of the predicted coiled coils in both subcomplexes. As far as the SPC24-SPC25 subcomplex is concerned, cross-links XXIII and XXV approximately define the register of the two chains in the coiled coil as they reveal a 10to 14-residue offset between SPC24 and SPC25 (residues 60 SPC24 /50 SPC25 and 98 SPC24 /84 SPC25 ). The analysis of cross-links in the NDC80-NUF2 subcomplex provides a useful framework to understand the register of the two chains in the coiled-coil regions. Cross-links XII, XIV, XVII, and XVIII all map to the predicted coiled-coil region of this subcomplex. Sites XIV, XVII, and XVIII represent a consistent set with separations of ϳ45 residues on NDC80 and ϳ50 residues on NUF2. Indeed we show below that the register of the coiled coil defined by these cross-links remains unaltered until the C-terminal end of the NDC80-NUF2 subcomplex. Conversely the distance between site XII (residues 360 NDC80 /252 NUF2 ) and site XIV (residues 462 NDC80 /299 NUF2 ) is different for the NDC80 and NUF2 chains (102 residues on NDC80 and 47 residues on NUF2). The discontinuity on the NDC80 chain correlates with a drop in the coiled coil prediction between residues 420 and 460 of NDC80 (data not shown), suggesting the presence of a non-coiled-coil loop region extending from the NDC80 chain around this region. This interpretation is reinforced by the presence of two long range intramolecular links bridging roughly equivalent positions in NUF2 and NDC80 (XIII and XV).
Overall these data suggest that there is an interruption in the coiled-coil region of the NDC80-NUF2 subcomplex, approximately centered on residue 440 of human NDC80, that might represent a site of flexibility in the NDC80 complex rod. Coiled-coil predictions on the NDC80 complex of Saccharomyces cerevisiae display a very similar pattern with an ϳ60residue interruption of the coiled coil approximately centered on residue 475 (data not shown). Visualization of the yeast NDC80 complex by electron microscopy confirms the speculation that this represents a site of flexibility in the NDC80 rod (35). A majority of the rotary shadowed particles showed evident bending of the NDC80 rod at about a third of its length from the N terminus (35), which is fully consistent with the position of a non-coiled-coil segment in the central shaft of the NDC80 complex (Fig. 6). Future work will have to address the functional significance of the 40 -50-amino acid insertion in the coiled coil of NDC80. Upstream from site XII, the coiledcoil predictions for NDC80 and NUF2 envision a ϳ110-residue coiled-coil segment for both chains, suggesting that the pairing defined by site XII extends ϳ110 residues upstream from this site, i.e. from the point in which the coiled-coil forms as the NUF2 and NDC80 chains emerge from the N-terminal globular regions.
In summary, our cross-linking studies shed light on the register of the coiled coils within the shaft of the NDC80 complex. In combination with low resolution visualization approaches and prediction methods of common usage, this approach allows probing the structural complexity of a large protein assembly. In this respect, the set of cross-links between different NDC80 subcomplexes (cross-links between NDC80-NUF2 and SPC24-SCP25) are particularly useful in defining the organization of the tetramerization domain. Consistent with the predicted organization of the NDC80 complex (20,35), these sites involve the C-terminal regions of NDC80-NUF2 and the N-terminal regions of SPC24-SPC25 (XIX-XXIV). Specifically we identified cross-links between residues 400 NUF2 or 418 NUF2 and the N terminus of a five-residue extension of SPC24 that was created by the Prescission protease after cleavage from the GST tag (sites XIX and XXI). We also found a cross-link between 577 NDC80 and the same N-terminal extension of SPC24 (site XX). We therefore suggest that 577 NDC80 faces the midpoint between 400 NUF2 and 418 NUF2 (409 NUF2 ). This pairing is fully consistent with the register of the coiled coil established by the set of cross-links XIV, XVII, and XVIII. Specifically if the coiled coil of NDC80 and NUF2 run uninterrupted C-terminal of site XVIII (504 NDC80 and 349 NUF2 ), one would expect that ϳ70 residues downstream from site XVIII, and therefore around residues 574 NDC80 and 419 NUF2 , the NDC80 and NUF2 chains should still be paired. Within the accuracy allowed by our cross-linking experiment, this prediction is in good agreement with our argument (see above) that 577 NDC80 faces 409 NUF2 . Thus, we can conclude that the NDC80-NUF2 coiled coil runs roughly uninterrupted after the predicted loop around 440 NDC80 .
We also found that 416 NUF2 cross-links to the N terminus of SPC25 (site XXII) and that 633 NDC80 cross-links to 52 SPC25 FIG. 6. Model of the NDC80 complex fitting best to the observed cross-links. Amino acid residue numbers are Arabic numbers, and cross-links are Roman numbers. For a full list of residue numbers for all cross-links see Table I. aa, amino acids.

Structural Analysis by Cross-linking, MS, and Bioinformatics
(site XXIV). Based on arguments developed in the previous paragraph, 416 NUF2 is expected to face ϳ584 NDC80 . Because 416 NUF2 faces the N terminus of SPC25, also 584 NDC80 is predicted to be close to 1 SPC25 . ϳ50 residues downstream from this point, the NDC80 and SPC25 chains have advanced at the same pace as revealed by the cross-links between 633 NDC80 and 52 SPC25 (site XXIV). These data also suggest that the C terminus of NDC80 extends beyond the C terminus of NUF2 by some 10 -15 residues. On the other hand, the N terminus of SPC24 might advance the N terminus of SPC25 by ϳ7-10 residues as revealed by their cross-links with NUF2 or NDC80. This is roughly consistent with the offset between these chains described above.
Based on the cross-linking analysis presented here, we carried out extensive additional subcloning to test the expression properties and solubility of different subcomplexes of the NDC80 complex. This work allowed the construction of new expression constructs to generate a stable recombinant version of the tetramerization domain. 5 More importantly, the determination of the register of the coiled coils in the NDC80-NUF2 and SPC24-SPC25 subcomplexes promoted the generation of engineered constructs of the NDC80 complex containing direct fusions of truncated versions of the coiled-coil segments of NDC80 and SPC25 and of NUF2 and SPC24, resulting in the creation of a chimeric dimeric NDC80 complex (data not shown). In this arrangement, residues 80 -286 of NDC80 were fused to residues 118 -224 of SPC25, and residues 1-169 of NUF2 were fused to residues 122-197 of SPC24. This strategy, which bypassed the requirement for a tetramerization domain, resulted in the production of a stable mini-NDC80 complex (named NDC80 bonsai ) that readily crystallized and whose crystal structure was determined at ϳ3.0-Å resolution. 6 The geometry of residues cross-linked in site XXVI, 122 SPC25 and 129 SPC25 , can be observed in the crystal structure of NDC80 bonsai . The C␣ atoms of these residues are ϳ10.5 Å away from each other, a distance that the 7.7-Å BS 2 G cross-linker and the ϳ6-Å flexible side chain of lysine can easily bridge. Indeed there are several cross-links with an equivalent positioning of linkage sites in the primary sequence in our dataset, such as sites X and XI. The coiled-coil register displayed by the crystal structure of NDC80 bonsai is consistent with that revealed by the cross-linking analysis described here (data not shown). Thus, our cross-linking studies were instrumental in tailoring appropriate manipulations within the complex architecture of the NDC80 complex, such as the one described above that led to its crystallization or such as those that might be required to design deletion mutants to probe the significance of different segments of a coiled-coil region in a large complex.

Conclusion
Cross-linking and mass spectrometry can potentially yield important structural information for multiprotein complexes in the absence of high resolution structural information. Using our novel data analysis algorithm we report the resolution and interpretation of an unprecedented number of cross-links for a 176-kDa, four-protein complex. The density of observed linkages results in a hitherto unobtainable depth of detail and gives clear directions for minimal constructs of the NDC80 complex to obtain a crystal structure. From the simplicity of our algorithm and the ability of mass spectrometry to detect peptides without large bias toward their sequence follows the option of complementing any study of multiprotein complexes with a topological analysis. This significantly increases the focus of follow-up experiments to reveal functional and structural aspects of the complexes. Our link into Mascot as a standard database search tool can be taken up by the providers of this and alternative tools and should ensure professional support and easy access to a wide range of researchers. Our database search strategy does not depend upon the use of isotope labels in any way and is hence equally applicable to work with non-labeled cross-linkers. However, the use of labeled cross-linkers allowed us to reduce 10-fold the size of our NDC80 dataset (ϳ14,000 to ϳ1400 fragmentation spectra) and was important to reduce the false positive rate. Using more accurate fragment masses than in our study and automation of data analysis now in place, even larger structures than the NDC80 complex can be addressed in the near future.