Identification of Protease Specificity by Combining Proteome-Derived Peptide Libraries and Quantitative Proteomics*

We present protease specificity profiling based on quantitative proteomics in combination with proteome-derived peptide libraries. Peptide libraries are generated by endoproteolytic digestion of proteomes without chemical modification of primary amines before exposure to a protease under investigation. After incubation with a test protease, treated and control libraries are differentially isotope-labeled using cost-effective reductive dimethylation. Upon analysis by liquid chromatography–tandem mass spectrometry, cleavage products of the test protease appear as semi-specific peptides that are enriched for the corresponding isotope label. We validate our workflow with two proteases with well-characterized specificity profiles: trypsin and caspase-3. We provide the first specificity profile of a protease encoded by a human endogenous retrovirus and for chlamydial protease-like activity factor (CPAF). For CPAF, we also highlight the structural basis of negative subsite cooperativity between subsites S1 and S2′. For A disintegrin and metalloproteinase with thrombospondin motifs (ADAMTS) -4, -5, and -15, we show a canonical preference profile, including glutamate in P1 and glycine in P3′. In total, we report nearly 4000 cleavage sites for seven proteases. Our protocol is fast, avoids enrichment or synthesis steps, and enables probing for lysine selectivity as well as subsite cooperativity. Due to its simplicity, we anticipate usability by most proteomic laboratories.

Proteolysis is an irreversible posttranslational modification, ubiquitously shaping every proteome. Degradative proteolysis regulates proteome composition, mediates protein turnover, and ensures proteome integrity by removing misfolded proteins. At the same time, limited proteolysis yields stable cleavage products, thereby controlling enzyme or chemokine activity, assembly of structural proteins, and release of bioactive proteins from the cell surface (shedding) or from larger precursors (1)(2)(3). Dysregulated proteolysis is a hallmark of numerous diseases with examples including tumor biology, neurodegeneration, or hereditary skin diseases. In short, proteolytic processing is a key event in health and disease.
Proteases typically recognize substrates in an extended active-site cleft, involving multiple interactions between substrate side-chains and corresponding binding pockets of the protease. In many cases, these interactions occur both Nand C-terminally to the scissile peptide bond. This relation has been formalized in 1968 with the Schechter and Berger nomenclature (4). Here, substrate residues N-terminal to the scissile peptide bond are denoted as P1, P2, P3,… and residues C-terminal 1 to the scissile peptide bond are denoted as P1Ј, P2Ј, P3Ј, …; the corresponding subsites of the protease are denoted as S1, S2, S3 and S1Ј, S2Ј, S3Ј, correspondingly.
Many proteases have strict subsite specificities. Prominent examples include trypsin (lysine or arginine in P1) or GluC (glutamate or aspartate in P1). On the other hand, numerous proteases are characterized by comparably broad subsite preferences, often encompassing multiple subsites. Examples include cysteine cathepsins or matrix metalloproteases (5)(6)(7). Determination of protease specificity is an important step in biochemical protease characterization. It enables deorphanizing of previously uncharacterized proteases and probing protease activity in vitro (8). Knowledge of active-site specificity also serves to guide protease inhibitor development (9) and allows identification of structural origins of protease specificity and promiscuity (10,11).
Several complementary experimental strategies exist for protease specificity profiling (12). Phage and bacterial display techniques employ genetic approaches and iterative enrichment to screen peptide pools with a large sequence variety for preferred cleavage motifs (13,14). Positional scanning synthetic combinatorial libraries represent a powerful biochemical approach for the determination of subsite selectivity (15). Synthetic mixture-based oriented peptide libraries represent a two-step strategy to characterize proteases with specificity profiles that include both prime and nonprime sites (7). In peptide nucleic acid arrays, peptidic substrates are coupled to defined nucleic acid sequences, allowing spatial deconvolution of complex mixtures on complementary microarrays (16). In combination with in vitro translation, nucleic acid sequencing has been used to determine protease cleavage sites (17). Such a setup has recently enabled the family-wide portrayal of matrix metalloprotease (MMP) specificity determinants (18). Cysteine cathepsin specificity has also been investigated by a nonpeptide approach termed fast profiling of protease specificity (19).
In recent years, proteome-derived peptide libraries have been introduced for the specificity profiling of proteases (5), including cysteine, serine, and metalloproteases (5,20,21) with adaptions to enable multiplexed stable isotope tagging for kinetic investigations (22) as well as investigating the specificity of carboxypeptidases and N␣-acetyltransferases (23,24). Using proteome-derived peptide libraries, Eckhard et al. recently reported thousands of matrix metalloprotease cleavage sites (25).
Here, we present a tag-free, straightforward strategy to employ proteome-derived peptide libraries for protease specificity profiling. The technique is based on the comparison of differentially stable isotope labeled protease-treated peptide libraries and control samples. LC-MS/MS analysis allows relative quantitation of each isotope-labeled peptide and thereby specific selection of peptides that occur only in the protease-treated sample, indicating cleaved sequences. Notably, affinity enrichment or modification of primary amines prior to the actual profiling step is no longer required. The protocol enables identification of specificity toward unmodified lysine residues as well as probing of subsite cooperativity. Presently, 3502 of the 3988 (88%) proteases annotated in the MEROPS database (version 9.12) (26) have less than 10 known substrates. This highlights that deorphanizing remains an important goal in protease research and underlines the need for straightforward approaches for protease cleavage site identification.

EXPERIMENTAL PROCEDURES
Experimental Design and Statistical Rationale-A total of 12 specificity experiments were conducted. Semi-specific peptides that are enriched more than eightfold (log2 fold-change ϭ 3) are considered to unambiguously represent cleavage products of the proteases under investigation. This is validated by the control experiments presented below, which accurately depict the specificity for trypsin and caspase-3.
Peptide Library Preparation-Escherichia coli MG1655 was grown in Luria-Bertani (LB) medium. Cells were lysed, and lysates were digested by either trypsin or GluC as described elsewhere (27). GluC digests were performed in the presence of 1.0 M tosyl phenylalanyl chloromethyl ketone and 1.0 M tosyl-L-lysine chloromethyl ketone hydrochloride. LysC digestion was performed similarly to tryptic digestion but using LysC instead. Following digestion, primary amines were not modified. The peptide digest was further purified by C18 solid phase extraction (Sep-Pak, Waters) according to the manufacturer's instructions. Aliquots of the peptide library were stored in water at Ϫ80°C.
Nanoflow-HPLC-MS/MS-Samples were analyzed on a Q-Exactive plus (Thermo Scientific, Waltham, MA) mass spectrometer coupled to an Easy nanoLC 1000 (Thermo Scientific) with a flow rate of 300 nl/min. Buffer A was 0.5% formic acid, and buffer B was 0.5% formic acid in acetonitrile (water and acetonitrile were at least HPLC gradient grade quality). A gradient of increasing organic proportion was used for peptide separation (5-40% acetonitrile in 80 min). The analytical column was an Acclaim PepMap column (Thermo Scientific), 2 m particle size, 100 Å pore size, length 150 mm, inner diameter 50 m. The mass spectrometer operated in data-dependent mode with a top 10 method at a mass range of 300 -2000.
Data Analysis-Raw LC-MS/MS data were converted to the mzXML format (34), using msconvert (35) with centroiding of MS1 and MS2 data and deisotoping of MS2 data. For spectrum to sequence assignment X! Tandem (Version 2013.09.01) was used (36). The E. coli proteome database (strain K12, reference proteome) was used as described previously (37), consisting of 4304 protein entries and 8608 randomized sequences, derived from the original E. coli proteome entries. The decoy sequences were generated with the software DB toolkit (38). X! Tandem parameters included: precursor mass error of Ϯ 10 ppm, fragment ion mass tolerance of 20 ppm, semi-tryptic, semi-GluC, or semi-LysC specificity with up to one missed cleavage, static residue modifications: cysteine carboxyamidomethylation (ϩ57.02 Da); lysine and N-terminal dimethylation (light formaldehyde 28.03 Da; heavy formaldehyde 34.06 Da or 36.08 Da, respectively, in the case of the caspase-3 experiment, for which cyanobordeuterid was used); no variable modifications. X!Tandem results were further validated by PeptideProphet (39) at a confidence level of Ͼ 95%. For relative peptide quantification, XPRESS (40) was used. Mass tolerance for quantification was 0.015 Da. An in-house PERL computer script (downloadable at http://www.mol-med.uni-freiburg.de/mom/schilling/ TAILS_v21_xpress-only/at_download/file) then filters for semi-specific peptides, with their N-or C terminus generated by the initial digestion protease and the respective other terminus derived from cleavage by the test protease. Similar to the original PICS procedure (5), the script then determines bioinformatically, through database lookup, the corresponding prime or nonprime sequence. Web-PICS was used to generate heatmap style representation of protease specificity (41).
Molecular Modeling of CPAF Peptide Complex-The structure of mature CPAF (pdb code 3DOR) and active site mutant CPAF-S499A mutant in complex with a peptide (pdb code 3DPN) (44) served as templates for molecular modeling. The corresponding residues P1, P2, P3, P1Ј, P2Ј, and P3Ј were changed with Coot (45), and figures were generated with PyMol (46).

RESULTS AND DISCUSSION
Overview and Proof of Concept-Our procedure is outlined in Fig. 1. Briefly, cellular proteomes, such as cellular or bacterial lysates are harvested and digested by specific endopeptidases (digestion protease), such as trypsin, GluC, or LysC, yielding a proteome-derived peptide library. After inactivation of the digestion protease, the peptide library is divided into a control sample and a sample for incubation with a protease under investigation (test protease). Following this incubation step, the samples are differentially labeled, for example, by reductive dimethylation using different stable isotopes of formaldehyde (47). The samples are then mixed, desalted, and analyzed by LC-MS/MS analysis. In data analysis, the procedure focuses on semi-specific peptides that are more abundant in the sample treated with the test protease. Semi-specific refers to peptides for which only the N or C terminus represents cleavage by the digestion protease, while the other terminus is generated by a protease with a different specificity. By focusing on semi-specific peptides, which are more abundant in the sample treated with the test protease, the procedure investigates the specificity of this enzyme. If only the C terminus of such a peptide is derived from treatment with the digestion protease, the N terminus was likely generated by the test protease, and the peptide thus constitutes the prime-site cleavage sequence. The requirement that the semi-specific peptide needs to be present only in the test protease treated sample but not in the control excludes artifacts arising from unspecific background proteolysis in the proteome. Similar to the original PICS procedure, the corresponding nonprime-site cleavage sequences are bioinformatically retrieved through a database lookup. We have previously shown that this combination of experimentally and bioinformatically derived sequences is suitable for profiling of protease specificities (5). We have now expanded the strategy to enable bioinformatic retrieval of prime-site cleavage sequences based on experimentally determined nonprime-site cleavage sequences (i.e. N terminus generated by the digestion protease, C terminus produced by the test protease). In a last step, all cleaved sequences are aligned, optionally normalized, and the aggregate specificity motifs visualized, e.g. as heatmaps (41) or sequence logos (48).
As an initial proof-of-concept study we generated a proteome-derived peptide library from E. coli lysates by GluC digestion. We then incubated this library with trypsin at a protease:library ratio of 1:500 for either 1 h or 16 h at 37°C. Upon LC-MS/MS data analysis, we used an eightfold increase in abundance to discriminate semi-specific peptides that are generated by trypsin incubation from background signals. 786 (1 h trypsin incubation, sequences in Supplemental Tables IA  and IB) and 819 (16 h trypsin incubation, sequences in Supplemental Tables IIA and IIB) semi-specific peptides fulfilled this criterion and were further analyzed. As shown in Fig. 2A, both incubation times clearly result in specificity profiles that reflect the prototypical trypsin selectivity for arginine and lysine in P1. This result nicely validates our experimental strategy and highlights its suitability for assessing the capacity of proteases to act on lysine-containing substrates.
We further chose human caspase-3 for a second proof-ofconcept experiment. Caspase-3 features subsites with differing grades of specificity, including strict selectivity (P1D), moderate selectivity (P4D), and partial preference (P2V). Using our modified PICS strategy, we profiled recombinant human caspase-3 with tryptic E. coli peptide libraries. We identified 63 caspase-3 cleavage sites (sequences in Supplemental Tables IIIA and IIIB). The resulting specificity profile (Fig. 2B) matches prototypical caspase specificity (49,50). In P3, methionine rather than glutamate appears to be preferred. This is corroborated by a phage-display study on caspase-3, which identified Asp-Leu-Val-Asp rather than Asp-Glu-Val-Asp as the preferred sequence motif for caspase-3 (51).
Specificity Profiling of HERV-K(HML-2) Protease-The human genome contains numerous sequences that originate from germ line insertions of exogenous retroviruses, so-called human endogenous retroviruses (HERVs). HERV sequences, that is, "fossilized" retroviral sequences, account for ϳ8% of the human genome (52). Similar to present-day exogenous retroviruses, a relatively small number of HERV sequences in the human genome comprises ORFs for typical retroviral proteins, among them a retroviral protease. The protease of an evolutionarily young HERV group, named HERV-K(HML-2), bears similarity to the aspartyl protease of viruses such as the human immunodeficiency virus (HIV) (53), and it has been suggested that the HERV-K(HML-2) protease may functionally complement HIV-1 protease (54).
Despite its interesting biology, little is known about the specificity of the HERV-K(HML-2) protease. Here, we sought to elucidate its preferred cleavage site motif using proteomederived peptide libraries. Incubation of a tryptic peptide library with recombinant, purified HERV-K(HML-2) protease (protease:library ratio 1:100) yielded 95 cleavage sites (specificity profile in Fig. 2C, sequences in Supplemental Tables IVA and  IVB). Aromatic residues (Phe, Trp, Tyr) in P1 constitute the primary specificity determinant of HERV-K(HML-2) protease. This finding is corroborated by its ability to process HERV-K(HML-2) Gag protein at Tyr 143 and Phe 252 (52,55,56). At the same time, we notice that there is a substantial proportion (21%) of cleavage sites with a P1 Gly. In line with our observation, HERV-K(HML-2) protease also cleaves the HERV-K(HML-2) Gag protein at Gly 532 in vitro (52,55,56). Further, minor specificity determinants are represented by a preference for either aliphatic residues in P2 and for aromatic or aliphatic residues in P1Ј. The acidic residues Asp and Glu are found in position P2 in 32% of all cleavage sites. However, due to the low pH of the reaction condition (pH 5.0), they were likely present in their protonated, noncharged form.
Overall, the specificity profile of HERV-K(HML-2) protease bears similarity to HIV-1 protease, which also displays a preference for aliphatic residues in P2 as well as for aromatic residues in P1 and P1Ј (5). HIV-1 protease is a prototypical case for subsite cooperativity (5,57), In particular, bulky residues in P1 (e.g. Phe) result in a preference for small residues in P3 (e.g. Ala) and vice versa. This behavior is also found for HERV-K(HML-2) protease (data not shown). Interestingly, the similar specificity profiles of HERV-K(HML-2) and HIV-1 proteases persist despite a rather limited sequence homology of the two enzymes (54).
Using the HERV-K(HML-2) dataset, we also investigated whether the appearance of cleavage products coincides with the depletion of the original tryptic peptides. For 22 of the 95 cleavage sequences, we also identified and quantified the original tryptic peptide. As expected, protease treatment resulted in depletion of the substrate peptides; for HERV-K(HML-2), the average depletion was more than fourfold. This observation further corroborates our strategy.
Specificity and Subsite Cooperativity of Chlamydial Protease-like Activity Factor-Chlamydial protease-like activity FIG. 2. (A) Specificity profiling of trypsin using a GluC peptide library. The protease:library ratio was 1:500 (wt/wt). Incubation at pH 8.0 occurred for either 1 h or 16 h at 37°C. The histograms show the fold-change value distribution (log 2 of label ratios for trypsin/control) of the semi-specific peptides. Semi-specific peptides with a more than eightfold enrichment (log2 fold-change value Ͼ 3) for the protease-treated sample (here: trypsin) are considered to represent specific cleavage events mediated by the test protease. These were used for reconstruction of the substrate cleavage sites, which were aligned and summarized as heat maps clearly showing the expected stringent trypsin specificity with arginine and lysine in P1. (B) Specificity profiling of human caspase-3. The protease:library ratio was 1:300 (wt/wt). Incubation at pH 7.4 occurred for 3 h at 37°C. The specificity heatmap is in line with canonical description of caspase-3 specificity, with the exception of the P3 position. However, caspase-3 affinity for aliphatic residues in P3 and a preference for DLVD over the DEVD sequence present in common synthetic caspase-3 substrates have also been reported by others (51). (C) Specificity profiling of human endogenous retrovirus HERV-K(HML-2) protease. The protease:library ratio was 1:100 (wt/wt). Incubation at pH 5.0 occurred for 16 h at 37°C. P1 constitutes the major specificity determinant with a preference for aromatic residues. For the present work, we used purified recombinant CPAF. To probe CPAF specificity, we generated a proteome-derived peptide library by GluC digestion of E. coli lysates. CPAF incubation (protease:library ratio of 1:500 (wt/wt)) at pH 7.5 for 1 h and 16 h, respectively, at 37°C resulted in the identification of 501 cleavage sequences (1 h incubation sequences in Supplemental Tables VA and VB) and 680 cleavage sequences (16 h incubation sequences in Supplemental Table   VIA and VIB). Both incubation times yielded highly similar specificity profiles (Fig. 3A). P1 constitutes the major specificity determinant with a preference for alanine, glycine, or methionine. P1M accounts for 8.4% (1 h incubation) or 8.5% (16 h incubation) of the CPAF cleavage sites. Due to its rare natural occurrence, this corresponds to a strong overrepresentation. The selectivity for P1M is corroborated by the three previously reported CPAF self-cleavage sites, two of which feature a P1Met (44). Further positional preferences include tyrosine in P3, aliphatic residues (isoleucine and valine) in P2Ј, and proline in P3Ј. Mixed specificity in a given subsite has been previously observed, for example, for cathepsin B, which prefers small or aromatic residues in P1Ј (20).
We further investigated whether CPAF preference for P1M is positively or negatively correlated to other positional pref- FIG. 4. Specificity profiling of proteases of the A disintegrin and metalloproteinase with thrombospondin motifs (ADAMTS) family. The protease:library ratio was 1:50 (wt/wt). Incubation at pH 7.5 occurred for 16 h at 37°C. P1 constitutes the major specificity determinant with a canonical preference for glutamate. erences. We noticed that P1M negatively correlates to P2ЈI and vice versa (Figs. 3B and 3C). In order to further characterize the substrate recognition by CPAF, we analyzed the structure of the active site and created a molecular model of a complex of CPAF with two model substrates containing either the motif ValMet2ValAla or ValMet2ValIle (P2 -P2Ј, "2" indicates the cleavage site, Fig. 3D). The side chain of methionine in the P1 position reaches into a large hydrophobic pocket close to the active site serine (Ser499) in CPAF. This interaction places the peptide bond in optimal distance to the active site serine. The side chain of valine in position P1Ј points away from the protein surface. The side chain of the residue in position P2Ј points toward the protein surface. However, there is limited space and only residues with a short side chain such as alanine fit well. Side chains of larger size such as isoleucine clash with the protein matrix and give a rationale for the negative correlation between isoleucine in position P2Ј and methionine in P1. Such cooperativity for subsites with mixed specificities has been previously observed, e.g. in the case of cathepsin B (20) and other proteases (58). Generally, our technique enabled detailed, highcontent profiling of CPAF specificity and unraveled subsite cooperativity for the mixed P1 specificity. We conclude that CPAF is a protease with multiple enzyme-substrate interactions and comparably broad subsite specificities.
Canonical Specificity of ADAMTS Proteases-The A disintegrin and metalloproteinase with thrombospondin motifs (ADAMTS) family comprises 19 mammalian members (59). These are secreted proteases with a variety of ancillary domains. Members of the ADAMTS family are involved in a plethora of functions in health and disease. Here, we focus on ADAMTS-4, -5, and -15. For ADAMTS-15, MEROPS does not yet report any cleavage sites, and we present its first specificity profile. We employed tryptic and LysC libraries to profile the active site specificity of ADAMTS-4, -5 and -15, using an enzyme:library ratio of 1:50. Earlier reports on specificity profiling of related proteases of the A disintegrin and a metalloproteinase (ADAM) family employed enzyme:library ratios of 1:10 (60). Our own studies highlighted that specificity profiling with proteome-derived peptide libraries is not prone to overdigestion or blurred specificity motifs as a result of elongated incubation times (20) or elevated enzyme:library ratios (5). For ADAMTS-4, we identified 180 tryptic and 335 LysC-based cleavage sites (Supplemental Tables VIIA, VIIB, VIIIA, and VIIIB); for ADAMTS-5, we identified 215 tryptic and 246 LysCbased cleavage sites (Supplemental Tables IXA, IXB, XA, and XB); for ADAMTS-15 the corresponding numbers were 77 tryptic and 63 LysC-based cleavage sites (Supplemental Tables XIA, XIB, XIIA, and XIIB).
The specificity profiles are summarized in Fig. 4. All three ADAMTS proteases yielded similar specificity profiles. Glutamate in P1 constitutes the major specificity determinant. ADAMTS-4 is unique in also displaying a minor preference for tyrosine in P1. Glycine in P2Ј is prominently preferred by all three ADAMTS proteases. In P2, there is a common, but less pronounced, preference for glutamine, and in P1Ј there is a shared preference for alanine. The LysC library further unraveled a preference for arginine in P2Ј, which is shared by all three ADAMTS proteases. Generally, there is good agreement between the tryptic and LysC-based specificity profiles, which further testifies to the robustness of our approach. For ADAMTS-4 and -5, our specificity profiles are in good agreement with cleavage site data deposited in MEROPS. Interestingly, the canonical ADAMTS specificity is distinct from the specificity profiles of further metzincin proteases such as MMPs, ADAMs, or meprins. CONCLUSION We present a simple and robust protocol for profiling of protease specificity, including subsite cooperativity, with proteome-derived peptide libraries. We showcase and validate our technique through characterization of the well-known trypsin and caspase-3 specificity. We demonstrate its wider applicability by deorphanizing HERV-K(HML-2) protease and CPAF, which are both proteases with previously unknown specificity. Furthermore, we determine the canonical specificity profile of ADAMTS proteases.