Predictions of Cleavability of Calpain Proteolysis by Quantitative Structure-Activity Relationship Analysis Using Newly Determined Cleavage Sites and Catalytic Efficiencies of an Oligopeptide Array*

Calpains are intracellular Ca2+-regulated cysteine proteases that are essential for various cellular functions. Mammalian conventional calpains (calpain-1 and calpain-2) modulate the structure and function of their substrates by limited proteolysis. Thus, it is critically important to determine the site(s) in proteins at which calpains cleave. However, the calpains' substrate specificity remains unclear, because the amino acid (aa) sequences around their cleavage sites are very diverse. To clarify calpains' substrate specificities, 84 20-mer oligopeptides, corresponding to P10-P10′ of reported cleavage site sequences, were proteolyzed by calpains, and the catalytic efficiencies (kcat/Km) were globally determined by LC/MS. This analysis revealed 483 cleavage site sequences, including 360 novel ones. The kcat/Kms for 119 sites ranged from 12.5–1,710 M−1s−1. Although most sites were cleaved by both calpain-1 and −2 with a similar kcat/Km, sequence comparisons revealed distinct aa preferences at P9-P7/P2/P5′. The aa compositions of the novel sites were not statistically different from those of previously reported sites as a whole, suggesting calpains have a strict implicit rule for sequence specificity, and that the limited proteolysis of intact substrates is because of substrates' higher-order structures. Cleavage position frequencies indicated that longer sequences N-terminal to the cleavage site (P-sites) were preferred for proteolysis over C-terminal (P′-sites). Quantitative structure-activity relationship (QSAR) analyses using partial least-squares regression and >1,300 aa descriptors achieved kcat/Km prediction with r = 0.834, and binary-QSAR modeling attained an 87.5% positive prediction value for 132 reported calpain cleavage sites independent of our model construction. These results outperformed previous calpain cleavage predictors, and revealed the importance of the P2, P3′, and P4′ sites, and P1-P2 cooperativity. Furthermore, using our binary-QSAR model, novel cleavage sites in myoglobin were identified, verifying our predictor. This study increases our understanding of calpain substrate specificities, and opens calpains to “next-generation,” i.e. activity-related quantitative and cooperativity-dependent analyses.

(C2), which are called the "conventional" calpains (in this paper, "calpains" refers to the conventional calpains unless otherwise indicated). C1 and C2 each forms a heterodimer composed of a larger (ϳ80 kDa) catalytic subunit (CAPN1 or CAPN2) and a common smaller (ϳ28 kDa) regulatory subunit (CAPNS1). Because CAPN1 and CAPN2 have more than 60% aa sequence identity, C1 and C2 show highly similar, if not identical, substrate specificities (1, 4 -6). They generally function by limited proteolysis, cleaving a few peptide bonds in their substrate protein, which changes the protein's function and/or structure to modulate cellular functions. Thus, calpains are called "modulator proteases." To understand the calpains' physiological functions, it is essential to clarify their substrate specificity/selectivity, i.e. what proteins calpains proteolytically process and at which position(s).
There have been many attempts to define calpains' substrate specificities. The initial studies, focusing on whether specific proteins are proteolyzed or not (6 -9), were followed by more detailed studies using substrate cleavage site amino acid (aa) sequence alignment and a position-specific scoring matrix (PSSM) method (10 -12). Next, peptide libraries were used (13,14). For example, Cuerrier and his colleagues used a peptide sequencing method to quantitatively determine calpains' preference for each aa residue (aar) at each position relative to the cleavage site (13), and developed a sensitive oligopeptidyl fluorescence substrate, H-E(EDANS)PLFAERK (DABCYL)-OH. More recently, machine-learning methods have been applied to the construction of calpain cleavage predictors (15)(16)(17)(18)(19)(20).
However, PSSM-based and machine-learning methods have so far yielded rather limited accuracy in predicting calpain cleavage sites. This is because, unlike with caspases and granzymes (19), there appears to be no explicit rule for calpain specificity, and the number of known aa sequences for calpain cleavage sites is rather small (Ͻ 200, before this study). Furthermore, the cleavage efficiency of most of the reported calpain cleavage sites is unknown, and the cleavage patterns change depending on the reaction conditions.
Notably, the most important question in identifying cleavage specificity is not whether a protein is cleaved. Technically, all peptide bonds can be cleaved by calpains (or any protease) with some efficiency, i.e. k cat /K m Ͼ 0, which depends on the cleavage conditions. In other words, the apparent "cleavability" of a bond is defined by the threshold k cat /K m determined by both the proteolytic conditions and the detection sensitivity. Therefore, the ultimate cleavage predictor should predict a k cat /K m value for each peptide bond within a given protein sequence under given cleavage conditions.
To address the above points, here we sought to identify calpain cleavage-site sequences through literature searches and by performing in vitro digestions of a concentrated, synthesized oligopeptide library. Using the identified cleavagesite sequences, we performed quantitative structure-activity relationship (QSAR) analyses, which revealed the important Pand PЈ-site positions (the positions N-and C-terminal to the cleavage site, respectively) on which to focus. Although the reaction conditions used in this study were slightly different from those used in typical calpain kinetics studies, several verification analyses confirmed that our results successfully elucidated the calpains' substrate specificity.

EXPERIMENTAL PROCEDURES
Peptides and Calpains-From 116 reports, 147 calpain substrates, and their 420 cleavage-site sequences (after excluding two overlapping sequences from a total of 422) were collected (supplemental Table S1). The substrate proteins were numbered SB0001 to SB0150 (substrates reported multiple times under different conditions were assigned different SB numbers; see supplemental Table S1), among which SB0001-SB0090 were already reported in our previous paper (15)). Next, a database, CaMP DB (Calpain for modulatory proteolysis database (21), http://www.calpain.org/), was constructed from the collected information, including all the cleavage sites, secondary structures, and references.
From the above collected site sequences, 86 were selected according to their position in the substrate protein (to have 10 or more P and PЈ site aars) and aa composition (to be not too hydrophobic), and the 20 aars surrounding the reported calpain cleavage site (10 on each side of the site) were selected for oligopeptide sequence preparation (there were several exceptions; see supplemental Table S2). Eight (ID031, 34,36,37,55,72,73, and 84) of these 86 sequences were randomly selected, scrambled, and used as control peptides (ID087-94) (supplemental Table S2).
In preliminary experiments, most of the peptides were detected as either or both of the following: (1) uncleaved (i.e. both N-and C termini capped with Ac and DKP, respectively [both-capped, BC]) peptides that were synthesized correctly and/or in truncated form; (2) fragments cleaved as previously reported (Rp), and/or not as reported (i.e. novel, Nv). The time course of the signals indicated that the optimal reaction time for most of the peptides was between 10 and 20 min (data not shown). Thus, the reaction time was set to 15 min for subsequent experiments. To maximize the number of cleaved peptides, the peptide concentration was increased to 1.7 mM (20 M each) in the reaction mixture. After testing several combinations of peptides and calpains, we decided to use 0.3-1.7 mM (3.3-20 M each) peptides and 2.5 M calpains in the following kinetics study. The ratio of calpain to each peptide was high compared with typical calpain proteolysis experiments. The most likely reason for the high calpain requirement is that the calpain activity was inhibited by impurities derived from the peptide synthesis process and by the high ionic concentration of the reaction mixture, which was because of the need for excess buffer to neutralize acetic acid present in peptide solvents. Although these assay conditions may not have been optimal for peptides with high-end and low-end k cat /K m values, they appeared to be appropriate for most of the peptides (see supplemental Fig. S1).
Among the clearly detected proteolytic fragments obtained by cleavages at Rp sites, oligopeptides corresponding to 78 C-terminal and 26 N-terminal fragments were newly synthesized with C-terminal DKP or N-terminal Ac modification, respectively, as described above (supplemental Table S3, ID0XX-Rp-C or -N series). Peptides corresponding to 39 (C-terminal) and 15 (N-terminal) fragments obtained by cleavages at Nv sites were also synthesized (supplemental Table  S3, ID0XX-Nv series). These peptides (158 total peptides, named "P158mix") were used to quantify the generated calpain-cleaved peptides in the following kinetics experiments.
Peptide Proteolysis and MS Analysis-P87mix (for final concentrations, see Table I) in 100 mM HEPES (pH 8.5) and 1 mM TCEP was denatured at 60°C for 1 h, and digested with 2.5 M C1 or C2 in the presence of 1 mM or 5 mM CaCl 2 , respectively, at 30°C for 15 min in a 20-l volume (see Fig. 1 for the overview of the experiments). As a standard for quantification of the cleaved peptides, P158mix (each peptide at 5 M) was incubated under the same conditions, without calpains. After the reaction, TCEP, SDS, triethylammonium bicarbonate, and three control peptides for iTRAQ TM standardization (C001: NH 2 -EFILRVFSEKRNL-COOH, M r 1,649.93; C002: NH 2 -DFCIRVF-SEKKAD-COOH, M r 1,556.77; C003: NH 2 -DFVLRFFSEKSAG-COOH, M r 1,501.76) were added to final concentrations of 4.36 mM, 0.0952%, 167 mM, and 0.5 M each, respectively, and denatured at 60°C for 1 h.
Next, methyl methanethiosulfonate was added to a concentration of 8.33 mM; the reaction mixture was then incubated at room temperature for 10 min, and labeled with the iTRAQ TM 8-plex labeling kit (Sciex), according to the manufacturer's instructions (Table I). The resulting reaction mixture was subjected to 2D-LC-MALDI MS as described above. The same sample was also analyzed by 2D-LC/MS using the DiNa 2D nLC system and Sciex QSTAR Elite with Nano-Spray TM ESI. MS and MS/MS spectra were acquired with Analyst QS Ver. 2.0 software (Sciex), using the standard parameters recommended by the manufacturer. Peptides were identified using Protein-Pilot TM Ver.4.5 with the following Paragon parameters: Sample Type: iTRAQ 8plex (Peptide Labeled); Cys Alkylation: MMTS; Digestion: None; Instrument: QSTAR Elite ESI or 4800; Special Factors: "N-Ac and C-DKP" or "N-Ac and C-DKP, cleavable" (see below); Species: None; "N-Ac and C-DKP" and "N-Ac and C-DKP, cleavable" were added by describing them in the ParameterTranslation.xml and Protein-Pilot.DataDictionary.xml files of the ProteinPilot TM software (see Supplemental Experimental Procedures for the description). The database was constructed as described below. A global false discovery rate (FDR) above 5% (normal condition) or 1% (stringent condition) was used to define significant data. Identified peptides were exported as PeptideSummary.txt for further data processing by Microsoft Excel Ver. 2010. Peptide structures and their proteolytic sites were assigned according to whether Ac and/or DKP was present (see supplemental Experimental Procedures).
First, the C-terminal 20 aars were selected from the proteome database entries that had 20 or more aars, resulting in 90,858 entries (1,817,160 aars). Among these entries, those similar to Core DB entries when reversed, i.e. entries whose reverse sequence contained a four-aa block included among the Core DB sequences, were eliminated, to construct "Hs50K DB" (50,330 entries, 1,006,600 aars). Next, forward sequences containing a four-aa block included in the Core DB were also eliminated, reducing the number of entries to 30,317. From the remaining entries, 4,000 were randomly selected, resulting in "Hs4K DB" (4,000 entries, 800,000 aars). "Core DB ϩ Hs50K DB and FDR Ͻ 1%", and "Core DB ϩ Hs4K DB and FDR Ͻ 5%" were used as the "stringent" and the "normal" condition, respectively. In this study, the reported results were obtained under the normal condition, because both conditions gave essentially the same results (see supplemental Fig. S5C). Kinetics-A k cat /K m value for each cleavage was calculated using Lineweaver-Burk and Eadie-Hofstee plots. A comparison of the results revealed that the former gave much better estimations than the latter (data not shown), so the Lineweaver-Burk method was used. A For cleaved fragments, v 0 was calculated as I n /I 121 ϫ 5 ϫ 10 Ϫ6 M/900 s, where I 121 was the iTRAQ TM signal intensity (standardized by those of control peptides) of iTRAQ TM -121, which corresponded to 5 M standard fragment peptides. In general, calculations using the full-length values showed considerably larger variance than those obtained using the fragments. This may have been due to the some-   supplemental Table S1). These data were summarized in the CaMP (Cleavage site sequences from Calpain for Modulatory Proteolysis) database (DB) web site (A). Next, 86 sequences corresponding to the P10-P10Ј of some of the above cleavage sites and 8 control scrambled sequences were selected for oligopeptide synthesis (P94mix) with the N-and C terminus capped by acetyl-and -DKP modifications, respectively (B). Shorter reference peptides corresponding to segments created by calpain cleavage were also synthesized (P158mix). Next, varying amounts of P87mix (7 peptides were excluded from P94mix because of insolubility and other reasons) were incubated with or without C1 or C2 at 30°C for 15 min (C). After the digestion, peptide solutions were labeled with iTRAQ TM reagents (D), and peptides that were cleaved or uncleaved (i.e. with both terminals capped) were identified and quantified by liquid chromatography-combined with MS (E). Finally, the v 0 (initial velocity of the cleavage reaction) values were calculated from the iTRAQ TM signals, and 1/v 0 was plotted against 1/[S] (where [S] was the substrate concentration) to determine the k cat /K m value for each cleavage (F). The identified peptide sequence was compared with the originally synthesized peptide sequence to determine the proteolytic site by calpains (g) associated with the determined k cat /K m . what high variances among the iTRAQ TM signals, and to their narrow dynamic range, as well as to unknown reasons. As verified in supplemental Fig. S1, k cat /K m values could be calculated with moderate errors, and the amounts of full-length peptides remaining after the reaction were smoothly distributed, supporting the appropriateness of the reaction time (15 min) in this study. For the rationale for calculating k cat /K m , see supplemental Experimental Procedures.
Determination of Cleavage Sites by N-terminal Sequencing and MS/MS Analysis-Human heart troponin T2 (Merck 648484 -100UGA, ca. 30 pmol) and horse myoglobin (Sigma-Aldrich, M0630, ca. 60 pmol) were digested with C1 (Merck Millipore #208712, 0.9 pmol) in 50 l of 100 mM Tris-HCl (pH 7.5), 1 mM DTT, and 5 mM CaCl 2 at 30°C for 20 min. The digested samples were directly separated by SDS-PAGE, and the proteolyzed fragments were then blotted onto a PVDF membrane and subjected to peptide sequencing analysis (Apro-Science Inc., Tokushima, Japan). For sequence analysis by MS, the same digestion reactions were performed, terminated by adding a 3-fold volume of 7% TCA followed by incubation on ice for 30 min, spun (20,000 ϫ g, 2°C, 10 min), and the supernatant was collected. An aliquot of the soluble fraction was desalted and concentrated to a few l using Zip-Tip C-18, and analyzed by Sciex 5600 ϩ with the Eksigent nanoLC system. The samples were analyzed in triplicate, the data were merged, and the peptide sequences were identified using ProteinPilot (Ver. 4.5) and Swiss-Prot DB (2015_08; 549,008 sequences; 195,692,017 aars) using the default parameters.
Determination of Cleavability of Synthetic Peptides by nLC-Peptides [tp1: Ac-QHLCGSHLVEALYLVCGERG (corresponding to ID014: INS); tp2: LEGNLYGSLFSVPSSKLLGN (ID040: GRIN2A), and tp3: GGGGYSASLHSEPPVYANLS (ID048: JUN)] for nLC analysis were synthesized and purified by Toray Research Center Inc. (Tokyo, Japan) with Ͼ 98% purity (determined by the manufacturer from the ratio of peak areas in HPLC), and were dissolved in distilled water. Each peptide (initial concentration: 6.7-20 M) was incubated with 1 pmol of either C1 (Merck Millipore #208712) or C2 in 50 l of 50 mM HEPES (pH 7.5), 1 mM TCEP, and 1 or 5 mM CaCl 2 at 30°C for 20 min. The digested sample was directly separated by DiNa nanoLC and monitored by a UV spectroscope MU701 (GL Sciences, Tokyo, Japan). Each peak sample was collected, and the contained peptide was determined by the Sciex 4800 MALDI MS system as described above. The areas of peaks were quantified using SmartChrom data analysis software Ver. 2.28J (KYA).
Statistics and QSAR Calculations-Statistical tests were performed using Excel 2010 (Microsoft), SAS Studio Release 3.1 of the SAS University Edition (SAS Institute Inc., Cary, NC), and Molecular Operating Environment (MOE, Ver. 2013.08, Chemical Computing Group Inc., Montoreal, Quebec, and Ryoka Systems Inc., Tokyo, Japan). Analyses for 3D structures and model constructions using the partial least squares (PLS) and binary-QSAR methods were performed by MOE.
A binary-QSAR model was constructed by Auto-QSAR (binary) of MOE software using default parameters and 812 aa descriptors at specific positions. The aa descriptors used were 3 secondary structure descriptors for each position (total of 3 ϫ 20 ϭ 60) and those that showed the largest r 2 values between the measured k cat /K m s and the corresponding aa descriptor's values (see supplemental Tables S11-S13). In the binary QSAR analysis, all of the cleaved and uncleaved sequences without measured k cat /K m values were assigned values of 1 and 0 M Ϫ1 s Ϫ1 , respectively, and a cut-off value of 0.5 M Ϫ1 s Ϫ1 was used so that all of the cleaved and uncleaved sequences were set as positive and negative samples, respectively. First, P10-P10Ј aars, which contained many missing aars close to both ends, were used for the construction. This resulted in a classification that placed unusual emphasis on whether an aar was missing or not, which was considered artifactual. Thus, only cleavage sequences with no missing aars in the varying ranges (P10-P10Ј, P9-P9Ј, P8-P8Ј, …) were used and tested. The trajectory of backward variable selection was analyzed manually, and the most balanced model was selected as having a leave-one-out (LOO) cross-validated accuracy (XA) of more than 0.7 and the lowest number of descriptors. The best model was found using the range P6-P6Ј with eight descriptors (see Table III) A PLS-QSAR model was constructed by Auto-QSAR (PLS) in the MOE software using default parameters and the same 812 aa descriptors at specific positions as above. After the first analysis, the calculated outliers were excluded by MOE, and the analysis was performed again. The trajectory of backward variable selection was analyzed manually, and the most balanced model, with eight descriptors, was selected as having an r 2 value cross-validated with LOO (Xr 2 ) of more than 0.6 and the lowest number of descriptors (see Table V).
For the standard aa compositions, the following values taken from Swiss-Prot DB release 2012_9 were used: Ala, 8

Literature Search and Peptide Library Digestion Followed by MS Detection Identified 420 and 483 Calpain Cleavage
Sites, Respectively-One of the major reasons for the previously incomplete accuracy of calpain cleavage predictors (15)(16)(17)(18)(19)(20) is the small number of positive (i.e. cleavage site sequence) samples. To increase the number of samples, we first searched the literature extensively for calpain cleavage site sequences, and picked up 420 sites from 147 substrates (supplemental Table S1).
To ensure that the reported (Rp) cleavage sites would be cleaved in the oligopeptide context, a mixture of oligopeptides (P87mix library), each of which corresponded to one of the above cleavage sites, was proteolyzed by either C1 or C2. The digests were then analyzed by LC/MS for the global identification of cleavage site sequences. In this analysis, most of the Rp sites (i.e. mostly the middle of each peptide) as well as many novel (Nv) sites were identified. Therefore, for the kinetics study (see below), peptides corresponding to some of the identified cleavage fragments (104 Rp and 54 Nv sites) were synthesized (P158mix library, supplemental Table S3).
Finally, 418 cleavage sites (106 Rp and 312 Nv) were identified for C1, 360 (107 Rp and 253 Nv) for C2, and a total of 483 (123 Rp and 360 Nv) for both combined (Tables II, supplemental Tables S7 and S8). In total, we found that 98 of the 131 Rp sites existing in the P87mix were proteolyzed by calpains (74 (out of 131) Rp sites were in the middle of the peptide [i.e. after position 10], and 70 of these were proteolyzed), even using oligopeptides (supplemental Table S4), indicating that the calpain substrate specificity was consistent and validating our experimental system.
All Cleavage Site Sequences Identified Using Oligopeptides Showed Similar Trends to Those Reported-To examine whether the Nv site sequences were distinct from those of Rp sites, the P10-P10Ј sequences for 420 sites from the literature ("Lit" sites) were compared with those of the 360 Nv sites identified above (Figs. 2A-2C). When the aa frequencies of all of the aars at all positions (P10-P10Ј) were compared for Lit and Nv, they showed significant correlation (p ϭ 2.1 ϫ 10 Ϫ38 ), with a Pearson's correlation coefficient (r) of 0.59 (Fig. 2C). Although the r at each position varied from less than 0.2 to more than 0.8, they all showed significant correlation (p Ͻ 0.05, supplemental Fig. S2A(1)). In addition, 123 Rp sites and 360 Nv sites also showed significant correlation by the same analysis (supplemental Fig. S2A(2) and S2B).
Therefore, we concluded that the calpains' preference for the Nv sites was not significantly different from that of Rp sites as a whole, although small differences in several specific aars were observed (data not shown). The slight differences were probably because of the fact that the aa composition at each position of the P87mix peptides was somewhat different from the standard, because most of these peptides were selected to have a calpain cleavage site in the middle. The aa preference of all of the cleavage sites (Lit ϩ Rp ϩ Nv) is shown in Fig. 2D.
To test whether Nv sites were cleavable in the context of a whole protein, purified cardiac troponin T (TNNT2, corresponding to ID007) was digested by calpain. MS and peptide sequencing analyses revealed that two of the three identified Nv sites [C-terminal to Phe 80 and Leu 84 (corresponding to mouse Phe 73 and Leu 77 , respectively)] were detected (supplemental Fig. S4). This experiment showed that some of the Nv sites, if not all, are cleaved by calpains in full-length proteins, and they have just not been reported yet.
These results strongly suggested that the calpains did not randomly proteolyze the oligopeptide mixture, but that all of the detected proteolytic sites strictly complied with an as-yetunknown rule for calpain substrate specificity. Therefore, the limited proteolytic activity of calpains observed in vivo is likely to depend on secondary and/or higher-order structures.
There were 123 and 65 sites that were specifically cleaved by C1 and C2, respectively, and were uncleaved by the other (supplemental Fig. S3C). Comparison of the aa preferences of these C1-and C2-specific sequences showed that both had significantly lower correlation (r ϭ 0.49, p Ͻ 0.001) than that for all sequences (Figs. 3A versus 3B), and that the above distinctive features at P9-P7, P2, and P5Ј were emphasized in these sequences (Figs. 3C-3E, and supplemental Table S6). Although there appeared to be a much greater difference between the C1-and C2-specific sequences than among the total sequences, more samples are required to clarify this issue.
The k cat /K m Values for 119 Calpain Cleavage Sites Ranged From 10 to 2000 M Ϫ1 s Ϫ1 -To shed further light on the calpain substrate specificity, the efficiency, i.e. the k cat /K m , for each cleavage site was determined. First, the decay of bothcapped ("BC"; i.e. "uncleaved") peptides was analyzed (because of the presence of truncated synthetic peptides, the number of BC peptides was much larger than 87; see supplemental Table S9). Although it was possible to calculate k cat /K m , the data were so variable that many signals could not be used for the calculation. There are several possible reasons for this variability, including the large variance in iTRAQ TM 8-plex signals, the rapid degradation of efficiently cleaved peptides (making them inappropriate for quantification), and probably other unknown reasons. The calculated k cat /K m values ranged from 1 to 600 M Ϫ1 s Ϫ1 (supplemental Table S9). These values correspond to the apparent k cat /K m of the total cleavages taking place in one peptide.
To obtain data for each cleavage site with more confidence, the cleaved peptides generated in the P158mix were quantified. In this case, the deviations in the data were mostly small, and 71 and 48 k cat /K m values were calculated for Rp and Nv cleavage sites, respectively, with modest standard deviations ( Fig. 4A and supplemental Table S8). The k cat /K m values for different sequences ranged widely, from 10 to 2,000 M Ϫ1 s Ϫ1 . To examine whether the k cat /K m values of Rp and Nv sites were distinct, those in the same peptides were compared (supplemental Table S10). The average k cat /K m values were 259.8 M Ϫ1 s Ϫ1 and 189.4 M Ϫ1 s Ϫ1 for the Rp and Nv sites, respectively, which were not significantly different (p ϭ 0.33), supporting the above conclusion that the Nv sites are not essentially different from Rp sites.
Most of these sites were cut by both C1 and C2 with a similar k cat /K m value (r ϭ 0.92; Fig. 4B), indicating that C1 and C2 share highly similar cleavage site efficiencies as well as highly similar sequence dependences. A few peptides, however, showed apparently different k cat /K m values for C1 and C2 (Fig. 4A). However, when we examined three peptides independently for their cleavability (tp1-tp3, see Experimental Procedures), no clear difference between C1 and C2 was observed (data not shown). It is possible that the relatively  Table S6] between C1 and C2 (red dots; some are labeled with their position, aa, and P). For the r at each position, see supplemental Fig. S3D. C, D, The P10-P10Ј cleavage site sequences specific for C1 (C, 123 sequences) or C2 (D, 65 sequences) were aligned, and the occurrence of each aar at each position was shown as in Fig. 2. Several aars that did not occur at some positions and are not shown in (C) and (D), are listed in (E). Red bold underlining indicates that the aa's absence represented a significant difference (p Ͻ 0.05; yellow marked: p Ͻ 0.01, binomial probability).  Calpain-1 Calpain-2 large deviations obtained using the iTRAQ TM -MS method were responsible for the apparent differences between C1 and C2. Thus, although C1 and C2 have distinct aa preferences, we have not yet observed a clear difference in their cleavage efficiency. Further studies are required to clarify the distinct substrate specificities of C1 and C2.

Calpains Significantly Prefer Longer P-site Sequences (Nterminal Side of the Cleavage Site) Than PЈ-site Sequences (C-terminal)-
To investigate whether the P-and PЈ-sites have distinct features, the positions of calpain cleavage sites in the oligopeptides were analyzed statistically. If the peptides were randomly cleaved by calpains without specificity, all of the positions should show an ϳ5% frequency (Fig. 5, gray line). However, the peptides were designed to contain a calpain cleavage site mostly in the middle (between positions 10 and 11), and, as expected, this site showed a significantly higher cleavage frequency (Fig. 5, black line between 10 and 11).
Unexpectedly, the site after position 11 showed a significantly higher cleavage frequency than expected (Fig. 5, dashed line between 11 and 12), and those after positions 12-14 had the same tendency as position 11, although the difference was not significant. On the other hand, sites N-terminal to position 8 and C-terminal to position 15 tended to be cleaved less frequently than expected. In summary, the sites between positions 10 and 14 are preferred by calpains, and those after the N-terminal 7 aars and before C-terminal 5 aars are cut poorly by calpains. These asymmetric features of cleavability suggest that calpains require a longer P-site se-quence than PЈ-site sequence. In addition, there was no difference in these trends between C1 and C2 in this analysis.

Binary-QSAR Model Constructed with Cleavage Site Sequences Showed a Better Prediction Performance Than
Previous Models-To predict calpain cleavage sites, we used a binary-QSAR model (see Discussion for advantages of this model) with the information gathered in the experiments above.
For aa descriptors, we used the AAindex (26), predicted secondary structures, and molecular descriptors in the MOE package (see supplemental Tables S11 and S12). Several ranges of sequences were tried, and P6-P6Ј were used, because longer and shorter ranges did not perform well, probably because there were too many missing values and the sequences were too short, respectively. Of all the possible P87mix site sequences (1,703), 806 (314 cleaved and 492 uncleaved) sequences did not contain any missing values between P6 and P6', and were used for training data to construct a predictor. The best-balanced binary-QSAR model achieved was constructed with eight descriptors, associated with P6, P2, and P1 (Table III). This predictor performed with a leave-one-out (LOO) accuracy of 74.9% (Table IV, versus P87 P6-P6Ј).
To test the real prediction performance of the binary-QSAR model, 331 cleavage site sequences from the literature ("Lit" data set) that were not used in its construction were analyzed with our model. The 331 reversed sequences were used as negative control samples. The model had 63.1% total accuracy Occurrence rates of the number of cleavage sites detected at each position were plotted along with those expected by random cleavages. Cleavages before and after position 11 showed significantly increased occurrences (P was calculated by the Z-test for a proportion). (Fig. 6A). It should be noted that our model achieved a positive prediction value (the ratio of true positives to those predicted as positive) of 84.0% when the classification threshold was set to 0.95 (Fig. 6A, thin line at threshold ϭ 0.95 crossing the PPV line). This means that sites predicted by our binary-QSAR model with a threshold of 0.95 are very likely to be cleaved by calpains at the cost of sensitivity.
Next, using 132 cleavage site sequences that were not used for training any of previous calpain predictors, the predictors' performance was compared. The results showed that    Table S14. our model outperformed all other reported prediction methods (Tables IV (versus Lit) and S14; note that reversed sequences were not necessarily true negative samples, and might be cleavable, implying that the accuracy of our model would be better than the value shown).
Finally, to identify calpain cleavage sites in a novel substrate protein, the sequence of horse myoglobin (MYO) was subjected to our prediction analysis. Among 12 sites predicted (Fig. 7A, red horizontal bars), three sites (arrows) were in loop/unstructured regions according to the 3D structure of MYO. Identification of the fragments generated by the calpain digestion of MYO showed that two of these sites were cleaved by calpains in actuality (Fig. 7A, red arrows, 7B-7D).
The First PLS QSAR model for Calpain Cleavage Site Efficiency-Finally, to predict quantitatively the cleavage efficiency of calpains for any peptide bond, the QSAR analysis of 119 site sequences with k cat /K m values was performed using the partial least squares regression (PLS) method. Using the LOO method, the most balanced PLS model had eight descriptors associated with P10, P2, P1, P3Ј, and P4Ј (Table V). This model showed a LOO r of 0.78 (total r ϭ 0.83, after excluding three outliers) (Fig. 6B).
Because the PLS model was constructed using the data from only 119 sequences from the P87mix data set, all the rest of the P87mix data (364 "cleaved" and 1220 "uncleaved" data without k cat /K m ) were evaluated by the model. As shown in Table VI (versus P87 unused), the average k cat /K m of the "cleaved" data set was significantly greater than that of "uncleaved" set (180.8 M Ϫ1 s Ϫ1 versus 114.4 M Ϫ1 s Ϫ1 , p ϭ 0.00049). These results indicated that our PLS model appropriately describes at least a portion of the calpain cleavage efficiencies. In other words, these findings indicate that the selections of aa descriptors and their weights by the MOE program are appropriate and reflect calpains' substrate specificity.

DISCUSSION
First Report of the Comprehensive Measurement of k cat /K m values-In this study, using an oligopeptide library and the iTRAQ TM proteomic method, 483 calpain cleavage sites were identified in addition to the 420 sites previously reported in the literature. Among the identified sites, 360 are novel, and the k cat /K m was determined for 119. These findings enabled us to analyze calpain substrate specificity not only precisely but also quantitatively. This is the first report to address calpain substrate specificity from the viewpoint of proteomewide quantitative structure-activity relationships.
To date, the k cat /K m values for fewer than 10 calpain substrates have been reported (6,38), which range from 41.7 to 141 M Ϫ1 s Ϫ1 . These values are consistent with those obtained in this study. Because the proteolytic conditions used in this study were somewhat unusual because of the use of concentrated calpains and unpurified peptides, the k cat /K m values determined here may be underestimated compared with those obtained under more typical conditions. However, the smooth distribution of the k cat /K m values that we obtained (see Fig. 4A) indicates that at least the relative k cat /K m values among the 119 determined values hold true.
Calpains also show amidase-like activity, but surprisingly, the k cat /K m for hydrolysis of the NH 2 group at the C terminus of substance P (RPKPQQFFGLM-NH 2 ) is 10 6 M Ϫ1 s Ϫ1 (39). This activity is mainly achieved by an ϳ10 4 -fold increase in the k cat without a significant change in the K m (39), by an unknown mechanism. Although this amidase-like calpain activity may be involved in as-yet-unknown physiological functions, there has been no further report on it. We did not detect any C-terminal DKP hydrolyzing activity in this study (data not shown; see supplemental Experimental Procedures).
Confirmation that the Substrate Sequence Selectivity of Calpains is Rather Weak-Consistent with all previous PSSM-type studies of calpain substrate sequences, both C1 and C2 showed weak sequence selectivity in this study (see supplemental Fig. S3). In terms of the 3D structure (40 -42), the substrate recognition by calpains is mainly determined by relatively weak interactions between an atom in the peptide bonds of a substrate and an atom of calpains' subsite residues. For example, Gly 198 of CAPN2 (supplemental Fig. S6A, corresponding to Gly 208 of CAPN1 (supplemental Fig. S6C)) interacts with the O (-2.0 kcal/mol) and NH (-1.7 kcal/mol) of the P1-P2 and P2-P3 peptide bonds, respectively, whereas Gly 261 of CAPN2 (S6A, corresponding to Gly 271 of CAPN1 (S6C)) interacts with the NH (-4.7 kcal/ mol) of P1-P2.

TABLE V Descriptors used in the partial least squares regression (PLS) model
For the values of aars for each descriptor, see supplemental Tables S11 and S12.  In other words, most of the side-chains of the substrate residues are exposed to the solvent without forming a strong interaction with calpain atoms. These features, which are common to both C1 and C2, are in sharp contrast to caspases, which strongly interact with P1 and P4 Asp side chains (supplemental Fig. S6D). These weak interactions contribute to the calpains' recognition of highly divergent substrate sequences. Exceptions are the P2 and P3Ј positions, where the side-chains of Leu and Pro, respectively, are deeply encompassed by the active site cleft of the calpains (supplemental Fig. S7). This point will be discussed further, below.

Existence of Many Nv Sites Suggests that Substrate Protein Cleavages By Calpains are Regulated By Both Primary and
Higher-order Structures-The literature contains reports of 420 unique calpain cleavage sites in 147 substrate proteins. Most of these sites are cleaved in the context of a whole protein or part of a protein that is expected to have a proper 3-D structure. On the other hand, the 483 sites identified in this study were in 20-mer peptides, which are unlikely to contain potential cleavable sites that were inaccessible by steric hindrance. Thus, the 360 Nv sites identified in this study are considered calpain-cleavable, not artifactual, sites that are not exposed in the context of a whole protein structure. The lack of significant differences in the aa preferences and k cat /K m values between the Rp and Nv sites supports this idea (see Fig. 2 and supplemental Table S10).
Therefore, most substrates have many sites that are potentially cleavable by calpains that escape cleavage when the substrate protein retains its higher-order structures. We thus conclude that the calpains' substrate specificity is defined by both primary and higher-order structures. The limited proteolysis by calpains that is often observed under physiological conditions probably reflects the fact that only extremely small amounts of calpains are activated in vivo.
Sequences Proximal to the Cleavage Sites Were Highly Similar for C1 and C2, and Both Preferred Longer Sequences in the P-than the PЈ-region-As in almost all previous reports, the aa sequence preferences around the cleavage sites for C1 and C2 were almost identical in this study, which is supported by the calpains' 3D-structural features, as described above. Surprisingly, however, detailed analysis revealed that the preferences for specific positions (P9-P7, P2, and P5Ј) were significantly different between C1 and C2 (Figs. 3C and 3D, and supplemental Table S5). Among them, the calpain aars most proximate to P8-P7 and P5Ј are different between C1 and C2, i.e. Asp 256 , Ile 257 , and Leu 260 of C1 are within 5 Å of Ser 169 -Thr 170 (corresponding to P8-P7) of calpastatin, whereas the corresponding residues of C2 (Ser 246 , Ala 247 , and Ser 250 , respectively) are not (supplemental Fig. S8A); Glu 172 of C2 and Met 329 of C1 are close to Glu 185 (P5Ј) of calpastatin, whereas the corresponding Gln 182 of C1 and Gln 319 of C2, respectively, are not (supplemental Fig. S8B). How these differences lead to distinct aa preferences is unknown at present. Moreover, there appears to be no significant difference in the P9-and P2-proximate aars between C1 and C2. To clarify the different substrate specificities of C1 and C2, further studies with more sample numbers are required.
The cleavage positions showed asymmetric frequencies (see Fig. 5), suggesting that calpains require a longer segment of P-site than PЈ-site residues. The P10-P5 sites are mainly recognized by the calpain CBSW domain (19,40,41), which may play a crucial role in substrate recognition (see supplemental Fig. S7A; the right side surface corresponds to CAPN2's CBSW domain). These results are in concert with calpains' amidase-like activity, for which only the P-site region plays a role (39).
Binary-QSAR Analyses of Calpain Substrate Cleavages Suggest That Discrete Positions (P6, P2, P1) Determine "Cleavability"-Many attempts have been made to predict calpain cleavage sites, including studies using PSSM, support vector machine (SVM), multiple kernel learning (MKL), a form of hierarchical clustering, and other methods (12)(13)(14)(15)(16)(17)(18)(19)(20), each of which has advantages and disadvantages. Here, we used the binary-QSAR model, which uses Bayes' theorem. It is a robust method that is low in computational cost and high in performance. In addition, it is easy to interpret the relative importance of various factors using a binary-QSAR model (43,44).
Our binary-QSAR model showed that the aa properties of only sites P6, P2, and P1 could reasonably predict the macro "cleavability" of a substrate by calpains (Table III, Fig. 8). That is, these sites are primarily involved in the cleavage efficiency of substrates by calpains with a certain hierarchy. Consistent with previous studies, P2 was the most important, and in the binary-QSAR model, P2 was associated with six descriptors, which are all related to hydrophobicity (NADH010102, BIOV880101 and 102, vsurf_W2 and _W3, and GUOD860101) (Table III). In brief, the model predicts that sequences with Leu at P2 will always be cleaved, regardless of P1 or P6; those with Ile, Val, Phe, Thr, Gln, Asn, Asp, Ser, Tyr, or Met at P2 are dependent on P1 and P6; and those with Glu, Lys, Trp, Cys, Gly, His, Ala, Arg, or Pro at P2 are predicted to be uncleaved regardless of P1 or P6 (Figs. 8A-8C).
P6 and P1, which are associated with one descriptor each, contribute only moderately to the cleavability, compared with P2. At P1, a water-accessible surface area (probe radius of 1.4 Å) with a partial positive charge (ASAϩ) yields the maximum cleavage probability at 138 Å 2 (Asn, Gln, Lys, Phe, and Tyr are close to this value; Fig. 8D). Larger and smaller ASAϩ values decrease the probability (by about 0.26 at maximum), suggesting that the condition at the S1 subsite of calpains is not very flexible; thus, Ile, Pro, or Leu at P1 markedly decreases cleavability.
A lower probability of a random coil secondary structure at P6 slightly increased the cleavability (by less than 0.2, Figs. 8A-8C, 8E). The 3-D structures of C2/calpastatin co-crystals revealed that calpains' S6 subsite is on the surface of the CBSW domain, and S3-S10 are almost aligned (19,40,41) (supplemental Fig. S7A). Therefore, our results support the idea that the secondary structure in the middle of this region may decrease a substrate's affinity for the CBSW domain by reducing flexibility, resulting in lower cleavability.

PLS QSAR Analyses Suggest That P3Ј-P4Ј Most Affects
Cleavage Efficiency, Followed By P2, P1, and P10 -To our surprise, the P3Ј and P4Ј positions had the most effect on the k cat /K m values, which changed by ca. 1,000 M Ϫ1 s Ϫ1 , depending on the aars at P3Ј-P4Ј (Fig. 9A).
The k cat /K m values predicted by our PLS QSAR model showed the best correlation with the partial specific volume and mass density of the aar at P3Ј (Fig. 9C). This finding is consistent with the 3D-structural observations that the sidechain of P3Ј has no specific interaction with calpain atoms, and is buried in a calpain surface cleft surrounded by a relatively hydrophobic environment (supplemental Fig. S7B).
P2 and P1 are also important (each k cat /K m change Ͼ 300 M Ϫ1 s Ϫ1 ), and Leu, Ile, and Val at P2, which gave high cleavage probability in the binary-QSAR model, were also associated with high efficiency (Fig. 9B). On the other hand, Asn and Asp at P2, which moderately increased cleavability, showed rather low efficiency. The predicted k cat /K m values were dependent on the sum of the van der Waals surface area of aars at P2 and P1, where the atomic partial charge is less than Ϫ0.3 (Table V, PEOE_VSA-6). The preference of P2 site was also related to the 3D-structure; the P2 residue side-chain penetrates the cleft beside the calpain active site, making weak hydrophobic interactions with calpain atoms (supplemental Fig. S7A, green surfaces).
Notably, Pro at P1, which markedly lowered the cleavability, caused the greatest increase in efficiency, among the 20 aars. This result suggests that most substrates with a Pro at P1 are not easily cleaved, whereas they are rather efficiently cleaved if the aars at other positions are favorable for cleavage. The accessible surface area, which is related to hydrophilicity, of the aar at P10 also contributes to the calpain cleavage efficiency, by 290 M Ϫ1 s Ϫ1 .
Cuerrier and his colleagues developed a highly sensitive fluorescent oligopeptide substrate, H-E(EDANS)PLFAERK (DABCYL)-OH (13), which is cleaved after Phe (F) (4). Our PLS model predicted that PLFAER for P3-P3Ј would have a k cat /K m of 763 M Ϫ1 s Ϫ1 , which is almost the maximum value (822 M Ϫ1 s Ϫ1 ) for all possible P3-P3Ј peptides, consistent with  Fig. 6B and Table V), the change in k cat /K m value (⌬k cat /K m ) was calculated as a function of the aars at P3Ј and P4Ј (A), or P2 and P1 (B). C, ⌬k cat /K m was plotted as a function of BULH740102 (see below) value for each aar at P3Ј. A k cat /K m value for each aar was calculated by entering each of the 20 aars for P3Ј into our PLS-QSAR model equation assuming that all other positions are fixed; i.e. for each aar, aa i (i ϭ 1-20; aa 1 ϭ Ala (A), aa 2 ϭ Cys (C), . . . aa 20 ϭ Trp (W)), P3Ј(aa i ) ϭ 62.6⅐E_ang(aa i ) ϩ average[-8.60⅐Q_VSA_PNEG(aa i ,aa j ) (j ϭ 1-20)]. The difference between the maximum (Ile) and minimum (His) values at the P3Ј position was calculated to be 537 M Ϫ1 s Ϫ1 . Next, the most correlated aa descriptor was determined: first, r and were calculated between the k cat /K m estimated above and each of the 1,315 aa descriptors; then, the descriptors were ranked independently for r and , and the sums of the ranks of both were again ranked; the best descriptor was BULH740102 (r ϭ 0.896, ϭ 0.869).
was used in addition to r, because is robust against abnormal distributions with outliers, which are features of some aa descriptors, whereas r is greatly affected by the outliers. For the values of the aars of each descriptor, see supplemental Tables S11 and S12. the sensitivity of the PLFAER substrate and supporting the effectiveness of our PLS QSAR approach. Indeed, Leu-Phe at P2-P1 and Arg at P3Ј was one of the best combinations of these positions (see Figs. 9A and 9B). PSSM-based methods count cleavages equally, regardless of the sequences' cleavage efficiencies, whereas the peptide sequencing-based method used by Cuerrier et al. (13) as well as our PLS method take the cleavability of each peptide into account. Thus, further PLS studies with more k cat /K m data should eventually reveal the ultimate substrate specificities of calpains.
Taken together, our PLS QSAR analyses showed that substrates having (Leu or Ile) (Val, Pro, or Ala) at P3Ј-P4Ј and P2-P1 are cleaved with high efficiency by calpains, and those with Glu or Asp at P3Ј, P2, and P1 are cleaved with the least efficiency. This information may be useful for mutation studies seeking to change calpain substrates to be uncleavable and/or to insert de novo calpain cleavage sites. Therefore, this study opens new avenues into the study of calpain substrates. Further elucidation of the context-dependent and quantitative structure-activity relationships of calpains and their substrates will improve our understanding of calpain substrate specificity.