Molecular Cloning of cDNA for Rat Cathepsin C CATHEPSIN C, A CYSTEINE PROTEINASE WITH AN EXTREMELY LONG PROPEPTIDE*

A cDNA for rat cathepsin C (dipeptidylaminopepti- dase I) was isolated. The deduced amino acid sequence of cathepsin C comprises 462 amino acid residues: 28 NHderminal residues corresponding to the signal peptide, 201 residues corresponding to the propeptide, and 233 COOH-terminal residues corresponding to the mature enzyme region. Four potential glycosylation sites were found, three located in the propeptide region, and one in the mature enzyme region. The amino acid se- quence of mature cathepsin C has 39.5% identity to that of cathepsin H, 35.1% to that of cathepsin L, 30.1% to that of cathepsin B, and 33.3% to that of papain. Cathepsin C, therefore, is a member of the papain family, although its propeptide region is much longer than those of other cysteine proteinases and shows no significant amino acid sequence similarity to any other cysteine proteinase.

Lysosomal cysteine proteinases play important roles in intracellular protein degradation, antigen presentation, tumor metastasis, and muscular dystrophy (1)(2)(3), and their inhibition causes various pathogenic states. For example, the injection of leupeptin, a cysteine proteinase inhibitor, or chloroquine, a general lysosomal enzyme inhibitor, into young rat brain induces the subsequent formation of ceroid-lipofuscinlike granular aggregates (4). Thus, cysteine proteinases participate in various biological actions and homeostasis.
Recent cDNA cloning experiments have revealed the amino acid sequences of the cysteine proteinases cathepsins B, H, and L (5-7). As their amino acid sequences are similar to one another and also to that of papain, they belong to the papain family of cysteine proteinases (8). From SDS'-PAGE analysis, * This work was supported in part by Grant-in-Aid for Encouragement of Young Scientists 02770133, Grant-in-Aid for Cancer Research 02151027, Grant-in-Aid for Co-operative Research 63304034 and Grant-in-Aid for Scientific Research 02454159 from the Ministry of Education, Science and Culture of Japan, Grant 63-2 from the National Center of Neurology and Psychiatry of the Ministry of Health and Welfare, Japan, and by Aid for Project Medical Research Grant 9002 from Juntendo University. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "aduertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. they exist in single-chain forms estimated as 30 kDa in size and double-chain forms, in which the single chain is processed to 25-and 5-kDa chains. On the other hand, cathepsins C, J, and K (9, 10) and bleomycin hydrolase (11) are oligomeric in structure and are larger than cathepsins B, H, and L; from gel filtration profiles cathepsins C and J are estimated as 200 kDa in size, whereas cathepsin K and bleomycin hydrolase are 250 and 650 kDa, respectively. Recently, a cDNA isolated from Trypanosoma brucei was found to encode a cysteine proteinase with a long cysteine-and proline-rich COOHterminal extension peptide (12) similar to a cold-induced proteinase in tomatoes (13). These facts suggest that the papain family includes many proteinases from various sources.
Cathepsin C, or dipeptidylaminopeptidase I (EC 3.4.14.1) is a lysosomal cysteine proteinase involved in cellular protein degradation. Other postulated functions of this enzyme include cell growth (14), platelet factor XI11 activation (15), and neuraminidase activation (16), in addition to the other roles already mentioned. Cathepsin C is distinct from cathepsins H and L in the following properties: cathepsin C requires chloride ion to express significant enzyme activity (17), it forms oligomeric structures estimated at 200 kDa from gel filtration analysis (lo), and it is inhibited more slowly than cathepsins H and L by E-64 (N-[N-(~-3-trans-carboxyoxirane-2-carbonyl-~-leucyl]agmatine) (18).
In this study, we isolated a cDNA for rat cathepsin C and determined its nucleotide sequence. The deduced amino acid sequence reveals that cathepsin C has an extremely long propeptide and apparently belongs to the papain family.

Isolation and Sequencing of cDNA for Rat Cathepsin C-
Four independent cDNA clones were isolated from a rat kidney cDNA library (4.0 X lo5 plaques). Restriction mapping and Southern hybridization analyses (23) showed these cDNA clones to overlap one another (data not shown). Fig. 1 shows the restriction map of cathepsin C. Since the nucleotide sequences (nucleotides -56 to 806 in Fig. 2) on the 5' side of the EcoRI site of fragments XC6, XC11, hC14, and XC15 were identical, only the nucleotides on the 3' side of the EcoRI site of X 1 4 were determined. The nucleotide sequence of the cDNA for rat cathepsin C comprises 1842 bp containing 49 bp in the 5"noncoding region, 1389 bp in the coding region, and 404 bp in the 3"noncoding region (Fig. 2).
The ATG codon numbered as 1 must be the translation initiation site, since the nucleotide sequence around it Portions of this paper (including "Materials and Methods," part of "Results," Figs. ls-4s, and Tables 1s-3s) are presented in miniprint at the end of this paper. Miniprint is easily read with the aid of a standard magnifying glass. Full size photocopies are included in the microfilm edition of the Journal that is available from Waverly Press.

Molecular
Cloning of cDNA for Rat Cathepsin C 16515 3'-noncoding region, a polv(A)+ additional signal was found at 1773-1778 (32). Preprocathepsin C comprises 462 amino acid residues containing three functional domains. The 28 NH,-terminal amino acid residues correspond to the signal peptide in accordance with their hydropathy and prediction of the signal peptide processing site (33); the next 201 residues correspond to the propeptide region; the 233 COOH-terminal residues correspond to the mature enzyme region. The amino acid sequences derived from peptide fragments of the purified cathepsin C (see the Miniprint) were found in the deduced amino acid sequence of the cDNA. NH,-terminal analysis showed that purified cathepsin C is processed to a two-chain form: the amino acid sequences of the NH, termini are LPESWDW and DPFNPFEL (see the Miniprint). Calculating from the amino acid sequence, the molecular weights of preprocathepsin C, procathepsin C, cathepsin C, and the heavy and light chains of cathepsin C are 52234. 36, 49342.95, 26056.93, 18362.32, and 7712.63, respectively. Four potential N-glvcosylation sites were found: three in the propeptide region and one in the mature enzyme region. Some of these sites must be glycosylated since the lysosomal enzymes bear mannose 6phosphate as a recognition marker (34,35).
A search for homology in the National Riomedical Research Foundation protein sequence data hank revealed no significant similarities to other proteins except cysteine proteinases.
RNA Blot Hybridization Ana/+vsis-KNA blot hyhridization analyses revealed that two sizes of mRNAs for cathepsin C, 2.1 and 2.7 kb, are expressed in all tissues (Fig.  3 ) . Judging from the nucleotide sequence length of the cDNA and the ratio of the two sizes of mRNA, the isolated cDNAs for cathepsin C probably correspond to the 2.1-kb transcript. The  I cDNA corresponding to the 2.7-kb mRNA could not be isolated from the cDNA library. Results from genomic Southern hybridization analyses under stringent and less stringent conditions show that the gene for cathepsin C is encoded as a single copy gene (data not shown). Further, the fact that there is no difference in the ratio of the two transcripts in any of the tissues examined shows that the second transcript was generated from the other poly(A)' addition signal located in the downstream region of the mRNA rather than by alternative splicing. The mRNA levels for cathepsin C are high in liver, spleen, small and large intestine, lung, and kidney, moderate in esophagus, stomach, and heart muscle, and low in the submandibular gland, aorta, and skeletal muscle. Only small amounts of mRNA for cathepsin C exist in brain, pancreas, adrenal gland, and testis. These results support the observation that cathepsin C is involved in cell growth (14) and suggest that the enzyme plays an important role in the alimentary tract.

DISCUSSION
In the mature enzyme region, the amino acid sequence of cathepsin C shares 39.5, 35.1, 30.1, and 33.3% identity with those of cathepsins H, L, B, and papain, respectively (Fig.  4A). Especially, around the active site cysteine, histidine, and asparagine residues, the amino acid sequence of cathepsin C is very similar to those of cathepsins H, L, B, and papain. Thus, cathepsin C belongs to the papain family. The tyrosine residue next to the active site cysteine in cathepsin C is a substitution for the tryptophan residue that is well conserved among other cysteine proteinases as part of the hydrophobic backbone (36). This tyrosine substitution, therefore, may affect substrate specificity.
Some amino acid sequence similarities have been reported for the propeptide regions of cysteine proteinases (7). The cathepsin C propeptide region, however, shows no sequence similarity to any other cysteine proteinase except cathepsin H (Fig. 4B). The region of cathepsin H that is similar to cathepsin C is different from the region that is similar to the further expression studies using deleted-and site-directed mutants of the cDNA for cathepsins C and H in Escherichia coli, such as have been done with human cathepsin L (37), will provide clues to the function of this region. The amino acid sequence of a major peptide (E) (see Table  2s, Miniprint) in the propeptide region (from -109 to -102 in Fig. 2) generated from purified cathepsin C by V8 proteinase digestion indicates that purified cathepsin C contains a propeptide, as is the case for human cathepsin H (38). We raised an antibody against a synthetic peptide corresponding to the propeptide region of cathepsin C (-109 to -100 in Fig.   2). Materials reacting with this antibody were detected in all preparations of the purified cathepsin C. This finding will be discussed further in a separate paper. 3 Recent cDNA cloning has revealed that cysteine proteinases can be divided into subgroups based on their primary structures. The two major superfamilies have diverged from different ancestral genes: the papain superfamily, consisting of papain, cathepsins B, H, L, and calpain (39), and the superfamily that includes viral cysteine proteinases that resemble trypsin-like serine proteinases rather than papain-like cysteine proteinase, even though they have a cysteine residue at the active site (40). The papain superfamily can be divided into two families, the papain family and the calpain family. The papain family includes many proteinases: papain, cathepsins B, H, and L, cysteine proteinases from Schistosoma mansoni (41), T. brucei (In), tomato (13), and rice seed (42).
Although cathepsin C differs in its oligomeric properties (lo), slow inhibition by E-64 (18), and an extremely long propeptide in its primary structure, it is apparently a member of the papain family as discussed above.
This paper reports for the first time the structure of procathepsin C. Although cathepsin C has characteristics different from those of other cysteine proteinases, it is still a member of the papain family. The function of the long propeptide of cathepsin C is still unknown, but may provide a :' D. Muno and E. Kominami, manuscript in preparation.

Molecular Cloning
of cDNA for Rat Cathepsin C