Differential RNA splicing and post-translational cleavages in the human salivary proline-rich protein gene system.

The nucleotide sequences of cDNAs coding for human salivary proline-rich proteins (PRPs) were determined. Clones cP1 and cP2 contain repetitive regions in which sites for the restriction enzyme HaeIII occur repeatedly; they code for the precursors of acidic PRPs. Clones cP3 to cP7 contain repetitive regions in which BstNI sites occur repeatedly; they code for precursors of basic and glycosylated PRPs. The clones cP3, cP4, and cP5 are identical except that cP4 and cP5 are missing 399 and 459 base pairs, respectively, from the repetitive region of cP3. The sequences at these deletion end points are homologous to the consensus sequences of RNA splicing donor and acceptor sites. This strongly suggests that all three cDNAs are derived from the transcript of a single gene via differential RNA splicing. All of the precursor proteins share a feature--the N-terminal region, following the signal peptide, is acidic, while the remainder of the molecule, made of proline-rich repeats of about 21 amino acids, is basic. Each precursor can generate multiple PRPs by various post-translational cleavages on the carboxylic side of specific arginine residues. The data show how differential RNA splicing and post-translational cleavages could generate a large number of proteins, such as those found in saliva, from a much smaller number of genes.

A wide variety of proline-rich proteins (PRPs') form the major protein components of human saliva. They are characterized by a predominance of the amino acids proline (25 to 42%), glycine (16 to 22%), and glutamic acid/glutamine (15 to 28%). PRPs have been classified into three groups, acidic, basic, and glycosylated, based on their electrophoretic and chemical properties (Bennick, 1982). The amino acid sequences of two acidic proline-rich proteins (Wong and Bennick, 1980) and of some small basic proline-rich proteins (Kauffman et al., 1982;Saitoh et al., 1983aSaitoh et al., , 1983b have been determined. Each contains repeated sequences that are homologous to the repeated sequences in the others. Although all of the proline-rich proteins in saliva have not yet been characterized, Kauffman and Keller (1979) have identified at least 11 basic proline-rich proteins in the saliva of a single individual.
* This work was supported by National Institutes of Health Grants AM 20120, GM 20069, and DE0 3658-20. This is paper 2810 from the Laboratory of Genetics, University of Wisconsin-Madison. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "aduertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
DNA clones corresponding to two members of the human PRP gene family, clones PRPl and PRP2, have recently been isolated and their partial nucleotide sequences have been determined . Both clones contain regions comprising nearly identical tandem repeats of 63 nucleotides, the decoded amino acid sequences of which are homologous to the repeated amino acid sequences found in proline-rich proteins. A probe was made from the repeated region of the PRPl clone and used to examine EcoRI digests of DNA from several individuals. The resulting Southern blots were remarkable in showing six hybridizing bands in most individuals and in failing to show any correlation between the hydridizing bands in the tested individuals and the null alleles previously assigned to some of the individuals by studying their PRPs at the protein level. Maeda (1985) recently reviewed and extended the Southern blot data and re-examined the patterns of inheritance of the PRPs. She hypothesized that the data could be interpreted more economically in terms of six loci falling into two subfamilies rather than in terms of 13 loci with null alleles. She called the six loci PRHl, PRHZ, PRB1, PRBZ, PRB3, and PRB4. Her hypothesis accounted in part for the lack of correlation between the genomic DNA hybridization patterns and the postulated presence of null alleles by suggesting that the three acidic PRPs (Db, Pa, and PIF) are coded by three productive alleles at one of the six loci, rather than by a productive allele and null allele at each of three loci.
Some of the complexities seen in the PRPs are also due to post-translationalprocessing. This has been demonstrated for the acidic proteins, protein C and protein A (Wong et al., 1983), and is suspected in the case of the basic proteins (Bennick, 1982;Maeda, 1985).
To fully understand the structures of the PRPs, their mode of inheritance, and the evolution of the PRP multigene family, it is necessary to identify all of the genes and their products. With this purpose in mind we constructed a cDNA library from the total poly(A+) RNA of a single human parotid gland. From this library we have isolated cDNAs that code for PRPs and have sequenced seven different species. In this paper we present these sequences, which permit us to classify the PRP genes into two groups. One group codes for acidic proteins and contains a repetitive region where sites for the restriction enzyme HaeIII occur repeatedly. The other group codes for basic and glycosylated PRPs and contains a repetitive region where BstNI sites occur repeatedly. The data suggest that, at least in the transcript from one PRP gene, differential RNA splicing has occurred to give mRNAs of different lengths. The data also demonstrate that post-translational proteolytic cleavage generates multiple products from all of the PRP genes.

EXPERIMENTAL PROCEDURES
Material-A normal parotid gland, obtained as a by-product from a patient who had undergone head and neck surgery for cancer, was frozen in liquid nitrogen and stored at -80 "C. Saliva of the same individual was collected for protein typing (Azen et al., 1979;Denniston, 1980, 1981) and a sample of total DNA was isolated from his white blood cells (Poncz et al., 1982).
Purification of mRNA-A urea buffer solution (10 ml of 10 mM Tris buffer, pH 7.4, containing 1 mM EDTA, plus 0.2 g of SDS, 4.2 g of urea, and 0.2 g of NaCl) and phenol (10 ml) were frozen and ground into powder under liquid nitrogen. The frozen parotid gland tissue (weight 2 g) was added to the frozen buffer powder and both were ground together under liquid nitrogen. The nitrogen was then allowed to boil off, and the frozen powder was transferred to a 50-ml plastic centrifuge tube and thawed at room temperature with continuous gently shaking. Chloroform (5 ml) was added to the resulting viscous solution, which was centrifuged, and the lower phase was discarded. The supernatant was extracted twice more with phenol/CHCl3 (2:1, v/v) followed by four washes with 2 volumes of CHCla/isoamyl alcohol (24: 1, v/v) .
Solid CsCl (1.1 g/2.5 ml) was added to the aqueous solution, and the mixture was layered onto 5.8 M CsCl solution in six tubes and centrifuged at 30,000 rpm in a Beckman SW 41 rotor at 20 "C for 15 h. The supernatant was removed carefully; the RNA pellet was dissolved in the urea buffer solution (above) and extracted once with phenol, and the RNA was precipitated with ethanol. Poly(A+) RNA was obtained from the total RNA after two cycles of oligo(dT)cellulose chromatography (Aviv and Leder, 1972).
Cell-free Translation-Poly(A+) RNA was translated in a cell-free rabbit reticulocyte lysate system purchased from Bethesda Research Laboratories in the presence of [35S]methionine or ["Clproline using the manufacturer's method. Reaction products were electrophoresed on 15% polyacrylamide slab gels containing 0.1% SDS and the gels were exposed at -70 "C for 24 h after treatment for fluorography (Bonner and Laskey, 1974).
Plasmid Construction-A cDNA library was prepared from the poly(A+) RNA by cloning in the plasmid vector pGFY279' using the principles described by Okayama and Berg (1982). In outline, the poly(A+) RNA (0.75 pg) was annealed to a dT tail (approximately 80 residues) synthesized on the 3' end of the PstI site of the 4.6-kb EcoRI-PstI fragment prepared from pGFY279. cDNA was synthesized with 24 units of reverse transcriptase (Life Science Institute, Miami, FL) in 20 pl of reaction mixture containing 50 mM Tris/HCl, pH 8.3, 8 mM MgCl,, 30 mM KC1,0.5 mM dithiothreitol, 2 mM concentrations of each deoxynucleotide, and 35 units of RNasin (Promega-Biotec, Madison, WI). The reaction was carried out for 90 min at 37 "C. Homopolymer tracts of dC (approximately 8 residues) were added to the 3' ends of the plasmid-cDNA-mRNA complex with terminal deoxynucleotidyl transferase (Pharmacia P-L Biochemicals, Milwaukee, WI). The product was digested with BamHI and ligated to the 0.5-kb PstI-BamHI fragment of pGFY279 having an oligo(dG) tail on its PstI end. Replacement of the RNA strand (equivalent to 0.2 pg of mRNA) with DNA was performed with 0.1 mM concentrations of each deoxynucleotide and 7 units of Klenow fragment (Pharmacia J. Brosius and 0. Smithies, unpublished work. P-L Biochemicals) in 50 p l of 0.03 M Tris/HCl buffer, pH 7.6, containing 10 mM MgC12,l mM dithiothreitol at 15 "C for 1 h, followed by 1 h at 25 "C. Portions of the resulting material (equivalent to about 0.05 pg of poly(A+) RNA) were used to transform the Escherichia coli strains C600SF8 or HB101.
Hybridization-Colonies were grown for 16 h at 37 "C on nitrocellulose layered on agar medium containing ampicillin. Colony hybridization to 32P-labeled probes was performed by the procedure of Grunstein and Hogness (1975) in 6 X SSC (0.9 M sodium chloride/ 0.09 M sodium citrate) with Denhardt's modification (0.02% bovine serum albumin/0.02% Ficoll/O.O2% polyvinylpyrollidone) at 68 "C for 20 h. Filters were washed under nonstringent conditions (twice in 3 X SSC with 0.5% SDS at 68 "C for 1 h each), or under stringent conditions (once in 0.1 X SSC with 0.02% SDS at 68 "C for 1 h).
Nucleotide Sequence Analysis-The nucleotide sequences of the cDNA inserts were determined by the method of Maxam and Gilbert (1977). All the presented sequences were determined in both directions.

Protein Typing and Characterization of Poly(A+) RNA-
The proline-rich proteins in the saliva of the individual whose parotid gland was used in this study had the following phenotypes: Prl-1, Db-, Pa-, PIF', Psl, G1 1-3, PmF-, PmS-.
The poly(A+) RNA prepared from the donor parotid gland was electrophoresed in a 1.5% agarose gel, transferred to nitrocellulose, and hybridized to a probe (Hinf-980) made from the region that contains repeats in the human PRP gene clone PRPl . Two main bands of 1.2 and 0.78 kb were detected in addition to less strongly hybridizing bands of 1.4 and 0.9 kb (data not shown). This result indicates that PRP-related species of at least 1.4 kb should be represented in cDNA clones made from this RNA preparation.
Cell-free translation of the same poly(A+) RNA gave several bands showing a high proline content as judged by the ratio of proline/methionine incorporation in the presence of either [3H]proline or [35S]methionine. A major band was seen at approximately M, 40,000, relative to globular protein size markers (data not shown). Other proline-rich bands were observed at 53,000, 45,000, 34,000, 32,000, and 30,000. Under the same conditions of SDS-polyacrylamide gel electrophoresis, Ps2 synthesized in vivo migrates at the rate expected for a globular protein of size 50,000, Psl at 44,000, and PmS at 24,000. Since many of the in uiuo proteins are modified post-translationally by cleavage, by glycosylation, or by phosphorylation, it is not surprising that the products translated in the cell-free system could not be correlated with the in uiuo products. Nonetheless, the ability to translate the poly(A+) RNA into PRPs indicates that the preparation contains translatable mRNA for PRPs. The data also suggest that the number of PRP precursor proteins may not be so large as the number of proteins found in saliva.

cDNA Clones Coding for PRPs Fall into Two Groups-
Clones of cDNA were prepared from the parotid gland poly(A+) RNA as described under "Experimental Procedures" and were screened by colony hybridization using the Hinf-980 probe. Ninety-five of 750 colonies gave a positive signal after a nonstringent wash. These could be classified into two main groups on the basis of their signals after more stringent washes. One group, about 72% of the initially positive clones, hybridized strongly to the probe even after a stringent wash; the other group, about 28% of the clones, no longer gave strong signals after the stringent wash. By sequencing some representatives of each group with insert sizes ranging from 400 to 1500 bp, we found that the two groups differ reproducibly in the structure of the repeating region they contain.
The ones that retained their signal with the Hinf-980 probe after the stringent wash have a repeated sequence very similar to that of clone PRP1, which has BstNI sites (CCT/,GG) spaced approximately every 63 bp. We have called these repeats BstNI repeats . The clones in the other group have a structure in which HaeIII sites (GGCC) occur repeatedly and we have named them HaeIII repeats. We find that both groups share a homologous 5' untranslated and secretory signal sequence (see below), and so we used a probe made from this region to isolate 19 full length cDNA clones for PRPs.
Overall Structure of cDNAs-The nucleotide sequences of the seven cDNAs (presented in Figs. 1-3 below) indicate that the PRP proteins are clearly related and derived from a common ancestor. They are composed of four domains (illustrated below in Fig. 4): a secretory signal sequence (S), Nterminal regions (N1 and N2), repeats BstNI (B) or HaeIII (H), and the C-terminal region. The secretory signal sequence (S) is composed of 16 predominantly hydrophobic amino acids and is similar to other secretory signal peptides (Watson, 1984). The 17-amino acid N-terminal region (N,) which follows is rich in acidic and hydrophobic residues. A second Nterminal region (N2) is unique to the HaeIII family proteins and has no counterpart in the BstNI family proteins. The repeated regions are made of about 21 amino acid residues and are rich in proline, but the number of repeats varies in the different cDNAs. A small C-terminal region (C, or C,) follows the repeats.
cDNA Clones with HaeIII Repeats Code for Acidic PRPs-The nucleotide sequences of 12 full length cDNA clones with HaeIII repeats were determined. Four correspond to the sequence cP1 and five correspond to the sequence cP2 shown in Fig. 1. Two of the 12 clones had rearrangements almost certainly created during the cDNA construction. One of these two is identical to cP2 except for the addition at the 5' end of 93 bp of extra DNA the sequence of which is an exact copy of the opposite of cP2 strand from positions 381 to 473. The other clone also has an inverted sequence; in this case positions 89 to 195 of cP1 are missing and are replaced with 46 bp from the 5' end of cP1 in an inverted form. (Similar artifactual inversions have been observed in other cDNAs (Derynck et al., 1980;Fagen et al., 1980;Weaver et al., 1981;Pletnev et al., 1983).) A third cDNA was found identical to cP2 except that the 12-bp sequence GGCCATCCCCGT from positions 393 to 404 is replaced by the sequence CACCCCCCCCCA which we have not found anywhere else. The reading frame of this cDNA is intact and would give a decoded amino acid sequence of His-Pro-Pro-Pro in place of the usual Gly-His-Pro-Arg. We suspect but cannot prove that this clone is also an artifact since six of seven clones covering this region lacked the unusual sequence.
The nucleotide sequences presented in Fig. 1 show that, although the decoded amino acid sequences of the clones cP1 and cP2 are very similar, they code for the precursors of two different acidic PRPs. They are identical except for amino acid positions 4 and 50; cP1 has aspartic acid and asparagine at positions 4 and 50, respectively, while cP2 has asparagine and aspartic acid at these positions. The decoded amino acid sequences of cP1 and cP2 and the amino acid sequence of an acidic PRP, protein C, determined by Wong and Bennick (1980) also differ at positions 4 and 50; both positions in protein C are reported as asparagine. Since the donor of the parotid gland is producing only two acidic PRPs (Prl and PIF) cP1 and cP2 must correspond to P r l and PIF. These two proteins co-migrate in the 7% polyacrylamide gel electrophoresis at pH 8.9, but they can be separated from each other by isoelectric focusing over the pH range 3.5 to 5.2 (Azen and Denniston, 1981) despite the fact that both have the same total amino acid composition. Our results suggest that protein C as sequenced was probably a mixture of the two proteins. (Asparagine is frequently the residue assigned when protein sequencing experiments show aspartic acid and asparagine at the same position.) One feature shared by all of the acidic PRPs, except Pa, is their appearance as double bands after electrophoresis. Wong and Bennick (1980) have shown that the amino acid sequence of their protein A is identical to the first N-terminal 106 residues of protein C, which itself is composed of 150 residues. The salivary protease kallikrein has been shown to cleave protein C on the carboxyl side of Arg-106 (Wong et al., 1983) to give protein A plus a carboxy-terminal peptide of 44 amino acids. The identification in human saliva by Isemura et al. (1980) of a peptide P-C which is identical in sequence to the C-terminal44 residues of protein C suggests that such proteo-

cDNAs for
Human Salivary Proline-rich Proteins 1 a6

40.
P i P 6 K P P G ' P P P i G G N K P P ' G P P 50. lytic cleavages occur in vivo in the salivary PRPs. Since the decoded amino acid sequences of both the cP1 and cP2 clones have the Arg-106, this mechanism is most likely responsible for the double-bandedness of PIF (PIF slow and PIF fast) and P r l ( P r l and Pr3). Among all the PRP cDNA sequences that we have determined, including sequences of an additional three incomplete cDNA clones with Hue111 repeats, we have found none which is missing codons for the C-terminal 44 residues. This result strongly supports the hypothesis that the double-banded feature of acidic PRPs is due to post-translational cleavage at Arg-106 and is not due to the existence of different mRNAs coding for the two proteins making up any particular double-banded phenotype. Possible Differential RNA Splicing in cDNA Clones with Different Numbers of BstNI Repeats-Seven full-sized cDNAs and one partial cDNA with BstNI repeats were sequenced. They form three subgroups based on their sequences. Fig. 2 shows the nucleotide sequences of the subgroup of BstNI family clones: cP3, cP4, and cP5. We obtained 2 examples of cP3, one of cP4, and two of cP5. Alignment of the nucleotide sequences shows that cP4 and cP5 are identical to cP3 (two examples) except that they are missing 399 and 459 bp, respectively, from the central part of cP3. We can exclude the possibility that they are derived from alleles at one locus because there are three different lengths from one individual. We can also exclude the possibility that recombination within the repeats during cloning might have caused the deletions, because inspection of the sequences in the appropriate alignment shows that the deletions could not have been generated by homologous crossovers between different repeating units. We therefore considered the possibility that they are differentially spliced transcripts from a single gene.
Inspection of the nucleotide sequence of cP3 shows that the 5' and 3' ends of the cP4 and cP5 deletions are homologous to the consensus sequences for RNA splicing donor sites, (C/A)AG:GT(A/G)AGTA, and acceptor sites, (C/T)llN(C/ T)AG:G (Mount, 1982). Consequently these nucleotide sequences support the hypothesis that the three different length mRNAs, cP3, cP4, and cP5, are all derived from a single transcript via differential RNA splicing. In cP3, there are two possible donor sites at nucleotide positions 347 and 529, and eight acceptor sites at 379, 439, 499, 562, 622, 682, 745, and 805. cP4 is apparently the result of a 347 to 745 splicing and cP5 is the result of a 347 to 805 splicing. We do not know whether the other possible sites are ever used or not.
Amino Acid Sequence of the BstNI Family Clone cP3"The nucleotide sequences of the 5' flanking, signal (S), and N terminal (N1) portions of the BstNI family sequence cP3 are homologous to those of the Hue111 family sequences cP1 and cP2. We have already seen that protein C probably corresponds to a mixture of the translation products of cP1 and cP2. We can therefore use the sequence of protein C and the nucleotide sequence data to predict the structure of the proteins corresponding to cP3, cP4, and cP5. The N-terminal amino acids of the three secreted proteins will most likely be glutamine. They will all have an N-terminal region of 17 amino acids, followed by a differing number of the BstNI the first three B3 repeats are 19 and the fourth B3 repeat is 20 amino acids in length. Their consensus sequences are given in Fig. 46.) The unit "B1-B2-B3" occurs four times in cP3. The nucleotide sequence of the C-terminal region (C,) of cP3 is not homologous to that of the repeating units, although its decoded amino acid sequence is still rich in proline.
The overall decoded amino acid sequence of cP3 corresponds to a PRP of 315 amino acids. None of the presently sequenced salivary PRPs is as long as this. This suggests that some of the currently sequenced proteins could be posttranslational cleavage products of a cP3-like precursor protein. An arginine residue occurs at the equivalent position in all but one of the B3 repeats of the cP3 precursor protein. If complete proteolytic cleavage occurred after all these arginine residues (at positions 75, 136, and 259) cP3 would give rise to four different basic proline-rich peptides. It would generate peptides of 75 residues, (from N1 and the first BI-BZ-B~ unit), of 61 residues (from the second B2-B2-B3 unit), of 123 residues (two B1-B2-B3 units left together because of the replacement of arginine by glutamine at position 197), and of 56 residues (the last B1 repeat and C-terminal region). cP4 and cP5 would generate the same N-and C-terminal peptides as cP3. In addition, cP4 would produce a basic peptide of 50 residues and cP5 would produce a basic peptide of 30 residues.
Previous work by others on the amino acid sequences of salivary PRPs provides evidence for this interpretation of the nucleotide sequences. The amino acid sequences of the four B1-B2-B3 units of 61 amino acids we expect from cP3 are very similar although not identical to the amino acid sequences of the small basic PRPs. Residues 87 to 197 are identical to IB-9 and P-E (Kauffman et al., 1982;Isemura et at., 1982), except that the glutamine at 197 of cP3 is replaced by arginine in IB-9 and P-E. Residues 198 to 259 of cP3 are identical to P-F (Saitoh et al,, 1983a) with three exceptions. Ala-249 and Gln-256 of cP3 are replaced by proline and arginine, respectively, in P-F, while Arg-259 is missing in P-F. The amino acid sequence of the basic protein P-H determined by Saitoh et al. (1983b) is identical to that from positions 260 to 315 of cP3, except that Saitoh et al. found alanine at the N-terminus of P-H rather than the serine at the equivalent position (residue 260) of cP3. Since the PRP system is genetically very polymorphic, some of these minor differences between the protein sequences and those predicted from cP3 could well be due to inherited variations between different individuals.
Arginine residues at positions 72 and 194 of cP3 could also be cleavage sites for proteases. However, the amino acid sequences of the basic proline-rich proteins IB-9 and P-F retain this Arg-Ser bond intact which suggests that this bond is resistant to cleavage. Nonetheless, in conjunction with the differential RNA splicing we have suggested, post-translational cleavage of precursor proteins could produce at least 6 different basic proline-rich proteins from the single gene corresponding to cP3.
cDNA Clones with BstNI Repeats Code for Two Types of Glycosylated PRPs- Fig. 3 presents the nucleotide sequences of two other types of cloned PRP cDNAs. cP6 (Fig. 3a) has a signal sequence (S) and N-terminal region (N,) very similar to those of cP3. However, the repeated region of cP6 is characteristically different from cP3 in that for the most part it contains a fourth variety of BstNI repeat (B,) which occurs seven times. Two and one-half more repeats (like B1, BP, and part of B2) follow the seven copies of B4. The C-terminal region of cP6 is an abbreviated form (CW) of the equivalent region (Cs) in cP3.
The B., repeat is different because the decoded sequence includes the sugar attachment site "N-Q-S." This suggests that cP6 codes for a glycosylated PRP having six carbohydrate side chains. (The first of the seven B4 repeats has a single base pair change in the site so it would not have a side chain.) Furthermore, the amino acid sequence of the B4 repeat is very similar to the amino acid sequences of the glycopeptides, CD-IIf and CD-IIg, obtained from proteolytic digests of a glycosylated PRP (Shimomura et al., 1983). If cP6 were translated and glycosylated without proteolytic cleavage, it would code for a secreted polypeptide of 231 amino acids. The sequence of the last six amino acids in the seventh B4 repeat is similar to that of the B3 repeat and contains two arginine residues. If proteolytic cleavage occurs in the protein corresponding to cP6 in the same way as we suggest it occurs in that corresponding to cP3, the secreted glycoprotein would be 161 (or possibly 158) amino acids in length (Fig. 3a). One glycosylated PRP that has been studied (Levine et al., 1979) lacks hydrophobic amino acids. This suggests that cP6 might also be cleaved after Arg-23 to give a basic glycopeptide of 138 amino acids lacking hydrophobic residues plus an Nterminal unglycosylated acidic peptide of 23 residues not rich in proline but containing hydrophobic amino acids. The Cterminal basic peptide would be 70 amino acids in length and nonglycosylated. This C-terminal sequence corresponds to the amino acid sequence of the basic proline-rich peptide P-D (Saitoh et al., 1983~) except that cP6 has alanine at position 193 where P-D has proline.
The sequence of clone cP7 (Fig. 36) shows that clone cP7 also contains BstNI repeats but is not full length even though it is larger than 1 kb. The repeat unit of cP7 is similar to the B1-B2-B3 units of cP3. Some repeats (B3, and B2,) contain a sequence for a sugar attachment site, "N-K-S." If the precursor protein of cP7 is processed post-translationally at Arg-9, -71, and -133 in the manner we are proposing, it would produce peptides of 62 amino acids with one carbohydrate side chain attached to each molecule. Variations in the degree of glycosylation of the basic PRPs have been described-some contain no carbohydrate, but others contain 1.7 to 5.2 sugar residues/ 100 amino acids (Bennick, 1982). The 62 amino acid peptides from the cP7 precursor could be some of the less glycosylated basic PRPs. Three additional basic PRPs without sugar side chains and very similar to those corresponding to cP3 are also expected from the sequence of cP7. The peptide expected from residues 134 to 195 is identical to P-F except that P-F does not have an arginine at its C terminus. The C-terminal peptide (residues 196 to 252) is completely identical to P-H (Saitoh et al., 1983b. Assignment of cDNA Clones to Fiue PRP Loci-In the human genome there are probably six loci controlling the synthesis of the proline-rich proteins in saliva (Maeda, 1985). Two loci have HaeIII-type repeats and are named PRHl and PRH2; four have BstNI-type repeats and are named PRBI, PRB2, PRB3, and PRB4. Each locus has several alleles, some of which show polymorphic length differences of the repeating region .3 We have isolated full length cDNAs corresponding to four loci and an incomplete cDNA clone corresponding to a fifth locus. Assigning these cDNA clones to the appropriate genetic locus requires the isolation and characterization of genomic clones. Genomic clones containing the genes at the PRHl and PRH2 loci have been isolated from the DNA of the same individual whose parotid gland was used to make cDNA. By sequencing the exons of these genomic clones and by mapping genomic DNAs from other individuals with different acidic PRP phenotypes, we have been able to assign cP1 tentatively to the PRH2 locus (Prl) K. Lyons, unpublished data. In addition to the symbols used as in Figs. 1 and 2, the sugar attachment sites are marked with three black squares. As the clone cP7 is not a complete cDNA, the glycine residue after the poly(G) tail is arbitrarily numbered position 1 in the amino acid sequence. and cP2 to the PRHl locus (PIF).4 The BstNI family clones, cP3, cP4, and cP5, are most likely transcripts of the PRBl locus. Thus, a genomic clone, PRP1, which corresponds to the locus PRBl has been isolated from the DNA of another individual and partially sequenced . The nucleotide sequences of PRPl and cP3 are not identical, but they are probably alleles since both have the same number (13) of BstNI repeats, and the repeated regions share identical restriction enzyme sites not found in the genes at other loci (data not shown). Locus PRB2 is probably represented by cP7. The sequenced region of a genomic clone, PRPB , is identical to cP7 except for one base difference. The restriction enzyme maps of the repeated region of PRPB and cP7 are also very similar. Genomic clones which contain genes at the loci PRB3 and PRB4 have also been isolated from the DNA of the same individual as the cDNA and partially characterized, and we have found that cP6 is the transcript of the gene at the PRB4 locus.
The relative frequency of obtaining cDNA clones corresponding to each of the six loci is likely in the absence of selective effects to be directly proportional to the transcriptional activity of the loci. We obtained 28% HueIII-and 72% BstNI-type cDNA clones for PRP. Since there are twice as many BstNI loci as HueIII loci, this suggests that the two types are transcribed approximately equally. We have not identified any cDNA corresponding to the PRB3 locus. We cannot tell whether this is due to technical difficulties, such as its mRNA being particularly long, or to its being transcribed at a lower level than the other loci. It has been suggested3 that the genes at the PRB3 locus code for the glycopeptide G1. Since the individual whose tissue was used to construct the cDNA library is producing the G1 proteins in his saliva, this suggests that the locus PRB3 is being expressed. Fig. 4a summarizes schematically the results and conclusions we draw from studying the seven types of cDNA clone described in this paper. The tentative identities of the genetic loci and the types of protein product we expect after in vivo proteolysis are also given. All full length cDNA clones have a homologous signal peptide (S) and N-terminal region (Nl). The consensus sequences of S and N1 are given in Fig. 4b.

CONCLUSION
Clones cP1 and cP2, which code for acidic proline-rich proteins, differ from the others in having an additional Nz region. The two types of repeats, H and B, which give rise to the repetitive Hue111 site and BstNI sites in the DNA sequences, are homologous to each other as is clear from their consensus amino acid sequences shown in Fig. 4b. Both repeats code for basic polypeptides. The B-type repeats can be subdivided into Bl, BP, B3, and B4 repeats as is also shown in Fig. 4b. Interestingly, the variations which lead to the subdivision of the B repeats are clustered within the second half of repeats while the first half is much more conserved. ) are also shown. Subtypes of BstNl repeats are designated as B1, B2,Ba,and B,. Repeats B2,,Ba,,and B, contain sugar side chains as indicated.
Assignments of the cDNAs to PRP loci are shown on the left. Positions of post-translational cleavage are marked by downward arrowheads, and expected proline-rich peptides are indicated as acidic, basic, or glycosylated proteins. The peptides P-C and P-H are assigned by the identity of the amino acid sequences, and P r l and PIF are assigned by the protein type of the individual whose tissue was used to construct cDNA. Differential splicing of cP3, cP4, and cP5 is indicated with dotted lines. b, consensus amino acid sequences of the regions. Consensus sequences of S and N, are obtained from the sequences of four clones, cP1, cP2, cP3, and cP6. When the residue at one position is not predominant, two residues are given. Consensus sequence for HaeIII repeats (H) is from 5 repeats each of cP1 and cP2, and that for BstNI repeats (€3) is from 32 repeats of cP3, cP6, and cP7. Residues that vary in the consensus sequences of BstNI subtype repeats are shown as difference from B. An asterisk indicates a gap.
Two features of the PRP cDNAs that we have studied show could be obtained from the same transcript as cP3 by splicing how a relatively small number of genes (six loci) could produce out different lengths of RNA. We do not know whether this a much larger number of proteins. The first feature, differis a general feature in PRP genes and the potential splicing entia1 RNA splicing, was suggested by the relationships besites in other genes are used or not. (Clone cP6 has a potential tween cP3, cP4, and cP5. Fig. 4a illustrates how cP4 and cP5 donor site at position 479 and two potential acceptor sites 3' to the donor site at positions 510 and 573.) Differential RNA splicing of several genes is known to produce secreted and membrane-associated forms of immunoglobins (Early et al., 1980), differential tissue specific promoter usage in a-amylase (Young et al., 1981), and a different number of exons in Drosophila myosin (Rozek and Davidson, 1983) and human fibronectin (Kornblitt et al., 1984). The case of a single donor site combining with several acceptor sites has been demonstrated in the rat fibronectin gene (Tamkun et al., 1984). The present examples of differential RNA splicing in a P R P gene are unique in that both alternative donor and acceptor sites occur within a single exon (see also Azen et al., 1984) without altering the reading frame or the general nature of the translated product.
The second feature leading to multiple proteins is the probable ubiquity of post-translational proteolytic cleavage in these proteins. Fig. 4a indicates the various protein products to be expected if post-translational cleavage, such as occurs at Arg-106 in the acidic PRPs, is a general process. Such processing would generate at least four basic PRPs from the cP3 precursor, all of different lengths and amino acid sequences but all having a homologous repeated region. cP6 has potential cleavage sites at Arg-23 and at Arg-161 which would generate a heavily glycosylated PRP and an additional basic PRP. The cleaved products of cP7 would be basic PRPs with or without a single carbohydrate side chain. As indicated above, some of these postulated products have sequences identical or similar to known basic PRPs. In most cases we do not know to what extent the proteolytic processing occurs except that the Hue111 acidic protein is known to be partially cleaved. Thus, Wong et al. (1983) found the ratio of protein C to its cleaved product protein A to be approximately 1:1 in saliva-a ratio not affected by salivary flow rate. If cleavage at the arginine residues in the precursor proteins with BstNItype repeats occurs completely, then Fig. 4a shows 12 different BstNI-type (basic and glycosylated) PRPs. If cleavages are incomplete and permuted the number will be larger.
The question remains as to function of the many different PRPs, the major protein components of saliva. Acidic PRPs bind ionic calcium (Gron and Hay, 1976), bind strongly to hydroxylapatite, and inhibit the precipitation of calcium phosphate in saliva (Hay, 1973). The functions of the glycosylated and basic PRPs are not known. PRPs also occur in the respiratory tract (Warner and Azen, 1984), which suggests that they could have some general function different from those specific to the oral cavity. I t is possible that the precursor proteins rather than the processed proteins are more important functionally. All of the precursor proteins share the feature that their N-terminal region is acidic and the rest of the molecule, made up of proline-rich repeats, is basic. Because of their unusual amino acid compositions and primary structures, the molecules are also expected to have some interesting higher order structures. Conceivably the precursor molecules serve one function (perhaps in the respiratory tract), and their many proteolytic products serve a different function (perhaps in saliva). Whatever may be the case, the P R P system is unusual in the number and variety of different protein products that can be obtained from a small number of genes.