Structures of two HaeIII-type genes in the human salivary proline-rich protein multigene family.

Two members of the human salivary proline-rich protein (PRP) multigene family have been isolated and completely sequenced. These PRP genes, PRH1 and PRH2, are of the HaeIII-type subfamily and code for acidic PRP proteins. Both genes are approximately 3.5 kilobase pairs (kb) in length and contain four exons. Exon 3 encodes the proline-rich part of the protein and includes five 63-base pair (bp) repeats. CAT and ATA boxes and several possible enhancer sequences occur in a 1-kb region 5' to exon 1. Two sets of repeats occur in the sequenced region in addition to the 63-bp repeats: one pair of about 140 bp flanks 500 bp of DNA in the first intervening sequence, and the other pair of 72 bp is tandemly repeated 1.4 kb 5' to the PRH1 gene. The 4-kb region of sequenced DNA from PRH1 differs by an average of 8.7% from the same region in PRH2, but the nucleotide sequences of the exon 3 of the two genes differ by only 0.2%. This result suggests the occurrence of a recent gene conversion event. The regions containing the 5-fold repeated sequences of 63 bp are identical in the two genes, PRH1 and PRH2. A comparison of the human HaeIII and BstNI subfamily repeats and a comparison of the human, mouse, and rat repeats suggest that the individual repeats have evolved in a concerted fashion within each gene and within the PRP gene family as a whole.

with different lengths, possibly contributing an additional source of complexity.
The PRP proteins in humans are classified into three groups: acidic, basic, and glycosylated. Protein studies led to the proposal of four loci coding for the acidic PRPs. They are Pr, proline-rich (Azen and Oppenheim, 1973); Db, double band (Azen and Denniston, 1974); Pa, acidic protein (Friedman and Merrit, 1975); and PIF, parotid isoelectric focusing variant (Azen and Denniston, 1981). The locus P r has two common alleles, P r l and Pr2, and an infrequent variant P r l I . One productive allele and one null allele have been described at each of the Db, Pa, and PIF loci on the basis of protein data. Our DNA studies, in contrast, indicate that only two loci encode the acidic PRPs and that these two loci are a subfamily of the larger (six loci) PRP gene family. Thus, Maeda (1985) re-examined the pattern of inheritance of the PRPs in conjunction with DNA blotting data and hypothesized that the three acidic PRPs (Db, Pa, and PIF) are coded by three alleles (proposed nomenclature PRHl I, PRHl', and PRH14) at the single locus, PRHl. PRH2', PRHZ2, and PRH23 were suggested to be alleles at a second locus, PRH2, that also codes for acidic PRPs, Prl, Pr2, and Prl', respectively.
The PRHl and PRH2 genes that code for the acidic PRPs form a PRP subfamily in which sites for the restriction enzyme HaeIII occur repeatedly. The other subfamily codes for the basic and glycosylated PRPs and consists of four BstN1-type genes, PRBl, PRB2, PRB3, and PRB4. These BstNI-type genes have a region where BstNI sites occur repeatedly. Two clones corresponding to the BstNI group, PRPl and PRP2, have been isolated and partially sequenced . Tandem repetitive sequences of 63 nucleotides encoding for proline-rich repeated amino acid sequences were observed within the BstNI-type genes.
Multigene families are of particular interest because of the possibility that recombinational events between family members play a role in their evolution. The PRP multigene system shows signs of being involved in such events because polymorphic length differences are frequently observed in the PRP genes of different individuals . This suggests that unequal crossing-over between the repeated units within the genes may have been occurring during the evolution of the family.
To gain a better understanding of the complexity of the PRP genes and of the evolutionary inter-relationships of the genes in this multigene family, we have isolated both human salivary PRP genes of the HaeIII-type, PRHl and PRH2, and in this paper we describe their complete DNA sequences. Comparison of these nucleotide sequences suggests that a recent gene conversion between the two genes rendered the exons containing the repeated regions more alike than other parts of the genes.
PRP protein phenotypes Db' , Pa+, and PIF-and Prl-2 (genotypes Cloning Procedures-BgZII and BamHL libraries were constructed from complete BglII and BanHI digests of genomic DNA of R. D. The digests were ligated into X phage Charon 35 (Loenen and Blattner, 1983) and packaged in vitro (Hohn, 1979). An EcoRI library was made from the same individual using the vector Charon 32 (Loenen and Blather, 1983). A library was made of 12-18-kb size-selected fragments from a partial MboI digest of DNA from 0. S. ligated into the BarnHI sites of Charon 35. All libraries were screened without amplification. A 500-bp fragment (Hue111 500) of a human PRP cDNA clone which encodes all of the repetitive region (plus a little of both the 5' and 3' regions) was cloned into the SrnaI site of the plasmid pLL 10 (Rothstein et al., 1979). This fragment was used as a probe during most of the cloning work. The Hue111 500 probe hybridizes strongly to both HaeIII-and BstNI-type genes under non-stringent washing condition at 68 "C in 3 X SSC (0.45 M sodium chloride, 0.045 M sodium citrate, pH 7) with 0.5% sodium dodecyl sulfate. After washing under more stringent conditions (0.1 X SSC at 68 "C), we could distinguish the still strongly labeled HaeIII-type clones from the now weakly labeled BstNI-type clones.
DNA Sequencing and Analysis-DNA sequencing was carried out by the method of Maxam and Gilbert (1977) with slight modifications (Slightom et ul., 1980). All regions were sequenced in both directions.
Sequences were analyzed using software provided by the University of Wisconsin Genetic Computer Group (Devereux et al., 1984).

RESULTS AND DISCUSSION
The PRHl and PRH2 Loci-The 18-kb region which contains the PRHl gene (allele PRHI ' ) was obtained from DNA of R. D. as a series of overlapping phage clones 1-4 (Fig. 1). R. D. is homozygous for the PRHZ4 allele (old nomenclature, PIF). Three libraries containing BgZII, BamHI, and EcoRI fragments were screened with the Hue111 500 probe (probe 1 in Fig. 1) and clones 1-3 were isolated. The 2.2-kb BgZII fragment from clone 1 was sequenced completely. It contained most of the coding region of PRHl corresponding to the sequence of the cDNA clone of cP2 (Maeda et al., 1985) but was missing the leader region and 3'-untranslated sequence. The entire PRHl gene was included in a 5-kb BamHI-HindIII fragment from clone 2. The nucleotide sequence of this fragment is given in Fig. 2. Clone 3, an EcoRI fragment, was found to be contained within the BamHI fragment of clone 2.
A 1-kb BamHI-XbaI fragment from the 5' end of clone 2 was used as a probe (probe 2) to isolate clone 4 which contains a 7.5-kb EcoRI fragment that covers the region upstream of the PRHl gene. A map generated by genomic Southern blotting (Southern, 1975) was used to find the overlaps of these clones. Similar strategies were used to isolate the 12-kb region ( Fig.   1) covering the PRH2 gene. A 1.1-kb BglII fragment, clone 5, and a 2.5-kb BamHI-Hind111 fragment, clone 6, were isolated from DNA of R. D. and completely sequenced. They were found to contain the coding regions corresponding to the cDNA clone cP1 (Maeda et al., 1985) except for the 5'untranslated region and exon 1. Comparison of the maps of the genes PRH2 and PRHl suggested that the entire PRH2 gene should be included in a 5-kb EcoRI fragment, but we were unable to isolate any phage clone containing this fragment from the DNA of R. D. A 12-kb fragment (clone 7) was, however, isolated from DNA of another individual (0. S.).
We present the sequence obtained from these clones as a continuous sequence in Fig. 2. The 3'-half (BamHI-Hind111 fragment) of the illustrated sequence therefore corresponds to the PRN2' allele since it was obtained from a PRHP homozygote, but the 5'-half (EcoRI-BamHI fragment) may be either a PRH2' or a PRH2' since 0. S. is heterozygous at this locus. Clone 8 (containing a 3.8-kb EcoRI DNA fragment from R. D.) contains the region 5' to the PRH2 gene; it was obtained using the BamHI-XbaI fragment (probe 2) from PRHl.
The maps of the regions which contain the two genes are summarized in Fig. 1. Although some restriction enzyme sites are common to the PRHl and PRH2 loci, the two restriction enzyme maps are distinctly different, and no six-base recognition enzymes tested gave the same length fragments in both genes with the Hue111 500 probe.
Gene Organization- Fig. 2 presents and compares the complete DNA sequences of the human salivary PRP genes, PRHl and PRH2. The four exons and three introns were assigned by comparison to the cDNA sequences cP2 and cP1 (Maeda et al., 1985). The nucleotide sequences at the RNA splicing donor and acceptor sites of PRHl and PRH2 genes are all in agreement with the consensus sequences described by Mount (1982). The first three exons in both genes have identical lengths, but the fourth exons differ slightly in length. The sizes of all the introns in the two genes are significantly different.  I IIIIIIIIIII IIIIIIIIIIIIIIIIIIIIIIIIIIII IIIIIIIIIIIIIIIIIIII IIIIIIIIIII IIIIIIIIIIIIIIII I I I I I I I I I I I I I I II IIIIIIIIIIIIIIII IIII IIIIIIIIIIIIIIIIIIIIIIIIIIIIII II IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII IIIIIIII IIIIII I   IIIIIIII IIIII IIIIIIIIIIIIIIII IIIIIIIIIIII IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII IIIIII I I I I I I I I I I I I I I I I I I I I I I I IIIIIIIIIIIIII IIIIIIIIIIII II IIIIIIIIIIIIIIIIIIIIIIIIIIIII I l l IIIIIIIIIIIIIIIIIIIIII   PRHl ~   PRHZ CTTGCCTCTGTCTACATAGAGTTAGAGAATCACCAGAGTGAAATATTGTCATTTTTTTCTCTCCTGCATGTAGTATTTTAATGTGCTGGGACGGGCATTTGTAAGATTGTATCTAAGTGG   IIII IIIIIIIIIIIIIIIIIIIIIIIII IIIIIIIIIIII IIIIIIIIIIII I I I I I I I I IIIIIIIIIII I I I I I I I I I I I I I I I I I IIIIIIII IIII IIIIII   PRHI ~C  A  G  T  T  T  A  A  C  T  A  A  A  T   IIIIIIIIIIIIIIIIIIIIIIIIII~ IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII,IIIIIIIIIII IIIIIIIIIIII I I I I I I I I I I I I I I I I I IIIIII IIIIIIIIIIIIIIIIIIIIII Ill I l l II IIIIIIIIII IIII II IIIIIIIIIIIII IIIIIIIII II IIIIIIIIIII IIIIIIII I I IIIIIIIIIIIIIII IIIIII IIIIIII IIIIIIIIII I IIII IIIIIIIIII Ill1 II IIIIIII IIIIII IIIIIIIIIIIIIIIIIIII II I  . is 64 bp in length and codes for the secretory signal Exon 3, located approximately 360 bp downstream from and for the first five NHz-terminal residues of acidic exon 2, encodes for the main repeated region of the two proteins where the HaeIII-type repeats occur tandemly five Exon 2, located approximately 1 kb downstream from exon times. These repeated regions are not interrupted by any 1, contains only 36 bp and codes for the next 12 residues of introns. The termination codon TAA also occurs in exon 3.

IIIIIIIIIIIIIII IIIIIII IIIIIIII I IIII I l l IIIIIIIIIIIIII IIIIIIII IIIIIIIIIIII IIIIIIIII I I l l IIIII
the NHz-terminal region of the proteins.
Exon 4, located after a long intervening sequence of about 1200 bp, contains only the 3"untranslated region and the poly(A) addition signal sequence AATAAA (Fitzgerald and Shenk, 1981). The nucleotide positions 1615 in PRHl and 850 in PRH2 are indicated by bent arrows in Fig. 2. They are the 5' ends of the nucleotide sequences of the cDNAs cP2 and cP1, respectively. The glutamine residue in exon 1, indicated by an asterisk, is the NH2-terminal amino acid of the secreted protein as judged by the amino acid sequence of the acidic PRP, protein C, determined by Wong and Bennick (1980). The poly(A) attachment sites in PRHl and PRH2 were deduced from the nucleotide sequence of the cDNAs cP1 and cP2. They occur after the sequence TTGC, as indicated by the bent arrows in exon 4 in Fig. 2, at nucleotide positions 4835 and 3926. Several sequences associated with transcriptional initiation are found in the 5' regions of the two genes, including an ATA box (boxed in Fig. 2; Goldberg, 1979) and a possible CAT box (Efstratiadis et al., 1980) located 27 bp upstream of the ATA box. The sequence TGGAAAG, the core sequence of some-viral enhancers (Khoury and Gruss, 1983), occurs twice in the 5' region, once at 483 bp and once at 345 bp upstream from the ATG codon. Similar sequences, TGAAAAA, TGAAAAG, and TGAAAAC, occur at 534, 376, 364, and 331 bp upstream from codon 1. We do not know their significance. The transcription units of the genes from the start of transcription to the poly(A) attachment site extend 3714 bp in PRHl and 3578 bp in PRH2.
A proline-rich protein gene, MPz, from a mouse has recently been sequenced (Ann and Carlson, 1985). This mouse PRP gene has a similar organization to that of the human PRP genes except that it does not have a short exon corresponding to the exon 2 of the human genes. In the mouse gene, se-quences homologous to the cyclic AMP control region and a hormone-binding site have been found 5' to the gene, but these sequences are not found near the human genes.
Some Interesting Sequences-A pair of direct repeats (140 and 142 bp in length) separated by 497 bp occur in the first intron of the PRHl gene starting at positions 1758 and 2381 as indicated in Fig. 2. The entire region of 765 bp is flanked by 6-bp short direct repeats having the sequence TTGGGG. A comparable set of sequences occurs in the PRH2 gene except that there appears to have been a deletion of 119 nucleotides in the 3' end of the second repeat. Direct repeats of this type are frequently associated with transposable elements (Calos and Miller, 1980), and it is possible that this region of the PRP genes is a transposon of length 497 or 496 bp with long terminal repeats of 142 or, less likely, 126 bp. However, genomic Southern blot hybridization of human DNA to a probe made from most of this region from PRHl or to a probe containing the whole 5"terminal repeat failed to detect any other copies of the sequence in the genome (data not shown), nor were we able to find any homologous sequences in the GenBank library (version of February 22, 1985). Thus, there is no evidence from the presence of other copies to support the idea that the region is a transposon.
A pair of 72-bp direct repeats having 93% sequence identity with each other are present in the PRHl gene 1429 and 1357 bp upstream from exon 1. The equivalent region of PRH2 has not been sequenced. This 72-bp tandem repeated sequence in the PRHl gene has only limited similarity (35% identity) to the enhancer element in SV40 gene (Benoist and Chambon, 1981;Gruss et al., 1981), but the sequences TGGAAA and CAAACCA, which occur within the PRHl repeats, are identical to virus enhancer core sequences. This suggests that the 72-bp tandem repeats in the PRHl gene may be important for its transcription.

. C a g C A g G G A C C A C C C C A A C A a G G A G G C C A g C A G C A A c a A g g T C C a C C A
** **

C C f C C T C c t G G A A A G C C c C A g G G A C C A C C t C c c C A a G G g G G C C g c C c a C A A G G A C C T C C a C a g
* * * b ) Consensus Seauences PRHl and PRH2 genes (both genes are identical in this region). Bases which differ from the consensus sequence of HaeIII-type repeats are indicated by small letters. Asterisks denote nucleotides different from the consensus sequence of the HaeIII-type repeat but identical to that of the BstNI-type repeat. b, the HaeIII-and BstNI-type consensus sequences. Bases which differ between the two consensus sequences are shown by asterisks. The HaeIII and BstNI recognition sites are underlined. c, consensus sequences of repeats from a rat cDNA clone, pRP33, and a mouse gene, MP,. Bases which differ from the human consensus sequences are shown by small letters.

Uneven Distribution of Sequence Differences between Two
HaeIII-type Repeats-The repeated region in exon 3, con-Genes-The differences between the nucleotide sequences of sisting of five tandem repeats of 63 bp, is completely identical the two HaeIII-type genes are diagrammatically summarized in PRHl and PRH2, as discussed in the preceding paragraph. in Fig. 3 in which the two sequences are homologously aligned The five repeated sequences are aligned to show their relatwith their exons indicated by solid black bars. The overall edness in Fig. 4a. Each repeat has the common Hue111 site, difference between the two genes is 8.7% when each base GGCC, which translates to Gly-X. The first repeat lacks 15 mismatch and each gap is counted as one difference (4031 bp at its 5' end, whereas the second and third repeated positions were compared with each gap counted as a single position). The ratio between transition and transversion nucleotide substitutions is 1.51, and all the length differences are less than 11 bp except for the difference in intron 1 described above, where PRH2 is shorter by 119 bp than PRHl.
The distribution of differences along the genes is not uniform. In the lower part of Fig. 3, the percentage differences of each exon, intron, and the 5'-and 3'-untranslated regions are shown by a bar diagram. A remarkable conservation is seen in exon 3 (419 positions) which encodes for the repetitive region. There is only one base difference between the two genes in this exon and that difference is outside the region of the HueIII-type repeats. In contrast, intron 3 has 15% differences including 22 gaps in the 1136 positions compared.
This uneven distribution of differences is intriguing. A likely explanation for this feature is that a recent gene conversion has occurred between the two genes. Some support for this idea is received from noting that there are only six nucleotide differences between the two genes in a 660-bp region (from nucleotide positions 2998 to 3650 of PRHl in Fig. 2) that includes over 250 bp of intron or untranslated sequences in addition to exon 3. sequences are 12 bp shorter near their 3' ends. 63%-of all positions are identical in all five repeated sequences. A consensus sequence for the HueIII-type PRP repeat has been constructed from these five repeated sequences and is compared in Fig. 4b to the consensus sequence of a BstNI-type PRP repeat . The total difference between the consensus sequences of these two gene types is 19%, which suggests that the duplication leading from a single human salivary PRP ancestral gene to the present HaeIII-and BstNItype genes is not recent. When the individual HaeIII-type repeats in the two genes are compared to the consensus sequence, more differences are found in the first and the fifth repeats. Conceivably, the middle three repeats have been more homogenized by gene conversions and unequal crossing-over events between the repeats than have the outer repeats.
Proline-rich proteins have been found in the saliva of other mammals. A cDNA clone, pRP33, of a rat contains six repeats of 19 amino acids (Ziemer et al., 1984). A mouse PRP gene, MPZ, contains 13 repeats of 14 amino acids (Ann and Carlson, 1985). The consensus nucleotide sequences of the repeats in these genes are shown in Fig. 4c for comparison. Clearly, the repeats are all related and derived from a common ancestor.
The two HaeIII-and BstNI-type repeats of human are, however, more related to each other than either is to rat or mouse repeats. This suggests that the divergence of HueIII-and BstNI-type repeats may be more recent than the divergence of human and rodents. Alternatively, the human genes may have evolved in a concerted fashion. We are in the process of sequencing other human PRP genes, as well as investigating the organization of the six human PRP loci, in order to understand better the evolution of this gene family.
In conclusion, our studies on the PRHl and PRHZ genes show that the evolution of these genes is more complicated than simple gene duplication and divergence and suggest the occurrence of recombinational events between genes in the family, as well as between the repeats within each gene.