Molecular Evolution of the Mouse Proline-rich Protein Multigene Family INSERTION OF A LONG INTERSPERSED REPEATED ELEMENT*

Proline-rich proteins (PRPs) in the salivary glands of mice, rats, and hamsters are encoded by tissue-specific inducible multigene families. Mouse PRP genes are located on chromosome 8, and transcription is dramatically induced (about 70-fold) by isoproterenol treatment. Clones containing two nonallelic PRP genes (MP2 and M14) were isolated from cosmid and phage libraries of CD-1 mouse genomic DNA. The cloned regions comprise a contiguous block of 77 kilobase pairs of the mouse genome. Restriction mapping established the physical lineage of PRP genes MP2 and M14, and they are tandemly arrayed. The DNA se- quence analysis presented in this report suggests that genes M14 and MP2 (Ann, D. K., and Carlson, D. M. (1986) J. Biol. Chem. 260, 16863-16872) arose via a gene duplication of a common ancestor. Two major differences between M14 and MP2 were observed. pair peated amplification nucleotide units coding repeated peptides and duplication. and

Proline-rich proteins (PRPs) in the salivary glands of mice, rats, and hamsters are encoded by tissuespecific inducible multigene families. Mouse PRP genes are located on chromosome 8, and transcription is dramatically induced (about 70-fold) by isoproterenol treatment. Clones containing two nonallelic PRP genes (MP2 and M14) were isolated from cosmid and phage libraries of CD-1 mouse genomic DNA. The cloned regions comprise a contiguous block of 77 kilobase pairs of the mouse genome. Restriction mapping established the physical lineage of PRP genes MP2 and M14, and they are tandemly arrayed. The DNA sequence analysis presented in this report suggests that genes M14 and MP2 (Ann, D. K., and Carlson, D. M. (1986) J. Biol. Chem. 260, 16863-16872) arose via a gene duplication of a common ancestor. Two major differences between M14 and MP2 were observed. PRP gene MP2 has 13 tandemly arrayed 42-nucleotide repeats in exon 11, whereas M14 has 17 repeats, and PRP gene M14 has an insertion by transposition of a 2-kilobase pair member of the long interspersed repeated DNA (LINE) family (LIMd) into intron I. The evolution of this PRP multigene family has been dominated by intra-exonic amplification of repeating nucleotide units coding for these and other proline-rich repeated peptides and by gene duplication. The LIMd element gives rise to heterogenous EcoRI, BamHI, and HindIII restriction enzyme patterns, and this insertion is also present in BALB/c, C67BL/6J, and DBA/2J mice.
Mammalian proline-rich proteins (PRPs)' are encoded by tissue-specific multigene families whose members have diverged with respect to structure and regulation (1-9). The nucleotide sequences of several PRP mRNAs from rat (4, 5), mouse (5), and human (6) and the structure and organization of complete PRP genes from the mouse (7), hamster (8), and human (9) have been reported. The common evolutionary origin of these genes is evident from the extensive conservation of 5"untranslated regions, coding sequences, and intron/ exon structures. It has been proposed (7) that the 42-nucleotide repeat unit CCA CCA CCA CCA GGA GGC CCA CAG * The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

The nucleotide sequence(s) reported in thispaper has been submitted to the GenBankTM/EMBL Data Bank with accession number(s) 503891.
The abbreviations used are: PRP, proline-rich protein; MP2, M14, mouse proline-rich protein genes; bp, base pair; kb, kilobase pairs, CRPs, contiguous identical repeat peptides. CCG AGA CCC CCT CAA GGC is the ancestral unit. During gene duplication multiples of three bases were likely recruited into or deleted from this ancestral unit, and gene conversion homogenized the divergence between the internal repeats (7). The mouse PRP genes are clustered on chromosome 8 (2). Unusual strain differences of PRP mRNAs in parotid glands of isoproterenol-treated mice have been reported (3).
Whatever the mechanism is, interspersed repeated DNA elements may have an effect on gene expression and evolution. One major class of long interspersed repeated DNA (LINE) (>lo4 copies/genome) in the mouse genome is LIMd (formerly known as BamHI (10) and . The size of LIMd has been estimated to be as large as 7 kb, but most members are truncated at apparently random distances from a common 3' end (14). The 3' end of individual LIMd elements contains an adenine-rich tail. This, coupled with the observation that individual LIMd elements are flanked by small, less than 15base pair direct repeats, suggests that each element is generated via an RNA intermediate which is subsequently dispersed to distant locations (13,14).
In this report, we demonstrate that the EcoRI restriction enzyme site differences in two members of the mouse PRP gene family (7) are due to the presence of a highly repeated DNA element LIMd. We have isolated PRP genes from cosmid and phage libraries of CD-I mouse genomic DNA. Restriction site mapping of a 77-kb region revealed the tandem arrangement of two nonallelic PRP genes (MP2 and M14) in the same 5' to 3' orientation. Nucleotide sequence analysis of M14 suggests that MP2 and M14 arose from a common ancestor, with the following major differences; an insertion into intron I of a 2 kb LIMd element in the opposite orientation of the M14 gene (3' + 5') and M14 contains 17 tandemly arrayed 42-nucleotide repeats versus 13 in MP2. The location and the nature of the insertion site for the LINE element and the implications for genomic evolution are discussed.
Screening of Mouse Genomic Libraries-Liver DNA was isolated from CD-I mice and was used to construct cosmid libraries and a X EMBL-3 library according to established procedures (15). For the cosmid libraries, genomic DNA was partially digested with Sau3A to a length of 30-50 kb and size-fractionated by centrifugation through a 1.25-5 M NaCl gradient. The preparation of vector (pTCF) arms, ligation, and packaging were carried out according to Grosveld et al. (16). The infectious bacteriophage particles were used to transduce Escherichia coli ED8767, with an efficiency of about 320,000 transformantslpg of size-fractionated genomic DNA. Approximately 320,000 recombinant colonies were grown on nitrocellulose filters (10,000 colonies/82-mm filter) and were hybridized to 52P-labeled 5' and exon I1 probes of mouse PRP gene MP2 (7) according to the screening procedure of Hanahan and Meselson (17). DNA was isolated from positive colonies by alkaline lysis (18). Approximately 200,000 recombinant phages were screened and 13 positive clones were purified and characterized as we reported previously (7, 8).
Characterization of Cosmid Clones-Complete restriction maps were derived by a five-step procedure. The vector (pTCF) contains two SalI sites flanking the BamHI cloning site, which facilitates mapping of the recombinant clones. First, the coding regions of the genes were mapped by Southern blotting (19) of restriction digests and hybridization with fragments of the 5' and exon I1 probes of MP2. Second, the terminal fragments were subcloned by recircularization of cosmid DNA after digestion with restriction enzymes HindIII or BamHI. The plasmid DNA from transformants contained the vector fragment with the origin of replication and AmpR sequences plus the 5'-or 3"terminal insert fragment. The insert fragment was then isolated by digestion with either Sal1 plus Hind11 (5') or SalI plus BamHI (3') followed by gel purification, The fragments were tested for repetitive sequences by probing a Southern blot of genomic DNA. Third, EcoRI, HindIII, BamHI, and SalI restriction fragments were ordered by a modified partial digestion followed by indirect end labeling. Recombinant cosmids were digested with MluI or NruI first, and the digested cosmid DNA was incubated with the appropriate restriction enzyme. Aliquots of the digest were removed at several intervals between 1 and 30 min and placed in tubes containing Y3 volume of 0.5 M EDTA. Fragments contained in the aliquots were combined and subjected to electrophoresis, and the separated fragments were blotted to a NYTRAN membrane. Two specific vector probes (NruI/SalI and SalIINruI) from the left and right side of the BamHI cloning site, respectively, were labeled and hybridized to the blots. Fourth, the complete spectrum of restriction fragments was visualized by radioautography of gels of single and double digestion with several restriction enzymes (six-base cutters), and the digests were labeled by filling in with appropriate [a-32P]deoxynucleotides. Fifth, the exact maps were ascertained by subcloning either SalI or BamHI fragments into pUC19 for detailed restriction enzyme analyses.
DNA Sequence Analysis-Fragments containing the coding regions were subcloned from the phage clones. Progressive and nonrandom shortening of the inserts, from either end separately, was carried out according to Guo and Wu (20) with slight modifications (7). All DNA sequencing was performed by the chemical method of Maxam and Gilbert (21). Computer analyses were performed on an IBM PC/XT computer using Pustell sequence analysis programs (22).

Isolation and Characterization of the Mouse PRP Gene
Ml4"Southern blot hybridization analysis of mouse genomic DNA had shown that two EcoRI fragments (3.1 and 8.6 kb) hybridized to the 5"specific probe of MP2 (7). Based on restriction enzyme mapping analysis, mouse PRP gene MP2 was represented by the 8.6-kb EcoRI fragment (Fig. 1). The initial purpose of this study was to identify the mouse PRP clone@) representing the 3.1-kb EcoRI fragment. Two phage clones, M14 and M56, were isolated from an unamplified phage (EMBL3) genomic library. Both clones hybridized to 5' and exon I1 probes, whereas only clone M14 hybridized to the 3"specific probe (data not shown). The 3.1-kb EcoRI fragment that hybridized to the 5'-specific probe was detected in both clones. Restriction enzyme analysis showed that M14 contained an 18.4-kb insert and M56 contained a 10.5-kb insert. The 5' part of M14 is identical to M56 with a 2.1-kb extension on the 5' region (Fig. 1). The 3.1-kb EcoRI fragment and the 3.8-kb EcoRIISalI fragment of clone M14 (Fig. 1 The Structure of Ml4"The location of M14 coding and noncoding sequences in Fig. 2 indicates that M14 is arranged in the exon/intron configuration characteristic of mammalian PRP genes (7-9), similar to MP2 (7), and that no intron is located between repeats (7-9). M14 has three exons (amino acids -15 to 6 and 7 to 312, and the 3"untranslated region) (Figs. 1 and 2). Both introns are flanked by consensus splice junctions of RNA transcripts synthesized by RNA polymerase I1 (23). However, relative to intron I of MP2, which is 1433 base pairs in length, intron I of the M14 gene is 3784 base pairs long. Actually, intron I of M14 had two separate insertions of 223 and 2005 base pairs into intron I of MP2.' We used a homology matrix, or "dot plot" method, to compare these two sequences for investigating the relationship between MP2 and M14. A clear homology between these two sequences is revealed using a minimum stringency of 28 out of 40 nucleotides (Fig. 3). This plot shows virtually no spurious background. The displacement of the diagonal line indicates (i) insertions of 223 and 2005 nucleotides in the M14 structure, (ii) fractional sequence differences on the simple repetitive sequences (CA, TA, TAGA) between M14 and MP2 genes, and (iii) four more 42-nucleotide internal repeats in M14 than in MP2. The dot plot of Fig. 3 serves as a basis for aligning the two sequences (data not shown). Excluding the insertions, the simple sequences and the four *The 223-nucleotide segment in intron I of MI4 may be either the result of an insertion into MP2 or a deletion from M14 to form MP2. The authors are assuming that this 223-nucleotide segment is an insertion, and it is treated accordingly in the paper.  Table I. The remarkable identity indicates that M14 and MP2 have a common ancestor. There are two typical poly(A) addition signals, AATAAA (24), separated by 662 base pairs (Fig. 2). Whether the second signal is used or whether any differential regulation is involved is not known.
Interruption of Intron I by Insertion of a LIMd-PRP Element-To investigate the nature of the large inserted element (2005 bp), a SpeIIAurII fragment of M14 (bp 933-2994) was used to probe a CD-1 mouse genomic total DNA blot (Fig. 1,  probe E ) . The characteristics of the repetitive DNA and the pattern of hybridization to various restriction enzyme digests of mouse total DNA suggested that this inserted DNA was a member of the mouse LINE family (LIMd). Comparison of this inserted sequence and the LIMd-A2 of BALB/c mouse (25) clearly showed that the inserted sequence was the 3' portion of the mouse LINE element and that it had been transposed into intron I in the opposite direction of M14. The LIMd-PRP, like most mouse LINE elements, is truncated at the 5' end (12,13). It contains the typical polyadenylation signal AATAAA (24) and an adenine-rich element, and it is flanked by a pair of 10-bp imperfect direct repeats (TGTCT-TTTTT) (Figs. 2 and 4). We subsequently aligned the nucleotide sequences bordering LIMd-PRP with their homologous sequences in MP2 (Fig. 4). The 5' boundary of LIMd-PRP was confirmed by the existence of 10-bp imperfect direct repeats at both the putative 5' boundary and adjacent to the 3'-poly(A) tract. This 10-bp sequence is present only once in the MP2 target region (Fig. 4). The presence of this direct repeat, apparently generated by duplication of the single target sequence, is consistent with the hypothesis that LIMd-PRP entered the PRP locus via transposition. The 10-bp direct repeat is imperfect due to the presence of one nucleotide substitution (T/C), which was presumably introduced during or after the transposition event. However, each of the direct repeats differs from the MP2 target sequence by one base substitution. Another sequence feature is the presence of 17-18 bases outside either border of the putative target site of transposition that show dyad symmetry (Fig. 4).
Comparison of the other insertion sequence of 223 nucleotides with respect to either the LIMd consensus sequence or to the mouse Alu-equivalent B1 and B2 elements (26,27) showed no evidence of sequence homology. Although this  Linkage of MP2 and M14"To link MP2 and the duplicated gene M14, two cosmid libraries were screened directly after construction. Positive clones were mapped with several restriction enzymes and were compared with the restriction maps of existing clones. Two specific cosmid clones, MC16 and MC22, were selected for further characterization. A compiled map of these PRP genes is shown in Fig. 1. In total, 77 kb were cloned from mouse genomic libraries. The positions of the exons and the orientations of the two genes have been determined by blot analysis of cosmid clones (MC16, MC22) and phage clone (M56) using various probes (Figs. 1 and 5). MP2 is located 29 kb upstream from M14, and they are tandemly arrayed. Fig. 5 shows Southern blots of two overlapping PRP cosmid clones and one phage clone digested with EcoRI and hybridized with 5' (probe B ) and exon I1 (probe C ) sequences of MP2. Both the 8.6-and 3.1-kb fragments (Fig. 1, fragments 6  and d ) , which were detected in genomic DNA by using the 5' probe, were observed in MC22 (lane 2, probe B ) . MC16 contained only the 8.6-kb EcoRI fragment (lane 3, probe B ) . This suggested that MC22 contained portions of both PRP genes MP2 and M14. When the blots were hybridized to exon 11, MC22 contained the EcoRI fragment of 8.6 kb (Fig. 1, fragment 6 ) as predicted by the total genomic blot and a 7.5-kb fragment (Fig. 5, lane 2, probe C ) . The 7.5-kb fragment represented a truncated EcoRI sequence (Fig. 1, fragment e ) plus part of the pTCF vector (SalI-EcoRI). This analysis suggested that MC22 as well as M56 contained a 3"truncated M14 gene.
To show that these two clones were overlapping and that no recombination had occurred during the cloning, the blot was screened with a terminal insert fragment from MC22 (probe A ) . Terminal fragments were isolated by recircularization of the cosmid DNA after digestion with BamHI which does not cleave the vector (see "Experimental Procedures"). An 8.5-kb EcoRI fragment from MC16 hybridized to this terminal probe (Fig. 1, fragment a, and Fig. 5, lane 2, probe A ) as well as truncated EcoRI fragment a of MC22 plus part of the vector (Fig. 5, lane 3, probe A ) . Interestingly, a 5.5-kb EcoRI fragment from MC22 and M56 also showed hybridization (Fig. 5, lanes 1 and 2, probe A , and fragment c in Fig.  1). This terminal fragment did not contain repetitive elements which suggested that the 5' upstream regions of M14 and MP2 shared sequence homology. Both MP2 and M14 also had two Hind111 sites separated by 0.5 kb and about 3.5 kb upstream from the EcoRI sites (Fig. 1). This is in contrast to other multigene families, such as the mouse amylase gene families (28), where restriction enzyme sites in the flanking regions are not conserved. Blots were screened with either LIMd-PRP, alternating purine and pyrimidine sequences (simple repetitive sequences) (Fig. 1, probes D and E ) , and the Alu-equivalent B1 probe of mouse (29) (Fig. 5). Only one copy of LIMd was observed in this 77-kb DNA fragment (Fig.  5, probe E ) , and no other stretches of simple repetitive se-   Fig. 1 and mouse B1 probe. quences were found (Fig. 5, probe D). The Alu-equivalent B1 probe hybridized to two distinct EcoRI fragments, fragment g of MC16 and fragment f of MC22 (Fig. 1). Although the exact position was not clear, there is at least one copy of the B1 element in the 5"upstream region (at least 8 kb) of each PRP gene. This 77-kb DNA fragment contains at least four different genetic elements, LINE, short interspersed repeated DNA, simple repetitive sequences, and tandem repeats. Both MP2 and M14 are transcriptionally active in the parotid gland when the mouse is treated with isoproterenol (5).3 The number of tandem repeats within each gene is variable as we proposed (7, 8) and is the same as Nakamura et al. (30) reported that using variable number of tandem repeats as markers for mapping human genes. The differences are that our tandem repeats are the major body of the active gene, and no sequence similarity between this repeat and the invariant core sequence of variable numbers of tandem repeats (30, 31) exists.

DISCUSSION
Evolution of the PRPs-Structural relationships among two mouse PRP genes, MP2 and M14 (7 and this report), human PRP gene PRHI (9), and hamster PRP gene H29 (8) are summarized in Fig. 6. These genes all contain at least two protein-coding exons. Exon I corresponds generally to the signal peptides of these secreted proteins. Exon I1 (b or a + b) represents the mature protein including the highly conserved repeated sequences, and exon I11 contains a 3"untranslated region. The combination of exons IIa and IIb in the mouse PRP genes (Fig. 6) probably represents a speciesspecific difference in PRP gene structures. Size variations among the gene products are generally proportional to the number of amino acids in each repeat and to the number of copies of repeated sequences in exon IIb as reported here and previously (3). In contrast to the similarities among the exons, the introns of PRP genes in different species are considerably different in size (Fig. 6), and there is no evidence of significant similarities among their sequences.
Some observations and speculations are given with respect to the transition and carboyxl terminus segments flanking the highly conserved repeats. These regions are variable in size, and the peptide sequences differ in various family members of the same species and of different species (1). However, PRP cDNAs which are likely encoded by MP2 and M14, pUMP40 and pUMP4, respectively, have been cloned and sequenced (7). these sequences are still relatively rich in the amino acids Pro, Glu + Gln, Gly, and Asp + Asn, which is a characteristic of the PRPs. The entire mature protein encoded by the PRP gene could have originated from the same simple oligonucleotide series which encodes the repeat, as proposed before (7, 8). Since there are no introns in this coding sequence, numerous duplications likely gave rise to both flanking (transition and carboxyl-terminal regions) and repeating segments over a rather short length of DNA. It is remarkable that the mechanism of duplication has preserved, with one exception, the exactness of the 42-nucleotide repeat in the highly conserved repeat region (7). Mutations likely occurred in the flanking regions which promoted variations in the transition region and the carboxyl terminus. A proposed role for introns is to limit amplification in genes whose protein products cannot tolerate variation in size (32). If so, the presence of introns between each repeat could have inhibited the development of new PRP genes. Ohno (33) has proposed that oligonucleotide repeats are the primordial source of all genes. It seems plausible that genes created recently would be the most likely to retain evidence of primordial repeats.
Recently, Heinrich and Habener (34) and Mirels et al. (35) reported on a multigene family encoding proteins containing glutamine/glutamic acid-rich secretory proteins or contiguous identical repeat polypeptides (CRPs) from rat submandibular gland. The CRPs are related to the PRPs, but they contain high amounts of Gln plus Glu (35%) and only about 10% proline. Other interesting similarities between the PRPs and the CRPs are that there are only two introns and three exons in the CRPs with essentially the same gene organization as shown for the mouse PRP genes (Fig. 6), and exons I of these two families (the signal peptide plus 5'-untranslated region) are high conserved (7,8, 34). These observations suggest two possibilities; the signal sequences encoded by exon I may be critical for secretion of these salivary gland products or there were recombinational events in the genesis of these multigene families. In other words, the signal peptide domains and 5'untranslated regions were added to members of these gene families after duplication and evolution of primordial repeats.
One model for the evolution of the PRP genes is that 42bp primordial oligonucleotides were duplicated and then they diverged in the flanking regions to form transcriptional units. These transcriptional units evolved independently and were subject to differential rates of change. The signal peptide domain may have been added to members of the gene family at this stage or later. Gene conversion could account for the homologies. The observation that gene conversion events in higher eukaryotic organisms often appears to occur within multigene families on the same chromosome (36) supports this contention. Unequal crossing over and gene conversion can result in the concerted evolution of gene families.
In situations such as this, standard numerical techniques for    inferring molecular phylogenies or estimating evolutionary distances cannot be directly applied. Further insight into the evolution of the mammalian PRP genes requires characterization of homologs in other species. Restriction site differences at the mouse PRP gene loci MP2 and M14 are due to the presence of a 2.0-kb element. DNA sequence determination showed that the 2.0-kb element is a member of the highly repeated mouse LINE family. Four different strains of mice (DBA/2J, CD-1, C57BL/6J, and BALB/c) were examined, and all contain this LINE insertion (data not shown). Therefore, this nonallelic transposition occurred early in the evolutionary history of mouse.

C A C A C A C A C A C A C T C A C A T A T A T A T A T A T A T A T A T A T A~A~A T A T A T A T A T A T A T A T A T A T ( " I ; A~. G T A G
Recent studies of the rat genome suggest that the target sites for LI insertion are nonrandom (37). For example, the target site is usually an A + T-rich region and a stretch of alternative purine and pyrimidine residues is located 3' downstream of the target site. Figs. 4 and 7 show the junctions between non-LINE and LIMd-PRP DNAs and the comparisons between MP2 and M14. An imperfect stretch of a simple sequence (Fig. 7B, segment B ) is 67 base pairs 3' ward from the target site sequence (boxed) and a perfect stretch of 15 (TA) pairs (Fig. 7B, segment A ) occurs 203 base pairs beyond the target site. These two stretches in the LINE-containing site are longer than in MP2. This difference in length is not unexpected since "simple" sequences can either expand or contract by unequal crossover or by "slippage" during DNA replication (38). In M14, there is also a 223-base pair insert between simple sequence B and the non-LINE 5' ward junction (Fig. 7B). This stretch of simple sequence has the potential to form a non-B DNA structure, such as Z-DNA or a cruciform structure. The non-B DNA conformations, or the transition to and from these conformations, may be involved in various recombinational events (39). In particular, rec-1 protein of Ustilago binds tightly to Z-DNA, which may be an intermediate in the recombination promoted by this protein (40).
The role of LINE elements in the mammalian genome is as yet unknown. A DNA element which is a transcriptionally active or highly repeated, or both, could have dramatic effects on contiguous regions of DNA. The transposition of a repeated DNA element to (or from) a given locus could alter the regulatory environment. The integration of a LINE element may supply either external promoter or enhancer activ-ities. In this report, the LINE element is inserted into the PRP gene in the opposite orientation. This insertion may affect isoproterenol inducibility of PRP gene M14. Hutchison et al. (41) have proposed that gene duplications are the major source of genomic growth and that genome growth is a consequence of a higher rate of fixation than deletion during evolution. The altered locus could then be subject to a number of genetic effects, such as recombination of the repeated DNA sequences, to produce the plasticity of the genome. The mouse albumin and a-fetoprotein genes (42) and two chicken ycrystallin genes (43) are present in the same 5' to 3' orientation. Both these genes are expressed in the same tissue, but they have diverged with respect to developmental regulation, and one gene of each pair is considerably more active. As these examples indicate, tandem gene duplication followed by divergence of various aspects can produce independently regulated genes in close physical proximity. This report represents a first step in characterizing the chromosomal organization of the mouse PRP multigene family.