Comparison of the flanking regions and introns of the mouse 2,3,7,8-tetrachlorodibenzo-p-dioxin-inducible cytochrome P1-450 and P3-450 genes.

The C57BL/6N inbred mouse cytochrome P1-450 and P3-450 genes, two genes in the same family and under control by the Ah receptor, have been completely sequenced. The transcription initiation sites were confirmed by primer extension studies. An additional 823 and 893 bp of the 5' upstream flanking regions of P1-450 and P3-450, respectively, and 1771 and 1251 bp of the 3' downstream flanking regions of P1-450 and P3-450, respectively, were sequenced and studied. P1-450 exons total 2619 nucleotides, and the gene spans 6215 bp. P3-450 exons total 1892 nucleotides, and the gene spans 6716 bp. Three interesting highly homologous regions of 11 or 12 bp, upstream between -280 and -530 from the cap site of both genes, are noted as possible candidates for binding by the inducer-Ah receptor complex (and/or other DNA-binding regulatory proteins). Several stretches of DNA upstream from the cap site, in several introns, and in the 3' flanking region of both genes have a high degree of homology with known core enhancer sequences. Other interesting stretches (DNA with Z-DNA-forming properties, DNA with recombinational potential, highly repetitive and middle repetitive sequences between 50 and 360 bp in length, and "simple" sequences presumably having no function in gene expression) exist throughout many of the introns and flanking regions in both the positive and negative strands of both genes. The mouse 2,3,7,8-tetrachlorodibenzo-p-dioxin-inducible and rat phenobarbital-inducible P-450 genes were compared for the amino acid residue number at each exon-intron junction, the location in the coding triplet at which the exons are split, and homologies among introns and exons. It can be shown that these two gene families probably diverged from a common ancestor more than 200 million years ago and that P1-450 and P3-450 split from each other about 65 million years ago.

Full-length cDNA clones for both P1-450 and P3-450 have recently been isolated (10) and sequenced (11,12). The complete cDNA nucleotide and deduced amino acid sequences exhibit 68% and 73% similarity, respectively. It was estimated that these two homologous genes of the same P-450 subfamily diverged from each other about 65 million years ago (12). The genomic clones for P1-450 and P3-450 have similar exonintron patterns, the 2nd and 7th exons being much larger than the other 5 (10). The P,-450 and P3-450 genes have been localized to mouse chromosome 9 (13). The transcriptional activation of both genes by 3-methylcholanthrene and TCDD (14) has been rigorously correlated with the presence of the inducer-receptor complex gaining high affinity for nuclear chromatin material (14,15). The P1-450 induction process occurs developmentally several weeks earlier than the P3-450 induction process (16,17); both the P1-450 and P3-450 induction processes occur in liver, and P1-450 but not P3-450 induction occurs in C57BL/6N kidney (17). All of these lines of evidence taken together strongly suggest that these two genes will be found to lie in tandem and that the inducerreceptor complex controls the expression of both genes in an unknown manner with developmental and tissue specificity.
In this report we have examined the 5' and 3' flanking regions and all introns of both strands of both the P1-450 and P3-450 genes. We had anticipated, and in fact do find, numerous segments having Z-DNA-forming potential, regions of DNA with homology to previously published enhancer core sequences, and stretches of DNA homologous to highly and middle repetitive sequences. One strength of this study is that we are comparing these two homologous genes in the same P-450 subfamily that had been cloned from the same genomic DNA library from liver of the inbred C57BL/6N mouse strain. Any interesting regions that occur approximately the same P-450 denotes any or all forms of cytochrome P-450 (multisubstrate monooxygenases). Mouse PI-450 and Pa-450 denote those forms of 3-methylcholanthrene-or TCDD-induced P-450 from C57BL/6N liver with the highest turnover numbers for induced aryl hydrocarbon (benzo[a]pyrene) hydroxylase and acetanilide 4-hydroxylase activity, respectively (I). P3-450 was formerly called "P-448" (1). More recently, a second P-450 having a Soret maximum of 448 nm for the reduced hemoprotein-CO complex was purified from DBA/2N liver (2) and named P2-450. P2-450 in DBA/2N mice may represent a polymorphism with P3-450 in C57BL/6N mice. Mouse PI-450 and P3-450 correspond to rat P-45Oc and P-450d (3), respectively, and rabbit form 6 and form 4 (4, 5), respectively. distance upstream from the cap site, downstream in the 3' flanking region, or in the same intron of both P1-450 and PS-450 will have a much higher probability of being functionally important (e.g., associated with transcriptional activation by TCDD) than if the interesting region exists in one gene but not the other. On the other hand, one or more interesting regions that differ between the two genes may reflect the distinct dissimilarities in developmental and tissue specificity mentioned above.

EXPERIMENTAL PROCEDURES
Isolation and Subcloning of the P1-450 and P3-450 Genes-C57BL/ 6N mice were obtained from the Veterinary Resources Branch of the National Institutes of Health (Bethesda, MD). Liver from these 3methylcholanthrene-treated animals was used to isolate PI-450 and P3-450 full-length cDNA clones as previously described (10). A C57BL/6N liver genomic library was constructed with the lambdoid phage vector Charon 30 (18) as outlined in detail in Ref. 10. The library was screened with the PI-450 and P3-450 full-length cDNA inserts in order to isolate the P1-450 and P3-450 genes (10). Subclones of the P1-450 and P3-450 genes were introduced into pBR322 by means of BamHI and HindIII digestion, respectively. Inserts were isolated from these subclones and subjected to DNA sequencing.
DNA Sequencing-All DNA sequence determinations were carried out by use of the M13 cloning and dideoxynucleotide sequencing strategies (19). Libraries of shotgun clones were produced and sequenced from each genomic subclone (20). Briefly described, DNA fragments (5 pg) were circularized by ligation and sonicated with four bursts (10 s each) at 100 W in 0.5 ml of 10 mM Tris-HC1 buffer, pH 7.5. Fragments were concentrated by ethanol precipitation and repaired with the Escherichia coli DNA polymerase large fragment. DNA fragments ranging from 500 to 1000 bp were isolated by agarose gel electrophoresis and ligated into the SmaI site of M13 mpll. Sequencing was carried out with the dideoxynucleotide reagent kits (P-L Biochemicals, Milwaukee, WI), [w3'P]dATP (400 Ci/mmol; Amersham Radiochemicals, Chicago, IL), and the DNA polymerase large fragment (Bethesda Research Laboratories, Bethesda, MD). Sequences were displayed on a salt gradient (21) and standard 50% urea-6% acrylamide gels (22). Sequence alignments were made with use of the Staden consensus program (23); any gaps that remained after the sequencing of many shotgun clones were filled in by sequences obtained from restriction site-directed clones. Partial restriction maps, and directions of the primary fragments used to determine the complete nucleotide sequence of each gene, are displayed in Fig.  1. The BamHI and HindIII sites in the P1-450 and P3-450 genes that were used to construct the subclones (with the exception of one HindIII site in exon 2 of P3-450) were not crossed by DNA sequencing. The possibility of a small stretch of DNA missed at these sites was ruled out, however, based on high resolution restriction mapping with acrylamide gels.
Primer Extension Analysis-Locations of the cap sites of the PI-450 and P3-450 mRNAs were &$ermined by the method of primer extension analysis (24). Primers were isolated from the P1-450 and P3-450 cDNA clones and labeled at their 5' ends with [T-~*P]ATP (4000 Ci/mmol; Amersham Radiochemicah) and polynucleotide kinase. Each primer was denatured in 100 pl of 80% formamide by incubation at 90 "C for 5 min. The solution was cooled to 55 "C, and 50 p1 were added to 5 pg of control or 5 pg of 3-methylcholanthreneinduced mRNA preparations. ( h e s e preparations, obtained from 3methylcholanthrene-treated C57BL/6N liver, had previously been dried with NaCl, ethylenediaminetetraacetic acid, and PIPES (pH 6.4) to give reconstituted concentrations of 0.5 M, 1 mM, and 40 mM, respectively.) The samples were incubated for 3 h at 55 "C and concentrated by ethanol precipitation. The primers were elongated by use of reverse transcriptase and then denatured and electrophoresed on salt gradient sequencing gels (21).

RESULTS AND DISCUSSION
Exon-Intron Junctions-Sequences of the two genes and flanking regions are shown in Figs. 2 and 3. Analysis of both genes across the exon-intron junctions (Table I) reveals that all splice sites rigorously follow the Chambon rule (25). Consensus sequence of the acceptor splice sites is PyAGIN. Consensus sequence of the donor splice sites is PulGTPu.
Primer Extension Analysis-We had concluded, from sequencing of the presumably full-length cDNA clones, that Pl-450 mRNA was 2620 nucleotides (12) and P3-450 mRNA was 1894 nucleotides (11) in length. Sequencing of the genomic clones, plus primer extension studies (Fig. 4), confirmed that our cDNA clones (10) contained virtually every base in the cDNA. The 5'-most G and the 3'-most A in both P1-450 (12) and P3-450 (11) cDNA represent nucleotides from the poly(dG) and poly(dA) tracts, respectively, of the cloning vector (26). One additional base in P1-450 exon 7 (between position 6047 and 6048 in the gene sequence of Fig. 2) was originally reported (12) and is an error. This now stands corrected in Fig. 2. Hence, we are now certain that the total lengths of the P1-450 and P3-450 mRNAs are 2619 and 1892 bp, respectively.
Primer extension analysis (Fig. 4)  a second initiation site 2 nucleotides upstream. The reason for the P1-450 fragments of 4 and 5 nucleotides following reverse transcriptase treatment is unclear but may represent homology to another mRNA in this primer extension assay. In view of the lack of information about the number of members in the P-450 gene superfamily, it is possible that these bands are not artifacts but rather represent true extension products on previously uncharacterized mRNAs.
In each case two potential cap sites were detected in both P1-450 and P3-450 mRNAs. This kind of difficulty in pinpointing the cap site to a single nucleotide is commonly found among eukaryotic genes and may represent two bona fide cap sites. In conclusion, primer extension studies confirm that the P1-450 gene with its 7 exons and 6 introns spans 6215 bp (Fig.  2), whereas the P3-450 gene with the same number of exons and introns spans 6716 bp (Fig. 3).    (35-37). Although each of these illustrated stretches of DNA are homologous to core enhancer elements, it goes Without saying that biological function must be demonstrated before any of these regions can be proved without a doubt to be important in the expression of either gene.
No direct repeats, inverted repeats, or dyad symmetries of significant length were found in either gene or the extensive flanking regions that we have sequenced. No homology across --. . " -".likely, these two genes thus arose from duplication.
2 2 e g 2 e "Simple sequences" are stretches of DNA that consist of v u S S r compared. Hence, we find no evidence for gene conversion z z ; z z z and/or unequal crossing-over between these two genes. Most g g g c E M one or several tandemly repeated sequences, e.g., (AT),, (AG),, (GT),(CA),, (CTT),, and (GAG), (41). These simple sequences are repetitive, interspersed throughout many eukaryotic genomes, and are believed to arise by slippage replication and unequal crossing-over and to have no general r~, 101 function with regard to gene expression (41). Many of these 98 99 sequences can be seen in the introns and flanking regions of the P1-450 and P3-450 genes (Figs. 2 and 3).
Comparison of Exon and Intron Conseruatwn-The homologous two genes in this P-450 family have similar exon-intron patterns, with the 2nd and 7th exons much larger than the other 5. The 727-bp difference in size between P1-450 and Pa-450 cDNAs can be mostly accounted for by the larger 3' seventh exon of P1-450 (Table 11). Variability among the introns is noteworthy. P1-450 intron 1 is more than twice as large as P3-450 intron 1, P3-450 intron 3 is more than seven times longer than P1-450 intron 3, and P3-450 intron 6 is more than 10 times longer than P1-450 intron 6. The degrees of nucleotide similarity among the introns range between 20 and 42%; these values were consistent with the expected amount of nucleotide divergence among introns during about 65 million years.
Although the first 6 exons are similar in size, it is interesting that exons 4, 5, and 6 are exactly the same size in the two genes: 90, 124, and 87 bp, respectively. Besides exon 2 (81% similarity), exons 4, 5, and 6 are quite conserved (73%, 85%, and 78% similarity, respectively). Exons 2,4,5, and 6 and the 5 ' end of exon 7 thus are good candidates for encoding highly conserved protein domains-perhaps those of importance to catalytic activities of these two membrane-bound multicomponent enzymes. Other possible reasons for conservation of these protein subunits include (i) type of attachment and configurationin the endoplasmic reticulum, (ii) participation in the heme-binding enzyme active site, and (iii) area of attachment of the NADPH-P-450 oxidoreductase. The so-

G A A C C C A G A A +ATCCTGGAC'+ G A C T C C C A C A A C T C T G C C A G T C T C C A G C C ? C T G C C C T T C A G~G G T A C A G ? T G G C G T T C T C CCAGT,ACAT? T C C T T A G C C C C A G A G C T G C T A C T G G C C A C T G C C A T C T T~? G T T T A G T G T T CTGGATGG;? A G A G C C T C A A m rm
Rse I Comparison of P1-450 and P3-450 with Rat P-45Oe"It should be noted that the locations in the coding triplet at which the exons are divided are identical between P1-450 and P3-450 (Table 111). The residue number at the end of each exon varies by no more than 3 amino acids between P1-450 and P3-450. Differences between the TCDD-inducible and the phenobarbital-inducible P-450 gene families are striking: (i) the former has 7 exons and the latter has 9 (43,44); (ii) the former genes span less than 7 kb while the latter genes span

TABLE IV List of potentially interesting homologous regions between mouse PI-450 or P3-450 and sequences published in the genetic sequence data bank
This computer program SRCHN (47) compares a given nucleic acid sequence with all sequences in the Genbank Data Bank as of June 1,1984, using the algorithm of Wilbur and Lipman. It must by recognized in gene data bank searches that repetitive tracts such as (AG),, (CT),, (CCT),, (GC),, (CAA),, and (GAA), may yield high similarity scores of unknown biological importance and are omitted from further consideration in this table. Besides those listed, other PI-450 and P3-450 exons exhibited significant similarity with rat P-450e exons, but more homology was seen at the protein level. At the nucleotide level, substantial divergence has occurred and therefore the significance of match-up does not show up in this search program.  homologies found between several exons at the nucleotide level (Table IV) and among the amino acids encoded by distinct exons of the two P-450 gene families, we conclude that P1-450 and P3-450 exon 2 are similar to P-450e exons 1 through 5, P1-450 and P3-450 exon 3 is similar to P-450e exon 6, P1-450 and P3-450 exons 4 and 5 are similar to P-450e exon 7, P1-450 and P3-450 exon 6 is similar to P-450e exon 8, and P1-450 and P3-450 exon 7 is similar to P-450e exon 9. No inversions of genetic material have been detected during this evolutionary process. We therefore conclude that the ancestral P-450 gene had a minimum of 14 exons (46). No significant homology with other published P-450 nucleotide sequences was found in any flanking region, intron, or on the negative DNA strand (Table IV). These data also indicate that there exists no evidence for inversion during evolution of these two P-450 gene families. Comparison ofthe two TCDD-inducible P-450 proteins from mouse and rat (12) results in an estimate of 1% divergence every 2.4 million years. We thus conclude that the P1-450 and P3-450 genes diverged from each other about 65 million years ago and that the TCDD-inducible and phenobarbital-inducible subfamilies separated from each other more than 200 million years ago.

Significance
The data in Tables I11 and IV confirm further that these two gene families arose from a common ancestor.
Intermediate Repetitive Sequences-During the computer program gene search, several large stretches of DNA (50 to 360 bp) in the P1-450 and P3-450 introns were homologous to previously reported sequences (Table IV) at occurrence rates between 10 and 56 standard deviations above the mean. Such stretches appeared on both the positive strand and the reverse strand. These appear to represent highly repetitive and middle repetitive sequences, such as the PR1 and Alu types, many of which are found in the introns and flanking regions of other reported genes. None was found in any exon. Hence, P1-450 introns 1 and 2 and the P3-450 5' flanking region, introns 1, 2, 3, and 6, plus the P3-450 3' flanking region, all have repetitive elements of one or another family. Such insertions of these repetitive elements, in addition to duplication and unequal crossing-over of simple sequences which presumably have no function, undoubtedly account for the large differences (in one case more than 10-fold) in lengths of the P1-450 and P3-450 corresponding introns. These data show quite dramatically, during the approximately 65 million years that these genes have diverged, how the exon sequences and lengths are highly conserved yet the intron sequences and lengths clearly are not.
Conclusions-We have sequenced all introns and exons, plus a portion of the 5' and 3' flanking regions, of both the P1-450 and P3-450 genes in the C57BL/6N inbred mouse strain. Regions of possible regulatory importance have been pointed out. This study is an important prelude to experiments designed to prove a biological function for any of these interesting regions within the gene and/or flanking regions.