Structure of the Human Type I DNA Topoisomerase Gene*

We describe the molecular organization of the human gene coding for type I DNA topoisomerase. The coding sequence is split into 2 1 exons distributed over at least 86 kilobase pairs (kb) of human genomic DNA. The sizes of the 20 introns vary widely between 0.2 and at least 30 kb and all contain the sequence elements known to be required for pre-mRNA splicing. Several of the intron sequences separate exons encoding parts of the enzyme that are highly conserved between hu- man and yeast suggesting that at least some of the exons may code for individual, structurally, or func- tionally important domains of the enzyme. We also describe the promoter sequence of the human topoisom- erase I gene and show that it is composed of distinct functional elements.


Norbert Kunze, Guochen YangS, Martin Dolberg, Rolf SundarpQ, Rolf Knippers, and Arndt Richter
From the Division of Biology, University of Konstanz, Germany We describe the molecular organization of the human gene coding for type I DNA topoisomerase. The coding sequence is split into 2 1 exons distributed over at least 86 kilobase pairs (kb) of human genomic DNA. The sizes of the 20 introns vary widely between 0.2 and at least 30 kb and all contain the sequence elements known to be required for pre-mRNA splicing. Several of the intron sequences separate exons encoding parts of the enzyme that are highly conserved between human and yeast suggesting that at least some of the exons may code for individual, structurally, or functionally important domains of the enzyme. We also describe the promoter sequence of the human topoisomerase I gene and show that it is composed of distinct functional elements. ~~ ~~ ~~~ ~ ~ Eukaryotic topoisomerases participate in many genetic reactions requiring the separation of complementary DNA strands such as transcription (see for example Liu and Wang, 1987;Brill and Sternglanz, 1988;Stewart et al., 1990;Ostrander et al., 1990, and references therein), replication (Yang, 1987;Avemann et al., 1988;Holm et al., 1989; for recent reviews see Zhang et al., 1990, and, and recombination (Bullock et al., 1985;Wallis et al., 1989;Rose et al., 1990; for a recent review see Wang et al., 1990). DNA strand separation causes an alteration of the twist and the writhe of topologically restrained chromosomal DNA. Topoisomerases recognize these changes and catalyze the relaxation of the torsional tension in the DNA (reviewed by Saavedra, 1990). One of the two major cellular topoisomerases is the type I DNA topoisomerase.
Type I topoisomerase initiates its reaction cycle by binding to the substrate DNA followed by the introduction of a nick into one strand of the double-stranded DNA molecule and the formation of a covalent phosphodiester bond between a tyrosine-OH group of the enzyme and the 3' terminus of the broken DNA strand. Topological tension is then released at the site of strand discontinuity and the nick is resealed as a final step of the reaction cycle. Topoisomerase I changes the linking number of DNA duplexes in steps of one; the eukaryotic enzyme is able to relax positive and negative superhelical * This work was supported by Fonds der Chemie, the Deutsche Forschungsgemeinschaft through SFB 156, and the FAZIT Stiftung through a fellowship (to G. Y.). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
The nucleotide sequencefs) reported in this paper has been submitted to the GenBankTM turns in topologically fixed superhelical DNA molecules (reviewed by Vosberg, 1985;Wang, 1985).
Reactions leading to an alteration in the DNA structure are expected to occur quite frequently in genetically active cells. It is therefore not surprising that the enzyme is present in a relatively high amount of about lo6 copies per mammalian cell nucleus, i.e. roughly one enzyme molecule per five nucleosomes.
Topoisomerase I is constitutively expressed in all nucleated cells examined as well as under a variety of physiological conditions. In spite of a high basal level of topoisomerase I expression, the topoisomerase I gene is highly regulated. Increased levels of topoisomerase I mRNA are found after growth stimulation of resting culture cells (Sobczak et al., 1989;Romig and Richter, 1990a), after adenovirus infection (Romig and Richter, 1990b), and as a response to phorbol ester treatment (Hwong et al., 1989). However, the molecular basis for the topoisomerase I gene regulation has yet to be explored.
In a more medically oriented vein, topoisomerase I is interesting because it is the target of a therapeutically used antitumor drug, camptothecin (Hsiang et al., 1985;Mattern et al., 1987;Hsiang and Liu, 1988). It is also an autoantigen in a subtype of systemic sclerosis (Shero et al., 1986;Maul et al., 1989;Verheijen et al., 1990).
This and its general physiological importance prompted us to investigate the molecular genetics of the human topoisomerase I.
The human genome contains three loci hybridizing with topoisomerase I-derived cDNA probes (Kunze et al., 1989). Two of these are truncated processed pseudogenes (retrosequences) (Yang et al., 1990), whereas the third locus on chromosome 20q11.2-13.1 contains the functional topoisomerase I gene (Juan et Kunze et al., 1989). Recently, we have isolated and sequenced the promoter region of the functional topoisomerase I gene and determined the location of DNase I hypersensitive sites within the upstream region (Kunze et al., 1990).
Here we describe the structure of the complete coding region of the human topoisomerase I (hTOP1)' locus which includes 21 exons spread over at least 85 kb of genomic DNA. We also present an analysis of the minimal promoter region necessary for the expression of the topoisomerase I gene.

MATERIALS AND METHODS
Genomic Libraries-Human HeLa cell and lymphocyte DNA was partially digested with the restriction endonuclease Sau3A and linked to the arms of the phage X-derived vectors EMBL3 and L47.1, respectively. The EMBL3 library has been described before (Yang et al., 1990). The L47.1 library was kindly given to us by B. Horsthemke (Essen, Germany). One clone, ATP15 (see Fig. l  Type Culture Collection (ATCC 57712). The libraries were screened using either radioactively labeled fragments of a topoisomerase Ispecific cDNA (Oddou et al., 1988;D'Arpa et al., 1988) or labeled oligonucleotides synt.hesized according to the published cDNA sequence (D'Arpa et al., 1988) and by genomic walking. The screening procedure and the isolation and propagation of positive phage clones were performed according to standard protocols (Maniatis et al.,1 982). Identification of Coding Regions-DNA inserts from positive phage clones were digested with different restriction endonucleases and investigated by the Southern procedure (Southern, 1975) to identify restriction fragments carrying topoisomerase I encoding exons. Fragments ofappropriate size were then subcloned in M13mp18 and mp19 vectors and sequenced according to the dideoxy method (Sanger et al., 1977). Southern blots of restricted insert DNA were also hybridized to BLUR 8, a representative of the a h 1 family of highly repetitive interspersed DNA elements (Deininger et al., 19811, and to DNA probe 3.1.8, a member of the long interspersed repetitive DNA family (Gruss et al., 1988).
Promoter Analysis-A 918-bp restrict,ion fragment extending from the XhoI site in the 5"nontranslated region of exon 1 to the SphI site 770 bp upstream of the 5'-most transcriptional start site of the hTOP1 gene was ligated to the appropriate polylinker sites upstream of the chloramphenicol acetyl transferase (CAT) reporter gene in plasmid pBLCAT3 (Luckow and Schutz, 1987). As previously shown, the CAT sequence of the resulting plasmid pTPCATl is expressed under the direction of the linked T O P l gene promoter (Kunze et al., 1990). Deletions of the TOP1 gene promoter region were constructed by exonuclease I11 trimming (Sambrook et al., 1989) of the SphI-HstEI1-digested pTPCATl plasmid. The exact lengths of the upstream regions present in the derivatives (clones E3.2, -6, -9, -15, and -16) were determined by nucleotide sequencing. Plasmids with intact or deleted promoter regions (2 pg) were transfected into HeLa cells using the calcium phosphate precipitation technique (Banerji et al., 1983). The 8-galactosidase-encoding plasmid pSVp (MacGregor and Caskey, 1989) ( 2 pg) was cotransfected and served as an internal standard to control the transfection efficiency. The CAT activities in cell extracts were determined using [:'H]acetyl-coenzyme A as substrate. The reaction product, [:lH]acetylchloramphenicol, was recovered from the reaction mixture by extraction with a nonaqueous scintillation cocktail as described by Neumann et al. (1987). The measured CAT activity was corrected relative to the @-galactosidase activity (determined according to MacGregor and Nolan, 1989).

RESULTS AND DISCUSSION
Gene Structure-We have isolated a large number of phage clones with inserted 10-15-kb-long human genomic DNA fragments carrying regions hybridizing with topoisomerase I cDNA. One set of clones contains sections of the two processed truncated T O P l retrosequences (pseudogenes) which we have described before (Yang et al., 1990). Another set of clones possesses inserts derived from the active topoisomerase I gene. These clones cover the entire coding region of the hTOPl locus as shown in Fig. 1.
The coding region is split into 21 exons (Fig. 1). Their exact lengths and their relationship to the known nucleotide sequence of the topoisomerase I cDNA (D'Arpa et al., 1988) are summarized in Table I. The combined coding sequence of the gene is almost identical with the published topoisomerase cDNA sequences except for some minor deviations: (i) The first nine nucleotides of the cDNA are not found in the genomic sequence. Most probably, these nucleotides are derived from the oligonucleotide linker used for the construction of the cDNA library. (ii) In the genomic sequence we find a cytosine residue instead of a thymine residue at a position corresponding to nucleotide 645 of the published cDNA sequence (D'Arpa et al., 1988). This results in a Val -+ Ala exchange at position 145 of the protein sequence. (iii) Two additional nucleotides are recognized in the 3'-nontranslated region: an adenine residue at the cDNA position 3160 and a cytosine residue at position 3163. These latter changes have been noted before (Maul et al., 1989;Zhou et al., 1989). 17 of the 21 hTOPl exons are in the size range of 69-188 bp like most exons of mammalian genes (Traut, 1988;Hawkins, 1988). Exons 2 and 5 are smaller than average with 25 and 56 bp, respectively (Table I). The largest exon includes the coding sequence for the C-terminal 34 amino acids, the stop codon (TAG) and a long 3'-untranslated region of about 1200 nucleotides. As pointed out before (Yang et al., 1990) the published cDNA sequence (D'Arpa et al., 1988) ends at a genomic EcoRI restriction site (see restriction maps of clones ATP10 and ATP15 in Fig. 1) 1138 nucleotides distal to the stop codon. In the sequence of exon 21, one putative polyadenylation signal (AATAAA) appears 1167 nucleotides distal to the stop codon. In the TOPl pseudogenes, the same AA-TAAA motif occurs in an identical sequence environment and is followed, 25 bp further downstream, by a stretch of adenine residues (Yang et al., 1990). It is therefore quite likely that the polyadenylation site in the functional gene is also located 25 nucleotides 3' to the AATAAA motif. Thus, exon 21 comprises almost 1300 bp and is much longer than the other exons of the hTOPl gene as is quite typical of the terminating exons in mammalian genes (Hawkins, 1988).
The 21 exons are separated from each other by intron sequences of various lengths. The lengths of 17 introns could be determined by nucleotide sequencing or by restriction mapping of the inserts of the respective clones. However, introns 2, 3, and 8 are on nonoverlapping clones. Based on the size of the insert DNA, intron 2 must be longer than 28 kb, and introns 2 and 8 must be longer than 7 and 3.4 kb, respectively. We performed Southern blot hybridizations to estimate the minimal sizes of the gaps between the nonoverlapping inserts (data not shown). For this purpose we used as probes single copy DNA fragments located close to the insert ends of the respective clones. The data obtained are included in Table I.
We have used these data to construct a map of the hTOP1 locus ( Fig. 1). A striking feature of the map is that the 21 exons are unevenly distributed over the more than 100-kblong genomic region (Fig. 1). The closely adjacent exons 1 and 2 are separated by more than 30 kb of intronic DNA from exon 3. Another relatively long intron sequence (>7 kb) is followed by a cluster of the closely spaced exons 4-13. This cluster of exons is again separated by a large intron of 11 kb from a second cluster of coding sequences including exons 14-21.
Even though intron sizes vary widely from 0.2 to about 30 kb they all have the canonical G T dinucleotide at their 5' and the AG dinucleotide at their 3' ends. Another important element for pre-mRNA splicing is the branch point, which is located 18-40 bp upstream of the 3' intron boundary in most known introns (for a recent review see Mattaj, 1990, and references therein). Therefore, we searched for the branch point consensus YNYYRAY (Y, pyrimidine; R, purine; N, any nucleotide) in the 3' parts of the introns. Matches with the consensus sequence within 18-40 bp upstream of the intron-exon boundary are listed in Table I. In four out of 20 introns we find a perfect match with the consensus; in 14 introns there is one, and in two introns there are two deviations from the consensus. These latter two introns (4 and 13, see Table I) possess additional consensus sequences with one mismatch 45 and 47 bp, respectively, upstream of t.he 3' end of the intron (data not shown). It remains to be determined which of these two alternative splice sites may be used during mRNA maturation. In introns 3, 6, and 7 additiona! branch point consensus sequences are found 16-17 bp upstream of the intron-exon boundary (Table I, underlined sequences). But, as shown for other genes (Reed and Maniatis, 1985),  these sites are probably too close to the intron end to serve as branch points for splicing. More or less pronounced polypyrimidine tracts between the 3' intron-exon boundaries and the putative branch sites are present in all of the introns (Table I). Thus, in spite of the uncertainties concerning some branch points, we could identify the sequence elements needed for the splicing of a typical mammalian pre-mRNA in all introns of the topoisomerase I gene.
Introns in the hTOPl locus occur in all phases relative to the reading frame (Table I) and introns with a given phase class are randomly distributed over the gene. Furthermore, only six out of the 21 exons are flanked by introns of the same phase class. These properties may be important for evolutionary considerations since it has been discussed earlier (Patthy, 1987) that a random distribution of intron phase classes and the presence of introns in different phases on both sides of an exon may exclude an exon-shuffling mechanism in the evolution of a gene.
As noted before (D'Arpa et al., 1988), human topoisomerase 1 shares extended regions of sequence similarities with the type I topoisomerases from the yeasts Saccharomyces cereuisi a e (Thrash et al., 1985) and Schizosaccharomyces pombe (Uemura et al., 1987). The S. cereuisiae TOP1 gene does not contain introns, whereas the S. pombe gene carries two small introns. They are located in the 5' part of the gene at positions similar to those of human introns 1 and 3. However, the phase classes of the human and the yeast introns are quite distinct. It is therefore unlikely that yeast and human introns could be remnants of the introns of a common primordial topoisomerase I gene (see Dibb and Newman, 1989, for a discussion of intron gain or loss during evolution).
Repetitiue Elements-The hTOPl locus contains many regions harboring members of the aluI family of small interspersed repetitive elements. Using the BLUR 8 probe in Southern blot experiments we found that aluI-like DNA elements are widely spread in the hTOPl gene locus (Fig. 1). This may not be very surprising given the large size of the gene and the fact that there are several hundred thousand copies of this element dispersed over the entire genome (Weiner et al., 1986). On the other hand, long interspersed repeti-tive elements were detected in only one region downstream of exon 21 (Fig. 1).
Comparisons-Little is known about possible functional and structural domains in topoisomerase I. In this situation, a comparison of the amino acid sequences of the known eukaryotic type I topoisomerases may be useful since regions which remained highly conserved through evolution may perform essential enzymatic functions.
As mentioned above, the human enzyme shares extended regions of sequence similarities with the yeast topoisomerases. However, regions of amino acid similarities are not evenly distributed along the three polypeptide chains (Fig. 2).
An N-terminal region of about 200 amino acids of the human enzyme shows only weak homology to the yeast enzymes. It may be interesting to note that this section of the human topoisomerase contains nine variations of the sequence KHKD embedded in a highly hydrophilic domain with alternating basic and acidic amino acids. The coding sequences for the KHKD motif and its variations are clustered in exon 3 (four repeats) and exon 4 (five repeats) (Fig. 2).
A central region of approximately 440 amino acids (residues 200-640 in the human sequence) is highly conserved between the human and yeast enzymes (D'Arpa et al., 1988). It is followed on the C-terminal side by a second divergent region of 60 amino acids and, finally, by a C-terminal block of highly conserved amino acids (residues 714-760 of the human sequence). The latter block includes the tyrosine residue that forms the transient covalent linkage with the 3' end of the broken DNA strand during the enzyme's reaction cycle (Lynn et al., 1989).
A closer inspection of Fig. 2 reveals that the central conserved region (amino acids 200-640 in the human sequence) is itself composed of smaller individual blocks with high identity scores separated from each other by short stretches of less well conserved amino acids. It is conceivable that each one of the highly conserved regions represents a structural or functional domain of the enzyme.
It has been observed that functional or structural protein domains are frequently encoded by individual exons in higher eukaryotes (Gilbert, 1978;Blake, 1985;Go, 1985; Gilbert et

TABLE I Exons and introns of the human topoisomerase I gene
The numbers of the exons are given in the first and in the last column. The table should be read from the left to the right and then continuing with the next line. The first number in the first line is the distance (in bp) from the 5'-most transcriptional start site to the 3' end of exon 1. The last few nucleotides of each exon are given (3' junction of the exon), they are followed by: (i) the first six nucleotides of the intron (5' junction of intron); (ii) the number of the intron; (iii) a note whether the intron is in the triplet phase or off-phase by one or two nucleotides; (iv) the size of the intron; and (v) the complete 3' sequence of the intron between the assumed branch point consensus and the 5' junction of the next exon.
Each line ends with the first few 5' nucleotides of the following exon whose size (in base pairs) is given in the next line, respectively.
The amino acids encoded by the given exon sequences are shown in the line above of the appropriate nucleotide triplets.
Note that the three-letter amino acid code is shared between two exons in case of codons which are split by intron sequences. Putative branch conforming to the consensus YNYYRAY (Y, pyrimidine; R, purine; N, any nucleotide) (Green, 1986), were identified within a region of 18-40 bp upstream of the 3' intron-exon boundaries. Possible alternative branch sites are underlined.

Exon Intron
Exon
Size junction junction     The parameters used are: K-tuple length 1; gap penalty 3; filtering level 2.5; diagonal window size 10. For the multiple alignment the gap penalty (fixed and varying) was set at 10. The asterisks denote identical amino acids in all three sequences. Stretches of homology, as defined by a MacVectorTM protein analysis program, are underlined (scoring matrix pam 250; window 10 amino acids; minimal match score 75%; hash value 1). The underlined yeast sequences are more than 75% homologous to the human sequence; the underlined human sequence is more than 75% homologous to both yeast sequences. The al., 1986;Traut, 1988). Therefore, we analyzed whether or not the position of the intron-exon boundaries in the human topoisomerase gene corresponds to the less conserved spacers between the highly conserved blocks. For this purpose we have indicated in Fig. 2 the locations at which the 20 introns split the coding sequence of the human gene.

E L P S G A S T T Q N R S P N D E E D W U D P E----"-EEEEEDKKME
Introns 1 to 7 as well as introns 18 and 19 reside in those regions of the gene which encode the divergent parts of the enzymes. Therefore, it should be more interesting to inspect the locations of introns in the large conserved region of the central part of topoisomerase I. As can be seen (Fig. 2) introns 10,11,12,13,17, and possibly 16 are located within divergent regions between the highly conserved blocks of amino acids encoded by exons 11, 12,13,18, and possibly 17. These highly conserved blocks may represent structural or functional domains of topoisomerase I. However, the only reasonably well established functional site is the enzymatically active tyrosine (amino acid 723, Lynn et al., 1989) which is contained in a conserved stretch of about 20 amino acids encoded by exon 20 (Fig. 2). This exon is bracketed on its 5' side by intron 19 which is located within the divergent part of the enzyme and on its 3' side by intron 20 in a region which is only conserved between the human and the S. cereuisiae enzyme (Fig. 2). Of course, whether or not these exons indeed encode individual enzyme domains cannot be decided before the three-dimensional structure of the protein will be available.
Promoter-The sequence of a few hundred nucleotides upstream of exon 1 (Kunze et al., 1990) has some features which are typically found in promoter regions of constitutively expressed genes (Stout and Caskey, 1985): (i) the sequence is rich in GC base pairs (67%) with a high frequency of CpG dinucleotides; (ii) there are multiple, closely spaced transcriptional start sites located in a region 200-250 bp upstream of the translation initiation codon, ATG; (iii) the upstream region conspicuously lacks TATA box and CCAAT box elements, sequence motifs which are common constituents of promoters in regulated genes (Fig. 3).
Instead, we find at least two potential binding sites for the transcription factor Spl (Kadonaga et al., 1986) as well as two variations of the octamer motif which may serve as binding sites for the octamer transcription factor 1 (Schreiber et al., 1989). A third common promoter element detected in the hTOPl gene upstream region is the sequence TGACGTCG, a possible variation of a CAMP-responsive element (Roesler et al., 1988) (Fig. 3).
As an attempt to assess the significance of these sequence elements we have linked an upstream hTOPl gene fragment to the coding region of the chloramphenicol acetyl transferase (CAT) reporter gene. The upstream region was then shortened by exonucleolytic degradation from the 5' end to construct truncated TOP1 gene promoter segments in front of the CAT reporter gene. The expression of these constructs was determined at 36 h after transfection into HeLa cells. In these cells, the CAT reporter gene was found to be fully expressed in the presence of 256 bp of the upstream hTOPl sequence. Deletion of the putative octamer transcription factor 1 binding site at nucleotides -158 to -168 caused a reduction of the CAT gene expression to about 70% of the basal activity. An additional deletion of the second octamer transcription factor 1 binding motif which deviates in the last position from the intron positions in the human sequence (numbered as in Fig. 1) are given by dashes: slanting dashes, intron phase 0, intron-exon boundary in front of first base of the codon for the amino acid written below; vertical dashes, intron phases 1 or 2, intron-exon boundary splits the codon of the amino acid written below. Variations of the KHKD motif occurring in the human topoisomerase I are indicated by brackets above the amino acid sequence.   A, nucleotide sequence of a 360-bp segment around and upstream of the transcriptional start sites. The multiple transcriptional start sites (uertical arrows) were identified by nuclease S1 mapping and by primer extension methods as described in a previous communication (Kunze et al., 1990). We define the most 5"located transcriptional start site as nucleotide +1. Potential regulatory transcriptional elements in the upstream region are underlined and discussed in the text. B, functional promoter elements. The upper line represents the upstream region of the human topoisomerase I (bTOP1) gene. The numbers give the nucleotide positions relative to the 5'-most transcriptional start site. The translation start codon, ATG, as well as some putative transcription factor binding sites (see above) are indicated. The XhoI site was used to link the hTOPl gene promoter to the coding region of the CAT reporter gene. The extents of the 5' deletions, as determined by nucleotide sequencing, are shown below. The corresponding plasmids were transfected into human HeLa cells, and the CAT activity in cell extracts was determined 36 h after transfection. We used the method of Neumann et al. (1987) to measure CAT activity. The results are expressed in relative values on the right (mean of seven independent experiments).

GCCGTGGGAG GAGTCGGCTC C T T U X C U L T C A C A G C G G A GCGCGCACGG T C C G W
consensus ATGCAAT and of the potential distal S p l site had only minor effects on promoter activity (reduction to about 57% of basal activity). However, an essentially complete inactivation of the promoter function was achieved by the additional deletion of the proximal S p l binding site and the CAMP-responsive element (Fig. 3).
In summary, a region of about 250 bp upstream of the transcriptional start sites in the hTOPl gene is sufficient to drive the expression of a linked coding region in HeLa cells; furthermore, the promoter activity is modulated by distinct seqnlence elements which may be binding sites for known transcription factors. Currently, we are investigating this possibility in more detail to elucidate the molecular basis of hTOPl gene regulation. Conclusion-We have described the molecular organization of the human gene for type I DNA topoisomerase with its 21 exons and its promoter region. This work should provide a solid basis for the further exploration of the molecular genetics of this physiologically important enzyme. One interesting aspect is to study the regulation of this gene in resting and proliferating cells as well as after virus infection.
Another point of interest concerns the evolution of the enzyme. Topoisomerase I contains evolutionarily conserved regions interrupted by less well conserved parts. These latter parts may be spacers between functional or structural domains. We have demonstrated that many introns of the hTOPl gene are located within these less conserved parts. But, clearly, further work on the structure of the enzyme is necessary to determine whether some of the exons of the hTOPl gene encode individual protein domains. In addition, a comparison of the human gene with the intron-exon structure of topoisomerase I genes from other Metazoan species may be of interest.
The elucidation of the molecular structure of the human TOP1 gene may also be important in analyzing rearrangements of topoisomerase genes occurring during the establishment of drug resistance in cells treated with antitopoisomerase drugs (Tan et al., 1989).