Nucleotide Sequence of the Escherichia coZipoZA Gene and Primary Structure of DNA Polymerase I*

We report the nucleotide sequence of a 3.2 kilobase pair region of the Escherichia coli polA gene, compris-ing the coding region for DNA polymerase I with about 400 base pairs of flanking sequence. The amino acid sequence for DNA polymerase I derived from our DNA sequence is largely consistent with previous protein chemical data. In the following paper, Brown et al. (Brown, W. E., Stump, K. H., and Kelley, W. S. (1982) J. BioL Chem. 257, 1965-1972) present additional protein chemistry experiments that further confirm our se- quence.

A. 74, 5632-5636). We have located the site of this cleavage between residues 323 and 324 of the 928 amino acid polymerase molecule. By sequence comparison of the polAl and wild type alleles, we have identified the poZAl mutation as a change from Trp (TGG) to amber (TAG) at residue 342.
DNA polymerase I of Escherichia coli is a multifunctional single subunit enzyme. Since the discovery of this enzyme in 1956, a large body of research (reviewed in Ref. 1) has elucidated the enzymology of its three catalytic activities (polymerization, 5'-3' and 3'-5' exonucleolytic digestion). Coordination of the polymerization and 3'-5' exonuclease activities allows error-free primed synthesis of DNA. Coordination of all three enzymatic activities results in nick translation in vitro, a model reaction for the enzyme's presumed in vivo function in excision repair, and the removal of RNA primers from Okazaki fragments during discontinuous replication. In contrast to this detailed enzymological description, our physical picture of the DNA polymerase I molecule is limited to the observation that the protein comprises two domains separable by mild proteolysis (2, 3), the smaller NHn-terminal domain containing the 5 ' 3 exonuclease activity, while the larger COOH-terminal domain carries the polymerase and 3'-5' exonuclease activities (4, 5). The lack of a more detailed physical description of the enzyme molecule can be attributed to technical difficulties resulting from both the large size of DNA polymerase I (about 100,000 daltons) and its relatively low abundance in cell extracts. The cloning of the DNA * This work was supported by Grant GM-28550 from the National Institutes of Health. The costs of publication of this article were dekayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. polymerase I structural gene (poZA) onto phage X (6, 7) has circumvented one of these problems, making it relatively easy to prepare the polymerase in sufficient quantities for X-ray crystallography and protein chemistry. Moreover, the cloned polA gene serves as a convenient source of DNA for sequencing. In this paper, we report the nucleotide sequence of a 3.2 kb' region of the polA gene, and the amino acid sequence of the DNA polymerase 1 protein. The following paper (8) describes protein chemical experiments that c o n f i i our sequence.

EXPERIMENTAL PROCEDURES
Preparation of DNA-Phage NM852 (ApolA att+Nam7am53cl+) was kindly provided by Dr. N. E. Murray, University of Edinburgh. Phage particles were purified by two cycles of CsCl density gradient centrifugation. DNA was released by treatment of the phage with sodium dodecyl sulfate, and dialyzed against 10 m~ Tris-HC1, pH 8.0, 1 mM EDTA.
Plasmid DNA was prepared from cleared lysates of chloramphenicol-amplified cultures, by CsCl density gradient centrifugation in the presence of ethidium bromide (9).
Restriction Endonuclease Mapping-Restriction enzymes were purchased from New England Biolabs (Beverly, MA) or Bethesda Research Laboratories. Hinff was a gift from Dr. M. D. Rosa, Yale University. DNA from phage NM852 was digested with HindIII, treated with alkaline phosphatase (Boehringer), and 5' end labeled using [y:"P]ATP (New England Nuclear) and T4 polynucleotide kinase (P-L Biochemicals). After cleavage with Sac I, the relevant fragments were isolated by preparative agarose gel electrophoresis followed by electroelution (10). The labeled polA-containing fragments were mapped using the partial digestion technique of Smith and Birnstiel (11).
DNA Sequencing-The preparation of end-labeled restriction fragments for sequencing was essentially as described by Maxam and Gilbert (12). Plasmid DNA was digested with an appropriate restriction enzyme and end labeled as described above. Labeled fragments were separated by electrophoresis on polyacrylamide gels, and eluted from the crushed gel slice with high salt buffer. Singly labeled DNA fragments were obtained either by digestion with a second restriction enzyme or by electrophoretic strand separation.
Fragments were sequenced using the partial chemical degradation technique of Maxam and Gilbert (12). An A > C reaction was routinely used in addition to the G, A + G , C, and C + T reactions. The chemically cleaved products were examined on thin sequencing gels (13) containing 208,856, or 6% acrylamide. Sequence data were stored and analyzed using the computer programs developed by Staden (14).
12 different restriction enzymes under partial cleavage conditions, we were able to derive a restriction map that was sufficiently detailed to serve as a starting point for our DNA sequencing experiments. The data were checked by mapping each HindIII-Sac I fragment in the opposite direction, after labeling at the Sac I site. A restriction map derived in this manner should be reliable except in the immediate vicinity of the Hind111 and Sac I sites. Our DNA sequencing results showed this to be the case.
Construction of a Small polAl Plasmid-To facilitate the preparation of large quantities of DNA for sequencing, we decided to subclone the 5 kb HindIII fragment onto a small multicopy plasmid. Although the wild type polA gene cannot be stably maintained on a multicopy vector, the polAl amber mutant can be propagated in this way (6). We therefore obtained a plasmid (kindly provided by Dr. N. E. Murray) canying the 5 kb HindIII fragment from a polAl mutant in the vector pBR313 (15), and recloned this fragment into the smaller vector pNG16 (16). Fig. 1 shows a simplified restriction map of the resulting plasmid, pCJ1. It is about 8.5 kb in size, conveniently small for the preparation of DNA fragments for sequencing.
DNA Sequencing- Fig. 2B shows the strategy used for sequencing the poZ.4 gene and surrounding regions, and the extent of sequences determined. More than 97% of the sequence was determined on both DNA strands and overlapping data were obtained for all restriction sites. Fig. 3 gives the nucleotide sequence of the region covered by Fig. 2; it com- prises the DNA polymerase I coding region plus about 400 bp of untranslated sequence. Previous studies on ApolA-lacZ fusions (17) indicated that the polA promoter is close to the Bgl TI site and that the direction of transcription is as shown in Fig. 1. Consistent with these results, we located the start of the coding region 100 bp downstream from the BgZ I1 site by scanning our translated DNA sequence for the previously determined NH2-terminal amino acid sequence (5). As shown in Table I, there is total agreement between our predicted sequence and the experimental results through six cycles of Edman degradation.
Having identified the start of the coding sequence, we observed an open translational reading frame extending for a total of 928 codons from the initiator ATG, interrupted only by a single amber codon at residue 342. To c o n f i i that this amber codon was, as we suspected, the site of the poLAl mutation, we sequenced the corresponding Xho I fragment from phage NM852 (ApoZA+). Comparison of the two sequences showed that the polAl mutation corresponds to a change from TGG (Trp) to TAG (amber) at residue 342. We have sequenced all the polA-containing Xho I fragments from NM852, a total of 830 bp (Fig. 2C), and have found the wild type sequence over this region to be identical with that derived from pCJl (polAI), aside from the single amber codon. This observation suggests that, although derived by heavy nitrosoguanidine mutagenesis (18), polAl is a single point mutation, consistent with previous genetic data (19). (A further argument against gross chromosomal rearrangements in the polA1 fragment cloned on pCJl is provided by the good correspondence between the experimentally determined restriction map (from phage NM852) and that derived from the DNA sequence of pCJ1.) On the assumption that there are no further differences between the wild type and mutant sequences, we have derived the amino acid sequence of DNA polymerase I shown in Fig. 3. DISCUSSION We have determined the DNA sequence of the coding region of the polA gene and from this we have deduced the amino acid sequence of DNA polymerase I. Since almost the entire sequence was obtained by sequencing both DNA strands, we are confident that it is correct. We have also examined our sequence to see whether it is consistent with data obtained for the DNA polymerase I protein molecule.
Comparison with Protein Chemical Data-As shown in Table I, our sequence agrees with the previously determined NH2-terminal sequence (5) through six cycles of Edman degradation. The divergence of the two sequences after the sixth Nucleotide sequence data spanning this region is shown in Fig. 4. A common error in DNA sequencing is the insertion or omission of a single nucleotide, resulting in a change in the reading frame of a translated gene product and probable misassignment of the COOH terminus of the protein. In the following paper, Brown et al. (8) present several lines of evidence supporting our identification of the COOH terminus of the DNA polymerase I molecule. The most direct evidence is the sequential release of histidine, alanine, and glutamine by carboxypeptidase digestion of DNA polymerase I or its large proteolytic fragment. Confirmatory data is provided by the isolation of the predicted COOH-terminal tryptic peptide, and by chemical cleavage experiments which position a cysteine residue extremely close to the COOH terminus.
The amino acid compositions and molecular weights that we predict for DNA polymerase I and its two proteolytic fragments (Table 11) are in good agreement with experimentally determined values (5,8,20,21). The single divergence between our data and earlier work is in the number of cysteines. Previous work indicated that a molecule of the polymerase contains three half-cystine residues, two of which are involved in disulfide bond formation (20), and that all three half-cystines are present in the large proteolytic fragment (21). By contrast, our DNA sequence shows only two cysteine codons, consistent with the composition data of Brown et al.
(8). The actual location of the cysteines, at positions 262 (small fragment) and 907 (large fragment), is confirmed by chemical cleavage experiments described in the following pa-We should point out that the protein chemical data reported by the Kornberg and Klenow (1,5) groups were obtained using DNA polymerase I from E. coli B, whereas the cloned polA gene which we have sequenced, and which Brown et al.
(8) have used to prepare homogeneous polymerase, was derived from E. coli K12. However, we do not find this a convincing reason for the discrepancies discussed above, since our preliminary DNA sequence studies on the resAl and re& mutant alleles ofpoll, which were derived from E. coli B (22), have revealed no changes from the K12 sequence.' The just within the NH2-terminal region of the large proteolytic fragment (residue 324), suggesting that, in the absence of degradation, thepolAl amber fragment would be only slightly larger than the small proteolytic fragment. The identification in a polAl cell extract of a polypeptide indistinguishable in size and enzymatic activity from the small proteolytic fragment (4) is consistent with our assignment of the mutation site, and suggests that the polAl amber fragment is indeed relatively resistant to proteolysis.

Met-Val-Gln-Ile-Pro-Gln-Asn-Pro-Leu-Ile
Experimental ( Codon Usage- Table I11 details the codon frequencies in the E. coli polA gene. The observed distribution is similar to that found for other sequenced E. coli genes (tabulated in Ref. 23), and supports the notion that E. coli genes have characteristic codon preferences, distinct from phage-and transposon-encoded genes that are translated in E. coli (23,discussed in Ref. 24). Examples of the biased codon distribution are the preference for CUG (Leu) and CGY (Arg) and the avoidance of AUA (Ile). As discussed by Post et al. (25),    FZanking Sequences of the polA Gene- Fig. 3 shows approximately 300 bp upstream and 100 bp downstream of the polA coding sequence. It is reasonable to suppose that these regions contain the sequences that control initiation and termination of transcription of the polA gene.
Sequencing and biochemical studies of promoters recognized by E. coli RNA polymerase (reviewed in Refs. 26 and 27) indicate that they contain a highly conserved region about 10 bp upstream of the site of transcription initiation (the Pribnow box, prototypically TATAATPu), and a somewhat less well conserved "polymerase binding site" about 35 bp upstream (the -35 region). Examination of the DNA sequence preceding the polA coding region reveals many sites having some homology with the prototype bacterial promoter sequence. The abundance of sites is probably a consequence of the high A T content of this region (about 70% between positions -50 and -300 relative to the translational start). On the basis of sequence homology, we feel that the most plausible promoter sequence for the polA gene involves one of the Pribnow boxes at -28 to -22 (CATAATC) or -150 to -144 (AATAATT) (Fig. 3). However, in neither case is there convincing homology at the -35 region.
Our inability to locate the polA promoter merely by sequence comparison may not be surprising. The majority of the promoter sequences used in deriving a consensus sequence belong to phage genes, stable RNA genes or genes for abundant bacterial proteins. Many of these genes are subject to additional forms of regulation. By contrast, polA is expressed at a relatively low level (about 400 molecules/cell, Ref. 1) and does not appear to be regulated (17). The only other low level, constitutively expressed bacterial gene whose promoter has been sequenced (ZacI, Ref. 28) also shows poor homology with the consensus sequence. However, the lacI promoter may not be a suitable model for polA, since the level of expression of lacI is about 30-fold lower (17, 28). An additional factor in the low level of expression of the polA gene may be poor initiation of translation of the polA transcript. Potential base pairing with the 3'-end of 16 S ribosomal RNA in the initiation complex (29) is limited to the 3-base sequence GGA (-7 to -5, Fig. 3).
On the E. coli genetic map (30) the ribosomal RNA operon, r r n 4 , immediately precedes poM. We have determined the DNA sequence upstream from the polA gene as far as the Hind111 site used in constructing the polAl clone (about N o bp to the left of the region shown in Fig. 2). We have found no sequences corresponding to ribosomal RNA sequences. At present, we do not know whether this region constitutes a truly "silent" portion of the E. coli genome.
We have identified the next gene downstream from polA as the gene for a small RNA (spot 42) characterized and sequenced by Sahagan and Dahlberg (31). Transcription of this RNA is initiated about 150 bp beyond the end of the polA coding sequence, suggesting that the polA transcript must terminate within the last 100 nucleotides shown in Fig. 3. Since this region contains no obvious dyad symmetries analogous to previously characterized p-independent terminators (26), we expect that transcription termination may be p-dependent. Further experiments are in progress to locate the precise sequences involved in initiation and termination of the polA transcript, and thus to define the boundaries of thepolA gene.
Evolution of thepolA Gene-The fact that the DNA polymerase I molecule contains two separable domains, each having DNA-binding and nucleolytic activities, together with the approximate 2:1 ratio of the sizes of the large and small proteolytic fragments, suggests that the polA gene may have resulted from the multiplication (probably triplication) of the gene for a primitive DNA-binding protein. Even after subsequent sequence divergence, such an event might be expected to leave traces of homology between domains when examined at the DNA sequence level. Computer analysis of our sequence has revealed no evidence for such homology. The secondary structure prediction of Brown et al. (8) is also at odds with a triplicated structure for DNA polymerase I, since similar features of super secondary structure do not appear at regular intervals throughout the protein. We therefore believe that the polA gene did not arise from a gene multiplication of the type discussed above, and feel that similarities between active sites on the enzyme may only be apparent at the tertiary structure level.