Nucleotide Sequences from the Adenovirus-2 Genome*

The sequence of 15,441 nucleotides from the adeno-virus-2 genome has been determined and includes the regions between coordinates 0-32% and 89-100%. These regions contain the early (E) transcription units ElA, ElB, E2B, and E4, the genes for polypeptides IVa2 and IX, the COOH terminus of fiber polypeptide, as well as the two virus-associated RNAs and the leader se- quences for the major late mRNAs. precursor positioned between 28.9 and 23.5

and its close relative Ad5 have been used extensively as model systems with which to study cellular transformation and eukaryotic gene expression. It was in this system that RNA splicing was discovered (2,3), and there has been widespread interest in the transcription, translation, and replication of adenoviruses (4).
The genome of Ad2 is approximately 36,000 base pairs in length. When isolated from virions, the 5'-end of each DNA strand is covalently linked to a protein with an observed M, -55,000 (5). This 55K polypeptide results from the processing of an 87K precursor, which is found at the 5'-ends of replicating DNA molecules in the nucleus (97).
During lytic infection, two distinct phases of transcription are recognized an early phase which precedes DNA replication and a late phase which accompanies the onset of DNA replication and continues throughout the infection. Four distinct regions of the Ad2 genome are transcribed during the early phase of Ad2 infection (Fig. 1). Early region 1 (1.3 to 11.1%) is of particular interest since DNA fragments which contain this region are both necessary and sufficient to produce the transformed phenotype (1). Early experiments indicated that transcription and translation of one of the products of this region control the expression of the other early regions (6, 7). This regulatory factor has now been shown to be the product of the 13 S mRNA from the E1A region (8). Although the initial nuclear transcripts from each of these early regions appear to be simple linear copies of the genome, they are subsequently processed into numerous cytoplasmic mRNA species. For instance, in early region 4, no fewer than seven distinct mRNA species have been observed (9,lO).
Transcription from the so-called "major late promoter," at coordinate 16.5, shows an interesting switch between early and late times. In both cases, the initial transcript is a linear representation of the genome. At early times, this transcript terminates predominantly around coordinate 39 (9, 11, 12,89, 98); however, at late times, transcription continues beyond this point and can reach almost to the right-hand terminus of the genome (13-15). Transcripts from this promoter, whether early or late, are all processed in a similar manner. Short leader segments, encoded at coordinates 16.6, 19.6, and 26.5, become joined to give an untranslated leader of 203 nucleotides (16,17). This tripartite leader then becomes joined to a variety of main bodies, encoded further downstream, leading to a collection of at least 12 different late mRNAs. Electron microscopic studies have shown that these mRNAs fall into 3'-co-termind families with poly (A) addition sites at positions 39.0,49.5,61.5,78.3,and 91.5 (see Fig. 1) (14,(18)(19)(20).
In this paper, we describe the sequence of 15,441 nucleotides, which contains the left-most 32% of the Ad2 genome together with the right-most 11% of the genome.

MATERIALS AND METHODS
Materials-Ad2 DNA was prepared as described previously (42) from purified virus grown in suspension cultures of HeLa or KB cells. The original viral stock was obtained from U. Pettersson (University of Uppsala, Sweden) in 1972 and has been maintained independently since then. For this project, an initial stock of virus was prepared and viral DNA used as templates was never more than two passages away from this initial stock. The M13 derivatives mp7, mp8, and mp9 were obtained from J. Messing (University of Minnesota) and DNA, both replicative and single-stranded forms, were prepared essentially as described previously (43,44). 32P-labeled deoxyribonucleoside triphosphates were purchased from either New England Nuclear or Amersham (a-labeled, 300-600 Ci/mmol; y-labeled, 1000-3000 Ci/ mmol). Dideoxynucleoside triphosphates were obtained from Collaborative Research or P-L Biochemicals. Synthetic primers complementary to MI3 DNA were obtained from New England Biolabs or Collaborative Research.
Restriction endonucleases were either isolated following published procedures or purchased from New England Biolabs or Bethesda Research Laboratories. Polynucleotide kinase and the Klenow fragment of DNA polymerase I were obtained from Boehringer Mannheim or New England Biolabs. Polynucleotide kinase and calf intestinal alkaline phosphatase were also gifts from Dr. G. Chaconas (Cold Spring Harbor Laboratory). Exonuclease I11 was purchased from New England Biolabs or BRL. Ti-infected Escherichia coli cells were a gift from Dr. W. Studier (Brookhaven National Laboratory) and T7 exonuclease was isolated as described (45). Initial samples of Tj exonuclease were also the generous gifts of Dr. J. Dunn (Brookhaven National Laboratory) and Dr. P. D. Sadowski (Universtiy of Toronto).
DNA Sequence Analysis-The chain termination procedure (46) was used, initially employing templates prepared from intact Ad2 DNA by treatment with exonuclease I11 (47) or T 7 exonuclease (48). In early experiments, Ad2 DNA was treated with alkali (49) to remove the remnants of the terminal protein before digestion with T7-exonuclease; however, it was later found that alkali treatment was unnecessary. Primers were generated by redigestion of individual large restriction fragments of Ad2 DNA with other restriction endonucleases (HaeIII, HpaII, etc.). Digests were fractionated on polyacrylamide gels(5-10%) and individual bands were eluted by diffusion into 0.5 M NaCl, 10 mM Tris-HC1, pH 7.9, 1 m~ Na2-EDTA. Occasionally, primers were further purified by passage over small (1.0-ml) columns of DEAE-cellulose. The fragments were loaded into these columns in 0.1 M NaCl and eluted with 0.7 M NaCI. The DNA was precipitated with EtOH and used as primer for the DNA sequencing reaction. Usually, all primers were treated with exonuclease I11 before use (48).
Templates were also prepared from M13 clones which contained small restriction fragments inserted into the vectors mp7, mp8, or mp9 (50). In this case, synthetic primers (TCCCAGTCACGACGT or TCACGACGTTGG) were used, complementary to the M13 sequence immediately adjacent to the site of insertion. Occasionally, internal primers were prepared by restriction endonuclease cleavage of replicative form DNA from the clone.
Computer Programs-All computer analysis was performed on either a Digital Equipment Corporation PDP 11/44 or a PDP 11/60. Primary data were stored and overlaps were established using the program ASSEMBLER (54). Analysis was carried out using additional programs described elsewhere (55-57).
Sequence Strategy-Two principal methods were employed for the collection of sequence data. Initially, intact Ad2 DNA was treated with either exonuclease I11 or Tj exonuclease, and the resulting single strands were used as templates for the chain termination procedure. Primers were obtained in a two-step procedure by redigestion of individual, large restriction fragments. For instance, the left-hand terminal HpaI fragment of Ad2 (0-4.3%) was redigested with HpaII, HaeIII, or MboI, and suitable small fragments were purified and used to prime synthesis on templates prepared either by exonuclease 111 digestion to obtain r-strand sequences, or by T 7 exonuclease digestion to yield I-strand sequences. This method worked well for sequences lying between coordinates 0-1776 at the left end and 89-100% at the right end. Greater than 90% of these sequences were obtained from both strands. However, for sequences lying between coordinates 17 and 3276, at the left end, it proved difficult to obtain unambiguous data consistently. Part of the reason for this may lie in the increased possibility of secondary structure formation within the template as the single-stranded tails grow larger. For this reason, the M13 cloning sequencing system was employed as an alternative means of gathering data from this region. Either Hind 111-B (17-31%) or suitable subfragments of HindIII-B (e.g. XhoI-G, 22-26%) were recut with HpaII, MboI, or TuqI, and the resulting fragments were cloned into an M13 phage vector. Initially, mixed digests were used and the resulting clones were identified by sequence analysis, a strategy which allowed most of the sequence of this region to be deduced unambiguously. As soon as the main features of the sequence became apparent, a number of specific fragments from these digests, which were not obtained from the original shotgun approach, were purified and cloned on an individual basis. In this way, it was possible to obtain a complete set of clones spanning both strands of this portion of the genome. A summary of all M13 clones obtained is presented in Table I. The sequences derived from this work are presented in Figs. 2 and 3.

RESULTS AND DISCUSSION
Sequence Heterogeneity-During the accumulation of sequence data, two regions of the genome caused particular problems due to limited sequence heterogeneity in the DNA from our viral stocks. The first of these occurred at the extreme right end of the genome, between bases 1580 and 1594 (numbered leftward from coordinate 100%) (95.5%), where a run of A residues was encountered on the r-strand which gave readable sequence preceding it, but unreadable sequence following it. Analysis of the complementary strand showed a similar run of T residues which gave a readable sequence preceding it, but an unreadable sequence following it. This sequence lay within the restriction fragment BgZII-L. BgZII-L was end-labeled using polynucleotide kinase and recut with MboII, which cuts eight nucleotides downstream from the MboII recognition sequence, GAAGA. The two MboII sites are separated by four nucleotides and lie in the same orientation so that cleavage directed by the left-hand site destroys the right-hand site (see Fig. 4). This gave rise to two fragments from the right end of BgZII-L, both of which contain the heterogeneous A residues. These fragments were separated and then subjected to chemical sequencing reactions. Samples of the intact fragments were run alongside for comparison (Fig. 4). It is apparent that the "intact" fragments are a mixture of at least three distinct lengths, which differ according to the number of A residues present. M13 clones containing either strand of BglII-L were also prepared and individual clones containing 13,14, 15, and 16 A residues were obtained. A total of 14 A residues has been placed in the

FIG. 4. Heterogeneous
A residues around coordinate 95.5%. BglII-L was end-labeled using [y-"PIATP and polynucleotide kinase and then recleaved with MboII. Two fragments were obtained that contain the right end of BglII-L because the close proximity of the two MboII sites leads to interference in cleavage (see text). From the sequence reaction of the larger fragment, it can be seen that an unreadable sequence results beyond the region of heterogeneity. Three distinct fragment lengths are apparent in each case. sequence a t this position since this appears to represent the peak distribution as judged both from the data of Fig. 4 and from analysis of the clones isolated. In other, independent stocks of Ad2 that we have examined, there is also variability in the number of A residues at this position. These other preparations included one derived from an Ad2 stock, which had recently been plaque-purified, indicating that the heterogeneity arises rapidly during growth. However, the sequence does not become completely heterogeneous, but rather, it oscillates within fairly close limits, suggesting that there may be a requirement for a minimum of 13 A residues at this position and that accumulation of more than 16 residues is perhaps deleterious. These residues lie in the intergenic region between URFs 18 and 19 and so do not affect any coding sequences (see Fig. 10).
The second region in which we observed heterogeneity in the sequence occurs between nucleotides 9382 and 9396 (25.7%), which lies within the coding region for the Ad2 terminal protein. In this case, the heterogeneity was again apparent from sequencing gels, which contained a readable sequence followed by an unreadable sequence. In this case, the problem was identified by isolating M13 clones containing the region and showing that two different kinds of clones were obtained. In one, a sequence on the 1-strand contained six repeats of the trinucleotide GAA, while in the other series, only five repeats were present. This sequence lies within the terminal protein gene and encodes a run of glutamic acid residues. They, in turn, lie within a tryptic peptide that contains a methionine residue, downstream from them. Anal-ysis of the methionine-containing tryptic peptides of the terminal protein (58) revealed the presence of tryptic peptides, which contained methionines a t positions 18 and 19. Since this is the only tryptic peptide from the terminal protein which contains a methionine a t either of these positions, this result is interpreted to mean that the heterogeneity which we have observed at the nucleotide level carries through to the polypeptide level.
Coordinate System-Most of the landmarks on the Ad2 genome have previously been defined as per cent coordinates with the ends of the genome providing reference points. This was both convenient and necessary in the absence of detailed sequence information and led to reasonable consistency between coordinates derived by electron microscopy when compared to restriction fragment mapping. As a result of the present work and the sequences derived by Galibert's group (34)(35)(36)(37), the complete sequence of more than 60% of the Ad2 genome is now available. It is therefore possible to provide more accurate coordinates based on these sequences. T o this end, we have compared the coordinates derived by these other methods with those derived from the sequence. First, it was necessary to define a value for 1% and this has been done by comparing restriction enzyme site coordinates predicted from the sequence using suitable values for 1%, with the values determined by direct physical measurement. This comparison is shown in Tables I1 and 111. It will be noted that a value of 365 nucleotides for 1% gives the best fit for the left end, whereas a lower value of 357 nucleotides for 1% provides the best fit at the right end. This value is higher than the 351 value proposed previously (36). From the standard deviations obtained in each case, the accuracy of the mapping is rather similar at the two ends, suggesting that there may be a systematic difference which should be taken into account when trying to correlate coordinates derived by physical methods with those from sequence data. One possible explanation is suggested by comparing the base composition of the two ends, 57.8% G + C at the left end (0-33%), 49.2% G + C at the right end (70-100%). When the standard Ad2 restriction map is replotted using these new coordinates for the left and right ends, but retaining the original coordinates in the middle of the genome, no gross problems arise. The resulting map is shown in Fig. 5. Throughout this manuscript, features are located both by absolute position in nucleotides as well as by genome coordinate based on 365 nucleotides for 1% at the left end and 357 nucleotides for 1% at the right end.
Early Region I (Genome Coordinates 0-11.1%)-This region of the genome is expressed early during lytic infection. It contains sequences which are both necessary and sufficient for transformation of rodent cells in vitro (1) and appears to play some regulatory role in determining the level of expression of other Ad2 early regions (6-8). As a consequence, it has been studied extensively and the complete sequence of this region is available for the closely related virus, Ad5 (60-631, and the more distantly related, Ad12 (64). Partial sequence information is available for two other related viruses, Ad7 and Ad3 (65-67). The regulation of expression of this and other transcriptional units of Ad2 is complex and is discussed in detail elsewhere (59).
Two independent transcription units are contained within this region, one, under the control of a promoter at coordinate 1.3%, giving rise to the E1A mRNAs, and another, at coordinate 4.6%, giving rise to the E1B mRNAs (24, 68). In both cases, an initial linear transcript is converted into several different cytoplasmic mRNAs by splicing. Data are available for several cDNA clones of these mRNAs, and the precise locations of the start and stop points of these transcripts have been defined (38,40).

Calculated values for Ad2 restriction sites ( 0 -3 s )
Various values for 1% were used to calculate the positions of restriction sites, previously mapped on the Ad2 genome (4). The fit is expressed as either the standard deviation or a variance, which is the sum of the differences between calculated and observed coordinates. Three mRNAs have been characterized from early region 1A and are termed 9 S , 12 S , and 13 S. These mRNAs all begin at nucleotide 498 (1.36%) (24) and terminate at nucleotide 1630 (4.47%) (40). Cloned copies of both the 12 S and 13 S mRNAs have been obtained (40), and combining the sequence data around the splice points obtained for those clones with the present genomic sequence gives values of M , = 26,500 and 31,900 for the polypeptides expected to be translated from them (Fig. 6). These molecular weight calculations are based on the assumption, used throughout this manuscript, that translation begins at the first AUG in the reading frame. The predicted values are considerably less than the values of 40,000-60,000 observed by in vitro translation of these mRNAs (Ref. 87 and references therein). In the case of the 9 S RNA, which is detected only after the onset of viral DNA replication (69), the donor splice point is estimated to lie around genome coordinate 1.7, based upon electron microscopy (9), and one acceptable donor splice site located at nucleotide 611 (1.67%), is present in the sequence. Joining of this sequence to the common acceptor splice sequence at nucleotide 1226 (3.36%) maintains the reading frame and would give rise to a polypeptide of M, = 13,300.
The amino acid sequences predicted for the 31.9K and 26.5K proteins from this region are identical except for an additional 46 amino acids at the center of the 31.9K polypeptide (Fig. 6). This stretch of 46 amino acids contains five cysteines within it, which could have a dramatic effect on the folding of the large polypeptide as compared to the 26.5K version. The 31.9K polypeptide has recently been shown to regulate the expression of the other early regions (8). The proximity of so many potential intra-or intermolecular crosslinking residues is reminiscent of the hinge region of the immunoglobulin polypeptides.
In early region lB, two mRNAs of 13 S and 22 S have been reported and are derived by differential splicing of a common precursor (Fig. 6). The precise locations of both the start and stop points of that transcript are known, as well as the locations of the splice sites as deduced from cDNAs (38). Heterogeneous starts have been observed for the E1B mRNAs and they begin at either nucleotide 1699 or 1701 (4.65%) (24). The smaller mRNA contains a splice from nucleotide 2236 (6.16%) to nucleotide 3589 (9.83%), whereas the larger mRNA contains a splice from nucleotide 3504 (9.6%) to the same acceptor at 3589 (9.83%). The first AUG, present in both mRNAs, occurs at nucleotide 1711 (4.69%) and continues in frame for 525 nucleotides, leading to a polypeptide with a predicted M, = 20,500. This stretch of sequence is common to both mRNAs. However, the larger mRNA also contains a second, much longer, open reading frame beginning at the second AUG in the mRNA located at nucleotide 2016 (5.52%) and leading to a protein of M, = 54,900. Studies of Ad2, Ad5, and Ad12 (63,70) have revealed a similar situation and show that the smaller mRNA, which is only found late in infection, is translated to give only the 20.5K polypeptide as expected, whereas the larger mRNA can give rise to either polypeptide. In vivo, it appears that the larger polypeptide is the predominant product of the 22 S mRNA, while in vitro, the smaller polypeptide is the major product (70), perhaps suggesting that some translational control factor operates in vivo that is missing from the in vitro systems.
The terminator for the 54.9K polypeptide is located at nucleotide 3501 (9.59%) and is immediately followed by a donor splice site at nucleotide 3504 (9.60%) which removes 84 nucleotides and connects the main body of this message to an untranslated sequence of 473 nucleotides (38). Close to the end of this sequence, the hexanucleotide AAUAAA is found and the transcript terminates shortly thereafter. This same AAUAAA also serves to signal polyadenylation for another mRNA, that which encodes polypeptide IX (Fig. 6). This polypeptide is derived from an mRNA, nucleotides 3576 (9.80%) to 4061 (11.13%), synthesized independently of the E1B mRNAs during the late phase of virus infection and the sequence of this region of the genome has been published previously (23).
The overall structure of this region in Ad2 is almost identical with that found for Ad5, although certain details vary. Between nucleotides 1 and 1608, there are 16 differences within the 672 noncoding nucleotides and eight differences within the coding sequences. Among the latter, two are silent, third position changes, whereas six lead predominantly to conservative amino acid substitutions (Fig. 7). In the intercistronic region separating E1A from ElB, there are no differences. For ElB, there are four silent, third position changes in the 20.5K reading frame plus an insertion of six nucleotides (Gln-Gln) in Ad5, closely followed by an insertion of three nucleotides (Gln) in Ad2. Three of these silent changes, as well as the insertions, lie in the region common to both the 20.5K and the 54.5K proteins. These all result in amino acid changes in the 54.5K protein. In addition, there are five other differences leading to amino acid substitutions and 12 silent, third position changes in the unique section of the 54.5K protein. A comparison of the changes between the E1A and E1B regions (Fig. 7) reveals that, whereas the differences in E1A are scattered throughout the coding region, in E1B these differences are clustered around the sequence where the 20.5K and M.5K polypeptides overlap. This suggests that the NHP terminus of the 20.5K and the COOH terminus of the 54.5K proteins contain important functional information. Unfortunately, comparison with Ad7 and Ad12 (71) is not helpful, because of multiple amino acid substitutions as well as large deletions-insertions in these less closely related viruses.

Ad2 Sequences
In addition to the open reading frames on the r-strand which correspond to previously characterized polypeptides, there are two other URFs on the 1-strand (Fig. 6). The first, URF 10, begins at nucleotide 2413 (6.61%), has an AUG a t nucleotide 2290 (6.27%) and terminates with a UAA at nucleotide 2005 (5.49%), giving a total coding capacity of 14,000.
The second, URF 11, begins at 1847 (5.06%), has an AUG at 1712 (4.69%) and terminates with a UAA at 1196 (3.28%) with a total coding capacity of 23,200. No mRNAs containing these sequences have been described so far, although the hexanucleotide AAUAAA, which commonly signals a polyadenylation site, is found downstream of these two reading frames at nucleotide 442 (1.21%).
IVa2 Polypeptide (Genome Coordinates il.l-l6%o)-The region between nucleotides 4050 (11.10%) and 5826 (15.95%) contains the coding sequences for a late Ad2 polypeptide called IVa2 (Fig. 8). This protein is translated from an 1-strand message in contrast to the early region 1 transcripts. Cap sites for the IVa2 mRNA are found at nucleotides 5826 and 5828 (15.95%) (24) and, by comparison with Ad5, translation begins with an AUG at nucleotide 5708 (15.63%) (63). The splice coordinates for IVa2 mRNA have been determined for Ad5 (63) and are assumed to be identical for Ad2. Only four amino acid residues are encoded between the AUG and the donor splice site at nucleotide 5696 (15.59%). The acceptor splice site is at nucleotide 5418 (14.84%), after which the reading frame continues without interruption to a terminator UAA at position 4085 (11.19%). This terminator is followed immediately by an AAUAAA sequence at nucleotide 4050 (11.10%). Thus, the 3'-end of the IVa2 mRNA overlaps by six nucleotides the 3'-end of both the E1B and polypeptide IX mRNAs. The sequence of this gene has been deduced for the related viruses Ad5 (63) and Ad7 (67). A comparison of the differences between Ad2 and Ad5 is presented in Table IV. A detailed discussion of both the structural features of the gene and the interserotypic comparisons can be found elsewhere (67).
There are two interesting features to the IVa2 gene. The f i t lies at the 3'-end where the message overlaps the end of the E1B and polypeptide IX mRNAs. The terminator codon Comparison of early region 1 in Ad2 and Ad5. Only those sequence differences leading to amino acid changes between the two serotypes are shown. Amino acid residues (single letter code) carrying an asterisk reflect changes in the EIB 20.5K polypeptide, whereas all other changes in the E1B region affect only the 54.5K polypeptide.

AUG
UAA forms a part of the AAUAAA sequence. The other interesting feature lies at the NH2 terminus of the gene. In the first exon of the mRNA, only four amino acids of the mature IVa2 polypeptide are encoded, before being spliced to the main body. The reading frame in which these four amino acids lie is part of a much longer reading frame with a potential coding capacity of 120,000, beginning with an AUG at nucleotide 8357 (22.89%) and terminating with UAG at nucleotide 5189 (14.22%) (see "Early Region 2B" and Fig. 8). This means that between the cap site for the IVa2 mRNA and the final terminator of the 120,000 reading frame, about 630 nucleotides are held in common. 220 of these nucleotides, which encode both the extreme COOH terminus of the 120,000 reading frame and the NH2 terminus of the main body of IVa2, are translated in two distinct reading frames, One additional open reading frame, URF 9, also occurs in this region and overlaps the common region described above. It begins at nucleotide 5863 (16.05%), has an AUG at nucleotide 5674 (15.53%) and a terminator UGA at nucleotide 5329 (14.60%) with a maximal coding capacity of 18,900. If this reading frame were to be used, then 89 nucleotides from 5418 (14.84%) to 5329 (14.60%) would be used in all three reading frames.
Early Region 2B"For a long time, the region between the late promoter at genome coordinate 16.4% and the start of the 52, 55K gene at coordinate 30% was known only to contain coding information for the three segments of the tripartite leader, together with the genes for VA I and VA 11. More recently, it has been shown (5) that a series of transcripts is made from the 1-strand, which covers all of this region and arises by complex splicing from a promoter probably located around coordinate 75. This promoter had previously been identified as the start of another early transcription unit called E2, which contains the gene for the 72K DNA binding protein (9, 10, 94). Following the discovery of these new transcripts, the 72K gene region has been designated E2A, and the region containing the new transcripts, E2B (5) (see Fig. 1). The E2B transcripts possess leader sequences encoded around coordinates 75 and 39, and in some cases, additional leaders around coordinate 68.5 or 65 are present. The leaders are spliced to main bodies at either coordinates 30, 26, or 23, which then extend to a common termination site at coordinate 11. This termination site appears to correspond with the termination site for IVa2 mRNA. Polypeptides of M, = 105,000, 87,000, and 75,000 have been identified as potential products of these mRNAs by in vitro translation. One of these, the 87K poly- The four amino acids at the NH2 terminus of IVa2, shown as the cross-hatched urea, originate from the 120,000 reading frame and become joined, by splicing, to the main body of IVa2 in a second reading frame. Other conventions follow those of Fig. 6. Within the sequence, two large open reading frames can be identified (Fig. 9). One of these begins at nucleotide 10,579 (28.97%), has the first AUG at nucleotide 10,534 (28.85%), and continues to a terminator UAG at nucleotide 8,575 (23.48%). Comparison of the tryptic peptides present in the terminal protein and its precursor reveals excellent agreement between those observed and those predicted from this sequence (58). This argues strongly that this reading frame is the one used for the production of the terminal protein. The sequence between coordinates 15.8 and 31.6% has been determined independently by Alestrom et al. (88). The only differences between the two sequences lie within this reading frame at nucleotides 9,315 and 9,316. The dinucleotide CG at this position is inverted, leading to an amino acid substitution from arginine to alanine. This difference almost certainly reflects a strain difference. A more extensive discussion of the region encoding the terminal protein can be found in the accompanying manuscripts (58,88).
A second large open reading frame begins at nucleotide 8796 (24.09%), has the first AUG at 8357 (22.89%), and continues to a terminator UAG at nucleotide 5189 (14.21%). The total coding capacity of this reading frame is 132,100 while the capacity from the f i t AUG to the terminator is 120,400. The smallest E2B mRNA has an acceptor splice point at coordinate 23, as mapped by electron microscopy, making it the best candidate from which the 120K polypeptide might be translated. Although a 105K polypeptide and longer polypeptides' have been identified by in vitro translation, their relationship if any to this 120K protein has not been established. The Ngroup mutants, which are defective in viral replication, map between coordinates 18 and 22.5% (72). These presumably contain defects in the 120K protein and indicate that it plays some role in DNA replication. Thus, early region 2, both E2A and E2B, seems likely to encode a set of genes essential for viral DNA replication. Recently, a large protein of M, = 140,000 has been found to co-purify with the Ad2 terminal protein and together they have been shown to exhibit DNA polymerase activity (73). It is possible that this polypeptide is the translation product of this large reading frame.
The E2B mRNA, which contains an acceptor splice site around coordinate 26, is more enigmatic. Examination of the DNA sequence indicates that, close to this coordinate, a third reading frame, URF 7, begins at nucleotide 9270 (25.39%) and overlaps extensively with the COOH terminus of the terminal protein (Fig. 9). The first AUG in this reading frame is encountered at nucleotide 9030 (24.72%). It terminates with UGA at nucleotide 8385 (22.96%) and has a total coding capacity of 31,400. This is much less than the 75,000 which would be required to account for the only other polypeptide assigned to this region by in vitro translation (5). For all three of these mRNAs, the situation is complicated by the fact that they each have leader sequences associated with them, which may or may not contain coding sequences, and so for the moment, these molecular weight estimates must be considered minimal values. In addition to these three open reading frames, there are four other small open reading frames (URFs 5, 6, 8, and 9 in Fig. 9). Polypeptides which might be J. B. Lewis and M. Mathews, unpublished results. translated from them vary in length between -10,OOO and 17,000. Because candidate polypeptides have not yet been identified in vitro, their significance remains obscure.
The r-strand in this region is considerably less complicated (Fig. 9). The major late transcript begins at nucleotide 6039 (16.5%) (74) and the next 41 nucleotides comprise the first late leader sequence (16, 17). At nucleotide 7101 (19.44%), the acceptor site for the second late leader is found with the donor site at nucleotide 7172 (19.64%). Within this fiist intervening sequence, there is one short open reading frame which might encode a polypeptide of M, = 11,600. A polypeptide of M , = 13,500 has been translated from mRNA isolated by hybridization to M13 clones containing only these intron sequence^.^ This mRNA belongs to a class of immediate early mRNAs (11).
The third leader sequence of the late mRNAs is encoded between nucleotides 9634 (26.38%) and 9724 (26.63%). Between the second and third leader segments another open reading frame is found with a coding capacity of 16,600. The first AUG is located at nucleotide 7968 (21.82%), and the frame extends to nucleotide 8415 (23.04%). This reading frame is almost wholly contained within the region identified as the 3" leader, which was fust observed by electron microscopy and is often found as an additional leader segment on certain of the late mRNAs (75). This leader has now been mapped precisely between nucleotides 7942 (21.75%) and 8381 (22.96%) (76). In vitro translation results suggest that, when mRNAs contain this leader sequence, they encode a polypeptide of M, -14,000 (11). Recently, direct selection of mRNAs containing these sequences, using M13 clones, has confirmed that a polypeptide identified as 13.6K is translated in Two additional open reading frames are present on this strand. URF 3 contains within it the sequences of the third late leader, begins with an AUG at nucleotide 9294 (25.45%), and extends through nucleotide 9798 (26.83%). A polypeptide of M, = 17,700 is predicted. URF 4 partially overlaps VA I RNA and begins with an AUG at nucleotide 10,421 (28.54%) and extend through nucleotide 10,832 (29.67%). A polypeptide of nil, = 14,400 is predicted, However, no candidate polypeptides have yet been identified for these two URFs, either in vivo or in vitro.
Finally, on the r-strand, the two small VA RNAs are encoded, VA I RNA beginning at nucleotide 10610 (29.06%) and VA I1 at nucleotide 10,866 (29.76%). Sequences from this region have been determined by others, and their structure is discussed in detail (28,41). 52, 55K Polypeptide-Although the major adenovirus promoter located at coordinate 16.5 is usually termed the late promoter, in fact it has been recently shown that it is also active at early times (11,75,77,89). However, at these early times, only a subset of the late mRNAs is produced. These mRNAs, which fall into the group L1 and which map between genome coordinates 30-40, derive from the immediate early genes and are apparently expressed before any of the other early genes. The fist polypeptides encoded by mRNAs of the L1 group are the 52-55K polypeptides and examination of the sequence reveals that a large open reading frame begins with an AUG at nucleotide 11,040 (30.23%) and continues in frame to the end of the sequence presented in Fig. 9. This region, therefore, likely encodes the NH2 terminus of the 52-55K polypeptides. The location of this AUG is just 17 nucleotides beyond the 3'-end of VA I1 RNA, and is immediately preceded by the acceptor splice site (77).
Early Region IV (Coordinates 91-100%)-This sequence has been determined independently by Herisse et al. (36) using Ad2 DNA cloned into pBR322. Their sequence differs from ours by the addition of a T residue at nucleotide 796 (97.77%). This is the fist of a run of T residues lying in a region between two open reading frames (URF 16 and URF 17). It seems likely that this is a strain variation. One other difference is found at nucleotide 1581 (95.57%), which is the f i s t of a run of A residues, in this case, the run where we find variability in our viral stock. The finding of one unique length at this position by Herisse et al. (36) reflects their use of cloned DNA.
This segment of the genome is transcribed at early times from the 1-strand and is under positive regulatory control (6,8,95). Transcription begins within the sequence TTTTTA at nucleotides 324-329 (99.09%) leading to a heterogeneous array of starts (24, 78). The initial product probably terminates around nucleotide 3137 (91.21%), just beyond an AATAAA sequence which occurs at nucleotide 3117 (91.26%). This would be consistent with electron microscopic results which map the 3'-end of the E4 mRNAs around coordinate 91.4% (9). A complex array of mRNAs are produced from this initial tran-script by splicing, and at least seven different species have been identified (9, 10). These mRNAs are indicated in Fig. 10. Also shown are the open reading frames present on the 1strand which can be deduced from the sequence.
With one exception, the major open reading frames that are present in the sequence lie in suitable positions so that they could be accessed by the various splicing events leading to the mature mRNAs. In the case of the 4a,c and 4e mRNAs, the f i s t AUGs within these reading frames are each preceded by a sequence characteristic of an acceptor splice site, although the precise locations of the splice points have not yet been determined. For the 4b,d mRNAs, splicing occurs between nucleotides 390 (98.91%) and 1503 (95.79%),5 while for the 4g mRNA, the best potential splice site lies between the first and second AUGs within the genomic reading frame. Thus, in each case, there is a good fit between the sequence data and the electron microscopic results, upon which the assignments are based.
A striking exception is provided by the largest open reading frame present within this sequence, which is labeled URF 20. This begins at nucleotide 1759 (95.07%), has the fist AUG at 1861 (94.79%), and terminates at 2743 (92.32%). This spans the large intron located between coordinates 92.4 and 94.4. Only two of the well characterized mRNAs (4a and 4b) contain this sequence intact, but in each case, these mRNAs also contain a complete and separate upstream reading frame. Thus, if the large reading frame is to be used in these mRNAs, then some mechanism must exist for bypassing the upstream sections, either through internal initiation, or some form of restart initiation, as is found in prokaryotic mRNAs. Neither of these situations has so far been reported for eukaryotic mRNAs, although it is possible that ribosomes enter through the 5'end of these long mRNAs and merely skip over the 5'-proximal AUGs, as occurs in poliovirus RNA (79). The latter possibility seems less likely in this case since perfectly acceptable reading frames exist upstream. In the case of the 4a mRNA, 11 upstream AUGs occur, while in the 4c mRNA, six upstream AUGs occur. D. Sciaky and N. Stowe, unpublished results.
An alternative way in which this reading frame might be used is suggested from the structures of the mRNAs shown in Fig. 10. The two pairs of mRNAs, 4a,c and 4b,d, are related by the presence or absence of the second intron. Thus, one might imagine that 4c and 4d could result from the further processing of 4a and 4b, respectively. In neither case would this result in loss of information if in fact the fist open reading frame only is used in these mRNAs. By analogy, one might therefore postulate a precursor to the 4e mRNA in which the second intron had not been removed (see Fig. 10). Such an mRNA would then possess an ideal structure for the translation of URF 20. In the more highly processed version, mRNA 4e, appropriate splicing could result in the joining of the NH2terminal part of the reading frame, to an additional stretch of open reading frame (URF 21) which is present at the 3'-end of all region 4 transcripts. Unfortunately, the precise location of the donor and acceptor splice sites for this large intron are not known; however, several possible splicing combinations could result in the in-frame joining of these two regions. One reasonable splice between nucleotides 2050 (94.26%) and 2786 (92.20%) would give rise to a polypeptide of M, = 16,300. A summary of the proposed coordinates for these mRNAs and their corresponding reading frames is presented in Table V.
The polypeptides encoded in E4 have been studied extensively by in vitro translation, and a variety of polypeptides, ranging in M, between 11,000 and 35,000, have been described (80)(81)(82)(83). In a recent study (a), the largest polypeptide detected had M, = 35,000 and its composition was highly basic.
By tryptic peptide analysis, the 35K polypeptide was related to four smaller polypeptides of M, = 23,000, 22,000, 21,000, and 18,000. This large polypeptide would be an attractive candidate for the product of the URF 20 reading frame. The details of the relationship between this polypeptide and the smaller polypeptides is far from clear. The other major polypeptide detected in that study had M, = 11,000 and was somewhat acidic. This polypeptide would fit reasonably with the products predicted for either the 4a-c mRNAs or the 4g mRNA, although in both cases, the predicted molecular weights of the polypeptides are slightly larger than 11,000. Of the remaining open reading frames in the sequence, candidate polypeptides exist and have been detected by others, although insufficient information is available at the present time to attempt a meaningful correlation between individual polypeptides and the reading frames present in the sequence.
In addition to the open reading frames assigned to early region 4, there is one additional reading frame (URF 22) with a coding capacity of about 12,000 that is also apparent on the 1-strand. This occurs just beyond nucleotide 3136 (91.22%), where early region 4 is believed to end. No mRNAs have been  described so far from which it might be translated; however, the sequence AAUAAA, which is often associated with the presence of a polyadenylation site, does occur downstream from the end of this reading frame.
Fiber mRNA and the Late Strand Transcripts-In addition to the 1-strand transcripts corresponding to early region 4, both nuclear and cytoplasmic transcripts from the r-strand of this region are also found. In particular, the mRNA for fiber polypeptide is known to terminate around genome coordinate 91 (85). Inspection of the sequence shows that an open reading frame is present which terminates at nucleotide 3162 (91.14%). This reading frame is the continuation of a reading frame present in EcoRI-E, which encodes the NH2terminal portion of fiber polypeptide (37). An AATAAA sequence is found at nucleotide 3164 (91.14%), some 20 nucleotides before the polyadenylation site of fiber mRNA. The terminator UAA for fiber polypeptide is actually a part of the AAUAAA signal for polyadenylation of fiber mRNAs, a situation also encountered at the 3'-end of IVa2 (see above).
The primary transcript from which fiber mRNA is derived begins at the late promoter at genome coordinate 16.5. This primary transcript can extend a good distance beyond the 3'end of fiber mRNA, a t least as far as genome coordinate 98.2 (86, 96), although the rightmost mRNA so far known to be derived from it is fiber mRNA. Examination of the sequences downstream from the 3'-end of fiber mRNA shows the presence of three AAUAAA sequences. The fiist two of these occur within 1000 nucleotides of the 3'-end of fiber mRNA, and are preceded by sequences rich in termination codons. It seems unlikely, therefore, that they serve as polyadenylation sites for additional mRNAs derived by splicing of the tripartite leader to sequences beyond the end of fiber mRNA. However, immediately following these two sequences, three significant stretches of open reading frame are found, as indicated in Fig.  10. Translation of these reading frames would give rise to proteins of M, = 10,000, 16,000, and 12,000. Furthermore, a third AAUAAA sequence occurs at nucleotide 810 (97.73%) which could serve as a polyadenylation site for mRNAs from which these three reading frames were translated. This particular AAUAAA is also of interest because it lies at the start of the variable stretch of 13-16 A residues discussed previ-Genome Organization-As already noted, the G + C content of the left end of Ad2 is quite high (57.8%), when compared to the right end (49.2%). This in turn is reflected in the dinucleotide frequencies, which do not show the low occurrence of the CG dinucleotide that is usual for eukaryotes (91) and many of their viruses (90). For example the tetranucleotides CGCG and GCGC are among the most frequent found at the left end of the genome. The CG dinucleotide is not methylated in Ad2 (93). Ad2 is also unlike other animal viruses in its preference for nucleotides at the third position of codons. Wain-Hobson et al. (92) have noted that codons containing U in the third position are the most frequent among animal ously.
viruses. Ad2 shows a clear preference for C or G in the third along the two strands is illustrated schematically in Fig. 11. position (Table VI).

Because many of the open reading frames cannot yet be
One common feature that Ad2 does share with all viral positively assigned, those known or highly likely to contain genomes thus far sequenced is that the available coding infor-coding information are highlighted. At the left end, these mation is used very economically, as summarized in Table account for 38% of the available sequence on the r-strand and VII. Intergenic distances are short and several examples of 51.3% of that on the I-strand. A further 3756 nucleotides overlapping genes occur. The distribution of coding sequences (URFs 3-11) may also contain significant coding stretches and   so might have to be included in this assessment.
It is apparent from Fig. 11 that in several instances genes overlap on the same strand. Thus, 948 nucleotides on the rstrand and 516 nucleotides on the 1-strand are used more than once for coding. Two cases exist where complementary strands appear to be used for coding. Both involve the region encoding the 120K polypeptide. From in vitro translation experiment^,^ it has been shown that URF 1 (318 bases) and URF 2 (447 bases) code for proteins of M, = 13,500 and 13,600, respectively. Their coding regions on the r-strand show complete overlap with the region of the genome shown to encode the 120K polypeptide on the 1-strand. As far as we are aware, this situation has not been described before.
It is of some interest to examine how these complementary reading frames are phased with respect to each other. Clearly, there are three possible phasings as illustrated in Fig. 12. Two of these, labeled A and C, lead to a considerable degree of flexibility in codon choice on each strand, since in both cases, the third position of each codon, which can vary considerably without affecting the amino acid coded, is placed opposite to either the fist position or second position of the complementary codon. Thus, in Type A, the fist position of each codon will determine the third position of its complement. This phasing relationship is found between URF 2 and the 120K protein. In contrast, the phasing between URF 1 and the 120K protein is of Type B, which is predicted to be the least flexible. It should be noted, however, that URF 8 also lies in this region and has a complementary overlap with URF 1 of Type A. If this frame is used, then a stretch of 154 nucleotides from 5160 (14.14%) to 5006 (13.72%) would provide coding information in three of the six possible frames.
Of the other URFs that show similar complementary overlaps, the phasing relationships are of Type B for URF 3 and URF 10 and of Type A for URF 4 and URF 5. Type C is found in URF 11. If most of the unassigned URFs are used to code for polypeptides, then clearly this will place considerable constraints on the freedom of the genome to select nucleotides for its assembly. These effects will be amplified by the need for additional precise sequences to d e f i e controlling elements such as polymerase binding sites, RNA processing sites, etc. It will be of interest to discover whether the strategies employed by Ad2 to condense its coding information have a parallel within the genome of its host. 28.