Gene for yeast glutamine tRNA synthetase encodes a large amino-terminal extension and provides a strong confirmation of the signature sequence for a group of the aminoacyl-tRNA synthetases.

The gene for the yeast Saccharomyces cerevisiae glutamine tRNA synthetase is shown here to encode a protein of 809 amino acids. This contrasts with the 551 amino acids of the Escherichia coli glutamine tRNA synthetase. The yeast GLN4 transcripts have 5' termini that start approximately 25 nucleotides in front of the long open reading frame. Much of the extra size of the yeast enzyme is due to a large amino-terminal extension. At codon 225, the yeast enzyme aligns with the amino terminus of the E. coli protein. From this point on, the two sequences have an average of 40% identity, with a few small gaps for alignment, until their respective carboxyl termini. At codon 254 of the yeast and codon 30 of the E. coli enzyme, however, there starts an exact 15-amino acid match between the two proteins. This match encompasses and is partially the same as a short sequence which is a signature sequence for the amino acid group of the bacterial aminoacyl-tRNA synthetases which are specific for different amino acids. This is the strongest sequence match found between any yeast cytoplasmic or mitochondrial aminoacyl-tRNA synthetase with its bacterial homologue. This region of the structure is associated with a nucleotide fold. The result provides strong validation of the signature sequence, especially for sequences where the homology relationships are less dramatic than in this example. Because the 224-amino acid extension of the yeast enzyme does not align with any part of the E. coli enzyme, we propose that it is not associated directly with the catalytic function of the enzyme. Its possible function is investigated in the accompanying paper.

Gene for Yeast Glutamine tRNA Synthetase Encodes a Large Amino-terminal Extension and Provides a Strong Confirmation of the Signature Sequence for a Group of the Aminoacyl-tRNA Synthetases* (Received for publication, February 4, 1987) Steven W. LudmererS and Paul Schimmel The gene for the yeast Saccharomyces cerevisiae glutamine tRNA synthetase is shown here to encode a protein of 809 amino acids. This contrasts with the 551 amino acids of the Escherichia coli glutamine tRNA synthetase. The yeast GLN4 transcripts have 5' termini that start approximately 25 nucleotides in front of the long open reading frame. Much of the extra size of the yeast enzyme is due to a large amino-terminal extension. At codon 225, the yeast enzyme aligns with the amino terminus of the E. coli protein. From this point on, the two sequences have an average of 40% identity, with a few small gaps for alignment, until their respective carboxyl termini. At codon 254 of the yeast and codon 30 of the E. coli enzyme, however, there starts an exact 15-amino acid match between the two proteins. This match encompasses and is partially the same as a short sequence which is a signature sequence for the amino acid group of the bacterial aminoacyl-tRNA synthetases which are specific for different amino acids. This is the strongest sequence match found between any yeast cytoplasmic or mitochondrial aminoacyl-tRNA synthetase with its bacterial homologue. This region of the structure is associated with a nucleotide fold. The result provides strong validation of the signature sequence, especially for sequences where the homology relationships are less dramatic than in this example. Because the 224amino acid extension of the yeast enzyme does not align with any part of the E. coli enzyme, we propose that it is not associated directly with the catalytic function of the enzyme. Its possible function is investigated in the accompanying paper.
Aminoacyl-tRNA synthetases catalyze the esterification of an amino acid with the cognate tRNA (1). Although they exist in a variety of subunit sizes and quarternary arrangements, recent investigations suggest that there is a common motif underlying their structural organization (2, 3). There is a catalytic core to which additional polypeptide sequences * This work was supported in part by Grant GM15539 from the National Institutes of Health. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "aduertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
The nucleotide sequence (s)  have been joined; these additional sequences are not essential for catalysis but may play a role in other cellular processes (4).
In bacteria, the catalytic region of some of the synthetases is positioned toward the amino-terminal portion of the protein; dispensable sequences are located on the carboxyl-terminal side of the catalytic core (2, 3). In eukaryotes, the cytoplasmic synthetases are typically larger than the prokaryotic homologues (1). One explanation is that the eukaryotic synthetases have additional domains, fused to the core structure, to perform physiological functions not required of the prokaryotic enzymes. Alternatively, the eukaryotic protein has acquired insertions at various sites of the core enzyme which provide subtle modulations of the catalytic functions.
Although the aminoacyl-tRNA synthetases catalyze a common reaction and share a structural motif, recent investigations have established that there is little primary structure similarity among them. The most significant sequence element was discovered by Webster et al. (5) as a consequence of the determination of the sequence of Escherichia coli isoleucine tRNA synthetase. This element is a sequence of 10-15 amino acids which serves as a signature for a group within this class of enzymes. This sequence forms a critical part of the nucleotide fold that is known to be in the amino-terminal part of the bacterial methionine and tyrosine tRNA synthetases (6-9).
The presence of this sequence element may be questioned in some of the enzymes, however, because of the significant amino acid substitutions that occur among the putative signature sequences. While attempts have been made to find other more definitive common sequences in the primary structures of the enzymes, no better example has been found or proposed.
We previously reported the isolation of GLN4, the structural gene for Gln-tRNA synthetase in Saccharomyces cereuisiae (10). GLN4 is an essential gene which exists in single copy on chromosome XV. The size of the transcript, as determined by an RNA blot, is 2900 nucleotides. This suggests that the protein product is considerably larger than the 551 amino acids of the E. coli polypeptide (11).
We report the sequence of GLN4 and an analysis of the primary structure of the encoded enzyme. This sequence is of particular interest because the primary structure of the E. coli counterpart has been determined and a location for the putative signature sequence proposed ( 5 ) . This proposed sequence element starts 30 amino acids from the amino terminus. Its identification is based upon partial, and in some cases weak, sequence similarities with several enzymes rather than a strong relationship with one particular enzyme. Identification of the putative signature sequence in the yeast enzyme provides a further evaluation of its significance. Furthermore, alignment of the yeast sequence alongside the bacterial sequence enables us to investigate whether the anticipated greater length of the yeast enzyme arises from multiple insertions into the catalytic domain or from the presence of discrete domains fused to the core enzyme. This, in turn, provides a basis for future investigations on the structure of an eukaryotic aminoacyl-tRNA synthetase.

MATERIALS AND METHODS
DNA Sequence Analysis-DNA sequence analysis was conducted by the chain termination method of Sanger et al. (12). Fragments of GLN4 were subcloned into M13mp8 and M13mp9 (New England Biolabs). Sequencing was initiated using the 15-or 17-base universal primer (New England Biolabs).
RNA Preparation-Total RNA was isolated from haploid strain AM644-2B (10) harboring GLN4 bearing plasmid pSWL203. Cells were grown in minimal media to a Klett reading of 100, broken with glass beads, and RNA was isolated by the method of Carlson and Botstein (13).
Phmids-The isolation of GLN4 bearing plasmid pSWL203 has been previously described (10). Additional plasmids relevant to this work, obtained through subcloning the GLN4 insert of pSWL203 via standard techniques, are as follows: pSWL219, EcoRI(-410) to HindIII(+709) fragment of GLN4 ligated into the EcoRI and Hind11 sites of M13 phage mplOw (New England Biolabs); pSWL220, EcoRI(+1987) to ClaI(+2737) fragment of GLN4 ligated into EcoRI and AccI sites of M13 phage mplOw (New England Biolabs); and pSWL99-7, Tacl(-385 to +266) fragment of GLN4 ligated into the Accl site of M13 phage mp8 (New England Biolabs). Nucleotide numbers have been designated relative to +1 as the start of the GLN4 coding sequence (see below). SI Nuclease Mapping-Fifty micrograms of total RNA was mixed with approximately 5. lo5 cpm of a gel-purified probe uniformly labeled with [a-32P]dATP (Amersham Corp.), according to the method of Hsu and Schimmel (14). The mixture was ethanol precipitated and resuspended in 50 pl of hybridization buffer, which contained 40 mM PIPES' (pH 6.4), 0.4 M NaCl, 1 mM EDTA, and deionized formamide. The optimum concentration of formamide was determined empirically for each hybridization and was generally found to be in the range of 70-80%. The sample was heated at 90 "C for 5 min and then cooled slowly to 42 "C, at which temperature it incubated overnight.
The hybridization mixture was quickly transferred into ice-cold S1 buffer which contains 30 mM sodium acetate (pH 4.6), 1 mM zinc sulfate, 0.25 M NaC1, and 5% glycerol. This mixture was digested with 500 units of S1 nuclease (Bethesda Research Laboratories) at 30 "C for 1 h. The digestion was stopped by the addition of 20 pl of 0.2 M EDTA (pH 7.4). This mixture was ethanol precipitated and washed and then resuspended in 5 pl of 0.1 N NaOH and 5 mM EDTA. To this was added 5 pl of the standard formamide loading buffer used in the sequencing reactions. Samples were run on 8 and 6% denaturing acrylamide gels next to a sequencing ladder or labeled fragments that served as size standards. Generally 3 pl were loaded per lane. Results were evaluated by autoradiography.
Miscellaneous-Restriction endonucleases were purchased from New England Biochemicals and Boehringer Mannheim. They were used as recommended by the supplier. E. coli DNA polymerase I large fragment (Klenow enzyme) was obtained from Boehringer Mannheim. [a-32P]dATP was obtained from Amersham Corp. at a specific activity of >400 Ci/mmol.

RESULTS
Coding Region for Glutamine tRNA Synthetase-We established previously that GLN4 encodes glutamine tRNA synthetase (10). A recombinant plasmid with a 5-kilobase pair insert of genomic DNA from S. cereuisiae was shown to harbor GLN4 by virtue of its ability to restore the Gln+ phenotype to strains that are Gln-due to a mutant gln4 allele and by its ability to confer an elevation in glutamine tRNA synthetase activity.
The abbreviation used is: PIPES, 1,4-piperazinediethanesulfonic acid. identity of 40%, the apriori random probability is vanishingly small that these two sequences would have an exact match of Restriction map and location of the GLN4 coding sequence and transcript. The Sau3A restriction fragment which harbors GLN4 is displayed. The entire fragment was sequenced except for the small area represented by dark shuding. The position of the GLN4 transcript, the GLN4 open reading frame, and its alignment with the E. coli Gln-tRNA synthetase sequence are shown in the top part of the diagram. The location of the exact 15-amino acid match between these sequences is shown by a thick vertical bar. The probes used in S1 nuclease mapping of transcripts are shown at the bottom of the figure. kbp, kilobase pair.

FIG.
2. Amino acid sequence of S. cereuisiae glutamine tRNA synthetase. The amino acids are deduced from the nucleotide sequence of GLN4. The E. coli glutamine tRNA synthetase sequence is shown beneath as a series of dots, except where there is a difference between the two sequences. A dash indicates a gap. The alignment of the two sequences was the best that could be established by a computer sequence alignment method (37).

FIG.
3. The nucleotide sequence of the noncoding regions of GLN4. The sequences were established by the dideoxy sequencing strategy of Sanger et al. (12) as described in the text. A and B, respectively, display the 5' and 3' noncoding sequences. In A , the approximate centers of the locations of the 5'-ends of transcripts, as determined by S1 nuclease mapping, are indicated by arrows at -15 and -28; the location of a potential TATA sequence upstream of these 5'-ends is indicated by underscoring. In B, the position of a potential tripartite polyadenylation sequence is given by underscores. The experimentally determined 3'-ends of transcripts centered at approximately +2501, +2532, and +2559 are designated by arrows.
15 amino acids. This is the only part of these primary structures which has a match of this length of consecutive amino acids.
We interpret this as a strong validation of the signature sequence element for a group of aminoacyl-tRNA synthetases (2) and of its assignment within the two glutamine tRNA synthetases. Based upon comparisons with the sequences and structures of the amino-terminal halves of bacterial methionine and tyrosine tRNA synthetases, this also defines the location of a nucleotide fold in the two glutamine enzymes (5,

8).
Sequence of 5'and 3'-Untranslated Regions and Delineation of the 5'-Ends of Transcripts- Fig. 3, A and B We mapped the 5'-end of the GLN4 transcripts with special concern for whether the transcripts initiate in front of the ATG of the long open reading frame. A previously reported gene disruption experiment showed that the HpaI site, which encompasses codons 74 and 75, is a restriction site internal to GLN4 (10). It was formally possible that this site, while internal to the transcript, lies within a nontranslated region.
We also explored whether GLN4 encodes two or more overlapping transcripts which are distinguished by the locations of their 5'-ends and which possibly utilize alternate ATG codons for translation initiation. If the size difference of these transcripts is less than 75 nucleotides, we may not have detected them in the previously determined RNA blots which demonstrated a single mRNA species of 2900 nucleotides (10).
To cover these possibilities, we chose a probe (for S1 nuclease mapping of the 5'-end of the transcript) which extends from a Hind111 site at nucleotide +705 to an EcoRI site located in the 5"flanking region at nucleotide -410. This Signature Sequence for Aminoacyl-tRNA Synthetases probe is designated as pSWL219 in Fig. 1. The probe encompasses all of the potential initiating ATG codons upstream of the aforementioned signature sequence.
Hybridization and S1 nuclease mapping were conducted with total RNA from a strain which overproduces the GLN4 transcript by virtue of carrying GLN4 on autonomously replicating plasmid pSWL203 (10). The S1 nuclease-protected products were separated by size on a 6% denaturing polyacrylamide gel and analyzed by autoradiography. Probe fragments of known molecular weight were run in adjacent lanes as size standards.
Autoradiography of the gel shows the presence of closely spaced bands of approximately 710 nucleotides in length (data not shown). This places the transcriptional start near the ATG located at nucleotide position +1. Because this probe detects no other set of bands, we doubt the possibility that GLN4 produces two overlapping transcripts which utilize different ATGs for translational initiation.
The nuclease protection experiment was repeated with another probe which begins closer to the transcriptional start so as to refine the location of the 5'-end of the GLN4 transcript. The probe, designated as pSWL99-7 (Fig. l), begins at a Taql site located at nucleotide +266 and extends to a Taql site at -385. With this probe, we detect two sets of bands (Fig. 4A). Each set consists of several sub-bands, and each of these is separated from its neighbor by a single nucleotide. The two sets of bands map to close, but distinct, places. One is centered at nucleotide -28, and the other is centered at -15. These transcripts have 5'-ends that lie just upstream of the ATG codon at position +l. There is no ATG in any reading frame positioned between these 5'-ends and the ATG codon of the GLN4 open reading frame. The multiplicity of these sets of bands is possibly due (at least in part) to melting at the ends of the probe-RNA hybrids and subsequent S1 nuclease digestion of the frayed ends of the duplexes.
The consensus sequence TATAT, the Goldberg-Hogness box, is a signal for transcriptional initiation (15). In yeast it is generally found between 39 and 150 nucleotides upstream of the transcriptional start. GLN4 contains the sequences TATAAA and TATAATA at nucleotide positions -93 and -23, respectively. The latter falls between the two mapped transcriptional starts at -28 and -15. The -93 TATA box, which exactly matches the consensus sequence, lies -65 and -78 nucleotides, respectively, in front of the two major GLN4 transcriptional starts. Given its strong similarity to the consensus TATA sequence and that it is located at what is considered to be the general locale for yeast TATA sequences, we consider this sequence as likely to serve the role of the TATA element for GLN4.
The sequence &NNAUGG is considered an efficient eukaryotic signal for translational initiation (16). For yeast transcripts this sequence has been refined to read ANNAUGNNU (17). A purine at -3 and the absence of a U at -1 are considered necessary for a high efficiency of translational initiation. The ATG at position +1, the first ATG codon to appear on the GLN4 transcript, contains the necessary alignment for efficient translation initiation (see also the accompanying manuscript (18)).
We examined the 5"untranslated region of GLN4 for sequences that might be associated with gene regulation or control. While the gene for isoleucine tRNA synthetase is subject to general amino acid control, there is no evidence that GLN4 is affected by this regulatory pathway (19, 20): A consensus sequence 5'-AAGTGACTC-3' has been implicated as affecting expression of genes in this regulatory circuit (21, Probes were prepared from pSWL219, pSWL220, and pSWL99-7 (See Fig. 1 and "Materials and Methods"). Total RNA from yeast cells harboring the GLN4-encoding plasmid pSWL203 was purified and hybridized to the probes (10). The hybridization products were treated with S1 nuclease as described under "Materials and Methods." The final products were size-separated by electrophoresis on a 6% denaturing polyacrylamide gel and then visualized by autoradiography. In A, the centers of two clusters of protected fragments are marked. The positions of molecular weight standards are shown in B, and arrows indicate fragments that are protected. In A and B, probes pSWL99-7 and pSWL220, respectively, were used. In each part, one lane is for RNA hybridized to the probe and then treated with S1 nuclease; the other lanes are controls in which no RNA was added or no S1 nuclease was added, respectively. In A , a DNA sequence ladder is shown also.

22
). The core sequence TGACTC is considered the most critical region of the consensus sequence; conformity to this part of the consensus sequence is generally high. Genes under general amino acid control usually contain several copies of the consensus sequence. They are located 50-150 nucleotides upstream from the transcriptional start (21, 22).
We found three stretches of nucleotides which fit the sixbase consensus sequence in four out of six positions. In contrast, genes under general amino acid control frequently show a perfect fit to this six-base consensus sequence. Our failure to find compelling similarities to general amino acid control sequences is consistent with the lack of a response of levels of glutamine tRNA synthetase to general amino acid supply.
The 3'-End of GLN4 Transcripts-To map the 3'-end of the GLN4 transcript, a probe (pSWL220) was constructed that extends from the EcoRI site at nucleotide +1987 to the ClaI site at +2737 (Fig. 1). When hybridized against the GLN4 transcript, this probe gives three fragments that are protected from S1 nuclease digestion (Fig. 4B). The ends of these fragments are positioned at approximately nucleotides +2501, +2532, and +2559. The coding sequence of the GLN4 open reading frame ends at +2427 and thus is encompassed by all three of the 3'-ends that are found by this analysis. Zaret and Sherman (23) have proposed the consensus sequence TAG. . .TATGT.. .TTT as a signal for transcriptional termination and polyadenylation of yeast transcripts. This consensus sequence is generally located within 140 bases of the 3'-side of the stop codon. The 3'-end of the GLN4 coding seuuence reads TGAATGATTTAATGTGCATATATGTATATAE. The GLN4 stoD codon is indicated bv boldface, and a sequence ~~ ~ homologo& to the tripartite consensus sequence is underlined. The 3' sequence of GLN4 conforms, therefore, to those of several other yeast genes. Because of the high AT content of the 3"noncoding region, there are other sequences which potentially could be the transcriptional terminator for GLN4. The length of the GLN4 transcript, defined by the boundaries of the 5'-and 3'-end points established here, is approximately 2500-2600 nucleotides. The addition of polyadenylation sequences of typically 100-200 nucleotides would increase the mRNA to a size of 2600-2800 nucleotides. This is within 10% of the previously reported size of 2900 nucleotides, as estimated from RNA blots.

DISCUSSION
Functional domains in aminoacyl-tRNA synthetases have been demonstrated by analysis of protein fragments generated by gene deletions of bacterial aminoacyl-tRNA synthetase genes ( 4 2 4 ) . The gene for a eukaryote synthetase affords, in principle, the opportunity to investigate the locations of functional domains within the polypeptide through a determination of the exon/intron structure of the DNA. However, the sequence analysis and the S1 nuclease digestion patterns with specific probes give no evidence for the presence of introns at any point in the long open reading frame of GLN4.
There are 21 reported sequences of aminoacyl-tRNA synthetases (3). These include sequences for 12 different amino acid-specific enzymes from E. coli, Bacillus stearothermophilus, Bacillus caldotenar, and the yeast S. cereuisiae. The strongest homology between enzymes specific for different amino acids is that between E. coli isoleucine and methionine tRNA synthetases (5). In a sequence of 11 amino acids, these two enzymes share 10 identities and one conservative change. This part of the methionine enzyme's structure has been determined, and it corresponds to a segment of the nucleotide fold associated with the ATPIadenylate binding site (8).
From investigations of the primary structure of other aminoacyl-tRNA synthetases, several have been identified to have a sequence that is similar to this segment of the isoleucine and methionine enzymes (5, 25). None, however, has the near perfect match that is found for these two enzymes. In the case of the B. stearothermophils tyrosine tRNA synthetase, for example, there are only five identities with the isoleucine or methionine sequence in this 11-amino acid segment (5). The structure of the polypeptide backbone of the tyrosine enzyme, however, is superimposable on that of the methionine enzyme over this segment. This confirms, in this instance, the significance of the partial sequence similarity. The E. coli glutamine, tyrosine, and glutamic tRNA synthetases also have segments that are similar to the aforementioned region in isoleucine and methionine tRNA synthetases (2, 5, 25). On the basis of these observations, this segment is a signature sequence for a group of the aminoacyl-tRNA synthetases. Table I tabulates the alignment of nine sequences for five different amino acid-specific enzymes in the region of the signature sequence. The 15 out of 15 identities of the yeast and E. coli glutamine enzymes contrast with the 6 identities over a stretch of 12 codons observed with the E. coli-S. cereuisiae sequence comparisons for either the isoleucine or methionine enzymes. Note that each glutamine enzyme has 8 of 12 identities with E. coli isoleucine tRNA synthetase. The latter enzyme has, in turn, only 6 of 12 identities with its yeast counterpart. Thus, the region of exact match between the two glutamine enzymes is also the section that matches the glutamine enzymes' sequences with those of noncognate aminoacyl-tRNA synthetases.
There are a few other examples of yeast cytoplasmic aminoacyl-tRNA synthetase genes which have been sequenced. These are the aspartyl-(26), histidyl-(27), methionyl-(28), and threonyl-tRNA synthetases (29). The methionine tRNA synthetase has an unequivocal signature sequence. The aspartyl-tRNA synthetase has not been investigated in bacteria so that comparisons of the type described here, between bacterial and yeast enzymes, cannot be made. In the case of HTSI, this locus encodes both the cytoplasmic and mitochondrial forms of His-tRNA synthetase (27). In contrast to the glutamine tRNA synthetase, the yeast histidine enzyme has little sequence similarity to its E. coli counterpart (27, 30).
The genes for two yeast threonine tRNA synthetases (cytoplasmic and mitochondrial), in addition to the gene for the E. coli enzyme, have been investigated (29,31,32). The bacterial and yeast cytoplasmic enzymes are 642 and 734 amino acids, respectively, with almost all of the additional sequences of  . ( 5 ) is based upon an 11-amino acid sequence similarity between bacterial aminoacyl-tRNA synthetases specific for different amino acids. This table shows the exact 15-amino acid match between E. coli and S. cereuisiue glutamine enzymes that encompasses the aforementioned 11 amino acids (shown as the last 11 amino acids of each sequence) and additional sequences from S. cereuisiae and E. coli. the yeast cytoplasmic enzyme located in an amino-terminal extension. The mitochondrial enzyme is much shorter, 462 amino acids. The three sequences can be aligned, and in this alignment the shorter length of the mitochondrial protein is largely due to a missing amino-terminal section. While there are several blocks of identical amino acids shared by the three proteins, there are no instances in this system of a 15-amino acid identity between either of the two sequence pairs. The E. coli and yeast methionine tRNA synthetases are 677 and 751 amino acids, respectively (28,33). The two sequences share a region of several hundred amino acids over which they have approximately 30% identity (34). However, there is no region that has as much as a 15-out-of-15 match. The signature sequence region of the alignment is shown in Table I.

S. cereuisiae
Note that the two sequences are displaced with respect to each other. The yeast enzyme has an amino-terminal extension of 191 amino acids, relative to the E. coli protein, and the E. coli enzyme extends further at its carboxyl end by approximately 100 amino acids. The latter sequences are associated with a domain that is used to form dimers of the E. coli enzyme (the yeast enzyme is a monomer (34,35)). The amino-terminal extension of the yeast methionine tRNA synthetase, which is close in size to that of the yeast glutamine tRNA synthetase, is a stable part of the mature enzyme. However, no elements of similar sequence exist between the two amino-terminal extensions. Because the long amino-terminal extensions are joined to what corresponds to the catalytic portions of the bacterial counterparts, we surmise that in each case these extra sequences are associated with a biological function other than catalysis.