The Primary Structure of Rat Ribosomal Protein L7 THE PRESENCE NEAR THE AMINO TERMINUS OF L7 OF FIVE TANDEM REPEATS OF A SEQUENCE OF 12 AMINO ACIDS*

The covalent structure of rat ribosomal protein L7 was determined in part from the sequence of nucleotides in a recombinant cDNA and in part from the sequence of amino acids in portions of the protein. The complementary analyses supplemented and confirmed each other. Ribosomal protein L7 contains 258 amino acids and has a molecular weight of 30,040, The pro- tein has an unusual and striking structural feature near the NH2 terminus: five tandem repeats of a se- quence of 12 residues. Rat L7 appears to be related to ribosomal protein L7 from the moderate halophile Vi- brio costicola and perhaps to L30 from Bacillus stearothermophilus, to L7 from the moderate halophile NRCC 41227, and to L22 from Nicotinia tobaccum chloroplast. In addition, there is a sequence of 24 amino acids in rat protein L7 that may be related to segments of the same number of residues in Escherichia coli ribosomal proteins S10, S15, L9, and L22. It is an articie of faith that a molecular account of the function of eukaryotic ribosomes follow from knowledge of the structure and has an indispensable prereq-uisite information on the chemistry of the constituent proteins and nucleic Progress in analysis, although, remains

The covalent structure of rat ribosomal protein L7 was determined in part from the sequence of nucleotides in a recombinant cDNA and in part from the sequence of amino acids in portions of the protein. The complementary analyses supplemented and confirmed each other. Ribosomal protein L7 contains 258 amino acids and has a molecular weight of 30,040, The protein has an unusual and striking structural feature near the NH2 terminus: five tandem repeats of a sequence of 12 residues. Rat L7 appears to be related to ribosomal protein L7 from the moderate halophile Vibrio costicola and perhaps to L30 from Bacillus stearothermophilus, to L7 from the moderate halophile NRCC 41227, and to L22 from Nicotinia tobaccum chloroplast. In addition, there is a sequence of 24 amino acids in rat protein L7 that may be related to segments of the same number of residues in Escherichia coli ribosomal proteins S10, S15, L9, and L22.
It is an articie of faith that a molecular account of the function of eukaryotic ribosomes will follow from knowledge of the structure and that this has as an indispensable prerequisite information on the chemistry of the constituent proteins and nucleic acids. Progress has been made in this analysis, although, a good deal remains to be done (1). The covalent structure of the four species of RNA in rat ribosomes, 5 S (Z), 5.8 S (3), 18 S (4, 5), and 28 S (6,7), has been established. In addition, 84 proteins have been isolated from the particles (8) and the sequence of amino acids in several, P2 (9), L37 (lo), and L39 ( l l ) , has been determined directly.
This inventory of data is not only necessary for the resolution of the structure of the organelle, but also for analyzing the interaction of the proteins with the nucleic acids. The task of determining the structure of eukaryotic ribosomal proteins is being expedited by the application of recombinant DNA technology. Thus, the structure of a number of rat ribosomal proteins has been determined from recombinant cDNAs. They include S11 ( E ) , S26 (13), S17 and L30 (14), L35a (15), and L19 (16). In a similar way the structure of three mouse ribosomal proteins, L30 (17), L32 (18), and S16 (19) and of the Chinese hamster protein 514 (20) has been established.
We report here the structure of rat ribosomal protein L7 * The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "aduertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. This paper is dedicated to the memory of David Vazquez, an esteemed colleague.  which we have inferred for the most part from the sequence of nucleotides in a recombinant cDNA but which we have completed and confirmed by sequencing portions of the protein.
There is a second purpose that underlies the analysis of the structure of the components of the ribosome and that is to understand their evolution. It is assumed that an organelle that is at one and the same time universal, essential, and complicated, as the ribosome is, arose on a single occasion. Indeed, the evidence for homology of rRNAs from evolutionarily distant species is substantial (21), albeit the relationship is most easily seen in comparisons of their secondary structures rather than of their nucleotide sequences. In a similar manner, comparison of the sequences of amino acids in ribosomal proteins may inform us concerning the details of their evolution. In addition, identification of conserved amino acid sequences cannot but help in unraveling the function of the proteins. Finally, these comparisons may provide clues as to why the number of proteins has increased from the 52 contained in prokaryotic ribosomes to the 70-80 that are found in eukaryotes without any significant changes in the reactions that the particles catalyze.

RESULTS AND DISCUSSION
The Sequence of Nucleotides in the Recombinant cDNA Encoding Rat Ribosomal Protein L7-A preliminary restriction endonuclease map of the cDNA insert in pL7-2 was prepared, and a set of enzymes was selected that would generate overlapping oligonucleotides suitable for a determination of the sequence (Fig. 3). Nucleotide sequences from both strands of the DNA, and overlapping sequences for each restriction site, were obtained (Fig. 3).
The cDNA insert in pL7-2 contains 826 nucleotides and includes the 5' poly(A) and 3' poly(C) homopolymer linkers, a portion of the 5' noncoding region, and a single open reading frame (Fig. 4). In the other two reading frames the sequence is interrupted by many termination condons. The open reading frame of 756 nucleotides begins at an ATG codon at position 58 and ends with a codon (ATA) for isoleucine. There GTA AAC GAG CTC ATC TAC AdA CGA GGC TAT GGC AAA ATC AAT AAA AAG CGC ATT GCC TTG ACA GAT AAC TCC TTG GTT GCT CGA TCT C T T is no termination codon and no 3' poly(A) sequence. We presume pL7-2 lacks a 3' end, and hence that the ribosomal protein L7 cDNA insert lacked the nucleotides coding for the carboxyl-terminal amino acids of the protein L7, because of a failure to make a full copy during the synthesis of the second strand of the cDNA. This assumption was substantiated later. The context in which the initiation codon occurs, ACCATGG, is the sequence considered optimum for initiation of translation of mRNAs by ribosomes (36,37).
We take notice that the first nucleotide of the L7 cDNA (position 36 in Fig. 4; position -22 if the A in the initiation codon ATG is taken as +1) is a cytosine and that it is followed by a sequence of 11 consecutive pyrimidines (Le. CTCTCTTTTTCC). A similar run of pyrimidines has been found in the 5' untranslated region of many eukaryotic ribosomal protein mRNAs in species as diverse as mammals (16)(17)(18)(19) and amphibia (38). The conservation of this track of pyrimidines at about the same position (Le. near the initiation codon) certainly suggests that it plays a role in the regulation of the translation of at least some of the mRNAs for eukaryotic ribosomal proteins.
The Primary Structure of Rat Ribosomal Protein L7-The reading frame extends from nucleotide 58 to 813 and encodes a protein of 252 amino acids (Fig. 4). The polypeptide was identified as rat ribosomal protein L7, in the first instance, by positive hybridization in a translation assay. The radioactive product in this reaction migrated on one-dimensional sodium dodecyl sulfate and two-dimensional urea-polyacrylamide gels with authentic L7 (results not shown). The molecular weight, calculated from the sequence of amino acids deduced from the DNA sequence was close to that estimated from the migration of the purified protein on sodium dodecyl sulfate gels (39), and the number of individual amino acid residues obtained from the sequence of the cDNA (Table I) approximated the number derived from an analysis of a hydrolysate of L7 isolated from rat ribosomes (39). Thus, we assumed that the pL7-2 insert lacked only the codons specifying a small number of amino acids at the carboxyl terminus.
Since the recombinant cDNA lacked a termination codon it was necessary to determine the carboxyl-terminal sequence of amino acids in L7 directly. We wished also to confirm the sequence of amino acids deduced from the sequence of nucleotides, to authenticate that pL7-2 encodes ribosomal protein L7, and to establish the identity of the blocked NH,terminal amino acid. For that purpose, L7 was cleaved with cyanogen bromide and the peptides were isolated (Fig. 2). Protein L7 has 7 methionyl residues (Table I) and, hence, one might expect eight peptides, seven if one of the methionines is at the NH, terminus as seemed most likely from the sequence of nucleotides in pL7-2. In this case, we isolated seven fractions (designated by brackets in Fig. 2). The purity and the identity of the peptides in the fractions was assessed by polyacrylamide gel electrophoresis in sodium dodecyl sulfate, by reaction with dansy12 reagent, by determination of the amino acid composition, and from the sequence of amino acids at the NH, terminus. The results of these analyses and the sequence of amino acids deduced from the sequence of nucleotides in pL7-2 were used as a guide to order the cyanogen bromide peptides. Fraction a has CN3; fraction b, CN2; fraction c, predominantly CN1; fraction d, a mixture of CN6 and CN7; fraction e, CN5; fraction f, CN4; and fraction g, undigested protein L7 (Figs. 4 and 5). The sequence of amino acids in peptides CN1 (6 residues), CN2 (7 residues), and CN3 (3 residues) was obtained by the micro-manual 4-N,Ndimethylaminobenzene 4'-isothiocyanate/phenylisothiocyanate double-coupling procedure (9). Each of the residues corresponds exactly to the sequence of amino acids deduced from the sequence of nucleotides in pL7-2 (compare Figs. 4 and 5).

Leu-Ile-Tyr-Glu-Lys-Ala-Lys-His-Tyr-His-Lys-Glu-Tyr-Arg-Gln-Met-Tyr-Arq-Thr-Glu-Ile-Arq-Met-Alo-Arg-Met-Alo-Arg-Lys-Ala
"+-+"++-+++"" residues) and CN5 (16 residues) was determined by automated liquid-phase Edman degradation. Once again there was exact correspondence between the residues determined directly from the peptides derived from L7 and amino acid sequence deduced from the sequence of nucleotides in the recombinant cDNA (compare Figs. 4 and 5). The main purpose in undertaking this exercise was to establish the sequence of amino acids at the carboxyl terminus of L7 since the nucleotides encoding these residues were absent from pL7-2. The residues at issue were in CN7 but we could not resolve CN6 and CN7. The expedient that was adopted was to determine the sequence of the two peptides simultaneously using an automated gas-phase sequenator. After each degradation we obtained two phenylthiohydantoin derivatives. We were able to assign each of the pair of residues to either CN6 or CN7 since we had as a guide the order of all of the amino acids of the former and most of the latter from the sequence deduced from pL7-2. Peptide CN6 has 33 residues and CN7 has 26. The first 20 cycles of degradation gave pairs of phenylthiohydantoin derivatives (at cycle 11 only glycine was obtained since that is the amino acid at that position in both CN6 and CN7) that corresponded to the pairs of residues deduced from pL7-2. The amino acids obtained for cycles 21 through 33 were: -Leu/@-Trp/His-Pro-

Phe/Ile-Lys-Leu/&-Ser-Ser-Pro-Arg-Gly-Gly-Met-.
The residues underlined were assigned to CN7 since the other member of the pair is contained in CN6. These are the amino acids not encoded by nucleotides in pL7-2. To confirm that no other amino acids are absent from the protein encoded by pL7-2, we digested L7 with a mixture of carboxypeptidase A and B and established that the carboxyl-terminal sequence is -1le-Lys-Arg (Fig. 6). Thus, the entire covalent structure of L7 was reconstructed from the sequence of amino acids in the cyanogen bromide peptides derived from the protein and from the sequence of nucleotides in a recombinant cDNA.
Protein L7 has 258 amino acids; 252, all but the carboxylterminal 6 residues, are encoded in pL7-2. The molecular weight of the protein derived from the sequence of amino acids is 30,040 which is close to that of 29,200 estimated from the migration of the purified protein in sodium dodecyl sulfate gels (39). The initial residue of L7 is methionine. We know that despite the fact that the amino group of the methionine is blocked since the NH2-terminal sequence derived from the cDNA is Met-Glu-Ala-Val-Pro-Glu-, whereas the sequence of amino acids in CNl is Glu-Ala-Val-Pro-Glu-. Peptide CN1 is not blocked, whereas L7 is. Hence, there must be at least 1 residue NH2-terminal to the first glutamic acid to account for the modified amino acid. That can only be methionine. However, we do not know the nature of the chemical modification, although, it is most likely to be methylation or acetylation since these are the only known modifications of amino acids in ribosomal proteins (40).
Comparison of the Sequence of Amino Acids in Rat L7 with That in Ribosomal Proteins from Other Species-The sequence of amino acids in the rat ribosomal protein L7 was compared, using the computer program RELATE (261, to the sequence of amino acid in 280 other ribosomal proteins contained in a library that we have compiled. The highest score, computed as the distance in standard deviations between the actual comparison and the mean of 100 comparisons of randomized sequences, was 3.0 for the ribosomal protein L7 from the moderate halophile Vibrio costicola. A score of at least 3.0 is ordinarily required to assign significance to a value (41). Scores that we interpret as indicating the possibility of a relationship were obtained for comparisons with Bacillus stearothermophilus L30 (2.8), with the moderate halophile NRCC 41227 L7 (2.61, and with Nicotinia tobaccum chloroplast L22 (2.7). Although, it is difficult to appraise the significance of these values we note that these ribosomal proteins span a large evolutionary distance from eubacteria (B. stearothermophilus L30), to archebacteria (V. costicola L7, and NRCC 41227 L7), to plant chloroplasts ( N . tobaccum L22), to eukaryotes (rat L7). Furthermore, V. costicola L7 is related to both the prokaryotic ribosomal " A proteins (30), Esche-Structure of Rat R i b o s o m l Protein L7 richia coli L7/L12 is the prototype (32), and to rat L7.
To ascertain if L7 and the other ribosomal proteins in the library contained similar partial sequences, we examined the spectrum of count scores, calculated by RELATE (Table 11) as we had done before (10,16). For a given fragment score, n, the program determines the number of fragments with scores exceeding n for both the real and randomized comparisons. The difference is expressed as a distance and is given in standard deviation units. In the comparison of two proteins with a short conserved region one would expect only a few high scores. Unfortunately, the value at which the spectrum of count score assumes statistical significance has not been established and there is no assurance that 3.0 standard deviations, for example, is of consequence. For that reason, we have used these scores only to identify protein fragments that might show homology and then have aligned the fragments by eye. The fragments from four E. coli ribosomal proteins (SlO, S15, L9, and L22) showed similarity with one segment of rat L7 (Fig. 7). Of the 24 residues in this segment (amino acids 106-129), 20 occurred at the same position as identical or related amino acids (lysine and arginine, or isoleucine, leucine, and valine, or serine and threonine) in one or more of the E. coli fragments. For S10 there were 10 identities, for S15 10, for L9 7, and for L22 11. In addition, at three of the remaining four positions there were identities amongst the E. coli proteins that did not appear in rat L7; thus, at only 1 of the 24 positions are there unrelated amino acids in the four fragments (Fig. 7). It is noteworthy that this fragment of rat L7 (residues 106-129) contains 7 basic (arginine and lysine) and 10 hydrophobic (isoleucine, leucine, and valine) amino acids and that they account for most of the identities (see also later). Finally, high scores were obtained for a comparison of separate fragments from rat L7 and B. stearothermophilrrs S9 and for rat L7 and N. tobaccum L2 (Table 11).
These results are reminiscent of earlier findings that a segment in rat ribosomal protein L37 and in yeast YP55 might be related to sequences in three E. coli proteins, 54, L20, and L34 (IO), and that a portion of rat L19 is, perhaps, related to amino acid sequences in E. coli ribosomal proteins S2, L18, and L30 (16). An analogous observation has been made for a fragment of rat PT3 The occurrence in four rat ribosomal proteins of sequences related to segments of a number of proteins from E. coli or from other species suggests this is not uncommon and not without significance. Unfortunately, it is difficult to assess statistically the reliability of these partial sequence identities using computer programs that only perform pairwise comparisons. The putative "homologies'' could be fortuitous, the consequence of comparing many ribosomal protein sequences, or they could reflect the process of evolution within this subset of proteins and perhaps the conservation of a shared function. T o distinguish between these possibilities, we are attempting to devise algorithms that allow comparison of several ribosomal proteins simultaneously and provide estimates of the statistical significance of the comparisons. As we have pointed out before (10,16), the import of these findings is in the support that they lend to the suggestion (42-44) that ribosomal proteins evolved by repeated duplication of a small number of short peptides. The possibility is that the common fragments serve a common function in several proteins, for example, involvement in binding to ribosomal RNA. Surrounding regions may have diverged too far to allow detection of a similarity once there. Alternatively, the surrounding portions of the protein may be unrelated, having arisen from different ancestral fragments. L7"The RELATE program was also employed to search for internal duplications in rat ribosomal protein L7. For the test, all possible pairs of fragments of a prescribed length were compared and the significance of the derived value established by repeating the comparison 100 times with sequences formed by scrambling the amino acids comprising protein L7 (41). For a length of 20 amino acids, the score for the comparison of all possible pairs of fragments within L7 was 4.8 standard deviations greater than the mean score of the randomized sequence comparisons. The raw data generated by RELATE includes a list of the segments showing the greatest similarity as well as the number of amino acids separating them, i.e. the displacement between fragments. For rat L7, multiples of 12 amino acids predominated. The fragments yielding the highest scores were at the NH2 terminus forming five tandem repeats of 12 residues each (Fig. 8). There is within the five tandem repeats of 12 residues two longer tandem repeats of 22 amino acids. They are at positions 3-24 and 25-46 (Table  111). There are 11 identities (if we include arginine/lysine and leucine/valine pairs) in the two fragments. Thus, the similarity between the short repeats is more striking than for the longer ones. Finally, there is another possible tandem repeat in L7 of 20 amino acids at positions 138-157 and 158-177 (Table III), although, in this instance the number of like residues is only 5.

The Presence of Repeat Sequences in Rat
Although the tandem repeats are an unusual and striking feature of the structure of L7 we are unable to assign a function to them. This is at least in part because we know nothing of the function of L7 itself. Repeat sequences of amino acids in ribosomal proteins have been reported before, although they are not common. E. coli protein S1 which is required for the binding of mRNA to the ribosome during the initiation of protein synthesis (45), and which as the a-subunit of the Q/3 replicase is involved in the transcription of the plus strand of the RNA phage, has multiple internal repeats (46). The exact number of the repeats and their length is not certain; either six of 87 residues or 12 of 44 residues are possible. The function of the repeats is not known, although the suggestion has been made, without evidence to support it, that they are involved in binding mRNA to the ribosome and phage RNA to the catalytic subunit of QP replicase. There may be internal repeats in E. coli ribosomal protein L2 also (47) but they have not been analyzed so extensively and, hence, are not as certain as those in S1.
One way to secure leads to the function of a protein, or a segment thereof, is to examine its higher order structure. The most desirable data comes from x-ray diffraction of crystals but in the absence of this information it is sometimes useful to examine predictions of the secondary structure (48). The repeats are characterized by a concentration of basic residues (arginyl or lysyl) in the NH2-terminal half and by hydrophobic residues (alanyl, phenylalanyl, valyl, and leucyl) in the remainder (Fig. 8). The several programs we have used vary in their predictions of the secondary structure of protein L7. However, there is agreement that the region of the five tandem repeats is likely to have a good deal of a-helical structure (Fig. 9). The distribution of amino acids in the repeats and the possibility that they have an a-helical conformation led us to consider that they might be amphipathic and, hence, capable of inserting into the lipid layer of the rough endoplasmic reticulum. In this way, L7 and more specifically the tandem repeats might be responsible for positioning the large subunit of the ribosome on the membrane. As a first approximation to a test of the proposition we examined Edmundson helical wheels (53) of the region of the tandem repeats (positions 7-66) to determine whether there was segregation of hydrophilic and hydrophobic residues. The results did not lend support to the proposal, although it is still possible that the tandem repeats participate in the positioning of the ribosome on the endoplasmic reticulum as a reflection of some other aspect of their structure. The tandem repeats may also play a role in the association of L7 to one or more of the rRNAs, although there is no direct evidence to support the suggestion.
In addition to the five tandem repeats near the NH, terminus of L7, there are variations of this sequence of 12 amino acids a t positions 84, 124, 175, and 230 (Fig. 8). These four have less identity to each other or to the tandem repeats than the latter have to each other, still, they appear to us to be related. Similar to the tandem repeats they have a cluster of basic residues in the NH,-terminal half, a tendency of the carboxyl half to be hydrophobic and to have phenylalanine and threonine, and to have valine or leucine at position 12. The major difference is that the alignment of like residues is not nearly so exact as in the tandem repeats4 Nonetheless, there may be as many as nine repeats of this 12-residue segment in rat ribosomal protein L7 and they are unlikely to be without functional significance.