Sequence Analysis of Protamine mRNA from the Rainbow Trout DEPURINATION AND NEAREST NEIGHBOR ANALYSIS OF PROTAMINE cDNA*

Protamine cDNA, which was a full length copy of protamine mRNA, was labeled during its synthesis by using deoxynucleoside [&“Pltriphosphates. Depurination analysis showed that there were 19 different pyrimidine oligonucleo- tides in protamine cDNA, some of which contained isomeric sequences. The stoichiometry of the pyrimidine oligonucleotides indicated that, while some sequences probably occur in each of the protamine mRNA components, other sequences are clearly absent from one or more of the components. Several of the pyrimidine oligonucleotides had sequences consistent with the amino acid sequences of the rainbow trout protamines. The longest oligopyrimidine tract, C,T,, had a complementary RNA sequence of AGGAGAGGAGG, a stoichiometry of close to 1, and fitted the amino acid se- quence Arg-Arg-Gly-Gly which occurs near the COOH terminus of each of three major protamine components. Other pyrimidine oligonucleotides analyzed were complementary to RNA sequences from the noncoding region of protamine mRNA. There appears to be no preferential use of one particular arginine codon or set of codons. Of the 21 to 22 arginine codons in protamine mRNA no less than 7 and no more than


Protamine
cDNA, which was a full length copy of protamine mRNA, was labeled during its synthesis by using deoxynucleoside [&"Pltriphosphates. Depurination analysis showed that there were 19 different pyrimidine oligonucleotides in protamine cDNA, some of which contained isomeric sequences. The stoichiometry of the pyrimidine oligonucleotides indicated that, while some sequences probably occur in each of the protamine mRNA components, other sequences are clearly absent from one or more of the components. Several of the pyrimidine oligonucleotides had sequences consistent with the amino acid sequences of the rainbow trout protamines.
The longest oligopyrimidine tract, C,T,, had a complementary RNA sequence of AGGAGAGGAGG, a stoichiometry of close to 1, and fitted the amino acid sequence Arg-Arg-Gly-Gly which occurs near the COOH terminus of each of three major protamine components. Other pyrimidine oligonucleotides analyzed were complementary to RNA sequences from the noncoding region of protamine mRNA. There appears to be no preferential use of one particular arginine codon or set of codons. Of the 21 to 22 arginine codons in protamine mRNA no less than 7 and no more than 12 are of the CGX series. The other two codons, AGA and AGG, both occur but not in a series of more than two together. This indicates that the RNA sequences coding for the arginine tracts tend to contain a mixture of arginine codons. Nearest neighbor frequency analysis of protamine cDNA gives a low value for the frequency of the CpG doublet, despite its occurrence in four out of the six arginine codons. This is in accordance with the observation that the sequence CpG is surprisingly rare in vertebrate DNA and in the RNA transcribed from it.
The nucleotide sequences of several eukaryotic mRNAs are currently being elucidated. Partial sequences are known for human 01-and @globin mRNAs (l-31, rabbit (Y-and P-globin mRNAs (3,(4)(5)(6), and mouse immunoglobulin light and heavy chain mRNAs (7,8). Sequence information about the noncoding region may indicate structural features which are impor-* This work was supported by the Medical Research Council of Canada.
$ Postdoctoral Fellow of the Medical Research Council. 5 To whom correspondence should be addressed.
tant for the maturation, transport and/or translation of mRNAs. Sequences from the coding region, in addition to verifying the purity of the mRNA, can provide information about the evolution of the structural gene.
The mRNA for protamine from the rainbow trout is very amenable to sequence analysis since it is readily purified and can be isolated in appreciable quantities (9). Furthermore, its length of 290 nucleotides is considerably less than that of globin and immunoglobulin mRNAs, and the amino acid sequences of the three major protamines are known (10).
One of the difficulties of sequencing eukaryotic RNAs in general and mRNAs in particular has been to incorporate sufficient radioactive label into the molecule (11). The enzyme reverse transcriptase (12) affords a way around this problem by enabling a highly labeled complementary DNA copy to be made to poly(A)-containing mRNA primed with oligo(dT). With the recent advances in DNA sequencing techniques (13) considerable information can be obtained directly from the cDNA without the need to prepare a complementary RNA to the cDNA (5).
This report presents sequences of oligopyrimidine tracts released from protamine cDNA after depurination.
In addition, the fact that the ""P-labeled protamine cDNAs prepared with reverse transcriptase were full length copies of the mRNA enabled the stoichiometries of the pyrimidine nucleotides to be determined.
Information on the type of arginine codons used in protamine mRNA was deduced partly from nucleotide sequences and partly from nearest neighbor frequency analysis. S'P-labeled oligopyrimidines (greater than 2,000 Cerenkov cpm) extracted from DEAE-cellulose thin layers were digested with dialyzed bacterial alkaline phosphatase (60 pg) in 10 ~1 of 10 mM Tris/HCl (pH 8.9) for 2 h at 37". The digestion products were separated on DEAE-paper (Whatman DE81) by electrophoresis in 7% formic acid (pH 1.9) at 3 kV for 70 min. The labeled products, located by autoradiography, were cut out and counted in toluene/Omnifluor.

Nearest
Neighbor Frequency Analysis -Nearest neighbor frequencies for each of the four protamine cDNAs individually labeled with one of the deoxynucleoside la-:"Pltriphosphates (dATP, dCTP, dGTP, or dTTP) were determined by the procedure of Kleppe et al. (20). The 3'-mononucleotides from protamine cDNA (1 to 6 x lo" Cerenkov cpm) were separated by electrophoresis, located by autoradiography, cut out, and quantitated by counting in toluenelomnifluor.

Catalogue of Pyrimidine
Tracts in Protamine cDNA -Reverse transcriptase is capable of synthesizing a full length single-stranded transcript from protamine mRNA primed with oligo(dT),,, as reported by Iatrou and Dixon." Fig. 1 shows that T-labeled4 protamine cDNA was equal in length to protamine mRNA. When shorter transcription products occurred in other preparations of protamine cDNA they were present in minor amounts and were readily removed by preparative polyacrylamide gel electrophoresis.
Depurination of full length C-and T-labeled protamine cDNAs therefore gave a complete catalogue of the pyrimidine tracts in stoichiometric amounts. The base composition of each oligopyrimidine was deduced after two-dimensional separation shown in Fig. 2 from its position on the fingerprint in accordance with the observations of Ling (19). Orientation to the graticule was helped by the absence of pTp, plTp, and pl?lTp from the C-labeled series and the corresponding absence of pCp, pCCp, and pCCCp from the T-labeled series. The identities of these mono-, di-, and trinucleotides were confirmed by digestion with bacterial alkaline phosphatase. Release of free RPP; was 100% from the mononucleotides, 50% from the dinucleotides, and 30 to 35% from the trinucleotides. The stoichiometry of the oligopyrimidine tracts in the cDNAs was calculated for both the C-labeled and T-labeled series in Fig. 2 in the following way. For the C-labeled series, the number of times a particular oligopyrimidine appears in protamine cDNA cpm in the oligopyrimidine sum of the cpm in all the oligopyrimidines x the number of C residues in protamine cDNA the number of C residues in the oligopyrimidine The corresponding calculation was done for the T-labeled series. However, it was necessary to correct the number of thymidylate residues in protamine cDNA for the oligo(dT) region complementary to the poly(A) tract. The values for the number of deoxycytidylate and thymidylate residues in protamine cDNA were taken to be 75 and 66, respectively, 4 T-, C-, G-, and A-labeled protamine cDNAs refer to protamine cDNAs into which :reP label was introduced by replacing unlabeled d'I"l'P, dCTP, dGTP, or dATP, respectively, with the corresponding a-"'P-labeled nucleotide. The aliquot was removed prior to alkaline hydrolysis of the template. Protamine mFtNA (15 pg), prepared as described under "Methods," was run in Slot C. The nrotamine mFtNA was made visible on stainine with a 0.005% solution of Stains-all in 50% formamide (21). The gel markers and protamine cDNA were detected on autoradiographs which overlay their respective slots (A and B) . equivalent to the number of guanosine and adenosine residues, respectively, in protamine mRNA corrected for the length of the poly(A) tract. These values for'the number of guanosine and adenosine residues were calculated knowing the base composition (29.9% adenosine; 25.9% cytidine; 25.8% guanosine; and 18.4% uridine) (22), the total length of protamine mRNA (290 residues), and the length of the nonadenylated portion of protamine mRNA (270 residues).
Values for the stoichiometry of the oligopyrimidine tracts in Table I show good agreement, both between duplicate  experiments and between the C-and T-labeled series. The longer oligopyrimidines C7T4, C4T2, C,T,, and CT, have values close to unity. Shorter oligopyrimidines such as C2 and T, clearly occur several times in protamine cDNA, while other oligopyrimidines such as C,T, and C&T, fall well below unity.

Sequence Analysis of Pyrimidine
Tracts-Deduction of the nucleotide sequence of the oligopyrimidines from autoradiographs of partial exonuclease digests was relatively easy because the angle of the jump between spots is greatest when deoxycytidylate and thymidylate mononucleotides are removed. Sequence analysis was facilitated by knowing the base composition of the oligopyrimidines and confirmatory information was obtained from alkaline phosphatase digests. In Fig. 3 partial snake venom phosphodiesterase digestion of C,T, gave the sequence (C,T)CTCTCCT.
Partial spleen phosphodiesterase digestion gave the sequence CCTCCTC(C2T2), resulting in an overlap of 3 residues to give the complete sequence CCTCCTCTCCT.
In Fig. 4 analysis of C,T, with spleen phosphodiesterase showed that this oligopyrimidine was a mixture of isomers C'ITTC (T) and TClTC(T). The sequence at the 3' end obtained from snake venom phosphodiesterase digestion (CT,)TCT, was unambiguous.
Isomeric sequences were also seen in C,TS and C,T, and were confirmed when these oligopyrimidines, isolated from C-labeled protamine cDNA, were digested with alkaline phosphatase and were shown to release a portion of their label as 32R. The nucleotide sequences deduced for the oligopyrimidines are presented in Table II.
Additional sequence information can be obtained by determining the base adjacent to the 3' end of the oligopyrimidines in the following m&mer. The oligopyrimidines released by depurination in the presence of diphenylamine have phosphate groups at both ends. The phosphate group at the 5' end originates from the cr position of the pyrimidine nucleotide at the 5' end. However, the phosphate group at the 3' end originates from the a! position of the purine nucleotide which was adjacent to this end but which was removed by depurination. The identity of this base, whether adenine or guanine, was determined by separately depurinating and flngerprinting A-labeled and G-labeled protamine cDNAs as shown in Fig. 5. Oligopyrimidines show up on the autoradiograph of the A-labeled series if adenine was the base adjacent to the 3' end or on the autoradiograph of the G-labeled series if guanine was the base adjacent to the 3' end. The shorter sequences which tend to occur several times in protamine cDNA have both adenine and guanine nearest neighbors to their 3' ends, whereas the longer sequences tend to have either adenine or guanine. The oligopyrimidines C1T3, C1T4, C,T,, and C3T2 appear in the A-labeled series but not in the G-labeled series. Conversely C2TS, C2T4, C,T, C,T,, and C&T, appear in the G-labeled series but not in the A-labeled series. T3, which has a stoichiometry of 1.52 in protamine cDNA, appears mainly in the A-labeled series but also to a lesser extent in the G-labeled series. This indicates that the sequence ATIT is the major one and that G!MT occurs less frequently, perhaps in only one of the protamine mRNA components. In accordance with this observation the sequence A'ITT has been seen in an endonuclease IV fragment of protamine cDNA which is believed to occur in each of the protamine mRNA components.s The oligopyrimidine sequences extended by nearest neighbor considerations are shown in Table III together with their  complementary mRNA sequences and possible coding assignments. The isomeric sequences pCTp and pTCp were shown to be present in equal amounts. Digestion with alkaline phosphatase of these pyrimidine dinucleotides from either C-'or T-labeled protamine cDNA liberated 50% of the total cpm as 32pi. Several of the RNA sequences tabulated, such as AGAG, have been seen in pancreatic RNase fragments of protamine mRNA sequenced after iodination of the mRNAG or following postlabeling of the fragments with polynucleotide kinase and [y-"*P]ATP (22). Other sequences have  Table III must come from the noncoding region of protamine mRNA since in any reading frame they are inconsistent with the known amino acid sequences of the three major protamine compo nents from the rainbow trout (10). By the same criterion other RNA sequences are consistent with a location in the coding region, although this does not prove that they occur there rather than in the noncoding region. The longest oligopyrimidine tract from protamine cDNA, C,T,, is complementary to AGGAGAGGAGG which in turn can code for Arg-Arg-Gly-Gly, This is an amino acid sequence which occurs towards the CGOH terminus in each of the protamine components. Nearest neighbor considerations place a cytidine residue at the 5' end of the oligoribonucleotide.
Since in each of protamine components an arginine residue precedes the sequence Arg-Arg-Gly-Gly it follows from a knowledge of the arginine codons that CG must precede the dodecanucleotide listed in Table III to give a total sequence CGCAGGAGAG-GAGG(:).
Nearest Neighbor Frequency Analysis-The availability of full length A-, C-, G-and T-labeled protamine cDNAs enabled the nearest neighbor frequencies to be calculated for all of the 16 doublets as shown in Table IV. The values for ApT, CpT, GpT, and TpT obtained from T-labeled protamine cDNA were distorted by the large amount of TpT in this sample which arose from extension of the oligo(dT),, primer in a region complementary to the poly(A) tract. Therefore, each of the values in this set were calculated from the three other sets. Thus, for example, ApT = 100% -(ApA -t ApC + ApG). The number of times each doublet occurs in the nonadenylated portion of protamine mRNA can be obtained from the figures in Table IV by

Depurination
of highly labeled DNA complementary to protamine mRNA released short oligopyrimidine tracts which were suitable for nucleotide sequence analysis. In addition, the availability of full length protamine cDNA enabled some information to be deduced which could not come from partial transcripts. Specifically, this information concerned the stoichiometry of the oligopyrimidine tracts in protamine cDNA and the nearest neighbor frequencies of the bases.
The same 19 oligopyrimidines were consistently released by depurination from different preparations of C-and T-labeled protamine cDNAs in a reproducible ratio (Table I). This vouches for the fidelity of transcription.
An average stoichiometry of close to l/DNA copy was obtained for the longer oligopyrimidines, CT,, C,T,, and C,T,, which were subsequently shown to be unique sequences (Table II). However, the stoichiometry of other oligopyrimidines, such as CT,, C,T, and C,T,, falls clearly below 1. Furthermore, two of these oligopyrimidines (CT, and C,T,) are each a mixture of at least two isomeric sequences (Table II). A stoichiometry for these sequences of less than unity is consistent with protamine mRNA being a mixture of mRNA components. Heterogeneity in protamine mRNA is to be expected since protamine itself, as isolated from the rainbow trout (10) or as translated from protamine mRNA in a cell-free assay (9), is made up of at least three polypeptides. Although these polypeptides are closely related in both length (32 to 33 residues) and amino acid sequence, coding assignments dictate that they must in part have different nucleotide sequences. Further heterogeneity in Oligopyrimidines from C-labeled and T-labeled protamine cDNAs were prepared as described in the legend to Fig. 2. The 32P-labeled oligopyrimidines were recovered from the DEAE-cellulose thin layers as described under "Methods" and the radioactivity of each was measured by Cerenkov counting. The stoichiometry of the oligopyrimidine tracts was calculated from this data as described under "Results." Duplicate values are presented for both the C-labeled and T-labeled series. The sum of Cerenkov counts in the spots from each plate was 257,540 and 262,870 for Experiments 1 and 2 of the Clabeled series, and 162,970 and 557,234 for Experiments 1 and 2 of the T-labeled series.
Oligopyr- protamine mRNA, aside from that introduced by the variable length of the poly(A) tract, could occur in the length and nucleotide sequence of the noncoding region. Recent observations' indicate that protamine mRNA can, in fact, he subfractionated into four components on the basis of length after prolonged electrophoresis in denaturing 6% polyacrylamide gels.
While sequence heterogeneity must exist between the protamine mRNA components, sequence homology can also be expected to occur, both in the coding region, because of the similar amino acid sequences of the rainbow trout protamines (lo), and in the noncoding region, where substantial sequence homology has previously been observed between eukaryotic mRNAs (3,23). For these reasons the unique oligopyrimidines such as C1T5, C,T,, and C,T, which have a stoichiometry of close to 1 are likely to represent homologous regions in each of the protamine mRNA components. In the case of C,T, the complementary RNA sequence fits the amino acid sequence Arg-Arg-Gly-Gly which occurs in each of the three major protamine components. Although the same RNA sequence could occur twice in one protamine mRNA component, once in the coding region and once in the noncoding region, and he absent from another component to give an average stoichiometry of 0.87, this is the less likely of the two explanations. Based on this rationale, CCTCCTCTCCT is presently being synthesized chemically for use as a specific primer to initiate reverse ' L. Gedamu, K. Iatrou, and G. H. Dixon (1977) Cell, in press. FIG. 3. Sequence analysis of the oligopyrimidine C,T,. Partial digests with snake venom phosphodiesterase (SVPD digest) and spleen phosphodiesterase (SPD digest), and separation of the digestion products in two dimensions were done as described under "Methods." The digests contained the alkaline phosphatase-treated oligonucleotide from both C-labeled and T-labeled protamine cDNAs mixed together. The autoradiographs of the separated digestion products are marked between spots with C or 2' to indicate the nucleotide (dCMP or dTMP, respectively) removed by the exonuclease.
FIG. 4. Sequence analysis of the oligopyrimidine C2T4. Sequence analysis was performed on this oligonucleotide as described in the legend to Fig. 3. transcription within the coding region of each protamine mRNA component.
The protamines are unusual polypeptides in that two-thirds of their amino acid residues are arginine and these arginine residues occur in blocks of up to 6 residues long. The arginine codons along with those for serine and leucine are the most degenerate in the genetic code. of the six codons which specify arginine four are of the CGX series and the other two are AGA and AGG. It is of interest to know which, if any, of these codons are used predominantly in the protamine genes. The   information contained in this report indicates that neither the CGX series nor the A@ codons predominate and that both AGA and AGG codons are used.
The oligopyrimidine tracts released from protamine cDNA after depurination are complementary to the purine tracts in the protamine mRNA. A sequence of AGA and AGG codons would appear as an unbroken oligopyrimidine tract in the cDNA. Exclusive use of either AGA or AGG codons, or both, for arginine would give rise to numerous, long oligopyrimidine tracts in protamine cDNA up to at least 18 nucleotides in length, corresponding to the arginine tracts in the protamines. This is not the case since protamine cDNA contains only one oligopyrimidine longer than a hexanucleotide. The sequences determined for the oligopyrimidine tracts (Table  III) show no more than two AGA or AGG codons are used in series. This is the case in the sequences AGAAGA, AG-GAGG, and AGGAGA. The latter sequence is a portion of the undecanucleotide coding for Arg-Arg-Gly-Gly. Conversely, the CGX codons are clearly not used exclusively for arginine since some AGG and AGA codons occur, as in the undecanucleotide.
It is difficult to estimate from the sequence data alone the proportions of the CGX codons and AG,A codons used for arginine. Many of the mRNA sequences deduced (Table III) could represent A@ codons for arginine but it is not possible to establish the reading frame with certainty nor can these sequences be unequivocally assigned to the coding region. However, the upper limit to the frequency of the CGX series can be obtained from nearest neighbor frequency analysis of protamine cDNA. Every CGX codon for arginine contains one CpG doublet. Given a length of 270 residues for the nonadenylated portion of protamine mRNA and the information in Table IV, there must be 12 CpG doublets. Hence, a maximum of 12 out of the 21 to 22 arginine codons in prota- mine mRNA can be of the CGX series. This value of 12 CGX codons is based on the unlikely premises that all the CpG doublets occur in the coding region, although the noncoding region of protamine mRNA is more than one and one-half times as long (160 to 170 residues) as the coding region (100 residues), and that the reading frame of CpG comprises the first and second bases in a codon rather than second and third bases, or third base in one codon and first base in the next. For these reasons the number of CGX codons for arginine is probably somewhat less than 12. A lower limit of seven codons of the CGX series can be deduced from a knowledge of the lengths of the arginine tracts in protamine (10) given that no more than two AG; codons can occur in series based on the recovery of oligopyrimidine tracts in the cDNA (Table  III).
It follows from these figures that the minimum number of AG; codons in protamine mRNA is 9 to 10. This value is easily exceeded by the number of potential AG,A codons contained in the RNA sequences in Table III, and by the frequency of the doublet ApG (=24). However, sequences complementary to AG, AGA, and AGG have all been seen in endonuclease IV fragments which have originated from the noncoding region of protamine cDNA, whereas CG has not been seen in any of these fragments5 Given this balance between the codons of the CGX and