The sequence of an embryonic myosin heavy chain gene and isolation of its corresponding cDNA.

The complete sequence of an embryonic chicken myosin heavy chain has been determined. Introns and exons were identified by comparison with the corresponding cDNA. The cDNA contains 5,962 bases, of which 85 bases constitute the poly(A) tail. The cDNA represents the entire mRNA transcript, except for 90 bases at the 5'-coding terminus and 101 bases of the 5'-untranslated region. The gene's coding region is split by 37 introns; two additional introns split the 101 base pairs which make up the 5'-untranslated region. The complete gene is approximately 23,000 base pairs and encodes a protein whose molecular weight is 222,559 and consists of 1,940 amino acids. Analysis of the protein and comparison with other myosin sequences reveal that certain regions have been conserved; those amino acids which have been postulated to participate in the ATPase and actin-binding activities of the molecule are highly homologous. These comparisons have allowed the identification of isolated regions within the myosin heavy chain that appear to be essential for the molecule's function.


HL07382.
Established Investigator of the American Heart Association. To whom correspondence should be addressed.
at the NH, terminus and a fibrous, a-helical domain which makes up the "rod" at the COOH terminus. The ATPase is localized within the head region, as are the sites with which the light chains interact (5,6). On the basis of digestion with a variety of enzymes, the myosin molecule has been subdivided further into a number of regions which are thought to correspond to structural domains. Treatment with chymotrypsin results in cleavage of the protein between the globular and fibrous domains (7). The resulting globular fragment, termed S-1,' contains some 800 amino acids and shows both ATPase activity and the ability to bind actin (8). Limited digestion of this fragment with trypsin results in the production of three polypeptides whose molecular weights, in order from the amino terminus, have been estimated to be 27,000, 50,000, and 20,000 (9, 10); each fragment has been assigned various roles in the molecule's activities. The hypothesis that these three fragments are representative of certain domains has been advanced since cleavage of S-1 with a number of other enzymes also results in the production of these or closely related fragments (11,12).
Other portions of the myosin molecule have been defined in a similar manner; by controlled proteolysis, the connecting segment between the globular S-1 region and the helical rod portion has been isolated and termed S-2, or "hinge region" (13)(14)(15). This fragment, which is highly susceptible to proteolysis (16), has been postulated to be the functional link between the coiled-coil rod structure and the globular, ATPase-containing head region. It has been postulated that the S-2 hinge permits the movement of the head along the actin, relative to the thick filament (17,18).
The rod region, in contrast, is thought to play a relatively static, structural role in the thick filament. This portion of the molecule, which also can be isolated by controlled proteolysis, is often termed light meromyosin (LMM) and is insoluble at physiological ionic strengths (19,20). The unique properties of LMM in terms of the charge periodicity needed to maintain its conformation and its interactions with the other myosin rods packed into the thick filament have been long appreciated (21,22). The existence in this region of a 28residue repeat has been confirmed by extensive sequencing Although enzymatic dissections have been invaluable in localizing some of the functional properties of the molecule, the resolution of these methodologies is limited. Indeed, a prerequisite for exact assignation of structure to function is the determination of the protein's sequence. Previously, we have cloned and characterized members of the chicken myosin heavy chain gene family (26,27) and have isolated recombinant bacteriophage which, taken together, encompass a com- (23)(24)(25).
The abbreviations used are: S-1, subfragment 1; S-2, subfragment 2; LMM, light meromyosin. 6478 plete myosin heavy chain gene and its promoter elements (28). In addition to the structural considerations at the protein level outlined above, we are also interested in defining the gene's structural basis for ita regulation; and we have noted that some of these elements may be localized within the interior of the gene (29, 30). Therefore, we have sequenced the entire embryonic myosin heavy chain gene. In addition, we have prepared an essentially full-length cDNA which lacks only 90 base pairs of the coding region and have used this construct to confirm directly the intron-exon junctions of the gene and the protein's sequence.

EXPERIMENTAL PROCEDURES
Sequencing Strategy-The bacteriophage containing the fragments of the embryonic myosin gene have been described previously (28). Suitable restriction fragments of between 300 and 3500 base pairs were generated and isolated using low-melt agarose. These fragments were cloned into M13mp18 or M13mp19 (31,32), and both strands of the entire fragment were sequenced directly by generating a series of nested deletions as described by Dale et d. (33). All other sequencing conditions have been described (28).
Synthesis of Complementary DNA-Messenger RNA was isolated from 14-day embryonic chick breast muscle as described previously (28). First-strand synthesis was carried out using 5 pg of the RNA in a volume of 50 pl, essentially as described by Okayama and Berg (34).
Reverse transcriptase X L (Life Science, Inc., St. Petersburg, FL) was found to give superior results in 30-min incubations at 42 "C. A parallel reaction, containing 7 pg of RNA but using Maloney murine leukemia virus reverse transcriptase (Bethesda Research Laboratories), was camed out at 37 "C for 1 h in 50 pl containing 60 m M KCl, 50 mM Tris-HC1 (pH 7.5), 3 mM MgC12, 0.5 mM deoxynucleotide triphosphates, 5 mM dithiothreitol, and 1 pg of oligo(dT)1s18. The reactions were mixed, quenched by the addition of EDTA to 20 mM, extracted with phenol/chloroform, and precipitated by the addition of an equal volume of 4 M ammonium acetate and 2 volumes of ethanol. The tube was placed at -70 "C for 10 min, and the cDNA was collected by centrifugation. Second-strand synthesis was carried out in 250 p1 essentially as described by Gubler and Hoffman (35).
The cDNA was loaded in a 1.5-cm slot in a 0.6% low-melt agarose gel and electrophoresed for 2 h at 2 V/cm. The fractions from 8000 to 5000 base pairs were collected and purified as previously described (27,36). This material was tailed with deoxycytosine as described (35) except the reaction was carried out at 37 "C for 7 min. The cDNA was then cloned into dG-tailed pBR325 at the PstI site and used to transform HB101. Myosin-containing clones were selected by standard procedures involving hybridization to a '*P-labeled myosin cDNA (36).

Gene's Organization
As shown in Fig. 1, the gene encompasses approximately 23,000 bases and consists of 40 exons, varying in size from 24 to 390 nucleotides. In contrast to the rather tight constraint on exon sizes, the introns vary widely, the smallest being 74 and the largest being 2,270 bases in length. As previously noted (28,37), the 101 bases of the 5"untranslated region are split by two introns. The third exon of the gene contains 34 bases of the untranslated region, followed by the ATG initiation codon. This and the remaining exons encode 6,067 nucleotides which are translated into 1,940 amino acids to produce a protein (neglecting glycosylation) of M, = 222,559.
The distal half of the last exon contains the nucleotides which encode the 3'-untranslated region of 146 bases; this exon also contains the polyadenylation signal. The ratio of intron nu-0 cleotides to exon nucleotides is approximately 3:l which, when compared to other eucaryotic genes, is not unusual. In addition, when compared with the other vertebrate myosin genes (25,37,60), the organization of the gene is highly conserved. This conservation is maintained across both species (25) and isoforms (60) and indicates that a primordial myosin gene gave rise by a series of duplications to a number of different gene isoforms.

Determination of Intron-Exon Junctions
Protein sequence information for the vertebrate myosins, which could be used to confirm intron-exon junctions, is limited. Although canonical splice junctions and the probable

Myosin Heavy Chain Gene
intron-exon organization can be determined by computer analysis, unambiguous assignation of the boundaries requires comparison of the gene sequence to the corresponding mRNA (or cDNA). Alternatively, primer extension analysis of each putative junction can be done. A comparison with the contiguous coding nucleotides is, however, the most direct approach. To this end, an essentially full-length cDNA was constructed, sequenced, and used to unambiguously confirm the intronexon organization. The organization of this cDNA is shown in Fig. 2, and the exons it contains are indicated. All exons that are translated are contained within the transcript, the cDNA terminating only 90 bases away from the translational initiation site. The sizes of the exons and introns and their locations are shown in Table I. The globular S-1 region, which makes up approximately 40% of the protein, is encoded by approximately 19 exons; the S-2 region is encoded by approximately seven; and the rod, which contains 33% of the protein's amino acids, is encoded by approximately 11. (None of the domains' boundaries correspond exactly to exon junctions, hence the approximations.) ATT GAG GAT GAG CM GCC CTG GCC ATG CAG TTA CAG M G M G ATC MG GAG CTG CAG g C~* g t g L C t g g t C C t t e c~t t~t g =~= t t =~g~~~~~~t = a g t t t g e R g l ) g g l l e a t g g g L U t~~~g g~t~~= = R~t g t t = =~=~g Nucleotide Sequence The entire sequence of the gene is shown in Fig. 3; the nucleotides corresponding to the exons are capitalized. The entire gene, from the transcriptional start site to the site of polyadenylation, consists of 22,684 nucleotides. As previously noted (28), the basal structural elements of a eucaryotic promoter, the CAAT and TATA boxes, are present in their usual positions upstream from the base (+1) at which transcription is initiated.

Amino Acid Sequence
The sequence of the protein was determined directly from the gene and cDNA sequences. We noted that only three nucleotide changes had occurred between the gene and cDNA. One of these (amino acid 379) shows a GGT + GAT change which results in a glycine + aspartate change. At amino acid residues 1212 (ATT + ATC, isoleucine) and 1824 (GTG + GTA, valine), silent site substitutions have taken place, and the amino acids have been conserved. These changes merely reflect the extensive polymorphisms present in the myosins which we have previously observed (61) and show the polymorphic variations in different Leghorn flocks. The amino acid sequence deduced from the gene's nucleotide sequence is shown in Fig. 4. The number and percentage in the protein of each amino acid are shown in Table 11. Subfragment 1-As noted above, controlled proteolysis of the intact myosin produces well-defined fragments, and one of these, S-1, corresponds to the globular head region. On the basis of an NH2-terminal sequence obtained from the adjoining fragment (38), we establish the probable cleavage point for generation of this fragment at lysine 846, although other investigators (24,25) have indicated that cleavage might occur at lysine 836 or 842. The resulting fragment has a molecular weight of 96,240 and can be subdivided into three discrete domains on the basis of enzyme cleavage patterns (9,38). The NH2-terminal fragment with M, -23,000 contains the residues which react with the photoaffinity-labeled ATP analogue N-(4-azido-2-nitrophenyl)-2-aminoethyl triphosphate (39).
On the basis of available information in the literature (9, 10, 38), we tentatively assign amino acids 1-213 (M, = 24,123) to this fragment. Since the active site-trapping procedure was used with this nucleotide analogue (40), it has been postulated that tryptophan 131, to which it was cross-linked, is a part of the ATP-binding site (39); and we have noted that the amino acids in and around this site are highly conserved (37). Lysine 130 is adjacent to photoreactive tryptophan 131 and is usually present as the Ne-trimethyllysine derivative. Thus, its positive charge can ionically attract the polyphosphate moiety of the ATP (45). This NHZ-terminal fragment is known to contain many other reactive lysine residues, and it is noteworthy that lysines 55, 84, 130, 146, 147, and 185 all occur in or around regions whose conformations display a high probability of containing a turn or having a relatively unordered structure (41). Reaction of one of the lysines with 2,4,6-trinitrobenzene sulfate (42, 43) causes a dramatic decrease in the ATPase activity. Previous sequence information (44) placed this residue in the peptide fragment aspartate-proline-proline-lysine, which corresponds exactly to residues 81-84 in the embryonic sequence.
The BO-kDa peptide which encompasses amino acids 214-640 (M, = 47,962) is adjacent to the 23-kDa amino-terminal fragment and is also able to interact with ATP analogues. Korner et al. (46) showed that modification of a carboxyl group in this fragment decreased the ATPase activity. More recently, Mahmood and Yount (47) demonstrated that a trapped 3'-0-(4-benzoyl)benzoyl-ATP could be specifically 10c: lized to this fragment, indicating the close proximity (6-7 A) of elements of this peptide to tryptophan 131. Fragment 214-640 has also been implicated in the binding of actin by numerous investigators (48)(49)(50).
A third fragment, located at the carboxyl-terminal end of  4). It can interact with the other domains found in the head region and has also been implicated in actin binding (48)(49)(50). In addition, this peptide fragment contains two reactive thiol groups, termed SH1 and SH2 (9), corresponding to cysteines 700 and 710 in Fig. 4. Their modification significantly affects ATPase activity. SH2, which is located in a flexible part of the polypeptide, not only can interact with the SH1 site but can be cross-linked to a reactive thiol in the 50-kDa fragment (residue 403, 480, 523, or 541). This crosslinking induces the stable trapping of Mg+-ADP (52). These apparent interactions between amino acids which are far apart in the primary sequence serve to emphasize the folding which places these domains in proximity to one another and the globular nature of the entire region in the native protein.
Subfrugment 2-The residues located between lysine 846 and arginine 1284 make up the central part of the myosin molecule which is designated as subfragment 2. S-2 is the structural link between the head region and the rigid LMM rod. S-2 can be further divided, on the basis of trypsin digestion, into two fragments: the "~hort" S-2, located at the NH, terminus, and the more flexible hinge comprising the COOH terminus (14,19). Analysis of the conformation of the entire S-2 using the methods of Chou and Fasman (53) and Finer-Moore and Stroud (41) reveals no striking differences in the high degree of a-helical conformation present in both of these regions, although the S-2:LMM junction is characterized by , a high degree of disorder centered between residues 1276 and 1315.
The exact nature of the hinge is unknown. It has been postulated that the flex inherent in the a-helical coil is sufficient to allow for the movement of the head relative to the rod (51). However, other investigators (54) have noted that the region may undergo melting and have hypothesized that this results in a localized conformational change.
A major contribution to the overall stability of the region is made by the interactions between the two a-helices. The coiled-coil was first considered in detail by Crick (55) who postulated that the amino acids would have a 7-residue repeat (a-g). Apolar residues in positions a and d would form a "stripe" inclined around the axis of the helix stabilizing the packing. Charged groups would be located preferentially on the surface. The model has been shown to be correct for a large number of proteins (56), and consideration of the impli-I Chain Gene 6485 cations for myosin has been detailed (25,51,57). Fig. 5 shows a residue analysis for the short S-2 and the flexible hinge region, as well as for the highly ordered "rigid" LMM. The data show that the hydrophobic, stabilizing interactions at position a are significantly reduced in the hinge section; this may lead to an overall decrease in the thermal stability of this portion of the molecule. Based on a cautious application of the available algorithms for predicting conformation (43,53), we think a plausible structural interpretation of the hinge might be the formation of a highly localized region of disorder within residues 1285-1315, which are unable to form any sort of an ordered structure. The resulting conformational changes could then be propagated over short distances, the process being facilitated by the lower degree of stability of this region relative to the short S-2 and LMM.
Light Meromyosin-We identify residue 1284 (arginine) as the probable trypsin site which defines LMM. The sequence of this portion of the molecule predicts the high degree of ahelix which has been noted previously in the myosins (25,51). The 7-and 28-residue repeats are maintained as are the positions of the "skip residues," which have the effect of widening the helix and altering the degree of twist (57). A hydrophilicity profile of the molecule shown in Fig. 6 emphasizes the dichotomy of the two ends of the protein. The globular characteristics of S-1 are graphically reflected in the mixed hydrophobic and hydrophilic natures of the region. A deep hydrophobic pocket is present proximal to photoreactive tryptophan 131, and the COOH terminus of the 20-kDa fragment ends in the "loop," the most hydrophilic feature in the entire profile. This loop has been observed in previously sequenced myosins (51). In contrast to 53-1, the other regions of the protein, S-2 and LMM, show minimal hydrophobkity, this being a reflection of their essentially rod-like natures.

Structure
Both the nematode and rat myosin sequences have been determined (24,25). The nematode gene, unc54, encodes a body-wall myosin which contains 1966 amino acids, whereas the rat gene encodes a myosin which is expressed predominantly in embryonic fast-white muscles. The embryonic chicken myosin, as might be expected, is closely related to both of these sequences, but is more homologous to the vertebrate sequence and contains exactly the same number (1940) of amino acids. We compared the amino acids from both the nematode and the rat to the chicken sequence; the overall homology was 48.8 and 83.5%, respectively. Although the exons in the chicken do not appear to define any obvious domains (e.g. no exon junctions are located where enzymatic domains are defined), we arbitrarily chose them as a unit of comparison, and an exon-by-exon analysis was made. Fig. 7 illustrates the results in a histogram showing the percent homology of the amino acids in each exon (or, in the case of the nematode, the amino acids in the analogous regions which correspond to the chicken exons) in relation to the overall homology. The results are striking and demonstrate that certain regions are preferentially conserved, whereas others are hypervariable.
The third exon, which contains the translational initiation codon, significantly deviates from the average homology, being hypervariable both in the nematode and rat. We have noted this previously in a comparison between the chicken, rabbit, and rat myosin genes (28) and have hypothesized that this exon, which encodes the amino terminus of the protein, may partly account for the differing ATPase activities of the different isoforms.
The fifth exon contains a portion of the ATPase site (40) encoding t-trimethyllysine 130 and tryptophan 131. In both the rat and nematode, this exon (or corresponding region in the nematode) is preferentially conserved, although the homology is much more pronounced between the rat and chicken. The adjoining exons show an interesting pattern in both comparisons: exons 6 and 8 are highly conserved, whereas exon 7 is not. Exons 9 and 10 are also highly conserved. Previous biochemical analyses have indicated that residues 117-195 (contained within exons 5-7) are important in myosin's interaction with ATP; this appears to be reflected by the striking sequence conservation apparent in exons 6 and 8. Exon 7, by comparison, does not appear to be as highly conserved. However, upon closer examination, we observed that this exon is highly conserved at its 5' end and hypervariable in its 3' region. In fact, it is within this exon (7) that the short consensus sequence (62), which is found in a wide variety of procaryotic and eucaryotic nucleotide-binding proteins (63), is located. Fig. 8 shows the extent of the homologies in and around this region. Between residues 143 and 200, there is 100% homology between the chicken and rat myosins. The available rabbit sequence (45) also shows 100% homology in this fragment. Whereas there are numerous differences between the chicken and nematode sequences, all but three of the substitutions are conservative and would have little effect, according to the available algorithms (41, 53), on the conformation of this region. It is noteworthy that the homologies extend across and terminate before exon borders. This emphasizes the apparent lack of correspondence between probable protein domains and individual exons. The conservation observed between amino acids 473 and 529, encoded by exon 13 (see Fig. 7), is quite striking; it is highly conserved between the nematode and chicken, and the chicken and rat sequences are also homologous. Interestingly, both actin-binding and ATP interactions have been localized to these residues (47-50); and it is not surprising, considering the importance of these functions to the protein's activities, that these residues also have been preferentially conserved.
All of these regions are localized in S-1; the pattern of sequence conservation in both comparisons is less apparent in the remainder of the molecule. However, the analysis does reveal some interesting aspects of LMM. The sequences which encode the rod diverge extensively between the different myosins, and this is reflected in the histogram (Fig. 7); the homologies in exons 27-40 are mostly below the average. However, the amino acids encoded by (chicken) exons 37 and 39 are preferentially conserved between the three myosins. These peptides are located at the end of the rod, and it has been hypothesized that this region may play an important role in the assembly of the thick filament (58). It seems probable that, if this is the case, the peptides important for this function are encoded by these sequences.
It should be emphasized that the unit of comparison, the chicken exon, is an arbitrary one; and we do not mean to imply that the exons define any functional or structural domain. Indeed, in the nematode gene, the coding sequence is interrupted by only eight introns (although five of these are located in the positions occupied by analogous introns in the vertebrate genes); and the histogram was constructed using the regions in the nematode which correspond to exon sequences in the vertebrates. It should also be noted that the sizes of the different exons do bias the analysis to some extent since exons encoding a small number of amino acids have a higher probability of deviating from the average degree of homology. Despite the above caveats, the analysis serves to illustrate which general regions are conserved and which diverge extensively as compared to the overall homologies between the genes.
The conservation of gene structure between the rat (25) and chicken genes is high. Both genes contain two exons which encode the 5"untranslated end the methionine initiating codons are both located in the third exons. Both genes encode 1940 amino acids; and, within the coding regions, all intron positions are exactly conserved except for the last. The terminal exon in the chicken is split in two in the rat, the extra mammalian intron occurring 24 bases upstream from the stop codon.
The development of a cDNA which encodes almost the entire polypeptide should be useful in initiating structurefunction studies and increasing the biochemical resolution of the system. For example, it should now be possible to perform site-specific mutagenesis of the nucleotides encoding particular amino acids, such as tryptophan 131, or one of the reactive thiols. If these constructs can be expressed in a suitable host and the protein used to reconstitute the myofibril, it will become possible to determine the exact primary structure which underlies the protein's functions.