The Hemoglobin of Urechis caupo THE cDNA-DERIVED AMINO ACID SEQUENCE*

The nucleotide sequence of a cDNA transcript containing part of the 5‘ noncoding region, the entire coding region, and the entire 3’ noncoding region has been determined. The protein sequence predicted from the coding region matches almost exactly the amino-terminal sequence and the sequence of several peptides from Urechis caupo F-I globin. Only 11-20% of the amino acid positions are identical with those of other known globins. at the Protein Sequencing Center at the University of Texas. Sequence determinations with a Beckman model 890 Sequencer equipped with a cold trap made use of the 0.1 M Quadrol program (No. 121178) with the addition of polybrene (17, 18). The phenylthiohydantoin deriva- tives of the amino acids were identified by high performance liquid chromatography as described (16).

The nucleotide sequence(s) reported in this paper b been submitted to the GenBankTM/EMBL Data Bank with accession numberfs) 502624.
$ Recipient of a predoctoral fellowship from the Robert A. Welch Foundation. Present address: Dept. of Hematology, University of Utah School of Medicine, Salt Lake City, UT 84132.
To whom reprint requests should be addressed.
Chemicals, Torrance, CA) of a mixed oligonucleotide probe, complementary to the F-I globin message. The probe was extended by using it as a primer for double-stranded cDNA synthesis (10) from poly(A)+ RNA isolated from red cells of U. caupo (see below) and subsequently cloned directly into M13mp8 with EcoRI linkers. The clone of the extended probe, UCG-1, was identified by the Sanger dideoxy sequencing method (11); the sequence corresponded exactly to the first 14 amino-terminal residues of F-I globin.
RNA Preparation and Analysis-Specimens of U. caupo were collected at Bodega Bay, CA, flown to Texas, and maintained in a marine aquarium. RNA, isolated from 5 ml of packed red cells as described by Cox (12) was passed twice over a column of oligo(dT)cellulose in order to isolate the poly(A)+ RNA. Northern analysis (13) was carried out on 2 pg of total and 2 pg of poly(A)+ RNA following electrophoresis on 1% agarose gels in 50% formaldehyde. Nicktranslated (14) UCG-1 was used as a probe.
Preparation and Analysis of cDNA-A cDNA library was constructed (IO) in XgtlO and screened by hybridizing nick-translated UCG-1 to nitrocellulose filters containing replicas of hgtl0 plaques (13). Insert DNA from a positive plaque, UCG-2, was subcloned into M13mp8 and sequenced from both ends by the Sanger dideoxy method (11). Subclones were made following digestion by MnlI, HincII, AluI, and HinfI. The complete sequence was carried out in both directions with these clones as indicated in Fig. 1 (Appendix). Protein Analysis and Sequencing-U. caupo F-I globin was prepared as described (9). F-I globin (50 mg) was digested with CNBr (15) and chromatographed on a column of Sephacryl S-200 superfine (2.5 X 165 cm) in 6 M guanidine HCI, 0.2 M sodium acetate, pH 6.0 (Fig. 2a). Peak B was rechromatographed by high performance liquid chromatography (Fig. 2b) as described (16). F-I globin (100 mg) was also digested with BNPS-skatole' reagent (15) and chromatographed on Sephacryl S-200 ( Fig. 3a) as described above. Peaks B and E were rechromatographed with high performance liquid chromatography (Fig. 3, b and c). Amino acid compositions were determined as described (9) on a Beckman 121MB amino acid analyzer at the Protein Sequencing Center at the University of Texas. Sequence determinations with a Beckman model 890 Sequencer equipped with a cold trap made use of the 0.1 M Quadrol program (No. 121178) with the addition of polybrene (17,18). The phenylthiohydantoin derivatives of the amino acids were identified by high performance liquid chromatography as described (16).

RESULTS AND DISCUSSION
The cDNA library contained about 1200 recombinants, of which seven hybridized to UCG-1. The largest of these, UCG-2, was completely sequenced (Fig. 4, Appendix) and found to be 727 bp in length. The 5' noncoding region of this clone is only 11 bp long. This is probably shorter than that of the original mRNA because of clipping of the 5' loop by S1  codon. The nucleotide sequence indicates that the globin chain is 141 amino acid residues long and has a calculated molecular mass of 15,045 daltons. This value differs from the earlier estimate, 13,500 daltons and 57,600 daltons for the tetrameric Hb obtained by gel chromatography (9). The latter values are self-consistent, in agreement with our sedimentation data (9) and with observations of others (7,19). A value of 15,045 should give a 4-heme tetramer of 62,700 daltons. A value of 14,000 would give a 4-heme tetramer of 58,500 daltons. Several possible explanations of this discrepancy exist. First, the tetramer might be more compact than other Hbs, but both SDS electrophoresis and chromatography in the presence of guanidine HC1 give low estimates for the molecular weight. Another explanation for the low tetramer weight is dissociation, but this cannot explain the low estimate for chain size. A further possibility is that UCG-2 corresponds to a previously undetected minor chain, identical to the major chain except for an additional 10 residues. Although the latter possibility does not appear to be consistent with the amino acid composition, we cannot rule it out.
The amino-terminal sequence (34 residues, Ref. 9) and the partial sequence of a large CNBr fragment and two BNPSskatole fragments (Fig. 4, Table I) were compared with the amino acid sequence predicted from UCG-2. CNBr fragment 4B corresponds to residues 98-119, BNPS-skatole fragment 3B corresponds to residues 58-91, and BNPS-skatole fragment 3E corresponds to residues 130-141. Amino acid residue 4 is threonine in UCG-2, although it was found to be alanine in both the amino-terminal sequence and the extended probe, UCG-1. Because this is a neutral substitution, it may not account for an electrophoretic difference unless the substitution caused a change in interaction between other parts of the globin chain. However, differences in other undetected positions could account for at least some of the heterogeneity reported earlier in U. caupo F-I hemoglobin (9).
A long (295 bp) 3' noncoding region follows the termination codon and ends with nine nucleotides of the poly A tail. Mammalian and frog (Y and @ globin genes have 88-130 nucleotides in the 3' noncoding region, legHb globin genes have up to 174 bp, whereas the genes of seal and human  (22) starts at base 697. The sequence GGTTTTA starts 11 bases upstream from the polyadenylation site. This sequence is also found upstream from the polyadenylation site of human and seal myoglobin, but is absent from Chironomus, human, and legHb genes. The possible significance of this sequence must await the sequences of other invertebrate genes. No other correspondence could be found between noncoding 3' regions from Urechis globin RNA and that from other organisms. The amino acid sequence of Urechis F-I globin has been compared to those of the globins from man, lamprey, Molpadia, Chirommus, Glycera, Lumbricus, Anadara, Aplysia, and soybean (23)(24)(25)(26). The number of identical positions ranges from 11 (Lumbricus) to 20% (Glycera) with a mean of 14%, although 51% of the amino acid sequence of Urechis globin corresponds to positions in at least one of the globins compared. These low levels of identity preclude conclusions as to the phylogenetic position of the Echiura. Fig. 5 shows that, with few exceptions, a close correspondence exists between the hydrophobicity pattern of the Urechis F-I globin chain and that of the human ,f3 chain. We have used for the hydrophobicity the mean area of the amino acid side chain buried upon transfer from the standard state to the folded protein as described by Rose et al. (27). The detailed, residue-by-residue correspondence strongly suggests that the overall conformation of the two chains is very similar, and that we are justified in the following discussion in assuming that the Urechis hemoglobin has the same helical segments that are present in the human / 3 chain. It is impossible to recognize the close correspondence if one uses the running average employed by Kyte and Doolittle (28) which blurs these details.
The Urechis globin A helix appears to start with the highly conserved threonine in position 3. All the charged residues of the A helix appear to be external and available to solvent.
The highly conserved tryptophan occurs at position 14 in the Urechis chain. This residue (815) acts as an AE helix spacer in the human chain; we presume a similar function here. Three residues, -1le-Lys-Gly-, join the A and B helices. Glycera hemoglobin has a similar AB segment in an extended configuration (29). The Urechis B helix has only 2 polar residues, B7 Asp and B12 Lys. Position B12 is occupied by Arg in the human , f 3 chain where it forms a salt linkage to B8 Glu. The B7 Asp of the Urechis chain appears to be too far away to form such a linkage. Although position B12 is internal in the human chain, the corresponding lysyl residue of the Urechis chain might be external if the intersubunit contacts are different than in human hemoglobin.
The short Urechis C helix appears in Fig. 5 to have a very different hydrophobicity pattern from that of the human chain, but this depends largely on the glycine at C3. The tabulation of Rose et al. (27) which we are using gives Gly as the least buried, but then it has the least to bury so this discrepancy may be misleading. The glycine could easily occupy a hydrophobic pocket without serious energetic penalty. For this reason we have ignored the glycines in parts of Fig.  5 . The Urechis chain has the highly conserved Phe at position CD1; in human hemoglobin this residue is packed close to the heme. The patterns for the E, F, FG, and G segments shown in Fig. 5 correspond sufficiently closely to suggest that their lengths are probably very similar in the two chains. The EF interhelical segment appears to have 9 residues in the Urechis chain-3 residues fewer than in the human /3 chain. Glutamine replaces the normal distal histidine in the Urechis chain. A striking feature of the F helix in the Urechis chain is the presence of a proline at F6, just two residues away from the proximal histidine. This proline seems certain to cause a shift in the F helix, perhaps similar to that found with the proline 3 X lo6 counts of nick-translated UCG-1 were hybridized. Washing was 0.1 X SSC and 0.1% SDS at 68 "C. The film (Kodak XAR-5) was exposed for 3 days with two intensifying screens at -80 "C. Arrows refer to size of markers (rabbit red cell 28 S, 18 S, and 9 S RNA) in kilobases (kb).

Urechk Ht
in the middle of the G helix of Glycera hemoglobin where it causes a 25" bend (29). The alignment of the G and H helices shown in Fig. 5 suggests that the GH corner and the H helix together are shorter by 3 residues. If the H helix is the same length as that in the human p chain, then the GH segment would be only 2 residues in length. The 180" turn generated by the GH corner would be energetically easier if the GH corner were the same length as in the human chain but the H helix were shorter.
Comparison of the 17 residues which form the alp1 contacts in human hemoglobin with those in Urechis indicates that only 2 are identical. A similar comparison for the alpz interface shows that only 2 of 13 residues are identical. However, 5 of 11 residues known to form the heme contacts in human /? globin (23) are identical with those in Urechis globin. Thus, although the heme pocket residues appear to be conserved, the data indicate that the intersubunit contacts are not.
Thus the tetrameric Urechis hemoglobin appears to lack the alp2 contacts which, in human hemoglobin, shift during oxygenation. Analysis of the CO and O2 association-dissociation kinetics a t 20" (30) indicates that Urechis hemoglobin can be best classified as a low affinity "T-state" hemoglobin (31). Furthermore, O2 and CO recombination kinetics studied in the nanosecond time regime show very little geminate recombination, a characteristic of T-state hemoglobin.' Recent extended x-ray absorption fine structure measurements (32) on the HbCO form are consistent with a very large out of plane position for the iron atom when CO-ligated, which is also consistent with assignment of Urechis hemoglobin to the T-state. X-ray diffraction determination of the structure will be ultimately necessary to explain these observations. However, we suggest that the F6 proline close to the proximal histidine may in part be responsible for the unusual functional properties.
The subunits of the tetrameric hemoglobin of the clam, Scaphurca inequiuuluis, are assembled very differently from those which form the tetramers of vertebrate hemoglobins (30). The major difference is that the E and F helices form   (33). This helix is absent in all other known globins including those of Urechis and man. The "pre-A" sequence of lamprey globin exists as an extended chain rather than a helix (34). Codon usage does not appear to differ significantly from that of other known globin genes (human a and p, mouse p, and chicken from Ref. 35; seal myoglobin, legHb, and Chironomus from Refs. 20, 36, and 37, respectively) except that Urechis globin appears to be the only one to utilize the CGA codon for arginine, the UCG codon for serine, and the UUA codon for leucine. Vertebrate globins use mostly CAG to code for glutamine, whereas the invertebrates appear to use only CAA.
We have preliminary evidence (3) that multiple globin genes occur in Urechis. A Southern analysis of genomic DNA cut with EcoRI and probed with nick-translated UCG-2 DNA showed the presence of four bands of 4,9,15, and 17 kilobases in size. If the multiple genes are all expressed, then they are probably very similar in size because Northern analysis (Fig.  6) of Urechis red cell RNA using the same probe shows only a single band. The previously discussed difference of one nucleotide between UCG-1 and UCG-2 indicates that at least two of the multiple genes are almost identical. This could explain the electrophoretic heterogeneity seen in Urechis hemoglobin (9). This is a situation similar to that seen in the globin genes of Chironomus (37), where gene A codes for threonine at position 57 and gene B codes for isoleucine.