Nucleotide Sequence of the ZuxA Gene of Vibrio harueyi and the Complete Amino Acid Sequence of the (Y Subunit of Bacterial Luciferase*

The nucleotide sequence of the 1.86-kilobase EcoRI fragment from Vibrio harveyi that was cloned using a mixed-sequence synthetic oligonucleotide probe (Cohn, D. H., Ogden, R. C., Abelson, J. N., Baldwin, T. O., Nealson, K. H., Simon, M. I., and Mileham, A. J. (1983) Proc. Natl. Acad. Sei. U. S. A. 80, 120-123) has been determined. The a subunit-coding region (luxA) was found to begin at base number 707 and end at base number 177 1. The a subunit has a calculated molecular weight of 40,108 and comprises a total of 355 amino acid residues. There are 34 base pairs separating the start of the a subunit structural gene and a 669-base open reading frame extending from the proximal EcoRI site. At the 3’ end of the luxA coding region there are 26 bases between the end of the structural gene and the start of the luxB structural gene. Approximately two-thirds of the a subunit was sequenced by protein chemical techniques. The amino acid sequence implied by the DNA sequence, with few exceptions, confirmed the chemically determined sequence. Re- gions of the a subunit thought to comprise the active center were using gradients by methods have used analyze for and to purify contaminated peptides by published procedures Purity of peptides also assessed on the basis of amino acid composition. Amino acid compositions of analyzer following 24-h hydrolysis at 110 "C with 6 M HCl. were determined with Beckman model

2), and in fully induced cells luciferase comprises up to 5% of the soluble protein (3). The enzyme is a heterodimer and catalyzes the following reaction.
The subunits of the enzyme from Vibrio harueyi have molecular weights determined by SDS-polyacrylamide gel electrophoresis of 42,000 and 37,000 for a and #?, respectively (4).
Mutant enzyme analyses and chemical modification studies indicate that the single active center resides primarily if not exclusively on the a subunit (5). The specific role of the 0 subunit is unknown, but it is absolutely required for bioluminescence activity.
Recently, the luciferase genes from V. hurveyi were isolated (6)(7)(8) and shown to be closely linked on the bacterial chromosome. luxA encodes the a subunit and luxB encodes the fl subunit. Partial sequence information at the nucleotide (7) and amino acid (9) levels suggests that the genes arose by tandem duplication of an ancestral gene.
Partial amino acid sequence information from regions thought to be associated with the active center has been obtained during the past few years (IO), but determination of the entire sequence of the subunits has been hampered by the poor solubility of the proteolytic and chemically derived fragments. In order to determine the encoded sequence of the a subunit and to better understand the structure and regulation of the lux region of the V. hurveyi chromosome, we have determined the nucleotide sequence of the -1.85-kb EcoRI fragment known to contain the entire luxA gene and part of the luxB gene. We report here the complete nucleotide sequence of the luxA gene and compare it with amino acid sequences obtained from analysis of peptide fragments comprising approximately two-thirds of the a subunit.

MATERIALS AND METHODS
Subcloning and DNA Sequencing-As the restriction map of the 1.85-kb EcoRI fragment was relatively well characterized, subcloning was done by purifying fragments from the recombinant plasmid pAGlOl (7) and ligating them into the appropriate vector. During the course of the sequencing, additional restriction sites were identified, and these were then used to construct additional subclones. Subclones were constructed in the bacteriophage M13 derivatives mp7 and mp8 (11). DNA sequencing was by the chain-termination method (12).
published (13)(14)(15). Starting with 600 g of frozen cell paste, 1050 mg of luciferase was purified, and from this enzyme, 438 mg of a subunit was isolated. The a subunit was judged to be greater than 95% pure based on Coomassie Blue staining of SDS-polyacrylamide gels (16).
Prior to digestion, the a subunit was alkylated by reaction with iodoacetate (17). The carboxymethylated subunit was digested with Staphylococcus aureus strain V8 protease (Miles) in ammonium bicarbonate, pH 7.8, 1 mM EDTA, at a substrate to enzyme ratio of 25:l (w/w) for 16 h at room temperature. These conditions have been reported to yield cleavage at glutamyl residues only (18). Peptides were resolved by chromatography on columns of the cation exchange resins Aminex A-4 (Bio-Rad) and PA-35 (Beckman) using gradients of pyridine-acetate by methods that have been described (17,19).
High voltage electrophoresis on paper was used extensively both to analyze fractions for purity and to purify contaminated peptides by published procedures (17,19). Purity of peptides was also assessed on the basis of amino acid composition. Amino acid compositions of analyzer following 24-h hydrolysis at 110 "C with 6 M HCl.
samples were determined with a Beckman model 121 amino acid Edman degradation was performed manually as previously described (17,19). Aliquots were removed after each cycle of degradation and dansylated, and DNS derivatives, released by acid hydrolysis, were identified by thin layer chromatography (20). Automated Edman degradation was performed with a Beckman Sequencer model 890C as described previously (9,17,19). PTH derivatives were determined by thin layer chromatography using a total of three solvent systems (9,17,19). The multiple determinations were considered necessary due to the lack of quantitative data from the chromatograms. Only residues identified by all three systems were considered to be accurate and are presented here.

DNA Sequencing
The clones used for sequencing are listed in Table I; the direction and extent of sequence derived from each clone are shown in Fig. 1. The nucleotide sequence of the 1.85-kb EcoRI fragment is shown in Fig. 2. The sequence of about 75% of the fragment was determined from both strands (see Fig. l), and in regions where information from only one strand is presented it is from areas of the sequencing gels where the nucleotide assignment is unambiguous and/or regions for which protein sequence was available. Shown below the DNA sequence is the implied protein sequence.

Peptide Sequencing
After digestion with S. aureus protease, 33 a subunit peptides were isolated in sufficient yield and purity for sequence analysis. The amino acid compositions of these peptides are presented in Table 11. While knowledge of the sequences of these peptides alone was not sufficient to deduce the sequence of the a subunit, the data were highly useful in checking the nucleotide sequence and ascertaining frameshift errors in reading the DNA-sequencing gels. The locations of most of the peptides presented in Table I1 are indicated in Fig. 2.
Determination of the Sequence of the a Subunit The entire sequence of the a subunit was determined from the sequence of the LuxA gene. We had also determined the sequence of numerous peptides derived from the protein, and knowledge of the sequences of those peptides was very helpful in confirming the DNA sequence. A detailed interpretation of the sequence follows.
Residues 1-26-The amino acid sequence of residues 1-26 was determined using a Beckman Sequencer, and the sequence has been published (9). Residue 13 was not determined in the earlier work, and the DNA sequence reported here allows us to assign Pro to that position (see Table I and Fig.  1, clone 3). This assignment is consistent with the sequence of the peptide SAP 2. Residue 17 was reported to be Glu (9), but the DNA sequence indicated Gln. It is likely that the DNA sequence is correct, and the error was due to the wellknown deamidation of Gln. The peptide SAP 1 was placed at residues 1-4 based on sequence identity. The DNA sequence in this region (residues 707-784) was determined from both strands using clones 3,4, and 15.
It was of interest that while the S. aureus V8 protease, under the conditions of the digestion, has been reported to be specific for bonds on the carboxyl side of Glu residues (18), we observed cleavages at reasonably high yield at other residues, most notably Gly. SAP 1 resulted from cleavage of a glycyl-asparagine bond, and SAP 2 resulted from cleavage of a threonyl-tyrosine bond, as well as a glutamyl-leucine bond. The cleavages observed and reported here indicate a preference, but hardly a specificity, for Glu. Residues 27-1 17-The DNA sequence through this region (residues 785-1063) was determined from both strands (see Fig. l ) , and the encoded amino acid sequence was confirmed by peptides SAP 3 through SAP 14. The corresponding region in the DNA sequence was residues 785-1063; the sequence was determined using clones 3, 4, 5, and 15 and by using the . .

S A P 8 h 9
..  Table 11). Underlined amino acid residues indicate those that were determined by automatic Edman degradation of either the whole a or p subunit (9) or of large proteolytic fragments of the a subunit (10). Locations of peptides derived from digestion of the a subunit by trypsin (T) or the Staphylococcus aureus protease ( S A P ) are indicated by the labeled brackets.
Regions of sequence of peptides enclosed within brackets were not determined, and alignments are by amino acid composition only. All other regions were determined by Edman degradation as described in the text. Positions in which there was discrepancy between the amino acid sequence implied from the nucleotide sequence and the protein sequence are indicated by the insertion of the ambiguous residues beneath the encoded amino acid residues. Residues that were not unambiguously identified in the protein sequence are indicated in the text. In all cases, the discrepancies have been reconciled in favor of the nucleotide sequence. Numerous other peptides were isolated and sequenced, but they were of no additional help in elucidation of the sequence and are not shown here for the sake of clarity. synthetic oligonucleotide that was used in the original cloning each other on the basis of the sequence of the reactive thiol-(Ref. 7; see sequence 14 in Fig. 1). The position of the peptides containing tryptic 'peptide Phe-Gly-Ile-Cys-Arg (21). The SAP 3 through 14 was based on the DNA sequence, with the cleavage of bonds associated with the carboxyl side of glycyl exception of SAP 10 and 12 which were placed relative to residues by the SAP enzyme was quite evident here as well.  2) Those peptides that were derived from the ( Residues 118-143-The DNA sequence in this region (residues 1063-1139) was determined from both strands, clones 5,6, and 13. Sequence 14 (Fig. 1) was derived by priming the template DNA from a clone that carried the entire 1.85-kb EcoRI fragment with the mixture of 8 sequences used in the original cloning (7). The protein sequence to which the synthetic probe was directed was Met-Asp-Cys-Trp-Tyr-Asp (amino acid residues 128-133 (10)). The 17-base oligonucleotide was constructed to have 2 base ambiguities at positions 6,9, and 15, giving a total of 8 sequences. The DNA sequence demonstrated that the correct base in the ambiguous position in the Asp codon was C, while it was T for both Cys and Tyr.
The amino acid sequence in this region was determined using a Beckman Sequencer and the large proteolytic fragment generated by the action of chymotrypsin on the native luciferase; the sequence has been presented at a meeting (10). The residue at position 124 was not determined chemically; the DNA sequence indicated that the residue is Ser, consistent with poor recovery and difficulty in making an unambiguous identification with the chromatographic techniques employed. The residue at position 126 was erroneously identified as Lys in the earlier work (10). The DNA sequence unambiguously identified the residue as Ala. The error likely was due to the similarity of the chromatographic properties of PTH-Lys and PTH-Ala under the conditions employed. The residue at position 135 was erroneously identified as Phe in the chemical determination, again probably due to the similarity in the chromatographic properties of PTH-Met and PTH-Phe. Position 136 was not identified in the degradation of the protein.
The DNA sequence indicated the sequence Met-Lys for these positions, consistent with the peptide SAP 15, which had the sequence Leu-Met-Lys-Glu. The sequence of SAP 16 was identical with the sequence of residues 138-141 determined from degradation of the proteolytic fragment of the protein as well as that predicted from the DNA sequence.
Residues 144355-The corresponding region from the DNA sequence (residues 1140-1771) was determined based on clones 7, 8, 9, 10, 11, 12, and 13. From about 1170 to base 1259 (the PstI site in Fig. l), the DNA sequence was of the  Fig. 1). From base 1196-1231, the DNA sequence was confirmed by the sequence of SAP 18. The sequence of SAP 18, determined by manual Edman degradation with the DNS-C1 technique, identified Ala (rather than Pro) at position 169 and Pro (rather than Ala) at position 174. The errors were probably due to the similarity in migration of DNS-Pro and DNS-Ala on polyamide sheets (20). SAP 19 was shown to contain a Trp based on absorbance spectroscopy, but of course DNS-Trp was not detected due to the acid hydrolysis step. The Trp was tentatively placed at the amino-terminal end of the peptide since no DNS derivative was obtained from the undegraded peptide. The DNA sequence from 1250-1261 confirms the location of the sequence of SAP 19 and the location of the Trp residue. Due to the alignment of peptides SAP 18 and 19 and the unambiguous sequences read from the DNA sequencing gels, we are confident that the sequence in this region is correct in spite of the fact that the sequence was derived from only one strand.
The sequence from position 1259 through the end of the 1.85-kb EcoRI fragment was determined from both strands with the exception of a stretch of -10 bases from 1403 to -1415, where the sequence was exclusively from clone 8 (Fig.  1). The alignment of SAP 20,21, and 22 (amino acid residues 201-214; nucleotide residues 1307-1348) before this region and tryptic peptide T1 (amino acid residues 238-244; nucleotide residues 1421-1438) after the region of single strand data gives us confidence that the region is correct.
The proteolytic fragment resulting from chymotryptic cleavage around residue 280 (position -1250 in the DNA sequence) has been designated the light 6 fragment (10). Isolation of this fragment by SDS-gel electrophoresis and sequence analysis on a Beckman Sequencer led to an erroneous sequence due to contamination of the sample (10). The sample actually contained a mixture of two fragments, one beginning with residue 281 and one beginning with residue 283. The two sequences would be determined in the following order.  Sequence I begins with residue 281, sequence I1 begins with residue 283, and sequence I11 is the reported sequence of bL (10). Re-evaluation of the data necessitated by the lack of agreement with the DNA sequence demonstrated the existence of the two sequences (I and I1 given above). The error was due to the use of gas chromatography (22) and thin-layer chromatography to identify the PTH derivatives and the differential stabilities of the various PTH derivatives. The sequences of SAP 25, 26, and 27, and tryptic peptide T2 (see Table 11) confirm the DNA sequence and are consistent with the above hypothesis explaining the earlier error (10).

Ser-Tyr-Glu-Ile-Asn-Pro-Val-
The sequence from position -1600 in the DNA sequence to the end of the a subunit coding region was largely confirmed by the sequences of SAP peptides 28, 29, 30, 31, 32, and 33 (see Table I1 and Figs. 1 and 2).

Structure of the lux Region
Three parallel open reading frames are seen in the DNA sequence. The only complete reading frame is that encoding the a subunit, from nucleotides 707-1771. It encodes a protein of 355 amino acids with a calculated molecular weight of 40,108. This agrees well with the published molecular weight of 42,000 (4). In addition, the composition of the encoded protein (Table 111) (7,lO). The complete sequence of the fragment, reported here, showed that it contained the entire a! subunit-coding region, a region encoding the carboxyl-terminal 223 residues of the polypeptide of unknown function and the amino-terminal 13 codons of the B subunit. The amino-terminal coding regions of the luxA and luxB genes imply amino acid sequences identical to those of the mature polypeptides (9), demonstrating that neither subunit undergoes post-translational processing in the amino-terminal region. Chemical modification and limited proteolysis studies with bacterial luciferase have shown that a highly reactive sulfhydryl group, thought to reside in or near the flavin-binding site, is located close to a region that is highly sensitive to proteases (10, 23, 24). In the complete sequence, the reactive cysteinyl residue is in position 106 (Fig. 3).
The proteolytic fragment resulting from chymotryptic cleavage around residue 280 has been designated the light 6 fragment (10). We have re-evaluated the light 6 protein se-

DISCUSSION
The 1.85-kb EcoRI fragment described in this paper was isolated from a genomic clone bank of V. harueyi DNA on the basis of hybridization with a mixed sequence synthetic oligonucleotide probe designed from a subunit amino acid sequence quence data because of a lack of agreement with the DNA sequence; we found that the light 6 sample contained a mixture of 2 peptides resulting from cleavage by chymotrypsin at residues 280 and 282, resulting in errors in the protein sequence determination. The differential stabilities of the PTH amino acids derived in the sequence of the mixture account for the errors.
The secondary structure of the a subunit predicted by the modified method (25) of Chou and Fasman (26) is shown in Fig. 3. The prediction is that 34% of the residues should form a-helix while 12% should be in the p sheet configuration.
These values are consistent with the measurements of 28% a-helix and 14% p sheet reported by Holzman and Baldwin for the dimer (27).
Luciferase is a highly soluble protein. For a protein of -40,000 daltons, the a subunit has a high proportion of hydrophilic residues (aspartic acid, asparagine, glutamic acid, glutamine, lysine, and arginine). Compared with two other polypeptides of similar size, carboxypeptidase A (a monomer) and horse liver alcohol dehydrogenase (one subunit of the dimer), luciferase a subunit has about 25% more external residues (33% versus 26%). The proportion of hydrophobic residues (leucine, methionine, isoleucine, valine, cysteine, phenylalanine, and tyrosine) is about the same in all three examples (-34%), but luciferase a subunit has a lower fraction (33%) of "neutral" residues (alanine, threonine, glycine, proline, serine, histidine, and tryptophan) than the other two, with about 40% apiece. The amino-terminal third is the most nonpolar region of the subunit. From residues 49-84, 71% of the residues are nonpolar. There is only a single trypsinsensitive bond between residues 29 and 98. A second area of high nonpolarity is from residues 166-199, which contains 59% nonpolar residues. These nonpolar regions are likely to be either internal in the native structure or involved in interactions with the / 3 subunit. Between these two nonpolar regions is a highly polar region from residues 107-153 containing 36% charged residues as well as the reactive cysteine and the first protease labile region. A second polar region, from residues 262-297, has 42% charged residues and contains the second protease labile region. It will be especially interesting to compare the amino acid sequences in these regions with the same regions of the p subunit, since the two subunits appear to be homologous (9) but have very different functions.
In the nucleotide sequence, an open reading frame is seen upstream from luxA. The reading frame has the same polarity as luxA and extends beyond the proximal EcoRI site at the end of the cloned fragment. The reading frame has 223 codons and ends 34 nucleotides upstream from the luxA initiation codon. The random chance of not finding a stop codon in a sequence of 223 codons is less than 0.002%, strongly suggesting that this region encodes a polypeptide. Previous analyses of the proteins encoded by the 5-kb BamHI fragment that encompasses the 1.85-kb fragment (6) support this view. The BamHI fragment has about 1.2 kb upstream from the proximal EcoRI site. A clone carrying the fragment synthesizes an M, 35,000 protein in addition to the 2 luciferase subunits.  erase has been found in V. harveyi (6,28). No function has been ascribed to this protein.
Three areas of similar nucleotide sequence suggesting homology have been noted preceding the luxA and luxB structural genes (7). Two of these areas not only have homology with each other but also have homology in sequence and location with RNA polymerase recognition sequences (-10 and -35 sequences). The complete nucleotide sequence shows that the homology with -10 is contained in the intergenic regions preceding both genes. Perhaps surprisingly the homology with the -35-like sequence preceding luxB is contained within the carboxyl-terminal coding region of luxA while the corresponding region before luxA is in the carboxylterminal coding region of the upstream gene. The DNA sequence information is the only evidence suggesting that these sequences could function as promoters. Furthermore, although cloned genes from V. harveyi can complement mutations in Escherichia coli (29)' there is no evidence indicating that the native promoters are present and/or functional in the clones. Thus, it is not known if promoters in V. harveyi are at all similar to those seen in E. coli. Finally, if the lux genes are cotranscribed, as they appear to be in Vibrio fischeri (30), these sequences must have a function other than as promoters, or perhaps they are secondary promoters which function only under special conditions. A third region of homology that has been noted preceding the 2 luciferase structural genes is the ribosome-binding site (31, 32). The luciferase subunits are required in equimolar amounts, and free a or ( 3 subunits are not found in wild-type V. harveyi cells (33). In some systems in which functionally related proteins are required in equal amounts, stoichiometric synthesis is achieved by translational coupling (34,35). In these cases, the 2 genes are adjacent, cotranscribed, and the stop codon of the proximal gene is closely followed by the initiation codon of the distal gene. The second gene typically has no ribosome-binding site preceding it, and ribosomes do not discharge at the end of the first gene. If cotranscription is assumed, on the basis of structural considerations, classical translational coupling is not operating in translation of the lux genes; ribosome-binding sites precede both genes, and 26 nucleotides separate the genes. It is tempting to speculate that the sequence conservation seen between the 2 ribosome-* C. Bieger, personal communication.
binding sites results in identical rates of translational initiation, resulting in synthesis of equal amounts of each subunit.
The codon usage of the luxA gene of V. harveyi, presented in Table IV, shows a bias in codon selection, as does E. coli and all other organisms for which genes have been sequenced (36). Parker et al. (37) have shown a relationship between codon usage and translational fidelity and suggest that codons are selected in order to reduce third position misreading. In codon groups where the third position misreading would result in amino acid substitution, G or C is preferred in the third position. luxA does not follow the rule. The codons for Asn, Gln, His, and Phe show no bias while the codons for Lys show a preference for A over G. The lack of codon bias in these groups suggests that either luxA is not representative of codon usage in V. hurveyi or that the parameters dictating codon usage in V. harveyi are significantly different from those dictating codon usage in other organisms such as E. coli. Until more is known about V. hurveyi tRNAs, it will be difficult to determine what pressures have been at work in codon selection.
The sequence of the luxA gene and its surrounding DNA sequences indicates that the genes of Vibrio harveyi are not radically different from those of other bacteria. Information on the fine structure of the luxA region suggests that the luciferase genes are cotranscribed and that they are in an operon with at least one additional gene. Recent work by Engebrecht and Silverman (30) indicates that the lux gene family in V. fischeri is composed of two operons comprising 7 distinct complementation groups, the two luciferase subunits, three polypeptides involved in aldehyde metabolism, and two genes involved in regulation of the expression of bioluminescence.
The amino acid sequence of the a subunit has allowed insight into the structure and function of the enzyme, supporting the biochemical data. The sequence of the luxB gene should expand our understanding further and perhaps indicate the manner in which the two subunits interact. It will also be important to define the limits of the operon and identify the functions of the additional genes in V. hurueyi, for which there is such a wealth of biochemical data, as has been accomplished for V. fischeri (30).