The nucleotide sequence of the Mr = 28,500 flagellin gene of Caulobacter crescentus.

The DNA sequences which encode the Mr = 28,500 flagellin polypeptide of Caulobacter crescentus CB15 have been determined. The size of the protein, deduced from its DNA sequence (276 amino acids), is in agreement with its apparent molecular weight as measured by sodium dodecyl sulfate-polyacrylamide gel electrophoresis. The distribution of arginine residues within the protein sequence encoded by the gene correlates with their relative location as predicted by peptide alignment analysis (Gill, P.R., and Agabian, N. (1982) J. Bacteriol. 150, 925-933). DNA sequences 5' and 3' to the coding sequence were also determined. In the 5' region, DNA sequences homologous to consensus sequences associated with RNA polymerase recognition and transcription initiation sites in Escherichia coli (Pribnow box) are found. These are centered around 60, 90, and 120 base pairs upstream from the ATG codon at the beginning of the structural gene. Sequences 3' to the coding region were identified which might signal transcription termination. A typical E. coli 16 S ribosomal binding site (Shine-Dalgarno sequence) is located just 5' to the coding sequence, and for most of the amino acids there is a strong codon usage preference. Although this protein is exported from the cell (Gill, P.R., and Agabian, N. (1982) J. Bacteriol. 150, 925-933), the encoded NH2-terminal amino acid sequence is not different from the mature product.

Cell division in the prokaryote Caulobacter crescentus produces two progeny cells which are each functionally differentiated with respect to surface structures (1-4) and developmental potential (5, 6). One of the more obvious manifestations of differentiation in this system is the appearance and disappearance of a single polar flagellum on one of the progeny cells during a restricted interval of the cell cycle (1, 3).
Unlike most bacterial flagella, the filament portion of the Caulobacter flagellum consists mainly of two distinct polypeptide monomer types: M , = 25,000 and 27,500 (7). In addition, at least one (M, = 28,500) and perhaps two (M, = 24,500) other flagellin monomers have been identified by immunologic (3) or by genetic (8) techniques.
A comparison of the peptide maps and tryptic peptide sequences obtained from the M, = 25,000 and 27,500 flagellins was carried out by Weissborn et al. (9); their studies indicated that these two proteins are encoded by distinct structural genes. In a concurrent study, we showed that the M, = 25,000 * The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
$ Recipient of National Institutes of Health Grant ROI GM 25527.
flagellin and a polypeptide with an apparent M , = 27,500 flagellin were derived from distinct structural genes by NHZterminal amino acid sequence analysis, peptide mapping, and a peptide alignment technique (10). Our flagellin protein isolation conditions were novel and involved the use of strong denaturants, but did not impose a criterion of acid solubility and reassociation into filaments for assay of flagellin polypeptides as described in (9). The DNA sequence reported in this paper, in conjunction with protein sequence analysis (9, 10) of the various flagellins, together indicate that the protein sequence reported by Gill and Agabian (10)  The flagellins are the most abundant components of the flagellar apparatus and thus serve as indicators in studying the regulation of expression of this organelle. The regulation of flagellin expression during Caulobacter development is primarily regulated at the transcriptional level (11, 12). Sequences flanking the flagellin coding sequence identified in this study represent possible regulatory regions which function in the transcriptional regulation of this gene. Since this is the first gene isolated and sequenced for Caulobacter, the analysis of codon preference for this gene provides a percentage index of codon usage in these G + C-rich organisms. intervals for both orientations of the fragment to be sequenced. These clones were obtained using subcloning strategies described below.
The entire coding sequence of the M, = 28,500 flagellin gene is contained within the 2.2-kb SalI fragment of pCAllO (see Fig. 1). 200 pg of purified pCAllO plasmid DNA were digested with SaZI and electrophoresed on a 1.2-cm thick 0.7% agarose gel (Seakem), and then the 2.2-kb SalI fragment was electroeluted using DEAE-paper (Whatman) (19). PstI cleavage of the 2.2-kb SalI fragment produced a 0.37-kb and a 0.93-kb PstI-SaZI fragment and a 0.81-kb PstI fragment. These were cloned into Sun-PstI-cut M13mp8 and mp9 vectors or into a PstI-cut M13mp8 vector, respectively. The orientation of the three PstI subfragments of the 2.2-kb SalI fragment was determined using the asymmetric HpaI, PuuII, and ChI sites (see Fig. 1).
DNA sequence analysis of one of the ends of the 0.81-kb PstI fragment located the NHz-terminal coding region and sequences 5' to the gene. Sequence analysis of the ends of the 0.93-kb Sd-PstI fragment identified some internal sequences and the sequences in the 3' region of the gene. Random subclones of the 0.93-kb Sd-PstI fragment were constructed in order to obtain overlapping sequences within the gene. In one approach, AccI-digested M13mp8 vector was used to clone TaqI or HpaII partial digestion products, and a majority of the sequences in the 0.93-kb SalI-PstI fragment were determined using these subclones. Alternatively, Sau3A partial digestion products of this fragment were cloned into a SalI-BamHI-cut M13mp9 vector. The recombinants generated by this approach contained variable length inserts starting at different Sau3A sites within the 0.93-kb Sun-PstI fragment and ending at the SalI site of this fragment. Fig. 1 shows the sequence strategy and indicates that both strands of -80% of the gene have been sequenced. All nucleotide residues presented have been determined from at least two and usually three independent sequencing reactions.
The fragments cloned into M13 vectors for DNA sequence analysis were routinely analyzed by hybridization, tests of complementarity with other M13 clones, the size of the inserted sequence, and single lane sequence analysis (16). Not all of these tests were used for each M13 clone.

RESULTS AND DISCUSSION
Both genomic and plasmid mapping analyses were used for deducing the restriction map of the 2.2-kb SalI fragment of pCAllO ( Fig. 1). Shown in Fig. 2 is the DNA sequence analysis that was used to identify the coding sequence for the NH2terminal amino acids and sequences 5' to the M, =  28,500 flagellin gene, is presented in Fig. 3. In Fig. 4, sequences which contain potential transcription initiation regions are shown in more detail. Preceding the translation start signals at -5 to -9 (discussed below) are three regions which contain sequences that typically identify transcription start sties in E. coli (20,21). The sequences associated with the M, = 28,500 flagellin gene are centered at about positions -60, -90, and -120 relative to the initiator Met codon sequence. In E. coli, the consensus sequence TATAAT is centered about 10 bp upstream from the start site of transcription, and the transcripts usually begin with either a G or an A residue. In such TATAAT sequences, the last T is invariant, the beginning TA is highly conserved, and the remaining TAA is subject to greater variation (20). At positions -90 and -120 are clusters of sequences which are consistant with the TATAAT sequence, whereas at the -60 position, a single such sequence ' The abbreviations used are: bp, base pair; kb, kilobase pair. Sequence which may signal the termination of M, = 28,500 flagellin gene transcription have been more difficult to identify on the basis of comparisons with known prokaryotic termination signals. The sequence homology of transcription termination signals is less conserved than those of basic promoter sequences. Furthermore, the difficulties in identifying transcription termination signals are increased for termination signals which have increased p dependence; nothing is known about p-like factors in C. crescentus. Nevertheless, where information is available, sequences which are correlated with transcription termination include those with regions of hyphenated dyad symmetry, containing a G + C-rich region and a T-rich region 3' to the dyad (20). A region of hyphenated dyad symmetry 3' to the coding region of the M, = 28,500 flagellin gene exists from sequences 840 to 883 bp (Fig. 3).
The T-rich region found in other systems, however, seems to be absent in the 3"untranslated region of the Caulobacter gene. Whether any of these Caulobacter sequences are functional in transcription termination must await further analysis.
In addition to the sequences shown in Fig. 3, we have sequenced 345 bp upstream from the coding region and have not identified any sequences which have homology with the the MI = 28,500 and 25,000 flagellins specifically cross-hybridize.3 Thus, one might expect a hybridization probe prepared for a given flagellin gene to cross-hybridize with mRNA molecules encoded by the other flagellin genes. The different flagellin proteins appear to be synthesized in vastly different amounts, and clearly the mRNA for the MI = 25,000 flagellin would be expected to be the most abundant mRNA in the cell, although this may not necessarily to be case at early cell cycle times of flagellin gene expression. Further experiments are in progress to determine the appearance of flagellin mRNAs during the cell cycle using gene probes from unique regions of each of the flagellin genes. The flagellin genes of other bacteria are adjacent to transcription initiation signals (24), and usually constitute a single operon. Thus, the MI = 28,500 flagellin gene in Cuulobacter would be novel in its regulation if the gene were a downstream gene in a polycistronic operon.
The confirmation of sequences as being promoters of terminators of transcription must await further genetic and biochemical analysis. Techniques such as S1 mapping (25) of . the ends of the transcript and in vitro transcription will enable a more comprehensive analysis of these putative regulatory regions. It must be borne in mind, however, that the crosshybridization of the flagellin genes may be a persistent problem in these analyses, and the construction of the appropriate mutant strains may be required before meaningful conclusions can be drawn.
Translation Signals-Centered at the -7-bp position relative to the coding sequence of the MI = 28,500 flagellin gene (Figs. 3 and 4) is the sequence GGAG, which corresponds to a transcribed consensus sequence found 5' to E. coli genes. This sequence is thought to hybridize to the 3' segment of 16 S ribosomal RNA during translation initiation (26). The appropriate complementary sequence has been found at the 3' end of the 16 S ribosomal RNA of C. crescentus (26). The ribosomal genes of C. crescentus also appear to be similar to those of E. coli (17). A methionine codon at the 5' end of the coding sequence for the MI = 28,500 flagellin gene precedes the known NH2-terminal amino acid sequence (lo), suggesting that the initiator N-formylmethionine residue, which is encoded as an ATG at the 5' end of the gene, is cleaved from the NH2 terminus of the newly synthesized polypeptide.
At least two termination codons exist in every reading frame within 200 bp 5' to the coding sequence (Fig. 3). Furthermore, the occurrence of the ribosome binding site centered a t -7 indicates that there is no signal sequence which is cleaved from the NH2 terminus of the M, = 28,500 flagellin polypeptide during export and assembly. The Salmonella typhimurium H2 antigen (flagellin) also does not contain a signal sequence (28), perhaps indicating that at least the flagellins of Gram-negative bacteria are representative of a class of proteins which do not require an NH2-terminal signal sequence for externalization. Although the exported proteins of E. coli are thought to be secreted a t distinct sites in the membrane (29), the processes which result in flagellin export and assembly may be quite different and may involve an interplay between flagellin and other polypeptides which make up the components of the flagellar organelle.
The translation termination codon of the gene coding sequence is UAA. This codon is probably the most efficient of the terminator codons (30) and the major terminator used in E. coli (31). Several other terminator sequences are found in all three reading frames 3' to this termination codon.
The distribution of arginine residues in the protein is in agreement with the peptide alignment data used to character-  19,600 and 13,200 (Fig. 3). The distribution of the glutamic acid and aspartic acid residues is also consistent with the peptides generated in peptide mapping experiments. Codon Usage-There is a strong codon preference for most of the amino acids encoded in the M= 28,500 flagellin gene.
For a number of highly expressed genes of E. coli (32) and the budding yeast Saccharomyces cereuisiae (33), there is a strong codon preference; about 70% of the codons that are preferred by the M, = 28,500 flagellin gene are the same ones preferred by the highly expressed E. coli genes (Table I). This is some- In E. coli, the codons that are preferred are directly correlated with the relative abundance of tRNA isoacceptor species, in the cell (32,36). Changes in the relative abundance of isoacceptor pools for given tRNA species have been correlated with distinct differentiated states in a number of cell types (37). It will be interesting to compare the codon preferences for a number of Caulobacter genes expressed at different times in development. Dyad Symmetry 5' to the Gene-5' to the coding region of the M , = (Fig. 4). Other regions of dyad symmetry are present which overlap the 5' arm of this large domain and several sequences 3' to this domain. Another dyad overlaps the 3' arm of the large domain and the sequences which encode the first few amino acids in the coding sequence. In addition, two other overlapping regions of dyad symmetry are centered around positions -20 and -12 relative to the initiation codon of the coding sequence. One of these (-12) overlaps the proposed ribosomal binding site (Shine-Dalgarno sequence).
The extent of dyad symmetry in the 5' region suggests the potential formation of secondary structure which might have functional significance in the regulation of mRNA expression. The largest domains of dyad symmetry in the 5' region of the gene suggest the possibility that specific regulatory proteins might interact with DNA sequences in a manner analogous to the interaction of the lac repressor and the lac operator (38). The function of such a DNA binding protein might be to repress the transcription of this gene during a specific period of the cell cycle. The occurrence of stage-specific DNA binding proteins has been noted in Caulobacter (5,6, 39).
A (dC-dG), sequence block is located just 3' to the large dyad centered at -60 bp. When the negative superhelical density of DNA containing alternating dC-dG sequences is greater than 0.072, these sequences can undergo a transition in helical state (from right-handed to left-handed) (40). The dC-dG block located between -60 and the structural gene might thus act to facilitate the formation of a cruciform structure by the large dyad. If the formation of such a cruciform structure is required for regulating transcription, the degree of supercoiling in this region may influence flagellin gene expression. The highly condensed state of the newly replicated swarmer chromosome probably reflects a change in its superhelical density as compared with that of stalked cell chromosome (41). A discrete interval of DNA replication is required for flagellin synthesis (42) which further suggests that localized changes in supercoiling as occurs during DNA replication may influence flagellin gene expression.
Although nothing is known about the stability of the M, = 28,500 flagellin mRNA, the M , = 25,000 flagellin mRNA has been shown to have a relatively long functional half-life (2, 11). The mRNA species encoded by the lipoprotein gene (43) and the ompA gene (44) of E. coli also have relatively long functional half-lives and predictions of secondary structure have suggested that a number of stem-loop structures could be formed to stabilize the mRNA of these genes in uiuo. A similar kind of argument may be made for the M, = 28,500 flagellin mRNA based on possible secondary structure of the mRNA. If transcription were initiated at -90 (Fig. 4), then a particularly stable mRNA structure might be formed centered at position -60. These stem-loop structures in mRNA may interact specifically with other cellular components, perhaps to sequester the mRNA at defined sites in the cell membrane during flagellin biosynthesis.
In contrast to the secondary structures proposed for ompA and lipoprotein mRNAs, the M, = 28,500 flagellin gene has regions of dyad symmetry which include the ribosomal binding sites. This raises the possibility that secondary structure in the region can be important in regulating the initiation of flagellin mRNA translation. In support of this possibility, Iserentant and Fiers (45) have shown that the efficiency of translation initiation is directly dependent on the degree of M. = 28,500 flagellin gene codon usage """""""""""""""""""""""""""-----0 UUU Phe 1 UCU S e r 0 UAU T y r 0 UGU CYS 5 UUC Phe 0 UCC Ser 0 UAC T y r 0 UGC CyS 0 UUA Leu 0 UCA S e r 0 UAA oc 0 UGA um 2 UUG Leu 9 UCG S e r 0 UAG am 1 UGG T r p """""""""""""""""""""""""""----- -"""""""""""""""""""""""""""""  (24). In this model, the NHz-terminal sequences, and to a lesser extent the carboxyl terminal sequences, are essential for flagellin function. The central portion of a bacterial flagellin monomer appears to be quite variable and is usually the site of mutations which alter antigenicity. Immunochemical and chemical analysis (24) of the central portion of the protein sequence of other flagellin monomers indicates that this region typically contains the antigenic determinants for the protein (24). We have found significant sequence homology between M, = 25,000 and 28,500 flagellin genes in both the NHz-terminal and carboxyl-terminal regions of the protein as would be predicted by Iino's model.3 The encoded protein of M , 28,500 flagellin gene was tested for antigenicity using an antigenic site predictor (46)  The high degree of sequence homology found in the NHz-terminal portion of flagellins from both Grampositive and Gram-negative, enteric and nonenteric bacteria (10) suggests that these regions of the molecule are essential for polymerization. This is in fact borne out by mutational analysis (24).
Relationship to Other Flagellin Genes-Lagenaur and Agabian (3) first observed in C. crescentus CB15 that in addition to the two major filament monomers, M, =  produced in small amounts and it appeared transiently during the cell cycle, just preceding the appearance of M, = 27,500 and 25,000 flagellins. Several lines of evidence, discussed below, suggest that the gene described in this paper encodes the MI = 28,500 flagellin polypeptide described by Lagenaur and Agabian and the sequence data provided in this report formally exclude the possibility that the M, = 28,500 flagellin is a precursor to either M, = 25,000 or 27,000 flagellin.
Flagellin monomer purification procedures reported by others (9,48) and those previously described from this laboratory (23) used intact flagellar filaments as starting material. The more recent technique used by us to prepare flagellin polypeptides involves their isolation from insoluble aggregates which form in the medium under certain growth conditions; these aggregates contain large bundles of intact flagellar filaments and much greater quantities of flagellin are obtained under these conditions as compared with those used previously (10). The purification of flagellin monomers from these aggregates requires their denaturation in urea and Triton X-100 and these denatured flagellins were used as a starting material for purification. Although the protein we purified and used for structural and NHz-terminal amino acid sequence analysis had an apparent M, = 27,500 and had the same ionic exchange elution properties of the M, = 27,500 flagellin as described by Fukuda et al. (48), its actual molecular weight is closer to 28,500 as deduced by DNA sequence analysis. The small differences in molecular weight between this polypeptide and the MI = 27,500 flagellin, coupled with its ion exchange behavior after denaturation in urea and Triton X-100, were not recognized as unusual at the time.
After determination of the sequence of the M, 28,500 flagellin (then thought to be the M, =