A GENE ORDER DATABASE OF PLASTID GENOMES

A gene order database of 32 completely sequenced plastid genomes was developed. The data structure is formally identical to that of the feature tables in the major GenBank/EMBL/DDBJ databases. The quality of annotations was largely improved. A normalizing gene-labeling system across the complete plastid genomes was developed so that comparative studies are made available without having to go back to sequence analysis. Many incorrect coordinates of tRNA-encoding regions found in the major databases were corrected. We attempted to distinctively label tRNA genes with the anticodon sequence CAT, which encodes either the initiator tRNA, elongator tRNA, or Ile-tRNA. The database is available at http://www.rs.noda. tus.ac.jp/~kunisawa.


INTRODUCTION
Nucleotide or amino acid substitutions provide measures of sequence similarity, which have been widely used to assess functional and phylogenetic relationships.The major sequence databases, GenBank, EMBL and DDBJ, have played a key role in this purpose.The advent of complete genomic sequence data has provided a new opportunity to investigate more macroscopic evolutionary events, such as duplication, inversion and transposition of parts of a genome.For example, phylogenetic trees are derived from the comparison of gene orders between different genomes (Sankoff, Leduc, Antoine, Paquin, Lang & Cedergren, 1992;Korbel, Snel, Huynen & Bork, 2002).Proteins encoded by gene pairs in a conserved order appear to interact physically (Dandekar, Snel, Huynen & Bork, 1998).The prevalence of small inversions is suggested in yeast genome evolution (Seoighe, Federspiel, Jones, Hansen, Bivolarovic, Surzycki et al., 2000).There are, however, no publicly available gene order databases.The major sequence databases can serve as surrogates for a gene order database.There are, however, a number of problems associated with the use of the major sequence databases for gene order comparison.The most serious has to do with the comparability of different database entries.Two orthologous genes may be labeled in different ways in two different database entries (sequences).This difficulty is overcome in the COG database (Tatusov, Galperin, Natale & Koonin, 2000;Tatsuov, Natale, Garkavtsev, Tasuova, Shankavaram, Rao et al., 2001), in which a unique four-digit number is systematically assigned to each group of orthologous (plus paralogous) genes present in different genomes.Such efforts enable an easy comparison of the orders of protein-encoding genes between different genomes.However, genes coding for transfer and ribosomal RNAs are not taken into consideration in the COG database.As will be shown, errors are found at high frequencies in the major sequence databases with respect to the coordinates and/or annotations of genes specifying tRNAs.RNA-encoding genes were also neglected in a recent comparative analysis of protein sequences from 19 plastid genomes by Rivas, Lozano and Ortiz (2002).
According to the GenBank/EMBL/DDBJ databases, more than 30 plastid genomes have been completely sequenced to date.
Under these circumstances, we have decided to develop a gene order database for complete plastid genomes, which can be regarded as excellent model systems in computational genomics studies because of the considerable number available and their small size.Plastid genomes are circular in shape and are of 100-200 kb in size, containing between 30 to 50 different RNA genes and 100 protein-encoding genes in land plants or 200 protein genes in red algae (Sugiura, 1995).We attempted to develop a normalizing gene-labeling system across the completely sequenced plastid genomes in a comparable way.We will show that gene order comparison presents another support both for the identification of orthologous genes with only a weak sequence similarity and for the assignment of a tRNA gene that has the anticodon sequence CAT encoding either initiator Met-tRNA, elongator Met-tRNA or Ile-tRNA.Initiator tRNA is used for the initiation of protein synthesis in all organisms including plastids, whereas elongator tRNA is used for the insertion of methionine into internal peptidic linkages (e.g., Marck & Grosjean, 2002).As well as both initiator and elongator Met-tRNAs, the same anticodon sequence CAT is shared by a peculiar Ile-tRNA species, which recognizes an isoleucine codon AUA by post-transcriptional base modification (Muramatsu, Yokoyama, Horie, Matsuda, Ueda, Yamaizumi et al., 1988).It is not always easy to distinguish among the three types of tRNA genes from sequence data alone, and their specification remains incomplete in the major databases.In the plastid genomes gene-order database, we have tried to distinctively label tRNA genes with the anticodon sequence CAT.

ANNOTATIONS OF PROTEIN-ENCODING GENES
The first step in normalized gene-labeling across plastid genomes is the comparison of all the protein sequences from a plastid genome with all the proteins from other plastids using the FASTA computer program (Pearson & Lipman, 1988).Orthologous relationships were identified on the basis of sequence similarity.In the all-by-all FASTA analysis, we used two criteria for detecting orthologous gene pairs from different genomes; (i) a level of amino acid identity higher than 30% and (ii) a region of similarity longer than any of the halves of the either of two protein-lengths.Another gene from a third genome was included in this orthologous group if its protein sequence satisfied the homology criteria when compared to at least one member of the orthologous group.A unique label, which was taken from the GenBank/EMBL/DDBJ annotations, was assigned to the orthologous group thus identified.When orthologous genes were labeled in different ways in the major databases, we arbitrarily used one of the alternate labels (Table 1).(vi) apcA, apcB, apcD, apcF, cpcA, cpcB and cpeB, (vii) ycf27 and ycf29.On the basis of multiple sequence alignments by ClustalW (Thompson, Higgins, & Gibson, 1994) and gene-order comparisons between genomes we have confirmed that these paralogous genes were correctly labeled to reflect their orthologous relationships in the major databases.Using this all-by-all FASTA analysis, we were able to label more than 98% of a total of 3497 protein-encoding genes in a comparable way.
The second step is a gene order comparison for gene pairs that do not satisfy the homology criteria mentioned above.For each of these genes, their neighboring genes were examined and their gene orders were compared between genomes.We found that gene orders are well conserved for gene pairs that show sequence similarities of over 15% amino acid identity irrespective of the length of sequence similarity.A typical example where an amino acid identity is only 16% is observed in a comparison of two open reading frames labeled ORF111 from Porphyra (111 aa) and ycf41 from Odontella (113 aa) in the major databases.Although the level of sequence similarity is low, identical gene orders, ycf39-ORF111-psbI and ycf39-ycf41-psbI, are found in both genomes, suggesting an orthologous relationship between the gene pair.Based on such gene order comparisons, we were able to label about 60 protein-encoding genes, which remained unlabeled in the first step.
Using this methodology we labeled a total of 3497 individual protein-encoding genes from 32 plastid genomes in a consistent and comparable way.A substantial fraction of them, i.e. 2993 genes, were identically labeled to the major GenBank/EMBL/DDBJ databases.A complete list of differences between the present database and the major databases is given in Appendix 1.A major difference arises from the fact that most pre-existing gene-labels in the major databases are not updated when homologous relationships are found between a new sequence and pre-existing sequences.Alternate gene labels (synonyms) are not normalized in the major databases, which is another source of difference.
Plastid genomes encode many short proteins of less than 100 amino acids.This generally makes it difficult to detect orthologous relationships among them, since shorter proteins contain less information.
Here, in addition to the primary sequence comparison of individual proteins, both protein length and gene order comparisons were included in orthology detection.Although we have used the identity % obtained by the FASTA alignment in the orthology criteria, this similarity measure can be replaced by the chance probability (P-value) of obtaining the FASTA sequence similarity score, which is known to be a better measure in the detection of weak similarity.Note that orthology detection based on the P-value alone becomes complex and unreliable when too many paralogs are present.The level of sequence similarity, protein length, and gene order are key elements in orthology identification

ANNOTATIONS OF RNA-ENCODING GENES
In the course of our initial survey of the GenBank/EMBL/DDBJ sequence entries, non-homogeneous annotations of tRNA-specifying genes were noticed.In one entry the anticodon species of a tRNA gene is listed, while in the other it is not listed.Similarly, the distinction between an initiator Met-tRNA and elongator Met-tRNA is well annotated in one entry, but is ignored in the other.Thus, we have also developed a normalizing gene-label for genes encoding tRNAs.Identification of tRNA genes was carried out with the tRNAscan computer program (Lowe & Eddy, 1997), which reports both the coordinates of a tRNA-specifying region along the genome and its corresponding anticodon species.Our search only failed to find eight tRNA-encoding regions that were annotated in the major databases, and a previously unmentioned Arg-tRNA gene was identified in the Odontella (Osi) plastid.The discrepancies between the GenBank/EMBL/DDBJ annotations and our search results are commented on in the present gene order database (see Section 5).Here we adopted a tRNA gene-labeling system using four letters; the first letter represents the amino acid species (in upper case) and remaining three show the anticodon sequence (in lower case), for instance, Fgaa for the Phe-tRNA with the anticodon GAA.While His-tRNAs possess an additional base at the 5' terminus, the "minus" 1 residue is often not correctly included in the major database annotations.In addition, there are a lot of mis-typing or mis-counting of base numbers for tRNA-encoding regions in the major databases.We have corrected such incorrect coordinates.Appendix 2 summarizes the corrections necessary in the major databases.
Although tRNAscan is a fine tool, every tRNA-specifying region with the anticodon sequence CAT is assigned as Met-tRNA.The tRNAscan does not distinguish between the initiator, elongator Met-tRNA and the peculiar Ile-tRNA species that recognizes an Ile codon AUA, all of which have the same anticodon sequence CAT.Thus, we have closely examined tRNA sequences identified by tRNAscan and have attempted to divide them into initiator tRNA (labeled fM), elongator tRNA (Mcat), and Ile-tRNA (Icat).
Our close examination has revealed characteristic sequences in the anticodon-loop region.Ggcc are adjacent to fM in most of the green plants, suggesting a common ancestry for these fM genes.
Similarly, in red algal plastids (Cca, Cme, Ppu, Osi, and Gth) fM is located adjacent to psaD and/or ycf36.
In Cyanophora (Cpa) and euglenozoa (Alo and Egr), Tgtg is located upstream of fM.In this way, an examination of gene order conservation was helpful in the assignment of these tRNA species.It should be noted that the apicomplexan (Tgo and Ete) fM genes do not share gene orders with other plastids and that their nucleotide sequences differ considerably from others.Therefore, the present assignment of these apicomplexan tRNA sequences as fM should be viewed as a tentative one.For this reason, these apicomplexan tRNAs are labeled "fM?" in the present database.In bacteria, a mismatch (non Watson-Crick) pairing at the first position of the acceptor-stem (marked "!" at the bottom of Figure 1) constitutes an identity element of fM, which is believed to be involved in its recognition by Met-RNA transformylase (Marck, & Grosjean, 2002).An A:T pairing at the first acceptor-stem position is found in chlorophytes (Nol and Cvu) and apicomplexa (Tgo and Ete), which resembles archaeal initiators (Marck, & Grosjean, 2002).Although the Astasia (Alo) tRNA sequence shows a G:C pair at that position, its gene order is similar to that in another euglenozoa Egr, which shows a mismatch.Thus, the Astasia tRNA is likely to be an initiator fM gene, with the reservation that experimental confirmation is needed.
Figure 2 compares sequences that appear to be Ile-tRNA (Icat) with a CAT anticodon, which can recognize the Ile codon AUA by a post-transcriptional modification of the base C at the first position of the anticodon into lysidine (Muramatsu et al., 1988).All but one of these sequences exhibit a characteristic sequence, aCTCATAAt, in the anticodon-loop and -stem regions, which is uniquely found in Icat.An exception is observed in Cynophora (Cpa), where the first base of the characteristic sequence is replaced by G, as shown in red on the left of Figure 2. Other non-canonical bases are also observed in the apicomplxan plastids (Tgo and Ete).It is to be noted that the third position of the acceptor-stem region is commonly occupied by base A in all members of this group.This feature is not found in the other two groups of initiator and elongator tRNAs.As shown in Figure 2, some similarities in gene arrangement are found among red algal plastids (Cca, Cme, Ppu, Osi and Gth) or among green plant chloroplasts (Mvi to Zma in Figure 1).Thus, the examination of gene order further supports the present classification of most of these tRNA genes.Elongator tRNA (Mcat) genes appear to possess a characteristic sequence at the anticodon-loop region, ACTCATAAG or ATTCATAAG in the non-green plastids, Cyanophora (Cpa) to euglenozoa (Egr), or tTTCATACg in green plant chloroplasts (Nol to Zma), as seen in Figure 3. Mismatches to the characteristic sequences are found in Cvu and Pnu, which are indicated in red in Figure 3.The Cvu tRNA gene is puzzling; a non-initiator is suggested from the G:C pair at the first position of the acceptor-stem, whereas its nearest neighbor genes, rps14 and Ggcc, are identical to those found in the initiator tRNAs from green plant chloroplasts.Conserved orders of other tRNA genes confirm the classification as elongator tRNA.
The annotations of genes encoding ribosomal RNAs in the GenBank/EMBL/DDBJ databases were confirmed based on the multiple sequence alignment by CLUSTALW (Thompson, Higgins, & Gibson, 1994).An incorrect coordinate of Oenothera (Oel) 5S rRNA gene was found in this analysis.The correction is shown in Appendix 2, which lists revisions necessary to the GenBank/EMBL/DDBJ annotations of ribosomal and transfer RNA genes.

DATABASE STRUCTURE
The labeled arrangements of genes on the 32 plastid genomes are structured as a database, which is available online.The gene order of a plastid genome is accessible from the front page of the Web site (see Figure 4).Figure 5 illustrates our database format.The GenBank/EMBL/DDBJ database accession number is given below the species name.Columns 1-6, and 8-13 represent the coordinates of a gene.
When a gene is encoded on a strand, the sequence of which is stored in the major databases, its strandness "+" is shown in column 15.Conversely, if encoded on the complementary strand, "-" is used.
Pseudogenes are indicated by "&" in column 17.When tRNAscan identifies a tRNA gene that is not listed in the major databases, the symbol "%" appears in the column 17.If the tRNAscan fails to identify a tRNA gene that is annotated in the major databases, another symbol "#" is used.The gene labels developed here are shown in columns 21-30.We adopted a numbering system for exons; the second exon (counted from the 5' end) of gene XYZ, for instance, is labeled XYZ_2.Original annotations in the major databases are retained after column 41.
Amino acid or nucleotide sequence data are accessible using the link function; a user can confirm the present annotation and infer its biological function by performing the FASTA search or the CLUSTALW multiple alignment, both of which are available at various Web sites.
Another function of the database is to show gene orders in the neighborhood of a given gene.When a user lists one gene symbol in the front page, five genes present at both the 5' and 3' regions of the given gene are displayed, as shown in Figures 1, 2 and 3. When a gene is present in two copies on a genome, the

CONCLUSIONS
We have developed a gene order database for 32 completely sequenced plastid genomes.We developed a normalizing gene-labeling system across complete genomes, by which comparative studies are made available without returning to sequence analysis.A lot of incorrect tRNA gene coordinates detected in the major databases were corrected.Incomplete annotations of tRNA genes with the anticodon sequence CAT in the major databases were improved, and their classification into initiator tRNA, elongator tRNA and Ile-tRNA genes were specified where possible.The gene order database developed here is available at http://www.rs.noda.tus.ac.jp/ ~kunisawa.Using this database, we are now resolving the phylogenetic relationships of plastid genomes along the lines suggested elsewhere (Kunisawa, Blanchette & Sankoff, 1997;Kunisawa, 2003).At the same time we are extending the present gene order database so that an evolutionary comparison between plastids and cyanobacteria and between plastids and host nuclear genomes will be possible.

Figure 5 .
Figure 5.The database format.Several examples are illustrated.The top two lines indicate the column positions.
Conserved open reading frames, which were not shared among at least two genomes, were represented in the form of OrfXY, where X or Y stands for an arbitrarily chosen letter of the alphabet.By contrast, non-conserved open reading frames were simply labeled "orf".Seven paralogous gene families were found, (i) psaA and psaB, (ii) psbA and psbD, (iii) psbE and psbF, (iv) psbL and psbT, (v) ndhA and ndhH, *****..