The ultimate goal of genome-annotation programs is to correctly predict the sequence of every gene in a given organism. Caenorhabditis elegans has led the way, and Wei et al. now report an adaptation of the TWINSCAN gene-prediction program, with which they have discovered 1,119 new C. elegans genes.

Although the C. elegans genome sequence has been available since 1998, there are still thousands of genes without cDNA or EST evidence. Therefore, several gene-prediction programs were developed and optimized specifically for worms. Wei et al. used these resourses and compared the available data with their results using the TWINSCAN algorithm, which was originally developed to annotate the human genome. The advantage of their method lies in the fact that it combines the probabilistic Hidden Markov Model approach with information derived from the alignment of the target genome (C. elegans) to a second genome, known as the informant (Caenorhabditis briggsae).

Using information from the entire C. elegans genome, they predicted 2,891 open reading frames (ORFs) that do not overlap with existing WormBase annotations. The authors then tested 265 of these predicted ORFs through amplification and cloning procedures, and finally confirmed 146 novel gene predictions — 55% of those targeted. The genes were poorly conserved between C. elegans and C. briggsae; this is a reflection of the strength of this strategy for gene identification because poorly conserved genes are difficult to predict.

Why is this approach so successful? The authors claim that the models the program uses for GC–AG splice sites and intron-length distribution, together with the C. briggsae alignment, are the major advances contributing to the accuracy of TWINSCAN's C. elegans predictions.

The total number of real genes in C. elegans is going to change as a result of this study — although its sequence is among the best annotated. This method is applicable to other model organisms, such as Arabidopsis thaliana, which is likely to contain more than 1,000 unannotated genes and thousands more that are misannotated. Because this computational approach is the first one to achieve 60% sensitivity in the exact prediction of proteins in a multicellular organism, the future for the correct annotation of other genomes is bright.