Abstract
A recent study has systematically compared the performance of various computational methods to predict human protein-coding genes (Guigó et al. 2006). In this study a set of well annotated ENCODE sequences were blind-analyzed with different gene finding programs and the predictions obtained were compared with the annotations. Predictions were analyzed at the nucleotide, exon, transcript and gene levels to evaluate how well they were able to reproduce the annotation. These studies have revealed that none of the strategies produced perfect predictions but prediction methods that rely on mRNA and protein sequences and those that used combined information (including expressed sequence information) were generally the most accurate. The dual-or multiple genome methods were less accurate, although performing better than the single genome ab initio prediction methods. Importantly, at the nucleotide level no prediction method correctly identified greater than ∼90% of nucleotides and at the transcript level (the most stringent criterion) no prediction method correctly identified greater than 45% of the coding transcripts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bendtsen J, Jensen L, Blom N, Von Heijne G, Brunak S (2004) Feature-based prediction of non-classical and leaderless protein secretion. Protein Eng Design Selection 17: 349–356
Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A (2006) Pfam: clans, web tools and services. Nucleic Acids Res 34: D247–D251
Gnomon description (2003) http://www.ncbi.nlm.nih.gov/genome/guide/gnomon.html
Guigó R, Flicek P, Abril J, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic V, Birney E, Castelo R, Eyras E, Ucla C, Gingeras T, Harrow J, Hubbard T, Lewis S, Reese M (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7(Suppl 1): S2.1–S3.1
Hubbard T, Aken B, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, Down T, Dyer S, Fitzgerald S, Fernandez-Banet J, Graf S, Haider S, Hammond M, Herrero J, Holland R, Howe K, Howe K, Johnson N, Kahari A, Keefe D, Kokocinski F, Kulesha E, Lawson D, Longden I, Melsopp C, Megy K, Meidl P, Overduin B, Parker A, Prlic A, Rice S, Rios D, Schuster M, Sealy I, Severin J, Slater G, Smedley D, Spudich G, Trevanion S, Vilella A, Vogel J, White S, Wood M, Cox T, Curwen V, Durbin R, Fernandez-Suarez X, Flicek P, Kasprzyk A, Proctor G, Searle S, Smith J, Ureta-Vidal A, Birney E (2007) Ensembl 2007. Nucleic Acids Res 35: D610–D617
Letunic I, Copley R, Schmidt S, Ciccarelli F, Doerks T, Schultz J, Ponting C, Bork P (2004) SMART 4.0: towards genomic data integration. Nucleic Acids Res 32: D142–D144
Mott R, Schultz J, Bork P, Ponting C (2002) Predicting protein cellular localization using a domain projection method. Genome Res 12: 1168–1740
Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Banyai L, Patthy L (2007) MisPred Database for mispredicted and abnormal proteins. http://mispred.enzim.hu/index.html
Tordai H, Nagy A, Farkas K, Bányai L, Patthy L (2005) Modules, multidomain proteins and organismic complexity. FEBS J 272: 5064–5078
Tress M, Martelli P, Frankish A, Reeves G, Wesselink J, Yeats C, Olason P, Albrecht M, Hegyi H, Giorgetti A, Raimondo D, Lagarde J, Laskowski R, López G, Sadowski M, Watson J, Fariselli P, Rossi I, Nagy A, Kai W, Størling Z, Orsini M, Assenov Y, Blankenburg H, Huthmacher C, Ramírez F, Schlicker A, Denoeud F, Jones P, Kerrien S, Orchard S, Antonarakis S, Reymond A, Birney E, Brunak S, Casadio R, Guigo R, Harrow J, Hermjakob H, Jones D, Lengauer T, Orengo C, Patthy L, Thornton J, Tramontano A, Valencia A (2007) The implications of alternative splicing in the ENCODE protein complement. P Natl Acad Sci USA 104: 5495–5500
Unneberg P, Claverie J (2007) Tentative mapping of transcription-induced interchromosomal interaction using chimeric EST and mRNA data. PLoS ONE 2: e254
Wheelan S, Marchler-Bauer A, Bryant S (2000) Domain size distributions can predict domain boundaries. Bioinformatics 16: 613–618
Wolf Y, Madej T, Babenko V, Shoemaker B, Panchenko AR (2007) Long-term trends in evolution of indels in protein sequences. BMC Evol Biol 7: 19
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag/Wien
About this chapter
Cite this chapter
Nagy, A. et al. (2008). Quality control of gene predictions. In: Frishman, D., Valencia, A. (eds) Modern Genome Annotation. Springer, Vienna. https://doi.org/10.1007/978-3-211-75123-7_3
Download citation
DOI: https://doi.org/10.1007/978-3-211-75123-7_3
Publisher Name: Springer, Vienna
Print ISBN: 978-3-211-75122-0
Online ISBN: 978-3-211-75123-7
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)