DNA Sequence Assembly and Annotation of Genes

Christensen, Henrik; Moodley, Arshnee

doi:10.1007/978-3-319-99280-8_2

Henrik Christensen² &
Arshnee Moodley²

Part of the book series: Learning Materials in Biosciences ((LMB))

Abstract

This chapter describes the different sequencing strategies, the pros and cons of the different strategies to help you select the optimal DNA sequencing strategy for your research question, and how to assembly and annotate DNA sequences. DNA sequencing is the determination of the order of nucleotides of parts or whole chromosomes of organisms and virus. DNA sequencing can be done for a single gene or a whole genome or many genomes at a time such as in metagenomics. One of the most popular sequencing machines is the MiSeq from Illumina which is capable of doing small whole-genome sequencing, transcriptomics, and 16S rRNA metagenomics. It is possible to multiplex by using unique combinations of specific barcodes and indexes. Real-time, single-molecule sequencing allows for sequencing of the native DNA, resulting in significantly longer read lengths and sequence information available when the bases are incorporated, i.e., information available in real time. Base calling is the first step in sequencing where the electronic signal generated in the sequencing machine is separated from random noise and converted to nucleotide information. Then the nucleotide information needs to be assembled to DNA sequences which resemble the original DNA sequenced as best as possible. This can either be done de novo without a reference or with a reference if the genome of the organism or virus is well known. The most important quality parameter to consider is the coverage. Another important parameter is N₅₀. Comparison of different assemblies can be made with Quast. The “minimum information about a genome sequence (MIGS) specification provides an exhaustive list of the information required for genomic sequences including demands to metadata. Genome annotation is the identification and labeling of all the relevant features of the genomic sequence. At first, this includes the coordinates provided as nucleotide positions where coding regions are predicted. It is mainly a prediction of coding genes; however, other structural genes such as rRNA are also identified.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes S, Glass EM, Kubal M, Meyer F, Olsen GJ, Olson R, Osterman AL, Overbeek RA, McNeil LK, Paarmann D, Paczian T, Parrello B, Pusch GD, Reich C, Stevens R, Vassieva O, Vonstein V, Wilke A, Zagnitko O. 2008. The RAST Server: rapid annotations using subsystems technology. BMC Genomics 9:75.
Article CAS PubMed PubMed Central Google Scholar
Bolger AM, Lohse M, Usadel B. 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30: 2114–2120.
Article CAS PubMed PubMed Central Google Scholar
Chun J, Oren A, Ventosa A, Christensen H, Arahal DR, da Costa MS, Rooney AP, Yi H, Xu XW, De Meyer S, Trujillo ME. 2018. Proposed minimal standards for the use of genome data for the taxonomy of prokaryotes. Int J Syst Evol Microbiol. 68, 461–466.
Article PubMed Google Scholar
Cock et al. 2010. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 38, 1767–1771
Article CAS PubMed Google Scholar
Compeau PE, Pevzner PA, Tesler G. 2011. How to apply de Bruijn graphs to genome assembly. Nat Biotechnol. 29:987–91.
Article CAS PubMed PubMed Central Google Scholar
Ewing B, Hillier L, Wend MC, & Green P. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome research 8, 175–185.
Article CAS PubMed Google Scholar
Field D, Garrity G, Gray T, Morrison N, Selengut J, Sterk P, Tatusova T, Thomson N, Allen MJ, Angiuoli SV, Ashburner M, Axelrod N, Baldauf S, Ballard S, Boore J, Cochrane G, Cole J, Dawyndt P, De Vos P, DePamphilis C, Edwards R, Faruque N, Feldman R, Gilbert J, Gilna P, Glöckner FO, Goldstein P, Guralnick R, Haft D, Hancock D, Hermjakob H, Hertz-Fowler C, Hugenholtz P, Joint I, Kagan L, Kane M, Kennedy J, Kowalchuk G, Kottmann R, Kolker E, Kravitz S, Kyrpides N, Leebens-Mack J, Lewis SE, Li K, Lister AL, Lord P, Maltsev N, Markowitz V, Martiny J, Methe B, Mizrachi I, Moxon R, Nelson K, Parkhill J, Proctor L, White O, Sansone SA, Spiers A, Stevens R, Swift P, Taylor C, Tateno Y, Tett A, Turner S, Ussery D, Vaughan B, Ward N, Whetzel T, San Gil I, Wilson G, Wipat A. 2008. The minimum information about a genome sequence (MIGS) specification. Nat Biotechnol. 26, 541–7.
Article CAS PubMed PubMed Central Google Scholar
Glass EM, Wilkening J, Wilke A, Antonopoulos D, Meyer F. 2010. Using the metagenomics RAST server (MG-RAST) for analyzing shotgun metagenomes. Cold Spring Harb Protoc.
Google Scholar
Goodwin S, McPherson JD, McCombie WR. 2016. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. 17:333–51.
Article CAS PubMed Google Scholar
Gurevich A, Saveliev V, Vyahhi N, Tesler G. 2013. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29,1072–5.
Article CAS PubMed PubMed Central Google Scholar
Idury RM, Waterman MS. 1995. A new algorithm for DNA sequence assembly. J Comput Biol. 1995 Summer;2(2):291–306.
Article CAS PubMed Google Scholar
Kanehisa M, Sato Y, Kawashima M, Furumichi M, Tanabe M. 2016. KEGG as a reference resource for gene and protein annotation. Nucleic Acid Res. 44(D1):D457–62.
Article CAS PubMed Google Scholar
Koren S, Harhay GP, Smith TP, Bono JL, Harhay DM, Mcvey SD, Radune D, Bergman NH, Phillippy AM. 2013. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biology 14: R101.
Article CAS PubMed PubMed Central Google Scholar
Larsen MV, Cosentino S, Rasmussen S, Friis C, Hasman H, Marvig RL, Jelsbak L, Sicheritz-Pontén T, Ussery DW, Aarestrup FM, Lund O. 2012. Multilocus sequence typing of total-genome-sequenced bacteria. J Clin Microbiol. 50, 1355–61.
Article CAS PubMed PubMed Central Google Scholar
Madigan M, Bender KS, Buckley DH, Sattley WM, & Stahl D. 2019. Brock biology of Microorganisms. Pearson, Harlow UK.
Google Scholar
Nurk S, Bankevich A, Antipov D, Gurevich AA, Korobeynikov A, Lapidus A, Prjibelski AD, Pyshkin A, Sirotkin A, Sirotkin Y, Stepanauskas R, Clingenpeel SR, Woyke T, McLean JS, Lasken R, Tesler G, Alekseyev MA, Pevzner PA. 2013. Assembling single-cell genomes and mini-metagenomes from chimeric MDA products. J Comput Biol. 20, 714–37.
Article CAS PubMed PubMed Central Google Scholar
Overbeek R, Olson R, Pusch GD, Olsen GJ, Davis JJ, Disz T, Edwards RA, Gerdes S, Parrello B, Shukla M, Vonstein V, Wattam AR, Xia F, Stevens R. 2014. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Res. 42(Database issue):D206–14.
Article CAS PubMed Google Scholar
Pearson WR, Lipman DJ. 1988. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 85, 2444–8.
Article CAS PubMed PubMed Central Google Scholar
Pevzner PA, Tang H, Waterman MS. 2001. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci U S A. 98, 9748–53.
Article CAS PubMed PubMed Central Google Scholar
Sanger F, Nicklen S, Coulson AR. 1977. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA 74, 5463–7.
Article CAS PubMed PubMed Central Google Scholar
Seemann T. 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–9.
Article CAS PubMed Google Scholar
Zerbino DR, Birney E. 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–9.
Article CAS PubMed PubMed Central Google Scholar

Author information

Authors and Affiliations

Department of Veterinary Animal Sciences, University of Copenhagen, Copenhagen, Denmark
Henrik Christensen & Arshnee Moodley

Authors

Henrik Christensen
View author publications
You can also search for this author in PubMed Google Scholar
Arshnee Moodley
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Henrik Christensen .

Editor information

Editors and Affiliations

Department of Veterinary Animal Sciences, University of Copenhagen, Copenhagen, Denmark
Henrik Christensen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Christensen, H., Moodley, A. (2018). DNA Sequence Assembly and Annotation of Genes. In: Christensen, H. (eds) Introduction to Bioinformatics in Microbiology. Learning Materials in Biosciences. Springer, Cham. https://doi.org/10.1007/978-3-319-99280-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-99280-8_2
Published: 28 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99279-2
Online ISBN: 978-3-319-99280-8
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics

DNA Sequence Assembly and Annotation of Genes

Abstract

Access this chapter

References

Further Reading

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

DNA Sequence Assembly and Annotation of Genes

Abstract

Access this chapter

References

Further Reading

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation