Abstract
Expected-value models have long provided a rudimentary theoretical foundation for random DNA sequencing. Here, we are interested in improving characterization of genome coverage in terms of its underlying probability distributions. We find that the mathematical notion of occupancy serves as a good model for evolution of the coverage distribution function and reveals new insights related to sequence redundancy. Established concepts, such as “full shotgun depth,” have been assumed invariant, but actually depend on project size and decrease over time. For most microbial projects, the full shotgun milestone should be revised downward by about 30%. Accordingly, many already-completed genomes appear to have been over-sequenced. Results also suggest that read lengths for emerging high-throughput sequencing methods must be increased substantially before they can be considered as possible successors to the standard Sanger method. In particular, gains in throughput and sequence depth cannot be made to compensate for diminished read length. Limits are well approximated by a simple logarithmic equation, which should be useful in estimating maximum coverage-based redundancy for future projects.
Similar content being viewed by others
References
Abrahamsen, M.S., Templeton, T.J., Enomoto, S., Abrahante, J.E., Zhu, G., Lancto, C.A., et al., 2004. Complete genome sequence of the apicomplexan Cryptosporidium parvum. Science 304, 441–445.
Adams, M.D., Celniker, S.E., Holt, R.A., Evans, C.A., Gocayne, J.D., Amanatides, P.G., et al., 2000. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195.
Anderson, S., 1981. Shotgun DNA sequencing using cloned DNase I-generated fragments. Nucleic Acids Res. 9, 3015–3027.
Armbrust, E.V., Berges, J.A., Bowler, C., Green, B.R., Martinez, D., Putnam, N.H., et al., 2004. The genome of the diatom Thalassiosira pseudonana: Ecology, evolution, and metabolism. Science 306, 79–86.
Bao, Q.Y., Tian, Y.Q., Li, W., Xu, Z.Y., Xuan, Z.Y., Hu, S.N., et al., 2002. A complete sequence of the T. tengcongensis genome. Genome Res. 12, 689–700.
Blakesley, R.W., Hansen, N.F., Mullikin, J.C., Thomas, P.J., McDowell, J.C., Maskeri, B., et al., 2004. An intermediate grade of finished genomic sequence suitable for comparative analyses. Genome Res. 14, 2235–2244.
Bouck, J., Miller, W., Gorrell, J.H., Muzny, D., Gibbs, R.A., 1998. Analysis of the quality and utility of random shotgun sequencing at low redundancies. Genome Res. 8, 1074–1084.
Braslavsky, I., Hebert, B., Kartalov, E., Quake, S.R., 2003. Sequence information can be obtained from single DNA molecules. Proc. Natl. Acad. Sci. U.S.A. 100, 3960–3964.
Carlton, J.M., Angiuoli, S.V., Suh, B.B., Kooij, T.W., Pertea, M., Silva, J.C., et al., 2002. Genome sequence and comparative analysis of the model rodent malaria parasite Plasmodium yoelii yoelii. Nature 419, 512–519.
Cerdeño-Tárraga, A.M., Patrick, S., Crossman, L.C., Blakely, G., Abratt, V., Lennard, N., et al., 2005. Extensive DNA inversions in the B. fragilis genome control variable gene expression. Science 307, 1463–1465.
Chaisson, M., Pevzner, P., Tang, H., 2004. Fragment assembly with short reads. Bioinformatics 20, 2067–2074.
Chien, M., Morozova, I., Shi, S., Sheng, H., Chen, J., Gomez, S.M., et al., 2004. The genomic sequence of the accidental pathogen Legionella pneumophila. Science 305, 1966–1968.
Chimpanzee Sequencing Consortium, 2005. Initial sequence of the chimpanzee genome and comparison wih the human genome. Nature, 437, 69–87.
Clarke, L., Carbon, J., 1976. A colony bank containing synthetic Col El hybrid plasmids representative of the entire E. coli genome. Cell 9, 91–99.
Comtet, L., 1974. Advanced Combinatorics. Reidel Publishing, Dordrecht, Holland.
Deininger, P.L., 1983. Random subcloning of sonicated DNA: Application to shotgun DNA sequence analysis. Anal. Biochem. 129, 216–223.
DelVecchio, V.G., Kapatral, V., Redkar, R.J., Patra, G., Mujer, C., Los, T., et al., 2002. The genome sequence of the facultative intracellular pathogen Brucella melitensis. Proc. Natl. Acad. Sci. U.S.A. 99, 443–448.
Elkin, C., Kapur, H., Smith, T., Humphries, D., Pollard, M., Hammon, N., Hawkins, T., 2002. Magnetic bead purification of labeled DNA fragments for high-throughput capillary electrophoresis sequencing. Biotechniques 32, 1296–1302.
Feller, W., 1968. An Introduction to Probability Theory and Its Applications, 3rd edn. Wiley, New York, NY.
Fisher, R.A., 1929. Tests of significance in harmonic analysis. Proc. R. Soc. Lond. Ser. A 125, 54–59.
Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R., et al., 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496–512.
Fraser, C.M., Norris, S.J., Weinstock, C.M., White, O., Sutton, G.G., Dodson, R., et al., 1998. Complete genome sequence of Treponema pallidum, the syphilis spirochete. Science 281, 375–388.
Galagan, J.E., Calvo, S.E., Borkovich, K.A., Selker, E.U., Read, N.D., Jaffe, D., et al., 2003. The genome sequence of the filamentous fungus Neurospora crassa. Nature 422, 859–868.
Gibbs, R.A., Weinstock, G.M., Metzker, M.L., Muzny, D.M., Sodergren, E.J., Scherer, S., et al., 2004. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428, 493–521.
Goff, S.A., Ricke, D., Lan, T.H., Presting, G., Wang, R., Dunn, M., et al., 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296, 92–100.
Green, E.D., 2001. Strategies for the systematic sequencing of complex genomes. Nat. Rev. Genet. 2, 573–583.
Johnson, N.L., Kotz, S., 1977. Urn Models and Their Application. John Wiley & Sons, New York, NY.
Jones, T., Federspiel, N.A., Chibana, H., Dungan, J., Kalman, S., Magee, B.B., et al., 2004. The diploid genome sequence of Candida albicans. Proc. Natl. Acad. Sci. U.S.A. 101, 7329–7334.
Kim, U.-J., Shizuya, H., deJong, P.J., Birren, B., Simon, M.I., 1992. Stable propagation of cosmid sized human DNA inserts in an F-factor based vector. Nucleic Acids Res. 20, 1083–1085.
Kirkness, E.F., Bafna, V., Halpern, A.L., Levy, S., Remington, K., Rusch, D.B., et al., 2003. The dog genome: Survey sequencing and comparative analysis. Science 301, 1898–1903.
Kolchin, V.F., Sevastyanov, B.A., Christyakov, V.P., 1978. Random Allocations. John Wiley & Sons, New York, NY.
Lander, E.S., Waterman, M.S., 1988. Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics 2, 231–239.
Leroy, S., Duperray, C., Morand, S., 2003. Flow cytometry for parasite nematode genome size measurement. Mol. Biochem. Parasitol. 128, 91–93.
Loftus, B., Anderson, I., Davies, R., Alsmark, U.C.M., Samuelson, J., Amedeo, P., et al., 2005a. The genome of the protist parasite Entamoeba histolytica. Nature 433, 865–868.
Loftus, B.J., Fung, E., Roncaglia, P., Rowley, D., Amedeo, P., Bruno, D., et al., 2005b. The genome of the basidiomycetous yeast and human pathogen Cryptococcus neoformans. Science 307, 1321–1324.
Matsuzaki, M., Misumi, O., Shin-I, T., Maruyama, S., Takahara, M., Miyagishima, S.Y., et al., 2004. Genome sequence of the ultrasmall unicellular red alga Cyanidioschyzon merolae 10D. Nature 428, 653–657.
Mitra, R.D., Shendure, J., Olejnik, J., Krzymanska-Olejnik, E., Church, G.M., 2003. Fluorescent insitu sequencing on polymerase colonies. Anal. Biochem. 320, 55–65.
Myers, G., 1999. Whole-genome DNA sequencing. Comput. Sci. Eng. 1, 33–43.
Roach, J.C., Boysen, C., Wang, K., Hood, L., 1995. Pairwise end sequencing: A unified approach to genomic mapping and sequencing. Genomics 26, 345–353.
Robbins, H.E., 1944. On the measure of a random set. Ann. Math. Stat. 15, 70–74.
Sanger, F., Coulson, A.R., Barrell, B.G., Smith, A.J., Roe, B.A., 1980. Cloning in single-stranded bacteriophage as an aid to rapid DNA sequencing. J. Mol. Biol. 143, 161–178.
Sanger, F., Nicklen, S., Coulson, A.R., 1977. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U.S.A. 74, 5463–5467.
Shendure, J., Mitra, R.D., Varma, C., Church, G.M., 2004. Advanced sequencing technologies: Methods and goals. Nat. Rev. Genet. 5, 335–344.
Shizuya, H., Birren, B., Kim, U.J., Mancino, V., Slepak, T., Tachiiri, Y., Simon, M., 1992. Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factor-based vector. Proc. Natl. Acad. Sci. U.S.A. 89, 8794–8797.
Siegel, A.F., 1978. Random arcs on the circle. J. Appl. Probabil. 15, 774–789.
Smith, G.D., Bernstein, K.E., 1995. BULLET: A computer simulation of shotgun DNA sequencing. Comput. Appl. Biosci. 11, 155–157.
Stevens, W.L., 1939. Solution to a geometrical problem in probability. Ann. Eugenics 9, 315–320.
Tettelin, H., Nelson, K.E., Paulsen, I.T., Eisen, J.A., Read, T.D., Peterson, S., et al., 2001. Complete genome sequence of a virulent isolate of Streptococcus pneumoniae. Science 293, 498–506.
Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., et al., 2001. The sequence of the human genome. Science 291, 1304–1351.
Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., et al., 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562.
Wendl, M.C., Waterston, R.H., 2002. Generalized gap model for bacterial artificial chromosome clone fingerprint mapping and shotgun sequencing. Genome Res. 12, 1943–1949.
Wendl, M.C., Yang, S.P., 2004. Gap statistics for whole genome shotgun DNA sequencing projects. Bioinformatics 20, 1527–1534.
Xu, P., Widmer, G., Wang, Y.P., Ozaki, L.S., Alves, J.M., Serrano, M.G., et al., 2004. The genome of Cryptosporidium hominis. Nature 431, 1107–1112.
Yakushevich, L.V., 1998. Nonlinear Physics of DNA. Johns Wiley & Sons, Chichester, UK.
Yu, J., Hu, S., Wang, J., Wong, G.K.S., Li, S., Liu, B., et al., 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296, 79–92.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wendl, M.C. Occupancy Modeling of Coverage Distribution for Whole Genome Shotgun Dna Sequencing. Bltn. Mathcal. Biology 68, 179–196 (2006). https://doi.org/10.1007/s11538-005-9021-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11538-005-9021-4