What is Finished, and Why Does it Matter

  1. Elaine Mardis1,
  2. John McPherson1,
  3. Robert Martienssen2,
  4. Richard K. Wilson1, and
  5. W. Richard McCombie2,3
  1. 1Washington University School of Medicine, Genome Sequencing Center, St. Louis, Missouri 63108, USA; 2Cold Spring Harbor Laboratory, Woodbury, New York 11797-2924, USA.

This extract was created in the absence of an abstract.

Our ability to acquire and analyze DNA sequence data has increased phenomenally in the past 12 years. The acquisition of both cDNA and genomic DNA sequence has exerted a major influence on the direction of biological and medical research and will continue to do so. However, the DNA sequencing field has progressed so rapidly that technical differences between various sequencing approaches have resulted in large datasets of differing quality. Although all of these datasets are valuable in their own right, they are composed of experimental data; therefore they are subject to errors, ambiguities, and incompleteness at a level related to the experimental strategy that created them. The picture is further complicated by the lack of a community-accepted nomenclature that clearly defines levels of sequence completeness. Because of the small number of people producing this resource relative to the large number using it, the nature of the data is, unfortunately, not commonly appreciated.

Initially, DNA sequencing was targeted at small (less than 5 kb) genomic regions or cDNAs; thus, there were fewer than 10 sequences of >50 kb available in public databases until the late 1980's (GenBank). This early period established in many peoples' minds the definition of a finished sequence; namely, if a sequence contained no gaps or ambiguities (only A, T, G, and C), then the sequence was complete and accurate (usually as measured by a correct translation to a known protein). As large genome projects were getting underway, this definition became inadequate. For example, no one was planning to sequence human centromeres when sequencing the human genome was discussed. Moreover, the nature of data collection made much of the scientifically applicable information in a DNA sequence available before reaching the level of finished high quality sequence (McCombie et al. 1992). Thus a valuable but less than complete or …

| Table of Contents

Preprint Server