Analysis of Sequence-Tagged-Connector Strategies for DNA Sequencing

  1. Andrew F. Siegel1,3,
  2. Barbara Trask2,
  3. Jared C. Roach2,
  4. Gregory G. Mahairas2,
  5. Leroy Hood2, and
  6. Ger van den Engh2
  1. 1 Departments of Management Science, Finance, and Statistics and 2Department of Molecular Biotechnology, University of Washington, Seattle, Washington 98195 USA

Abstract

The BAC-end sequencing, or sequence-tagged-connector (STC), approach to genome sequencing involves sequencing the ends of BAC inserts to scatter sequence tags (STCs) randomly across the genome. Once any BAC or other large segment of DNA is sequenced to completion by conventional shotgun approaches, these STC tags can be used to identify a minimum tiling path of BAC clones overlapping the nucleation sequence for sequence extension. Here, we explore the properties of STC-sequencing strategies within a mathematical model of a random target with homologous repeats and imperfect sequencing technology to understand the consequences of varying various parameters on the incidence of problem clones and the cost of the sequencing project. Problem clones are defined as clones for which either (A) there is no identifiable overlapping STC to extend the sequence in a particular direction or (B) the identified STC with minimum overlap comes from a nonoverlapping clone, either owing to random false matches or repeat-family homology. Based on the minimum overlap, we estimate the number of clones to be entirely sequenced and, then, using cost estimates, identify the decision rule (the degree of sequence similarity required before a match is declared between an STC and a clone) to minimize overall sequencing cost. A method to optimize the overlap decision rule is highly desirable, because both the total cost and the number of problem clones are shown to be highly sensitive to this choice. For a target of 3 Gb containing ∼800 Mb of repeats with 85%–90% identity, we expect <10 problem clones with 15 times coverage by 150-kb clones. We derive the optimal redundancy and insert sizes of clone libraries for sequencing genomes of various sizes, from microbial to human. We estimate that establishing the resource of STCs as a means of identifying minimally overlapping clones represents only 1%–3% of the total cost of sequencing the human genome, and, up to a point of diminishing returns, a larger STC resource is associated with a smaller total sequencing cost.

Footnotes

  • 3 Corresponding author.

  • 1 These sequencing error rates are considerably higher than those currently achievable for substitution errors in single reads but were chosen to improve robsutness of the results to polymorphism (<1%) as well as to encompass insertion and deletion errors (1%–2%) in single reads.

  • E-MAIL asiegel{at}u.washington.edu; FAX (206) 685-9392.

    • Received July 22, 1998.
    • Accepted January 12, 1999.
| Table of Contents

Preprint Server