Modeling kinetic rate variation in third generation DNA sequencing data to detect putative modifications to DNA bases

  1. Andrew Kasarskis1
  1. 1Department of Genetics and Genomic Sciences, Mount Sinai School of Medicine, New York, New York 10029, USA;
  2. 2Pacific Biosciences, Menlo Park, California 94025, USA;
  3. 3Department of Computer Science and Engineering, University of Minnesota, Minneapolis, Minnesota 55455, USA;
  4. 4Department of Statistics, Stanford University, Stanford, California 94305, USA;
  5. 5Tsinghua National Laboratory for Information Science and Technology, and Department of Automation, Tsinghua University, Beijing 100084, China;
  6. 6Department of Neurology,
  7. 7Department of Pediatrics, The University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
    1. 8 These authors contributed equally to this work.

    Abstract

    Current generation DNA sequencing instruments are moving closer to seamlessly sequencing genomes of entire populations as a routine part of scientific investigation. However, while significant inroads have been made identifying small nucleotide variation and structural variations in DNA that impact phenotypes of interest, progress has not been as dramatic regarding epigenetic changes and base-level damage to DNA, largely due to technological limitations in assaying all known and unknown types of modifications at genome scale. Recently, single-molecule real time (SMRT) sequencing has been reported to identify kinetic variation (KV) events that have been demonstrated to reflect epigenetic changes of every known type, providing a path forward for detecting base modifications as a routine part of sequencing. However, to date no statistical framework has been proposed to enhance the power to detect these events while also controlling for false-positive events. By modeling enzyme kinetics in the neighborhood of an arbitrary location in a genomic region of interest as a conditional random field, we provide a statistical framework for incorporating kinetic information at a test position of interest as well as at neighboring sites that help enhance the power to detect KV events. The performance of this and related models is explored, with the best-performing model applied to plasmid DNA isolated from Escherichia coli and mitochondrial DNA isolated from human brain tissue. We highlight widespread kinetic variation events, some of which strongly associate with known modification events, while others represent putative chemically modified sites of unknown types.

    Footnotes

    • 9 Corresponding author

      E-mail eric.schadt{at}mssm.edu

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.136739.111.

      Freely available online through the Genome Research Open Access option.

    • Received December 19, 2011.
    • Accepted October 2, 2012.

    This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 3.0 Unported License), as described at http://creativecommons.org/licenses/by-nc/3.0/.

    | Table of Contents
    OPEN ACCESS ARTICLE

    Preprint Server