BiosyntheticSPAdes: reconstructing biosynthetic gene clusters from assembly graphs

  1. Pavel A. Pevzner1,3
  1. 1Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia, 19904;
  2. 2Tri-Institutional PhD Program in Computational Biology and Medicine, Weill Cornell Medical College, New York, New York 10021, USA;
  3. 3Department of Computer Science and Engineering, University of California, San Diego, California 92093-0404, USA;
  4. 4Computational Biology Department, School of Computer Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA;
  5. 5Bioinformatics Group, Wageningen University, 6708 PB Wageningen, The Netherlands;
  6. 6Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medicine of Cornell University, New York, New York 10021, USA;
  7. 7Englander Institute for Precision Medicine, Meyer Cancer Center, Weill Cornell Medicine, New York, New York 10021, USA;
  8. 8Department of Statistical Modelling, St. Petersburg State University, St. Petersburg, Russia, 198504
  • Corresponding author: hoseinm{at}andrew.cmu.edu
  • Abstract

    Predicting biosynthetic gene clusters (BGCs) is critically important for discovery of antibiotics and other natural products. While BGC prediction from complete genomes is a well-studied problem, predicting BGCs in fragmented genomic assemblies remains challenging. The existing BGC prediction tools often assume that each BGC is encoded within a single contig in the genome assembly, a condition that is violated for most sequenced microbial genomes where BGCs are often scattered through several contigs, making it difficult to reconstruct them. The situation is even more severe in shotgun metagenomics, where the contigs are often short, and the existing tools fail to predict a large fraction of long BGCs. While it is difficult to assemble BGCs in a single contig, the structure of the genome assembly graph often provides clues on how to combine multiple contigs into segments encoding long BGCs. We describe biosyntheticSPAdes, a tool for predicting BGCs in assembly graphs and demonstrate that it greatly improves the reconstruction of BGCs from genomic and metagenomics data sets.

    Footnotes

    • Received August 29, 2018.
    • Accepted May 29, 2019.

    This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    | Table of Contents

    Preprint Server