ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Web Tool

C-Sibelia: an easy-to-use and highly accurate tool for bacterial genome comparison

[version 1; peer review: 2 approved]
PUBLISHED 25 Nov 2013
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

We present C-Sibelia, a highly accurate and easy-to-use software tool for comparing two closely related bacterial genomes, which can be presented as either finished sequences or fragmented assemblies. C-Sibelia takes as input two FASTA files and produces: (1) a VCF file containing all identified single nucleotide variations and indels; (2) an XMFA file containing alignment information. The software also produces Circos diagrams visualizing high level genomic architecture for rearrangement analyses. C-Sibelia is a part of the Sibelia comparative genomics suite, which is freely available under the GNU GPL v.2 license at http://sourceforge.net/projects/sibelia-bio. C-Sibelia is compatible with Unix-like operating systems. A web-based version of the software is available at http://etool.me/software/csibelia.

Introduction

The development of inexpensive genome sequencing technologies and efficient assembly methods has revolutionized the study of bacterial genomes, which are being sequenced and assembled on a daily basis. When an assembly is available, the most common first task is to compare it against a reference genome (or another assembly, if no such genome is available) in order to find genetic differences between the newly assembled and reference genomes. This analysis is critical to understand genetic factors that determine certain phenotypes of the isolates.

We present Comparative Sibelia software (C-Sibelia) for the comparison of two bacterial genomes in the form of complete sequences or draft assemblies. C-Sibelia is able to compare genomes in the presence of rearrangements and duplications. C-Sibelia takes as input two FASTA files (the assembly and reference files; if the reference genome is not available, it can be substituted by another draft assembly) and produces: (1) a VCF file containing all identified single nucleotide variations (SNVs) and indels; (2) annotation of these variants by SnpEff; (3) an XMFA1 file containing alignment information. The web-based version also produces a circular diagram visualizing the rearrangement of synteny blocks in two genomes.

The performance of C-Sibelia in detecting SNVs and indels is comparable to MUMmer and outperforms Mauve in terms of the false-positive rate. C-Sibelia is a part of the Sibelia comparative genomics suite, which is freely available under the GNU GPL v.2 license at http://sourceforge.net/projects/sibelia-bio. Users are encouraged to use the web-based version of C-Sibelia at http://etool.me/software/csibelia.

Methods

From synteny blocks to alignment

The task of finding SNVs and indels connects closely to the problem of whole-genome alignment. Unlike aligning two short DNA segments, aligning two genomes is more challenging because of the presence of rearrangements and repetitive elements. C-Sibelia addresses this problem by first decomposing genomes into synteny blocks, using the iterative de Bruijn graph algorithm described in Minkin et al.2. This step separates linear operations (indels, substitutions) from non-linear operations (rearrangements) and thus allows us to apply global alignment to multiple instances of each synteny block. C-Sibelia incorporates LAGAN3, a global alignment tool, for aligning different instances of the same synteny block.

From alignment to variant calling. C-Sibelia then finds differences between two genomes (indels, SNVs, rearrangements) by analyzing the resulting synteny and alignment blocks. Regions in one genome not covered by synteny blocks are treated as indels. SNVs and small indels that lie within the regions covered by synteny blocks are reported by analyzing the alignment information produced by LAGAN. Identified variants are annotated by using snpEff4. The pipeline of C-Sibelia is described in the following pseudocode.

Input: An assembly and a reference genome (in FASTA format).

Algorithm:

  • Decompose the sequences into synteny blocks using Sibelia.

  • Align instances of synteny blocks using LAGAN.

  • Analyze the synteny block decomposition and alignment information.

            – Find indels in non-syntenic regions.

            – Find small indels and SNVs in aligned regions (using the alignment information produced by LAGAN).

            – Annotate the identified variants using SnpEff.

            – Select contigs containing multiple synteny blocks (i.e., rearranged contigs).

Output:

  • All SNVs and indel variants, in a VCF file.

  • Annotation of these variants produced by SnpEff4.

  • A picture in Circos format5 for rearranged contigs and the reference genome.

Results

A simulated dataset

To evaluate the variant calling feature, we benchmarked C-Sibelia against Mauve6 and MUMmer7 on a simulated dataset, designed as follows.

From the complete genome of Staphylococcus aureus (S. aureus) NCTC 8325, we performed 10 deletions of random segments of size 2000 bp, and futher introduced 1000 SNVs in the resulting genome. We then generated five reversals and five translocations of random segments in the genome with size 10,000 bp each to evaluate the capability of these tools to perform an alignment in the presence of rearrangements. We obtained a simulated assembly of this newly simulated genome of 180 contigs; the distribution of contig length was similar to that of the RN4220 assembly reported in Dhanalakshmi et al.8. We further used C-Sibelia, Mauve and MUMmer to find variants in this simulated assembly and the original reference genome (NCTC 8325). Table 1 and Table 2 demonstrate that the performance of C-Sibelia in detecting variants is comparable to MUMmer and improves upon Mauve in terms of the false-positive rate. Figure 1 shows the Circos diagram of the rearranged contigs and the reference genome. The scripts and commands used for this benchmark are available in the Supplementary material.

Table 1. SNV calling on simulated data.

ToolTrue PositiveFalse PositiveFalse Negative
C-Sibelia976024
MUMmer977023
Mauve991789

Table 2. Indel calling on simulated data.

ToolTrue PositiveFalse PositiveFalse Negative
C-Sibelia901
MUMmer901
Mauve1010
9c8a61c3-6abf-49a0-8637-fec6d0ed1aa7_figure1.gif

Figure 1. A picture in Circos format for assembly sequences and the reference genome.

Only contigs with multiple synteny blocks rearranged differently in the genome are shown. Green and red bars depict the direction of synteny blocks on the positive and negative strands, respectively.

A real dataset

The most common approach for comparing an assembly against a reference genome is to first align the assembly against the reference and then write in-house scripts to extract variants. C-Sibelia can achieve this task automatically and with high accuracy. We used C-Sibelia to reproduce the comparison of the S. aureus RN4220 assembly and the reference genome NCTC 8325, reported in Dhanalakshmi et al.8 (the authors used MUMmer and in-house scripts for this comparison). Among 132 single nucleotide variants and four large deletions reported in Dhanalakshmi et al.8, C-Sibelia confirmed 121 SNVs and all four large deletions. C-Sibelia also reported six additional variants, which are also confirmed by BLAST9. The input data as well as the commands for generating these results are available in the Supplementary material.

The Etool Web-Server

The online version of C-Sibelia is available at http://etool.me/software/csibelia. The web form takes as input two FASTA files (one for the assembly and the other for the reference). The web form’s parameters allow users to choose whether or not to annotate variants and display the Circos5 picture for rearrangement analysis (see Figure 1). Results are delivered to registered users by a real time push notification mechanism10,11.

Discussion

In this application note, we introduced C-Sibelia, a novel software for comparing two closely-related bacterial strains. Performance of C-Sibelia is comparable to MUMmer, and better than Mauve in terms of false positives rate. The web interface of C-Sibelia makes the task of comparing assemblies against a reference genome convenient for microbiologists, who do not want to go to the trouble of downloading and compiling the software. In the future, we plan to extend C-Sibelia to compare multiple genomes or draft assemblies as well as scale the software to larger genomes.

Comments on this article Comments (2)

Version 1
VERSION 1 PUBLISHED 25 Nov 2013
  • Reader Comment 04 Sep 2017
    Ashvini Chauhan, Florida A&M University
    04 Sep 2017
    Reader Comment
    Despite trying several times, I was unable to run the C-Sibelia pipeline to compare two bacterial whole genome sequences.

    The author is unresponsive to my queries, so am putting ... Continue reading
  • Author Response 17 Sep 2014
    Son Pham, UCSD, USA
    17 Sep 2014
    Author Response
    Dear Dr. Hauser and Dr. McLean,
    Thank you very much for your very helpful comments.
    1. About the fasta and fna file, the system now can recognize both file extensions.
    2. About the ... Continue reading
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Minkin I, Pham H, Starostina E et al. C-Sibelia: an easy-to-use and highly accurate tool for bacterial genome comparison [version 1; peer review: 2 approved] F1000Research 2013, 2:258 (https://doi.org/10.12688/f1000research.2-258.v1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 25 Nov 2013
Views
38
Cite
Reviewer Report 17 Jul 2014
Loren J Hauser, Computational Biology and Bioinformatics Group, Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA 
Approved
VIEWS 38
C-Sibelia was created to be a user friendly tool for pairwise genome comparison. The web tool is relatively easy to use for the non-bioinformatics trained and will therefore make these kinds of analysis easier for small groups to perform. This is ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Hauser LJ. Reviewer Report For: C-Sibelia: an easy-to-use and highly accurate tool for bacterial genome comparison [version 1; peer review: 2 approved]. F1000Research 2013, 2:258 (https://doi.org/10.5256/f1000research.2867.r4953)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
48
Cite
Reviewer Report 27 Nov 2013
Jeffrey McLean, Microbial and Environmental Genomics, J. Craig Venter Institute, San Diego, CA, USA 
Approved
VIEWS 48
In general, the rationale behind the decision to develop a user friendly tool to compare finished and draft assemblies, to reference genomes, is highly justified. The accurate calling of single nucleotide variations and indels using the comprehensive SnpEff tool will ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
McLean J. Reviewer Report For: C-Sibelia: an easy-to-use and highly accurate tool for bacterial genome comparison [version 1; peer review: 2 approved]. F1000Research 2013, 2:258 (https://doi.org/10.5256/f1000research.2867.r2566)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (2)

Version 1
VERSION 1 PUBLISHED 25 Nov 2013
  • Reader Comment 04 Sep 2017
    Ashvini Chauhan, Florida A&M University
    04 Sep 2017
    Reader Comment
    Despite trying several times, I was unable to run the C-Sibelia pipeline to compare two bacterial whole genome sequences.

    The author is unresponsive to my queries, so am putting ... Continue reading
  • Author Response 17 Sep 2014
    Son Pham, UCSD, USA
    17 Sep 2014
    Author Response
    Dear Dr. Hauser and Dr. McLean,
    Thank you very much for your very helpful comments.
    1. About the fasta and fna file, the system now can recognize both file extensions.
    2. About the ... Continue reading
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.