figshare
Browse
1/5
94 files

Contrasting new and available reference genomes to highlight uncertainties in assemblies and areas for future improvement: an example with Monodontid species

dataset
posted on 2023-06-02, 14:20 authored by Trevor BringloeTrevor Bringloe

Abstract:   

Reference genomes provide a foundational framework for evolutionary investigations, ecological analysis, and conservation science, yet uncertainties in the assembly of reference genomes are difficult to assess, and by extension rarely quantified. We generated three beluga (Delphinapterus leucas) and one narwhal (Monodon monoceros) reference genomes and contrasted these with published chromosomal scale assemblies for each species to quantify discrepancies associated with genome assemblies. The new reference genomes achieved chromosomal scale assembly using a combination of PacBio long reads, Illumina short reads, and Hi-C scaffolding data. For beluga, we identified discrepancies in the order and orientation of contigs in 2.2-3.7% of the total genome depending on the pairwise comparison of references. In addition, unsupported higher order scaffolding was identified in published reference genomes. In contrast, we estimated 8.2% of the compared narwhal genomes featured discrepancies, with inversions being notably abundant (5.3%). Discrepancies were linked to repetitive elements in both species, indicating additional layers of data providing information on ultra-long genomic distances are needed to resolve persistent errors in reference genome construction. We present a conceptual summary for improving the accuracy of reference genomes with relevance to end-user needs and how they relate to levels of assembly quality and uncertainty.


The repository contains files required to understand the steps taken to arrive at new reference genomes for beluga and narwhal and files detailing comparison of the genomes with published work. Sequence data can be accessed via BioProject PRJNA925093, while code to reproduce the assemblies (along with details) are available on Github: https://github.com/tbringloe/Monodontid_assemblies_2023.

Samples, BioSampleID: 

S_20_00703: Beluga, SAMN32779554

S_20_00702: Beluga, SAMN33394026

S_20_00693: Beluga, SAMN33394065

S_20_00708: Narwhal, SAMN33394077


Below are descriptions of the files, provided for each sample, in the order of the workflow (assembly -> polishing -> scaffolding -> annotation -> genome comparison).

Assembly

XXX_flye.log: logfile from long read assembler Flye.

Polishing

XXX_bowtie2_map1|2.out: logfile for mapping reads prior to polishing rounds 1 and 2

XXX_Pilon_polish1|2.out: logfile for correcting (polishing) base call errors in the long read assemblies using Pilon

Scaffolding

XXX.FINAL_scaffold.fasta: Final scaffold sequences for the assembled genomes.

XXX.final.hic: HiC contact density data; can be uploaded into Juicer for viewing, along with .assembly file.

XXX.final.assembly: Contains information on order and orientation of contigs; can be uploaded into Juicer for viewing, along with the .hic file.

XXX.5i23.review.apg: A Golden Path file describing assembly of scaffolds from contigs.

XXX.5i23.review.bed: bed file describing assembly of scaffolds from contigs.

XXX.5i23.review.break_report.txt: a report detailing breaks to contigs introduced through the scaffolding workflow 3D-DNA.

Repeat annotations

S_20_00703_Repeat_Modeler.out: logfile for repeat modeler predicting beluga repetitive elements

S_20_00703_beluga_custom_repeat.lib: custom repeat library for repeat masker generated using repeat modeler; contains beluga specific repeat elements

S_20_00703_beluga_custom_repeat+RMlib.lib: As above, but with repeat masker library concatenated. This is the library used to annotate repetitive elements of the beluga genomes.

S_20_00708_Repeat_Modeler.out: logfile for repeat modeler predicting narwhal repetitive elements

S_20_00708_narwhal_custom_repeat.lib: custom repeat library for repeat masker generated using repeat modeler; contains narwhal specific repeat elements

S_20_00708_narwhal_custom_repeat+RMlib.lib: As above, but with repeat masker library concatenated. This is the library used to annotate repetitive elements of the narwhal genome.

XXX.html: html output of the repeat landscape produced by repeat masker

XXX.repeatmasker.out.gff3.gz: annotation file of repeat elements predicted for a given assembly

Gene Annotations

XXX.gene.models.rename.putative_function.domain_added.gff: Contains annotations for gene models and function predicted with MAKER2 workflow.

XXX.visible_iprscan_domains.gff: Contains annotations of protein domains predicted using interproscan v.5.62-94.0.

XXX.all.maker.transcripts.renamed.putative_function.fasta: Contains nucleotide sequences of transcripts predicted with MAKER2 workflow, with function added to headers.

XXX.all.maker.proteins.renamed.putative_function.fasta: Contains amino acid sequences of proteins predicted with MAKER2 workflow, with function added to headers.

AED_Scores_MOBELS_30v23.csv: Contains accumulated proportions of AED scores for gene model predictions.


Busco

XXX_busco.log: log file of busco analysis using vertebrate database, with results table at end of file.


Genome Comparisons

XXX.paf: output from minimap2 alingment of reference genomes

XXX-Discrepancy_table.csv: Tables derived from the .paf describing alignment blocks, including categorizations of discrepancy (Debris, translocations, inversions and relocations). See paper for definitions.

Regression_repeat_discrepancy_table-1vi23.txt: tab delimited table with columns for % repeats and % discrepancy by chromosome for each of the cross compared reference genomes.

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC