Contrasting new and available reference genomes to highlight uncertainties in assemblies and areas for future improvement: an example with Monodontid species
Abstract:
Reference genomes provide a foundational framework for evolutionary investigations, ecological analysis, and conservation science, yet uncertainties in the assembly of reference genomes are difficult to assess, and by extension rarely quantified. We generated three beluga (Delphinapterus leucas) and one narwhal (Monodon monoceros) reference genomes and contrasted these with published chromosomal scale assemblies for each species to quantify discrepancies associated with genome assemblies. The new reference genomes achieved chromosomal scale assembly using a combination of PacBio long reads, Illumina short reads, and Hi-C scaffolding data. For beluga, we identified discrepancies in the order and orientation of contigs in 2.2-3.7% of the total genome depending on the pairwise comparison of references. In addition, unsupported higher order scaffolding was identified in published reference genomes. In contrast, we estimated 8.2% of the compared narwhal genomes featured discrepancies, with inversions being notably abundant (5.3%). Discrepancies were linked to repetitive elements in both species, indicating additional layers of data providing information on ultra-long genomic distances are needed to resolve persistent errors in reference genome construction. We present a conceptual summary for improving the accuracy of reference genomes with relevance to end-user needs and how they relate to levels of assembly quality and uncertainty.
The repository contains files required to understand the steps taken to arrive at new reference genomes for beluga and narwhal and files detailing comparison of the genomes with published work. Sequence data can be accessed via BioProject PRJNA925093, while code to reproduce the assemblies (along with details) are available on Github: https://github.com/tbringloe/Monodontid_assemblies_2023.
Samples, BioSampleID:
S_20_00703: Beluga, SAMN32779554
S_20_00702: Beluga, SAMN33394026
S_20_00693: Beluga, SAMN33394065
S_20_00708: Narwhal, SAMN33394077
Below are descriptions of the files, provided for each sample, in the order of the workflow (assembly -> polishing -> scaffolding -> annotation -> genome comparison).
Assembly
XXX_flye.log: logfile from long read assembler Flye.
Polishing
XXX_bowtie2_map1|2.out: logfile for mapping reads prior to polishing rounds 1 and 2
XXX_Pilon_polish1|2.out: logfile for correcting (polishing) base call errors in the long read assemblies using Pilon
Scaffolding
XXX.FINAL_scaffold.fasta: Final scaffold sequences for the assembled genomes.
XXX.final.hic: HiC contact density data; can be uploaded into Juicer for viewing, along with .assembly file.
XXX.final.assembly: Contains information on order and orientation of contigs; can be uploaded into Juicer for viewing, along with the .hic file.
XXX.5i23.review.apg: A Golden Path file describing assembly of scaffolds from contigs.
XXX.5i23.review.bed: bed file describing assembly of scaffolds from contigs.
XXX.5i23.review.break_report.txt: a report detailing breaks to contigs introduced through the scaffolding workflow 3D-DNA.
Repeat annotations
S_20_00703_Repeat_Modeler.out: logfile for repeat modeler predicting beluga repetitive elements
S_20_00703_beluga_custom_repeat.lib: custom repeat library for repeat masker generated using repeat modeler; contains beluga specific repeat elements
S_20_00703_beluga_custom_repeat+RMlib.lib: As above, but with repeat masker library concatenated. This is the library used to annotate repetitive elements of the beluga genomes.
S_20_00708_Repeat_Modeler.out: logfile for repeat modeler predicting narwhal repetitive elements
S_20_00708_narwhal_custom_repeat.lib: custom repeat library for repeat masker generated using repeat modeler; contains narwhal specific repeat elements
S_20_00708_narwhal_custom_repeat+RMlib.lib: As above, but with repeat masker library concatenated. This is the library used to annotate repetitive elements of the narwhal genome.
XXX.html: html output of the repeat landscape produced by repeat masker
XXX.repeatmasker.out.gff3.gz: annotation file of repeat elements predicted for a given assembly
Gene Annotations
XXX.gene.models.rename.putative_function.domain_added.gff: Contains annotations for gene models and function predicted with MAKER2 workflow.
XXX.visible_iprscan_domains.gff: Contains annotations of protein domains predicted using interproscan v.5.62-94.0.
XXX.all.maker.transcripts.renamed.putative_function.fasta: Contains nucleotide sequences of transcripts predicted with MAKER2 workflow, with function added to headers.
XXX.all.maker.proteins.renamed.putative_function.fasta: Contains amino acid sequences of proteins predicted with MAKER2 workflow, with function added to headers.
AED_Scores_MOBELS_30v23.csv: Contains accumulated proportions of AED scores for gene model predictions.
Busco
XXX_busco.log: log file of busco analysis using vertebrate database, with results table at end of file.
Genome Comparisons
XXX.paf: output from minimap2 alingment of reference genomes
XXX-Discrepancy_table.csv: Tables derived from the .paf describing alignment blocks, including categorizations of discrepancy (Debris, translocations, inversions and relocations). See paper for definitions.
Regression_repeat_discrepancy_table-1vi23.txt: tab delimited table with columns for % repeats and % discrepancy by chromosome for each of the cross compared reference genomes.