Sample collection
Branches of
Lophelia pertusa were obtained from the Savannah Banks site, off the southeastern coast of the continental USA, Atlantic Ocean (latitude 31.75420, longitude −79.19442, depth 515 mbsl), while aboard the NOAA Ship
Ronald Brown (expedition RB1903) using ROV
Jason (Dive 1130) on April 17, 2019 (BioSample accession
SAMN31822850). The branches were collected using a hydraulic robotic arm and stored in an insulted biobox until they reached the surface (Figures
1c,d). Once onboard the ship, they were immersed in cold RNALater (Thermo Fisher), left to soak in the refrigerator (4 °C) for 24 hours, and then frozen at −80 °C. Samples remained at that temperature until DNA was purified in the laboratory.
De novo genome assembly
The analytical pipeline to generate the
de novo assembly of
Lophelia pertusa is depicted in Figure
2.
De novo genome assembly of PacBio data was performed using the assemblers flye (v2.9; RRID:
SCR_017016)
[20], wtdbg2 (v2.5, RRID:
SCR_017225)
[21], and FALCON (RRID:
SCR_016089)
[22], in combination with the polishing tools NextPolish v1.3.1
[23] and Arrow as implemented in the Pacific Biosciences GenomicConsensus package (
https://github.com/PacificBiosciences/GenomicConsensus), and the haplotig and contig overlap removal program purge_dups (v.1.2.3, RRID:
SCR_021173)
[24]. First, we generated an assembly with flye using default parameters, followed by purging with purge_dups and polishing with NextPolish (assembly A). Using default parameters, we generated a second assembly with wtdbg2 and polished it with NextPolish (assembly B). A third assembly was generated using FALCON, followed by polishing with Arrow and purging with purge_dups (assembly C). Assemblies A and B were combined by aligning the flye assembly against the wtdbg2 assembly using MUMmer (v4.0; RRID:
SCR_018171)
[25], followed by merging with Quickmerge v0.3
[26] (-hco 5.0 -c 1.5 -l 248998 -ml 5000). The resulting assembly was polished with NextPolish (assembly D). Assembly D was aligned against assembly C using MUMmer and merged with Quickmerge. Finally, the assembly resulting from merging assemblies C and D was polished with NextPolish and purged with purge_dups (assembly E). Assemblies generated with other programs were not included because they had lower assembly contiguity or completeness (see the ‘Data validation and quality control’ section, Table
1).
Figure 2.
Flow chart depicting the assembly pipeline for the
Lophelia pertusa genome. Dotted boxes indicate the different
de novo assemblies. Letters indicate the designed nomenclature of each assembly as reflected in the text and Table
1. Data inputs are indicated in maroon font. Software packages are highlighted with blue background.
Table 1
Statistics for Lophelia pertusa intermediate and final assemblies.
Assembly ID | A | B | C | D | E | F | G | H | I |
---|
Input data | PacBio CLR | PacBio CLR | PacBio CLR | PacBio CLR | PacBio CLR | PacBio CLR | PacBio CLR | PacBio CLR | PacBio CLR | PacBio CLR | PacBio CLR | PacBio CLR | PacBio CLR + Chicago | PacBio CLR + HiC | PacBio CLR + Chicago + HiC | H |
---|
Software | flye | flye + purge_dups | flye + purge_dups + NextPolish | wtdbg2 | wtdbg2 + NextPolish | FALCON + arrow | FALCON + arrow + purge_dups | quickmerge (A + B) | quickmerge (A + B) + Next polish | quickmerge (D + C) | quickmerge (D + C) + Next polish | quickmerge (D + C) + Next polish + purge_dups | SALSA2 (E + Chicago) | SALSA2 (E + HiC) | SALSA2 (F + HiC) | BlobToolKit |
---|
Sanitation | | | | | | | | | | | | | | | Prokaryot. | GC, cov., no-hit, undef |
---|
#contigs | 17,029 | 13,865 | 13,865 | 7,345 | 7,345 | 8,987 | 6,321 | 11,237 | 11,237 | 10,226 | 10,226 | 10,011 | 7,818 | 8,712 | 7,385 | 2,858 |
#contigs (≥10 Kbp) | 11,441 | 8,278 | 8,514 | 5,789 | 5,863 | 8,710 | 6,044 | 6,218 | 6,248 | 5,431 | 5,438 | 5,223 | 3,284 | 4,073 | 2,910 | 2,019 |
#contigs (≥25 Kbp) | 7,012 | 4,688 | 4,708 | 3,543 | 3,573 | 6,729 | 4,283 | 3,226 | 3,227 | 2,768 | 2,768 | 2,528 | 1,300 | 1,765 | 1,033 | 924 |
#contigs (≥50 Kbp) | 4,449 | 3,271 | 3,274 | 2,119 | 2,121 | 4,292 | 2,835 | 2,142 | 2,145 | 1,843 | 1,844 | 1,630 | 897 | 1,148 | 676 | 652 |
Largest contig (Kbp) | 1,284 | 1,284 | 1,278 | 2,222 | 2,198 | 1,100 | 1,100 | 3,039 | 3,036 | 3,134 | 3,136 | 3,136 | 5,013 | 6,202 | 10,677 | 10,677 |
Total length (Kbp) | 781,392 | 615,714 | 618,620 | 546,887 | 548,050 | 685,805 | 487,642 | 620,046 | 619,797 | 635,659 | 635,608 | 588,242 | 589,370 | 588,927 | 589,296 | 556,859 |
Total length (≥1 Kbp) | 781,186 | 615,508 | 618,417 | 546,886 | 548,049 | 685,805 | 487,642 | 619,842 | 619,593 | 635,455 | 635,406 | 588,040 | 589,174 | 588,731 | 589,104 | 556,857 |
Total length (≥5 Kbp) | 774,389 | 608,711 | 611,986 | 545,841 | 547,109 | 685,800 | 487,637 | 613,412 | 613,222 | 629,083 | 629,053 | 581,686 | 583,014 | 582,517 | 583,001 | 556,159 |
Total length (≥10 Kbp) | 751,665 | 585,996 | 590,000 | 536,436 | 537,966 | 683,319 | 485,155 | 594,011 | 593,904 | 611,231 | 611,203 | 563,836 | 566,568 | 565,396 | 566,844 | 551,248 |
Total length (≥25 Kbp) | 680,889 | 529,945 | 530,627 | 498,949 | 499,636 | 648,954 | 455,537 | 547,800 | 547,280 | 570,231 | 570,093 | 522,303 | 537,194 | 530,454 | 539,228 | 534,434 |
Total length (≥50 Kbp) | 589,893 | 480,070 | 480,059 | 449,330 | 448,950 | 685,805 | 403,091 | 509,550 | 509,117 | 537,749 | 537,651 | 490,800 | 523,329 | 509,117 | 526,899 | 525,028 |
GC (%) | 39.57 | 39.59 | 39.57 | 39.22 | 39.40 | 39.36 | 39.37 | 39.55 | 39.55 | 39.54 | 39.54 | 39.55 | 39.55 | 39.55 | 39.55 | 39.53 |
N50 (Kbp) | 114 | 138 | 137 | 249 | 248 | 123 | 142 | 331 | 329 | 455 | 452 | 467 | 901 | 824 | 1,440 | 1,614 |
N75 (bp) | 51 | 59 | 58 | 84 | 82 | 65 | 69 | 82 | 82 | 117 | 117 | 109 | 413 | 219 | 553 | 689 |
L50 | 1,817 | 1,229 | 1,243 | 586 | 590 | 1,582 | 977 | 455 | 457 | 366 | 366 | 329 | 186 | 155 | 94 | 83 |
L75 | 4,373 | 2,935 | 2,976 | 1,509 | 1,527 | 3,487 | 2,200 | 1,453 | 1,458 | 1,038 | 1,040 | 953 | 420 | 495 | 258 | 219 |
#N’s/100 kbp | 2.23 | 3.01 | 0 | 0.00 | 0 | 0.00 | 0.51 | 0 | 0 | 0.05 | 0.0 | 0.5 | 191.90 | 116.74 | 238.68 | 248.99 |
Busco (metazoa, n = 954) | | | 87.44.43.44.8 | | 90.70.92.85.6 | | 75.12.14.218.6 | | 87.94.72.15.3 | | | 88.62.42.16.9 | 89.02.12.16.8 | 88.92.12.16.9 | 89.02.12.16.8 | 88.9
2.2
2.1
6.8 |
Scaffolding
Assembly E was scaffolded with long-insert Chicago and Hi-C reads following the Arima Genomics mapping pipeline A160156 v02 (retrieved from
https://github.com/ArimaGenomics/mapping_pipeline). First, the reads from the Chicago library were aligned to assembly E using the MEM algorithm of the program BWA (v0.7.17; RRID:
SCR_010910)
[27]. Chicago and Hi-C sequence data had mapping rates to the assembly of 96% and 98%, respectively, indicating high quality. Chimeric reads that mapped in the 3
′ direction were excluded using the Arima-HiC Mapping pipeline filter_five_end.pl script
[28]. Reads were combined into pairs with the two_read_bam_combiner.pl script and sorted using Samtools (v.1.10; RRID:
SCR_002105)
[29]. The program Picard tools (v2.26.6; RRID:
SCR_006525)
[30] was used to add read groups to the resulting bam file and remove PCR duplicates. The program SALSA2 v2.2
[31, 32] (-e GATC -m yes) was used for scaffolding assembly E with the mapped Chicago reads (assembly F). The Hi-C reads were mapped to assembly H using the same procedure described above and re-scaffolded with SALSA2 (assembly H).
Sanitation
The program BloobToolKit v2.2
[33] was used to identify non-target scaffolds from assembly H. First, scaffolds were queried against the nucleotide collection database (nt) from the National Center for Biotechnology Information (NCBI), retrieved on May 5, 2020, using NCBI BLAST (RRID:
SCR_004870) plus blastn (v2.10; RRID:
SCR_001598)
[34]. Scaffolds were then queried against the UniProt protein sequence database
[35], retrieved on May 5, 2020, using DIAMOND blastx (v0.9.14.115; RRID:
SCR_016071)
[36]. Assembly coverage evenness was assessed by mapping the raw PacBio reads against assembly H using minimap2 (v2.24-r1122; RRID:
SCR_018550)
[37]. We excluded six scaffolds with significant matches to non-eukaryotic sequences (i.e., bacteria and viruses). We also excluded 4,531 scaffolds with significant deviations in coverage (<×0.01, >×65) or G.C. content (<26%, >52.5%) relative to the assembly-wide means (coverage = 3.27×, G.C. content = 39.81%) (assembly I). This Whole Genome Shotgun (WGS) project was deposited at DDBJ/ENA/GenBank under the accession
JAPMOT000000000.
Annotation
Repetitive elements in the genome assembly I were identified
de novo with the RepeatModeler v2.0.2 package (RRID:
SCR_015027), including the programs RECON (v1.05; RRID:
SCR_021170)
[38] and RepeatScout (v1.06; RRID:
SCR_014653)
[39]. Repetitive elements were classified using RepeatClassifier v2.0.2
[40] and soft-masked using RepeatMasker (v4.1.2; RRID:
SCR_012954)
[40]. This procedure resulted in 57.37% of the genome assembly being masked.
The masked genome assembly was used for functional annotation using the Funannotate v1.8.9 pipeline
[41]. First, we performed a
de novo genome-guided transcriptome assembly using the Funnannotate
train script with the
Lophelia pertusa RNA-seq data published by Glazier and colleagues
[42]. In short, (1) the RNA-seq data reads were normalized with Trinity (v2.8.5; RRID:
SCR_013048)
[43] and mapped to the masked genome assembly using HISAT2 (v2.2.1; RRID:
SCR_015530)
[44]; (2) a transcriptome assembly was generated with these mapped reads using Trinity; (3) the PASA (v2.4.1; RRID:
SCR_014656)
[45] program was used to produce a likely set of protein-coding genes based on transcript alignments.
Second, we performed gene prediction using the Funnannotate
predict script (–repeats2evm –max_intronlen 30000 –busco_db metazoa). With this script, we (1) parsed transcript alignments to the genome to use as transcript evidence; (2) aligned the UniProtKB/SwissProt v2021_04 curated protein database
[46] to the genomes and parsed alignments to use as protein evidence; (3) generated
ab initio gene model predictions from the masked assembly with GeneMark-ES/E.T. (v4.68; RRID:
SCR_011930)
[47, 48], Augustus (v3.3.3; RRID:
SCR_008417)
[49], SNAP (v2013_11_29; RRID:
SCR_007936)
[50], and GlimmerHMM (v3.0.4; RRID:
SCR_002654)
[51], using PASA gene modes for training; (4) computed a weighted consensus of gene models from transcript, protein, and
ab initio evidence using EVidenceModeler (v.1.1.1; RRID:
SCR_014659)
[52] (evidence source/weight: transcript/1, protein/1, Augustus/1, Augustus HiQ/2, GeneMark/1, GlimmerHMM/1, PASA/6, Snap/1); (5) filtered gene models to exclude transposable elements and lengths <50 aa; (6) predicted tRNAs using tRNAscan-SE (v2.0.9; RRID:
SCR_010835)
[53]. In total, 37,945 coding genes and 3,144 tRNA genes were predicted in the genome assembly. The average gene length was 4,972 bp. This analysis indicates that approximately 34% of the
Lophelia pertusa genome is coding DNA.
The protein products of the predicted coding gene models were functionally annotated using the Funnannotate
annotate script. The following annotations were added: (1) Protein family domains from PFAM (v35.0; RRID:
SCR_004726) using HMMer (v3.3.2; RRID:
SCR_005305) to find sequence homologs
[54]; (2) Gene and product names from UniProt D.B. (v2021_04; RRID:
SCR_002380) using DIAMOND blastp v2.0.13
[55] alignments; (3) Orthologous groups, gene, and product names from eggNog (v5.0; RRID:
SCR_002456)
[56] using eggNOG-mapper (v2.1.6; RRID:
SCR_021165)
[57]; (4) Protease annotation from MEROPS (v12.0; RRID:
SCR_007777)
[58] using DIAMOND blastp; (5) Metazonan single-copy orthologs from the OrthoDB (v10; RRID:
SCR_011980)
[59] using BUSCO (v5; RRID:
SCR_015008)
[60]; and (6) protein families and gene ontology (GO) terms from InterPro (v87; RRID:
SCR_006695) using InterProScan (v5.53; RRID:
SCR_005829)
[61]. This procedure yielded 24,665 EggNog annotations, 24,471 InterPro annotations, 16,020 PFAM annotations, 16,646 GO terms, and 1,086 MEROPS annotations.