Largest Complete Mitochondrial Genome of a Gymnosperm, Sitka Spruce (Picea sitchensis), Indicates Complex Physical Structure

Plant mitochondrial genomes vary widely in size. Although many plant mitochondrial genomes have been sequenced and assembled, the vast majority are of angiosperms, and few are of gymnosperms. Most plant mitochondrial genomes are smaller than a megabase, with a few notable exceptions. We have sequenced and assembled the 5.5 Mbp mitochondrial genome of Sitka spruce (Picea sitchensis), the largest complete mitochondrial genome of a gymnosperm. We sequenced the whole genome using Oxford Nanopore MinION, and then identified contigs of mitochondrial origin assembled from these long reads. The assembly graph shows a multipartite genome structure, composed of one smaller 168 kbp circular segment of DNA, and a larger 5.4 Mbp component with a branching structure. The assembly graph gives insight into a putative complex physical genome structure, and its branching points may represent active sites of recombination.


Introduction
Plant mitochondrial genomes are amazingly diverse and complex (Mower et al. 2012). Land plant mitochondrial genomes range in size from 100 kilobases (kbp) for mosses such as Mielichhoferia elongata (Goruynov et al. 2018) to more than 11 megabases (Mbp) in the case of the flowering plant Silene conica (Sloan et al. 2012). Although their genome structure is often portrayed as a circle, the true physical structure of their genome appears to be a variety of circles, linear molecules, and complex branching structures (Backert et al. 1997;Backert & Börner 2000). While many species have a single master circle representation of their mitochondrial genome, others are composed of more than a hundred circular chromosomes ). The precise mechanism of how plant mitochondria replicate and maintain their DNA is not yet fully understood (Cupp & Nielsen 2014). It is hypothesized that recombination-dependent replication plays a role, giving a functional role to the repeat sequences often observed in mitochondria (Gualberto et al. 2014). This model does not fully explain how genomic copy number is regulated and maintained (Oldenburg & Bendich 2015), particularly in multipartite genomes (Vlcek et al. 2010). Although angiosperm mitochondrial genomes are well studied with numerous complete genomes available, few gymnosperm mitochondrial genomes are available: one from each of the cycads (Chaw et al. 2008), ginkgos, gnetales (Guo et al. 2016), and conifers (Jackman et al. 2015). Whereas other gymnosperm mitochondrial genomes are smaller than a megabase, conifer mitochondrial genomes can exceed five megabases (Jackman et al. 2015), larger than many bacteria. The origin and mechanism of this expansion are not known, but the trend correlates well with the very large nuclear genomes of conifers (De La Torre et al. 2014) and spruces in particular (Birol et al. 2013;Nystedt et al. 2013;Warren et al. 2015), compared to other plants. As plant mitochondrial genomes typically have fewer than 100 genes, what role this expanse of DNA serves, if any, remains mysterious.
Assembling plant mitochondrial genomes is difficult due to the presence of large (up to 30 kbp) perfect repeats, which may be involved in active recombination, and hypothesized recombination-dependent replication (Gualberto et al. 2014). A hybrid assembly of both long reads, which are able to span most repeats, and accurate short sequencing reads, which correct indel errors, is well suited to tackle these challenging genome features. When analyzing whole genome sequencing reads to reconstruct mitochondrial genomes, long reads provide additional confidence that the assembled sequences represent the true mitochondrial sequence in regions that share sequence homology between the mitochondrion, plastid, and nuclear genomes, due to the transfer of DNA between cellular compartments (Adams 2003;Smith 2011). Although hybrid assembly has been applied to assemble the plastid genome of Eucalyptus pauciflora (Wang et al. 2018), it has not yet been applied to the assembly of a plant mitochondrial genome.
Annotating plant mitochondrial genomes is also challenging, due to numerous features of plant mitochondria that are not typical of most organisms. For one, RNA editing of C-to-U is pervasive, and this process creates AUG start codons by editing ACG to AUG (Hiesel et al. 1989) or by editing GCG to GUG, an alternative start codon used by some plant mitochondrial genes (Sakamoto et al. 1997). RNA editing can also create stop codons in a similar fashion. Further complicating annotation using available bioinformatics pipelines, the typical GU-AG splice site expected by most splice-aware alignments tools is instead GNGCG-AY (Y denotes C or T) for group II introns (Lambowitz & Zimmerly 2010 and see results). Also, trans-spliced genes are common in mitochondrial genomes (Kamikawa et al. 2016), and no purpose-built software tool exists for identifying and annotating trans-spliced genes. To add further difficulty, transspliced exons may be as small as 22 bp, as is nad5 exon 3 of gymnosperms (Guo et al. 2016) and other vascular plant mitochondria (Knoop et al. 1991). For these reasons, annotating a plant mitochondrial genome remains a laborious and manual task.
In this study, we report on the sequencing and assembly of the mitochondrial genome of Sitka spruce (Picea sitchensis, Pinaceae), a widely distributed conifer in the coastal regions of the Pacific Northwest. We show that this mitochondrial genome is one of the largest among plants and exhibits a multipartite genome structure.

Genome Sequencing and Assembly
Genomic DNA was extracted from young Sitka spruce (Picea sitchensis [Bong.] Carrière, genotype Q903) needles, as described in Coombe et al. (2016). We constructed 18 Oxford Nanopore 1D sequencing libraries, 16 by ligation of 1 to 7 µg of lightly needle-sheared genomic DNA and 2 by rapid transposition of 0.6 µg of unsheared genomic DNA, and sequenced these on 18 MinION R9.4 flow cells. This whole genome sequencing produced 98 Gbp in 9.6 million reads (SRA accession SRX5081713), yielding 5-fold depth of coverage of the roughly 20 Gbp nuclear genome, and 26-fold depth of coverage of the mitochondrial genome. Because separating putative mitochondrial reads by homology to known mitochondrial sequences could discard mitochondrial sequences that are unique to Sitka spruce, we chose to first assemble the whole genome reads and then compare contigs to known mitochondrial sequences. Assembling such a large number of Nanopore reads is not yet straight forward, and so we adopted an iterative approach to assembly. We first obtained a rough but computationally efficient assembly using Miniasm (Li 2016), after trimming adapter sequences with Porechop (Wick et al. 2017a). Miniasm produces an assembly whose sequencing error rate is comparable to that of the original reads, but no better. We polished this assembly using Racon (Vaser et al. 2017). We selected contigs with homology to the white spruce (Picea glauca, interior white spruce genotype PG29) mitochondrial genome (Jackman et al. 2015) using Bandage (Wick et al. 2015), retaining contigs with at least one 5 kbp alignment to the white spruce mitochondrion by BLASTN (Altschul et al. 1990). We note that this selection process would also discard any mitochondrial plasmids smaller than 5 kbp, if present.
We selected putative mitochondrial reads by aligning the Nanopore reads to this assembly using Minimap2 (Li 2018), and retained reads with an alignment score of 5000 or more. We assembled these reads using Unicycler (Wick et al. 2017b). This assembly yielded one circular contig and many linear contigs with no adjacent contigs, indicating that the assembly may not yet be complete, unless the genome were composed of linear chromosomes. We repeated the alignment of the Nanopore reads to the assembly, and again retained reads with an alignment score of 5000 or more. We assembled these reads using Flye (Kolmogorov et al. 2018), taking the output assembly graph 2-repeat/graph_final.gfa that identifies repeats that are longer than the read length and determines their precise boundaries. This Flye assembly was polished using Racon. Contigs with homology to the white spruce mitochondrion were selected using Bandage, which uses BLASTN, requiring an alignment length of at least 5 kbp and percent identity at least 90. Contigs with unambiguous adjacent contigs were merged using the Bandage operation "Merge all possible nodes".
In addition to a generally high sequencing error rate, Nanopore reads are particularly poor in accurately representing the length of homopolymer repeats. To compensate, we polished the assembly using one flow cell of Illumina HiSeq sequencing reads of the same DNA extraction, yielding 59-fold depth of coverage of the mitochondrial genome, to correct for sequencing and homopolymer length errors. We used Unicycler Polish to iteratively align the reads to the assembly using Bowtie2 (Langmead & Salzberg 2012), and correct the consensus sequence using Pilon (Walker et al. 2014). This iterative polishing process yielded no further corrections on the tenth round. Unicycler Polish applies Assembly Likelihood Estimate (ALE) (Clark et al. 2013) to each round to verify that the assembly of the final round of polishing resembles the reads the most. While annotating the genome, we found five indel errors in homopolymer runs that disrupted the reading frame of a gene. These five indel errors were corrected manually after inspecting the sequencing data.

Annotation
We annotated coding genes and non-coding rRNA and tRNA genes using automated methods where possible, and performed manual inspection to refine these automated annotations. We used Prokka (Seemann 2014), which uses Prodigal (Hyatt et al. 2010) to identify single-exon coding genes and open reading frames (ORFs). We used MAKER (Holt & Yandell 2011), which uses BLASTP and Exonerate (Slater & Birney 2005) to identify cis-spliced coding genes. We used tRNAscan-SE (Lowe & Eddy 1997) and Aragorn (Laslett 2004) to identify tRNA. We used RNAmmer (Lagesen et al. 2007) and Barrnap (Seemann 2014) to identify rRNA. We used RNAweasel (Lang et al. 2007) and Infernal (Nawrocki et al. 2009) to identify group II introns. RFAM motif Domain-V (RM00007) represents domain V of group II introns, and RFAM family Intron_gpII (RF00029) represents both domains V and VI (Kalvari et al. 2017).
Following automated annotation, we reviewed coding genes for completeness, compared to their best BLASTP match, and corrected the annotation, most often for aspects that are particular to plant mitochondria. We manually corrected the annotation of genes to address start codons created by RNA editing of ACG to the start codon AUG, and editing of GCG to the alternative start codon GUG (see results for details). Three genes display atypical start codons: rpl16 uses a GUG start codon (Sakamoto et al. 1997); rps19 uses a GUG start codon created by RNA editing GCG, seen also in Pinus strobus AJP33554.1; matR appears to use an unusual GGG start codon, seen also in Cycas taitungensis YP_001661429.1 (Chaw et al. 2008) and Pinus strobus AJP33535.1. The gene sdh4 was missed by automatic annotation, as its coding sequence was found to overlap with cox3 by 73 bp on the same strand.
We reviewed splice sites, and adjusted their position to agree with the expected splicing motifs of group II introns when possible, ensuring not to introduce insertions or deletions into the peptide sequence compared to homologous proteins. We confirmed the presence of domain V of the group II intron upstream of the 3' splice site, identified by RNAweasel or Infernal. We manually annotated transspliced introns by comparing alignments of homologous proteins to the genome. We determined the 5' and 3' splice sites similarly to cis-spliced introns, looking for expected group II splicing motifs, and domain V upstream of the 3' splice site. When Infernal did find a match to RFAM Intron_gpII (RF00029), it frequently identified the precise 3' splice site, in agreement with protein sequence homology.
The scripts to assemble and annotate the Sitka spruce mitochondrial genome are available online at https://github.com/sjackman/psitchensismt.

Complete Genome Assembly
The complete mitochondrial genome of Sitka spruce is 5.52 Mbp assembled in 13 segments, and has a GC content of 44.7%. The genome assembly is composed of two components: a 168 kbp circular segment, and a larger 5.36 Mbp component composed of 12 segments, visualized by Bandage (Wick et al. 2015) in Fig. 1. The two smallest segments (27 kbp and 24 kbp, labeled 12 and 13 respectively) exhibit an estimated copy number of two based on their depth of sequencing coverage, and all other segments have similar depth of coverage, assumed to represent single copy. The single-copy segments range in size from 84 kbp to 1.65 Mbp. No sequence variation is evident in these repeats. An absence of variation in the repeat implies that they may be involved in active recombination (Maréchal & Brisson 2010). Though 10% of reads are larger than 24 kbp, no reads fully span these repeats. Figure 1: The assembly graph of the mitochondrial genome of Sitka spruce. Each segment is labeled with its size and named 01 through 13 by rank of size.
The draft mitochondrial genome assembly of white spruce (Jackman et al. 2015) was assembled from paired-end and mate-pair Illumina sequencing. The draft assembly is composed of 117 contigs larger than 2 kbp arranged in 36 scaffolds with a contig N50 of 102 kbp and scaffold N50 of 369 kbp. It provided little information as to the structure of the mitochondrial genome. The complete mitochondrial genome assembly of Sitka spruce assembled from Oxford Nanopore sequencing is composed of 13 contigs larger than 20 kbp with a N50 of 547 kbp.
The assembly graph (Fig. 1) reveals a multipartite genome structure. The use of long reads were critical in achieving this contiguity and completeness.
The nuclear repeats LTR/Gypsy compose 51% of the repeat sequence, LTR/Copia compose 7%, simple repeat sequences compose 34%, low complexity compose 3%, and 5% are other repeat sequences. The 36-bp Bpu repeat sequence is present in roughly 500 copies in Cycas taitungensis and roughly 100 copies in Ginkgo biloba (Guo et al. 2016). We find only a single full-length copy with four mismatches in Sitka spruce, similar to Welwitschia mirabilis.

Genes
The mitochondrial genome of Sitka spruce has 41 distinct protein coding genes with known function, 3 distinct rRNA genes (Table 1), and 27 distinct tRNA genes representing 18 distinct anticodons ( Table 2). The relative order and orientation of these genes are shown in Fig. 2. The 41 known protein coding genes found in the gymnosperm mitochondria Cycas taitungensis (Chaw et al. 2008) and Ginkgo biloba (Guo et al. 2016) are also found in Sitka spruce. The 29 introns, 16 cis-spliced and 13 trans-spliced, are found in 10 protein coding genes, two pseudogenes, and one plastid-derived tRNA (Table 3).
Four ORFs (937, 507, 371, and 221 amino acids) contain an organellar DNA polymerase type B (DNA_pol_B_2) Pfam family (El-Gebali et al. 2018). The largest one also contains a segment homologous to the structural maintenance of chromosome (SMC_N) Pfam family, which includes the RecF and RecN proteins involved in DNA recombination. We hypothesize that this ORF may be involved in recombination-dependent replication of mitochondrial DNA (Gualberto et al. 2014). These ORFs have homology to DNA polymerase genes found in the mitochondria of Picea glauca and the angiosperms Cocos nucifera, Daucus carota, Helianthus annuus, and Silene vulgaris. Two ORFs (781 and 560 amino acids) contain an RNA polymerase (RNA_pol) domain. These ORFs have homology to DNA-dependent RNA polymerase genes found in the mitochondria of Picea glauca and the angiosperms Beta vulgaris, Cocos nucifera, Daucus carota, and Phoenix dactylifera. The two largest genes of the Sitka spruce mitochondrial genome are these putative DNA and RNA polymerase genes.
The matR gene and three additional ORFs (476, 197, and   transcriptase/maturase (group_II_RT_mat) NCBI conserved protein domain (Marchler-Bauer et al. 2016). We hypothesize that these ORFs may be additional maturases involved in splicing (Matsuura 2001). These ORFs have homology to mitochondrial genes of Picea glauca and the angiosperm Utricularia reniformis.
The full complement of rRNA genes are present in Sitka spruce, shown in Table 1. Unlike rRNA genes of other gymnosperms, the Sitka spruce rRNA genes are present in multiple copies. The 5S rRNA gene rrn5 is present in four copies. The small subunit rRNA gene rrn18 is present in three copies, though one copy is found on the 27 kbp repeat segment with an estimated copy number of two. One copy of the large subunit rRNA gene rrn26 is present, though it is found on the 24 kbp repeat segment, which also has an estimated copy number of two. Sitka spruce has 27 tRNA genes, representing 18 distinct anticodons, coding for 15 distinct amino acids, DEHIKLMNPQRTVWY (Table 2). tRNA genes coding for the amino acids ACFGS are absent in Sitka spruce, and also absent in Welwitschia. trnM-CAU exhibits six copies, trnD-GUC three copies, and trnY-GUA two copies. All other tRNA genes are single copy. trnN-GUU, trnV-UAC, and one copy of trnfM-CAU are derived from plastid origins. One cis-spliced intron is observed in the plastid-derived trnV-UAC gene, also seen in Cycas taitungensis. Six tRNA genes (trnL-CAA, trnR-CCG, trnR-GCG, trnT-AGU, trnT-UGU, and trnY-AUA) found in Sitka spruce are absent in Cycas, Ginkgo, and Welwitschia.
In addition to three plastid-derived tRNA genes, ten partial plastid genes are found in the 14 kbp of plastid-derived sequence: atpB, atpE, atpF, chlN, petA, psaA, rps3, rrn18, a partial copy of rpl2, and a partial trnS-GGA gene with homology to Cycas taitungensis. The rpl2 partial gene is more similar to eudicot plastids (77% identical to Helwingia himalaica, Robinia pseudoacacia, and many other eudicots) than it is to the Sitka spruce plastid (66% identical).  This table is adapted from Table S1 of Guo et al. (2016) with the addition of Sitka spruce. (i) Contains a cis-spliced group II intron. *Anticodon is inferred to be edited (Weber et al. 1990).

Introns
Although the same 27 introns are found in the same 11 genes as Cycas taitungensis (Chaw et al. 2008;Guo et al. 2016), eight introns that are cis-spliced in Cycas are trans-spliced in Sitka spruce, more than doubling the number of transspliced introns found in Cycas. Nearly half of the introns in Sitka spruce are trans-spliced. All introns are group II introns, whose domain V was identified by both RNAweasel and Infernal, with one exception.
The first intron of nad1 is trans-spliced in Sitka spruce and other gymnosperms (Guo et al. 2016). No domain V is detectable by Infernal either downstream of exon 1 or upstream of exon 2 in Sitka spruce nor any of Cycas, Ginkgo, and Welwitschia. The genomic disruption of this intron may occur in domain V itself, as is seen in cox2 of Diphylleia rotans (Kamikawa et al. 2016).
The fourth intron of nad1 is cis-spliced, and it contains matR in Cycas, transspliced with a single disruption in Sitka spruce, and trans-spliced with two distinct genomic disruptions in Welwitschia mirabilis ( Figure S2 of Guo et al. 2016). Whereas matR is found in a cis-spliced intron in Cycas and free-standing in Welwitschia, it is found upstream of nad1 exon 5 in Sitka spruce. In this regard, Sitka spruce appears to be an evolutionary midpoint between Cycas and Welwitschia. Sitka spruce has not however experienced the extensive gene loss observed in Welwitschia.
A second partial copy of nad5 is found in Sitka spruce with one cis-spliced group II intron, representing exons 4 and 5. The translated protein sequence of this partial gene is more similar however to eudicot mitochondria (99% identical to both Chrysobalanus icaco and Hirtella racemosa, >95% identical to many other eudicots, and 94% identical to one monocot Triantha glutinosa) than to the complete nad5 of Sitka spruce (76% identical). It may have been acquired by horizontal gene transfer, as is frequently reported in plant mitochondria (Richardson & Palmer 2006) of both gymnosperms (Won & Renner 2003) and angiosperms (Bergthorsson et al. 2003). This interpretation of horizontal gene transfer in plant mitochondria is not universally accepted (Goremykin et al. 2008). This partial copy of nad5 is also found in white spruce (Jackman et al. 2015) with 100% nucleotide identity. This level of conservation between Sitka spruce, white spruce, and angiosperms suggests that this partial gene may be functional. We find no upstream domain V, whose presence would indicate that it may be part of a larger trans-spliced gene. A putatitve alternative GUG start codon created by RNA editing of GCG could initiate translation of this partial gene.
RNAweasel identifies 34 group II domain V regions in Sitka spruce, 26 of which are associated with the intron of a gene. Two domain V regions are found in the cis-spliced introns of the pseudogenes Ψnad5 and plastid-derived Ψrpl2. The remaining six domain V regions are not associated with a gene, and further Table 3: Intron content of four gymnosperms. Sitka spruce has 29 introns, 16 cis-spliced (•) and 13 trans-spliced (T), in ten protein coding genes, two pseudogenes (Ψ), and one tRNA. "T²" indicates a tripartite (double trans-spliced) intron. "-" indicates intron absence. "x" indicates gene absence. "cp" indicates plastid-derived. This table is adapted from Guo et al. (2016) with the addition of Sitka spruce.
The splice-site motifs of the 14 cis-spliced genes of the Sitka spruce mitochondrial genome are shown in Fig. 3, visualized by WebLogo (Crooks 2004). Because its position is variable, the bulged adenosine of the 3' splice site, typically found at position -7 or -8, is not readily apparent.

Conclusion
The 5.5 Mbp mitochondrial genome of Sitka spruce is among the largest ones in plants, and is the largest complete mitochondrial genome reported for a gymnosperm. It follows the trend seen for spruce and conifer nuclear genomes, which are also among the largest in plants (De La Torre et al. 2014). The physical structure of the Sitka spruce mitochondrial genome is not the typical circularly-mapping single chromosome, but multipartite. The larger component of the assembly graph exhibits a rosette-like structure, mirroring the rosette-like structures observed in electron micrographs of mitochondrial DNA (Backert & Börner 2000). The intricate structure and the conservation of large repeat elements suggest the presence of active sites for hypothesized recombinationdependent replication of the mitochondrial genome (Gualberto et al. 2014). Considering heteroplasmy resulting from naturally hybridizing species of spruce and paternal leakage of mitochondria, intermolecular recombination would result in interspecific hybridization of the mitochondrial genome, which has been reported to occur in natural spruce populations (Jaramillo-Correa & Bousquet 2005). Although sequence identity of the set of common mitochondrial proteins is well conserved, their splicing structure is not. Trans-splicing is a frequentlyemployed mechanism of plant mitochondria to compensate for genomic structural instability, and Sitka spruce has a record number of trans-spliced introns.
As the first long read assembly of a complete plant mitochondrial genome, also exhibiting a multipartite genome structure, this resource should prove invaluable to future investigations into the genome structure and mechanism of replication of conifer mitochondrial genomes.

Data Availability
The Oxford Nanopore MinION sequencing data has SRA accessions SRX5081713, and SRR8264752 through SRR8264769. The Illumina HiSeq sequencing data has SRA accession SRR5028199. The Sitka spruce (Picea sitchensis) genotype Q903 complete mitochondrial genome has NCBI GenBank accessions MK697696 through MK697708.