Tracing Homopolymers in Oikopleura dioica's Mitogenome

Abstract Oikopleura dioica is a planktonic tunicate (Appendicularia class) found extensively across the marine waters of the globe. The genome of a single male individual collected from Okinawa, Japan was sequenced using the single-molecule PacBio Hi-Fi method and assembled with NOVOLoci. The mitogenome is 39,268 bp long, featuring a large control region of around 22,000 bp. We annotated the proteins atp6, cob, cox1, cox2, cox3, nad1, nad4, and nad5, and found one more open reading frame that did not match any known gene. This study marks the first complete mitogenome assembly for an appendicularian, and reveals that A and T homopolymers cumulatively account for nearly half of its length. This reference sequence will be an asset for environmental DNA and phylogenetic studies.


Introduction
Oikopleura dioica (Fol 1872), a marine tunicate within the Appendicularia class (larvacean), is notable for its widespread distribution across oceans.Throughout its lifecycle, O. dioica remains adrift, carried by ocean currents, and constructs a distinctive cellulose apparatus called the "house", which it uses for protection and food collection.O. dioica frequently replaces its house when it becomes clogged, contributing to marine snow as discarded houses descend to the ocean floor.This process is important for Earth's carbon cycling, emphasizing the species' ecological significance (Glover 2020).
We reported earlier the possibility of cryptic speciation in O. dioica (Masunaga et al. 2022), and that cox1 nucleotide sequence similarity could be as low as ∼85% similar when comparing O. dioica isolated from Europe (Denoeud et al. 2010;Danks et al. 2013), Okinawa (Bliznina et al. 2021), or the main Japanese islands (Wang et al. 2015(Wang et al. , 2020)).Appropriate adjustment of the taxonomy is currently being discussed in the tunicate community.Nevertheless the evolutionary distance between these O. dioica cryptic species is large enough that separate reference sequences are needed to support applications of molecular biology to experimental research and taxonomic surveys using mitochondrial sequences as barcodes.This evolutionary distance also provides useful data to support phylogenetic studies of appendicularians and tunicates, which are needed to understand the evolution of early vertebrates.
The mitogenome of O. dioica was first characterized in the supplementary material in Denoeud et al. (2010), revealing eight genes (cox1, cox2, cox3, nad1, nad4, nad5, cob, and atp6).The existence of nad2 remained under question.The ascidian mitochondrial genetic code was used by Denoeud et al. (2010) 2010) also noted the presence of poly-T insertions in coding regions, and stated the hypothesis that they are reduced to 6-mers in the transcriptome with an RNA editing mechanism.Bliznina et al. (2021) produced a partial assembly of 9,225 kbp containing the previously reported genes except nad5 (Bliznina et al. 2021).Unfortunately the length of the poly-T insertions could not be assessed with confidence because of limitations in the Nanopore basecalling technology, and the use of postassembly polishing methods.Neither of the two studies could confirm whether the O. dioica mitochondrial genome was a single circle, linear, or a collection of minicircles like in Salpa thompsoni (Goodall-Copestake 2017).Sequencing complete O. dioica mitochondrial genomes has remained a challenge until now because of the lack of technologies accurate over long homopolymers, and the lack of software capable to assemble these regions.

Sample Collection and DNA Extraction
We collected O. dioica specimens at Ishikawa harbor, Okinawa, Japan (26.114N, 127.665E) in June 2018.The samples were washed with 5 mL of filtered autoclaved seawater three times, and then resuspended in 200 µL lysis buffer from the MagAttract HMW DNA Kit (Qiagen, 67563) with 20 µL of 10 µg/mL proteinase K and incubated for 1 h at 56 °C.After adding 50 µL of 5 M NaCl, the mixture was centrifuged at 5,000 × g at 4 °C for 15 min.The supernatant was transferred into a new microtube with 400 µL of 100% EtOH and 5 µL of glycogen (20 mg/mL) and cooled at −80 °C for 20 min.After centrifuging at 6,250 × g, 4 °C for 5 min, the supernatant was removed.The pellet was washed with 1 mL of cold 70% ethanol, centrifuged, and air-dried 5 min.Finally, the DNA was resuspended in molecular biology grade for quantitation using a Qubit 3 Fluorometer (Thermo Fisher Scientific, Q32850), and quality controlled using an Agilent 4200 TapeStation (Agilent, 5067-5365).

Sequencing and Assembly
The genome of a single individual ("I25") was sequenced on PacBio Sequel II using a low-input HiFi library kit.Genomic DNA was sheared with Megarupter3 to an average size of 10 kbp.Library size distribution and concentration were assessed using the FEMTO Pulse system.
The sequence reads were assembled with NOVOLoci (Dierckxsens 2024), a newly developed targeted haplotypeaware assembler based on the same principle as the organelle assembler NOVOPlasty (Dierckxsens et al. 2017), using a partial nad1 gene sequence of 350 bp from the O. dioica OKI2018_I69_1.0 sequence (Bliznina 2021) as a seed.The seed sequence is only used to recruit reads for the first iteration of the assembly and will therefore not affect the final assembly.Nevertheless, it is important not to select a repetitive or duplicated region to initiate the assembly.Hence, we choose a region that was deprived of long homopolymer sequences.

Annotation
We attempted to annotate the assembly with MITOS (Bernt et al. 2013) using the ascidian mitochondrial genetic code (Pichon et al. 2019).However, the long length of the homopolymers (Figs. 1 and 3, and Table 1) made it problematic for MITOS to detect entire genes with enough precision and lead to a large number of false positives.We hence compared the genome to the Trinity transcriptome assembly of OKI2018_I69 (Bliznina 2021) to detect each gene.We queried the transcript models using amino acid sequences from MITOS or from related tunicates using tblastn (Altschul et al. 1997).The matched models, which were polycistronic, are shown in Table 1 and provided in the supplementary material.We determined the extent of each gene's coding sequence using the getorf command from EMBOSS (Rice et al. 2000) with -table 13.The region encoding the gene was then aligned to the genome to discover the locations of poly-A or poly-T insertions, by Smith-Waterman alignments with the water command.The Circular plot was drawn by Circos version 0.69-9 (Krzywinski et al. 2009).

Results
We looked for mitochondrial sequences in 58 samples sequenced with Nanopore and 3 samples sequenced with PacBio.From the 58 Nanopore sequencing runs, 26 did not contain any mitochondrial sequences and none resulted in a circular assembly.The high error rate in the control region, caused by the abundance of homopolymers, made it impossible to assemble the entire genome.Assembly lengths ranged from 800 to 26,000 bp.Assembly lengths of the 3 PacBio runs were generally longer and one resulted in a complete circular genome, which we selected for this study.The circular mitogenome has a length of 39,283 bp, containing 34.8% A-homopolymers of length six or more and 13.6% T-homopolymers of length six or more (Fig. 1).While existing algorithms were incapable of assembling the complete mitochondrial genome, NOVOLoci succeeded by step-wise extending the seed into a complete circular genome.We confirmed the existence of each gene reported earlier (Denoeud et al. 2010) (Fig. 2, Table 1).All coding genes ended with TAA juxtaposing A-homopolymers of length greater than six.No A-homopolymer was found within open reading frames (ORFs) (a feature which in retrospect makes the genome very easy to annotate), with the possible exception of cob, for which we lack phylogenomic or proteomic evidence to determine if the translation starts before or after an A-homopolymer present at the beginning of the ORF.Every T-homopolymer longer than 6 in the coding genes was reduced to 6-mers in the transcriptome.We note that some poly-T regions had non-T insertions which  Tracing homopolymers in Oikopleura dioica's mitogenome were also removed.Visual inspection of sequence read alignments to the genome confirmed the accuracy of the sequence at these insertions.
The region between cox1 and cob does not encode known proteins, but we found a long ORF at the same position where Denoeud et al. (2010) hypothesized nad2.Unfortunately, the identity of this gene could not be deciphered as its sequence did not have matches in the NCBI BLAST databases.The other ORFs of the region were all shorter and also without known protein or nucleotide matches.The cox1 sequence reported here is 99.4% similar to the transcript model we used as a seed, and encodes for the same protein sequence.
To place our reference genome into context and to illustrate appendicularian relationship to other taxa, we computed a phylogenetic tree on 87 codon-aligned cox1 chordate sequences (Fig. 2).The appendicularian branch placement is different from our previous analysis based on nuclear gene orthogroups (Plessy et al. 2024), but the discrepancy may be caused by long branch attraction.We also note that the Ciona genus grouped with Aplousobranchia instead of the Phlebobranchia where current taxonomy places it, but this discrepancy has been observed in numerous phylogenomic analysis before.Finally, the division between O. dioica and other appendicularians corresponds to the Coecaria (O.longicauda group) and Vexillaria (O.dioica group) and is well supported by taxonomy (Galt et al. 1985) and genomics (Naville et al. 2019).Sequence identity between O. dioica and O. longicauda is below 60%; no full-length cox1 sequence is available for closer relatives of O. dioica such as O. albicans or O. vanhoeffeni (Naville et al. 2019).

Discussion and Conclusion
We present here the first complete mitogenome of O. dioica.The assembly is circular, but we cannot exclude the possibility that the mitochondrial genome is actually a linear concatenate or present in multiple copies in a circle, due to the length of the control region.Our findings confirm the presence of genes previously reported and identify an unusual high amount of poly-T insertions in the control region.Our annotation reveals a general principle that will facilitate the study of other mitogenomes containing homopolymer insertions in the Oikopleura genus.First, genes are separated by poly-A regions directly encoded in the genome.Second, poly-T insertions containing a few non-T bases may also be edited.The discovery of this principle enables the annotation of homopolymercontaining Oikopleura mitogenome in the absence of transcriptome annotation.
Our protein-coding gene annotation does not resolve the possible loss of the ATP synthase subunit atp8 and the dehydrogenase subunits nad2, nad3, nad4L, and nad6 since the last common ancestor with ascidians.Investigation of the ORFs between cox1 and cob is currently hampered by the lack of sequence homology with other tunicates.
During the preparation of this manuscript, we became aware of the publication of a mitogenome assembly for an European isolate of O. dioica (Klirs et al. 2024).In order to avoid confusion caused by the high nucleotide divergence, we assigned our assembly to NCBI's taxon ID 3071372 as "unclassified Oikopleura", pending the needed taxonomic adjustments.These two assemblies will be an asset for detecting O. dioica in environmental DNA (eDNA) studies of diverse geographical regions, and open the way for comparative approaches to elucidate noncoding regions and small ORFs.Future production of mitogenomes from the Oikopleura genus will be needed, which will increase the power of these methods.
We confirmed that the gene order in our assembly is identical to that of Denoeud et al. (2010), extracted from a Norwegian laboratory line, and that of Bliznina et al. (2021), extracted from our laboratory line from Okinawa, Japan.Plessy et al. (2024) showed that in the nuclear genome, the order of protein-coding genes is "scrambled" when comparing these two O. dioica lines (Plessy et al. 2024).It is therefore noticeable that although the order of genes is said to be less conserved in mitogenomes (Singh et al. 2009), no change took place in O. dioica at the time scale separating these two populations.
The mechanism of poly-T editing in the mitochondrial mRNAs is not understood.Despite the existence of this editing is common knowledge in the scientific community working with appendicularian mitochondrial sequences, our publication and that of Klirs et al. (2024) are only the second ones after Denoeud et al. (2010) to provide evidence matching genome and transcriptome data.Because of (1) this reason, (2) the uncertainty on the coding gene count, and (3) the absence of ncRNA annotation, we were only allowed to deposit the genome's sequence in GenBank as "UNVERIFIED", with no annotation, which we provide as supplementary material.This is unfortunate because it hides essential taxonomic information to metagenomics and eDNA studies that rely on the existence of annotated hits with strong sequence similarity in public databases.This publication aims as build up the evidence available to curators regarding the existence of homopolymer editing in the Oikopleura genus, and the need to adjust database infrastructure so that this biological phenomenon can be properly represented in annotations distributed by public databases.bioinformatics analysis.K.W. and N.D. drafted the manuscript.C.P. and N.M.L. critically revised the manuscript.All authors approved the final manuscript and agreed to be accountable for all aspects of this work.

Fig. 1 .
Fig. 1.Pie charts showing the proportion of the homopolymers of six successive bases or longer.C repeats do not exist and G repeats are slight in proportion, thus excluded from the charts.

Fig. 2 .
Fig. 2. between cox1 sequenced this work and the other publicly available appendicularian sequences.Bootstrap values are displayed on each node.
to translate these coding genes.Furthermore, Pichon et al. (2019) used multiple sequence alignments of Cox1 and Cob to demonstrate that this genetic code arose early in the tunicate history and confirm its appropriateness for O. dioica sequences (Pichon et al. 2019) despite they are not ascidians.Denoeud et al. (

Table 1
Coordinates of the annotated genes, IDs of transcripts used for the annotation, number of edited homopolymers