Introduction

The freshwater zebra mussel, Dreissena polymorpha (Pallas, 1771) (Dreissenidae), is one of the most successful invaders in terms of abundance and high density (Nalepa & Schlosser, 1993; Van der Velde et al., 2010; Schlosser & Schmuckal, 2012; Sousa et al., 2014). Native to the lakes, slow moving rivers and low salinity areas of the Black, Caspian and Azov Seas regions of central and eastern Europe, the zebra mussel was introduced to northwestern Russia, much of mainland Europe and England by at least the mid-1800s by the construction of canal systems in Europe and Russia that linked several major river drainages. The zebra mussel together with the quagga mussel (Dreissena bugensis, Andrusov, 1897) was introduced into the Great Lakes of North America in the mid-1980s by the discharge of ballast water from trans-Atlantic freighters (Stańczykowska, 1977; Hebert et al., 1989; Roberts, 1990; Bij de Vaate et al., 2002; Astanei et al., 2005; Bidwell, 2010). The colonization success of the zebra mussel is largely related to its method of reproduction external mode of fertilization, presence of free-swimming planktonic veliger larval stage, and attachment of juveniles and adults to hard surfaces via byssal threads (Borcherding, 1991; Johnson & Carlton, 1996). Despite this, there are only relatively few genetic studies of this invader (Stepien et al., 2002; Soroka, 1999, 2002, 2003, 2010; Astanei et al., 2005; Astanei & Gosling, 2010).

The zebra mussels started expanding their range in Poland approx. 200 years ago, and this expansion was correlated with the development of inland sailing (Stańczykowska, 1977; Bidwell, 2010). It is presumed to have arrived on the Baltic coast via River Nemunas which, at the end of the 18th century, was linked with River Dnieper via the Oginskij Canal (Nowak, 1974; Bidwell, 2010; Kołodziejczyk et al., 2011). In Poland, the earliest record of zebra mussel date from 1824, from the former Eastern Prussia, whereas it is currently present mainly in Pomeranian, Masurian and Suwalskie Lake District as well as in lagoons, lower stretches, and estuaries of the Vistula and Odra rivers (Stańczykowska, 1977; Lewandowski, 1991; Bidwell, 2010; Stańczykowska et al., 2010).

D. polymorpha is one of the most studied invasive species (Karatayev et al., 1997, 2007; Kołodziejczyk et al., 2011; Schlosser & Schmuckal, 2012; Sousa et al., 2014); however, most studies concentrate on their ecology, toxicology, and physiology, whereas purely genetic studies are less abundant and concentrate more on North American than European samples (Sousa et al., 2014). Few markers are available, mainly allozyme electrophoretic and RAPDs markers (Hebert et al., 1989; May & Marsden, 1992; Spidle et al., 1994; Marsden et al., 1995; Lewis et al., 2000; Elderkin et al., 2001, 2004; Stepien et al., 2002). The usefulness of cox1 appeared to be limited for the regional and fine-scale European D. polymorpha populations due to very low levels of polymorphism: only 5 haplotypes were observed over 338 individuals by Tarnowska et al. (2013) in France. Generally, the polymorphism measured with these low-resolution and/or anonymous markers led neither to the full understanding of the invasion nor to its better control (Marsden et al., 1995, 1996; Elderkin et al., 2001, 2004; Müller et al., 2002; Stepien et al., 2002).

Several studies used mitochondrial markers to assess genetic polymorphism and connectivity of D. polymorpha populations. The only available markers are based on the most conservative cox1 and lrn genes. Nevertheless, it has been shown that populations closest to the native range have marginally higher polymorphism in mitochondrial genomes, and a common haplotype is present in all sampled locations in Europe and North America, with the overall diversity derived from cox1 sequencing not exceeding 1.1% (Baldwin et al., 1996; Giribet & Wheeler, 2002; Therriault et al., 2004; Albrecht et al., 2007; Soroka, 2010).

We maintain that the available genetic resources, including the mitochondrial variation currently recognized, are inadequate to fully characterize the D. polymorpha range expansion in Europe. Consequently, we will present here an attempt to fill the gap by providing a large amount of original transcriptomic data and a nearly complete mitogenome of D. polymorpha.

Materials and methods

A sample of 10 zebra mussels was collected from lake Wdzydze (54°01′34.21″N, 17°55′31.96″E) in July 2010. Fresh animals were transported to the laboratory, sexed by microscopic examination of gonadal tissues. Both RNA and DNA were extracted from mantle/gonadal tissue of the same male individual using established protocols (Sańko & Burzyński, 2014). The total RNA sample was sent to Macrogen (South Korea). Library preparation and next-generation sequencing (NGS) was performed by Macrogen, using TrueSeq kit and MiSeq 2 × 150 bp technology. Approximately 3.5 × 106 raw sequence reads were obtained. They were preprocessed and assembled using CLC Genomics workbench software v. 9.01 (QIAGENE), under default settings for de novo NGS assembly. The assembly was evaluated by TransRate (Smith-Unna et al., 2016) and BUSCO (Simão et al., 2015) applications. TransRate is a tool for reference-free quality assessment of de novo transcriptome assemblies, using only the sequenced reads and the assembly as input. BUSCO (Benchmarking Universal Single-Copy Orthologs) is a measure for quantitative assessment of genome assembly and annotation completeness based on evolutionarily informed expectations of gene content. The obtained set of transcripts was first analyzed for the presence of mitochondrial transcripts by wise2 searches (Birney et al., 2004) using hmm profiles of mitochondrialy encoded proteins (Finn et al., 2010) and by nhmmer (Wheeler & Eddy, 2013) searches using hmm profiles of mitochondrialy encoded ribosomal RNA genes (Griffiths-Jones, 2003). The sequences of mitochondrial contigs were then used to design specific primers (Supplementary Table S1) for mtDNA amplification and Sanger sequencing. Sequence annotation was achieved by comparative analysis: the original annotations form wise2 searches were verified with ORF-finding and BLAST (Altschul et al., 1990) searches from within the CLC software. In addition, the trn genes were identified using arwen software (Laslett & Canbäck, 2008). To better evaluate the structure of mitochondrial transcripts, the quality filtered reads were mapped onto the final annotated mitochondrial contig in CLC software. The remaining contigs were analyzed using the BLAST2GO application (Conesa et al., 2005). Phylogenetic analysis based on transcriptomic data was conducted by adding D. polymorpha transcript sequences to the reference alignments downloaded from the DRYAD repository (González et al., 2014) and following similar methodology (González et al., 2015). Maximum-likelihood tree searches were conducted with RAxML v. 8.2.8 (Stamatakis, 2006), using a model of protein evolution with corrections for a discrete gamma distribution with the LG model (“-m PROTGAMMALG” option) (Le & Gascuel, 2008). Bootstrap resampling was conducted for 100 replicates using a rapid bootstrapping algorithm (Stamatakis et al., 2008), and the support values were mapped onto the optimal tree.

The annotated mitochondrial and assembled transcriptomic sequences as well as raw sequence data have been deposited in GenBank under the following accession numbers: mitogenome—KY091877, raw data—SRR5000302. The Transcriptome Shotgun Assembly project has been deposited at DDBJ/EMBL/GenBank under the accession GFBQ00000000. The version described in this study is the first version, GFBQ01000000.

Results

General transcriptome data

The overall quality of the assembly was good, with a TransRate score at 0.18 and approximately 150 BUSCOs present. Given the general lack of annotated reference data for bivalves, this is a good result. There were 26636 contigs, out of which approx. 7500 were positively identified by Blast2GO application (Supplementary Fig. S1), showing diverse distribution of contigs among various GO terms, typical for bivalvian mantle (Sleight et al., 2016; Vendrami et al., 2016; Yarra et al., 2016).

The transcriptome provided 1136 loci compatible with the general phylogenetic framework proposed by González et al. (2015), out of 1377 loci present in the largest supermatrix of González et al. (2014). These were combined forming the new supermatrix, composed of 276,775 alignment patterns. The overall proportion of missing data in this dataset was 46.2%, very similar to the original one (46.6%). This slight decrease was caused by the fact that some of the included transcripts were incomplete, effectively shortening the alignment. This effect was nevertheless negligible. The performed maximum-likelihood analysis of bivalvian phylogeny based on this new dataset was sufficiently informative to unambiguously place D. polymorpha in the general bivalvian phylogenetic context. The obtained tree (Fig. 1) has an overall topology that is identical to the original one (González et al., 2015; Fig. 1) and places D. polymorpha in relatively close proximity to Mya arenaria.

Fig. 1
figure 1

Phylogenetic tree of Bivalvia reconstructed using ML method, based on concatenated alignment of 1136 amino acid sequences, and total number of alignment patterns was 276,775. The alignment was obtained by adding homologous D. polymorpha sequences to the largest supermatrix of González et al. (2014). The colors in the figure follow the convention used by González et al. (2015). Bootstrap support values were 100% for all nodes, with the exception of the few for which the actual value is shown in the picture

Mitogenome data

It was possible to close all but one gap between the mitochondrial contigs leading to a single, linear contig containing all identified mitochondrial transcripts. The mitochondrial contig was more than 18,450 bp long (Fig. 2). Despite much effort, it was impossible to amplify the fragment spanning the last gap, between nad2 and nad5 genes. The remaining part of this contig is quite complete: both rRNA genes were identified as well as 12 protein-coding genes (CDS). The only missing CDS was atp8. Several tRNA genes were unequivocally placed on the map forming a reasonably complete annotation. The three exceptions were the missing trnL and the trnC genes and the presence of one extra trnW. Transcript mapping of the raw cDNA reads to the assembled mitochondrial contig revealed the pattern of transcripts, with similar and very high sequence coverage for most CDS and much higher coverage for lrn. The average coverage for the contig was very high (Table 1), sufficient to positively identify, and map all polyadenylation sites (Fig. 2). Apparently, the genes in this genome are encoded in two blocks of opposite transcriptional orientations. Most transcripts cover just a single gene, with the exception of a nd6-cytb pair which forms a bicistronic transcript. Neither transcripts nor PCR products showed any sequence polymorphism, indicating that only one mitochondrial genome was present and expressed in the analyzed tissues, without even trace evidence for heteroplasmy.

Fig. 2
figure 2

Gene order of the mitochondrial DNA of Dreissena polymorpha, based on transcriptome and amplicon sequencing and comparative annotations. All CDS, rRNA and tRNA, are labeled; three letter amino acid code is used to indicate tRNA specificity. Arrows indicate the direction of transcription

Table 1 Mitochondrial transcript mapping statistics

The partial sequences of cox1 (KC429149, AF474404, EU484431, EF414494, AM748975, DQ840121, AF510510, JX239087: Marescaux & Van Doninck, 2013) and lrn (DQ280038, KP052744, DQ333747: Giribet et al., 2006; Van der Velde et al., 2010) genes of D. polymorpha present in GenBank perfectly match the respective fragment of this genome, confirming its identity. BLAST searches with all annotated D. polymorpha mitochondrial genes against RefSeq database pointed at the mitogenome belonging to M. arenaria as the closest one. However, the gene order of D. polymorpha mitogenome is unique, not resembling that of M. arenaria and the sequence homology was also relatively weak, at 60% sequence identity for cox1, 55% for cytb and 35% for atp6.

There were three relatively long unassigned regions in the mitogenome. The largest one contained an array of AT-rich tandem repeats. The length of the repeat unit was 121–133 bp, but the number of repeats was difficult to infer reliably using short-reads NGS. The assembly suggested the presence of at least 6 repeats. There were also two short open reading frames (ORFs) present within the URs. The shorter one, 153 bp long, was located between trnS and nd4, while the longer one, 246 bp long, was located between trnQ and atp6. Neither of the predicted protein sequences encoded by these ORFs had any similarity to known proteins in BLAST and wise2 searches against nr and pfam databases.

Discussion

Some bivalve species have an exotic system of mitochondria inheritance called DUI (doubly uniparental inheritance) resulting in male heteroplasmy for two, often very divergent mitogenomes (Skibinski et al., 1994; Zouros et al., 1994). Since DUI has been reported also for a venerid clam Ruditapes philipinarum (Passamonti & Scali, 2001), it could have been expected in D. polymorpha, too. However, based on the presented data, it can be positively concluded that zebra mussels do not have DUI as their male gonads are homopalsmic, both at the genetic and transcriptomic levels. The representative mitochondrial genome was characterized at unprecedented level by providing its nearly complete sequence. The incompleteness of this mitogenome of is unfortunate. There are several possible reasons for this failure, but the most likely one is the presence of long and complex repeats coupled with the apparently difficulty of melting secondary structures within the missing part of the mitogenome. The alternative hypothesis would be that the mitogenome is actually linear in this species. There are very few examples of linear mitogenomes in animals, most notably atypical structures have been reported in some crustaceans (Doublet et al., 2012). Either way the complete structure could probably be solved only by using modern NGS technique allowing much longer reads. Other than the structural uncertainty, this genome is reasonably complete, with the possible exception of atp8. This is not surprising given the difficulties associated with detection of this gene (Breton et al., 2010; Śmietanka et al., 2010). The possibility that one of the anonymous ORFs is in fact encoding atp8 cannot be dismissed; however, there is also no positive argument supporting such a claim. Similar sequences do not exist even in the closest mitogenome of M. arenaria. Ultimately, increasing taxonomic sampling of mitogenomes will allow better comparative analysis and may lead to correct identification of these ORFs. The missing and/or extra trn genes could well be the consequence of incompleteness or imprecise genome annotations. It can be argued that there are no numts (nuclear mitochondrial DNA segments) contaminating our data because transcript mapping shows typical pattern of expression and numts are generally not expressed. In fact, even taking into account that the overrepresentation of mitochondrial transcript is a typical feature of animal RNA-seq datasets (Smith, 2013), the observed proportion (more than 26%) of mitochondrial transcripts was very high. In this context, it is a little surprising that the expression of nad6 was much lower than that of the other genes. This may be due to the bicistronic nature of its transcript and the fact that library construction involves priming of the first strand cDNA synthesis from the oligo-dT-based primer which may contribute to the under-representation of 3′ ends of transcripts in the library. Bicistronic transcripts have been observed previously in other bivalves (Chatzoglou et al., 2013; Sańko & Burzyński, 2014).

The phylogenetic relationships within Bivalvia are a matter of ongoing debate. The early inferences based on poor molecular markers often conflicted with established views based on morphological characters, up to the point of questioning the monophyly of Bivalvia (Plazzi et al., 2013). Fortunately, these conflicts can usually be resolved by providing more molecular data. For Bivalvia, the recent analysis of transcriptomes representing all major lineages is such a keystone (González et al., 2015), solving most conflicts. It is therefore reassuring that the presented transcriptomic dataset is fully compatible with this methodology and supports the placement of Dreissenids in close proximity with Myidae, confirming the recent findings based on a much smaller dataset (Bieler et al., 2014) and in accordance also with the presented mitogenomic data. The only existing NGS-derived D. polymorpha dataset was targeted at STR (short tandem repeats, also known as microsatellites) marker development and contains relatively low number of unique contigs (Peñarrubia et al., 2015). Moreover, nearly all microsatellite sequences of D. polymorpha deposited in GenBank can be also found in the presented transcriptomic dataset, demonstrating that it can be also a good data source for STR marker development.

The presented set of transcriptomic and mitogenomic data should be beneficial for development of highly variable genetic markers that is needed for studying and managing of D. polymorpha invasions.