Global connectivity patterns of the notoriously invasive mussel, Mytilus galloprovincialis Lmk using archived CO1 sequence data

The invasive mussel, Mytilus galloprovincialis has established invasive populations across the globe and in some regions, have completely displaced native mussels through competitive exclusion. The objective of this study was to elucidate global connectivity patterns of M. galloprovincialis strictly using archived cytochrome c oxidase 1 sequence data obtained from public databases. Through exhaustive mining and the development of a systematic workflow, we compiled the most comprehensive global CO1 dataset for M. galloprovincialis thus far, consisting of 209 sequences representing 14 populations. Haplotype networks were constructed and genetic differentiation was assessed using pairwise analysis of molecular variance. There was significant genetic structuring across populations with significant geographic patterning of haplotypes. In particular, South Korea, South China, Turkey and Australasia appear to be the most genetically isolated populations. However, we were unable to recover a northern and southern hemisphere grouping for M. galloprovincialis as was found in previous studies. These results suggest a complex dispersal pattern for M. galloprovincialis driven by several contributors including both natural and anthropogenic dispersal mechanisms along with the possibility of potential hybridization and ancient vicariance events.


Introduction
Quantifying dispersal in marine environments has been a long standing challenge due to the difficulty in tracking large numbers of microscopic larvae within oceanic basins [1]. As a consequence, indirect methods have been developed, the most common of which is population genetics. In marine invasion ecology, population genetics is often employed to track the dispersal of invasive species but the dynamic nature of marine invasions caused by changes in vector strength, transient dispersal barriers and stochastic factors, poses a challenge [2]. One potential alternative could be the use of archived sequence data which possess a temporal element. As sequencing costs continues to decline, public data banks that archive sequences are growing at an exponential rate [3]. These data banks play a central role in the life sciences because they allow for the reproducibility of published research, which recently, has been a contentious issue in the life sciences [4]. In invasion genetics and indeed, population genetics as a whole, such resources remain surprisingly underutilized [3], despite the fact that they possess a wealth of spatio-temporal sequence data generated from a variety of projects [5,6]. In this study, we attempted to elucidate global connectivity patterns of the invasive Mediterranean mussel, Mytilus galloprovincialis Lamarck, 1819 by repurposing archived cytochrome c oxidase 1 (CO1) sequence data from public databanks.
Mytilus galloprovincialis is a relatively small marine bivalve (5-8 cm) that is native to the Mediterranean but has aggressively extended its range to the Americas, Asia, southern Africa and Australasia [7][8][9]. The primary vector responsible for the spread of this species includes shipping, more specifically the transportation of planktonic larvae in ballast water of commercial ships and attachment of byssal threads to ship hulls [10]. In addition, the ease of culturing M. galloprovincialis along with its palatability have resulted in the transplantation of stock populations for aquaculture purposes in different regions of the world [10]. The success of the species in its introduced range is due to inherent biological characteristics that make it an aggressive invader, including high fecundity and recruitment rates [11], broad thermal tolerance [12] and resistance to desiccation and parasites [13]. The objective of this study was to assess levels of genetic differentiation of M. galloprovincialis across multiple global localities. We hypothesized that M. galloprovincialis would exhibit marked genetic differentiation between northern and southern hemisphere mussels but overall very low levels of genetic differentiation within hemispheres due to repeated introductions.

Data mining and alignment
A workflow for repurposing repository sequence data was developed (Fig. 1). A mining program was first coded in C++ to search several DNA databases for M. galloprovincialis DNA sequences. These databases included GenBank, the European Nucleotide Archive, DNA Database of Japan and the Barcode of Life Database (BoLD).
An automated search was preferentially chosen over a manual search due to the speed of sequence acquisition, and its highly discriminative nature (a specific code is unlikely to pull duplicates, or ambiguous sequences). The mitochondrial DNA marker, cytochrome c oxidase 1 (CO1) was chosen due to its overrepresentation in population genetic studies for this species relative to other markers. The following qualifiers were incorporated into the coding script: 'Mytilus' , 'galloprovincialis' 'mitochondrial' , 'CO1' . After scanning 2480 mitochondrial gene sequences, 322 CO1 fragments were recovered (date of original search: April 2016). For verification, each database entry was manually checked and discarded if: (i) the sequence could not be linked to published research (peer reviewed articles, technical papers or conference abstracts) and or (ii) the sequence could not be traced to a specific geographic locality. In a few cases the mining program recovered 'CO1-like sequences' which were discarded. To avoid compiling duplicated sequences, database entries were cross referenced and any duplications were also discarded. To confirm collection dates, sequences were cross-referenced to their corresponding publications and in cases where no collection date was specified, authors were contacted directly for confirmation. Based on these aforementioned filters, a finally tally of 209 sequences representing 14 distinct populations were tagged as 'useable' for this study and they were all accessible from the GenBank database (Additional file 1: Table S1; Fig. 2a). A series of alignment algorithms (CLUSTALW, MUSCLE and MAFFT) were tested on the compiled dataset set in Geneious ver 10.1.3 [14] Fig. 1 Workflow for CO1 sequence acquisition of Mytilus galloprovincialis from data mining to sequence alignment and edited in BioEdit ver. 5 [15]. The MAFFT algorithm provided the highest quality dataset as measured by bp length. It was also chosen because to its incorporation of iterative refinement steps that corrects for accidental misalignments [16].

Genetic differentiation analyses
To determine evolutionary relationships among haplotypes, a statistical parsimony network was constructed using TCS ver.1.2.1 [17], with the fixed connection limit set to 95%. Genetic differentiation across populations was calculated via pairwise ɸ ST comparisons that were carried out in Arlequin ver. 3.5. [18]. To determine the extent of differentiation between northern and southern hemisphere populations, populations representing both regions of the world were clustered into batches: (i) North: Spain, Portugal, British Columbia, Turkey, China (South and Northwestern Pacific-NP), Korea, Portugal and (ii) South: South Africa, Chile, Australia (West and East), Tasmania and New Zealand. A hierarchical analysis of molecular variance (AMOVA) along with ɸ ST calculations were carried out on both clusters and among sites to estimate the extent of genetic differentiation.

Results
A 360-bp fragment with 157 variable sites was obtained for the CO1 marker. A total of 67 haplotypes was recovered of which 47 were unique and 38% of these unique haplotypes originated from the South African population (Fig. 2b). There were four distinct haplogroups that, with the exception of Turkish individuals, were separated by at least 20 mutational steps. These haplogroups included Australasia, consisting of some Australian individuals and the entire Tasmanian and New Zealand populations, Turkey, Korea and South China. While the parsimony network showed generally strong geographic patterning of haplotypes, some haplotypes were shared by individuals from six geographically distinct populations. There was also a close relationship between South African and Chilean haplotypes with North Atlantic and Mediterranean haplotypes as indicated by the high frequency of haplotype sharing and the small number of mutational steps separating them.

Discussion
Previous phylogenetic studies using RFLPS and 16S rRNA sequences recovered two distinct northern and southern hemisphere clades of M. galloprovincialis [19][20][21]. In contrast, the CO1 gene in this study failed to recover these such grouping and in fact, hierarchical AMOVA results found that most of the genetic variation (~ 88%) was found within the southern and northern populations rather than between them. This issue is of particular importance because a distinct northern and southern clade of M. galloprovincialis is used as evidence for the support of the 'northern migration' hypothesis which states that M. galloprovincialis diverged from an ancestral M. edulis in the Mediterranean Sea more than 1.5 mya and then migrated to the southern hemisphere during the Pleistocene via the Atlantic Ocean. The lack of distinct northern and southern hemisphere groupings could be due to cryptic dispersal where frequent anthropogenic transport could dilute phylogeographic signal and consequently decrease the value of standard population genetic parameters (e.g. F ST ) [22].
While our study did not recover a northern and southern group, it did recover two Australasian haplogroups consisting of some individuals from east Australia and the entire western Australian, Tasmanian and New Zealand population. Approximately 99% of Mytilus mussels  [23]. In particular, Tasmanian individuals did not share haplotypes with any other population and were even genetically isolated from nearby Australian and New Zealand individuals. These results are congruent with a previous nuclear hybridization study which showed that Tasmanian M. galloprovincialis is endemic and a secondary contact with M. edulis either before or after the founding population became established, could have resulted in the genetic isolation observed in the present study [24]. Both South Korean and South Chinese M. galloprovincialis populations also showed genetic isolation based on haplotype networks and pairwise AMOVA results. More surprising, however was that the Korean population was strongly differentiated from both the NP Chinese and South Chinese populations (ɸ ST = 0.88 and 0.91 respectively-P < 0.05) despite the fact that all three localities are located less than 2000 km away from each other in the Yellow Sea with ocean current direction conducive to fine scale connectivity. The high genetic structure observed in this region could be due to hybridization. In all three Asian localities, M. galloprovincialis' range overlaps with that of M. edulis and M. trossulus. [25]. Hybridization and mtDNA introgression among these species are common in regions where they occur together [7]. When interspecific hybridization occurs, especially during invasion events, there is an elevated response to selection pressures, resulting in rapid rates of adaptation and as a consequence, the development of barriers to mtDNA exchange [26]. It is therefore possible that our sequence alignment may have included hybrid lineages that were submitted under the name of M. galloprovincialis though further genetic analyses will be needed to definitively detect the presence of hybridization. Alternatively, past geological and climatic changes in this coastal system could have resulted in ancient vicariance events [27], leading to the observed genetic structuring of the East Asian populations.
There were also frequent instances of haplotype sharing along between NP Chinese populations, North Atlantic, Mediterranean and Pacific populations of Chile and British Columbia, which is congruent with past population studies of M. galloprovincialis using nuclear markers [28][29][30]. In addition, we found that the North Atlantic and Mediterranean population shared haplotypes with the NP Chinese, Chilean and British Columbia populations. Since it is unlikely that such close kinship is due to dispersal across the Pacific Ocean, we hypothesize that multiple introductory events could be connecting these populations. Previous surveys and studies have shown that the Pacific, especially the northwestern Pacific has been a viable corridor for non-indigenous species for decades [31]. Transoceanic shipping across this corridor could therefore be responsible for the movement of M. galloprovincialis from the Asian Pacific region to British Columbia.
The negative impacts of M. galloprovincialis has been studied extensively on the southern African coast where the mussel had rapidly colonized the western coast of the country upon in its introduction in the 1970s and has since displaced the native mussel Aulacomya ater in this region [8]. Both haplotype networks and AMOVA results agrees with previous studies that suggest a Mediterranean origin for South African M. galloprovincialis [32].
In conclusion, our results are indicative of a complex dispersal pattern for M. galloprovincialis and it likely involves a combination of natural and anthropogenic dispersal coupled with local adaptation and hybridization events. Most importantly, these findings were based on archived genetic data drawn from disparate studies. This study therefore shows that DNA sequence repositories possess valuable genetic data, from which informative results can be gleaned from post hoc analyses.

Limitations
Many researchers are very wary of using public databases to test new hypotheses especially in large scale population genetic studies. A significant issue is the taxonomic reliability of submitted sequences, especially with regards to closely related species like the Mytilus spp. complex of which the study species belongs to. However, we believe our workflow for repurposing sequence data was adequate in filtering bona fide M. galloprovincialis CO1 sequences. However, we did encounter some problems associated with sequence acquisition itself. For some of the DNA barcoding and phylogenetic studies, sampling localities were not included either as a source modifier in GenBank or in the published study, and in such cases authors had to be contacted directly for this information. In addition, GPS co-ordinates were also missing so we were unable to carry out some crucial tests commonly used in connectivity studies including isolation by distance (IBD) calculations and SAMOVA (spatial analysis of molecular variance).

Additional file
Additional file 1: Table S1. GenBank accession data. List of GenBank accession numbers for all CO1 sequences used in the present study. In this file, sequences are separated by population and both sample size and the original purpose of the sequences are provided.