Genome statistics and phylogenetic reconstructions for Southern Hemisphere whelks (Gastropoda: Buccinulidae)

This data article provides genome statistics, phylogenetic networks and trees for a phylogenetic study of Southern Hemisphere Buccinulidae marine snails [1]. We present alternative phylogenetic reconstructions using mitochondrial genomic and 45S nuclear ribosomal cassette DNA sequence data, as well as trees based on short-length DNA sequence data. We also investigate the proportion of variable sites per sequence length for a set of mitochondrial and nuclear ribosomal genes, in order to examine the phylogenetic information provided by different DNA markers. Sequence alignment files used for phylogenetic reconstructions in the main text and this article are provided here.


Specifications
Phylogenetics; Genetics; Evolutionary Biology Type of data Table, text file, graph, figure How data was acquired High-throughput and Sanger DNA sequencing Data format Text file format for DNA sequence alignments and phylogenetic trees is.nex (nexus) and.tree respectively. Experimental factors Total DNA was extracted from specimens using CTAB buffer. DNA was pairedend sequenced using the high-throughput Illumina HiSeq. 2500 platform. Short-length DNA sequences were amplified via PCR and Sanger sequenced. Experimental features mtDNA genome and 45 S nuclear ribosomal DNA sequences were assembled using reference sequences. Sequences were aligned with gaps and poorlyaligned positions removed. Phylogenetic trees were constructed using Bayesian (BEAST 1.8.3) and Maximum-Likelihood methods (RAxML 8.2.8). The unrooted phylogenetic network of some alignments was investigated using SplitsTree 4.

Data source location
Most specimens originate from New Zealand waters, some were collected from the coasts of Australia, Japan, USA (California), and the UK. Data accessibility Interactive.nwk (Newick) tree files are provided here and with the main article [1].

Value of the data
Summary statistics for whole mitochondrial DNA sequences and 45S nuclear ribosomal genes are presented because such information for gastropods is currently rare, and base bias is known to influence phylogenetic inferences.
Phylogenetic reconstructions (from short-length DNA sequence data) presented here include multiple buccinid and buccinulid taxa not included in the main-text trees, and may be useful to future evolutionary studies of Neogastropoda.
DNA sequence variation and phylogenetic trees are provided because Southern Hemisphere taxa are currently under-sampled.
The proportion of variable DNA sites for a selection of mitochondrial and nuclear genes from buccinulid whelks are compared. This information can improve genetic marker selection for future molluscan studies.

Data
The data presented here originates from a phylogenetic study of Southern Hemisphere whelks [1], which refers to a group of marine snails that can be classified within the taxonomic families Buccinulidae or Buccinidae [2][3][4][5][6][7][8]. The classification of these gastropod snails depends upon a biogeographic hypothesis and an assumption of reciprocal monophyly between the majority of lineages in the Northern and Southern Hemispheres [3,[6][7][8]. Results from our study indicated that Buccinulidae and Southern Hemisphere whelks are not monophyletic [1]. 32 putative buccinulid and buccinid marine snails, as well as three fasciolariid snails used as a phylogenetic outgroup, were high-throughput sequenced on the Illumina 2500 platform. Sequence data was assembled to provide mitochondrial (mtDNA) genomic and 45S nuclear ribosomal DNA (rDNA) sequence data for most taxa, although some individuals failed to successfully sequence for the entire mtDNA or rDNA. This data was complemented with short-length sequence data from the mitochondrial 16S RNA and cox1 genes and nuclear ribosomal 28S RNA gene. This short-length  A maximum-likelihood derived phylogeny generated using RAxML 8.2.8 [9], based an alignment of 31 concatenated mitochondrial genome sequences (11,128 bp incorporating protein-encoding, tRNA and rRNA genes). No partitions were used. No outgroup or monophyly was enforced for this tree. Genera putatively belonging to Buccinulidae are shown in different colours. sequence data was acquired via PCR amplification and Sanger sequencing using universal primers. Sequence alignments used for analyses presented in the main text are attached to this paper.
Using these sequence alignments, we present maximum-likelihood and Bayesian phylogenetic reconstructions for the sampled buccinulid whelks. These phylogenetic trees are alternative reconstructions that can be compared to trees presented in the main text. Splits networks are also estimated using the mtDNA genomic and nuclear ribosomal RNA (18S, 5.4S, 28S) sequence data. The proportion of variable sites per sequence length for a set of mitochondrial and nuclear ribosomal genes is investigated as well, which provides insight towards marker information for recent and distant evolutionary change in neogastropods (Figs. 1 and 9).

Experimental design, materials and methods
The DNA extraction, purification, sequencing method and routine for sequence assembly is provided in the main text [1]. The main text also explains how the figures presented here were Bayesian calibrated mtDNA phylogeny of buccinid and buccinulid whelks. A Bayesian phylogeny based on an alignment of 25 concatenated mitochondrial genome sequences (incorporating protein-encoding, tRNA and rRNA genes), which has been fossil calibrated to estimate divergence dates among the whelk lineages. Two sequence partitions were used: 1) proteinencoding and tRNA genes (10,635 bp), and 2) tRNA genes (1065 bp) using the GTR þ I þ G and HKY þ I þ G substitution models respectively [10,11]. Black stars indicate splits that fossil calibrated. Tree root height was calibrated using the earliest known buccinoid fossils [12], and fossil calibrations were also used for the earliest Fasciolariidae (un-enforced outgroup) [13,14], and the earliest known occurrence of the tip branch Buccinulum v. vittatum [15]. BEAST 1.8.3 [16] using and MCMC length of 100 million, 1000 sample frequency and a 10% burn-in was used to generate this phylogeny. Node labels are estimated median divergence dates with the 95% highest posterior density (HPD) range shown as a blue bar. Posterior support values are also shown at nodes, but only if support was o 1.0. Putative buccinulid genera are shown in different colours. The GTR þ I þ G substitution model was used [11]. The phylogeny was produced using a Bayesian method (100 million MCMC, 10% burn-in, 1000 sample frequency, node labels are posterior support values), via BEAST 1.8.3 [16]. For this tree no outgroup was specified explicitly but reciprocal monophyly was enforced for the Fasciolariidae and Buccinidae/Buccinulidae/Nassariidae. Genera putatively belonging to Buccinulidae are shown in different colours. The GTR þ I þ G substitution model was used [11]. The phylogeny was produced using a Bayesian method (100 million MCMC, 10% burn-in, 1000 sample frequency, node labels are posterior support values), via BEAST 1.8.3 [16]. For this tree no outgroup was specified explicitly but reciprocal monophyly was enforced for the Fasciolariidae and Buccinidae/Buccinulidae/Nassariidae. Genera putatively belonging to Buccinulidae are shown in different colours. The GTR þ I þ G substitution model was used [11]. The phylogeny was produced using a Bayesian method (100 million MCMC, 10% burn-in, 1000 sample frequency, node labels are posterior support values), via BEAST 1.8.3 [16]. For this tree no outgroup was specified explicitly but reciprocal monophyly was enforced for the Fasciolariidae and Buccinidae/Buccinulidae/Nassariidae. Genera putatively belonging to Buccinulidae are shown in different colours. Fig. 7. Proportion of variable sites at increasingly deep levels of divergence. The proportion of variable sites per sequence length (bp) for a selection of mtDNA and nuclear rDNA genes reflects different rates of DNA substitution. Values were calculated using Geneious 9.1.3 [17]. The trends plotted effectively represent change in the phylogenetic information provided by each gene for different levels of investigation. Average numbers of variable sites were used for groups in genus and family-level comparisons. For example, we used the average number of differences for all sampled whelk (Buccinidae/Buccinulidae) taxa from all sampled Fasciolariidae taxa. Sampling from Aeneator, Buccinulum and Penion was used to estimate generic-level differences as these groups contained more than two specimens. Likewise, only P. sulcatus, P. chathamensis, and P. c. cuvierianus were used for within-species estimates as these taxa were sampled twice. Since read coverage varies for some genes, not all individuals were included for estimates made for each gene. Fig. 8. Splits network illustrating alternative phylogenetic signal in mtDNA sequence data for marine snails. The splits network of based on an alignment of 31 concatenated mitochondrial genome sequences (incorporating protein-encoding, tRNA and rRNA genes; 11,128 bp). Splits were generated using the Neighbor-Net algorithm in SplitsTree 4 [18]. The splits network presents a generalisation of all of possible topological solutions for the phylogenetic signal contained in the underlying sequence data, but it does not quantify the likelihood of alternative phylogenetic relationships. Edge length is proportional to split weight, and box structures within the network indicate signal for alternative topologies in the underlying sequence data. Fig. 9. Splits network for illustrating alternative phylogenetic signal in 45S rDNA sequence data for marine snails. The splits network of based on a 4667 bp alignment of 31 concatenated nuclear rDNA gene sequences (18S, 5.8S, 28S rRNA genes). Splits were generated using the Neighbor-Net algorithm in SplitsTree 4 [18]. The splits network presents a generalisation of all of possible topological solutions for the phylogenetic signal contained in the underlying sequence data, but it does not quantify the likelihood of alternative phylogenetic relationships. Edge length is proportional to split weight, and box structures within the network indicate signal for alternative topologies in the underlying sequence data. generated, including the software and settings used. Legends for tables and figures presented below specify which sequence alignments were used (again referenced in the main text) (Tables 1 and 2).

Transparency document. Supporting information
Transparency data associated with this article can be found in the online version at https://doi.org/ 10.1016/j.dib.2017.11.021.

Appendix A. Supporting information
Supplementary data associated with this article can be found in the online version at https://doi. org/10.1016/j.dib.2017.11.021.