Data on the multilocus molecular phylogenies of the Neotropical fish family Prochilodontidae (Teleostei: Characiformes)

The data presented herein support the article “Molecular phylogenetics of the Neotropical fish family Prochilodontidae (Teleostei: Characiformes)” (B.F. Melo, B.L. Sidlauskas, B.W. Frable, K. Hoekzema, R.P. Vari, C. Oliveira, 2016) [1], which inferred phylogenetic relationships of the prochilodontids from an alignment of three mitochondrial and three nuclear loci (5279 bp) for all 21 recognized prochilodontid species and 22 related species. Herein, we provide primer sequences, museum voucher information and GenBank accession numbers. Additionally, we more fully describe the maximum-likelihood and Bayesian phylogenetic analyses of the concatenated dataset, detail the Bayesian species tree analysis, and provide the maximum likelihood topologies congruent with prior morphological hypotheses that were compared with the unconstrained tree using Shimodaira–Hasegawa tests.

prior morphological hypotheses that were compared with the unconstrained tree using Shimodaira Data accessibility Data provided with this article and in the GenBank public repository, Gen-Bank: KX086740 through GenBank: KX087100 (see Table 2 Value of the data New sequence data were used to infer the first complete molecular phylogenetic analysis of family Prochilodontidae.
Dataset includes DNA sequences for all 21 valid prochilodontid species and 22 related characiform species, many of which are not otherwise represented in Genbank.
These data facilitate synthesis with previously published sequences and can be reused in other studies because the loci are commonly used in fish phylogenetics.
Constrained phylogenies permit statistical comparison of new molecular results with prior morphological hypotheses.

Data
We provide: 1) A table documenting the deposition of museum voucher specimens, 2) aa file containing concatenated alignments for all six loci, 3) a table containing GenBank accession numbers, 4) procedures, parameters and configuration scripts used to estimate phylogenetic relationships, 5) Newick-formatted treefiles inferred with maximum likelihood, concatenated Bayesian, and species tree methods, 6) Newick-formatted treefiles and PDF images of maximum likelihood phylogenies inferred under four topological constraints matching the morphological phylogeny of Castro and Vari [2], and 7) procedures used in Shimodaira-Hasegawa tests of alternative topologies.

Taxon sampling
This dataset included samples from 77 individuals: 55 individuals representing all 21 species of the three prochilodontid genera, and samples from 22 related taxa from the other three anostomoid families (Anostomidae, Chilodontidae, Curimatidae), three families previously hypothesized to be closely related to Anostomoidea (Hemiodontidae, Parodontidae and Serrasalmidae), and Brycon pesu (Bryconidae), as an outgroup. Nine of the samples were derived from previous studies [3][4][5], and thus 88% of these data are new to science. We used tissue samples stored in 95% ethanol or a saturated DMSO/NaCl solution, primarily from specimens deposited in museum and university collections (see Table 1 in Melo et al. [1]). We included multiple individuals for each prochilodontid species except Ichthyoloelephas longirostris, which is exceedingly rare in tissue collections. The authors BFM, BLS and RPV confirmed the taxonomic identity of most voucher specimens using morphological features.

Molecular dataset
We extracted genomic DNA using DNeasy Tissue kits (Qiagen Inc.) or a modified NaCl protocol from Lopera-Barrero et al. [6]. For this dataset, we amplified partial sequences of the mitochondrial  Table 2 Specimens and loci used in Melo et al. [1]. For each individual, its taxonomic designation, collection catalog number of voucher, tissue specimen number, and GenBank accession numbers are given (GenBank:KX086740 through GenBank:KX087100).  genes 16S rRNA (16S, 510 bp), cytochrome oxidase C subunit 1 (COI, 658 bp) and cytochrome B (Cytb, 991 bp) using one round of polymerase chain reaction (PCR). Additionally, we acquired sequences of the nuclear myosin heavy chain 6 gene (Myh6, 711 bp), recombination activating gene 1 (Rag1, 1379 bp), and recombination activating gene 2 (Rag2, 1030 bp) using nested-PCR following Oliveira et al. [3]. Primers for the loci appear in Table 1. We selected these loci as they are commonly used in phylogenetic analyses of Neotropical characiforms [3][4][5] and will facilitate subsequent supermatrix analyses and use by other researchers. Amplification techniques and sequencing reactions are detailed in Melo et al. [1]. We amplified and included all six loci for 42 (of 77) individuals. In the rest of the matrix, we are missing one locus for 22 individuals, two loci for nine individuals, four for one individual and five for three specimens (both specimens of Ichthyoelephas humeralis and one of Prochilodus britskii; see Table 2). New sequences generated in this analysis were deposited in GenBank with accession numbers KX086740

Alignment, partitioning, and model selection
We aligned and edited sequences using Geneious 7.1.7 ([7]; www.geneious.com). We assigned IUPAC ambiguity codes where we detected uncertainty of nucleotide identity. We performed the alignment of consensus sequences for each gene with the Muscle algorithm [8] implemented in Geneious using default parameters and inspected the sequences visually for obvious misalignments. We estimated the index of substitution saturation (Iss) using Dambe 5.3.38 [9] to evaluate the occurrence of substitution saturation. We found no indication of substitution saturation in transitions or transversions in any topologies. Initial examination of the complete 16S data revealed many uncertain alignments from length polymorphism in loop regions. We excluded these hypervariable regions in a reduced 16S submatrix that was in turn concatenated with the other five genes. The final concatenated dataset for all the sampled taxa is 5279 bp long with 8.9% missing data, 944 (17.9%) identical sites and 1463 of 1970 variable sites being parsimony-informative (matrixfile Prochilo-dontidae_matrix.nex). Nucleotide frequencies are presented in Table 1.
We used PartitionFinder 1.1.0 [10] to select the partitioning scheme and the model molecular evolution for each partition in the scheme using the Bayesian information criterion (BIC). For this analysis, we assumed 16 possible partitions (Table 3), one for each codon position in the five coding genes (COI, Cytb, Myh6, Rag1 and Rag2), plus the 16S stems. Results identified six partitions with models summarized in Table 3.

Concatenated analyses
We analyzed the partitioned matrix using the Bayesian methods in MrBayes 3.2 [11] with substitution models identified by PartitionFinder (Table 3). We performed two Monte Carlo runs of four independent Markov chains (MCMC) for 20 million generations each, sampling every two thousand replicates. Methods for identifying the maximum-clade credibility (MCC) tree are discussed in Melo et al. [1]. We visualized and edited the final MCC phylogeny with FigTree v1.4.2 (treefile max_-cred_tree_newick.nwk).
We inferred a maximum likelihood (ML) topology using RAxML HPC v.8 on XSEDE [12] on CIPRES Scientific Gateway v.3.3 [13]. Partitioning schemes were identified using PartitionFinder; however, substitution models were restricted to GTR due to the limitations of RAxML. Additional information on the ML analysis is provided in Melo et al. [1]. The final maximum likelihood phylogeny is provided here in treefile RAxML_bipartitions.unconstrained_result (Fig. 1).

Species tree analyses
We implemented the sequence-based species tree ancestral reconstruction method *BEAST [14]. This method estimates the posterior probability of all gene trees and species tree simultaneously from the alignment with informed priors on substitutions and rates of evolution. *BEAST requires a priori designation of individuals into species or OTUs (not individual organisms or sequences). Due to the non-monophyletic reconstructions of Prochilodus nigricans and P. rubrotaeniatus in concatenated analysis (see Melo et al. [1]), we assigned those species to two separate species units, denoted by 1 and 2 following the species name (see Fig. 5 in Melo et al. [1]). The final analysis included 77 individuals in 41 nominal species and four taxonomic units. We constrained Prochilodontidae to monophyly based on exceptionally evidence strong from morphology [2], and the concatenated molecular analyses [1]. Brycon pesu served as the outgroup.
We hypothesized six possible partitions (one for each gene), and used the BIC in PartitionFinder 1.1.4 [10] to estimate the best partitioning scheme and to select the best-fit model for each gene (Table 4). We implemented the uncorrelated lognormal distribution (UCLN) rate variation model to estimate trees in BEAST v 1.8.3 because previous empirical and simulation studies have demonstrated that the UCLN model is usually the most accurate and robust [15,16] when local clocks are not expected [17]. A lognormal prior was set on the mean clock rate for each gene ( any time) and is considered the most appropriate when extinction is known or suspected to have occurred in the group [15]. Priors and parameters were set in BEAUti 1.8.3 [18]. We ran four independent MCMC chains for 250 million generations, sampling data every 25,000 generations. sampled trees with a log clade credibility of À 8.56 (Fig. 5 in Melo et al. [1]; treefile StarBeast_MCC_-Prochilodontidae_concatenation.nwk).

Shimodaira-Hasegawa tests
In order to compare support for the most likely molecular topology ( Fig. 1; treefile F1_RAxML_-bestTree.unconstrained_result.nwk) to support for the morphological hypothesis of Castro and Vari [2], we inferred ML trees in RAxML under four morphology-based constraints discussed in Melo et al. [1]. Constraint trees were created in Mesquite 3.04 [19], and results inferred under those constraints appear in Figs. 2-5. (treefiles F2_constraint4_Ichthyoelephas_constrained_RAxML_bestTree. result.nwk F3_constraint1_Semaprochilodus_taeniurus_constrained_RAxML_bestTree.result.nwk, F4_ constraint2_Semaprochilodus_constrained_RAxML_bestTree.result.nwk, F5_constraint3_Prochilodus_ constrained_RAxML_bestTree.result.nwk). The best tree inferred under constraint four (Fig. 2) contains an extremely short branch subtending the Semaprochilodus þ Prochilodus clade, effectively creating a genus-level polytomy. This topology likely results from the much poorer probability of the sequence data given any of the tree models available under constraint four. The maximum likelihood tree under constraint four essentially makes the best of a poor region of parameter space by setting the evolutionary history shared by Semaprochilodus and Ichthyolelephas, but not Prochilodus, to the minimum possible value. Branch length shortening under the other three constraints is substantially more subtle.
We compared the ML unconstrained phylogeny with the four constrained phylogenies using the Shimodaira-Hasegawa (SH) test [20] as implemented in phangorn v2.0.1 [21]. The script for Table 5 Prior parameter settings for major priors applied in *BEAST. Prior names as in *BEAST/Beauti and are described in BEAST documentation [18].

Prior
Distribution  performing these analysis appears here as SHtest.r, and depends upon the FASTA alignment in prochilodontidae.fasta.

Transparency document. Supplementary material
Transparency data associated with this article can be found in the online version at: http://dx.doi. org/10.1016/j.dib.2016.08.015.