Anonymous nuclear markers data supporting species tree phylogeny and divergence time estimates in a cactus species complex in South America

Supportive data related to the article “Anonymous nuclear markers reveal taxonomic incongruence and long-term disjunction in a cactus species complex with continental-island distribution in South America” (Perez et al., 2016) [1]. Here, we present pyrosequencing results, primer sequences, a cpDNA phylogeny, and a species tree phylogeny.


Type of data
Pyrosequencing filtering steps, primer sequences and characteristics, species tree analysis input and output, species tree and cpDNA phylogenetic tree How data was acquired Pyrosequencing filtering in pyRAD, primer sequences designed with Primer3, primer characteristics gathered with DNAsp, species tree and cpDNA phylogenetic tree generated with BEAST2 Data format Primer sequences allow researchers to test and to use this genomic information in other related taxa.
Mitochondrial and multilocus phylogenies allow comparing the topologies gathered with the two sets of markers, and also enable comparisons with other codistributed taxa.

Data
The data shared in this article consist of primer sequences designed after filtering two Pyrosequencing runs, sequencing data from 25 nuclear markers in 40 individuals from 4 species of the Pilosocereus aurisetus species complex, and the species tree and chloroplast topologies used in Perez et al. [1].

Bioinformatic analysis
The Pyrosequencing reads were quality controlled using FASTX-toolkit (http://hannonlab.cshl.edu/ fastx_toolkit/) and the pyRAD package [2] to recover variable loci with data available across the species and populations analyzed. The following parameters were applied: (1) Z5 identical sequences for each allele, to minimize the recovery of sequencing errors and homopolymers; (2) r2 different bases for a given nucleotide position, as the organisms are diploid and showed no signal of polyploidy [3]; (3) r20 polymorphic sites for each locus, to avoid the inclusion of paralogous loci, that usually show high levels of variation. The remaining dataset after each quality control step is in Table 1. The pyrosequencing data filtering resulted in a total of 223 loci occurring in at least 10 individuals, which were aligned against GenBank with Blastn ( Table 2). All loci that matched cytoplasmatic sequences (cp and mtDNA) and retrotransposons were discarded, resulting in 26 loci in all populations sampled. Primers were developed for these loci in the software Primer3 v4.0.0 [4] with the parameters: (1) primer size between 18 and 23 bp; (2) melting (Tm) between 58 and 63°C; (3) maximum difference of 2°C for the Tm between forward and reverse primers; GC content of 20-70%. All the developed loci showed specific amplification in at least one sample, but one marker was discarded from further analysis owing to amplification and sequencing problems in the outgroup. Sanger sequencing reactions were obtained for 117 sequences (containing both strands), selected to assure data for at least two individuals for each locus. After combining sequences from both Sanger and pyrosequencing for the 25 loci, a total of 687 sequences over 40 individuals were obtained (Supplementary Table 1), with a total of 367 SNPs. The obtained loci were quality-controlled for recombination using the DSS method [5] as implemented in the software package TOPALi v2 [6], and we also tried to detect loci under selection using Tajima's D, Fu and Li's D* and F* in DNAsp [7]. The results of the quality control for recombination and selection, as well as the main characteristics of each locus are available in Table 3.

Species tree
A species tree was estimated using the STRUCTURE groups ( Fig. 1 in Perez et al. [1]) as operational taxonomic units (OTUs) in BEAST 2 [8]. We performed this analysis using a Yule speciation prior, with the most likely model of sequence evolution obtained in jModeltest2 [9]. We used either a strict or a relaxed lognormal clock at each locus, selected after comparing the marginal likelihoods of runs using each model with a Path Sampling analysis with 8 steps and 500,000 generation after a 50% burn-in. The species tree was obtained after two independent runs of 100,000,000 MCMC generations each, with a 10% burn-in, and sampling trees every 5000 steps. The species tree analyzes were performed according to the sequence evolution and clock models recovered for each marker (Table 3). A Maximum Clade Credibility (MCC) tree was generated in TreeAnotator [10], by combining the trees from the two runs. The XML input file, containing all the sequences (also deposited as GenBank accession numbers KU161695-KU162858) used is available in Supplementary data 1. The obtained MCC tree is available in Newick format in the Supplementary data 2. Table 1 Results from pyrosequencing runs and filtering steps.

Filtering step
Amount of data   Table 3 Primers and statistics for each locus.

cpDNA and multiloci data comparison
Comparison of the plastid (partial trnT-trnL and trnS-trnG data from [11]) and the combined multilocus datasets (Fig. 4a in [1]) was performed by contrasting the topology of the species tree analysis with the nuclear data (Supplementary data 2) and the topology of a BEAST phylogenetic analysis with a relaxed lognormal clock in the plastid data. The cpDNA XML file with the sequences is available in Supplementary data 3. The cpDNA tree in Newick format is in Supplementary data 4. The divergence times (Mya) estimate between the two main lineages was also compared ( Table 4) by setting them as monophyletic and calculating the time to the Most Recent Common Ancestor (TMRCA) using BEAST for the plastid dataset and the combined multilocus dataset, including the plastid data ( Fig. 4b in [1]). Because of the lack of substitution rates for the nuclear markers, relative rates to the plastid marker was used, by using a prior distribution including the minimum and maximum substitution rates observed in the chloroplast sequences of angiosperms [12].