Genome assembly and phylogenomic data analyses using plastid data: Contrasting species tree estimation methods

Phylogenomics has become increasingly popular in recent years mostly due to the increased affordability of next generation sequencing techniques. Phylogenomics has sparked interest in multiple fields of research, including systematics, ecology, epidemiology, and even personalized medicine, agriculture and pharmacy. Despite this trend, it is usually difficult to learn and understand how the analyses were done, how the results were obtained, and most importantly, how to replicate the study. Here we present the data and all of the code utilized to perform phylogenomic inferences using plastome data: from raw data to extensive phylogenetic inference and accuracy assessment. The data presented here utilizes plastome sequences available on GenBank (accession numbers of 94 species are available below) and the code is also available at https://github.com/deisejpg/rosids. Gonçalves et al. is the research article associated with the data analyses presented here.


Data
A dataset comprising 78 plastid protein-coding genes of 94 species of rosids is presented in Table 1.
Here we present all the code used in the analysis of this dataset [1], including the scripts used to quality filter, assemble, extract regions of interest, and perform phylogenomic analysis, in a series of tutoriallike files: I. Genome assembly; II. Phylogenetic Analysis; III. Tree space; IV. Phylogenetic Signal. Part of the data was obtained from GenBank (http://www.ncbi.nlm.nih.gov/genbank). Data for and 27 species from groups of rosids that lacked the information on the database were generated using Illumina HiSeq. A total of 657,471,631 million paired-end reads with an average length of 150 bp was generated ( Table 2). Despite the interest on extracting and using only the genes from the plastome, the pipeline for genome assembly presented here also separates contigs from the three cellular genomic compartments with a potential for use used in studies that target not only the plastome, but also mitochondria, nuclear ribosomal DNA, and other nuclear markers. The next set of tutorial-like markdown files present the code utilized for preparing alignments and for inferring phylogenies using an array of strategies of data partition and methods of phylogenetic inference. The code used to explore the similarities/dissimilarities between topologies and for the phylogenetic signal calculation is also presented.

Data preprocessing and genome assembly
For the 27 samples for which data were generated, leaf tissue was ground and total genomic DNA was isolated using DNeasy Plant Mini Kit (Qiagen) according to the manufacturer's protocol or a modified version of [2] described in Ref. [3]. The DNA was quantified using a Qubit Fluorometric Quantitation (Thermo Fisher) instrument and was sequenced at the Genome Sequencing and Analysis Specifications table   Subject area  Biology  More specific subject  area   Systematics of angiosperms, plastome evolution   Type of data  Table,  Value of the data The present data provides details about phylogenomic analysis using a set of well-documented pipelines covering analyses from preprocessing Illumina reads to inferring and testing phylogenies using multiple methods of phylogenetic inference The data introduce the practical use of multispecies coalescent methods using plastid protein-coding genes and could be adjusted and used with molecular data from different molecular markers and organisms Accessibility to scripts utilized and data files containing the alignments and trees will enhance the replication of the analyses presented Facility (GSAF) at The University of Texas at Austin. Two species were kindly provided by The Royal Botanic Gardens, Kew, DNA bank (https://www.kew.org/data/dnaBank/). Once the reads were available, the genome assembly pipeline was used to remove adaptors and PHIX, for quality trimming, and for genome assembly.

Phylogenetic inference
After gathering sequences of plastid protein-coding genes, the alignments and phylogenetic inference were performed. The code used to prepare the alignments using MAFFT [4] and MACSE [5] as well as the scripts used to infer phylogenies using Maximum Likelihood (ML), IQ-TREE [6], and Multispecies Coalescent (MSC) methods, SVDquartets [7], and ASTRAL-II [8] is presented in phylogenetic analysis pipeline.

Calculating distances of tree topologies and phylogenetic signal
Commented scripts present how the inferred phylogenies were further explored. First, Robinson-Foulds and Kendall-Colijn algorithms implemented in the R package TREESPACE [9] were used to visualize the distances of species trees and between species trees and gene trees inferred. The code used is available at tree space. Lastly, five taxa from different taxonomic levels that had alternative placements were selected for a set of measurements of gene-wise and site-wise log-likelihood support of alternative topologies phylogenetic signal. with the option "covstats". Reference index was built with the 23 complete plastomes with each representing a scaffold. The average fold coverage was calculated by mapping reads to the reference index and values were taken from the scaffold correspondent to each species. For species with incomplete plastomes (marked in bold) we used the closely related species as the scaffold indicated within the parentheses.