The use of chloroplast genome sequences to solve phylogenetic incongruences in Polystachya Hook (Orchidaceae Juss)

Background Current evidence suggests that for more robust estimates of species tree and divergence times, several unlinked genes are required. However, most phylogenetic trees for non-model organisms are based on single sequences or just a few regions, using traditional sequencing methods. Techniques for massive parallel sequencing or next generation sequencing (NGS) are an alternative to traditional methods that allow access to hundreds of DNA regions. Here we use this approach to resolve the phylogenetic incongruence found in Polystachya Hook. (Orchidaceae), a genus that stands out due to several interesting aspects, including cytological (polyploid and diploid species), evolutionary (reticulate evolution) and biogeographical (species widely distributed in the tropics and high endemism in Brazil). The genus has a notoriously complicated taxonomy, with several sections that are widely used but probably not monophyletic. Methods We generated the complete plastid genome of 40 individuals from one clade within the genus. The method consisted in construction of genomic libraries, hybridization to RNA probes designed from available sequences of a related species, and subsequent sequencing of the product. We also tested how well a smaller sample of the plastid genome would perform in phylogenetic inference in two ways: by duplicating a fast region and analyzing multiple copies of this dataset, and by sampling without replacement from all non-coding regions in our alignment. We further examined the phylogenetic implications of non-coding sequences that appear to have undergone hairpin inversions (reverse complemented sequences associated with small loops). Results We retrieved 131,214 bp, including coding and non-coding regions of the plastid genome. The phylogeny was able to fully resolve the relationships among all species in the targeted clade with high support values. The first divergent species are represented by African accessions and the most recent ones are among Neotropical species. Discussion Our results indicate that using the entire plastid genome is a better option than screening highly variable markers, especially when the expected tree is likely to contain many short branches. The phylogeny inferred is consistent with the proposed origin of the genus, showing a probable origin in Africa, with later dispersal into the Neotropics, as evidenced by a clade containing all Neotropical individuals. The multiple positions of Polystachya concreta (Jacq.) Garay & Sweet in the phylogeny are explained by allotetraploidy. Polystachya estrellensis Rchb.f. can be considered a genetically distinct species from P. concreta and P. foliosa (Lindl.) Rchb.f., but the delimitation of P. concreta remains uncertain. Our study shows that NGS provides a powerful tool for inferring relationships at low taxonomic levels, even in taxonomically challenging groups with short branches and intricate morphology.


INTRODUCTION
Orchidaceae is considered the largest family of flowering plants, with over 25,000 species (Dressler, 1990;Christenhusz & Byng, 2016). The family probably dates back to the Late Cretaceous, as indicated by fossil-calibrated molecular phylogenies (Gustafsson, Verola & Antonelli, 2010;Ramírez et al., 2007Ramírez et al., , 2011. Polystachya Hook. is an orchid genus containing 240 species, with most species found in Africa (Dressler, 1993). A total of 13 species are reported from the Neotropical region (Mytnik-Ejsmont, 2011), but this number may increase when considering the endemic species from Brazil that were not accounted for in Mytnik-Ejsmont (2011) or were considered synonymous (Barros et al., 2010).
Recent studies have shown a number of peculiar cytological, evolutionary and biogeographic aspects of Polystachya. The genus has diploid and polyploid species; the latter recently formed in the Neotropics and Madagascar (Rupp et al., 2010;Russell et al., 2010b). Unlike most genera of Orchidaceae, Polystachya has a wide geographical distribution range  Fig. 1), having species that are Pantropical or have a transatlantic distribution. On the other hand, the Neotropics presents a high level of endemism. Brazil, as an example, has 12 species of which 10 are endemic (Barros et al., 2010). In addition, there is evidence of reticulate evolution in the genus and hybridization with independent origins (Russell et al., 2010a).
The monophyly of the genus has been reported in the latest studies (Russell et al., 2010b;Mytnik-Ejsmont, 2011), which contrasts starkly with the low level of monophyly observed in the taxonomic sections described within the genus. Those 15 sections (Kraenzlin, 1926;Summerhayes, 1942, 1947apud Russell et al., 2011Brenan, 1954;Cribb, 1978) are based on morphological characters and have been useful for field identification and inventories, but do not find support as natural groupings in the molecular studies currently available (Russell et al., 2010b;Mytnik-Ejsmont, 2011). According to those molecular studies, all sections are polyphyletic or paraphyletic, except sect. Isochiloides (Russell et al., 2010b).
Section Polystachya has been described as comprising 32 species worldwide and is the only section with species of pantropical distribution (Mytnik-Ejsmont, 2011). However, according to molecular analyses, some species of this section appear to be more related to species of other sections (Russell et al., 2010b;Mytnik-Ejsmont, 2011). These studies highlight the need for new infrageneric divisions based on robust molecular evidence. Russell et al. (2010b), using chloroplast markers, defined five different clades that could be used as the basis for a revised classification of new sections within the genus. Clade III (sensu Russell et al., 2010b) includes species from five different sections (sects. Polystachya, Eurychilae, Caulescentes, Superpositae, Polychaete) and is divided into distinct subclades of morphologically diverse plants. These species are Pantropical (such as P. concreta), Neotropical (such as P. foliosa), Malagasy endemics (such as P. henrici) or African (such as P. odorata). The relationships among Clade III species remain unresolved because  Table 1.
New methods of DNA sequencing as well as the development of more powerful algorithms are propelling the replacement of trees generated from one or a few genes to those constructed from hundreds of them (Edwards, 2009). The improvement of massively parallel sequencing techniques-or next generation sequencing (NGS)-has increased the amount of data available for biological research, whether the fully annotated reference genomes of species under study have been sequenced or not (Bräutigam & Gowik, 2010). However, despite its obvious potential, NGS technology is underused in most studies of plant systematics (Cronn et al., 2012;Carstens et al., 2013;Eaton & Ree, 2013), probably as a result of a prevailing focus on non-model organisms (which require de novo genomic sequencing and its inherent challenges), the need to sample many individuals per species and the absence of well-established protocols (McCormack et al., 2013). One method that increases the efficiency of NGS for non-model species compared to other genomic partitioning strategies is sequence capture (or hybridization-based enrichment), which is based on the prior selection of loci of interest (Lemmon & Lemmon, 2013). The main benefit of this technique is that the number of specific sequences obtained can be very high, which makes it an advantageous method compared to PCRbased approaches if the objective is to sequence several individuals and multiple loci. Furthermore, sequence capture when combined with NGS platforms, such as Illumina, also reduces the costs of the process (Lemmon, Emme & Lemmon, 2012). Sequence capture methods have successfully been used in other plant genera to generate large amounts of useful data for phylogenetic inference (Kamneva et al., 2017;Sousa et al., 2014;Stephens et al., 2015;Weitemier et al., 2014).
The necessity of a molecular phylogenetic framework for (and a morphological taxonomic revision of) Polystachya is clear. It requires a well-resolved phylogenetic hypothesis in order to clarify the relationships between species and also to redefine new infrageneric sections. In this paper, we explore the use of nearly complete plastid genomes (note that we use chloroplast and plastid interchangeably when referring to these genomes), obtained by sequence capture and massively parallel sequencing, to solve the phylogenetic inconsistencies found within Clade III of Polystachya (sensu Russell et al., 2010b). We also explore whether sequencing the entire chloroplast genome using NGS was worthwhile, compared to PCR and Sanger sequencing of a few fast-evolving loci. We hope that the results generated here can be extended to the rest of the genus and thus result in new interpretations of the evolutionary and biogeographic history of the group.

Sampling and DNA extraction
We sampled 19 species and 48 individuals (Table 1; Fig. 1), of which 15 were collected in different locations in continental Brazil and three were collected on Trindade Island in the South Atlantic. The DNA of the 15 Brazilian samples was extracted from 10 mg of tissue dried with silica gel and using the DNeasy Plant Mini Kit (Qiagen, Valencia, CA, USA). DNA samples for the remaining 33 individuals were provided by the University of Vienna and the DNA bank of the Royal Botanic Gardens, Kew. To this sample of individuals, selected because they cluster in Clade III (as defined in Russell et al., 2010b), we added Polystachya tessellata Lindl., supposedly synonymous of P. concreta. We also included multiple samples of P. concreta, because previous studies have reported a lack of monophyly for this species and several synonymous species. Permits to collect were provided by the Ministério do Meio Ambiente (MMA), Instituto Chico Mendes de Conservação da Biodiversidade (ICMBio) and Sistema de Autorização e Informação em Biodiversidade (SISBIO), with registration number 29478-1.
Polystachya bicolor Rolfe and P. melanantha Schltr. were chosen as possible more distant relatives of the species in focus, in order to provide additional context for the phylogenetic inference. All studies conducted to date resolve P. melanantha as an outgroup with respect to the Clade III species. Polystachya bicolor has already been treated as a synonym of P. rosea (Mytnik-Ejsmont, 2011), with an uncertain position in the phylogenetic trees generated thus far, being sometimes closely related to P. concreta and other associated species (Mytnik-Ejsmont, 2011) and sometimes closely related to species from other clades (Russell et al., 2010b;Mytnik-Ejsmont, 2011).

Probe design for DNA capture
We used complete chloroplast genome of Phalaenopsis aphrodite subsp. formosana (NC_007499.1) (Chang et al., 2006) as the reference for the design of capture probes, because there is no completely sequenced chloroplast genome of a Polystachya. According to molecular analyses, Polystachya and Phalaenopsis belong to different sub-tribes but are closely related within the Vandeae tribe (Van Den Berg et al., 2005;Górniak, Paun & Chase, 2010;Freudenstein & Chase, 2015). The use of a quite distantly related species is made possible by the DNA capture kit (MYcroarray, Ann Arbor, MI, USA), which is able to support differences larger than 5% between probe sequences and target sequences (Li et al., 2013). The complete sequence of the chloroplast genome of Phalaenopsis aphrodite subsp. formosana was divided into blocks of 360 bp. Every second block was used as the template for probe design; the probes consisting of 120 bp sequences, three to each block without overlap. Given the genomic DNA fragment sizes of between 300 and 400 bp (below) and that fragments can contain complementary sequence anywhere on their length to attach to a probe, fragments can contain up to 200-300 bp of genomic sequence into the flanking regions beyond the probes, or cover the probes with little extent into the flanking sequence, or somewhere in between. In this way, captured sequences produce a series of tiled overlapping sequence for high quality genomic assembly. Additionally, fragments with a base repeated more than seven times in a row were avoided to reduce the capture of repetitive sequences present in many places in the genome. Finally, the reference sequences in blocks used for probe design totaled 63,720 bp and were brought together into a single FASTA file and sent to (MYcroarray, Ann Arbor, MI, USA) to produce the probes. Notes: Species analyzed; location of the collection and voucher; DNA concentration and purity before and after genomic library assembly. * Species excluded due to low quality sequencing.

Sonication and genomic library preparation
Extracted DNA was randomly fragmented by sonication using a Covaris S220 instrument (Covaris, Woburn, MA, USA), in order to evenly cover the full genome. Adapters were incorporated into the fragmented DNA using NEXTflex TM DNA Sequencing Kit and NEXflex TM Barcodes kit (BIOO Scientific, Austin, TX, USA). Uniquely indexed adapters were used for each sample. We selected fragments between 300 and 400 bp using Agencourt AMPure XP magnetic beads kit (Beckman Coulter, Brea, CA, USA). The genomic library was amplified following the program: 98 ºC for 2 min; 14 cycles (98 ºC for 30 s; 65 ºC for 30 s; 72 ºC for 60 s); 72 ºC for 4 min. The products were purified using a QIAquick PCR Purification Kit (Qiagen, Valencia, CA, USA). The genomic DNA concentrations before and after sonication and the amplification of the library were measured in a NanoDrop 2000c instrument (Thermo Fisher Scientific, Waltham, MA, USA) ( Table 1) to ensure that the final concentration exceeded 400 ng/mL.

Enrichment and sequencing
Before the enrichment, equimolar amounts (400 ng/mL) of each amplified library were pooled into six reactions, each one containing eight indexed samples. The enrichment method involves the selection of genomic regions and capture of DNA samples before sequencing (Mamanova et al., 2010). The enrichment was performed with MYBaits target enrichment system (MYcroarray, Ann Arbor, MI, USA), following the manufacturer's instructions. The probes were recovered using Dynabeads Ò MyOne TM Streptavidin C1 (Invitrogen Dynal AS, Oslo, Norway).
To increase DNA concentration, 14 cycles of PCR were performed for each hybridization reaction using Herculase II Fusion DNA Polymerase (Agilent, Waldbronn, Germany) and the following program: 98 ºC for 30 s; 14 cycles (98 ºC for 20 s; 60 ºC for 30 s; 72 ºC for 60 s), 72 ºC for 5 min. Sequencing was performed on the Illumina MiSeq platform (San Diego, CA, USA) by the Genomics Core Facility (University of Gothenburg, Sweden).

Sequence editing
Illumina reads were processed using the program CLC assembly cell (CLC Bio, Aarhus, Denmark). Firstly, the Illumina adapter sequences were removed and low-quality reads were excluded. Reads were then mapped against the reference sequence used for probe design (P. aphrodite). Consensus sequences generated for each sample were converted into FASTA format using the SAMTools software (Li et al., 2009) using the mpileup tool with reference sequence option, allowing for the inclusion of indels in the consensus sequences. These sequences were used as a new individual reference sequences for each sample in a second round of mapping. Final consensus sequences were generated using mpileup, without the reference sequence option, to avoid erroneous base calling in low read-depth portions of the read alignment. Sequence alignment was performed using the auto strategy in MAFFT-Multiple Sequence Alignment Software Version 7 (Katoh & Standley, 2013) and later manually refined using Geneious Pro (Biomatters Ltd., Auckland, New Zealand). In the last step we aligned the sequenced samples with the Phalaenopsis aphrodite subsp. formosana (NC_007499.1) chloroplast genome to obtain the sequenced region annotation.

Hairpin inversions
Micro-structural features of chloroplast non-coding sequences can have a profound influence on the multiple sequence alignment, and hence also the phylogeny. Hairpins (short stem-loop structures in single stranded DNA or RNA), for example, can create sites that allow small inversions to occur at a high enough frequency that homoplasious inversions can be observed among sequences from closely related species (Kelchner & Wendel, 1996). Sometimes the inverted sequence is not so short and can disrupt phylogenetic analysis, leading to strongly supported but spurious groupings (Joly et al., 2010). Non-coding sequences, such as group II chloroplast introns, contain many such stem-loop structures (Kelchner, 2002).
We examined the non-coding sequences in our alignment for inverted (reverse complemented) sequences and tested for their effect on the phylogenetic inference. This was done by excluding all but one character of the inversion (to down-weight the inversion to a single event) and rerunning the analysis. The selected character to represent the inversion was arbitrarily chosen. This was done to avoid recoding the inversions as indel characters and creating a new, small partition (with only eight characters) that would have required many additional parameters, in comparison to our approach.

Faster region assessment
We used the sequences from one sample on GenBank (FS1045) of Polystachya cultriformis and examined two published markers, psbD-trnT (870 length aligned to our samples) and matK (1,521 length aligned to our samples). We compared these sequences pairwise to one of our samples, P. estrelensis8, to check which was the faster-evolving region. We then ran a MrBayes v.3.2.4 (Ronquist et al., 2012) analysis on the faster region, as a representative of a fast part of the chloroplast genome (fast cpDNA hereafter), with one copy of our dataset trimmed to this region alone plus the GenBank sequence.
We then ran successive analyses, using additional copies of the dataset interleaved in the same file, to explore the increase in support with an increase of characters evolving under the same model. This was to discover how much of the fast cpDNA data would be needed to achieve high support on most nodes in the phylogeny (i.e., among species but not necessarily within species). An important assumption we made at this stage was that the single fast region would contain mutations spread across most/all branches of the phylogeny. If this assumption was true, then a single region could carry changes representative of the entire phylogenetic history that we were exploring. The assumption is essentially one of i.i.d. (independently and identically distributed sites)-in that the sites would be representative of many unsampled sites and that double mutations would be rare-coupled with a sufficient dataset size to contain enough changes overall to reflect the history. Although the i.i.d. assumption is rarely true across sites, model-based analysis methods can cope with this, because the frequency of site patterns can be modelled by an i.i.d. process (Steel & Penny, 2000). So we are in effect mainly testing whether the original data set size was sufficient to carry changes reflective of the entire phylogeny under investigation.

Random sample from all non-coding regions
A single region copied many times proved ineffective in recovering most nodes with support (see Results). We therefore explored using random samples of characters without replacement from among all of the non-coding regions in our dataset to test how much data from faster regions would yield supported trees across most nodes. We expected this approach to be less subject to the limitations caused by the stochastic nature of mutations coupled with the limited size of any one region. By sampling across many regions (over 58 kb in this case), even those few characters that have changed on short branches might be sampled occasionally. In contrast, a single region, by chance, may simply not contain any characters changing on a specific short branch.
We sampled without replacement 4%, 8% and 16% of the non-coding data using delete-fraction jackknifing in the seqboot program v3.69 (from http://evolution.genetics. washington.edu/phylip.html), excluding the poorly aligned parts and with downweighting of the inverted loops (by excluding all but one character of each loop), in 20 replicates each. The approximate average (and range) of posterior probabilities (PP) per clade was taken across the 20 replicates to get an indication of the likely support for selected clades that a non-coding dataset of these sizes would generate. These values were plotted on the whole alignment analysis to compare to the support received when using the whole dataset. Given that the largest dataset we used here (hereafter the "16% dataset," or ∼9.2 kb) failed to recover support for all nodes found in the whole genome analysis (see Results), we did not end up analyzing the smaller replicates.

Phylogenetic analysis
MrBayes v.3.2.4 analyses were used for phylogenetic inference. These analyses were run for five million generations (two million for the random sample replicates), using a mixed substitution model (plus gamma and invariant sites) to account for among-site rate variation. Priors on branch lengths were set to unconstrained: exponential (100) to minimize the chance of inferring incorrectly long branches (Marshall, 2010), otherwise with default settings. The paired runs were checked for convergence and high effective samples sizes in the MrBayes output and Tracer v.1.6 (Rambaut et al., 2014), respectively. Burn-in generations were removed by discarding 10% of the samples of parameters and trees, while summarizing in TreeAnnotator v.1.8 (Rambaut & Drummond, 2010) to ascertain clade PP. Trees were rooted using the Phalaenopsis sequence. Analyses using the character partitions were also done, returning nearly identical results to the analysis described above, so they are not reported further.

RESULTS
Our NGS approach allowed the capture of coding and non-coding regions throughout the chloroplast genome. We recovered approximately 132 kb, after the exclusion of gaps, representing 116 genes, seven pseudogenes, as well as regions with intergenic sequences and introns with secondary structure. Compared to the reference annotation, seven genes contained frameshifts that are usually associated with pseudogenization and corresponded to previously reported pseudogenized genes in orchids (Luo et al., 2014).
We excluded eight of the 48 samples due to the low quality of the sequencing results (Table 1). These eight samples showed lower DNA concentrations after the genomic library construction assembly, which may be the cause of low quality sequencing. The remaining 40 samples were submitted to the EMBL/ENA database under accession numbers ERS2203551-ERS2203590. The coding regions have 48,308 polymorphic sites (38.4%). Introns with secondary structure and regions with intergenic sequences have 21,264 (16.9%) and 56,226 (44.7%) polymorphic sites, respectively. The alignment of the concatenated data showed an unbalanced (but fairly typical) mean nucleotide composition of A = 29.9%, C = 19.9%, G = 19.4% e T = 30.8%.

Hairpin inversions
In the non-coding part of alignment, we found evidence for eight putative small inversions (Table 2), based on the presence of inverted repeated motifs that could form stems at least 4 bp long. Stems of this length or longer are part of models of group II structures (Michel, Kazuhiko & Haruo, 1989;Toor, Hausner & Zimmerly, 2001;Kelchner, 2002) and are consistent with sequence patterns observed by one of us in the rpL16 intron sequences in other taxa (Pfeil et al., 2002).

Faster region assessment
The pairwise identity between Polystachya concreta5 and P. concreta8 (whose common ancestor is relatively old and near the crown of Polystachya) for psbD-trnT was 98.7%. The pairwise identity for these samples for matK was 99.2%. PsbD-trnT was therefore used as the representative fast cpDNA region. The analysis with a single copy of this dataset (870 bp) yielded a MCC tree ( Fig. 2A) with only five nodes with high support (>0.95 PP). Increasing the number of copies did not result in much improvement. The analysis using 16 copies of psbD-trnT (∼14 kb) produced a MCC tree (Fig. 2B) containing only eight highly supported nodes.

Random sample from all non-coding regions
The pairwise identity between Polystachya concreta5 and P. concreta8 was 98.7% across 58 236 bp of non-coding region contained in our alignment. This compares to 98.8% identity between the same samples across the coding regions in our alignment. The replicate datasets that sampled 16% of the original non-coding alignment (excluding poorly aligned parts and down-weighting the inverted loops) failed to return all nodes/clades found in the whole genome analysis. Of 13 selected nodes found in the tree from the whole genome (four subtended by relatively long branches, four by medium length branches, and five by short branches), only five were found with high support across most or all replicates (i.e., at least 16 of 20 replicates had 0.95 PP). Three of the selected nodes instead had five or fewer replicates with high support (0.95 PP), but only one or no replicates that contained highly supported contradictory nodes (thus the support for the expected node was 0.05). Finally, five of the nodes had generally poor support among replicates (i.e., five or fewer replicates had 0.95 PP along with six or more replicates with 0.05 PP for these nodes). Sixteen percent of datasets (∼9.2 kb) were recovered from seven to 21 highly supported nodes among replicates (mean = 14.8), with more nodes recovered in 19 of 20 replicates than was the case with the larger repeated psbD-trnT dataset (∼14 kb and eight supported nodes). This character sampling strategy was probably more reflective of the underlying support for various nodes than using repeated copies of a single small dataset.
The mutually exclusive foliosa1/concreta2 versus foliosa1/foliosa2 clades (see below) were also examined in the 16% datasets. In the first case (foliosa1/concreta2), just four replicates contained this clade with high or moderate support (0.90 PP). The contradictory second grouping (foliosa1/foliosa2) was found with a similar level of support (0.90 PP) in only two replicates. The fact that both groupings could be recovered, with high support, in at least some replicates suggests that the original dataset contains the signal of both clades. A NeighborNet analysis (Fig. 3B) confirmed that a mixture of patterns exists in the original dataset involving foliosa1, foliosa2, concreta1 and concreta2.

Phylogenetic analyses
Analyses with and without the inverted loops (the latter by down-weighting to a single character) returned almost identical trees. The results of only the latter analysis is presented in this section. The tree we recovered was able to resolve the phylogenetic relationships among the groups of the large clade selected for this study, with high support values on almost every node (Fig. 3). The tree was characterized by a large clade with relatively short branches containing only sequences from the Neotropics, with a grade of a few small clades and single sequences containing the remaining sequences (Fig. 3). The large clade contained 21 sequences from Brazil, Dominica and Venezuela, whereas the grade included 19 sequences from tropical central and eastern Africa, as well as Madagascar and the nearby islands (Fig. 3). The grade recovered include a few geographically identifiable clades (Fig. 3). One of these, attaching fairly deeply within the crown, consists of four Malagasy sequences (Polystachya humbertii1, P. humbertii2, P. oreocharis and P. tsinjoarivensis2) that are sister to a Kenyan sequence (P. eurychila). Another clade comprises a Kenyan sequence (P. golungensis) and one from Reunion (P. concreta8). A third clade contains a pair of central African sequences, one from Cameroon (P. odorata2) and one from Nigeria (P. odorata1). A fourth clade contains sequences from central Africa (P. concreta5 from Cameroon), Madagascar (P. tesselata1), the Comoros (P. concreta9) and two sequences without certain provenance. Finally, a fifth pair of sequences were from samples collected from Mauritius (P. concreta7) and Madagascar (P. tesselata2). Lineages containing only a single sequence in this grade included samples from Kenya (P. melanantha and P. steudneri) and Cameroon (P. dolichophylla).
Sequences from the widely sampled and widely distributed P. concreta did not form a monophyletic group and occurred on different branches of the tree, separated by several well supported nodes (Fig. 3). Similarly, the two P. tessellata sequences from Madagascar did not form a clade. Polystachya estrellensis sequences form a clade with P. concreta sequences collected in Brazil. Although the sequences of P. estrellensis are thus paraphyletic, whether the taxon itself is paraphyletic cannot be established for certain here. The identification of this P. concreta sample could be wrong, given that the identification of these species is confused in Brazil and sometimes they are considered synonymous (see also Discussion).

With versus without loops
The down weighting of the inversions we identified (by excluding all but one character per inversion) resulted in a similar, but not identical, phylogenetic inference. The differences among the maximum clade credibility (MCC) trees involved P. foliosa1, P. foliosa2, P. concreta1, P. concreta2, P. concreta3 and P. with clade PP listed after each node. In contrast, the inference resulting from downweighted inversions returned this tree (Fig. 3 main panel  There are several supported differences between these trees, with at least one corresponding to the way the inverted loops are weighted. When the entire loops are analyzed, P. foliosa1 and P. foliosa2 are supported as sisters, with these two sequences appearing to share two loop inversions (if this topology is correct; Fig. 4A). However, down-weighting the inversions produces a tree consistent instead with two independent inversions (Fig. 4 main panel).

DISCUSSION
Chloroplast genome sequence provides a robust phylogeny In this work we used the nearly complete chloroplast sequences of 40 Polystachya samples to infer a robust plastid phylogeny. The dataset significantly increased the phylogenetic resolution within the genus. Thus, our results suggest that increasing the number of molecular markers has the potential to solve not only the relationships among species, but also to identify new Polystachya clades and define new sections. The delimitation of new sections will, however, depend upon the inclusion of more taxa than was done by this study-in other words a higher coverage of the genus. Below we highlight some of the clades recovered, their morphological and/or geographical characterization, and a comparison with previous studies.

Polystachya bicolor/rosea position contradicts Russell's Clade III
Of the two species selected as possible distant relatives to provide more context in the phylogenetic inference, one of them, P. bicolor (=P. rosea), appears in a clade together with samples of P. concreta (from Cameroon and from the Comoros), P. tesselata (= P. concreta) and P. modesta. Not surprisingly, the clade that includes P. bicolor/rosea is deeply nested within the ingroup, thus contradicting the monophyletic Clade III presented by Russell et al. (2010b).
In prior studies, Polystachya bicolor/rosea has an uncertain position in the phylogenetic trees. In an analysis using plastid markers and Bayesian inference, this species appears in a large polytomy with P. concreta and other related species (Mytnik-Ejsmont, 2011), or related to species of other clades (Russell et al., 2010b) depending on the marker used. A phylogeny using nuclear data (ITS sequences) highlighted the lack of monophyly of this species (Mytnik-Ejsmont, 2011), which may be connected to the difficulty in identifying it. Polystachya bicolor/rosea is often mistaken for P. concreta, since differentiation between these is made by subtle differences in the shapes of leaves, and the size and color of the flowers. Unlike P. concreta, which has a pantropical distribution, P. bicolor/rosea is restricted to Madagascar, Comoros and the Seychelles (Mytnik-Ejsmont, 2011).

Brazilian sequences form a clade
The monophyletic nature of the group formed by the Brazilian accessions, contrasting with the paraphyletic group made up of African accessions, is consistent with the hypothesis that Africa is the center of origin with a subsequent (i.e., more recent) dispersal into the Neotropics (Russell et al., 2010a(Russell et al., , 2010b.

Hybrid origins of some taxa suggested
The hybrid origin of P. concreta is a possible explanation for this species being found in different positions in the tree (Russell et al., 2010b). P. concreta individuals that have dispersed out of Africa are tetraploid, whereas plants found in continental Africa can be diploid or tetraploid. The sister taxa of African P. concreta are diploid (Russell et al., 2010b), indicating that tetraploidy is a derived state in P. concreta. Allotetraploidy in P. concreta has been confirmed by analysis of low copy nuclear genes (Russell et al., 2010a).
Interspecific hybridization events, as in P. concreta, are considered a source of chloroplast genome exchange via introgression. Chloroplast genome exchange among species is sometimes suggested as an explanation for the inconsistencies between phylogenetic trees based on nuclear and plastid markers in, e.g., Populus (Salicaceae) (Smith & Sytsma, 1990;Tsitrone, Kirkpatrick & Levin, 2003), Nothofagus (Nothofagaceae) and Crassulaceae (Mort et al., 2002;Acosta & Premoli, 2010). In Nothofagus, chloroplast capture results in the association of chloroplast genomes with geographic locations, rather than taxonomic relationships (Acosta & Premoli, 2010). Relationships based on geographic location could be explored as a possible explanation for the proximity of P. concreta (accesses from Brazil) with P. estrellensis (also from Brazil) and not with non-Brazilian accessions of P. concreta. In this case a study of nuclear markers of these taxa would be needed.

Neotropical species
Relationships in the group that includes P. concreta, P. foliosa, P. estrellensis and other species are not well resolved due to the low sequence divergence levels between species found in both plastid and nuclear genes (Russell et al., 2010a(Russell et al., , 2010b(Russell et al., , 2011Mytnik-Ejsmont, 2011). Generally, the morphological variation observed in this group is identified as P. concreta. Although P. estrellensis is considered a valid species on the official plant list of Brazil (Barros et al., 2010), there is no consensus on synonymy with P. concreta. This can be seen in the herbarium identifications that sometimes consider them as two distinct species, but sometimes as the same species. The same occurs with P. foliosa, a name which would only be correctly applied to plants from the Amazon basin, the Guyana Shield and the West Indies (Peraza-Flores, Fernández-Concha & Romero-González, 2011). This circumscription is not accepted by Mytnik-Ejsmont (2011), who considers P. estrellensis and P. foliosa to be synonymous.
Genetic dissimilarity between African and Neotropical tetraploids was reported by Russell et al. (2010a) and Russell et al. (2011), but the delimitation P. estrellensis, P. concreta and P. foliosa remained uncertain. According to our results, under a molecular perspective, P. estrellensis should be considered distinct from P. concreta. Moreover, our results do not corroborate the placement in synonymy of P. estrellensis and P. foliosa as proposed by Mytnik-Ejsmont (2011). In our tree P. foliosa forms a highly supported group with some P. concreta sequences (from samples collected in Brazil). Finally, although our results indicate a possible separation of Brazilian and African P. concreta, the delimitation of this species remains uncertain, considering that there is no generic taxonomic revision that has rigorously analyzed the morphological variation in this species. Moreover, considering the reticulated evolution by Russell et al. (2010a), further investigation with nuclear markers would be necessary.
Taken together, our analysis suggests that P. estrellensis can be considered a distinct species from P. concreta and P. foliosa, and that Brazilian and African P. concreta should probably be treated as different species. Evidence of hybridization influencing the evolution of P. concreta (Russell et al., 2010a(Russell et al., , 2010b highlights how importance it will be to also consider bi-parentally inherited nuclear DNA when inferring of phylogenetic relationships between this species and other species of the genus. The placement in synonymy of P. estrellensis and P. foliosa proposed by Mytnik-Ejsmont (2011) was not confirmed by this study. In our results, P. foliosa forms a highly supported clade including Brazilian samples of P. concreta.

Implications for data requirements
The entire chloroplast is more useful than a fast subset By using a relatively large number of chloroplast sequences we were able to resolve the polytomy involving the Neotropical species. But, if on one hand this dataset is promising in the formulation of more robust phylogenetic hypotheses, on the other hand, the complete chloroplast genome sequencing may be costly for systematic projects that consider genera with many species (Särkinen & George, 2013), such as Polystachya, which has about 250 species. This was the main motivation for testing how well a smaller sample of the chloroplast genome would perform in phylogenetic inference. This was done in two ways: by duplicating a fast region and analyzing multiple copies of this dataset, and by sampling without replacement from all non-coding regions in our alignment.
We found that sampling without replacement up to ∼9 kb of non-coding sequence (16% of our alignment) was not sufficient to return a robust inference across all nodes. This was in contrast to the analysis of the entire chloroplast and showed that in the case of these samples of Polystachya, more data were needed to resolve their relationships. The cost of primers, amplification and Sanger sequencing of only three or four regions begins to exceed that of gene capture of the entire chloroplast. It is therefore more cost effective and produces a more robust result to undertake the collection of the entire chloroplast genome. That said, our 16% sample did resolve some nodes with high support, and other nodes obtained moderate to high support from a few of the replicates. This suggests that these data are on the way to resolving most nodes, but a gradual increase in resolving power occurs as characters are added.
Duplicating a single fast region even 16 times, in this case psbD-trnT copies totaling ∼14 kb, failed to achieve a robustly resolved phylogeny. The results for the psbD-trnT duplicated analysis was poorer even than that of sampling fewer but more representative characters across the non-coding region (above). It appears that a small sample size (only 870 bp of independent sequence sites) is a serious source of stochastic error in this case. Sampling one versus 16 copies of the same dataset only slightly increased the number of resolved nodes (but still falling short of the number of nodes usually resolved with support by the smaller 16% sample), confirming the limitations of the original dataset. It is likely that the original dataset simply did not contain sites that changed on most branches of the phylogeny during the span of history that we investigated.
Numerous previous studies have also examined which regions of the plastid genome are typically evolving faster than others (Small et al., 1998;Shaw et al., 2005Shaw et al., , 2007Shaw et al., , 2014. Prior to NGS methods, the aim was to identify the "best" regions, when sequencing only a limited number could be afforded in most projects. However, given current technology, we should shift our focus to whether a few of the "best" regions are cost effective compared to using the entire genome, as the latter has become affordable for even small phylogenetic projects.

Homoplastic hairpin inversions affect phylogenetic analysis
One issue raised here that is rarely taken into account in analyses of whole chloroplasts is that sequence patterns at the small scale, namely hairpin inversions of loops, can still have an effect on phylogenetic inference, despite using very large data sets. Our results indicate that at least some of the differences between the trees inferred using entire loops versus down-weighted loops were driven by these hairpin loop inversions. This kind of phylogenetic effect has been observed in other cases, although with smaller data sets (Kim & Lee, 2005;Joly et al., 2010). If loops invert in a single molecular event (as is currently believed: Kelchner & Wendel, 1996;Kim & Lee, 2005), such as an intra-molecular recombination, then there is no good reason to use each character state difference found between sequences in the entire loop in an analysis. This simply inflates the phylogenetic impact of a single event, treating it instead as many independent events (corresponding to the number of character state differences in the inversion), as also noted by Kim & Lee (2005). As shown here, a larger data set simply does not give license to ignore known analytical pitfalls.
Together, these findings show that sampling the entire chloroplast, analyzed carefully, is a better option than sampling a few (even a dozen or more) fast regions. This is true, at least in Polystachya, but a similar result has also found by other studies, such as Parks, Cronn & Liston (2009) for Pinus. Based on cost alone, it seems there is no benefit to be gained by screening the chloroplast for faster markers when there are many short branches in the particular tree, as there are here. Whole chloroplast analyses are likely to be a better way forward than sampling individual chloroplast markers in addressing many phylogenetic questions. If gene capture is used, as it was here, it is also very easy to add probes to unlinked nuclear regions, further increasing the power of this approach as a general solution to the issue of data sampling.

CONCLUSION
Our results show that significantly increasing the number of nucleotides can be an effective option in the phylogenetic inference of taxonomic challenging taxa, such as the orchid genus Polystachya. We generated complete chloroplast sequences of 40 Polystachya specimens using a combination of Illumina NGS sequencing and a sequence capture, which solved a notorious polytomy for Neotropical species. Our tests on how well a smaller sample of the chloroplast genome would perform in phylogenetic inference shows that the whole chloroplast is a better option than selecting just a few highly variable markers. Full plastid genomes appear particularly powerful when the expected tree is likely to contain many short branches, but nonetheless need to be analyzed with care.