Recent insertion/deletion (reINDEL) mutations: increasing awareness to boost molecular-based research in ecology and evolution

Today, the comparative analysis of DNA molecules mainly uses information inferred from nucleotide substitutions. Insertion/deletion (INDEL) mutations, in contrast, are largely considered uninformative and discarded, due to our lacking knowledge on their evolution. However, including rather than discarding INDELs would be relevant to any research area in ecology and evolution that uses molecular data. As a practical approach to better understanding INDEL evolution in general, we propose the study of recent INDEL (reINDEL) mutations – mutations where both ancestral and derived state are seen in the sample. The precondition for reINDEL identification is knowledge about the pedigree of the individuals sampled. Sound reINDEL knowledge will allow the improved modeling needed for including INDELs in the downstream analysis of molecular data. Both microsatellites, currently still the predominant marker system in the analysis of populations, and sequences generated by next-generation sequencing, a promising and rapidly developing range of technologies, offer the opportunity for reINDEL identification. However, a 2013 sample of animal microsatellite studies contained unexpectedly few reINDELs identified. As most likely explanation, we hypothesize that reINDELs are underreported rather than absent and that this underreporting stems from common reINDEL unawareness. If our hypothesis applies, increased reINDEL awareness should allow gathering data rapidly. We recommend the routine reporting of either the absence or presence of reINDELs together with standardized key information on the nature of mutations when they are detected and the use of the keyword “reINDEL” to increase visibility in both instances of successful and unsuccessful search.


Introducing reINDELs: Recent Insertion/Deletion Mutations
The comparative analysis of DNA molecules looks back on a history of over 40 years. Increasingly complex models of nucleotide substitution patterns at point mutations have been developed (Sullivan and Joyce 2005) and are routinely applied on DNA sequence markers. Another type of mutation, the gain or loss of nucleotides within a defined locus, termed insertion/deletion (INDEL) mutation, has received much less attention (Lunter et al. 2006). This lack of attention is not due to a lack of relevanceon the contrary, INDELs constitute a considerable fraction of mutations in coding and noncoding parts of the genome, are responsible for copy number variants, and are the signature of transposable elements (Korbel et al. 2007;Huang et al. 2010). Rather, the lack of attention is due to a lack of profound knowledge on the evolution of INDELs (Ogden andRosenberg 2007 andreferences therein, Sunday andHart 2013), for which reason they are usually considered uninformative and removed during data preparation, except in the context of gene finding (Kellis et al. 2004) and microsatellite analyses (Ellegren 2004).
The special case of recent INDEL, henceforth reINDEL mutations, that is, mutations where both ancestral and derived state are seen in the sample, allows us witnessing INDEL evolution in real time. Knowledge of reINDELs based on broad sampling should therefore facilitate a better understanding of INDEL evolution generally. The preconditions for reINDEL identification are that we know the pedigree in our sample and that our sample is large (Ellegren 2000;Schl€ otterer 2000); examples of such situations include studies on extrapair paternity, intraspecific brood parasitism, and eusocial societies, ideally with a single female reproductive (see Fig. 1 for an example study system). Whenever we find previously unknown allelic size variation in genotype data for such systems (see Box 1 for microsatellites as an example), reINDELs can be identified easily. For example, a validated allele detected in a singly mated female's offspring that is unknown from the parents can be deduced to represent a recent mutation. The reINDEL can then be described in base-pair length and understood as either insertion or deletion.

The Relevance of (recent) INDEL Mutations to the Study of Ecology and Evolution
A better understanding of INDEL evolution, achievable via the study of reINDELs, will allow the inclusion of INDELs as genetic variation in our analyses rather than their discard. The gain will be manifold. Our phylogenetic trees and genetic networks will be based on more and more solid information, and our population-genetic and population-genomic inferences will be more accurate (see Box 1 for microsatellite examples). This will have farreaching effects on any research area using molecular data to make ecological and evolutionary inferences, such as speciation research, sociobiology, the study of species interactions, conservation genetics, invasion biology, and climate change biology.
From the obviously vast range of potential implications of including INDEL information in the data analyses, we illustrate one aspect in more detail: sex differences in germline mutability. There is broad consensus on the existence of a male mutation bias in vertebrates and some plants caused by the higher number of cell divisions during spermatogenesis than during oogenesis (Kirkpatrick and Hall 2004). Anyway, the underlying factors shaping the extent of male mutation bias are still poorly understood (Bartosch-Harlid et al. 2003;Goetting-Minesky and Makova 2006), information from invertebrates is scarce, and recent observations indicate that the mutation rate can be highly different among closely related species (Venn et al. 2014). Thus, more information on patterns at individual loci and in different organisms is needed (Ellegren 2007). Such information will then open up additional research avenues; for example, such bias may even influence the long-term persistence of populations with skewed sex allocation (Cotton and Wedekind 2010).

Studying reINDELs in the NGS Era
The advent of next-generation sequencing (NGS), the massive parallel sequencing of DNA, has opened up a previously unthinkable array of opportunities (Andrew et al. 2013). For ecology and evolution including applied fields, possibly the most important consequence is that NGS makes available genomics for the study of nonmodel organisms (Hudson 2008;Tautz et al. 2010;Williams et al. 2014). Compared with population genetics, the promises of population genomics include improved identification of adaptive molecular variation as well as improved inferences about population demography and evolutionary history (Luikart et al. 2003;Stapley et al. 2010;Williams et al. 2014).
Mutation rates in microsatellites tend to vary considerably within taxa, from 10 À6 to 10 À2 per site and generation, relatively homogenously across the tree of life (Bhargava and Fuentes 2010). In contrast, estimations of genome-wide rates of nonmicrosatellite INDEL mutations per site and generation are in the range of 2.1 9 10 À10 (Sacharomyces cerevisiae; Lang and Murray 2008) to 1.2 9 10 À8 (Caenorhabditis elegans; Denver et al. 2004), that is, two to eight orders of magnitude lower than the estimated rates of mutations in microsatellites. Anyway, the sheer amount of data in NGS projects compensates for Figure 1. Eusocial insects are among study systems that facilitate generating large population-genetic or population-genomic data sets for individuals with known pedigree and that are thus ideal for studying reINDELs. Here, one of the 1,000-10,000 workers (Steiner et al. 2004) of a colony of the ant Lasius austriacus carries a worker sister at the pupal stage, all these workers stemming from the same mother and father (Steiner et al. 2007 the lower mutation rate. For example, RADseq (Baird et al. 2008), one of the more established NGS methods for population genomics looking at just fractions of the genome in the range of typically 1%, targets thousands or tens of thousands of loci in a single analysis, compared with some dozens or, rarely, hundreds in the most comprehensive microsatellite studies. Some NGS data sets have been found to contain considerable amounts of INDELs indeed (e.g., Baldwin et al. 2012). Currently, it is common to filter out INDELs in an early stage of the bioinformatics pipeline (e.g., Toonen et al. 2013), but there is growing awareness that INDELs in NGS data are not just a nuisance in the alignment process but also a valuable source of information (e.g., Pacurar et al. 2012;Smolina et al. 2014).
The unparalleled promises of NGS have caused many researchers in ecology and evolution to switch from traditional, locus-based to whole-genome-based approaches or to plan doing so. However, to complete the transition, multiple challenges coming with the new technologies need be overcome (Sboner et al. 2011;DeWoody et al. 2013;Poisot et al. 2013;Andrews and Luikart 2014;Mesak et al 2014). The transition is slower in the study of nonmodel organisms than in that of model organisms (McCormack et al. 2013), given that for nonmodel organisms, resources tend to be more limited both in terms of relevant genomic background information (Nevado et al. 2014) and money.
In any case, not the technology per se but the relevance of the question addressed and the stringency in testing the hypotheses raised make the quality of research. For various research questions, using NGS technology might be like using a sledgehammer to crack a nut (cf. Brewer et al. 2014)viewed matter-of-factly, there are both advantages and disadvantages to microsatellites compared with NGS techniques, and microsatellites are still being used massively (Box 2). In short, some believe in the fast replacement of microsatellites by NGS approaches (Andrew et al. 2013), others in the persistence of microsatellites in the study of populations also in the future (e.g., Zalapa et al. 2012;Dawson et al. 2013;Butler et al. 2014).

Box 1. Microsatellites as a fast road to understanding reINDELs
Microsatellites are noncoding, codominantly inherited DNA loci consisting of simple sequence repeats (SSRs), sometimes also termed short tandem repeats (STRs) or simple sequence length polymorphisms (SSLPs). Due to frequent INDEL mutations via loss and gain of repeat units, they commonly exhibit high variation. The design of studies with known pedigree allows precise expectations on the number and size of alleles in the data set based on the knowledge of the parental alleles. The vast majority of recent mutations in microsatellites will cause variation in allele size (Estoup et al. 2002), and any deviation from the original expectations thus is a reINDEL candidate. For correct data interpretation, validation of allele calling to exclude a scoring error and re-genotyping of the respective individual to exclude a PCR error are necessary. Hitherto studies on microsatellite evolution have shed some light on, for example, the factors increasing slippage events and multistep changes (Primmer et al. 1996;Chakraborty et al. 1997;Schl€ otterer 2000;Eckert et al. 2002;Beal et al. 2012) and the existence of differences between male and female germline (Anmarkrud et al. 2011). However, many questions remain such as about the influence of base composition of repeat motifs, about mating and/or sex-determination systems, about cross-taxa and cross-loci variation, and about differences between experimental and natural populations (Ellegren 2000(Ellegren , 2004Schl€ otterer 2000;Leclercq et al. 2010;Anmarkrud et al. 2011). Thinking about the increase of accuracy the improvement of nucleotide models of evolution brought about in phylogenetic applications, we can expect comparable advances in microsatellite-based analyses. Currently existing microsatellite models such as the stepwise mutation, the generalized stepwise, or the K-allele model (reviewed in Estoup et al. 2002) are either rather simplistic or make rather unrealistic assumptions, both of which can lead to poor performance in empirical tests (see, e.g., Balloux and Lugon-Moulin 2002 on stepwise-mutation-model-based Rst values or, Peery et al. 2012 on the reliability of microsatellite-based bottleneck tests for detecting recent population declines). However, with profound knowledge about microsatellite evolution, future models might incorporate the full range of relevant aspects for which maximum likelihood or Bayesian approaches can make good estimates (cf. Caliebe et al. 2010;Wu and Drummond 2011;Nikolic and Chevalet 2014). Potential implementations include different mutability rates according to motif sequence, motif and allele size, allele-frequency dependence, different probabilities for expansion or contraction events, and different rates for male and female germline. More realistic models will aid multiple research areas in ecology and evolution: kinship, parentage, and behavior analyses will be more accurate facilitating a better understanding of mating systems, dispersal patterns, and social organization; better estimations of effective population sizes and detection of bottlenecks will aid nature conservation research; hybridization and backcrossing patterns will be more correctly mirrored, which in turn will increase our understanding of these evolutionary forces; aberrant modes of reproduction might be elucidated, phylogeography exploring the recent past improved, and signals of selection better recognized.

Box 2. Microsatellites versus next-generation sequencing in the study of populations
Today, researchers addressing population-level questions can decide among a range of classical population genetic and nextgeneration sequencing (NGS) approaches. No approach is "inherently better" (Karl et al. 2012) than any other one. The decision is not easy, and the criteria to be considered range from scientific to resource related. In compiling a list of characteristics (Table B1), we considered just microsatellites among classical approaches due to their common use but two different approaches among NGS techniques, representing the two extremes in effort and information. These are RADseq (Baird et al. 2008), the most frequently used among approaches analyzing just a fraction of the genome, and whole-genome resequencing (Huang et al. 2009), the approach using the maximum of information that can be used. We used recent protocols and manuals and our own experience; we aimed at covering a wide range of characteristics and at objectivity but take responsibility for any failure in doing so. Table B1. Characteristics (as of October 2014) of microsatellites versus two selected next-generation sequencing (NGS) approaches in studying populations, RADseq (Baird et al. 2008) and whole-genome resequencing (Huang et al. 2009). All sequencing information is based on the assumption that Illumina (http://www.illumina.com/) technology is used, except the sequencing for developing microsatellite loci, for which the use of Roche 454 (http://www.454.com/) technology is assumed. Where feasible, we classified using a five-step scale of very low, low, intermediate, high, and very high, for each characteristic calibrated relatively across the three techniques. Secondary bioinformatics: sequence alignment and variant calling; tertiary bioinformatics: further downstream steps of sequence annotation and interpretation (Wright et al. 2011; primary bioinformatics, that is, base calling, usually is performed by the software of the sequencing machine). We also performed a Web-of-Science-based analysis of the numbers of annually published original articles on microsatellites and population genomics, from 1995, when the first population-genomic paper came out, to 2013 ( Fig. B1; Appendix 1) and made two inferences. First, microsatellites still represent the major research approach with about 4500 contributions in 2013 and thus the 47-fold of population-genomic papers. Second, population genomics should indeed have a bright future with a growth rate better explained by an exponential than a linear function. In contrast, microsatellites' growth rate is, over the observed period, better explained by a linear than an exponential function, and, in fact, the number of contributions on microsatellites seems to level off in recent years.

Currently Few reINDEL Reportswe Hypothesize Little reINDEL Awareness
Can we simply use the published literature to study reIN-DELs? We did a standardized analysis of microsatellite literature (see Appendix 2 for details) and found unexpectedly few reINDEL reports. In detail, in 16 animal microsatellite studies from 2013 in which sufficient information on pedigree was available, between 264 and 93,140 microsatellite alleles were seen per study. One study reported the absence of reINDELs explicitly (Liu et al. 2013). No study reported proven reINDELs. One study (32,788 alleles seen) reported 17 putative reINDELs, but did not report the validation of allele calling to exclude a scoring error and re-genotyping of the respective individual to exclude a PCR error (Mayer and Pasinelli 2013). In that study, the position of the loci on autosomes was indicated as was the sex that introduced the mutations, but no information was provided on the identity of the locus or loci, mutation type(s), and allele size(s).
We also calculated the binomial proportion confidence interval to identify mutation rates in line with the number of microsatellite reINDELs reported in the 16 papers surveyed using a confidence level of 0.95, as implemented in a custom Fortran program: 17 reINDELs of 32,788 alleles are in line with mutation rates of 3 9 10 À4 to 8 9 10 À4 which is well in the middle range of the reported per-locus-per-generation mutation rates of microsatellites of 10 À6 to 10 À2 (Bhargava and Fuentes 2010). On the other hand, zero reINDELs of the combined 184,660 alleles seen in the 15 studies without re-INDEL reports (eight animal classes) are compatible with, at most, a mutation rate of 2 9 10 À5 . There are two explanations thinkable for a mutation rate at the lower end of the rate range known: (1) The mutation rate could be very low indeed, or (2) researchers may be insufficiently reINDEL aware. We are not able to decide definitively in favor of one of these explanations but for two reasons hypothesize that (2) applies. Firstly, microsatellite loci used in the 15 studies lacking reINDEL detection were all chosen for maximum polymorphism by the authors, rendering (1) rather implausible. Secondly, 14 of the 15 studies lacking successful reINDEL detection lack any statement about the absence of or scanning for reINDELs.

Recommendation
Apparently, an increase of reINDEL awareness is needed, irrespective of the future prime methods in ecology and evolution. To prevent further loss of reINDEL information, we appeal to researchers to scan their data for reIN-DELs and report them together with a few easy-to-convey and crucial details on the nature of locus and mutation. We recommend the standard reINDEL report to contain the following information: report of absence, or, alternatively, (1) identity of locus affected including GenBank accession number, (2) (putative) ancestral allele and derived allele, (3) sex of (putative) ancestral-allele donor, as well as (4) pipeline of reINDEL validation. Importantly, we suggest that in both instances, absence and presence of reINDELs, the result of the reINDEL search should be reported: Only when reINDEL absence is exported explicitly, will it be possible to use absence data in mutability calculations, in contrast to our inability of doing so with the results of our literature analysis. By managing to efficiently tap all sources for reINDEL knowledge from now on, we will rapidly create a comprehensive set of information including on insertion/deletion lengths, flanking regions, and chromosomal locations, suitable for the development of INDEL mutation models. No matter whether the study of reINDELs proposed here will be based on microsatellite or NGS data, it seems that the time for advanced INDEL modeling is near. phisms" OR STR OR STRs OR "short tandem repeat" OR "short tandem repeats") AND SU= (

(c) Selection of definitive set of papers
From the 97 papers retrieved under (b), we selected as definitive set of papers those that fulfilled the following criteria: • Primary research article, that is, not meta-analysis or review article; despite our search for just Article under Document Type under (b), not all results were primary research articles indeed. • Empirical, that is, not simulated data.
• Number of alleles seen discernible, that is, number of individuals successfully analyzed using microsatellites, number of microsatellite loci successfully genotyped, and ploidy level discernible.
• Sufficient information on pedigree available, either via inference from independent data or from microsatellite data under monogyny/monandry or clonality.

(d) Analysis of definitive set of papers
We retrieved information from the 16 papers selected under (c) on the following: • Taxonomic affiliation of the animals analyzed at the level of Class.
• Number of alleles seen.
• Number of reINDELs reported, as putative or proven.
• In case of reINDEL report, the presence/absence of statements on identity of locus affected, position of locus on auto-/allosome, mutation type, direction of mutation in case of size-shift mutation, allele size, sex.
• In case of no reINDEL report, the presence/absence of statement that no reINDEL was detected.