Fast sequence-based microsatellite genotyping development workflow for any non-model species

Application of high-throughput sequencing technologies to microsatellite genotyping (SSRseq) has been shown to remove many of the limitations of electrophoresis-based methods and to refine inference of population genetic diversity and structure. However, early proof of concept and species specific development studies resulted in dispersed information making it cumbersome for prospective users to identify a clear path to SSRseq approach set up in species of new interest. To overcome these difficulties, we present here a streamlined SSRseq development workflow that includes microsatellite development, multiplexed marker amplification and sequencing, and automated bioinformatics data analysis. We demonstrate its application to five groups of species across kingdoms (fungi, plant, insect and fish) with different levels of polymorphism and genomic resource availability. We found that relying on previously developed microsatellite assay is not optimal and leads to a resulting low number of reliable locus being genotyped. In contrast, de novo ad hoc primer designs gives highly multiplexed microsatellite assays that can be sequenced to produce high quality genotypes for 20 to 40 loci. We highlight critical upfront development factors to consider for effective SSRseq setup in a wide range of situations. The automated sequence analysis pipeline, which accounts for all linked polymorphisms along the sequence, quickly generates a powerful multi-allelic haplotype-based genotypic dataset. Cost and time effective application of SSRseq approaches are within reach for any species, calling to new theoretical and analytical frameworks to extract more information from multi-nucleotide polymorphism marker systems.


Introduction
Development of high-throughput sequencing (HTS) technologies since the mid 2000's has greatly widened the scope of genetic applications in research fields such as agronomy, ecology and evolutionary biology where the use of molecular markers is common place. The easy collection of millions of DNA sequences has also facilitated the development of molecular markers such as single nucleotide polymorphism (SNP; (Garvin et al. 2010;Delord et al. 2018)) or microsatellites (or single sequence repeats SSR; (Guichoux et al., 2011;Malausa et al., 2011)) for any species of interest.
Beyond marker development, HTS has been largely adopted for genotyping-by-sequencing (GBS), thus taking advantage of restriction enzymes to target sequencing efforts in specific parts of the genome that are shared by conspecific individuals (Miller et al. 2007;Baird et al. 2008). Restriction site-associated DNA polymorphism sequencing allows simultaneous identification and genotyping of SNP in non-model species and interrogation of thousands to hundreds of thousands of loci genomewide but remains expensive when population samples comprise hundreds of individuals. Recent development of amplicon sequencing which targets previously identified SNP makes it possible to genotype hundreds to thousands of individuals at a few hundreds of SNP at a very reasonable cost (Nguyen-Dumont et al. 2013;Campbell et al. 2015;Aykanat et al. 2016). If sequence-based genotyping of SNP is gaining momentum, application of HTS to microsatellite genotyping was until recently lagging behind. Updating microsatellite genotyping to modern technologies remains important for several reasons. Firstly, some current scientific questions in ecology or evolutionary biology can be answered using a moderate number (e.g. a dozen) of highly polymorphic multi-allelic loci such as microsatellites (Harrison et al. 2013). Secondly, variation in the number of repeated oligonucleotide motif is a unique kind of polymorphism with specific mutation mechanism and rate which in itself provides a complementary picture of genetic variation to nucleotide substitutions across populations (Haasl and Payseur 2011) and genomes (Willems et al. 2014). Thirdly, it becomes more and more obvious that microsatellite polymorphism is involved in numerous biological processes such as gene expression regulation and epigenetic mechanisms (Bagshaw 2017;Sadd et al. 2018), and more generally in phenotypic variation (Xie et al. 2019) including human diseases (Gymrek 2017;Hannan 2018). Thus, while marker preference evolves through time with specific markers dominating the genotyping field over a period of time following technological advances (Schlötterer 2004;Seeb et al. 2011), maintaining our capability to interrogate any kind of polymorphism in the context of rapid HTS technological advances is paramount and should be prioritized.
In the age of HTS, and in comparison with SNP genotyping, microsatellite genotyping suffers from several drawbacks: homoplasy (alleles of identical size having different underlying sequence (Viard et al. 1998;Estoup et al. 2002)), time and cost consuming development and genotyping, low throughput, lack of automation and data standardization (Moran et al. 2006;Ellis et al. 2011). Yet, all of these limitations are linked to the fact that currently microsatellite genotyping relies on allele discrimination based on amplicon size assessed by capillary electrophoresis (De Barba et al. 2016) and do not hold true if microsatellite genotyping transitions to sequence-based genotyping. Previous direct comparisons of capillary electrophoresis and sequence-based microsatellite genotyping (called SSRseq thereafter) validated SSRseq as a reliable method (Darby et al. 2016;Vartia et al. 2016).
Advantages of sequence-based over capillary electrophoresis-based microsatellite genotyping are significant. Direct access to allele sequence reveals additional polymorphisms that remain hidden when using only allele size to identify variation (Darby et al. 2016;Vartia et al. 2016; Barbian et al. 2018). Sequence data thus reduces allele homoplasy because alleles of the same size may contain molecular variation that does not translate into size variation such as SNP, indels masking variation in repeat number, or presence of two adjacent SSR motifs with complementary size variation (Darby et al. 2016). As a result, SSRseq offers refined genetic diversity estimation and population structure inference (Darby et al. 2016;Bradbury et al. 2018;Neophytou et al. 2018;Viruel et al. 2018).
Studies published so far have explored specific technical or analytical aspects of SSRseq. Several bioinformatics approaches have been developed (Hoogenboom et al. 2016;Suez et al. 2016;Zhan et al. 2017; Barbian et al. 2018), different laboratory protocols tested (Vartia et al. 2016;Pimentel et al. 2018) and means to account for molecular variation on population genetic inference compared (Neophytou et al. 2018;Viruel et al. 2018). Together, these studies explored numerous issues surrounding the technical and analytical advantages of SSRseq over traditional methods, each of them focusing on a given model species making it difficult to assess finding generalization to other biological models. In addition, laboratory methods for microsatellite marker amplification as well as analytical approaches to convert raw sequences to codominant genotypes markedly differ between studies making it cumbersome for prospective users to identify a clear path to setup a new SSRseq approach for a species of new interest.
Here, we propose an integrative workflow for the development of a SSRseq analysis for application to any species. We apply this workflow to five species groups from families taken across Phylums (Basidiomycota: Armillaria ostoyae; Angiosperms: Quercus faginea and Q. canariensis; Euarthropoda: Melipona variegatipes; Chordata: Alosa alosa, A. fallax and Salmo salar) that markedly differ in the amount of genomic data already available for them. We compare a broad range of possible development scenarios including sequencing of already optimized microsatellite assay traditionally genotyped on capillary sequencers, optimizing primers around already developed microsatellites, and developing microsatellite de novo from a range of genomic resources when available, or from newly generated low coverage random genome sequences for species without existing genomic resources. Building on our previous experience in development of highly multiplex microsatellite genotyping protocols (Guichoux et al., 2011;Lepais & Bacles, 2011), we propose a streamlined, simple and cost effective approach with demonstrated application to groups of species which a wide range of genetic and evolutionary characteristics. We offer an automated microsatellite sequence data analysis pipeline to produce haplotypic data, validated by extensive blind-repeat genotyping to estimate SSRseq error rates. We explore two different data analysis approaches that either account for the whole range of polymorphism detected within the repeated motif and the flanking regions, or restrict the analysis to the repeated motif itself. We find that relying on previously developed microsatellite assay is far from optimal and result in a low number of reliable locus genotyped. In contrast, primer redesign based on strict criteria around known locus or de novo microsatellite development gives successful single highly multiplexed PCR amplification and sequencing, producing high quality genotypes for 20 to 40 loci. Overall, we highlight critical upfront developments needed to generalize SSRseq setup effectively. We emphasize that efficient and powerful multipolymorphism haplotype-based genotyping approaches are time and cost effective and achievable in virtually any species, calling for new theoretical and analytical development to extract more information from this new kind of molecular markers.

Studied species, SSRseq development strategies and DNA isolation
We selected a range of species from different Kingdoms among biological models studied in our laboratories (Table 1). For S. salar we took the most straightforward route by amplifying and sequencing microsatellites using previously developed primers. To develop a refined workflow, we chose species with different level of genomic resource availability to test alternative de novo microsatellite development strategies that are likely to cover a wide range of situations (Table 1).
Protocols for DNA isolation and final DNA quality differ between species (Table 1).

SSRseq using previously-developed primers
The most straightforward approach to SSRseq microsatellite genotyping, i.e. based on sequence information from existing primers, was applied to S. salar using two marker selection strategies. In the first strategy, we selected 23 primers from a list of 81 microsatellites available for S. salar (O'Reilly and Wright 1995;Slettan et al. 1996;Ozaki et al. 2001;Rexroad et al. 2001;Gilbey et al. 2004;Paterson et al. 2004;King et al. 2005;Vasemägi et al. 2005;Thorsen et al. 2005;Yano et al. 2013, see details in Supporting Table S1) and mapped (Vasemägi et al. 2010). Selection criteria included allele size smaller than 300 bp to ensure that sequencing reads can span the entire allele length including library construction and absence of sequences complementarity between primers tested using Multiplex Manager (Holleley and Geerts 2009). In the second strategy, we chose to sequence a set of 15 microsatellites (Supporting Table S1) that are routinely amplified in a single multiplexed PCR and genotyped using electrophoresis-based capillary sequencers (Bacles et al., 2018;Lepais et al., 2017).

Genomic resources for microsatellite (re)development
For the other species, microsatellites primers were either re-designed or developed de novo from various genomic resources (Figure 1 top panel).
For Quercus sp., primers were re-design primers in flanking regions of existing microsatellite markers to optimize multiplex amplification and sequence interpretability while taking advantage of already validated microsatellite markers. We extracted primer sequences from 259 polymorphic and mapped EST-derived (Durand et al. 2010) and 35 genomic microsatellites (Steinkellner et al. 1997;Kampfer et al. 1998)). The primer sequences were mapped on the Q. robur reference genome (Plomion et al. 2018, GenBank accession GCA_003013145.1) using bowtie 2 v2.3.4.1 (Langmead and Salzberg 2012) and genomic sequence spanning from 200 bp downstream of the forward primer to 200 bp upstream of the reverse primer position were extracted using bedtools v2.25.0 (Quinlan 2014) resulting in 294 sequences used as genomic resource to design new primers.
For Alosa sp., we used sequences obtained from a Roche 454 GS-FLX sequencing run on a microsatellite-enriched DNA library (Rougemont et al. 2015) following the method described in Malausa et al. (2011).
For A. ostoyae, we used the reference genome sequence as genomic resource to identify microsatellites loci (Sipos et al. 2017, GenBank accession GCA_900157425.1).
Finally, no genomic resources were available for M. variegatipes. We therefore used DNA from one individual to construct a whole-genome sequencing library using Illumina TruSeq DNA kit. The resulting library was sequenced on an Illumina MiSeq flowcell using v3 2x300 pb paired-end sequencing kit. Mothur software v1.39.5 (Schloss et al. 2009) was used to assemble paired reads and keep paired reads with a minimum overlap of 100 bp without mismatch. We randomly subsampled 500,000 reads from the resulting 6.74 million paired reads for subsequent microsatellite identification, because a few hundreds of thousands of random sequences are sufficient to identify thousands of microsatellites (Castoe et al., 2010;Lepais & Bacles, 2011). Figure 1: Workflow for SSRseq markers optimization or development depending on genomic resource availability, from selection to multiplexed amplification and library preparation to bioinformatics analysis (refined approach).

De novo microsatellite development or primer re-design
The command line version of QDD pipeline v 3.1 (Meglécz et al. 2010(Meglécz et al. , 2014 was run on either i) a reference genome sequence, ii) a set of low coverage random sequences or iii) sequence extracted around already characterised microsatellite loci, to detect sequences containing microsatellites, identify good quality sequence (singletons and consensus) from problematic sequences (sequences showing low complexity, minisatellites or multiple BLAST hits with other sequences) and design primer pairs flanking the identified microsatellites ( Figure 1). QDD pipeline was run with default parameters, except for the primer design step (pipe3) where parameters were stringently defined in order to improve their capacity to be amplified jointly in a single multiplexed PCR (Qiagen Multiplex kit handbook; Lepais & Bacles 2011): primer optimal size was set to 25 nucleotides (min: 21, max: 26), optimal annealing temperature to 68°C (min: 60°C, max: 75°C) with a maximal difference of 10°C between primers of a same pair and optimal percentage of cytosine and guanine of 50% (min: 40%, max: 60%). In addition, PCR product size was set between 120 and 200 bp to be compatible with a wide range of sequencing platforms and to produce robust genotyping assays that can be used to analyse degraded or non-invasive samples. QDD analysis results in a large number of candidate loci with designed primer pairs from which a restricted number of loci can be selected (Figure 1). At the exception of Quercus sp. where a restricted set of input sequences necessarily limited choice among resulting candidate microsatellites, several quality criteria were used to select 60 microsatellites among hundreds to thousands candidates for further testing. We followed recommendations of Meglécz et al. (2014) to prioritize primer pairs with increased likelihood of amplification success by selecting microsatellite from singletons and not from consensus sequence, with pure repeat instead of compound motifs, showing at least 20 bp between the primers and the repeat motif, and with flanking region showing high complexity (e.g. primer pairs from the design group A following QDD terminology: no minisatellite, no other microsatellite in the flanking region, no homopolymer in the flanking region or the primer). In addition, we further selected microsatellites with the highest number of repeats to increase the probability of selecting polymorphic loci, avoided motif that can form hairpin such as AT repeats and when possible included a variety of di-, tri-and tetra nucleotide repeats.

Primer modification and simplex amplification tests
For Ion Torrent sequencing (Table 2), tag A 5'-GCCTTGCCAGCCCGCTCAG-3' was add to the 5' end of each forward primer and tag B sequence 5'-GCCTCCCTCGCGCCA-3' (Margulies et al. 2005;Blacket et al. 2012) was added to the 5' end of each reverse primer. For Illumina sequencing (Table 2), specific tags 5'-TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-3' and 5'-GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG-3' were added to the 5' end of the forward and reverse primer sequences respectively. Designed primers were tested for potential primer dimer formation using Primer Pooler (Brown et al. 2017). Primer pairs showing a deltaG lower than -6 kcal/mol are likely to form dimer and result in poor amplification in a multiplexed PCR. For locus involved in significant interactions, alternative primers were selected or in absence of alternative, another locus was selected from the candidate list. Oligonucleotides were ordered in a plate format from Integrated DNA Technologies with standard desalt purification at a final concentration of 100 µM. Primer pairs were tested using simplex amplification performed on one individuals per species using Qiagen Multiplex kit in a final volume of 10µL and with a final concentration of each primer of 0.2µM. Amplification conditions consisted of an initial denaturation step at 95°C for 15 minutes, followed by 35 cycles consisting of denaturation at 95°C for 20s, annealing at 59°C for 60s and extension at 72°C for 30s, and a final extension step for 10 minutes at 72°C. Amplicons and 1 Kb size standard were then loaded on a 3% agarose gel containing GelRed or SyberSafe dye and migrated at 100v for 15 minutes. Each locus was screened under UV light for positive amplification with a clear band at the expected size. loci showing substantial evidence for minimum sequencing success (at least 20 sequences in at least 50% of the individuals); ‡ : reliable loci (less than 50% of missing data among individuals and less than 6% of genotyping error based on comparison of repeated genotyping); § routinely genotyped using optimized multiplexed PCR and capillary-based sequencer. FDSTools analysis using two parameter sets: stutterfinder -s:-1:50,+1:10 allelefinder -m 15 -n 20; and stuttermark -s:-1:70,+1:10 allelefinder -m 10 -n 20. For each marker, four parameter combination were used (two strategies and two parameters set) and for each strategy, the best parameter set was used for a given locus.

Multiplex microsatellite amplification and sequencing library construction
From 192 to 960 individuals were analyzed depending on the taxa considered including from 46 to 156 repeated individuals to check the reproducibility of the method (Table 2). For each of the taxonomic groups, a three-round multiplex PCR approach was used to amplify all loci simultaneously and improve amplification homogeneity and thus coverage of sequence between loci (Chen et al. 2016). In the first round, a multiplexed PCR including all selected locus primers was performed ( Figure 1, Supporting Table S1 for locus characteristics including primer sequences). PCR amplification were carried out in 96-well plates in a final volume of 5µL or 10µL using Qiagen

Sequence preparation
After sequence demultiplexing and adaptor trimming using a sequencer platform built-in software, reads shorter than 70 bp were removed using cutadapt (Martin 2011). When paired-end sequencing was used, paired reads were assembled into contigs using pear (Zhang et al. 2014) with a minimum overlap of 50 bp and a maximum assembled sequence length of 450 bp. For Alosa sp., reverse reads quality was generally poor, therefore, only the forward read was used as its length (250 bp) encompasses the whole length of the sequenced loci (max. 200 bp). In some cases, microsatellite amplicons from several species were pooled prior to PCR indexing so that two individuals from two species shared an identical barcode combination and were sequenced in a single run. Forward primer sequences of species-specific loci were used to sort sequences belonging to different species into different fastq files using fqgrep tool (https://github.com/indraniel/fqgrep) allowing for one mismatch.

Converting microsatellite sequences to genotypes
We used the first three steps of the FDSTools v1.1.1 pipeline (Hoogenboom et al. 2016) to identify sequences corresponding to the microsatellite alleles and call genotypes for each individual ( Figure   1). This analytical tool was chosen because it accounts for any kind of polymorphism detected across the analysed sequences (including variation in the number of repeated motifs, SNP or indels) while integrating specific tools to detect true allele from stutter mutation introduced during amplification that are typical of microsatellite markers.
In the first step (  In addition, two analytical strategies were compared. In the first strategy (called FullLength thereafter), all variation identified between primers was considered as a haplotype irrespectively of the nature of the polymorphism because all polymorphisms are physically-linked to each other in reads that encompass the whole locus. The FullLenghth strategy may be too complex for some loci or species showing high levels of polymorphism. In the second strategy, the analysis was therefore restricted to the repeat motif only (strategy called RepeatFocused thereafter). In this case, primer and flanking sequences surrounding the repeated motif are indicated in the primer sequence field of the FDSTools input file. The tssv step then extracts the sequence corresponding to the repeat motif region (still allowing for 8% of mismatch which will accommodate flanking sequence polymorphism) to perform subsequent genotypic call with Stuttermark and Allelefinder as described above. As the analytical approach used by FDSTools is based on counting coverage of unique sequences, any variation identified within the repeated motif region, including variation in repeated motif number, SNP and indel within the motif region, will still be accounted for when defining alleles. While this RepeatFocused strategy may be more robust due to the shorter length of sequence analysed, it should identify a smaller number of alleles compared to the FullLength strategy.

Comparing analytical approaches
We compared genotypic call qualities between the four generated datasets (i.e. FullLength_PS1, FullLength_PS2, RepeatFocused_PS1 and RepeatFocused_PS2). For each locus, we determined the best parameter set for each strategy and then the best overall analysis approach using the following criteria by order of importance: estimated allelic error, amount of missing genotypes and number of detected alleles. Loci that showed more than 6% of allelic error or more than 50% of missing data across individuals were flagged as failed and removed from further inspection. For each species and analytical strategy, we recorded the number of genotyped loci, mean allelic error and missing data rate and the total number of alleles (haplotypes) across loci. We then determined the best overall approach for each locus and used it to genotype each locus generating a final Combined genotypic dataset for each species using a specific bash script (SSRseq_DataAnalysis_FinalGenotyping.sh available https://doi.org/10.15454/HBXKVA). Finally, the overall development success rate was computed for each species by dividing the number of reliable loci by the number of loci included in the multiplexed PCR.

Gain from sequence information
The number of identified alleles based on sequence information (haplotypes) was compared to the number of alleles differing in amplicon length only for all analysed loci to assess the gain of information obtained by using sequence data and estimate size homoplasy. We investigated further the nature of the detected polymorphism by counting, for each locus, the number of variations in the number of repeats, SNP and indels in the repeated motif and the flanking regions that differ between haplotypes.

Sequence-based genotyping of previously developed microsatellites
Attempts to genotype previously developed microsatellites in S. salar either from a new combination of 23 loci or a routinely-used multiplex of 15 loci resulted in a low number of reliable loci genotyped ( Table 2). The overall success rate (the percentage of reliable loci over the number of loci amplified in the multiplexed PCR) ranged from 39% to 47% and the number of reliable loci from 7 to 10 (Table 2).
Moreover, the quality of the generated genotypic datasets is relatively low with a rate of missing data and allelic error above 10% and 1% respectively (Figure 2). The sequences produced on the PGM Ion Torrent platform resulted in the lowest genotypic data quality (Figure 2). The same genotyping protocol sequenced on the Illumina MiSeq platform produced higher genotypic quality with lower missing data and allelic error rates (Figure 2) in spite of a 2.3 times lower mean coverage per sequenced locus per individual ( Table 2). The Illumina MiSeq sequencing platform was thus used for subsequent analyses in other species. Illumina MiSeq sequencing platform. Number of reliable loci, total number of alleles, missing data and allelic error rates are indicated for three bioinformatics analysis strategies that focused either on all polymorphism across the sequence, on polymorphism within the repeated motif only, or a combination of the best strategy for each locus.

Sequence-based genotyping of de novo developed microsatellites
In contrast, overall success rate of de novo microsatellite development ranged from 67% to 86% (Table 2). Given the high number of candidate microsatellites typically identified from highthroughput sequencing or reference genome sequence, we could afford to screen as much as 60 new loci, and combined most of them (from 28 to 60) in a single multiplexed PCR for amplification ( We explored the effect of sequence coverage on genotypic data quality on Alosa sp. by sequencing the same set of 28 microsatellites amplified in 156 individuals using one, two or three Illumina MiSeq nano flowcells ( Table 2). The resulting increased coverage, from 95 to 198 and 267 sequences respectively per locus per individual (Table 2), recovers more data for those highly degraded DNA samples. First, increasing the coverage from 95 to 198 sequences by locus by individual detected three additional loci, while a further increase in coverage failed to recover additional locus (Table 2).
Second, the missing data rate linearly decreases with the increase in coverage, from 16.4% to 9.0% and 6.8% with 95, 198 and 267 sequences per locus per individual respectively (Supporting Material Figure S1). It is worth noting that the allelic error rate is not affected by genome coverage, as it varied only slightly between 0.5% and 0.7% without any correlation to coverage (Supporting Material Figure S1). A significant result is that even in conditions when only highly degraded DNA templates are available, reliable genotypic data can be obtained with moderate coverage, while increasing coverage will reduce missing data but not genotyping error rate.

Whole sequence VS repeated motif polymorphism analysis
Focusing the analysis on the repeated motif slightly increased the number of reliable loci and tends to produce fewer missing data and allelic errors (Figure 3). However, numerous polymorphisms that may be present in the flanking sequences are not accounted for. Indeed, analysing all polymorphism detected between the PCR primers resulted in a higher mean number of allele per locus, at the expense of slightly higher missing data and allelic error rates (Figure 3). Interestingly, 17% of the loci can be analysed reliably using either the FullLength or the RepeatFocused analytical approach. Thus combining analytical strategies by selecting the best approach for each locus resulted into an optimized dataset (Figure 3). Even for loci with reliable genotypes irrespectively of the analytical approach chosen, selecting the one that produces the best quality data (in terms of number of alleles, missing data and error rate) leads to an improved dataset quality. This combined strategy results in recovering the highest number of loci and alleles while keeping missing data and allelic error rates at the lowest (Figure 3). Number of reliable loci, total number of alleles, missing data and allelic error rates are indicated for three bioinformatics analysis strategies that focused either on all polymorphism across the sequence, on polymorphism within the repeated motif only, or a combination of the best strategy for each locus.

Types of polymorphism detected across species
While most common population genetics applications do not necessitate to characterize the nature of the polymorphism differentiating alleles, the main advantage of sequence data (in addition to analysing a much higher number of loci) is to be able to identify allelic variation that does not translate into size variation, i.e. the only variation that is detected when using classical electrophoretic approaches. Across species, the proportion of alleles would have remained undetected by capillary electrophoresis (size homoplasy) ranging from 6% for M. variegatipes, 11% for S. salar, 14% for Alosa sp., 35% for Quercus sp. and 53% for A. ostoyae (Table 3). Conversely, the increase in the proportion of allele detected by accessing sequence data ranges from 6% for M.
variegatipes, 13% for S. salar, 16% for Alosa sp., 56% for Quercus sp. and 113% for A. ostoyae (Table   3). Indeed, beside variation in repeat number, we identified numerous SNP and indel either in the flanking sequence or in the repeat motif itself (Figure 4, Table 3). In fact, additional polymorphism beyond variation in repeat number was the rule rather than the exception (Table 3). However, differences in the proportions of the type of the detected polymorphism were found between species (Figure 4). While repeat number variation represented more than 80% of the polymorphism detected in S. salar, Quercus sp., Alosa sp. and M. variegatipes, SNP in the flanking sequence was the most frequent polymorphism for A. ostoyae representing 49.8% of the variation, much high than variation of repeat number estimated at 29.8% (Figure 4). For A. ostoyae, SNP in the repeat motif and indel in the flanking sequence also represent a significant proportion of the polymorphism detected (12.6% and 6.5% respectively). Quercus sp. were characterised by SNP both within the repeat motif (7.5%) and the flanking region (9.3%). The second most comment polymorphism for Alosa sp. was SNP in the repeat motif (8.5%) and for M. variegatipes SNP in the flanking region (5.5%). For S. salar, variation in the number of motif was much more frequent than for other species (93.0%) compared to other polymorphisms that represent a marginal proportion of the variation (less than 3% each).   (3) 1 (1) 9 (7) 5 (5) † : irrespectively of polymorphism type, computed based on the Combined analysis strategy; ‡ : simulating the number of alleles that would have been identified using traditional capillary electrophoresis, computed based on the FullLength analysis strategy and accounting for allele size only on the same locus as the Combined approach. § : total number of alleles (and number of loci in brackets) for each polymorphism type. ¶ : combination of two sequence based microsatellite genotyping protocols.

Discussion
While relying on previously developed microsatellite assay is far from optimal, our study demonstrates that primer redesign around known locus or de novo microsatellite development based on strict criteria gives successful single highly multiplexed PCR amplification and sequencing for 20 to 40 loci. However, critical upfront factors need to be considered for efficient SSRseq setup.
The development workflow and analysis pipeline provided in this study are able to produce powerful multilocus haplotypic data in a time and cost effective manner in virtually any species.

Not all roads lead to Rome: navigating pitfalls when adopting HTS for microsatellite genotyping
Our preliminary attempts to apply SSRseq using previously developed microsatellite primers or capillary-based multiplexed microsatellite genotyping protocols clearly failed to produce reliable genotypic data for a sufficient number of loci. In the best case scenario, sequencing of 15 microsatellites in Salmo led to the reliable genotyping of 7 loci with 15.6% of missing data and 1.19% of allelic error, a result far worse than the high quality dataset obtained for the same multiplex using traditional capillary-electrophoresis of 14 loci with 0.5% missing data and 0.35% allelic error (Bacles et al. 2018). Similar trends were observed in previous studies that validated the use of HTS to genotype microsatellites and relied on previously developed primers: the generated datasets were characterised by high levels of missing data (up to 45%, (Vartia et al. 2016)) or low number of genotyped loci (7 to 8, (Suez et al. 2016;Barbian et al. 2018)). Two characteristics are problematic for SSRseq from microsatellites initially developed for capillary electrophoresis-based genotyping. First, high variability in locus length and primer characteristics leads to heterogeneous amplification intensity and sequencing coverage across loci. Indeed, efficient multiplexed PCR necessitates careful primer design using strict criteria (Guichoux et al., 2011;Lepais & Bacles, 2011). In addition, the need for variable locus length for optimal multiplexing without allele size range overlap in capillary-based electrophoresis becomes unnecessary for sequencing, because same size loci can be reliably identified simply based on primer sequences. Secondly, as sequencing errors accumulate proportionally to sequence length, lengthy sequences are more difficult to analyse, while complex or interrupted microsatellite motifs produce difficult to interpret stutter patterns. In that case, starting from a limited number of loci results in a handful of reliable loci that may be too low for downstream applications. This conclusion agrees well with a previous study developing SSRseq in S. salar, where only one out of six (17%) of previously developed loci was successfully integrated into the final panel, compared with a 26% success rate for newly developed primers (Bradbury et al. 2018). Adapting previously developed loci in the same species resulted in a success rate of about 45% in our case (7 out of 15 and 10 out of 23 reliable loci), but de novo development in other species was much more successful with a success rate of 75% on average. Not relying of previous primer design is thus of primary importance for the success of reliable SSRseq protocol development.

Workflow for efficient SSRseq development in non-model species
We propose here a workflow for developing new SSRseq approaches and we demonstrated its efficiency for a range of species with marked differences in polymorphism level. Starting from a reasonable number of candidate loci that were identified using readily available or newly generated genomic resources, resulted in an average of 75% of the loci included in the PCR multiplex generating analysis: if a higher number of loci is required; additional sets of 60 loci could be selected from the candidate locus list and amplified in separated PCR multiplexes that can be pooled together before PCR indexing leading to a time and cost effective way to genotype additional loci (Bradbury et al. 2018). We did not test multiplexing more than 60 loci in a single PCR, but we did not see difficulties in doing so (Campbell et al. 2015). In such a situation, careful primer design with strict criteria and control for primer interactions will be key to increase multiplexed amplification success.
The proposed workflow for SSRseq development minimized laboratory steps and analytical optimizations. We chose to start from a moderate number of loci, designed to maximize their compatibility and sequence interpretability, and remove any locus that failed to amplify, produce interpretable or repeatable genotypes. This strategy avoids tedious optimization and reduce the number and complexity of laboratory steps necessary to produce genotypes. Admittedly, additional DNA or amplicon clean up or normalisation might improve sequence quality and coverage across individuals and loci. However, we chose to keep the laboratory procedure as simple as possible and compensate the increase in amplification heterogeneity by additional sequence output resulting in sufficient coverage (220 reads/loci/individuals on average) to obtain a nearly complete genotypic dataset (generally 5% of missing data excluding the atypical case of Armilaria species). Moreover, increasing the number of laboratory steps inflates the risk of handling error and subsequently of genotyping error (Vartia et al. 2016). In this respect, highly multiplexed PCR are a useful technique to reduce laboratory steps and potential associated errors in addition to saving time and cutting costs.
Additional tests not presented here showed that the two-stage multiplex amplification prior to PCR indexing that was used to increase amplification homogeneity across loci is not necessary: high coverage is still efficient to compensate the increase in amplification heterogeneity when using a single multiplex PCR step for locus amplification as performed in previous studies (De Barba et al. 2016;Zhan et al. 2017;Bradbury et al. 2018;Tibihika et al. 2018).
We chose to rely on the set of 384 Illumina barcodes combinations because we found it to fit well with the output of the MiSeq sequencing platform when analysing from 20 loci to 300 loci depending on the type of flow cell used. However, studying more than 384 individuals from a single species necessitates either several MiSeq runs (Bradbury et al. 2018) or custom made dual-indexing strategies, as was successfully performed for the Ion Torrent PGM run (960 barcode combinations used) or in previous studies using the MiSeq platform (960 and 1024 barcode combinations Zhan et al. 2017)).
Finally, including repeated individuals is of paramount importance to assess the reliability of the produced genotypic data for each locus. This procedure aims to be best practice in capillary electrophoresis-based microsatellite genotyping (Hoffman and Amos 2005;Pompanon et al. 2005;Guichoux et al. 2011) but is even more important in SSRseq. Indeed, not all sequenced loci produced reliable genotypic data. It is thus necessary to be able to identify and exclude loci producing high genotyping errors. At the bioinformatics analysis stage, we settled on testing a limited number of parameter combinations (four parameter sets) that were found to provide satisfactory results for most loci across the different species analysed. Selecting the best parameter set for each locus among the four parameter combinations is an easy task that greatly improve the number of reliable loci and decrease the genotyping error rate. However, a few loci will still show high genotyping error rate due to low coverage or complex polymorphism patterns and should be excluded from the final genotypic dataset. Exploratory analyses testing a much wider number of parameter combinations resulted in limited success in increasing the final number of reliable locus. Extensive parameter testing is a tedious task requiring high computation time with only minor improvement for the final genotypic dataset quality and is not worth the extra effort as long as a sufficient number of loci have been included in the genotyping panel. Furthermore, additional analyses using alternative microsatellite sequences analysis tools such as Megasat (Zhan et al. 2017) and MicNeSs (Suez et al. 2016) also lead to the conclusion that all loci cannot be analysed reliably using a single set of parameters. Thus, we stress the importance to include a significant number of repeated individuals (at least 48) in the first analysis of a new SSRseq panel to (1) coarsely optimize analysis parameters for reliable loci, (2) quantify genotyping error rate and (3) identify and exclude unreliable loci. Such procedure has been implemented only in a limited number of previous studies describing SSRseq methods (5 out of 12 published studies thus far, (De Barba et al. 2016;Zhan et al. 2017;Bradbury et al. 2018;Barbian et al. 2018;Viruel et al. 2018)) but should be generalized.

Implications of haplotype based genotyping
We took advantage of the fact that reads span across entire loci to analyse all linked and phased polymorphisms encountered using the FDSTools pipeline (Hoogenboom et al. 2016). This haplotype approach differs from the methods implemented in other sequence-based microsatellite genotyping software such as Megasat (Zhan et al. 2017) which focuses on amplicon length or Micness (Suez et al. 2016) which estimates the number of repeated motifs while accounting for up to one substitution within the microsatellites motif. The haplotype approach has several advantages. Firstly, it is relatively insensitive to sequencing error because the analysis focused on unique sequence with high coverage and thus does not consider the noise generated by sequencing error which produce numerous unique low coverage sequences that are removed. Secondly, by analysing unique sequences, it accounts for any kind of polymorphisms, while at the same time includes an algorithm to identify stutters resulting from slippage mutation due to microsatellite instability during PCR amplification. Thirdly, it is flexible enough to be parameterized for analysis of the whole sequence or of the repeated motif only which is very useful to adapt the analysis to recalcitrant loci genotyping or to specific downstream applications.
Previous studies have demonstrated the added benefit of analysing different types of polymorphism within sequences compared to using amplicon size only (as in traditional capillary electrophoresisbased genotyping) to differentiate alleles (Darby et al. 2016;Neophytou et al. 2018;Barbian et al. 2018;Tibihika et al. 2018). Size homoplasy, due to alleles identical by size but not by sequence, ranges from to 32% and 44% (Darby et al. 2016;Vartia et al. 2016;Barbian et al. 2018). Here, we found high variability in size homoplasy ranging from 6% to 53% between the studied species in direct correlation to species polymorphism level and type of variation. SNP in the repeat motif or in the flanking region was the main source of size homoplasy which showed high prevalence in A.
ostoyae (SNP represented 62.4% of the detected variation and 53% of size homoplasy) and to a lesser extent in Quercus sp. (SNP represented 16.8% of the detected variation and 35% of apparent homoplasy). Even for species with lower apparent polymorphism levels, such as M. variegatipes (6%), S. salar (11%) and Alosa sp. (16%), the increase in the observed number of alleles will substantially improve genotypic resolution power (Darby et al. 2016;De Barba et al. 2016). It is important to note that not all SSRseq approaches maximize the information gained by the sequence information at the same level. If only allele size length is taken into account at the bioinformatics analysis step, conditions and thus results will be very similar to capillary electrophoresis-based genotyping (Zhan et al. 2017) with minimal information gain. Additional information will be obtained from analysing polymorphism within the repeat microsatellite motif, either by counting the number of repeated motifs including one substitution (Suez et al. 2016) or better still using the haplotype approach which is not constrained by the number of substitutions or the complexity of interrupted motifs. However, haplotype-based analysis that accounts for all linked variations across the whole sequence will make the most of the information available from sequence data (Barthe et al. 2012). Previous application of such an analytical approach used custom-made scripts implementing generalist bioinformatics tools and manual scoring of final genotypes (Darby et al. 2016;Neophytou et al. 2018) making them difficult to automate and transfer to other non-model systems (but see Barbian et al. 2018). Here, we integrated FDSTools algorithms into an automated analysis that includes parameters comparison and final genotyping into two pipelines based on bash script (https://doi.org/10.15454/HBXKVA), thus making it easy to estimate genotyping error based on repeat genotyping, to optimize parameters and identify reliable loci and produce the final multilocus genotypic table.
Finally, the ability to detect different sources of variation originating from several mutation mechanisms occurring at different rates provides renewed opportunities to study ecological and evolutionary events that occur at different timescales (Ramakrishnan and Mountain 2004;Barthe et al. 2012). Combining information on linked microsatellites and SNP (into a system called SNPSTR (Mountain et al. 2002) or HapSTR (Hey et al. 2004;Sorenson and Dacosta 2011)) was demonstrated to be a promising approach thanks to the increased phylogenetic resolution offered by explicitly considering complementary mutation properties of the markers. While theoretical and analytical implications of these approaches have been derived (Ramakrishnan and Mountain 2004;Hey et al. 2004;Payseur and Cutter 2006), empirical applications remain scarce and restricted to a very small number of systems due to the previous difficulties encountered to generate such empirical data (Mountain et al. 2002;Hey et al. 2004). This early limitation does not hold anymore with the generalisation of sequence-based microsatellite genotyping, as proposed herein, and the new ability to analyse linked microsatellites and SNP as haplotype. In addition, the flexibility of coalescent programs to simulate linked loci of different types (e.g. fastsimcoal2 (Excoffier et al. 2013)) will authorize far more realistic simulation of mutation mechanisms specifically tailored to each marker system. Such improvement, especially when applied to tens of marker systems, would make simulation-based inference (Beaumont et al. 2002) more accurate over an extended timescale range even for complex evolutionary history scenarios (Mountain et al. 2002).
In conclusion, this study proposes an integrated approach to expedite the development of SSRseq protocol for any species. We found it to be time efficient and cost effective for a range of species across Kingdoms. Building on our experience, we provided several recommendations to improve development efficiency. The two most important advices are to optimize marker selection and primer design for effective multiplex PCR amplification and sequence interpretability, and to use repeated individuals to assess the quality of the generated genotypic data. The automated genotyping pipeline developed in this study is flexible enough to optimize analysis parameters and improve genotypic data quality, and rapidly process sequences to estimate locus-level statistics (including missing data and genotyping error rates). Furthermore, it generates formatted genotypic tables ready for downstream data analysis from sequence information ensuring easy data interoperability (correspondence between allele code and allele sequence). As our method is based on simple bioinformatics algorithms, hundreds of individuals can be genotyped at tens of markers within a couple of hours on any standard laptop running on a Linux operating system. The ability of SSRseq to characterize SNP and indel present along the sequences, in addition to the targeted microsatellite, represents a new opportunity to produce empirical data to apply existing theoretical and statistical frameworks that integrate linked polymorphism with different mutation characteristics (Payseur and Cutter 2006). Finally, the ease and cost effectiveness of parallel development for multiple species make these approach convenient to develop powerful multilocus datasets for comparative population and community genetics studies (Crutsinger 2016), and to further investigate the functional implications (Bagshaw 2017) and adaptive potential of microsatellite variation among natural populations (Xie et al. 2019).
-Pipeline scripts for automated sequence-based microsatellite genotyping with documentation and example files are available at Data Inra https://doi.org/10.15454/HBXKVA.

Authors Contributions
OL designed the study and developed the data analysis pipeline. OL, EC, CB, FS and EG developed the laboratory protocols. AM, LT, AA and CFEB contributed to laboratory analyses. EC, LT and CFEB contributed to data analysis. CD, FD and SL contributed to sampling and preliminary data collection. OL wrote the manuscript which was revised critically and approved by all authors. Table S1: Full details about the loci used or developed in this study (including species, locus name, primer sequences, characteristics, and reference) Figure S1: Effect of sequence coverage on SSRseq data quality for Alosa species.

Supporting / Supplemental information
Supporting Materials and Methods: detailed explanation of FDSTools parameters used in this study.