Multi-scale and multi-site resampling of a study area in spatial genetics: implications for flying insect species

The use of multiple sampling areas in landscape genetic analysis has been recognized as a useful way of generalizing the patterns of environmental effects on organism gene flow. It reduces the variability in inference which can be substantially affected by the scale of the study area and its geographic location. However, empirical landscape genetic studies rarely consider multiple sampling areas due to the sampling effort required. In this study, we explored the effects of environmental features on the gene flow of a flying long-horned beetle (Monochamus galloprovincialis) using a landscape genetics approach. To account for the unknown scale of gene flow and the multiple local confounding effects of evolutionary history and landscape changes on inference, we developed a way of resampling study areas on multiple scales and in multiple locations (sliding windows) in a single large-scale sampling design. Landscape analyses were conducted in 3*104 study areas ranging in scale from 220 to 1,000 km and spread over 132 locations on the Iberian Peninsula. The resampling approach made it possible to identify the features affecting the gene flow of this species but also showed high variability in inference among the scales and the locations tested, independent of the variation in environmental features. This method provides an opportunity to explore the effects of environmental features on organism gene flow on the whole and reach conclusions about general landscape effects on their dispersal, while limiting the sampling effort to a reasonable level.


Introduction
Morabito 2014). Details of primer sequences and the protocol for genotyping are given in Table   145 S2. Results showing negative or ambiguous amplification of particular loci were repeated once 146 and considered null when still unsatisfactory. Individuals exceeding two missing loci were 147 removed for the analysis. Deviation from Hardy Weinberg Equilibrium (F is ) was estimated for 148 each deme, each inferred cluster and for the whole dataset using GENEPOP 4.2 (Raymon & 149 Rousset 1995). The frequency of null alleles at each locus was tested using FREENA (Chapuis & 150 Estoup 2007) among three large demes (n>19). Loci exceeding a rate of 7% of null alleles across 151 populations were discarded from further analysis. Allelic richness was computed for each deme 152 using rarefaction (HP-RARE, Kalinowski 2005). The absence of linkage disequilibrium between 153 pairs of loci was reported in a previous population-based study (Haran et al., 2015).

155
Genetic structure 156 We used the Bayesian approach implemented in STRUCTURE 2.3.4 (Pritchard et al., 157 2000) to identify the main genetic clusters among Iberian demes. STRUCTURE assigns 158 individuals to a predefined number of clusters based on allele composition and linkage 159 disequilibrium. We used the Delta K method (Evanno et al., 2005) to determine the number of 160 clusters (K) that best fitted the data. Genotypes were analyzed using default parameters (admixture 161 model, correlated allele frequency). We made ten repeats of a 200,000 burn-in period followed by 162 500,000 replicates of Markov Chain Monte Carlo (MCMC), for K values ranging from 1 to 20. 163 The results were uploaded in STRUCTURE HARVESTER (Earl et al., 2012) to determine the 164 optimum K. We also explored the existence of genetic clusters among demes using a principal 165 component analysis (PCA) performed on allele frequencies (Adegenet package, Jombart 2008). 166 To account for potential confounding effects of differentiated genetic clusters (possibly of 167 evolutionary history origin) on the inference of gene flow, the landscape genetic analyses of this 168 study (see below) were performed twice, once within the main cluster identified by STRUCTURE 169 and PCA, and once with the whole dataset including all clusters. 170 The genotypes of M. galloprovincialis were also analyzed taking a spatial approach in order 171 to identify nested levels of genetic structure linked to scales of study. The scores of the sampling 173 used as a univariate statistical measure of genetic composition. The scores may encapsulate 174 relevant spatial information, so we explored this point using a specific tool borrowed from 175 geostatistics: the variogram , Goovaerts, 1997. The variogram is used in all 176 branches of life sciences in order to explore spatial patterns and determine the main spatial scales 177 on which structures occur. In our study, we analyzed the score of sample points on axis 1 using a 178 variogram to gain a better understanding of the spatial component of the variation encapsulated in 179 the first axis of the PCA. Let z(u α ), with α=1, 2, ... n, be a set of n values of sample scores on a 180 PCA axis where u α is the vector of spatial coordinates of the αth observation. In geostatistics, 181 spatial dependence is described in terms of dissimilarity between observations expressed as a 182 function of the separating distance (Goovaerts 1997). The average dissimilarity between data 183 separated by a vector h is measured by the empirical semi-variance ŷ(h), which is computed as half 184 of the average squared difference between the data pairs:    (Table S1).

339
The data points of the variogram were grouped into 26 distance classes ranging from 0 to 340 1252 km, with a distance interval of 50 km. The variogram revealed that the first axis of the PCA 341 corresponded to a highly spatially structured pattern (Fig. 2). The semi-variance first progressively 342 increased with increasing lag distance up to a distance of about 190 km and then reached a plateau. 343 For distances of about 400 km, the semi-variance increased again and leveled off for distances 344 further than 1000 km. The resulting plateau of the variogram showed the presence of a long-range 345 spatial variation superimposed over a more local, i.e. short-scale genetic structure, occurring on 346 scales of 200 to 400 km. For scales below 200 km, the variogram showed that genotypes were 347 strongly spatially auto-correlated (i.e. non-independent). 350 Analyses were conducted on both the whole dataset (992 individuals, 132 localities) and 351 within the Spanish cluster (790 individuals, 87 localities). Within each dataset, the number of 352 alleles observed across all individuals and loci was 116 and 102, respectively.

353
Over the whole study area (whole dataset), we generated a total of 30,576 sampling areas.

371
Interpolation of supported IBR hypotheses and IBD was based on areas of scales ranging 372 from 220 to 600 km, because most of the variation in the detection of the effects of environmental 373 features was found on these scales (Fig. 3A). For most IBR hypotheses (IBR-E, IBR-Pr and IBR-374 T) and IBD, the effects were mainly detected in the northern part of the study area (Fig. 4). In 375 contrast, these IBR hypotheses were the least frequently detected on the southern and eastern sides 376 of the Iberian Peninsula. For the IBR-Pc hypothesis, significant effects were detected mainly in 377 Andalucía, along the Betic system. Conversely, low or no effects for this hypothesis were detected 378 in the northern half of the Iberian Peninsula. The distribution of supported hypotheses was similar 379 for those performed on the whole dataset and on the Spanish cluster only (Fig. 4). For hypotheses 380 IBR-E, IBR-Pr and IBR-Pc, the variation of environmental features was lower on average in areas 381 exhibiting significant effects for scales up to 400 -600 km (whole dataset; Fig. 5). Above this 382 scale, the mean standard deviation (SD) of areas with a supported IBR hypothesis was either equal 383 to or higher than the mean SD of areas with no support. For the IBR-T hypothesis, the mean SD 384 of areas with support was above the mean for non-supported areas, for most of the scales.    412 We observed a notable influence of scale on the detection of supported IBR hypotheses 413 with Mantel tests for most of the environmental features tested (IBR-E, IBR-T and IBR-Pr).

Importance of the scale of the study area for flying insects
414 Support was scarcely detected on the smallest spatial scale (220-400 km) and was generally more    Table 1: Commonality coefficients of both unique and common effects for the three sampling 540 areas with the highest variance explained.