An empirical examination of sample size effects on population demographic estimates in birds using single nucleotide polymorphism (SNP) data

Sample size is a critical aspect of study design in population genomics research, yet few empirical studies have examined the impacts of small sample sizes. We used datasets from eight diverging bird lineages to make pairwise comparisons at different levels of taxonomic divergence (populations, subspecies, and species). Our data are from loci linked to ultraconserved elements and our analyses used one single nucleotide polymorphism per locus. All individuals were genotyped at all loci, effectively doubling sample size for coalescent analyses. We estimated population demographic parameters (effective population size, migration rate, and time since divergence) in a coalescent framework using Diffusion Approximation for Demographic Inference, an allele frequency spectrum method. Using divergence-with-gene-flow models optimized with full datasets, we subsampled at sequentially smaller sample sizes from full datasets of 6–8 diploid individuals per population (with both alleles called) down to 1:1, and then we compared estimates and their changes in accuracy. Accuracy was strongly affected by sample size, with considerable differences among estimated parameters and among lineages. Effective population size parameters (ν) tended to be underestimated at low sample sizes (fewer than three diploid individuals per population, or 6:6 haplotypes in coalescent terms). Migration (m) was fairly consistently estimated until <2 individuals per population, and no consistent trend of over-or underestimation was found in either time since divergence (T) or theta (Θ = 4Nrefμ). Lineages that were taxonomically recognized above the population level (subspecies and species pairs; that is, deeper divergences) tended to have lower variation in scaled root mean square error of parameter estimation at smaller sample sizes than population-level divergences, and many parameters were estimated accurately down to three diploid individuals per population. Shallower divergence levels (i.e., populations) often required at least five individuals per population for reliable demographic inferences using this approach. Although divergence levels might be unknown at the outset of study design, our results provide a framework for planning appropriate sampling and for interpreting results if smaller sample sizes must be used.


INTRODUCTION
Genomic-scale data for studying population histories have increased the resolution of demographic estimates, including effective population sizes, migration rates, and times since divergence, even when the number of sampled individuals is relatively low (Willing, Dreyer & Van Oosterhout, 2012;Jeffries et al., 2016;Nazareno et al., 2017). However, it is not well understood how the precision and accuracy of these estimates are impacted by lower population sample sizes. The number of individuals that can be included in a study might be limited by practical considerations such as availability of samples for difficult-to-access or endangered populations, tradeoffs between including more individuals per population or more populations, or decisions about whether to include more loci or more individuals (Felsenstein, 2005;Pruett & Winker, 2008;Jeffries et al., 2016). Because these issues affect study design, it is important to understand the impacts of relatively low within-population sample sizes on population demographic parameters that are now commonly estimated in a coalescent framework.
The impacts of population sample size, and particularly the tradeoff between increased numbers of individuals versus increased number of loci, has been studied primarily with microsatellite datasets. In general, increasing the number of loci decreases the number of individuals needed for accurate parameter estimations in population genetic studies (Morin, Martien & Taylor, 2009;Willing, Dreyer & Van Oosterhout, 2012), but different parameter estimates are not impacted uniformly by low sample sizes. A sample size of eight alleles per population (4:4 diploid individuals) has been suggested as an optimum sample size for obtaining coalescent-based likelihood estimates of theta (Θ = 4N e m; Felsenstein, 2005). This sample size has also been sufficient for non-coalescent-based estimates of unbiased heterozygosity (Pruett & Winker, 2008), which have been effectively estimated with 5-10 individuals. However, other estimators, such as genetic diversity (e.g., A E , H O and unbiased H E ) and differentiation (F ST ), require larger sample sizes for accurate estimation, and often the number of individuals required increases as divergence decreases (Kalinowski, 2005;Morin, Martien & Taylor, 2009).
Modern genomic datasets, with their large numbers of sampled loci, are predicted to decrease the number of individuals required for obtaining accurate estimates of demographic history (Jeffries et al., 2016). However, impacts of sample size on such estimates have undergone only limited investigation thus far, and previous empirical work has focused on estimates of diversity (A E , H O and unbiased H E ) and differentiation (F ST ; Nazareno et al., 2017). Other demographic estimates made using allele frequency spectrum methods have only been evaluated so far with simulated data (Robinson et al., 2014), using the program Diffusion Approximation for Demographic Inference (dadi; Gutenkunst et al., 2009). Robinson et al. (2014) showed that median estimated parameter values in two-population dadi models of divergence in isolation remained close to true values down to three diploid individuals per population. However, this did not hold true across all three model types they examined, and their optimal sampling recommendations depended on the timescale of the demographic events experienced by the populations, with very recent and very ancient events both requiring greater sample sizes (Robinson et al., 2014). In empirical systems, such information on the timescale of demographic events or divergence might be unknown at the outset of a study, particularly in taxa that have not been previously studied, and care must be taken to avoid sampling too few individuals to accurately estimate parameters of interest.
Here we use empirical datasets to conduct pairwise examinations of how inferences of population parameters are impacted by sample size, scaling symmetrically downwards from full datasets that meet or exceed sample sizes widely considered optimal for coalescent-based analyses. We expected that as sample sizes decreased, errors in estimates would increase and accuracy would decrease, but to varying degrees among parameters, and that systematic biases of mean estimates of parameters might emerge at lower sample sizes. We used empirical datasets from diverging avian lineages with different demographic and evolutionary histories to enhance our understanding of how lower sample sizes affect estimates of effective population size (ν), migration (m), time since divergence (T), and Θ (4N ref m).

Study system
We used eight datasets of ultraconserved elements (UCEs) from Beringian birds from McLaughlin et al. (2020 ; Table 1). Genomic data were generated using the methods outlined in Winker et al. (2019) using protocols from Dr. Travis Glenn's lab at the University of Georgia (http://baddna.uga.edu/protocols.html). From these data, we generated repeatedly subsampled datasets at smaller sample sizes for analysis under a coalescent framework using dadi (Gutenkunst et al., 2009). While we tried other programs, we were unable to get them to consistently run on our UCE datasets despite months of effort and over a hundred thousand hours of high-performance computing resources (e.g., jaatha, Mathew et al., 2013;IMa2p, Sethuraman & Hey, 2015). Thus, although we lack independent corroboration, we consider dadi to be sufficient to answer the questions posed. Our empirical datasets represent taxonomically designated levels of population, subspecies and species pairs in three avian orders, contrasting pairs of Asian and North American populations of: Clangula hyemalis (long-tailed duck), Anas crecca crecca/A. c. carolinensis (green-winged teal), and Mareca penelope/M. americana (Eurasian and American wigeons) in Anseriformes; Numenius phaeopus variegatus/N. p. hudsonicus (whimbrel) and Tringa brevipes/T. incana (gray-tailed and wandering tattlers) in Charadriiformes; and Luscinia svecica (bluethroat), Pinicola enucleator kamschatkensis/ P. e. flammula (pine grosbeak) and Pica pica/P. hudsonia (Eurasian and black-billed magpies) in Passeriformes. These datasets, which span divergence levels from populations with substantial levels of gene flow to effectively reproductively isolated species (albeit with low gene flow), enable us to explore how the effects of low sample sizes on demographic inference play out across these levels of divergence. Insofar as taxonomy is not a reliable indicator of genomic divergence levels (Humphries & Winker, 2011), we also include in our evaluations estimates of F ST made from the full datasets (Table 1). Among the lineages in this study, pairwise comparisons fell out into two general groups, one with relatively low divergence and one with relatively high divergence (McLaughlin et al., 2020; Table 1).
These datasets consist of one single nucleotide polymorphism (SNP) per locus from over 1,500 UCE loci per lineage (each lineage is a pairwise, two-population sample of diverging populations, subspecies, or species). For bioinformatics methods, see Winker et al. (2019) and McLaughlin et al. (2020); a summary of our pipeline is given here: https://github.com/jfmclaughlin92/beringia_scripts. Each dataset consists of 100% coverage for all individuals (all individuals have phased, high-quality SNPs called at both alleles for all loci). Z-linked loci were removed because they have a different inheritance scalar from autosomal loci (Winker et al., 2019;McLaughlin et al., 2020). Original sequence data are deposited in the NCBI Sequence Read Archive (SRA; PRJNA393740). Complete data files analyzed for this study are available at https://doi.org/10.6084/m9.figshare. 12622658.v1.

Subsampling datasets and analyses
To produce datasets of varying sample sizes, stepping down from the maximum number of individuals available for each population (6-8) to 1 individual per population, a custom Python script (https://github.com/jfmclaughlin92/beringia_scripts) was used. This script (ngapi_dadi.py) iteratively sampled individuals without replacement from the thinned .vcf files, created new .vcf files containing these individuals, converted these files to the proper dadi input format (using a Perl script by Kun Wang, https://groups.google. com/forum/#!msg/dadi-user/p1WvTKRI9_0/1yQtcKqamPcJ), and ran dadi models with predetermined, lineage-specific best-fit parameters for the split-migration (divergence-with-gene-flow) model that comes with the dadi Demographics2D.py file (split-mig). For six of our eight lineages, split-migration models produced a best-fit Here we chose to include all eight datasets under a single model framework (split-migration, an isolation-with-migration model in dadi, termed "split-mig" therein). We wished to focus here on changes due to sample size variation with multiple empirical datasets and not on more subtle variation due to differences among divergence-with-gene-flow models. For each sample size, 25 subsampled datasets were created, which were each run five times. The best-fit run by highest maximum log composite likelihood value among those five runs was then selected for each dataset and used for subsequent analyses. Parameter estimates for effective population size (ν 1 and ν 2 ), migration (m), divergence time (T), and Θ (defined as 4N ref m, with N ref defined as ancestral population size and m as mutation rate per generation), were then compared across different sample sizes. Raw parameter estimates are used throughout; we did not convert these values to individuals or years (except for three illustrative examples for individuals; see below) because that would introduce lineage-specific idiosyncrasies (e.g., through application of different mutation rate estimates) that would diminish the power of our focal among-lineage comparisons here. For the three exemplar cases in which we translated raw values into numbers of individuals, we used estimates of mutation rate and generation time given in Table S2.
The scaled root mean square error (SRMSE) was calculated, defined as with u in this context representing the estimate from the full dataset,û as the parameter estimate from the subsampled dataset, and n the number of datasets (25) considered, following Robinson et al. (2014). This was scaled by the mean of the parameter estimate at each sample size ( " uÞ to enable inter-lineage comparisons of the changes in accuracy at lower sample sizes (SRMSE). This allowed us to quantify the changes in accuracy of estimates at different sample sizes relative to each species' parameter estimates' means.

RESULTS
Each lineage had a dataset of between 1,636 and 2,656 variable loci (Table 1). Across the eight lineages, 25 datasets were constructed at each sample size from 1:1 individual up to the full sample size minus one for a total of 1,250 subsampled datasets. Overall, as expected, variability in parameter estimates increased and accuracy decreased with smaller sample sizes (Table 2; Fig. 1; Figs. S1-S5). Performance of mean parameter estimates varied both with lineage and with sample size. The effective population size parameters (ν 1 and ν 2 ) tended to be underestimated at the lowest sample sizes, whereas there was a trend towards overestimation of migration at the lowest sample sizes (m; Table 2; Fig. 1; Figs. S1-S5). Divergence time (T) and Θ were more ambiguous, with both    Fig. 1; Figs. S1-S5). These corresponded in many cases to large changes in the biologically meaningful estimates derived from these parameters. For example, this can be seen in the effective population size parameter of Tringa brevipes (ν 1 ), which varied from 1.02 to 8.49 across the full sample size spectrum (Table 2). This represents effective population size estimates of 4,478 to 37,410 individuals. In other cases, however, seemingly large changes translated into minor biological differences (e.g., changes in m among pairwise comparisons with very low levels of gene flow, considered in more detail below). In general, SRMSE increased as sample sizes decreased (Table 3; Fig. 2), reflecting the loss of accuracy at lower sample sizes. Lineages with lower levels of divergence (Table 1; Fig. S6) tended to exhibit more variability among model runs at higher sample sizes than lineages with higher levels of divergence (e.g., Numenius vs. Luscinia in Fig. 1 for ν 2 ). This was most notable in the two population-level splits (L. svecica and C. hyemalis; Fig. 1; Figs. S1-S5). At higher levels of divergence (Table 1)-particularly among T. brevipes/ T. incana, N. phaeopus, and Pica pica/Pica hudsonia-most parameter estimates reached a consistent level at approximately 4 or 5 diploid individuals, after which adding more individuals did not considerably improve estimates (Table 2), whereas SRMSE generally only began to increase markedly below 3:3 comparisons for population size and split-time estimates (Table 3; Fig. 2). In some lower-divergence lineages, such as A. crecca and L. svecica, SRMSE began increasing substantially in most parameters below a sample size of 5 (Table 3; Fig. 2). However, this was not universally the case, with SRMSE values in C. hyemalis remaining similar at most sample sizes for multiple parameter estimates (Table 3; Fig. 2).
Variation among lineages was noteworthy, as was variation among demographic variables as sample sizes changed. Considering aggregate performance, using SRMSE as the basis for among-lineage contrasts, all lineages showed a significant decrease in performance (increased SRMSE) with smaller sample sizes (Table 4). These relationships were all significant using a linear regression except for the SRMSE of m, which showed aberrancies at N = 2 among some high-divergence lineages (Tables 2-4; Fig. 2; Fig. S6). In many cases the linear regression models were substantially improved by breaking the lineages into low-divergence and high-divergence groups (groups from Table 1, split by F ST values < 0.05 and >0.25; Table S1).

DISCUSSION
Sample size is an important consideration in study design, but it remains understudied in large-scale genomic datasets (Nazareno et al., 2017). Our results suggest that the minimum reliable sample size will vary considerably from taxon to taxon, depending on factors such as parameters of interest and the depth of the lineage's divergence. Although analyses using coalescent theory have suggested that sample sizes of 8-10 individuals per population are optimal (Felsenstein, 2005), by genotyping both alleles of diploid animals our sample sizes were doubled (i.e., 1N = 2 haplotypes), and we were able to estimate population parameters at considerably lower sample sizes in terms of individuals. Certain parameters, such as migration rate (m) and effective population sizes (ν 1 and ν 2 ), showed fairly consistent patterns of bias in over-or under-estimation across all lineages ( Fig. 1; Figs. S1-S5). In particular, gene flow (m) was fairly consistently estimated with relatively small departures from accuracy down to two individuals per population, after which it was overestimated in all lineages (Table 2; Fig. S3).

Estimates of migration
We found the most variation in estimates of m occuring when samples were at 2:2 (e.g., Pinicola enucleator and Pica pica/hudsonia; Fig. 2). In most of the cases in which extreme estimates occurred at 2:2, pairings of individuals that caused geographic clustering of within-continent population samples were involved together with numerically very small estimates of m. The values of m were consistently small, but variation around the mean estimate was apparently magnified by more subtle within-continent variation than our study was designed to detect. Biologically, we reason that small values of m are the more informative takeaway, and that increased variation around those very small numbers at N of 2:2 is an artifact arising from a combination of relatively deep divergence and very low gene flow, probably coupled with some more subtle population structure within continental populations. In biological terms, although these variations can appear graphically substantial (Fig. 2, m), in Pinicola enucleator they represented estimates ranging (max-min) from 0.01 to 6.13 × 10 −9 individuals per generation. In Pica, these max-min values were 0.03-2.29 × 10 −9 individuals per generation.

Estimates of population size
The effective population sizes (ν parameters) were not as robust, with variation tending to begin to increase markedly below four diploid individuals per population and accuracy decreasing in all lineages (Tables 2 and 3; Fig. 1; Figs. S1 and S2). They were, however, still reasonably accurate in many lineages at relatively small samples sizes (Tables 2 and 3; Fig. 1; Figs. S1 and S2). The negative relationships between SRMSE values for each demographic parameter and sample size (N) should help users interpret how lineages and individual parameters are affected by smaller sample sizes (Table 4; Table S1).

The impact of divergence
Our results reinforce previous findings (Kalinowski, 2005;Morin, Martien & Taylor, 2009) that an important factor in determining the minimum sample size for a study is the level of divergence in the lineages under examination. Although this might be known at the start of a study, that might not always be true, potentially complicating sampling design. However, some general recommendations are possible, at least within a broader framework of higher-and lower-divergence groups. Lineages with considerable divergence (e.g., species-level, such as in Tringa) had accurate demographic parameters estimated at lower sample sizes ( Fig. 2; Figs. S1-S5). Thus, it seems possible in such systems to reliably use fewer individuals. In shallowly diverged populations that might experience substantial gene flow, however, higher sample sizes may be required to overcome the impact of individuals with varying amounts of admixture, which appears to increase the variation in model performance at lower sample sizes among low-divergence lineages ( Fig. 1; Figs. S1-S6; Table S1).
Our findings of the effects of divergence levels on the minimum sample sizes needed to accurately estimate population demographic parameters broadly agreed with previous findings in other genetic markers, with some exceptions. In lineages that are more shallowly split and have experienced more gene flow, greater sample sizes are required to reliably estimate multiple parameters, including not just the demographic parameters examined here, but also genetic distance (Kalinowski, 2005), F ST (Morin, Martien & Taylor, 2009;Humphries & Winker, 2011), and recent demographic events (e.g., <100 generations; Beichman, Huerta-Sanchez & Lohmueller, 2018). The two population-level splits in our study, L. svecica and C. hyemalis, did not perform as well for most parameter estimates at sample sizes below 6 individuals per population, with accuracy (as measured by SRMSE; Table 3) decreasing rapidly; this fits our understanding that accurately estimating more recent demographic events requires the improved draw on more recent coalescent events that increased sample sizes bring (Beichman, Huerta-Sanchez & Lohmueller, 2018). The presence of a substantial amount of gene flow appears to increase variation in parameter estimates and decrease accuracy, as seen in L. svecica (Tables 2 and 3), and in practical terms would require increased sample sizes for accurate parameter estimation.

Model fit
Due to computational restrictions, we analyzed all subsampled datasets under the split-mig dadi (Gutenkunst et al., 2009) model determined and optimized for the full dataset in each lineage, and we did not investigate the impact of sample size on model fit. Several subsample datasets (notably Clangula hyemalis, Mareca penelope/M. americana, and Luscinia svecica) showed signs in some parameters of beginning to consistently push the upper bounds of some model parameters. This means that both variation and over-estimation of the parameters were likely underestimated in these groups at smaller sample sizes. This situation has also been noted with simulated data, which have been found in some situations to have a better fit with a model type different than the one under which they were simulated (Robinson et al., 2014).

Implications for study design
Research efficiency requires attention not only to the minimum sample size required to meet an objective, but also to the point after which adding more samples begins to produce diminishing returns. In this context, this means the point above which the SRMSE becomes similar between sample sizes, but before the means of estimates start to change due to decreased sample size. This inflection point might represent the minimum reliable sample size, but not necessarily. In some lineages, SRMSE was very similar at larger sample sizes, began to slowly increase at intermediate sizes, and then at low sample sizes increased quickly (Table 3; Fig. 2). This again varied among lineages (Table 3; Figs. S1-S5).
In some, such as the Pica and Tringa species lineages, this inflection point was reached at higher sample sizes than the minimum reliable sample sizes in some parameters (Table 3), whereas in others, such as in most estimates of m, these points were the same (e.g., Fig. S3). However, in some groups, particularly estimates of effective population size (ν 1 ) and migration (m) in L. svecica, this optimal point was not reached until the full dataset was analyzed, and might not have been reached at all in C. hyemalis in any of the parameter estimates (Figs. S1-S5). This is consistent with the findings of Robinson et al. (2014), in that although in some cases a small sample size could be used, larger sample sizes still led to more accurate parameter estimates. This was especially the case in our data for divergence times (T), Θ, and some effective population size (ν) estimates (Table 2; Fig. 1; Figs. S1-S5). Our linear regression models help generalize these relationships (Table 4; Table S1). In sum, two key sources of variation preclude our providing detailed suggestions for threshold sample sizes in future studies: the levels of divergence in a study's focal lineage, and the demographic parameter of most interest for that study. We urge those designing their own studies to consider our results (Tables 2-4, Supplemental Information) at different divergence depths and for different demographic parameters, depending on objectives.

CONCLUSIONS
Sample size is a critical aspect of study design and interpretation, and balancing the need for reliable estimates with cost effectiveness is a key tradeoff. Inadequate sampling can lead to ambiguous or biased results (Nazareno & Jump, 2012;Nazareno et al., 2017), whereas many parameter estimates are not improved above a certain sample size (Felsenstein, 2005;Nazareno et al., 2017). As other researchers, we found that inference of demographic parameters can be strongly influenced by sample size, with estimates becoming less accurate at lower sample sizes and being over-and underestimated, with considerable variation both among parameters and among lineages. In general, for pairwise comparisons at shallow levels of divergence (population), care should be taken to include adequate samples, with the best performance in these data generally occurring at 6 or more diploid individuals per population. Parameter estimates in lineages with deeper divergence (subspecies and species) were generally more resilient to lower sample sizes.